Zhang Le wrote:
Well, for Python-related news is suck stuff from O'Reilly's meerkat service using xmlrpc. Once upon a time I used to update www.holdenweb.com every four hours, but until my current hosting situation changes I can't be arsed.Hello, I'm writing a little Tkinter application to retrieve news from various news websites such as http://news.bbc.co.uk/, and display them in a TK listbox. All I want are news title and url information. Since each news site has a different layout, I think I need some template-based techniques to build news extractors for each site, ignoring information such as table, image, advertise, flash that I'm not interested in.
So far I have built a simple GUI using Tkinter, a link extractor using HTMLlib to extract HREFs from web page. But I really have no idea how to extract news from web site. Is anyone aware of general techniques for extracting web news? Or can point me to some falimiar projects. I have seen some search engines doing this, for example:http://news.ithaki.net/, but do not know the technique used. Any tips?
Thanks in advance,
Zhang Le
However, the code to extract the news is pretty simple. Here's the whole program, modulo newsreader wrapping. It would be shorter if I weren't stashing the extracted links it a relational database:
#!/usr/bin/python
#
# mkcheck.py: Get a list of article categories from the O'Reilly Network
# and update the appropriate section database
#
import xmlrpclib
server = xmlrpclib.Server("http://www.oreillynet.com/meerkat/xml-rpc/server.php")
from db import conn, pmark import mx.DateTime as dt curs = conn.cursor()
pyitems = server.meerkat.getItems( {'search':'/[Pp]ython/','num_items':10,'descriptions':100})
sqlinsert = "INSERT INTO PyLink (pylWhen, pylURL, pylDescription) VALUES(%s, %s, %s)" % (pmark, pmark, pmark)
for itm in pyitems:
description = itm['description'] or itm['title']
if itm['link'] and not ("<" in description):
curs.execute("""SELECT COUNT(*) FROM PyLink
WHERE pylURL=%s""" % pmark, (itm['link'], ))
newlink = curs.fetchone()[0] == 0
if newlink:
print "Adding", itm['link']
curs.execute(sqlinsert,
(dt.DateTimeFromTicks(int(dt.now())), itm['link'], description))
conn.commit() conn.close()
Similar techniques can be used on many other sites, and you will find that (some) RSS feeds are a fruitful source of news.
regards Steve -- Steve Holden http://www.holdenweb.com/ Python Web Programming http://pydish.holdenweb.com/ Holden Web LLC +1 703 861 4237 +1 800 494 3119 -- http://mail.python.org/mailman/listinfo/python-list