Re: extract news article from web

Steve Holden Wed, 22 Dec 2004 12:25:40 -0800

Zhang Le wrote:

Hello,
I'm writing a little Tkinter application to retrieve news from
various news websites such as http://news.bbc.co.uk/, and display them
in a TK listbox. All I want are news title and url information. Since
each news site has a different layout, I think I need some
template-based techniques to build news extractors for each site,
ignoring information such as table, image, advertise, flash that I'm
not interested in.

So far I have built a simple GUI using Tkinter, a link extractor
using HTMLlib to extract HREFs from web page. But I really have no idea
how to extract news from web site. Is anyone aware of general
techniques for extracting web news? Or can point me to some falimiar
projects.
I have seen some search engines doing this, for
example:http://news.ithaki.net/, but do not know the technique used.
Any tips?

Thanks in advance,

Zhang Le

Well, for Python-related news is suck stuff from O'Reilly's meerkat service using xmlrpc. Once upon a time I used to update www.holdenweb.com every four hours, but until my current hosting situation changes I can't be arsed.

However, the code to extract the news is pretty simple. Here's the whole program, modulo newsreader wrapping. It would be shorter if I weren't stashing the extracted links it a relational database:

#!/usr/bin/python # # mkcheck.py: Get a list of article categories from the O'Reilly Network # and update the appropriate section database # import xmlrpclib server = xmlrpclib.Server("http://www.oreillynet.com/meerkat/xml-rpc/server.php";)

from db import conn, pmark
import mx.DateTime as dt
curs = conn.cursor()

pyitems = server.meerkat.getItems(
        {'search':'/[Pp]ython/','num_items':10,'descriptions':100})

sqlinsert = "INSERT INTO PyLink (pylWhen, pylURL, pylDescription) VALUES(%s, %s, %s)" % (pmark, pmark, pmark) for itm in pyitems: description = itm['description'] or itm['title'] if itm['link'] and not ("<" in description): curs.execute("""SELECT COUNT(*) FROM PyLink WHERE pylURL=%s""" % pmark, (itm['link'], )) newlink = curs.fetchone()[0] == 0 if newlink: print "Adding", itm['link'] curs.execute(sqlinsert,

(dt.DateTimeFromTicks(int(dt.now())), itm['link'], description))

conn.commit()
conn.close()

Similar techniques can be used on many other sites, and you will find that (some) RSS feeds are a fruitful source of news.

regards
 Steve
--
Steve Holden               http://www.holdenweb.com/
Python Web Programming  http://pydish.holdenweb.com/
Holden Web LLC      +1 703 861 4237  +1 800 494 3119
--
http://mail.python.org/mailman/listinfo/python-list

Re: extract news article from web

Reply via email to