DMoz RDF & RSS 1.0 RDF are two distict dialects of RDF. I'd be surprised
if Nutch can import RSS 1.0 RDF directly (although I may be wrong).

Since you are using Rome, you may be better off using the Rome Fetcher's
Event API to detect if a feed has changed. Currently the fetcher will
retrieve the feed if it has changed, so you could pass the updated
content directly to nutch.

Another alternative is to hook into FeedMesh
(http://bobwyman.pubsub.com/main/2005/04/feedmesh_works_.html) to get
immediate updates of RSS feeds.

Nick

> -----Original Message-----
> From: Hasan Diwan [mailto:[EMAIL PROTECTED] 
> Sent: Wednesday, 20 April 2005 10:50 AM
> To: [email protected]
> Subject: RSS Updates -- Best strategy
> 
> Instead of recrawling the web every few months, I'd like 
> nutch to monitor RSS feeds for site updates. The way I'm 
> currently thinking of doing this is:
> 1. If a website indicates syndication (<link rel="alternate"
> type="application/(atom|rss|rsd)">), grab the file with the 
> information (indicated by the "href" attribute). If there's 
> no RDF file, I'll fetch it using rome and have it convert the 
> feed to RDF.
> 2. See if it matches with the URI's stored hash, if so, skip 
> on to the next site.
> 3. If not, fetch all URLs in the file and add them to the 
> segments to be indexed using WebDBGenerator.main() with the 
> -dmozfile parameter. I still need to determine if the DMOZ 
> RDF file is a strict superset of the rdf format or if it is 
> incompatible. The validators all choke on it because of it's 
> size; rome and feedparser run out of memory with it.
> 4. Optimise the index once every 24 hours.
> Is this the best way to do what I'd like? Thanks in advance 
> for the help!
> --
> 
> Cheers,
> Hasan Diwan <[EMAIL PROTECTED]>
> 


IMPORTANT: This e-mail, including any attachments, may contain private or 
confidential information. If you think you may not be the intended recipient, 
or if you have received this e-mail in error, please contact the sender 
immediately and delete all copies of this e-mail. If you are not the intended 
recipient, you must not reproduce any part of this e-mail or disclose its 
contents to any other party.
This email represents the views of the individual sender, which do not 
necessarily reflect those of education.au limited except where the sender 
expressly states otherwise.
It is your responsibility to scan this email and any files transmitted with it 
for viruses or any other defects.
education.au limited will not be liable for any loss, damage or consequence 
caused directly or indirectly by this email. 

Reply via email to