Question on crawling RSS feeds with Nutch

robg Mon, 10 Dec 2007 15:53:19 -0800

I'm currently working w/ the trunk version of Nutch, and am trying to
configure it to crawl a list of RSS feeds, but to generate the lucene index
such that there is only 1 document per RSS feed url, but have the content
for the document include all of the phrases indexed for each outlink in the
feed. I've done lots of searches through the news group and have read all
that I could find on  RSS.
I'm currently running the crawl command with a depth of 2 (but have tried
running it with a depth as 1 too).
I'm using 3 rss feeds from the nytimes to test:
http://www.nytimes.com/services/xml/rss/nyt/WorldBusiness.xml
http://www.nytimes.com/services/xml/rss/nyt/YourMoney.xml
http://www.nytimes.com/services/xml/rss/nyt/Automobiles.xml
This thread
http://www.nabble.com/Indexing-Feeds---Blog-Posts-with-Nutch-to13159411.html#a13214886
seems to say that the older parse-rss plugin indexes the whole feed and it's
items as 1 document, but is a little vague. I've tried using both and 1 or
the other, but can't seem to get the behavior I want. I'm also confused
about why both rss plugins are configured as active in the parse-plugins.xml
file for the application/xml--are they meant to compliment each other? I
would think that you'd want to choose one or the other? I noticed that if
only use the feed plugin, my resulting lucene index only has 20 documents.
If i use both or just parse-rss I get 52.


Thanks for any help you can provide

-- 
View this message in context: 
http://www.nabble.com/Question-on-crawling-RSS-feeds-with-Nutch-tp14264498p14264498.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Question on crawling RSS feeds with Nutch

Reply via email to