I'm currently working w/ the trunk version of Nutch, and am trying to configure it to crawl a list of RSS feeds, but to generate the lucene index such that there is only 1 document per RSS feed url, but have the content for the document include all of the phrases indexed for each outlink in the feed. I've done lots of searches through the news group and have read all that I could find on RSS. I'm currently running the crawl command with a depth of 2 (but have tried running it with a depth as 1 too). I'm using 3 rss feeds from the nytimes to test: http://www.nytimes.com/services/xml/rss/nyt/WorldBusiness.xml http://www.nytimes.com/services/xml/rss/nyt/YourMoney.xml http://www.nytimes.com/services/xml/rss/nyt/Automobiles.xml This thread http://www.nabble.com/Indexing-Feeds---Blog-Posts-with-Nutch-to13159411.html#a13214886 seems to say that the older parse-rss plugin indexes the whole feed and it's items as 1 document, but is a little vague. I've tried using both and 1 or the other, but can't seem to get the behavior I want. I'm also confused about why both rss plugins are configured as active in the parse-plugins.xml file for the application/xml--are they meant to compliment each other? I would think that you'd want to choose one or the other? I noticed that if only use the feed plugin, my resulting lucene index only has 20 documents. If i use both or just parse-rss I get 52.
Thanks for any help you can provide -- View this message in context: http://www.nabble.com/Question-on-crawling-RSS-feeds-with-Nutch-tp14264498p14264498.html Sent from the Nutch - User mailing list archive at Nabble.com.
