Renaud Richardet wrote: > Doug Cutting wrote: >> Renaud Richardet wrote: >>> The usecase is that you index RSS-feeds, but your users can search >>> each feed-entry as a single document. Does it makes sense? >> >> But each feed item also contains a link whose content will be indexed >> and that's generally a superset of the item. > Agreed >> So should there be two urls indexed per item? > I don't think so >> In many cases, the best thing to do is to index only the linked page, >> not the feed item at all. In some (rare?) cases, there might be >> items without a link, whose only content is directly in the feed, or >> where the content in the feed is complementary to that in the linked >> page. In these cases it might be useful to combine the two (the feed >> item and the linked content), indexing both. The proposed change >> might permit that. Is that the case you're concerned about? > I see. I was thinking that I could index the feed items without having > to fetch them individually. > > More fundamentally, I want to index only the blog-entry text, and not > the elements around it (header, menus, ads, ...), so as to improve the > search results. > > Here's my case, the proposed changes would allow me to do (*) > > 1) parse feeds: > > for each (feedentry : feed) do > | > | if (full-text entries) then > | | index each feed entry as a single document; blog header, menus > are not indexed. * > | else > | | create a "special outlink" for each feed entry, which include > metadata (content, time, etc) > | endif > | > done > > 2) on a next fetch loop: > > for each (link) do > | > | if (this is a normal link) > | | fetch it and index it normally > | else if (this link come from an already indexed feed entry) then > | | end, do not fetch it * > | else if (this is a "special outlink") > | | guess which DOM nodes hold the post content > | | index it; blog header, menus are not indexed. > | endif > | > done > I agree with Renaud Richardet.
Also, I think it all boils down to speed. if you are building a blog search engine, you want it to update feeds as fast as it can. Doing 2 depths(one for rss-feed, one for outlinks) will slow it down. Besides that, many blog crawlers(like http://help.yahoo.com/help/us/ysearch/crawling/crawling-02.html) set crawl-delay to 1 and so I guess most of the web servers are OK with that for rss-feeds, but not necessarily OK with it for HTML pages. (So you will do depth 1(rss-feeds) very fast(with a 1 second delay), and then get the items with 5 second delay.) (I hope it is not stupid to point out Yahoo's crawler to someone who works at Yahoo :) -- Doğacan Güney > > Thanks, > Renaud > > > ------------------------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier. Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 _______________________________________________ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers