Hi Doug, Okay, I see your points. It seems like this would be really useful for some current folks, and for Nutch going forward. I see that there has been some initial work today and preparing patches. I'd be happy to shepherd this into the sources. I will begin reviewing what's required, and contacting the folks who've begun work on this issue.
Thanks! Cheers, Chris On 2/7/07 1:31 PM, "Doug Cutting" <[EMAIL PROTECTED]> wrote: > Chris Mattmann wrote: >> Got it. So, the logic behind this is, why bother waiting until the >> following fetch to parse (and create ParseData objects from) the RSS items >> out of the feed. Okay, I get it, assuming that the RSS feed has *all* of the >> RSS metadata in it. However, it's perfectly acceptable to have feeds that >> simply have a title, description, and link in it. > > Almost. The feed may have less than the referenced page, but it's also > a lot easier to parse, since the link could be an anchor within a large > page, or could be a page that has lots of navigation links, spam > comments, etc. So feed entries are generally much more precise than the > pages they reference, and may make for a higher-quality search experience. > >> I guess this is still >> valuable metadata information to have, however, the only caveat is that the >> implication of the proposed change is: >> >> 1. We won't have cached copies, or fetched copies of the Content represented >> by the item links. Therefore, in this model, we won't be able to pull up a >> Nutch cache of the page corresponding to the RSS item, because we are >> circumventing the fetch step > > Good point. We indeed wouldn't have these URLs in the cache. > >> 2. It sounds like a pretty fundamental API shift in Nutch, to support a >> single type of content, RSS. Even if there are more content types that >> follow this model, as Doug and Renaud both pointed out, there aren't a >> multitude of them (perhaps archive files, but can you think of any others)? > > Also true. On the other hand, Nutch provides 98% of an RSS search > engine. It'd be a shame to have to re-invent everything else and it > would be great if Nutch could evolve to support RSS well. > > Could image search might also benefit from this? One could generate a > Parse for each image on a page whose text was from the page. Product > search too, perhaps. > >> The other main thing that comes to mind about this for me is it prevents the >> fetched Content for the RSS items from being able to provide useful >> metadata, in the sense that it doesn't explicitly fetch the content. What if >> we wanted to apply some super cool metadata extractor X that used >> word-stemming, HTML design analysis, and other techniques to extract >> metadata from the content pointed to by an RSS item link? In the proposed >> model, we assume that the RSS xml item tag already contains all necessary >> metadata for indexing, which in my mind, limits the model. Does what I am >> saying make sense? I'm not shooting down the issue, I'm just trying to >> brainstorm a bit here about the issue. > > Sure, the RSS feed may contain less than the page it references, but > that might be all that one wishes to index. Otherwise, if, e.g., a blog > includes titles from other recent posts you're going to get lots of > false positives. Ideally Nutch should support various options: > searching the feed only, searching the referenced page only, or perhaps > searching both. > > Doug ------------------------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier. Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 _______________________________________________ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers