Chris Mattmann wrote: > Got it. So, the logic behind this is, why bother waiting until the > following fetch to parse (and create ParseData objects from) the RSS items > out of the feed. Okay, I get it, assuming that the RSS feed has *all* of the > RSS metadata in it. However, it's perfectly acceptable to have feeds that > simply have a title, description, and link in it.
Almost. The feed may have less than the referenced page, but it's also a lot easier to parse, since the link could be an anchor within a large page, or could be a page that has lots of navigation links, spam comments, etc. So feed entries are generally much more precise than the pages they reference, and may make for a higher-quality search experience. > I guess this is still > valuable metadata information to have, however, the only caveat is that the > implication of the proposed change is: > > 1. We won't have cached copies, or fetched copies of the Content represented > by the item links. Therefore, in this model, we won't be able to pull up a > Nutch cache of the page corresponding to the RSS item, because we are > circumventing the fetch step Good point. We indeed wouldn't have these URLs in the cache. > 2. It sounds like a pretty fundamental API shift in Nutch, to support a > single type of content, RSS. Even if there are more content types that > follow this model, as Doug and Renaud both pointed out, there aren't a > multitude of them (perhaps archive files, but can you think of any others)? Also true. On the other hand, Nutch provides 98% of an RSS search engine. It'd be a shame to have to re-invent everything else and it would be great if Nutch could evolve to support RSS well. Could image search might also benefit from this? One could generate a Parse for each image on a page whose text was from the page. Product search too, perhaps. > The other main thing that comes to mind about this for me is it prevents the > fetched Content for the RSS items from being able to provide useful > metadata, in the sense that it doesn't explicitly fetch the content. What if > we wanted to apply some super cool metadata extractor X that used > word-stemming, HTML design analysis, and other techniques to extract > metadata from the content pointed to by an RSS item link? In the proposed > model, we assume that the RSS xml item tag already contains all necessary > metadata for indexing, which in my mind, limits the model. Does what I am > saying make sense? I'm not shooting down the issue, I'm just trying to > brainstorm a bit here about the issue. Sure, the RSS feed may contain less than the page it references, but that might be all that one wishes to index. Otherwise, if, e.g., a blog includes titles from other recent posts you're going to get lots of false positives. Ideally Nutch should support various options: searching the feed only, searching the referenced page only, or perhaps searching both. Doug ------------------------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier. Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 _______________________________________________ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers