Doug, Renaud, Got it. So, the logic behind this is, why bother waiting until the following fetch to parse (and create ParseData objects from) the RSS items out of the feed. Okay, I get it, assuming that the RSS feed has *all* of the RSS metadata in it. However, it's perfectly acceptable to have feeds that simply have a title, description, and link in it. I guess this is still valuable metadata information to have, however, the only caveat is that the implication of the proposed change is:
1. We won't have cached copies, or fetched copies of the Content represented by the item links. Therefore, in this model, we won't be able to pull up a Nutch cache of the page corresponding to the RSS item, because we are circumventing the fetch step 2. It sounds like a pretty fundamental API shift in Nutch, to support a single type of content, RSS. Even if there are more content types that follow this model, as Doug and Renaud both pointed out, there aren't a multitude of them (perhaps archive files, but can you think of any others)? The other main thing that comes to mind about this for me is it prevents the fetched Content for the RSS items from being able to provide useful metadata, in the sense that it doesn't explicitly fetch the content. What if we wanted to apply some super cool metadata extractor X that used word-stemming, HTML design analysis, and other techniques to extract metadata from the content pointed to by an RSS item link? In the proposed model, we assume that the RSS xml item tag already contains all necessary metadata for indexing, which in my mind, limits the model. Does what I am saying make sense? I'm not shooting down the issue, I'm just trying to brainstorm a bit here about the issue. Cheers, Chris On 2/7/07 11:11 AM, "Doug Cutting" <[EMAIL PROTECTED]> wrote: > Chris Mattmann wrote: >> Sorry to be so thick-headed, but could someone explain to me in really >> simple language what this change is requesting that is different from the >> current Nutch API? I still don't get it, sorry... > > A Content would no longer generate a single Parse. Instead, a Content > could potentially generate many Parses. For most types of content, > e.g., HTML, each Content would still generate a single Parse. But for > RSS, a Content might generate multiple Parses, each indexed separately > and each with a distinct URL. > > Another potential application could be processing archives: the parser > could unpack the archive and each item in it indexed separately rather > than indexing the archive as a whole. This only makes sense if each > item has a distinct URL, which it does in RSS, but it might not in an > archive. However some archive file formats do contain URLs, like that > used by the Internet Archive. > > http://www.archive.org/web/researcher/ArcFileFormat.php > > Does that help? > > Doug