Thank you very much for the helpful reply, I'm back on track.
On Tue, Oct 20, 2009 at 2:01 AM, Andrzej Bialecki <a...@getopt.org> wrote: > malcolm smith wrote: > >> I am looking to create a parser for a groupware product that would read >> pages message board type web site. (Think phpBB). But rather than >> creating >> a single Content item which is parsed and indexed to a single lucene >> document, I am planning to have the parser create a master document (for >> the >> original post) and an additional document for each reply item. >> >> I've reviewed the code for protocol plugins, parser plugins and indexing >> plugins but each interface allows for a single document or content object >> to >> be passed around. >> >> Am I missing something simple? >> >> My best bet at the moment is to implement some kind of new "fake" protocol >> for the reply items then I would use the http client plugin for the first >> request to the page and generate outlines on the >> "fakereplyto://originalurl/reply1" "fakereplyto://originalurl/reply2" to >> go >> back through and fetch the sub page content. But this seems round-about >> and >> would probably generate an http request for each reply on the original >> page. But perhaps there is a way to lookup the original page in the >> segment >> db before requesting it again. >> >> Needless to say it would seem more straightforward to tackle this in some >> kind of parser plugin that could break the original page into pieces that >> are treated as standalone pages for indexing purposes. >> >> Last but not least conceptually a plugin for the indexer might be able to >> take a set of custom meta data for a replies "collection" and index it as >> separate lucene documents - but I can't find a way to do this given the >> interfaces in the indexer plugins. >> >> Thanks in advance >> Malcolm Smith >> > > What version of Nutch are you using? This should be already possible to do > using the 1.0 release or a nightly build. ParseResult (which is what parsers > produce) can hold multiple Parse objects, each with its own URL. > > The common approach to handle whole-part relationships (like zip/tar > archives, RSS, and other compound docs) is to split them in the parser and > parse each part, then give each sub-document its own URL (e.g > file.tar!myfile.txt) and add the original URL in the metadata, to keep track > of the parent URL. The rest should be handled automatically, although there > are some other complications that need to be handled as well (e.g. don't > recrawl sub-documents). > > > > -- > Best regards, > Andrzej Bialecki <>< > ___. ___ ___ ___ _ _ __________________________________ > [__ || __|__/|__||\/| Information Retrieval, Semantic Web > ___|||__|| \| || | Embedded Unix, System Integration > http://www.sigram.com Contact: info at sigram dot com > >