I am looking to create a parser for a groupware product that would read pages message board type web site. (Think phpBB). But rather than creating a single Content item which is parsed and indexed to a single lucene document, I am planning to have the parser create a master document (for the original post) and an additional document for each reply item.
I've reviewed the code for protocol plugins, parser plugins and indexing plugins but each interface allows for a single document or content object to be passed around. Am I missing something simple? My best bet at the moment is to implement some kind of new "fake" protocol for the reply items then I would use the http client plugin for the first request to the page and generate outlines on the "fakereplyto://originalurl/reply1" "fakereplyto://originalurl/reply2" to go back through and fetch the sub page content. But this seems round-about and would probably generate an http request for each reply on the original page. But perhaps there is a way to lookup the original page in the segment db before requesting it again. Needless to say it would seem more straightforward to tackle this in some kind of parser plugin that could break the original page into pieces that are treated as standalone pages for indexing purposes. Last but not least conceptually a plugin for the indexer might be able to take a set of custom meta data for a replies "collection" and index it as separate lucene documents - but I can't find a way to do this given the interfaces in the indexer plugins. Thanks in advance Malcolm Smith