Re: Extending HTML Parser to create subpage index documents

malcolm smith Tue, 20 Oct 2009 17:21:30 -0700

Thank you very much for the helpful reply, I'm back on track.


On Tue, Oct 20, 2009 at 2:01 AM, Andrzej Bialecki <a...@getopt.org> wrote:

> malcolm smith wrote:
>
>> I am looking to create a parser for a groupware product that would read
>> pages message board type web site.  (Think phpBB).  But rather than
>> creating
>> a single Content item which is parsed and indexed to a single lucene
>> document, I am planning to have the parser create a master document (for
>> the
>> original post) and an additional document for each reply item.
>>
>> I've reviewed the code for protocol plugins, parser plugins and indexing
>> plugins but each interface allows for a single document or content object
>> to
>> be passed around.
>>
>> Am I missing something simple?
>>
>> My best bet at the moment is to implement some kind of new "fake" protocol
>> for the reply items then I would use the http client plugin for the first
>> request to the page and generate outlines on the
>> "fakereplyto://originalurl/reply1" "fakereplyto://originalurl/reply2" to
>> go
>> back through and fetch the sub page content.  But this seems round-about
>> and
>> would probably generate an http request for each reply on the original
>> page.  But perhaps there is a way to lookup the original page in the
>> segment
>> db before requesting it again.
>>
>> Needless to say it would seem more straightforward to tackle this in some
>> kind of parser plugin that could break the original page into pieces that
>> are treated as standalone pages for indexing purposes.
>>
>> Last but not least conceptually a plugin for the indexer might be able to
>> take a set of custom meta data for a replies "collection" and index it as
>> separate lucene documents - but I can't find a way to do this given the
>> interfaces in the indexer plugins.
>>
>> Thanks in advance
>> Malcolm Smith
>>
>
> What version of Nutch are you using? This should be already possible to do
> using the 1.0 release or a nightly build. ParseResult (which is what parsers
> produce) can hold multiple Parse objects, each with its own URL.
>
> The common approach to handle whole-part relationships (like zip/tar
> archives, RSS, and other compound docs) is to split them in the parser and
> parse each part, then give each sub-document its own URL (e.g
> file.tar!myfile.txt) and add the original URL in the metadata, to keep track
> of the parent URL. The rest should be handled automatically, although there
> are some other complications that need to be handled as well (e.g. don't
> recrawl sub-documents).
>
>
>
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>

Re: Extending HTML Parser to create subpage index documents

Reply via email to