Re: Nutch Post-Processing

Doğacan Güney Mon, 09 Feb 2009 03:56:26 -0800

On Sat, Feb 7, 2009 at 3:20 PM, Andrzej Bialecki <[email protected]> wrote:
> (moving this to nutch-user - nutch-agent is for reporting abuse/misbehavior
> of Nutch-based crawlers)
>
> John Crepezzi wrote:
>>
>> I'm interested in writing an application that analyzes sources every time
>> they are updated,
>> and uses the parsedText, tags, title, etc to perform some operations and
>> export the finished data to
>> a database.
>>
>> I'm not sure where this application should be placed within nutch/lucene,
>> so any advice anyone can offer would be greatly appreciated.
>>
>> I thought plugins would work for me, but I'm unable to find an extension
>> point that will give me access
>> to the parsed data and tag sets.
>
> This issue comes up occasionally, but so far no one was desperate enough to
> work out a patch ;)
>
> You can define an additional extension point (please see how eg.
> HtmlParseFilter extension is designed - perhaps this extension is all you
> need?), and invoke this new extension point right after you parse the
> content, so that you can access both the content and the parsed data/text
> even before it's recorded in a segment.
>
> The best place to put this hook would be in ParseUtil class, because that's
> what other Nutch tools use to parse the content.
>
>


Another way to do it is to use the new NutchIndexWriter-s. You can add a new
DBIndexWriter (like SolrIndexWriter or LuceneIndexWriter) then modify the
index process (or add a new Indexer) to push documents to your database.

> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>



-- 
Doğacan Güney

Re: Nutch Post-Processing

Reply via email to