Hi all,

My organization is mulling the creation of a Nutch Extension Point that would 
enable realtime processing of Nutch documents as they’re fetched. We have the 
desire to pass Nutch-fetched documents to a realtime framework such as Storm or 
Spark. Currently, it’s trivial to implement a custom Indexer plugin that sort 
of gets the job done. However, this doesn’t really meet the realtime 
requirement—you must wait for the fetch, parse, updateddb, index cycle to 
complete.

Our idea is to create a FetcherDisseminator extension point. A 
FetcherDisseminator would implement a disseminate() method that would take care 
of serialization (JSON, Avro, etc) and disseminating the data to an external 
entity (for example a REST interface, or a Kafka broker).

The FetcherDisseminators would be called from within the 
org.apache.nutch.fetcher.Fetcher.FetcherThread class. The implementation would 
be such that the normal fetch-parse-update-index cycle would be unaffected, 
even in the case of disseminator failure. 

My first question is whether something like this has been discussed before by 
the Nutch developers, and if so, if there is any current work on the project.

My second question is whether there is any interest from the community in such 
a feature. If so, we’d love your input on how to go about contributing to the 
Nutch project.

Cheers

Jake

Reply via email to