RE: Nutch Extension for realtime processing

Markus Jelsma Tue, 17 Jun 2014 10:56:29 -0700

Hi Jake,

It would be more pluggable if you just implement an indexer backend plugin for 
your target (storm, spark) so you can use the existing indexing filtering 
framework and plugins to enrich the data. If you then couple the indexing logic 
to FetcherOutputFormat, you can skip the parse (because this requires a parsing 
fetcher) and updatedb jobs, as well as the separate indexing job. This is 
certainly not real time but the delay is much smaller, especially if you keep 
to (many) small fetch jobs. In our environment we can guarantee a fetched 
document is always indexed within 15 minutes.


Markus 
 
-----Original message-----
> From:Jake Dodd <[email protected]>
> Sent: Tuesday 17th June 2014 19:30
> To: [email protected]
> Subject: Nutch Extension for realtime processing
> 
> Hi all,
> 
> My organization is mulling the creation of a Nutch Extension Point that would 
> enable realtime processing of Nutch documents as they’re fetched. We have the 
> desire to pass Nutch-fetched documents to a realtime framework such as Storm 
> or Spark. Currently, it’s trivial to implement a custom Indexer plugin that 
> sort of gets the job done. However, this doesn’t really meet the realtime 
> requirement—you must wait for the fetch, parse, updateddb, index cycle to 
> complete.
> 
> Our idea is to create a FetcherDisseminator extension point. A 
> FetcherDisseminator would implement a disseminate() method that would take 
> care of serialization (JSON, Avro, etc) and disseminating the data to an 
> external entity (for example a REST interface, or a Kafka broker).
> 
> The FetcherDisseminators would be called from within the 
> org.apache.nutch.fetcher.Fetcher.FetcherThread class. The implementation 
> would be such that the normal fetch-parse-update-index cycle would be 
> unaffected, even in the case of disseminator failure. 
> 
> My first question is whether something like this has been discussed before by 
> the Nutch developers, and if so, if there is any current work on the project.
> 
> My second question is whether there is any interest from the community in 
> such a feature. If so, we’d love your input on how to go about contributing 
> to the Nutch project.
> 
> Cheers
> 
> Jake

RE: Nutch Extension for realtime processing

Reply via email to