Re: Nutch Extension for realtime processing

Jake Dodd Tue, 17 Jun 2014 15:55:34 -0700

Markus: The indexer plugin idea definitely works if the goal is only to pass 
Nutch-collected data to realtime frameworks. However, there are some cool 
things that you can do in “real" realtime (heh), as opposed to the batch nature 
of Nutch’s indexing plugins and the FetcherOutputFormat. Moreover, it would be 
cool to have Nutch working as designed (with fetching, parsing, indexing and 
all) while basically gaining the realtime capabilities for free.


Chris: Glad to hear you’re interested, and thanks for the link! Today I was 
actually able to finish a prototype version of this, along with two example 
Disseminator plugins (one to stdout, the other to a Kafka topic—both working 
beautifully). I’d be happy to create a New Feature JIRA and start working on 
this.

Cheers

Jake


On Jun 17, 2014, at 11:02 AM, Mattmann, Chris A (3980) 
<[email protected]> wrote:

> Jake I am totally interested in this. Contributing to Nutch (and more
> generally to Apache projects) is described really well (by Dennis Kubes)
> here:
> 
> http://wiki.apache.org/nutch/Becoming_A_Nutch_Developer
> 
> 
> Looking forward to seeing your contributions!
> 
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: [email protected]
> WWW:  http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> 
> 
> 
> 
> 
> 
> -----Original Message-----
> From: Markus Jelsma <[email protected]>
> Reply-To: "[email protected]" <[email protected]>
> Date: Tuesday, June 17, 2014 10:55 AM
> To: "[email protected]" <[email protected]>
> Subject: RE: Nutch Extension for realtime processing
> 
>> Hi Jake,
>> 
>> It would be more pluggable if you just implement an indexer backend
>> plugin for your target (storm, spark) so you can use the existing
>> indexing filtering framework and plugins to enrich the data. If you then
>> couple the indexing logic to FetcherOutputFormat, you can skip the parse
>> (because this requires a parsing fetcher) and updatedb jobs, as well as
>> the separate indexing job. This is certainly not real time but the delay
>> is much smaller, especially if you keep to (many) small fetch jobs. In
>> our environment we can guarantee a fetched document is always indexed
>> within 15 minutes.
>> 
>> Markus 
>> 
>> -----Original message-----
>>> From:Jake Dodd <[email protected]>
>>> Sent: Tuesday 17th June 2014 19:30
>>> To: [email protected]
>>> Subject: Nutch Extension for realtime processing
>>> 
>>> Hi all,
>>> 
>>> My organization is mulling the creation of a Nutch Extension Point that
>>> would enable realtime processing of Nutch documents as they¹re fetched.
>>> We have the desire to pass Nutch-fetched documents to a realtime
>>> framework such as Storm or Spark. Currently, it¹s trivial to implement a
>>> custom Indexer plugin that sort of gets the job done. However, this
>>> doesn¹t really meet the realtime requirement‹you must wait for the
>>> fetch, parse, updateddb, index cycle to complete.
>>> 
>>> Our idea is to create a FetcherDisseminator extension point. A
>>> FetcherDisseminator would implement a disseminate() method that would
>>> take care of serialization (JSON, Avro, etc) and disseminating the data
>>> to an external entity (for example a REST interface, or a Kafka broker).
>>> 
>>> The FetcherDisseminators would be called from within the
>>> org.apache.nutch.fetcher.Fetcher.FetcherThread class. The implementation
>>> would be such that the normal fetch-parse-update-index cycle would be
>>> unaffected, even in the case of disseminator failure.
>>> 
>>> My first question is whether something like this has been discussed
>>> before by the Nutch developers, and if so, if there is any current work
>>> on the project.
>>> 
>>> My second question is whether there is any interest from the community
>>> in such a feature. If so, we¹d love your input on how to go about
>>> contributing to the Nutch project.
>>> 
>>> Cheers
>>> 
>>> Jake

Re: Nutch Extension for realtime processing

Reply via email to