Jake I am totally interested in this. Contributing to Nutch (and more
generally to Apache projects) is described really well (by Dennis Kubes)
here:

http://wiki.apache.org/nutch/Becoming_A_Nutch_Developer


Looking forward to seeing your contributions!

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: [email protected]
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++






-----Original Message-----
From: Markus Jelsma <[email protected]>
Reply-To: "[email protected]" <[email protected]>
Date: Tuesday, June 17, 2014 10:55 AM
To: "[email protected]" <[email protected]>
Subject: RE: Nutch Extension for realtime processing

>Hi Jake,
>
>It would be more pluggable if you just implement an indexer backend
>plugin for your target (storm, spark) so you can use the existing
>indexing filtering framework and plugins to enrich the data. If you then
>couple the indexing logic to FetcherOutputFormat, you can skip the parse
>(because this requires a parsing fetcher) and updatedb jobs, as well as
>the separate indexing job. This is certainly not real time but the delay
>is much smaller, especially if you keep to (many) small fetch jobs. In
>our environment we can guarantee a fetched document is always indexed
>within 15 minutes.
>
>Markus 
> 
>-----Original message-----
>> From:Jake Dodd <[email protected]>
>> Sent: Tuesday 17th June 2014 19:30
>> To: [email protected]
>> Subject: Nutch Extension for realtime processing
>> 
>> Hi all,
>> 
>> My organization is mulling the creation of a Nutch Extension Point that
>>would enable realtime processing of Nutch documents as they¹re fetched.
>>We have the desire to pass Nutch-fetched documents to a realtime
>>framework such as Storm or Spark. Currently, it¹s trivial to implement a
>>custom Indexer plugin that sort of gets the job done. However, this
>>doesn¹t really meet the realtime requirement‹you must wait for the
>>fetch, parse, updateddb, index cycle to complete.
>> 
>> Our idea is to create a FetcherDisseminator extension point. A
>>FetcherDisseminator would implement a disseminate() method that would
>>take care of serialization (JSON, Avro, etc) and disseminating the data
>>to an external entity (for example a REST interface, or a Kafka broker).
>> 
>> The FetcherDisseminators would be called from within the
>>org.apache.nutch.fetcher.Fetcher.FetcherThread class. The implementation
>>would be such that the normal fetch-parse-update-index cycle would be
>>unaffected, even in the case of disseminator failure.
>> 
>> My first question is whether something like this has been discussed
>>before by the Nutch developers, and if so, if there is any current work
>>on the project.
>> 
>> My second question is whether there is any interest from the community
>>in such a feature. If so, we¹d love your input on how to go about
>>contributing to the Nutch project.
>> 
>> Cheers
>> 
>> Jake

Reply via email to