Jake I am totally interested in this. Contributing to Nutch (and more generally to Apache projects) is described really well (by Dennis Kubes) here:
http://wiki.apache.org/nutch/Becoming_A_Nutch_Developer Looking forward to seeing your contributions! ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: [email protected] WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ -----Original Message----- From: Markus Jelsma <[email protected]> Reply-To: "[email protected]" <[email protected]> Date: Tuesday, June 17, 2014 10:55 AM To: "[email protected]" <[email protected]> Subject: RE: Nutch Extension for realtime processing >Hi Jake, > >It would be more pluggable if you just implement an indexer backend >plugin for your target (storm, spark) so you can use the existing >indexing filtering framework and plugins to enrich the data. If you then >couple the indexing logic to FetcherOutputFormat, you can skip the parse >(because this requires a parsing fetcher) and updatedb jobs, as well as >the separate indexing job. This is certainly not real time but the delay >is much smaller, especially if you keep to (many) small fetch jobs. In >our environment we can guarantee a fetched document is always indexed >within 15 minutes. > >Markus > >-----Original message----- >> From:Jake Dodd <[email protected]> >> Sent: Tuesday 17th June 2014 19:30 >> To: [email protected] >> Subject: Nutch Extension for realtime processing >> >> Hi all, >> >> My organization is mulling the creation of a Nutch Extension Point that >>would enable realtime processing of Nutch documents as they¹re fetched. >>We have the desire to pass Nutch-fetched documents to a realtime >>framework such as Storm or Spark. Currently, it¹s trivial to implement a >>custom Indexer plugin that sort of gets the job done. However, this >>doesn¹t really meet the realtime requirement‹you must wait for the >>fetch, parse, updateddb, index cycle to complete. >> >> Our idea is to create a FetcherDisseminator extension point. A >>FetcherDisseminator would implement a disseminate() method that would >>take care of serialization (JSON, Avro, etc) and disseminating the data >>to an external entity (for example a REST interface, or a Kafka broker). >> >> The FetcherDisseminators would be called from within the >>org.apache.nutch.fetcher.Fetcher.FetcherThread class. The implementation >>would be such that the normal fetch-parse-update-index cycle would be >>unaffected, even in the case of disseminator failure. >> >> My first question is whether something like this has been discussed >>before by the Nutch developers, and if so, if there is any current work >>on the project. >> >> My second question is whether there is any interest from the community >>in such a feature. If so, we¹d love your input on how to go about >>contributing to the Nutch project. >> >> Cheers >> >> Jake

