Re: Streaming.jar for Nutch?

Chris Anderson Wed, 11 Jun 2008 15:38:30 -0700

On Wed, Jun 11, 2008 at 3:06 PM, David Grandinetti <[EMAIL PROTECTED]> wrote:
> Chris,
>
> I'm not sure I understand completely, but I would try to write a parser
> plugin that pipes content to an external ruby process...or even just use
> JRuby. This way would keep you from having to worry about the complexities
> of interacting with Hadoop directly.


Well I have lots of reasons to be able to run additional Hadoop jobs
on the data, and I've found so far that interacting with Hadoop is
easier than Nutch, for the most part - so I'd like to be able to
process fetched documents using streaming jar or other generic Hadoop
jobs.

It looks like our team is about to have a solution to the problem, so
hopefully we'll be able to post something soon. The nice thing about
streaming.jar is that it works with any process that can take input
over stdin. So Ruby is strictly an afterthought. :)

Our solution will likely use JSON as the line-protocol for map and
reduce scripts to handle -- well-formed json is a great container for
fetched html & a little associated metadata.

>
> What kind of ruby parsing are you looking to do? I had considered doing the
> same thing to parse and sanitize news feeds.
>

The Ruby parsers all use Hpricot.XML, and are wrapped really tightly
around our domain. I do recommend using Hpricot for all your parsing
needs. It's fast and not prone to (many) surprises.


-- 
Chris Anderson
http://jchris.mfdz.com

Re: Streaming.jar for Nutch?

Reply via email to