I am working on a solution at the moment that involves writing out the data in a json format during the actual fetch during the Fetcher phase. This is the only time when the CrawlDatum/Content data are together, so makes sense if you are only interested in the actual website content and the metadata/fetch status (provided by the CrawlDatum)

I will post it in a pastie if you are interested later tonight.

Lincoln Ritter wrote:
Chris,

We've been considering doing something similar.  Since our development
environment is primarily ruby and we don't have a ton of java
expertise in our shop we're looking for a way to give our ruby people
a low-friction way of processing our crawl data.

I've mainly been thinking about a JRuby solution.  My concern is
performance, but if it's not too bad then I'll take the trade off so
that our devs have a nicer time working with data.

I'm pretty new to Nutch/Hadoop, but I'd love collaborate on any
solution that makes using ruby and Hadoop/Nutch easier.

-lincoln

--
lincolnritter.com



On Wed, Jun 11, 2008 at 3:06 PM, David Grandinetti <[EMAIL PROTECTED]> wrote:
Chris,

I'm not sure I understand completely, but I would try to write a parser
plugin that pipes content to an external ruby process...or even just use
JRuby. This way would keep you from having to worry about the complexities
of interacting with Hadoop directly.

What kind of ruby parsing are you looking to do? I had considered doing the
same thing to parse and sanitize news feeds.

-dave

--
david grandinetti
ideas for food and code


On Jun 11, 2008, at 16:46, "Chris Anderson" <[EMAIL PROTECTED]> wrote:

We're planning to run some Ruby parsers on the fetched content from a
Nutch crawl. It seems like the best way to do this would be through an
interface like Hadoop's streaming.jar, but streaming.jar expects a
line-based input format.

Has anyone written a version of streaming.jar for Nutch? We're working
on one, so if you'd like to collaborate (or have any advice), please
reply!

Thanks,
Chris

--
Chris Anderson
http://jchris.mfdz.com



--
Chris Anderson
http://jchris.mfdz.com

Reply via email to