Chris,

We've been considering doing something similar.  Since our development
environment is primarily ruby and we don't have a ton of java
expertise in our shop we're looking for a way to give our ruby people
a low-friction way of processing our crawl data.

I've mainly been thinking about a JRuby solution.  My concern is
performance, but if it's not too bad then I'll take the trade off so
that our devs have a nicer time working with data.

I'm pretty new to Nutch/Hadoop, but I'd love collaborate on any
solution that makes using ruby and Hadoop/Nutch easier.

-lincoln

--
lincolnritter.com



On Wed, Jun 11, 2008 at 3:06 PM, David Grandinetti <[EMAIL PROTECTED]> wrote:
> Chris,
>
> I'm not sure I understand completely, but I would try to write a parser
> plugin that pipes content to an external ruby process...or even just use
> JRuby. This way would keep you from having to worry about the complexities
> of interacting with Hadoop directly.
>
> What kind of ruby parsing are you looking to do? I had considered doing the
> same thing to parse and sanitize news feeds.
>
> -dave
>
> --
> david grandinetti
> ideas for food and code
>
>
> On Jun 11, 2008, at 16:46, "Chris Anderson" <[EMAIL PROTECTED]> wrote:
>
>> We're planning to run some Ruby parsers on the fetched content from a
>> Nutch crawl. It seems like the best way to do this would be through an
>> interface like Hadoop's streaming.jar, but streaming.jar expects a
>> line-based input format.
>>
>> Has anyone written a version of streaming.jar for Nutch? We're working
>> on one, so if you'd like to collaborate (or have any advice), please
>> reply!
>>
>> Thanks,
>> Chris
>>
>> --
>> Chris Anderson
>> http://jchris.mfdz.com
>>
>>
>>
>> --
>> Chris Anderson
>> http://jchris.mfdz.com
>

Reply via email to