Chris, We've been considering doing something similar. Since our development environment is primarily ruby and we don't have a ton of java expertise in our shop we're looking for a way to give our ruby people a low-friction way of processing our crawl data.
I've mainly been thinking about a JRuby solution. My concern is performance, but if it's not too bad then I'll take the trade off so that our devs have a nicer time working with data. I'm pretty new to Nutch/Hadoop, but I'd love collaborate on any solution that makes using ruby and Hadoop/Nutch easier. -lincoln -- lincolnritter.com On Wed, Jun 11, 2008 at 3:06 PM, David Grandinetti <[EMAIL PROTECTED]> wrote: > Chris, > > I'm not sure I understand completely, but I would try to write a parser > plugin that pipes content to an external ruby process...or even just use > JRuby. This way would keep you from having to worry about the complexities > of interacting with Hadoop directly. > > What kind of ruby parsing are you looking to do? I had considered doing the > same thing to parse and sanitize news feeds. > > -dave > > -- > david grandinetti > ideas for food and code > > > On Jun 11, 2008, at 16:46, "Chris Anderson" <[EMAIL PROTECTED]> wrote: > >> We're planning to run some Ruby parsers on the fetched content from a >> Nutch crawl. It seems like the best way to do this would be through an >> interface like Hadoop's streaming.jar, but streaming.jar expects a >> line-based input format. >> >> Has anyone written a version of streaming.jar for Nutch? We're working >> on one, so if you'd like to collaborate (or have any advice), please >> reply! >> >> Thanks, >> Chris >> >> -- >> Chris Anderson >> http://jchris.mfdz.com >> >> >> >> -- >> Chris Anderson >> http://jchris.mfdz.com >
