Chris,
I'm not sure I understand completely, but I would try to write a
parser plugin that pipes content to an external ruby process...or even
just use JRuby. This way would keep you from having to worry about the
complexities of interacting with Hadoop directly.
What kind of ruby parsing are you looking to do? I had considered
doing the same thing to parse and sanitize news feeds.
-dave
--
david grandinetti
ideas for food and code
On Jun 11, 2008, at 16:46, "Chris Anderson" <[EMAIL PROTECTED]> wrote:
We're planning to run some Ruby parsers on the fetched content from a
Nutch crawl. It seems like the best way to do this would be through an
interface like Hadoop's streaming.jar, but streaming.jar expects a
line-based input format.
Has anyone written a version of streaming.jar for Nutch? We're working
on one, so if you'd like to collaborate (or have any advice), please
reply!
Thanks,
Chris
--
Chris Anderson
http://jchris.mfdz.com
--
Chris Anderson
http://jchris.mfdz.com