On Wed, Jun 11, 2008 at 3:06 PM, David Grandinetti <[EMAIL PROTECTED]> wrote: > Chris, > > I'm not sure I understand completely, but I would try to write a parser > plugin that pipes content to an external ruby process...or even just use > JRuby. This way would keep you from having to worry about the complexities > of interacting with Hadoop directly.
Well I have lots of reasons to be able to run additional Hadoop jobs on the data, and I've found so far that interacting with Hadoop is easier than Nutch, for the most part - so I'd like to be able to process fetched documents using streaming jar or other generic Hadoop jobs. It looks like our team is about to have a solution to the problem, so hopefully we'll be able to post something soon. The nice thing about streaming.jar is that it works with any process that can take input over stdin. So Ruby is strictly an afterthought. :) Our solution will likely use JSON as the line-protocol for map and reduce scripts to handle -- well-formed json is a great container for fetched html & a little associated metadata. > > What kind of ruby parsing are you looking to do? I had considered doing the > same thing to parse and sanitize news feeds. > The Ruby parsers all use Hpricot.XML, and are wrapped really tightly around our domain. I do recommend using Hpricot for all your parsing needs. It's fast and not prone to (many) surprises. -- Chris Anderson http://jchris.mfdz.com
