On Thu, May 29, 2008 at 11:53 PM,  <[EMAIL PROTECTED]> wrote:
> We've successfully used:
>
> 1) Nutch to fetch + parse pages
> 2) Custom Nutch2Solr indexer

That's what I started out doing, but I hit a couple issues.

First, turned out I needed to parse the pages myself.  The nutch
parse_text field is useful, but tends to contain lots of navigation
text from the pages I've been crawling.  I needed a subset of the
page, plus more information from external sources.

Right now it's a several-pass system.  Nutch to fetch the data, then a
pass to parse the raw HTML, then a pass to turn the parsed data into
XML suitable for feeding into solr.

I don't send anything straight to solr from nutch - instead, I build
XML files in hadoop and then after hadoop is finished, I post them to
solr using something simple like curl.

-- 
James Moore | [EMAIL PROTECTED]
Ruby and Ruby on Rails consulting
blog.restphone.com

Reply via email to