On Thu, May 29, 2008 at 11:53 PM, <[EMAIL PROTECTED]> wrote: > We've successfully used: > > 1) Nutch to fetch + parse pages > 2) Custom Nutch2Solr indexer
That's what I started out doing, but I hit a couple issues. First, turned out I needed to parse the pages myself. The nutch parse_text field is useful, but tends to contain lots of navigation text from the pages I've been crawling. I needed a subset of the page, plus more information from external sources. Right now it's a several-pass system. Nutch to fetch the data, then a pass to parse the raw HTML, then a pass to turn the parsed data into XML suitable for feeding into solr. I don't send anything straight to solr from nutch - instead, I build XML files in hadoop and then after hadoop is finished, I post them to solr using something simple like curl. -- James Moore | [EMAIL PROTECTED] Ruby and Ruby on Rails consulting blog.restphone.com
