Very interesting, thanks! I'm still working out the best way to do this for my project, and a multi pass solution might make sense. Currently I'm using Python to fetch (urllib) and parse (BeautifulSoup) and post to Solr (with solr.py)
On Wed, Jun 4, 2008 at 7:01 PM, James Moore <[EMAIL PROTECTED]> wrote: > On Thu, May 29, 2008 at 11:53 PM, <[EMAIL PROTECTED]> wrote: >> We've successfully used: >> >> 1) Nutch to fetch + parse pages >> 2) Custom Nutch2Solr indexer > > That's what I started out doing, but I hit a couple issues. > > First, turned out I needed to parse the pages myself. The nutch > parse_text field is useful, but tends to contain lots of navigation > text from the pages I've been crawling. I needed a subset of the > page, plus more information from external sources. > > Right now it's a several-pass system. Nutch to fetch the data, then a > pass to parse the raw HTML, then a pass to turn the parsed data into > XML suitable for feeding into solr. > > I don't send anything straight to solr from nutch - instead, I build > XML files in hadoop and then after hadoop is finished, I post them to > solr using something simple like curl. > > -- > James Moore | [EMAIL PROTECTED] > Ruby and Ruby on Rails consulting > blog.restphone.com >
