I think you might be doing a bit of extra work there. There is no need to create XML files for Solr. When you read fetched/parsed data, use something like solrj to post to Solr without creating intermediary XML files on disk.
Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch ----- Original Message ---- > From: James Moore <[EMAIL PROTECTED]> > To: [email protected] > Sent: Wednesday, June 4, 2008 3:01:38 AM > Subject: Re: Ideas for solutions to Crawling and Solr > > On Thu, May 29, 2008 at 11:53 PM, wrote: > > We've successfully used: > > > > 1) Nutch to fetch + parse pages > > 2) Custom Nutch2Solr indexer > > That's what I started out doing, but I hit a couple issues. > > First, turned out I needed to parse the pages myself. The nutch > parse_text field is useful, but tends to contain lots of navigation > text from the pages I've been crawling. I needed a subset of the > page, plus more information from external sources. > > Right now it's a several-pass system. Nutch to fetch the data, then a > pass to parse the raw HTML, then a pass to turn the parsed data into > XML suitable for feeding into solr. > > I don't send anything straight to solr from nutch - instead, I build > XML files in hadoop and then after hadoop is finished, I post them to > solr using something simple like curl. > > -- > James Moore | [EMAIL PROTECTED] > Ruby and Ruby on Rails consulting > blog.restphone.com
