Re: Ideas for solutions to Crawling and Solr

ogjunk-nutch Wed, 04 Jun 2008 13:35:36 -0700

I think you might be doing a bit of extra work there.  There is no need to 
create XML files for Solr.  When you read fetched/parsed data, use something 
like solrj to post to Solr without creating intermediary XML files on disk.



Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


----- Original Message ----
> From: James Moore <[EMAIL PROTECTED]>
> To: [email protected]
> Sent: Wednesday, June 4, 2008 3:01:38 AM
> Subject: Re: Ideas for solutions to Crawling and Solr
> 
> On Thu, May 29, 2008 at 11:53 PM,  wrote:
> > We've successfully used:
> >
> > 1) Nutch to fetch + parse pages
> > 2) Custom Nutch2Solr indexer
> 
> That's what I started out doing, but I hit a couple issues.
> 
> First, turned out I needed to parse the pages myself.  The nutch
> parse_text field is useful, but tends to contain lots of navigation
> text from the pages I've been crawling.  I needed a subset of the
> page, plus more information from external sources.
> 
> Right now it's a several-pass system.  Nutch to fetch the data, then a
> pass to parse the raw HTML, then a pass to turn the parsed data into
> XML suitable for feeding into solr.
> 
> I don't send anything straight to solr from nutch - instead, I build
> XML files in hadoop and then after hadoop is finished, I post them to
> solr using something simple like curl.
> 
> -- 
> James Moore | [EMAIL PROTECTED]
> Ruby and Ruby on Rails consulting
> blog.restphone.com

Re: Ideas for solutions to Crawling and Solr

Reply via email to