Re: Ideas for solutions to Crawling and Solr

ogjunk-nutch Wed, 04 Jun 2008 19:46:27 -0700

I thought you are first writing to HDFS via the usual Nutch mechanism 
(fetching, parsing, various Sequence and Map files), *then* processing that and 
converting that to XML, and *then* posting that to Solr.  If that's what you 
are doing, then the conversion to XML is an extra step that you can just skip.  
I wasn't suggesting you have jobs to send data directly to Solr.



Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


----- Original Message ----
> From: James Moore <[EMAIL PROTECTED]>
> To: [email protected]
> Sent: Wednesday, June 4, 2008 7:34:53 PM
> Subject: Re: Ideas for solutions to Crawling and Solr
> 
> On Wed, Jun 4, 2008 at 1:35 PM,  wrote:
> > I think you might be doing a bit of extra work there.  There is no need to 
> create XML files for Solr.  When you read fetched/parsed data, use something 
> like solrj to post to Solr without creating intermediary XML files on disk.
> 
> I might be misunderstanding you, but it seems like it's better for to
> deal with the xml files rather than something like ruby-solr or solrj.
> I don't want any of the hadoop jobs to have solr dependencies - they
> just write to text xml files in the normal hadoop way, and someone
> else is responsible for getting the results into solr.  In this case,
> it's some fairly trivial shell scripts that run on each solr machine
> and do a dfs cat /whatever.xml | post_to_a_solr_instance at the end of
> the run.  (Using solr clustering here, so each machine is responsible
> for loading only its own xml files)
> 
> But I'd be happy to skip a step - am I just missing something obvious?
> 
> -- 
> James Moore | [EMAIL PROTECTED]
> Ruby and Ruby on Rails consulting
> blog.restphone.com

Re: Ideas for solutions to Crawling and Solr

Reply via email to