I thought you are first writing to HDFS via the usual Nutch mechanism (fetching, parsing, various Sequence and Map files), *then* processing that and converting that to XML, and *then* posting that to Solr. If that's what you are doing, then the conversion to XML is an extra step that you can just skip. I wasn't suggesting you have jobs to send data directly to Solr.
Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch ----- Original Message ---- > From: James Moore <[EMAIL PROTECTED]> > To: [email protected] > Sent: Wednesday, June 4, 2008 7:34:53 PM > Subject: Re: Ideas for solutions to Crawling and Solr > > On Wed, Jun 4, 2008 at 1:35 PM, wrote: > > I think you might be doing a bit of extra work there. There is no need to > create XML files for Solr. When you read fetched/parsed data, use something > like solrj to post to Solr without creating intermediary XML files on disk. > > I might be misunderstanding you, but it seems like it's better for to > deal with the xml files rather than something like ruby-solr or solrj. > I don't want any of the hadoop jobs to have solr dependencies - they > just write to text xml files in the normal hadoop way, and someone > else is responsible for getting the results into solr. In this case, > it's some fairly trivial shell scripts that run on each solr machine > and do a dfs cat /whatever.xml | post_to_a_solr_instance at the end of > the run. (Using solr clustering here, so each machine is responsible > for loading only its own xml files) > > But I'd be happy to skip a step - am I just missing something obvious? > > -- > James Moore | [EMAIL PROTECTED] > Ruby and Ruby on Rails consulting > blog.restphone.com
