[Nutch Wiki] Update of "RunningNutchAndSolr" by NickTkach

Apache Wiki Thu, 27 Mar 2008 13:32:00 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.


The following page has been changed by NickTkach:
http://wiki.apache.org/nutch/RunningNutchAndSolr

New page:
This is just a quick first pass at a guide for getting Nutch running with Solr. 
 I'm sure there are better ways of doing some/all of it, but I'm not aware of 
them.  By all means, please do correct/update this if someone has a better 
idea.  Many thanks to [http://variogram.com|Brian Whitman at Variogr.am] and 
[http://blog.foofactory.fi|Sam Siren at FooFactory] for all the help!  You guys 
saved me a lot of time! :)

I'm posting it under Nutch rather than Solr on the presumption that people are 
more likely to be learning/using Solr first, then come here looking to combine 
it with Nutch.  I'm going to skip over doing command by command for right now.  
I'm running/building on Ubuntu 7.10 using Java 1.6.0_05.  I'm assuming that the 
Solr trunk code is checked out into solr-trunk and Nutch trunk code is checked 
out into nutch-trunk.

 1. Check out solr-trunk and nutch-trunk
 1. Go into the solr-trunk and run 'ant dist dist-solrj'
 1. Get zip from [http://variogram.com/latest/SolrIndexer.zip|Variogr.am] and 
unzip it to solr-trunk
 1. Copy apache-solr-solrj-1.3-dev.jar and apache-solr-common-1.3-dev.jar to 
nutch-trunk/lib
 1. Get the zip file from 
[http://blog.foofactory.fi/2007/02/online-indexing-integrating-nutch-with.html|FooFactory]
 for SOLR-20
 1. Unzip solr-client.zip somewhere, go into java/solr/src and run 'ant'
 1. Copy solr-client.jar from dist to nutch-trunk/lib
 1. Copy xpp3-1.1.3.4.0.jar from lib to nutch-trunk/lib
 1. Get SolrClientAdapter.java from 
[http://www.foofactory.fi/files/nutch-solr/nutch_solr.patch|FooFactory patch] 
and copy it to nutch-trunk/src/java/org/apache/nutch/indexer
   * Edit nutch-trunk/src/java/org/apache/nutch/indexer/SolrIndexer.java:
   * Replace int res = new SolrIndexer().doMain(NutchConfiguration.create(), 
args); with int res = ToolRunner.run(NutchConfiguration.create(), new 
SolrIndexer(), args);
 1. Edit the imports to pick up ToolRunner
 1. Edit nutch-trunk/src/java/org/apache/nutch/indexer/Indexer.java changing 
scope on LuceneDocumentWrapper from private to protected
 1. Configure nutch-trunk/conf/nutch-site.xml with settings for your site 
including a value for property indexer.solr.url (something like 
http://localhost:8983/solr/)
 1. Configure some url(s) to crawl (files in a urls directory)
 1. Copy [http://www.foofactory.fi/files/nutch-solr/crawl.sh|Crawl.sh script] 
from FooFactory and copy it to nutch-trunk/bin (editing if needed)
 1. Start a Solr server (such as the solr-trunk/example instance)
 1. Run a Nutch crawl using the bin/crawl.sh script.

If you watch the output from your Solr instance (logs) you should see a bunch 
of messages scroll by when Nutch finishes crawling and posts new documents.  If 
not, then you've got something not configured right.  I'll try to add more 
notes here as people have questions/issues.

[Nutch Wiki] Update of "RunningNutchAndSolr" by NickTkach

Reply via email to