[Nutch Wiki] Update of "RunningNutchAndSolr" by NickTkach

Apache Wiki Wed, 09 Apr 2008 14:03:48 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.


The following page has been changed by NickTkach:
http://wiki.apache.org/nutch/RunningNutchAndSolr

The comment on the change is:
Major rewrite of order of steps.

------------------------------------------------------------------------------
  
  I'm posting it under Nutch rather than Solr on the presumption that people 
are more likely to be learning/using Solr first, then come here looking to 
combine it with Nutch.  I'm going to skip over doing command by command for 
right now.  I'm running/building on Ubuntu 7.10 using Java 1.6.0_05.  I'm 
assuming that the Solr trunk code is checked out into solr-trunk and Nutch 
trunk code is checked out into nutch-trunk.
  
-  1. Check out solr-trunk and nutch-trunk
+  1. Check out solr-trunk ( svn co http://svn.apache.org/repos/solr/ 
solr-trunk )
+  1. Check out nutch-trunk ( svn co http://svn.apache.org/repos/nutch/ 
nutch-trunk )
   1. Go into the solr-trunk and run 'ant dist dist-solrj'
-  1. Copy apache-solr-solrj-1.3-dev.jar and apache-solr-common-1.3-dev.jar to 
nutch-trunk/lib
+  1. Copy apache-solr-solrj-1.3-dev.jar and apache-solr-common-1.3-dev.jar 
from solr-trunk/dist to nutch-trunk/lib
+  1. Apply patch from 
[http://www.foofactory.fi/files/nutch-solr/nutch_solr.patch FooFactory patch] 
to nutch-trunk (cd nutch-trunk; patch -p0 < nutch_solr.patch)
+  1. Get zip file from [http://variogram.com/latest/SolrIndexer.zip 
Variogr.am] and unzip somewhere other than nutch-trunk
+  1. Copy ONLY SolrIndexer.java from src/java/org/apache/nutch/indexer/ to 
nutch-trunk/src/java/org/apache/nutch/indexer
+  1. Edit nutch-trunk/src/java/org/apache/nutch/indexer/SolrIndexer.java 
(somewhere around line 92):
+    * Replace int res = new SolrIndexer().doMain(NutchConfiguration.create(), 
args); with int res = ToolRunner.run(NutchConfiguration.create(), new 
SolrIndexer(), args);
+    * Edit the imports to pick up org.apache.hadoop.util.ToolRunner
+  1. Edit nutch-trunk/src/java/org/apache/nutch/indexer/Indexer.java changing 
scope on LuceneDocumentWrapper from private to protected
   1. Get the zip file from 
[http://blog.foofactory.fi/2007/02/online-indexing-integrating-nutch-with.html  
FooFactory] for SOLR-20
   1. Unzip solr-client.zip somewhere, go into java/solr/src and run 'ant'
   1. Copy solr-client.jar from dist to nutch-trunk/lib
   1. Copy xpp3-1.1.3.4.0.jar from lib to nutch-trunk/lib
+  1. Configure nutch-trunk/conf/nutch-site.xml with *at least* settings for 
your site including a value for property indexer.solr.url (something like 
http://localhost:8983/solr/), but you should also have http.agent.name, 
http.agnet.description, http.agent.url, and http.agent.email as well.
+  1. Edit nutch-trunk/conf/regex-urlfilter.xml to include some pattern for 
what to grab (such as +^http://([a-z0-9]*\.)apache.org/)
+  1. Configure some url(s) to crawl (make a nutch-trunk/urls directory with a 
text file with just a url in it like http://lucene.apache.org/nutch)
-  1. Apply patch from 
[http://www.foofactory.fi/files/nutch-solr/nutch_solr.patch FooFactory patch] 
to nutch/trunk (something like cd nutch-trunk; patch -p0 < nutch_solr.patch)
-  1. Edit nutch-trunk/src/java/org/apache/nutch/indexer/SolrIndexer.java:
-    * Replace int res = new SolrIndexer().doMain(NutchConfiguration.create(), 
args); with int res = ToolRunner.run(NutchConfiguration.create(), new 
SolrIndexer(), args);
-    * Change job.setOutputValueClass(ObjectWritable.class) to 
job.setOutputValueClass(NutchWritable.class)
-    * Edit imports to pick up org.apache.nutch.crawl.NutchWritable
-    * Edit the imports to pick up ToolRunner
-  1. Edit nutch-trunk/src/java/org/apache/nutch/indexer/Indexer.java changing 
scope on LuceneDocumentWrapper from private to protected
-  1. Configure nutch-trunk/conf/nutch-site.xml with settings for your site 
including a value for property indexer.solr.url (something like 
http://localhost:8983/solr/)
-  1. Configure some url(s) to crawl (files in a urls directory)
-  1. Copy [http://www.foofactory.fi/files/nutch-solr/crawl.sh Crawl.sh script] 
from FooFactory and copy it to nutch-trunk/bin (editing if needed)
+  1. Copy [http://www.foofactory.fi/files/nutch-solr/crawl.sh Crawl.sh script] 
from FooFactory and copy it to nutch-trunk/bin (editing if needed for things 
like topN)
-  1. Start a Solr server (such as the solr-trunk/example instance)
+  1. Go into solr-trunk and make an example server instance (run 'ant example')
+  1. Copy example off somewhere (like /tmp/mysolr)
+  1. Edit mysolr/solr/conf/schema.xml
+    * Add the fields that Nutch needs (url, content, segment, digest, host, 
site, anchor, title, tstamp, text--see 
[http://blog.foofactory.fi/2007/02/online-indexing-integrating-nutch-with.html 
FooFactory Article on Nutch + Solr])
+    * Change defaultSearchField to 'text'
+    * Change defaultOperator to 'AND'
+    * Add lines to "copyField" section to copy cat & name into the text field
+  1. Start the Solr you just made (cd /tmp/mysolr; java -jar start.jar)
   1. Run a Nutch crawl using the bin/crawl.sh script.
  
  If you watch the output from your Solr instance (logs) you should see a bunch 
of messages scroll by when Nutch finishes crawling and posts new documents.  If 
not, then you've got something not configured right.  I'll try to add more 
notes here as people have questions/issues.
  
  '''Troubleshooting:'''
   * If you get errors about "Type mismatch in value from map:" (expected 
ObjectWritable, but received NutchWritable), then you likely are missing the 
two steps I just added in step 9 above.  Sorry about that, I forgot about 
making the change there in SolrIndexer.
-  * Note that I've changed the steps above.  I was mistaken-you don't need the 
zip from Variogr.am.  You only need the files from FooFactory.  If you take 
that part out then you shouldn't see errors about ClassCastExceptions any more.
+  * Note: Double-mistaken.  I've re-written the order of the steps.  Turns out 
you do need both the Variogram file and the FooFactory files. 
+  * When in doubt, look at nutch-trunk/logs/hadoop.log .  It frequently shows 
details about what's gone wrong and can be a big help when you start getting 
"unexplained" errors.
+  * See original articles at 
[http://blog.foofactory.fi/2007/02/online-indexing-integrating-nutch-with.html 
FooFactory Article on Nutch + Solr] and [http://variogram.com/latest/?p=26 
Variogr.am Updates to FooFactory Posting]
  ---------------------------------------------------
  ERROR
  I did everything but i got this error any idea??

[Nutch Wiki] Update of "RunningNutchAndSolr" by NickTkach

Reply via email to