Dear Wiki user, You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.
The following page has been changed by NickTkach: http://wiki.apache.org/nutch/RunningNutchAndSolr The comment on the change is: Major rewrite of order of steps. ------------------------------------------------------------------------------ I'm posting it under Nutch rather than Solr on the presumption that people are more likely to be learning/using Solr first, then come here looking to combine it with Nutch. I'm going to skip over doing command by command for right now. I'm running/building on Ubuntu 7.10 using Java 1.6.0_05. I'm assuming that the Solr trunk code is checked out into solr-trunk and Nutch trunk code is checked out into nutch-trunk. - 1. Check out solr-trunk and nutch-trunk + 1. Check out solr-trunk ( svn co http://svn.apache.org/repos/solr/ solr-trunk ) + 1. Check out nutch-trunk ( svn co http://svn.apache.org/repos/nutch/ nutch-trunk ) 1. Go into the solr-trunk and run 'ant dist dist-solrj' - 1. Copy apache-solr-solrj-1.3-dev.jar and apache-solr-common-1.3-dev.jar to nutch-trunk/lib + 1. Copy apache-solr-solrj-1.3-dev.jar and apache-solr-common-1.3-dev.jar from solr-trunk/dist to nutch-trunk/lib + 1. Apply patch from [http://www.foofactory.fi/files/nutch-solr/nutch_solr.patch FooFactory patch] to nutch-trunk (cd nutch-trunk; patch -p0 < nutch_solr.patch) + 1. Get zip file from [http://variogram.com/latest/SolrIndexer.zip Variogr.am] and unzip somewhere other than nutch-trunk + 1. Copy ONLY SolrIndexer.java from src/java/org/apache/nutch/indexer/ to nutch-trunk/src/java/org/apache/nutch/indexer + 1. Edit nutch-trunk/src/java/org/apache/nutch/indexer/SolrIndexer.java (somewhere around line 92): + * Replace int res = new SolrIndexer().doMain(NutchConfiguration.create(), args); with int res = ToolRunner.run(NutchConfiguration.create(), new SolrIndexer(), args); + * Edit the imports to pick up org.apache.hadoop.util.ToolRunner + 1. Edit nutch-trunk/src/java/org/apache/nutch/indexer/Indexer.java changing scope on LuceneDocumentWrapper from private to protected 1. Get the zip file from [http://blog.foofactory.fi/2007/02/online-indexing-integrating-nutch-with.html FooFactory] for SOLR-20 1. Unzip solr-client.zip somewhere, go into java/solr/src and run 'ant' 1. Copy solr-client.jar from dist to nutch-trunk/lib 1. Copy xpp3-1.1.3.4.0.jar from lib to nutch-trunk/lib + 1. Configure nutch-trunk/conf/nutch-site.xml with *at least* settings for your site including a value for property indexer.solr.url (something like http://localhost:8983/solr/), but you should also have http.agent.name, http.agnet.description, http.agent.url, and http.agent.email as well. + 1. Edit nutch-trunk/conf/regex-urlfilter.xml to include some pattern for what to grab (such as +^http://([a-z0-9]*\.)apache.org/) + 1. Configure some url(s) to crawl (make a nutch-trunk/urls directory with a text file with just a url in it like http://lucene.apache.org/nutch) - 1. Apply patch from [http://www.foofactory.fi/files/nutch-solr/nutch_solr.patch FooFactory patch] to nutch/trunk (something like cd nutch-trunk; patch -p0 < nutch_solr.patch) - 1. Edit nutch-trunk/src/java/org/apache/nutch/indexer/SolrIndexer.java: - * Replace int res = new SolrIndexer().doMain(NutchConfiguration.create(), args); with int res = ToolRunner.run(NutchConfiguration.create(), new SolrIndexer(), args); - * Change job.setOutputValueClass(ObjectWritable.class) to job.setOutputValueClass(NutchWritable.class) - * Edit imports to pick up org.apache.nutch.crawl.NutchWritable - * Edit the imports to pick up ToolRunner - 1. Edit nutch-trunk/src/java/org/apache/nutch/indexer/Indexer.java changing scope on LuceneDocumentWrapper from private to protected - 1. Configure nutch-trunk/conf/nutch-site.xml with settings for your site including a value for property indexer.solr.url (something like http://localhost:8983/solr/) - 1. Configure some url(s) to crawl (files in a urls directory) - 1. Copy [http://www.foofactory.fi/files/nutch-solr/crawl.sh Crawl.sh script] from FooFactory and copy it to nutch-trunk/bin (editing if needed) + 1. Copy [http://www.foofactory.fi/files/nutch-solr/crawl.sh Crawl.sh script] from FooFactory and copy it to nutch-trunk/bin (editing if needed for things like topN) - 1. Start a Solr server (such as the solr-trunk/example instance) + 1. Go into solr-trunk and make an example server instance (run 'ant example') + 1. Copy example off somewhere (like /tmp/mysolr) + 1. Edit mysolr/solr/conf/schema.xml + * Add the fields that Nutch needs (url, content, segment, digest, host, site, anchor, title, tstamp, text--see [http://blog.foofactory.fi/2007/02/online-indexing-integrating-nutch-with.html FooFactory Article on Nutch + Solr]) + * Change defaultSearchField to 'text' + * Change defaultOperator to 'AND' + * Add lines to "copyField" section to copy cat & name into the text field + 1. Start the Solr you just made (cd /tmp/mysolr; java -jar start.jar) 1. Run a Nutch crawl using the bin/crawl.sh script. If you watch the output from your Solr instance (logs) you should see a bunch of messages scroll by when Nutch finishes crawling and posts new documents. If not, then you've got something not configured right. I'll try to add more notes here as people have questions/issues. '''Troubleshooting:''' * If you get errors about "Type mismatch in value from map:" (expected ObjectWritable, but received NutchWritable), then you likely are missing the two steps I just added in step 9 above. Sorry about that, I forgot about making the change there in SolrIndexer. - * Note that I've changed the steps above. I was mistaken-you don't need the zip from Variogr.am. You only need the files from FooFactory. If you take that part out then you shouldn't see errors about ClassCastExceptions any more. + * Note: Double-mistaken. I've re-written the order of the steps. Turns out you do need both the Variogram file and the FooFactory files. + * When in doubt, look at nutch-trunk/logs/hadoop.log . It frequently shows details about what's gone wrong and can be a big help when you start getting "unexplained" errors. + * See original articles at [http://blog.foofactory.fi/2007/02/online-indexing-integrating-nutch-with.html FooFactory Article on Nutch + Solr] and [http://variogram.com/latest/?p=26 Variogr.am Updates to FooFactory Posting] --------------------------------------------------- ERROR I did everything but i got this error any idea??