Dear Wiki user, You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.
The following page has been changed by amitkumar: http://wiki.apache.org/nutch/RunningNutchAndSolr ------------------------------------------------------------------------------ * apt-get install sun-java6-jdk subversion ant patch unzip == Steps == - Setup The first step to get started is to download the required software components, namely Apache Solr and Nutch. - 1. Download Solr version 1.3.0 or LucidWorks for Solr from Download page + '''1.''' Download Solr version 1.3.0 or LucidWorks for Solr from Download page - 2. Extract Solr package + '''2.''' Extract Solr package - 3. Download Nutch version 1.0 or later (Alternatively download the the nightly version of Nutch that contains the required functionality) + '''3.''' Download Nutch version 1.0 or later (Alternatively download the the nightly version of Nutch that contains the required functionality) - 4. Extract the Nutch package + '''4.''' Extract the Nutch package tar xzf apache-nutch-1.0.tar.gz - tar xzf apache-nutch-1.0.tar.gz - - 5. Configure Solr + '''5.''' Configure Solr - For the sake of simplicity we are going to use the example configuration of Solr as a base. - a. Copy the provided Nutch schema from directory + '''a.''' Copy the provided Nutch schema from directory apache-nutch-1.0/conf to directory apache-solr-1.3.0/example/solr/conf (override the existing file) We want to allow Solr to create the snippets for search results so we need to store the content in addition to indexing it: - b. Change schema.xml so that the stored attribute of field âcontentâ is true. + '''b.''' Change schema.xml so that the stored attribute of field âcontentâ is true. <field name=âcontentâ type=âtextâ stored=âtrueâ indexed=âtrueâ/> We want to be able to tweak the relevancy of queries easily so weâll create new dismax request handler configuration for our use case: - d. Open apache-solr-1.3.0/example/solr/conf/solrconfig.xml and paste following fragment to it + '''d.''' Open apache-solr-1.3.0/example/solr/conf/solrconfig.xml and paste following fragment to it <requestHandler name="/nutch" class="solr.SearchHandler" > @@ -93, +89 @@ </requestHandler> - 6. Start Solr + '''6.''' Start Solr cd apache-solr-1.3.0/example java -jar start.jar - 7. Configure Nutch + '''7. Configure Nutch''' a. Open nutch-site.xml in directory apache-nutch-1.0/conf, replace itâs contents with the following (we specify our crawler name, active plugins and limit maximum url count for single host per run to be 100) : <?xml version="1.0"?> <configuration> + <property> + <name>http.agent.name</name> + <value>nutch-solr-integration</value> + </property> + <property> <name>generate.max.per.host</name> + <value>100</value> + </property> + <property> + <name>plugin.includes</name> + <value>protocol-http|urlfilter-regex|parse-html|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value> + </property> + </configuration> + - b. Open regex-urlfilter.txt in directory apache-nutch-1.0/conf, + '''b.''' Open regex-urlfilter.txt in directory apache-nutch-1.0/conf,replace itâs content with following: - replace itâs content with following: -^(https|telnet|file|ftp|mailto): @@ -135, +143 @@ # deny anything else -. - 8. Create a seed list (the initial urls to fetch) + '''8.''' Create a seed list (the initial urls to fetch) mkdir urls echo "http://www.lucidimagination.com/" > urls/seed.txt - 9. Inject seed url(s) to nutch crawldb (execute in nutch directory) + '''9.''' Inject seed url(s) to nutch crawldb (execute in nutch directory) bin/nutch inject crawl/crawldb urls - 10. Generate fetch list, fetch and parse content + '''10.''' Generate fetch list, fetch and parse content bin/nutch generate crawl/crawldb crawl/segments @@ -166, +174 @@ Now a full Fetch cycle is completed. Next you can repeat step 10 couple of more times to get some more content. - 11. Create linkdb + '''11.''' Create linkdb bin/nutch invertlinks crawl/linkdb -dir crawl/segments - 12. Finally index all content from all segments to Solr + '''12.''' Finally index all content from all segments to Solr bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb crawl/linkdb crawl/segments/*