[Nutch Wiki] Update of "RunningNutchAndSolr" by EricPugh

Apache Wiki Sun, 17 Jul 2011 18:18:16 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.


The "RunningNutchAndSolr" page has been changed by EricPugh:
http://wiki.apache.org/nutch/RunningNutchAndSolr?action=diff&rev1=64&rev2=65

Comment:
minor edits for clarity.

  
  This tutorial was originally constructed and posted by 'waycool' on the user 
lists. It has been edited slightly for integration into the Apache Nutch 
project.
  
- Apache Nutch is an open source web crawler written in Java. By using it, we 
can find out the hyperlinks in automated manner, reduce lots of maintenance 
work, for example checking broken links, and create a copy of all the visited 
pages for future search. That’s where Apache Solr comes in. Solr is an open 
source full text search framework, with Solr we can search the visited pages 
from Nutch. Luckily, integration between Nutch and Solr is pretty 
straightforward as explained below.
+ Apache Nutch is an open source web crawler written in Java. By using it, we 
can find web page hyperlinks in an automated manner, reduce lots of maintenance 
work, for example checking broken links, and create a copy of all the visited 
pages for searching over. That’s where Apache Solr comes in. Solr is an open 
source full text search framework, with Solr we can search the visited pages 
from Nutch. Luckily, integration between Nutch and Solr is pretty 
straightforward as explained below.
  
- Apache Nutch release 1.3 has Solr integration embedded, this greatly eases 
Nutch-Solr integration. It also removes the legacy dependence upon both Apache 
Tomcat for running the old Nutch Web Application and upon Apache Lucene for 
indexing. Just download a 1.3 release from 
[[http://www.apache.org/dyn/closer.cgi/nutch/|here]]. NOTE: You can download 
release 1.3 in either binary or source format, both of which are covered in 
this tutorial.
+ Apache Nutch release 1.3 has Solr integration embedded, greatly simplifying 
Nutch-Solr integration. It also removes the legacy dependence upon both Apache 
Tomcat for running the old Nutch Web Application and upon Apache Lucene for 
indexing. Just download a 1.3 release from 
[[http://www.apache.org/dyn/closer.cgi/nutch/|here]]. NOTE: You can download 
release 1.3 in either binary or source format, both of which are covered in 
this tutorial.
  
  == Table of Contents ==
  <<TableOfContents(3)>>
@@ -60, +60 @@

  </property>
  }}}
   * mkdir -p urls
-  * create a file nutch under /urls with the following content (or any site 
you want Nutch to crawl).
+  * create a file nutch under /urls with the following content (1 url per line 
for each site you want Nutch to crawl).
  {{{
  http://nutch.apache.org/
  }}}
@@ -68, +68 @@

  {{{
  bin/nutch crawl urls -dir crawl -depth 3 -topN 5
  }}}
-  * Now you should be able to see the following directories exist:
+  * Now you should be able to see the following directories created:
  {{{
  crawl/crawldb 
  Crawl/linkdb
@@ -102, +102 @@

  
  == 6. Integrate Solr with Nutch ==
  
- We have both Nutch and Solr installed and setup correctly. And Nutch already 
created crawl data from the seed url(s). Below are the steps to delagte 
searching to Solr for links to be searchable:
+ We have both Nutch and Solr installed and setup correctly. And Nutch already 
created crawl data from the seed url(s). Below are the steps to delegate 
searching to Solr for links to be searchable:
   * cp ${NUTCH_RUNTIME_HOME}/conf/schema.xml 
${APACHE_SOLR_HOME}/example/solr/conf/ 
   * restart Solr with the command “java -jar start.jar” under 
${APACHE_SOLR_HOME}/example 
   * run the Solr Index command:
@@ -111, +111 @@

  }}}
  This will send all crawl data to Solr for indexing. For more information 
please see bin/nutch solrindex
   
- If all has gone to plan, we are now ready to search with 
http://localhost:8983/solr/admin/. 
+ If all has gone to plan, we are now ready to search with 
http://localhost:8983/solr/admin/.  If you want to see the HTML indexed by Solr 
in the raw form, them then in solrconfig.xml change the field content to stored:
+ {{{
+ <field name="content" type="text" stored="true" indexed="true"/>
+ }}}

[Nutch Wiki] Update of "RunningNutchAndSolr" by EricPugh

Reply via email to