[Nutch Wiki] Update of "RunningNutchAndSolr" by AlexMc

Apache Wiki Sun, 13 Jun 2010 05:00:33 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.


The "RunningNutchAndSolr" page has been changed by AlexMc.
http://wiki.apache.org/nutch/RunningNutchAndSolr?action=diff&rev1=33&rev2=34

--------------------------------------------------

  
  I'm posting it under Nutch rather than Solr on the presumption that people 
are more likely to be learning/using Solr first, then come here looking to 
combine it with Nutch.  I'm going to skip over doing command by command for 
right now.  I'm running/building on Ubuntu 7.10 using Java 1.6.0_05.  I'm 
assuming that the Solr trunk code is checked out into solr-trunk and Nutch 
trunk code is checked out into nutch-trunk.
  
+ 
  == Prerequisites ==
   * apt-get install sun-java6-jdk subversion ant patch unzip
+ 
+ == Ubuntu Note ==
+ 
+ If you are using more recent versions of Ubuntu Solr comes as a package 
installable through apt-get 
+ You might wish to install it that way instead of as follows. If so then you 
will find the solr config in /etc/solr/conf
  
  == Steps ==
  The first step to get started is to download the required software 
components, namely Apache Solr and Nutch.
@@ -62, +68 @@

  
  '''6.''' Start Solr
  
+ Assuming you have installed Solr as per instructions above. 
+ {{{
  cd apache-solr-1.3.0/example java -jar start.jar
+ }}}
+ 
+ 
  
  '''7. Configure Nutch'''
  
@@ -119, +130 @@

  bin/nutch generate crawl/crawldb crawl/segments
  }}}
  
- The above command will generate a new segment directory under crawl/segments 
that at this point contains files that store the url(s) to be fetched. In the 
following commands we need the latest segment dir as parameter so we’ll store 
it in an environment variable:
+ The above command will generate a new segment directory under crawl/segments 
that at this point contains files that store the url(s) to be fetched. In the 
following commands we need the latest segment dir as parameter so we’ll store 
it in an environment variable. 
  
+ {{{
  export SEGMENT=crawl/segments/&#96;ls -tr crawl/segments|tail -1&#96;
+ echo $SEGMENT
+ }}}
+ 
+ Note: This only works if you are using your local file system. If your crawl 
is on Hadoop DFS then you will need to figure out some other way of setting the 
SEGMENT environment variable - possibly using something like 
+ 
+ {{{
+ bin/hadoop fs  -ls crawl/segments
+ }}}
  
  Now I launch the fetcher that actually goes to get the content:
  
+ {{{
  bin/nutch fetch $SEGMENT -noParsing
+ }}}
  
  Next I parse the content:
  
+ {{{
  bin/nutch parse $SEGMENT
+ }}}
  
  Then I update the Nutch crawldb. The updatedb command wil store all new urls 
discovered during the fetch and parse of the previous segment into Nutch 
database so they can be fetched later. Nutch also stores information about the 
pages that were fetched so the same urls won’t be fetched again and again.
  
@@ -153, +177 @@

  
  
http://127.0.0.1:8983/solr/nutch/?q=solr&amp;version=2.2&amp;start=0&amp;rows=10&amp;indent=on&amp;wt=json
  
+ === Comments ===
+ --------------------------------------
+ 
  HI, I to faced problems in integrating solr and nutch. After, some work out i 
found the below article and integrated successfully. 
http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/

[Nutch Wiki] Update of "RunningNutchAndSolr" by AlexMc

Reply via email to