[Nutch Wiki] Trivial Update of "RunningNutchAndSolr" by LewisJohnMcgibbney

Apache Wiki Fri, 24 Jun 2011 13:35:30 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.


The "RunningNutchAndSolr" page has been changed by LewisJohnMcgibbney:
http://wiki.apache.org/nutch/RunningNutchAndSolr?action=diff&rev1=58&rev2=59

Comment:
This revision is a first attempt at getting a local

  
  '''4a.''' Setup Solr for search from binary distribution:
   * download binary file from 
[[http://www.apache.org/dyn/closer.cgi/lucene/solr/|here]]
-  * unzip to $HOME/apache-solr-3.X
-  * cd apache-solr-3.X/example
+  * unzip to $HOME/apache-solr-3.X, we will now refer to this as 
${APACHE_SOLR_HOME}
+  * cd ${APACHE_SOLR_HOME}/example
   * java -jar start.jar
  
  '''4b.''' Setup Solr for search from source distribution:
   * You can setup Solr from source distribution with Maven. This 
[[http://thetechietutorials.blogspot.com/2011/06/how-to-build-and-start-apache-solr.html|link]]
 shows how to do that.
  
+ '''5.''' Verify Solr installation:
+ After you started Solr admin console, you should be able to access the 
following links:
- 
- '''a.''' Copy the provided Nutch schema from directory apache-nutch-1.0/conf 
to directory apache-solr-1.3.0/example/solr/conf (override the existing file)
- 
- We want to allow Solr to create the snippets for search results so we need to 
store the content in addition to indexing it:
- 
- '''b.''' Change schema.xml so that the stored attribute of field “content” is 
true.
- 
  {{{
- <field name=”content” type=”text” stored=”true” indexed=”true”/>
+ http://localhost:8983/solr/admin/
+ http://localhost:8983/solr/admin/stats.jsp
  }}}
  
+ '''6.''' Integrate Solr with Nutch
+ We have both Nutch and Solr installed and setup correctly. And Nutch already 
created crawl data from the seed url(s). Below are the steps to delagte 
searching to Solr for links to be searchable:
+  * cp ${NUTCH_RUNTIME_HOME}/conf/schema.xml 
${APACHE_SOLR_HOME}/example/solr/conf/ 
+  * restart Solr with the command “java -jar start.jar” under 
${APACHE_SOLR_HOME}/example 
+  * run the Solr Index command:
- We want to be able to tweak the relevancy of queries easily so we’ll create 
new [[http://wiki.apache.org/solr/DisMaxRequestHandler|dismax request handler]] 
configuration for our use case:
- 
- '''d.''' Open apache-solr-1.3.0/example/solr/conf/solrconfig.xml and paste 
following fragment to it
- 
- {{{
- <requestHandler name="/nutch" class="solr.SearchHandler" >
- <lst name="defaults">
- <str name="defType">dismax</str>
- <str name="echoParams">explicit</str>
- <float name="tie">0.01</float>
- <str name="qf">
- content&#94;0.5 anchor&#94;1.0 title&#94;1.2 </str>
- <str name="pf"> content&#94;0.5 anchor&#94;1.5 title&#94;1.2 site&#94;1.5 
</str>
- <str name="fl"> url </str>
- <str name="mm"> 2&lt;-1 5&lt;-2 6&lt;90% </str>
- <int name="ps">100</int>
- <bool name="hl">true</bool>
- <str name="q.alt">*:*</str>
- <str name="hl.fl">title url content</str>
- <str name="f.title.hl.fragsize">0</str>
- <str name="f.title.hl.alternateField">title</str>
- <str name="f.url.hl.fragsize">0</str>
- <str name="f.url.hl.alternateField">url</str>
- <str name="f.content.hl.fragmenter">regex</str>
- </lst>
- </requestHandler>
- }}}
- 
- 
- '''6.''' Start Solr
- 
- Assuming you have installed Solr as per instructions above. 
- {{{
- cd apache-solr-1.3.0/example java -jar start.jar
- }}}
- 
- 
- 
- '''7'''. Configure Nutch
- 
- a. Open nutch-site.xml in directory apache-nutch-1.0/conf, replace it’s 
contents with the following (we specify our crawler name, active plugins and 
limit maximum url count for single host per run to be 100) :
- 
- {{{
- <?xml version="1.0"?> <configuration>
- <property>
- <name>http.agent.name</name>
- <value>nutch-solr-integration</value>
- </property>
- <property> <name>generate.max.per.host</name>
- <value>100</value>
- </property>
- <property>
- <name>plugin.includes</name>
- 
<value>protocol-http|urlfilter-regex|parse-html|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
- </property>
- </configuration>
- }}}
- 
- 
- '''b.''' Open regex-urlfilter.txt in directory apache-nutch-1.0/conf,replace 
it’s content with something similar to the following:
- 
- {{{
- -^(https|telnet|file|ftp|mailto):
- # skip some suffixes 
- 
-\.(swf|SWF|doc|DOC|mp3|MP3|WMV|wmv|txt|TXT|rtf|RTF|avi|AVI|m3u|M3U|flv|FLV|WAV|wav|mp4|MP4|avi|AVI|rss|RSS|xml|XML|pdf|PDF|js|JS|gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
- # skip URLs containing certain characters as probable queries, etc. 
- -[?*!@=]
- # allow urls in foofactory.fi domain (or lucidimagination.com...)
- +^http://([a-z0-9\-A-Z]*\.)*lucidimagination.com/
- # deny anything else 
- -.
- }}}
- 
- 
- '''8.''' Create a seed list (the initial urls to fetch)
- 
- {{{
- mkdir urls 
- echo "http://www.lucidimagination.com/"; > urls/seed.txt
- }}}
- 
- '''9.''' Inject seed url(s) to nutch crawldb (execute in nutch directory)
- 
- {{{
- bin/nutch inject crawl/crawldb urls
- }}}
- 
- '''10.''' Generate fetch list, fetch and parse content
- 
- {{{
- bin/nutch generate crawl/crawldb crawl/segments
- }}}
- 
- The above command will generate a new segment directory under crawl/segments 
that at this point contains files that store the url(s) to be fetched. In the 
following commands we need the latest segment dir as parameter so we’ll store 
it in an environment variable. 
- 
- {{{
- export SEGMENT=crawl/segments/`ls -tr crawl/segments|tail -1`
- echo $SEGMENT
- }}}
- 
- Note: This only works if you are using your local file system. If your crawl 
is on Hadoop DFS then you will need to figure out some other way of setting the 
SEGMENT environment variable - possibly using something like 
- 
- {{{
- bin/hadoop fs  -ls crawl/segments
- }}}
- 
- Now I launch the fetcher that actually goes to get the content:
- 
- {{{
- bin/nutch fetch $SEGMENT -noParsing
- }}}
- 
- Next I parse the content:
- 
- {{{
- bin/nutch parse $SEGMENT
- }}}
- 
- Then I update the Nutch crawldb. The updatedb command wil store all new urls 
discovered during the fetch and parse of the previous segment into Nutch 
database so they can be fetched later. Nutch also stores information about the 
pages that were fetched so the same urls won’t be fetched again and again.
- 
- {{{
- bin/nutch updatedb crawl/crawldb $SEGMENT -filter -normalize
- }}}
- 
- Now a full Fetch cycle is completed. Next you can repeat step 10 couple of 
more times to get some more content.
- 
- '''11.''' Create linkdb
- 
- {{{
- bin/nutch invertlinks crawl/linkdb -dir crawl/segments
- }}}
- 
- '''12.''' Finally index all content from all segments to Solr
- 
  {{{
  bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb crawl/linkdb 
crawl/segments/*
  }}}
+ This will send all crawl data to Solr for indexing. For more information 
please see bin/nutch solrindex
+  
+ If all has gone to plan, we are now ready to search with 
http://localhost:8983/solr/admin/. 
  
- Now the indexed content is available through Solr. You can try to execute 
searches from the Solr admin ui from
- 
- http://127.0.0.1:8983/solr/admin
- 
- , or directly with url like
- 
- 
http://127.0.0.1:8983/solr/nutch/?q=solr&amp;version=2.2&amp;start=0&amp;rows=10&amp;indent=on&amp;wt=json
- 
- (Note: If you installed SolR via the Ubuntu package manager then the port 
SolR is on will be "8080" and not "8983". *sigh*)
- 
- === Comments ===
- --------------------------------------
- 
- HI, I to faced problems in integrating solr and nutch. After, some work out i 
found the below article and integrated successfully. 
http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/
- 
- When I run the solrindex, always get 400 error. that means we should add some 
fields to the solr scheme.xml
- here is one blogentry 
http://androidyou.blogspot.com/2010/09/update-nutch-index-to-solr.html
-

[Nutch Wiki] Trivial Update of "RunningNutchAndSolr" by LewisJohnMcgibbney

Reply via email to