[Nutch Wiki] Trivial Update of "RunningNutchAndSolr" by AlexMc

Apache Wiki Sun, 13 Jun 2010 03:42:12 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.


The "RunningNutchAndSolr" page has been changed by AlexMc.
http://wiki.apache.org/nutch/RunningNutchAndSolr?action=diff&rev1=32&rev2=33

--------------------------------------------------

  
  '''d.''' Open apache-solr-1.3.0/example/solr/conf/solrconfig.xml and paste 
following fragment to it
  
+ {{{
  <requestHandler name="/nutch" class="solr.SearchHandler" >
- 
  <lst name="defaults">
- 
  <str name="defType">dismax</str>
- 
  <str name="echoParams">explicit</str>
- 
  <float name="tie">0.01</float>
- 
  <str name="qf">
- 
  content&#94;0.5 anchor&#94;1.0 title&#94;1.2 </str>
- 
  <str name="pf"> content&#94;0.5 anchor&#94;1.5 title&#94;1.2 site&#94;1.5 
</str>
- 
  <str name="fl"> url </str>
- 
  <str name="mm"> 2<-1 5<-2 6<90% </str>
- 
  <int name="ps">100</int>
- 
  <bool hl="true"/>
- 
  <str name="q.alt">*:*</str>
- 
  <str name="hl.fl">title url content</str>
- 
  <str name="f.title.hl.fragsize">0</str>
- 
  <str name="f.title.hl.alternateField">title</str>
- 
  <str name="f.url.hl.fragsize">0</str>
- 
  <str name="f.url.hl.alternateField">url</str>
- 
  <str name="f.content.hl.fragmenter">regex</str>
- 
  </lst>
- 
  </requestHandler>
+ }}}
  
  '''6.''' Start Solr
  
@@ -86, +68 @@

  
  a. Open nutch-site.xml in directory apache-nutch-1.0/conf, replace it’s 
contents with the following (we specify our crawler name, active plugins and 
limit maximum url count for single host per run to be 100) :
  
+ {{{
  <?xml version="1.0"?> <configuration>
+ <property>
+ <name>http.agent.name</name>
+ <value>nutch-solr-integration</value>
+ </property>
+ <property> <name>generate.max.per.host</name>
+ <value>100</value>
+ </property>
+ <property>
+ <name>plugin.includes</name>
+ 
<value>protocol-http|urlfilter-regex|parse-html|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
+ </property>
+ </configuration>
+ }}}
  
- <property>
  
- <name>http.agent.name</name>
+ '''b.''' Open regex-urlfilter.txt in directory apache-nutch-1.0/conf,replace 
it’s content with something similar to the following:
  
- <value>nutch-solr-integration</value>
+ {{{
+ -^(https|telnet|file|ftp|mailto):
+ # skip some suffixes 
+ 
-\.(swf|SWF|doc|DOC|mp3|MP3|WMV|wmv|txt|TXT|rtf|RTF|avi|AVI|m3u|M3U|flv|FLV|WAV|wav|mp4|MP4|avi|AVI|rss|RSS|xml|XML|pdf|PDF|js|JS|gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
+ # skip URLs containing certain characters as probable queries, etc. 
+ -[...@=]
+ # allow urls in foofactory.fi domain (or lucidimagination.com...)
+ +^http://([a-z0-9\-A-Z]*\.)*lucidimagination.com/
+ # deny anything else 
+ -.
+ }}}
  
- </property>
- 
- <property> <name>generate.max.per.host</name>
- 
- <value>100</value>
- 
- </property>
- 
- <property>
- 
- <name>plugin.includes</name>
- 
- 
<value>protocol-http|urlfilter-regex|parse-html|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
- 
- </property>
- 
- </configuration>
- 
- '''b.''' Open regex-urlfilter.txt in directory apache-nutch-1.0/conf,replace 
it’s content with following:
- 
- -^(https|telnet|file|ftp|mailto):
- 
- # skip some suffixes 
- 
- 
-\.(swf|SWF|doc|DOC|mp3|MP3|WMV|wmv|txt|TXT|rtf|RTF|avi|AVI|m3u|M3U|flv|FLV|WAV|wav|mp4|MP4|avi|AVI|rss|RSS|xml|XML|pdf|PDF|js|JS|gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
- 
- # skip URLs containing certain characters as probable queries, etc. 
- 
- -[...@=]
- 
- # allow urls in foofactory.fi domain (or lucidimagination.com...)
- 
- +^http://([a-z0-9\-A-Z]*\.)*lucidimagination.com/
- 
- # deny anything else 
- 
- -.
  
  '''8.''' Create a seed list (the initial urls to fetch)
  
+ {{{
  mkdir urls 
  echo "http://www.lucidimagination.com/"; > urls/seed.txt
+ }}}
  
  '''9.''' Inject seed url(s) to nutch crawldb (execute in nutch directory)
  
+ {{{
  bin/nutch inject crawl/crawldb urls
+ }}}
  
  '''10.''' Generate fetch list, fetch and parse content
  
+ {{{
  bin/nutch generate crawl/crawldb crawl/segments
+ }}}
  
  The above command will generate a new segment directory under crawl/segments 
that at this point contains files that store the url(s) to be fetched. In the 
following commands we need the latest segment dir as parameter so we’ll store 
it in an environment variable:

[Nutch Wiki] Trivial Update of "RunningNutchAndSolr" by AlexMc

Reply via email to