[Nutch Wiki] Update of "IntranetDocumentSearch" by MichaelAlleblas

Apache Wiki Mon, 17 Oct 2011 14:51:23 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.


The "IntranetDocumentSearch" page has been changed by MichaelAlleblas:
http://wiki.apache.org/nutch/IntranetDocumentSearch?action=diff&rev1=1&rev2=2

  
  When configured correctly, there should be a core located at 
''http://localhost:8983/solr/nutch''. You can test this by accessing the 
administration page at ''http://localhost:8983/solr/nutch/admin'' where you can 
also verify that the schema is being correctly loaded.
  
+ == Apache Nutch ==
+ Put the Nutch package ''apache-nutch-1.3-bin.tar.gz into /opt and e''xtract 
Nutch using the command ''tar -xvf apache-nutch-1.3-bin.tar.gz''. There is an 
example runtime setup under the directory ''/opt/nutch-1.3/runtime/local''. 
Change into this directory. All configuration files are referenced relative to 
this path.
+ 
+ === File: conf/regex-urlfilter.txt ===
+ Make the following changes:
+ 
+ Comment out the lines preventing file:// URLs from being handled
+ 
+ {{{
+ # skip file: ftp: and mailto: urls
+ #-^(file|ftp|mailto):
+ }}}
+ Add the folowing lines to skip HTTP(S), FTP and MailTo URLs
+ 
+ {{{
+ # skip http: https: ftp: and mailto: urls
+ -^(http|ftp|mailto|https):
+ }}}
+ In this case the documents to be indexed are under ''/srv/samba/files*''. 
Please use an appropriate regular expression for the paths you would like to 
index.
+ 
+ {{{
+ # accept anything else
+ +^file:///srv/samba/files
+ #-.
+ }}}
+ Now create a directory called ''urls''. This will hold text files containing 
lists of URLs for the crawler to process.
+ 
+ {{{
+ > mkdir urls
+ > touch urls/local-fs
+ }}}
+ === File: urls/local-fs ===
+ Example content for urls/local-fs is. The priory configured URL filter 
regular expression should allow these paths to be accepted.
+ 
+ {{{
+ file:///srv/samba/files/
+ }}}
+ === File: conf/nutch-site.xml ===
+ This file configures the plugins that Nutch will use. It requires 
configuration to allow the file:// protocol plugin to be loaded. There are many 
configuration options that can be done here but the following should suffice to 
get a working local crawler going.
+ 
+ {{{
+ <?xml version="1.0"?>
+ <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
+ 
+ <!-- Put site-specific property overrides in this file. -->
+ 
+ <configuration>
+   <property>
+     <name>http.agent.name</name>
+     <value>My Nutch Spider</value>
+   </property>
+   <property>
+     <name>plugin.includes</name>
+     
<value>protocol-file|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
+     <description>Regular expression naming plugin directory names to
+       include.  Any plugin not matching this expression is excluded.
+       In any case you need at least include the nutch-extensionpoints plugin. 
By
+       default Nutch includes crawling just HTML and plain text via HTTP,
+       and basic indexing and search plugins. In order to use HTTPS please 
enable
+       protocol-httpclient, but be aware of possible intermittent problems 
with the
+       underlying commons-httpclient library.
+    </description>
+  </property>
+ </configuration>
+ }}}
+ === Running the Crawler ===
+ The following command is used to start crawling.
+ 
+ {{{
+ > ./bin/nutch crawl urls -dir crawl -depth 2 -solr 
http://localhost:8983/solr/nutch
+ }}}
+ ==== Command Breakdown ====
+ '''./bin/nutch''' - The Nutch binary
+ 
+ '''crawl''' - Tells Nutch what action to perform
+ 
+ '''urls''' - The urls directory we created earlier with our URL lists
+ 
+ '''-dir crawl''' - Tells Nutch where to store it's crawl data
+ 
+ '''-depth 2''' - Just for testing purposes otherwise crawling can take a very 
long time. This will test recusive retrieval without going too deep. You can 
remove this when you want to do a full index
+ 
+ '''-solr http://localhost:8983/solr/nutch''' - Tells Nutch where the Solr 
server is located to upload index data after crawling has finished
+ 
+ === Notes ===
+ '''MichaelAlleblas:''' I am very new to Solr and Nutch myself and therefore 
this is by no means a comprehensive or completely accurate guide. I just hope 
it can be a starting point for others to pool together their collective 
knowledge to help this capability of Nutch to be exploited more often and allow 
newcomers like myself to get things up and running as simply and fast as 
possible.
+

[Nutch Wiki] Update of "IntranetDocumentSearch" by MichaelAlleblas

Reply via email to