Dear Wiki user, You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.
The "IntranetDocumentSearch" page has been changed by MichaelAlleblas: http://wiki.apache.org/nutch/IntranetDocumentSearch?action=diff&rev1=1&rev2=2 When configured correctly, there should be a core located at ''http://localhost:8983/solr/nutch''. You can test this by accessing the administration page at ''http://localhost:8983/solr/nutch/admin'' where you can also verify that the schema is being correctly loaded. + == Apache Nutch == + Put the Nutch package ''apache-nutch-1.3-bin.tar.gz into /opt and e''xtract Nutch using the command ''tar -xvf apache-nutch-1.3-bin.tar.gz''. There is an example runtime setup under the directory ''/opt/nutch-1.3/runtime/local''. Change into this directory. All configuration files are referenced relative to this path. + + === File: conf/regex-urlfilter.txt === + Make the following changes: + + Comment out the lines preventing file:// URLs from being handled + + {{{ + # skip file: ftp: and mailto: urls + #-^(file|ftp|mailto): + }}} + Add the folowing lines to skip HTTP(S), FTP and MailTo URLs + + {{{ + # skip http: https: ftp: and mailto: urls + -^(http|ftp|mailto|https): + }}} + In this case the documents to be indexed are under ''/srv/samba/files*''. Please use an appropriate regular expression for the paths you would like to index. + + {{{ + # accept anything else + +^file:///srv/samba/files + #-. + }}} + Now create a directory called ''urls''. This will hold text files containing lists of URLs for the crawler to process. + + {{{ + > mkdir urls + > touch urls/local-fs + }}} + === File: urls/local-fs === + Example content for urls/local-fs is. The priory configured URL filter regular expression should allow these paths to be accepted. + + {{{ + file:///srv/samba/files/ + }}} + === File: conf/nutch-site.xml === + This file configures the plugins that Nutch will use. It requires configuration to allow the file:// protocol plugin to be loaded. There are many configuration options that can be done here but the following should suffice to get a working local crawler going. + + {{{ + <?xml version="1.0"?> + <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> + + <!-- Put site-specific property overrides in this file. --> + + <configuration> + <property> + <name>http.agent.name</name> + <value>My Nutch Spider</value> + </property> + <property> + <name>plugin.includes</name> + <value>protocol-file|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|scoring-opic|urlnormalizer-(pass|regex|basic)</value> + <description>Regular expression naming plugin directory names to + include. Any plugin not matching this expression is excluded. + In any case you need at least include the nutch-extensionpoints plugin. By + default Nutch includes crawling just HTML and plain text via HTTP, + and basic indexing and search plugins. In order to use HTTPS please enable + protocol-httpclient, but be aware of possible intermittent problems with the + underlying commons-httpclient library. + </description> + </property> + </configuration> + }}} + === Running the Crawler === + The following command is used to start crawling. + + {{{ + > ./bin/nutch crawl urls -dir crawl -depth 2 -solr http://localhost:8983/solr/nutch + }}} + ==== Command Breakdown ==== + '''./bin/nutch''' - The Nutch binary + + '''crawl''' - Tells Nutch what action to perform + + '''urls''' - The urls directory we created earlier with our URL lists + + '''-dir crawl''' - Tells Nutch where to store it's crawl data + + '''-depth 2''' - Just for testing purposes otherwise crawling can take a very long time. This will test recusive retrieval without going too deep. You can remove this when you want to do a full index + + '''-solr http://localhost:8983/solr/nutch''' - Tells Nutch where the Solr server is located to upload index data after crawling has finished + + === Notes === + '''MichaelAlleblas:''' I am very new to Solr and Nutch myself and therefore this is by no means a comprehensive or completely accurate guide. I just hope it can be a starting point for others to pool together their collective knowledge to help this capability of Nutch to be exploited more often and allow newcomers like myself to get things up and running as simply and fast as possible. +

