Dear Wiki user, You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.
The following page has been changed by DanielLaLiberte: http://wiki.apache.org/nutch/Nutch_-_The_Java_Search_Engine ------------------------------------------------------------------------------ === 3.2.1 Create a flat file of root urls. === - For example, to crawl the http://www.virtusa.com site from scratch, you might start with a file named 'urls' containing just the Virtusa home page. All other pages should be reachable from this page. The âurlsâ file would therefore contain: http://www.virtusa.com + For example, to crawl the http://www.virtusa.com site from scratch, you might start with a file named 'urls' containing just the URL for the Virtusa home page. All other pages should be reachable by links from this page. The âurlsâ file would therefore contain: http://www.virtusa.com + + The 'urls' file could be put anywhere. It will be used below in the nutch crawl command, which assumes the file is in the nutch directory. === 3.2.2 Edit the file conf/crawl-urlfilter.txt === If you are using TRUNK then there is no file called conf/crawl-urlfilter.txt but conf/crawl-urlfilter.txt.template. Just do {{{ - cat conf/crawl-urlfilter.txt.template|sed 's/MY.DOMAIN.NAME/criaturitas.org/'g> conf/crawl-urlfilter.txt + cat conf/crawl-urlfilter.txt.template|sed 's/MY.DOMAIN.NAME/criaturitas.org/'g> conf/crawl-urlfilter.txt }}} - }}} If you already have this file then replace the existing domain name with the name of the domain you wish to crawl. For example, if you wished to limit the crawl to the virtusa.com domain, the line should read: + {{{ - {{{ +^http://([a-z0-9]*\.)*virtusa.com/ }}} + +^http://([a-z0-9]*\.)*virtusa.com/ }}} This will include any url in the domain virtusa.com in the crawl. @@ -127, +129 @@ * -dir dir names the directory to put the crawl in. * -depth depth indicates the link depth from the root page that should be crawled. * -delay delay determines the number of seconds between accesses to each host. - * -threads threads determines the number of threads that will fetch in parallel. + * -threads threads determines the number of threads that will fetch in parallel. }}} - }}} For example, a typical command might be: + {{{ - {{{ bin/nutch crawl urls -dir crawl.virtusa -depth 10 }}} + bin/nutch crawl urls -dir crawl.virtusa -depth 10 }}} === 3.2.4 Output of the crawl === @@ -175, +177 @@ <property> <name>searcher.dir </name> <value>/home/tyrell/nutch-0.7/crawl.virtusa </value> - </property> + </property> }}} - }}} 4. Re-start Tomcat @@ -196, +197 @@ Now that all is working, we need to think the long term maintenance of the Index. This is a required activity because the web gets updated frequently. New content will appear on sites while existing content might get modified or deleted altogether. - Nutch provides the administrator with a set of commands to update a given index, however performing them manually will not only be tiresome but also non productive. Since this task need to be carried out periodically it should ideally be scheduled. + Nutch provides the administrator with a set of commands to update a given index, however performing them manually will not only be tiresome but also unproductive. Since this task need to be carried out periodically it should ideally be scheduled. === 3.5.1 Creating a Maintenance Shell Script === @@ -233, +234 @@ }}} - === 3.5.2 Scheduling Index Updations === + === 3.5.2 Scheduling Index Updates === The above shell script can be scheduled to be run periodically using a âcronâ job.