When I tried to deploy nutch in intranet crawl mode, it built fine, but when I tried to run the command:
$NUTCH_HOME/bin/nutch crawl $HOME/SearchTest/urls -dir $HOME/SearchTest/crawl -depth 2 bin/nutch returns the following log. For sake of completeness, it is duplicated in its entirity below: 060227 150621 parsing file:/home/hdiwan/nutch-0.7.1/conf/nutch-default.xml 060227 150621 parsing file:/home/hdiwan/nutch-0.7.1/conf/crawl-tool.xml 060227 150621 parsing file:/home/hdiwan/nutch-0.7.1/conf/nutch-site.xml 060227 150621 No FS indicated, using default:local 060227 150621 crawl started in: /home/hdiwan/nutch/crawl20060227150607 060227 150621 rootUrlFile = /home/hdiwan/SpectraSearch/urls 060227 150621 threads = 10 060227 150621 depth = 2 060227 150621 Created webdb at LocalFS,/home/hdiwan/nutch/crawl20060227150607/db 060227 150621 Starting URL processing 060227 150621 Plugins: looking in: /home/hdiwan/nutch-0.7.1/build/plugins 060227 150621 parsing: /home/hdiwan/nutch-0.7.1 /build/plugins/nutch-extensionpoints/plugin.xml 060227 150621 not including: /home/hdiwan/nutch-0.7.1 /build/plugins/protocol-ftp 060227 150621 not including: /home/hdiwan/nutch-0.7.1 /build/plugins/protocol-http 060227 150621 parsing: /home/hdiwan/nutch-0.7.1 /build/plugins/protocol-httpclient/plugin.xml 060227 150621 impl: point=org.apache.nutch.protocol.Protocol class= org.apache.nutch.protocol.httpclient.Http 060227 150621 impl: point=org.apache.nutch.protocol.Protocol class= org.apache.nutch.protocol.httpclient.Http 060227 150621 parsing: /home/hdiwan/nutch-0.7.1 /build/plugins/parse-html/plugin.xml 060227 150621 impl: point=org.apache.nutch.parse.Parser class= org.apache.nutch.parse.html.HtmlParser 060227 150621 not including: /home/hdiwan/nutch-0.7.1/build/plugins/parse-js 060227 150621 parsing: /home/hdiwan/nutch-0.7.1 /build/plugins/parse-text/plugin.xml 060227 150621 impl: point=org.apache.nutch.parse.Parser class= org.apache.nutch.parse.text.TextParser 060227 150621 not including: /home/hdiwan/nutch-0.7.1 /build/plugins/parse-pdf 060227 150621 not including: /home/hdiwan/nutch-0.7.1 /build/plugins/parse-rss 060227 150621 not including: /home/hdiwan/nutch-0.7.1 /build/plugins/parse-msword 060227 150621 not including: /home/hdiwan/nutch-0.7.1 /build/plugins/parse-ext 060227 150621 not including: /home/hdiwan/nutch-0.7.1 /build/plugins/index-basic 060227 150621 parsing: /home/hdiwan/nutch-0.7.1 /build/plugins/index-more/plugin.xml 060227 150622 impl: point=org.apache.nutch.indexer.IndexingFilter class= org.apache.nutch.indexer.more.MoreIndexingFilter 060227 150622 parsing: /home/hdiwan/nutch-0.7.1 /build/plugins/query-basic/plugin.xml 060227 150622 impl: point=org.apache.nutch.searcher.QueryFilter class= org.apache.nutch.searcher.basic.BasicQueryFilter 060227 150622 parsing: /home/hdiwan/nutch-0.7.1 /build/plugins/query-more/plugin.xml 060227 150622 impl: point=org.apache.nutch.searcher.QueryFilter class= org.apache.nutch.searcher.more.TypeQueryFilter 060227 150622 impl: point=org.apache.nutch.searcher.QueryFilter class= org.apache.nutch.searcher.more.DateQueryFilter 060227 150622 parsing: /home/hdiwan/nutch-0.7.1 /build/plugins/query-site/plugin.xml 060227 150622 impl: point=org.apache.nutch.searcher.QueryFilter class= org.apache.nutch.searcher.site.SiteQueryFilter 060227 150622 parsing: /home/hdiwan/nutch-0.7.1 /build/plugins/query-url/plugin.xml 060227 150622 impl: point=org.apache.nutch.searcher.QueryFilter class= org.apache.nutch.searcher.url.URLQueryFilter 060227 150622 not including: /home/hdiwan/nutch-0.7.1 /build/plugins/urlfilter-regex 060227 150622 not including: /home/hdiwan/nutch-0.7.1 /build/plugins/urlfilter-prefix 060227 150622 not including: /home/hdiwan/nutch-0.7.1 /build/plugins/creativecommons 060227 150622 not including: /home/hdiwan/nutch-0.7.1 /build/plugins/language-identifier 060227 150622 not including: /home/hdiwan/nutch-0.7.1 /build/plugins/clustering-carrot2 060227 150622 not including: /home/hdiwan/nutch-0.7.1/build/plugins/ontology 060227 150622 Using URL normalizer: org.apache.nutch.net.BasicUrlNormalizer 060227 150622 Added 30 pages 060227 150622 Processing pagesByURL: Sorted 30 instructions in 0.0080seconds. 060227 150622 Processing pagesByURL: Sorted 3750.0 instructions/second 060227 150622 Processing pagesByURL: Merged to new DB containing 18 records in 0.0050 seconds 060227 150622 Processing pagesByURL: Merged 3600.0 records/second 060227 150622 Processing pagesByMD5: Sorted 18 instructions in 0.0040seconds. 060227 150622 Processing pagesByMD5: Sorted 4500.0 instructions/second 060227 150622 Processing pagesByMD5: Merged to new DB containing 18 records in 0.0010 seconds 060227 150622 Processing pagesByMD5: Merged 18000.0 records/second 060227 150622 Processing linksByMD5: Copied file (4096 bytes) in 0.0050secs. 060227 150622 Processing linksByURL: Copied file (4096 bytes) in 0.0030secs. 060227 150622 Processing /home/hdiwan/nutch/crawl20060227150607/segments/20060227150622/fetchlist.unsorted: Sorted 18 entries in 0.0030 seconds. 060227 150622 Processing /home/hdiwan/nutch/crawl20060227150607/segments/20060227150622/fetchlist.unsorted: Sorted 6000.0 entries/second 060227 150622 Overall processing: Sorted 18 entries in 0.0030 seconds. 060227 150622 Overall processing: Sorted 1.6666666666666666E-4entries/second 060227 150622 FetchListTool completed 060227 150622 logging at INFO 060227 150622 fetching http://hasan.wits2020.net/~hdiwan/blog/2006/02/27/murder_in_samarkand.html 060227 150622 fetching http://hasan.wits2020.net/~hdiwan/blog/2006/02/16/book_search_presentation.html 060227 150622 fetching http://hasan.wits2020.net/~hdiwan/blog/2006/02/20/emailcasting_revisited.html 060227 150622 http.proxy.host = null 060227 150622 http.proxy.port = 8118 060227 150622 http.timeout = 10000 060227 150622 http.content.limit = -1 060227 150622 http.agent = Spectra/200602 (Spectra; http://hasan.wits2020.net/typo/public; [EMAIL PROTECTED]) 060227 150622 http.auth.ntlm.username = 060227 150622 fetcher.server.delay = 1000 060227 150622 http.max.delays = 100 060227 150623 Configured Client 060227 150623 fetching http://hasan.wits2020.net/~hdiwan/blog/2006/02/26/automating_photographic_workfl.html 060227 150623 fetching http://hasan.wits2020.net/~hdiwan/blog/2006/02/25/pint_search.html 060227 150623 fetching http://hasan.wits2020.net/~hdiwan/blog/2006/02/16/spectrasearch_privacy_statemen.html 060227 150623 fetching http://hasan.wits2020.net/~hdiwan/blog/2006/02/27/directtv_videoondemand.html 060227 150623 fetching http://hasan.wits2020.net/~hdiwan/blog/2006/02/15/atms_an d_googlemaps_tad_buggy.html 060227 150623 fetching http://hasan.wits2020.net/~hdiwan/blog/2006/02/19/capanni na.html 060227 150623 fetching http://hasan.wits2020.net/~hdiwan/blog/2006/02/15/opera_t ries_to_converge_bittor.html 060227 150623 fetching http://hasan.wits2020.net/~hdiwan/blog/2006/02/25/transam erica.html 060227 150623 fetching http://hasan.wits2020.net/~hdiwan/blog/2006/02/27/nobody_ likes_me.html 060227 150623 fetching http://hasan.wits2020.net/~hdiwan/blog/2006/02/21/bloggin g_system_critiques.html 060227 150623 fetching http://hasan.wits2020.net/~hdiwan/blog/2006/02/25/sorry_h aters.html 060227 150623 fetching http://hasan.wits2020.net/~hdiwan/blog/2006/02/15/valenti nes_overseas.html 060227 150623 fetching http://hasan.wits2020.net/~hdiwan/blog/2006/02/18/spectra search_update.html 060227 150624 Updating /home/hdiwan/nutch/crawl20060227150607/db 060227 150624 Updating for /home/hdiwan/nutch/crawl20060227150607/segments/20060 227150622 060227 150624 Processing document 0 060227 150624 Finishing update 060227 150626 Update finished 060227 150626 Updating /home/hdiwan/nutch/crawl20060227150607/segments from /hom e/hdiwan/nutch/crawl20060227150607/db 060227 150626 reading /home/hdiwan/nutch/crawl20060227150607/segments/200602271 50622 060227 150626 reading /home/hdiwan/nutch/crawl20060227150607/segments/200602271 50624 060227 150626 Sorting pages by url... 060227 150626 Getting updated scores and anchors from db... 060227 150626 Sorting updates by segment... 060227 150626 Updating segments... 060227 150626 updating /home/hdiwan/nutch/crawl20060227150607/segments/20060227 150622 060227 150626 Done updating /home/hdiwan/nutch/crawl20060227150607/segments from /home/hdiwan/nutch/crawl20060227150607/db 060227 150626 indexing segment: /home/hdiwan/nutch/crawl20060227150607/segments/ 20060227150622 060227 150626 * Opening segment 20060227150622 060227 150626 * Indexing segment 20060227150622 060227 150626 * Optimizing index... 060227 150626 * Moving index to NFS if needed... 060227 150626 DONE indexing segment 20060227150622: total 18 records in 0.047 s (Infinity rec/s). 060227 150626 done indexing 060227 150626 done indexing 060227 150626 Reading url hashes... 060227 150626 Sorting url hashes... 060227 150626 Deleting url duplicates... 060227 150626 Deleted 0 url duplicates. 060227 150626 Reading content hashes... 060227 150626 Sorting content hashes... 060227 150626 Deleting content duplicates... 060227 150626 Deleted 0 content duplicates. 060227 150626 Duplicate deletion complete locally. Now returning to NFS... 060227 150626 DeleteDuplicates complete 060227 150626 Merging segment indexes... 060227 150626 crawl finished: /home/hdiwan/nutch/crawl20060227150607 Now, I'm sure there are duplicates in the url list, yet nutch doesn't delete anything. I'm also going to be adding new pages pretty frequently. The crawl parameter does not let you add new urls without removing the last crawl. So, how would I go about doing this? Thanks for the help! Please CC replies to my personal address. Thanks a bunch! -- Cheers, Hasan Diwan <[EMAIL PROTECTED]>
