When I tried to deploy nutch in intranet crawl mode, it built fine, but when
I tried to run the command:

$NUTCH_HOME/bin/nutch crawl $HOME/SearchTest/urls -dir
$HOME/SearchTest/crawl -depth 2

bin/nutch returns the following log. For sake of completeness, it is
duplicated in its entirity below:

060227 150621 parsing file:/home/hdiwan/nutch-0.7.1/conf/nutch-default.xml
060227 150621 parsing file:/home/hdiwan/nutch-0.7.1/conf/crawl-tool.xml
060227 150621 parsing file:/home/hdiwan/nutch-0.7.1/conf/nutch-site.xml
060227 150621 No FS indicated, using default:local
060227 150621 crawl started in: /home/hdiwan/nutch/crawl20060227150607
060227 150621 rootUrlFile = /home/hdiwan/SpectraSearch/urls
060227 150621 threads = 10
060227 150621 depth = 2
060227 150621 Created webdb at
LocalFS,/home/hdiwan/nutch/crawl20060227150607/db
060227 150621 Starting URL processing
060227 150621 Plugins: looking in: /home/hdiwan/nutch-0.7.1/build/plugins
060227 150621 parsing: /home/hdiwan/nutch-0.7.1
/build/plugins/nutch-extensionpoints/plugin.xml
060227 150621 not including: /home/hdiwan/nutch-0.7.1
/build/plugins/protocol-ftp
060227 150621 not including: /home/hdiwan/nutch-0.7.1
/build/plugins/protocol-http
060227 150621 parsing: /home/hdiwan/nutch-0.7.1
/build/plugins/protocol-httpclient/plugin.xml
060227 150621 impl: point=org.apache.nutch.protocol.Protocol class=
org.apache.nutch.protocol.httpclient.Http
060227 150621 impl: point=org.apache.nutch.protocol.Protocol class=
org.apache.nutch.protocol.httpclient.Http
060227 150621 parsing: /home/hdiwan/nutch-0.7.1
/build/plugins/parse-html/plugin.xml
060227 150621 impl: point=org.apache.nutch.parse.Parser class=
org.apache.nutch.parse.html.HtmlParser
060227 150621 not including: /home/hdiwan/nutch-0.7.1/build/plugins/parse-js
060227 150621 parsing: /home/hdiwan/nutch-0.7.1
/build/plugins/parse-text/plugin.xml
060227 150621 impl: point=org.apache.nutch.parse.Parser class=
org.apache.nutch.parse.text.TextParser
060227 150621 not including: /home/hdiwan/nutch-0.7.1
/build/plugins/parse-pdf
060227 150621 not including: /home/hdiwan/nutch-0.7.1
/build/plugins/parse-rss
060227 150621 not including: /home/hdiwan/nutch-0.7.1
/build/plugins/parse-msword
060227 150621 not including: /home/hdiwan/nutch-0.7.1
/build/plugins/parse-ext
060227 150621 not including: /home/hdiwan/nutch-0.7.1
/build/plugins/index-basic
060227 150621 parsing: /home/hdiwan/nutch-0.7.1
/build/plugins/index-more/plugin.xml
060227 150622 impl: point=org.apache.nutch.indexer.IndexingFilter class=
org.apache.nutch.indexer.more.MoreIndexingFilter
060227 150622 parsing: /home/hdiwan/nutch-0.7.1
/build/plugins/query-basic/plugin.xml
060227 150622 impl: point=org.apache.nutch.searcher.QueryFilter class=
org.apache.nutch.searcher.basic.BasicQueryFilter
060227 150622 parsing: /home/hdiwan/nutch-0.7.1
/build/plugins/query-more/plugin.xml
060227 150622 impl: point=org.apache.nutch.searcher.QueryFilter class=
org.apache.nutch.searcher.more.TypeQueryFilter
060227 150622 impl: point=org.apache.nutch.searcher.QueryFilter class=
org.apache.nutch.searcher.more.DateQueryFilter
060227 150622 parsing: /home/hdiwan/nutch-0.7.1
/build/plugins/query-site/plugin.xml
060227 150622 impl: point=org.apache.nutch.searcher.QueryFilter class=
org.apache.nutch.searcher.site.SiteQueryFilter
060227 150622 parsing: /home/hdiwan/nutch-0.7.1
/build/plugins/query-url/plugin.xml
060227 150622 impl: point=org.apache.nutch.searcher.QueryFilter class=
org.apache.nutch.searcher.url.URLQueryFilter
060227 150622 not including: /home/hdiwan/nutch-0.7.1
/build/plugins/urlfilter-regex
060227 150622 not including: /home/hdiwan/nutch-0.7.1
/build/plugins/urlfilter-prefix
060227 150622 not including: /home/hdiwan/nutch-0.7.1
/build/plugins/creativecommons
060227 150622 not including: /home/hdiwan/nutch-0.7.1
/build/plugins/language-identifier
060227 150622 not including: /home/hdiwan/nutch-0.7.1
/build/plugins/clustering-carrot2
060227 150622 not including: /home/hdiwan/nutch-0.7.1/build/plugins/ontology
060227 150622 Using URL normalizer: org.apache.nutch.net.BasicUrlNormalizer
060227 150622 Added 30 pages
060227 150622 Processing pagesByURL: Sorted 30 instructions in 0.0080seconds.
060227 150622 Processing pagesByURL: Sorted 3750.0 instructions/second
060227 150622 Processing pagesByURL: Merged to new DB containing 18 records
in 0.0050 seconds
060227 150622 Processing pagesByURL: Merged 3600.0 records/second
060227 150622 Processing pagesByMD5: Sorted 18 instructions in 0.0040seconds.
060227 150622 Processing pagesByMD5: Sorted 4500.0 instructions/second
060227 150622 Processing pagesByMD5: Merged to new DB containing 18 records
in 0.0010 seconds
060227 150622 Processing pagesByMD5: Merged 18000.0 records/second
060227 150622 Processing linksByMD5: Copied file (4096 bytes) in 0.0050secs.
060227 150622 Processing linksByURL: Copied file (4096 bytes) in 0.0030secs.
060227 150622 Processing
/home/hdiwan/nutch/crawl20060227150607/segments/20060227150622/fetchlist.unsorted:
Sorted 18 entries in 0.0030 seconds.
060227 150622 Processing
/home/hdiwan/nutch/crawl20060227150607/segments/20060227150622/fetchlist.unsorted:
Sorted 6000.0 entries/second
060227 150622 Overall processing: Sorted 18 entries in 0.0030 seconds.
060227 150622 Overall processing: Sorted 1.6666666666666666E-4entries/second
060227 150622 FetchListTool completed
060227 150622 logging at INFO
060227 150622 fetching
http://hasan.wits2020.net/~hdiwan/blog/2006/02/27/murder_in_samarkand.html
060227 150622 fetching
http://hasan.wits2020.net/~hdiwan/blog/2006/02/16/book_search_presentation.html
060227 150622 fetching
http://hasan.wits2020.net/~hdiwan/blog/2006/02/20/emailcasting_revisited.html
060227 150622 http.proxy.host = null
060227 150622 http.proxy.port = 8118
060227 150622 http.timeout = 10000
060227 150622 http.content.limit = -1
060227 150622 http.agent = Spectra/200602 (Spectra;
http://hasan.wits2020.net/typo/public; [EMAIL PROTECTED])
060227 150622 http.auth.ntlm.username =
060227 150622 fetcher.server.delay = 1000
060227 150622 http.max.delays = 100
060227 150623 Configured Client
060227 150623 fetching
http://hasan.wits2020.net/~hdiwan/blog/2006/02/26/automating_photographic_workfl.html
060227 150623 fetching
http://hasan.wits2020.net/~hdiwan/blog/2006/02/25/pint_search.html
060227 150623 fetching
http://hasan.wits2020.net/~hdiwan/blog/2006/02/16/spectrasearch_privacy_statemen.html
060227 150623 fetching
http://hasan.wits2020.net/~hdiwan/blog/2006/02/27/directtv_videoondemand.html
060227 150623 fetching
http://hasan.wits2020.net/~hdiwan/blog/2006/02/15/atms_an
d_googlemaps_tad_buggy.html
060227 150623 fetching
http://hasan.wits2020.net/~hdiwan/blog/2006/02/19/capanni
na.html
060227 150623 fetching
http://hasan.wits2020.net/~hdiwan/blog/2006/02/15/opera_t
ries_to_converge_bittor.html
060227 150623 fetching
http://hasan.wits2020.net/~hdiwan/blog/2006/02/25/transam
erica.html
060227 150623 fetching
http://hasan.wits2020.net/~hdiwan/blog/2006/02/27/nobody_
likes_me.html
060227 150623 fetching
http://hasan.wits2020.net/~hdiwan/blog/2006/02/21/bloggin
g_system_critiques.html
060227 150623 fetching
http://hasan.wits2020.net/~hdiwan/blog/2006/02/25/sorry_h
aters.html
060227 150623 fetching
http://hasan.wits2020.net/~hdiwan/blog/2006/02/15/valenti
nes_overseas.html
060227 150623 fetching
http://hasan.wits2020.net/~hdiwan/blog/2006/02/18/spectra
search_update.html
060227 150624 Updating /home/hdiwan/nutch/crawl20060227150607/db
060227 150624 Updating for
/home/hdiwan/nutch/crawl20060227150607/segments/20060
227150622
060227 150624 Processing document 0
060227 150624 Finishing update
060227 150626 Update finished
060227 150626 Updating /home/hdiwan/nutch/crawl20060227150607/segments from
/hom
e/hdiwan/nutch/crawl20060227150607/db
060227 150626  reading
/home/hdiwan/nutch/crawl20060227150607/segments/200602271
50622
060227 150626  reading
/home/hdiwan/nutch/crawl20060227150607/segments/200602271
50624
060227 150626 Sorting pages by url...
060227 150626 Getting updated scores and anchors from db...
060227 150626 Sorting updates by segment...
060227 150626 Updating segments...
060227 150626  updating
/home/hdiwan/nutch/crawl20060227150607/segments/20060227
150622
060227 150626 Done updating /home/hdiwan/nutch/crawl20060227150607/segments
from
 /home/hdiwan/nutch/crawl20060227150607/db
060227 150626 indexing segment:
/home/hdiwan/nutch/crawl20060227150607/segments/
20060227150622
060227 150626 * Opening segment 20060227150622
060227 150626 * Indexing segment 20060227150622
060227 150626 * Optimizing index...
060227 150626 * Moving index to NFS if needed...
060227 150626 DONE indexing segment 20060227150622: total 18 records in
0.047 s
(Infinity rec/s).
060227 150626 done indexing
060227 150626 done indexing
060227 150626 Reading url hashes...
060227 150626 Sorting url hashes...
060227 150626 Deleting url duplicates...
060227 150626 Deleted 0 url duplicates.
060227 150626 Reading content hashes...
060227 150626 Sorting content hashes...
060227 150626 Deleting content duplicates...
060227 150626 Deleted 0 content duplicates.
060227 150626 Duplicate deletion complete locally.  Now returning to NFS...
060227 150626 DeleteDuplicates complete
060227 150626 Merging segment indexes...
060227 150626 crawl finished: /home/hdiwan/nutch/crawl20060227150607

Now, I'm sure there are duplicates in the url list, yet nutch doesn't delete
anything. I'm also going to be adding new pages pretty frequently. The crawl
parameter does not let you add new urls without removing the last crawl. So,
how would I go about doing this? Thanks for the help! Please CC replies to
my personal address. Thanks a bunch!
--
Cheers,
Hasan Diwan <[EMAIL PROTECTED]>

Reply via email to