That brings up a question, does nutch consider the URL in a form action, and an images source URL as part of the default 100 links x page? Or does it only count <a href> tags. What about the Google and yahoo etc. do they only count <a href's>?
http://tonalweb.com ----- Original Message ----- From: "Stefan Groschupf" <[EMAIL PROTECTED]> To: <[email protected]> Sent: Friday, July 07, 2006 1:59 AM Subject: Re: why i can't crawl all the linked pages in the specified page to crawl. > Hi, > may be you can try to have a much higher depth something like 20? > However in general check: > + the regex url filter file. > + the rebotos.txt > + nofollow tag in the pages > + number of out links to extrac in nutch-default.cml > > Stefan > On 06.07.2006, at 19:12, kevin pang wrote: > > > i set up the nutch to crawl the url: http://www.haha365.com/gd_joke/ > > but after crawl complete, only 54 pages were fetched. > > > > here is the log info: > > > > 060705 154332 parsing file:/C:/cygwin/nutch-0.7.2/conf/nutch- > > default.xml > > 060705 154332 parsing file:/C:/cygwin/nutch-0.7.2/conf/crawl-tool.xml > > 060705 154332 parsing file:/C:/cygwin/nutch-0.7.2/conf/nutch-site.xml > > 060705 154332 No FS indicated, using default:local > > 060705 154332 crawl started in: crawled2 > > 060705 154332 rootUrlFile = url.txt > > 060705 154332 threads = 4 > > 060705 154332 depth = 3 > > 060705 154333 Created webdb at LocalFS,C:\cygwin\nutch-0.7.2\bin > > \crawled2\db > > 060705 154333 Starting URL processing > > 060705 154333 Plugins: looking in: C:\cygwin\nutch-0.7.2\plugins > > 060705 154333 parsing: C:\cygwin\nutch- > > 0.7.2\plugins\urlfilter-regex\plugin.xml > > 060705 154333 impl: point=org.apache.nutch.net.URLFilter class= > > org.apache.nutch.net.RegexURLFilter > > 060705 154333 not including: C:\cygwin\nutch-0.7.2\plugins > > \urlfilter-prefix > > 060705 154333 parsing: C:\cygwin\nutch-0.7.2\plugins\query-url > > \plugin.xml > > 060705 154333 impl: point=org.apache.nutch.searcher.QueryFilter class= > > org.apache.nutch.searcher.url.URLQueryFilter > > 060705 154333 parsing: C:\cygwin\nutch-0.7.2\plugins\query-site > > \plugin.xml > > 060705 154333 impl: point=org.apache.nutch.searcher.QueryFilter class= > > org.apache.nutch.searcher.site.SiteQueryFilter > > 060705 154333 not including: C:\cygwin\nutch-0.7.2\plugins\query-more > > 060705 154333 parsing: C:\cygwin\nutch-0.7.2\plugins\query-basic > > \plugin.xml > > 060705 154333 impl: point=org.apache.nutch.searcher.QueryFilter class= > > org.apache.nutch.searcher.basic.BasicQueryFilter > > 060705 154333 not including: C:\cygwin\nutch- > > 0.7.2\plugins\protocol-httpclient > > 060705 154333 parsing: C:\cygwin\nutch- > > 0.7.2\plugins\protocol-http\plugin.xml > > 060705 154333 impl: point=org.apache.nutch.protocol.Protocol class= > > org.apache.nutch.protocol.http.Http > > 060705 154333 not including: C:\cygwin\nutch-0.7.2\plugins\protocol- > > ftp > > 060705 154333 not including: C:\cygwin\nutch-0.7.2\plugins\protocol- > > file > > 060705 154333 parsing: C:\cygwin\nutch-0.7.2\plugins\parse-text > > \plugin.xml > > 060705 154333 impl: point=org.apache.nutch.parse.Parser class= > > org.apache.nutch.parse.text.TextParser > > 060705 154333 not including: C:\cygwin\nutch-0.7.2\plugins\parse-rss > > 060705 154333 not including: C:\cygwin\nutch-0.7.2\plugins\parse-pdf > > 060705 154333 not including: C:\cygwin\nutch-0.7.2\plugins\parse- > > msword > > 060705 154333 not including: C:\cygwin\nutch-0.7.2\plugins\parse-js > > 060705 154333 parsing: C:\cygwin\nutch-0.7.2\plugins\parse-html > > \plugin.xml > > 060705 154333 impl: point=org.apache.nutch.parse.Parser class= > > org.apache.nutch.parse.html.HtmlParser > > 060705 154333 not including: C:\cygwin\nutch-0.7.2\plugins\parse-ext > > 060705 154333 not including: C:\cygwin\nutch-0.7.2\plugins\ontology > > 060705 154333 parsing: C:\cygwin\nutch- > > 0.7.2\plugins\nutch-extensionpoints\plugin.xml > > 060705 154333 not including: C:\cygwin\nutch- > > 0.7.2\plugins\language-identifier > > 060705 154333 not including: C:\cygwin\nutch-0.7.2\plugins\index-more > > 060705 154333 parsing: C:\cygwin\nutch-0.7.2\plugins\index-basic > > \plugin.xml > > 060705 154333 impl: point=org.apache.nutch.indexer.IndexingFilter > > class= > > org.apache.nutch.indexer.basic.BasicIndexingFilter > > 060705 154333 not including: C:\cygwin\nutch-0.7.2\plugins > > \creativecommons > > 060705 154333 not including: C:\cygwin\nutch- > > 0.7.2\plugins\clustering-carrot2 > > 060705 154333 found resource crawl-urlfilter.txt at file:/C:/cygwin/ > > nutch- > > 0.7.2/conf/crawl-urlfilter.txt > > 060705 154333 Using URL normalizer: > > org.apache.nutch.net.BasicUrlNormalizer > > 060705 154333 Added 1 pages > > 060705 154333 Processing pagesByURL: Sorted 1 instructions in 0.016 > > seconds. > > 060705 154333 Processing pagesByURL: Sorted 62.5 instructions/second > > 060705 154333 Processing pagesByURL: Merged to new DB containing 1 > > records > > in 0.0 seconds > > 060705 154333 Processing pagesByURL: Merged Infinity records/second > > 060705 154333 Processing pagesByMD5: Sorted 1 instructions in 0.0 > > seconds. > > 060705 154333 Processing pagesByMD5: Sorted Infinity instructions/ > > second > > 060705 154333 Processing pagesByMD5: Merged to new DB containing 1 > > records > > in 0.0 seconds > > 060705 154333 Processing pagesByMD5: Merged Infinity records/second > > 060705 154333 Processing linksByMD5: Copied file (0 bytes) in 0.016 > > secs. > > 060705 154333 Processing linksByURL: Copied file (0 bytes) in 0.015 > > secs. > > 060705 154333 FetchListTool started > > 060705 154333 Processing pagesByURL: Sorted 1 instructions in 0.0 > > seconds. > > 060705 154333 Processing pagesByURL: Sorted Infinity instructions/ > > second > > 060705 154333 Processing pagesByURL: Merged to new DB containing 1 > > records > > in 0.0 seconds > > 060705 154333 Processing pagesByURL: Merged Infinity records/second > > 060705 154333 Processing pagesByMD5: Sorted 1 instructions in 0.0 > > seconds. > > 060705 154333 Processing pagesByMD5: Sorted Infinity instructions/ > > second > > 060705 154334 Processing pagesByMD5: Merged to new DB containing 1 > > records > > in 0.0 seconds > > 060705 154334 Processing pagesByMD5: Merged Infinity records/second > > 060705 154334 Processing linksByMD5: Copied file (0 bytes) in 0.031 > > secs. > > 060705 154334 Processing linksByURL: Copied file (0 bytes) in 0.015 > > secs. > > 060705 154334 Processing C:\cygwin\nutch- > > 0.7.2\bin\crawled2\segments\20060705154333\fetchlist.unsorted: > > Sorted 1 > > entries in 0.015 seconds. > > 060705 154334 Processing C:\cygwin\nutch- > > 0.7.2\bin\crawled2\segments\20060705154333\fetchlist.unsorted: Sorted > > 66.66666666666667 entries/second > > 060705 154334 Overall processing: Sorted 1 entries in 0.015 seconds. > > 060705 154334 Overall processing: Sorted 0.015 entries/second > > 060705 154334 FetchListTool completed > > 060705 154334 logging at INFO > > 060705 154334 fetching http://www.haha365.com/gd_joke/index_3.htm > > 060705 154334 http.proxy.host = null > > 060705 154334 http.proxy.port = 8080 > > 060705 154334 http.timeout = 10000 > > 060705 154334 http.content.limit = 65536 > > 060705 154334 http.agent = NutchCVS/0.7.2 (Nutch; > > http://lucene.apache.org/nutch/bot.html; nutch- > > [EMAIL PROTECTED]) > > 060705 154334 fetcher.server.delay = 1000 > > 060705 154334 http.max.delays = 100 > > 060705 154336 status: segment 20060705154333, 1 pages, 0 errors, 19172 > > bytes, 2000 ms > > 060705 154336 status: 0.5 pages/s, 74.890625 kb/s, 19172.0 bytes/page > > 060705 154337 Updating C:\cygwin\nutch-0.7.2\bin\crawled2\db > > 060705 154337 Updating for C:\cygwin\nutch- > > 0.7.2\bin\crawled2\segments\20060705154333 > > 060705 154337 Processing document 0 > > 060705 154337 Finishing update > > 060705 154337 Processing pagesByURL: Sorted 27 instructions in > > 0.015seconds. > > 060705 154337 Processing pagesByURL: Sorted 1800.0 instructions/second > > 060705 154337 Processing pagesByURL: Merged to new DB containing 27 > > records > > in 0.0 seconds > > 060705 154337 Processing pagesByURL: Merged Infinity records/second > > 060705 154337 Processing pagesByMD5: Sorted 28 instructions in > > 0.015seconds. > > 060705 154337 Processing pagesByMD5: Sorted > > 1866.6666666666667instructions/second > > 060705 154337 Processing pagesByMD5: Merged to new DB containing 27 > > records > > in 0.016 seconds > > 060705 154337 Processing pagesByMD5: Merged 1687.5 records/second > > 060705 154337 Processing linksByMD5: Sorted 27 instructions in > > 0.015seconds. > > 060705 154337 Processing linksByMD5: Sorted 1800.0 instructions/second > > 060705 154337 Processing linksByMD5: Merged to new DB containing 26 > > records > > in 0.0 seconds > > 060705 154337 Processing linksByMD5: Merged Infinity records/second > > 060705 154337 Processing linksByURL: Sorted 26 instructions in > > 0.015seconds. > > 060705 154337 Processing linksByURL: Sorted > > 1733.3333333333335instructions/second > > 060705 154337 Processing linksByURL: Merged to new DB containing 26 > > records > > in 0.0 seconds > > 060705 154337 Processing linksByURL: Merged Infinity records/second > > 060705 154337 Processing linksByMD5: Sorted 26 instructions in > > 0.031seconds. > > 060705 154337 Processing linksByMD5: Sorted > > 838.7096774193549instructions/second > > 060705 154337 Processing linksByMD5: Merged to new DB containing 26 > > records > > in 0.0 seconds > > 060705 154337 Processing linksByMD5: Merged Infinity records/second > > 060705 154337 Update finished > > 060705 154337 FetchListTool started > > 060705 154338 Processing pagesByURL: Sorted 26 instructions in > > 0.016seconds. > > 060705 154338 Processing pagesByURL: Sorted 1625.0 instructions/second > > 060705 154338 Processing pagesByURL: Merged to new DB containing 27 > > records > > in 0.0 seconds > > 060705 154338 Processing pagesByURL: Merged Infinity records/second > > 060705 154338 Processing pagesByMD5: Sorted 26 instructions in 0.0 > > seconds. > > 060705 154338 Processing pagesByMD5: Sorted Infinity instructions/ > > second > > 060705 154338 Processing pagesByMD5: Merged to new DB containing 27 > > records > > in 0.015 seconds > > 060705 154338 Processing pagesByMD5: Merged 1800.0 records/second > > 060705 154338 Processing linksByMD5: Copied file (0 bytes) in 0.016 > > secs. > > 060705 154338 Processing linksByURL: Copied file (0 bytes) in 0.0 > > secs. > > 060705 154338 Processing C:\cygwin\nutch- > > 0.7.2\bin\crawled2\segments\20060705154337\fetchlist.unsorted: > > Sorted 26 > > entries in 0.0 seconds. > > 060705 154338 Processing C:\cygwin\nutch- > > 0.7.2\bin\crawled2\segments\20060705154337\fetchlist.unsorted: Sorted > > Infinity entries/second > > 060705 154338 Overall processing: Sorted 26 entries in 0.0 seconds. > > 060705 154338 Overall processing: Sorted 0.0 entries/second > > 060705 154338 FetchListTool completed > > 060705 154338 logging at INFO > > 060705 154338 fetching http://www.haha365.com/gd_joke/ > > 20050319084431.htm > > 060705 154338 fetching http://www.haha365.com/gd_joke/ > > 20050319084733.htm > > 060705 154338 fetching http://www.haha365.com/gd_joke/ > > 20050319085110.htm > > 060705 154338 fetching http://www.haha365.com/gd_joke/ > > 20050319084338.htm > > 060705 154339 fetching http://www.haha365.com/gd_joke/ > > 20050319085226.htm > > 060705 154340 fetching http://www.haha365.com/gd_joke/ > > 20050318163740.htm > > 060705 154341 fetching http://www.haha365.com/gd_joke/ > > 20050319085344.htm > > 060705 154343 fetching http://www.haha365.com/gd_joke/ > > 20050318163709.htm > > 060705 154345 fetching http://www.haha365.com/gd_joke/ > > 20050319085310.htm > > 060705 154347 fetching http://www.haha365.com/gd_joke/ > > 20050319085028.htm > > 060705 154349 fetching http://www.haha365.com/gd_joke/ > > 20050319084052.htm > > 060705 154350 fetching http://www.haha365.com/gd_joke/index.htm > > 060705 154352 fetching http://www.haha365.com/gd_joke/ > > 20050319084902.htm > > 060705 154353 fetching http://www.haha365.com/gd_joke/ > > 20050319084945.htm > > 060705 154355 fetching http://www.haha365.com/gd_joke/ > > 20050319084129.htm > > 060705 154356 fetching http://www.haha365.com/gd_joke/ > > 20050319084202.htm > > 060705 154358 fetching http://www.haha365.com/gd_joke/ > > 20050318163642.htm > > 060705 154359 fetching http://www.haha365.com/gd_joke/ > > 20050319084304.htm > > 060705 154400 fetching http://www.haha365.com/gd_joke/ > > 20050319084822.htm > > 060705 154402 fetching http://www.haha365.com/gd_joke/ > > 20050319085142.htm > > 060705 154403 fetching http://www.haha365.com/gd_joke/ > > 20050319084232.htm > > 060705 154408 fetching http://www.haha365.com/gd_joke/ > > 20050318163829.htm > > 060705 154411 fetching http://www.haha365.com/gd_joke/ > > 20050318163920.htm > > 060705 154415 fetching http://www.haha365.com/gd_joke/ > > 20050319084559.htm > > 060705 154419 fetching http://www.haha365.com/gd_joke/ > > 060705 154423 fetching http://www.haha365.com/gd_joke/ > > 20050318163807.htm > > 060705 154440 status: segment 20060705154337, 26 pages, 0 errors, > > 323050 > > bytes, 62047 ms > > 060705 154440 status: 0.41903716 pages/s, 40.67607 kb/s, 12425.0 > > bytes/page > > 060705 154441 Updating C:\cygwin\nutch-0.7.2\bin\crawled2\db > > 060705 154441 Updating for C:\cygwin\nutch- > > 0.7.2\bin\crawled2\segments\20060705154337 > > 060705 154441 Processing document 0 > > 060705 154441 Finishing update > > 060705 154441 Processing pagesByURL: Sorted 174 instructions in > > 0.016seconds. > > 060705 154441 Processing pagesByURL: Sorted 10875.0 instructions/ > > second > > 060705 154441 Processing pagesByURL: Merged to new DB containing 53 > > records > > in 0.0 seconds > > 060705 154441 Processing pagesByURL: Merged Infinity records/second > > 060705 154441 Processing pagesByMD5: Sorted 78 instructions in > > 0.015seconds. > > 060705 154441 Processing pagesByMD5: Sorted 5200.0 instructions/second > > 060705 154441 Processing pagesByMD5: Merged to new DB containing 53 > > records > > in 0.0 seconds > > 060705 154441 Processing pagesByMD5: Merged Infinity records/second > > 060705 154441 Processing linksByMD5: Sorted 174 instructions in > > 0.016seconds. > > 060705 154441 Processing linksByMD5: Sorted 10875.0 instructions/ > > second > > 060705 154441 Processing linksByMD5: Merged to new DB containing > > 148 records > > in 0.015 seconds > > 060705 154441 Processing linksByMD5: Merged 9866.666666666668 > > records/second > > 060705 154441 Processing linksByURL: Sorted 122 instructions in 0.0 > > seconds. > > 060705 154441 Processing linksByURL: Sorted Infinity instructions/ > > second > > 060705 154441 Processing linksByURL: Merged to new DB containing > > 148 records > > in 0.015 seconds > > 060705 154441 Processing linksByURL: Merged 9866.666666666668 > > records/second > > 060705 154441 Processing linksByMD5: Sorted 148 instructions in 0.0 > > seconds. > > 060705 154441 Processing linksByMD5: Sorted Infinity instructions/ > > second > > 060705 154441 Processing linksByMD5: Merged to new DB containing > > 148 records > > in 0.016 seconds > > 060705 154441 Processing linksByMD5: Merged 9250.0 records/second > > 060705 154442 Update finished > > 060705 154442 FetchListTool started > > 060705 154442 Processing pagesByURL: Sorted 26 instructions in > > 0.016seconds. > > 060705 154442 Processing pagesByURL: Sorted 1625.0 instructions/second > > 060705 154442 Processing pagesByURL: Merged to new DB containing 53 > > records > > in 0.015 seconds > > 060705 154442 Processing pagesByURL: Merged > > 3533.3333333333335records/second > > 060705 154442 Processing pagesByMD5: Sorted 26 instructions in 0.0 > > seconds. > > 060705 154442 Processing pagesByMD5: Sorted Infinity instructions/ > > second > > 060705 154442 Processing pagesByMD5: Merged to new DB containing 53 > > records > > in 0.0 seconds > > 060705 154442 Processing pagesByMD5: Merged Infinity records/second > > 060705 154442 Processing linksByMD5: Copied file (0 bytes) in 0.016 > > secs. > > 060705 154442 Processing linksByURL: Copied file (0 bytes) in 0.0 > > secs. > > 060705 154442 Processing C:\cygwin\nutch- > > 0.7.2\bin\crawled2\segments\20060705154442\fetchlist.unsorted: > > Sorted 26 > > entries in 0.093 seconds. > > 060705 154442 Processing C:\cygwin\nutch- > > 0.7.2\bin\crawled2\segments\20060705154442\fetchlist.unsorted: Sorted > > 279.5698924731183 entries/second > > 060705 154442 Overall processing: Sorted 26 entries in 0.093 seconds. > > 060705 154442 Overall processing: Sorted 0.003576923076923077 > > entries/second > > 060705 154443 FetchListTool completed > > 060705 154443 logging at INFO > > 060705 154443 fetching http://www.haha365.com/gd_joke/ > > 20050815111532.htm > > 060705 154443 fetching http://www.haha365.com/gd_joke/ > > 20050815105800.htm > > 060705 154443 fetching http://www.haha365.com/gd_joke/ > > 20050319085605.htm > > 060705 154443 fetching http://www.haha365.com/gd_joke/ > > 20050815110121.htm > > 060705 154446 fetching http://www.haha365.com/gd_joke/ > > 20060625064748.htm > > 060705 154448 fetching http://www.haha365.com/gd_joke/ > > 20050815105937.htm > > 060705 154449 fetching http://www.haha365.com/gd_joke/ > > 20050815110925.htm > > 060705 154450 fetching http://www.haha365.com/gd_joke/ > > 20050815111651.htm > > 060705 154452 fetching http://www.haha365.com/gd_joke/ > > 20050706110014.htm > > 060705 154453 fetching http://www.haha365.com/gd_joke/ > > 20050318163615.htm > > 060705 154454 fetching http://www.haha365.com/gd_joke/ > > 20050815111228.htm > > 060705 154456 fetching http://www.haha365.com/gd_joke/ > > 20050706105833.htm > > 060705 154457 fetching http://www.haha365.com/gd_joke/ > > 20050815110411.htm > > 060705 154459 fetching http://www.haha365.com/gd_joke/ > > 20050815105527.htm > > 060705 154500 fetching http://www.haha365.com/gd_joke/ > > 20050815111758.htm > > 060705 154502 fetching http://www.haha365.com/gd_joke/ > > 20050706110230.htm > > 060705 154503 fetching http://www.haha365.com/gd_joke/ > > 20050706105453.htm > > 060705 154504 fetching http://www.haha365.com/gd_joke/ > > 20050706110522.htm > > 060705 154506 fetching http://www.haha365.com/gd_joke/ > > 20050706105104.htm > > 060705 154507 fetching http://www.haha365.com/gd_joke/ > > 20050709144044.htm > > 060705 154509 fetching http://www.haha365.com/gd_joke/ > > 20060611112617.htm > > 060705 154510 fetching http://www.haha365.com/gd_joke/ > > 20050815105330.htm > > 060705 154511 fetching http://www.haha365.com/gd_joke/ > > 20050709144708.htm > > 060705 154513 fetching http://www.haha365.com/gd_joke/ > > 20050706105324.htm > > 060705 154514 fetching http://www.haha365.com/gd_joke/ > > 20050815110707.htm > > 060705 154516 fetching http://www.haha365.com/gd_joke/ > > 20050706105218.htm > > 060705 154523 status: segment 20060705154442, 26 pages, 0 errors, > > 314308 > > bytes, 40063 ms > > 060705 154523 status: 0.6489779 pages/s, 61.291748 kb/s, 12088.77 > > bytes/page > > 060705 154524 Updating C:\cygwin\nutch-0.7.2\bin\crawled2\db > > 060705 154524 Updating for C:\cygwin\nutch- > > 0.7.2\bin\crawled2\segments\20060705154442 > > 060705 154524 Processing document 0 > > 060705 154524 Finishing update > > 060705 154524 Processing pagesByURL: Sorted 127 instructions in 0.0 > > seconds. > > 060705 154524 Processing pagesByURL: Sorted Infinity instructions/ > > second > > 060705 154524 Processing pagesByURL: Merged to new DB containing 56 > > records > > in 0.0 seconds > > 060705 154524 Processing pagesByURL: Merged Infinity records/second > > 060705 154524 Processing pagesByMD5: Sorted 55 instructions in > > 0.016seconds. > > 060705 154524 Processing pagesByMD5: Sorted 3437.5 instructions/second > > 060705 154524 Processing pagesByMD5: Merged to new DB containing 56 > > records > > in 0.015 seconds > > 060705 154524 Processing pagesByMD5: Merged > > 3733.3333333333335records/second > > 060705 154524 Processing linksByMD5: Sorted 127 instructions in > > 0.016seconds. > > 060705 154524 Processing linksByMD5: Sorted 7937.5 instructions/second > > 060705 154524 Processing linksByMD5: Merged to new DB containing > > 249 records > > in 0.0 seconds > > 060705 154524 Processing linksByMD5: Merged Infinity records/second > > 060705 154524 Processing linksByURL: Sorted 101 instructions in 0.0 > > seconds. > > 060705 154524 Processing linksByURL: Sorted Infinity instructions/ > > second > > 060705 154524 Processing linksByURL: Merged to new DB containing > > 249 records > > in 0.016 seconds > > 060705 154524 Processing linksByURL: Merged 15562.5 records/second > > 060705 154524 Processing linksByMD5: Sorted 127 instructions in > > 0.015seconds. > > 060705 154524 Processing linksByMD5: Sorted > > 8466.666666666668instructions/second > > 060705 154524 Processing linksByMD5: Merged to new DB containing > > 249 records > > in 0.0 seconds > > 060705 154524 Processing linksByMD5: Merged Infinity records/second > > 060705 154524 Update finished > > 060705 154524 Updating C:\cygwin\nutch-0.7.2\bin\crawled2\segments > > from > > C:\cygwin\nutch-0.7.2\bin\crawled2\db > > 060705 154524 reading C:\cygwin\nutch- > > 0.7.2\bin\crawled2\segments\20060705154333 > > 060705 154524 reading C:\cygwin\nutch- > > 0.7.2\bin\crawled2\segments\20060705154337 > > 060705 154524 reading C:\cygwin\nutch- > > 0.7.2\bin\crawled2\segments\20060705154442 > > 060705 154524 Sorting pages by url... > > 060705 154524 Getting updated scores and anchors from db... > > 060705 154524 Sorting updates by segment... > > 060705 154524 Updating segments... > > 060705 154524 updating C:\cygwin\nutch- > > 0.7.2\bin\crawled2\segments\20060705154333 > > 060705 154525 updating C:\cygwin\nutch- > > 0.7.2\bin\crawled2\segments\20060705154337 > > 060705 154525 updating C:\cygwin\nutch- > > 0.7.2\bin\crawled2\segments\20060705154442 > > 060705 154525 Done updating C:\cygwin\nutch-0.7.2\bin\crawled2 > > \segments from > > C:\cygwin\nutch-0.7.2\bin\crawled2\db > > 060705 154525 indexing segment: C:\cygwin\nutch- > > 0.7.2\bin\crawled2\segments\20060705154333 > > 060705 154525 * Opening segment 20060705154333 > > 060705 154525 * Indexing segment 20060705154333 > > 060705 154525 found resource common-terms.utf8 at file:/C:/cygwin/ > > nutch- > > 0.7.2/conf/common-terms.utf8 > > 060705 154525 * Optimizing index... > > 060705 154525 * Moving index to NFS if needed... > > 060705 154525 DONE indexing segment 20060705154333: total 1 records in > > 0.187s (Infinity rec/s). > > 060705 154525 done indexing > > 060705 154525 indexing segment: C:\cygwin\nutch- > > 0.7.2\bin\crawled2\segments\20060705154337 > > 060705 154525 * Opening segment 20060705154337 > > 060705 154525 * Indexing segment 20060705154337 > > 060705 154525 * Optimizing index... > > 060705 154525 * Moving index to NFS if needed... > > 060705 154525 DONE indexing segment 20060705154337: total 26 > > records in > > 0.391 s (Infinity rec/s). > > 060705 154525 done indexing > > 060705 154525 indexing segment: C:\cygwin\nutch- > > 0.7.2\bin\crawled2\segments\20060705154442 > > 060705 154525 * Opening segment 20060705154442 > > 060705 154525 * Indexing segment 20060705154442 > > 060705 154525 * Optimizing index... > > 060705 154525 * Moving index to NFS if needed... > > 060705 154525 DONE indexing segment 20060705154442: total 26 > > records in > > 0.219 s (Infinity rec/s). > > 060705 154525 done indexing > > 060705 154526 Reading url hashes... > > 060705 154526 Sorting url hashes... > > 060705 154526 Deleting url duplicates... > > 060705 154526 Deleted 0 url duplicates. > > 060705 154526 Reading content hashes... > > 060705 154526 Sorting content hashes... > > 060705 154526 Deleting content duplicates... > > 060705 154526 Deleted 1 content duplicates. > > 060705 154526 Duplicate deletion complete locally. Now returning > > to NFS... > > 060705 154526 DeleteDuplicates complete > > 060705 154526 Merging segment indexes... > > 060705 154526 crawl finished: crawled2 > > ______________________________________ Tonal web design and hosting http://tonalweb.com eCommerce development & marketing Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
