Hello All, I was running an intranet crawl and It seems like it did not finish, properly. It is a pretty default setup, but crawl's depth was 15, and I had turned on queries by commenting out # skip URLs containing certain characters as probable queries, etc. [EMAIL PROTECTED]
other than bunch of fetch messages, and bunch of Exceeding max.delays meaning message I am seeing the following.. crawl starts normally... 050228 064335 status: segment 20050228044354, 6300 pages, 91 errors, 140194211 bytes, 7163124 ms 050228 064335 status: 0.8795045 pages/s, 152.90356 kb/s, 22253.049 bytes/page ....... 050228 064551 status: segment 20050228044354, 6400 pages, 97 errors, 142348797 bytes, 7298549 ms 050228 064551 status: 0.87688667 pages/s, 152.37276 kb/s, 22242.0 bytes/page ..... 050228 064759 status: segment 20050228044354, 6500 pages, 102 errors, 144522915 bytes, 7427113 ms Results of all this was a nutch-seacher-dir looked like this: du -h nutch-searcher.dir/ 5.3M nutch-searcher.dir/db/webdb/pagesByURL 3.4M nutch-searcher.dir/db/webdb/pagesByMD5 14M nutch-searcher.dir/db/webdb/linksByMD5 14M nutch-searcher.dir/db/webdb/linksByURL 36M nutch-searcher.dir/db/webdb 36M nutch-searcher.dir/db 12K nutch-searcher.dir/segments/20050228020140/fetchlist 12K nutch-searcher.dir/segments/20050228020140/fetcher 20K nutch-searcher.dir/segments/20050228020140/content 12K nutch-searcher.dir/segments/20050228020140/parse_text 16K nutch-searcher.dir/segments/20050228020140/parse_data 76K nutch-searcher.dir/segments/20050228020140 16K nutch-searcher.dir/segments/20050228020146/fetchlist 16K nutch-searcher.dir/segments/20050228020146/fetcher 316K nutch-searcher.dir/segments/20050228020146/content 52K nutch-searcher.dir/segments/20050228020146/parse_text 144K nutch-searcher.dir/segments/20050228020146/parse_data 548K nutch-searcher.dir/segments/20050228020146 56K nutch-searcher.dir/segments/20050228020257/fetchlist 68K nutch-searcher.dir/segments/20050228020257/fetcher 2.2M nutch-searcher.dir/segments/20050228020257/content 260K nutch-searcher.dir/segments/20050228020257/parse_text 912K nutch-searcher.dir/segments/20050228020257/parse_data 3.5M nutch-searcher.dir/segments/20050228020257 232K nutch-searcher.dir/segments/20050228020931/fetchlist 276K nutch-searcher.dir/segments/20050228020931/fetcher 9.4M nutch-searcher.dir/segments/20050228020931/content 1.1M nutch-searcher.dir/segments/20050228020931/parse_text 4.1M nutch-searcher.dir/segments/20050228020931/parse_data 15M nutch-searcher.dir/segments/20050228020931 900K nutch-searcher.dir/segments/20050228024012/fetchlist 1.1M nutch-searcher.dir/segments/20050228024012/fetcher 37M nutch-searcher.dir/segments/20050228024012/content 3.9M nutch-searcher.dir/segments/20050228024012/parse_text 16M nutch-searcher.dir/segments/20050228024012/parse_data 58M nutch-searcher.dir/segments/20050228024012 3.2M nutch-searcher.dir/segments/20050228044354/fetchlist 1.1M nutch-searcher.dir/segments/20050228044354/fetcher 39M nutch-searcher.dir/segments/20050228044354/content 3.6M nutch-searcher.dir/segments/20050228044354/parse_text 16M nutch-searcher.dir/segments/20050228044354/parse_data 62M nutch-searcher.dir/segments/20050228044354 139M nutch-searcher.dir/segments 175M nutch-searcher.dir Crawl ran for about 2 hours and 43 minutes. when I search, it looks at the right searcher.dir, but its not returning anything for me: 050228 085819 10 query request from 64.171.1.207 050228 085819 10 query: bhangra 050228 085819 10 searching for 20 raw hits 050228 085819 10 total hits: 0 what am I doing wrong? TIA for the help. Regards, Paul ------------------------------------------------------- SF email is sponsored by - The IT Product Guide Read honest & candid reviews on hundreds of IT Products from real users. Discover which products truly live up to the hype. Start reading now. http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
