Hi Paul, the most likely problem seems to me the depth of 15. If your first page and every consecutive one had 10 links, your crawler would have to fetch roughly 24414062500 GigyByte from the Internet.
Depending on your data, start with a much smaller depth. Kind regards, Olaf On Mon, 28 Feb 2005 08:20:22 -0800, sub paul <[EMAIL PROTECTED]> wrote: > Hello All, > > I was running an intranet crawl and It seems like it did not finish, properly. > It is a pretty default setup, but crawl's depth was 15, and I had > turned on queries by commenting out > # skip URLs containing certain characters as probable queries, etc. > [EMAIL PROTECTED] > > other than bunch of fetch messages, and bunch of Exceeding max.delays > meaning message I am seeing the following.. > > crawl starts normally... > 050228 064335 status: segment 20050228044354, 6300 pages, 91 errors, > 140194211 bytes, 7163124 ms > 050228 064335 status: 0.8795045 pages/s, 152.90356 kb/s, 22253.049 bytes/page > ....... > 050228 064551 status: segment 20050228044354, 6400 pages, 97 errors, > 142348797 bytes, 7298549 ms > 050228 064551 status: 0.87688667 pages/s, 152.37276 kb/s, 22242.0 bytes/page > ..... > 050228 064759 status: segment 20050228044354, 6500 pages, 102 errors, > 144522915 bytes, 7427113 ms > > Results of all this was a nutch-seacher-dir looked like this: > du -h nutch-searcher.dir/ > 5.3M nutch-searcher.dir/db/webdb/pagesByURL > 3.4M nutch-searcher.dir/db/webdb/pagesByMD5 > 14M nutch-searcher.dir/db/webdb/linksByMD5 > 14M nutch-searcher.dir/db/webdb/linksByURL > 36M nutch-searcher.dir/db/webdb > 36M nutch-searcher.dir/db > 12K nutch-searcher.dir/segments/20050228020140/fetchlist > 12K nutch-searcher.dir/segments/20050228020140/fetcher > 20K nutch-searcher.dir/segments/20050228020140/content > 12K nutch-searcher.dir/segments/20050228020140/parse_text > 16K nutch-searcher.dir/segments/20050228020140/parse_data > 76K nutch-searcher.dir/segments/20050228020140 > 16K nutch-searcher.dir/segments/20050228020146/fetchlist > 16K nutch-searcher.dir/segments/20050228020146/fetcher > 316K nutch-searcher.dir/segments/20050228020146/content > 52K nutch-searcher.dir/segments/20050228020146/parse_text > 144K nutch-searcher.dir/segments/20050228020146/parse_data > 548K nutch-searcher.dir/segments/20050228020146 > 56K nutch-searcher.dir/segments/20050228020257/fetchlist > 68K nutch-searcher.dir/segments/20050228020257/fetcher > 2.2M nutch-searcher.dir/segments/20050228020257/content > 260K nutch-searcher.dir/segments/20050228020257/parse_text > 912K nutch-searcher.dir/segments/20050228020257/parse_data > 3.5M nutch-searcher.dir/segments/20050228020257 > 232K nutch-searcher.dir/segments/20050228020931/fetchlist > 276K nutch-searcher.dir/segments/20050228020931/fetcher > 9.4M nutch-searcher.dir/segments/20050228020931/content > 1.1M nutch-searcher.dir/segments/20050228020931/parse_text > 4.1M nutch-searcher.dir/segments/20050228020931/parse_data > 15M nutch-searcher.dir/segments/20050228020931 > 900K nutch-searcher.dir/segments/20050228024012/fetchlist > 1.1M nutch-searcher.dir/segments/20050228024012/fetcher > 37M nutch-searcher.dir/segments/20050228024012/content > 3.9M nutch-searcher.dir/segments/20050228024012/parse_text > 16M nutch-searcher.dir/segments/20050228024012/parse_data > 58M nutch-searcher.dir/segments/20050228024012 > 3.2M nutch-searcher.dir/segments/20050228044354/fetchlist > 1.1M nutch-searcher.dir/segments/20050228044354/fetcher > 39M nutch-searcher.dir/segments/20050228044354/content > 3.6M nutch-searcher.dir/segments/20050228044354/parse_text > 16M nutch-searcher.dir/segments/20050228044354/parse_data > 62M nutch-searcher.dir/segments/20050228044354 > 139M nutch-searcher.dir/segments > 175M nutch-searcher.dir > > Crawl ran for about 2 hours and 43 minutes. > > when I search, it looks at the right searcher.dir, but its not > returning anything for me: > 050228 085819 10 query request from 64.171.1.207 > 050228 085819 10 query: bhangra > 050228 085819 10 searching for 20 raw hits > 050228 085819 10 total hits: 0 > > what am I doing wrong? TIA for the help. > > Regards, > Paul > > ------------------------------------------------------- > SF email is sponsored by - The IT Product Guide > Read honest & candid reviews on hundreds of IT Products from real users. > Discover which products truly live up to the hype. Start reading now. > http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click > _______________________________________________ > Nutch-developers mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/nutch-developers > -- <SimpleHuman gender="male"> <Physical name="Olaf Thiele" /> <Virtual adress="http://www.olafthiele.de" /> </SimpleHuman> ------------------------------------------------------- SF email is sponsored by - The IT Product Guide Read honest & candid reviews on hundreds of IT Products from real users. Discover which products truly live up to the hype. Start reading now. http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
