Hi Olaf, Thanks for the reply. I am trying the same crawl with depth 5, and it has been going on for about 3 hours. So its a good sign.
I read in the the nutch website somewhere that even if the crawl is not finished, you can use the results from whatever it has. I tried doing just that, and all my queries returned 0 results. If depth was the issue here, should nutch tell me something that like.."dude... are you sure?" and if something did go wrong.. how can I repair my data so I can atleast use it for my search. should I post such question in the general list? TIA, Paul On Mon, 28 Feb 2005 21:06:22 +0100, Olaf Thiele <[EMAIL PROTECTED]> wrote: > Hi Paul, > the most likely problem seems to me the depth of 15. > If your first page and every consecutive one had 10 links, > your crawler would have to fetch roughly 24414062500 > GigyByte from the Internet. > > Depending on your data, start with a much smaller depth. > > Kind regards, > Olaf > > > On Mon, 28 Feb 2005 08:20:22 -0800, sub paul <[EMAIL PROTECTED]> wrote: > > Hello All, > > > > I was running an intranet crawl and It seems like it did not finish, > > properly. > > It is a pretty default setup, but crawl's depth was 15, and I had > > turned on queries by commenting out > > # skip URLs containing certain characters as probable queries, etc. > > [EMAIL PROTECTED] > > > > other than bunch of fetch messages, and bunch of Exceeding max.delays > > meaning message I am seeing the following.. > > > > crawl starts normally... > > 050228 064335 status: segment 20050228044354, 6300 pages, 91 errors, > > 140194211 bytes, 7163124 ms > > 050228 064335 status: 0.8795045 pages/s, 152.90356 kb/s, 22253.049 > > bytes/page > > ....... > > 050228 064551 status: segment 20050228044354, 6400 pages, 97 errors, > > 142348797 bytes, 7298549 ms > > 050228 064551 status: 0.87688667 pages/s, 152.37276 kb/s, 22242.0 bytes/page > > ..... > > 050228 064759 status: segment 20050228044354, 6500 pages, 102 errors, > > 144522915 bytes, 7427113 ms > > > > Results of all this was a nutch-seacher-dir looked like this: > > du -h nutch-searcher.dir/ > > 5.3M nutch-searcher.dir/db/webdb/pagesByURL > > 3.4M nutch-searcher.dir/db/webdb/pagesByMD5 > > 14M nutch-searcher.dir/db/webdb/linksByMD5 > > 14M nutch-searcher.dir/db/webdb/linksByURL > > 36M nutch-searcher.dir/db/webdb > > 36M nutch-searcher.dir/db > > 12K nutch-searcher.dir/segments/20050228020140/fetchlist > > 12K nutch-searcher.dir/segments/20050228020140/fetcher > > 20K nutch-searcher.dir/segments/20050228020140/content > > 12K nutch-searcher.dir/segments/20050228020140/parse_text > > 16K nutch-searcher.dir/segments/20050228020140/parse_data > > 76K nutch-searcher.dir/segments/20050228020140 > > 16K nutch-searcher.dir/segments/20050228020146/fetchlist > > 16K nutch-searcher.dir/segments/20050228020146/fetcher > > 316K nutch-searcher.dir/segments/20050228020146/content > > 52K nutch-searcher.dir/segments/20050228020146/parse_text > > 144K nutch-searcher.dir/segments/20050228020146/parse_data > > 548K nutch-searcher.dir/segments/20050228020146 > > 56K nutch-searcher.dir/segments/20050228020257/fetchlist > > 68K nutch-searcher.dir/segments/20050228020257/fetcher > > 2.2M nutch-searcher.dir/segments/20050228020257/content > > 260K nutch-searcher.dir/segments/20050228020257/parse_text > > 912K nutch-searcher.dir/segments/20050228020257/parse_data > > 3.5M nutch-searcher.dir/segments/20050228020257 > > 232K nutch-searcher.dir/segments/20050228020931/fetchlist > > 276K nutch-searcher.dir/segments/20050228020931/fetcher > > 9.4M nutch-searcher.dir/segments/20050228020931/content > > 1.1M nutch-searcher.dir/segments/20050228020931/parse_text > > 4.1M nutch-searcher.dir/segments/20050228020931/parse_data > > 15M nutch-searcher.dir/segments/20050228020931 > > 900K nutch-searcher.dir/segments/20050228024012/fetchlist > > 1.1M nutch-searcher.dir/segments/20050228024012/fetcher > > 37M nutch-searcher.dir/segments/20050228024012/content > > 3.9M nutch-searcher.dir/segments/20050228024012/parse_text > > 16M nutch-searcher.dir/segments/20050228024012/parse_data > > 58M nutch-searcher.dir/segments/20050228024012 > > 3.2M nutch-searcher.dir/segments/20050228044354/fetchlist > > 1.1M nutch-searcher.dir/segments/20050228044354/fetcher > > 39M nutch-searcher.dir/segments/20050228044354/content > > 3.6M nutch-searcher.dir/segments/20050228044354/parse_text > > 16M nutch-searcher.dir/segments/20050228044354/parse_data > > 62M nutch-searcher.dir/segments/20050228044354 > > 139M nutch-searcher.dir/segments > > 175M nutch-searcher.dir > > > > Crawl ran for about 2 hours and 43 minutes. > > > > when I search, it looks at the right searcher.dir, but its not > > returning anything for me: > > 050228 085819 10 query request from 64.171.1.207 > > 050228 085819 10 query: bhangra > > 050228 085819 10 searching for 20 raw hits > > 050228 085819 10 total hits: 0 > > > > what am I doing wrong? TIA for the help. > > > > Regards, > > Paul > > > > ------------------------------------------------------- > > SF email is sponsored by - The IT Product Guide > > Read honest & candid reviews on hundreds of IT Products from real users. > > Discover which products truly live up to the hype. Start reading now. > > http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click > > _______________________________________________ > > Nutch-developers mailing list > > Nutch-developers@lists.sourceforge.net > > https://lists.sourceforge.net/lists/listinfo/nutch-developers > > > > -- > > <SimpleHuman gender="male"> > <Physical name="Olaf Thiele" /> > <Virtual adress="http://www.olafthiele.de" /> > </SimpleHuman> > > ------------------------------------------------------- > SF email is sponsored by - The IT Product Guide > Read honest & candid reviews on hundreds of IT Products from real users. > Discover which products truly live up to the hype. Start reading now. > http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click > _______________________________________________ > Nutch-developers mailing list > Nutch-developers@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/nutch-developers > ------------------------------------------------------- SF email is sponsored by - The IT Product Guide Read honest & candid reviews on hundreds of IT Products from real users. Discover which products truly live up to the hype. Start reading now. http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click _______________________________________________ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers