Hi Olaf,

Thanks for the reply. I am trying the same crawl with depth 5, and it
has been going on for about 3 hours. So its a good sign.

I read in the the nutch website somewhere that even if the crawl is
not finished, you can use the results from whatever it has. I tried
doing just that, and all my queries returned 0 results.

If depth was the issue here, should nutch tell me something that 
like.."dude... are you sure?"  and if something did go wrong.. how can
I repair my data so I can atleast use it for my search.

should I post such question in the general list?

TIA,
Paul



On Mon, 28 Feb 2005 21:06:22 +0100, Olaf Thiele <[EMAIL PROTECTED]> wrote:
> Hi Paul,
> the most likely problem seems to me the depth of 15.
> If your first page and every consecutive one had 10 links,
> your crawler would have to fetch roughly 24414062500
> GigyByte from the Internet.
> 
> Depending on your data, start with a much smaller depth.
> 
> Kind regards,
> Olaf
> 
> 
> On Mon, 28 Feb 2005 08:20:22 -0800, sub paul <[EMAIL PROTECTED]> wrote:
> > Hello All,
> >
> > I was running an intranet crawl and It seems like it did not finish, 
> > properly.
> > It is a pretty default setup, but crawl's depth was 15, and I had
> > turned on queries by commenting out
> > # skip URLs containing certain characters as probable queries, etc.
> > [EMAIL PROTECTED]
> >
> > other than bunch of fetch messages, and bunch of Exceeding max.delays
> > meaning message I am seeing the following..
> >
> > crawl starts normally...
> > 050228 064335 status: segment 20050228044354, 6300 pages, 91 errors,
> > 140194211 bytes, 7163124 ms
> > 050228 064335 status: 0.8795045 pages/s, 152.90356 kb/s, 22253.049 
> > bytes/page
> > .......
> > 050228 064551 status: segment 20050228044354, 6400 pages, 97 errors,
> > 142348797 bytes, 7298549 ms
> > 050228 064551 status: 0.87688667 pages/s, 152.37276 kb/s, 22242.0 bytes/page
> > .....
> > 050228 064759 status: segment 20050228044354, 6500 pages, 102 errors,
> > 144522915 bytes, 7427113 ms
> >
> > Results of all this was a nutch-seacher-dir looked like this:
> > du -h nutch-searcher.dir/
> > 5.3M    nutch-searcher.dir/db/webdb/pagesByURL
> > 3.4M    nutch-searcher.dir/db/webdb/pagesByMD5
> > 14M     nutch-searcher.dir/db/webdb/linksByMD5
> > 14M     nutch-searcher.dir/db/webdb/linksByURL
> > 36M     nutch-searcher.dir/db/webdb
> > 36M     nutch-searcher.dir/db
> > 12K     nutch-searcher.dir/segments/20050228020140/fetchlist
> > 12K     nutch-searcher.dir/segments/20050228020140/fetcher
> > 20K     nutch-searcher.dir/segments/20050228020140/content
> > 12K     nutch-searcher.dir/segments/20050228020140/parse_text
> > 16K     nutch-searcher.dir/segments/20050228020140/parse_data
> > 76K     nutch-searcher.dir/segments/20050228020140
> > 16K     nutch-searcher.dir/segments/20050228020146/fetchlist
> > 16K     nutch-searcher.dir/segments/20050228020146/fetcher
> > 316K    nutch-searcher.dir/segments/20050228020146/content
> > 52K     nutch-searcher.dir/segments/20050228020146/parse_text
> > 144K    nutch-searcher.dir/segments/20050228020146/parse_data
> > 548K    nutch-searcher.dir/segments/20050228020146
> > 56K     nutch-searcher.dir/segments/20050228020257/fetchlist
> > 68K     nutch-searcher.dir/segments/20050228020257/fetcher
> > 2.2M    nutch-searcher.dir/segments/20050228020257/content
> > 260K    nutch-searcher.dir/segments/20050228020257/parse_text
> > 912K    nutch-searcher.dir/segments/20050228020257/parse_data
> > 3.5M    nutch-searcher.dir/segments/20050228020257
> > 232K    nutch-searcher.dir/segments/20050228020931/fetchlist
> > 276K    nutch-searcher.dir/segments/20050228020931/fetcher
> > 9.4M    nutch-searcher.dir/segments/20050228020931/content
> > 1.1M    nutch-searcher.dir/segments/20050228020931/parse_text
> > 4.1M    nutch-searcher.dir/segments/20050228020931/parse_data
> > 15M     nutch-searcher.dir/segments/20050228020931
> > 900K    nutch-searcher.dir/segments/20050228024012/fetchlist
> > 1.1M    nutch-searcher.dir/segments/20050228024012/fetcher
> > 37M     nutch-searcher.dir/segments/20050228024012/content
> > 3.9M    nutch-searcher.dir/segments/20050228024012/parse_text
> > 16M     nutch-searcher.dir/segments/20050228024012/parse_data
> > 58M     nutch-searcher.dir/segments/20050228024012
> > 3.2M    nutch-searcher.dir/segments/20050228044354/fetchlist
> > 1.1M    nutch-searcher.dir/segments/20050228044354/fetcher
> > 39M     nutch-searcher.dir/segments/20050228044354/content
> > 3.6M    nutch-searcher.dir/segments/20050228044354/parse_text
> > 16M     nutch-searcher.dir/segments/20050228044354/parse_data
> > 62M     nutch-searcher.dir/segments/20050228044354
> > 139M    nutch-searcher.dir/segments
> > 175M    nutch-searcher.dir
> >
> > Crawl ran for about 2 hours and 43 minutes.
> >
> > when I search, it looks at the right searcher.dir, but its not
> > returning anything for me:
> > 050228 085819 10 query request from 64.171.1.207
> > 050228 085819 10 query: bhangra
> > 050228 085819 10 searching for 20 raw hits
> > 050228 085819 10 total hits: 0
> >
> > what am I doing wrong? TIA for the help.
> >
> > Regards,
> > Paul
> >
> > -------------------------------------------------------
> > SF email is sponsored by - The IT Product Guide
> > Read honest & candid reviews on hundreds of IT Products from real users.
> > Discover which products truly live up to the hype. Start reading now.
> > http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
> > _______________________________________________
> > Nutch-developers mailing list
> > Nutch-developers@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/nutch-developers
> >
> 
> --
> 
> <SimpleHuman gender="male">
>    <Physical name="Olaf Thiele" />
>    <Virtual adress="http://www.olafthiele.de"; />
> </SimpleHuman>
> 
> -------------------------------------------------------
> SF email is sponsored by - The IT Product Guide
> Read honest & candid reviews on hundreds of IT Products from real users.
> Discover which products truly live up to the hype. Start reading now.
> http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
> _______________________________________________
> Nutch-developers mailing list
> Nutch-developers@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nutch-developers
>


-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now.
http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
_______________________________________________
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to