Hi Olaf, Thank you very much for your response and luke would be a great tool for getting familiar with lucene and nutch.
I was able to build my index 3-4 times, but I had queries turned off, so it would only catch first few pages as most of the website uses queries. You were right when you said depth 15 was the culprit, when I re-crawled the whole website with depth 5, it actually gave me a valid index. Now my searches are working fine. This is a for just a single website (intranet). Thanks for you help again, Paul On Tue, 1 Mar 2005 07:57:51 +0100, Olaf Thiele <[EMAIL PROTECTED]> wrote: > Hi Paul, > the general list would indeed be more appropriate, but > let's finish the thread here. That makes it easier for everyone. > > It sounds like you have not been building your index yet. I > recommend following the tutorial steps closely, if you are a > first time user. After you built the index, you should be able > to look at it using Luke (http://www.getopt.org/luke/). > > Launch it and open crawl_dir/index/segments > You will be able to browse all indexed documents. > > If that doesn't help, try fetching only every 6000th page from > dmoz. It is not unusual to let Nutch run over night, as it needs > to fetch a lot of sites. > > Kind regards, > Olaf > > > On Mon, 28 Feb 2005 14:04:35 -0800, sub paul <[EMAIL PROTECTED]> wrote: > > Hi Olaf, > > > > Thanks for the reply. I am trying the same crawl with depth 5, and it > > has been going on for about 3 hours. So its a good sign. > > > > I read in the the nutch website somewhere that even if the crawl is > > not finished, you can use the results from whatever it has. I tried > > doing just that, and all my queries returned 0 results. > > > > If depth was the issue here, should nutch tell me something that > > like.."dude... are you sure?" and if something did go wrong.. how can > > I repair my data so I can atleast use it for my search. > > > > should I post such question in the general list? > > > > TIA, > > Paul > > > > > > On Mon, 28 Feb 2005 21:06:22 +0100, Olaf Thiele <[EMAIL PROTECTED]> wrote: > > > Hi Paul, > > > the most likely problem seems to me the depth of 15. > > > If your first page and every consecutive one had 10 links, > > > your crawler would have to fetch roughly 24414062500 > > > GigyByte from the Internet. > > > > > > Depending on your data, start with a much smaller depth. > > > > > > Kind regards, > > > Olaf > > > > > > > > > On Mon, 28 Feb 2005 08:20:22 -0800, sub paul <[EMAIL PROTECTED]> wrote: > > > > Hello All, > > > > > > > > I was running an intranet crawl and It seems like it did not finish, > > > > properly. > > > > It is a pretty default setup, but crawl's depth was 15, and I had > > > > turned on queries by commenting out > > > > # skip URLs containing certain characters as probable queries, etc. > > > > [EMAIL PROTECTED] > > > > > > > > other than bunch of fetch messages, and bunch of Exceeding max.delays > > > > meaning message I am seeing the following.. > > > > > > > > crawl starts normally... > > > > 050228 064335 status: segment 20050228044354, 6300 pages, 91 errors, > > > > 140194211 bytes, 7163124 ms > > > > 050228 064335 status: 0.8795045 pages/s, 152.90356 kb/s, 22253.049 > > > > bytes/page > > > > ....... > > > > 050228 064551 status: segment 20050228044354, 6400 pages, 97 errors, > > > > 142348797 bytes, 7298549 ms > > > > 050228 064551 status: 0.87688667 pages/s, 152.37276 kb/s, 22242.0 > > > > bytes/page > > > > ..... > > > > 050228 064759 status: segment 20050228044354, 6500 pages, 102 errors, > > > > 144522915 bytes, 7427113 ms > > > > > > > > Results of all this was a nutch-seacher-dir looked like this: > > > > du -h nutch-searcher.dir/ > > > > 5.3M nutch-searcher.dir/db/webdb/pagesByURL > > > > 3.4M nutch-searcher.dir/db/webdb/pagesByMD5 > > > > 14M nutch-searcher.dir/db/webdb/linksByMD5 > > > > 14M nutch-searcher.dir/db/webdb/linksByURL > > > > 36M nutch-searcher.dir/db/webdb > > > > 36M nutch-searcher.dir/db > > > > 12K nutch-searcher.dir/segments/20050228020140/fetchlist > > > > 12K nutch-searcher.dir/segments/20050228020140/fetcher > > > > 20K nutch-searcher.dir/segments/20050228020140/content > > > > 12K nutch-searcher.dir/segments/20050228020140/parse_text > > > > 16K nutch-searcher.dir/segments/20050228020140/parse_data > > > > 76K nutch-searcher.dir/segments/20050228020140 > > > > 16K nutch-searcher.dir/segments/20050228020146/fetchlist > > > > 16K nutch-searcher.dir/segments/20050228020146/fetcher > > > > 316K nutch-searcher.dir/segments/20050228020146/content > > > > 52K nutch-searcher.dir/segments/20050228020146/parse_text > > > > 144K nutch-searcher.dir/segments/20050228020146/parse_data > > > > 548K nutch-searcher.dir/segments/20050228020146 > > > > 56K nutch-searcher.dir/segments/20050228020257/fetchlist > > > > 68K nutch-searcher.dir/segments/20050228020257/fetcher > > > > 2.2M nutch-searcher.dir/segments/20050228020257/content > > > > 260K nutch-searcher.dir/segments/20050228020257/parse_text > > > > 912K nutch-searcher.dir/segments/20050228020257/parse_data > > > > 3.5M nutch-searcher.dir/segments/20050228020257 > > > > 232K nutch-searcher.dir/segments/20050228020931/fetchlist > > > > 276K nutch-searcher.dir/segments/20050228020931/fetcher > > > > 9.4M nutch-searcher.dir/segments/20050228020931/content > > > > 1.1M nutch-searcher.dir/segments/20050228020931/parse_text > > > > 4.1M nutch-searcher.dir/segments/20050228020931/parse_data > > > > 15M nutch-searcher.dir/segments/20050228020931 > > > > 900K nutch-searcher.dir/segments/20050228024012/fetchlist > > > > 1.1M nutch-searcher.dir/segments/20050228024012/fetcher > > > > 37M nutch-searcher.dir/segments/20050228024012/content > > > > 3.9M nutch-searcher.dir/segments/20050228024012/parse_text > > > > 16M nutch-searcher.dir/segments/20050228024012/parse_data > > > > 58M nutch-searcher.dir/segments/20050228024012 > > > > 3.2M nutch-searcher.dir/segments/20050228044354/fetchlist > > > > 1.1M nutch-searcher.dir/segments/20050228044354/fetcher > > > > 39M nutch-searcher.dir/segments/20050228044354/content > > > > 3.6M nutch-searcher.dir/segments/20050228044354/parse_text > > > > 16M nutch-searcher.dir/segments/20050228044354/parse_data > > > > 62M nutch-searcher.dir/segments/20050228044354 > > > > 139M nutch-searcher.dir/segments > > > > 175M nutch-searcher.dir > > > > > > > > Crawl ran for about 2 hours and 43 minutes. > > > > > > > > when I search, it looks at the right searcher.dir, but its not > > > > returning anything for me: > > > > 050228 085819 10 query request from 64.171.1.207 > > > > 050228 085819 10 query: bhangra > > > > 050228 085819 10 searching for 20 raw hits > > > > 050228 085819 10 total hits: 0 > > > > > > > > what am I doing wrong? TIA for the help. > > > > > > > > Regards, > > > > Paul > > > > > > > > ------------------------------------------------------- > > > > SF email is sponsored by - The IT Product Guide > > > > Read honest & candid reviews on hundreds of IT Products from real users. > > > > Discover which products truly live up to the hype. Start reading now. > > > > http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click > > > > _______________________________________________ > > > > Nutch-developers mailing list > > > > Nutch-developers@lists.sourceforge.net > > > > https://lists.sourceforge.net/lists/listinfo/nutch-developers > > > > > > > > > > -- > > > > > > <SimpleHuman gender="male"> > > > <Physical name="Olaf Thiele" /> > > > <Virtual adress="http://www.olafthiele.de" /> > > > </SimpleHuman> > > > > > > ------------------------------------------------------- > > > SF email is sponsored by - The IT Product Guide > > > Read honest & candid reviews on hundreds of IT Products from real users. > > > Discover which products truly live up to the hype. Start reading now. > > > http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click > > > _______________________________________________ > > > Nutch-developers mailing list > > > Nutch-developers@lists.sourceforge.net > > > https://lists.sourceforge.net/lists/listinfo/nutch-developers > > > > > > > ------------------------------------------------------- > > SF email is sponsored by - The IT Product Guide > > Read honest & candid reviews on hundreds of IT Products from real users. > > Discover which products truly live up to the hype. Start reading now. > > http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click > > _______________________________________________ > > Nutch-developers mailing list > > Nutch-developers@lists.sourceforge.net > > https://lists.sourceforge.net/lists/listinfo/nutch-developers > > > > -- > > <SimpleHuman gender="male"> > <Physical name="Olaf Thiele" /> > <Virtual adress="http://www.olafthiele.de" /> > </SimpleHuman> > > ------------------------------------------------------- > SF email is sponsored by - The IT Product Guide > Read honest & candid reviews on hundreds of IT Products from real users. > Discover which products truly live up to the hype. Start reading now. > http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click > _______________________________________________ > Nutch-developers mailing list > Nutch-developers@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/nutch-developers > ------------------------------------------------------- SF email is sponsored by - The IT Product Guide Read honest & candid reviews on hundreds of IT Products from real users. Discover which products truly live up to the hype. Start reading now. http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click _______________________________________________ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers