Hi Olaf,

Thank you very much for your response and luke would be a great tool
for getting familiar with lucene and nutch.

I was able to build my index 3-4 times, but I had queries turned off,
so it would only catch first few pages as most of the website uses
queries.

You were right when you said depth 15 was the culprit, when I
re-crawled the whole website with depth 5, it actually gave me a valid
index.

Now my searches are working fine. 

This is a for just a single website (intranet).  

Thanks for you help again,

Paul


On Tue, 1 Mar 2005 07:57:51 +0100, Olaf Thiele <[EMAIL PROTECTED]> wrote:
> Hi Paul,
> the general list would indeed be more appropriate, but
> let's finish the thread here. That makes it easier for everyone.
> 
> It sounds like you have not been building your index yet. I
> recommend following the tutorial steps closely, if you are a
> first time user. After you built the index, you should be able
> to look at it using Luke (http://www.getopt.org/luke/).
> 
> Launch it and open crawl_dir/index/segments
> You will be able to browse all indexed documents.
> 
> If that doesn't help, try fetching only every 6000th page from
> dmoz. It is not unusual to let Nutch run over night, as it needs
> to fetch a lot of sites.
> 
> Kind regards,
> Olaf
> 
> 
> On Mon, 28 Feb 2005 14:04:35 -0800, sub paul <[EMAIL PROTECTED]> wrote:
> > Hi Olaf,
> >
> > Thanks for the reply. I am trying the same crawl with depth 5, and it
> > has been going on for about 3 hours. So its a good sign.
> >
> > I read in the the nutch website somewhere that even if the crawl is
> > not finished, you can use the results from whatever it has. I tried
> > doing just that, and all my queries returned 0 results.
> >
> > If depth was the issue here, should nutch tell me something that
> > like.."dude... are you sure?"  and if something did go wrong.. how can
> > I repair my data so I can atleast use it for my search.
> >
> > should I post such question in the general list?
> >
> > TIA,
> > Paul
> >
> >
> > On Mon, 28 Feb 2005 21:06:22 +0100, Olaf Thiele <[EMAIL PROTECTED]> wrote:
> > > Hi Paul,
> > > the most likely problem seems to me the depth of 15.
> > > If your first page and every consecutive one had 10 links,
> > > your crawler would have to fetch roughly 24414062500
> > > GigyByte from the Internet.
> > >
> > > Depending on your data, start with a much smaller depth.
> > >
> > > Kind regards,
> > > Olaf
> > >
> > >
> > > On Mon, 28 Feb 2005 08:20:22 -0800, sub paul <[EMAIL PROTECTED]> wrote:
> > > > Hello All,
> > > >
> > > > I was running an intranet crawl and It seems like it did not finish, 
> > > > properly.
> > > > It is a pretty default setup, but crawl's depth was 15, and I had
> > > > turned on queries by commenting out
> > > > # skip URLs containing certain characters as probable queries, etc.
> > > > [EMAIL PROTECTED]
> > > >
> > > > other than bunch of fetch messages, and bunch of Exceeding max.delays
> > > > meaning message I am seeing the following..
> > > >
> > > > crawl starts normally...
> > > > 050228 064335 status: segment 20050228044354, 6300 pages, 91 errors,
> > > > 140194211 bytes, 7163124 ms
> > > > 050228 064335 status: 0.8795045 pages/s, 152.90356 kb/s, 22253.049 
> > > > bytes/page
> > > > .......
> > > > 050228 064551 status: segment 20050228044354, 6400 pages, 97 errors,
> > > > 142348797 bytes, 7298549 ms
> > > > 050228 064551 status: 0.87688667 pages/s, 152.37276 kb/s, 22242.0 
> > > > bytes/page
> > > > .....
> > > > 050228 064759 status: segment 20050228044354, 6500 pages, 102 errors,
> > > > 144522915 bytes, 7427113 ms
> > > >
> > > > Results of all this was a nutch-seacher-dir looked like this:
> > > > du -h nutch-searcher.dir/
> > > > 5.3M    nutch-searcher.dir/db/webdb/pagesByURL
> > > > 3.4M    nutch-searcher.dir/db/webdb/pagesByMD5
> > > > 14M     nutch-searcher.dir/db/webdb/linksByMD5
> > > > 14M     nutch-searcher.dir/db/webdb/linksByURL
> > > > 36M     nutch-searcher.dir/db/webdb
> > > > 36M     nutch-searcher.dir/db
> > > > 12K     nutch-searcher.dir/segments/20050228020140/fetchlist
> > > > 12K     nutch-searcher.dir/segments/20050228020140/fetcher
> > > > 20K     nutch-searcher.dir/segments/20050228020140/content
> > > > 12K     nutch-searcher.dir/segments/20050228020140/parse_text
> > > > 16K     nutch-searcher.dir/segments/20050228020140/parse_data
> > > > 76K     nutch-searcher.dir/segments/20050228020140
> > > > 16K     nutch-searcher.dir/segments/20050228020146/fetchlist
> > > > 16K     nutch-searcher.dir/segments/20050228020146/fetcher
> > > > 316K    nutch-searcher.dir/segments/20050228020146/content
> > > > 52K     nutch-searcher.dir/segments/20050228020146/parse_text
> > > > 144K    nutch-searcher.dir/segments/20050228020146/parse_data
> > > > 548K    nutch-searcher.dir/segments/20050228020146
> > > > 56K     nutch-searcher.dir/segments/20050228020257/fetchlist
> > > > 68K     nutch-searcher.dir/segments/20050228020257/fetcher
> > > > 2.2M    nutch-searcher.dir/segments/20050228020257/content
> > > > 260K    nutch-searcher.dir/segments/20050228020257/parse_text
> > > > 912K    nutch-searcher.dir/segments/20050228020257/parse_data
> > > > 3.5M    nutch-searcher.dir/segments/20050228020257
> > > > 232K    nutch-searcher.dir/segments/20050228020931/fetchlist
> > > > 276K    nutch-searcher.dir/segments/20050228020931/fetcher
> > > > 9.4M    nutch-searcher.dir/segments/20050228020931/content
> > > > 1.1M    nutch-searcher.dir/segments/20050228020931/parse_text
> > > > 4.1M    nutch-searcher.dir/segments/20050228020931/parse_data
> > > > 15M     nutch-searcher.dir/segments/20050228020931
> > > > 900K    nutch-searcher.dir/segments/20050228024012/fetchlist
> > > > 1.1M    nutch-searcher.dir/segments/20050228024012/fetcher
> > > > 37M     nutch-searcher.dir/segments/20050228024012/content
> > > > 3.9M    nutch-searcher.dir/segments/20050228024012/parse_text
> > > > 16M     nutch-searcher.dir/segments/20050228024012/parse_data
> > > > 58M     nutch-searcher.dir/segments/20050228024012
> > > > 3.2M    nutch-searcher.dir/segments/20050228044354/fetchlist
> > > > 1.1M    nutch-searcher.dir/segments/20050228044354/fetcher
> > > > 39M     nutch-searcher.dir/segments/20050228044354/content
> > > > 3.6M    nutch-searcher.dir/segments/20050228044354/parse_text
> > > > 16M     nutch-searcher.dir/segments/20050228044354/parse_data
> > > > 62M     nutch-searcher.dir/segments/20050228044354
> > > > 139M    nutch-searcher.dir/segments
> > > > 175M    nutch-searcher.dir
> > > >
> > > > Crawl ran for about 2 hours and 43 minutes.
> > > >
> > > > when I search, it looks at the right searcher.dir, but its not
> > > > returning anything for me:
> > > > 050228 085819 10 query request from 64.171.1.207
> > > > 050228 085819 10 query: bhangra
> > > > 050228 085819 10 searching for 20 raw hits
> > > > 050228 085819 10 total hits: 0
> > > >
> > > > what am I doing wrong? TIA for the help.
> > > >
> > > > Regards,
> > > > Paul
> > > >
> > > > -------------------------------------------------------
> > > > SF email is sponsored by - The IT Product Guide
> > > > Read honest & candid reviews on hundreds of IT Products from real users.
> > > > Discover which products truly live up to the hype. Start reading now.
> > > > http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
> > > > _______________________________________________
> > > > Nutch-developers mailing list
> > > > Nutch-developers@lists.sourceforge.net
> > > > https://lists.sourceforge.net/lists/listinfo/nutch-developers
> > > >
> > >
> > > --
> > >
> > > <SimpleHuman gender="male">
> > >    <Physical name="Olaf Thiele" />
> > >    <Virtual adress="http://www.olafthiele.de"; />
> > > </SimpleHuman>
> > >
> > > -------------------------------------------------------
> > > SF email is sponsored by - The IT Product Guide
> > > Read honest & candid reviews on hundreds of IT Products from real users.
> > > Discover which products truly live up to the hype. Start reading now.
> > > http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
> > > _______________________________________________
> > > Nutch-developers mailing list
> > > Nutch-developers@lists.sourceforge.net
> > > https://lists.sourceforge.net/lists/listinfo/nutch-developers
> > >
> >
> > -------------------------------------------------------
> > SF email is sponsored by - The IT Product Guide
> > Read honest & candid reviews on hundreds of IT Products from real users.
> > Discover which products truly live up to the hype. Start reading now.
> > http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
> > _______________________________________________
> > Nutch-developers mailing list
> > Nutch-developers@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/nutch-developers
> >
> 
> --
> 
> <SimpleHuman gender="male">
>    <Physical name="Olaf Thiele" />
>    <Virtual adress="http://www.olafthiele.de"; />
> </SimpleHuman>
> 
> -------------------------------------------------------
> SF email is sponsored by - The IT Product Guide
> Read honest & candid reviews on hundreds of IT Products from real users.
> Discover which products truly live up to the hype. Start reading now.
> http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
> _______________________________________________
> Nutch-developers mailing list
> Nutch-developers@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nutch-developers
>


-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now.
http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
_______________________________________________
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to