Interesting, thanks.

I believe I have a similar problem. My solution was to delete my webdb and recreate it, and then things worked fine for one analysis itteration.

There was a specific site that was behaving just as you noted below (index/index) but instead of /index/index was /english/russian/german/romanian/english/french/german etc etc.

I went ahead and created an /etc/hosts entry for this site and pointed it to a webserver with a robots.txt that denies access. But thanks to your suggestion, I'll look into purging it from my DB.

Thanks,
Gus

On Tue, 29 Mar 2005, Phoebe Miller wrote:

I ran into this symptom before.
If it is the same problem. It has to do with the circular links in the site
crawled.
If you run into circular links, the link analyze step will take longer with
each successive crawl and then keep going until you run out of disk space.
If you look at the
db/tmp<some number>/scoreEdits.0.unsorted

do a
tail and pipe it to strings

If you have the same problem, you will see the same url over and over again,
in the form of

http://site.com/index/index/page.html
http://site.com/index/index/index/page.html
http://site.com/index/index/index/index/page.html
...

I filed a bug report on this away ago, the only way I found to deal with it
was to write code using WebdbReader and WeDBwriter to delete bad pages and
their links, you can also then use one of the urlfilters to block the
specific urls from coming back.

Phoebe




-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now.
http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to