I ran into this symptom before. 
If it is the same problem. It has to do with the circular links in the site 
crawled.
If you run into circular links, the link analyze step will take longer with 
each successive crawl and then keep going until you run out of disk space.
If you look at the 
db/tmp<some number>/scoreEdits.0.unsorted

do a 
tail and pipe it to strings

If you have the same problem, you will see the same url over and over again, 
in the form of

http://site.com/index/index/page.html
http://site.com/index/index/index/page.html
http://site.com/index/index/index/index/page.html
...

I filed a bug report on this away ago, the only way I found to deal with it 
was to write code using WebdbReader and WeDBwriter to delete bad pages and 
their links, you can also then use one of the urlfilters to block the 
specific urls from coming back. 

Phoebe


-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now.
http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to