I ran into this symptom before. If it is the same problem. It has to do with the circular links in the site crawled. If you run into circular links, the link analyze step will take longer with each successive crawl and then keep going until you run out of disk space. If you look at the db/tmp<some number>/scoreEdits.0.unsorted
do a tail and pipe it to strings If you have the same problem, you will see the same url over and over again, in the form of http://site.com/index/index/page.html http://site.com/index/index/index/page.html http://site.com/index/index/index/index/page.html ... I filed a bug report on this away ago, the only way I found to deal with it was to write code using WebdbReader and WeDBwriter to delete bad pages and their links, you can also then use one of the urlfilters to block the specific urls from coming back. Phoebe ------------------------------------------------------- SF email is sponsored by - The IT Product Guide Read honest & candid reviews on hundreds of IT Products from real users. Discover which products truly live up to the hype. Start reading now. http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
