Hi,
crawling segments with 1/2 million pages works fine for me with a newer nutch version from cvs. Maybe you should also get a newer one.
If you have many urls from one domain your fetcher might fetch one url, wait for some seconds, and fetch the next page from the same domain.Also when I am trying to do 4 and 5 million page fetches I always come back the next day and the fetcher is either hung up, or it will sit idle for a while and then fetch a little bit and stall. Is this a memory thing or how do I get the fetcher to get through a complete segment?
This is implemented to have a polite fetcher. At the end of your fetching cycle this means, that your fetcher could rest much more than crawling.
Maybe this information helps.
Use the RegexUrlFilter and configure to avoid pdfs in the regex-file in your conf dir.I am still using .05 for the code at the moment. How do I eliminate .pdf's so there is no chance of a PDF hanging the system up?
If you use the prefixUrl filter, you should switch to a combination of both filters. I don't know, if it is in cvs, but the one we use, you will find at http://nutch.eventax.com/ (PrefixB4URLFilter).
Bye
Matthias
------------------------------------------------------- This SF.Net email is sponsored by: YOU BE THE JUDGE. Be one of 170 Project Admins to receive an Apple iPod Mini FREE for your judgement on who ports your project to Linux PPC the best. Sponsored by IBM. Deadline: Sept. 24. Go here: http://sf.net/ppc_contest.php _______________________________________________ Nutch-general mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/nutch-general
