Hi,
crawling segments with 1/2 million pages works fine for me with a newer nutch version from cvs. Maybe you should also get a newer one.
Also when I am trying to do 4 and 5 million page fetches I always come back the next day and the fetcher is either hung up, or it will sit idle for a while and then fetch a little bit and stall. Is this a memory thing or how do I get the fetcher to get through a complete segment?
If you have many urls from one domain your fetcher might fetch one url, wait for some seconds, and fetch the next page from the same domain. This is implemented to have a polite fetcher. At the end of your fetching cycle this means, that your fetcher could rest much more than crawling. Maybe this information helps.
I am still using .05 for the code at the moment. How do I eliminate .pdf's so there is no chance of a PDF hanging the system up?
Use the RegexUrlFilter and configure to avoid pdfs in the regex-file in your conf dir.
If you use the prefixUrl filter, you should switch to a combination of both filters. I don't know, if it is in cvs, but the one we use, you will find at http://nutch.eventax.com/ (PrefixB4URLFilter).
Bye
Matthias
-- http://gmbh.eventax.de - eventax GmbH http://www.umkreisfinder.de - Die Suchmaschine f�r Lokales und Events http://www.fahnen-drucken.de - Flaggen einfach selbst gemacht
------------------------------------------------------- This SF.Net email is sponsored by: YOU BE THE JUDGE. Be one of 170 Project Admins to receive an Apple iPod Mini FREE for your judgement on who ports your project to Linux PPC the best. Sponsored by IBM. Deadline: Sept. 24. Go here: http://sf.net/ppc_contest.php _______________________________________________ Nutch-general mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/nutch-general
