Hi,

crawling segments with 1/2 million pages works fine for me with a newer nutch version from cvs. Maybe you should also get a newer one.

Also when I am trying to do 4 and 5 million page fetches I always come back
the next day and the fetcher is either hung up, or it will sit idle for a
while and then fetch a little bit and stall.  Is this a memory thing or how
do I get the fetcher to get through a complete segment?
If you have many urls from one domain your fetcher might fetch one url, wait for some seconds, and fetch the next page from the same domain.
This is implemented to have a polite fetcher. At the end of your fetching cycle this means, that your fetcher could rest much more than crawling.
Maybe this information helps.


I am still using .05 for the code at the moment.  How do I eliminate .pdf's
so there is no chance of a PDF hanging the system up?
Use the RegexUrlFilter and configure to avoid pdfs in the regex-file in your conf dir.

If you use the prefixUrl filter, you should switch to a combination of both filters. I don't know, if it is in cvs, but the one we use, you will find at http://nutch.eventax.com/ (PrefixB4URLFilter).

Bye

Matthias



-------------------------------------------------------
This SF.Net email is sponsored by: YOU BE THE JUDGE. Be one of 170
Project Admins to receive an Apple iPod Mini FREE for your judgement on
who ports your project to Linux PPC the best. Sponsored by IBM.
Deadline: Sept. 24. Go here: http://sf.net/ppc_contest.php
_______________________________________________
Nutch-general mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to