Hi,

crawling segments with 1/2 million pages works fine for me with a newer
nutch version from cvs. Maybe you should also get a newer one.

Also when I am trying to do 4 and 5 million page fetches I always come back
the next day and the fetcher is either hung up, or it will sit idle for a
while and then fetch a little bit and stall.  Is this a memory thing or how
do I get the fetcher to get through a complete segment?
If you have many urls from one domain your fetcher might fetch one url,
wait for some seconds, and fetch the next page from the same domain.
This is implemented to have a polite fetcher. At the end of your
fetching cycle this means, that your fetcher could rest much more than
crawling.
Maybe this information helps.

I am still using .05 for the code at the moment.  How do I eliminate .pdf's
so there is no chance of a PDF hanging the system up?
Use the RegexUrlFilter and configure to avoid pdfs in the regex-file in
your conf dir.

If you use the prefixUrl filter, you should switch to a combination of
both filters. I don't know, if it is in cvs, but the one we use, you
will find at http://nutch.eventax.com/ (PrefixB4URLFilter).

Bye

Matthias



--
http://gmbh.eventax.de - eventax GmbH
http://www.umkreisfinder.de - Die Suchmaschine f�r Lokales und Events
http://www.fahnen-drucken.de - Flaggen einfach selbst gemacht




------------------------------------------------------- This SF.Net email is sponsored by: YOU BE THE JUDGE. Be one of 170 Project Admins to receive an Apple iPod Mini FREE for your judgement on who ports your project to Linux PPC the best. Sponsored by IBM. Deadline: Sept. 24. Go here: http://sf.net/ppc_contest.php _______________________________________________ Nutch-general mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to