Hi,

This is a segment that constantly fails to fetch.  It gets about 1/3 of the
way through and then hangs.  Is there anything obvious in this as to why it
is failing to fetch?

040927 233415 fetch of
http://www.phila.gov/districtattorney/community/youthaid/ failed with:
net.nutch.protocol.http.HttpException: java.net.UnknownHostException:
www.phila.gov
040927 233415 http://www.avenida.net/links/spain/se/: falling back to
windows-1252
040927 233415 fetch of http://www.avenida.net/links/spain/se failed with:
org.w3c.dom.DOMException: INVALID_CHARACTER_ERR: An invalid or illegal XML
character is specified.
040927 233416 fetch of http://www.cse.msu.edu/~vailayaa/India-Club.html
failed with: net.nutch.protocol.http.HttpException:
java.net.SocketTimeoutException: connect timed out
040927 233416
http://caspianworld.com/ru/go/1665913384/-1960478727/342827978/: setting
encoding to UTF-8
040927 233419
http://es.groups.yahoo.com/group/genealhispana/auth?check=G&done=%2Fgroup%2Fgenealhispana%2Fmessage%2F256:
setting encoding to windows-1252
040927 233420 http://re-fe-rat.narod.ru/54000.html: setting encoding to
windows-1251
040927 233432 fetch of
http://www.iuma.com/site-bin/mp3gen/8336/IUMA/Bands/Manual_Motiv/audio/Manual_Motiv_-_Intro.mp3
failed with: net.nutch.parse.ParseException: Content-Type not text/html:
audio/mpeg
040927 233438 fetch of null failed with: java.io.EOFException

Also when I am trying to do 4 and 5 million page fetches I always come back
the next day and the fetcher is either hung up, or it will sit idle for a
while and then fetch a little bit and stall.  Is this a memory thing or how
do I get the fetcher to get through a complete segment?

I am still using .05 for the code at the moment.  How do I eliminate .pdf's
so there is no chance of a PDF hanging the system up?

Thanks,

Jason




-------------------------------------------------------
This SF.Net email is sponsored by: YOU BE THE JUDGE. Be one of 170
Project Admins to receive an Apple iPod Mini FREE for your judgement on
who ports your project to Linux PPC the best. Sponsored by IBM.
Deadline: Sept. 24. Go here: http://sf.net/ppc_contest.php
_______________________________________________
Nutch-general mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to