Hi, This is a segment that constantly fails to fetch. It gets about 1/3 of the way through and then hangs. Is there anything obvious in this as to why it is failing to fetch?
040927 233415 fetch of http://www.phila.gov/districtattorney/community/youthaid/ failed with: net.nutch.protocol.http.HttpException: java.net.UnknownHostException: www.phila.gov 040927 233415 http://www.avenida.net/links/spain/se/: falling back to windows-1252 040927 233415 fetch of http://www.avenida.net/links/spain/se failed with: org.w3c.dom.DOMException: INVALID_CHARACTER_ERR: An invalid or illegal XML character is specified. 040927 233416 fetch of http://www.cse.msu.edu/~vailayaa/India-Club.html failed with: net.nutch.protocol.http.HttpException: java.net.SocketTimeoutException: connect timed out 040927 233416 http://caspianworld.com/ru/go/1665913384/-1960478727/342827978/: setting encoding to UTF-8 040927 233419 http://es.groups.yahoo.com/group/genealhispana/auth?check=G&done=%2Fgroup%2Fgenealhispana%2Fmessage%2F256: setting encoding to windows-1252 040927 233420 http://re-fe-rat.narod.ru/54000.html: setting encoding to windows-1251 040927 233432 fetch of http://www.iuma.com/site-bin/mp3gen/8336/IUMA/Bands/Manual_Motiv/audio/Manual_Motiv_-_Intro.mp3 failed with: net.nutch.parse.ParseException: Content-Type not text/html: audio/mpeg 040927 233438 fetch of null failed with: java.io.EOFException Also when I am trying to do 4 and 5 million page fetches I always come back the next day and the fetcher is either hung up, or it will sit idle for a while and then fetch a little bit and stall. Is this a memory thing or how do I get the fetcher to get through a complete segment? I am still using .05 for the code at the moment. How do I eliminate .pdf's so there is no chance of a PDF hanging the system up? Thanks, Jason ------------------------------------------------------- This SF.Net email is sponsored by: YOU BE THE JUDGE. Be one of 170 Project Admins to receive an Apple iPod Mini FREE for your judgement on who ports your project to Linux PPC the best. Sponsored by IBM. Deadline: Sept. 24. Go here: http://sf.net/ppc_contest.php _______________________________________________ Nutch-general mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/nutch-general
