On May 7, 2007, at 9:07 AM, Dennis Kubes wrote: > Brian Whitman wrote: >> Hi all, >> I looked into this a bit more after it crashed for the third time >> in a row. >> every time it has segfaulted it's had this url as one of the past >> few fetches: >> fetching http://www.c bs.nu/cgi-bin/ac/adcycle.cgi? >> gid=4&layout=multi&id=125 >> Note the space in there. This URL is not in my initial fetchlist >> so it was found somewhere. Not sure if the space is actually a >> space or an encoding -> terminal issue, either way I think this >> has something to do with it. Does anyone know what happens when >> java/nutch gets a hostname that is obviously malformed? > > I believe is should throw a malformed url exception.
OK. I got the crash again today on different urls. It's strange because I've been crawling quite regularly with the same nutch setup for a while. It's possible that a recent system-level change is getting in the way (I'm running debian with a recent full upgrade.) After googling the culprit for a while I found this trick: -Djava.net.preferIPv4Stack=true I'm running a large crawl with it now and will let you know if I don't see it in a while! -Brian ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers