On May 7, 2007, at 9:07 AM, Dennis Kubes wrote:
Brian Whitman wrote:
Hi all,
I looked into this a bit more after it crashed for the third time
in a row.
every time it has segfaulted it's had this url as one of the past
few fetches:
fetching http://www.c bs.nu/cgi-bin/ac/adcycle.cgi?
gid=4&layout=multi&id=125
Note the space in there. This URL is not in my initial fetchlist
so it was found somewhere. Not sure if the space is actually a
space or an encoding -> terminal issue, either way I think this
has something to do with it. Does anyone know what happens when
java/nutch gets a hostname that is obviously malformed?
I believe is should throw a malformed url exception.
OK. I got the crash again today on different urls. It's strange
because I've been crawling quite regularly with the same nutch setup
for a while. It's possible that a recent system-level change is
getting in the way (I'm running debian with a recent full upgrade.)
After googling the culprit for a while I found this trick:
-Djava.net.preferIPv4Stack=true
I'm running a large crawl with it now and will let you know if I
don't see it in a while!
-Brian