On May 7, 2007, at 9:07 AM, Dennis Kubes wrote:
> Brian Whitman wrote:
>> Hi all,
>> I looked into this a bit more after it crashed for the third time  
>> in a row.
>> every time it has segfaulted it's had this url as one of the past  
>> few fetches:
>> fetching http://www.c bs.nu/cgi-bin/ac/adcycle.cgi? 
>> gid=4&layout=multi&id=125
>> Note the space in there. This URL is not in my initial fetchlist  
>> so it was found somewhere. Not sure if the space is actually a  
>> space or an encoding -> terminal issue, either way I think this  
>> has something to do with it. Does anyone know what happens when  
>> java/nutch gets a hostname that is obviously malformed?
>
> I believe is should throw a malformed url exception.

OK. I got the crash again today on different urls. It's strange  
because I've been crawling quite regularly with the same nutch setup  
for a while. It's possible that a recent system-level change is  
getting in the way (I'm running debian with a recent full upgrade.)

After googling the culprit for a while I found this trick:

-Djava.net.preferIPv4Stack=true

I'm running a large crawl with it now and will let you know if I  
don't see it in a while!

-Brian



-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to