Hi,

Can anyone help me with the following problem. In my crawl.log I'm getting lots of messages such as those below. However if I test the URLs in my browser, they're fine. Is there a regular expression I need to update somewhere e.g. One of the URLs below has a space in it. So I was thinking I might need to change or add a line in crawl-urlfilter.txt ?


fetch of http://planetbp.bp.com/general/aptrix/bani.nsf/Content/XXXXPS%5FMB%5F090605%5CXXXXps%5FManagement+Briefing%5F090605
failed with: org.apache.nutch.protocol.http.HttpError: HTTP Error: 400

fetch of http://planetbp.bp.com/general/aptrix/aptrix.nsf/Content/BP websites failed with: org.apache.nutch.protocol.http.HttpError: HTTP Error: 400


fetch of http://planetbp.bp.com/general/aptrix/aptcsops.nsf/Content/GoHi+Services+Home%5CSocial failed with: org.apache.nutch.protocol.http.HttpError: HTTP Error: 400


fetch of http://planetbp.bp.com/general/aptrix/aptppl.nsf/Content/Training+Home%5CBusiness+Tools%5CPatrol+Medical failed with: org.apache.nutch.protocol.http.HttpError: HTTP Error: 500




-------------------------------------------------------
This SF.Net email is sponsored by: NEC IT Guy Games.  How far can you shotput
a projector? How fast can you ride your desk chair down the office luge track?
If you want to score the big prize, get to know the little guy. Play to win an NEC 61" plasma display: http://www.necitguy.com/?r=20
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to