Hi,
Can anyone help me with the following problem. In my crawl.log I'm getting
lots of messages such as those below. However if I test the URLs in my
browser, they're fine. Is there a regular expression I need to update
somewhere e.g. One of the URLs below has a space in it. So I was thinking I
might need to change or add a line in crawl-urlfilter.txt ?
fetch of
http://planetbp.bp.com/general/aptrix/bani.nsf/Content/XXXXPS%5FMB%5F090605%5CXXXXps%5FManagement+Briefing%5F090605
failed with: org.apache.nutch.protocol.http.HttpError: HTTP Error: 400
fetch of http://planetbp.bp.com/general/aptrix/aptrix.nsf/Content/BP
websites failed with: org.apache.nutch.protocol.http.HttpError: HTTP Error:
400
fetch of
http://planetbp.bp.com/general/aptrix/aptcsops.nsf/Content/GoHi+Services+Home%5CSocial
failed with: org.apache.nutch.protocol.http.HttpError: HTTP Error: 400
fetch of
http://planetbp.bp.com/general/aptrix/aptppl.nsf/Content/Training+Home%5CBusiness+Tools%5CPatrol+Medical
failed with: org.apache.nutch.protocol.http.HttpError: HTTP Error: 500
-------------------------------------------------------
This SF.Net email is sponsored by: NEC IT Guy Games. How far can you shotput
a projector? How fast can you ride your desk chair down the office luge track?
If you want to score the big prize, get to know the little guy.
Play to win an NEC 61" plasma display: http://www.necitguy.com/?r=20
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general