Hi,
During my Intranet crawl, Nutch reports an error 400 for the following URL:
050614 075430 fetch of
http://planetbp.bp.com/general/aptrix/bani.nsf/Content/XXXXPS%5FMB%5F090605%5CXXXXps%5FManagement+
Briefing%5F090605 failed with: org.apache.nutch.protocol.http.HttpError:
HTTP Error: 400
If I go to the page in my browser it works fine. However, as you can see
from the headers below, the first GET does return a 400 but then a rewrite
is done to append ?OpenDocument on to the end of the URL, and the next GET
request is successful.
Is there something I can do to get round this ?
Thanks for any help.
JS.
Here are the headers:
http://planetbp.bp.com/general/aptrix/bani.nsf/Content/XXXXPS%5FMB%5F090605%5CXXXXps%5FManagement+Briefing%5F090605
GET
/general/aptrix/bani.nsf/Content/XXXXPS%5FMB%5F090605%5CXXXXps%5FManagement+Briefing%5F090605
HTTP/1.1
Host: planetbp.bp.com
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.7.5)
Gecko/20041110 Firefox/1.0
Accept:
text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5
Accept-Language: en-gb,en;q=0.5
Accept-Encoding: gzip,deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive: 300
Connection: keep-alive
HTTP/1.x 400 Bad Request
Server: Lotus-Domino
Date: Tue, 14 Jun 2005 07:24:38 GMT
Connection: close
Expires: Tue, 01 Jan 1980 06:00:00 GMT
Content-Type: text/html; charset=US-ASCII
Content-Length: 526
Cache-Control: no-cache
----------------------------------------------------------
http://planetbp.bp.com/general/aptrix/bani.nsf/Content/XXXXPS%5FMB%5F090605%5CXXXXps%5FManagement+Briefing%5F090605?OpenDocument
GET
/general/aptrix/bani.nsf/Content/XXXXPS%5FMB%5F090605%5CXXXXps%5FManagement+Briefing%5F090605?OpenDocument
HTTP/1.1
Host: planetbp.bp.com
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.7.5)
Gecko/20041110 Firefox/1.0
Accept:
text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5
Accept-Language: en-gb,en;q=0.5
Accept-Encoding: gzip,deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive: 300
Connection: keep-alive
If-Modified-Since: Tue, 14 Jun 2005 07:24:05 GMT
HTTP/1.x 200 OK
Server: Lotus-Domino
Date: Tue, 14 Jun 2005 07:24:39 GMT
Last-Modified: Tue, 14 Jun 2005 07:24:37 GMT
Expires: Tue, 01 Jan 1980 06:00:00 GMT
Content-Type: text/html; charset=US-ASCII
Content-Length: 16061
Cache-Control: no-cache
----------------------------------------------------------
http://planetbp.bp.com/favicon.ico
GET /favicon.ico HTTP/1.1
Host: planetbp.bp.com
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.7.5)
Gecko/20041110 Firefox/1.0
Accept: image/png,*/*;q=0.5
Accept-Language: en-gb,en;q=0.5
Accept-Encoding: gzip,deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive: 300
Connection: keep-alive
HTTP/1.x 404 Not Found
Server: Lotus-Domino
Date: Tue, 14 Jun 2005 07:24:39 GMT
Connection: close
Pragma: no-cache
Cache-Control: no-cache
Expires: Tue, 14 Jun 2005 07:24:39 GMT
Content-Type: text/html
Content-Length: 159
----------------------------------------------------------
http://planetbp.bp.com/favicon.ico
GET /favicon.ico HTTP/1.1
Host: planetbp.bp.com
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.7.5)
Gecko/20041110 Firefox/1.0
Accept: image/png,*/*;q=0.5
Accept-Language: en-gb,en;q=0.5
Accept-Encoding: gzip,deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive: 300
Connection: keep-alive
HTTP/1.x 404 Not Found
Server: Lotus-Domino
Date: Tue, 14 Jun 2005 07:24:39 GMT
Connection: close
Pragma: no-cache
Cache-Control: no-cache
Expires: Tue, 14 Jun 2005 07:24:39 GMT
Content-Type: text/html
Content-Length: 159
----------------------------------------------------------
-------------------------------------------------------
This SF.Net email is sponsored by: NEC IT Guy Games. How far can you shotput
a projector? How fast can you ride your desk chair down the office luge track?
If you want to score the big prize, get to know the little guy.
Play to win an NEC 61" plasma display: http://www.necitguy.com/?r=20
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general