Hi,

During my Intranet crawl, Nutch reports an error 400 for the following URL:

050614 075430 fetch of http://planetbp.bp.com/general/aptrix/bani.nsf/Content/XXXXPS%5FMB%5F090605%5CXXXXps%5FManagement+ Briefing%5F090605 failed with: org.apache.nutch.protocol.http.HttpError: HTTP Error: 400

If I go to the page in my browser it works fine. However, as you can see from the headers below, the first GET does return a 400 but then a rewrite is done to append ?OpenDocument on to the end of the URL, and the next GET request is successful.

Is there something I can do to get round this ?

Thanks for any help.

JS.

Here are the headers:

http://planetbp.bp.com/general/aptrix/bani.nsf/Content/XXXXPS%5FMB%5F090605%5CXXXXps%5FManagement+Briefing%5F090605

GET /general/aptrix/bani.nsf/Content/XXXXPS%5FMB%5F090605%5CXXXXps%5FManagement+Briefing%5F090605 HTTP/1.1
Host: planetbp.bp.com
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.7.5) Gecko/20041110 Firefox/1.0 Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5
Accept-Language: en-gb,en;q=0.5
Accept-Encoding: gzip,deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive: 300
Connection: keep-alive

HTTP/1.x 400 Bad Request
Server: Lotus-Domino
Date: Tue, 14 Jun 2005 07:24:38 GMT
Connection: close
Expires: Tue, 01 Jan 1980 06:00:00 GMT
Content-Type: text/html; charset=US-ASCII
Content-Length: 526
Cache-Control: no-cache
----------------------------------------------------------
http://planetbp.bp.com/general/aptrix/bani.nsf/Content/XXXXPS%5FMB%5F090605%5CXXXXps%5FManagement+Briefing%5F090605?OpenDocument

GET /general/aptrix/bani.nsf/Content/XXXXPS%5FMB%5F090605%5CXXXXps%5FManagement+Briefing%5F090605?OpenDocument HTTP/1.1
Host: planetbp.bp.com
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.7.5) Gecko/20041110 Firefox/1.0 Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5
Accept-Language: en-gb,en;q=0.5
Accept-Encoding: gzip,deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive: 300
Connection: keep-alive
If-Modified-Since: Tue, 14 Jun 2005 07:24:05 GMT

HTTP/1.x 200 OK
Server: Lotus-Domino
Date: Tue, 14 Jun 2005 07:24:39 GMT
Last-Modified: Tue, 14 Jun 2005 07:24:37 GMT
Expires: Tue, 01 Jan 1980 06:00:00 GMT
Content-Type: text/html; charset=US-ASCII
Content-Length: 16061
Cache-Control: no-cache
----------------------------------------------------------
http://planetbp.bp.com/favicon.ico

GET /favicon.ico HTTP/1.1
Host: planetbp.bp.com
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.7.5) Gecko/20041110 Firefox/1.0
Accept: image/png,*/*;q=0.5
Accept-Language: en-gb,en;q=0.5
Accept-Encoding: gzip,deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive: 300
Connection: keep-alive

HTTP/1.x 404 Not Found
Server: Lotus-Domino
Date: Tue, 14 Jun 2005 07:24:39 GMT
Connection: close
Pragma: no-cache
Cache-Control: no-cache
Expires: Tue, 14 Jun 2005 07:24:39 GMT
Content-Type: text/html
Content-Length: 159
----------------------------------------------------------
http://planetbp.bp.com/favicon.ico

GET /favicon.ico HTTP/1.1
Host: planetbp.bp.com
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.7.5) Gecko/20041110 Firefox/1.0
Accept: image/png,*/*;q=0.5
Accept-Language: en-gb,en;q=0.5
Accept-Encoding: gzip,deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive: 300
Connection: keep-alive

HTTP/1.x 404 Not Found
Server: Lotus-Domino
Date: Tue, 14 Jun 2005 07:24:39 GMT
Connection: close
Pragma: no-cache
Cache-Control: no-cache
Expires: Tue, 14 Jun 2005 07:24:39 GMT
Content-Type: text/html
Content-Length: 159
----------------------------------------------------------




-------------------------------------------------------
This SF.Net email is sponsored by: NEC IT Guy Games.  How far can you shotput
a projector? How fast can you ride your desk chair down the office luge track?
If you want to score the big prize, get to know the little guy. Play to win an NEC 61" plasma display: http://www.necitguy.com/?r=20
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to