Re: [Nutch-general] http.content.limit not respected when the Content-Type header has charset attributes

Doğacan Güney Thu, 21 Jun 2007 04:15:01 -0700

On 6/21/07, Vishal Shah <[EMAIL PROTECTED]> wrote:
>  Hi,
>
>   Many of the urls we crawl have headers that look like this:
>
> Connection: close
> Date: Thu, 21 Jun 2007 09:28:42 GMT
> Accept-Ranges: bytes
> ETag: "2c0c3-650-cc1eb800"
> Server: Apache/2.0.40 (Red Hat Linux)
> Content-Length: 1616
> Content-Type: text/html; charset=ISO-8859-1
> Last-Modified: Mon, 09 Apr 2007 13:13:04 GMT
> Client-Date: Thu, 21 Jun 2007 07:42:10 GMT
> Client-Peer: 202.141.129.22:80
> Client-Response-Num: 1
>
> In this case, the cType variable is set to "text/html; charset=ISO-8859-1"
> in HttpResponse.java (for both protocol-http and protocol-httpclient). In
> this case, the mimeType cannot be found correctly in HttpResponse.java. I am
> talking about this piece of code here:
>
>      /*
>        * Extract the content type from the response and then look for its
>        * mimetype preferences specified in mime-type.xml
>        */
>      String ctype = headers.get(Response.CONTENT_TYPE);
>       int downloadSize = 0;
>       if (ctype != null && (mimeType = http.getMimeTypes().forName(ctype))
> != null) {
>
> In this case, the ctype should actually be set to just "text/html".
> Currently, since it's set to "text/html; charset=ISO-8859-1", mimeType
> variable is coming out to be null. Thus neither the content limit specified
> in mimetypes.xml nor the http.content.limit setting is respected for these
> documents.
>
> One solution to the problem is to actually check the cType, split on ";" and
> take the first part to lookup the mimeType. Anyone got any other ideas?


Are you sure about this? I haven't examined codes there carefully,
however, I tested a crawl with a sample url: http://www.metu.edu.tr/

Page returns these headers:

Date: Thu, 21 Jun 2007 10:59:35 GMT
Server: Apache
X-Powered-By: PHP/5.1.4
Connection: close
Content-Type: text/html; charset=ISO-8859-9

and this is the output of readseg -get:

Content::
Version: 2
url: http://www.metu.edu.tr/
base: http://www.metu.edu.tr/
contentType: text/html
metadata: X-Powered-By=PHP/5.1.4 Connection=close
nutch.segment.name=20070621125200 nutch.crawl.score=1.0 Date=Thu, 21
Jun 2007 10:52:13 GMT Server=Apache Content-Type=text/html;
charset=ISO-8859-9
Content:
...

Content-type seems to be picked up correctly.

btw, there is already a StringUtil.parseCharacterEncoding that is
designed to parse the encoding part of Content-Type header.

(Also, I couldn't find the code you were mentioning. Where is it, exactly?)

>
> -vishal.
>


-- 
Doğacan Güney
-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] http.content.limit not respected when the Content-Type header has charset attributes

Reply via email to