Hi Dogacan, We are pretty sure. We were having problems with 3 urls. We put some debug statements in HttpResponse.java. This is what we got:
URL = http://perso0.free.fr/cgi-bin/guestbook.pl?login=kobudo.okinawa cType = 'text/plain' mimeType = 'text/plain' allowed download limit for this mimetype is 0 download file is smalled then the max therefore setting actual filesize as download limit Download size 282 URL = http://www.prospect-magazine.co.uk/list.php?related_article=9635 cType = 'text/html; charset=ISO-8859-1' setting filesize as Integer.Max Download size 2147483647 cType = 'text/plain; charset=ISO-8859-1' setting filesize as Integer.Max Download size 2147483647 URL = http://www.muschihaus.de/vol4/templates/guestbook.php?name=Guestbook&image=g uestbook cType = 'text/html; charset=UTF-8' setting filesize as Integer.Max Download size 2147483647 cType = 'text/html; charset=ISO-8859-1' setting filesize as Integer.Max Download size 2147483647 >From this, I inferred that the cType is not set correctly to "text/html" here. Also, the content limit is set to Integer.Max, and the http.content.limit (64K) is ignored for 2 of the urls. Regards, -vishal. -----Original Message----- From: Dogacan Güney [mailto:[EMAIL PROTECTED] Sent: Thursday, June 21, 2007 4:44 PM To: [EMAIL PROTECTED]; [EMAIL PROTECTED] Subject: Re: http.content.limit not respected when the Content-Type header has charset attributes On 6/21/07, Vishal Shah <[EMAIL PROTECTED]> wrote: > Hi, > > Many of the urls we crawl have headers that look like this: > > Connection: close > Date: Thu, 21 Jun 2007 09:28:42 GMT > Accept-Ranges: bytes > ETag: "2c0c3-650-cc1eb800" > Server: Apache/2.0.40 (Red Hat Linux) > Content-Length: 1616 > Content-Type: text/html; charset=ISO-8859-1 > Last-Modified: Mon, 09 Apr 2007 13:13:04 GMT > Client-Date: Thu, 21 Jun 2007 07:42:10 GMT > Client-Peer: 202.141.129.22:80 > Client-Response-Num: 1 > > In this case, the cType variable is set to "text/html; charset=ISO-8859-1" > in HttpResponse.java (for both protocol-http and protocol-httpclient). In > this case, the mimeType cannot be found correctly in HttpResponse.java. I am > talking about this piece of code here: > > /* > * Extract the content type from the response and then look for its > * mimetype preferences specified in mime-type.xml > */ > String ctype = headers.get(Response.CONTENT_TYPE); > int downloadSize = 0; > if (ctype != null && (mimeType = http.getMimeTypes().forName(ctype)) > != null) { > > In this case, the ctype should actually be set to just "text/html". > Currently, since it's set to "text/html; charset=ISO-8859-1", mimeType > variable is coming out to be null. Thus neither the content limit specified > in mimetypes.xml nor the http.content.limit setting is respected for these > documents. > > One solution to the problem is to actually check the cType, split on ";" and > take the first part to lookup the mimeType. Anyone got any other ideas? Are you sure about this? I haven't examined codes there carefully, however, I tested a crawl with a sample url: http://www.metu.edu.tr/ Page returns these headers: Date: Thu, 21 Jun 2007 10:59:35 GMT Server: Apache X-Powered-By: PHP/5.1.4 Connection: close Content-Type: text/html; charset=ISO-8859-9 and this is the output of readseg -get: Content:: Version: 2 url: http://www.metu.edu.tr/ base: http://www.metu.edu.tr/ contentType: text/html metadata: X-Powered-By=PHP/5.1.4 Connection=close nutch.segment.name=20070621125200 nutch.crawl.score=1.0 Date=Thu, 21 Jun 2007 10:52:13 GMT Server=Apache Content-Type=text/html; charset=ISO-8859-9 Content: ... Content-type seems to be picked up correctly. btw, there is already a StringUtil.parseCharacterEncoding that is designed to parse the encoding part of Content-Type header. (Also, I couldn't find the code you were mentioning. Where is it, exactly?) > > -vishal. > -- Dogacan Güney ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
