Re: Charset trouble, questionmarks

Ken Krugler Wed, 02 Sep 2009 09:48:58 -0700

Hi Magnus,

I used curl to grab the file, and the bytes at 0x1845...0x1847 are0xC3 0xA5, which is valid UTF-8 for the u00E5 code point (latin smallletter a with ring above).

I also used Bixo (http://bixo.101tec.com) to crawl the same page, andwound up with the same raw data. Bixo uses HttpClient 4.0, so it's agood test.

Given what you've tried (in your initial email), I've only got oneweak guess - that your tools are showing you stuff that isn't actuallythere.

If you use HttpClient to dump the string and the byte array to files,and look at the files with a real, honest-to-gosh hex editor, do youstill see 0x3F at offset 0x1845, or 0xC3 0xA5?


-- Ken


On Sep 2, 2009, at 7:58am, Magnus Olstad Hansen wrote:

Thanks for the reply, Ken.


The basic problem is that determining the character set of a web page
is complex, and not something that HttpClient is designed to handle.

If you check out (for example) the Nutch source, you'll see that it
has a multi-step process, where it uses the Content-type in the
response header, the meta http-equiv tag in the HTML, and low-level

charset sniiffing code to try to guess at the right encoding, buteven

then sometimes it gets it wrong.

Yep - I know it's a complex matter - headers are not always specified,
metas the same... :) However if the page is actually in one specific
charset, and one can force this charset to be used by HttpClient, it
should all work in my opinion.


In your case, the page looks pretty clean - all UTF-8.

But when you call httpclient.execute(httpget, responseHandler), the
BasicResponseHandler will call EntityUtils.toString, and that in turn
uses ISO-8859-1 as its default charset when converting the bytes it
receives into the string.

Well, I've read the source on that and it's not entirely what I saw. 1
of 3 possibilites exists for charset-variable used within
EntityUtils.toString(), in preferred order :
1) Result of EntityUtils.getContentCharset()
2) Default charset given as argument to EntityUtils.toString().

3) HttpClient-global HTTP.DEFAULT_CHARSET (probably not the correctname

of the constant, but you get the point).

I tested EntityUtils.getContentCharset() separatly, and it returnsUTF-8

(as found in the Content-Type of the given page).

So go one level deeper, and get the HttpEntity from the response,then
try EntityUtils.toString(entity, "UTF-8") and see what you get back.

Conclusion must be I have tried this without luck ... ?

Regards,
Magnus


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]


--------------------------
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-210-6378


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Charset trouble, questionmarks

Reply via email to