Hi Magnus,

I used curl to grab the file, and the bytes at 0x1845...0x1847 are 0xC3 0xA5, which is valid UTF-8 for the u00E5 code point (latin small letter a with ring above).

I also used Bixo (http://bixo.101tec.com) to crawl the same page, and wound up with the same raw data. Bixo uses HttpClient 4.0, so it's a good test.

Given what you've tried (in your initial email), I've only got one weak guess - that your tools are showing you stuff that isn't actually there.

If you use HttpClient to dump the string and the byte array to files, and look at the files with a real, honest-to-gosh hex editor, do you still see 0x3F at offset 0x1845, or 0xC3 0xA5?

-- Ken


On Sep 2, 2009, at 7:58am, Magnus Olstad Hansen wrote:

Thanks for the reply, Ken.

The basic problem is that determining the character set of a web page
is complex, and not something that HttpClient is designed to handle.

If you check out (for example) the Nutch source, you'll see that it
has a multi-step process, where it uses the Content-type in the
response header, the meta http-equiv tag in the HTML, and low-level
charset sniiffing code to try to guess at the right encoding, but even
then sometimes it gets it wrong.
Yep - I know it's a complex matter - headers are not always specified,
metas the same... :) However if the page is actually in one specific
charset, and one can force this charset to be used by HttpClient, it
should all work in my opinion.

In your case, the page looks pretty clean - all UTF-8.

But when you call httpclient.execute(httpget, responseHandler), the
BasicResponseHandler will call EntityUtils.toString, and that in turn
uses ISO-8859-1 as its default charset when converting the bytes it
receives into the string.
Well, I've read the source on that and it's not entirely what I saw. 1
of 3 possibilites exists for charset-variable used within
EntityUtils.toString(), in preferred order :
1) Result of EntityUtils.getContentCharset()
2) Default charset given as argument to EntityUtils.toString().
3) HttpClient-global HTTP.DEFAULT_CHARSET (probably not the correct name
of the constant, but you get the point).

I tested EntityUtils.getContentCharset() separatly, and it returns UTF-8
(as found in the Content-Type of the given page).

So go one level deeper, and get the HttpEntity from the response, then
try EntityUtils.toString(entity, "UTF-8") and see what you get back.
Conclusion must be I have tried this without luck ... ?

Regards,
Magnus


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]


--------------------------
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-210-6378


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to