Thanks for the reply, Ken.
>
> The basic problem is that determining the character set of a web page
> is complex, and not something that HttpClient is designed to handle.
>
> If you check out (for example) the Nutch source, you'll see that it
> has a multi-step process, where it uses the Content-type in the
> response header, the meta http-equiv tag in the HTML, and low-level
> charset sniiffing code to try to guess at the right encoding, but even
> then sometimes it gets it wrong.
Yep - I know it's a complex matter - headers are not always specified,
metas the same... :) However if the page is actually in one specific
charset, and one can force this charset to be used by HttpClient, it
should all work in my opinion.
>
> In your case, the page looks pretty clean - all UTF-8.
>
> But when you call httpclient.execute(httpget, responseHandler), the
> BasicResponseHandler will call EntityUtils.toString, and that in turn
> uses ISO-8859-1 as its default charset when converting the bytes it
> receives into the string.
Well, I've read the source on that and it's not entirely what I saw. 1
of 3 possibilites exists for charset-variable used within
EntityUtils.toString(), in preferred order :
1) Result of EntityUtils.getContentCharset()
2) Default charset given as argument to EntityUtils.toString().
3) HttpClient-global HTTP.DEFAULT_CHARSET (probably not the correct name
of the constant, but you get the point).

I tested EntityUtils.getContentCharset() separatly, and it returns UTF-8
(as found in the Content-Type of the given page).
>
> So go one level deeper, and get the HttpEntity from the response, then
> try EntityUtils.toString(entity, "UTF-8") and see what you get back.
Conclusion must be I have tried this without luck ... ?

Regards,
Magnus


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to