Thanks for the reply, Ken. > > The basic problem is that determining the character set of a web page > is complex, and not something that HttpClient is designed to handle. > > If you check out (for example) the Nutch source, you'll see that it > has a multi-step process, where it uses the Content-type in the > response header, the meta http-equiv tag in the HTML, and low-level > charset sniiffing code to try to guess at the right encoding, but even > then sometimes it gets it wrong. Yep - I know it's a complex matter - headers are not always specified, metas the same... :) However if the page is actually in one specific charset, and one can force this charset to be used by HttpClient, it should all work in my opinion. > > In your case, the page looks pretty clean - all UTF-8. > > But when you call httpclient.execute(httpget, responseHandler), the > BasicResponseHandler will call EntityUtils.toString, and that in turn > uses ISO-8859-1 as its default charset when converting the bytes it > receives into the string. Well, I've read the source on that and it's not entirely what I saw. 1 of 3 possibilites exists for charset-variable used within EntityUtils.toString(), in preferred order : 1) Result of EntityUtils.getContentCharset() 2) Default charset given as argument to EntityUtils.toString(). 3) HttpClient-global HTTP.DEFAULT_CHARSET (probably not the correct name of the constant, but you get the point).
I tested EntityUtils.getContentCharset() separatly, and it returns UTF-8 (as found in the Content-Type of the given page). > > So go one level deeper, and get the HttpEntity from the response, then > try EntityUtils.toString(entity, "UTF-8") and see what you get back. Conclusion must be I have tried this without luck ... ? Regards, Magnus --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
