Re: Charset trouble, questionmarks

Ken Krugler Wed, 02 Sep 2009 06:20:17 -0700

Hi Magnus,

On Sep 2, 2009, at 1:22am, Magnus Olstad Hansen wrote:

Hello,
I'm using HttpClient 4.0 to download a webpage the same way as shownin one of the examples. This is my method to return a webpage as astring:
protected static String leechUrl(String url) throwsIOException {
              HttpClient httpclient = new DefaultHttpClient();
              HttpGet httpget = new HttpGet(url);
System.out.println("executing request " +httpget.getURI());
              // Create a response handler
ResponseHandler<String> responseHandler = newBasicResponseHandler();String responseBody = httpclient.execute(httpget,responseHandler);
              // When HttpClient instance is no longer needed,
              // shut down the connection manager to ensure
              // immediate deallocation of all system resources
              httpclient.getConnectionManager().shutdown();
              return responseBody;
      }
However; the responseBody returned here contains ? (questionmarks)for all norwegian characters (æøåÆØÅ) on the page. For example if Itry to dump "http://www.vg.no"; I can find the following at line 107:
<li><a href="http://go.vg.no/cgi-bin/go.cgi/meny/http://elisabeth.vgb.no/">Frue *p?* veggen (blogg)</a></li>
...that questionmark there should've been the character å. Forcertainty I've compared to the same page and line dumped with wget:
<li><a href="http://go.vg.no/cgi-bin/go.cgi/meny/http://elisabeth.vgb.no/">Frue på veggen (blogg)</a></li>
My question is simply what I need to do to keep the norwegianletters intact? So far I've tried:- Copying BasicResponseHandler and debug thatEntityUtils.getContentCharset() finds a reasonable charset, it does.- Hacking EntityUtils.toString() to override both detected anddefault charset with "ISO-8859-1" and "UTF-8".- Adding header to the request with content-type and charset (whichisn't really logical to add to a request, but I tried anyway)
All I've accomplished with this is to sometimes get two ?'s insteadof one for the norwegian letters. I also tried to dump the responseas directly as I saw possible by using EntityUtils.toByteArray() andwriting directly to a file. To my surprise I can see that the ?'sare still there and via hexdump I can see that they are all 3F(questionmark) - so it's infact impossible to recover the norwegianletters. They must have been replaced with a questionmark somewhere.
Please advice, and a thousand thanks for reading my problem!

The basic problem is that determining the character set of a web pageis complex, and not something that HttpClient is designed to handle.

If you check out (for example) the Nutch source, you'll see that ithas a multi-step process, where it uses the Content-type in theresponse header, the meta http-equiv tag in the HTML, and low-levelcharset sniiffing code to try to guess at the right encoding, but eventhen sometimes it gets it wrong.


In your case, the page looks pretty clean - all UTF-8.

But when you call httpclient.execute(httpget, responseHandler), theBasicResponseHandler will call EntityUtils.toString, and that in turnuses ISO-8859-1 as its default charset when converting the bytes itreceives into the string.

So go one level deeper, and get the HttpEntity from the response, thentry EntityUtils.toString(entity, "UTF-8") and see what you get back.


-- Ken
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Charset trouble, questionmarks

Reply via email to