I've had to deal with this problem myself. Right now the only solution is to use getResponseBody() and convert bytes into a string using the appropriate encoding. I like the idea of having getResponseBodyAsString() use the encoding specified in the Content-Type header, but the problem is that it still won't be very useful.
The vast majority of web servers out there don't include a "; charset=" attribute in the content-type header or provide a reasonable mechanism for content authors to cause the server to set the attribute correctly on a per-file basis. Most pages with non-ISO-LATIN-1 charsets use <META HTTP-EQUIV> tag in the HTML header to specify the page encoding. That means you still have to read at least part of the response body (as ISO-LATIN-1) in order to determine the correct encoding. I don't have a problem with changing getResponseBodyAsString() to check the content-type header, I just doubt that doing that will make it much more useful in the real world. What do others think? Marc Saegesser > -----Original Message----- > From: Rapheal Kaplan [mailto:[EMAIL PROTECTED]] > Sent: Wednesday, March 20, 2002 12:46 PM > To: [EMAIL PROTECTED] > Subject: [HttpClient]Encoding > > > Was working with a friend trying to determine the best way > to read the > contents of an HTTP response in to a string. Since he's > working within the > Jakarta framework, including the HttpClient, we decided to > use that API. The > simplest way seems to be: > > HttpClient hc = new HttpClient() > UrlGetMethod gm = new UrlGetMethod(query); > hc.startSession(url,80); > hc.executeMethod(gm); > > String htmlText = gm.getResponseBodyAsString(); > > I thought that seemed like a good idea, and wanted to check > to make sure > that the encoding was working correctly in > getResponseBodyAsString. I > noticed there is also "byte[] getResponseBody" and > getResponseBodyAsStream. > It doesn't seem like the getResponseBodyAsString would encode > the byte array > properly. Here is how it is written in > org.apache.commons.httpclient.methods.GetMethod.java: > > /** > * Return my response body, if any, > * as a {@link String}. > * Otherwise return <tt>null</tt>. > */ > public String getResponseBodyAsString() { > byte[] data = getResponseBody(); > if(null == data) { > return null; > } else { > return new String(data); > } > } > > The problem is that the string is constructed using the > default encoding of > the VM, but not the encoding that the server might be sending > the data in. > For example, if the client is requesting a document written > in Chinese, it > could well use an entirely different encoding. > > Of course I am not worried about the getResponseBody and > getResponseBodyAsStream methods. Those should expose binary > data. However, > the get...AsString should do something like: > > /** > * Return my response body, if any, > * as a {@link String}. > * Otherwise return <tt>null</tt>. > */ > public String getResponseBodyAsString() { > byte[] data = getResponseBody(); > if(null == data) { > return null; > } else { > return new String(data, getResponseEncoding()); > } > } > > Of course I am making up the method getResponseEncoding as > an example. > > Likewise, I would recommend a getResponseAsReader method > that would return > an InputStreamReader set to the proper encoding. > > Has anyone giving this problem any thought? Or, is this > design intentional > and encoding is handled somewhere else? Are there other issues? > > If there is a desire to solve the encoding problem > (assuming I am correct > in thinking it is missing), I am quite willing to participate > in the design > and encoding. > > Thank you. > > - Rapheal Kaplan > > > > > -- > To unsubscribe, e-mail: > <mailto:[EMAIL PROTECTED]> > For additional commands, e-mail: > <mailto:[EMAIL PROTECTED]> > -- To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>
