I'm sure that you guys're talking about character set(= character encoding in MIME) in
HTTP. I added my comment below. ;)
Sung-Gu
----- Original Message -----
Subject: Re: [HttpClient]Encoding
> I'll see about changing getResponseBodyAsString() to use the charset from
> the content-type (if it exists). I'm up to my ears with day job work right
> now, so it'll probably be a while before I can get to it.
I think we'll need to support language tags (within the Accept-Language and
Content-Language fields) and Accept and Content-Type (for internet media types) at
some point.
>
> People still need to understand (and I'll improve the JavaDoc) that
> getResponseBodyAsString() is never really going to be all that useful in the
> real world. From HttpClient's perspective the response body is simply a
> sequence of bytes, nothing more. It is up to a higher application layer to
> actually *interpret* those bytes based on the mime type specified in the
> content-type header.
>
> Marc Saegesser
>
> > -----Original Message-----
> > From: Rapheal Kaplan [mailto:[EMAIL PROTECTED]]
> > Sent: Wednesday, March 20, 2002 1:53 PM
> > To: Jakarta Commons Developers List
> > Subject: Re: [HttpClient]Encoding
> >
> >
> > Makes sense to me. Because the encoding is handled in the
> > body itself, it
> > doesn't necessarily help that much to set the encoding in the
> > getResponseBodyAsString method. Also, this kind of means
> > that you can't rely
> > on the getResponseBodyAsString method for all purposes.
> > There needs to be
> > some other layer of a client application that manages encoding.
> >
> > I still see the use of get...AsString, of course. It could
> > be an inbetween
> > step that is sent to a parser to determine actual encoding,
> > but then you
> > would need to return to the original byte stream anyway to
> > re-string the
> > body. Maybe the documentation should reflect this information.
> >
> > Also, if people start using charset info in the future, it
> > would probably
> > be nice to provide support. It might be that doing body to
> > string conversion
> > should be somewhere else in the API. Any ideas?
> >
> > My first guess would be to have a utility class that can do
> > the correct
> > encoding, from both the header and maybe even parsing the
> > content. However,
> > I don't think I am framiliar enough with the API to say decisivly.
> >
> > I do know that such features might be very useful for some work
> > that I need to do in the near future. I am working one
> > software that needs
> > to interact with several languages with non-latin character sets.
In your pre-mail,
> For example, if the client is requesting a document written
> in Chinese, it
> could well use an entirely different encoding.
if you want to solve this problem in the only perspective of character encoding,
you should consider of the conversion from/to local character set to/from transfer
character set in the client/server side.
We can go more complicately!
If you use mixed non-ascii characters (Korean and Chinese... ), you should provide to
handle to bi-directional display for these character sets. Then you should take a
two step process for conversion from/to local character set to/from UTF-8? First,
convert the local character set to the UCS. Second, convert UCS to UTF-8. How
complicated, huh?
And one more!
Some old clients or servers doesn't support 8 bit transfer encoding like UTF-8. Then
what? We should check that the code is valid UTF-8 or not.
However, there is an eaiser way to solve this problem.
( I WANT to say this a bit! ^^ )
That's to use "escaped encoding" that includes ASCII character set only.
It looks like application/x-www-form-urlencoded for media type in HTML.
But it's somewhat different.
> >
> > - Rapheal Kaplan
> >
> >
> >
> > On Wednesday 20 March 2002 14:27, you wrote:
> > > I've had to deal with this problem myself. Right now the
> > only solution is
> > > to use getResponseBody() and convert bytes into a string using the
> > > appropriate encoding. I like the idea of having
> > getResponseBodyAsString()
> > > use the encoding specified in the Content-Type header, but
> > the problem is
> > > that it still won't be very useful.
> > >
> > > The vast majority of web servers out there don't include a
> > "; charset="
> > > attribute in the content-type header or provide a
> > reasonable mechanism for
> > > content authors to cause the server to set the attribute
> > correctly on a
> > > per-file basis. Most pages with non-ISO-LATIN-1 charsets use <META
> > > HTTP-EQUIV> tag in the HTML header to specify the page
> > encoding. That
> > > means you still have to read at least part of the response body (as
> > > ISO-LATIN-1) in order to determine the correct encoding.
> > >
> > > I don't have a problem with changing
> > getResponseBodyAsString() to check the
> > > content-type header, I just doubt that doing that will make
> > it much more
> > > useful in the real world.
> > >
> > > What do others think?
> > >
> > > Marc Saegesser
> > >
> >
>
>
>