[Jwebunit-users] Re : Context.toEncodedString() doesn't make sense

Julien HENRY Thu, 08 Jun 2006 00:56:02 -0700

Hi Jesse,

I don't know exactly why this method was introduced, but perhaps the elders can explain the reasons. In case of this method is useless, it should be deleted. I disagree with having a no-op method.
Perhaps some test cases should highlight the problem...

++
Julien

----- Message d'origine ----
De : Jesse Wilson <[EMAIL PROTECTED]>
À : [email protected]
Envoyé le : Jeudi, 8 Juin 2006, 3h10mn 20s
Objet : [Jwebunit-users] Context.toEncodedString() doesn't make sense

Hi JWebUnit team!

I'm a new user of JWebUnit and I'm having problems using
it with multi-byte characters, the Euro character in particular.

The toEncodedString() method supposedly converts a String from
one encoding to another:
  public String toEncodedString(String text) {
    try {
      return new String(text.getBytes(), encodingScheme);
    } catch (UnsupportedEncodingException e) {
      e.printStackTrace();
      return text;
    }
  }

Unfortunately, this doesn't make sense. Internally, all Strings
in Java are UTF-16, regardless of what encoding they were
in when you created them from bytes. The String constructors
automatically convert bytes from their specified encoding to UTF-16.
There's no reason to worry about encoding of Java Strings until
you're reading from or writing to bytes, since all Strings are
the same encoding internally.

Since the toEncodedString() does not convert Strings from
one encoding to another, what does it do?
1. text.getBytes() converts the UTF-16 String into a byte[] array
   using the platform's default character encoding. On my Linux
   box, the default character encoding scheme is ISO-8859-1, a
   poor choice because it only supports ~225 distinct characters.
   This means whenever a multibyte character is encoded, information
   will be lost.
2. new String( platformEncodedString, encodingScheme) takes an
   array of bytes encoded in my platform's default scheme, and uses
   a potentially different encoding scheme to convert it to a proper
   Java UTF-16 String. Since the encoding charset and the decoding
   charset differ, we can get data corruption on this step.

For example, on my Red hat box (ISO-8859-1), the euro character is
converted to a "?"
        String euro = "\u20AC";
        byte[] euroAsBytes = euro.getBytes("ISO-8859-1");
        String euroEncoded = new String(euroAsBytes, "ISO-8859-1"); //
equals "?
On my coworker's Ubuntu box (UTF-8) the euro character is converted to
        String euro = "\u20AC";
        byte[] euroAsBytes = euro.getBytes("UTF-8"); // array length is 3
        String euroEncoded = new String(euroAsBytes, "ISO-8859-1"); //
3 garbage chars

Since the encoding is unnecessary, I strongly recommend changing
the method's implementation with a no-op:
  public String toEncodedString(String text) {
    return text;
  }

Thanks in advance,
Jesse Wilson

_______________________________________________
Jwebunit-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/jwebunit-users

_______________________________________________
Jwebunit-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/jwebunit-users

[Jwebunit-users] Re : Context.toEncodedString() doesn't make sense

Reply via email to