Hi Jesse,
I don't know exactly why this method was introduced, but perhaps the elders can explain the reasons. In case of this method is useless, it should be deleted. I disagree with having a no-op method.
Perhaps some test cases should highlight the problem...
++
Julien
----- Message d'origine ----
De : Jesse Wilson <[EMAIL PROTECTED]>
À : [email protected]
Envoyé le : Jeudi, 8 Juin 2006, 3h10mn 20s
Objet : [Jwebunit-users] Context.toEncodedString() doesn't make sense
I don't know exactly why this method was introduced, but perhaps the elders can explain the reasons. In case of this method is useless, it should be deleted. I disagree with having a no-op method.
Perhaps some test cases should highlight the problem...
++
Julien
----- Message d'origine ----
De : Jesse Wilson <[EMAIL PROTECTED]>
À : [email protected]
Envoyé le : Jeudi, 8 Juin 2006, 3h10mn 20s
Objet : [Jwebunit-users] Context.toEncodedString() doesn't make sense
Hi JWebUnit team!
I'm a new user of JWebUnit and I'm having problems using
it with multi-byte characters, the Euro character in particular.
The toEncodedString() method supposedly converts a String from
one encoding to another:
public String toEncodedString(String text) {
try {
return new String(text.getBytes(), encodingScheme);
} catch (UnsupportedEncodingException e) {
e.printStackTrace();
return text;
}
}
Unfortunately, this doesn't make sense. Internally, all Strings
in Java are UTF-16, regardless of what encoding they were
in when you created them from bytes. The String constructors
automatically convert bytes from their specified encoding to UTF-16.
There's no reason to worry about encoding of Java Strings until
you're reading from or writing to bytes, since all Strings are
the same encoding internally.
Since the toEncodedString() does not convert Strings from
one encoding to another, what does it do?
1. text.getBytes() converts the UTF-16 String into a byte[] array
using the platform's default character encoding. On my Linux
box, the default character encoding scheme is ISO-8859-1, a
poor choice because it only supports ~225 distinct characters.
This means whenever a multibyte character is encoded, information
will be lost.
2. new String( platformEncodedString, encodingScheme) takes an
array of bytes encoded in my platform's default scheme, and uses
a potentially different encoding scheme to convert it to a proper
Java UTF-16 String. Since the encoding charset and the decoding
charset differ, we can get data corruption on this step.
For example, on my Red hat box (ISO-8859-1), the euro character is
converted to a "?"
String euro = "\u20AC";
byte[] euroAsBytes = euro.getBytes("ISO-8859-1");
String euroEncoded = new String(euroAsBytes, "ISO-8859-1"); //
equals "?
On my coworker's Ubuntu box (UTF-8) the euro character is converted to
String euro = "\u20AC";
byte[] euroAsBytes = euro.getBytes("UTF-8"); // array length is 3
String euroEncoded = new String(euroAsBytes, "ISO-8859-1"); //
3 garbage chars
Since the encoding is unnecessary, I strongly recommend changing
the method's implementation with a no-op:
public String toEncodedString(String text) {
return text;
}
Thanks in advance,
Jesse Wilson
_______________________________________________
Jwebunit-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/jwebunit-users
I'm a new user of JWebUnit and I'm having problems using
it with multi-byte characters, the Euro character in particular.
The toEncodedString() method supposedly converts a String from
one encoding to another:
public String toEncodedString(String text) {
try {
return new String(text.getBytes(), encodingScheme);
} catch (UnsupportedEncodingException e) {
e.printStackTrace();
return text;
}
}
Unfortunately, this doesn't make sense. Internally, all Strings
in Java are UTF-16, regardless of what encoding they were
in when you created them from bytes. The String constructors
automatically convert bytes from their specified encoding to UTF-16.
There's no reason to worry about encoding of Java Strings until
you're reading from or writing to bytes, since all Strings are
the same encoding internally.
Since the toEncodedString() does not convert Strings from
one encoding to another, what does it do?
1. text.getBytes() converts the UTF-16 String into a byte[] array
using the platform's default character encoding. On my Linux
box, the default character encoding scheme is ISO-8859-1, a
poor choice because it only supports ~225 distinct characters.
This means whenever a multibyte character is encoded, information
will be lost.
2. new String( platformEncodedString, encodingScheme) takes an
array of bytes encoded in my platform's default scheme, and uses
a potentially different encoding scheme to convert it to a proper
Java UTF-16 String. Since the encoding charset and the decoding
charset differ, we can get data corruption on this step.
For example, on my Red hat box (ISO-8859-1), the euro character is
converted to a "?"
String euro = "\u20AC";
byte[] euroAsBytes = euro.getBytes("ISO-8859-1");
String euroEncoded = new String(euroAsBytes, "ISO-8859-1"); //
equals "?
On my coworker's Ubuntu box (UTF-8) the euro character is converted to
String euro = "\u20AC";
byte[] euroAsBytes = euro.getBytes("UTF-8"); // array length is 3
String euroEncoded = new String(euroAsBytes, "ISO-8859-1"); //
3 garbage chars
Since the encoding is unnecessary, I strongly recommend changing
the method's implementation with a no-op:
public String toEncodedString(String text) {
return text;
}
Thanks in advance,
Jesse Wilson
_______________________________________________
Jwebunit-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/jwebunit-users
_______________________________________________ Jwebunit-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/jwebunit-users
