Pier Fumagalli wrote:
On 5 Sep 2005, at 01:53, Antonio Gallardo wrote:
Pier Fumagalli wrote:
Depending on your platform encoding (yours apparently ISO8859-1,
mine UTF-8, my wife's -she's japanese- Shift-JIS) that sequence
(B4) of BYTES as in the original source code will be interpreted
as a different character.
The char encoding Shift-JIS (JIS X 0201:1997 or JIS X 0208:1997) is
is exactly the same as using ISO-8859-1. We need to keep the sources
in UNICODE and there is also for Japanese: Hiragana, Katakana, et
al: http://www.unicode.org/charts/
Err... Ehmmm.. No... The character in question (Latin-1 character B4,
Acute Accent) is encoded in ISO8850-1 as the bytes sequence "B4",
while in Shift-JIS the same character is encoded as byte sequence "81
4C", quite different.
Reading the byte sequence "B4" in Shift-JS will produce Unicode
character FF74 (Halfwidth katakana "E"), which is quite different
from an acute accent as you intended.
Trust me, it's 9 years I'm doing this! :-)
Yes, I believe you. :-) When I told that using Shift-JIS and ISO-8859-1
is the same. I had in mind that they don't represent the full unicode
expectrum. I was just tryin to show this problem in other char-set So in
fact we are in the same problem. Of course that I am aware that both
codesets (Shift-JIS and ISO-8859-1) are different UNICODE subset. This
is same as you stated.
Changing the binary sequence B4 to \u00B4 instructs the JVM that
no matter what encoding your platform is set to, the resulting
character will always (always) be UNICODE 00B4, the Acute Accent,
part of the Latin-1 (0X0080) table.
If we wrote the code in UNICODE you will have the same effect. It is
exactly the same as with XML, isn't?
Unicode is simply a list of characters. To save them on a disk, you
_need_ to use an encoding. Unicode characters are 32bits long (they
were 16 bits until Unicode 4 came along, but that ain't important
right now), bytes are 8bits long. It's as easy as that. To represent
32 bits in 8, you need to "compress" them (or as said in I18N,
"encoding" them).
Some encodings are complete (such as the family of UTF encodings)
meaning that the encoding CAN represent ALL Unicode characters, some
are not (such as ISO8859-1 which can represent only Unicode
characters from 00 to FF).
Yes. Please correct me here if I am wrong: Our SVN uses UTF-8 as the
default charset (or encoding) or not? If not, then we need to take care
not only of java sources but also of the chars above 7F in the XML files.
I have special interest in that, since we wrote mostly spanish messages.
I will like to know if this is needed or not.
Best Regards,
Antonio Gallardo.