XSPExpressionParser.java

Antonio Gallardo Sun, 04 Sep 2005 23:43:21 -0700

Pier Fumagalli wrote:

On 5 Sep 2005, at 01:53, Antonio Gallardo wrote:
Pier Fumagalli wrote:
Depending on your platform encoding (yours apparently ISO8859-1,mine UTF-8, my wife's -she's japanese- Shift-JIS) that sequence(B4) of BYTES as in the original source code will be interpretedas a different character.
The char encoding Shift-JIS (JIS X 0201:1997 or JIS X 0208:1997) isis exactly the same as using ISO-8859-1. We need to keep the sourcesin UNICODE and there is also for Japanese: Hiragana, Katakana, etal: http://www.unicode.org/charts/
Err... Ehmmm.. No... The character in question (Latin-1 character B4,Acute Accent) is encoded in ISO8850-1 as the bytes sequence "B4",while in Shift-JIS the same character is encoded as byte sequence "814C", quite different.
Reading the byte sequence "B4" in Shift-JS will produce Unicodecharacter FF74 (Halfwidth katakana "E"), which is quite differentfrom an acute accent as you intended.
Trust me, it's 9 years I'm doing this! :-)

Yes, I believe you. :-) When I told that using Shift-JIS and ISO-8859-1is the same. I had in mind that they don't represent the full unicodeexpectrum. I was just tryin to show this problem in other char-set So infact we are in the same problem. Of course that I am aware that bothcodesets (Shift-JIS and ISO-8859-1) are different UNICODE subset. Thisis same as you stated.

Changing the binary sequence B4 to \u00B4 instructs the JVM thatno matter what encoding your platform is set to, the resultingcharacter will always (always) be UNICODE 00B4, the Acute Accent,part of the Latin-1 (0X0080) table.
If we wrote the code in UNICODE you will have the same effect. It isexactly the same as with XML, isn't?
Unicode is simply a list of characters. To save them on a disk, you_need_ to use an encoding. Unicode characters are 32bits long (theywere 16 bits until Unicode 4 came along, but that ain't importantright now), bytes are 8bits long. It's as easy as that. To represent32 bits in 8, you need to "compress" them (or as said in I18N,"encoding" them).
Some encodings are complete (such as the family of UTF encodings)meaning that the encoding CAN represent ALL Unicode characters, someare not (such as ISO8859-1 which can represent only Unicodecharacters from 00 to FF).

Yes. Please correct me here if I am wrong: Our SVN uses UTF-8 as thedefault charset (or encoding) or not? If not, then we need to take carenot only of java sources but also of the chars above 7F in the XML files.

I have special interest in that, since we wrote mostly spanish messages.I will like to know if this is needed or not.


Best Regards,

Antonio Gallardo.

Re: svn commit: r278641 - /cocoon/branches/BRANCH_2_1_X/src/blocks/xsp/java/org/apache/cocoon/components/language/markup/xsp/XSPExpressionParser.java

Reply via email to