XSPExpressionParser.java

Pier Fumagalli Sun, 04 Sep 2005 18:52:10 -0700

On 5 Sep 2005, at 01:53, Antonio Gallardo wrote:

Pier Fumagalli wrote:
It's not a UTF-8 character, it's an UNICODE character: \u doesn't mean "UTF" but rather "UNICODE" (which is not an encoding).
First, I request excuses because I wrote up the previous phrase very badly. I wanted to state that I don't see a reason to use Java "Unicode escaping" for this case. Reading the Java Specification, we found [0]: " Programs are written using the/ /Unicode character set.". So IMO a UNICODE 00B4, the Acute Accent in Latin-1, should be only represented by only one code.

Programs are written using (yes) the UNICODE specification, but source ".java" files are not. If you notice the output of "javac - help" it will say:

-encoding <encoding> Specify character encoding used by source files

So, the encoding parameter (that by default /methinks is the platform's default) will interpret the byte stream with the specified encoding, and then, the decoded UNICODE character stream will be parsed. Much like:


InputStream javaSource = new FileInputStream(javaSourceFile);
Reader reader = new InputStreamReader(encodingSpecifiedInCommandLine);
parse(reader);

So, programs are written using UNICODE characters, yes, the source files, though, are encoded using a some-sort of encoding mechanism (UTF-8, UTF-16, Shift-JIS, blablabla).

In our case the sequence "\u00B4" has the same byte representation (5c 75 30 30 62 34) in almost-all encodings (UTF-8, US-ASCII, ISO8859-1, Shift-JIS, ...), as it's composed by bytes in the range from 00 to 7F (which hardly changes in whatever encoding you put them into).

Java uses this syntax to represent a UNICODE character, because with most of the encodings you can use, it won't normally change its UNICODE meaning.

That said, this is NOT a safe mechanism, because if (for example) you were to read the byte sequence "5c 75 30 30 62 34" using the EBCDIC encoding (IBM's mainframes encoding) you woudln't read "backslash" "letter u" "zero" "zero" "letter b" "four", but you would read something quite different: "asterisk" "nil" "nil" "nil" "nil" "pn".

For an example of the EBCDIC encoding, look here: http:// www.dynamoo.com/technical/ebcdic.htm

Depending on your platform encoding (yours apparently ISO8859-1, mine UTF-8, my wife's -she's japanese- Shift-JIS) that sequence (B4) of BYTES as in the original source code will be interpreted as a different character.
The char encoding Shift-JIS (JIS X 0201:1997 or JIS X 0208:1997) is is exactly the same as using ISO-8859-1. We need to keep the sources in UNICODE and there is also for Japanese: Hiragana, Katakana, et al: http://www.unicode.org/charts/

Err... Ehmmm.. No... The character in question (Latin-1 character B4, Acute Accent) is encoded in ISO8850-1 as the bytes sequence "B4", while in Shift-JIS the same character is encoded as byte sequence "81 4C", quite different.

Reading the byte sequence "B4" in Shift-JS will produce Unicode character FF74 (Halfwidth katakana "E"), which is quite different from an acute accent as you intended.


Trust me, it's 9 years I'm doing this! :-)

Changing the binary sequence B4 to \u00B4 instructs the JVM that no matter what encoding your platform is set to, the resulting character will always (always) be UNICODE 00B4, the Acute Accent, part of the Latin-1 (0X0080) table.
If we wrote the code in UNICODE you will have the same effect. It is exactly the same as with XML, isn't?

Unicode is simply a list of characters. To save them on a disk, you _need_ to use an encoding. Unicode characters are 32bits long (they were 16 bits until Unicode 4 came along, but that ain't important right now), bytes are 8bits long. It's as easy as that. To represent 32 bits in 8, you need to "compress" them (or as said in I18N, "encoding" them).

Some encodings are complete (such as the family of UTF encodings) meaning that the encoding CAN represent ALL Unicode characters, some are not (such as ISO8859-1 which can represent only Unicode characters from 00 to FF).

Comparing Unicode to an encoding is like comparing an apple to a the speed of light: there's nothing in common between the two, but if you say that an apple is 1 meter per second, you can say that the speed of light is (roughly) 299.792.458 apples.

Let's call it defensive programming, and actually, in the source code, we should be using only characters in the range 00-7F (Unicode BASIC-Latin, encoding US-ASCII), as that's the "most- common" amongst all different encodings (even if when thinking about IBM's EBCDIC, even that one might have some problems in some cases).
I am sorry, but I do not like to cover the sun with a finger.

???

 I believe Thorsten Schalab can tell us more about this topic. ;-)


Nah, I'm pretty confident that on this little nag, I'm right...

    Pier

smime.p7s
Description: S/MIME cryptographic signature

Re: svn commit: r278641 - /cocoon/branches/BRANCH_2_1_X/src/blocks/xsp/java/org/apache/cocoon/components/language/markup/xsp/XSPExpressionParser.java

Reply via email to