XSPExpressionParser.java

Antonio Gallardo Sun, 04 Sep 2005 17:53:44 -0700

Pier Fumagalli wrote:

On 5 Sep 2005, at 00:33, Antonio Gallardo wrote:
[EMAIL PROTECTED] wrote:
Author: pier
Date: Sun Sep  4 16:29:09 2005
New Revision: 278641

URL: http://svn.apache.org/viewcvs?rev=278641&view=rev
Log:
Fixing wrong encoding bug

Modified:
cocoon/branches/BRANCH_2_1_X/src/blocks/xsp/java/org/apache/cocoon/components/language/markup/xsp/XSPExpressionParser.java
@@ -211,7 +211,7 @@
                    parser.setState(EXPRESSION_CHAR_STATE);
                    break;
-                case 'ï¿½':
+                case '\u00B4':
                    parser.append(ch);
                    parser.setState(EXPRESSION_SHELL_STATE);
                    break;
@@ -235,10 +235,10 @@
protected static final State EXPRESSION_CHAR_STATE = newQuotedState('\'');
    /**
- * The parser has encountered 'ï¿½' in <code>[EMAIL PROTECTED]EXPRESSION_STATE}</code>
-     * to start a Python string constant.
+ * The parser has encountered '\u00B4' (Unicode Latin-1 AcuteAccent) in+ * <code>[EMAIL PROTECTED] EXPRESSION_STATE}</code> to start a Pythonstring constant.
     */
- protected static final State EXPRESSION_SHELL_STATE = newQuotedState('ï¿½');+ protected static final State EXPRESSION_SHELL_STATE = newQuotedState('\u00B4');
Why not only left the original char as it was before your firstchange? It was working. Having a UTF-8 IMO is not good.
It's not a UTF-8 character, it's an UNICODE character: \u doesn'tmean "UTF" but rather "UNICODE" (which is not an encoding).

First, I request excuses because I wrote up the previous phrase verybadly. I wanted to state that I don't see a reason to use Java "Unicodeescaping" for this case. Reading the Java Specification, we found [0]: "Programs are written using the/ /Unicode character set.". So IMO aUNICODE 00B4, the Acute Accent in Latin-1, should be only represented byonly one code.

Depending on your platform encoding (yours apparently ISO8859-1, mineUTF-8, my wife's -she's japanese- Shift-JIS) that sequence (B4) ofBYTES as in the original source code will be interpreted as adifferent character.

The char encoding Shift-JIS (JIS X 0201:1997 or JIS X 0208:1997) is isexactly the same as using ISO-8859-1. We need to keep the sources inUNICODE and there is also for Japanese: Hiragana, Katakana, et al:http://www.unicode.org/charts/

Changing the binary sequence B4 to \u00B4 instructs the JVM that nomatter what encoding your platform is set to, the resulting characterwill always (always) be UNICODE 00B4, the Acute Accent, part of theLatin-1 (0X0080) table.

If we wrote the code in UNICODE you will have the same effect. It isexactly the same as with XML, isn't?

Let's call it defensive programming, and actually, in the sourcecode, we should be using only characters in the range 00-7F (UnicodeBASIC-Latin, encoding US-ASCII), as that's the "most-common" amongstall different encodings (even if when thinking about IBM's EBCDIC,even that one might have some problems in some cases).


I am sorry, but I do not like to cover the sun with a finger.

I believe Thorsten Schalab can tell us more about this topic. ;-)

Best Regards,

Antonio Gallardo.

[0]http://java.sun.com/docs/books/jls/second_edition/html/lexical.doc.html#95413

Re: svn commit: r278641 - /cocoon/branches/BRANCH_2_1_X/src/blocks/xsp/java/org/apache/cocoon/components/language/markup/xsp/XSPExpressionParser.java

Reply via email to