Pier Fumagalli wrote:

On 5 Sep 2005, at 00:33, Antonio Gallardo wrote:

[EMAIL PROTECTED] wrote:

Author: pier
Date: Sun Sep  4 16:29:09 2005
New Revision: 278641

URL: http://svn.apache.org/viewcvs?rev=278641&view=rev
Log:
Fixing wrong encoding bug

Modified:
cocoon/branches/BRANCH_2_1_X/src/blocks/xsp/java/org/apache/ cocoon/components/language/markup/xsp/XSPExpressionParser.java

@@ -211,7 +211,7 @@
                    parser.setState(EXPRESSION_CHAR_STATE);
                    break;
-                case '�':
+                case '\u00B4':
                    parser.append(ch);
                    parser.setState(EXPRESSION_SHELL_STATE);
                    break;
@@ -235,10 +235,10 @@
protected static final State EXPRESSION_CHAR_STATE = new QuotedState('\'');
    /**
- * The parser has encountered '�' in <code>[EMAIL PROTECTED] EXPRESSION_STATE}</code>
-     * to start a Python string constant.
+ * The parser has encountered '\u00B4' (Unicode Latin-1 Acute Accent) in + * <code>[EMAIL PROTECTED] EXPRESSION_STATE}</code> to start a Python string constant.
     */
- protected static final State EXPRESSION_SHELL_STATE = new QuotedState('�'); + protected static final State EXPRESSION_SHELL_STATE = new QuotedState('\u00B4');


Why not only left the original char as it was before your first change? It was working. Having a UTF-8 IMO is not good.


It's not a UTF-8 character, it's an UNICODE character: \u doesn't mean "UTF" but rather "UNICODE" (which is not an encoding).

First, I request excuses because I wrote up the previous phrase very badly. I wanted to state that I don't see a reason to use Java "Unicode escaping" for this case. Reading the Java Specification, we found [0]: " Programs are written using the/ /Unicode character set.". So IMO a UNICODE 00B4, the Acute Accent in Latin-1, should be only represented by only one code.

Depending on your platform encoding (yours apparently ISO8859-1, mine UTF-8, my wife's -she's japanese- Shift-JIS) that sequence (B4) of BYTES as in the original source code will be interpreted as a different character.

The char encoding Shift-JIS (JIS X 0201:1997 or JIS X 0208:1997) is is exactly the same as using ISO-8859-1. We need to keep the sources in UNICODE and there is also for Japanese: Hiragana, Katakana, et al: http://www.unicode.org/charts/


Changing the binary sequence B4 to \u00B4 instructs the JVM that no matter what encoding your platform is set to, the resulting character will always (always) be UNICODE 00B4, the Acute Accent, part of the Latin-1 (0X0080) table.

If we wrote the code in UNICODE you will have the same effect. It is exactly the same as with XML, isn't?

Let's call it defensive programming, and actually, in the source code, we should be using only characters in the range 00-7F (Unicode BASIC-Latin, encoding US-ASCII), as that's the "most-common" amongst all different encodings (even if when thinking about IBM's EBCDIC, even that one might have some problems in some cases).

I am sorry, but I do not like to cover the sun with a finger.

I believe Thorsten Schalab can tell us more about this topic. ;-)

Best Regards,

Antonio Gallardo.

[0] http://java.sun.com/docs/books/jls/second_edition/html/lexical.doc.html#95413

Reply via email to