Pier Fumagalli wrote:
On 5 Sep 2005, at 00:33, Antonio Gallardo wrote:
[EMAIL PROTECTED] wrote:
Author: pier
Date: Sun Sep 4 16:29:09 2005
New Revision: 278641
URL: http://svn.apache.org/viewcvs?rev=278641&view=rev
Log:
Fixing wrong encoding bug
Modified:
cocoon/branches/BRANCH_2_1_X/src/blocks/xsp/java/org/apache/
cocoon/components/language/markup/xsp/XSPExpressionParser.java
@@ -211,7 +211,7 @@
parser.setState(EXPRESSION_CHAR_STATE);
break;
- case '�':
+ case '\u00B4':
parser.append(ch);
parser.setState(EXPRESSION_SHELL_STATE);
break;
@@ -235,10 +235,10 @@
protected static final State EXPRESSION_CHAR_STATE = new
QuotedState('\'');
/**
- * The parser has encountered '�' in <code>[EMAIL PROTECTED]
EXPRESSION_STATE}</code>
- * to start a Python string constant.
+ * The parser has encountered '\u00B4' (Unicode Latin-1 Acute
Accent) in
+ * <code>[EMAIL PROTECTED] EXPRESSION_STATE}</code> to start a Python
string constant.
*/
- protected static final State EXPRESSION_SHELL_STATE = new
QuotedState('�');
+ protected static final State EXPRESSION_SHELL_STATE = new
QuotedState('\u00B4');
Why not only left the original char as it was before your first
change? It was working. Having a UTF-8 IMO is not good.
It's not a UTF-8 character, it's an UNICODE character: \u doesn't
mean "UTF" but rather "UNICODE" (which is not an encoding).
First, I request excuses because I wrote up the previous phrase very
badly. I wanted to state that I don't see a reason to use Java "Unicode
escaping" for this case. Reading the Java Specification, we found [0]: "
Programs are written using the/ /Unicode character set.". So IMO a
UNICODE 00B4, the Acute Accent in Latin-1, should be only represented by
only one code.
Depending on your platform encoding (yours apparently ISO8859-1, mine
UTF-8, my wife's -she's japanese- Shift-JIS) that sequence (B4) of
BYTES as in the original source code will be interpreted as a
different character.
The char encoding Shift-JIS (JIS X 0201:1997 or JIS X 0208:1997) is is
exactly the same as using ISO-8859-1. We need to keep the sources in
UNICODE and there is also for Japanese: Hiragana, Katakana, et al:
http://www.unicode.org/charts/
Changing the binary sequence B4 to \u00B4 instructs the JVM that no
matter what encoding your platform is set to, the resulting character
will always (always) be UNICODE 00B4, the Acute Accent, part of the
Latin-1 (0X0080) table.
If we wrote the code in UNICODE you will have the same effect. It is
exactly the same as with XML, isn't?
Let's call it defensive programming, and actually, in the source
code, we should be using only characters in the range 00-7F (Unicode
BASIC-Latin, encoding US-ASCII), as that's the "most-common" amongst
all different encodings (even if when thinking about IBM's EBCDIC,
even that one might have some problems in some cases).
I am sorry, but I do not like to cover the sun with a finger.
I believe Thorsten Schalab can tell us more about this topic. ;-)
Best Regards,
Antonio Gallardo.
[0]
http://java.sun.com/docs/books/jls/second_edition/html/lexical.doc.html#95413