Re: svn commit: r278641 - /cocoon/branches/BRANCH_2_1_X/src/blocks/xsp/java/org/apache/cocoon/components/language/markup/xsp/XSPExpressionParser.java
Pier Fumagalli wrote: On 5 Sep 2005, at 01:53, Antonio Gallardo wrote: Pier Fumagalli wrote: Depending on your platform encoding (yours apparently ISO8859-1, mine UTF-8, my wife's -she's japanese- Shift-JIS) that sequence (B4) of BYTES as in the original source code will be interpreted as a different character. The char encoding Shift-JIS (JIS X 0201:1997 or JIS X 0208:1997) is is exactly the same as using ISO-8859-1. We need to keep the sources in UNICODE and there is also for Japanese: Hiragana, Katakana, et al: http://www.unicode.org/charts/ Err... Ehmmm.. No... The character in question (Latin-1 character B4, Acute Accent) is encoded in ISO8850-1 as the bytes sequence B4, while in Shift-JIS the same character is encoded as byte sequence 81 4C, quite different. Reading the byte sequence B4 in Shift-JS will produce Unicode character FF74 (Halfwidth katakana E), which is quite different from an acute accent as you intended. Trust me, it's 9 years I'm doing this! :-) Yes, I believe you. :-) When I told that using Shift-JIS and ISO-8859-1 is the same. I had in mind that they don't represent the full unicode expectrum. I was just tryin to show this problem in other char-set So in fact we are in the same problem. Of course that I am aware that both codesets (Shift-JIS and ISO-8859-1) are different UNICODE subset. This is same as you stated. Changing the binary sequence B4 to \u00B4 instructs the JVM that no matter what encoding your platform is set to, the resulting character will always (always) be UNICODE 00B4, the Acute Accent, part of the Latin-1 (0X0080) table. If we wrote the code in UNICODE you will have the same effect. It is exactly the same as with XML, isn't? Unicode is simply a list of characters. To save them on a disk, you _need_ to use an encoding. Unicode characters are 32bits long (they were 16 bits until Unicode 4 came along, but that ain't important right now), bytes are 8bits long. It's as easy as that. To represent 32 bits in 8, you need to compress them (or as said in I18N, encoding them). Some encodings are complete (such as the family of UTF encodings) meaning that the encoding CAN represent ALL Unicode characters, some are not (such as ISO8859-1 which can represent only Unicode characters from 00 to FF). Yes. Please correct me here if I am wrong: Our SVN uses UTF-8 as the default charset (or encoding) or not? If not, then we need to take care not only of java sources but also of the chars above 7F in the XML files. I have special interest in that, since we wrote mostly spanish messages. I will like to know if this is needed or not. Best Regards, Antonio Gallardo.
Re: svn commit: r278641 - /cocoon/branches/BRANCH_2_1_X/src/blocks/xsp/java/org/apache/cocoon/components/language/markup/xsp/XSPExpressionParser.java
On 5 Sep 2005, at 04:44, David Crossley wrote: Pier Fumagalli wrote: Nah, I'm pretty confident that on this little nag, I'm right... Does anyone have a pier2doc transformer? I need to get into a meeting right now, but the first part of the pier2doc translation of this thread is here: http://www.betaversion.org/~pier/wiki/display/pier/Unicode+and+Encodings Pier smime.p7s Description: S/MIME cryptographic signature
Re: svn commit: r278641 - /cocoon/branches/BRANCH_2_1_X/src/blocks/xsp/java/org/apache/cocoon/components/language/markup/xsp/XSPExpressionParser.java
On Monday 05 September 2005 14:43, Antonio Gallardo wrote: Of course that I am aware that both codesets (Shift-JIS and ISO-8859-1) are different UNICODE subset. This is same as you stated. No. Pier doesn't mix the difference between Unicode (sequence of characters) and the mapping of those characters to fixed or variable length encoded bytestreams. The fact that character 65 in Unicode is in many encodings mapped to the byte value 65 is for convenience only, and that fact should be ignored. Our SVN uses UTF-8 as the default charset (or encoding) or not? Subversion uses binary data, and is agnostic to any encodings in the data (or so they say). AFAIU, marking files as text only deals with the line endings and how the diff mails are generated. The --encoding argument applies to commit messages. Paths, URLs/URIs has additional encoding requirements. Cheers Niclas
Re: svn commit: r278641 - /cocoon/branches/BRANCH_2_1_X/src/blocks/xsp/java/org/apache/cocoon/components/language/markup/xsp/XSPExpressionParser.java
Pier Fumagalli wrote: David Crossley wrote: Pier Fumagalli wrote: Nah, I'm pretty confident that on this little nag, I'm right... Does anyone have a pier2doc transformer? I need to get into a meeting right now, but the first part of the pier2doc translation of this thread is here: http://www.betaversion.org/~pier/wiki/display/pier/Unicode+and+Encodings Ah, that was easy for us. I knew there would be one lying around somewhere. :-) Thanks. Ever grateful. -David
Re: svn commit: r278641 - /cocoon/branches/BRANCH_2_1_X/src/blocks/xsp/java/org/apache/cocoon/components/language/markup/xsp/XSPExpressionParser.java
Niclas Hedhman wrote: On Monday 05 September 2005 14:43, Antonio Gallardo wrote: Of course that I am aware that both codesets (Shift-JIS and ISO-8859-1) are different UNICODE subset. This is same as you stated. No. Pier doesn't mix the difference between Unicode (sequence of characters) and the mapping of those characters to fixed or variable length encoded bytestreams. The fact that character 65 in Unicode is in many encodings mapped to the byte value 65 is for convenience only, and that fact should be ignored. Our SVN uses UTF-8 as the default charset (or encoding) or not? Subversion uses binary data, and is agnostic to any encodings in the data (or so they say). AFAIU, marking files as text only deals with the line endings and how the diff mails are generated. Problem is the interpretation of line ending. On Unix, it's 0x10 which can be part of a multibyte character in a file encoded in UTF-8. In such a case, although the file is a text file, setting the eol-style=native property may well break the file... Or is there a way to specify the encoding to SVN? Sylvain -- Sylvain WallezAnyware Technologies http://people.apache.org/~sylvain http://www.anyware-tech.com Apache Software Foundation Member Research Technology Director
Re: svn commit: r278641 - /cocoon/branches/BRANCH_2_1_X/src/blocks/xsp/java/org/apache/cocoon/components/language/markup/xsp/XSPExpressionParser.java
David Crossley wrote: Pier Fumagalli wrote: Nah, I'm pretty confident that on this little nag, I'm right... Does anyone have a pier2doc transformer? Why do you think this projet was started? :-) -- Stefano.
Re: svn commit: r278641 - /cocoon/branches/BRANCH_2_1_X/src/blocks/xsp/java/org/apache/cocoon/components/language/markup/xsp/XSPExpressionParser.java
Niclas Hedhman wrote: On Monday 05 September 2005 14:43, Antonio Gallardo wrote: Of course that I am aware that both codesets (Shift-JIS and ISO-8859-1) are different UNICODE subset. This is same as you stated. No. Pier doesn't mix the difference between Unicode (sequence of characters) and the mapping of those characters to fixed or variable length encoded bytestreams. The fact that character 65 in Unicode is in many encodings mapped to the byte value 65 is for convenience only, and that fact should be ignored. Our SVN uses UTF-8 as the default charset (or encoding) or not? Subversion uses binary data, and is agnostic to any encodings in the data (or so they say). AFAIU, marking files as text only deals with the line endings and how the diff mails are generated. The --encoding argument applies to commit messages. Paths, URLs/URIs has additional encoding requirements. Correct. And is also worth noting that SVN before 1.2 and CVS2SVN create a pretty broken combination when the commit message in CVS used an encoding that was not UTF-8. As an example, try to get svn log of the apache repository and the svn client will fail, because we have three commit messages in latin-1 placed, as binary, by cvs2svn into svn (and prior to 1.2 there was no encoding validation checking in svn) that get moved into the XML file that is passed between the svn server and client, which is using UTF-8 as the encoding. I've asked infra@ to fix this, but being not really high priority (only data archeologist like myself care about those things) it is unlikely to get fixed. Anyhow, I agree with Pier, we should *only* use ASCII and escape unicode characters explicitly the \u way. -- Stefano.
Re: svn commit: r278641 - /cocoon/branches/BRANCH_2_1_X/src/blocks/xsp/java/org/apache/cocoon/components/language/markup/xsp/XSPExpressionParser.java
On 5 Sep 2005, at 17:25, Stefano Mazzocchi wrote: David Crossley wrote: Pier Fumagalli wrote: Nah, I'm pretty confident that on this little nag, I'm right... Does anyone have a pier2doc transformer? Why do you think this projet was started? :-) Darn, I can see in that 9 years my communication skills have hardly improved! Pier smime.p7s Description: S/MIME cryptographic signature
Re: svn commit: r278641 - /cocoon/branches/BRANCH_2_1_X/src/blocks/xsp/java/org/apache/cocoon/components/language/markup/xsp/XSPExpressionParser.java
Pier Fumagalli wrote: Stefano Mazzocchi wrote: David Crossley wrote: Pier Fumagalli wrote: Nah, I'm pretty confident that on this little nag, I'm right... Does anyone have a pier2doc transformer? Why do you think this projet was started? :-) ;-) Darn, I can see in that 9 years my communication skills have hardly improved! Argh, i did not specify the need properly. Pier communicates brilliantly. The reason for pier2doc is to automate the documentation of that. -David
Re: svn commit: r278641 - /cocoon/branches/BRANCH_2_1_X/src/blocks/xsp/java/org/apache/cocoon/components/language/markup/xsp/XSPExpressionParser.java
[EMAIL PROTECTED] wrote: Author: pier Date: Sun Sep 4 16:29:09 2005 New Revision: 278641 URL: http://svn.apache.org/viewcvs?rev=278641view=rev Log: Fixing wrong encoding bug Modified: cocoon/branches/BRANCH_2_1_X/src/blocks/xsp/java/org/apache/cocoon/components/language/markup/xsp/XSPExpressionParser.java @@ -211,7 +211,7 @@ parser.setState(EXPRESSION_CHAR_STATE); break; -case '�': +case '\u00B4': parser.append(ch); parser.setState(EXPRESSION_SHELL_STATE); break; @@ -235,10 +235,10 @@ protected static final State EXPRESSION_CHAR_STATE = new QuotedState('\''); /** - * The parser has encountered '�' in code[EMAIL PROTECTED] EXPRESSION_STATE}/code - * to start a Python string constant. + * The parser has encountered '\u00B4' (Unicode Latin-1 Acute Accent) in + * code[EMAIL PROTECTED] EXPRESSION_STATE}/code to start a Python string constant. */ -protected static final State EXPRESSION_SHELL_STATE = new QuotedState('�'); +protected static final State EXPRESSION_SHELL_STATE = new QuotedState('\u00B4'); Why not only left the original char as it was before your first change? It was working. Having a UTF-8 IMO is not good. Best Regards, Antonio Gallardo.
Re: svn commit: r278641 - /cocoon/branches/BRANCH_2_1_X/src/blocks/xsp/java/org/apache/cocoon/components/language/markup/xsp/XSPExpressionParser.java
On 5 Sep 2005, at 00:33, Antonio Gallardo wrote: [EMAIL PROTECTED] wrote: Author: pier Date: Sun Sep 4 16:29:09 2005 New Revision: 278641 URL: http://svn.apache.org/viewcvs?rev=278641view=rev Log: Fixing wrong encoding bug Modified: cocoon/branches/BRANCH_2_1_X/src/blocks/xsp/java/org/apache/ cocoon/components/language/markup/xsp/XSPExpressionParser.java @@ -211,7 +211,7 @@ parser.setState(EXPRESSION_CHAR_STATE); break; -case '�': +case '\u00B4': parser.append(ch); parser.setState(EXPRESSION_SHELL_STATE); break; @@ -235,10 +235,10 @@ protected static final State EXPRESSION_CHAR_STATE = new QuotedState('\''); /** - * The parser has encountered '�' in code[EMAIL PROTECTED] EXPRESSION_STATE}/code - * to start a Python string constant. + * The parser has encountered '\u00B4' (Unicode Latin-1 Acute Accent) in + * code[EMAIL PROTECTED] EXPRESSION_STATE}/code to start a Python string constant. */ -protected static final State EXPRESSION_SHELL_STATE = new QuotedState('�'); +protected static final State EXPRESSION_SHELL_STATE = new QuotedState('\u00B4'); Why not only left the original char as it was before your first change? It was working. Having a UTF-8 IMO is not good. It's not a UTF-8 character, it's an UNICODE character: \u doesn't mean UTF but rather UNICODE (which is not an encoding). Depending on your platform encoding (yours apparently ISO8859-1, mine UTF-8, my wife's -she's japanese- Shift-JIS) that sequence (B4) of BYTES as in the original source code will be interpreted as a different character. Changing the binary sequence B4 to \u00B4 instructs the JVM that no matter what encoding your platform is set to, the resulting character will always (always) be UNICODE 00B4, the Acute Accent, part of the Latin-1 (0X0080) table. Let's call it defensive programming, and actually, in the source code, we should be using only characters in the range 00-7F (Unicode BASIC-Latin, encoding US-ASCII), as that's the most-common amongst all different encodings (even if when thinking about IBM's EBCDIC, even that one might have some problems in some cases). Pier smime.p7s Description: S/MIME cryptographic signature
Re: svn commit: r278641 - /cocoon/branches/BRANCH_2_1_X/src/blocks/xsp/java/org/apache/cocoon/components/language/markup/xsp/XSPExpressionParser.java
Pier Fumagalli wrote: On 5 Sep 2005, at 00:33, Antonio Gallardo wrote: [EMAIL PROTECTED] wrote: Author: pier Date: Sun Sep 4 16:29:09 2005 New Revision: 278641 URL: http://svn.apache.org/viewcvs?rev=278641view=rev Log: Fixing wrong encoding bug Modified: cocoon/branches/BRANCH_2_1_X/src/blocks/xsp/java/org/apache/ cocoon/components/language/markup/xsp/XSPExpressionParser.java @@ -211,7 +211,7 @@ parser.setState(EXPRESSION_CHAR_STATE); break; -case '�': +case '\u00B4': parser.append(ch); parser.setState(EXPRESSION_SHELL_STATE); break; @@ -235,10 +235,10 @@ protected static final State EXPRESSION_CHAR_STATE = new QuotedState('\''); /** - * The parser has encountered '�' in code[EMAIL PROTECTED] EXPRESSION_STATE}/code - * to start a Python string constant. + * The parser has encountered '\u00B4' (Unicode Latin-1 Acute Accent) in + * code[EMAIL PROTECTED] EXPRESSION_STATE}/code to start a Python string constant. */ -protected static final State EXPRESSION_SHELL_STATE = new QuotedState('�'); +protected static final State EXPRESSION_SHELL_STATE = new QuotedState('\u00B4'); Why not only left the original char as it was before your first change? It was working. Having a UTF-8 IMO is not good. It's not a UTF-8 character, it's an UNICODE character: \u doesn't mean UTF but rather UNICODE (which is not an encoding). First, I request excuses because I wrote up the previous phrase very badly. I wanted to state that I don't see a reason to use Java Unicode escaping for this case. Reading the Java Specification, we found [0]: Programs are written using the/ /Unicode character set.. So IMO a UNICODE 00B4, the Acute Accent in Latin-1, should be only represented by only one code. Depending on your platform encoding (yours apparently ISO8859-1, mine UTF-8, my wife's -she's japanese- Shift-JIS) that sequence (B4) of BYTES as in the original source code will be interpreted as a different character. The char encoding Shift-JIS (JIS X 0201:1997 or JIS X 0208:1997) is is exactly the same as using ISO-8859-1. We need to keep the sources in UNICODE and there is also for Japanese: Hiragana, Katakana, et al: http://www.unicode.org/charts/ Changing the binary sequence B4 to \u00B4 instructs the JVM that no matter what encoding your platform is set to, the resulting character will always (always) be UNICODE 00B4, the Acute Accent, part of the Latin-1 (0X0080) table. If we wrote the code in UNICODE you will have the same effect. It is exactly the same as with XML, isn't? Let's call it defensive programming, and actually, in the source code, we should be using only characters in the range 00-7F (Unicode BASIC-Latin, encoding US-ASCII), as that's the most-common amongst all different encodings (even if when thinking about IBM's EBCDIC, even that one might have some problems in some cases). I am sorry, but I do not like to cover the sun with a finger. I believe Thorsten Schalab can tell us more about this topic. ;-) Best Regards, Antonio Gallardo. [0] http://java.sun.com/docs/books/jls/second_edition/html/lexical.doc.html#95413
Re: svn commit: r278641 - /cocoon/branches/BRANCH_2_1_X/src/blocks/xsp/java/org/apache/cocoon/components/language/markup/xsp/XSPExpressionParser.java
On Monday 05 September 2005 09:52, Pier Fumagalli wrote: Nah, I'm pretty confident that on this little nag, I'm right... Yes, you are (reservation below). And I find it amazing how difficult this topic is to understand for most people, some of them pretty clever. Now, the confusion adds to the matter as the JLS initially specified that java source files had to have ISO-8859-1 (IIRC) encoding, later interoduced the -encoding argument to the compiler, and AFAIU in Java 5 changed the default. Pier seems to suggest that the platform settings also play a role in which encoding the compiler chooses. This I am not aware of. The only proper way is that Cocoon declare an encoding for source files to use, and that this setting is explicitly given in the javac argument, and any deviations are bugs. Cheers Niclas
Re: svn commit: r278641 - /cocoon/branches/BRANCH_2_1_X/src/blocks/xsp/java/org/apache/cocoon/components/language/markup/xsp/XSPExpressionParser.java
Pier Fumagalli wrote: Nah, I'm pretty confident that on this little nag, I'm right... Does anyone have a pier2doc transformer? -David