Re: svn commit: r278641 - /cocoon/branches/BRANCH_2_1_X/src/blocks/xsp/java/org/apache/cocoon/components/language/markup/xsp/XSPExpressionParser.java

2005-09-05 Thread Antonio Gallardo

Pier Fumagalli wrote:


On 5 Sep 2005, at 01:53, Antonio Gallardo wrote:


Pier Fumagalli wrote:


Depending on your platform encoding (yours apparently ISO8859-1,  
mine  UTF-8, my wife's -she's japanese- Shift-JIS) that sequence  
(B4) of  BYTES as in the original source code will be interpreted  
as a  different character.



The char encoding Shift-JIS (JIS X 0201:1997 or JIS X 0208:1997) is  
is exactly the same as using ISO-8859-1. We need to keep the  sources 
in UNICODE and there is also for Japanese: Hiragana,  Katakana, et 
al: http://www.unicode.org/charts/



Err... Ehmmm.. No... The character in question (Latin-1 character B4,  
Acute Accent) is encoded in ISO8850-1 as the bytes sequence B4,  
while in Shift-JIS the same character is encoded as byte sequence 81  
4C, quite different.


Reading the byte sequence B4 in Shift-JS will produce Unicode  
character FF74 (Halfwidth katakana E), which is quite different  
from an acute accent as you intended.


Trust me, it's 9 years I'm doing this! :-)


Yes, I believe you. :-) When I told that using Shift-JIS and ISO-8859-1 
is the same. I had in mind that they don't represent the full unicode 
expectrum. I was just tryin to show this problem in other char-set So in 
fact we are in the same problem. Of course that I am aware that both 
codesets (Shift-JIS and ISO-8859-1) are different UNICODE subset. This 
is same as you stated.




Changing the binary sequence B4 to \u00B4 instructs the JVM that  
no  matter what encoding your platform is set to, the resulting  
character  will always (always) be UNICODE 00B4, the Acute Accent,  
part of the  Latin-1 (0X0080) table.



If we wrote the code in UNICODE you will have the same effect. It  is 
exactly the same as with XML, isn't?



Unicode is simply a list of characters. To save them on a disk, you  
_need_ to use an encoding. Unicode characters are 32bits long (they  
were 16 bits until Unicode 4 came along, but that ain't important  
right now), bytes are 8bits long. It's as easy as that. To represent  
32 bits in 8, you need to compress them (or as said in I18N,  
encoding them).


Some encodings are complete (such as the family of UTF encodings)  
meaning that the encoding CAN represent ALL Unicode characters, some  
are not (such as ISO8859-1 which can represent only Unicode  
characters from 00 to FF).


Yes. Please correct me here if I am wrong: Our SVN uses UTF-8 as the 
default charset (or encoding) or not? If not, then we need to take care 
not only of java sources but also of the chars above 7F in the XML files.


I have special interest in that, since we wrote mostly spanish messages. 
I will like to know if this is needed or not.


Best Regards,

Antonio Gallardo.


Re: svn commit: r278641 - /cocoon/branches/BRANCH_2_1_X/src/blocks/xsp/java/org/apache/cocoon/components/language/markup/xsp/XSPExpressionParser.java

2005-09-05 Thread Pier Fumagalli

On 5 Sep 2005, at 04:44, David Crossley wrote:

Pier Fumagalli wrote:


Nah, I'm pretty confident that on this little nag, I'm right...


Does anyone have a pier2doc transformer?


I need to get into a meeting right now, but the first part of the  
pier2doc translation of this thread is here:


http://www.betaversion.org/~pier/wiki/display/pier/Unicode+and+Encodings

Pier



smime.p7s
Description: S/MIME cryptographic signature


Re: svn commit: r278641 - /cocoon/branches/BRANCH_2_1_X/src/blocks/xsp/java/org/apache/cocoon/components/language/markup/xsp/XSPExpressionParser.java

2005-09-05 Thread Niclas Hedhman
On Monday 05 September 2005 14:43, Antonio Gallardo wrote:

 Of course that I am aware that both codesets (Shift-JIS and ISO-8859-1) are
 different UNICODE subset. This is same as you stated. 

No. Pier doesn't mix the difference between Unicode (sequence of characters) 
and the mapping of those characters to fixed or variable length encoded 
bytestreams.
The fact that character 65 in Unicode is in many encodings mapped to the byte 
value 65 is for convenience only, and that fact should be ignored.

 Our SVN uses UTF-8 as the default charset (or encoding) or not?

Subversion uses binary data, and is agnostic to any encodings in the data (or 
so they say). AFAIU, marking files as text only deals with the line endings 
and how the diff mails are generated.
The --encoding argument applies to commit messages.
Paths, URLs/URIs has additional encoding requirements.


Cheers
Niclas


Re: svn commit: r278641 - /cocoon/branches/BRANCH_2_1_X/src/blocks/xsp/java/org/apache/cocoon/components/language/markup/xsp/XSPExpressionParser.java

2005-09-05 Thread David Crossley
Pier Fumagalli wrote:
 David Crossley wrote:
 Pier Fumagalli wrote:
 
 Nah, I'm pretty confident that on this little nag, I'm right...
 
 Does anyone have a pier2doc transformer?
 
 I need to get into a meeting right now, but the first part of the  
 pier2doc translation of this thread is here:
 
 http://www.betaversion.org/~pier/wiki/display/pier/Unicode+and+Encodings

Ah, that was easy for us. I knew there would be one
lying around somewhere. :-) Thanks. Ever grateful.

-David


Re: svn commit: r278641 - /cocoon/branches/BRANCH_2_1_X/src/blocks/xsp/java/org/apache/cocoon/components/language/markup/xsp/XSPExpressionParser.java

2005-09-05 Thread Sylvain Wallez

Niclas Hedhman wrote:


On Monday 05 September 2005 14:43, Antonio Gallardo wrote:

 


Of course that I am aware that both codesets (Shift-JIS and ISO-8859-1) are
different UNICODE subset. This is same as you stated. 
   



No. Pier doesn't mix the difference between Unicode (sequence of characters) 
and the mapping of those characters to fixed or variable length encoded 
bytestreams.
The fact that character 65 in Unicode is in many encodings mapped to the byte 
value 65 is for convenience only, and that fact should be ignored.


 


Our SVN uses UTF-8 as the default charset (or encoding) or not?
   



Subversion uses binary data, and is agnostic to any encodings in the data (or 
so they say). AFAIU, marking files as text only deals with the line endings 
and how the diff mails are generated.
 



Problem is the interpretation of line ending. On Unix, it's 0x10 which 
can be part of a multibyte character in a file encoded in UTF-8.


In such a case, although the file is a text file, setting the 
eol-style=native property may well break the file... Or is there a way 
to specify the encoding to SVN?


Sylvain

--
Sylvain WallezAnyware Technologies
http://people.apache.org/~sylvain http://www.anyware-tech.com
Apache Software Foundation Member Research  Technology Director



Re: svn commit: r278641 - /cocoon/branches/BRANCH_2_1_X/src/blocks/xsp/java/org/apache/cocoon/components/language/markup/xsp/XSPExpressionParser.java

2005-09-05 Thread Stefano Mazzocchi

David Crossley wrote:

Pier Fumagalli wrote:


Nah, I'm pretty confident that on this little nag, I'm right...



Does anyone have a pier2doc transformer?


Why do you think this projet was started? :-)

--
Stefano.



Re: svn commit: r278641 - /cocoon/branches/BRANCH_2_1_X/src/blocks/xsp/java/org/apache/cocoon/components/language/markup/xsp/XSPExpressionParser.java

2005-09-05 Thread Stefano Mazzocchi

Niclas Hedhman wrote:

On Monday 05 September 2005 14:43, Antonio Gallardo wrote:



Of course that I am aware that both codesets (Shift-JIS and ISO-8859-1) are
different UNICODE subset. This is same as you stated. 



No. Pier doesn't mix the difference between Unicode (sequence of characters) 
and the mapping of those characters to fixed or variable length encoded 
bytestreams.
The fact that character 65 in Unicode is in many encodings mapped to the byte 
value 65 is for convenience only, and that fact should be ignored.




Our SVN uses UTF-8 as the default charset (or encoding) or not?



Subversion uses binary data, and is agnostic to any encodings in the data (or 
so they say). AFAIU, marking files as text only deals with the line endings 
and how the diff mails are generated.

The --encoding argument applies to commit messages.
Paths, URLs/URIs has additional encoding requirements.


Correct.

And is also worth noting that SVN before 1.2 and CVS2SVN create a pretty 
broken combination when the commit message in CVS used an encoding that 
was not UTF-8.


As an example, try to get svn log of the apache repository and the svn 
client will fail, because we have three commit messages in latin-1 
placed, as binary, by cvs2svn into svn (and prior to 1.2 there was no 
encoding validation checking in svn) that get moved into the XML file 
that is passed between the svn server and client, which is using UTF-8 
as the encoding.


I've asked infra@ to fix this, but being not really high priority (only 
data archeologist like myself care about those things) it is unlikely to 
get fixed.


Anyhow, I agree with Pier, we should *only* use ASCII and escape unicode 
characters explicitly the \u way.


--
Stefano.



Re: svn commit: r278641 - /cocoon/branches/BRANCH_2_1_X/src/blocks/xsp/java/org/apache/cocoon/components/language/markup/xsp/XSPExpressionParser.java

2005-09-05 Thread Pier Fumagalli

On 5 Sep 2005, at 17:25, Stefano Mazzocchi wrote:

David Crossley wrote:

Pier Fumagalli wrote:


Nah, I'm pretty confident that on this little nag, I'm right...


Does anyone have a pier2doc transformer?



Why do you think this projet was started? :-)


Darn, I can see in that 9 years my communication skills have hardly  
improved!


Pier



smime.p7s
Description: S/MIME cryptographic signature


Re: svn commit: r278641 - /cocoon/branches/BRANCH_2_1_X/src/blocks/xsp/java/org/apache/cocoon/components/language/markup/xsp/XSPExpressionParser.java

2005-09-05 Thread David Crossley
Pier Fumagalli wrote:
 Stefano Mazzocchi wrote:
 David Crossley wrote:
 Pier Fumagalli wrote:
 
 Nah, I'm pretty confident that on this little nag, I'm right...
 
 Does anyone have a pier2doc transformer?
 
 Why do you think this projet was started? :-)

;-)

 Darn, I can see in that 9 years my communication skills have hardly  
 improved!

Argh, i did not specify the need properly.

Pier communicates brilliantly. The reason for pier2doc
is to automate the documentation of that.

-David


Re: svn commit: r278641 - /cocoon/branches/BRANCH_2_1_X/src/blocks/xsp/java/org/apache/cocoon/components/language/markup/xsp/XSPExpressionParser.java

2005-09-04 Thread Antonio Gallardo

[EMAIL PROTECTED] wrote:


Author: pier
Date: Sun Sep  4 16:29:09 2005
New Revision: 278641

URL: http://svn.apache.org/viewcvs?rev=278641view=rev
Log:
Fixing wrong encoding bug

Modified:
   
cocoon/branches/BRANCH_2_1_X/src/blocks/xsp/java/org/apache/cocoon/components/language/markup/xsp/XSPExpressionParser.java

@@ -211,7 +211,7 @@
parser.setState(EXPRESSION_CHAR_STATE);
break;

-case '�':
+case '\u00B4':
parser.append(ch);
parser.setState(EXPRESSION_SHELL_STATE);
break;
@@ -235,10 +235,10 @@
protected static final State EXPRESSION_CHAR_STATE = new QuotedState('\'');

/**
- * The parser has encountered '�' in code[EMAIL PROTECTED] 
EXPRESSION_STATE}/code
- * to start a Python string constant.
+ * The parser has encountered '\u00B4' (Unicode Latin-1 Acute Accent) in
+ * code[EMAIL PROTECTED] EXPRESSION_STATE}/code to start a Python 
string constant.
 */
-protected static final State EXPRESSION_SHELL_STATE = new 
QuotedState('�');
+protected static final State EXPRESSION_SHELL_STATE = new 
QuotedState('\u00B4');

 

Why not only left the original char as it was before your first change? 
It was working. Having a UTF-8 IMO is not good.


Best Regards,

Antonio Gallardo.



Re: svn commit: r278641 - /cocoon/branches/BRANCH_2_1_X/src/blocks/xsp/java/org/apache/cocoon/components/language/markup/xsp/XSPExpressionParser.java

2005-09-04 Thread Pier Fumagalli

On 5 Sep 2005, at 00:33, Antonio Gallardo wrote:

[EMAIL PROTECTED] wrote:


Author: pier
Date: Sun Sep  4 16:29:09 2005
New Revision: 278641

URL: http://svn.apache.org/viewcvs?rev=278641view=rev
Log:
Fixing wrong encoding bug

Modified:
   cocoon/branches/BRANCH_2_1_X/src/blocks/xsp/java/org/apache/ 
cocoon/components/language/markup/xsp/XSPExpressionParser.java


@@ -211,7 +211,7 @@
parser.setState(EXPRESSION_CHAR_STATE);
break;
-case '�':
+case '\u00B4':
parser.append(ch);
parser.setState(EXPRESSION_SHELL_STATE);
break;
@@ -235,10 +235,10 @@
protected static final State EXPRESSION_CHAR_STATE = new  
QuotedState('\'');

/**
- * The parser has encountered '�' in code[EMAIL PROTECTED]  
EXPRESSION_STATE}/code

- * to start a Python string constant.
+ * The parser has encountered '\u00B4' (Unicode Latin-1 Acute  
Accent) in
+ * code[EMAIL PROTECTED] EXPRESSION_STATE}/code to start a Python  
string constant.

 */
-protected static final State EXPRESSION_SHELL_STATE = new  
QuotedState('�');
+protected static final State EXPRESSION_SHELL_STATE = new  
QuotedState('\u00B4');



Why not only left the original char as it was before your first  
change? It was working. Having a UTF-8 IMO is not good.


It's not a UTF-8 character, it's an UNICODE character: \u doesn't  
mean UTF but rather UNICODE (which is not an encoding).


Depending on your platform encoding (yours apparently ISO8859-1, mine  
UTF-8, my wife's -she's japanese- Shift-JIS) that sequence (B4) of  
BYTES as in the original source code will be interpreted as a  
different character.


Changing the binary sequence B4 to \u00B4 instructs the JVM that no  
matter what encoding your platform is set to, the resulting character  
will always (always) be UNICODE 00B4, the Acute Accent, part of the  
Latin-1 (0X0080) table.


Let's call it defensive programming, and actually, in the source  
code, we should be using only characters in the range 00-7F (Unicode  
BASIC-Latin, encoding US-ASCII), as that's the most-common amongst  
all different encodings (even if when thinking about IBM's EBCDIC,  
even that one might have some problems in some cases).


Pier



smime.p7s
Description: S/MIME cryptographic signature


Re: svn commit: r278641 - /cocoon/branches/BRANCH_2_1_X/src/blocks/xsp/java/org/apache/cocoon/components/language/markup/xsp/XSPExpressionParser.java

2005-09-04 Thread Antonio Gallardo

Pier Fumagalli wrote:


On 5 Sep 2005, at 00:33, Antonio Gallardo wrote:


[EMAIL PROTECTED] wrote:


Author: pier
Date: Sun Sep  4 16:29:09 2005
New Revision: 278641

URL: http://svn.apache.org/viewcvs?rev=278641view=rev
Log:
Fixing wrong encoding bug

Modified:
   cocoon/branches/BRANCH_2_1_X/src/blocks/xsp/java/org/apache/ 
cocoon/components/language/markup/xsp/XSPExpressionParser.java


@@ -211,7 +211,7 @@
parser.setState(EXPRESSION_CHAR_STATE);
break;
-case '�':
+case '\u00B4':
parser.append(ch);
parser.setState(EXPRESSION_SHELL_STATE);
break;
@@ -235,10 +235,10 @@
protected static final State EXPRESSION_CHAR_STATE = new  
QuotedState('\'');

/**
- * The parser has encountered '�' in code[EMAIL PROTECTED]  
EXPRESSION_STATE}/code

- * to start a Python string constant.
+ * The parser has encountered '\u00B4' (Unicode Latin-1 Acute  
Accent) in
+ * code[EMAIL PROTECTED] EXPRESSION_STATE}/code to start a Python  
string constant.

 */
-protected static final State EXPRESSION_SHELL_STATE = new  
QuotedState('�');
+protected static final State EXPRESSION_SHELL_STATE = new  
QuotedState('\u00B4');



Why not only left the original char as it was before your first  
change? It was working. Having a UTF-8 IMO is not good.



It's not a UTF-8 character, it's an UNICODE character: \u doesn't  
mean UTF but rather UNICODE (which is not an encoding).


First, I request excuses because I wrote up the previous phrase very 
badly. I wanted to state that I don't see a reason to use Java Unicode 
escaping for this case. Reading the Java Specification, we found [0]:  
Programs are written using the/ /Unicode character set.. So IMO a 
UNICODE 00B4, the Acute Accent in Latin-1, should be only represented by 
only one code.


Depending on your platform encoding (yours apparently ISO8859-1, mine  
UTF-8, my wife's -she's japanese- Shift-JIS) that sequence (B4) of  
BYTES as in the original source code will be interpreted as a  
different character.


The char encoding Shift-JIS (JIS X 0201:1997 or JIS X 0208:1997) is is 
exactly the same as using ISO-8859-1. We need to keep the sources in 
UNICODE and there is also for Japanese: Hiragana, Katakana, et al: 
http://www.unicode.org/charts/




Changing the binary sequence B4 to \u00B4 instructs the JVM that no  
matter what encoding your platform is set to, the resulting character  
will always (always) be UNICODE 00B4, the Acute Accent, part of the  
Latin-1 (0X0080) table.


If we wrote the code in UNICODE you will have the same effect. It is 
exactly the same as with XML, isn't?


Let's call it defensive programming, and actually, in the source  
code, we should be using only characters in the range 00-7F (Unicode  
BASIC-Latin, encoding US-ASCII), as that's the most-common amongst  
all different encodings (even if when thinking about IBM's EBCDIC,  
even that one might have some problems in some cases).


I am sorry, but I do not like to cover the sun with a finger.

I believe Thorsten Schalab can tell us more about this topic. ;-)

Best Regards,

Antonio Gallardo.

[0] 
http://java.sun.com/docs/books/jls/second_edition/html/lexical.doc.html#95413 



Re: svn commit: r278641 - /cocoon/branches/BRANCH_2_1_X/src/blocks/xsp/java/org/apache/cocoon/components/language/markup/xsp/XSPExpressionParser.java

2005-09-04 Thread Niclas Hedhman
On Monday 05 September 2005 09:52, Pier Fumagalli wrote:
 Nah, I'm pretty confident that on this little nag, I'm right...

Yes, you are (reservation below).

And I find it amazing how difficult this topic is to understand for most 
people, some of them pretty clever.

Now, the confusion adds to the matter as the JLS initially specified that java 
source files had to have ISO-8859-1 (IIRC) encoding, later interoduced the 
-encoding argument to the compiler, and AFAIU in Java 5 changed the default.

Pier seems to suggest that the platform settings also play a role in which 
encoding the compiler chooses. This I am not aware of.


The only proper way is that Cocoon declare an encoding for source files to 
use, and that this setting is explicitly given in the javac argument, and 
any deviations are bugs.


Cheers
Niclas


Re: svn commit: r278641 - /cocoon/branches/BRANCH_2_1_X/src/blocks/xsp/java/org/apache/cocoon/components/language/markup/xsp/XSPExpressionParser.java

2005-09-04 Thread David Crossley
Pier Fumagalli wrote:
 
 Nah, I'm pretty confident that on this little nag, I'm right...

Does anyone have a pier2doc transformer?

-David