questions about string literals

Rong Chen Sun, 08 Oct 2006 10:32:13 +0200

Thomas Beale wrote:
> Andrew Patterson wrote:
>   
>>   
>>     
>>> - use the \uNNNN approach Andrew suggests (is this hex or decimal?)
>>>     
>>>       
>> This is hexadecimal (as per the unicode spec for unicode codepoints).
>> C# and Java use this notation - C# extends it to also have \UXXXXXXXX
>> for 32 bit codepoints (as per the new unicode versions)
>>
>>   
>>     
> One of Andrew's issues I don't think matters too much - the quoting of 
> &; this is because &amp; is used to quote the '&' character. However, I 
> agree that using a more uniform '\'-based quoting approach will be 
> clearer, and make for easier parser construction. So let's say that we 
> will go for the \uHHHH \UHHHHHHHH approach.
>
> Onto more important issues. When do we use real unicode, and when do we 
> use ASCII files containing quoted unicode? Currently we have made real 
> unicode work in the ADL workbench and Archetype Editor, and I would not 
> anticipate any problems in the Java Archetype tools. So it is mostly 
>   
The Java ADL parser currently uses UTF-8 as encoding for parsing. It has 
been planned to support more encoding later which is quite easy to do in 
Java.
> likely a question not of archetype tools, but of sharing ADL files. With 
> no unicode, and assuming latin-1 based languages, ADL files are (as far 
> as I can tell) safe to transport as text files. However, even for 
> languages like Turkish (which has an odd situation to do with upper and 
> lower case), these files get broken, and unicode is needed; but then an 
> ADL file is no longer a "text file" from the point of view of file 
> sharing, mime-type and so on. We have not defined a mime-type, but it 
> would be one of the application ones I guess.
>
> One problem is that a person receiving an ADL file under the quoting 
> proposal here is that they might be receiving:
> - a "safe" text file with only ASCII / latin-1 alphabet characters in it 
> ("real" ascii)
> - a "safe" text file with quoted unicode, that is in fact an archetype 
> written in say Turkish, Farsi, Chinese etc
> - a binary file containing UTF-8 unicode characters, that will look like 
> a text file with some funny characters in it depending on how smart your 
> editor is...
> - or....a UTF-8 encoded file that also contained \uHHHH encoded 
> characters (due to cut and paste in some editor environment)
>
> There seem to be a couple of ways of dealing with this:
> - include an "encoding" attribute at the top of ADL files, indicating 
> how to read the file
>   
I like the idea of including an "encoding" attribute in the ADL, 
probably in the archetype header section. It's also good to keep the 
encoding information in the archetype (in AOM form) so that ADL 
serializer can use the right encoding for output.
> - create a new file extension and specify that .adl is for UTF-8 encoded 
> files, and that (say) .uadl is for ascii encoded files containing 
> unicode quoting...
>   
This doesn't seem as flexible as the first one. It seems that we need to 
create a new file extension to support a new encoding each time. It's 
good to keep all the meta data about the archetype including the 
encoding in the header section.
> The first is the more obvious thing to do, since it is what XML, HTML 
> and probably other formats (RTF?) do; this is easy to add to ADL 
> archetypes as a field. It would have to be an optional field, so that 
> all current ADL files are not invalidated. This means we a) have to 
> choose the allowable encoding names (UTF-8 is the default in openEHR for 
> true unicode; the other will presumably be ISO-8859-1); we then need to 
> specify which encoding is assumed for an ADL file with no encoding 
> marker; I propose that it is UTF-8, since we already have "cracked" that 
> problem, and we say that it is only ISO-8859-1 if it actually says so. 
>   
Agree. This is supported in the Java ADL parser.


Cheers,
Rong
> This might sound odd, but remember UTF-8 is a proper superset of ASCII 
> anyway, so for all us western language people wondering if our files 
> will look funny, they won't. However, we could do it the other way round 
> - I don't see any terribly strong arguments one way or the other.
>
> further thoughts anyone?
>
> - thomas beale
>
>
> _______________________________________________
> openEHR-technical mailing list
> openEHR-technical at openehr.org
> http://www.chime.ucl.ac.uk/mailman/listinfo/openehr-technical
>
>   

_______________________________________________
openEHR-technical mailing list
openEHR-technical at openehr.org
http://www.chime.ucl.ac.uk/mailman/listinfo/openehr-technical

questions about string literals

Reply via email to