questions about string literals

Thomas Beale Fri, 06 Oct 2006 11:58:37 +0100

Andrew Patterson wrote:
>   
>> - use the \uNNNN approach Andrew suggests (is this hex or decimal?)
>>     
>
> This is hexadecimal (as per the unicode spec for unicode codepoints).
> C# and Java use this notation - C# extends it to also have \UXXXXXXXX
> for 32 bit codepoints (as per the new unicode versions)
>
>   
One of Andrew's issues I don't think matters too much - the quoting of 
&; this is because &amp; is used to quote the '&' character. However, I 
agree that using a more uniform '\'-based quoting approach will be 
clearer, and make for easier parser construction. So let's say that we 
will go for the \uHHHH \UHHHHHHHH approach.


Onto more important issues. When do we use real unicode, and when do we 
use ASCII files containing quoted unicode? Currently we have made real 
unicode work in the ADL workbench and Archetype Editor, and I would not 
anticipate any problems in the Java Archetype tools. So it is mostly 
likely a question not of archetype tools, but of sharing ADL files. With 
no unicode, and assuming latin-1 based languages, ADL files are (as far 
as I can tell) safe to transport as text files. However, even for 
languages like Turkish (which has an odd situation to do with upper and 
lower case), these files get broken, and unicode is needed; but then an 
ADL file is no longer a "text file" from the point of view of file 
sharing, mime-type and so on. We have not defined a mime-type, but it 
would be one of the application ones I guess.

One problem is that a person receiving an ADL file under the quoting 
proposal here is that they might be receiving:
- a "safe" text file with only ASCII / latin-1 alphabet characters in it 
("real" ascii)
- a "safe" text file with quoted unicode, that is in fact an archetype 
written in say Turkish, Farsi, Chinese etc
- a binary file containing UTF-8 unicode characters, that will look like 
a text file with some funny characters in it depending on how smart your 
editor is...
- or....a UTF-8 encoded file that also contained \uHHHH encoded 
characters (due to cut and paste in some editor environment)

There seem to be a couple of ways of dealing with this:
- include an "encoding" attribute at the top of ADL files, indicating 
how to read the file
- create a new file extension and specify that .adl is for UTF-8 encoded 
files, and that (say) .uadl is for ascii encoded files containing 
unicode quoting...

The first is the more obvious thing to do, since it is what XML, HTML 
and probably other formats (RTF?) do; this is easy to add to ADL 
archetypes as a field. It would have to be an optional field, so that 
all current ADL files are not invalidated. This means we a) have to 
choose the allowable encoding names (UTF-8 is the default in openEHR for 
true unicode; the other will presumably be ISO-8859-1); we then need to 
specify which encoding is assumed for an ADL file with no encoding 
marker; I propose that it is UTF-8, since we already have "cracked" that 
problem, and we say that it is only ISO-8859-1 if it actually says so. 
This might sound odd, but remember UTF-8 is a proper superset of ASCII 
anyway, so for all us western language people wondering if our files 
will look funny, they won't. However, we could do it the other way round 
- I don't see any terribly strong arguments one way or the other.

further thoughts anyone?

- thomas beale


_______________________________________________
openEHR-technical mailing list
openEHR-technical at openehr.org
http://www.chime.ucl.ac.uk/mailman/listinfo/openehr-technical

questions about string literals

Reply via email to