Andrew Patterson wrote:
>
>> - use the \uNNNN approach Andrew suggests (is this hex or decimal?)
>>
>
> This is hexadecimal (as per the unicode spec for unicode codepoints).
> C# and Java use this notation - C# extends it to also have \UXXXXXXXX
> for 32 bit codepoints (as per the new unicode versions)
>
>
One of Andrew's issues I don't think matters too much - the quoting of
&; this is because & is used to quote the '&' character. However, I
agree that using a more uniform '\'-based quoting approach will be
clearer, and make for easier parser construction. So let's say that we
will go for the \uHHHH \UHHHHHHHH approach.
Onto more important issues. When do we use real unicode, and when do we
use ASCII files containing quoted unicode? Currently we have made real
unicode work in the ADL workbench and Archetype Editor, and I would not
anticipate any problems in the Java Archetype tools. So it is mostly
likely a question not of archetype tools, but of sharing ADL files. With
no unicode, and assuming latin-1 based languages, ADL files are (as far
as I can tell) safe to transport as text files. However, even for
languages like Turkish (which has an odd situation to do with upper and
lower case), these files get broken, and unicode is needed; but then an
ADL file is no longer a "text file" from the point of view of file
sharing, mime-type and so on. We have not defined a mime-type, but it
would be one of the application ones I guess.
One problem is that a person receiving an ADL file under the quoting
proposal here is that they might be receiving:
- a "safe" text file with only ASCII / latin-1 alphabet characters in it
("real" ascii)
- a "safe" text file with quoted unicode, that is in fact an archetype
written in say Turkish, Farsi, Chinese etc
- a binary file containing UTF-8 unicode characters, that will look like
a text file with some funny characters in it depending on how smart your
editor is...
- or....a UTF-8 encoded file that also contained \uHHHH encoded
characters (due to cut and paste in some editor environment)
There seem to be a couple of ways of dealing with this:
- include an "encoding" attribute at the top of ADL files, indicating
how to read the file
- create a new file extension and specify that .adl is for UTF-8 encoded
files, and that (say) .uadl is for ascii encoded files containing
unicode quoting...
The first is the more obvious thing to do, since it is what XML, HTML
and probably other formats (RTF?) do; this is easy to add to ADL
archetypes as a field. It would have to be an optional field, so that
all current ADL files are not invalidated. This means we a) have to
choose the allowable encoding names (UTF-8 is the default in openEHR for
true unicode; the other will presumably be ISO-8859-1); we then need to
specify which encoding is assumed for an ADL file with no encoding
marker; I propose that it is UTF-8, since we already have "cracked" that
problem, and we say that it is only ISO-8859-1 if it actually says so.
This might sound odd, but remember UTF-8 is a proper superset of ASCII
anyway, so for all us western language people wondering if our files
will look funny, they won't. However, we could do it the other way round
- I don't see any terribly strong arguments one way or the other.
further thoughts anyone?
- thomas beale
_______________________________________________
openEHR-technical mailing list
openEHR-technical at openehr.org
http://www.chime.ucl.ac.uk/mailman/listinfo/openehr-technical