questions about string literals

Andrew Patterson Sun, 8 Oct 2006 00:40:12 +1000

> One of Andrew's issues I don't think matters too much - the quoting of
> &; this is because &amp; is used to quote the '&' character. However, I
> agree that using a more uniform '\'-based quoting approach will be
> clearer, and make for easier parser construction. So let's say that we
> will go for the \uHHHH \UHHHHHHHH approach.


Are you saying that the \u quoting will be used instead of the
XML quoting or in addition to? If you are saying the first, please
ignore the following rant :)

I still think the & is needlessly confusing and pointless. My
issues are:

1) it is completely non-obvious - as an ADL user I would never expect to use the
XML quoting rules in the string definition in ADL because ADL
is clearly not an XML document.. sure, it has bits that are like XML, but
if you want it to be XML, then go the whole way. More importantly,
for one of the target groups of ADL, the clinicians, it is a behaviour
that I imagine could confuse them. They have never heard of XML
quoting rules and hence may just type in strings like
"term code meaning pain to head & chest" in their ADL strings.
Now this may be mitigated by the fact that they will often
be editing ADL in a tool, but if ADL is only going to be edited
by tools we should drop the human parseable format and
do the whole thing in XML.

2) It is a pain to implement - now every ADL parser needs to
have an XML entity converter built in as well - which entities are
included - just the XML ones (&lt; &gt; &amp;)? What about the
HTML/SGML ones (&acute; &grave;)?? Does every ADL
implementation need to have the table of standard unicode
names built in to be able to parse strings? Do angle brackets
need to be quoted - they do in XML but that is because they
have special meaning. Yet within ADL strings they don't. Of course,
the two characters that do need to be quoted are the \ and the
quotation mark. Are these quoted in XML? Not by default, and
so now the XML programmers are confused :)

> choose the allowable encoding names (UTF-8 is the default in openEHR for
> true unicode; the other will presumably be ISO-8859-1); we then need to
> specify which encoding is assumed for an ADL file with no encoding
> marker; I propose that it is UTF-8, since we already have "cracked" that
> problem, and we say that it is only ISO-8859-1 if it actually says so.
> This might sound odd, but remember UTF-8 is a proper superset of ASCII
> anyway, so for all us western language people wondering if our files
> will look funny, they won't. However, we could do it the other way round
> - I don't see any terribly strong arguments one way or the other.

I think you are right that it should default to UTF-8. I am not sure
the correct way of putting the encoding marker in - if its a standard
archetype field then the parser is obviously well into parsing the
file before it finds out what encoding the file is in? Which then
invalidates encodings such as UTF-16 because it would be impossible
to write even the first "archetype" keyword in such a way that the
parser could parse it.

I actually don't feel too strongly that ADL needs to be 7-bit safe
(i.e. I would be happy with UTF-8 as the default and leave it at that
- still including the \uxxxx rules to allow the insertion of characters
that are hard to _edit_, but assume UTF-8 can be read/transported).
Is there any web/email transport mechanism in existence now that
can't pass through an 8-bit stream untouched? Even moreso, is there
any modern environment that can't parse UTF-8?? (keeping in mind
that this is not saying that openEHR systems won't have to exchange
data with old legacy systems, but I doubt the openEHR system will be
sending the legacy systems ADL files to parse??)

Andrew
_______________________________________________
openEHR-technical mailing list
openEHR-technical at openehr.org
http://www.chime.ucl.ac.uk/mailman/listinfo/openehr-technical

questions about string literals

Reply via email to