questions about string literals

Thomas Beale Sat, 07 Oct 2006 16:46:31 +0100

Andrew Patterson wrote:
>> One of Andrew's issues I don't think matters too much - the quoting of
>> &; this is because &amp; is used to quote the '&' character. However, I
>> agree that using a more uniform '\'-based quoting approach will be
>> clearer, and make for easier parser construction. So let's say that we
>> will go for the \uHHHH \UHHHHHHHH approach.
>>     
>
> Are you saying that the \u quoting will be used instead of the
> XML quoting or in addition to? If you are saying the first, please
> ignore the following rant :)
>   
I am following your original suggestion, to replace the current XML 
quoting rules with \u and \U (since we already use \ to quote anyway, 
and as you point out, the & stuff is ugly.)
> I still think the & is needlessly confusing and pointless. My
> issues are:
>
> 1) it is completely non-obvious - as an ADL user I would never expect to use 
> the
> XML quoting rules in the string definition in ADL because ADL
> is clearly not an XML document.. sure, it has bits that are like XML, but
> if you want it to be XML, then go the whole way. More importantly,
> for one of the target groups of ADL, the clinicians, it is a behaviour
> that I imagine could confuse them. They have never heard of XML
> quoting rules and hence may just type in strings like
> "term code meaning pain to head & chest" in their ADL strings.
> Now this may be mitigated by the fact that they will often
> be editing ADL in a tool, but if ADL is only going to be edited
> by tools we should drop the human parseable format and
> do the whole thing in XML.
>   
agreed. I personally don't see XML as useful other than a purely literal 
transfer syntax, i.e. a serialisation of objects. ADL is an abstract 
syntax, which is both readable by humans, and for which abstract parsers 
can be written; the parser that can read the XML form (which will be 
supported fairly soon, but is completely unreadable) is a pure object 
serialiser/deserialiser, not a language parser.
> 2) It is a pain to implement - now every ADL parser needs to
> have an XML entity converter built in as well - which entities are
> included - just the XML ones (&lt; &gt; &amp;)? What about the
> HTML/SGML ones (&acute; &grave;)?? Does every ADL
> implementation need to have the table of standard unicode
> names built in to be able to parse strings? Do angle brackets
> need to be quoted - they do in XML but that is because they
> have special meaning. Yet within ADL strings they don't. Of course,
> the two characters that do need to be quoted are the \ and the
> quotation mark. Are these quoted in XML? Not by default, and
> so now the XML programmers are confused :)
>   
yes, I also agree with this ;-)
>   
>> choose the allowable encoding names (UTF-8 is the default in openEHR for
>> true unicode; the other will presumably be ISO-8859-1); we then need to
>> specify which encoding is assumed for an ADL file with no encoding
>> marker; I propose that it is UTF-8, since we already have "cracked" that
>> problem, and we say that it is only ISO-8859-1 if it actually says so.
>> This might sound odd, but remember UTF-8 is a proper superset of ASCII
>> anyway, so for all us western language people wondering if our files
>> will look funny, they won't. However, we could do it the other way round
>> - I don't see any terribly strong arguments one way or the other.
>>     
>
> I think you are right that it should default to UTF-8. I am not sure
> the correct way of putting the encoding marker in - if its a standard
> archetype field then the parser is obviously well into parsing the
> file before it finds out what encoding the file is in? Which then
> invalidates encodings such as UTF-16 because it would be impossible
> to write even the first "archetype" keyword in such a way that the
> parser could parse it.
>   
It probably has to be on the first line, which is easy enough to deal 
with. At this stage, I think it s reasonable to just allow UTF-8 and 
ISO-8859-1 only. UTF-16 et al need byte order markers at the start of 
the file (which removes the need for the encoding indicator in the file 
I guess); but let's not go there yet.
> I actually don't feel too strongly that ADL needs to be 7-bit safe
> (i.e. I would be happy with UTF-8 as the default and leave it at that
> - still including the \uxxxx rules to allow the insertion of characters
> that are hard to _edit_, but assume UTF-8 can be read/transported).
> Is there any web/email transport mechanism in existence now that
> can't pass through an 8-bit stream untouched? Even moreso, is there
> any modern environment that can't parse UTF-8?? (keeping in mind
> that this is not saying that openEHR systems won't have to exchange
> data with old legacy systems, but I doubt the openEHR system will be
> sending the legacy systems ADL files to parse??)
>   
well, Notepad and gvim on Windows don't get it right....but that may 
just be display...


- thomas


_______________________________________________
openEHR-technical mailing list
openEHR-technical at openehr.org
http://www.chime.ucl.ac.uk/mailman/listinfo/openehr-technical

questions about string literals

Reply via email to