> One of Andrew's issues I don't think matters too much - the quoting of > &; this is because & is used to quote the '&' character. However, I > agree that using a more uniform '\'-based quoting approach will be > clearer, and make for easier parser construction. So let's say that we > will go for the \uHHHH \UHHHHHHHH approach.
Are you saying that the \u quoting will be used instead of the XML quoting or in addition to? If you are saying the first, please ignore the following rant :) I still think the & is needlessly confusing and pointless. My issues are: 1) it is completely non-obvious - as an ADL user I would never expect to use the XML quoting rules in the string definition in ADL because ADL is clearly not an XML document.. sure, it has bits that are like XML, but if you want it to be XML, then go the whole way. More importantly, for one of the target groups of ADL, the clinicians, it is a behaviour that I imagine could confuse them. They have never heard of XML quoting rules and hence may just type in strings like "term code meaning pain to head & chest" in their ADL strings. Now this may be mitigated by the fact that they will often be editing ADL in a tool, but if ADL is only going to be edited by tools we should drop the human parseable format and do the whole thing in XML. 2) It is a pain to implement - now every ADL parser needs to have an XML entity converter built in as well - which entities are included - just the XML ones (< > &)? What about the HTML/SGML ones (´ `)?? Does every ADL implementation need to have the table of standard unicode names built in to be able to parse strings? Do angle brackets need to be quoted - they do in XML but that is because they have special meaning. Yet within ADL strings they don't. Of course, the two characters that do need to be quoted are the \ and the quotation mark. Are these quoted in XML? Not by default, and so now the XML programmers are confused :) > choose the allowable encoding names (UTF-8 is the default in openEHR for > true unicode; the other will presumably be ISO-8859-1); we then need to > specify which encoding is assumed for an ADL file with no encoding > marker; I propose that it is UTF-8, since we already have "cracked" that > problem, and we say that it is only ISO-8859-1 if it actually says so. > This might sound odd, but remember UTF-8 is a proper superset of ASCII > anyway, so for all us western language people wondering if our files > will look funny, they won't. However, we could do it the other way round > - I don't see any terribly strong arguments one way or the other. I think you are right that it should default to UTF-8. I am not sure the correct way of putting the encoding marker in - if its a standard archetype field then the parser is obviously well into parsing the file before it finds out what encoding the file is in? Which then invalidates encodings such as UTF-16 because it would be impossible to write even the first "archetype" keyword in such a way that the parser could parse it. I actually don't feel too strongly that ADL needs to be 7-bit safe (i.e. I would be happy with UTF-8 as the default and leave it at that - still including the \uxxxx rules to allow the insertion of characters that are hard to _edit_, but assume UTF-8 can be read/transported). Is there any web/email transport mechanism in existence now that can't pass through an 8-bit stream untouched? Even moreso, is there any modern environment that can't parse UTF-8?? (keeping in mind that this is not saying that openEHR systems won't have to exchange data with old legacy systems, but I doubt the openEHR system will be sending the legacy systems ADL files to parse??) Andrew _______________________________________________ openEHR-technical mailing list openEHR-technical at openehr.org http://www.chime.ucl.ac.uk/mailman/listinfo/openehr-technical

