Hoylen Sue wrote: >Thomas Beale <thomas at deepthought.com.au> writes: > > >>I wonder if this is true for people using openEHR-based components via >>an API rather than communicating via data messages. I assume that the >>unicode implemementation used in the String type in most of today's >>languages make it easy to determine what width unicode characters you >>have in the data? >> >> > >In all the case I know of, once the data has been read into >the native string type, discovering its "width" is no longer >an issue. This is because the native string type is defined >to support only a single encoding. Data is converted into >that encoding when it is read in. > > so if it is stored in UTF-16 say, the library in a unix application will detect the relevant byte pattern and dispatch the appropriate conversion routine to do utf-16 -> utf-8, or utf-32 -> utf-8? If I remember correctly, this is possible because it is possible to tell from the actual binary data which width encoding is being used; since there is no guarantee that the data is in XML form, with its encoding stated.
If the above is true, we still presumably need to mark the data as being Unicode v 2, 3 or 4 etc? This would only be necessary if the earlier versions were not pure subsets of the later ones. Can you clarify this Hoylen? >P.S. As an aside, Java 1.1 and later uses Unicode 2.0 >character set. So if Unicode 3.0 or Unicode 4.0 is the >target character set, implementations may be forced to >implement their own string class rather than using the >native java.lang.String. Something to consider when picking >which Unicode version as the standard character set. > > I would suggest that the requirements with respect to string representation are: * systems get to store their data in whatever is most convenient locally (e.g. whatever the dbms wants to use) * it must be possible for a 3rd party application which is openEHR compliant to read the data in a system, even it if it not in its own "preferred" form (vertical interoperability) * it must be possible for the data to be exported in a way that it can be universally read or transformed into a readable form for use in another system (horizontal interoperability) * the specifications commit implementors to as little as possible, while allowing the above requirements to be met. Based on Hoylen's more recent info, a draft modelling solution seems to be: 1. openEHR states that all strings are in Unicode in its abstract specifications. Question: if there are no "simple strings" at all in the data and everything is a unicoded string, is this safe? 2. that the following abstract model is used (improved from version a few weeks ago): ENTRY class has - a mandatory language attribute - a mandatory character encoding attribute (says which VERSION and which flavour of unicode). This forces the whole ENTRY to be encoded the same way no matter what, but also allows distinct ENTRYs to be encoded in e.g. Unicode 3.0/UTF-8, Unicode 4.0/UTF-16. DV_TEXT class has - an optional language attribute, which is understood to override the one from its enclosing ENTRY. 3. Implementation specifications like XML-schemas and software APIs are required to make the character encoding and Unicode version attributes visible, so that clients can process / convert the data properly. further comments to clean this up will be much appreciated. - thomas -- ___________________________________________________________________________________ CTO Ocean Informatics (http://www.OceanInformatics.biz) Hon. Research Fellow, University College London openEHR (http://www.openEHR.org) Archetypes (http://www.oceaninformatics.biz/adl.html) Community Informatics (http://www.deepthought.com.au/ci/rii/Output/mainTOC.html) - If you have any questions about using this list, please send a message to d.lloyd at openehr.org