Hoylen Sue wrote:

>Thomas Beale <thomas at deepthought.com.au> writes:
>  
>
>>I wonder if this is true for people using openEHR-based components via
>>an API rather than communicating via data messages. I assume that the
>>unicode implemementation used in the String type in most of today's
>>languages make it easy to determine what width unicode characters you
>>have in the data?
>>    
>>
>
>In all the case I know of, once the data has been read into
>the native string type, discovering its "width" is no longer
>an issue.  This is because the native string type is defined
>to support only a single encoding.  Data is converted into
>that encoding when it is read in.
>  
>
so if it is stored in UTF-16 say, the library in a unix application will 
detect the relevant byte pattern and dispatch the appropriate conversion 
routine to do utf-16 -> utf-8, or utf-32 -> utf-8? If I remember 
correctly, this is possible because it is possible to tell from the 
actual binary data which width encoding is being used; since there is no 
guarantee that the data is in XML form, with its encoding stated.

If the above is true, we still presumably need to mark the data as being 
Unicode v 2, 3 or 4 etc? This would only be necessary if the earlier 
versions were not pure subsets of the later ones. Can you clarify this 
Hoylen?

>P.S. As an aside, Java 1.1 and later uses Unicode 2.0
>character set.  So if Unicode 3.0 or Unicode 4.0 is the
>target character set, implementations may be forced to
>implement their own string class rather than using the
>native java.lang.String.  Something to consider when picking
>which Unicode version as the standard character set.
>  
>
I would suggest that the requirements with respect to string 
representation are:

    * systems get to store their data in whatever is most convenient 
locally (e.g. whatever the dbms wants to use)
    * it must be possible for a 3rd party application which is openEHR 
compliant to read the data in a system, even it if it not in its own 
"preferred" form
       (vertical interoperability)
    * it must be possible for the data to be exported in a way that it 
can be universally read or transformed into a readable form for use in 
another system
       (horizontal interoperability)
    * the specifications commit implementors to as little as possible, 
while allowing the above requirements to be met.

Based on Hoylen's more recent info, a draft modelling solution seems to be:

1. openEHR states that all strings are in Unicode in its abstract 
specifications.

Question: if there are no "simple strings" at all in the data and 
everything is a unicoded string, is this safe?

2. that the following abstract model is used (improved from version a 
few weeks ago):
    ENTRY class has
        - a mandatory language attribute
        - a mandatory character encoding attribute (says which VERSION 
and which flavour of unicode).
            This forces the whole ENTRY to be encoded the same way no 
matter what,
            but also allows distinct ENTRYs to be encoded in e.g. 
Unicode 3.0/UTF-8, Unicode 4.0/UTF-16.

    DV_TEXT class has
        - an optional language attribute, which is understood to 
override the one from its enclosing ENTRY.

3. Implementation specifications like XML-schemas and software APIs are 
required to make the character
    encoding and Unicode version attributes visible, so that clients can 
process / convert the data properly.


further comments to clean this up will be much appreciated.

- thomas



-- 
___________________________________________________________________________________
CTO Ocean Informatics (http://www.OceanInformatics.biz)
Hon. Research Fellow, University College London

openEHR (http://www.openEHR.org)
Archetypes (http://www.oceaninformatics.biz/adl.html)
Community Informatics (http://www.deepthought.com.au/ci/rii/Output/mainTOC.html)


-
If you have any questions about using this list,
please send a message to d.lloyd at openehr.org

Reply via email to