character sets and languages in openEHR

Thomas Beale Thu, 18 Mar 2004 08:46:09 +1000

Tim Churches wrote:

>
>Yes, I thought of examples which were similar to these. And it is not
>just a matter of the recording health professional not knowing what
>"Engelse ziekte" means, and thus having to record to verbatim and
>untranslated - many diagnoses have no equivalent in other
>languages/cultures, and are thus untranslatable (at least not without
>some information loss).
>
actually, these kinds of expressions are not the problem - they can 
happily be recorded inside a DV_TEXT object which has the language set 
to English or Dutch or whatever it may be; an inline occurrence of a 
'foreign' term that is routinely used by speakers of a different 
language (the way we use 'gesundheit' or 'triage' in english) can be 
assumed to be understood and is probably even in the dictionary of the 
language of narration.


The problem is when there are text fragments recorded where the words 
are viable in more than one language, and do not usually have the same 
meaning in each. Words in Danish & Norwegian should be almost the same, 
but I assume there are by now some small differences; there are 
certainly words in most of the European languages which occur in another 
language, and are completely unrelated. So in theory a language marker 
is needed to ensure that a later reader knows what language the words 
were in (maybe even to allow them to know what kind of translator to 
call). So the question remains - do we need the ability to have multiple 
languages inside a single entry? For Gerard's examples - would it really 
be necessary to indicate what the other languages were or not, given 
that they are probably obvious to most users who will use them?

The real reason for the question is that having to record language 
everywhere all the time means wasting a certain amount of data stroage 
on every text fragment stored in the record; the alternative seems to be 
to record it on Entry; if we decide that it has to be possible to have 
text fragments within an Entry for which athe name of a different 
language is actually recorded, we can use an optional language attribute 
on DV_TEXT which is understood as overriding the value elsewhere. In 
general I am against this kind of overriding of values in lower objects 
in a composition - it is not OO, and it is often misunderstood by 
programmers given the specifications; in general it is dangerous. 
However, maybe this is an exception which justifies its use....

As for Unicode, obviously we cannot do much about the standard; but I 
guess someone had to have the 8-bit part of the code space.

> Given that the "foreign" language text may
>require accented characters, or even a completely different character
>set, then the Unicode encoding used for the entry will need to be
>captured as well as the language, unless openEHR will be restricted
>purely to one Unicode encoding, such as UTF-8. Remember the golden rule
>with Unicode: "If you don't know the encoding, you don't know nuffin'."
>
>The only problem with "UTF-8 everywhere" is that it is Roman alphabet
>chauvinistic, in that the basic Roman characters are all represented
>with one byte, but everything else needs two bytes. That dooms all
>Russian openEHR records to using twice as much storage as the equivalent
>English openEHR records. In these days of massive cheap disc storage and
>high speed networks, that fact probably doesn't matter, but it just
>seems unfair, although I can't think of a better alternative. As an
>English speaker, I would not be keen if openEHR mandated the use of
>UTF-16, thus forcing me to use two bytes for every letter. Yet that's
>what UTF-8 forces Russians, and Greeks, and Thais and Vietnamese and
>just about every other non-Roman alphabetic language speaker to do. Of
>course, ideographic languages like Chinese are doomed to use more than
>one byte per character, but then the language itself encodes a lot more
>information in each character, so it probably works out about the same
>in the end.
>
>  
>


-- 
___________________________________________________________________________________
CTO Ocean Informatics (http://www.OceanInformatics.biz)
Hon. Research Fellow, University College London

openEHR (http://www.openEHR.org)
Archetypes (http://www.oceaninformatics.biz/adl.html)
Community Informatics (http://www.deepthought.com.au/ci/rii/Output/mainTOC.html)


-
If you have any questions about using this list,
please send a message to d.lloyd at openehr.org

character sets and languages in openEHR

Reply via email to