character sets and languages in openEHR

Tim Churches 18 Mar 2004 06:50:04 +1100

On Thu, 2004-03-18 at 00:51, gfrer wrote:
> Hi,
> 
> 
> Anamnesis in psychiatry:
> 
> <italic>And then the disturbed patient said: "Merdre". [Translation:
> shit]
> 
> </italic>
> 
> Family history:
> 
> <italic>My father was diagnosed as suffering from: "Engelse ziekte"
> [Translation: Rickets dissease]
> 
> 
> </italic>Codingsystems<italic>
> 
> ICPC-1 Dutch version.
> 
> Code: R05.
> 
> Displayed text: Hoest
> 
> Added translation: Cough
> 
> </italic>


Yes, I thought of examples which were similar to these. And it is not
just a matter of the recording health professional not knowing what
"Engelse ziekte" means, and thus having to record to verbatim and
untranslated - many diagnoses have no equivalent in other
languages/cultures, and are thus untranslatable (at least not without
some information loss). Given that the "foreign" language text may
require accented characters, or even a completely different character
set, then the Unicode encoding used for the entry will need to be
captured as well as the language, unless openEHR will be restricted
purely to one Unicode encoding, such as UTF-8. Remember the golden rule
with Unicode: "If you don't know the encoding, you don't know nuffin'."

The only problem with "UTF-8 everywhere" is that it is Roman alphabet
chauvinistic, in that the basic Roman characters are all represented
with one byte, but everything else needs two bytes. That dooms all
Russian openEHR records to using twice as much storage as the equivalent
English openEHR records. In these days of massive cheap disc storage and
high speed networks, that fact probably doesn't matter, but it just
seems unfair, although I can't think of a better alternative. As an
English speaker, I would not be keen if openEHR mandated the use of
UTF-16, thus forcing me to use two bytes for every letter. Yet that's
what UTF-8 forces Russians, and Greeks, and Thais and Vietnamese and
just about every other non-Roman alphabetic language speaker to do. Of
course, ideographic languages like Chinese are doomed to use more than
one byte per character, but then the language itself encodes a lot more
information in each character, so it probably works out about the same
in the end.

-- 

Tim C

PGP/GnuPG Key 1024D/EAF993D0 available from keyservers everywhere
or at http://members.optushome.com.au/tchur/pubkey.asc
Key fingerprint = 8C22 BF76 33BA B3B5 1D5B  EB37 7891 46A9 EAF9 93D0


-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: 
<http://lists.openehr.org/mailman/private/openehr-technical_lists.openehr.org/attachments/20040318/0b2519f7/attachment.asc>

character sets and languages in openEHR

Reply via email to