Re: [HAPI-devel] UTF-8 support, how to define and test?

Rahul Somasunderam Tue, 14 Jan 2014 14:58:34 -0800

This is a write up I had distributed to my team to help them understand and 
convince partner systems to use UTF 8.
I've used groovy for all code samples.
HTH.


Content Encoding is what decides how Strings which are first class citizens in 
most modern programming languages, get converted into byte arrays. Byte arrays 
are what get sent over the wire, or get written to disk. In the reverse 
direction, they decide how a byte array must be converted to a String.

This program takes a string in 3 different languages and shows how each 
language is affected by different charsets.

import java.nio.charset.Charset
// println Charset.availableCharsets()*.key.join('\n')
void reportOnText(String text) {
  final encodings = [
      'ASCII', 'ISO-8859-1', 'windows-1251', 'UTF-8', 'UTF-16', 'UTF-32'
  ]
  println ''
  println text
  println text.replaceAll(/./,'=')
  encodings.each { enc ->
    def theBytes = text.getBytes(Charset.forName(enc))
    def reparse = new String(theBytes, enc)
    println "${enc.padRight(12)}: ${theBytes.encodeHex()} --> ${reparse}"
  }
}
reportOnText('Happy New Year!')
reportOnText('¡Feliz Año Nuevo!')
reportOnText('新年あけましておめでとうございます！')
reportOnText('KYPHON® Balloon Kyphoplasty')
Let's look at the output before we dig into the explanation

Happy New Year!
===============
ASCII       : 4861707079204e6577205965617221 --> Happy New Year!
ISO-8859-1  : 4861707079204e6577205965617221 --> Happy New Year!
windows-1251: 4861707079204e6577205965617221 --> Happy New Year!
UTF-8       : 4861707079204e6577205965617221 --> Happy New Year!
UTF-16      : feff004800610070007000790020004e00650077002000590065006100720021 
--> Happy New Year!
UTF-32      : 
0000004800000061000000700000007000000079000000200000004e0000006500000077000000200000005900000065000000610000007200000021
 --> Happy New Year!

¡Feliz Año Nuevo!
=================
ASCII       : 3f46656c697a20413f6f204e7565766f21 --> ?Feliz A?o Nuevo!
ISO-8859-1  : a146656c697a2041f16f204e7565766f21 --> ¡Feliz Año Nuevo!
windows-1251: 3f46656c697a20413f6f204e7565766f21 --> ?Feliz A?o Nuevo!
UTF-8       : c2a146656c697a2041c3b16f204e7565766f21 --> ¡Feliz Año Nuevo!
UTF-16      : 
feff00a100460065006c0069007a0020004100f1006f0020004e007500650076006f0021 --> 
¡Feliz Año Nuevo!
UTF-32      : 
000000a100000046000000650000006c000000690000007a0000002000000041000000f10000006f000000200000004e0000007500000065000000760000006f00000021
 --> ¡Feliz Año Nuevo!

新年あけましておめでとうございます！
==================
ASCII       : 3f3f3f3f3f3f3f3f3f3f3f3f3f3f3f3f3f3f --> ??????????????????
ISO-8859-1  : 3f3f3f3f3f3f3f3f3f3f3f3f3f3f3f3f3f3f --> ??????????????????
windows-1251: 3f3f3f3f3f3f3f3f3f3f3f3f3f3f3f3f3f3f --> ??????????????????
UTF-8       : 
e696b0e5b9b4e38182e38191e381bee38197e381a6e3818ae38281e381a7e381a8e38186e38194e38196e38184e381bee38199efbc81
 --> 新年あけましておめでとうございます！
UTF-16      : 
feff65b05e7430423051307e30573066304a3081306730683046305430563044307e3059ff01 
--> 新年あけましておめでとうございます！
UTF-32      : 
000065b000005e7400003042000030510000307e00003057000030660000304a000030810000306700003068000030460000305400003056000030440000307e000030590000ff01
 --> 新年あけましておめでとうございます！

KYPHON® Balloon Kyphoplasty
===========================
ASCII       : 4b5950484f4e3f2042616c6c6f6f6e204b7970686f706c61737479 --> 
KYPHON? Balloon Kyphoplasty
ISO-8859-1  : 4b5950484f4eae2042616c6c6f6f6e204b7970686f706c61737479 --> 
KYPHON® Balloon Kyphoplasty
windows-1251: 4b5950484f4eae2042616c6c6f6f6e204b7970686f706c61737479 --> 
KYPHON® Balloon Kyphoplasty
UTF-8       : 4b5950484f4ec2ae2042616c6c6f6f6e204b7970686f706c61737479 --> 
KYPHON® Balloon Kyphoplasty
UTF-16      : 
feff004b005900500048004f004e00ae002000420061006c006c006f006f006e0020004b007900700068006f0070006c0061007300740079
 --> KYPHON® Balloon Kyphoplasty
UTF-32      : 
0000004b0000005900000050000000480000004f0000004e000000ae0000002000000042000000610000006c0000006c0000006f0000006f0000006e000000200000004b0000007900000070000000680000006f000000700000006c00000061000000730000007400000079
 --> KYPHON® Balloon Kyphoplasty
As you can see, all the encodings we use do a great job with plain English 
text. That's because all encodings have support for the characters in the 
English alphabet. As we start making the alphabet more and more complex, we 
start seeing the difference between the Universal Encodings and the regional 
Encodings.

ASCII and windows-1521 have very limited support for anything other than 
English.

ISO-8859-1 improves support for Spanish, but Japanese is still broken.

All the UTF encodings are great for all languages.

Among the UTF charsets, UTF-32 takes 32 bits per character. UTF-16 takes a 
minimum of 16 bits. UTF-8 takes a minimum of 8 bits, but adds more bits to 
expand the character set.

What if we change charsets after encoding?

This is precisely what happens when one system encodes a message in one charset 
and another tries to parse it using a different charset.

import java.nio.charset.Charset
void testWrongEncoding(String text) {
  def theBytes = text.getBytes(Charset.forName('ISO-8859-1'))
  def reparse = new String(theBytes, 'UTF-8')
  println "${text}: ${theBytes.encodeHex()} --> ${reparse}"
}
println "Wrong Encoding"
println "Wrong Encoding".replaceAll(/./,'=')
testWrongEncoding('Happy New Year!')
testWrongEncoding('¡Feliz Año Nuevo!')
testWrongEncoding('新年あけましておめでとうございます！')
testWrongEncoding('KYPHON® Balloon Kyphoplasty')
This is the result

Wrong Encoding
==============
Happy New Year!: 4861707079204e6577205965617221 --> Happy New Year!
¡Feliz Año Nuevo!: a146656c697a2041f16f204e7565766f21 --> �Feliz A�o Nuevo!
新年あけましておめでとうございます！: 3f3f3f3f3f3f3f3f3f3f3f3f3f3f3f3f3f3f --> ??????????????????
KYPHON® Balloon Kyphoplasty: 
4b5950484f4eae2042616c6c6f6f6e204b7970686f706c61737479 --> KYPHON� Balloon 
Kyphoplasty
As you can see, there will be problems in parsing the string into UTF-8. This 
is what happens when another system uses a different encoding and we parse it 
using UTF-8.

Why UTF-8

You might already have guessed why we use UTF. It's because UTF can support all 
major characters we are likely to run into. But why UTF-8 in specific?

Because UTF-8 is the most compressed for the typical inputs we receive.

Other Charsets supported by the Java Runtime

[Big5, Big5-HKSCS, EUC-JP, EUC-KR, GB18030, GB2312, GBK, IBM-Thai, IBM00858, 
IBM01140, IBM01141, IBM01142, IBM01143, IBM01144, IBM01145, IBM01146, IBM01147, 
IBM01148, IBM01149, IBM037, IBM1026, IBM1047, IBM273, IBM277, IBM278, IBM280, 
IBM284, IBM285, IBM297, IBM420, IBM424, IBM437, IBM500, IBM775, IBM850, IBM852, 
IBM855, IBM857, IBM860, IBM861, IBM862, IBM863, IBM864, IBM865, IBM866, IBM868, 
IBM869, IBM870, IBM871, IBM918, ISO-2022-CN, ISO-2022-JP, ISO-2022-JP-2, 
ISO-2022-KR, ISO-8859-1, ISO-8859-13, ISO-8859-15, ISO-8859-2, ISO-8859-3, 
ISO-8859-4, ISO-8859-5, ISO-8859-6, ISO-8859-7, ISO-8859-8, ISO-8859-9, 
JIS_X0201, JIS_X0212-1990, KOI8-R, KOI8-U, Shift_JIS, TIS-620, US-ASCII, 
UTF-16, UTF-16BE, UTF-16LE, UTF-32, UTF-32BE, UTF-32LE, UTF-8, windows-1250, 
windows-1251, windows-1252, windows-1253, windows-1254, windows-1255, 
windows-1256, windows-1257, windows-1258, windows-31j, x-Big5-HKSCS-2001, 
x-Big5-Solaris, x-COMPOUND_TEXT, x-euc-jp-linux, x-EUC-TW, x-eucJP-Open, 
x-IBM1006, x-IBM1025, x-IBM1046, x-IBM1097, x-IBM1098, x-IBM1112, x-IBM1122, 
x-IBM1123, x-IBM1124, x-IBM1364, x-IBM1381, x-IBM1383, x-IBM33722, x-IBM737, 
x-IBM833, x-IBM834, x-IBM856, x-IBM874, x-IBM875, x-IBM921, x-IBM922, x-IBM930, 
x-IBM933, x-IBM935, x-IBM937, x-IBM939, x-IBM942, x-IBM942C, x-IBM943, 
x-IBM943C, x-IBM948, x-IBM949, x-IBM949C, x-IBM950, x-IBM964, x-IBM970, 
x-ISCII91, x-ISO-2022-CN-CNS, x-ISO-2022-CN-GB, x-iso-8859-11, x-JIS0208, 
x-JISAutoDetect, x-Johab, x-MacArabic, x-MacCentralEurope, x-MacCroatian, 
x-MacCyrillic, x-MacDingbat, x-MacGreek, x-MacHebrew, x-MacIceland, x-MacRoman, 
x-MacRomania, x-MacSymbol, x-MacThai, x-MacTurkish, x-MacUkraine, x-MS932_0213, 
x-MS950-HKSCS, x-MS950-HKSCS-XP, x-mswin-936, x-PCK, x-SJIS_0213, 
x-UTF-16LE-BOM, X-UTF-32BE-BOM, X-UTF-32LE-BOM, x-windows-50220, 
x-windows-50221, x-windows-874, x-windows-949, x-windows-950, 
x-windows-iso2022jp]


On Jan 14, 2014, at 1:40 PM, Ian Vowles <[email protected]> wrote:

> I have encountered some issues with character sets, although not to the 
> extent Tom is dealing with.
>  
> Rahul's comment about the character sets at each end is something I have 
> encountered, and I will add that on the UNIX platforms I have dealt with, the 
> base character set is NOT configured to UTF-8, which adds another layer at 
> which there can be differing interpretations.
>  
> My experience has made me cautious, particularly where Microsoft based 
> systems send to a UNIX based system which transforms and sends on to another 
> Microsoft system. Since the base character set interpretation effectively 
> changes at each boundary, loss of 'special' characters is likely. Many a time 
> I have seen the ?, and seldom have I been able to prevent it's appearance. 
> Pasting reports from MS-Word is particularly fraught.
>  
> During my search I found several Wiki's and sites discussing the history of 
> character sets and the growth of UTF-xx. What a revelation that is!
>  
> Thanks for the example message, and the advice about MSH-18. I will be saving 
> that away for further use.
>  
> Ian
> 
> >>> Rahul Somasunderam <[email protected]> 15/01/14 5:53 >>>
> That is a sign of not having the same charset on both ends. 
> In the MSH segment, I recall there are some restrictions that require you to 
> stay within ASCII 7 bit. The other segments can be any charset you choose.
> 
> 
> On Jan 14, 2014, at 11:27 AM, Tom Wilson <[email protected]> wrote:
> 
>> After some research, I have answered my own question. If no character set is 
>> defined in MSH-18, then the default is single byte printable ASCII (decimal 
>> 32-127).
>>  
>> If anyone is interested, I am attaching a test ORU file which includes a 
>> full set of UTF-8 characters, above and beyond what is supported. You can 
>> trim it to use this in your tests, or test a more extensive UTF-8 support if 
>> you like.
>>  
>> -tom
>>  
>>  
>> From: Tom Wilson [mailto:[email protected]] 
>> Sent: Tuesday, January 14, 2014 10:27 AM
>> To: [email protected]
>> Subject: [HAPI-devel] UTF-8 support, how to define and test?
>> Hi.
>> I'm in the final testing phase of a HAPI-based application, and I want to 
>> define precisely what character encoding it can support. I know the HL7v2 
>> spec defines UTF-8 as the supported character set. However, it looks like it 
>> is only supporting a subset of UTF-8. I am testing by ingesting an HL7v2 
>> message in a unit test and serializing to XML. For example, it looks like 
>> Simplified Chinese, Vietnamese, Cyrillic, are not supported.
>> Sending this in an NTE segment:
>> 我能吞下玻璃而不伤身体
>> Produces this on the other end:
>> ???????????
>> So, exactly what UTF-8 characters can I expect to work? I also want to 
>> create a unit test with a full range of the supported characters.
>> It might be nice to support other languages, but I don't know if I can 
>> expect to receive them from EMR systems.
>> Thanks in advance,
>> -tom
>> <complete-utf8-set.oru>------------------------------------------------------------------------------
>> CenturyLink Cloud: The Leader in Enterprise Cloud Services.
>> Learn Why More Businesses Are Choosing CenturyLink Cloud For
>> Critical Workloads, Development Environments & Everything In Between.
>> Get a Quote or Start a Free Trial Today. 
>> http://pubads.g.doubleclick.net/gampad/clk?id=119420431&iu=/4140/ostg.clktrk_______________________________________________
>> Hl7api-devel mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/hl7api-devel
> 
> ********************************************************************************
> 
> This email, including any attachments sent with it, is confidential and for 
> the sole use of the intended recipient(s). This confidentiality is not waived 
> or lost, if you receive it and you are not the intended recipient(s), or if 
> it is transmitted/received in error.
> 
> Any unauthorised use, alteration, disclosure, distribution or review of this 
> email is strictly prohibited. The information contained in this email, 
> including any attachment sent with it, may be subject to a statutory duty of 
> confidentiality if it relates to health service matters.
> 
> If you are not the intended recipient(s), or if you have received this email 
> in error, you are asked to immediately notify the sender by telephone collect 
> on Australia +61 1800 198 175 or by return email. You should also delete this 
> email, and any copies, from your computer system network and destroy any hard 
> copies produced.
> 
> If not an intended recipient of this email, you must not copy, distribute or 
> take any action(s) that relies on it; any form of disclosure, modification, 
> distribution and/or publication of this email is also prohibited.
> 
> Although Queensland Health takes all reasonable steps to ensure this email 
> does not contain malicious software, Queensland Health does not accept 
> responsibility for the consequences if any person's computer inadvertently 
> suffers any disruption to services, loss of information, harm or is infected 
> with a virus, other malicious computer programme or code that may occur as a 
> consequence of receiving this email.
> 
> Unless stated otherwise, this email represents only the views of the sender 
> and not the views of the Queensland Government.
> 
> **********************************************************************************
> 
> ------------------------------------------------------------------------------
> CenturyLink Cloud: The Leader in Enterprise Cloud Services.
> Learn Why More Businesses Are Choosing CenturyLink Cloud For
> Critical Workloads, Development Environments & Everything In Between.
> Get a Quote or Start a Free Trial Today. 
> http://pubads.g.doubleclick.net/gampad/clk?id=119420431&iu=/4140/ostg.clktrk_______________________________________________
> Hl7api-devel mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/hl7api-devel

------------------------------------------------------------------------------
CenturyLink Cloud: The Leader in Enterprise Cloud Services.
Learn Why More Businesses Are Choosing CenturyLink Cloud For
Critical Workloads, Development Environments & Everything In Between.
Get a Quote or Start a Free Trial Today. 
http://pubads.g.doubleclick.net/gampad/clk?id=119420431&iu=/4140/ostg.clktrk

_______________________________________________
Hl7api-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/hl7api-devel

Re: [HAPI-devel] UTF-8 support, how to define and test?

Reply via email to