Re: [HAPI-devel] UTF-8 support, how to define and test?

Ian Vowles Tue, 14 Jan 2014 18:31:54 -0800

Many thanks for this helpful information.
 
I have posted it to our internal developer wiki, with credit to
yourself, so that all may be enlightened by your good work :-)
 
Thanks again
Ian


>>> Rahul Somasunderam <r...@certifydatasystems.com> 15/01/14 8:56 >>>
This is a write up I had distributed to my team to help them understand
and convince partner systems to use UTF 8.
I've used groovy for all code samples.
HTH.


Content Encoding is what decides how Strings which are first class
citizens in most modern programming languages, get converted into byte
arrays. Byte arrays are what get sent over the wire, or get written to
disk. In the reverse direction, they decide how a byte array must be
converted to a String.
This program takes a string in 3 different languages and shows how each
language is affected by different charsets.import
java.nio.charset.Charset
// println Charset.availableCharsets()*.key.join('\n')
void reportOnText(String text) {  final encodings = [
      'ASCII', 'ISO-8859-1', 'windows-1251', 'UTF-8', 'UTF-16',
'UTF-32'
  ]
  println ''
  println text
  println text.replaceAll(/./,'=')
  encodings.each { enc ->    def theBytes =
text.getBytes(Charset.forName(enc))
    def reparse = new String(theBytes, enc)
    println "${enc.padRight(12)}: ${theBytes.encodeHex()} -->
${reparse}"
  }}reportOnText('Happy New Year!')
reportOnText('¡Feliz Año Nuevo!')
reportOnText('新年あけましておめでとうございます！')
reportOnText('KYPHON® Balloon Kyphoplasty')
Let's look at the output before we dig into the explanationHappy New
Year!
===============
ASCII       : 4861707079204e6577205965617221 --> Happy New Year!
ISO-8859-1  : 4861707079204e6577205965617221 --> Happy New Year!
windows-1251: 4861707079204e6577205965617221 --> Happy New Year!
UTF-8       : 4861707079204e6577205965617221 --> Happy New Year!
UTF-16      :
feff004800610070007000790020004e00650077002000590065006100720021 -->
Happy New Year!
UTF-32      :
0000004800000061000000700000007000000079000000200000004e0000006500000077000000200000005900000065000000610000007200000021
--> Happy New Year!

¡Feliz Año Nuevo!
=================
ASCII       : 3f46656c697a20413f6f204e7565766f21 --> ?Feliz A?o Nuevo!
ISO-8859-1  : a146656c697a2041f16f204e7565766f21 --> ¡Feliz Año Nuevo!
windows-1251: 3f46656c697a20413f6f204e7565766f21 --> ?Feliz A?o Nuevo!
UTF-8       : c2a146656c697a2041c3b16f204e7565766f21 --> ¡Feliz Año
Nuevo!
UTF-16      :
feff00a100460065006c0069007a0020004100f1006f0020004e007500650076006f0021
--> ¡Feliz Año Nuevo!
UTF-32      :
000000a100000046000000650000006c000000690000007a0000002000000041000000f10000006f000000200000004e0000007500000065000000760000006f00000021
--> ¡Feliz Año Nuevo!

新年あけましておめでとうございます！
==================
ASCII       : 3f3f3f3f3f3f3f3f3f3f3f3f3f3f3f3f3f3f -->
??????????????????
ISO-8859-1  : 3f3f3f3f3f3f3f3f3f3f3f3f3f3f3f3f3f3f -->
??????????????????
windows-1251: 3f3f3f3f3f3f3f3f3f3f3f3f3f3f3f3f3f3f -->
??????????????????
UTF-8       :
e696b0e5b9b4e38182e38191e381bee38197e381a6e3818ae38281e381a7e381a8e38186e38194e38196e38184e381bee38199efbc81
--> 新年あけましておめでとうございます！
UTF-16      :
feff65b05e7430423051307e30573066304a3081306730683046305430563044307e3059ff01
--> 新年あけましておめでとうございます！
UTF-32      :
000065b000005e7400003042000030510000307e00003057000030660000304a000030810000306700003068000030460000305400003056000030440000307e000030590000ff01
--> 新年あけましておめでとうございます！

KYPHON® Balloon Kyphoplasty
===========================
ASCII       : 4b5950484f4e3f2042616c6c6f6f6e204b7970686f706c61737479
--> KYPHON? Balloon Kyphoplasty
ISO-8859-1  : 4b5950484f4eae2042616c6c6f6f6e204b7970686f706c61737479
--> KYPHON® Balloon Kyphoplasty
windows-1251: 4b5950484f4eae2042616c6c6f6f6e204b7970686f706c61737479
--> KYPHON® Balloon Kyphoplasty
UTF-8       : 4b5950484f4ec2ae2042616c6c6f6f6e204b7970686f706c61737479
--> KYPHON® Balloon Kyphoplasty
UTF-16      :
feff004b005900500048004f004e00ae002000420061006c006c006f006f006e0020004b007900700068006f0070006c0061007300740079
--> KYPHON® Balloon Kyphoplasty
UTF-32      :
0000004b0000005900000050000000480000004f0000004e000000ae0000002000000042000000610000006c0000006c0000006f0000006f0000006e000000200000004b0000007900000070000000680000006f000000700000006c00000061000000730000007400000079
--> KYPHON® Balloon Kyphoplasty
As you can see, all the encodings we use do a great job with plain
English text. That's because all encodings have support for the
characters in the English alphabet. As we start making the alphabet more
and more complex, we start seeing the difference between the Universal
Encodings and the regional Encodings.
ASCII and windows-1521 have very limited support for anything other
than English.
ISO-8859-1 improves support for Spanish, but Japanese is still broken.
All the UTF encodings are great for all languages.
Among the UTF charsets, UTF-32 takes 32 bits per character. UTF-16
takes a minimum of 16 bits. UTF-8 takes a minimum of 8 bits, but adds
more bits to expand the character set.What if we change charsets after
encoding?
This is precisely what happens when one system encodes a message in one
charset and another tries to parse it using a different charset.import
java.nio.charset.Charset
void testWrongEncoding(String text) {  def theBytes =
text.getBytes(Charset.forName('ISO-8859-1'))
  def reparse = new String(theBytes, 'UTF-8')
  println "${text}: ${theBytes.encodeHex()} --> ${reparse}"}println
"Wrong Encoding"
println "Wrong Encoding".replaceAll(/./,'=')
testWrongEncoding('Happy New Year!')
testWrongEncoding('¡Feliz Año Nuevo!')
testWrongEncoding('新年あけましておめでとうございます！')
testWrongEncoding('KYPHON® Balloon Kyphoplasty')
This is the resultWrong Encoding
==============
Happy New Year!: 4861707079204e6577205965617221 --> Happy New Year!
¡Feliz Año Nuevo!: a146656c697a2041f16f204e7565766f21 --> �Feliz A�o
Nuevo!
新年あけましておめでとうございます！: 3f3f3f3f3f3f3f3f3f3f3f3f3f3f3f3f3f3f -->
??????????????????
KYPHON® Balloon Kyphoplasty:
4b5950484f4eae2042616c6c6f6f6e204b7970686f706c61737479 --> KYPHON�
Balloon Kyphoplasty
As you can see, there will be problems in parsing the string into
UTF-8. This is what happens when another system uses a different
encoding and we parse it using UTF-8.Why UTF-8
You might already have guessed why we use UTF. It's because UTF can
support all major characters we are likely to run into. But why UTF-8 in
specific?
Because UTF-8 is the most compressed for the typical inputs we
receive.
Other Charsets supported by the Java Runtime[Big5, Big5-HKSCS, EUC-JP,
EUC-KR, GB18030, GB2312, GBK, IBM-Thai, IBM00858, IBM01140, IBM01141,
IBM01142, IBM01143, IBM01144, IBM01145, IBM01146, IBM01147, IBM01148,
IBM01149, IBM037, IBM1026, IBM1047, IBM273, IBM277, IBM278, IBM280,
IBM284, IBM285, IBM297, IBM420, IBM424, IBM437, IBM500, IBM775, IBM850,
IBM852, IBM855, IBM857, IBM860, IBM861, IBM862, IBM863, IBM864, IBM865,
IBM866, IBM868, IBM869, IBM870, IBM871, IBM918, ISO-2022-CN,
ISO-2022-JP, ISO-2022-JP-2, ISO-2022-KR, ISO-8859-1, ISO-8859-13,
ISO-8859-15, ISO-8859-2, ISO-8859-3, ISO-8859-4, ISO-8859-5, ISO-8859-6,
ISO-8859-7, ISO-8859-8, ISO-8859-9, JIS_X0201, JIS_X0212-1990, KOI8-R,
KOI8-U, Shift_JIS, TIS-620, US-ASCII, UTF-16, UTF-16BE, UTF-16LE,
UTF-32, UTF-32BE, UTF-32LE, UTF-8, windows-1250, windows-1251,
windows-1252, windows-1253, windows-1254, windows-1255, windows-1256,
windows-1257, windows-1258, windows-31j, x-Big5-HKSCS-2001,
x-Big5-Solaris, x-COMPOUND_TEXT, x-euc-jp-linux, x-EUC-TW, x-eucJP-Open,
x-IBM1006, x-IBM1025, x-IBM1046, x-IBM1097, x-IBM1098, x-IBM1112,
x-IBM1122, x-IBM1123, x-IBM1124, x-IBM1364, x-IBM1381, x-IBM1383,
x-IBM33722, x-IBM737, x-IBM833, x-IBM834, x-IBM856, x-IBM874, x-IBM875,
x-IBM921, x-IBM922, x-IBM930, x-IBM933, x-IBM935, x-IBM937, x-IBM939,
x-IBM942, x-IBM942C, x-IBM943, x-IBM943C, x-IBM948, x-IBM949, x-IBM949C,
x-IBM950, x-IBM964, x-IBM970, x-ISCII91, x-ISO-2022-CN-CNS,
x-ISO-2022-CN-GB, x-iso-8859-11, x-JIS0208, x-JISAutoDetect, x-Johab,
x-MacArabic, x-MacCentralEurope, x-MacCroatian, x-MacCyrillic,
x-MacDingbat, x-MacGreek, x-MacHebrew, x-MacIceland, x-MacRoman,
x-MacRomania, x-MacSymbol, x-MacThai, x-MacTurkish, x-MacUkraine,
x-MS932_0213, x-MS950-HKSCS, x-MS950-HKSCS-XP, x-mswin-936, x-PCK,
x-SJIS_0213, x-UTF-16LE-BOM, X-UTF-32BE-BOM, X-UTF-32LE-BOM,
x-windows-50220, x-windows-50221, x-windows-874, x-windows-949,
x-windows-950, x-windows-iso2022jp]


On Jan 14, 2014, at 1:40 PM, Ian Vowles <ian_vow...@health.qld.gov.au>
wrote:



I have encountered some issues with character sets, although not to the
extent Tom is dealing with.
 
Rahul's comment about the character sets at each end is something I
have encountered, and I will add that on the UNIX platforms I have dealt
with, the base character set is NOT configured to UTF-8, which adds
another layer at which there can be differing interpretations.
 
My experience has made me cautious, particularly where Microsoft based
systems send to a UNIX based system which transforms and sends on to
another Microsoft system. Since the base character set interpretation
effectively changes at each boundary, loss of 'special' characters is
likely. Many a time I have seen the ?, and seldom have I been able to
prevent it's appearance. Pasting reports from MS-Word is particularly
fraught.
 
During my search I found several Wiki's and sites discussing the
history of character sets and the growth of UTF-xx. What a revelation
that is!
 
Thanks for the example message, and the advice about MSH-18. I will be
saving that away for further use.
 
Ian

>>> Rahul Somasunderam <r...@certifydatasystems.com> 15/01/14 5:53 >>>
That is a sign of not having the same charset on both ends. 
In the MSH segment, I recall there are some restrictions that require
you to stay within ASCII 7 bit. The other segments can be any charset
you choose. 


On Jan 14, 2014, at 11:27 AM, Tom Wilson <twil...@sujansky.com> wrote:



After some research, I have answered my own question. If no character
set is defined in MSH-18, then the default is single byte printable
ASCII (decimal 32-127).
 
If anyone is interested, I am attaching a test ORU file which includes
a full set of UTF-8 characters, above and beyond what is supported. You
can trim it to use this in your tests, or test a more extensive UTF-8
support if you like.
 
-tom
 
 
From: Tom Wilson [mailto:twil...@sujansky.com] 
Sent: Tuesday, January 14, 2014 10:27 AM
To: hl7api-devel@lists.sourceforge.net
Subject: [HAPI-devel] UTF-8 support, how to define and test?
Hi.
I'm in the final testing phase of a HAPI-based application, and I want
to define precisely what character encoding it can support. I know the
HL7v2 spec defines UTF-8 as the supported character set. However, it
looks like it is only supporting a subset of UTF-8. I am testing by
ingesting an HL7v2 message in a unit test and serializing to XML. For
example, it looks like Simplified Chinese, Vietnamese, Cyrillic, are not
supported.
Sending this in an NTE segment:
我能吞下玻璃而不伤身体
Produces this on the other end:
???????????
So, exactly what UTF-8 characters can I expect to work? I also want to
create a unit test with a full range of the supported characters.
It might be nice to support other languages, but I don't know if I can
expect to receive them from EMR systems.
Thanks in advance,
-tom
<complete-utf8-set.oru>------------------------------------------------------------------------------
CenturyLink Cloud: The Leader in Enterprise Cloud Services.
Learn Why More Businesses Are Choosing CenturyLink Cloud For
Critical Workloads, Development Environments & Everything In Between.
Get a Quote or Start a Free Trial Today. 
http://pubads.g.doubleclick.net/gampad/clk?id=119420431&iu=/4140/ostg.clktrk_______________________________________________
Hl7api-devel mailing list
Hl7api-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/hl7api-devel



********************************************************************************
This email, including any attachments sent with it, is confidential and
for the sole use of the intended recipient(s). This confidentiality is
not waived or lost, if you receive it and you are not the intended
recipient(s), or if it is transmitted/received in error.
Any unauthorised use, alteration, disclosure, distribution or review of
this email is strictly prohibited. The information contained in this
email, including any attachment sent with it, may be subject to a
statutory duty of confidentiality if it relates to health service
matters.
If you are not the intended recipient(s), or if you have received this
email in error, you are asked to immediately notify the sender by
telephone collect on Australia +61 1800 198 175 or by return email. You
should also delete this email, and any copies, from your computer system
network and destroy any hard copies produced.
If not an intended recipient of this email, you must not copy,
distribute or take any action(s) that relies on it; any form of
disclosure, modification, distribution and/or publication of this email
is also prohibited.
Although Queensland Health takes all reasonable steps to ensure this
email does not contain malicious software, Queensland Health does not
accept responsibility for the consequences if any person's computer
inadvertently suffers any disruption to services, loss of information,
harm or is infected with a virus, other malicious computer programme or
code that may occur as a consequence of receiving this email.
Unless stated otherwise, this email represents only the views of the
sender and not the views of the Queensland Government.
**********************************************************************************
------------------------------------------------------------------------------
CenturyLink Cloud: The Leader in Enterprise Cloud Services.
Learn Why More Businesses Are Choosing CenturyLink Cloud For
Critical Workloads, Development Environments & Everything In Between.
Get a Quote or Start a Free Trial Today. 
http://pubads.g.doubleclick.net/gampad/clk?id=119420431&iu=/4140/ostg.clktrk_______________________________________________
Hl7api-devel mailing list
Hl7api-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/hl7api-devel

------------------------------------------------------------------------------
CenturyLink Cloud: The Leader in Enterprise Cloud Services.
Learn Why More Businesses Are Choosing CenturyLink Cloud For
Critical Workloads, Development Environments & Everything In Between.
Get a Quote or Start a Free Trial Today. 
http://pubads.g.doubleclick.net/gampad/clk?id=119420431&iu=/4140/ostg.clktrk

_______________________________________________
Hl7api-devel mailing list
Hl7api-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/hl7api-devel

Re: [HAPI-devel] UTF-8 support, how to define and test?

Reply via email to