Anders Andersson created CAMEL-22712:
----------------------------------------
Summary: Add UNICODE -> UTF-8 to
MllpProtocolConstants.MSH18_VALUES and also values
Key: CAMEL-22712
URL: https://issues.apache.org/jira/browse/CAMEL-22712
Project: Camel
Issue Type: Bug
Components: camel-mllp
Affects Versions: 4.x
Reporter: Anders Andersson
I would like to add:
{code:java}
MSH18_VALUES.put("UNICODE", StandardCharsets.UTF_8);
MSH18_VALUES.put("UNICODE UTF-16", StandardCharsets.UTF_16);
MSH18_VALUES.put("UNICODE UTF-32", Charset.forName("UTF-32"));{code}
to class org.apache.camel.component.mllp.MllpProtocolConstants.
h4. Reasoning why UNICODE should be UTF-8
Since this is missing the HashMap,
org.apache.camel.component.mllp.MllpCharsetHelper#getCharset(org.apache.camel.Exchange,
byte[], org.apache.camel.component.mllp.internal.Hl7Util,
java.nio.charset.Charset)
will not use any of the predefined values and resort to:
{code:java}
if (Charset.isSupported(msh18)) {
return Charset.forName(msh18);
}{code}
which will return UTF-16. I think vast majority of users, including me, expect
UTF-8.
Result is:
{code:java}
MSH|^~\&|SE165561187179-B10|SE165561187179-100K|SE2321000131-S000000014080|SE2321000131-E000000000001|20251111144547.473+0100||ORM^O01|13DD9455820B4A28A7A15E4C286438FD|P|2.3.1||||||UNICODE{code}
Gets turned into:
{code:java}
䵓䡼幾尦籓䔱㘵㔶ㄱ㠷ㄷ㤭䈱ぼ卅ㄶ㔵㘱ㄸ...{code}
The specification is very vague:
Incoming message can specify in MSH18 which enconding one should use for the
rest of the message. Valid values are specified in
[https://terminology.hl7.org/CodeSystem-v2-0211.html] and for "UNICODE" it says:
{quote}Deprecated. Retained for backward compatibility only as v 2.5. Replaced
by specific Unicode encoding codes.
{quote}
and
{quote}The world wide character standard from ISO/IEC 10646-1-1993
{quote}
It also mentions for UNICODE UTF-8:
{quote}UTF-8 is a variable-length encoding, each code value is represented by
1,2 or 3 bytes, depending on the code value. 7 bit ASCII is a proper subset of
UTF-8. Note that the code contains a space before UTF but not before and after
the hyphen. Since UTF-8 represents the full UNICODE character set, the
following restriction apply to its use: 1. UTF-8 must be the default encoding
of the message, UTF-8 cannot be specified as an additional character set in
MSH-18 2. There are no other character sets allowed in a message where UTF-8 is
the default encoding in the message. In other words, UNICODE UTF-8 can only be
specified as a single value in MSH-18 3. A message encoded in UTF-8 must not
use a Byte Order Mark (BOM).
{quote}
Previous to v2.5 (e.g. 2.3.1 which my problematic message is written in) had
only "UNICODE" and not "UNICODE UTF-8":
[https://hl7-definition.caristix.com/v2/HL7v2.3.1/Tables/0211
|https://hl7-definition.caristix.com/v2/HL7v2.3.1/Tables/0211]
Besides UNICODE UTF-16 and UNICODE UTF-32 are being deprecated in 2.9 according
to the first link i referenced:
[https://terminology.hl7.org/CodeSystem-v2-0211.html]
I argue therefor UNICODE should be hardcoded to mean UTF-8 to avoid UTF-16.
h4. Reasoning behind UTF-16 and 32
For the sake of completion I would also like to add
{code:java}
MSH18_VALUES.put("UNICODE UTF-16", StandardCharsets.UTF_16);
MSH18_VALUES.put("UNICODE UTF-32", Charset.forName("UTF-32"));{code}
Arguments for "UNICODE UTF-16" as StandardCharsets.UTF_16
The link above says UTF-16 is ISO/IEC 10646 UCS-2 which according to wikipedia
[https://en.wikipedia.org/wiki/Unicode#Mapping_and_encodings] is "obsolete
subset of UTF-16" and:
{quote}The UCS-2 and UTF-16 encodings specify the Unicode [byte order
mark|https://en.wikipedia.org/wiki/Byte_order_mark] (BOM) for use at the
beginnings of text files, which may be used for byte-order detection (or [byte
endianness|https://en.wikipedia.org/wiki/Endianness] detection).
{quote}
Comparing that text with what Oracle writes for Java it seems like a very good
fit:
{quote}public static final
[Charset|https://docs.oracle.com/en/java/javase/21/docs/api/java.base/java/nio/charset/Charset.html]
UTF_16
Sixteen-bit UCS Transformation Format, byte order identified by an optional
byte-order mark.{quote}
from
[https://docs.oracle.com/en/java/javase/21/docs/api/java.base/java/nio/charset/StandardCharsets.html]
If these are not added, if one is to send a UNICODE UTF-32 or UNICODE UTF-16
this line would run Charset.forName("UNICODE UTF-32") which would result in
java.nio.charset.IllegalCharsetNameException.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)