Anders Andersson created CAMEL-22712:
----------------------------------------

             Summary: Add UNICODE -> UTF-8 to 
MllpProtocolConstants.MSH18_VALUES and also values 
                 Key: CAMEL-22712
                 URL: https://issues.apache.org/jira/browse/CAMEL-22712
             Project: Camel
          Issue Type: Bug
          Components: camel-mllp
    Affects Versions: 4.x
            Reporter: Anders Andersson


I would like to add:

 
{code:java}
MSH18_VALUES.put("UNICODE", StandardCharsets.UTF_8);

MSH18_VALUES.put("UNICODE UTF-16", StandardCharsets.UTF_16);
MSH18_VALUES.put("UNICODE UTF-32", Charset.forName("UTF-32"));{code}
to class org.apache.camel.component.mllp.MllpProtocolConstants.

 
h4. Reasoning why UNICODE should be UTF-8

Since this is missing the HashMap, 
org.apache.camel.component.mllp.MllpCharsetHelper#getCharset(org.apache.camel.Exchange,
 byte[], org.apache.camel.component.mllp.internal.Hl7Util, 
java.nio.charset.Charset)

will not use any of the predefined values and resort to:



 
{code:java}
if (Charset.isSupported(msh18)) {
return Charset.forName(msh18);
}{code}
 

which will return UTF-16. I think vast majority of users, including me, expect 
UTF-8.

Result is:

 
{code:java}
MSH|^~\&|SE165561187179-B10|SE165561187179-100K|SE2321000131-S000000014080|SE2321000131-E000000000001|20251111144547.473+0100||ORM^O01|13DD9455820B4A28A7A15E4C286438FD|P|2.3.1||||||UNICODE{code}

Gets turned into:

 

 
{code:java}
䵓䡼幾尦籓䔱㘵㔶ㄱ㠷ㄷ㤭䈱ぼ卅ㄶ㔵㘱ㄸ...{code}
 

The specification is very vague:

Incoming message can specify in MSH18 which enconding one should use for the 
rest of the message. Valid values are specified in 
[https://terminology.hl7.org/CodeSystem-v2-0211.html] and for "UNICODE" it says:
{quote}Deprecated. Retained for backward compatibility only as v 2.5. Replaced 
by specific Unicode encoding codes.
{quote}
and
{quote}The world wide character standard from ISO/IEC 10646-1-1993
{quote}
It also mentions for UNICODE UTF-8:
{quote}UTF-8 is a variable-length encoding, each code value is represented by 
1,2 or 3 bytes, depending on the code value. 7 bit ASCII is a proper subset of 
UTF-8. Note that the code contains a space before UTF but not before and after 
the hyphen. Since UTF-8 represents the full UNICODE character set, the 
following restriction apply to its use: 1. UTF-8 must be the default encoding 
of the message, UTF-8 cannot be specified as an additional character set in 
MSH-18 2. There are no other character sets allowed in a message where UTF-8 is 
the default encoding in the message. In other words, UNICODE UTF-8 can only be 
specified as a single value in MSH-18 3. A message encoded in UTF-8 must not 
use a Byte Order Mark (BOM).
{quote}
Previous to v2.5 (e.g. 2.3.1 which my problematic message is written in) had 
only "UNICODE" and not "UNICODE UTF-8": 
[https://hl7-definition.caristix.com/v2/HL7v2.3.1/Tables/0211 
|https://hl7-definition.caristix.com/v2/HL7v2.3.1/Tables/0211]

Besides UNICODE UTF-16 and UNICODE UTF-32 are being deprecated in 2.9 according 
to the first link i referenced: 
[https://terminology.hl7.org/CodeSystem-v2-0211.html]  

I argue therefor UNICODE should be hardcoded to mean UTF-8 to avoid UTF-16.
h4. Reasoning behind UTF-16 and 32

For the sake of completion I would also like to add
{code:java}
MSH18_VALUES.put("UNICODE UTF-16", StandardCharsets.UTF_16);
MSH18_VALUES.put("UNICODE UTF-32", Charset.forName("UTF-32"));{code}
Arguments for "UNICODE UTF-16" as StandardCharsets.UTF_16

The link above says UTF-16 is ISO/IEC 10646 UCS-2 which according to wikipedia 
[https://en.wikipedia.org/wiki/Unicode#Mapping_and_encodings] is "obsolete 
subset of UTF-16" and:
{quote}The UCS-2 and UTF-16 encodings specify the Unicode [byte order 
mark|https://en.wikipedia.org/wiki/Byte_order_mark] (BOM) for use at the 
beginnings of text files, which may be used for byte-order detection (or [byte 
endianness|https://en.wikipedia.org/wiki/Endianness] detection).
{quote}
 

Comparing that text with what Oracle writes for Java it seems like a very good 
fit: 

 
{quote}public static final 
[Charset|https://docs.oracle.com/en/java/javase/21/docs/api/java.base/java/nio/charset/Charset.html]
 UTF_16
Sixteen-bit UCS Transformation Format, byte order identified by an optional 
byte-order mark.{quote}
 
from 
[https://docs.oracle.com/en/java/javase/21/docs/api/java.base/java/nio/charset/StandardCharsets.html]
 
If these are not added, if one is to send a UNICODE UTF-32 or UNICODE UTF-16 
this line would run Charset.forName("UNICODE UTF-32") which would result in 
java.nio.charset.IllegalCharsetNameException.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to