[ 
https://issues.apache.org/jira/browse/CAMEL-22712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18039361#comment-18039361
 ] 

Anders Andersson edited comment on CAMEL-22712 at 11/20/25 6:34 AM:
--------------------------------------------------------------------

Just stumbled upon:

org.apache.camel.component.hl7.HL7Charset

It seems to do roughly the same thing as 
org.apache.camel.component.mllp.MllpCharsetHelper but have more support for 
other charsets. At least it looks like we have two different implementations of 
hl7-encoding and decoding.

It will be out of scope for this issue and probably a discussion on what to do, 
wouldn't it be a good thing if camel-mllp used camel-hl7 for encoding and 
decoding?

Editing for clarification


was (Author: JIRAUSER309480):
Just stumbled upon:

org.apache.camel.component.hl7.HL7Charset

It seems to do the same thing as 
org.apache.camel.component.mllp.MllpCharsetHelper but have more support for 
other charsets.

It will be out of scope for this issue and probably a discussion on what to do, 
but it seems Camel-mllp maybe should be switching to use camel-hl7?

> Add UNICODE -> UTF-8 to MllpProtocolConstants.MSH18_VALUES and also values 
> ---------------------------------------------------------------------------
>
>                 Key: CAMEL-22712
>                 URL: https://issues.apache.org/jira/browse/CAMEL-22712
>             Project: Camel
>          Issue Type: Improvement
>          Components: camel-mllp
>    Affects Versions: 4.16.0
>            Reporter: Anders Andersson
>            Priority: Minor
>             Fix For: 4.x
>
>
> I would like to add:
>  
> {code:java}
> MSH18_VALUES.put("UNICODE", StandardCharsets.UTF_8);
> MSH18_VALUES.put("UNICODE UTF-16", StandardCharsets.UTF_16);
> MSH18_VALUES.put("UNICODE UTF-32", Charset.forName("UTF-32"));{code}
> to class org.apache.camel.component.mllp.MllpProtocolConstants.
>  
> h4. Reasoning why UNICODE should be UTF-8
> Since this is missing the HashMap, 
> org.apache.camel.component.mllp.MllpCharsetHelper#getCharset(org.apache.camel.Exchange,
>  byte[], org.apache.camel.component.mllp.internal.Hl7Util, 
> java.nio.charset.Charset)
> will not use any of the predefined values and resort to:
>  
> {code:java}
> if (Charset.isSupported(msh18)) {
> return Charset.forName(msh18);
> }{code}
>  
> which will return UTF-16. I think vast majority of users, including me, 
> expect UTF-8.
> Result is:
>  
> {code:java}
> MSH|^~\&|SE165561187179-B10|SE165561187179-100K|SE2321000131-S000000014080|SE2321000131-E000000000001|20251111144547.473+0100||ORM^O01|13DD9455820B4A28A7A15E4C286438FD|P|2.3.1||||||UNICODE{code}
> Gets turned into:
>  
>  
> {code:java}
> 䵓䡼幾尦籓䔱㘵㔶ㄱ㠷ㄷ㤭䈱ぼ卅ㄶ㔵㘱ㄸ...{code}
>  
> The specification is very vague:
> Incoming message can specify in MSH18 which enconding one should use for the 
> rest of the message. Valid values are specified in 
> [https://terminology.hl7.org/CodeSystem-v2-0211.html] and for "UNICODE" it 
> says:
> {quote}Deprecated. Retained for backward compatibility only as v 2.5. 
> Replaced by specific Unicode encoding codes.
> {quote}
> and
> {quote}The world wide character standard from ISO/IEC 10646-1-1993
> {quote}
> It also mentions for UNICODE UTF-8:
> {quote}UTF-8 is a variable-length encoding, each code value is represented by 
> 1,2 or 3 bytes, depending on the code value. 7 bit ASCII is a proper subset 
> of UTF-8. Note that the code contains a space before UTF but not before and 
> after the hyphen. Since UTF-8 represents the full UNICODE character set, the 
> following restriction apply to its use: 1. UTF-8 must be the default encoding 
> of the message, UTF-8 cannot be specified as an additional character set in 
> MSH-18 2. There are no other character sets allowed in a message where UTF-8 
> is the default encoding in the message. In other words, UNICODE UTF-8 can 
> only be specified as a single value in MSH-18 3. A message encoded in UTF-8 
> must not use a Byte Order Mark (BOM).
> {quote}
> Previous to v2.5 (e.g. 2.3.1 which my problematic message is written in) had 
> only "UNICODE" and not "UNICODE UTF-8": 
> [https://hl7-definition.caristix.com/v2/HL7v2.3.1/Tables/0211 
> |https://hl7-definition.caristix.com/v2/HL7v2.3.1/Tables/0211]
> Besides UNICODE UTF-16 and UNICODE UTF-32 are being deprecated in 2.9 
> according to the first link i referenced: 
> [https://terminology.hl7.org/CodeSystem-v2-0211.html]  
> I argue therefor UNICODE should be hardcoded to mean UTF-8 to avoid UTF-16.
> h4. Reasoning behind UTF-16 and 32
> For the sake of completion I would also like to add
> {code:java}
> MSH18_VALUES.put("UNICODE UTF-16", StandardCharsets.UTF_16);
> MSH18_VALUES.put("UNICODE UTF-32", Charset.forName("UTF-32"));{code}
> Arguments for "UNICODE UTF-16" as StandardCharsets.UTF_16
> The link above says UTF-16 is ISO/IEC 10646 UCS-2 which according to 
> wikipedia [https://en.wikipedia.org/wiki/Unicode#Mapping_and_encodings] is 
> "obsolete subset of UTF-16" and:
> {quote}The UCS-2 and UTF-16 encodings specify the Unicode [byte order 
> mark|https://en.wikipedia.org/wiki/Byte_order_mark] (BOM) for use at the 
> beginnings of text files, which may be used for byte-order detection (or 
> [byte endianness|https://en.wikipedia.org/wiki/Endianness] detection).
> {quote}
>  
> Comparing that text with what Oracle writes for Java it seems like a very 
> good fit: 
>  
> {quote}public static final 
> [Charset|https://docs.oracle.com/en/java/javase/21/docs/api/java.base/java/nio/charset/Charset.html]
>  UTF_16
> Sixteen-bit UCS Transformation Format, byte order identified by an optional 
> byte-order mark.{quote}
>  
> from 
> [https://docs.oracle.com/en/java/javase/21/docs/api/java.base/java/nio/charset/StandardCharsets.html]
>  
> If these are not added, if one is to send a UNICODE UTF-32 or UNICODE UTF-16 
> this line would run Charset.forName("UNICODE UTF-32") which would result in 
> java.nio.charset.IllegalCharsetNameException.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to