[
https://issues.apache.org/jira/browse/AXIS-2908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14592545#comment-14592545
]
Steve Onorato edited comment on AXIS-2908 at 6/18/15 9:32 PM:
--------------------------------------------------------------
I also had this problem. For example, 𩶘 (U+29D98) should either be
converted to the UTF-8 byte sequence 0xF0 0xA9 0xB6 0x98 or Numeric Character
Reference 𩶘 when serialized to XML.
Unfortunately, the UTF-16 surrogates are getting directly converted to
�� which is not valid according to both XML 1.0 and 1.1
specs. As a result, the XML parser receiving the invalid XML throws an
exception.
As a workaround, I applied the patch "AXIS_2342.diff" from
https://issues.apache.org/jira/browse/AXIS-2342 - it solves this problem since
it avoids the logic that causes the bad Numeric Character References to be
emitted.
was (Author: steveonorato):
I also had this problem. For example, U+29D98 (see
http://www.fileformat.info/info/unicode/char/29d98/index.htm) should either be
converted to the UTF-8 byte sequence 0xF0 0xA9 0xB6 0x98 or Numeric Character
Reference 𩶘 when serialized to XML.
Unfortunately, the UTF-16 surrogates are getting directly converted to
�� which is not valid according to both XML 1.0 and 1.1 specs.
As a result, the XML parser receiving the invalid XML throws an exception.
As a workaround, I applied the patch "AXIS_2342.diff" from
https://issues.apache.org/jira/browse/AXIS-2342 - it solves this problem since
it avoids the logic that causes the bad Numeric Character References to be
emitted.
> Apache Axis fails to handle non Basic Multilingual Plane characters(U+10000
> and above) while creating SOAP request
> ------------------------------------------------------------------------------------------------------------------
>
> Key: AXIS-2908
> URL: https://issues.apache.org/jira/browse/AXIS-2908
> Project: Axis
> Issue Type: Bug
> Components: Serialization/Deserialization
> Affects Versions: 1.4
> Environment: OS - CentOS
> Software Platform - JDK 7
> Reporter: Siddhesh Sundar Toraskar
> Labels: charset, xml-rpc
>
> While creating SOAP request, if we have nonBMP characters(e.g. EMOJIs),
> they(EMOJIs) are not properly inserted in XML.
> It seems that my content which is UTF-8 will be encoded in UTF-16 Java String
> (default) once program receives it.
> Apache Axis library that we are using then take those UTF-16 Java Strings and
> try to convert back into UTF-8 to create a XML before sending. It fails
> whenever I send a 4-byte Emoji (:grin:) UTF-8 character. I found that any
> UTF-8 4-byte character will be represented as surrogate pair in UTF-16. I
> suspect in that case Axis parser not able to understand surrogate pair and
> not able to convert into valid UTF-8 encoding.
> As result, while UTF-8 is specified, these EMOJIs appear in UTF-16 form which
> actually corrupts them because they are then incorrectly processed.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]