[
https://issues.apache.org/jira/browse/BEAM-11875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17318109#comment-17318109
]
Chamikara Madhusanka Jayalath edited comment on BEAM-11875 at 4/9/21, 4:13 PM:
-------------------------------------------------------------------------------
It's documented but I don't think we reject input that contains multi-byte
characters
[https://github.com/apache/beam/blob/ed3df93e747ddc271db5186faf2e05af0b57de1d/sdks/java/io/xml/src/main/java/org/apache/beam/sdk/io/xml/XmlIO.java#L102]
I'm not sure how easy to detect this. The complexity comes from the code where
we try to detect the record element when reading a byte stream:
[https://github.com/apache/beam/blob/ed3df93e747ddc271db5186faf2e05af0b57de1d/sdks/java/io/xml/src/main/java/org/apache/beam/sdk/io/xml/XmlSource.java#L213]
was (Author: chamikara):
It's documented but I don't think we reject input that contains multi-byte
characters
[https://github.com/apache/beam/blob/master/sdks/java/io/xml/src/main/java/org/apache/beam/sdk/io/xml/XmlIO.java#L102]
I'm not sure how easy to detect this. The complexity comes from the code where
we try to detect the record element when reading a byte stream:
https://github.com/apache/beam/blob/ed3df93e747ddc271db5186faf2e05af0b57de1d/sdks/java/io/xml/src/main/java/org/apache/beam/sdk/io/xml/XmlSource.java#L213
> XmlIO.Read does not handle XML encoding per spec
> ------------------------------------------------
>
> Key: BEAM-11875
> URL: https://issues.apache.org/jira/browse/BEAM-11875
> Project: Beam
> Issue Type: Bug
> Components: io-java-xml
> Affects Versions: 2.28.0
> Reporter: Elliotte Rusty Harold
> Priority: P1
>
> Not sure what the implementation problem is but based on the API doc, there's
> a real flaw in XmlIO.Read:
>
> By default, UTF-8 charset is used. To specify a different charset, use
> [{{XmlIO.Read.withCharset(java.nio.charset.Charset)}}|https://beam.apache.org/releases/javadoc/2.2.0/org/apache/beam/sdk/io/xml/XmlIO.Read.html#withCharset-java.nio.charset.Charset-].
> Currently, only XML files that use single-byte characters are supported.
> Using a file that contains multi-byte characters may result in data loss or
> duplication.
>
> Properly handled, there is never any need to specify the character encoding
> when reading an XML document. XML documents fully identify their character
> encoding. The developer at this level doesn't need to know and shouldn't
> think about the character encoding. Perhaps in the source code someone is a
> using a Reader where they should be using an InputStream instead? That might
> lead this problem.
> Also, the text contradicts itself. UTF-8 is a multibyte character set. I hope
> that doesn't lead to data loss or duplication by default.
>
>
>
>
>
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)