[GitHub] [beam] damccorm opened a new issue, #20818: XmlIO.Read does not handle XML encoding per spec

GitBox Sat, 04 Jun 2022 12:50:21 -0700


damccorm opened a new issue, #20818:
URL: https://github.com/apache/beam/issues/20818

Not sure what the implementation problem is but based on the API doc,
there's a real flaw in XmlIO.Read:

By default, UTF-8 charset is used. To specify a different charset, use
[`XmlIO.Read.withCharset(java.nio.charset.Charset)`|https://beam.apache.org/releases/javadoc/2.2.0/org/apache/beam/sdk/io/xml/XmlIO.Read.html#withCharset-java.nio.charset.Charset-].

Currently, only XML files that use single-byte characters are supported.
Using a file that contains multi-byte characters may result in data loss or
duplication.

Properly handled, there is never any need to specify the character encoding
when reading an XML document. XML documents fully identify their character
encoding. The developer at this level doesn't need to know and shouldn't think
about the character encoding. Perhaps in the source code someone is a using a
Reader where they should be using an InputStream instead? That might lead this
problem.

Also, the text contradicts itself. UTF-8 is a multibyte character set. I
hope that doesn't lead to data loss or duplication by default.

Imported from Jira
[BEAM-11875](https://issues.apache.org/jira/browse/BEAM-11875). Original Jira
may contain additional context.
Reported by: elharo.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [beam] damccorm opened a new issue, #20818: XmlIO.Read does not handle XML encoding per spec

Reply via email to