Elliotte Rusty Harold created BEAM-11875:
--------------------------------------------

             Summary: XmlIO.Read does not handle XML encoding per spec
                 Key: BEAM-11875
                 URL: https://issues.apache.org/jira/browse/BEAM-11875
             Project: Beam
          Issue Type: Bug
          Components: io-java-xml
    Affects Versions: 2.28.0
            Reporter: Elliotte Rusty Harold


Not sure what the implementation problem is but based on the API doc, there's a 
real flaw in XmlIO.Read:

 
By default, UTF-8 charset is used. To specify a different charset, use 
[{{XmlIO.Read.withCharset(java.nio.charset.Charset)}}|https://beam.apache.org/releases/javadoc/2.2.0/org/apache/beam/sdk/io/xml/XmlIO.Read.html#withCharset-java.nio.charset.Charset-].

Currently, only XML files that use single-byte characters are supported. Using 
a file that contains multi-byte characters may result in data loss or 
duplication.

 

Properly handled, there is never any need to specify the character encoding 
when reading an XML document. XML documents fully identify their character 
encoding. The developer at this level doesn't need to know and shouldn't think 
about the character encoding. Perhaps in the source code someone is a using a 
Reader where they should be using an InputStream instead? That might lead this 
problem.

Also, the text contradicts itself. UTF-8 is a multibyte character set. I hope 
that doesn't lead to data loss or duplication by default.

 

 

 

 

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to