[jira] [Comment Edited] (BEAM-11875) XmlIO.Read does not handle XML encoding per spec

Brian Hulette (Jira) Fri, 09 Apr 2021 09:55:15 -0700


    [ 
https://issues.apache.org/jira/browse/BEAM-11875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17318125#comment-17318125
 ]


Brian Hulette edited comment on BEAM-11875 at 4/9/21, 4:54 PM:
---------------------------------------------------------------

Per BEAM-10883: Action is to check if JDK XML implementation has improved (and 
remove woodstox dep?), or update woodstox (BEAM-8720)


was (Author: bhulette):
Per BEAM-10883: Action is to check if JDK XML implementation has improved, or 
update woodstox (BEAM-8720)

> XmlIO.Read does not handle XML encoding per spec
> ------------------------------------------------
>
>                 Key: BEAM-11875
>                 URL: https://issues.apache.org/jira/browse/BEAM-11875
>             Project: Beam
>          Issue Type: Bug
>          Components: io-java-xml
>    Affects Versions: 2.28.0
>            Reporter: Elliotte Rusty Harold
>            Priority: P1
>
> Not sure what the implementation problem is but based on the API doc, there's 
> a real flaw in XmlIO.Read:
>  
> By default, UTF-8 charset is used. To specify a different charset, use 
> [{{XmlIO.Read.withCharset(java.nio.charset.Charset)}}|https://beam.apache.org/releases/javadoc/2.2.0/org/apache/beam/sdk/io/xml/XmlIO.Read.html#withCharset-java.nio.charset.Charset-].
> Currently, only XML files that use single-byte characters are supported. 
> Using a file that contains multi-byte characters may result in data loss or 
> duplication.
>  
> Properly handled, there is never any need to specify the character encoding 
> when reading an XML document. XML documents fully identify their character 
> encoding. The developer at this level doesn't need to know and shouldn't 
> think about the character encoding. Perhaps in the source code someone is a 
> using a Reader where they should be using an InputStream instead? That might 
> lead this problem.
> Also, the text contradicts itself. UTF-8 is a multibyte character set. I hope 
> that doesn't lead to data loss or duplication by default.
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (BEAM-11875) XmlIO.Read does not handle XML encoding per spec

Reply via email to