[ 
https://issues.apache.org/jira/browse/BEAM-11875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17318215#comment-17318215
 ] 

Elliotte Rusty Harold commented on BEAM-11875:
----------------------------------------------

Yes, that's the problem. You can't do that (use byte positions) short of 
implementing an entire XML parser, and even then you really shouldn't do that. 
There are a lot of ways the existing code can break and produce bad results 
depending on what input you feed it.

My next question is whether anyone is using this. If it's not in common use, we 
should just deprecate it and put some big warning signs in the API docs to 
scare people off.

> XmlIO.Read does not handle XML encoding per spec
> ------------------------------------------------
>
>                 Key: BEAM-11875
>                 URL: https://issues.apache.org/jira/browse/BEAM-11875
>             Project: Beam
>          Issue Type: Bug
>          Components: io-java-xml
>    Affects Versions: 2.28.0
>            Reporter: Elliotte Rusty Harold
>            Priority: P1
>
> Not sure what the implementation problem is but based on the API doc, there's 
> a real flaw in XmlIO.Read:
>  
> By default, UTF-8 charset is used. To specify a different charset, use 
> [{{XmlIO.Read.withCharset(java.nio.charset.Charset)}}|https://beam.apache.org/releases/javadoc/2.2.0/org/apache/beam/sdk/io/xml/XmlIO.Read.html#withCharset-java.nio.charset.Charset-].
> Currently, only XML files that use single-byte characters are supported. 
> Using a file that contains multi-byte characters may result in data loss or 
> duplication.
>  
> Properly handled, there is never any need to specify the character encoding 
> when reading an XML document. XML documents fully identify their character 
> encoding. The developer at this level doesn't need to know and shouldn't 
> think about the character encoding. Perhaps in the source code someone is a 
> using a Reader where they should be using an InputStream instead? That might 
> lead this problem.
> Also, the text contradicts itself. UTF-8 is a multibyte character set. I hope 
> that doesn't lead to data loss or duplication by default.
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to