Changming Ma created BEAM-8579:
----------------------------------

             Summary: Strip UTF-8 BOM bytes (if present) in TextSource.
                 Key: BEAM-8579
                 URL: https://issues.apache.org/jira/browse/BEAM-8579
             Project: Beam
          Issue Type: Bug
          Components: io-java-text
    Affects Versions: 2.15.0
            Reporter: Changming Ma


TextSource in the org.apache.beam.sdk.io package can handle UTF-8 encoded 
files, and when the file contains byte order mark (BOM), it will preserve it in 
the output. According to Unicode standard 
([http://www.unicode.org/versions/Unicode5.0.0/ch02.pdf|https://www.google.com/url?q=http://www.unicode.org/versions/Unicode5.0.0/ch02.pdf&sa=D&usg=AFQjCNF_PW0McUUnM1UrvZSIwgvAj1uUKw]):
 "Use of a BOM is neither required nor recommended for UTF-8". UTF-8 with a BOM 
will also be a potential problem for some Java implementations (e.g., 
[https://bugs.java.com/bugdatabase/view_bug.do?bug_id=4508058|https://www.google.com/url?q=https://bugs.java.com/bugdatabase/view_bug.do?bug_id%3D4508058&sa=D&usg=AFQjCNEdT7vUK99N5bxQc9fkCt-uIG2v7Q]).
 As a general practice, it's suggested to use UTF-8 without BOM.

Proposal: remove BOM bytes in the output from TextSource.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to