[
https://issues.apache.org/jira/browse/BEAM-8579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Anonymous updated BEAM-8579:
----------------------------
Status: Triage Needed (was: Resolved)
> Strip UTF-8 BOM bytes (if present) in TextSource.
> -------------------------------------------------
>
> Key: BEAM-8579
> URL: https://issues.apache.org/jira/browse/BEAM-8579
> Project: Beam
> Issue Type: Bug
> Components: io-java-text
> Affects Versions: 2.15.0
> Reporter: Changming Ma
> Assignee: Changming Ma
> Priority: P3
> Fix For: 2.18.0
>
> Time Spent: 1h 20m
> Remaining Estimate: 0h
>
> TextSource in the org.apache.beam.sdk.io package can handle UTF-8 encoded
> files, and when the file contains byte order mark (BOM), it will preserve it
> in the output. According to Unicode standard
> ([http://www.unicode.org/versions/Unicode5.0.0/ch02.pdf|https://www.google.com/url?q=http://www.unicode.org/versions/Unicode5.0.0/ch02.pdf&sa=D&usg=AFQjCNF_PW0McUUnM1UrvZSIwgvAj1uUKw]):
> "Use of a BOM is neither required nor recommended for UTF-8". UTF-8 with a
> BOM will also be a potential problem for some Java implementations (e.g.,
> [https://bugs.java.com/bugdatabase/view_bug.do?bug_id=4508058|https://www.google.com/url?q=https://bugs.java.com/bugdatabase/view_bug.do?bug_id%3D4508058&sa=D&usg=AFQjCNEdT7vUK99N5bxQc9fkCt-uIG2v7Q]).
> As a general practice, it's suggested to use UTF-8 without BOM.
> Proposal: remove BOM bytes in the output from TextSource.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)