[
https://issues.apache.org/jira/browse/FLINK-10134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Flink Jira Bot updated FLINK-10134:
-----------------------------------
Labels: auto-unassigned pull-request-available stale-major (was:
auto-unassigned pull-request-available)
I am the [Flink Jira Bot|https://github.com/apache/flink-jira-bot/] and I help
the community manage its development. I see this issues has been marked as
Major but is unassigned and neither itself nor its Sub-Tasks have been updated
for 30 days. I have gone ahead and added a "stale-major" to the issue". If this
ticket is a Major, please either assign yourself or give an update. Afterwards,
please remove the label or in 7 days the issue will be deprioritized.
> UTF-16 support for TextInputFormat
> ----------------------------------
>
> Key: FLINK-10134
> URL: https://issues.apache.org/jira/browse/FLINK-10134
> Project: Flink
> Issue Type: Bug
> Components: API / DataSet
> Affects Versions: 1.4.2
> Reporter: David Dreyfus
> Priority: Major
> Labels: auto-unassigned, pull-request-available, stale-major
> Fix For: 1.14.0
>
>
> It does not appear that Flink supports a charset encoding of "UTF-16". It
> particular, it doesn't appear that Flink consumes the Byte Order Mark (BOM)
> to establish whether a UTF-16 file is UTF-16LE or UTF-16BE.
>
> TextInputFormat.setCharset("UTF-16") calls DelimitedInputFormat.setCharset(),
> which sets TextInputFormat.charsetName and then modifies the previously set
> delimiterString to construct the proper byte string encoding of the the
> delimiter. This same charsetName is also used in TextInputFormat.readRecord()
> to interpret the bytes read from the file.
>
> There are two problems that this implementation would seem to have when using
> UTF-16.
> # delimiterString.getBytes(getCharset()) in DelimitedInputFormat.java will
> return a Big Endian byte sequence including the Byte Order Mark (BOM). The
> actual text file will not contain a BOM at each line ending, so the delimiter
> will never be read. Moreover, if the actual byte encoding of the file is
> Little Endian, the bytes will be interpreted incorrectly.
> # TextInputFormat.readRecord() will not see a BOM each time it decodes a
> byte sequence with the String(bytes, offset, numBytes, charset) call.
> Therefore, it will assume Big Endian, which may not always be correct. [1]
> [https://github.com/apache/flink/blob/master/flink-java/src/main/java/org/apache/flink/api/java/io/TextInputFormat.java#L95]
>
> While there are likely many solutions, I would think that all of them would
> have to start by reading the BOM from the file when a Split is opened and
> then using that BOM to modify the specified encoding to a BOM specific one
> when the caller doesn't specify one, and to overwrite the caller's
> specification if the BOM is in conflict with the caller's specification. That
> is, if the BOM indicates Little Endian and the caller indicates UTF-16BE,
> Flink should rewrite the charsetName as UTF-16LE.
> I hope this makes sense and that I haven't been testing incorrectly or
> misreading the code.
>
> I've verified the problem on version 1.4.2. I believe the problem exists on
> all versions.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)