Laurence Rouesnel created PARQUET-158:
-----------------------------------------
Summary: Thrift binary fields are not serialized correctly
Key: PARQUET-158
URL: https://issues.apache.org/jira/browse/PARQUET-158
Project: Parquet
Issue Type: Bug
Reporter: Laurence Rouesnel
Thrift binary fields are not serialized correctly - I believe UTF-8 encoding is
applied.
A demonstration of this bug is shown in
https://github.com/laurencer/parquet-mr-bug
This appears to be an issue in Parquet-MR where Thrift TType is used determine
the type of fields. TType actually represents the on-disk/encoded field type
tag - that does not distinguish between binary and string fields.
String data is not actually represented on-disk as being different - instead it
is up to the program to interpret the binary data as a UTF-8 encoded string.
Parquet-MR instead assumes that every binary field is a UTF-8 encoded string.
This may have arisen because the binary field tag is actually TType.String
(where it actually just represents a raw binary field).
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)