Laurence Rouesnel created PARQUET-158:
-----------------------------------------

             Summary: Thrift binary fields are not serialized correctly
                 Key: PARQUET-158
                 URL: https://issues.apache.org/jira/browse/PARQUET-158
             Project: Parquet
          Issue Type: Bug
            Reporter: Laurence Rouesnel


Thrift binary fields are not serialized correctly - I believe UTF-8 encoding is 
applied.

A demonstration of this bug is shown in 
https://github.com/laurencer/parquet-mr-bug

This appears to be an issue in Parquet-MR where Thrift TType is used determine 
the type of fields. TType actually represents the on-disk/encoded field type 
tag - that does not distinguish between binary and string fields.

String data is not actually represented on-disk as being different - instead it 
is up to the program to interpret the binary data as a UTF-8 encoded string. 
Parquet-MR instead assumes that every binary field is a UTF-8 encoded string.

This may have arisen because the binary field tag is actually TType.String 
(where it actually just represents a raw binary field).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to