[
https://issues.apache.org/jira/browse/PARQUET-172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14355894#comment-14355894
]
Ryan Blue commented on PARQUET-172:
-----------------------------------
This was based on a [reported parquet-scrooge
bug|https://github.com/laurencer/parquet-mr-bug/commit/d09126e03e2dc9f60eb5fd7b13b8166ab4d52ba0]
and an incomplete analysis on my part. I looked only at the Schema converter,
which doesn't show support for binary but doesn't need it. Support for binary
already exists;
[{{ParquetWriteProtocol}}|https://github.com/apache/incubator-parquet-mr/blob/master/parquet-thrift/src/main/java/parquet/thrift/ParquetWriteProtocol.java#L298]
allows binary data to be written as either a String or binar, and the
[{{ThriftRecordConverter}}|https://github.com/apache/incubator-parquet-mr/blob/master/parquet-thrift/src/main/java/parquet/thrift/ThriftRecordConverter.java#L413]
generates correct protocol events that handle both.
It turns out that the original bug report wasn't correct. The summary does
appear to have corrupt data:
{code}
Parquet Example
---------------
./sbt "run-main com.rouesnel.parquetmr.bug.ParquetExample"
This should print the following:
Parquet
=======
Parquet File written to /some/random/location/test-324324-foo.parquet
After encoding - binary field is equal to original binary field: false
After encoding - binary field is equal to UTF8 encoded binary field: false
Original
-----
binaryField: -123, 20, 33
stringField: foo
binaryAsStringField: -17, -65, -67, 20, 33
Thrift Serialized
-----
binaryField: 3, 0, 0, 0, -123, 20, 33
stringField: foo
binaryAsStringField: -17, -65, -67, 20, 33
{code}
Tests show that the underlying byte buffers are correct and the error is caused
by assuming the ByteBuffer's backing array contains only the bytes serialized.
When restricted to the position and limit, the data is correct.
> Add support for non-String binary in parquet-thrift
> ---------------------------------------------------
>
> Key: PARQUET-172
> URL: https://issues.apache.org/jira/browse/PARQUET-172
> Project: Parquet
> Issue Type: Bug
> Components: parquet-mr
> Affects Versions: 1.5.0
> Reporter: Ryan Blue
>
> Thrift [considers binary a "special"
> type|https://thrift.apache.org/docs/types] that isn't in the official spec
> but is "to provide better interoperability with java". The parquet-thrift
> side doesn't currently support binary because Thrift String fields are
> converted to UTF8-annotated binary. The result is that binary fields get
> mangled when stored in Parquet because Parquet assumes they are UTF8.
> I think some storage layer in Java Thrift must know about binary and pass the
> unencoded bytes, but that Parquet hasn't implemented a similar hack. (The
> [type
> conversion|https://github.com/apache/incubator-parquet-mr/blob/master/parquet-thrift/src/main/java/parquet/thrift/ThriftSchemaConverter.java#L86]
> code at least has no entry for binary.)
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)