We had a bug report about this a while back and I wrote some tests to verify that the binary behavior was correct:

  https://github.com/apache/parquet-mr/pull/145/files

I believe that Thrift binary is correctly stored as binary and Thrift String is correctly stored as binary (UTF8). Is that not the case? Could you write a Thrift schema conversion unit test that demonstrates the case you're talking about here?

rb

On 07/07/2015 01:19 PM, Cheng Lian wrote:
Hi all,

I’m working on Spark SQL Parquet support. While doing compatibility test
with parquet-thrift, I noticed that parquet-thrift always treats Thrift
BINARY types as UTF-8 strings. But this isn’t Parquet’s fault, because
Thrift doesn’t even have a genuine BINARY type at all. Below is quoted
from Thrift docs <https://thrift.apache.org/docs/types>:


          Base Types

    …

      * string: A text string encoded using UTF-8 encoding

    …


          Special Types

    binary: a sequence of unencoded bytes

    N.B.: This is currently a specialized form of the string type above,
    added to provide better interoperability with Java. The current
    plan-of-record is to elevate this to a base type at some point.

I think this implies we can’t treat Thrift STRING as |BINARY (UTF8)| in
parquet-thrift, since it may contain unencoded random bytes. My proposal
here is to always treat Thrift STRING as raw Parquet BINARY without any
annotation. But still need to think about backwards compatibility
though. Thoughts?

Will try to give a minimum test case for this later.

Cheng

​



--
Ryan Blue
Software Engineer
Cloudera, Inc.

Reply via email to