Thrift binary type in Parquet

Cheng Lian Tue, 07 Jul 2015 13:19:52 -0700

Hi all,

I’m working on Spark SQL Parquet support. While doing compatibility testwith parquet-thrift, I noticed that parquet-thrift always treats ThriftBINARY types as UTF-8 strings. But this isn’t Parquet’s fault, becauseThrift doesn’t even have a genuine BINARY type at all. Below is quotedfrom Thrift docs <https://thrift.apache.org/docs/types>:



         Base Types

   …

     * string: A text string encoded using UTF-8 encoding

   …


         Special Types

   binary: a sequence of unencoded bytes

   N.B.: This is currently a specialized form of the string type above,
   added to provide better interoperability with Java. The current
   plan-of-record is to elevate this to a base type at some point.

I think this implies we can’t treat Thrift STRING as |BINARY (UTF8)| inparquet-thrift, since it may contain unencoded random bytes. My proposalhere is to always treat Thrift STRING as raw Parquet BINARY without anyannotation. But still need to think about backwards compatibilitythough. Thoughts?


Will try to give a minimum test case for this later.

Cheng

Thrift binary type in Parquet

Reply via email to