I think this implies we can’t treat Thrift STRING as |BINARY (UTF8)| in parquet-thrift, since it may contain unencoded random bytes. My proposal here is to always treat Thrift STRING as raw Parquet BINARY without any annotation. But still need to think about backwards compatibility though. Thoughts?
I think you mean that we can’t treat Thrift BINARY type as UTF-8 string, right? On Tue, Jul 7, 2015 at 1:19 PM, Cheng Lian <[email protected]> wrote: > Hi all, > > I’m working on Spark SQL Parquet support. While doing compatibility test > with parquet-thrift, I noticed that parquet-thrift always treats Thrift > BINARY types as UTF-8 strings. But this isn’t Parquet’s fault, because > Thrift doesn’t even have a genuine BINARY type at all. Below is quoted from > Thrift docs <https://thrift.apache.org/docs/types>: > > > Base Types > > … > > * string: A text string encoded using UTF-8 encoding > > … > > > Special Types > > binary: a sequence of unencoded bytes > > N.B.: This is currently a specialized form of the string type above, > added to provide better interoperability with Java. The current > plan-of-record is to elevate this to a base type at some point. > > I think this implies we can’t treat Thrift STRING as |BINARY (UTF8)| in > parquet-thrift, since it may contain unencoded random bytes. My proposal > here is to always treat Thrift STRING as raw Parquet BINARY without any > annotation. But still need to think about backwards compatibility though. > Thoughts? > > Will try to give a minimum test case for this later. > > Cheng > > > -- Regards, Ashish
