On 7/7/15 3:48 PM, Ryan Blue wrote:
On 07/07/2015 03:23 PM, Cheng Lian wrote:
On 7/7/15 1:28 PM, Ashish Singh wrote:
I think you mean that we can’t treat Thrift BINARY type as UTF-8
string,
right?
Yeah, it's possible that a Thrift BINARY contains illegal UTF-8 byte
sequence(s) and I suppose this may cause problem. Trying to verify
this.
Isn't this the right behavior? As long as it isn't annotated as a
UTF8, then storing it as binary should be fine.
Ah, it’s actually annotated as UTF8…
Internally Thrift just maps BINARY to STRING and doesn’t have any
annotation indicating that this field is a BINARY, so Parquet just
assume it’s a normal UTF8 string and writes “BINARY (UTF8)”.
Here are my testing Thrift schema and the Parquet schema extracted from
the written Parquet file by |parquet-schema|:
|struct ParquetThriftCompat { 1: binary binaryColumn; 2: string
stringColumn; } message ParquetSchema { optional binary binaryColumn
(UTF8); optional binary stringColumn (UTF8); } |
rb