You may see that parquet-mr 1.7.0 can only handle Thrift STRING, and always add UTF8 annotation: https://github.com/apache/parquet-mr/blob/apache-parquet-1.7.0/parquet-thrift/src/main/java/org/apache/parquet/thrift/ThriftSchemaConvertVisitor.java#L249-L252

Because there’s just no |ThriftType.BinaryType|.

On 7/7/15 3:56 PM, Cheng Lian wrote:

On 7/7/15 3:48 PM, Ryan Blue wrote:

On 07/07/2015 03:23 PM, Cheng Lian wrote:
On 7/7/15 1:28 PM, Ashish Singh wrote:
I think you mean that we can’t treat Thrift BINARY type as UTF-8 string,
right?
Yeah, it's possible that a Thrift BINARY contains illegal UTF-8 byte
sequence(s) and I suppose this may cause problem. Trying to verify this.

Isn't this the right behavior? As long as it isn't annotated as a UTF8, then storing it as binary should be fine.

Ah, it’s actually annotated as UTF8…

Internally Thrift just maps BINARY to STRING and doesn’t have any annotation indicating that this field is a BINARY, so Parquet just assume it’s a normal UTF8 string and writes “BINARY (UTF8)”.

Here are my testing Thrift schema and the Parquet schema extracted from the written Parquet file by |parquet-schema|:

|struct ParquetThriftCompat { 1: binary binaryColumn; 2: string stringColumn; } message ParquetSchema { optional binary binaryColumn (UTF8); optional binary stringColumn (UTF8); } |

rb


Reply via email to