Hi All,
Greetings.
I am finding that parquet-cli throws when trying to read UUID values.
I have a parquet file with 2 columns, message encoded as byte-array and
number encoded as fixed length byte array (UUID). The file has one row
worth of data and is readable by parquet-cpp.
Schema:
message root {
required binary Message (STRING);
required fixed_len_byte_array(16) Number (UUID);
}
Here is the exception stack from parquet-cli when trying to read uuid
values:
Caused by: org.apache.parquet.io.ParquetDecodingException: The requested
schema is not compatible with the file schema. incompatible types: required
binary Number (STRING) != required fixed_len_byte_array(16) Number (UUID)
at
org.apache.parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.incompatibleSchema(ColumnIOFactory.java:101)
at
org.apache.parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.visit(ColumnIOFactory.java:93)
at
org.apache.parquet.schema.PrimitiveType.accept(PrimitiveType.java:602)
at
org.apache.parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.visitChildren(ColumnIOFactory.java:83)
at
org.apache.parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.visit(ColumnIOFactory.java:57)
at org.apache.parquet.schema.MessageType.accept(MessageType.java:55)
at
org.apache.parquet.io.ColumnIOFactory.getColumnIO(ColumnIOFactory.java:162)
at
org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:135)
at
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:225)
I debugged parquet-cli code and found that parquet-cli is trying to
project the UUID as a string and later on that throws as these types are
not compatible?
Source code references:
At AvroReadSupport.java, line 97
>>>>>>
String requestedProjectionString =
configuration.get(AVRO_REQUESTED_PROJECTION);
if (requestedProjectionString != null) {
Schema avroRequestedProjection = new
Schema.Parser().parse(requestedProjectionString);
projection = new
AvroSchemaConverter(configuration).convert(avroRequestedProjection);
}
<<<<<<<<
Debugger values for requestedProjectionString=
{"type":"record","name":"root","fields":[{"name":"Message","type":"string"},{"name":"Number","type":{"type":"string","logicalType":"uuid"}}]}
[Note that `Number` now has a type of `string` and a logicalType of `uuid`]
At ColumnIOFactory.java line 93
>>>>>>
incompatibleSchema(primitiveType, currentRequestedType);
<<<<<<
Debugger values for
primitiveType = required fixed_len_byte_array(16) Number (UUID)
currentRequestedType = required binary Number (STRING)
and this will throw.
If I skip over the projection code in AvroReadSupport, parquet-cli is able
to read my file.
Can someone help me understand what's the issue here -- bug in parquet-cli
or is there a configuration I must tweak to not make it convert uuid to
string?
Thanks,
Balaji