Hi All,

Greetings.


I am finding that parquet-cli throws when trying to read UUID values.


I have a parquet file with 2 columns, message encoded as byte-array and
number encoded as fixed length byte array (UUID). The file has one row
worth of data and is readable by parquet-cpp.


Schema:

message root {

  required binary Message (STRING);

  required fixed_len_byte_array(16) Number (UUID);

}


Here is the exception stack from parquet-cli when trying to read uuid
values:

Caused by: org.apache.parquet.io.ParquetDecodingException: The requested
schema is not compatible with the file schema. incompatible types: required
binary Number (STRING) != required fixed_len_byte_array(16) Number (UUID)

        at
org.apache.parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.incompatibleSchema(ColumnIOFactory.java:101)

        at
org.apache.parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.visit(ColumnIOFactory.java:93)

        at
org.apache.parquet.schema.PrimitiveType.accept(PrimitiveType.java:602)

        at
org.apache.parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.visitChildren(ColumnIOFactory.java:83)

        at
org.apache.parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.visit(ColumnIOFactory.java:57)

        at org.apache.parquet.schema.MessageType.accept(MessageType.java:55)

        at
org.apache.parquet.io.ColumnIOFactory.getColumnIO(ColumnIOFactory.java:162)

        at
org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:135)

        at
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:225)



 I debugged parquet-cli code and found that parquet-cli is trying to
project the UUID as a string and later on that throws as these types are
not compatible?


Source code references:

At AvroReadSupport.java, line 97

>>>>>>

    String requestedProjectionString =
configuration.get(AVRO_REQUESTED_PROJECTION);

    if (requestedProjectionString != null) {

      Schema avroRequestedProjection = new
Schema.Parser().parse(requestedProjectionString);

      projection = new
AvroSchemaConverter(configuration).convert(avroRequestedProjection);

    }

<<<<<<<<


Debugger values for requestedProjectionString=

{"type":"record","name":"root","fields":[{"name":"Message","type":"string"},{"name":"Number","type":{"type":"string","logicalType":"uuid"}}]}

[Note that `Number` now has a type of `string` and a logicalType of `uuid`]



At ColumnIOFactory.java line 93

>>>>>>

incompatibleSchema(primitiveType, currentRequestedType);

<<<<<<

Debugger values for

primitiveType = required fixed_len_byte_array(16) Number (UUID)

currentRequestedType = required binary Number (STRING)



and this will throw.


If I skip over the projection code in AvroReadSupport, parquet-cli is able
to read my file.


Can someone help me understand what's the issue here --  bug in parquet-cli
or is there a configuration I must tweak to not make it convert uuid to
string?


Thanks,

Balaji

Reply via email to