Hi,
wondering if I could get some help with this.. If someone can confirm that
too find parquet-cli could not read UUID from files written using
parquet-mr, that would be enough for me and I'll file a bug report, or if I
get some help on how to fix, I'm happy to contribute a PR.

Thanks

On Wed, Apr 13, 2022 at 12:32 PM gamaken k <[email protected]> wrote:

> Hi All,
>
>
> Greetings.
>
>
> I am finding that parquet-cli throws when trying to read UUID values.
>
>
> I have a parquet file with 2 columns, message encoded as byte-array and
> number encoded as fixed length byte array (UUID). The file has one row
> worth of data and is readable by parquet-cpp.
>
>
> Schema:
>
> message root {
>
>   required binary Message (STRING);
>
>   required fixed_len_byte_array(16) Number (UUID);
>
> }
>
>
> Here is the exception stack from parquet-cli when trying to read uuid
> values:
>
> Caused by: org.apache.parquet.io.ParquetDecodingException: The requested
> schema is not compatible with the file schema. incompatible types: required
> binary Number (STRING) != required fixed_len_byte_array(16) Number (UUID)
>
>         at
> org.apache.parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.incompatibleSchema(ColumnIOFactory.java:101)
>
>         at
> org.apache.parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.visit(ColumnIOFactory.java:93)
>
>         at
> org.apache.parquet.schema.PrimitiveType.accept(PrimitiveType.java:602)
>
>         at
> org.apache.parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.visitChildren(ColumnIOFactory.java:83)
>
>         at
> org.apache.parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.visit(ColumnIOFactory.java:57)
>
>         at
> org.apache.parquet.schema.MessageType.accept(MessageType.java:55)
>
>         at
> org.apache.parquet.io.ColumnIOFactory.getColumnIO(ColumnIOFactory.java:162)
>
>         at
> org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:135)
>
>         at
> org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:225)
>
>
>
>  I debugged parquet-cli code and found that parquet-cli is trying to
> project the UUID as a string and later on that throws as these types are
> not compatible?
>
>
> Source code references:
>
> At AvroReadSupport.java, line 97
>
> >>>>>>
>
>     String requestedProjectionString =
> configuration.get(AVRO_REQUESTED_PROJECTION);
>
>     if (requestedProjectionString != null) {
>
>       Schema avroRequestedProjection = new
> Schema.Parser().parse(requestedProjectionString);
>
>       projection = new
> AvroSchemaConverter(configuration).convert(avroRequestedProjection);
>
>     }
>
> <<<<<<<<
>
>
> Debugger values for requestedProjectionString=
>
>
> {"type":"record","name":"root","fields":[{"name":"Message","type":"string"},{"name":"Number","type":{"type":"string","logicalType":"uuid"}}]}
>
> [Note that `Number` now has a type of `string` and a logicalType of `uuid`]
>
>
>
> At ColumnIOFactory.java line 93
>
> >>>>>>
>
> incompatibleSchema(primitiveType, currentRequestedType);
>
> <<<<<<
>
> Debugger values for
>
> primitiveType = required fixed_len_byte_array(16) Number (UUID)
>
> currentRequestedType = required binary Number (STRING)
>
>
>
> and this will throw.
>
>
> If I skip over the projection code in AvroReadSupport, parquet-cli is able
> to read my file.
>
>
> Can someone help me understand what's the issue here --  bug in
> parquet-cli or is there a configuration I must tweak to not make it convert
> uuid to string?
>
>
> Thanks,
>
> Balaji
>

Reply via email to