Hi, wondering if I could get some help with this.. If someone can confirm that too find parquet-cli could not read UUID from files written using parquet-mr, that would be enough for me and I'll file a bug report, or if I get some help on how to fix, I'm happy to contribute a PR.
Thanks On Wed, Apr 13, 2022 at 12:32 PM gamaken k <[email protected]> wrote: > Hi All, > > > Greetings. > > > I am finding that parquet-cli throws when trying to read UUID values. > > > I have a parquet file with 2 columns, message encoded as byte-array and > number encoded as fixed length byte array (UUID). The file has one row > worth of data and is readable by parquet-cpp. > > > Schema: > > message root { > > required binary Message (STRING); > > required fixed_len_byte_array(16) Number (UUID); > > } > > > Here is the exception stack from parquet-cli when trying to read uuid > values: > > Caused by: org.apache.parquet.io.ParquetDecodingException: The requested > schema is not compatible with the file schema. incompatible types: required > binary Number (STRING) != required fixed_len_byte_array(16) Number (UUID) > > at > org.apache.parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.incompatibleSchema(ColumnIOFactory.java:101) > > at > org.apache.parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.visit(ColumnIOFactory.java:93) > > at > org.apache.parquet.schema.PrimitiveType.accept(PrimitiveType.java:602) > > at > org.apache.parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.visitChildren(ColumnIOFactory.java:83) > > at > org.apache.parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.visit(ColumnIOFactory.java:57) > > at > org.apache.parquet.schema.MessageType.accept(MessageType.java:55) > > at > org.apache.parquet.io.ColumnIOFactory.getColumnIO(ColumnIOFactory.java:162) > > at > org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:135) > > at > org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:225) > > > > I debugged parquet-cli code and found that parquet-cli is trying to > project the UUID as a string and later on that throws as these types are > not compatible? > > > Source code references: > > At AvroReadSupport.java, line 97 > > >>>>>> > > String requestedProjectionString = > configuration.get(AVRO_REQUESTED_PROJECTION); > > if (requestedProjectionString != null) { > > Schema avroRequestedProjection = new > Schema.Parser().parse(requestedProjectionString); > > projection = new > AvroSchemaConverter(configuration).convert(avroRequestedProjection); > > } > > <<<<<<<< > > > Debugger values for requestedProjectionString= > > > {"type":"record","name":"root","fields":[{"name":"Message","type":"string"},{"name":"Number","type":{"type":"string","logicalType":"uuid"}}]} > > [Note that `Number` now has a type of `string` and a logicalType of `uuid`] > > > > At ColumnIOFactory.java line 93 > > >>>>>> > > incompatibleSchema(primitiveType, currentRequestedType); > > <<<<<< > > Debugger values for > > primitiveType = required fixed_len_byte_array(16) Number (UUID) > > currentRequestedType = required binary Number (STRING) > > > > and this will throw. > > > If I skip over the projection code in AvroReadSupport, parquet-cli is able > to read my file. > > > Can someone help me understand what's the issue here -- bug in > parquet-cli or is there a configuration I must tweak to not make it convert > uuid to string? > > > Thanks, > > Balaji >
