Hi,

This reminds me of some similar problems I've seen in the bug tracker. I 
suggest creating a JIRA ticket, with some instructions, and attaching a parquet 
file for others to look at. Also include how you did the writing. If you're 
linking ParquetMR to your own code, please include minimal code to reproduce 
the problem, such as a maven project we can build. In theory, ParquetMR 
shouldn't be able to write files that can't be read back, but it happens, and 
it might be due to a mistake in the code that feeds ParquetMR what it is 
supposed to store.

I'm pretty sure that parquet-cli is the same codebase as ParquetMR, just 
wrapped up with a command line interface.


On 4/22/22, 1:51 PM, "gamaken k" <[email protected]> wrote:

    CAUTION: This email originated from outside of the organization. Do not 
click links or open attachments unless you can confirm the sender and know the 
content is safe.



    Hi,
    wondering if I could get some help with this.. If someone can confirm that
    too find parquet-cli could not read UUID from files written using
    parquet-mr, that would be enough for me and I'll file a bug report, or if I
    get some help on how to fix, I'm happy to contribute a PR.

    Thanks

    On Wed, Apr 13, 2022 at 12:32 PM gamaken k <[email protected]> wrote:

    > Hi All,
    >
    >
    > Greetings.
    >
    >
    > I am finding that parquet-cli throws when trying to read UUID values.
    >
    >
    > I have a parquet file with 2 columns, message encoded as byte-array and
    > number encoded as fixed length byte array (UUID). The file has one row
    > worth of data and is readable by parquet-cpp.
    >
    >
    > Schema:
    >
    > message root {
    >
    >   required binary Message (STRING);
    >
    >   required fixed_len_byte_array(16) Number (UUID);
    >
    > }
    >
    >
    > Here is the exception stack from parquet-cli when trying to read uuid
    > values:
    >
    > Caused by: org.apache.parquet.io.ParquetDecodingException: The requested
    > schema is not compatible with the file schema. incompatible types: 
required
    > binary Number (STRING) != required fixed_len_byte_array(16) Number (UUID)
    >
    >         at
    > 
org.apache.parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.incompatibleSchema(ColumnIOFactory.java:101)
    >
    >         at
    > 
org.apache.parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.visit(ColumnIOFactory.java:93)
    >
    >         at
    > org.apache.parquet.schema.PrimitiveType.accept(PrimitiveType.java:602)
    >
    >         at
    > 
org.apache.parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.visitChildren(ColumnIOFactory.java:83)
    >
    >         at
    > 
org.apache.parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.visit(ColumnIOFactory.java:57)
    >
    >         at
    > org.apache.parquet.schema.MessageType.accept(MessageType.java:55)
    >
    >         at
    > 
org.apache.parquet.io.ColumnIOFactory.getColumnIO(ColumnIOFactory.java:162)
    >
    >         at
    > 
org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:135)
    >
    >         at
    > 
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:225)
    >
    >
    >
    >  I debugged parquet-cli code and found that parquet-cli is trying to
    > project the UUID as a string and later on that throws as these types are
    > not compatible?
    >
    >
    > Source code references:
    >
    > At AvroReadSupport.java, line 97
    >
    > >>>>>>
    >
    >     String requestedProjectionString =
    > configuration.get(AVRO_REQUESTED_PROJECTION);
    >
    >     if (requestedProjectionString != null) {
    >
    >       Schema avroRequestedProjection = new
    > Schema.Parser().parse(requestedProjectionString);
    >
    >       projection = new
    > AvroSchemaConverter(configuration).convert(avroRequestedProjection);
    >
    >     }
    >
    > <<<<<<<<
    >
    >
    > Debugger values for requestedProjectionString=
    >
    >
    > 
{"type":"record","name":"root","fields":[{"name":"Message","type":"string"},{"name":"Number","type":{"type":"string","logicalType":"uuid"}}]}
    >
    > [Note that `Number` now has a type of `string` and a logicalType of 
`uuid`]
    >
    >
    >
    > At ColumnIOFactory.java line 93
    >
    > >>>>>>
    >
    > incompatibleSchema(primitiveType, currentRequestedType);
    >
    > <<<<<<
    >
    > Debugger values for
    >
    > primitiveType = required fixed_len_byte_array(16) Number (UUID)
    >
    > currentRequestedType = required binary Number (STRING)
    >
    >
    >
    > and this will throw.
    >
    >
    > If I skip over the projection code in AvroReadSupport, parquet-cli is able
    > to read my file.
    >
    >
    > Can someone help me understand what's the issue here --  bug in
    > parquet-cli or is there a configuration I must tweak to not make it 
convert
    > uuid to string?
    >
    >
    > Thanks,
    >
    > Balaji
    >

Reply via email to