Hi Tim,
Thanks for responding. To answer your question, I wrote the parquet file
using a private .net implementation of parquet, that I'm building. I am
using plain encoding for UUID according to the spec. My goal is to maintain
compatibility with parquet-mr and parquet-cpp, so I'm running some
roundtrip tests by encoding data using my private library and decoding
using parquet-mr and parquet-cpp. While I was able to read the values with
parquet-cpp, I noticed parquet-cli throws when reading it. I'm using
parquet-cli as a proxy for parquet-mr and trying to get to the bottom of
this issue.
I have filed a report here including the sample file:
https://issues.apache.org/jira/browse/PARQUET-2140




On Fri, Apr 22, 2022 at 1:19 PM Miller, Tim <[email protected]>
wrote:

> Hi,
>
> This reminds me of some similar problems I've seen in the bug tracker. I
> suggest creating a JIRA ticket, with some instructions, and attaching a
> parquet file for others to look at. Also include how you did the writing.
> If you're linking ParquetMR to your own code, please include minimal code
> to reproduce the problem, such as a maven project we can build. In theory,
> ParquetMR shouldn't be able to write files that can't be read back, but it
> happens, and it might be due to a mistake in the code that feeds ParquetMR
> what it is supposed to store.
>
> I'm pretty sure that parquet-cli is the same codebase as ParquetMR, just
> wrapped up with a command line interface.
>
>
> On 4/22/22, 1:51 PM, "gamaken k" <[email protected]> wrote:
>
>     CAUTION: This email originated from outside of the organization. Do
> not click links or open attachments unless you can confirm the sender and
> know the content is safe.
>
>
>
>     Hi,
>     wondering if I could get some help with this.. If someone can confirm
> that
>     too find parquet-cli could not read UUID from files written using
>     parquet-mr, that would be enough for me and I'll file a bug report, or
> if I
>     get some help on how to fix, I'm happy to contribute a PR.
>
>     Thanks
>
>     On Wed, Apr 13, 2022 at 12:32 PM gamaken k <[email protected]>
> wrote:
>
>     > Hi All,
>     >
>     >
>     > Greetings.
>     >
>     >
>     > I am finding that parquet-cli throws when trying to read UUID values.
>     >
>     >
>     > I have a parquet file with 2 columns, message encoded as byte-array
> and
>     > number encoded as fixed length byte array (UUID). The file has one
> row
>     > worth of data and is readable by parquet-cpp.
>     >
>     >
>     > Schema:
>     >
>     > message root {
>     >
>     >   required binary Message (STRING);
>     >
>     >   required fixed_len_byte_array(16) Number (UUID);
>     >
>     > }
>     >
>     >
>     > Here is the exception stack from parquet-cli when trying to read uuid
>     > values:
>     >
>     > Caused by: org.apache.parquet.io.ParquetDecodingException: The
> requested
>     > schema is not compatible with the file schema. incompatible types:
> required
>     > binary Number (STRING) != required fixed_len_byte_array(16) Number
> (UUID)
>     >
>     >         at
>     > org.apache.parquet.io
> .ColumnIOFactory$ColumnIOCreatorVisitor.incompatibleSchema(ColumnIOFactory.java:101)
>     >
>     >         at
>     > org.apache.parquet.io
> .ColumnIOFactory$ColumnIOCreatorVisitor.visit(ColumnIOFactory.java:93)
>     >
>     >         at
>     >
> org.apache.parquet.schema.PrimitiveType.accept(PrimitiveType.java:602)
>     >
>     >         at
>     > org.apache.parquet.io
> .ColumnIOFactory$ColumnIOCreatorVisitor.visitChildren(ColumnIOFactory.java:83)
>     >
>     >         at
>     > org.apache.parquet.io
> .ColumnIOFactory$ColumnIOCreatorVisitor.visit(ColumnIOFactory.java:57)
>     >
>     >         at
>     > org.apache.parquet.schema.MessageType.accept(MessageType.java:55)
>     >
>     >         at
>     > org.apache.parquet.io
> .ColumnIOFactory.getColumnIO(ColumnIOFactory.java:162)
>     >
>     >         at
>     >
> org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:135)
>     >
>     >         at
>     >
> org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:225)
>     >
>     >
>     >
>     >  I debugged parquet-cli code and found that parquet-cli is trying to
>     > project the UUID as a string and later on that throws as these types
> are
>     > not compatible?
>     >
>     >
>     > Source code references:
>     >
>     > At AvroReadSupport.java, line 97
>     >
>     > >>>>>>
>     >
>     >     String requestedProjectionString =
>     > configuration.get(AVRO_REQUESTED_PROJECTION);
>     >
>     >     if (requestedProjectionString != null) {
>     >
>     >       Schema avroRequestedProjection = new
>     > Schema.Parser().parse(requestedProjectionString);
>     >
>     >       projection = new
>     > AvroSchemaConverter(configuration).convert(avroRequestedProjection);
>     >
>     >     }
>     >
>     > <<<<<<<<
>     >
>     >
>     > Debugger values for requestedProjectionString=
>     >
>     >
>     >
> {"type":"record","name":"root","fields":[{"name":"Message","type":"string"},{"name":"Number","type":{"type":"string","logicalType":"uuid"}}]}
>     >
>     > [Note that `Number` now has a type of `string` and a logicalType of
> `uuid`]
>     >
>     >
>     >
>     > At ColumnIOFactory.java line 93
>     >
>     > >>>>>>
>     >
>     > incompatibleSchema(primitiveType, currentRequestedType);
>     >
>     > <<<<<<
>     >
>     > Debugger values for
>     >
>     > primitiveType = required fixed_len_byte_array(16) Number (UUID)
>     >
>     > currentRequestedType = required binary Number (STRING)
>     >
>     >
>     >
>     > and this will throw.
>     >
>     >
>     > If I skip over the projection code in AvroReadSupport, parquet-cli
> is able
>     > to read my file.
>     >
>     >
>     > Can someone help me understand what's the issue here --  bug in
>     > parquet-cli or is there a configuration I must tweak to not make it
> convert
>     > uuid to string?
>     >
>     >
>     > Thanks,
>     >
>     > Balaji
>     >
>
>

Reply via email to