Hi Tim, Thanks for responding. To answer your question, I wrote the parquet file using a private .net implementation of parquet, that I'm building. I am using plain encoding for UUID according to the spec. My goal is to maintain compatibility with parquet-mr and parquet-cpp, so I'm running some roundtrip tests by encoding data using my private library and decoding using parquet-mr and parquet-cpp. While I was able to read the values with parquet-cpp, I noticed parquet-cli throws when reading it. I'm using parquet-cli as a proxy for parquet-mr and trying to get to the bottom of this issue. I have filed a report here including the sample file: https://issues.apache.org/jira/browse/PARQUET-2140
On Fri, Apr 22, 2022 at 1:19 PM Miller, Tim <[email protected]> wrote: > Hi, > > This reminds me of some similar problems I've seen in the bug tracker. I > suggest creating a JIRA ticket, with some instructions, and attaching a > parquet file for others to look at. Also include how you did the writing. > If you're linking ParquetMR to your own code, please include minimal code > to reproduce the problem, such as a maven project we can build. In theory, > ParquetMR shouldn't be able to write files that can't be read back, but it > happens, and it might be due to a mistake in the code that feeds ParquetMR > what it is supposed to store. > > I'm pretty sure that parquet-cli is the same codebase as ParquetMR, just > wrapped up with a command line interface. > > > On 4/22/22, 1:51 PM, "gamaken k" <[email protected]> wrote: > > CAUTION: This email originated from outside of the organization. Do > not click links or open attachments unless you can confirm the sender and > know the content is safe. > > > > Hi, > wondering if I could get some help with this.. If someone can confirm > that > too find parquet-cli could not read UUID from files written using > parquet-mr, that would be enough for me and I'll file a bug report, or > if I > get some help on how to fix, I'm happy to contribute a PR. > > Thanks > > On Wed, Apr 13, 2022 at 12:32 PM gamaken k <[email protected]> > wrote: > > > Hi All, > > > > > > Greetings. > > > > > > I am finding that parquet-cli throws when trying to read UUID values. > > > > > > I have a parquet file with 2 columns, message encoded as byte-array > and > > number encoded as fixed length byte array (UUID). The file has one > row > > worth of data and is readable by parquet-cpp. > > > > > > Schema: > > > > message root { > > > > required binary Message (STRING); > > > > required fixed_len_byte_array(16) Number (UUID); > > > > } > > > > > > Here is the exception stack from parquet-cli when trying to read uuid > > values: > > > > Caused by: org.apache.parquet.io.ParquetDecodingException: The > requested > > schema is not compatible with the file schema. incompatible types: > required > > binary Number (STRING) != required fixed_len_byte_array(16) Number > (UUID) > > > > at > > org.apache.parquet.io > .ColumnIOFactory$ColumnIOCreatorVisitor.incompatibleSchema(ColumnIOFactory.java:101) > > > > at > > org.apache.parquet.io > .ColumnIOFactory$ColumnIOCreatorVisitor.visit(ColumnIOFactory.java:93) > > > > at > > > org.apache.parquet.schema.PrimitiveType.accept(PrimitiveType.java:602) > > > > at > > org.apache.parquet.io > .ColumnIOFactory$ColumnIOCreatorVisitor.visitChildren(ColumnIOFactory.java:83) > > > > at > > org.apache.parquet.io > .ColumnIOFactory$ColumnIOCreatorVisitor.visit(ColumnIOFactory.java:57) > > > > at > > org.apache.parquet.schema.MessageType.accept(MessageType.java:55) > > > > at > > org.apache.parquet.io > .ColumnIOFactory.getColumnIO(ColumnIOFactory.java:162) > > > > at > > > org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:135) > > > > at > > > org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:225) > > > > > > > > I debugged parquet-cli code and found that parquet-cli is trying to > > project the UUID as a string and later on that throws as these types > are > > not compatible? > > > > > > Source code references: > > > > At AvroReadSupport.java, line 97 > > > > >>>>>> > > > > String requestedProjectionString = > > configuration.get(AVRO_REQUESTED_PROJECTION); > > > > if (requestedProjectionString != null) { > > > > Schema avroRequestedProjection = new > > Schema.Parser().parse(requestedProjectionString); > > > > projection = new > > AvroSchemaConverter(configuration).convert(avroRequestedProjection); > > > > } > > > > <<<<<<<< > > > > > > Debugger values for requestedProjectionString= > > > > > > > {"type":"record","name":"root","fields":[{"name":"Message","type":"string"},{"name":"Number","type":{"type":"string","logicalType":"uuid"}}]} > > > > [Note that `Number` now has a type of `string` and a logicalType of > `uuid`] > > > > > > > > At ColumnIOFactory.java line 93 > > > > >>>>>> > > > > incompatibleSchema(primitiveType, currentRequestedType); > > > > <<<<<< > > > > Debugger values for > > > > primitiveType = required fixed_len_byte_array(16) Number (UUID) > > > > currentRequestedType = required binary Number (STRING) > > > > > > > > and this will throw. > > > > > > If I skip over the projection code in AvroReadSupport, parquet-cli > is able > > to read my file. > > > > > > Can someone help me understand what's the issue here -- bug in > > parquet-cli or is there a configuration I must tweak to not make it > convert > > uuid to string? > > > > > > Thanks, > > > > Balaji > > > >
