[ 
https://issues.apache.org/jira/browse/PARQUET-2140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17528248#comment-17528248
 ] 

Timothy Miller commented on PARQUET-2140:
-----------------------------------------

I'm still working on this, but the problem appears to be in 
org.apache.parquet.avro.AvroSchemaConverter.convertField(). There's an 
anonymous inner class, where this method is defined:
{code:java}
            @Override
            public Schema convertFIXED_LEN_BYTE_ARRAY(PrimitiveTypeName 
primitiveTypeName) {
              if (annotation instanceof 
LogicalTypeAnnotation.UUIDLogicalTypeAnnotation) {
                return Schema.create(Schema.Type.STRING);
              } else {
                int size = parquetType.asPrimitiveType().getTypeLength();
                return Schema.createFixed(parquetType.getName(), null, null, 
size);
              }
            }{code}
The UUID from the file is indeed a UUIDLogicalTypeAnnotation, so this is 
specifying that the internal Avro schema to convert to should be STRING, which 
is elsewhere rejected as incompatible.

It looks like someone was attempting to add UUID support but didn't finish the 
job. If I edit this code so that it doesn't do that, then everything works. I 
don't get a UUID printed all pretty, but there's no exception. This is the 
output I get:
{noformat}
{"Message": "First record", "Number": [-41, 48, -44, -37, -38, 57, -87, 67, 
-90, 116, 7, 79, 44, -29, 50, 17]}{noformat}
I have no idea how to fix this properly, since 
org.apache.parquet.io.ColumnIOFactory.visit() insists on strict type checking. 
So I'm going to create a PR to just make this not crash, in the hopes that 
someone who understands this better might provide some feedback.

> parquet-cli unable to read UUID values
> --------------------------------------
>
>                 Key: PARQUET-2140
>                 URL: https://issues.apache.org/jira/browse/PARQUET-2140
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-cli
>            Reporter: Balaji K
>            Priority: Minor
>         Attachments: guid.parquet
>
>
> I am finding that parquet-cli throws when trying to read UUID values. 
> Attached to this bug report is a parquet file with 2 columns, message encoded 
> as byte-array and number encoded as fixed length byte array (UUID). This file 
> was written by my .net implementation of parquet specification. The file has 
> one row worth of data and is readable by parquet-cpp.
> +Schema as read by parquet-cli:+
> message root {
>   required binary Message (STRING);
>   required fixed_len_byte_array(16) Number (UUID);
> }
> +Values as read by parquet-cpp:+
> --- Values ---
> Message                       |Number                        |
> First record                  |215 48 212 219 218 57 169 67 166 116 7 79 44 
> 227 50 17 |
>  
> +Here is the exception stack from parquet-cli when trying to read uuid 
> values:+
> Caused by: org.apache.parquet.io.ParquetDecodingException: The requested 
> schema is not compatible with the file schema. incompatible types: required 
> binary Number (STRING) != required fixed_len_byte_array(16) Number (UUID)
>         at 
> org.apache.parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.incompatibleSchema(ColumnIOFactory.java:101)
>         at 
> org.apache.parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.visit(ColumnIOFactory.java:93)
>         at 
> org.apache.parquet.schema.PrimitiveType.accept(PrimitiveType.java:602)
>         at 
> org.apache.parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.visitChildren(ColumnIOFactory.java:83)
>         at 
> org.apache.parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.visit(ColumnIOFactory.java:57)
>         at org.apache.parquet.schema.MessageType.accept(MessageType.java:55)
>         at 
> org.apache.parquet.io.ColumnIOFactory.getColumnIO(ColumnIOFactory.java:162)
>         at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:135)
>         at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:225)
>  
>  I debugged parquet-cli code and found that parquet-cli is trying to project 
> the UUID as a string and later on that throws as these types are not 
> compatible? 
>  
> +Source code references:+
> At AvroReadSupport.java, line 97
> ~~~~~~~~~~~~
>     String requestedProjectionString = 
> configuration.get(AVRO_REQUESTED_PROJECTION);
>     if (requestedProjectionString != null) {
>       Schema avroRequestedProjection = new 
> Schema.Parser().parse(requestedProjectionString);
>       projection = new 
> AvroSchemaConverter(configuration).convert(avroRequestedProjection);
>     }
> ~~~~~~~~~~~~
>  
> +Debugger values for+ 
> requestedProjectionString=
> {"type":"record","name":"root","fields":[\{"name":"Message","type":"string"},\{"name":"Number","type":{"type":"string","logicalType":"uuid"}}]}
> [Note that `Number` now has a type of `string` and a logicalType of `uuid`]
>  
> At ColumnIOFactory.java line 93
> ~~~~~~~~~~~~
> incompatibleSchema(primitiveType, currentRequestedType);
> ~~~~~~~~~~~~
> +Debugger values for+ 
> primitiveType = required fixed_len_byte_array(16) Number (UUID)
> currentRequestedType = required binary Number (STRING)
>  
> and this will throw.
>  
> If I skip over the projection code in AvroReadSupport, parquet-cli is able to 
> read my file.
> I am not sure if the bug is in parquet-cli or parquet-mr or in the library I 
> used to encode this file. The fact that parquet-cpp is able to read it gives 
> me some confidence to say that the problem is either in parquet-cli or 
> parquet-mr.
> Please point me in the right direction if I could verify this UUID 
> roundtripping purely from parquet-mr itself in form of an unit-test. Happy to 
> contribute tests or fix if needed.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to