RussellSpitzer opened a new pull request #2167:
URL: https://github.com/apache/iceberg/pull/2167


   Previously the Iceberg conversion functions for Parquet would throw an 
exception if
   they encountered a Binary type field. This was internally represented as a 
repeated
   primitive field that is not nested in another group type. This violated some 
expecations
   within our schema conversion code.
   
   We encountered this with a user who was using Parquet's AvroParquetWriter 
class to write Parquet files. The files, while readable by hive and spark, were 
not readable by iceberg.
   
   Investigating this I found the following Avro Schema element caused the 
problem
   
   ```java
     String schema = "{\n" +
           "   \"type\":\"record\",\n" +
           "   \"name\":\"DbRecord\",\n" +
           "   \"namespace\":\"com.russ\",\n" +
           "   \"fields\":[\n" +
           "      {\n" +
           "         \"name\":\"foo\",\n" +
           "         \"type\":[\n" +
           "            \"null\",\n" +
           "            {\n" +
           "               \"type\":\"array\",\n" +
           "               \"items\":\"bytes\"\n" +
           "            }\n" +
           "         ],\n" +
           "         \"default\":null\n" +
           "      }\n" +
           "   ]\n" +
           "}";
      ```
      
      Parquet would convert this element into
      
      ```
      foo:
    OPTIONAL F:1
      .array: REPEATED BINARY R:1 D:2
      ```
      
      Which violates Iceberg's reader, which assumes the list will be nested.
      
      Doing a quick test with
      ```
      org.apache.avro.Schema.Parser parser = new 
org.apache.avro.Schema.Parser();
       org.apache.avro.Schema avroSchema = parser.parse(schema);
       AvroSchemaConverter converter = new AvroSchemaConverter();
       MessageType parquetSchema = converter.convert(avroSchema);
   ```
   
   I saw that this was reproducible in the current version of Parquet and not 
just in our User's code.
   
   To fix this I added some tests for this particular datatype and loosened 
some of the restrictions
   in our Parquet Schema parsing code.
      


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to