[GitHub] [iceberg] zhongyujiang opened a new issue #3604: Parquet: read compatibility for migrated Iceberg table of parquet data file

GitBox Wed, 24 Nov 2021 19:24:50 -0800


zhongyujiang opened a new issue #3604:
URL: https://github.com/apache/iceberg/issues/3604



   I migrated serveral Hive tables with parquet format to Iceberg, but 
encountered an exception when reading the migrated Iceberg table. I found the 
parquet file of source table uses legacy schema, the name of repeatedElement of 
list type is 'bag'  :
   ```
   optional group myList (LIST) {
       repeated group bag {
           optional binary element (STRING);
       }
   }
   ```
   
   and Iceberg will ApplyNameMapping for the fileSchame of parquet cause the 
parquet schema has no field IDs:
   ```
   else if (nameMapping != null) {
         typeWithIds = ParquetSchemaUtil.applyNameMapping(fileSchema, 
nameMapping);
         this.projection = ParquetSchemaUtil.pruneColumns(typeWithIds, 
expectedSchema);
     } 
   ```
   
   Now ApplyNameMapping build a new list/map GroupType which use the default 
repeatedElement name (list, key_value) for return:
   ```
    public Type list(GroupType list, Type elementType) {
       Preconditions.checkArgument(elementType != null,
           "List type must have element field");
   
       MappedField field = nameMapping.find(currentPath());
       Type listType = 
org.apache.parquet.schema.Types.list(list.getRepetition())
           .element(elementType)
           .named(list.getName());
   
       return field == null ? listType : listType.withId(field.id());
     }
   ```
    so the schema of return Type of ApplyNameMapping is like this:
   ```
   optional group myList (LIST) = 1 {
       repeated group list {
           optional binary element (STRING) = 2;
       }
   }
   ```
   which doesn't match the  actual parquet file schema  mentioned above, and 
causes an exception when reading the parquet file.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] zhongyujiang opened a new issue #3604: Parquet: read compatibility for migrated Iceberg table of parquet data file

Reply via email to