zhongyujiang opened a new issue #3604:
URL: https://github.com/apache/iceberg/issues/3604
I migrated serveral Hive tables with parquet format to Iceberg, but
encountered an exception when reading the migrated Iceberg table. I found the
parquet file of source table uses legacy schema, the name of repeatedElement of
list type is 'bag' :
```
optional group myList (LIST) {
repeated group bag {
optional binary element (STRING);
}
}
```
and Iceberg will ApplyNameMapping for the fileSchame of parquet cause the
parquet schema has no field IDs:
```
else if (nameMapping != null) {
typeWithIds = ParquetSchemaUtil.applyNameMapping(fileSchema,
nameMapping);
this.projection = ParquetSchemaUtil.pruneColumns(typeWithIds,
expectedSchema);
}
```
Now ApplyNameMapping build a new list/map GroupType which use the default
repeatedElement name (list, key_value) for return:
```
public Type list(GroupType list, Type elementType) {
Preconditions.checkArgument(elementType != null,
"List type must have element field");
MappedField field = nameMapping.find(currentPath());
Type listType =
org.apache.parquet.schema.Types.list(list.getRepetition())
.element(elementType)
.named(list.getName());
return field == null ? listType : listType.withId(field.id());
}
```
so the schema of return Type of ApplyNameMapping is like this:
```
optional group myList (LIST) = 1 {
repeated group list {
optional binary element (STRING) = 2;
}
}
```
which doesn't match the actual parquet file schema mentioned above, and
causes an exception when reading the parquet file.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]