qphien commented on a change in pull request #2052:
URL: https://github.com/apache/iceberg/pull/2052#discussion_r554938002
##########
File path: mr/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergSerDe.java
##########
@@ -82,7 +81,17 @@ public void initialize(@Nullable Configuration
configuration, Properties serDePr
}
String[] selectedColumns =
ColumnProjectionUtils.getReadColumnNames(configuration);
- Schema projectedSchema = selectedColumns.length > 0 ?
tableSchema.select(selectedColumns) : tableSchema;
+ // When same table is joined multiple times, it is possible some selected
columns are duplicated,
+ // in this case wrong recordStructField position leads wrong value or
ArrayIndexOutOfBoundException
+ String[] distinctSelectedColumns =
Arrays.stream(selectedColumns).distinct().toArray(String[]::new);
+ Schema projectedSchema = distinctSelectedColumns.length > 0 ?
+ tableSchema.select(distinctSelectedColumns) : tableSchema;
+ // the input split mapper handles does not belong to this table
+ // it is necessary to ensure projectedSchema equals to tableSchema,
+ // or we cannot find selectOperator's column from inspector
+ if (projectedSchema.columns().size() != distinctSelectedColumns.length) {
+ projectedSchema = tableSchema;
+ }
Review comment:
With test case `testSelectedColumnsOverlapJoin`, assuming that mapper
is handling split belongs to table `default.orders`, columns set in
`hive.io.file.readcolumn.names` are [order_id, customer_id], the inspector
created for table `default.customers` just contains column [customer_id], when
table `default.customers` selectOperator is initialized, field `first_name`
cannot be found from inspector we just created, so exception below is thrown
```
cannot find field first_name from
[org.apache.iceberg.mr.hive.serde.objectinspector.IcebergRecordObjectInspector$IcebergRecordStructField@a6807525]
at
org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorUtils.getStandardStructFieldRef
```
The cause of this exception is that the schema get from `Schema.select()` is
not what we want, returning an inspector contains all table columns is an easy
workaround to fix this issue.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]