qphien commented on a change in pull request #2052:
URL: https://github.com/apache/iceberg/pull/2052#discussion_r554938002



##########
File path: mr/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergSerDe.java
##########
@@ -82,7 +81,17 @@ public void initialize(@Nullable Configuration 
configuration, Properties serDePr
     }
 
     String[] selectedColumns = 
ColumnProjectionUtils.getReadColumnNames(configuration);
-    Schema projectedSchema = selectedColumns.length > 0 ? 
tableSchema.select(selectedColumns) : tableSchema;
+    // When same table is joined multiple times, it is possible some selected 
columns are duplicated,
+    // in this case wrong recordStructField position leads wrong value or 
ArrayIndexOutOfBoundException
+    String[] distinctSelectedColumns = 
Arrays.stream(selectedColumns).distinct().toArray(String[]::new);
+    Schema projectedSchema = distinctSelectedColumns.length > 0 ?
+            tableSchema.select(distinctSelectedColumns) : tableSchema;
+    // the input split mapper handles does not belong to this table
+    // it is necessary to ensure projectedSchema equals to tableSchema,
+    // or we cannot find selectOperator's column from inspector
+    if (projectedSchema.columns().size() != distinctSelectedColumns.length) {
+      projectedSchema = tableSchema;
+    }

Review comment:
       With test case `testSelectedColumnsOverlapJoin`,  assuming that mapper 
is handling split belongs to table `default.orders`, columns set in 
`hive.io.file.readcolumn.names` are [order_id, customer_id], the inspector 
created for table `default.customers` just contains column [customer_id], when 
table `default.customers` selectOperator  is initialized, field `first_name` 
cannot be found from inspector we just created, so exception below is thrown 
   ```
   cannot find field first_name from 
[org.apache.iceberg.mr.hive.serde.objectinspector.IcebergRecordObjectInspector$IcebergRecordStructField@a6807525]
        at 
org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorUtils.getStandardStructFieldRef
   ```
   
   The cause of this exception is that the schema get from `Schema.select()` is 
not what we want, returning an inspector contains all table columns is an easy 
workaround to fix this issue.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to