pvary commented on pull request #3392:
URL: https://github.com/apache/iceberg/pull/3392#issuecomment-961111912
The issue is reproducible with this:
```
@Test
public void testBug() {
shell.setHiveSessionValue("hive.cbo.enable", true);
String process_info = "CREATE TABLE process_info (\n" +
" application_id STRING,\n" +
" engine STRING,\n" +
" node_type STRING,\n" +
" bytes_read BIGINT,\n" +
" bytes_write BIGINT,\n" +
" fd_count_avg BIGINT,\n" +
" thread_count_avg INT\n" +
" )\n" +
" STORED BY
'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler'";
shell.executeStatement(process_info);
shell.executeStatement(
"INSERT INTO process_info VALUES('APP-1','map', 'MR', 0, 0, 0,
0)");
String mr_job_info = "CREATE TABLE mr_job_info (\n" +
" application_id STRING,\n" +
" queue STRING,\n" +
" user_name STRING\n" +
" )\n" +
" STORED BY
'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler'";
shell.executeStatement(mr_job_info);
shell.executeStatement("INSERT INTO mr_job_info VALUES
('APP-1','QUEUE-1', 'openinx')");
String query = "SELECT\n" +
" j.user_name\n" +
"FROM\n" +
" process_info r\n" +
" JOIN mr_job_info j ON \n" +
" j.application_id = r.application_id\n";
shell.executeStatement(query);
}
```
The explain plan is this:
```
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 depends on stages: Stage-1
STAGE PLANS:
Stage: Stage-1
Map Reduce
Map Operator Tree:
TableScan
alias: r
filterExpr: application_id is not null (type: boolean)
Statistics: Num rows: 1 Data size: 2055 Basic stats: COMPLETE
Column stats: NONE
Filter Operator
predicate: application_id is not null (type: boolean)
Statistics: Num rows: 1 Data size: 2055 Basic stats: COMPLETE
Column stats: NONE
Select Operator
expressions: application_id (type: string)
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 2055 Basic stats:
COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: _col0 (type: string)
sort order: +
Map-reduce partition columns: _col0 (type: string)
Statistics: Num rows: 1 Data size: 2055 Basic stats:
COMPLETE Column stats: NONE
TableScan
alias: j
filterExpr: application_id is not null (type: boolean)
Statistics: Num rows: 1 Data size: 1014 Basic stats: COMPLETE
Column stats: NONE
Filter Operator
predicate: application_id is not null (type: boolean)
Statistics: Num rows: 1 Data size: 1014 Basic stats: COMPLETE
Column stats: NONE
Select Operator
expressions: application_id (type: string), user_name (type:
string)
outputColumnNames: _col0, _col1
Statistics: Num rows: 1 Data size: 1014 Basic stats:
COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: _col0 (type: string)
sort order: +
Map-reduce partition columns: _col0 (type: string)
Statistics: Num rows: 1 Data size: 1014 Basic stats:
COMPLETE Column stats: NONE
value expressions: _col1 (type: string)
Reduce Operator Tree:
Join Operator
condition map:
Inner Join 0 to 1
keys:
0 _col0 (type: string)
1 _col0 (type: string)
outputColumnNames: _col2
Statistics: Num rows: 1 Data size: 2260 Basic stats: COMPLETE
Column stats: NONE
Select Operator
expressions: _col2 (type: string)
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 2260 Basic stats: COMPLETE
Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 1 Data size: 2260 Basic stats: COMPLETE
Column stats: NONE
table:
input format:
org.apache.hadoop.mapred.SequenceFileInputFormat
output format:
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
```
The issue is that 2 `TableScanOperator`s are running in the same `Map` and
they are using the same HiveConf (which contains the column pruning
information). This is the fix for Hive:
```
diff --git a/ql/src/java/org/apache/hadoop/hive/ql/exec/MapOperator.java
b/ql/src/java/org/apache/hadoop/hive/ql/exec/MapOperator.java
index ea8e634484..c921cf68e1 100644
--- a/ql/src/java/org/apache/hadoop/hive/ql/exec/MapOperator.java
+++ b/ql/src/java/org/apache/hadoop/hive/ql/exec/MapOperator.java
@@ -369,10 +369,13 @@ else if
(partRawRowObjectInspector.equals(tblRawRowObjectInspector)) {
if (!tableNameToConf.containsKey(tableName)) {
Configuration clonedConf = new Configuration(hconf);
clonedConf.unset(ColumnProjectionUtils.READ_NESTED_COLUMN_PATH_CONF_STR);
+
clonedConf.unset(ColumnProjectionUtils.READ_COLUMN_NAMES_CONF_STR);
+ clonedConf.unset(ColumnProjectionUtils.READ_COLUMN_IDS_CONF_STR);
tableNameToConf.put(tableName, clonedConf);
}
Configuration newConf = tableNameToConf.get(tableName);
- ColumnProjectionUtils.appendNestedColumnPaths(newConf,
nestedColumnPaths);
+ ColumnProjectionUtils.appendReadColumns(newConf,
tableScanDesc.getNeededColumnIDs(),
+ tableScanDesc.getOutputColumnNames(),
tableScanDesc.getNeededNestedColumnPaths());
}
}
```
Have to think about how to push the fix to Hive, as MR is deprecated there
and the problem does not occur with Tez.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]