[GitHub] [iceberg] pvary commented on pull request #3392: Hive: Bug when runing SQL with multiple table join.

GitBox Thu, 04 Nov 2021 07:57:21 -0700


pvary commented on pull request #3392:
URL: https://github.com/apache/iceberg/pull/3392#issuecomment-961111912



   The issue is reproducible with this:
   ```
     @Test
     public void testBug() {
       shell.setHiveSessionValue("hive.cbo.enable", true);
   
       String process_info = "CREATE TABLE process_info (\n" +
               "   application_id STRING,\n" +
               "   engine STRING,\n" +
               "   node_type STRING,\n" +
               "   bytes_read BIGINT,\n" +
               "   bytes_write BIGINT,\n" +
               "   fd_count_avg BIGINT,\n" +
               "   thread_count_avg INT\n" +
               " )\n" +
               " STORED BY 
'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler'";
       shell.executeStatement(process_info);
       shell.executeStatement(
               "INSERT INTO process_info VALUES('APP-1','map', 'MR', 0, 0, 0, 
0)");
   
       String mr_job_info = "CREATE TABLE mr_job_info (\n" +
               "   application_id STRING,\n" +
               "   queue STRING,\n" +
               "   user_name STRING\n" +
               " )\n" +
               " STORED BY 
'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler'";
       shell.executeStatement(mr_job_info);
       shell.executeStatement("INSERT INTO mr_job_info VALUES 
('APP-1','QUEUE-1', 'openinx')");
   
       String query = "SELECT\n" +
               "    j.user_name\n" +
               "FROM\n" +
               "    process_info r\n" +
               "    JOIN mr_job_info j ON \n" +
               "        j.application_id = r.application_id\n";
       shell.executeStatement(query);
     }
   ```
   
   The explain plan is this:
   ```
   
   STAGE DEPENDENCIES:
     Stage-1 is a root stage
     Stage-0 depends on stages: Stage-1
   
   STAGE PLANS:
     Stage: Stage-1
       Map Reduce
         Map Operator Tree:
             TableScan
               alias: r
               filterExpr: application_id is not null (type: boolean)
               Statistics: Num rows: 1 Data size: 2055 Basic stats: COMPLETE 
Column stats: NONE
               Filter Operator
                 predicate: application_id is not null (type: boolean)
                 Statistics: Num rows: 1 Data size: 2055 Basic stats: COMPLETE 
Column stats: NONE
                 Select Operator
                   expressions: application_id (type: string)
                   outputColumnNames: _col0
                   Statistics: Num rows: 1 Data size: 2055 Basic stats: 
COMPLETE Column stats: NONE
                   Reduce Output Operator
                     key expressions: _col0 (type: string)
                     sort order: +
                     Map-reduce partition columns: _col0 (type: string)
                     Statistics: Num rows: 1 Data size: 2055 Basic stats: 
COMPLETE Column stats: NONE
             TableScan
               alias: j
               filterExpr: application_id is not null (type: boolean)
               Statistics: Num rows: 1 Data size: 1014 Basic stats: COMPLETE 
Column stats: NONE
               Filter Operator
                 predicate: application_id is not null (type: boolean)
                 Statistics: Num rows: 1 Data size: 1014 Basic stats: COMPLETE 
Column stats: NONE
                 Select Operator
                   expressions: application_id (type: string), user_name (type: 
string)
                   outputColumnNames: _col0, _col1
                   Statistics: Num rows: 1 Data size: 1014 Basic stats: 
COMPLETE Column stats: NONE
                   Reduce Output Operator
                     key expressions: _col0 (type: string)
                     sort order: +
                     Map-reduce partition columns: _col0 (type: string)
                     Statistics: Num rows: 1 Data size: 1014 Basic stats: 
COMPLETE Column stats: NONE
                     value expressions: _col1 (type: string)
         Reduce Operator Tree:
           Join Operator
             condition map:
                  Inner Join 0 to 1
             keys:
               0 _col0 (type: string)
               1 _col0 (type: string)
             outputColumnNames: _col2
             Statistics: Num rows: 1 Data size: 2260 Basic stats: COMPLETE 
Column stats: NONE
             Select Operator
               expressions: _col2 (type: string)
               outputColumnNames: _col0
               Statistics: Num rows: 1 Data size: 2260 Basic stats: COMPLETE 
Column stats: NONE
               File Output Operator
                 compressed: false
                 Statistics: Num rows: 1 Data size: 2260 Basic stats: COMPLETE 
Column stats: NONE
                 table:
                     input format: 
org.apache.hadoop.mapred.SequenceFileInputFormat
                     output format: 
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                     serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
   
     Stage: Stage-0
       Fetch Operator
         limit: -1
         Processor Tree:
           ListSink
   ```
   
   The issue is that 2 `TableScanOperator`s are running in the same `Map` and 
they are using the same HiveConf (which contains the column pruning 
information). This is the fix for Hive:
   ```
   diff --git a/ql/src/java/org/apache/hadoop/hive/ql/exec/MapOperator.java 
b/ql/src/java/org/apache/hadoop/hive/ql/exec/MapOperator.java
   index ea8e634484..c921cf68e1 100644
   --- a/ql/src/java/org/apache/hadoop/hive/ql/exec/MapOperator.java
   +++ b/ql/src/java/org/apache/hadoop/hive/ql/exec/MapOperator.java
   @@ -369,10 +369,13 @@ else if 
(partRawRowObjectInspector.equals(tblRawRowObjectInspector)) {
            if (!tableNameToConf.containsKey(tableName)) {
              Configuration clonedConf = new Configuration(hconf);
              
clonedConf.unset(ColumnProjectionUtils.READ_NESTED_COLUMN_PATH_CONF_STR);
   +          
clonedConf.unset(ColumnProjectionUtils.READ_COLUMN_NAMES_CONF_STR);
   +          clonedConf.unset(ColumnProjectionUtils.READ_COLUMN_IDS_CONF_STR);
              tableNameToConf.put(tableName, clonedConf);
            }
            Configuration newConf = tableNameToConf.get(tableName);
   -        ColumnProjectionUtils.appendNestedColumnPaths(newConf, 
nestedColumnPaths);
   +        ColumnProjectionUtils.appendReadColumns(newConf, 
tableScanDesc.getNeededColumnIDs(),
   +            tableScanDesc.getOutputColumnNames(), 
tableScanDesc.getNeededNestedColumnPaths());
          }
        }
   ```
   
   Have to think about how to push the fix to Hive, as MR is deprecated there 
and the problem does not occur with Tez.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] pvary commented on pull request #3392: Hive: Bug when runing SQL with multiple table join.

Reply via email to