[GitHub] [iceberg] liubo1022126 opened a new pull request #2614: MR: Fix selectedColumns not belong to current Map (#2567)

GitBox Wed, 19 May 2021 08:10:34 -0700


liubo1022126 opened a new pull request #2614:
URL: https://github.com/apache/iceberg/pull/2614



   Issue: https://github.com/apache/iceberg/issues/2567
   
   Run hive Sql in hive-shell. Table A left join Table B.
   
   > select * from 
   (select * from ta)p1
   left join 
   (select id,name,age from tb) p2 
   on p1.id=p2.id limit 10; 
   
   Regardless of whether Table A and Table B are in iceberg format or not, The 
amount of data in the table on the right is relatively large, Some map operator 
initialization failed.
   
   I find that the code `String[] selectedColumns = 
ColumnProjectionUtils.getReadColumnNames(configuration)` in class 
HiveIcebergSerDe get selectedColumns value from hconf by 
hive.io.file.readcolumn.names, But it does not correspond to the current map 
sometimes.
   
   Maybe someone realized this problem before, so there is some notes and code 
below:
   >  // the input split mapper handles does not belong to this table
       // it is necessary to ensure projectedSchema equals to tableSchema,
       // or we cannot find selectOperator's column from inspector
       if (projectedSchema.columns().size() != distinctSelectedColumns.length) {
         projectedSchema = tableSchema;
       }
   
   But it is not enough at some case. eg: Table ta also have column [name] and 
column [age], which are the select column in Table tb.
   
   I debug and notice that when the above situation occurs,  
`serDeProperties.getProperty("columns")` corresponds to the schema columns of 
the current map, and `configuration.get("schema.evolution.columns")` 
corresponds to the schema columns of another. So I compare them to verify and 
it running ok.
   
   **But I'm not sure if these are enough, Can someone please help to check?**
   
   -------------------------
   
   And I found that there is another way to fix these problem, and I think this 
way is the best. But we also need to code hive.
   
   With hive, in class org.apache.hadoop.hive.ql.exec.MapOperator, we can get 
need columns from `((TableScanOperator) 
conf.getAliasToWork().get(alias)).getConf().getNeededColumns()`,  and set it in 
hconf `public void setChildren(Configuration hconf)` use a property px.
   
   Then in class org.apache.iceberg.mr.hive.HiveIcebergSerDe in iceberg, we can 
get need columns from property px correctly.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] liubo1022126 opened a new pull request #2614: MR: Fix selectedColumns not belong to current Map (#2567)

Reply via email to