[ https://issues.apache.org/jira/browse/HIVE-18779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Keith Sun updated HIVE-18779: ----------------------------- Description: *Issue :* Hive does not enable ppd to underlying storage format by default even with hive.optimize.ppd/storage=true and the inputFormat is applicable for fitler push down. *How to re-produce :* {code:java} CREATE TABLE MYDUAL (ID INT) stored as parquet; insert overwrite table mydual ... set hive.optimize.ppd=true set hive.optimize.ppd.storage=true explain select * from mydual where id =100; //No filterExpr generated which will be utilized by Parquet InputFormat STAGE PLANS: Stage: Stage-0 Fetch Operator limit: -1 Processor Tree: TableScan alias: mydual Statistics: Num rows: 362 Data size: 362 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: (id = 100) (type: boolean) Statistics: Num rows: 181 Data size: 181 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: 100 (type: int) //set hive.optimize.index.filter=true which is false by default. //then we get the filterExpr which can be pushed down to parquet. STAGE PLANS: Stage: Stage-0 Fetch Operator limit: -1 Processor Tree: TableScan alias: mydual filterExpr: (id = 100) (type: boolean) Statistics: Num rows: 362 Data size: 362 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: (id = 100) (type: boolean) Statistics: Num rows: 181 Data size: 181 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: 100 (type: int) outputColumnNames: _col0 Statistics: Num rows: 181 Data size: 181 Basic stats: COMPLETE Column stats: NONE ListSink {code} By checking the code of org.apache.hadoop.hive.ql.ppd.OpProcFactory: I just found that to generate the filterExpr in the plan, we have to set : hive.optimize.index.filter=true as a workaround, but this parameter is not related to parquet input format as we does not have index at all. {code:java} private static ExprNodeGenericFuncDesc pushFilterToStorageHandler( TableScanOperator tableScanOp, ExprNodeGenericFuncDesc originalPredicate, OpWalkerInfo owi, HiveConf hiveConf) { TableScanDesc tableScanDesc = tableScanOp.getConf(); Table tbl = tableScanDesc.getTableMetadata(); if (HiveConf.getBoolVar(hiveConf, HiveConf.ConfVars.HIVEOPTINDEXFILTER)) { // attach the original predicate to the table scan operator for index // optimizations that require the pushed predicate before pcr & later // optimizations are applied tableScanDesc.setFilterExpr(originalPredicate); } if (!tbl.isNonNative()) { return originalPredicate; } HiveStorageHandler storageHandler = tbl.getStorageHandler(); if (!(storageHandler instanceof HiveStoragePredicateHandler)) { // The storage handler does not provide predicate decomposition // support, so we'll implement the entire filter in Hive. However, // we still provide the full predicate to the storage handler in // case it wants to do any of its own prefiltering. tableScanDesc.setFilterExpr(originalPredicate); return originalPredicate; } HiveStoragePredicateHandler predicateHandler = (HiveStoragePredicateHandler) storageHandler; JobConf jobConf = new JobConf(owi.getParseContext().getConf()); Utilities.setColumnNameList(jobConf, tableScanOp); Utilities.setColumnTypeList(jobConf, tableScanOp); Utilities.copyTableJobPropertiesToConf( Utilities.getTableDesc(tbl), jobConf); {code} Per my check , the "getFilterExpr" method of TableScanDesc is called below places and If hive always set the filterExpr, it may not cause trouble (chime in if i am wrong). !image-2018-02-22-20-29-41-589.png! I could propose a pull request then. !image-2018-02-22-20-23-07-732.png! was: *Issue :* Hive does not enable ppd to underlying storage format by default even with hive.optimize.ppd/storage=true and the inputFormat is applicable for fitler push down. *How to re-produce :* {code:java} CREATE TABLE MYDUAL (ID INT) stored as parquet; insert overwrite table mydual ... set hive.optimize.ppd=true set hive.optimize.ppd.storage=true explain select * from mydual where id =100; //No filterExpr generated which will be utilized by Parquet InputFormat STAGE PLANS: Stage: Stage-0 Fetch Operator limit: -1 Processor Tree: TableScan alias: mydual Statistics: Num rows: 362 Data size: 362 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: (id = 100) (type: boolean) Statistics: Num rows: 181 Data size: 181 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: 100 (type: int) //set hive.optimize.index.filter=true which is false by default. //then we get the filterExpr which can be pushed down to parquet. STAGE PLANS: Stage: Stage-0 Fetch Operator limit: -1 Processor Tree: TableScan alias: mydual filterExpr: (id = 100) (type: boolean) Statistics: Num rows: 362 Data size: 362 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: (id = 100) (type: boolean) Statistics: Num rows: 181 Data size: 181 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: 100 (type: int) outputColumnNames: _col0 Statistics: Num rows: 181 Data size: 181 Basic stats: COMPLETE Column stats: NONE ListSink {code} By checking the code of org.apache.hadoop.hive.ql.ppd.OpProcFactory: I just found that to generate the filterExpr in the plan, we have to set : hive.optimize.index.filter=true as a workaround, but this parameter is not related to parquet input format as we does not have index at all. {code:java} private static ExprNodeGenericFuncDesc pushFilterToStorageHandler( TableScanOperator tableScanOp, ExprNodeGenericFuncDesc originalPredicate, OpWalkerInfo owi, HiveConf hiveConf) { TableScanDesc tableScanDesc = tableScanOp.getConf(); Table tbl = tableScanDesc.getTableMetadata(); if (HiveConf.getBoolVar(hiveConf, HiveConf.ConfVars.HIVEOPTINDEXFILTER)) { // attach the original predicate to the table scan operator for index // optimizations that require the pushed predicate before pcr & later // optimizations are applied tableScanDesc.setFilterExpr(originalPredicate); } if (!tbl.isNonNative()) { return originalPredicate; } HiveStorageHandler storageHandler = tbl.getStorageHandler(); if (!(storageHandler instanceof HiveStoragePredicateHandler)) { // The storage handler does not provide predicate decomposition // support, so we'll implement the entire filter in Hive. However, // we still provide the full predicate to the storage handler in // case it wants to do any of its own prefiltering. tableScanDesc.setFilterExpr(originalPredicate); return originalPredicate; } HiveStoragePredicateHandler predicateHandler = (HiveStoragePredicateHandler) storageHandler; JobConf jobConf = new JobConf(owi.getParseContext().getConf()); Utilities.setColumnNameList(jobConf, tableScanOp); Utilities.setColumnTypeList(jobConf, tableScanOp); Utilities.copyTableJobPropertiesToConf( Utilities.getTableDesc(tbl), jobConf); {code} Per my check , the "getFilterExpr" method of TableScanDesc is called below places and If hive always set the filterExpr, it may not cause trouble (chime in if i am wrong). I could propose a pull request then. !image-2018-02-22-20-23-07-732.png! > Hive does not enable ppd to underlying storage format by default > ---------------------------------------------------------------- > > Key: HIVE-18779 > URL: https://issues.apache.org/jira/browse/HIVE-18779 > Project: Hive > Issue Type: Bug > Components: Physical Optimizer, Query Planning > Affects Versions: 1.2.1, 1.2.2, 2.3.2 > Environment: Hive 1.2.1 and also checked the latest version , it > still have this issue. > Reporter: Keith Sun > Priority: Major > Attachments: image-2018-02-22-20-29-41-589.png > > > *Issue :* Hive does not enable ppd to underlying storage format by default > even with hive.optimize.ppd/storage=true and the inputFormat is applicable > for fitler push down. > *How to re-produce :* > > {code:java} > CREATE TABLE MYDUAL (ID INT) stored as parquet; > insert overwrite table mydual ... > set hive.optimize.ppd=true > set hive.optimize.ppd.storage=true > explain select * from mydual where id =100; > //No filterExpr generated which will be utilized by Parquet InputFormat > STAGE PLANS: > Stage: Stage-0 > Fetch Operator > limit: -1 > Processor Tree: > TableScan > alias: mydual > Statistics: Num rows: 362 Data size: 362 Basic stats: COMPLETE Column stats: > NONE > Filter Operator > predicate: (id = 100) (type: boolean) > Statistics: Num rows: 181 Data size: 181 Basic stats: COMPLETE Column stats: > NONE > Select Operator > expressions: 100 (type: int) > //set hive.optimize.index.filter=true which is false by default. > //then we get the filterExpr which can be pushed down to parquet. > STAGE PLANS: Stage: Stage-0 Fetch Operator limit: -1 Processor Tree: > TableScan alias: mydual filterExpr: (id = 100) (type: boolean) Statistics: > Num rows: 362 Data size: 362 Basic stats: COMPLETE Column stats: NONE Filter > Operator predicate: (id = 100) (type: boolean) Statistics: Num rows: 181 Data > size: 181 Basic stats: COMPLETE Column stats: NONE Select Operator > expressions: 100 (type: int) outputColumnNames: _col0 Statistics: Num rows: > 181 Data size: 181 Basic stats: COMPLETE Column stats: NONE ListSink > {code} > By checking the code of org.apache.hadoop.hive.ql.ppd.OpProcFactory: > I just found that to generate the filterExpr in the plan, we have to set : > hive.optimize.index.filter=true as a workaround, but this parameter is not > related to parquet input format as we does not have index at all. > > {code:java} > private static ExprNodeGenericFuncDesc pushFilterToStorageHandler( > TableScanOperator tableScanOp, > ExprNodeGenericFuncDesc originalPredicate, > OpWalkerInfo owi, > HiveConf hiveConf) { > TableScanDesc tableScanDesc = tableScanOp.getConf(); > Table tbl = tableScanDesc.getTableMetadata(); > if (HiveConf.getBoolVar(hiveConf, HiveConf.ConfVars.HIVEOPTINDEXFILTER)) { > // attach the original predicate to the table scan operator for index > // optimizations that require the pushed predicate before pcr & later > // optimizations are applied > tableScanDesc.setFilterExpr(originalPredicate); > } > if (!tbl.isNonNative()) { > return originalPredicate; > } > HiveStorageHandler storageHandler = tbl.getStorageHandler(); > if (!(storageHandler instanceof HiveStoragePredicateHandler)) { > // The storage handler does not provide predicate decomposition > // support, so we'll implement the entire filter in Hive. However, > // we still provide the full predicate to the storage handler in > // case it wants to do any of its own prefiltering. > tableScanDesc.setFilterExpr(originalPredicate); > return originalPredicate; > } > HiveStoragePredicateHandler predicateHandler = > (HiveStoragePredicateHandler) storageHandler; > JobConf jobConf = new JobConf(owi.getParseContext().getConf()); > Utilities.setColumnNameList(jobConf, tableScanOp); > Utilities.setColumnTypeList(jobConf, tableScanOp); > Utilities.copyTableJobPropertiesToConf( > Utilities.getTableDesc(tbl), > jobConf); > {code} > Per my check , the "getFilterExpr" method of TableScanDesc is called below > places and > If hive always set the filterExpr, it may not cause trouble (chime in if i am > wrong). > !image-2018-02-22-20-29-41-589.png! > I could propose a pull request then. > !image-2018-02-22-20-23-07-732.png! > -- This message was sent by Atlassian JIRA (v7.6.3#76005)