Keith Sun created HIVE-18779:
--------------------------------
Summary: Hive does not enable ppd to underlying storage format by
default
Key: HIVE-18779
URL: https://issues.apache.org/jira/browse/HIVE-18779
Project: Hive
Issue Type: Bug
Components: Physical Optimizer, Query Planning
Affects Versions: 2.3.2, 1.2.2, 1.2.1
Environment: Hive 1.2.1 and also checked the latest version , it still
have this issue.
Reporter: Keith Sun
*Issue :* Hive does not enable ppd to underlying storage format by default
even with hive.optimize.ppd/storage=true and the inputFormat is applicable for
fitler push down.
*How to re-produce :*
{code:java}
CREATE TABLE MYDUAL (ID INT) stored as parquet;
insert overwrite table mydual ...
set hive.optimize.ppd=true
set hive.optimize.ppd.storage=true
explain select * from mydual where id =100;
//No filterExpr generated which will be utilized by Parquet InputFormat
STAGE PLANS:
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
TableScan
alias: mydual
Statistics: Num rows: 362 Data size: 362 Basic stats: COMPLETE Column stats:
NONE
Filter Operator
predicate: (id = 100) (type: boolean)
Statistics: Num rows: 181 Data size: 181 Basic stats: COMPLETE Column stats:
NONE
Select Operator
expressions: 100 (type: int)
//set hive.optimize.index.filter=true which is false by default.
//then we get the filterExpr which can be pushed down to parquet.
STAGE PLANS: Stage: Stage-0 Fetch Operator limit: -1 Processor Tree: TableScan
alias: mydual filterExpr: (id = 100) (type: boolean) Statistics: Num rows: 362
Data size: 362 Basic stats: COMPLETE Column stats: NONE Filter Operator
predicate: (id = 100) (type: boolean) Statistics: Num rows: 181 Data size: 181
Basic stats: COMPLETE Column stats: NONE Select Operator expressions: 100
(type: int) outputColumnNames: _col0 Statistics: Num rows: 181 Data size: 181
Basic stats: COMPLETE Column stats: NONE ListSink
{code}
By checking the code of org.apache.hadoop.hive.ql.ppd.OpProcFactory:
I just found that to generate the filterExpr in the plan, we have to set :
hive.optimize.index.filter=true as a workaround, but this parameter is not
related to parquet input format as we does not have index at all.
{code:java}
private static ExprNodeGenericFuncDesc pushFilterToStorageHandler(
TableScanOperator tableScanOp,
ExprNodeGenericFuncDesc originalPredicate,
OpWalkerInfo owi,
HiveConf hiveConf) {
TableScanDesc tableScanDesc = tableScanOp.getConf();
Table tbl = tableScanDesc.getTableMetadata();
if (HiveConf.getBoolVar(hiveConf, HiveConf.ConfVars.HIVEOPTINDEXFILTER)) {
// attach the original predicate to the table scan operator for index
// optimizations that require the pushed predicate before pcr & later
// optimizations are applied
tableScanDesc.setFilterExpr(originalPredicate);
}
if (!tbl.isNonNative()) {
return originalPredicate;
}
HiveStorageHandler storageHandler = tbl.getStorageHandler();
if (!(storageHandler instanceof HiveStoragePredicateHandler)) {
// The storage handler does not provide predicate decomposition
// support, so we'll implement the entire filter in Hive. However,
// we still provide the full predicate to the storage handler in
// case it wants to do any of its own prefiltering.
tableScanDesc.setFilterExpr(originalPredicate);
return originalPredicate;
}
HiveStoragePredicateHandler predicateHandler =
(HiveStoragePredicateHandler) storageHandler;
JobConf jobConf = new JobConf(owi.getParseContext().getConf());
Utilities.setColumnNameList(jobConf, tableScanOp);
Utilities.setColumnTypeList(jobConf, tableScanOp);
Utilities.copyTableJobPropertiesToConf(
Utilities.getTableDesc(tbl),
jobConf);
{code}
Per my check , the "getFilterExpr" method of TableScanDesc is called below
places and
If hive always set the filterExpr, it may not cause trouble (chime in if i am
wrong).
I could propose a pull request then.
!image-2018-02-22-20-23-07-732.png!
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)