[
https://issues.apache.org/jira/browse/DRILL-4194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dechang Gu closed DRILL-4194.
-----------------------------
Verified with the test case HIVE in drill perf test framework, comparing this
commit (bc74629) and the commit right before it (2953fe5), the planning time is
improved ~20x:
[root@ucs-node1 private-drill-perf-test-framework]# grep TOTAL
log/49_2953fe5_HIVE_20160413_094226/*/*log
log/49_2953fe5_HIVE_20160413_094226/HIVE_expPlan_01/HIVE_expPlan_01.log:[STAT]
TOTAL TIME : 185351 msec
log/49_2953fe5_HIVE_20160413_094226/HIVE_expPlan_01/HIVE_expPlan_01.log:[STAT]
TOTAL TIME : 163891 msec
log/49_2953fe5_HIVE_20160413_094226/HIVE_expPlan_01/HIVE_expPlan_01.log:[STAT]
TOTAL TIME : 165902 msec
log/49_2953fe5_HIVE_20160413_094226/HIVE_limit1_02/HIVE_limit1_02.log:[STAT]
TOTAL TIME : 137410 msec
log/49_2953fe5_HIVE_20160413_094226/HIVE_limit1_02/HIVE_limit1_02.log:[STAT]
TOTAL TIME : 165910 msec
log/49_2953fe5_HIVE_20160413_094226/HIVE_limit1_02/HIVE_limit1_02.log:[STAT]
TOTAL TIME : 164624 msec
with this fix:
[root@ucs-node1 private-drill-perf-test-framework]# grep TOTAL
log/50_bc74629_HIVE_20160413_100625/*/*log
log/50_bc74629_HIVE_20160413_100625/HIVE_expPlan_01/HIVE_expPlan_01.log:[STAT]
TOTAL TIME : 8094 msec
log/50_bc74629_HIVE_20160413_100625/HIVE_expPlan_01/HIVE_expPlan_01.log:[STAT]
TOTAL TIME : 6801 msec
log/50_bc74629_HIVE_20160413_100625/HIVE_expPlan_01/HIVE_expPlan_01.log:[STAT]
TOTAL TIME : 6166 msec
log/50_bc74629_HIVE_20160413_100625/HIVE_limit1_02/HIVE_limit1_02.log:[STAT]
TOTAL TIME : 11229 msec
log/50_bc74629_HIVE_20160413_100625/HIVE_limit1_02/HIVE_limit1_02.log:[STAT]
TOTAL TIME : 7533 msec
log/50_bc74629_HIVE_20160413_100625/HIVE_limit1_02/HIVE_limit1_02.log:[STAT]
TOTAL TIME : 7764 msec
> Improve the performance of metadata fetch operation in HiveScan
> ---------------------------------------------------------------
>
> Key: DRILL-4194
> URL: https://issues.apache.org/jira/browse/DRILL-4194
> Project: Apache Drill
> Issue Type: Bug
> Components: Storage - Hive
> Affects Versions: 1.4.0
> Reporter: Venki Korukanti
> Assignee: Venki Korukanti
> Fix For: 1.5.0
>
>
> Current HiveScan fetches the InputSplits for all partitions when {{HiveScan}}
> is created. This causes long delays when the table contains large number of
> partitions. If we end up pruning majority of partitions, this delay is
> unnecessary.
> We need this InputSplits info from the beginning of planning because
> * it is used in calculating the cost of the {{HiveScan}}. Currently when
> calculating the cost first we look at the rowCount (from Hive MetaStore), if
> it is available we use it in cost calculation. Otherwise we estimate the
> rowCount from InputSplits.
> * We also need the InputSplits for determining whether {{HiveScan}} is a
> singleton or distributed for adding appropriate traits in {{ScanPrule}}
> Fix is to delay the loading of the InputSplits until we need. There are two
> cases where we need it. If we end up fetching the InputSplits, store them
> until the query completes.
> * If the stats are not available, then we need InputSplits
> * If the partition is not pruned we need it for parallelization purposes.
> Regarding getting the parallelization info in {{ScanPrule}}: Had a discussion
> with [~amansinha100]. All we need at this point is whether the data is
> distributed or singleton at this point. Added a method {{isSingleton()}} to
> GroupScan. Returning {{false}} seems to work fine for HiveScan, but I am not
> sure of the implications here. We also have {{ExcessiveExchangeIdentifier}}
> which removes unnecessary exchanges by looking at the parallelization info. I
> think it is ok to return the parallelization info here as the pruning must
> have already completed.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)