[
https://issues.apache.org/jira/browse/DRILL-4194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15057195#comment-15057195
]
ASF GitHub Bot commented on DRILL-4194:
---------------------------------------
GitHub user vkorukanti opened a pull request:
https://github.com/apache/drill/pull/301
DRILL-4194: Improve the performance of metadata fetch operation in HiveScan
+ Use the stats (numRows) stored in Hive metastore whenever available to
calculate the costs for planning purpose
+ Delay the costly operation of loading of InputSplits until needed. When
InputSplits are loaded, cache them at query level to speedup subsequent
access.
@jinfengni, @amansinha100 Could you please review the patch?
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/vkorukanti/drill hive_lazyload
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/drill/pull/301.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #301
----
commit 0b8654e472bad2c76bbd84f5f9bbe0cd521a3898
Author: vkorukanti <[email protected]>
Date: 2015-12-15T00:02:55Z
DRILL-4198: Enhance StoragePlugin interface to expose logical space rules
for planning purpose
Also move Hive partition pruning rules to logical storage plugin rulesets.
commit 48520c93744f0865731a763c8447a720f80842a0
Author: vkorukanti <[email protected]>
Date: 2015-12-11T19:36:11Z
DRILL-4194: Improve performance of the HiveScan metadata fetch operation
+ Use the stats (numRows) stored in Hive metastore whenever available to
calculate the costs for planning purpose
+ Delay the costly operation of loading of InputSplits until needed. When
InputSplits are loaded, cache them at query level to speedup subsequent
access.
----
> Improve the performance of metadata fetch operation in HiveScan
> ---------------------------------------------------------------
>
> Key: DRILL-4194
> URL: https://issues.apache.org/jira/browse/DRILL-4194
> Project: Apache Drill
> Issue Type: Bug
> Components: Storage - Hive
> Affects Versions: 1.4.0
> Reporter: Venki Korukanti
> Assignee: Venki Korukanti
> Fix For: 1.5.0
>
>
> Current HiveScan fetches the InputSplits for all partitions when {{HiveScan}}
> is created. This causes long delays when the table contains large number of
> partitions. If we end up pruning majority of partitions, this delay is
> unnecessary.
> We need this InputSplits info from the beginning of planning because
> * it is used in calculating the cost of the {{HiveScan}}. Currently when
> calculating the cost first we look at the rowCount (from Hive MetaStore), if
> it is available we use it in cost calculation. Otherwise we estimate the
> rowCount from InputSplits.
> * We also need the InputSplits for determining whether {{HiveScan}} is a
> singleton or distributed for adding appropriate traits in {{ScanPrule}}
> Fix is to delay the loading of the InputSplits until we need. There are two
> cases where we need it. If we end up fetching the InputSplits, store them
> until the query completes.
> * If the stats are not available, then we need InputSplits
> * If the partition is not pruned we need it for parallelization purposes.
> Regarding getting the parallelization info in {{ScanPrule}}: Had a discussion
> with [~amansinha100]. All we need at this point is whether the data is
> distributed or singleton at this point. Added a method {{isSingleton()}} to
> GroupScan. Returning {{false}} seems to work fine for HiveScan, but I am not
> sure of the implications here. We also have {{ExcessiveExchangeIdentifier}}
> which removes unnecessary exchanges by looking at the parallelization info. I
> think it is ok to return the parallelization info here as the pruning must
> have already completed.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)