[jira] [Commented] (DRILL-4194) Improve the performance of metadata fetch operation in HiveScan

ASF GitHub Bot (JIRA) Mon, 14 Dec 2015 18:17:02 -0800

    [ 
https://issues.apache.org/jira/browse/DRILL-4194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15057195#comment-15057195
 ]


ASF GitHub Bot commented on DRILL-4194:
---------------------------------------

GitHub user vkorukanti opened a pull request:

    https://github.com/apache/drill/pull/301

    DRILL-4194: Improve the performance of metadata fetch operation in HiveScan

    + Use the stats (numRows) stored in Hive metastore whenever available to
      calculate the costs for planning purpose
    + Delay the costly operation of loading of InputSplits until needed. When
      InputSplits are loaded, cache them at query level to speedup subsequent
      access.
    
    @jinfengni, @amansinha100  Could you please review the patch?

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/vkorukanti/drill hive_lazyload

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/drill/pull/301.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #301
    
----
commit 0b8654e472bad2c76bbd84f5f9bbe0cd521a3898
Author: vkorukanti <[email protected]>
Date:   2015-12-15T00:02:55Z

    DRILL-4198: Enhance StoragePlugin interface to expose logical space rules 
for planning purpose
    
    Also move Hive partition pruning rules to logical storage plugin rulesets.

commit 48520c93744f0865731a763c8447a720f80842a0
Author: vkorukanti <[email protected]>
Date:   2015-12-11T19:36:11Z

    DRILL-4194: Improve performance of the HiveScan metadata fetch operation
    
    + Use the stats (numRows) stored in Hive metastore whenever available to
      calculate the costs for planning purpose
    + Delay the costly operation of loading of InputSplits until needed. When
      InputSplits are loaded, cache them at query level to speedup subsequent
      access.

----


> Improve the performance of metadata fetch operation in HiveScan
> ---------------------------------------------------------------
>
>                 Key: DRILL-4194
>                 URL: https://issues.apache.org/jira/browse/DRILL-4194
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Storage - Hive
>    Affects Versions: 1.4.0
>            Reporter: Venki Korukanti
>            Assignee: Venki Korukanti
>             Fix For: 1.5.0
>
>
> Current HiveScan fetches the InputSplits for all partitions when {{HiveScan}} 
> is created. This causes long delays when the table contains large number of 
> partitions. If we end up pruning majority of partitions, this delay is 
> unnecessary.
> We need this InputSplits info from the beginning of planning because
>  * it is used in calculating the cost of the {{HiveScan}}. Currently when 
> calculating the cost first we look at the rowCount (from Hive MetaStore), if 
> it is available we use it in cost calculation. Otherwise we estimate the 
> rowCount from InputSplits. 
>  * We also need the InputSplits for determining whether {{HiveScan}} is a 
> singleton or distributed for adding appropriate traits in {{ScanPrule}}
> Fix is to delay the loading of the InputSplits until we need. There are two 
> cases where we need it. If we end up fetching the InputSplits, store them 
> until the query completes.
>  * If the stats are not available, then we need InputSplits
>  * If the partition is not pruned we need it for parallelization purposes.
> Regarding getting the parallelization info in {{ScanPrule}}: Had a discussion 
> with [~amansinha100]. All we need at this point is whether the data is 
> distributed or singleton at this point. Added a method {{isSingleton()}} to 
> GroupScan. Returning {{false}} seems to work fine for HiveScan, but I am not 
> sure of the implications here. We also have {{ExcessiveExchangeIdentifier}} 
> which removes unnecessary exchanges by looking at the parallelization info. I 
> think it is ok to return the parallelization info here as the pruning must 
> have already completed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (DRILL-4194) Improve the performance of metadata fetch operation in HiveScan

Reply via email to