[
https://issues.apache.org/jira/browse/DRILL-4194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Venki Korukanti resolved DRILL-4194.
------------------------------------
Resolution: Fixed
> Improve the performance of metadata fetch operation in HiveScan
> ---------------------------------------------------------------
>
> Key: DRILL-4194
> URL: https://issues.apache.org/jira/browse/DRILL-4194
> Project: Apache Drill
> Issue Type: Bug
> Components: Storage - Hive
> Affects Versions: 1.4.0
> Reporter: Venki Korukanti
> Assignee: Venki Korukanti
> Fix For: 1.5.0
>
>
> Current HiveScan fetches the InputSplits for all partitions when {{HiveScan}}
> is created. This causes long delays when the table contains large number of
> partitions. If we end up pruning majority of partitions, this delay is
> unnecessary.
> We need this InputSplits info from the beginning of planning because
> * it is used in calculating the cost of the {{HiveScan}}. Currently when
> calculating the cost first we look at the rowCount (from Hive MetaStore), if
> it is available we use it in cost calculation. Otherwise we estimate the
> rowCount from InputSplits.
> * We also need the InputSplits for determining whether {{HiveScan}} is a
> singleton or distributed for adding appropriate traits in {{ScanPrule}}
> Fix is to delay the loading of the InputSplits until we need. There are two
> cases where we need it. If we end up fetching the InputSplits, store them
> until the query completes.
> * If the stats are not available, then we need InputSplits
> * If the partition is not pruned we need it for parallelization purposes.
> Regarding getting the parallelization info in {{ScanPrule}}: Had a discussion
> with [~amansinha100]. All we need at this point is whether the data is
> distributed or singleton at this point. Added a method {{isSingleton()}} to
> GroupScan. Returning {{false}} seems to work fine for HiveScan, but I am not
> sure of the implications here. We also have {{ExcessiveExchangeIdentifier}}
> which removes unnecessary exchanges by looking at the parallelization info. I
> think it is ok to return the parallelization info here as the pruning must
> have already completed.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)