[ 
https://issues.apache.org/jira/browse/HIVE-15339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15748718#comment-15748718
 ] 

Jesus Camacho Rodriguez commented on HIVE-15339:
------------------------------------------------

[~rajesh.balamohan], I have taken a look at the latest patch and I have a 
couple of comments.

* We might end up retrieving stats for more columns than we actually need, e.g, 
if folding leads to new pruning opportunities? If most of the time is spent in 
the round trip to gather the stats, that should still be fine. But is this 
always the case? Could this harm us for some queries?
* Further, your latest patch does the consolidation at column projection time 
and thus assumes that we have projected some columns from the table. If we bail 
out in L167 (no columns are projected), we will not consolidate the calls. This 
might introduce some more variability in the time to plan the query.

> Batch metastore calls to get column stats for fields needed in 
> FilterSelectivityEstimator
> -----------------------------------------------------------------------------------------
>
>                 Key: HIVE-15339
>                 URL: https://issues.apache.org/jira/browse/HIVE-15339
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: Rajesh Balamohan
>            Priority: Minor
>         Attachments: HIVE-15339.1.patch, HIVE-15339.3.patch
>
>
> Based on query pattern, {{FilterSelectivityEstimator}} gets column statistics 
> from metastore in multiple calls. For instance, in the following query, it 
> ends up getting individual column statistics for for flights multiple number 
> of times.
> When the table has large number of partitions, getting statistics for columns 
> via multiple calls can be very expensive. This would adversely impact the 
> overall compilation time. The following query took 14 seconds to compile.
> {noformat}
> SELECT COUNT(`flights`.`flightnum`) AS `cnt_flightnum_ok`,
> YEAR(`flights`.`dateofflight`) AS `yr_flightdate_ok`
> FROM `flights` as `flights`
> JOIN `airlines` ON (`flights`.`uniquecarrier` = `airlines`.`code`)
> JOIN `airports` as `source_airport` ON (`flights`.`origin` = 
> `source_airport`.`iata`)
> JOIN `airports` as `dest_airport` ON (`flights`.`dest` = 
> `dest_airport`.`iata`)
> GROUP BY YEAR(`flights`.`dateofflight`);
> {noformat}
> It may be helpful to club all columns that need statistics and fetch these 
> details in single remote call.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to