[ 
https://issues.apache.org/jira/browse/HIVE-15339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15750688#comment-15750688
 ] 

Rajesh Balamohan commented on HIVE-15339:
-----------------------------------------

Thank you for the comments [~jcamachorodriguez].

If the col stats are already present in cache, these calls would be very fast. 
Otherwise, it ends up fetching col stats per column and multiple queries to DB 
is slowing things down especially with larger set of partitions. I checked by 
running tpc-ds dataset and haven't observed any regression with this. I will 
check with more queries.

If it bails out in L167, it would be fine still as it would follow the old code 
path. That is, it wouldn't be worse than the current runtime.

> Batch metastore calls to get column stats for fields needed in 
> FilterSelectivityEstimator
> -----------------------------------------------------------------------------------------
>
>                 Key: HIVE-15339
>                 URL: https://issues.apache.org/jira/browse/HIVE-15339
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: Rajesh Balamohan
>            Priority: Minor
>         Attachments: HIVE-15339.1.patch, HIVE-15339.3.patch
>
>
> Based on query pattern, {{FilterSelectivityEstimator}} gets column statistics 
> from metastore in multiple calls. For instance, in the following query, it 
> ends up getting individual column statistics for for flights multiple number 
> of times.
> When the table has large number of partitions, getting statistics for columns 
> via multiple calls can be very expensive. This would adversely impact the 
> overall compilation time. The following query took 14 seconds to compile.
> {noformat}
> SELECT COUNT(`flights`.`flightnum`) AS `cnt_flightnum_ok`,
> YEAR(`flights`.`dateofflight`) AS `yr_flightdate_ok`
> FROM `flights` as `flights`
> JOIN `airlines` ON (`flights`.`uniquecarrier` = `airlines`.`code`)
> JOIN `airports` as `source_airport` ON (`flights`.`origin` = 
> `source_airport`.`iata`)
> JOIN `airports` as `dest_airport` ON (`flights`.`dest` = 
> `dest_airport`.`iata`)
> GROUP BY YEAR(`flights`.`dateofflight`);
> {noformat}
> It may be helpful to club all columns that need statistics and fetch these 
> details in single remote call.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to