[ 
https://issues.apache.org/jira/browse/HIVE-13995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hari Sankar Sivarama Subramaniyan updated HIVE-13995:
-----------------------------------------------------
    Attachment: HIVE-13995.6.patch

> Hive generates inefficient metastore queries for TPCDS tables with 1800+ 
> partitions leading to higher compile time
> ------------------------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-13995
>                 URL: https://issues.apache.org/jira/browse/HIVE-13995
>             Project: Hive
>          Issue Type: Bug
>          Components: Hive
>    Affects Versions: 2.2.0
>            Reporter: Nita Dembla
>            Assignee: Hari Sankar Sivarama Subramaniyan
>         Attachments: HIVE-13995.1.patch, HIVE-13995.2.patch, 
> HIVE-13995.3.patch, HIVE-13995.4.patch, HIVE-13995.5.patch, HIVE-13995.6.patch
>
>
> TPCDS fact tables (store_sales, catalog_sales) have 1800+ partitions and when 
> the query does not a filter on the partition column, metastore queries 
> generated have a large IN clause listing all the partition names. Most RDBMS 
> systems have issues optimizing large IN clause and even when a good index 
> plan is chosen , comparing to 1800+ string values will not lead to best 
> execution time.
> When all partitions are chosen, not specifying the partition list and having 
> filters only on table and column name will generate the same result set as 
> long as there are no concurrent modifications to partition list of the hive 
> table (adding/dropping partitions).
> For eg: For TPCDS query18, the metastore query gathering partition column 
> statistics runs in 0.5 secs in Mysql. Following is output from mysql log
> {noformat}
> -- Query_time: 0.482063  Lock_time: 0.003037 Rows_sent: 1836  Rows_examined: 
> 18360
> select count("COLUMN_NAME") from "PART_COL_STATS"
>  where "DB_NAME" = 'tpcds_bin_partitioned_orc_30000' and "TABLE_NAME" = 
> 'catalog_sales' 
>  and "COLUMN_NAME" in 
> ('cs_bill_customer_sk','cs_bill_cdemo_sk','cs_item_sk','cs_quantity','cs_list_price','cs_sales_price','cs_coupon_amt','cs_net_profit')
>  and "PARTITION_NAME" in 
> ('cs_sold_date_sk=2450815','cs_sold_date_sk=2450816','cs_sold_date_sk=2450817','cs_sold_date_sk=2450818','cs_sold_date_sk=2450819','cs_sold_date_sk=2450820','cs_sold_date_sk=2450821','cs_sold_date_sk=2450822','cs_sold_date_sk=2450823','cs_sold_date_sk=2450824','cs_sold_date_sk=2450825','cs_sold_date_sk=2450826','cs_sold_date_sk=2450827','cs_sold_date_sk=2450828','cs_sold_date_sk=2450829','cs_sold_date_sk=2450830','cs_sold_date_sk=2450831','cs_sold_date_sk=2450832','cs_sold_date_sk=2450833','cs_sold_date_sk=2450834','cs_sold_date_sk=2450835','cs_sold_date_sk=2450836','cs_sold_date_sk=2450837','cs_sold_date_sk=2450838','cs_sold_date_sk=2450839','cs_sold_date_sk=2450840','cs_sold_date_sk=2450841','cs_sold_date_sk=2450842','cs_sold_date_sk=2450843','cs_sold_date_sk=2450844','cs_sold_date_sk=2450845','cs_sold_date_sk=2450846','cs_sold_date_sk=2450847','cs_sold_date_sk=2450848','cs_sold_date_sk=2450849','cs_sold_date_sk=2450850','cs_sold_date_sk=2450851','cs_sold_date_sk=2450852','cs_sold_date_sk=2450853','cs_sold_date_sk=2450854','cs_sold_date_sk=2450855','cs_sold_date_sk=2450856',...,'cs_sold_date_sk=2452654')
>  group by "PARTITION_NAME";
> {noformat}
> Functionally equivalent query runs in 0.1 seconds
> {noformat}
> --Query_time: 0.121296  Lock_time: 0.000156 Rows_sent: 1836  Rows_examined: 
> 18360
> select count("COLUMN_NAME") from "PART_COL_STATS"
>  where "DB_NAME" = 'tpcds_bin_partitioned_orc_30000' and "TABLE_NAME" = 
> 'catalog_sales'  and "COLUMN_NAME" in 
> ('cs_bill_customer_sk','cs_bill_cdemo_sk','cs_item_sk','cs_quantity','cs_list_price','cs_sales_price','cs_coupon_amt','cs_net_profit')
>  group by "PARTITION_NAME";
> {noformat}
> If removing the partition list seems drastic, its also possible to simply 
> list the range since hive gets a ordered list of partition names. This 
> performs equally well as earlier query
> {noformat}
> # Query_time: 0.143874  Lock_time: 0.000154 Rows_sent: 1836  Rows_examined: 
> 18360
> SET timestamp=1464014881;
> select count("COLUMN_NAME") from "PART_COL_STATS" where "DB_NAME" = 
> 'tpcds_bin_partitioned_orc_30000' and "TABLE_NAME" = 'catalog_sales'  and 
> "COLUMN_NAME" in 
> ('cs_bill_customer_sk','cs_bill_cdemo_sk','cs_item_sk','cs_quantity','cs_list_price','cs_sales_price','cs_coupon_amt','cs_net_profit')
>   and "PARTITION_NAME" >= 'cs_sold_date_sk=2450815' and "PARTITION_NAME" <= 
> 'cs_sold_date_sk=2452654' 
> group by "PARTITION_NAME";
> {noformat}
> Another thing to check is the IN clause of column names. Columns in 
> projection list of hive query are mentioned here. Not sure if statistics of 
> these columns are required for hive query optimization.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to