[
https://issues.apache.org/jira/browse/HIVE-13995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Hari Sankar Sivarama Subramaniyan updated HIVE-13995:
-----------------------------------------------------
Status: Open (was: Patch Available)
> Hive generates inefficient metastore queries for TPCDS tables with 1800+
> partitions leading to higher compile time
> ------------------------------------------------------------------------------------------------------------------
>
> Key: HIVE-13995
> URL: https://issues.apache.org/jira/browse/HIVE-13995
> Project: Hive
> Issue Type: Bug
> Components: Hive
> Affects Versions: 2.2.0
> Reporter: Nita Dembla
> Assignee: Hari Sankar Sivarama Subramaniyan
> Attachments: HIVE-13995.1.patch, HIVE-13995.2.patch,
> HIVE-13995.3.patch, HIVE-13995.4.patch, HIVE-13995.5.patch
>
>
> TPCDS fact tables (store_sales, catalog_sales) have 1800+ partitions and when
> the query does not a filter on the partition column, metastore queries
> generated have a large IN clause listing all the partition names. Most RDBMS
> systems have issues optimizing large IN clause and even when a good index
> plan is chosen , comparing to 1800+ string values will not lead to best
> execution time.
> When all partitions are chosen, not specifying the partition list and having
> filters only on table and column name will generate the same result set as
> long as there are no concurrent modifications to partition list of the hive
> table (adding/dropping partitions).
> For eg: For TPCDS query18, the metastore query gathering partition column
> statistics runs in 0.5 secs in Mysql. Following is output from mysql log
> {noformat}
> -- Query_time: 0.482063 Lock_time: 0.003037 Rows_sent: 1836 Rows_examined:
> 18360
> select count("COLUMN_NAME") from "PART_COL_STATS"
> where "DB_NAME" = 'tpcds_bin_partitioned_orc_30000' and "TABLE_NAME" =
> 'catalog_sales'
> and "COLUMN_NAME" in
> ('cs_bill_customer_sk','cs_bill_cdemo_sk','cs_item_sk','cs_quantity','cs_list_price','cs_sales_price','cs_coupon_amt','cs_net_profit')
> and "PARTITION_NAME" in
> ('cs_sold_date_sk=2450815','cs_sold_date_sk=2450816','cs_sold_date_sk=2450817','cs_sold_date_sk=2450818','cs_sold_date_sk=2450819','cs_sold_date_sk=2450820','cs_sold_date_sk=2450821','cs_sold_date_sk=2450822','cs_sold_date_sk=2450823','cs_sold_date_sk=2450824','cs_sold_date_sk=2450825','cs_sold_date_sk=2450826','cs_sold_date_sk=2450827','cs_sold_date_sk=2450828','cs_sold_date_sk=2450829','cs_sold_date_sk=2450830','cs_sold_date_sk=2450831','cs_sold_date_sk=2450832','cs_sold_date_sk=2450833','cs_sold_date_sk=2450834','cs_sold_date_sk=2450835','cs_sold_date_sk=2450836','cs_sold_date_sk=2450837','cs_sold_date_sk=2450838','cs_sold_date_sk=2450839','cs_sold_date_sk=2450840','cs_sold_date_sk=2450841','cs_sold_date_sk=2450842','cs_sold_date_sk=2450843','cs_sold_date_sk=2450844','cs_sold_date_sk=2450845','cs_sold_date_sk=2450846','cs_sold_date_sk=2450847','cs_sold_date_sk=2450848','cs_sold_date_sk=2450849','cs_sold_date_sk=2450850','cs_sold_date_sk=2450851','cs_sold_date_sk=2450852','cs_sold_date_sk=2450853','cs_sold_date_sk=2450854','cs_sold_date_sk=2450855','cs_sold_date_sk=2450856',...,'cs_sold_date_sk=2452654')
> group by "PARTITION_NAME";
> {noformat}
> Functionally equivalent query runs in 0.1 seconds
> {noformat}
> --Query_time: 0.121296 Lock_time: 0.000156 Rows_sent: 1836 Rows_examined:
> 18360
> select count("COLUMN_NAME") from "PART_COL_STATS"
> where "DB_NAME" = 'tpcds_bin_partitioned_orc_30000' and "TABLE_NAME" =
> 'catalog_sales' and "COLUMN_NAME" in
> ('cs_bill_customer_sk','cs_bill_cdemo_sk','cs_item_sk','cs_quantity','cs_list_price','cs_sales_price','cs_coupon_amt','cs_net_profit')
> group by "PARTITION_NAME";
> {noformat}
> If removing the partition list seems drastic, its also possible to simply
> list the range since hive gets a ordered list of partition names. This
> performs equally well as earlier query
> {noformat}
> # Query_time: 0.143874 Lock_time: 0.000154 Rows_sent: 1836 Rows_examined:
> 18360
> SET timestamp=1464014881;
> select count("COLUMN_NAME") from "PART_COL_STATS" where "DB_NAME" =
> 'tpcds_bin_partitioned_orc_30000' and "TABLE_NAME" = 'catalog_sales' and
> "COLUMN_NAME" in
> ('cs_bill_customer_sk','cs_bill_cdemo_sk','cs_item_sk','cs_quantity','cs_list_price','cs_sales_price','cs_coupon_amt','cs_net_profit')
> and "PARTITION_NAME" >= 'cs_sold_date_sk=2450815' and "PARTITION_NAME" <=
> 'cs_sold_date_sk=2452654'
> group by "PARTITION_NAME";
> {noformat}
> Another thing to check is the IN clause of column names. Columns in
> projection list of hive query are mentioned here. Not sure if statistics of
> these columns are required for hive query optimization.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)