[
https://issues.apache.org/jira/browse/HIVE-13995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15388723#comment-15388723
]
Hive QA commented on HIVE-13995:
--------------------------------
Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12819476/HIVE-13995.7.patch
{color:green}SUCCESS:{color} +1 due to 6 test(s) being added or modified.
{color:red}ERROR:{color} -1 due to 20 failed/errored test(s), 10341 tests
executed
*Failed tests:*
{noformat}
TestMsgBusConnection - did not produce a TEST-*.xml file
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_acid_globallimit
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_annotate_stats_part
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_autoColumnStats_1
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_autoColumnStats_2
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_extrapolate_part_stats_partial
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_extrapolate_part_stats_partial_ndv
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_index_auto_mult_tables
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_index_bitmap_auto_partitioned
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_stats_list_bucket
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_subquery_multiinsert
org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_acid_globallimit
org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_explainuser_1
org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_explainuser_2
org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_mapjoin_mapjoin
org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_tez_join_hash
org.apache.hadoop.hive.llap.daemon.impl.TestLlapTokenChecker.testCheckPermissions
org.apache.hadoop.hive.llap.daemon.impl.TestLlapTokenChecker.testGetToken
org.apache.hadoop.hive.metastore.TestMetaStoreMetrics.testConnections
org.apache.hadoop.hive.thrift.TestHadoopAuthBridge23.testSaslWithHiveMetaStore
{noformat}
Test results:
https://builds.apache.org/job/PreCommit-HIVE-MASTER-Build/593/testReport
Console output:
https://builds.apache.org/job/PreCommit-HIVE-MASTER-Build/593/console
Test logs:
http://ec2-204-236-174-241.us-west-1.compute.amazonaws.com/logs/PreCommit-HIVE-MASTER-Build-593/
Messages:
{noformat}
Executing org.apache.hive.ptest.execution.TestCheckPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 20 tests failed
{noformat}
This message is automatically generated.
ATTACHMENT ID: 12819476 - PreCommit-HIVE-MASTER-Build
> Hive generates inefficient metastore queries for TPCDS tables with 1800+
> partitions leading to higher compile time
> ------------------------------------------------------------------------------------------------------------------
>
> Key: HIVE-13995
> URL: https://issues.apache.org/jira/browse/HIVE-13995
> Project: Hive
> Issue Type: Bug
> Components: Hive
> Affects Versions: 2.2.0
> Reporter: Nita Dembla
> Assignee: Hari Sankar Sivarama Subramaniyan
> Attachments: HIVE-13995.1.patch, HIVE-13995.2.patch,
> HIVE-13995.3.patch, HIVE-13995.4.patch, HIVE-13995.5.patch,
> HIVE-13995.6.patch, HIVE-13995.7.patch
>
>
> TPCDS fact tables (store_sales, catalog_sales) have 1800+ partitions and when
> the query does not a filter on the partition column, metastore queries
> generated have a large IN clause listing all the partition names. Most RDBMS
> systems have issues optimizing large IN clause and even when a good index
> plan is chosen , comparing to 1800+ string values will not lead to best
> execution time.
> When all partitions are chosen, not specifying the partition list and having
> filters only on table and column name will generate the same result set as
> long as there are no concurrent modifications to partition list of the hive
> table (adding/dropping partitions).
> For eg: For TPCDS query18, the metastore query gathering partition column
> statistics runs in 0.5 secs in Mysql. Following is output from mysql log
> {noformat}
> -- Query_time: 0.482063 Lock_time: 0.003037 Rows_sent: 1836 Rows_examined:
> 18360
> select count("COLUMN_NAME") from "PART_COL_STATS"
> where "DB_NAME" = 'tpcds_bin_partitioned_orc_30000' and "TABLE_NAME" =
> 'catalog_sales'
> and "COLUMN_NAME" in
> ('cs_bill_customer_sk','cs_bill_cdemo_sk','cs_item_sk','cs_quantity','cs_list_price','cs_sales_price','cs_coupon_amt','cs_net_profit')
> and "PARTITION_NAME" in
> ('cs_sold_date_sk=2450815','cs_sold_date_sk=2450816','cs_sold_date_sk=2450817','cs_sold_date_sk=2450818','cs_sold_date_sk=2450819','cs_sold_date_sk=2450820','cs_sold_date_sk=2450821','cs_sold_date_sk=2450822','cs_sold_date_sk=2450823','cs_sold_date_sk=2450824','cs_sold_date_sk=2450825','cs_sold_date_sk=2450826','cs_sold_date_sk=2450827','cs_sold_date_sk=2450828','cs_sold_date_sk=2450829','cs_sold_date_sk=2450830','cs_sold_date_sk=2450831','cs_sold_date_sk=2450832','cs_sold_date_sk=2450833','cs_sold_date_sk=2450834','cs_sold_date_sk=2450835','cs_sold_date_sk=2450836','cs_sold_date_sk=2450837','cs_sold_date_sk=2450838','cs_sold_date_sk=2450839','cs_sold_date_sk=2450840','cs_sold_date_sk=2450841','cs_sold_date_sk=2450842','cs_sold_date_sk=2450843','cs_sold_date_sk=2450844','cs_sold_date_sk=2450845','cs_sold_date_sk=2450846','cs_sold_date_sk=2450847','cs_sold_date_sk=2450848','cs_sold_date_sk=2450849','cs_sold_date_sk=2450850','cs_sold_date_sk=2450851','cs_sold_date_sk=2450852','cs_sold_date_sk=2450853','cs_sold_date_sk=2450854','cs_sold_date_sk=2450855','cs_sold_date_sk=2450856',...,'cs_sold_date_sk=2452654')
> group by "PARTITION_NAME";
> {noformat}
> Functionally equivalent query runs in 0.1 seconds
> {noformat}
> --Query_time: 0.121296 Lock_time: 0.000156 Rows_sent: 1836 Rows_examined:
> 18360
> select count("COLUMN_NAME") from "PART_COL_STATS"
> where "DB_NAME" = 'tpcds_bin_partitioned_orc_30000' and "TABLE_NAME" =
> 'catalog_sales' and "COLUMN_NAME" in
> ('cs_bill_customer_sk','cs_bill_cdemo_sk','cs_item_sk','cs_quantity','cs_list_price','cs_sales_price','cs_coupon_amt','cs_net_profit')
> group by "PARTITION_NAME";
> {noformat}
> If removing the partition list seems drastic, its also possible to simply
> list the range since hive gets a ordered list of partition names. This
> performs equally well as earlier query
> {noformat}
> # Query_time: 0.143874 Lock_time: 0.000154 Rows_sent: 1836 Rows_examined:
> 18360
> SET timestamp=1464014881;
> select count("COLUMN_NAME") from "PART_COL_STATS" where "DB_NAME" =
> 'tpcds_bin_partitioned_orc_30000' and "TABLE_NAME" = 'catalog_sales' and
> "COLUMN_NAME" in
> ('cs_bill_customer_sk','cs_bill_cdemo_sk','cs_item_sk','cs_quantity','cs_list_price','cs_sales_price','cs_coupon_amt','cs_net_profit')
> and "PARTITION_NAME" >= 'cs_sold_date_sk=2450815' and "PARTITION_NAME" <=
> 'cs_sold_date_sk=2452654'
> group by "PARTITION_NAME";
> {noformat}
> Another thing to check is the IN clause of column names. Columns in
> projection list of hive query are mentioned here. Not sure if statistics of
> these columns are required for hive query optimization.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)