Bharath Vissapragada has posted comments on this change. ( http://gerrit.cloudera.org:8080/12221 )
Change subject: [PROTOTYPE] IMPALA-5872: Test case builder for query planner
......................................................................
Patch Set 1:
The following example exports a testcase from an 8-node cluster and replays the
plan on my local mini cluster (1 node).
On 8-node cluster (tpcds1000 SF, q30)
[xxxxxxxx:21000] default> use tpcds_1000_parquet;
Query: use tpcds_1000_parquet
[xxxxxxxx:21000] tpcds_1000_parquet> export testcase into outfile
'hdfs:///tmp/' with customer_total_return as
(select wr_returning_customer_sk as ctr_customer_sk
,ca_state as ctr_state,
sum(wr_return_amt) as ctr_total_return
from web_returns
,date_dim
,customer_address
where wr_returned_date_sk = d_date_sk
and d_year =2002
and wr_returning_addr_sk = ca_address_sk
group by wr_returning_customer_sk
,ca_state)
select
c_customer_id,c_salutation,c_first_name,c_last_name,c_preferred_cust_flag
,c_birth_day,c_birth_month,c_birth_year,c_birth_country,c_login,c_email_address
,c_last_review_date,ctr_total_return
from customer_total_return ctr1
,customer_address
,customer
where ctr1.ctr_total_return > (select avg(ctr_total_return)*1.2
from customer_total_return ctr2
where ctr1.ctr_state = ctr2.ctr_state)
and ca_address_sk = c_current_addr_sk
and ca_state = 'IL'
and ctr1.ctr_customer_sk = c_customer_sk
order by
c_customer_id,c_salutation,c_first_name,c_last_name,c_preferred_cust_flag
,c_birth_day,c_birth_month,c_birth_year,c_birth_country,c_login,c_email_address
,c_last_review_date,ctr_total_return
limit 100;
Query: export testcase into outfile 'hdfs:///tmp/' with customer_total_return as
(select wr_returning_customer_sk as ctr_customer_sk
,ca_state as ctr_state,
sum(wr_return_amt) as ctr_total_return
from web_returns
,date_dim
,customer_address
where wr_returned_date_sk = d_date_sk
and d_year =2002
and wr_returning_addr_sk = ca_address_sk
group by wr_returning_customer_sk
,ca_state)
select
c_customer_id,c_salutation,c_first_name,c_last_name,c_preferred_cust_flag
,c_birth_day,c_birth_month,c_birth_year,c_birth_country,c_login,c_email_address
,c_last_review_date,ctr_total_return
from customer_total_return ctr1
,customer_address
,customer
where ctr1.ctr_total_return > (select avg(ctr_total_return)*1.2
from customer_total_return ctr2
where ctr1.ctr_state = ctr2.ctr_state)
and ca_address_sk = c_current_addr_sk
and ca_state = 'IL'
and ctr1.ctr_customer_sk = c_customer_sk
order by
c_customer_id,c_salutation,c_first_name,c_last_name,c_preferred_cust_flag
,c_birth_day,c_birth_month,c_birth_year,c_birth_country,c_login,c_email_address
,c_last_review_date,ctr_total_return
limit 100
Query submitted at: 2019-01-11 14:20:57 (Coordinator: http://xxxxxxxx:25000)
Query progress can be monitored at:
http://xxxxxxxx:25000/query_plan?query_id=a74384f0af33b8ca:ca1342c000000000
+-----------------------------------------------------------------------------------------------------+
| Test case data output path
|
+-----------------------------------------------------------------------------------------------------+
|
hdfs://xxxxxxxx:8020/tmp/impala-testcase-data-1dc27e8a-a293-48bb-b91e-8384550cebed
|
+-----------------------------------------------------------------------------------------------------+
WARNINGS: Impala does not have READ_WRITE access to path 'hdfs://xxxxxxxx:8020/'
Fetched 1 row(s) in 1.46s
[xxxxxxxx:21000] tpcds_1000_parquet>
[user@host ~]$ ls -lSh impala-testcase-data-1dc27e8a-a293-48bb-b91e-8384550cebed
-rw-r--r-- 1 user user 503K Jan 11 14:22
impala-testcase-data-1dc27e8a-a293-48bb-b91e-8384550cebed
Copy the file
hdfs://xxxxxxxx:8020/tmp/impala-testcase-data-1dc27e8a-a293-48bb-b91e-8384550cebed
into a target cluster.
-- Load the testcase on the local cluster. (the target cluster doesn't contain
these Dbs/tables, so they are imported)
[localhost:21000] default> load testcase from
'hdfs:///tmp/impala-testcase-data-1dc27e8a-a293-48bb-b91e-8384550cebed';
Query: load testcase from
'hdfs:///tmp/impala-testcase-data-1dc27e8a-a293-48bb-b91e-8384550cebed'
+----------------------------------------------+
| summary |
+----------------------------------------------+
| 1 db(s), 11 table(s) and 0 view(s) imported. |
+----------------------------------------------+
Fetched 1 row(s) in 1.07s
[localhost:21000] tpcds_1000_parquet> show tables in tpcds_1000_parquet;
Query: show tables in tpcds_1000_parquet
+------------------+
| name |
+------------------+
| customer |
| customer_address |
| date_dim |
| web_returns |
+------------------+
Fetched 4 row(s) in 0.00s
[localhost:21000] tpcds_1000_parquet>
[localhost:21000] tpcds_1000_parquet> set EXPLAIN_LEVEL=3;
EXPLAIN_LEVEL set to 3
[localhost:21000] tpcds_1000_parquet> set PLANNER_DEBUG_MODE=true;
PLANNER_DEBUG_MODE set to true
[localhost:21000] tpcds_1000_parquet> explain with customer_total_return as
(select wr_returning_customer_sk as ctr_customer_sk
,ca_state as ctr_state,
sum(wr_return_amt) as ctr_total_return
from web_returns
,date_dim
,customer_address
where wr_returned_date_sk = d_date_sk
and d_year =2002
and wr_returning_addr_sk = ca_address_sk
group by wr_returning_customer_sk
,ca_state)
select
c_customer_id,c_salutation,c_first_name,c_last_name,c_preferred_cust_flag
,c_birth_day,c_birth_month,c_birth_year,c_birth_country,c_login,c_email_address
,c_last_review_date,ctr_total_return
from customer_total_return ctr1
,customer_address
,customer
where ctr1.ctr_total_return > (select avg(ctr_total_return)*1.2
from customer_total_return ctr2
where ctr1.ctr_state = ctr2.ctr_state)
and ca_address_sk = c_current_addr_sk
and ca_state = 'IL'
and ctr1.ctr_customer_sk = c_customer_sk
order by
c_customer_id,c_salutation,c_first_name,c_last_name,c_preferred_cust_flag
,c_birth_day,c_birth_month,c_birth_year,c_birth_country,c_login,c_email_address
,c_last_review_date,ctr_total_return
limit 100;
here is a snippet of the explain plan
| F08:PLAN FRAGMENT [RANDOM] hosts=7 instances=7 <=====
| Per-Host Resources: mem-estimate=52.95MB mem-reservation=20.94MB
thread-reservation=2 runtime-filters-memory=3.00MB
|
| DATASTREAM SINK [FRAGMENT=F11, EXCHANGE=28, HASH(wr_returning_addr_sk)]
|
| | mem-estimate=0B mem-reservation=0B thread-reservation=0
|
| 11:HASH JOIN [INNER JOIN, BROADCAST]
|
| | hash predicates: wr_returned_date_sk = d_date_sk
|
| | fk/pk conjuncts: wr_returned_date_sk = d_date_sk
|
| | runtime filters: RF012[bloom] <- d_date_sk
|
| | mem-estimate=1.94MB mem-reservation=1.94MB spill-buffer=64.00KB
thread-reservation=0
|
| | tuple-ids=7,8 row-size=24B cardinality=12290653
|
| | in pipelines: 08(GETNEXT), 09(OPEN)
|
| |
|
| |--27:EXCHANGE [BROADCAST]
|
| | mem-estimate=16.00KB mem-reservation=0B thread-reservation=0
|
| | tuple-ids=8 row-size=8B cardinality=373
|
| | in pipelines: 09(GETNEXT)
|
| |
|
| 08:SCAN HDFS [tpcds_1000_parquet.web_returns, RANDOM]
|
| partitions=2185/2185 files=2185 size=6.50GB
|
| runtime filters: RF010[bloom] -> wr_returning_addr_sk, RF012[bloom] ->
wr_returned_date_sk
|
| stored statistics:
|
| table: rows=71997522 size=6.50GB
|
| partitions: 2185/2185 rows=71997522
|
| columns: all
|
| extrapolated-rows=disabled max-scan-range-rows=3239259
|
| mem-estimate=48.00MB mem-reservation=16.00MB thread-reservation=1
|
| tuple-ids=7 row-size=16B cardinality=71997522
|
| in pipelines: 08(GETNEXT)
|
|
|
| F09:PLAN FRAGMENT [RANDOM] hosts=1 instances=1
|
| Per-Host Resources: mem-estimate=32.00MB mem-reservation=512.00KB
thread-reservation=2
|
| DATASTREAM SINK [FRAGMENT=F08, EXCHANGE=27, BROADCAST]
|
| | mem-estimate=0B mem-reservation=0B thread-reservation=0
|
| 09:SCAN HDFS [tpcds_1000_parquet.date_dim, RANDOM]
|
| partitions=1/1 files=1 size=2.15MB
|
| predicates: d_year = CAST(2002 AS INT)
|
| stored statistics:
|
| table: rows=73049 size=2.15MB
|
| columns: all
|
| extrapolated-rows=disabled max-scan-range-rows=73049
|
| parquet statistics predicates: d_year = CAST(2002 AS INT)
|
| parquet dictionary predicates: d_year = CAST(2002 AS INT)
|
| mem-estimate=32.00MB mem-reservation=512.00KB thread-reservation=1
|
| tuple-ids=8 row-size=8B cardinality=373
|
| in pipelines: 09(GETNEXT)
|
|
|
| F10:PLAN FRAGMENT [RANDOM] hosts=7 instances=7 <====
|
| Per-Host Resources: mem-estimate=48.00MB mem-reservation=4.00MB
thread-reservation=2
|
| DATASTREAM SINK [FRAGMENT=F11, EXCHANGE=29, HASH(ca_address_sk)]
|
| | mem-estimate=0B mem-reservation=0B thread-reservation=0
|
| 10:SCAN HDFS [tpcds_1000_parquet.customer_address, RANDOM]
|
| partitions=1/1 files=7 size=111.28MB
|
| stored statistics:
|
| table: rows=6000000 size=111.28MB
|
| columns: all
|
| extrapolated-rows=disabled max-scan-range-rows=858070
|
| mem-estimate=48.00MB mem-reservation=4.00MB thread-reservation=1
|
| tuple-ids=9 row-size=18B cardinality=6000000
|
| in pipelines: 10(GETNEXT)
|
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
Fetched 353 row(s) in 0.39s
--
To view, visit http://gerrit.cloudera.org:8080/12221
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iec83eeb2dc5136768b70ed581fb8d3ed0335cb52
Gerrit-Change-Number: 12221
Gerrit-PatchSet: 1
Gerrit-Owner: Bharath Vissapragada <[email protected]>
Gerrit-Reviewer: Bharath Vissapragada <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Comment-Date: Fri, 11 Jan 2019 22:33:41 +0000
Gerrit-HasComments: No
