Aman Sinha created IMPALA-10179:
-----------------------------------

             Summary: After inverting a join's inputs the join's parallelism 
does not get reset
                 Key: IMPALA-10179
                 URL: https://issues.apache.org/jira/browse/IMPALA-10179
             Project: IMPALA
          Issue Type: Bug
          Components: Frontend
    Affects Versions: Impala 3.4.0
            Reporter: Aman Sinha
            Assignee: Aman Sinha


In the following query, the left semi join gets flipped to a right semi join 
due to the cardinality of the tables but the parallelism of the HashJoin 
fragment (see Fragment F01) remains as hosts=1, instances=1.  The right 
behavior should be to inherit the parallelism of the new probe input table 
store_sales, so it should be hosts=3, instances=3 to avoid under-parallelizing 
the HashJoin.

{noformat}
[localhost:21000] default> set explain_level=2;
EXPLAIN_LEVEL set to 2
[localhost:21000] default> use tpcds_parquet;
Query: use tpcds_parquet
[localhost:21000] tpcds_parquet> explain select count(*) from store_returns 
where sr_customer_sk in (select ss_customer_sk from store_sales);
Query: explain select count(*) from store_returns where sr_customer_sk in 
(select ss_customer_sk from store_sales)
Max Per-Host Resource Reservation: Memory=10.31MB Threads=6
Per-Host Resource Estimates: Memory=85MB
Analyzed query: SELECT count(*) FROM tpcds_parquet.store_returns LEFT SEMI JOIN
(SELECT ss_customer_sk FROM tpcds_parquet.store_sales) `$a$1` (`$c$1`) ON
sr_customer_sk = `$a$1`.`$c$1`
""
F03:PLAN FRAGMENT [UNPARTITIONED] hosts=1 instances=1
|  Per-Host Resources: mem-estimate=10.02MB mem-reservation=0B 
thread-reservation=1
PLAN-ROOT SINK
|  output exprs: count(*)
|  mem-estimate=0B mem-reservation=0B thread-reservation=0
|
09:AGGREGATE [FINALIZE]
|  output: count:merge(*)
|  mem-estimate=10.00MB mem-reservation=0B spill-buffer=2.00MB 
thread-reservation=0
|  tuple-ids=3 row-size=8B cardinality=1
|  in pipelines: 09(GETNEXT), 04(OPEN)
|
08:EXCHANGE [UNPARTITIONED]
|  mem-estimate=16.00KB mem-reservation=0B thread-reservation=0
|  tuple-ids=3 row-size=8B cardinality=1
|  in pipelines: 04(GETNEXT)
|
F01:PLAN FRAGMENT [HASH(tpcds_parquet.store_sales.ss_customer_sk)] hosts=1 
instances=1
Per-Host Resources: mem-estimate=23.88MB mem-reservation=5.81MB 
thread-reservation=1 runtime-filters-memory=1.00MB
04:AGGREGATE
|  output: count(*)
|  mem-estimate=10.00MB mem-reservation=0B spill-buffer=2.00MB 
thread-reservation=0
|  tuple-ids=3 row-size=8B cardinality=1
|  in pipelines: 04(GETNEXT), 06(OPEN)
|
03:HASH JOIN [RIGHT SEMI JOIN, PARTITIONED]
|  hash predicates: tpcds_parquet.store_sales.ss_customer_sk = sr_customer_sk
|  runtime filters: RF000[bloom] <- sr_customer_sk
|  mem-estimate=2.88MB mem-reservation=2.88MB spill-buffer=128.00KB 
thread-reservation=0
|  tuple-ids=0 row-size=4B cardinality=287.51K
|  in pipelines: 06(GETNEXT), 00(OPEN)
|
|--07:EXCHANGE [HASH(sr_customer_sk)]
|  |  mem-estimate=1.10MB mem-reservation=0B thread-reservation=0
|  |  tuple-ids=0 row-size=4B cardinality=287.51K
|  |  in pipelines: 00(GETNEXT)
|  |
|  F02:PLAN FRAGMENT [RANDOM] hosts=1 instances=1
|  Per-Host Resources: mem-estimate=24.00MB mem-reservation=1.00MB 
thread-reservation=2
|  00:SCAN HDFS [tpcds_parquet.store_returns, RANDOM]
|     HDFS partitions=1/1 files=1 size=15.43MB
|     stored statistics:
|       table: rows=287.51K size=15.43MB
|       columns: all
|     extrapolated-rows=disabled max-scan-range-rows=287.51K
|     mem-estimate=24.00MB mem-reservation=1.00MB thread-reservation=1
|     tuple-ids=0 row-size=4B cardinality=287.51K
|     in pipelines: 00(GETNEXT)
|
06:AGGREGATE [FINALIZE]
|  group by: tpcds_parquet.store_sales.ss_customer_sk
|  mem-estimate=10.00MB mem-reservation=1.94MB spill-buffer=64.00KB 
thread-reservation=0
|  tuple-ids=6 row-size=4B cardinality=90.63K
|  in pipelines: 06(GETNEXT), 01(OPEN)
|
05:EXCHANGE [HASH(tpcds_parquet.store_sales.ss_customer_sk)]
|  mem-estimate=142.01KB mem-reservation=0B thread-reservation=0
|  tuple-ids=6 row-size=4B cardinality=90.63K
|  in pipelines: 01(GETNEXT)
|
F00:PLAN FRAGMENT [RANDOM] hosts=3 instances=3
Per-Host Resources: mem-estimate=27.00MB mem-reservation=3.50MB 
thread-reservation=2 runtime-filters-memory=1.00MB
02:AGGREGATE [STREAMING]
|  group by: tpcds_parquet.store_sales.ss_customer_sk
|  mem-estimate=10.00MB mem-reservation=2.00MB spill-buffer=64.00KB 
thread-reservation=0
|  tuple-ids=6 row-size=4B cardinality=90.63K
|  in pipelines: 01(GETNEXT)
|
01:SCAN HDFS [tpcds_parquet.store_sales, RANDOM]
   HDFS partitions=1824/1824 files=1824 size=200.95MB
   runtime filters: RF000[bloom] -> tpcds_parquet.store_sales.ss_customer_sk
   stored statistics:
     table: rows=2.88M size=200.95MB
     partitions: 1824/1824 rows=2.88M
     columns: all
   extrapolated-rows=disabled max-scan-range-rows=130.09K
   mem-estimate=16.00MB mem-reservation=512.00KB thread-reservation=1
   tuple-ids=1 row-size=4B cardinality=2.88M
   in pipelines: 01(GETNEXT)
{noformat}

The same behavior is seen for inner joins as well but to reproduce that I have 
to comment out the 'ordering' of tables for the joins to force creating an 
initial join order that is sub-optimal. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to