Riza Suminto has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/21927 )

Change subject: IMPALA-13445: Ignore num partition for unpartitioned writes
......................................................................


Patch Set 7:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/21927/4//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/21927/4//COMMIT_MSG@18
PS4, Line 18: 1. If the insert is unpartitioned, use the byte-based estimate 
fully.
            :    Shuffling should only happen if num writers is less than num 
input
            :    fragment instances.
            : 2. If the insert is partitioned, try to plan at least one writer 
for
            :    each shuffling executor nodes, but do not exceed number of
            :    partitions.
> Scheduling less than num nodes for partitioned insert necessitate inserting
Added store_sales_5_rows, store_sales_1_row_per_part, and 
store_sales_5_part_1_row_per_part.
All looks holding up pretty well.


http://gerrit.cloudera.org:8080/#/c/21927/6/fe/src/main/java/org/apache/impala/planner/HdfsTableSink.java
File fe/src/main/java/org/apache/impala/planner/HdfsTableSink.java:

http://gerrit.cloudera.org:8080/#/c/21927/6/fe/src/main/java/org/apache/impala/planner/HdfsTableSink.java@453
PS6, Line 453: shuffling
> nit: typo
Done


http://gerrit.cloudera.org:8080/#/c/21927/6/fe/src/main/java/org/apache/impala/planner/HdfsTableSink.java@457
PS6, Line 457: maxWriters = totalNumPartitions
> Shouldn't we set maxWriters higher, e.g. totalNumPartitions * 2, to limit h
I'm not sure. totalNumPartitions most likely has been over estimated because it 
is a simple NDV multiplication of partition columns. So probably no need to do 
that.


http://gerrit.cloudera.org:8080/#/c/21927/6/testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-ddl-iceberg.test
File 
testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-ddl-iceberg.test:

http://gerrit.cloudera.org:8080/#/c/21927/6/testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-ddl-iceberg.test@283
PS6, Line 283: costs=[7250411, 7] cp
> Why is it 10 hosts and 10 instances when we have only a single input file?
This is because we force colocation between scan and writing.
Worst case, only 1 out of 10 fragment instance is active.
Best case, file is split into 10 scan ranges and all active reading and writing.

I think this is a non-issue because the whole TpcdsCpuCostPlannerTest runs with 
synthetic stats from 3TB TPC-DS. So real num files in 3TB TPC-DS probably is 
higher or larger in size.



--
To view, visit http://gerrit.cloudera.org:8080/21927
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I51ab8fc35a5489351a88d372b28642b35449acfc
Gerrit-Change-Number: 21927
Gerrit-PatchSet: 7
Gerrit-Owner: Riza Suminto <[email protected]>
Gerrit-Reviewer: Abhishek Rawat <[email protected]>
Gerrit-Reviewer: David Rorke <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Riza Suminto <[email protected]>
Gerrit-Reviewer: Wenzhe Zhou <[email protected]>
Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]>
Gerrit-Comment-Date: Fri, 18 Oct 2024 19:59:46 +0000
Gerrit-HasComments: Yes

Reply via email to