[Impala-ASF-CR] IMPALA-3120: Support Bucket Shuffle Join for bucketed table

Baike Xia (Code Review) Wed, 01 Feb 2023 00:53:35 -0800

Baike Xia has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/19430 )


Change subject: IMPALA-3120: Support Bucket Shuffle Join for bucketed table
......................................................................


Patch Set 10:

(12 comments)

Hi Csaba, Thanks for your reply and suggestions.
Look forward to your further comments.

http://gerrit.cloudera.org:8080/#/c/19430/9//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/19430/9//COMMIT_MSG@9
PS9, Line 9: rovides better
> Besides the bucket operations do we also apply predicates to buckets? For e
That's right, and I think we can add related optimizations in the future.


http://gerrit.cloudera.org:8080/#/c/19430/9//COMMIT_MSG@10
PS9, Line 10: performance for some Join queries. Th
> Can you add more info about these optimizations?
Yes, I can.
1. For sort, bucketing not limited to  a partitioned analytic function, the 
aggregate function works as well;
2. It is OK to have multiple bucket columns in one table or multiple bucket 
columns in multiple bucket table;
3. Yes, support bucket shuffle.


http://gerrit.cloudera.org:8080/#/c/19430/9//COMMIT_MSG@11
PS9, Line 11:  the dat
> Can you add some info about the tradeoffs? My understanding is that while b
1. For the first question, it might cause a decrease in parallelism, but, if 
buckets are properly divided, a single bucket is not too large, which does not 
affect query performance;
2. In scheduling, localization execution is still judged first, and bucket 
shuffle does not affect localization execution.


http://gerrit.cloudera.org:8080/#/c/19430/9//COMMIT_MSG@13
PS9, Line 13:
> Can you add some info about the effect on scheduling?
The executor is assigned to the node where the bucket is located.


http://gerrit.cloudera.org:8080/#/c/19430/9/fe/src/main/java/org/apache/impala/planner/DistributedPlanner.java
File fe/src/main/java/org/apache/impala/planner/DistributedPlanner.java:

http://gerrit.cloudera.org:8080/#/c/19430/9/fe/src/main/java/org/apache/impala/planner/DistributedPlanner.java@324
PS9, Line 324:    * TODO: hbase scans are range-partitioned on the row key
             :    */
> Todo can be removed
Done


http://gerrit.cloudera.org:8080/#/c/19430/9/fe/src/main/java/org/apache/impala/planner/DistributedPlanner.java@510
PS9, Line 510: butionMode() for more details.
> Can you mention buckating join?
Done


http://gerrit.cloudera.org:8080/#/c/19430/9/fe/src/main/java/org/apache/impala/planner/DistributedPlanner.java@521
PS9, Line 521: (ctx_.getQueryOptions().isEnable_bucket_shuffle()
> Is this always the optimal solution?
That's right, for predicate filtering, I want to optimize it in the future.


http://gerrit.cloudera.org:8080/#/c/19430/9/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:

http://gerrit.cloudera.org:8080/#/c/19430/9/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@2539
PS9, Line 2539:
> Isn't it Hive hash?
Done


http://gerrit.cloudera.org:8080/#/c/19430/9/fe/src/main/java/org/apache/impala/planner/SortNode.java
File fe/src/main/java/org/apache/impala/planner/SortNode.java:

http://gerrit.cloudera.org:8080/#/c/19430/9/fe/src/main/java/org/apache/impala/planner/SortNode.java@387
PS9, Line 387:     if (isBucketedNode()) {
> Can you translate this to English?
Done


http://gerrit.cloudera.org:8080/#/c/19430/9/testdata/datasets/functional/functional_schema_template.sql
File testdata/datasets/functional/functional_schema_template.sql:

http://gerrit.cloudera.org:8080/#/c/19430/9/testdata/datasets/functional/functional_schema_template.sql@3913
PS9, Line 3913: CLUSTERED BY(id)
> Can you also add a test table that is bucketed by more than 1 column?
OK, that's right.


http://gerrit.cloudera.org:8080/#/c/19430/9/testdata/workloads/functional-planner/queries/PlannerTest/bucket-shuffle.test
File 
testdata/workloads/functional-planner/queries/PlannerTest/bucket-shuffle.test:

http://gerrit.cloudera.org:8080/#/c/19430/9/testdata/workloads/functional-planner/queries/PlannerTest/bucket-shuffle.test@9
PS9, Line 9: 06:AGGREGATE [FINALIZE]
           : |  output: count:merge(b.id), count:merge(b.string_col)
           : |  row-size=16B cardinality=1
           : |
           : 05:EXCHANGE [UNPARTITIONED]
> I don't understand this part of the plan - shouldn't be there a pre-aggrega
Yes, this is wrong, I fixed it.


http://gerrit.cloudera.org:8080/#/c/19430/9/testdata/workloads/functional-planner/queries/PlannerTest/bucket-shuffle.test@28
PS9, Line 28: |     HDFS partitions=12/24 files=12 size=239.77KB
> For bucketed tables it could be useful to add something like buckets=4/4
That's great.



--
To view, visit http://gerrit.cloudera.org:8080/19430
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If321e7987bc88374d79500cffb77ea25b2ed0316
Gerrit-Change-Number: 19430
Gerrit-PatchSet: 10
Gerrit-Owner: Baike Xia <[email protected]>
Gerrit-Reviewer: Baike Xia <[email protected]>
Gerrit-Reviewer: Csaba Ringhofer <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Quanlong Huang <[email protected]>
Gerrit-Comment-Date: Wed, 01 Feb 2023 08:49:25 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-3120: Support Bucket Shuffle Join for bucketed table

Reply via email to