Baike Xia has posted comments on this change. ( http://gerrit.cloudera.org:8080/19430 )
Change subject: IMPALA-3120: Support Bucket Shuffle Join for bucketed table ...................................................................... Patch Set 10: (12 comments) Hi Csaba, Thanks for your reply and suggestions. Look forward to your further comments. http://gerrit.cloudera.org:8080/#/c/19430/9//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/19430/9//COMMIT_MSG@9 PS9, Line 9: rovides better > Besides the bucket operations do we also apply predicates to buckets? For e That's right, and I think we can add related optimizations in the future. http://gerrit.cloudera.org:8080/#/c/19430/9//COMMIT_MSG@10 PS9, Line 10: performance for some Join queries. Th > Can you add more info about these optimizations? Yes, I can. 1. For sort, bucketing not limited to a partitioned analytic function, the aggregate function works as well; 2. It is OK to have multiple bucket columns in one table or multiple bucket columns in multiple bucket table; 3. Yes, support bucket shuffle. http://gerrit.cloudera.org:8080/#/c/19430/9//COMMIT_MSG@11 PS9, Line 11: the dat > Can you add some info about the tradeoffs? My understanding is that while b 1. For the first question, it might cause a decrease in parallelism, but, if buckets are properly divided, a single bucket is not too large, which does not affect query performance; 2. In scheduling, localization execution is still judged first, and bucket shuffle does not affect localization execution. http://gerrit.cloudera.org:8080/#/c/19430/9//COMMIT_MSG@13 PS9, Line 13: > Can you add some info about the effect on scheduling? The executor is assigned to the node where the bucket is located. http://gerrit.cloudera.org:8080/#/c/19430/9/fe/src/main/java/org/apache/impala/planner/DistributedPlanner.java File fe/src/main/java/org/apache/impala/planner/DistributedPlanner.java: http://gerrit.cloudera.org:8080/#/c/19430/9/fe/src/main/java/org/apache/impala/planner/DistributedPlanner.java@324 PS9, Line 324: * TODO: hbase scans are range-partitioned on the row key : */ > Todo can be removed Done http://gerrit.cloudera.org:8080/#/c/19430/9/fe/src/main/java/org/apache/impala/planner/DistributedPlanner.java@510 PS9, Line 510: butionMode() for more details. > Can you mention buckating join? Done http://gerrit.cloudera.org:8080/#/c/19430/9/fe/src/main/java/org/apache/impala/planner/DistributedPlanner.java@521 PS9, Line 521: (ctx_.getQueryOptions().isEnable_bucket_shuffle() > Is this always the optimal solution? That's right, for predicate filtering, I want to optimize it in the future. http://gerrit.cloudera.org:8080/#/c/19430/9/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java: http://gerrit.cloudera.org:8080/#/c/19430/9/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@2539 PS9, Line 2539: > Isn't it Hive hash? Done http://gerrit.cloudera.org:8080/#/c/19430/9/fe/src/main/java/org/apache/impala/planner/SortNode.java File fe/src/main/java/org/apache/impala/planner/SortNode.java: http://gerrit.cloudera.org:8080/#/c/19430/9/fe/src/main/java/org/apache/impala/planner/SortNode.java@387 PS9, Line 387: if (isBucketedNode()) { > Can you translate this to English? Done http://gerrit.cloudera.org:8080/#/c/19430/9/testdata/datasets/functional/functional_schema_template.sql File testdata/datasets/functional/functional_schema_template.sql: http://gerrit.cloudera.org:8080/#/c/19430/9/testdata/datasets/functional/functional_schema_template.sql@3913 PS9, Line 3913: CLUSTERED BY(id) > Can you also add a test table that is bucketed by more than 1 column? OK, that's right. http://gerrit.cloudera.org:8080/#/c/19430/9/testdata/workloads/functional-planner/queries/PlannerTest/bucket-shuffle.test File testdata/workloads/functional-planner/queries/PlannerTest/bucket-shuffle.test: http://gerrit.cloudera.org:8080/#/c/19430/9/testdata/workloads/functional-planner/queries/PlannerTest/bucket-shuffle.test@9 PS9, Line 9: 06:AGGREGATE [FINALIZE] : | output: count:merge(b.id), count:merge(b.string_col) : | row-size=16B cardinality=1 : | : 05:EXCHANGE [UNPARTITIONED] > I don't understand this part of the plan - shouldn't be there a pre-aggrega Yes, this is wrong, I fixed it. http://gerrit.cloudera.org:8080/#/c/19430/9/testdata/workloads/functional-planner/queries/PlannerTest/bucket-shuffle.test@28 PS9, Line 28: | HDFS partitions=12/24 files=12 size=239.77KB > For bucketed tables it could be useful to add something like buckets=4/4 That's great. -- To view, visit http://gerrit.cloudera.org:8080/19430 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: If321e7987bc88374d79500cffb77ea25b2ed0316 Gerrit-Change-Number: 19430 Gerrit-PatchSet: 10 Gerrit-Owner: Baike Xia <[email protected]> Gerrit-Reviewer: Baike Xia <[email protected]> Gerrit-Reviewer: Csaba Ringhofer <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Quanlong Huang <[email protected]> Gerrit-Comment-Date: Wed, 01 Feb 2023 08:49:25 +0000 Gerrit-HasComments: Yes
