Baike Xia has posted comments on this change. ( http://gerrit.cloudera.org:8080/19430 )
Change subject: IMPALA-3120: Support Bucket Shuffle Join for bucketed table ...................................................................... Patch Set 13: (6 comments) Hi Csaba And Aman, I was busy with some things at work some time ago, so I didn't have much time to deal with the reply, and I'm so sorry. Now I'm back. Look forward to your reply and suggestions. http://gerrit.cloudera.org:8080/#/c/19430/9//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/19430/9//COMMIT_MSG@10 PS9, Line 10: performance for some Join queries. Th > I still don't get the non-partitoned sort case. Can you give an example que Yes, I'll add it later. http://gerrit.cloudera.org:8080/#/c/19430/9//COMMIT_MSG@13 PS9, Line 13: > Is there a node where the whole bucket is located? I mean that if there are I don't think I understand what you mean, Can you explain that again? http://gerrit.cloudera.org:8080/#/c/19430/13//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/19430/13//COMMIT_MSG@25 PS13, Line 25: based on hdfs storage are supported. > Thanks for the detailed patch. I have a high level question about the phys HI Aman, Thanks for you reply. HDFS rebalancing is not about moving files, it's about moving blocks of data. The underlying block movement does not affect the content and size of the file, so buckets are not broken. http://gerrit.cloudera.org:8080/#/c/19430/13/be/src/runtime/query-state.h File be/src/runtime/query-state.h: http://gerrit.cloudera.org:8080/#/c/19430/13/be/src/runtime/query-state.h@149 PS13, Line 149: /// Define locks to ensure thread safety when replenishing reserved memory. : std::mutex increase_memory_reservation_mtx_; : : /// Configure a semaphore to control FragmentInstanceState::Exec : /// for each fragment instance that is executed in a bucket. : /// To save memory, only one concurrency is supported in the open phase and beyond, : /// after the completion of prepare. : std::unordered_map<TFragmentIdx, sem_t> bucket_fragment_sem_; : : /// Configure a counter for each fragment instance to count the number of fragment : /// instances that have not yet completed execution, to prevent invalid : /// increase_memory_reservation, and to destroy the semaphore after the execution of : /// all instances of the fragment in the bucket has completed. : std::unordered_map<TFragmentIdx, int> bucket_fragment_un_finished_instances_; > I couldn't grasp the changes in query life-cycle yet. Can you give some exp Yes, you are right. In particular, in KrpcDataStreamSender, the hash method is used to send each row of data to the corresponding fragment. In this case, hive hash is used. The reason for controlling the fragmentation of data running at the same time is to prevent concurrency from running out of resources. But this is an internal transformation of our company based on impala 3.2. I'm still wondering if this piece of logic is necessary. Can you give me some good advice? http://gerrit.cloudera.org:8080/#/c/19430/13/be/src/util/hash-util.h File be/src/util/hash-util.h: http://gerrit.cloudera.org:8080/#/c/19430/13/be/src/util/hash-util.h@287 PS13, Line 287: { > Can you add some tests for this in https://github.com/apache/impala/blob/ma Yes, I'll add it later. http://gerrit.cloudera.org:8080/#/c/19430/13/fe/src/main/java/org/apache/impala/catalog/Table.java File fe/src/main/java/org/apache/impala/catalog/Table.java: http://gerrit.cloudera.org:8080/#/c/19430/13/fe/src/main/java/org/apache/impala/catalog/Table.java@1045 PS13, Line 1045: TBucketType.NONE > This is not from this patch, but I saw that the other value of TBucketType Yeah, i see you. For TBucketType, it is compatible with the existence of multiple bucket partitioning algorithms. NONE indicates that buckets are not divided. HASH indicates that hive hash algorithm is used. Other hash algorithms can be added later, such as icebearg, kudu, etc. The HIVE_HASH or HIVE_BUCKET_V2_HASH name is not used here, because it is compatible with hive sql and easier to run hive sql in impala. -- To view, visit http://gerrit.cloudera.org:8080/19430 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: If321e7987bc88374d79500cffb77ea25b2ed0316 Gerrit-Change-Number: 19430 Gerrit-PatchSet: 13 Gerrit-Owner: Baike Xia <[email protected]> Gerrit-Reviewer: Aman Sinha <[email protected]> Gerrit-Reviewer: Baike Xia <[email protected]> Gerrit-Reviewer: Csaba Ringhofer <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Quanlong Huang <[email protected]> Gerrit-Comment-Date: Tue, 21 Feb 2023 08:21:25 +0000 Gerrit-HasComments: Yes
