Light-City opened a new issue, #37491:
URL: https://github.com/apache/arrow/issues/37491

   ### Describe the enhancement requested
   
   When the data on the build side is an empty table, there is no need to push 
data to the source node on the probe side, that is to say, the time of source 
callback can be optimized, especially when the external table is a large table, 
the performance will be significantly improved.
   
   <img width="632" alt="截屏2023-08-31 下午4 34 39" 
src="https://github.com/apache/arrow/assets/25699850/4cf91278-59e5-4952-867f-1fe5b4ba141f";>
   
   
   left is outer table, right is inner table.
   
   We can do some optimizations for inner join and right join:
   
   When the inner table is empty, we don't need to asynchronously generate 
outer data, the corresponding code is source_node.cc:
   
   ```
   generator_()
   ```
   
   When the appearance is very large, this step is completely unnecessary, so 
we can design the following solution:
   
   **Step 1: In the inner table hashjoin InputFinished(), we can output an 
empty batch.
   
   Step 2: Notify plan to end all tasks.**
   
   The version we use internally is 6.0, which can be easily implemented (a 
variable in the plan hack), but for the latest code, it seems that this kind of 
interface where the node actively stops the entire plan is **not supported.**
   
   
   The following are some reference code:
   
   
   ```
    Status InputFinished(ExecNode* input, int total_batches) override {
       ARROW_DCHECK(std::find(inputs_.begin(), inputs_.end(), input) != 
inputs_.end());
       size_t thread_index = plan_->query_context()->GetThreadIndex();
       int side = (input == inputs_[0]) ? 0 : 1;
   
       if (batch_count_[side].SetTotal(total_batches)) {
         // build batch rows are empty, we should stop and finish.
         bool stop_join =
             join_type_ == JoinType::INNER || join_type_ == 
JoinType::RIGHT_OUTER;
         bool should_stop = side == 1 && build_accumulator_.row_count() == 0 && 
stop_join;
         if (should_stop) {
           RETURN_NOT_OK(OutputEmptyBatch());
           plan_->should_early_stop_ = true;
           return Status::OK();
         }
   
         if (side == 0) {
           return OnProbeSideFinished(thread_index);
         } else {
           return OnBuildSideFinished(thread_index);
         }
       }
       return Status::OK();
     }
   ```
     
   
   ### Component(s)
   
   C++


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to