caiwanli opened a new issue, #39071:
URL: https://github.com/apache/arrow/issues/39071

   ### Describe the usage question you have. Please include as many useful 
details as  possible.
   
   
   In my join operator, all data will be cached. The cache is defined as 
follows:
   `std::vector<std::shared_ptr<arrow::RecordBatch>> buffer_chunk;`
   After caching all the data, I intend to perform Radix partitioning on the 
buffer_chunk. The function interface is designed as follows:
   `void radix_partition(vector<shared_ptr<arrow::RecordBatch>> &input, 
std::shared_ptr<arrow::RecordBatch> &output, int *histogram);`
   After the function execution, the output will be the reordered data, and the 
"histogram" represents the histogram (used to record the starting position of 
each partition).
   For example, for the data in table 1, dividing the table into 4 partitions 
results in the data becoming the data in table 2, and simultaneously, the 
histogram is {0, 3, 7, 9}.
   
![Arrow](https://github.com/apache/arrow/assets/22023446/fc78c99b-d728-4b11-bee5-00b1844c057d)
   
   For the implementation of the radix_partition function, there are several 
issues that need to be addressed:
   
   1. How to efficiently traverse the "input" data?
   2. How to efficiently insert data into the "output" while traversing?
   
   Partition calculation method:
   `part = tar & (1 << 2);`
   For example, 
   ```
   part(57):57 & ((1 << 2) - 1 ) = 1, 
   part(92):92 & ((1 << 2) - 1 ) = 0
   ```
   
   
   ### Component(s)
   
   C++


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to