baharberna opened a new pull request, #6346:
URL: https://github.com/apache/arrow-datafusion/pull/6346

   
   # Rationale for this change
   
   RepartitionExec , when handling multiple input partitions, creates N 
channels for each input partition, where N is the output partition count. This 
results in a total of input_partition * output_partition channels. During 
processing, the channels are pulled for each output partition, depending on the 
processing time, which disrupts the order of records. This is particularly 
problematic when the input partition count is greater than 1, as it leads to an 
unpredictable order of records within the output partitions. To address this 
issue, a more sophisticated algorithm is needed, one that can combine the 
existing hash partitioner and round-robin partitioner functionalities while 
preserving the original order of records within partitions, even when the input 
partition count is greater than 1.
   
   # What changes are included in this PR?
   
   SortPreservingRepartitionExec that implements the ExecutionPlan trait and 
its associated APIs. 
   the sort preserving repartition operator maps N input partitions to M output 
partitions based on a partitioning scheme meanwhile preserving their order. To 
achieve this, we exploit from SortPreservingMergeStream: with this, we first 
merge multiple input partitions into one output stream preserving their order, 
then give this output into RepartitionExec. Since RepartitionExec preserve the 
order when the the number of input partitions is one, we reach our goal, 
hopefully :)
   SortPreservingRepartitionExec mainly combines the functionality of 
SortPreservingMergeStream in the first order and as the next, RepartitionExec
   
   # Are these changes tested?
   
   Tests are not included in the PR, there are some tests in 
sort_enforcement.rs but are failing :(
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to