R-JunmingChen commented on issue #35268:
URL: https://github.com/apache/arrow/issues/35268#issuecomment-1555318525

   > The n-way merge can then call `FetchBatch` appropriately. An n-way merge 
is going to be challenging to implement performantly because it is not a 
columnar algorithm. Thinking about this more, the n-way merge kernel will 
probably not be a compute function. You will probably want to use something 
like `ExecBatchBuilder` to accumulate the results.
   
   Helpful suggestions, It is indeed challenging to implement external merge 
sort performantly. 
   Draft plan:
   1. In `InsertBatch`, we do what you comment says.
   2. we do n-way merge in buffer. We could also just compare the columns 
needed for sorting and take the indices, whose format should be 
{batch_index}_{indice}, to materialize a result. 
   
   > Keep in mind that there is another approach which will not require an 
n-way merge (external distribution sort). This approach may be simpler to 
implement but I don't know.
   
   I have roughly investigated external distribution sort. May be we shouldn't 
choose it.
   The external distribution sort prefers n-pivots could divide the entire data 
evenly. However, It's hard to get the perfect n-pivots in the top n rounds of  
`InsertBatch` unless every batch is identically distribution. If the n-pivots 
can't divide the data evenly, the sorting would call **extra IO**, compared to 
external merge sort, to recusively divide data untill the smallest segment fits 
buffer.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to