R-JunmingChen commented on issue #35268:
URL: https://github.com/apache/arrow/issues/35268#issuecomment-1555318525
> The n-way merge can then call `FetchBatch` appropriately. An n-way merge
is going to be challenging to implement performantly because it is not a
columnar algorithm. Thinking about this more, the n-way merge kernel will
probably not be a compute function. You will probably want to use something
like `ExecBatchBuilder` to accumulate the results.
Helpful suggestions, It is indeed challenging to implement external merge
sort performantly.
Draft plan:
1. In `InsertBatch`, we do what you comment says.
2. we do n-way merge in buffer. We could also just compare the columns
needed for sorting and take the indices, whose format should be
{batch_index}_{indice}, to materialize a result.
> Keep in mind that there is another approach which will not require an
n-way merge (external distribution sort). This approach may be simpler to
implement but I don't know.
I have roughly investigated external distribution sort. May be we shouldn't
choose it.
The external distribution sort prefers n-pivots could divide the entire data
evenly. However, It's hard to get the perfect n-pivots in the top n rounds of
`InsertBatch` unless every batch is identically distribution. If the n-pivots
can't divide the data evenly, the sorting would call **extra IO**, compared to
external merge sort, to recusively divide data untill the smallest segment fits
buffer.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]