westonpace commented on issue #35268:
URL: https://github.com/apache/arrow/issues/35268#issuecomment-1552967192
`order_by_sink` is the old way. We should be migrating to use `order_by` in
`order_by_node.cc`. You should make these changes there.
> It will be a little weird, cause I think the two kernels don't conform
arrow's design principle for kernel.
I agree that compute kernels are not best for some of this
> we should sort the buffer and write the sorted data to disk.
This can be a kernel (this already exists with SortIndices).
> Implement two kernels in arrow::compute, one does the spillover things
Kernels should not write to disk. I think we should create a separate
abstraction, a spilling accumulation queue, that writes to disk. For example,
it could be something like...
```
class OrderedSpillingAccumulationQueue {
public:
OrderedSpillingAccumulationQueue(int64_t buffer_size);
// Inserts a batch into the queue. This may trigger a write to disk if
enough data is accumulated
// If it does, then SpillCount should be incremented before this method
returns (but the write can
// happen in the background, asynchronously)
Status InsertBatch(ExecBatch batch);
// The number of files that have been written to disk. This should also
include any data in memory
// so it will be the number of files written to disk + 1 if there is
in-memory data.
int SpillCount();
Future<ExecBatch> FetchBatch(int spill_index, int index_in_spill);
};
```
The n-way merge can then call `FetchBatch` appropriately. An n-way merge is
going to be challenging to implement performantly because it is not a columnar
algorithm. Thinking about this more, the n-way merge kernel will probably not
be a compute function. You will probably want to use something like
`ExecBatchBuilder` to accumulate the results.
Keep in mind that there is another approach which will not require an n-way
merge (external distribution sort). This approach may be simpler to implement.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]