R-JunmingChen commented on issue #35268:
URL: https://github.com/apache/arrow/issues/35268#issuecomment-1546690050
> Great! You might take a look at #35320 as you're getting started. It helps
give some overview of Acero in general.
I am stuck with how to create a appropriate design for external sort in
acero.
With your doc, I think it's good if I first create a external sorting kernel
and then use it in acero.
But acero needs seperate parts of external sorting. To be acurate:
- In the `InputReceived()` of execnode `order_by_sink` with spillover, we
should read data to buffer with size of `buffer_size` . When the buffer is
fully filled, we should sort the buffer and write the sorted data to disk.
- In the `DoFinish()` of execnode `order_by_sink` with spillover, we should
perform a n-way external merge sorting.
So I have two simple plan,
1. Implement an externalOrderByImpl and implement above methods on its own.
In this way, we do not implement anything in compute kernel. May be we still
need to call the sorting method in compute in `InputReceived()`.
2. Implement two kernels, one does the spillover things, one dose the n-way
external merge sorting. And externalOrderByImpl just call the two kernel in
its `InputReceived()` and `DoFinish()`. It will be a little weird, cause I
think the two kernels don't conform arrow's design principle for kernel.
I need some suggestions before I implement codes for external sorting.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]