[GitHub] [arrow] westonpace commented on issue #35268: [C++] OrderBy with spillover

via GitHub Thu, 18 May 2023 05:14:56 -0700


westonpace commented on issue #35268:
URL: https://github.com/apache/arrow/issues/35268#issuecomment-1552967192


   `order_by_sink` is the old way.  We should be migrating to use `order_by` in 
`order_by_node.cc`.  You should make these changes there.
   
   > It will be a little weird, cause I think the two kernels don't conform 
arrow's design principle for kernel.
   
   I agree that compute kernels are not best for some of this
   
   > we should sort the buffer and write the sorted data to disk.
   
   This can be a kernel (this already exists with SortIndices).
   
   > Implement two kernels in arrow::compute, one does the spillover things
   
   Kernels should not write to disk.  I think we should create a separate 
abstraction, a spilling accumulation queue, that writes to disk.  For example, 
it could be something like...
   
   ```
   class OrderedSpillingAccumulationQueue {
   public:
     
     OrderedSpillingAccumulationQueue(int64_t buffer_size);
   
     // Inserts a batch into the queue.  This may trigger a write to disk if 
enough data is accumulated
     // If it does, then SpillCount should be incremented before this method 
returns (but the write can
     // happen in the background, asynchronously)
     Status InsertBatch(ExecBatch batch);
   
     // The number of files that have been written to disk.  This should also 
include any data in memory
     // so it will be the number of files written to disk + 1 if there is 
in-memory data.
     int SpillCount();
   
     Future<ExecBatch> FetchBatch(int spill_index, int  index_in_spill);
   };
   ```
   
   The n-way merge can then call `FetchBatch` appropriately.  An n-way merge is 
going to be challenging to implement performantly because it is not a columnar 
algorithm.  Thinking about this more, the n-way merge kernel will probably not 
be a compute function.  You will probably want to use something like 
`ExecBatchBuilder` to accumulate the results.
   
   Keep in mind that there is another approach which will not require an n-way 
merge (external distribution sort).  This approach may be simpler to implement.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] westonpace commented on issue #35268: [C++] OrderBy with spillover

Reply via email to