[GitHub] [arrow-datafusion] yjshen commented on pull request #1596: Consolidate sort and external_sort

GitBox Wed, 19 Jan 2022 20:35:03 -0800


yjshen commented on pull request #1596:
URL: 
https://github.com/apache/arrow-datafusion/pull/1596#issuecomment-1017112437



   @alamb I find some of my messages for you are lost, maybe because of the 
network.
   
   Yes, I think the PR is ready for another round of review now.
   
   1. for `SortKeyCursor`: 
   
   `AtomicUsize` is introduced for sharing the SortKeyCursors in a BinaryHeap 
as well as in a Vec of cursors. I can think of deduplicating its usage and only 
keeping it in heap, **but let me do in a follow-up PR.**
   `batch_comparators: RwLock` it's required to make SortKeyCursor impl 
PartialOrd and Ord, since the original cmp in SortKeyCursor has the mut 
signature, but Ord needs it to be immutable.  **Do you have any suggestions on 
simplifying this?**
   
   2. for `AggregatedMetricsSet`:
   
   > I don't really understand this code -- it appears to be merging in new 
(zero'd) metrics to sef.all_metrics`? Why not just return self.all_metrics ?
   
   This means to aggregates all metrics during a complex operation, which is 
composed of multiple stages and each stage reports its statistics separately. 
For the sort case here: when the dataset is more significant than available 
memory, it will report multiple in-mem sort metrics and final merge-sort 
metrics from `SortPreservingMergeStream`.
   Therefore, We need a separation of metrics for which are final metrics (for 
output_rows accumulation), and which are intermediate metrics that we only 
account for elapsed_compute time.  So I create a zero'd metrics here and gather 
`elapse_time` and `output_rows` separately.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] yjshen commented on pull request #1596: Consolidate sort and external_sort

Reply via email to