[GitHub] [arrow-datafusion] yjshen edited a comment on pull request #1596: Consolidate sort and external_sort

GitBox Fri, 21 Jan 2022 02:53:22 -0800


yjshen edited a comment on pull request #1596:
URL: 
https://github.com/apache/arrow-datafusion/pull/1596#issuecomment-1017112437



   @alamb I find some of my messages for you are lost, maybe because of my 
unstable network.
   
   Yes, I think the PR is ready for another round of review now.
   
   1. for `SortKeyCursor`: 
   
   `AtomicUsize` is introduced for sharing the SortKeyCursors in a BinaryHeap 
and a Vec of cursors. I can think of deduplicating its usage and only keeping 
it in heap, **but let me do in a follow-up PR.**. #1624 
   `batch_comparators: RwLock` it's required to make SortKeyCursor impl 
PartialOrd and Ord, since the original cmp in SortKeyCursor has the mut 
signature, but Ord needs it to be immutable.  **Do you have any suggestions on 
simplifying this?**
   
   2. for `AggregatedMetricsSet`:
   
   > I don't really understand this code -- it appears to be merging in new 
(zero'd) metrics to sef.all_metrics`? Why not just return self.all_metrics ?
   
   This means aggregating all metrics during a complex operation, composed of 
multiple steps, and each step reports its statistics separately. For the sort 
case here: when the dataset is more significant than available memory, it will 
report multiple in-mem sort metrics and final merge-sort metrics from 
`SortPreservingMergeStream`.
   Therefore, We need a separation of metrics: final metrics (for output_rows 
accumulation), and intermediate metrics that we only account for 
elapsed_compute time. So I create a zero'd metrics and gather `elapse_time` and 
`output_rows` separately.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] yjshen edited a comment on pull request #1596: Consolidate sort and external_sort

Reply via email to