[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #7130: Account for memory usage in SortPreservingMerge (#5885)

via GitHub Wed, 09 Aug 2023 06:50:45 -0700


alamb commented on code in PR #7130:
URL: https://github.com/apache/arrow-datafusion/pull/7130#discussion_r1288509219



##########
datafusion/core/tests/memory_limit.rs:
##########
@@ -17,12 +17,21 @@
 
 //! This module contains tests for limiting memory at runtime in DataFusion
 
-use arrow::datatypes::SchemaRef;
+use arrow::datatypes::{Int32Type, SchemaRef};
 use arrow::record_batch::RecordBatch;
+use arrow_array::{ArrayRef, DictionaryArray};
+use arrow_schema::SortOptions;
+use async_trait::async_trait;
+use datafusion::assert_batches_eq;
 use datafusion::physical_optimizer::PhysicalOptimizerRule;
+use datafusion::physical_plan::common::batch_byte_size;

Review Comment:
   That is a good idea -- I will do so in a follow on PR



##########
datafusion/core/src/physical_plan/sorts/sort.rs:
##########
@@ -210,23 +210,37 @@ struct ExternalSorter {
     /// If Some, the maximum number of output rows that will be
     /// produced.
     fetch: Option<usize>,
-    /// Memory usage tracking
+    /// Reservation for in_mem_batches
     reservation: MemoryReservation,
-    /// The partition id that this Sort is handling (for identification)
-    partition_id: usize,
-    /// A handle to the runtime to get Disk spill files
+    /// Reservation for the merging of in-memory batches. If the sort
+    /// might spill, `sort_spill_reservation_bytes` will be
+    /// pre-reserved to ensure there is some space for this sort/merg.

Review Comment:
   in f87705e630 



##########
datafusion/common/src/config.rs:
##########
@@ -235,6 +235,23 @@ config_namespace! {
         ///
         /// Defaults to the number of CPU cores on the system
         pub planning_concurrency: usize, default = num_cpus::get()
+
+        /// How much memory is set aside, for each spillable sort, to
+        /// ensure an in-memory merge can occur. This setting has no
+        /// if the sort can not spill (there is no `DiskManager`
+        /// configured)
+        ///
+        /// As part of spilling to disk, in memory data must be sorted
+        /// / merged before writing the file. This in-memory
+        /// sort/merge requires memory as well, so To avoid allocating
+        /// once memory is exhausted, DataFusion sets aside this
+        /// many bytes before.

Review Comment:
   That is a much better wording. Thank you @yjshen  -- in f87705e630



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #7130: Account for memory usage in SortPreservingMerge (#5885)

Reply via email to