Re: [PR] Fix record batch memory size double counting [datafusion]

via GitHub Tue, 12 Nov 2024 17:29:05 -0800


blaginin commented on code in PR #13377:
URL: https://github.com/apache/datafusion/pull/13377#discussion_r1838621019



##########
datafusion/physical-plan/src/spill.rs:
##########
@@ -175,11 +247,103 @@ mod tests {
         )?;
 
         let file = BufReader::new(File::open(spill_file.path())?);
-        let reader = arrow::ipc::reader::FileReader::try_new(file, None)?;
+        let reader = FileReader::try_new(file, None)?;
 
         assert_eq!(reader.num_batches(), 4);
         assert_eq!(reader.schema(), schema);
 
         Ok(())
     }
+
+    #[test]
+    fn test_get_record_batch_memory_size() {
+        // Create a simple record batch with two columns
+        let schema = Arc::new(Schema::new(vec![
+            Field::new("ints", DataType::Int32, true),
+            Field::new("float64", DataType::Float64, false),
+        ]));
+
+        let int_array =
+            Int32Array::from(vec![Some(1), Some(2), Some(3), Some(4), 
Some(5)]);
+        let float64_array = Float64Array::from(vec![1.0, 2.0, 3.0, 4.0, 5.0]);
+
+        let batch = RecordBatch::try_new(
+            schema,
+            vec![Arc::new(int_array), Arc::new(float64_array)],
+        )
+        .unwrap();
+
+        let size = get_record_batch_memory_size(&batch);
+        assert_eq!(size, 60);

Review Comment:
   My only concern with this PR is that the result of 
`get_record_batch_memory_size` differs from `get_array_memory_size`. For 
example, here `batch.get_array_memory_size()` would return 252 instead of 60.
   
   This could be dangerous because the project would end up with two different 
methods of calculating memory sizes. I can imagine a scenario in the future, 
where we _reserve_ memory based on one calculation method and _shrink_ it using 
the result from the other. While the difference may not be large each time, 
over many repetitions or a large dataset, it could behave almost like a memory 
leak (but without actual memory), making debugging very challenging...



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Fix record batch memory size double counting [datafusion]

Reply via email to