Re: [PR] Fix `compute_record_batch_statistics` wrong with `projection` [arrow-datafusion]

via GitHub Thu, 14 Dec 2023 12:31:14 -0800


alamb commented on code in PR #8489:
URL: https://github.com/apache/arrow-datafusion/pull/8489#discussion_r1427237213



##########
datafusion/physical-plan/src/common.rs:
##########
@@ -685,20 +694,30 @@ mod tests {
         let schema = Arc::new(Schema::new(vec![
             Field::new("f32", DataType::Float32, false),
             Field::new("f64", DataType::Float64, false),
+            Field::new("u64", DataType::UInt64, false),
         ]));
         let batch = RecordBatch::try_new(
             Arc::clone(&schema),
             vec![
                 Arc::new(Float32Array::from(vec![1., 2., 3.])),
                 Arc::new(Float64Array::from(vec![9., 8., 7.])),
+                Arc::new(UInt64Array::from(vec![4, 5, 6])),
             ],
         )?;
+
+        // just select f32,f64
+        let select_projection = Some(vec![0, 1]);
+        let byte_size = batch
+            .project(&select_projection.clone().unwrap())
+            .unwrap()
+            .get_array_memory_size();
+
         let actual =
-            compute_record_batch_statistics(&[vec![batch]], &schema, 
Some(vec![0, 1]));
+            compute_record_batch_statistics(&[vec![batch]], &schema, 
select_projection);
 
-        let mut expected = Statistics {
+        let expected = Statistics {
             num_rows: Precision::Exact(3),
-            total_byte_size: Precision::Exact(464), // this might change a bit 
if the way we compute the size changes
+            total_byte_size: Precision::Exact(byte_size),

Review Comment:
   I think this is ok and a nice way to make the code less brittle to future 
changes in arrow's layout



##########
datafusion/sqllogictest/test_files/groupby.slt:
##########
@@ -2021,14 +2021,15 @@ SortPreservingMergeExec: [col0@0 ASC NULLS LAST]
 ----------RepartitionExec: partitioning=Hash([col0@0, col1@1, col2@2], 4), 
input_partitions=4
 ------------AggregateExec: mode=Partial, gby=[col0@0 as col0, col1@1 as col1, 
col2@2 as col2], aggr=[LAST_VALUE(r.col1)], ordering_mode=PartiallySorted([0])
 --------------SortExec: expr=[col0@3 ASC NULLS LAST]
-----------------CoalesceBatchesExec: target_batch_size=8192
-------------------HashJoinExec: mode=Partitioned, join_type=Inner, 
on=[(col0@0, col0@0)]
---------------------CoalesceBatchesExec: target_batch_size=8192
-----------------------RepartitionExec: partitioning=Hash([col0@0], 4), 
input_partitions=1
-------------------------MemoryExec: partitions=1, partition_sizes=[3]
---------------------CoalesceBatchesExec: target_batch_size=8192
-----------------------RepartitionExec: partitioning=Hash([col0@0], 4), 
input_partitions=1
-------------------------MemoryExec: partitions=1, partition_sizes=[3]
+----------------ProjectionExec: expr=[col0@2 as col0, col1@3 as col1, col2@4 
as col2, col0@0 as col0, col1@1 as col1]

Review Comment:
   Looks to me like the change is due to the fact the join inputs were 
reordered and this projection puts the columns back the expected way. 
   
   Same thing with the projection below



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Fix `compute_record_batch_statistics` wrong with `projection` [arrow-datafusion]

Reply via email to