Re: [D] Best practices for memory-efficient deduplication of pre-sorted Parquet files [datafusion]

via GitHub Thu, 24 Jul 2025 13:28:27 -0700


GitHub user zheniasigayev added a comment to the discussion: Best practices for 
memory-efficient deduplication of pre-sorted Parquet files


Both queries use `mode=Partial`.

#### Addressing Question / Query 1)

```
+---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type     | plan                                                          
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                          |
+---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| logical_plan  | CopyTo: format=parquet output_url=/tmp/result.parquet 
options: (format.compression zstd(1))                                           
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                  |
|               |   Sort: example.col_1 ASC NULLS LAST, example.col_2 ASC NULLS 
LAST                                                                            
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                          |
|               |     Projection: example.col_1, example.col_2, example.col_3, 
example.col_4, example.col_5, example.col_6, first_value(example.col_7) AS 
col_7, first_value(example.col_8) AS col_8                                      
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                |
|               |       Aggregate: groupBy=[[example.col_1, example.col_2, 
example.col_3, example.col_4, example.col_5, example.col_6]], 
aggr=[[first_value(example.col_7), first_value(example.col_8)]]                 
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
 |
|               |         TableScan: example projection=[col_1, col_2, col_3, 
col_4, col_5, col_6, col_7, col_8]                                              
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                            |
| physical_plan | DataSinkExec: sink=ParquetSink(file_groups=[])                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                          |
|               |   SortPreservingMergeExec: [col_1@0 ASC NULLS LAST, col_2@1 
ASC NULLS LAST]                                                                 
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                            |
|               |     SortExec: expr=[col_1@0 ASC NULLS LAST, col_2@1 ASC NULLS 
LAST], preserve_partitioning=[true]                                             
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                          |
|               |       ProjectionExec: expr=[col_1@0 as col_1, col_2@1 as 
col_2, col_3@2 as col_3, col_4@3 as col_4, col_5@4 as col_5, col_6@5 as col_6, 
first_value(example.col_7)@6 as col_7, first_value(example.col_8)@7 as col_8]   
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                |
|               |         AggregateExec: mode=FinalPartitioned, gby=[col_1@0 as 
col_1, col_2@1 as col_2, col_3@2 as col_3, col_4@3 as col_4, col_5@4 as col_5, 
col_6@5 as col_6], aggr=[first_value(example.col_7), 
first_value(example.col_8)]                                                     
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
      |
|               |           CoalesceBatchesExec: target_batch_size=8192         
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                          |
|               |             RepartitionExec: partitioning=Hash([col_1@0, 
col_2@1, col_3@2, col_4@3, col_5@4, col_6@5], 10), input_partitions=10          
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                               |
|               |               AggregateExec: mode=Partial, gby=[col_1@0 as 
col_1, col_2@1 as col_2, col_3@2 as col_3, col_4@3 as col_4, col_5@4 as col_5, 
col_6@5 as col_6], aggr=[first_value(example.col_7), 
first_value(example.col_8)]                                                     
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
         |
|               |                 DataSourceExec: file_groups={10 groups: 
[[tmp/redacted/reproducible_data_0.parquet:0..12537303, 
tmp/redacted/reproducible_data_1.parquet:0..6245726], 
[tmp/redacted/reproducible_data_1.parquet:6245726..12518047, 
tmp/redacted/reproducible_data_10.parquet:0..12510708], 
[tmp/redacted/reproducible_data_10.parquet:12510708..12530931, 
tmp/redacted/reproducible_data_11.parquet:0..12518206, 
tmp/redacted/reproducible_data_12.parquet:0..6244600], 
[tmp/redacted/reproducible_data_12.parquet:6244600..12522074, 
tmp/redacted/reproducible_data_13.parquet:0..12505555], 
[tmp/redacted/reproducible_data_13.parquet:12505555..12523871, 
tmp/redacted/reproducible_data_14.parquet:0..12515635, 
tmp/redacted/reproducible_data_2.parquet:0..6249078], ...]}, projection=[col_1, 
col_2, col_3, col_4, col_5, col_6, col_7, col_8], file_type=parquet |
|               |                                                               
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                          |
+---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
```

#### Addressing Question / Query 2)

```
+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type     | plan                                                          
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                          |
+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| logical_plan  | CopyTo: format=parquet output_url=/tmp/result_part2.parquet 
options: (format.compression zstd(1))                                           
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                            |
|               |   Sort: example.col_1 ASC NULLS LAST, example.col_2 ASC NULLS 
LAST                                                                            
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                          |
|               |     Aggregate: groupBy=[[example.col_1, example.col_2, 
example.col_3, example.col_4, example.col_5, example.col_6]], aggr=[[]]         
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                 |
|               |       TableScan: example projection=[col_1, col_2, col_3, 
col_4, col_5, col_6]                                                            
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                              |
| physical_plan | DataSinkExec: sink=ParquetSink(file_groups=[])                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                          |
|               |   SortPreservingMergeExec: [col_1@0 ASC NULLS LAST, col_2@1 
ASC NULLS LAST]                                                                 
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                            |
|               |     SortExec: expr=[col_1@0 ASC NULLS LAST, col_2@1 ASC NULLS 
LAST], preserve_partitioning=[true]                                             
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                          |
|               |       AggregateExec: mode=FinalPartitioned, gby=[col_1@0 as 
col_1, col_2@1 as col_2, col_3@2 as col_3, col_4@3 as col_4, col_5@4 as col_5, 
col_6@5 as col_6], aggr=[]                                                      
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                             |
|               |         CoalesceBatchesExec: target_batch_size=8192           
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                          |
|               |           RepartitionExec: partitioning=Hash([col_1@0, 
col_2@1, col_3@2, col_4@3, col_5@4, col_6@5], 10), input_partitions=10          
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                 |
|               |             AggregateExec: mode=Partial, gby=[col_1@0 as 
col_1, col_2@1 as col_2, col_3@2 as col_3, col_4@3 as col_4, col_5@4 as col_5, 
col_6@5 as col_6], aggr=[]                                                      
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                |
|               |               DataSourceExec: file_groups={10 groups: 
[[tmp/redacted/reproducible_data_0.parquet:0..12537303, 
tmp/redacted/reproducible_data_1.parquet:0..6245726], 
[tmp/redacted/reproducible_data_1.parquet:6245726..12518047, 
tmp/redacted/reproducible_data_10.parquet:0..12510708], 
[tmp/redacted/reproducible_data_10.parquet:12510708..12530931, 
tmp/redacted/reproducible_data_11.parquet:0..12518206, 
tmp/redacted/reproducible_data_12.parquet:0..6244600], 
[tmp/redacted/reproducible_data_12.parquet:6244600..12522074, 
tmp/redacted/reproducible_data_13.parquet:0..12505555], 
[tmp/redacted/reproducible_data_13.parquet:12505555..12523871, 
tmp/redacted/reproducible_data_14.parquet:0..12515635, 
tmp/redacted/reproducible_data_2.parquet:0..6249078], ...]}, projection=[col_1, 
col_2, col_3, col_4, col_5, col_6], file_type=parquet |
|               |                                                               
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                          |
+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
```


GitHub link: 
https://github.com/apache/datafusion/discussions/16776#discussioncomment-13882212

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: 
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [D] Best practices for memory-efficient deduplication of pre-sorted Parquet files [datafusion]

Reply via email to