GitHub user zheniasigayev added a comment to the discussion: Best practices for memory-efficient deduplication of pre-sorted Parquet files
Both queries use `mode=Partial`.
#### Addressing Question / Query 1)
```
+---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type | plan
|
+---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| logical_plan | CopyTo: format=parquet output_url=/tmp/result.parquet
options: (format.compression zstd(1))
|
| | Sort: example.col_1 ASC NULLS LAST, example.col_2 ASC NULLS
LAST
|
| | Projection: example.col_1, example.col_2, example.col_3,
example.col_4, example.col_5, example.col_6, first_value(example.col_7) AS
col_7, first_value(example.col_8) AS col_8
|
| | Aggregate: groupBy=[[example.col_1, example.col_2,
example.col_3, example.col_4, example.col_5, example.col_6]],
aggr=[[first_value(example.col_7), first_value(example.col_8)]]
|
| | TableScan: example projection=[col_1, col_2, col_3,
col_4, col_5, col_6, col_7, col_8]
|
| physical_plan | DataSinkExec: sink=ParquetSink(file_groups=[])
|
| | SortPreservingMergeExec: [col_1@0 ASC NULLS LAST, col_2@1
ASC NULLS LAST]
|
| | SortExec: expr=[col_1@0 ASC NULLS LAST, col_2@1 ASC NULLS
LAST], preserve_partitioning=[true]
|
| | ProjectionExec: expr=[col_1@0 as col_1, col_2@1 as
col_2, col_3@2 as col_3, col_4@3 as col_4, col_5@4 as col_5, col_6@5 as col_6,
first_value(example.col_7)@6 as col_7, first_value(example.col_8)@7 as col_8]
|
| | AggregateExec: mode=FinalPartitioned, gby=[col_1@0 as
col_1, col_2@1 as col_2, col_3@2 as col_3, col_4@3 as col_4, col_5@4 as col_5,
col_6@5 as col_6], aggr=[first_value(example.col_7),
first_value(example.col_8)]
|
| | CoalesceBatchesExec: target_batch_size=8192
|
| | RepartitionExec: partitioning=Hash([col_1@0,
col_2@1, col_3@2, col_4@3, col_5@4, col_6@5], 10), input_partitions=10
|
| | AggregateExec: mode=Partial, gby=[col_1@0 as
col_1, col_2@1 as col_2, col_3@2 as col_3, col_4@3 as col_4, col_5@4 as col_5,
col_6@5 as col_6], aggr=[first_value(example.col_7),
first_value(example.col_8)]
|
| | DataSourceExec: file_groups={10 groups:
[[tmp/redacted/reproducible_data_0.parquet:0..12537303,
tmp/redacted/reproducible_data_1.parquet:0..6245726],
[tmp/redacted/reproducible_data_1.parquet:6245726..12518047,
tmp/redacted/reproducible_data_10.parquet:0..12510708],
[tmp/redacted/reproducible_data_10.parquet:12510708..12530931,
tmp/redacted/reproducible_data_11.parquet:0..12518206,
tmp/redacted/reproducible_data_12.parquet:0..6244600],
[tmp/redacted/reproducible_data_12.parquet:6244600..12522074,
tmp/redacted/reproducible_data_13.parquet:0..12505555],
[tmp/redacted/reproducible_data_13.parquet:12505555..12523871,
tmp/redacted/reproducible_data_14.parquet:0..12515635,
tmp/redacted/reproducible_data_2.parquet:0..6249078], ...]}, projection=[col_1,
col_2, col_3, col_4, col_5, col_6, col_7, col_8], file_type=parquet |
| |
|
+---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
```
#### Addressing Question / Query 2)
```
+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type | plan
|
+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| logical_plan | CopyTo: format=parquet output_url=/tmp/result_part2.parquet
options: (format.compression zstd(1))
|
| | Sort: example.col_1 ASC NULLS LAST, example.col_2 ASC NULLS
LAST
|
| | Aggregate: groupBy=[[example.col_1, example.col_2,
example.col_3, example.col_4, example.col_5, example.col_6]], aggr=[[]]
|
| | TableScan: example projection=[col_1, col_2, col_3,
col_4, col_5, col_6]
|
| physical_plan | DataSinkExec: sink=ParquetSink(file_groups=[])
|
| | SortPreservingMergeExec: [col_1@0 ASC NULLS LAST, col_2@1
ASC NULLS LAST]
|
| | SortExec: expr=[col_1@0 ASC NULLS LAST, col_2@1 ASC NULLS
LAST], preserve_partitioning=[true]
|
| | AggregateExec: mode=FinalPartitioned, gby=[col_1@0 as
col_1, col_2@1 as col_2, col_3@2 as col_3, col_4@3 as col_4, col_5@4 as col_5,
col_6@5 as col_6], aggr=[]
|
| | CoalesceBatchesExec: target_batch_size=8192
|
| | RepartitionExec: partitioning=Hash([col_1@0,
col_2@1, col_3@2, col_4@3, col_5@4, col_6@5], 10), input_partitions=10
|
| | AggregateExec: mode=Partial, gby=[col_1@0 as
col_1, col_2@1 as col_2, col_3@2 as col_3, col_4@3 as col_4, col_5@4 as col_5,
col_6@5 as col_6], aggr=[]
|
| | DataSourceExec: file_groups={10 groups:
[[tmp/redacted/reproducible_data_0.parquet:0..12537303,
tmp/redacted/reproducible_data_1.parquet:0..6245726],
[tmp/redacted/reproducible_data_1.parquet:6245726..12518047,
tmp/redacted/reproducible_data_10.parquet:0..12510708],
[tmp/redacted/reproducible_data_10.parquet:12510708..12530931,
tmp/redacted/reproducible_data_11.parquet:0..12518206,
tmp/redacted/reproducible_data_12.parquet:0..6244600],
[tmp/redacted/reproducible_data_12.parquet:6244600..12522074,
tmp/redacted/reproducible_data_13.parquet:0..12505555],
[tmp/redacted/reproducible_data_13.parquet:12505555..12523871,
tmp/redacted/reproducible_data_14.parquet:0..12515635,
tmp/redacted/reproducible_data_2.parquet:0..6249078], ...]}, projection=[col_1,
col_2, col_3, col_4, col_5, col_6], file_type=parquet |
| |
|
+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
```
GitHub link:
https://github.com/apache/datafusion/discussions/16776#discussioncomment-13882212
----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
