Re: [D] Best practices for memory-efficient deduplication of pre-sorted Parquet files [datafusion]

via GitHub Wed, 16 Jul 2025 10:09:58 -0700


GitHub user zheniasigayev added a comment to the discussion: Best practices for 
memory-efficient deduplication of pre-sorted Parquet files


Addressing Question 1. 

The query plan for the original query:

```sql
CREATE EXTERNAL TABLE example (
    col_1 VARCHAR(50) NOT NULL,
    col_2 BIGINT NOT NULL,
    col_3 VARCHAR(50),
    col_4 VARCHAR(50),
    col_5 VARCHAR(50),
    col_6 VARCHAR(100) NOT NULL,
    col_7 VARCHAR(50),
    col_8 DOUBLE
) 
WITH ORDER (col_1 ASC, col_2 ASC) 
STORED AS PARQUET 
LOCATION '/tmp/redacted/*.parquet';

EXPLAIN COPY (
    SELECT 
        col_1,
        col_2,
        col_3,
        col_4,
        col_5,
        col_6,
        first_value(col_7) AS col_7,
        first_value(col_8) AS col_8
    FROM 
        example 
    GROUP BY 
        col_1, col_2, col_3, col_4, col_5, col_6 
    ORDER BY 
        col_1 ASC, col_2 ASC
) 
TO '/tmp/result.parquet' 
STORED AS PARQUET 
OPTIONS (compression 'zstd(1)');
```

The resulting `EXPLAIN` output:

```
+---------------+-------------------------------+
| plan_type     | plan                          |
+---------------+-------------------------------+
| physical_plan | ┌───────────────────────────┐ |
|               | │        DataSinkExec       │ |
|               | └─────────────┬─────────────┘ |
|               | ┌─────────────┴─────────────┐ |
|               | │  SortPreservingMergeExec  │ |
|               | │    --------------------   │ |
|               | │   col_1 ASC NULLS LAST,   │ |
|               | │    col_2 ASC NULLS LAST   │ |
|               | └─────────────┬─────────────┘ |
|               | ┌─────────────┴─────────────┐ |
|               | │          SortExec         │ |
|               | │    --------------------   │ |
|               | │  col_1@0 ASC NULLS LAST,  │ |
|               | │   col_2@1 ASC NULLS LAST  │ |
|               | └─────────────┬─────────────┘ |
|               | ┌─────────────┴─────────────┐ |
|               | │       ProjectionExec      │ |
|               | │    --------------------   │ |
|               | │        col_1: col_1       │ |
|               | │        col_2: col_2       │ |
|               | │        col_3: col_3       │ |
|               | │        col_4: col_4       │ |
|               | │        col_5: col_5       │ |
|               | │        col_6: col_6       │ |
|               | │                           │ |
|               | │           col_7:          │ |
|               | │ first_value(example.col_7)│ |
|               | │                           │ |
|               | │           col_8:          │ |
|               | │ first_value(example.col_8)│ |
|               | └─────────────┬─────────────┘ |
|               | ┌─────────────┴─────────────┐ |
|               | │       AggregateExec       │ |
|               | │    --------------------   │ |
|               | │           aggr:           │ |
|               | │ first_value(example.col_7)│ |
|               | │   , first_value(example   │ |
|               | │          .col_8)          │ |
|               | │                           │ |
|               | │         group_by:         │ |
|               | │ col_1, col_2, col_3, col_4│ |
|               | │       , col_5, col_6      │ |
|               | │                           │ |
|               | │           mode:           │ |
|               | │      FinalPartitioned     │ |
|               | └─────────────┬─────────────┘ |
|               | ┌─────────────┴─────────────┐ |
|               | │    CoalesceBatchesExec    │ |
|               | │    --------------------   │ |
|               | │     target_batch_size:    │ |
|               | │            8192           │ |
|               | └─────────────┬─────────────┘ |
|               | ┌─────────────┴─────────────┐ |
|               | │      RepartitionExec      │ |
|               | │    --------------------   │ |
|               | │ partition_count(in->out): │ |
|               | │          10 -> 10         │ |
|               | │                           │ |
|               | │    partitioning_scheme:   │ |
|               | │  Hash([col_1@0, col_2@1,  │ |
|               | │      col_3@2, col_4@3,    │ |
|               | │     col_5@4, col_6@5],    │ |
|               | │             10)           │ |
|               | └─────────────┬─────────────┘ |
|               | ┌─────────────┴─────────────┐ |
|               | │       AggregateExec       │ |
|               | │    --------------------   │ |
|               | │           aggr:           │ |
|               | │ first_value(example.col_7)│ |
|               | │   , first_value(example   │ |
|               | │          .col_8)          │ |
|               | │                           │ |
|               | │         group_by:         │ |
|               | │ col_1, col_2, col_3, col_4│ |
|               | │       , col_5, col_6      │ |
|               | │                           │ |
|               | │       mode: Partial       │ |
|               | └─────────────┬─────────────┘ |
|               | ┌─────────────┴─────────────┐ |
|               | │       DataSourceExec      │ |
|               | │    --------------------   │ |
|               | │         files: 24         │ |
|               | │      format: parquet      │ |
|               | └───────────────────────────┘ |
|               |                               |
+---------------+-------------------------------+
```

GitHub link: 
https://github.com/apache/datafusion/discussions/16776#discussioncomment-13780110

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: 
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [D] Best practices for memory-efficient deduplication of pre-sorted Parquet files [datafusion]

Reply via email to