GitHub user zheniasigayev added a comment to the discussion: Best practices for
memory-efficient deduplication of pre-sorted Parquet files
Addressing Question 1.
The query plan for the original query:
```sql
CREATE EXTERNAL TABLE example (
col_1 VARCHAR(50) NOT NULL,
col_2 BIGINT NOT NULL,
col_3 VARCHAR(50),
col_4 VARCHAR(50),
col_5 VARCHAR(50),
col_6 VARCHAR(100) NOT NULL,
col_7 VARCHAR(50),
col_8 DOUBLE
)
WITH ORDER (col_1 ASC, col_2 ASC)
STORED AS PARQUET
LOCATION '/tmp/redacted/*.parquet';
EXPLAIN COPY (
SELECT
col_1,
col_2,
col_3,
col_4,
col_5,
col_6,
first_value(col_7) AS col_7,
first_value(col_8) AS col_8
FROM
example
GROUP BY
col_1, col_2, col_3, col_4, col_5, col_6
ORDER BY
col_1 ASC, col_2 ASC
)
TO '/tmp/result.parquet'
STORED AS PARQUET
OPTIONS (compression 'zstd(1)');
```
The resulting `EXPLAIN` output:
```
+---------------+-------------------------------+
| plan_type | plan |
+---------------+-------------------------------+
| physical_plan | ┌───────────────────────────┐ |
| | │ DataSinkExec │ |
| | └─────────────┬─────────────┘ |
| | ┌─────────────┴─────────────┐ |
| | │ SortPreservingMergeExec │ |
| | │ -------------------- │ |
| | │ col_1 ASC NULLS LAST, │ |
| | │ col_2 ASC NULLS LAST │ |
| | └─────────────┬─────────────┘ |
| | ┌─────────────┴─────────────┐ |
| | │ SortExec │ |
| | │ -------------------- │ |
| | │ col_1@0 ASC NULLS LAST, │ |
| | │ col_2@1 ASC NULLS LAST │ |
| | └─────────────┬─────────────┘ |
| | ┌─────────────┴─────────────┐ |
| | │ ProjectionExec │ |
| | │ -------------------- │ |
| | │ col_1: col_1 │ |
| | │ col_2: col_2 │ |
| | │ col_3: col_3 │ |
| | │ col_4: col_4 │ |
| | │ col_5: col_5 │ |
| | │ col_6: col_6 │ |
| | │ │ |
| | │ col_7: │ |
| | │ first_value(example.col_7)│ |
| | │ │ |
| | │ col_8: │ |
| | │ first_value(example.col_8)│ |
| | └─────────────┬─────────────┘ |
| | ┌─────────────┴─────────────┐ |
| | │ AggregateExec │ |
| | │ -------------------- │ |
| | │ aggr: │ |
| | │ first_value(example.col_7)│ |
| | │ , first_value(example │ |
| | │ .col_8) │ |
| | │ │ |
| | │ group_by: │ |
| | │ col_1, col_2, col_3, col_4│ |
| | │ , col_5, col_6 │ |
| | │ │ |
| | │ mode: │ |
| | │ FinalPartitioned │ |
| | └─────────────┬─────────────┘ |
| | ┌─────────────┴─────────────┐ |
| | │ CoalesceBatchesExec │ |
| | │ -------------------- │ |
| | │ target_batch_size: │ |
| | │ 8192 │ |
| | └─────────────┬─────────────┘ |
| | ┌─────────────┴─────────────┐ |
| | │ RepartitionExec │ |
| | │ -------------------- │ |
| | │ partition_count(in->out): │ |
| | │ 10 -> 10 │ |
| | │ │ |
| | │ partitioning_scheme: │ |
| | │ Hash([col_1@0, col_2@1, │ |
| | │ col_3@2, col_4@3, │ |
| | │ col_5@4, col_6@5], │ |
| | │ 10) │ |
| | └─────────────┬─────────────┘ |
| | ┌─────────────┴─────────────┐ |
| | │ AggregateExec │ |
| | │ -------------------- │ |
| | │ aggr: │ |
| | │ first_value(example.col_7)│ |
| | │ , first_value(example │ |
| | │ .col_8) │ |
| | │ │ |
| | │ group_by: │ |
| | │ col_1, col_2, col_3, col_4│ |
| | │ , col_5, col_6 │ |
| | │ │ |
| | │ mode: Partial │ |
| | └─────────────┬─────────────┘ |
| | ┌─────────────┴─────────────┐ |
| | │ DataSourceExec │ |
| | │ -------------------- │ |
| | │ files: 24 │ |
| | │ format: parquet │ |
| | └───────────────────────────┘ |
| | |
+---------------+-------------------------------+
```
GitHub link:
https://github.com/apache/datafusion/discussions/16776#discussioncomment-13780110
----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]