[I] Change SQL to GCS operators default row group size when output is Parquet [airflow]

via GitHub Mon, 15 Jan 2024 05:38:14 -0800


renzepost opened a new issue, #36793:
URL: https://github.com/apache/airflow/issues/36793

### Description

When using an operator that is derived from `BaseSQLToGCSOperator` with
`output_format=parquet`, the default `parquet_row_group_size` is 1. This seems
like a very strange default setting and with these settings (in my experience)
it leads to some very unwanted results: enormous Parquet files, workers running
out of memory and long task durations.

I know this parameter is configurable, but my point is that this default
setting should be changed to something more usable out of the box.

### Use case/motivation

I looked up some other Parquet writing system's default settings. Spark
seems to default to 128MB row groups. DuckDB has a default setting of 122.880
rows per row group according to the
[docs](https://duckdb.org/docs/data/parquet/tips.html), and Polars uses a
default setting of [512^2
rows](https://docs.pola.rs/py-polars/html/reference/api/polars.DataFrame.write_parquet.html).

So I think considering this and the unwanted effects I noticed of having 1
row per row group, I'd say the default setting should be changed. However, I'm
not sure what would be a good default setting instead for this Airflow operator.

### Related issues

_No response_

### Are you willing to submit a PR?

- [X] Yes I am willing to submit a PR!

### Code of Conduct

- [X] I agree to follow this project's [Code of
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] Change SQL to GCS operators default row group size when output is Parquet [airflow]

Reply via email to