renzepost opened a new issue, #36793:
URL: https://github.com/apache/airflow/issues/36793

   ### Description
   
   When using an operator that is derived from `BaseSQLToGCSOperator` with 
`output_format=parquet`, the default `parquet_row_group_size` is 1. This seems 
like a very strange default setting and with these settings (in my experience) 
it leads to some very unwanted results: enormous Parquet files, workers running 
out of memory and long task durations.
   
   I know this parameter is configurable, but my point is that this default 
setting should be changed to something more usable out of the box.
   
   ### Use case/motivation
   
   I looked up some other Parquet writing system's default settings. Spark 
seems to default to 128MB row groups. DuckDB has a default setting of 122.880 
rows per row group according to the 
[docs](https://duckdb.org/docs/data/parquet/tips.html), and Polars uses a 
default setting of [512^2 
rows](https://docs.pola.rs/py-polars/html/reference/api/polars.DataFrame.write_parquet.html).
   
   So I think considering this and the unwanted effects I noticed of having 1 
row per row group, I'd say the default setting should be changed. However, I'm 
not sure what would be a good default setting instead for this Airflow operator.
   
   ### Related issues
   
   _No response_
   
   ### Are you willing to submit a PR?
   
   - [X] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of 
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to