[PR] feat: Support Parquet writer options [datafusion-python]

via GitHub Mon, 05 May 2025 14:45:37 -0700


nuno-faria opened a new pull request, #1123:
URL: https://github.com/apache/datafusion-python/pull/1123

# Which issue does this PR close?

N/A.

# Rationale for this change

Supporting all Parquet writer options allows us more flexibility when
creating data directly from `datafusion-python`.

For consistency, it supports all writer options defined by `ParquetOptions`
in `datafusion`, using the same defaults:
https://github.com/apache/datafusion/blob/555fc2e24dd669e44ac23a9a1d8406f4ac58a9ed/datafusion/common/src/config.rs#L423.

# What changes are included in this PR?

- Extended `write_parquet` with all writer options, including
column-specific options.
- Added relevant tests. (Since `pyarrow` does not expose page-level
information, some options could not be directly tested, like enabling
bloom-filters (an external tool confirmed that this option works). For this
specific case, in this a test compares the file sizes, given bloom-filters
increase the storage required.)

# Are there any user-facing changes?

The main difference relates to the existing `compression` field, which now
uses a `str` like `datafusion`, instead of a custom enum. The main advantage is
that future algorithms will not require updating the Python-side code.

Additionally, the default compression was changed from `zstd(4)` to
`zstd(3)`, the same as `datafusion`.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] feat: Support Parquet writer options [datafusion-python]

Reply via email to