This is an automated email from the ASF dual-hosted git repository. github-bot pushed a commit to branch main in repository https://gitbox.apache.org/repos/asf/datafusion.git
The following commit(s) were added to refs/heads/main by this push:
new d103d8886f chore: remove LZO Parquet compression (#19726)
d103d8886f is described below
commit d103d8886fcef989b0465a4a7dba28114869431c
Author: Kumar Ujjawal <[email protected]>
AuthorDate: Mon Jan 12 08:01:20 2026 +0530
chore: remove LZO Parquet compression (#19726)
## Which issue does this PR close?
<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax. For example
`Closes #123` indicates that this PR will close issue #123.
-->
- Closes #19720.
## Rationale for this change
- Choosing LZO compression errors, I think it might never get supported
so the best option moving forward is to remove it algother and update
the docs.
<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->
## What changes are included in this PR?
- Removed LZO from parse_compression_string() function
- Removed docs
- Updated exptected test output
<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->
## Are these changes tested?
Yes
<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code
If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?
-->
## Are there any user-facing changes?
User choosing LZO as compression will get a clear error message:
```
Unknown or unsupported parquet compression: lzo. Valid values are:
uncompressed, snappy, gzip(level), brotli(level), lz4, zstd(level), and lz4_raw.
```
<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.
-->
<!--
If there are any breaking changes to public APIs, please add the `api
change` label.
-->
---
datafusion/common/src/config.rs | 4 +-
.../common/src/file_options/parquet_writer.rs | 6 +-
.../sqllogictest/test_files/information_schema.slt | 2 +-
docs/source/user-guide/configs.md | 2 +-
docs/source/user-guide/sql/format_options.md | 64 +++++++++++-----------
5 files changed, 37 insertions(+), 41 deletions(-)
diff --git a/datafusion/common/src/config.rs b/datafusion/common/src/config.rs
index b7a7841593..87344914d2 100644
--- a/datafusion/common/src/config.rs
+++ b/datafusion/common/src/config.rs
@@ -772,7 +772,7 @@ config_namespace! {
/// (writing) Sets default parquet compression codec.
/// Valid values are: uncompressed, snappy, gzip(level),
- /// lzo, brotli(level), lz4, zstd(level), and lz4_raw.
+ /// brotli(level), lz4, zstd(level), and lz4_raw.
/// These values are not case sensitive. If NULL, uses
/// default parquet writer setting
///
@@ -2499,7 +2499,7 @@ config_namespace_with_hashmap! {
/// Sets default parquet compression codec for the column path.
/// Valid values are: uncompressed, snappy, gzip(level),
- /// lzo, brotli(level), lz4, zstd(level), and lz4_raw.
+ /// brotli(level), lz4, zstd(level), and lz4_raw.
/// These values are not case-sensitive. If NULL, uses
/// default parquet options
pub compression: Option<String>, transform = str::to_lowercase,
default = None
diff --git a/datafusion/common/src/file_options/parquet_writer.rs
b/datafusion/common/src/file_options/parquet_writer.rs
index 196cb96f38..f6608d16c1 100644
--- a/datafusion/common/src/file_options/parquet_writer.rs
+++ b/datafusion/common/src/file_options/parquet_writer.rs
@@ -341,10 +341,6 @@ pub fn parse_compression_string(
level,
)?))
}
- "lzo" => {
- check_level_is_none(codec, &level)?;
- Ok(parquet::basic::Compression::LZO)
- }
"brotli" => {
let level = require_level(codec, level)?;
Ok(parquet::basic::Compression::BROTLI(BrotliLevel::try_new(
@@ -368,7 +364,7 @@ pub fn parse_compression_string(
_ => Err(DataFusionError::Configuration(format!(
"Unknown or unsupported parquet compression: \
{str_setting}. Valid values are: uncompressed, snappy, gzip(level), \
- lzo, brotli(level), lz4, zstd(level), and lz4_raw."
+ brotli(level), lz4, zstd(level), and lz4_raw."
))),
}
}
diff --git a/datafusion/sqllogictest/test_files/information_schema.slt
b/datafusion/sqllogictest/test_files/information_schema.slt
index 860d81b098..2039ee93df 100644
--- a/datafusion/sqllogictest/test_files/information_schema.slt
+++ b/datafusion/sqllogictest/test_files/information_schema.slt
@@ -373,7 +373,7 @@ datafusion.execution.parquet.bloom_filter_on_read true
(reading) Use any availab
datafusion.execution.parquet.bloom_filter_on_write false (writing) Write bloom
filters for all columns when creating parquet files
datafusion.execution.parquet.coerce_int96 NULL (reading) If true, parquet
reader will read columns of physical type int96 as originating from a different
resolution than nanosecond. This is useful for reading data from systems like
Spark which stores microsecond resolution timestamps in an int96 allowing it to
write values with a larger date range than 64-bit timestamps with nanosecond
resolution.
datafusion.execution.parquet.column_index_truncate_length 64 (writing) Sets
column index truncate length
-datafusion.execution.parquet.compression zstd(3) (writing) Sets default
parquet compression codec. Valid values are: uncompressed, snappy, gzip(level),
lzo, brotli(level), lz4, zstd(level), and lz4_raw. These values are not case
sensitive. If NULL, uses default parquet writer setting Note that this default
setting is not the same as the default parquet writer setting.
+datafusion.execution.parquet.compression zstd(3) (writing) Sets default
parquet compression codec. Valid values are: uncompressed, snappy, gzip(level),
brotli(level), lz4, zstd(level), and lz4_raw. These values are not case
sensitive. If NULL, uses default parquet writer setting Note that this default
setting is not the same as the default parquet writer setting.
datafusion.execution.parquet.created_by datafusion (writing) Sets "created by"
property
datafusion.execution.parquet.data_page_row_count_limit 20000 (writing) Sets
best effort maximum number of rows in data page
datafusion.execution.parquet.data_pagesize_limit 1048576 (writing) Sets best
effort maximum size of data page in bytes
diff --git a/docs/source/user-guide/configs.md
b/docs/source/user-guide/configs.md
index b59af0c13d..99c94b2c78 100644
--- a/docs/source/user-guide/configs.md
+++ b/docs/source/user-guide/configs.md
@@ -96,7 +96,7 @@ The following configuration settings are available:
| datafusion.execution.parquet.write_batch_size |
1024 | (writing) Sets write_batch_size in bytes
[...]
| datafusion.execution.parquet.writer_version |
1.0 | (writing) Sets parquet writer version valid values
are "1.0" and "2.0"
[...]
| datafusion.execution.parquet.skip_arrow_metadata |
false | (writing) Skip encoding the embedded arrow metadata
in the KV_meta This is analogous to the
`ArrowWriterOptions::with_skip_arrow_metadata`. Refer to
<https://docs.rs/parquet/53.3.0/parquet/arrow/arrow_writer/struct.ArrowWriterOptions.html#method.with_skip_arrow_metadata>
[...]
-| datafusion.execution.parquet.compression |
zstd(3) | (writing) Sets default parquet compression codec.
Valid values are: uncompressed, snappy, gzip(level), lzo, brotli(level), lz4,
zstd(level), and lz4_raw. These values are not case sensitive. If NULL, uses
default parquet writer setting Note that this default setting is not the same
as the default parquet writer setting.
[...]
+| datafusion.execution.parquet.compression |
zstd(3) | (writing) Sets default parquet compression codec.
Valid values are: uncompressed, snappy, gzip(level), brotli(level), lz4,
zstd(level), and lz4_raw. These values are not case sensitive. If NULL, uses
default parquet writer setting Note that this default setting is not the same
as the default parquet writer setting.
[...]
| datafusion.execution.parquet.dictionary_enabled |
true | (writing) Sets if dictionary encoding is enabled.
If NULL, uses default parquet writer setting
[...]
| datafusion.execution.parquet.dictionary_page_size_limit |
1048576 | (writing) Sets best effort maximum dictionary page
size, in bytes
[...]
| datafusion.execution.parquet.statistics_enabled |
page | (writing) Sets if statistics are enabled for any
column Valid values are: "none", "chunk", and "page" These values are not case
sensitive. If NULL, uses default parquet writer setting
[...]
diff --git a/docs/source/user-guide/sql/format_options.md
b/docs/source/user-guide/sql/format_options.md
index d349bc1c98..c04a6b5d52 100644
--- a/docs/source/user-guide/sql/format_options.md
+++ b/docs/source/user-guide/sql/format_options.md
@@ -132,38 +132,38 @@ OPTIONS('DELIMITER' '|', 'HAS_HEADER' 'true',
'NEWLINES_IN_VALUES' 'true');
The following options are available when reading or writing Parquet files. If
any unsupported option is specified, an error will be raised and the query will
fail. If a column-specific option is specified for a column that does not
exist, the option will be ignored without error.
-| Option | Can be Column Specific? |
Description
| OPTIONS Key | Default
Value |
-| ------------------------------------------ | ----------------------- |
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| ----------------------------------------------------- |
------------------------ |
-| COMPRESSION | Yes | Sets
the internal Parquet **compression codec** for data pages, optionally including
the compression level. Applies globally if set without `::col`, or specifically
to a column if set using `'compression::column_name'`. Valid values:
`uncompressed`, `snappy`, `gzip(level)`, `lzo`, `brotli(level)`, `lz4`,
`zstd(level)`, `lz4_raw`. | `'compression'` or `'compression::col'`
| zstd(3) |
-| ENCODING | Yes | Sets
the **encoding** scheme for data pages. Valid values: `plain`,
`plain_dictionary`, `rle`, `bit_packed`, `delta_binary_packed`,
`delta_length_byte_array`, `delta_byte_array`, `rle_dictionary`,
`byte_stream_split`. Use key `'encoding'` or `'encoding::col'` in OPTIONS.
| `'encoding'` or
`'encoding::col'` | None |
-| DICTIONARY_ENABLED | Yes | Sets
whether dictionary encoding should be enabled globally or for a specific
column.
| `'dictionary_enabled'` or `'dictionary_enabled::col'` | true
|
-| STATISTICS_ENABLED | Yes | Sets
the level of statistics to write (`none`, `chunk`, `page`).
| `'statistics_enabled'` or `'statistics_enabled::col'` | page
|
-| BLOOM_FILTER_ENABLED | Yes | Sets
whether a bloom filter should be written for a specific column.
| `'bloom_filter_enabled::column_name'` | None
|
-| BLOOM_FILTER_FPP | Yes | Sets
bloom filter false positive probability (global or per column).
| `'bloom_filter_fpp'` or `'bloom_filter_fpp::col'` | None
|
-| BLOOM_FILTER_NDV | Yes | Sets
bloom filter number of distinct values (global or per column).
| `'bloom_filter_ndv'` or `'bloom_filter_ndv::col'` | None
|
-| MAX_ROW_GROUP_SIZE | No | Sets
the maximum number of rows per row group. Larger groups require more memory but
can improve compression and scan efficiency.
| `'max_row_group_size'` | 1048576
|
-| ENABLE_PAGE_INDEX | No | If
true, reads the Parquet data page level metadata (the Page Index), if present,
to reduce I/O and decoding.
| `'enable_page_index'` | true
|
-| PRUNING | No | If
true, enables row group pruning based on min/max statistics.
| `'pruning'` | true
|
-| SKIP_METADATA | No | If
true, skips optional embedded metadata in the file schema.
| `'skip_metadata'` | true
|
-| METADATA_SIZE_HINT | No | Sets
the size hint (in bytes) for fetching Parquet file metadata.
| `'metadata_size_hint'` | None
|
-| PUSHDOWN_FILTERS | No | If
true, enables filter pushdown during Parquet decoding.
| `'pushdown_filters'` | false
|
-| REORDER_FILTERS | No | If
true, enables heuristic reordering of filters during Parquet decoding.
| `'reorder_filters'` | false
|
-| SCHEMA_FORCE_VIEW_TYPES | No | If
true, reads Utf8/Binary columns as view types.
| `'schema_force_view_types'` | true
|
-| BINARY_AS_STRING | No | If
true, reads Binary columns as strings.
| `'binary_as_string'` | false
|
-| DATA_PAGESIZE_LIMIT | No | Sets
best effort maximum size of data page in bytes.
| `'data_pagesize_limit'` | 1048576
|
-| DATA_PAGE_ROW_COUNT_LIMIT | No | Sets
best effort maximum number of rows in data page.
| `'data_page_row_count_limit'` | 20000
|
-| DICTIONARY_PAGE_SIZE_LIMIT | No | Sets
best effort maximum dictionary page size, in bytes.
| `'dictionary_page_size_limit'` | 1048576
|
-| WRITE_BATCH_SIZE | No | Sets
write_batch_size in bytes.
| `'write_batch_size'` | 1024
|
-| WRITER_VERSION | No | Sets
the Parquet writer version (`1.0` or `2.0`).
| `'writer_version'` | 1.0
|
-| SKIP_ARROW_METADATA | No | If
true, skips writing Arrow schema information into the Parquet file metadata.
| `'skip_arrow_metadata'` | false
|
-| CREATED_BY | No | Sets
the "created by" string in the Parquet file metadata.
| `'created_by'` | datafusion
version X.Y.Z |
-| COLUMN_INDEX_TRUNCATE_LENGTH | No | Sets
the length (in bytes) to truncate min/max values in column indexes.
| `'column_index_truncate_length'` | 64
|
-| STATISTICS_TRUNCATE_LENGTH | No | Sets
statistics truncate length.
| `'statistics_truncate_length'` | None
|
-| BLOOM_FILTER_ON_WRITE | No | Sets
whether bloom filters should be written for all columns by default (can be
overridden per column).
| `'bloom_filter_on_write'` | false
|
-| ALLOW_SINGLE_FILE_PARALLELISM | No |
Enables parallel serialization of columns in a single file.
| `'allow_single_file_parallelism'` | true
|
-| MAXIMUM_PARALLEL_ROW_GROUP_WRITERS | No |
Maximum number of parallel row group writers.
| `'maximum_parallel_row_group_writers'` | 1
|
-| MAXIMUM_BUFFERED_RECORD_BATCHES_PER_STREAM | No |
Maximum number of buffered record batches per stream.
| `'maximum_buffered_record_batches_per_stream'` | 2
|
-| KEY_VALUE_METADATA | No (Key is specific) | Adds
custom key-value pairs to the file metadata. Use the format
`'metadata::your_key_name' 'your_value'`. Multiple entries allowed.
| `'metadata::key_name'`
| None |
+| Option | Can be Column Specific? |
Description
| OPTIONS Key | Default Value
|
+| ------------------------------------------ | ----------------------- |
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| ----------------------------------------------------- |
------------------------ |
+| COMPRESSION | Yes | Sets
the internal Parquet **compression codec** for data pages, optionally including
the compression level. Applies globally if set without `::col`, or specifically
to a column if set using `'compression::column_name'`. Valid values:
`uncompressed`, `snappy`, `gzip(level)`, `brotli(level)`, `lz4`, `zstd(level)`,
`lz4_raw`. | `'compression'` or `'compression::col'` | zstd(3)
|
+| ENCODING | Yes | Sets
the **encoding** scheme for data pages. Valid values: `plain`,
`plain_dictionary`, `rle`, `bit_packed`, `delta_binary_packed`,
`delta_length_byte_array`, `delta_byte_array`, `rle_dictionary`,
`byte_stream_split`. Use key `'encoding'` or `'encoding::col'` in OPTIONS.
| `'encoding'` or
`'encoding::col'` | None |
+| DICTIONARY_ENABLED | Yes | Sets
whether dictionary encoding should be enabled globally or for a specific
column.
| `'dictionary_enabled'` or `'dictionary_enabled::col'` | true
|
+| STATISTICS_ENABLED | Yes | Sets
the level of statistics to write (`none`, `chunk`, `page`).
| `'statistics_enabled'` or `'statistics_enabled::col'` | page
|
+| BLOOM_FILTER_ENABLED | Yes | Sets
whether a bloom filter should be written for a specific column.
| `'bloom_filter_enabled::column_name'` | None
|
+| BLOOM_FILTER_FPP | Yes | Sets
bloom filter false positive probability (global or per column).
| `'bloom_filter_fpp'` or `'bloom_filter_fpp::col'` | None
|
+| BLOOM_FILTER_NDV | Yes | Sets
bloom filter number of distinct values (global or per column).
| `'bloom_filter_ndv'` or `'bloom_filter_ndv::col'` | None
|
+| MAX_ROW_GROUP_SIZE | No | Sets
the maximum number of rows per row group. Larger groups require more memory but
can improve compression and scan efficiency.
| `'max_row_group_size'` | 1048576
|
+| ENABLE_PAGE_INDEX | No | If
true, reads the Parquet data page level metadata (the Page Index), if present,
to reduce I/O and decoding.
| `'enable_page_index'` | true
|
+| PRUNING | No | If
true, enables row group pruning based on min/max statistics.
| `'pruning'` | true
|
+| SKIP_METADATA | No | If
true, skips optional embedded metadata in the file schema.
| `'skip_metadata'` | true
|
+| METADATA_SIZE_HINT | No | Sets
the size hint (in bytes) for fetching Parquet file metadata.
| `'metadata_size_hint'` | None
|
+| PUSHDOWN_FILTERS | No | If
true, enables filter pushdown during Parquet decoding.
| `'pushdown_filters'` | false
|
+| REORDER_FILTERS | No | If
true, enables heuristic reordering of filters during Parquet decoding.
| `'reorder_filters'` | false
|
+| SCHEMA_FORCE_VIEW_TYPES | No | If
true, reads Utf8/Binary columns as view types.
| `'schema_force_view_types'` | true
|
+| BINARY_AS_STRING | No | If
true, reads Binary columns as strings.
| `'binary_as_string'` | false
|
+| DATA_PAGESIZE_LIMIT | No | Sets
best effort maximum size of data page in bytes.
| `'data_pagesize_limit'` | 1048576
|
+| DATA_PAGE_ROW_COUNT_LIMIT | No | Sets
best effort maximum number of rows in data page.
| `'data_page_row_count_limit'` | 20000
|
+| DICTIONARY_PAGE_SIZE_LIMIT | No | Sets
best effort maximum dictionary page size, in bytes.
| `'dictionary_page_size_limit'` | 1048576
|
+| WRITE_BATCH_SIZE | No | Sets
write_batch_size in bytes.
| `'write_batch_size'` | 1024
|
+| WRITER_VERSION | No | Sets
the Parquet writer version (`1.0` or `2.0`).
| `'writer_version'` | 1.0
|
+| SKIP_ARROW_METADATA | No | If
true, skips writing Arrow schema information into the Parquet file metadata.
| `'skip_arrow_metadata'` | false
|
+| CREATED_BY | No | Sets
the "created by" string in the Parquet file metadata.
| `'created_by'` | datafusion version
X.Y.Z |
+| COLUMN_INDEX_TRUNCATE_LENGTH | No | Sets
the length (in bytes) to truncate min/max values in column indexes.
| `'column_index_truncate_length'` | 64
|
+| STATISTICS_TRUNCATE_LENGTH | No | Sets
statistics truncate length.
| `'statistics_truncate_length'` | None
|
+| BLOOM_FILTER_ON_WRITE | No | Sets
whether bloom filters should be written for all columns by default (can be
overridden per column).
| `'bloom_filter_on_write'` | false
|
+| ALLOW_SINGLE_FILE_PARALLELISM | No |
Enables parallel serialization of columns in a single file.
| `'allow_single_file_parallelism'` | true
|
+| MAXIMUM_PARALLEL_ROW_GROUP_WRITERS | No |
Maximum number of parallel row group writers.
| `'maximum_parallel_row_group_writers'` | 1
|
+| MAXIMUM_BUFFERED_RECORD_BATCHES_PER_STREAM | No |
Maximum number of buffered record batches per stream.
| `'maximum_buffered_record_batches_per_stream'` | 2
|
+| KEY_VALUE_METADATA | No (Key is specific) | Adds
custom key-value pairs to the file metadata. Use the format
`'metadata::your_key_name' 'your_value'`. Multiple entries allowed.
| `'metadata::key_name'` |
None |
**Example:**
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
