AlyAbdelmoneim commented on PR #19498:
URL: https://github.com/apache/datafusion/pull/19498#issuecomment-3693064245
> I don't know if this config field is one we should be tackling. I don't
think we can ever have an exhaustive list of valid values here if we consider
user extensibility?
Oh you are right! I just wasn't aware of that use case where users can
register custom table provider factories. I saw the `default_table_factories()`
in `session_state_defaults.rs` and thought those were the only valid variants.
I'll revert this change.
For my next PR, I'll be working on validating `ParquetOptions` fields.
Here's what I plan to validate:
**Fields to validate with enum wrappers (following the pattern from PR
#17549):**
1. **`writer_version: String`** → `DFWriterVersion` enum
- Valid values: `"1.0"`, `"2.0"`
- Maps to `parquet::file::properties::WriterVersion`
- Files: `datafusion/common/src/config.rs` (update field type), new enum
in `datafusion/common/src/format.rs` (or separate file)
2. **`coerce_int96: Option<String>`** → `Option<DFTimeUnit>` enum
- Valid values: `"ns"`, `"us"`, `"ms"`, `"s"`
- Maps to `arrow::datatypes::TimeUnit`
- Files: `datafusion/common/src/config.rs`, enum in
`datafusion/common/src/format.rs`
- Reference: `parse_coerce_int96_string()` in
`datafusion/datasource-parquet/src/source.rs:497`
3. **`statistics_enabled: Option<String>`** → `Option<DFEnabledStatistics>`
enum
- Valid values: `"none"`, `"chunk"`, `"page"`
- Maps to `parquet::file::properties::EnabledStatistics`
- Files: `datafusion/common/src/config.rs`, enum in
`datafusion/common/src/format.rs`
- Reference: `parse_statistics_string()` in
`datafusion/common/src/file_options/parquet_writer.rs:388`
4. **`encoding: Option<String>`** → `Option<DFEncoding>` enum
- Valid values: `"plain"`, `"plain_dictionary"`, `"rle"`, `"bit_packed"`,
`"delta_binary_packed"`, `"delta_length_byte_array"`, `"delta_byte_array"`,
`"rle_dictionary"`, `"byte_stream_split"`
- Maps to `parquet::basic::Encoding`
- Files: `datafusion/common/src/config.rs`, enum in
`datafusion/common/src/format.rs`
**And I have a question for this field**
5. **`compression: Option<String>`** → Format pattern validation
- Pattern: `codec_name` or `codec_name(level)` where:
- Codecs requiring levels: `gzip(level)`, `brotli(level)`, `zstd(level)`
- Codecs without levels: `uncompressed`, `snappy`, `lzo`, `lz4`,
`lz4_raw`
- This is more complex than a simple enum due to the pattern and level
requirements
- Reference: `parse_compression_string()` in
`datafusion/common/src/file_options/parquet_writer.rs:323`
- **Question**: Should I validate the compression format pattern at
config time, or would you prefer a different approach? The current
implementation only validates when actually writing parquet files, which means
invalid values like `"invalid_codec"` or `"snappy(2)"` (snappy doesn't support
levels) are accepted at `SET` time and only fail later.
if you think this are the edits we want, I will create a new PR for these
changes.
@Jefffrey please review when you have time, thanks!!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]