AlyAbdelmoneim commented on PR #19498:
URL: https://github.com/apache/datafusion/pull/19498#issuecomment-3693064245

   > I don't know if this config field is one we should be tackling. I don't 
think we can ever have an exhaustive list of valid values here if we consider 
user extensibility?
   
   Oh you are right! I just wasn't aware of that use case where users can 
register custom table provider factories. I saw the `default_table_factories()` 
in `session_state_defaults.rs` and thought those were the only valid variants. 
I'll revert this change.
   
   For my next PR, I'll be working on validating `ParquetOptions` fields. 
Here's what I plan to validate:
   
   **Fields to validate with enum wrappers (following the pattern from PR 
#17549):**
   
   1. **`writer_version: String`** → `DFWriterVersion` enum
      - Valid values: `"1.0"`, `"2.0"`
      - Maps to `parquet::file::properties::WriterVersion`
      - Files: `datafusion/common/src/config.rs` (update field type), new enum 
in `datafusion/common/src/format.rs` (or separate file)
   
   2. **`coerce_int96: Option<String>`** → `Option<DFTimeUnit>` enum
      - Valid values: `"ns"`, `"us"`, `"ms"`, `"s"`
      - Maps to `arrow::datatypes::TimeUnit`
      - Files: `datafusion/common/src/config.rs`, enum in 
`datafusion/common/src/format.rs`
      - Reference: `parse_coerce_int96_string()` in 
`datafusion/datasource-parquet/src/source.rs:497`
   
   3. **`statistics_enabled: Option<String>`** → `Option<DFEnabledStatistics>` 
enum
      - Valid values: `"none"`, `"chunk"`, `"page"`
      - Maps to `parquet::file::properties::EnabledStatistics`
      - Files: `datafusion/common/src/config.rs`, enum in 
`datafusion/common/src/format.rs`
      - Reference: `parse_statistics_string()` in 
`datafusion/common/src/file_options/parquet_writer.rs:388`
   
   4. **`encoding: Option<String>`** → `Option<DFEncoding>` enum
      - Valid values: `"plain"`, `"plain_dictionary"`, `"rle"`, `"bit_packed"`, 
`"delta_binary_packed"`, `"delta_length_byte_array"`, `"delta_byte_array"`, 
`"rle_dictionary"`, `"byte_stream_split"`
      - Maps to `parquet::basic::Encoding`
      - Files: `datafusion/common/src/config.rs`, enum in 
`datafusion/common/src/format.rs`
   
   **And I have a question for this field**
   
   5. **`compression: Option<String>`** → Format pattern validation
      - Pattern: `codec_name` or `codec_name(level)` where:
        - Codecs requiring levels: `gzip(level)`, `brotli(level)`, `zstd(level)`
        - Codecs without levels: `uncompressed`, `snappy`, `lzo`, `lz4`, 
`lz4_raw`
      - This is more complex than a simple enum due to the pattern and level 
requirements
      - Reference: `parse_compression_string()` in 
`datafusion/common/src/file_options/parquet_writer.rs:323`
      - **Question**: Should I validate the compression format pattern at 
config time, or would you prefer a different approach? The current 
implementation only validates when actually writing parquet files, which means 
invalid values like `"invalid_codec"` or `"snappy(2)"` (snappy doesn't support 
levels) are accepted at `SET` time and only fail later.
   
   if you think this are the edits we want, I will create a new PR for these 
changes.
   @Jefffrey please review when you have time, thanks!!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to