coderfender opened a new issue, #3459:
URL: https://github.com/apache/datafusion-comet/issues/3459

   ### What is the problem the feature request solves?
   
   # RFC: Cast Module Modularization
   
   **Author:** Bhargava Vadlamani (@coderfender)
   **Date:** 2026-02-09
   **Status:** Draft
   
   ## Summary
   
   Refactor the monolithic `cast.rs` (3,700 LOC) into separate focused modules 
organized by data type category, using a two-level static dispatch flow
   
   ## Problems
   
   1. **Merge Conflicts** - Multiple contributors editing the same large file 
causes frequent conflicts
   2. **No Enforced Contract** - No trait defining what a cast implementation 
must provide
   3. **Poor Discoverability** - Hard to find specific cast logic in 3,700 lines
   4. **Scattered Changes** - Adding a new cast type requires edits across 
distant parts of the file
   
   ## Proposal
   
   ### Trait Definition
   
   ```rust
   /// Core trait for Spark-compatible cast operations.
   ///
   /// # Future Optimization
   /// TODO: Impl should branch on EvalMode at array level, not row level.
   pub trait SparkCast: Send + Sync + Debug {
       fn can_cast(&self, from: &DataType, to: &DataType) -> bool;
   
       fn cast(
           &self,
           array: &ArrayRef,
           from: &DataType,
           to: &DataType,
           opts: &SparkCastOptions,
       ) -> SparkResult<ArrayRef>;
   }
   ```
   
   ### Module Structure
   
   ```
   native/spark-expr/src/cast/
   ├── mod.rs              # Public API, dispatcher
   ├── traits.rs           # SparkCast trait
   ├── string.rs           # String ↔ other types
   ├── numeric.rs          # Integer, Float, Decimal
   ├── temporal.rs         # Date, Timestamp
   ├── boolean.rs          # Boolean conversions
   ├── complex_types.rs    # Struct, Array, Map
   ├── binary.rs           # Binary conversions
   └── utils.rs            # Shared helpers
   ```
   
   ### Two-Level Static Dispatch
   
   **Level 1 - mod.rs** (routes by `from` type category):
   
   ```rust
   match from {
       _ if is_string(from) => string::cast_from_string(array, to, opts),
       _ if is_boolean(from) => boolean::cast_from_boolean(array, to, opts),
       _ if is_numeric(from) => numeric::cast_from_numeric(array, from, to, 
opts),
       _ if is_temporal(from) => temporal::cast_from_temporal(array, from, to, 
opts),
       _ if is_complex(from) => complex_types::cast_from_complex(array, from, 
to, opts),
       _ if is_binary(from) => binary::cast_from_binary(array, to, opts),
   }
   ```
   
   **utils.rs** (type category helpers):
   
   ```rust
   pub fn is_numeric(dt: &DataType) -> bool {
       matches!(dt, Int8 | Int16 | Int32 | Int64 | Float32 | Float64 | 
Decimal128(_, _))
   }
   
   pub fn is_temporal(dt: &DataType) -> bool {
       matches!(dt, Date32 | Date64 | Timestamp(_, _))
   }
   
   pub fn is_string(dt: &DataType) -> bool {
       matches!(dt, Utf8 | LargeUtf8)
   }
   
   // ...
   ```
   
   **Level 2 - string.rs** (routes by `to` type):
   
   ```rust
   pub fn cast_from_string(array: &ArrayRef, to: &DataType, opts: 
&SparkCastOptions) -> SparkResult<ArrayRef> {
       match to {
           Boolean => utf8_to_boolean(array, opts),
           Int8 => utf8_to_int8(array, opts),
           Int16 => utf8_to_int16(array, opts),
           // ...
       }
   }
   ```
   
   **Level 3 - actual function** (branches by eval_mode at array level): (This 
is adaptable and not all casts might need separate implementation at an 
EValMode level)
   
   ```rust
   fn utf8_to_int8(array: &ArrayRef, opts: &SparkCastOptions) -> 
SparkResult<ArrayRef> {
       match opts.eval_mode {
           EvalMode::Legacy => utf8_to_int8_legacy(array),
           EvalMode::Ansi => utf8_to_int8_ansi(array),
           EvalMode::Try => utf8_to_int8_try(array),
       }
   }
   ```
   
   All static dispatch. Zero overhead. Clean separation.
   
   ## Migration Strategy
   
   **Phase 1: Setup (1 PR)**
   - Create `cast/` directory
   - Add `mod.rs`, `traits.rs`, `utils.rs` with type category helpers
   - Old `cast.rs` re-exports from new location (backward compatible)
   
   **Phase 2: Move modules (1 PR per module)**
   - Start with smallest: `boolean.rs`
   - Proceed to other cast operation refactor based on datatypes
   
   **Phase 3: Cleanup (1 PR)**
   - Remove old `cast.rs`
   - refactor imports across cast subsystem
   
   
   ## Testing
   
   - Existing tests remain valid (re-exports preserve API)
   - Each module gets its own `#[cfg(test)]` section
   - No new tests needed initially - existing coverage carries over
   - Future: add module-specific unit tests as needed
   
   ## Alternatives Considered
   
   1. **Keep single file, just reorganize** - Doesn't solve merge conflicts
   2. **Trait objects with dynamic dispatch** - Virtual call overhead, rejected 
for performance
   3. **Macros** - Less readable, harder to debug
   
   ## Open Questions
   
   1. Should we handle Dictionary types in a separate module or within each 
category?
   2. Naming convention can be better
   3. Generate dynamic cast support documentation (similar to scala side) and 
probably handle uncreachable paths during compile time ?
   
   
   ### Describe the potential solution
   
   _No response_
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to