coderfender opened a new issue, #3459:
URL: https://github.com/apache/datafusion-comet/issues/3459
### What is the problem the feature request solves?
# RFC: Cast Module Modularization
**Author:** Bhargava Vadlamani (@coderfender)
**Date:** 2026-02-09
**Status:** Draft
## Summary
Refactor the monolithic `cast.rs` (3,700 LOC) into separate focused modules
organized by data type category, using a two-level static dispatch flow
## Problems
1. **Merge Conflicts** - Multiple contributors editing the same large file
causes frequent conflicts
2. **No Enforced Contract** - No trait defining what a cast implementation
must provide
3. **Poor Discoverability** - Hard to find specific cast logic in 3,700 lines
4. **Scattered Changes** - Adding a new cast type requires edits across
distant parts of the file
## Proposal
### Trait Definition
```rust
/// Core trait for Spark-compatible cast operations.
///
/// # Future Optimization
/// TODO: Impl should branch on EvalMode at array level, not row level.
pub trait SparkCast: Send + Sync + Debug {
fn can_cast(&self, from: &DataType, to: &DataType) -> bool;
fn cast(
&self,
array: &ArrayRef,
from: &DataType,
to: &DataType,
opts: &SparkCastOptions,
) -> SparkResult<ArrayRef>;
}
```
### Module Structure
```
native/spark-expr/src/cast/
├── mod.rs # Public API, dispatcher
├── traits.rs # SparkCast trait
├── string.rs # String ↔ other types
├── numeric.rs # Integer, Float, Decimal
├── temporal.rs # Date, Timestamp
├── boolean.rs # Boolean conversions
├── complex_types.rs # Struct, Array, Map
├── binary.rs # Binary conversions
└── utils.rs # Shared helpers
```
### Two-Level Static Dispatch
**Level 1 - mod.rs** (routes by `from` type category):
```rust
match from {
_ if is_string(from) => string::cast_from_string(array, to, opts),
_ if is_boolean(from) => boolean::cast_from_boolean(array, to, opts),
_ if is_numeric(from) => numeric::cast_from_numeric(array, from, to,
opts),
_ if is_temporal(from) => temporal::cast_from_temporal(array, from, to,
opts),
_ if is_complex(from) => complex_types::cast_from_complex(array, from,
to, opts),
_ if is_binary(from) => binary::cast_from_binary(array, to, opts),
}
```
**utils.rs** (type category helpers):
```rust
pub fn is_numeric(dt: &DataType) -> bool {
matches!(dt, Int8 | Int16 | Int32 | Int64 | Float32 | Float64 |
Decimal128(_, _))
}
pub fn is_temporal(dt: &DataType) -> bool {
matches!(dt, Date32 | Date64 | Timestamp(_, _))
}
pub fn is_string(dt: &DataType) -> bool {
matches!(dt, Utf8 | LargeUtf8)
}
// ...
```
**Level 2 - string.rs** (routes by `to` type):
```rust
pub fn cast_from_string(array: &ArrayRef, to: &DataType, opts:
&SparkCastOptions) -> SparkResult<ArrayRef> {
match to {
Boolean => utf8_to_boolean(array, opts),
Int8 => utf8_to_int8(array, opts),
Int16 => utf8_to_int16(array, opts),
// ...
}
}
```
**Level 3 - actual function** (branches by eval_mode at array level): (This
is adaptable and not all casts might need separate implementation at an
EValMode level)
```rust
fn utf8_to_int8(array: &ArrayRef, opts: &SparkCastOptions) ->
SparkResult<ArrayRef> {
match opts.eval_mode {
EvalMode::Legacy => utf8_to_int8_legacy(array),
EvalMode::Ansi => utf8_to_int8_ansi(array),
EvalMode::Try => utf8_to_int8_try(array),
}
}
```
All static dispatch. Zero overhead. Clean separation.
## Migration Strategy
**Phase 1: Setup (1 PR)**
- Create `cast/` directory
- Add `mod.rs`, `traits.rs`, `utils.rs` with type category helpers
- Old `cast.rs` re-exports from new location (backward compatible)
**Phase 2: Move modules (1 PR per module)**
- Start with smallest: `boolean.rs`
- Proceed to other cast operation refactor based on datatypes
**Phase 3: Cleanup (1 PR)**
- Remove old `cast.rs`
- refactor imports across cast subsystem
## Testing
- Existing tests remain valid (re-exports preserve API)
- Each module gets its own `#[cfg(test)]` section
- No new tests needed initially - existing coverage carries over
- Future: add module-specific unit tests as needed
## Alternatives Considered
1. **Keep single file, just reorganize** - Doesn't solve merge conflicts
2. **Trait objects with dynamic dispatch** - Virtual call overhead, rejected
for performance
3. **Macros** - Less readable, harder to debug
## Open Questions
1. Should we handle Dictionary types in a separate module or within each
category?
2. Naming convention can be better
3. Generate dynamic cast support documentation (similar to scala side) and
probably handle uncreachable paths during compile time ?
### Describe the potential solution
_No response_
### Additional context
_No response_
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]