davidm-db opened a new pull request, #54223:
URL: https://github.com/apache/spark/pull/54223

   ### What changes were proposed in this pull request?
   
   This PR introduces the foundation of the **Spark Types Framework** - a 
system for centralizing type-specific operations that are currently scattered 
across 50+ files using diverse patterns.
   
   **Framework interfaces** (9 new files in `sql/api` and `sql/catalyst`):
   - Base traits: `TypeOps` (catalyst), `TypeApiOps` (sql-api)
   - Core interfaces: `PhyTypeOps` (physical type representation), 
`LiteralTypeOps` (literal creation), `ExternalTypeOps` (internal/external type 
conversion), `FormatTypeOps` (string formatting), `EncodeTypeOps` (row encoders)
   - Proof-of-concept implementation: `TimeTypeOps` + `TimeTypeApiOps` for 
TimeType
   
   **Integration points** (10 existing files modified with check-and-delegate 
pattern):
   - `PhysicalDataType.scala` - physical type dispatch
   - `CatalystTypeConverters.scala` - external/internal type conversion (via 
`TypeOpsConverter` adapter)
   - `ToStringBase.scala` - string formatting
   - `RowEncoder.scala` - row encoding
   - `literals.scala` - default literal creation (`Literal.default`)
   - `EncoderUtils.scala` - encoder Java class mapping
   - `CodeGenerator.scala` - codegen Java class mapping
   - `SpecificInternalRow.scala` - mutable value creation
   - `InternalRow.scala` - row writer dispatch
   
   **Feature flag**: `spark.sql.types.framework.enabled` (defaults to `true` in 
tests via `Utils.isTesting`, `false` otherwise), configured in `SQLConf.scala` 
+ `SqlApiConf.scala`.
   
   **Interface hierarchy:**
   ```
   TypeOps (catalyst)
   +-- PhyTypeOps      - Physical representation (PhysicalDataType, Java class, 
MutableValue, row writer)
   +-- LiteralTypeOps  - Literal creation and defaults
   +-- ExternalTypeOps - External <-> internal type conversion
   
   TypeApiOps (sql-api)
   +-- FormatTypeOps   - String formatting (CAST to STRING, SQL values)
   +-- EncodeTypeOps   - Row encoding (AgnosticEncoder)
   ```
   
   The split across `sql/api` and `sql/catalyst` follows existing Spark module 
separation - `TypeApiOps` lives in `sql/api` for client-side operations that 
depend on `AgnosticEncoder`, while `TypeOps` lives in `sql/catalyst` for 
server-side operations that depend on `InternalRow`, `PhysicalDataType`, etc.
   
   All integration points use a **check-and-delegate** pattern at the beginning 
of existing match statements:
   ```scala
   def someOperation(dt: DataType) = dt match {
     // Framework dispatch (guarded by feature flag)
     case _ if SQLConf.get.typesFrameworkEnabled && SomeOps.supports(dt) =>
       SomeOps(dt).someMethod()
     // Legacy types (unchanged)
     case DateType => ...
     case TimestampType => ...
   }
   ```
   
   This is the first of several planned PRs. Subsequent PRs will add 
client-side integrations (Spark Connect proto, Arrow SerDe, JDBC, Python, 
Thrift) and storage format integrations (Parquet, ORC, CSV, JSON, etc.).
   
   ### Why are the changes needed?
   
   Adding a new data type to Spark currently requires modifying **50+ files** 
with scattered type-specific logic. Each file has its own conventions, and 
there is no compiler assistance to ensure completeness. Integration points are 
non-obvious and easy to miss - patterns include `_: TimeType` in Scala pattern 
matching, `TimeNanoVector` in Arrow SerDe, `.hasTime()`/`.getTime()` in proto 
fields, `LocalTimeEncoder` in encoder helpers, `java.sql.Types.TIME` in JDBC, 
`instanceof TimeType` in Java files, and compound matches like `case LongType | 
... | _: TimeType =>` that are invisible to naive searches.
   
   The framework centralizes type-specific infrastructure operations in Ops 
interface classes. When adding a new type with the framework in place, a 
developer creates two Ops classes (one in `sql/api`, one in `sql/catalyst`) and 
registers them in the corresponding factory objects. The compiler enforces that 
all required interface methods are implemented, significantly reducing the risk 
of missing integration points.
   
   **Concrete example - TimeType:** TimeType has integration points spread 
across 50+ files using the diverse patterns listed above (physical type 
mapping, literals, type converters, encoders, formatters, Arrow SerDe, proto 
conversion, JDBC, Python, Thrift, storage formats). With the framework, these 
are consolidated into two Ops classes: `TimeTypeOps` (~150 lines) and 
`TimeTypeApiOps` (~90 lines). A developer adding a new type with similar 
complexity would create two analogous files instead of touching 50+ files. The 
framework does not cover type-specific expressions (e.g., `CurrentTime`, 
`TimeAddInterval`) or SQL parser changes, which are inherently type-specific - 
it provides the primitives those build on.
   
   This PR covers the core infrastructure integration. Subsequent PRs will add 
client-side integrations (Spark Connect proto, Arrow SerDe, JDBC, Python, 
Thrift) and storage format integrations (Parquet, ORC, CSV, JSON, etc.).
   
   ### Does this PR introduce _any_ user-facing change?
   
   No. This is an internal refactoring behind a feature flag 
(`spark.sql.types.framework.enabled`). When the flag is enabled, 
framework-supported types use centralized Ops dispatch instead of direct 
pattern matching. Behavior is identical in both paths. The flag defaults to 
`true` in tests and `false` otherwise.
   
   ### How was this patch tested?
   
   The framework is a refactoring of existing dispatch logic - it changes the 
mechanism but preserves identical behavior. The feature flag is enabled by 
default in test environments (`Utils.isTesting`), so the entire existing test 
suite validates the framework code path. No new tests are added in this PR 
because the framework delegates to the same underlying logic that existing 
tests already cover.
   
   In subsequent phases, the testing focus will be on:
   1. Testing the framework itself (Ops interface contracts, roundtrip 
correctness, edge cases)
   2. Designing a generalized testing mechanism that enforces proper test 
coverage for each type added through the framework
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   Co-authored with: claude-opus-4-6


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to