davidm-db opened a new pull request, #54223:
URL: https://github.com/apache/spark/pull/54223
### What changes were proposed in this pull request?
This PR introduces the foundation of the **Spark Types Framework** - a
system for centralizing type-specific operations that are currently scattered
across 50+ files using diverse patterns.
**Framework interfaces** (9 new files in `sql/api` and `sql/catalyst`):
- Base traits: `TypeOps` (catalyst), `TypeApiOps` (sql-api)
- Core interfaces: `PhyTypeOps` (physical type representation),
`LiteralTypeOps` (literal creation), `ExternalTypeOps` (internal/external type
conversion), `FormatTypeOps` (string formatting), `EncodeTypeOps` (row encoders)
- Proof-of-concept implementation: `TimeTypeOps` + `TimeTypeApiOps` for
TimeType
**Integration points** (10 existing files modified with check-and-delegate
pattern):
- `PhysicalDataType.scala` - physical type dispatch
- `CatalystTypeConverters.scala` - external/internal type conversion (via
`TypeOpsConverter` adapter)
- `ToStringBase.scala` - string formatting
- `RowEncoder.scala` - row encoding
- `literals.scala` - default literal creation (`Literal.default`)
- `EncoderUtils.scala` - encoder Java class mapping
- `CodeGenerator.scala` - codegen Java class mapping
- `SpecificInternalRow.scala` - mutable value creation
- `InternalRow.scala` - row writer dispatch
**Feature flag**: `spark.sql.types.framework.enabled` (defaults to `true` in
tests via `Utils.isTesting`, `false` otherwise), configured in `SQLConf.scala`
+ `SqlApiConf.scala`.
**Interface hierarchy:**
```
TypeOps (catalyst)
+-- PhyTypeOps - Physical representation (PhysicalDataType, Java class,
MutableValue, row writer)
+-- LiteralTypeOps - Literal creation and defaults
+-- ExternalTypeOps - External <-> internal type conversion
TypeApiOps (sql-api)
+-- FormatTypeOps - String formatting (CAST to STRING, SQL values)
+-- EncodeTypeOps - Row encoding (AgnosticEncoder)
```
The split across `sql/api` and `sql/catalyst` follows existing Spark module
separation - `TypeApiOps` lives in `sql/api` for client-side operations that
depend on `AgnosticEncoder`, while `TypeOps` lives in `sql/catalyst` for
server-side operations that depend on `InternalRow`, `PhysicalDataType`, etc.
All integration points use a **check-and-delegate** pattern at the beginning
of existing match statements:
```scala
def someOperation(dt: DataType) = dt match {
// Framework dispatch (guarded by feature flag)
case _ if SQLConf.get.typesFrameworkEnabled && SomeOps.supports(dt) =>
SomeOps(dt).someMethod()
// Legacy types (unchanged)
case DateType => ...
case TimestampType => ...
}
```
This is the first of several planned PRs. Subsequent PRs will add
client-side integrations (Spark Connect proto, Arrow SerDe, JDBC, Python,
Thrift) and storage format integrations (Parquet, ORC, CSV, JSON, etc.).
### Why are the changes needed?
Adding a new data type to Spark currently requires modifying **50+ files**
with scattered type-specific logic. Each file has its own conventions, and
there is no compiler assistance to ensure completeness. Integration points are
non-obvious and easy to miss - patterns include `_: TimeType` in Scala pattern
matching, `TimeNanoVector` in Arrow SerDe, `.hasTime()`/`.getTime()` in proto
fields, `LocalTimeEncoder` in encoder helpers, `java.sql.Types.TIME` in JDBC,
`instanceof TimeType` in Java files, and compound matches like `case LongType |
... | _: TimeType =>` that are invisible to naive searches.
The framework centralizes type-specific infrastructure operations in Ops
interface classes. When adding a new type with the framework in place, a
developer creates two Ops classes (one in `sql/api`, one in `sql/catalyst`) and
registers them in the corresponding factory objects. The compiler enforces that
all required interface methods are implemented, significantly reducing the risk
of missing integration points.
**Concrete example - TimeType:** TimeType has integration points spread
across 50+ files using the diverse patterns listed above (physical type
mapping, literals, type converters, encoders, formatters, Arrow SerDe, proto
conversion, JDBC, Python, Thrift, storage formats). With the framework, these
are consolidated into two Ops classes: `TimeTypeOps` (~150 lines) and
`TimeTypeApiOps` (~90 lines). A developer adding a new type with similar
complexity would create two analogous files instead of touching 50+ files. The
framework does not cover type-specific expressions (e.g., `CurrentTime`,
`TimeAddInterval`) or SQL parser changes, which are inherently type-specific -
it provides the primitives those build on.
This PR covers the core infrastructure integration. Subsequent PRs will add
client-side integrations (Spark Connect proto, Arrow SerDe, JDBC, Python,
Thrift) and storage format integrations (Parquet, ORC, CSV, JSON, etc.).
### Does this PR introduce _any_ user-facing change?
No. This is an internal refactoring behind a feature flag
(`spark.sql.types.framework.enabled`). When the flag is enabled,
framework-supported types use centralized Ops dispatch instead of direct
pattern matching. Behavior is identical in both paths. The flag defaults to
`true` in tests and `false` otherwise.
### How was this patch tested?
The framework is a refactoring of existing dispatch logic - it changes the
mechanism but preserves identical behavior. The feature flag is enabled by
default in test environments (`Utils.isTesting`), so the entire existing test
suite validates the framework code path. No new tests are added in this PR
because the framework delegates to the same underlying logic that existing
tests already cover.
In subsequent phases, the testing focus will be on:
1. Testing the framework itself (Ops interface contracts, roundtrip
correctness, edge cases)
2. Designing a generalized testing mechanism that enforces proper test
coverage for each type added through the framework
### Was this patch authored or co-authored using generative AI tooling?
Co-authored with: claude-opus-4-6
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]