David Milicevic created SPARK-55439:
---------------------------------------

             Summary: Types Framework
                 Key: SPARK-55439
                 URL: https://issues.apache.org/jira/browse/SPARK-55439
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 4.2.0
            Reporter: David Milicevic


*Summary:*

Introduce a Types Framework to centralize scattered type-specific pattern 
matching

*Description:*

Adding a new data type to Spark currently requires modifying 50+ files with 
scattered type-specific logic using diverse patterns (focusing on TIME type - 
'_: TimeType', 'TimeNanoVector', '.hasTime()', 'LocalTimeEncoder', 'instanceof 
TimeType', etc.). There is no compiler assistance to ensure completeness, and 
integration points are easy to miss.

This effort introduces a *Types Framework* that centralizes type-specific 
infrastructure operations in Ops interface classes. Instead of modifying dozens 
of files, a developer implementing a new type creates two Ops classes and 
registers them in factory objects. The compiler enforces interface completeness.

*Concrete example - TimeType:*

TimeType (the proof-of-concept for this framework) has integration points 
spread across 50+ files for infrastructure concerns like physical type mapping, 
literals, type converters, encoders, formatters, Arrow SerDe, proto conversion, 
JDBC, Python, Thrift, and storage formats. With the framework, these are 
consolidated into two Ops classes (~240 lines total). A developer adding a new 
type with similar complexity would create two analogous files instead of 
touching 50+ files.

*Architecture:*

The framework defines a hierarchy of Ops traits that each cover a specific 
category of type operations:

 
{code:java}
TypeOps (sql/catalyst)
+-- PhyTypeOps      - Physical type representation
+-- LiteralTypeOps  - Literal creation and defaults
+-- ExternalTypeOps - External <-> internal type conversion
+-- ProtoTypeOps    - Spark Connect proto serialization
+-- ClientTypeOps   - JDBC, Arrow, Python, Thrift integration

TypeApiOps (sql/api)
+-- FormatTypeOps   - String formatting
+-- EncodeTypeOps   - Row encoding (AgnosticEncoder) {code}
 

Existing integration points use a *check-and-delegate* pattern guarded by a 
feature flag to dispatch to the framework while preserving legacy behavior as 
fallback:

 
{code:java}
def someOperation(dt: DataType) = dt match {
  case _ if SQLConf.get.typesFrameworkEnabled && SomeOps.supports(dt) =>
    SomeOps(dt).someMethod()
  case DateType => ...  // legacy types unchanged
} {code}
 

TimeType serves as the proof-of-concept implementation. Once the framework is 
validated, additional types can be implemented or migrated incrementally.

*Scope:*
 * In scope: Physical type representation, literal creation, type conversion, 
string formatting, row encoding, proto serialization, client integration (JDBC, 
Arrow, Python, Thrift), storage formats, testing infrastructure, documentation
 * Out of scope: Type-specific expressions and arithmetic, SQL parser changes. 
The framework provides primitives that expressions use, not the expressions 
themselves.

See the attached design document for full details.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to