[jira] [Updated] (SPARK-53504) Types framework

David Milicevic (Jira) Mon, 09 Feb 2026 00:47:06 -0800


     [ 
https://issues.apache.org/jira/browse/SPARK-53504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


David Milicevic updated SPARK-53504:
------------------------------------
    Description: 
*Summary:*

Introduce a Types Framework to centralize scattered type-specific pattern 
matching

*Description:*

Adding a new data type to Spark currently requires modifying 50+ files with 
scattered type-specific logic using diverse patterns (focusing on TIME type - 
'_: TimeType', 'TimeNanoVector', '.hasTime()', 'LocalTimeEncoder', 'instanceof 
TimeType', etc.). There is no compiler assistance to ensure completeness, and 
integration points are easy to miss.

This effort introduces a *Types Framework* that centralizes type-specific 
infrastructure operations in Ops interface classes. Instead of modifying dozens 
of files, a developer implementing a new type creates two Ops classes and 
registers them in factory objects. The compiler enforces interface completeness.

*Concrete example - TimeType:*

TimeType (the proof-of-concept for this framework) has integration points 
spread across 50+ files for infrastructure concerns like physical type mapping, 
literals, type converters, encoders, formatters, Arrow SerDe, proto conversion, 
JDBC, Python, Thrift, and storage formats. With the framework, these are 
consolidated into two Ops classes (~240 lines total). A developer adding a new 
type with similar complexity would create two analogous files instead of 
touching 50+ files.

*Architecture:*

The framework defines a hierarchy of Ops traits that each cover a specific 
category of type operations:
{code:java}
TypeOps (sql/catalyst)
+-- PhyTypeOps      - Physical type representation
+-- LiteralTypeOps  - Literal creation and defaults
+-- ExternalTypeOps - External <-> internal type conversion
+-- ProtoTypeOps    - Spark Connect proto serialization
+-- ClientTypeOps   - JDBC, Arrow, Python, Thrift integration

TypeApiOps (sql/api)
+-- FormatTypeOps   - String formatting
+-- EncodeTypeOps   - Row encoding (AgnosticEncoder) {code}
Existing integration points use a *check-and-delegate* pattern guarded by a 
feature flag to dispatch to the framework while preserving legacy behavior as 
fallback:
{code:java}
def someOperation(dt: DataType) = dt match {
  case _ if SQLConf.get.typesFrameworkEnabled && SomeOps.supports(dt) =>
    SomeOps(dt).someMethod()
  case DateType => ...  // legacy types unchanged
} {code}
TimeType serves as the proof-of-concept implementation. Once the framework is 
validated, additional types can be implemented or migrated incrementally.

*Scope:*
 * In scope: Physical type representation, literal creation, type conversion, 
string formatting, row encoding, proto serialization, client integration (JDBC, 
Arrow, Python, Thrift), storage formats, testing infrastructure, documentation
 * Out of scope: Type-specific expressions and arithmetic, SQL parser changes. 
The framework provides primitives that expressions use, not the expressions 
themselves.

*Note:*

[2026/02/09] For now, only the basic set of sub-tasks has been created. This is 
an estimation of what needs to be covered and it will most probably be updated 
as development of the feature proceeds. Also, the scope of the changes might be 
extended along the way, in case any uncovered topics have been missed, or for 
example, if the proper way to abstract the expressions has been figured out.

*Design doc:*

See the attached design document for full details.

  was:
*Summary:*

Introduce a Types Framework to centralize scattered type-specific pattern 
matching

*Description:*

Adding a new data type to Spark currently requires modifying 50+ files with 
scattered type-specific logic using diverse patterns (focusing on TIME type - 
'_: TimeType', 'TimeNanoVector', '.hasTime()', 'LocalTimeEncoder', 'instanceof 
TimeType', etc.). There is no compiler assistance to ensure completeness, and 
integration points are easy to miss.

This effort introduces a *Types Framework* that centralizes type-specific 
infrastructure operations in Ops interface classes. Instead of modifying dozens 
of files, a developer implementing a new type creates two Ops classes and 
registers them in factory objects. The compiler enforces interface completeness.

*Concrete example - TimeType:*

TimeType (the proof-of-concept for this framework) has integration points 
spread across 50+ files for infrastructure concerns like physical type mapping, 
literals, type converters, encoders, formatters, Arrow SerDe, proto conversion, 
JDBC, Python, Thrift, and storage formats. With the framework, these are 
consolidated into two Ops classes (~240 lines total). A developer adding a new 
type with similar complexity would create two analogous files instead of 
touching 50+ files.

*Architecture:*

The framework defines a hierarchy of Ops traits that each cover a specific 
category of type operations:
{code:java}
TypeOps (sql/catalyst)
+-- PhyTypeOps      - Physical type representation
+-- LiteralTypeOps  - Literal creation and defaults
+-- ExternalTypeOps - External <-> internal type conversion
+-- ProtoTypeOps    - Spark Connect proto serialization
+-- ClientTypeOps   - JDBC, Arrow, Python, Thrift integration

TypeApiOps (sql/api)
+-- FormatTypeOps   - String formatting
+-- EncodeTypeOps   - Row encoding (AgnosticEncoder) {code}
Existing integration points use a *check-and-delegate* pattern guarded by a 
feature flag to dispatch to the framework while preserving legacy behavior as 
fallback:
{code:java}
def someOperation(dt: DataType) = dt match {
  case _ if SQLConf.get.typesFrameworkEnabled && SomeOps.supports(dt) =>
    SomeOps(dt).someMethod()
  case DateType => ...  // legacy types unchanged
} {code}
TimeType serves as the proof-of-concept implementation. Once the framework is 
validated, additional types can be implemented or migrated incrementally.

*Scope:*
 * In scope: Physical type representation, literal creation, type conversion, 
string formatting, row encoding, proto serialization, client integration (JDBC, 
Arrow, Python, Thrift), storage formats, testing infrastructure, documentation
 * Out of scope: Type-specific expressions and arithmetic, SQL parser changes. 
The framework provides primitives that expressions use, not the expressions 
themselves.

See the attached design document for full details.


> Types framework
> ---------------
>
>                 Key: SPARK-53504
>                 URL: https://issues.apache.org/jira/browse/SPARK-53504
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 4.2.0
>            Reporter: Max Gekk
>            Assignee: Max Gekk
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: TYPES_FRAMEWORK_DESIGN_V2.md
>
>
> *Summary:*
> Introduce a Types Framework to centralize scattered type-specific pattern 
> matching
> *Description:*
> Adding a new data type to Spark currently requires modifying 50+ files with 
> scattered type-specific logic using diverse patterns (focusing on TIME type - 
> '_: TimeType', 'TimeNanoVector', '.hasTime()', 'LocalTimeEncoder', 
> 'instanceof TimeType', etc.). There is no compiler assistance to ensure 
> completeness, and integration points are easy to miss.
> This effort introduces a *Types Framework* that centralizes type-specific 
> infrastructure operations in Ops interface classes. Instead of modifying 
> dozens of files, a developer implementing a new type creates two Ops classes 
> and registers them in factory objects. The compiler enforces interface 
> completeness.
> *Concrete example - TimeType:*
> TimeType (the proof-of-concept for this framework) has integration points 
> spread across 50+ files for infrastructure concerns like physical type 
> mapping, literals, type converters, encoders, formatters, Arrow SerDe, proto 
> conversion, JDBC, Python, Thrift, and storage formats. With the framework, 
> these are consolidated into two Ops classes (~240 lines total). A developer 
> adding a new type with similar complexity would create two analogous files 
> instead of touching 50+ files.
> *Architecture:*
> The framework defines a hierarchy of Ops traits that each cover a specific 
> category of type operations:
> {code:java}
> TypeOps (sql/catalyst)
> +-- PhyTypeOps      - Physical type representation
> +-- LiteralTypeOps  - Literal creation and defaults
> +-- ExternalTypeOps - External <-> internal type conversion
> +-- ProtoTypeOps    - Spark Connect proto serialization
> +-- ClientTypeOps   - JDBC, Arrow, Python, Thrift integration
> TypeApiOps (sql/api)
> +-- FormatTypeOps   - String formatting
> +-- EncodeTypeOps   - Row encoding (AgnosticEncoder) {code}
> Existing integration points use a *check-and-delegate* pattern guarded by a 
> feature flag to dispatch to the framework while preserving legacy behavior as 
> fallback:
> {code:java}
> def someOperation(dt: DataType) = dt match {
>   case _ if SQLConf.get.typesFrameworkEnabled && SomeOps.supports(dt) =>
>     SomeOps(dt).someMethod()
>   case DateType => ...  // legacy types unchanged
> } {code}
> TimeType serves as the proof-of-concept implementation. Once the framework is 
> validated, additional types can be implemented or migrated incrementally.
> *Scope:*
>  * In scope: Physical type representation, literal creation, type conversion, 
> string formatting, row encoding, proto serialization, client integration 
> (JDBC, Arrow, Python, Thrift), storage formats, testing infrastructure, 
> documentation
>  * Out of scope: Type-specific expressions and arithmetic, SQL parser 
> changes. The framework provides primitives that expressions use, not the 
> expressions themselves.
> *Note:*
> [2026/02/09] For now, only the basic set of sub-tasks has been created. This 
> is an estimation of what needs to be covered and it will most probably be 
> updated as development of the feature proceeds. Also, the scope of the 
> changes might be extended along the way, in case any uncovered topics have 
> been missed, or for example, if the proper way to abstract the expressions 
> has been figured out.
> *Design doc:*
> See the attached design document for full details.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SPARK-53504) Types framework

Reply via email to