[
https://issues.apache.org/jira/browse/SPARK-53504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
David Milicevic updated SPARK-53504:
------------------------------------
Description:
*Summary:*
Introduce a Types Framework to centralize scattered type-specific pattern
matching
*Description:*
Adding a new data type to Spark currently requires modifying 50+ files with
scattered type-specific logic using diverse patterns (focusing on TIME type -
'_: TimeType', 'TimeNanoVector', '.hasTime()', 'LocalTimeEncoder', 'instanceof
TimeType', etc.). There is no compiler assistance to ensure completeness, and
integration points are easy to miss.
This effort introduces a *Types Framework* that centralizes type-specific
infrastructure operations in Ops interface classes. Instead of modifying dozens
of files, a developer implementing a new type creates two Ops classes and
registers them in factory objects. The compiler enforces interface completeness.
*Concrete example - TimeType:*
TimeType (the proof-of-concept for this framework) has integration points
spread across 50+ files for infrastructure concerns like physical type mapping,
literals, type converters, encoders, formatters, Arrow SerDe, proto conversion,
JDBC, Python, Thrift, and storage formats. With the framework, these are
consolidated into two Ops classes (~240 lines total). A developer adding a new
type with similar complexity would create two analogous files instead of
touching 50+ files.
*Architecture:*
The framework defines a hierarchy of Ops traits that each cover a specific
category of type operations:
{code:java}
TypeOps (sql/catalyst)
+-- PhyTypeOps - Physical type representation
+-- LiteralTypeOps - Literal creation and defaults
+-- ExternalTypeOps - External <-> internal type conversion
+-- ProtoTypeOps - Spark Connect proto serialization
+-- ClientTypeOps - JDBC, Arrow, Python, Thrift integration
TypeApiOps (sql/api)
+-- FormatTypeOps - String formatting
+-- EncodeTypeOps - Row encoding (AgnosticEncoder) {code}
Existing integration points use a *check-and-delegate* pattern guarded by a
feature flag to dispatch to the framework while preserving legacy behavior as
fallback:
{code:java}
def someOperation(dt: DataType) = dt match {
case _ if SQLConf.get.typesFrameworkEnabled && SomeOps.supports(dt) =>
SomeOps(dt).someMethod()
case DateType => ... // legacy types unchanged
} {code}
TimeType serves as the proof-of-concept implementation. Once the framework is
validated, additional types can be implemented or migrated incrementally.
*Scope:*
* In scope: Physical type representation, literal creation, type conversion,
string formatting, row encoding, proto serialization, client integration (JDBC,
Arrow, Python, Thrift), storage formats, testing infrastructure, documentation
* Out of scope: Type-specific expressions and arithmetic, SQL parser changes.
The framework provides primitives that expressions use, not the expressions
themselves.
*Note:*
[2026/02/09] For now, only the basic set of sub-tasks has been created. This is
an estimation of what needs to be covered and it will most probably be updated
as development of the feature proceeds. Also, the scope of the changes might be
extended along the way, in case any uncovered topics have been missed, or for
example, if the proper way to abstract the expressions has been figured out.
*Design doc:*
See the attached design document for full details.
was:
*Summary:*
Introduce a Types Framework to centralize scattered type-specific pattern
matching
*Description:*
Adding a new data type to Spark currently requires modifying 50+ files with
scattered type-specific logic using diverse patterns (focusing on TIME type -
'_: TimeType', 'TimeNanoVector', '.hasTime()', 'LocalTimeEncoder', 'instanceof
TimeType', etc.). There is no compiler assistance to ensure completeness, and
integration points are easy to miss.
This effort introduces a *Types Framework* that centralizes type-specific
infrastructure operations in Ops interface classes. Instead of modifying dozens
of files, a developer implementing a new type creates two Ops classes and
registers them in factory objects. The compiler enforces interface completeness.
*Concrete example - TimeType:*
TimeType (the proof-of-concept for this framework) has integration points
spread across 50+ files for infrastructure concerns like physical type mapping,
literals, type converters, encoders, formatters, Arrow SerDe, proto conversion,
JDBC, Python, Thrift, and storage formats. With the framework, these are
consolidated into two Ops classes (~240 lines total). A developer adding a new
type with similar complexity would create two analogous files instead of
touching 50+ files.
*Architecture:*
The framework defines a hierarchy of Ops traits that each cover a specific
category of type operations:
{code:java}
TypeOps (sql/catalyst)
+-- PhyTypeOps - Physical type representation
+-- LiteralTypeOps - Literal creation and defaults
+-- ExternalTypeOps - External <-> internal type conversion
+-- ProtoTypeOps - Spark Connect proto serialization
+-- ClientTypeOps - JDBC, Arrow, Python, Thrift integration
TypeApiOps (sql/api)
+-- FormatTypeOps - String formatting
+-- EncodeTypeOps - Row encoding (AgnosticEncoder) {code}
Existing integration points use a *check-and-delegate* pattern guarded by a
feature flag to dispatch to the framework while preserving legacy behavior as
fallback:
{code:java}
def someOperation(dt: DataType) = dt match {
case _ if SQLConf.get.typesFrameworkEnabled && SomeOps.supports(dt) =>
SomeOps(dt).someMethod()
case DateType => ... // legacy types unchanged
} {code}
TimeType serves as the proof-of-concept implementation. Once the framework is
validated, additional types can be implemented or migrated incrementally.
*Scope:*
* In scope: Physical type representation, literal creation, type conversion,
string formatting, row encoding, proto serialization, client integration (JDBC,
Arrow, Python, Thrift), storage formats, testing infrastructure, documentation
* Out of scope: Type-specific expressions and arithmetic, SQL parser changes.
The framework provides primitives that expressions use, not the expressions
themselves.
See the attached design document for full details.
> Types framework
> ---------------
>
> Key: SPARK-53504
> URL: https://issues.apache.org/jira/browse/SPARK-53504
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Affects Versions: 4.2.0
> Reporter: Max Gekk
> Assignee: Max Gekk
> Priority: Major
> Labels: pull-request-available
> Attachments: TYPES_FRAMEWORK_DESIGN_V2.md
>
>
> *Summary:*
> Introduce a Types Framework to centralize scattered type-specific pattern
> matching
> *Description:*
> Adding a new data type to Spark currently requires modifying 50+ files with
> scattered type-specific logic using diverse patterns (focusing on TIME type -
> '_: TimeType', 'TimeNanoVector', '.hasTime()', 'LocalTimeEncoder',
> 'instanceof TimeType', etc.). There is no compiler assistance to ensure
> completeness, and integration points are easy to miss.
> This effort introduces a *Types Framework* that centralizes type-specific
> infrastructure operations in Ops interface classes. Instead of modifying
> dozens of files, a developer implementing a new type creates two Ops classes
> and registers them in factory objects. The compiler enforces interface
> completeness.
> *Concrete example - TimeType:*
> TimeType (the proof-of-concept for this framework) has integration points
> spread across 50+ files for infrastructure concerns like physical type
> mapping, literals, type converters, encoders, formatters, Arrow SerDe, proto
> conversion, JDBC, Python, Thrift, and storage formats. With the framework,
> these are consolidated into two Ops classes (~240 lines total). A developer
> adding a new type with similar complexity would create two analogous files
> instead of touching 50+ files.
> *Architecture:*
> The framework defines a hierarchy of Ops traits that each cover a specific
> category of type operations:
> {code:java}
> TypeOps (sql/catalyst)
> +-- PhyTypeOps - Physical type representation
> +-- LiteralTypeOps - Literal creation and defaults
> +-- ExternalTypeOps - External <-> internal type conversion
> +-- ProtoTypeOps - Spark Connect proto serialization
> +-- ClientTypeOps - JDBC, Arrow, Python, Thrift integration
> TypeApiOps (sql/api)
> +-- FormatTypeOps - String formatting
> +-- EncodeTypeOps - Row encoding (AgnosticEncoder) {code}
> Existing integration points use a *check-and-delegate* pattern guarded by a
> feature flag to dispatch to the framework while preserving legacy behavior as
> fallback:
> {code:java}
> def someOperation(dt: DataType) = dt match {
> case _ if SQLConf.get.typesFrameworkEnabled && SomeOps.supports(dt) =>
> SomeOps(dt).someMethod()
> case DateType => ... // legacy types unchanged
> } {code}
> TimeType serves as the proof-of-concept implementation. Once the framework is
> validated, additional types can be implemented or migrated incrementally.
> *Scope:*
> * In scope: Physical type representation, literal creation, type conversion,
> string formatting, row encoding, proto serialization, client integration
> (JDBC, Arrow, Python, Thrift), storage formats, testing infrastructure,
> documentation
> * Out of scope: Type-specific expressions and arithmetic, SQL parser
> changes. The framework provides primitives that expressions use, not the
> expressions themselves.
> *Note:*
> [2026/02/09] For now, only the basic set of sub-tasks has been created. This
> is an estimation of what needs to be covered and it will most probably be
> updated as development of the feature proceeds. Also, the scope of the
> changes might be extended along the way, in case any uncovered topics have
> been missed, or for example, if the proper way to abstract the expressions
> has been figured out.
> *Design doc:*
> See the attached design document for full details.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]