[
https://issues.apache.org/jira/browse/SPARK-55439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
David Milicevic resolved SPARK-55439.
-------------------------------------
Resolution: Duplicate
I didn't realize I had permissions to edit the original work item:
https://issues.apache.org/jira/browse/SPARK-53504.
I'm closing this one and will update the original one!
> Types Framework
> ---------------
>
> Key: SPARK-55439
> URL: https://issues.apache.org/jira/browse/SPARK-55439
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Affects Versions: 4.2.0
> Reporter: David Milicevic
> Priority: Major
> Attachments: TYPES_FRAMEWORK_DESIGN_V2.md
>
>
> *Summary:*
> Introduce a Types Framework to centralize scattered type-specific pattern
> matching
> *Description:*
> Adding a new data type to Spark currently requires modifying 50+ files with
> scattered type-specific logic using diverse patterns (focusing on TIME type -
> '_: TimeType', 'TimeNanoVector', '.hasTime()', 'LocalTimeEncoder',
> 'instanceof TimeType', etc.). There is no compiler assistance to ensure
> completeness, and integration points are easy to miss.
> This effort introduces a *Types Framework* that centralizes type-specific
> infrastructure operations in Ops interface classes. Instead of modifying
> dozens of files, a developer implementing a new type creates two Ops classes
> and registers them in factory objects. The compiler enforces interface
> completeness.
> *Concrete example - TimeType:*
> TimeType (the proof-of-concept for this framework) has integration points
> spread across 50+ files for infrastructure concerns like physical type
> mapping, literals, type converters, encoders, formatters, Arrow SerDe, proto
> conversion, JDBC, Python, Thrift, and storage formats. With the framework,
> these are consolidated into two Ops classes (~240 lines total). A developer
> adding a new type with similar complexity would create two analogous files
> instead of touching 50+ files.
> *Architecture:*
> The framework defines a hierarchy of Ops traits that each cover a specific
> category of type operations:
>
> {code:java}
> TypeOps (sql/catalyst)
> +-- PhyTypeOps - Physical type representation
> +-- LiteralTypeOps - Literal creation and defaults
> +-- ExternalTypeOps - External <-> internal type conversion
> +-- ProtoTypeOps - Spark Connect proto serialization
> +-- ClientTypeOps - JDBC, Arrow, Python, Thrift integration
> TypeApiOps (sql/api)
> +-- FormatTypeOps - String formatting
> +-- EncodeTypeOps - Row encoding (AgnosticEncoder) {code}
>
> Existing integration points use a *check-and-delegate* pattern guarded by a
> feature flag to dispatch to the framework while preserving legacy behavior as
> fallback:
>
> {code:java}
> def someOperation(dt: DataType) = dt match {
> case _ if SQLConf.get.typesFrameworkEnabled && SomeOps.supports(dt) =>
> SomeOps(dt).someMethod()
> case DateType => ... // legacy types unchanged
> } {code}
>
> TimeType serves as the proof-of-concept implementation. Once the framework is
> validated, additional types can be implemented or migrated incrementally.
> *Scope:*
> * In scope: Physical type representation, literal creation, type conversion,
> string formatting, row encoding, proto serialization, client integration
> (JDBC, Arrow, Python, Thrift), storage formats, testing infrastructure,
> documentation
> * Out of scope: Type-specific expressions and arithmetic, SQL parser
> changes. The framework provides primitives that expressions use, not the
> expressions themselves.
> See the attached design document for full details.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]