[ 
https://issues.apache.org/jira/browse/SPARK-53504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18057294#comment-18057294
 ] 

David Milicevic commented on SPARK-53504:
-----------------------------------------

Updated the description and attached a design doc. I will create sub-tasks and 
start working on this!

> Types framework
> ---------------
>
>                 Key: SPARK-53504
>                 URL: https://issues.apache.org/jira/browse/SPARK-53504
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 4.2.0
>            Reporter: Max Gekk
>            Assignee: Max Gekk
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: TYPES_FRAMEWORK_DESIGN_V2.md
>
>
> *Summary:*
> Introduce a Types Framework to centralize scattered type-specific pattern 
> matching
> *Description:*
> Adding a new data type to Spark currently requires modifying 50+ files with 
> scattered type-specific logic using diverse patterns (focusing on TIME type - 
> '_: TimeType', 'TimeNanoVector', '.hasTime()', 'LocalTimeEncoder', 
> 'instanceof TimeType', etc.). There is no compiler assistance to ensure 
> completeness, and integration points are easy to miss.
> This effort introduces a *Types Framework* that centralizes type-specific 
> infrastructure operations in Ops interface classes. Instead of modifying 
> dozens of files, a developer implementing a new type creates two Ops classes 
> and registers them in factory objects. The compiler enforces interface 
> completeness.
> *Concrete example - TimeType:*
> TimeType (the proof-of-concept for this framework) has integration points 
> spread across 50+ files for infrastructure concerns like physical type 
> mapping, literals, type converters, encoders, formatters, Arrow SerDe, proto 
> conversion, JDBC, Python, Thrift, and storage formats. With the framework, 
> these are consolidated into two Ops classes (~240 lines total). A developer 
> adding a new type with similar complexity would create two analogous files 
> instead of touching 50+ files.
> *Architecture:*
> The framework defines a hierarchy of Ops traits that each cover a specific 
> category of type operations:
>  
> {code:java}
> TypeOps (sql/catalyst)
> +-- PhyTypeOps      - Physical type representation
> +-- LiteralTypeOps  - Literal creation and defaults
> +-- ExternalTypeOps - External <-> internal type conversion
> +-- ProtoTypeOps    - Spark Connect proto serialization
> +-- ClientTypeOps   - JDBC, Arrow, Python, Thrift integration
> TypeApiOps (sql/api)
> +-- FormatTypeOps   - String formatting
> +-- EncodeTypeOps   - Row encoding (AgnosticEncoder) {code}
>  
> Existing integration points use a *check-and-delegate* pattern guarded by a 
> feature flag to dispatch to the framework while preserving legacy behavior as 
> fallback:
>  
> {code:java}
> def someOperation(dt: DataType) = dt match {
>   case _ if SQLConf.get.typesFrameworkEnabled && SomeOps.supports(dt) =>
>     SomeOps(dt).someMethod()
>   case DateType => ...  // legacy types unchanged
> } {code}
>  
> TimeType serves as the proof-of-concept implementation. Once the framework is 
> validated, additional types can be implemented or migrated incrementally.
> *Scope:*
>  * In scope: Physical type representation, literal creation, type conversion, 
> string formatting, row encoding, proto serialization, client integration 
> (JDBC, Arrow, Python, Thrift), storage formats, testing infrastructure, 
> documentation
>  * Out of scope: Type-specific expressions and arithmetic, SQL parser 
> changes. The framework provides primitives that expressions use, not the 
> expressions themselves.
> See the attached design document for full details.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to