[ 
https://issues.apache.org/jira/browse/SPARK-53504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Milicevic updated SPARK-53504:
------------------------------------
    Description: 
*Summary:*

Introduce a Types Framework to centralize scattered type-specific pattern 
matching

*Description:*

Adding a new data type to Spark currently requires modifying 50+ files with 
scattered type-specific logic using diverse patterns (focusing on TIME type - 
'_: TimeType', 'TimeNanoVector', '.hasTime()', 'LocalTimeEncoder', 'instanceof 
TimeType', etc.). There is no compiler assistance to ensure completeness, and 
integration points are easy to miss.

This effort introduces a *Types Framework* that centralizes type-specific 
infrastructure operations in Ops interface classes. Instead of modifying dozens 
of files, a developer implementing a new type creates two Ops classes and 
registers them in factory objects. The compiler enforces interface completeness.

*Concrete example - TimeType:*

TimeType (the proof-of-concept for this framework) has integration points 
spread across 50+ files for infrastructure concerns like physical type mapping, 
literals, type converters, encoders, formatters, Arrow SerDe, proto conversion, 
JDBC, Python, Thrift, and storage formats. With the framework, these are 
consolidated into two Ops classes (~240 lines total). A developer adding a new 
type with similar complexity would create two analogous files instead of 
touching 50+ files.

*Architecture:*

The framework defines a hierarchy of Ops traits that each cover a specific 
category of type operations:

 
{code:java}
TypeOps (sql/catalyst)
+-- PhyTypeOps      - Physical type representation
+-- LiteralTypeOps  - Literal creation and defaults
+-- ExternalTypeOps - External <-> internal type conversion
+-- ProtoTypeOps    - Spark Connect proto serialization
+-- ClientTypeOps   - JDBC, Arrow, Python, Thrift integration

TypeApiOps (sql/api)
+-- FormatTypeOps   - String formatting
+-- EncodeTypeOps   - Row encoding (AgnosticEncoder) {code}
 

Existing integration points use a *check-and-delegate* pattern guarded by a 
feature flag to dispatch to the framework while preserving legacy behavior as 
fallback:

 
{code:java}
def someOperation(dt: DataType) = dt match {
  case _ if SQLConf.get.typesFrameworkEnabled && SomeOps.supports(dt) =>
    SomeOps(dt).someMethod()
  case DateType => ...  // legacy types unchanged
} {code}
 

TimeType serves as the proof-of-concept implementation. Once the framework is 
validated, additional types can be implemented or migrated incrementally.

*Scope:*
 * In scope: Physical type representation, literal creation, type conversion, 
string formatting, row encoding, proto serialization, client integration (JDBC, 
Arrow, Python, Thrift), storage formats, testing infrastructure, documentation
 * Out of scope: Type-specific expressions and arithmetic, SQL parser changes. 
The framework provides primitives that expressions use, not the expressions 
themselves.

See the attached design document for full details.

  was:
h3. Issues on adding new Catalyst types:
Based on experience on adding ANSI intervals, TIME type and migration of 
TIMESTAMP_LTZ onto Proleptic Gregorian calendar.

* Public data type classes like DateType, TimestampType contain minimum 
information and no implementation. Some classes like StructType, DecimalType, 
StringType contain much more operation over the types.
* Type operations are spread across entire codebase. There is high chance to 
miss processing of new type.
* Error prone since all errors are caught by tests in runtime. No help from the 
compiler.

h3. Examples of the current implementation:
{code}
find . -name "*.scala" -print0|xargs -0 grep case|grep '=>'|grep 
DayTimeIntervalType|grep -v test|wc -l
     133
{code}

h3. The goal is to add a set of interface and ops objects.
The interfaces define operations (internal) over the catalyst types. And the 
ops objects implement such interfaces. For instance:

{code:scala}
case class TimeTypeOps(t: TimeType)
  extends TypeApiOps
  with EncodeTypeOps
  with FormatTypeOps
  with TypeOps
  with PhyTypeOps
  with LiteralTypeOps {
{code}
where *LiteralTypeOps* is

{code:scala}
trait LiteralTypeOps {
  // Gets a literal with default value of the type
  def getDefaultLiteral: Literal
  // Gets an Java literal as a string. It can be used in codegen
  def getJavaLiteral(v: Any): String
}
{code}




> Type framework
> --------------
>
>                 Key: SPARK-53504
>                 URL: https://issues.apache.org/jira/browse/SPARK-53504
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 4.2.0
>            Reporter: Max Gekk
>            Assignee: Max Gekk
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: TYPES_FRAMEWORK_DESIGN_V2.md
>
>
> *Summary:*
> Introduce a Types Framework to centralize scattered type-specific pattern 
> matching
> *Description:*
> Adding a new data type to Spark currently requires modifying 50+ files with 
> scattered type-specific logic using diverse patterns (focusing on TIME type - 
> '_: TimeType', 'TimeNanoVector', '.hasTime()', 'LocalTimeEncoder', 
> 'instanceof TimeType', etc.). There is no compiler assistance to ensure 
> completeness, and integration points are easy to miss.
> This effort introduces a *Types Framework* that centralizes type-specific 
> infrastructure operations in Ops interface classes. Instead of modifying 
> dozens of files, a developer implementing a new type creates two Ops classes 
> and registers them in factory objects. The compiler enforces interface 
> completeness.
> *Concrete example - TimeType:*
> TimeType (the proof-of-concept for this framework) has integration points 
> spread across 50+ files for infrastructure concerns like physical type 
> mapping, literals, type converters, encoders, formatters, Arrow SerDe, proto 
> conversion, JDBC, Python, Thrift, and storage formats. With the framework, 
> these are consolidated into two Ops classes (~240 lines total). A developer 
> adding a new type with similar complexity would create two analogous files 
> instead of touching 50+ files.
> *Architecture:*
> The framework defines a hierarchy of Ops traits that each cover a specific 
> category of type operations:
>  
> {code:java}
> TypeOps (sql/catalyst)
> +-- PhyTypeOps      - Physical type representation
> +-- LiteralTypeOps  - Literal creation and defaults
> +-- ExternalTypeOps - External <-> internal type conversion
> +-- ProtoTypeOps    - Spark Connect proto serialization
> +-- ClientTypeOps   - JDBC, Arrow, Python, Thrift integration
> TypeApiOps (sql/api)
> +-- FormatTypeOps   - String formatting
> +-- EncodeTypeOps   - Row encoding (AgnosticEncoder) {code}
>  
> Existing integration points use a *check-and-delegate* pattern guarded by a 
> feature flag to dispatch to the framework while preserving legacy behavior as 
> fallback:
>  
> {code:java}
> def someOperation(dt: DataType) = dt match {
>   case _ if SQLConf.get.typesFrameworkEnabled && SomeOps.supports(dt) =>
>     SomeOps(dt).someMethod()
>   case DateType => ...  // legacy types unchanged
> } {code}
>  
> TimeType serves as the proof-of-concept implementation. Once the framework is 
> validated, additional types can be implemented or migrated incrementally.
> *Scope:*
>  * In scope: Physical type representation, literal creation, type conversion, 
> string formatting, row encoding, proto serialization, client integration 
> (JDBC, Arrow, Python, Thrift), storage formats, testing infrastructure, 
> documentation
>  * Out of scope: Type-specific expressions and arithmetic, SQL parser 
> changes. The framework provides primitives that expressions use, not the 
> expressions themselves.
> See the attached design document for full details.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to