[
https://issues.apache.org/jira/browse/SPARK-53504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
David Milicevic updated SPARK-53504:
------------------------------------
Description:
*Summary:*
Introduce a Types Framework to centralize scattered type-specific pattern
matching
*Description:*
Adding a new data type to Spark currently requires modifying 50+ files with
scattered type-specific logic using diverse patterns (focusing on TIME type -
'_: TimeType', 'TimeNanoVector', '.hasTime()', 'LocalTimeEncoder', 'instanceof
TimeType', etc.). There is no compiler assistance to ensure completeness, and
integration points are easy to miss.
This effort introduces a *Types Framework* that centralizes type-specific
infrastructure operations in Ops interface classes. Instead of modifying dozens
of files, a developer implementing a new type creates two Ops classes and
registers them in factory objects. The compiler enforces interface completeness.
*Concrete example - TimeType:*
TimeType (the proof-of-concept for this framework) has integration points
spread across 50+ files for infrastructure concerns like physical type mapping,
literals, type converters, encoders, formatters, Arrow SerDe, proto conversion,
JDBC, Python, Thrift, and storage formats. With the framework, these are
consolidated into two Ops classes (~240 lines total). A developer adding a new
type with similar complexity would create two analogous files instead of
touching 50+ files.
*Architecture:*
The framework defines a hierarchy of Ops traits that each cover a specific
category of type operations:
{code:java}
TypeOps (sql/catalyst)
+-- PhyTypeOps - Physical type representation
+-- LiteralTypeOps - Literal creation and defaults
+-- ExternalTypeOps - External <-> internal type conversion
+-- ProtoTypeOps - Spark Connect proto serialization
+-- ClientTypeOps - JDBC, Arrow, Python, Thrift integration
TypeApiOps (sql/api)
+-- FormatTypeOps - String formatting
+-- EncodeTypeOps - Row encoding (AgnosticEncoder) {code}
Existing integration points use a *check-and-delegate* pattern guarded by a
feature flag to dispatch to the framework while preserving legacy behavior as
fallback:
{code:java}
def someOperation(dt: DataType) = dt match {
case _ if SQLConf.get.typesFrameworkEnabled && SomeOps.supports(dt) =>
SomeOps(dt).someMethod()
case DateType => ... // legacy types unchanged
} {code}
TimeType serves as the proof-of-concept implementation. Once the framework is
validated, additional types can be implemented or migrated incrementally.
*Scope:*
* In scope: Physical type representation, literal creation, type conversion,
string formatting, row encoding, proto serialization, client integration (JDBC,
Arrow, Python, Thrift), storage formats, testing infrastructure, documentation
* Out of scope: Type-specific expressions and arithmetic, SQL parser changes.
The framework provides primitives that expressions use, not the expressions
themselves.
See the attached design document for full details.
was:
h3. Issues on adding new Catalyst types:
Based on experience on adding ANSI intervals, TIME type and migration of
TIMESTAMP_LTZ onto Proleptic Gregorian calendar.
* Public data type classes like DateType, TimestampType contain minimum
information and no implementation. Some classes like StructType, DecimalType,
StringType contain much more operation over the types.
* Type operations are spread across entire codebase. There is high chance to
miss processing of new type.
* Error prone since all errors are caught by tests in runtime. No help from the
compiler.
h3. Examples of the current implementation:
{code}
find . -name "*.scala" -print0|xargs -0 grep case|grep '=>'|grep
DayTimeIntervalType|grep -v test|wc -l
133
{code}
h3. The goal is to add a set of interface and ops objects.
The interfaces define operations (internal) over the catalyst types. And the
ops objects implement such interfaces. For instance:
{code:scala}
case class TimeTypeOps(t: TimeType)
extends TypeApiOps
with EncodeTypeOps
with FormatTypeOps
with TypeOps
with PhyTypeOps
with LiteralTypeOps {
{code}
where *LiteralTypeOps* is
{code:scala}
trait LiteralTypeOps {
// Gets a literal with default value of the type
def getDefaultLiteral: Literal
// Gets an Java literal as a string. It can be used in codegen
def getJavaLiteral(v: Any): String
}
{code}
> Type framework
> --------------
>
> Key: SPARK-53504
> URL: https://issues.apache.org/jira/browse/SPARK-53504
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Affects Versions: 4.2.0
> Reporter: Max Gekk
> Assignee: Max Gekk
> Priority: Major
> Labels: pull-request-available
> Attachments: TYPES_FRAMEWORK_DESIGN_V2.md
>
>
> *Summary:*
> Introduce a Types Framework to centralize scattered type-specific pattern
> matching
> *Description:*
> Adding a new data type to Spark currently requires modifying 50+ files with
> scattered type-specific logic using diverse patterns (focusing on TIME type -
> '_: TimeType', 'TimeNanoVector', '.hasTime()', 'LocalTimeEncoder',
> 'instanceof TimeType', etc.). There is no compiler assistance to ensure
> completeness, and integration points are easy to miss.
> This effort introduces a *Types Framework* that centralizes type-specific
> infrastructure operations in Ops interface classes. Instead of modifying
> dozens of files, a developer implementing a new type creates two Ops classes
> and registers them in factory objects. The compiler enforces interface
> completeness.
> *Concrete example - TimeType:*
> TimeType (the proof-of-concept for this framework) has integration points
> spread across 50+ files for infrastructure concerns like physical type
> mapping, literals, type converters, encoders, formatters, Arrow SerDe, proto
> conversion, JDBC, Python, Thrift, and storage formats. With the framework,
> these are consolidated into two Ops classes (~240 lines total). A developer
> adding a new type with similar complexity would create two analogous files
> instead of touching 50+ files.
> *Architecture:*
> The framework defines a hierarchy of Ops traits that each cover a specific
> category of type operations:
>
> {code:java}
> TypeOps (sql/catalyst)
> +-- PhyTypeOps - Physical type representation
> +-- LiteralTypeOps - Literal creation and defaults
> +-- ExternalTypeOps - External <-> internal type conversion
> +-- ProtoTypeOps - Spark Connect proto serialization
> +-- ClientTypeOps - JDBC, Arrow, Python, Thrift integration
> TypeApiOps (sql/api)
> +-- FormatTypeOps - String formatting
> +-- EncodeTypeOps - Row encoding (AgnosticEncoder) {code}
>
> Existing integration points use a *check-and-delegate* pattern guarded by a
> feature flag to dispatch to the framework while preserving legacy behavior as
> fallback:
>
> {code:java}
> def someOperation(dt: DataType) = dt match {
> case _ if SQLConf.get.typesFrameworkEnabled && SomeOps.supports(dt) =>
> SomeOps(dt).someMethod()
> case DateType => ... // legacy types unchanged
> } {code}
>
> TimeType serves as the proof-of-concept implementation. Once the framework is
> validated, additional types can be implemented or migrated incrementally.
> *Scope:*
> * In scope: Physical type representation, literal creation, type conversion,
> string formatting, row encoding, proto serialization, client integration
> (JDBC, Arrow, Python, Thrift), storage formats, testing infrastructure,
> documentation
> * Out of scope: Type-specific expressions and arithmetic, SQL parser
> changes. The framework provides primitives that expressions use, not the
> expressions themselves.
> See the attached design document for full details.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]