Re: [DISCUSS] Data Type framework

Dongjoon Hyun Thu, 11 Sep 2025 21:12:19 -0700

Sounds like a great plan! Thank you.

+1 for the refactoring.


Dongjoon.

On Thu, Sep 11, 2025 at 1:04 PM Max Gekk <[email protected]> wrote:

> Hello Dongjoon,
>
> > can we do this migration safely in a step-by-step manner over multiple
> Apache Spark versions without blocking any Apache Spark releases?
>
> Sure, we can start from the TIME type, and refactor the existing pattern
> mathings. After that I would support new features of TIME using the
> framework (highly likely we will need to add new interfaces). This is not
> risky since the type hasn't been released yet. After the release 4.1.0, we
> could refactor some of existing data types, for example TIMESTAMP or/and
> DATE.
>
> Yours faithfully,
> Max Gekk
>
>
> On Thu, Sep 11, 2025 at 5:01 PM Dongjoon Hyun <[email protected]> wrote:
>
>> Thank you for sharing the direction, Max.
>>
>> Since this is internal refactoring, can we do this migration safely in a
>> step-by-step manner over multiple Apache Spark versions without blocking
>> any Apache Spark releases?
>>
>> The proposed direction itself looks reasonable and doable for me.
>>
>> Thanks,
>> Dongjoon.
>>
>> On 2025/09/10 13:44:45 "serge rielau.com" wrote:
>> > I think this is a great idea. There is a signifcant backlog of types
>> which should be added: E.g TIMESTAMP(9), TIMESTAMP WITH TIME ZONE, TIME
>> WITH TIMEZONE, some sort of big decimal to name a few).
>> > Making these more "plug and play" is goodness.
>> >
>> > +1
>> >
>> > On Sep 10, 2025, at 1:22 PM, Max Gekk <[email protected]> wrote:
>> >
>> > Hi All,
>> >
>> > I would like to propose refactoring of internal operations over
>> Catalyst's data types. In the current implementation, data types are
>> handled in an adhoc manner, and processing logic is dispersed  across the
>> entire code base. There are more than 100 places where every data type is
>> pattern matched. For example, formatting of type values (converting to
>> strings) is implemented in the same way in ToStringBase and in toString
>> (literals.scala). This leads to a few issues:
>> >
>> > 1. If you change the handling in one place, you might miss other
>> places. The compiler won't help you in such cases.
>> > 2. Adding a new data type has constant and significant overhead. Based
>> on our experience of adding new data types: ANSI intervals (
>> https://issues.apache.org/jira/browse/SPARK-27790) took > 1.5 years,
>> TIMESTAMP_NTZ (https://issues.apache.org/jira/browse/SPARK-35662) took >
>> 1 year, TIME (https://issues.apache.org/jira/browse/SPARK-51162) has not
>> been finished yet, but we spent more than half-year so far.
>> >
>> > I propose to define a set of interfaces, and operation classes for
>> every data type. The operation classes (Ops) should implement subsets of
>> interfaces that are suitable for a particular data type.
>> > For example, TimeType will have the companion class TimeTypeOps which
>> implements the following operations:
>> > - Operations over the underlying physical type
>> > - Literal related operationsig decimal (like DECFLOAT(
>> > - Formatting of type values to strings
>> > - Converting to/from external Java type: java.time.LocalTime in the
>> case of TimeType
>> > - Hashing data type values
>> >
>> > On the handling side, we won't need to examine every data type. We can
>> check that a data type and its ops instance supports a required interface,
>> and invoke the needed method. For example:
>> > ---
>> >   override def sql: String = dataTypeOps match {
>> >     case fops: FormatTypeOps => fops.toSQLValue(value)
>> >     case _ => value.toString
>> >   }
>> > ---
>> > Here is the prototype of the proposal:
>> https://github.com/apache/spark/pull/51467
>> >
>> > Your comments and feedback would be greatly appreciated.
>> >
>> > Yours faithfully,
>> > Max Gekk
>> >
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: [email protected]
>>
>>

Re: [DISCUSS] Data Type framework

Reply via email to