Sounds like a great plan! Thank you. +1 for the refactoring.
Dongjoon. On Thu, Sep 11, 2025 at 1:04 PM Max Gekk <max.g...@gmail.com> wrote: > Hello Dongjoon, > > > can we do this migration safely in a step-by-step manner over multiple > Apache Spark versions without blocking any Apache Spark releases? > > Sure, we can start from the TIME type, and refactor the existing pattern > mathings. After that I would support new features of TIME using the > framework (highly likely we will need to add new interfaces). This is not > risky since the type hasn't been released yet. After the release 4.1.0, we > could refactor some of existing data types, for example TIMESTAMP or/and > DATE. > > Yours faithfully, > Max Gekk > > > On Thu, Sep 11, 2025 at 5:01 PM Dongjoon Hyun <dongj...@apache.org> wrote: > >> Thank you for sharing the direction, Max. >> >> Since this is internal refactoring, can we do this migration safely in a >> step-by-step manner over multiple Apache Spark versions without blocking >> any Apache Spark releases? >> >> The proposed direction itself looks reasonable and doable for me. >> >> Thanks, >> Dongjoon. >> >> On 2025/09/10 13:44:45 "serge rielau.com" wrote: >> > I think this is a great idea. There is a signifcant backlog of types >> which should be added: E.g TIMESTAMP(9), TIMESTAMP WITH TIME ZONE, TIME >> WITH TIMEZONE, some sort of big decimal to name a few). >> > Making these more "plug and play" is goodness. >> > >> > +1 >> > >> > On Sep 10, 2025, at 1:22 PM, Max Gekk <max.g...@gmail.com> wrote: >> > >> > Hi All, >> > >> > I would like to propose refactoring of internal operations over >> Catalyst's data types. In the current implementation, data types are >> handled in an adhoc manner, and processing logic is dispersed across the >> entire code base. There are more than 100 places where every data type is >> pattern matched. For example, formatting of type values (converting to >> strings) is implemented in the same way in ToStringBase and in toString >> (literals.scala). This leads to a few issues: >> > >> > 1. If you change the handling in one place, you might miss other >> places. The compiler won't help you in such cases. >> > 2. Adding a new data type has constant and significant overhead. Based >> on our experience of adding new data types: ANSI intervals ( >> https://issues.apache.org/jira/browse/SPARK-27790) took > 1.5 years, >> TIMESTAMP_NTZ (https://issues.apache.org/jira/browse/SPARK-35662) took > >> 1 year, TIME (https://issues.apache.org/jira/browse/SPARK-51162) has not >> been finished yet, but we spent more than half-year so far. >> > >> > I propose to define a set of interfaces, and operation classes for >> every data type. The operation classes (Ops) should implement subsets of >> interfaces that are suitable for a particular data type. >> > For example, TimeType will have the companion class TimeTypeOps which >> implements the following operations: >> > - Operations over the underlying physical type >> > - Literal related operationsig decimal (like DECFLOAT( >> > - Formatting of type values to strings >> > - Converting to/from external Java type: java.time.LocalTime in the >> case of TimeType >> > - Hashing data type values >> > >> > On the handling side, we won't need to examine every data type. We can >> check that a data type and its ops instance supports a required interface, >> and invoke the needed method. For example: >> > --- >> > override def sql: String = dataTypeOps match { >> > case fops: FormatTypeOps => fops.toSQLValue(value) >> > case _ => value.toString >> > } >> > --- >> > Here is the prototype of the proposal: >> https://github.com/apache/spark/pull/51467 >> > >> > Your comments and feedback would be greatly appreciated. >> > >> > Yours faithfully, >> > Max Gekk >> > >> > >> >> --------------------------------------------------------------------- >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >> >>