Hi All,

I would like to propose refactoring of internal operations over Catalyst's
data types. In the current implementation, data types are handled in an
adhoc manner, and processing logic is dispersed  across the entire code
base. There are more than 100 places where every data type is pattern
matched. For example, formatting of type values (converting to strings) is
implemented in the same way in ToStringBase and in toString
(literals.scala). This leads to a few issues:

1. If you change the handling in one place, you might miss other places.
The compiler won't help you in such cases.
2. Adding a new data type has constant and significant overhead. Based on
our experience of adding new data types: ANSI intervals (
https://issues.apache.org/jira/browse/SPARK-27790) took > 1.5 years,
TIMESTAMP_NTZ (https://issues.apache.org/jira/browse/SPARK-35662) took > 1
year, TIME (https://issues.apache.org/jira/browse/SPARK-51162) has not been
finished yet, but we spent more than half-year so far.

I propose to define a set of interfaces, and operation classes for every
data type. The operation classes (Ops) should implement subsets of
interfaces that are suitable for a particular data type.
For example, TimeType will have the companion class TimeTypeOps which
implements the following operations:
- Operations over the underlying physical type
- Literal related operations
- Formatting of type values to strings
- Converting to/from external Java type: java.time.LocalTime in the case of
TimeType
- Hashing data type values

On the handling side, we won't need to examine every data type. We can
check that a data type and its ops instance supports a required interface,
and invoke the needed method. For example:
---
  override def sql: String = dataTypeOps match {
    case fops: FormatTypeOps => fops.toSQLValue(value)
    case _ => value.toString
  }
---
Here is the prototype of the proposal:
https://github.com/apache/spark/pull/51467

Your comments and feedback would be greatly appreciated.

Yours faithfully,
Max Gekk

Reply via email to