Enrico Minack created SPARK-31056:
-------------------------------------
Summary: Add CalendarIntervals division
Key: SPARK-31056
URL: https://issues.apache.org/jira/browse/SPARK-31056
Project: Spark
Issue Type: Improvement
Components: SQL
Affects Versions: 3.1.0
Reporter: Enrico Minack
{{CalendarInterval}} should be allowed for division. The {{CalendarInterval}}
consists of three time components: {{months}}, {{days}} and {{microseconds}}.
The division can only be defined between intervals that have a single non-zero
time component, while both intervals have the same non-zero time component.
Otherwise the division expression would be ambiguous.
This allows to evaluate the magnitude of {{CalendarInterval}} in SQL
expressions:
{code}
Seq((Timestamp.valueOf("2020-02-01 12:00:00"), Timestamp.valueOf("2020-02-01
13:30:25")))
.toDF("start", "end")
.withColumn("interval", $"end" - $"start")
.withColumn("interval [h]", $"interval" / lit("1
hour").cast(CalendarIntervalType))
.withColumn("rate [€/h]", lit(1.45))
.withColumn("price [€]", $"interval [h]" * $"rate [€/h]")
.show(false)
+-------------------+-------------------+-----------------------------+------------------+----------+------------------+
|start |end |interval |interval
[h] |rate [€/h]|price [€] |
+-------------------+-------------------+-----------------------------+------------------+----------+------------------+
|2020-02-01 12:00:00|2020-02-01 13:30:25|1 hours 30 minutes 25
seconds|1.5069444444444444|1.45 |2.1850694444444443|
+-------------------+-------------------+-----------------------------+------------------+----------+------------------+
{code}
The currently available approach is
{code}
Seq((Timestamp.valueOf("2020-02-01 12:00:00"), Timestamp.valueOf("2020-02-01
13:30:25")))
.toDF("start", "end")
.withColumn("interval [s]", unix_timestamp($"end") - unix_timestamp($"start"))
.withColumn("interval [h]", $"interval [s]" / 3600)
.withColumn("rate [€/h]", lit(1.45))
.withColumn("price [€]", $"interval [h]" * $"rate [€/h]")
.show(false)
{code}
Going through {{unix_timestamp}} is a hack and it pollutes the SQL query with
unrelated semantics (unix timestamp is completely irrelevant for this
computation). It is merely there because there is currently no way to access
the length of an {{CalendarInterval}}. Dividing an interval by another interval
provides means to measure the length in an arbitrary unit (minutes, hours,
quarter hours).
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]