EnricoMi opened a new pull request #27805: Add CalendarIntervals division
URL: https://github.com/apache/spark/pull/27805
 
 
   ### What changes were proposed in this pull request?
   This defines the devision of two `CalendarInterval`s as the ratio between 
the time duration of both. The `CalendarInterval` consists of three time 
components: `months`, `days` and `microseconds`. The division is only defined 
between intervals that have a single non-zero time component, while both 
intervals have the same non-zero time component. Otherwise the division 
expression would be ambiguous.
   
   ### Why are the changes needed?
   Putting interval into relation is a natural expression: 1 day / 1 hour = 
number of hours per day.
   
   This allows to evaluate the magnitude of `CalendarInterval`s in SQL 
expressions:
   
   ```
   Seq((Timestamp.valueOf("2020-02-01 12:00:00"), Timestamp.valueOf("2020-02-01 
13:30:25")))
     .toDF("start", "end")
     .withColumn("interval", $"end" - $"start")
     .withColumn("interval [h]", $"interval" / lit("1 
hour").cast(CalendarIntervalType))
     .withColumn("rate [€/h]", lit(1.45))
     .withColumn("price [€]", $"interval [h]" * $"rate [€/h]")
     .show(false)
   
+-------------------+-------------------+-----------------------------+------------------+----------+------------------+
   |start              |end                |interval                     
|interval [h]      |rate [€/h]|price [€]         |
   
+-------------------+-------------------+-----------------------------+------------------+----------+------------------+
   |2020-02-01 12:00:00|2020-02-01 13:30:25|1 hours 30 minutes 25 
seconds|1.5069444444444444|1.45      |2.1850694444444443|
   
+-------------------+-------------------+-----------------------------+------------------+----------+------------------+
   ```
   
   The currently available approach is
   
   ```
   Seq((Timestamp.valueOf("2020-02-01 12:00:00"), Timestamp.valueOf("2020-02-01 
13:30:25")))
     .toDF("start", "end")
     .withColumn("interval [s]", unix_timestamp($"end") - 
unix_timestamp($"start"))
     .withColumn("interval [h]", $"interval [s]" / 3600)
     .withColumn("rate [€/h]", lit(1.45))
     .withColumn("price [€]", $"interval [h]" * $"rate [€/h]")
     .show(false)
   ```
   
   Going through `unix_timestamp` is a hack and it pollutes the SQL query with 
unrelated semantics (unix timestamp is completely irrelevant for this 
computation). It is merely there because there is currently no way to access 
the length of an `CalendarInterval`. Dividing an interval by another interval 
provides means to measure the length in an arbitrary unit (minutes, hours, 
quarter hours).
   
   ### Does this PR introduce any user-facing change?
   It allows interval division in SQL expressions and adds the 
`DivideIntervals` catalyst expression.
   
   ### How was this patch tested?
   Unit tests in `IntervalExpressionsSuite`.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to