[GitHub] [spark] MaxGekk commented on a change in pull request #25981: [SPARK-28420][SQL] Support the `INTERVAL` type in `date_part()`

GitBox Tue, 08 Oct 2019 08:13:59 -0700

MaxGekk commented on a change in pull request #25981: [SPARK-28420][SQL] 
Support the `INTERVAL` type in `date_part()`
URL: https://github.com/apache/spark/pull/25981#discussion_r332571726


 ##########
 File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala
 ##########
 @@ -2067,6 +2082,10 @@ object DatePart {
        224
       > SELECT _FUNC_('SECONDS', timestamp'2019-10-01 00:00:01.000001');
        1.000001
+      > SELECT _FUNC_('days', interval 1 year 10 months 5 days);
 
 Review comment:
   > Sometimes the entire interval's length is return in the given unit ...
   
   The entire interval's length is returned for `EPOCH` only, and components 
for other `field` values.
   
   > I wonder if there is any other standard to look at.
   
   I like java 8 approach where there are 2 types `java.time.Duration` and 
`java.time.Period`. The first one is to store interval duration in nanosecond 
precision, the second one is to store components `year`, `months`, `days`. I 
would reuse this model and extend the types slightly:
   - `DURATION` type is to store interval duration in microseconds. `long` 
should be enough to store difference between any supported timestamps.
   - `PERIOD` type should store `years`, `months`, `days`, `hours`, `minutes`, 
`seconds`, `milliseconds` and `microseconds`. For example, (-10 years, 5 
months, -3 hours, 100 microseconds).
   
   > I'm mostly concerned with having consistent semantics, whatever they are. 
If it can't be done reasonably, hm, should we implement this?
   
   I don't know. From my point of semantic of each extraction field is well 
defined. There is a difference in implementation of this PR and PostgreSQL. As 
I wrote above Spark store interval in 2 variables - `months` and 
`microseconds`, but PostgreSQL uses 3 vars - `month`, `day` and `time` (in 
microseonds). In this way, `days` are independent from other components. And as 
consequence, `hours` are not limited by `[0, 24)`:
   ```sql
   maxim=# SELECT interval '-100 years 100 days -100 hours'; 
               interval             
   ---------------------------------
    -100 years +100 days -100:00:00
   (1 row)
   ```
   
   > I'm trying to figure out the use case for "a date part of an interval". I 
can only think of cases where the interval should be converted entirely into 
some unit.
   
   I can image at least 2 use cases:
   - Adjusting some of timestamp/date components. For example, we `timestamp 
have '2019-10-08 10:11:12.123456'`, and we want to adjust it by `-1 year 10 
hours`. As result, we will have `timestamp '2018-10-08 20:11:12.123456'`. The 
`PERIOD` type could be useful here.
   - When we need `Ordered` (and comparable) intervals. We can calculate 
absolute interval duration in some units. For example, in the query:
   ```sql
   SELECT name, place
   FROM t1, t2
   WHERE t1.sellTimestamp - t2.deliveredTimestamp < interval 1 month 15 days;
   ```
   Here, the `DURATION` should be used because we cannot use `PERIOD` because 
its values cannot be ordered. Spark's CalendarIntervalType cannot be used here 
too. 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] MaxGekk commented on a change in pull request #25981: [SPARK-28420][SQL] Support the `INTERVAL` type in `date_part()`

Reply via email to