[GitHub] [spark] cloud-fan commented on a change in pull request #32949: [SPARK-35749][SPARK-35773][SQL] Parse unit list interval literals as tightest year-month/day-time interval types

GitBox Tue, 29 Jun 2021 06:40:52 -0700


cloud-fan commented on a change in pull request #32949:
URL: https://github.com/apache/spark/pull/32949#discussion_r659867783




##########
File path: docs/sql-migration-guide.md
##########
@@ -83,7 +83,9 @@ license: |
 
   - In Spark 3.2, `TRANSFORM` operator can support 
`ArrayType/MapType/StructType` without Hive SerDe, in this mode, we use 
`StructsToJosn` to convert `ArrayType/MapType/StructType` column to `STRING` 
and use `JsonToStructs` to parse `STRING` to `ArrayType/MapType/StructType`. In 
Spark 3.1, Spark just support case `ArrayType/MapType/StructType` column as 
`STRING` but can't support parse `STRING` to `ArrayType/MapType/StructType` 
output columns.
 
-  - In Spark 3.2, the unit-to-unit interval literals like `INTERVAL '1-1' YEAR 
TO MONTH` are converted to ANSI interval types: `YearMonthIntervalType` or 
`DayTimeIntervalType`. In Spark 3.1 and earlier, such interval literals are 
converted to `CalendarIntervalType`. To restore the behavior before Spark 3.2, 
you can set `spark.sql.legacy.interval.enabled` to `true`.
+  - In Spark 3.2, the unit-to-unit interval literals like `INTERVAL '1-1' YEAR 
TO MONTH` and the unit list interval literals like `INTERVAL '3' DAYS '1' HOUR` 
are converted to ANSI interval types: `YearMonthIntervalType` or 
`DayTimeIntervalType`. In Spark 3.1 and earlier, such interval literals are 
converted to `CalendarIntervalType`. To restore the behavior before Spark 3.2, 
you can set `spark.sql.legacy.interval.enabled` to `true`.
+
+  - In Spark 3.2, the unit list interval literals support two category of 
units. The one is year-month units (YEAR and MONTH) and the other is day-time 
units (DAY, HOUR, MINUTE and SECOND). Units are allowd to be listed within the 
same category like `5 YEARS 8 MONTHS` or `1 DAY 2 MINUTE`. In Spark 3.1 and 
earlier, WEEK, MILLISECOND, MICROSECOND and NANOSECOND are also supported as 
units, and any units are allowed to be listed in a literal. To restore the 
behavior before Spark 3.2, you can set `spark.sql.legacy.interval.enabled` to 
`true`.

Review comment:
       Do we have to forbid `WEEK, MILLISECOND, MICROSECOND and NANOSECOND`? I 
think the only limitation is people can't mix year-month and day-time fields 
now, but I don't see why we need to forbid `WEEK, MILLISECOND, MICROSECOND and 
NANOSECOND`.

##########
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala
##########
@@ -2358,20 +2358,56 @@ class AstBuilder extends SqlBaseBaseVisitor[AnyRef] 
with SQLConfHelper with Logg
       // `CalendarInterval` doesn't have enough info. For instance, new 
CalendarInterval(0, 0, 0)
       // can be derived from INTERVAL '0-0' YEAR TO MONTH as well as from
       // INTERVAL '0 00:00:00' DAY TO SECOND.
+      val fromUnit =
+        
ctx.errorCapturingUnitToUnitInterval.body.from.getText.toLowerCase(Locale.ROOT)
       val toUnit = 
ctx.errorCapturingUnitToUnitInterval.body.to.getText.toLowerCase(Locale.ROOT)
       if (toUnit == "month") {
         assert(calendarInterval.days == 0 && calendarInterval.microseconds == 
0)
-        // TODO(SPARK-35773): Parse year-month interval literals to tightest 
types
-        Literal(calendarInterval.months, YearMonthIntervalType())
+        val start = YearMonthIntervalType.stringToField(fromUnit)
+        val end = YearMonthIntervalType.stringToField(toUnit)
+        Literal(calendarInterval.months, YearMonthIntervalType(start, end))
       } else {
         assert(calendarInterval.months == 0)
-        val fromUnit =
-          
ctx.errorCapturingUnitToUnitInterval.body.from.getText.toLowerCase(Locale.ROOT)
         val micros = IntervalUtils.getDuration(calendarInterval, 
TimeUnit.MICROSECONDS)
         val start = DayTimeIntervalType.stringToField(fromUnit)
         val end = DayTimeIntervalType.stringToField(toUnit)
         Literal(micros, DayTimeIntervalType(start, end))
       }
+    } else if (ctx.errorCapturingMultiUnitsInterval != null && 
!conf.legacyIntervalEnabled) {
+      val units =
+        ctx.errorCapturingMultiUnitsInterval.body.unit.asScala.map(
+          _.getText.toLowerCase(Locale.ROOT).stripSuffix("s"))
+      if (units.forall(YearMonthIntervalType.stringToField.contains)) {
+        val fields = units.map(YearMonthIntervalType.stringToField)
+        val start = if (0 < IntervalUtils.getYears(calendarInterval)) {

Review comment:
       what does this mean? why can't we always use `fields.min`?

##########
File path: sql/core/src/test/resources/sql-tests/results/ansi/interval.sql.out
##########
@@ -1215,9 +1251,9 @@ SELECT
   to_csv(named_struct('a', interval 32 month, 'b', interval 70 minute)),
   from_csv(to_csv(named_struct('a', interval 32 month, 'b', interval 70 
minute)), 'a interval, b interval')
 -- !query schema
-struct<from_csv(1, 1 day):struct<a:int,b:interval>,to_csv(from_csv(1, 1 
day)):string,to_csv(named_struct(a, INTERVAL '2 years 8 months', b, INTERVAL '1 
hours 10 minutes')):string,from_csv(to_csv(named_struct(a, INTERVAL '2 years 8 
months', b, INTERVAL '1 hours 10 minutes'))):struct<a:interval,b:interval>>
+struct<from_csv(1, 1 day):struct<a:int,b:interval>,to_csv(from_csv(1, 1 
day)):string,to_csv(named_struct(a, INTERVAL '2-8' YEAR TO MONTH, b, INTERVAL 
'01:10' HOUR TO MINUTE)):string,from_csv(to_csv(named_struct(a, INTERVAL '2-8' 
YEAR TO MONTH, b, INTERVAL '01:10' HOUR TO 
MINUTE))):struct<a:interval,b:interval>>
 -- !query output
-{"a":1,"b":1 days}     1,1 days        2 years 8 months,1 hours 10 minutes     
{"a":2 years 8 months,"b":1 hours 10 minutes}
+{"a":1,"b":1 days}     1,1 days        32,4200000000   {"a":null,"b":null}

Review comment:
       > because the result of named_struct('a', interval 32 month, 'b', 
interval 70 minute) is 32,4200000000
   
   Is it corrected? I thought it should be `interval '32' month`

##########
File path: 
sql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q12.sf100/explain.txt
##########
@@ -142,15 +142,15 @@ BroadcastExchange (28)
 Output [2]: [d_date_sk#13, d_date#24]
 Batched: true
 Location [not included in comparison]/{warehouse_dir}/date_dim]
-PushedFilters: [IsNotNull(d_date), GreaterThanOrEqual(d_date,1999-02-22), 
LessThanOrEqual(d_date,1999-03-24), IsNotNull(d_date_sk)]
+PushedFilters: [IsNotNull(d_date), GreaterThanOrEqual(d_date,1999-02-22), 
IsNotNull(d_date_sk)]

Review comment:
       do we break filter pushdown in this PR?

##########
File path: 
sql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q12.sf100/explain.txt
##########
@@ -142,15 +142,15 @@ BroadcastExchange (28)
 Output [2]: [d_date_sk#13, d_date#24]
 Batched: true
 Location [not included in comparison]/{warehouse_dir}/date_dim]
-PushedFilters: [IsNotNull(d_date), GreaterThanOrEqual(d_date,1999-02-22), 
LessThanOrEqual(d_date,1999-03-24), IsNotNull(d_date_sk)]
+PushedFilters: [IsNotNull(d_date), GreaterThanOrEqual(d_date,1999-02-22), 
IsNotNull(d_date_sk)]

Review comment:
       > cast('1999-02-22' AS DATE) + INTERVAL 30 days is TimestampType
   
   I checked the corresponding code in `ResolveBinaryArithmetic`
   ```
   case (DateType, DayTimeIntervalType(DAY, DAY)) => DateAdd(l, 
ExtractANSIIntervalDays(r))
   ```
   `DateAdd` returns Date, I think we are fine now?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] cloud-fan commented on a change in pull request #32949: [SPARK-35749][SPARK-35773][SQL] Parse unit list interval literals as tightest year-month/day-time interval types

Reply via email to