MaxGekk opened a new pull request #26092: [SPARK-29440][SQL] Support 
java.time.Duration as an external type of CalendarIntervalType
URL: https://github.com/apache/spark/pull/26092
 
 
   ### What changes were proposed in this pull request?
   In the PR, I propose to convert values of the `CalendarIntervalType` 
Catalyst's type to the `java.time.Duration` values when such values are need 
outside of Spark, for example in UDF. If an `INTERVAL` values has non-zero 
`months` field, it is converted to number of seconds assuming `2629746` seconds 
per months. This average number of seconds per month was given by assuming that 
the average year of the Gregorian calendar `365.2425` days long (see 
https://en.wikipedia.org/wiki/Gregorian_calendar): `60 * 60 * 24 * 365.2425` = 
`31556952.0` = `12 * 2629746`.
   
   For example:
   ```sql
   scala> val plusDay = udf((i: java.time.Duration) => i.plusDays(1))
   plusDay: org.apache.spark.sql.expressions.UserDefinedFunction = 
SparkUserDefinedFunction($Lambda$1855/165450258@485996f7,CalendarIntervalType,List(Some(Schema(CalendarIntervalType,true))),None,true,true)
   
   scala> val df = spark.sql("SELECT interval 40 minutes as i")
   df: org.apache.spark.sql.DataFrame = [i: interval]
   
   scala> df.show
   +-------------------+                                                        
   
   |                  i|
   +-------------------+
   |interval 40 minutes|
   +-------------------+
   
   scala> df.select(plusDay('i)).show(false)
   +--------------------------+
   |UDF(i)                    |
   +--------------------------+
   |interval 1 days 40 minutes|
   +--------------------------+
   ```
   I added an implicit encoder for `java.time.Duration` which allows to create 
Spark dataframe from an external collections:
   ```sql
   scala> Seq(Duration.ofDays(10), Duration.ofHours(10)).toDS.show(false)
   +-----------------------+
   |value                  |
   +-----------------------+
   |interval 1 weeks 3 days|
   |interval 10 hours      |
   +-----------------------+
   ```
   
   ### Why are the changes needed?
   This should allow to users:
   - Write UDF over interval inputs
   - Use Java 8 libraries for `java.time.Duration` in manipulations on 
collected values or in UDFs
   - Create dataframes from a collection of `java.time.Duration` values.
   
   ### Does this PR introduce any user-facing change?
   Yes, currently `collect()` returns not public class `CalendarInterval`:
   ```
   scala> spark.sql("select interval 1 
week").collect().apply(0).get(0).isInstanceOf[org.apache.spark.unsafe.types.CalendarInterval]
   res2: Boolean = true
   ```
   After the changes:
   ```
   scala> spark.sql("select interval 1 
week").collect().apply(0).get(0).isInstanceOf[Duration]
   res8: Boolean = true
   ```
   ### How was this patch tested?
   - Added new testes to `CatalystTypeConvertersSuite` to check conversion of 
`CalendarIntervalType` to/from `java.time.Duration`
   - By `JavaUDFSuite`/ `UDFSuite` to test usage of `Duration` type in 
Scala/Java UDFs.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to