MaxGekk opened a new pull request #26092: [SPARK-29440][SQL] Support java.time.Duration as an external type of CalendarIntervalType URL: https://github.com/apache/spark/pull/26092 ### What changes were proposed in this pull request? In the PR, I propose to convert values of the `CalendarIntervalType` Catalyst's type to the `java.time.Duration` values when such values are need outside of Spark, for example in UDF. If an `INTERVAL` values has non-zero `months` field, it is converted to number of seconds assuming `2629746` seconds per months. This average number of seconds per month was given by assuming that the average year of the Gregorian calendar `365.2425` days long (see https://en.wikipedia.org/wiki/Gregorian_calendar): `60 * 60 * 24 * 365.2425` = `31556952.0` = `12 * 2629746`. For example: ```sql scala> val plusDay = udf((i: java.time.Duration) => i.plusDays(1)) plusDay: org.apache.spark.sql.expressions.UserDefinedFunction = SparkUserDefinedFunction($Lambda$1855/165450258@485996f7,CalendarIntervalType,List(Some(Schema(CalendarIntervalType,true))),None,true,true) scala> val df = spark.sql("SELECT interval 40 minutes as i") df: org.apache.spark.sql.DataFrame = [i: interval] scala> df.show +-------------------+ | i| +-------------------+ |interval 40 minutes| +-------------------+ scala> df.select(plusDay('i)).show(false) +--------------------------+ |UDF(i) | +--------------------------+ |interval 1 days 40 minutes| +--------------------------+ ``` I added an implicit encoder for `java.time.Duration` which allows to create Spark dataframe from an external collections: ```sql scala> Seq(Duration.ofDays(10), Duration.ofHours(10)).toDS.show(false) +-----------------------+ |value | +-----------------------+ |interval 1 weeks 3 days| |interval 10 hours | +-----------------------+ ``` ### Why are the changes needed? This should allow to users: - Write UDF over interval inputs - Use Java 8 libraries for `java.time.Duration` in manipulations on collected values or in UDFs - Create dataframes from a collection of `java.time.Duration` values. ### Does this PR introduce any user-facing change? Yes, currently `collect()` returns not public class `CalendarInterval`: ``` scala> spark.sql("select interval 1 week").collect().apply(0).get(0).isInstanceOf[org.apache.spark.unsafe.types.CalendarInterval] res2: Boolean = true ``` After the changes: ``` scala> spark.sql("select interval 1 week").collect().apply(0).get(0).isInstanceOf[Duration] res8: Boolean = true ``` ### How was this patch tested? - Added new testes to `CatalystTypeConvertersSuite` to check conversion of `CalendarIntervalType` to/from `java.time.Duration` - By `JavaUDFSuite`/ `UDFSuite` to test usage of `Duration` type in Scala/Java UDFs.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
