Hi all, Personally, I'm a big fan of the .summary() function to compute statistics of a dataframe. I often use this for debugging pipelines, and check what the impact of the RDD is after changing code.
I've noticed that not all datatypes are in this summary. Currently, there is a list <https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/stat/StatFunctions.scala#L267> of types that allowed to be included in the summary, and I love to extend that list. The first one is the date type. It is important to define this together with the community, and that we get consensus, as this will be part of the public API. Changing this will be costly (or maybe impossible) to do. I've checked what other DBMS'es do with averages out of dates: Postgres Unsupported: postgres@366ecc8a0fb9:/$ psql psql (12.3 (Debian 12.3-1.pgdg100+1)) Type "help" for help. postgres=# SELECT CAST(CAST('2020-01-01' AS DATE) AS decimal); ERROR: cannot cast type date to numeric LINE 1: SELECT CAST(CAST('2020-01-01' AS DATE) AS decimal); ^ postgres=# SELECT CAST(CAST('2020-01-01' AS DATE) AS integer); ERROR: cannot cast type date to integer LINE 1: SELECT CAST(CAST('2020-01-01' AS DATE) AS integer); ^ The way to get the epoch in days is: postgres=# SELECT EXTRACT(DAYS FROM (now() - '1970-01-01')); date_part ----------- 18422 (1 row) MySQL Converts to a YYYYMMDD format: mysql> SELECT CAST(CAST('2020-01-01' AS DATE) AS decimal); +---------------------------------------------+ | CAST(CAST('2020-01-01' AS DATE) AS decimal) | +---------------------------------------------+ | 20200101 | +---------------------------------------------+ 1 row in set (0.00 sec) However, converting to an int, isn't allowed: mysql> SELECT CAST(CAST('2020-01-01' AS DATE) AS int); ERROR 1064 (42000): You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'int)' at line 1 mysql> SELECT CAST(CAST('2020-01-01' AS DATE) AS bigint); ERROR 1064 (42000): You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'bigint)' at line 1 Bigquery Unsupported: [image: image.png] Excel Converts it to the days since epoch. This feels weird, but I can see it, as it is being used as a physical format internally in many data formats. [image: image.png] For me, returning a Date as the output of avg(date) seems like a logical choice. Internally it is handled as dates since epoch, which makes sense as well: *Avro* it is milliseconds since epoch: https://github.com/apache/avro/blob/master/lang/java/avro/src/main/java/org/apache/avro/reflect/DateAsLongEncoding.java *Parquet* it is days since epoch: https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#date *ORC* is based around days since Epoch: https://github.com/apache/orc/blob/master/java/core/src/java/org/threeten/extra/chrono/HybridDate.java Also with this, we keep parity with the Catalyst type :) Any further thoughts on this before moving forward? Kind regards, Fokko