Hi all,

Personally, I'm a big fan of the .summary() function to compute statistics
of a dataframe. I often use this for debugging pipelines, and check what
the impact of the RDD is after changing code.

I've noticed that not all datatypes are in this summary. Currently, there
is a list
<https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/stat/StatFunctions.scala#L267>
of types that allowed to be included in the summary, and I love to extend
that list.

The first one is the date type. It is important to define this together
with the community, and that we get consensus, as this will be part of the
public API. Changing this will be costly (or maybe impossible) to do.

I've checked what other DBMS'es do with averages out of dates:

Postgres

Unsupported:

postgres@366ecc8a0fb9:/$ psql
psql (12.3 (Debian 12.3-1.pgdg100+1))
Type "help" for help.

postgres=# SELECT CAST(CAST('2020-01-01' AS DATE) AS decimal);
ERROR:  cannot cast type date to numeric
LINE 1: SELECT CAST(CAST('2020-01-01' AS DATE) AS decimal);
               ^

postgres=# SELECT CAST(CAST('2020-01-01' AS DATE) AS integer);
ERROR:  cannot cast type date to integer
LINE 1: SELECT CAST(CAST('2020-01-01' AS DATE) AS integer);
               ^

The way to get the epoch in days is:

postgres=# SELECT EXTRACT(DAYS FROM (now() - '1970-01-01'));
date_part
-----------
    18422
(1 row)


MySQL

Converts to a YYYYMMDD format:

mysql> SELECT CAST(CAST('2020-01-01' AS DATE) AS decimal);
+---------------------------------------------+
| CAST(CAST('2020-01-01' AS DATE) AS decimal) |
+---------------------------------------------+
|                                    20200101 |
+---------------------------------------------+
1 row in set (0.00 sec)

However, converting to an int, isn't allowed:

mysql> SELECT CAST(CAST('2020-01-01' AS DATE) AS int);
ERROR 1064 (42000): You have an error in your SQL syntax; check the
manual that corresponds to your MySQL server version for the right
syntax to use near 'int)' at line 1

mysql> SELECT CAST(CAST('2020-01-01' AS DATE) AS bigint);
ERROR 1064 (42000): You have an error in your SQL syntax; check the
manual that corresponds to your MySQL server version for the right
syntax to use near 'bigint)' at line 1

Bigquery

Unsupported:

[image: image.png]

Excel

Converts it to the days since epoch. This feels weird, but I can see it, as
it is being used as a physical format internally in many data formats.

[image: image.png]

For me, returning a Date as the output of avg(date) seems like a logical
choice. Internally it is handled as dates since epoch, which makes sense as
well:

*Avro* it is milliseconds since epoch:
https://github.com/apache/avro/blob/master/lang/java/avro/src/main/java/org/apache/avro/reflect/DateAsLongEncoding.java

*Parquet* it is days since epoch:
https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#date

*ORC* is based around days since Epoch:
https://github.com/apache/orc/blob/master/java/core/src/java/org/threeten/extra/chrono/HybridDate.java

Also with this, we keep parity with the Catalyst type :)

Any further thoughts on this before moving forward?

Kind regards, Fokko

Reply via email to