The spec purposely avoids timestamp conversion. Iceberg returns values as they are passed from the engine and it is the engine's responsibility to do any date/time conversion. I don't think that we should change this and take responsibility in Iceberg.
On Thu, Sep 12, 2024 at 12:32 AM Bart Samwel <b...@databricks.com.invalid> wrote: > I have some historical context that may or may not be relevant. I still > remember how we did the transition for Spark. This was ca. 2019, and there > were still many people mixing Spark 2.x and 3.0. Also, many other systems > were still using Java 7 which only supported Julian. As a result, Spark > 3.0+ can even still write with the Julian calendar, at least if using the > Spark-native parquet read and write path. > > 1) The parquet files written by Spark 3.0+ have metadata keys that contain > a Spark version ("org.apache.spark.version") and whether the timestamps are > in Julian a.k.a. Java 7 ("org.apache.spark.legacyDateTime"). There's also > "org.apache.spark.legacyINT96" which is about whether INT96 timestamps have > been written with Julian calendar in the date part. > > 2) Files that don't have a Spark version are interpreted as Julian or > proleptic Gregorian depending on a config > "spark.sql.parquet.datetimeRebaseModeInRead" / > "spark.sql.parquet.int96RebaseModeInRead". (There are similar configs for > ORC and avro.) This defaults to EXCEPTION, which means "if a date is > different in the two calendars, fail the write and force the users to > choose". If it's set to LEGACY, then Spark will actually "rebase" the dates > at read time because Spark 3.0+ uses the Java 8 proleptic gregorian > calendar internally. > > 3) Writing mode is controlled by configs > "spark.sql.parquet.datetimeRebaseModeInWrite" and > "spark.sql.parquet.int96RebaseModeInWrite". These were also until recently > set to EXCEPTION (i.e., force the user to choose when a value is > encountered where it matters). See > https://issues.apache.org/jira/browse/SPARK-46440. > > I'm not sure if any of this matters for Iceberg though. It may matter if > any Iceberg implementation writes using the Spark native parquet/orc/avro > write path AND the user has configured it to use LEGACY dates. Or are there > paths where Iceberg can convert from Parquet files? Then you might > encounter these metadata flags. I'm not sure if it's worth complicating the > spec by supporting this. :) > > On Thu, Sep 12, 2024 at 8:03 AM Micah Kornfield <emkornfi...@gmail.com> > wrote: > >> At the moment, the specification is ambiguous on which calendar is used >> for temporal conversion/writing [1]. Reading the java code it appears it is >> using Java's OffsetDateTime which conforms to ISO8601 [2]. ISO8601 appears >> to explicitly disallow the Julian calendar (but only says proleptic >> gregorian can be used by mutual consent [3]). >> >> Therefore I'd propose: >> 1. We make the ISO8601 + proleptic Gregorian + Gregorian calendars >> explicit in the specification. >> 2. Mention in an implementation note, that data migrated from other >> systems or data written by older systems might follow the Julian calendar >> (e.g. it looks like Spark only transitioned in 3.0 [4]). >> * Does anybody know of metadata available for systems to make this >> determination? >> * Or a recommendation on how to handle these? >> >> Thoughts? >> >> Thanks, >> Micah >> >> [1] This is esoteric but a few systems use 0001-01-01 as a sentinel value >> for null so does have some wider applicability >> [2] >> https://docs.oracle.com/javase/8/docs/api/java/time/OffsetDateTime.html >> [3] https://en.wikipedia.org/wiki/ISO_8601#Dates >> [4] https://issues.apache.org/jira/browse/SPARK-26651 >> >>