Re: [DISCUSS] Define calendar used in specification?

rdb...@gmail.com Thu, 12 Sep 2024 12:27:17 -0700

The spec purposely avoids timestamp conversion. Iceberg returns values as
they are passed from the engine and it is the engine's responsibility to do
any date/time conversion. I don't think that we should change this and take
responsibility in Iceberg.


On Thu, Sep 12, 2024 at 12:32 AM Bart Samwel <b...@databricks.com.invalid>
wrote:

> I have some historical context that may or may not be relevant. I still
> remember how we did the transition for Spark. This was ca. 2019, and there
> were still many people mixing Spark 2.x and 3.0. Also, many other systems
> were still using Java 7 which only supported Julian. As a result, Spark
> 3.0+ can even still write with the Julian calendar, at least if using the
> Spark-native parquet read and write path.
>
> 1) The parquet files written by Spark 3.0+ have metadata keys that contain
> a Spark version ("org.apache.spark.version") and whether the timestamps are
> in Julian a.k.a. Java 7 ("org.apache.spark.legacyDateTime"). There's also
> "org.apache.spark.legacyINT96" which is about whether INT96 timestamps have
> been written with Julian calendar in the date part.
>
> 2) Files that don't have a Spark version are interpreted as Julian or
> proleptic Gregorian depending on a config
> "spark.sql.parquet.datetimeRebaseModeInRead" /
> "spark.sql.parquet.int96RebaseModeInRead". (There are similar configs for
> ORC and avro.) This defaults to EXCEPTION, which means "if a date is
> different in the two calendars, fail the write and force the users to
> choose". If it's set to LEGACY, then Spark will actually "rebase" the dates
> at read time because Spark 3.0+ uses the Java 8 proleptic gregorian
> calendar internally.
>
> 3) Writing mode is controlled by configs
> "spark.sql.parquet.datetimeRebaseModeInWrite" and
> "spark.sql.parquet.int96RebaseModeInWrite". These were also until recently
> set to EXCEPTION (i.e., force the user to choose when a value is
> encountered where it matters). See
> https://issues.apache.org/jira/browse/SPARK-46440.
>
> I'm not sure if any of this matters for Iceberg though. It may matter if
> any Iceberg implementation writes using the Spark native parquet/orc/avro
> write path AND the user has configured it to use LEGACY dates. Or are there
> paths where Iceberg can convert from Parquet files? Then you might
> encounter these metadata flags. I'm not sure if it's worth complicating the
> spec by supporting this. :)
>
> On Thu, Sep 12, 2024 at 8:03 AM Micah Kornfield <emkornfi...@gmail.com>
> wrote:
>
>> At the moment, the specification is ambiguous on which calendar is used
>> for temporal conversion/writing [1]. Reading the java code it appears it is
>> using Java's OffsetDateTime which conforms to ISO8601 [2].  ISO8601 appears
>> to explicitly disallow the Julian calendar (but only says proleptic
>> gregorian can be used by mutual consent [3]).
>>
>> Therefore I'd propose:
>> 1. We make the  ISO8601 + proleptic Gregorian + Gregorian calendars
>> explicit in the specification.
>> 2. Mention in an implementation note, that data migrated from other
>> systems or data written by older systems might follow the Julian calendar
>> (e.g. it looks like Spark only transitioned in 3.0 [4]).
>>   *  Does anybody know of metadata available for systems to make this
>> determination?
>>   *  Or a recommendation on how to handle these?
>>
>> Thoughts?
>>
>> Thanks,
>> Micah
>>
>> [1] This is esoteric but a few systems use 0001-01-01 as a sentinel value
>> for null so does have some wider applicability
>> [2]
>> https://docs.oracle.com/javase/8/docs/api/java/time/OffsetDateTime.html
>> [3] https://en.wikipedia.org/wiki/ISO_8601#Dates
>> [4] https://issues.apache.org/jira/browse/SPARK-26651
>>
>>

Re: [DISCUSS] Define calendar used in specification?

Reply via email to