Seems we all agree that adding a timestamp with nanosecond precision is necessary. And we need to store it in 10 bytes. Additionally, the spark side will not wait for the change of Parquet as well. I will continue to address all the comments for implementation details.
huaxin gao <huaxin.ga...@gmail.com> 于2025年3月30日周日 19:20写道: > Thanks all for the discussion! > > I agree that we first need to reach a consensus on adding the > TIMESTAMP(nanosecond) data type to Apache Spark. It's a standard data > type supported by major databases like Oracle and IBM DB2, making it a > necessary inclusion in Spark to align with industry practices. > > For the storage format, Spark supports the full ANSI SQL range from the > year 0001 to 9999, which requires us to use a 10-byte format. Currently, > Parquet/Iceberg timestamps cover only those after the Unix epoch, so 8 > bytes suffice. > > While adopting a unified 10-byte format in Parquet/Iceberg is worth > considering, it may not be essential at this moment. Instead, we can handle > timestamps that fall outside the Parquet/Iceberg range by throwing an > exception when they occur. This approach allows us to move forward without > having to rely on external dependencies. > > Thanks, > > Huaxin > > > On Fri, Mar 28, 2025 at 2:11 PM Szehon Ho <szehon.apa...@gmail.com> wrote: > >> Trying to catch up on this, Serge's suggestion in the doc seems the best >> way forward, >> https://docs.google.com/document/d/1wjFsBdlV2YK75x7UOk2HhDOqWVA0yC7iEiqOMnNnxlA/edit?disco=AAABe5AUnWU. >> Spark would support the full ANSI SQL timestamp range, and Iceberg / >> Parquet/ other data source will throw runtime error if it trying to write a >> value outside its supported range, until we get a wider timestamp type in >> Parquet (Iceberg's V3 timestamp_ns type is just built on top of that) >> >> Thanks, >> Szehon >> >> On Thu, Mar 27, 2025 at 9:45 PM Micah Kornfield <emkornfi...@gmail.com> >> wrote: >> >>> I think the key issue is the format. The proposed 10-byte format doesn't >>>> seem like a standard and the one in Iceberg/Parquet does not support the >>>> required range by ANSI SQL: year 0001 to year 9999. We should address this >>>> issue first. Note that Parquet has an INT96 timestamp that supports >>>> nanosecond precision, but it's deprecated. Shall we work with the Parquet >>>> community to revive it? >>> >>> >>> It would be great to discuss a plan for this in parquet. This has come >>> up in passing in some of the recent parquet syncs. I don't think >>> resurrecting int96 is necessarily a great idea since it is defined in terms >>> of Julian days [1], and most systems these days are standardizing on >>> proleptic-Gregorian. >>> >>> A fair number of OSS implementations that do interact with int96 I've >>> seen do conversion assuming all timestamps are post Unix epoch timestamps >>> and therefore have errors/idiosyncrasies when translating dates prior to >>> the Gregorian cutover. >>> >>> Cheers, >>> Micah >>> >>> [1] https://github.com/apache/parquet-format/pull/49 >>> >>> On Thu, Mar 27, 2025 at 7:02 PM Wenchen Fan <cloud0...@gmail.com> wrote: >>> >>>> Maybe we should discuss the key issues on the dev list as it's easy to >>>> lose track of Google Doc comments. >>>> >>>> I think all the proposals for adding new data types need to prove that >>>> the new data type is common/standard in the ecosystem. This means 3 things: >>>> - it has common/standard semantic. TIMESTAMP with nanosecond precision >>>> is definitely a standard data type, in both ANSI SQL and mainstream >>>> databases. >>>> - it has common/standard storage format. Parquet/Iceberg supports >>>> nanosecond timestamp using int64, which is different from what is proposed >>>> here. >>>> - it has common/standard processing methods. The java datetime library >>>> Spark is using now already support nanosecond, so we are fine here. >>>> >>>> I think the key issue is the format. The proposed 10-byte format >>>> doesn't seem like a standard and the one in Iceberg/Parquet does not >>>> support the required range by ANSI SQL: year 0001 to year 9999. We should >>>> address this issue first. Note that Parquet has an INT96 timestamp that >>>> supports nanosecond precision, but it's deprecated. Shall we work with the >>>> Parquet community to revive it? >>>> >>>> On Fri, Mar 28, 2025 at 7:03 AM DB Tsai <dbt...@dbtsai.com> wrote: >>>> >>>>> Thanks!!! >>>>> >>>>> DB Tsai | https://www.dbtsai.com/ | PGP 42E5B25A8F7A82C1 >>>>> >>>>> On Mar 27, 2025, at 3:56 PM, Qi Tan <qi.tan.j...@gmail.com> wrote: >>>>> >>>>> Thanks DB, >>>>> >>>>> I just noticed a few more comments came in after I initiated the vote. >>>>> I'm going to postpone the voting process and address those outstanding >>>>> comments. >>>>> >>>>> Qi Tan >>>>> >>>>> DB Tsai <dbt...@dbtsai.com> 于2025年3月27日周四 15:12写道: >>>>> >>>>>> Hello Qi, >>>>>> >>>>>> I'm supportive of the NanoSecond Timestamps proposal; however, before >>>>>> we initiate the vote, there are a few outstanding comments in the SPIP >>>>>> document that haven't been addressed yet. Since the vote is on the >>>>>> document >>>>>> itself, could we resolve these items beforehand? >>>>>> >>>>>> For example: >>>>>> >>>>>> - >>>>>> >>>>>> The default precision of TimestampNsNTZType is set to 6, which >>>>>> overlaps with the existing TimestampNTZ. >>>>>> - >>>>>> >>>>>> The specified range exceeds the capacity of an int64, but the >>>>>> document doesn't clarify how this type will be represented in memory >>>>>> or >>>>>> serialized in data sources. >>>>>> - >>>>>> >>>>>> Schema inference details for data sources are missing. >>>>>> >>>>>> These points still need discussion. >>>>>> >>>>>> I appreciate your efforts in putting the doc together and look >>>>>> forward to your contribution! >>>>>> >>>>>> Thanks, >>>>>> DB Tsai | https://www.dbtsai.com/ | PGP 42E5B25A8F7A82C1 >>>>>> >>>>>> On Mar 27, 2025, at 1:24 PM, huaxin gao <huaxin.ga...@gmail.com> >>>>>> wrote: >>>>>> >>>>>> +1 >>>>>> >>>>>> On Thu, Mar 27, 2025 at 1:22 PM Qi Tan <qi.tan.j...@gmail.com> wrote: >>>>>> >>>>>>> Hi all, >>>>>>> >>>>>>> I would like to start a vote on adding support for nanoseconds >>>>>>> timestamps. >>>>>>> >>>>>>> *Discussion thread: * >>>>>>> https://lists.apache.org/thread/y2vzrjl1499j5dvbpg3m81jxdhf4b6of >>>>>>> *SPIP:* >>>>>>> https://docs.google.com/document/d/1wjFsBdlV2YK75x7UOk2HhDOqWVA0yC7iEiqOMnNnxlA/edit?usp=sharing >>>>>>> *JIRA:* https://issues.apache.org/jira/browse/SPARK-50532 >>>>>>> >>>>>>> Please vote on the SPIP for the next 72 hours: >>>>>>> >>>>>>> [ ] +1: Accept the proposal as an official SPIP >>>>>>> [ ] +0 >>>>>>> [ ] -1: I don’t think this is a good idea because >>>>>>> >>>>>> >>>>>> >>>>>