Seems we all agree that adding a timestamp with nanosecond precision is
necessary. And we need to store it in 10 bytes. Additionally, the spark
side will not wait for the change of Parquet as well. I will continue to
address all the comments for implementation details.

huaxin gao <huaxin.ga...@gmail.com> 于2025年3月30日周日 19:20写道:

> Thanks all for the discussion!
>
> I agree that we first need to reach a consensus on adding the
> TIMESTAMP(nanosecond) data type to Apache Spark. It's a standard data
> type supported by major databases like Oracle and IBM DB2, making it a
> necessary inclusion in Spark to align with industry practices.
>
> For the storage format, Spark supports the full ANSI SQL range from the
> year 0001 to 9999, which requires us to use a 10-byte format. Currently,
> Parquet/Iceberg timestamps cover only those after the Unix epoch, so 8
> bytes suffice.
>
> While adopting a unified 10-byte format in Parquet/Iceberg is worth
> considering, it may not be essential at this moment. Instead, we can handle
> timestamps that fall outside the Parquet/Iceberg range by throwing an
> exception when they occur. This approach allows us to move forward without
> having to rely on external dependencies.
>
> Thanks,
>
> Huaxin
>
>
> On Fri, Mar 28, 2025 at 2:11 PM Szehon Ho <szehon.apa...@gmail.com> wrote:
>
>> Trying to catch up on this, Serge's suggestion in the doc seems the best
>> way forward,
>> https://docs.google.com/document/d/1wjFsBdlV2YK75x7UOk2HhDOqWVA0yC7iEiqOMnNnxlA/edit?disco=AAABe5AUnWU.
>> Spark would support the full ANSI SQL timestamp range, and Iceberg /
>> Parquet/ other data source will throw runtime error if it trying to write a
>> value outside its supported range, until we get a wider timestamp type in
>> Parquet (Iceberg's V3 timestamp_ns type is just built on top of that)
>>
>> Thanks,
>> Szehon
>>
>> On Thu, Mar 27, 2025 at 9:45 PM Micah Kornfield <emkornfi...@gmail.com>
>> wrote:
>>
>>> I think the key issue is the format. The proposed 10-byte format doesn't
>>>> seem like a standard and the one in Iceberg/Parquet does not support the
>>>> required range by ANSI SQL: year 0001 to year 9999. We should address this
>>>> issue first. Note that Parquet has an INT96 timestamp that supports
>>>> nanosecond precision, but it's deprecated. Shall we work with the Parquet
>>>> community to revive it?
>>>
>>>
>>> It would be great to discuss a plan for this in parquet.  This has come
>>> up in passing in some of the recent parquet syncs.  I don't think
>>> resurrecting int96 is necessarily a great idea since it is defined in terms
>>> of Julian days [1], and most systems these days are standardizing on
>>> proleptic-Gregorian.
>>>
>>> A fair number of OSS implementations that do interact with int96 I've
>>> seen do conversion assuming all timestamps are post Unix epoch timestamps
>>> and therefore have errors/idiosyncrasies when translating dates prior to
>>> the Gregorian cutover.
>>>
>>> Cheers,
>>> Micah
>>>
>>> [1] https://github.com/apache/parquet-format/pull/49
>>>
>>> On Thu, Mar 27, 2025 at 7:02 PM Wenchen Fan <cloud0...@gmail.com> wrote:
>>>
>>>> Maybe we should discuss the key issues on the dev list as it's easy to
>>>> lose track of Google Doc comments.
>>>>
>>>> I think all the proposals for adding new data types need to prove that
>>>> the new data type is common/standard in the ecosystem. This means 3 things:
>>>> - it has common/standard semantic. TIMESTAMP with nanosecond precision
>>>> is definitely a standard data type, in both ANSI SQL and mainstream
>>>> databases.
>>>> - it has common/standard storage format. Parquet/Iceberg supports
>>>> nanosecond timestamp using int64, which is different from what is proposed
>>>> here.
>>>> - it has common/standard processing methods. The java datetime library
>>>> Spark is using now already support nanosecond, so we are fine here.
>>>>
>>>> I think the key issue is the format. The proposed 10-byte format
>>>> doesn't seem like a standard and the one in Iceberg/Parquet does not
>>>> support the required range by ANSI SQL: year 0001 to year 9999. We should
>>>> address this issue first. Note that Parquet has an INT96 timestamp that
>>>> supports nanosecond precision, but it's deprecated. Shall we work with the
>>>> Parquet community to revive it?
>>>>
>>>> On Fri, Mar 28, 2025 at 7:03 AM DB Tsai <dbt...@dbtsai.com> wrote:
>>>>
>>>>> Thanks!!!
>>>>>
>>>>> DB Tsai  |  https://www.dbtsai.com/  |  PGP 42E5B25A8F7A82C1
>>>>>
>>>>> On Mar 27, 2025, at 3:56 PM, Qi Tan <qi.tan.j...@gmail.com> wrote:
>>>>>
>>>>> Thanks DB,
>>>>>
>>>>> I just noticed a few more comments came in after I initiated the vote.
>>>>> I'm going to postpone the voting process and address those outstanding
>>>>> comments.
>>>>>
>>>>> Qi Tan
>>>>>
>>>>> DB Tsai <dbt...@dbtsai.com> 于2025年3月27日周四 15:12写道:
>>>>>
>>>>>> Hello Qi,
>>>>>>
>>>>>> I'm supportive of the NanoSecond Timestamps proposal; however, before
>>>>>> we initiate the vote, there are a few outstanding comments in the SPIP
>>>>>> document that haven't been addressed yet. Since the vote is on the 
>>>>>> document
>>>>>> itself, could we resolve these items beforehand?
>>>>>>
>>>>>> For example:
>>>>>>
>>>>>>    -
>>>>>>
>>>>>>    The default precision of TimestampNsNTZType is set to 6, which
>>>>>>    overlaps with the existing TimestampNTZ.
>>>>>>    -
>>>>>>
>>>>>>    The specified range exceeds the capacity of an int64, but the
>>>>>>    document doesn't clarify how this type will be represented in memory 
>>>>>> or
>>>>>>    serialized in data sources.
>>>>>>    -
>>>>>>
>>>>>>    Schema inference details for data sources are missing.
>>>>>>
>>>>>> These points still need discussion.
>>>>>>
>>>>>> I appreciate your efforts in putting the doc together and look
>>>>>> forward to your contribution!
>>>>>>
>>>>>> Thanks,
>>>>>> DB Tsai  |  https://www.dbtsai.com/  |  PGP 42E5B25A8F7A82C1
>>>>>>
>>>>>> On Mar 27, 2025, at 1:24 PM, huaxin gao <huaxin.ga...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>> +1
>>>>>>
>>>>>> On Thu, Mar 27, 2025 at 1:22 PM Qi Tan <qi.tan.j...@gmail.com> wrote:
>>>>>>
>>>>>>> Hi all,
>>>>>>>
>>>>>>> I would like to start a vote on adding support for nanoseconds
>>>>>>> timestamps.
>>>>>>>
>>>>>>> *Discussion thread: *
>>>>>>> https://lists.apache.org/thread/y2vzrjl1499j5dvbpg3m81jxdhf4b6of
>>>>>>> *SPIP:*
>>>>>>> https://docs.google.com/document/d/1wjFsBdlV2YK75x7UOk2HhDOqWVA0yC7iEiqOMnNnxlA/edit?usp=sharing
>>>>>>> *JIRA:*  https://issues.apache.org/jira/browse/SPARK-50532
>>>>>>>
>>>>>>> Please vote on the SPIP for the next 72 hours:
>>>>>>>
>>>>>>> [ ] +1: Accept the proposal as an official SPIP
>>>>>>> [ ] +0
>>>>>>> [ ] -1: I don’t think this is a good idea because
>>>>>>>
>>>>>>
>>>>>>
>>>>>

Reply via email to