hi all,

I'm working on type conversions between different systems, and the details
of both the time and date data types raised some questions about their
behaviour and a potential impact on interoperability:

*Question 1*: For my own understanding: what purpose does the millisecond
date64 type serve?

*Question 2* Relates to the definition and implementation of the date64
data type:

The definition of date64 is from schema.fbs[1] is:
*Milliseconds (64 bits) indicating UNIX time elapsed since the epoch (no
leap seconds), where the values are evenly divisible by 86400000*

However, In PyArrow I can create Date64 instances using integer input
values that are not evenly divisible by 86400000 and the original input
persists in the Arrow dataframe. That seems very counterintuitive and a
potential cause for bugs in low level transformations and when moving data
between systems with Arrow. Shouldn't (Py)Arrow either reject the input, or
convert it when explicitly asked to?

>>> pa.scalar(86499999, pa.date64())
<pyarrow.Date64Scalar: datetime.date(1970, 1, 2)>
>>> pa.scalar(86499999, pa.date64()).cast(pa.int64())
<pyarrow.Int64Scalar: 86499999>


*Question 3*: both the time32 and time64 time-of-day types, in either
precision, accept and store integer input that falls outside of the 24-hour
window. Like the issue raised about the date64 type, this seems like
unexpected behavior, possibly even impacting interoperability. I expected
the boundaries of these values to be enforced. What's the
desirable behaviour from the Arrow specification perspective? Is it the
current behaviour, or should the input either be rejected or explicitly
converted?

See:

>>> pa.scalar(-1,pa.time32('s')) # expected: exception or warning
<pyarrow.Time32Scalar: datetime.time(23, 59, 59)>
>>> pa.scalar(-1,pa.time32('s')).cast(pa.int32()) # expected: 86399
<pyarrow.Int32Scalar: -1>
>>> pa.scalar(86400,pa.time32('s')) # expected: exception or warning
<pyarrow.Time32Scalar: datetime.time(0, 0)>
>>> pa.scalar(86400,pa.time32('s')).cast(pa.int32()) # expected: 0
<pyarrow.Int32Scalar: 86400>


I'm looking for answers to understand the intended behaviour. If question 2
and 3 are actually issues with the implementations, let me know and I'll
raise them on Github (or Jira if that's where they belong).

Thanks,
Marnix van den Broek

Data Engineer at bundlesandbatches.io

[1]
https://github.com/apache/arrow/blob/4591d76fce2846a29dac33bf01e9ba0337b118e9/format/Schema.fbs#L200-L201

Reply via email to