jorisvandenbossche commented on issue #38050:
URL: https://github.com/apache/arrow/issues/38050#issuecomment-1750225910

   Interesting. This is because the `date64` data type under the hood stores 
the data as milliseconds, however, because it is a "date", it should not 
actually take into account any milliseconds that is not a multiple of an exact 
day. 
   
   After roundtripping, the data has become more "correct":
   
   ```
   In [24]: date64_array.cast("int64")
   Out[24]: 
   <pyarrow.lib.Int64Array object at 0x7f67e06dbee0>
   [
     1,
     2,
     3
   ]
   
   In [25]: date64_roundtripped.cast("int64")
   Out[25]: 
   <pyarrow.lib.Int64Array object at 0x7f67e009f040>
   [
     0,
     0,
     0
   ]
   ```
   
   But as long as we allow to store milliseconds that are not a multiple of a 
single day, then we should also ignore those sub-day milliseconds in operations 
like equality. For example, `date64_roundtripped == date64_array` should 
evaluate to True, even though the underlying values are not equal.
   
   Our format spec says:
   
   ```
   /// Date is either a 32-bit or 64-bit signed integer type representing an
   /// elapsed time since UNIX epoch (1970-01-01), stored in either of two 
units:
   ///
   /// * Milliseconds (64 bits) indicating UNIX time elapsed since the epoch (no
   ///   leap seconds), where the values are evenly divisible by 86400000
   /// * Days (32 bits) since the UNIX epoch
   ```
   
   So the question is whether we should always truncate the values when 
creating, or rather deal with sub-day milliseconds later on.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to