Jefffrey opened a new issue, #5288:
URL: https://github.com/apache/arrow-rs/issues/5288

   **Which part is this question about**
   <!--
   Is it code base, library api, documentation or some other part?
   -->
   
   Date64 array values.
   
   **Describe your question**
   <!--
   A clear and concise description of what the question is.
   -->
   
   Docs for Date64 type states:
   
   
https://github.com/apache/arrow-rs/blob/a61e824abdd7b38ea214828480430ff2a13f2ead/arrow-schema/src/datatype.rs#L150-L152
   
   - Mirrored by `Schema.fbs` docs: 
https://github.com/apache/arrow/blob/37a8bf04bc713858a5b247d4424c1e8505e61947/format/Schema.fbs#L245-L253
   
   > Values are evenly divisible by 86400000
   
   This seems to suggest that Date64 should NOT store time, and should only 
represent days since UNIX epoch, akin to Date32 (but as milliseconds, not days).
   
   What is the point of Date64 type, then? It would be the same as Date32 but 
multiplied by 86400000 **assuming it's used according to spec**.
   
   The bold is important, as there are examples where you can set values that 
are not evenly divisible by the factor, and the printing code even shows the 
time as well:
   
   
https://github.com/apache/arrow-rs/blob/a61e824abdd7b38ea214828480430ff2a13f2ead/arrow-cast/src/pretty.rs#L476-L487
   
   The C++ implementation seems to have a validate function, see 
https://github.com/apache/arrow/pull/12014
   
   But I can still set 'invalid' values via PyArrow as this full validation is 
optional:
   
   ```python
   >>> import pyarrow as pa
   >>> days = pa.array([0, 1, 2], type=pa.date64())
   >>> days
   <pyarrow.lib.Date64Array object at 0x7f6810ecba60>
   [
     1970-01-01,
     1970-01-01,
     1970-01-01
   ]
   >>> days.validate()
   >>> days.validate(full=True)
   Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
     File "pyarrow/array.pxi", line 1630, in pyarrow.lib.Array.validate
     File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
   pyarrow.lib.ArrowInvalid: date64[ms] 1 does not represent a whole number of 
days
   >>>
   ```
   
   So I'm just wondering, even if we implement some sort of validation on these 
values (there is this old issue on the arrow repo: 
https://github.com/apache/arrow/issues/26853), if this is not made mandatory, 
then what is the point of having that restriction on Date64 type?
   
   Do we need to implement this optional validation on arrow-rs too, and also 
fix the print code to not show the time for Date64? Or just embrace that Date64 
will also store time, contrary to the docs (both in arrow-rs and the official 
arrow repo)?
   
   **Additional context**
   <!--
   Add any other context about the problem here.
   -->
   
   This might be a wider arrow discussion, I'm not sure if it's been had 
before, feel free to link if so.
   
   Came across this whilst looking into 
https://github.com/apache/arrow-rs/issues/5266
   
   As I wasn't sure, given the case of a Date64 array, whether extracting the 
millisecond part should always return 0 (assuming the array contains valid 
values) or should return the actual milliseconds part (though that would 
technically mean the  value is invalid?)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to