jorisvandenbossche commented on pull request #10933:
URL: https://github.com/apache/arrow/pull/10933#issuecomment-901676354


   Personally, I think having a separate (non-cast) kernel to extract those 
components make sense from a user perspective (but can of course share 
implementation), and complementing the other timestamp component extraction 
kernels we already have. 
   As @lidavidm also mentions, this is in general "unsafe" cast, but when you 
explicitly want to extract a time/date component, it is obvious you want this 
and having to allow an unsafe cast feels unnecessarily.
   
   > There's already a cast from timestamp to date32/date64, however, placing 
this implementation there would change semantics a little bit:
   
   I would say that the current casting semantics are wrong and should be 
fixed? (in any case, the "extracted" date is clearly wrong, whether that's seen 
as a consequence of the "unsafe" cast or not is to be discussed I suppose)
   
   There are actually two "unsafe" steps in this conversion it seems (which 
explains the wrong part):
   
   ```python
   arr = pa.array(["1970-01-01 00:00:59.123456789","2000-02-29 
23:23:23.999999999","1899-01-01 00:59:20.001001001"]).cast(pa.timestamp("ns"))
   
   >>> arr.cast(pa.date64())
   ...
   ArrowInvalid: Casting from timestamp[ns] to date64[ms] would lose data: 
59123456789
   ../src/arrow/compute/kernels/scalar_cast_temporal.cc:178  
(ShiftTime<int64_t, int64_t>(ctx, conversion.first, conversion.second, input, 
output))
   
   # that error is actually coming from a conversion to milliseconds
   # (and you don't really care about the part being lost for conversion to 
date ..)
   >>> arr.cast(pa.timestamp("ms"))
   ...
   ArrowInvalid: Casting from timestamp[ns] to timestamp[ms] would lose data: 
59123456789
   
   # when ignoring this lost part in conversion to ms, then casting to date 
gives another error:
   >>> arr.cast(pa.timestamp("ms"), safe=False).cast(pa.date64())
   ...
   ArrowInvalid: Timestamp value had non-zero intraday milliseconds
   ```
   
   And I _suppose_ that when those intraday milliseconds are ignored by doing 
`safe=False`, we do a simple round to get rid of those:
   
   
https://github.com/apache/arrow/blob/d3af6d479bd3f6966b793cc9a1cf5bafe0cd032e/cpp/src/arrow/compute/kernels/scalar_cast_temporal.cc#L198-L202
   
   By subtracting the remainder we basically round towards zero (it seems C++ 
module operator behaves differently as Python when involving negative integers 
`-12 % 10 = -2` in C++ and `-12 % 10 = 8` in Python), which means that for 
negative values we are rounding up, and we should round down instead? (that 
could be seen as a bug fix?)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to