jorisvandenbossche commented on PR #12528:
URL: https://github.com/apache/arrow/pull/12528#issuecomment-1131644858

   > I really dislike that sortedness of list can get destroyed in 
calendar-based-origin mode as shown by 
https://github.com/apache/arrow/pull/12528#issuecomment-1131369183.
   
   Yes, indeed, that's also what I found "off" about it .. But I suppose that 
is a caveat of this kind of rounding? (and a good reason we don't do that by 
default) 
   (and it also doesn't have the idempotency criterion ..)
   
   > I don't know much about user expectations beyond precedents by Pandas and 
lubridate, but we should probably have idempotency, maintain sortedness of 
"timestamp continuum" in UTC and maintain sortedness in local time. That might 
constrain the problem enough to only have one solution? :)
   
   That sounds as a good starting point. 
   
   Considering those 4 example times around a DST jump again: 
   
   ```
   "01:50:00+01:00", 
   "01:59:59+01:00", 
   "03:00:00+02:00", 
   "03:10:00+02:00"
   ```
   
   I can currently think of two options: 1) something that gets rounded "into 
the jump" as a nonexistent time gets moved to the border of the jump (start or 
end of the jump doesn't really matter, as this is the same point in time, in 
practice this is represented as the end. Also for floor vs ceil I wouldn't do 
anything different). In that case we get something like:
   
   ```
   data -> rounded in local naive time -> with timezone
   "01:50:00+01:00" -> "01:52:00" -> "01:52:00+01:00"
   "01:59:59+01:00" -> "01:52:00" -> "01:52:00+01:00"
   "03:00:00+02:00" -> "02:56:00" -> "03:00:00+02:00"
   "03:10:00+02:00" -> "03:12:00" -> "03:12:00+02:00"
   ```
   
   Or otherwise 2) something that gets rounded into a jump as a nonexistent 
time gets moved to the "closest" rounded value (that would otherwise occur) 
outside of the jump:
   
   ```
   data -> rounded in local naive time -> with timezone
   "01:50:00+01:00" -> "01:52:00" -> "01:52:00+01:00"
   "01:59:59+01:00" -> "01:52:00" -> "01:52:00+01:00"
   "03:00:00+02:00" -> "02:56:00" -> "03:12:00+02:00"  <-- only this one is 
different
   "03:10:00+02:00" -> "03:12:00" -> "03:12:00+02:00"
   ```
   
   In this case floor vs ceil could use the rounded value before vs after.
   
   Both cases preserve the sortedness, and are idempotent. In this example at 
least; I don't know if we could come up with an example where the value at the 
jump (in this case "03:00:00" would not round to itself. Probably this is 
possible by playing with the exact multiple, in which case this is a reason to 
maybe go for option 2.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to