jorisvandenbossche commented on PR #12528: URL: https://github.com/apache/arrow/pull/12528#issuecomment-1131644858
> I really dislike that sortedness of list can get destroyed in calendar-based-origin mode as shown by https://github.com/apache/arrow/pull/12528#issuecomment-1131369183. Yes, indeed, that's also what I found "off" about it .. But I suppose that is a caveat of this kind of rounding? (and a good reason we don't do that by default) (and it also doesn't have the idempotency criterion ..) > I don't know much about user expectations beyond precedents by Pandas and lubridate, but we should probably have idempotency, maintain sortedness of "timestamp continuum" in UTC and maintain sortedness in local time. That might constrain the problem enough to only have one solution? :) That sounds as a good starting point. Considering those 4 example times around a DST jump again: ``` "01:50:00+01:00", "01:59:59+01:00", "03:00:00+02:00", "03:10:00+02:00" ``` I can currently think of two options: 1) something that gets rounded "into the jump" as a nonexistent time gets moved to the border of the jump (start or end of the jump doesn't really matter, as this is the same point in time, in practice this is represented as the end. Also for floor vs ceil I wouldn't do anything different). In that case we get something like: ``` data -> rounded in local naive time -> with timezone "01:50:00+01:00" -> "01:52:00" -> "01:52:00+01:00" "01:59:59+01:00" -> "01:52:00" -> "01:52:00+01:00" "03:00:00+02:00" -> "02:56:00" -> "03:00:00+02:00" "03:10:00+02:00" -> "03:12:00" -> "03:12:00+02:00" ``` Or otherwise 2) something that gets rounded into a jump as a nonexistent time gets moved to the "closest" rounded value (that would otherwise occur) outside of the jump: ``` data -> rounded in local naive time -> with timezone "01:50:00+01:00" -> "01:52:00" -> "01:52:00+01:00" "01:59:59+01:00" -> "01:52:00" -> "01:52:00+01:00" "03:00:00+02:00" -> "02:56:00" -> "03:12:00+02:00" <-- only this one is different "03:10:00+02:00" -> "03:12:00" -> "03:12:00+02:00" ``` In this case floor vs ceil could use the rounded value before vs after. Both cases preserve the sortedness, and are idempotent. In this example at least; I don't know if we could come up with an example where the value at the jump (in this case "03:00:00" would not round to itself. Probably this is possible by playing with the exact multiple, in which case this is a reason to maybe go for option 2. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
