jorisvandenbossche commented on issue #33962: URL: https://github.com/apache/arrow/issues/33962#issuecomment-1864285994
We might need some more discussion about what we actually want here. The current PR adds "day", "second", "milli/micro/nanosecond" and "subsecond" kernels. And I think this is mostly modelled after the Python `datetime.timedelta` attributes (see also https://pandas.pydata.org/docs/user_guide/timedeltas.html#attributes for some context). For example the "second" kernel in the PR would return the number of seconds in the duration value that represents the number of seconds >= 0 and < 1 day. Equivalent Python example: ```python >>> import datetime >>> td = datetime.timedelta(days=2, hours=3, seconds=4, milliseconds=5) >>> td.seconds 10804 # which is 3 hours (60*60 seconds) + 4 seconds >>> 3*3600+4 10804 ``` But a reason for Python to have those attributes, is because that is how it is implemented under the hood (it stores separate numbers of days, seconds and microseconds (https://docs.python.org/3/library/datetime.html#timedelta-objects). In Arrow, we simply store a single value (number of (milli/micro/nano)seconds depending on the unit), so it doesn't necessarily make sense to copy the interface of Python's `datetime.timedelta` to extract those components (for example, why days and seconds, and not also hours?). Also note that the Python attributes are plural, in contrast to the names for the timestamp/date/time parts. Checking with some other software about what kind of operations are support for Duration types: - Python's `datetime.timedelta` has an additional method `total_seconds()`, which always returns all seconds as a float (in the example above, `td.total_seconds()` returns 183604.005). This could be useful to add as an easier way to get the duration in seconds, regardless of the unit (you can already achieve this currently by dividing by a duration of 1 second). - As mentioned earlier in this thread, pandas has an additional `components` attribute, that gives you the different components as they would be _displayed_, i.e. actually splitted in days/hours/minutes/seconds/milli...) - The R lubridate package doesn't seem to have specific methods for its duration type for this type of operations (https://lubridate.tidyverse.org/reference/index.html#durations) - The Joda-Time Java package has `getStandardDays`/`getStandardHours`/`getStandardMinutes`/`getStandardSeconds` methods (https://www.joda.org/joda-time/key_duration.html, https://www.joda.org/joda-time/apidocs/org/joda/time/Duration.html). But in this case, they are not "mutually exclusive", i.e. the seconds still include the days/hours/minutes as well. - The Rust chrono crate has a Duration type with `num_days`/`num_hours`/`num_minutes`/.. etc methods (https://docs.rs/chrono/latest/chrono/struct.Duration.html), but again they return the total number of days/hours/minutes/seconds/.., and not e.g. the number of hours after the number of days already has been subtracted (i.e. the number of days is simply "number of seconds / seconds_per_day") -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
