jorisvandenbossche commented on pull request #10647: URL: https://github.com/apache/arrow/pull/10647#issuecomment-901783928
Sorry for the slow response here, but I think there are still a few behavioural aspects to fix/clarify: * Related to @westonpace's comment above (https://github.com/apache/arrow/pull/10647#discussion_r668364905), you added a "Z" to the default format. However, this is only correct if you have a UTC timezone, and not for any other timezone. For example: ```python >>> ts = pd.to_datetime(["2018-03-10 09:00"]).tz_localize("US/Eastern") >>> ts DatetimeIndex(['2018-03-10 09:00:00-05:00'], dtype='datetime64[ns, US/Eastern]', freq=None) >>> tsa = pa.array(ts) >>> tsa <pyarrow.lib.TimestampArray object at 0x7f7350b087c0> [ 2018-03-10 14:00:00.000000000 ] >>> pc.strftime(tsa) <pyarrow.lib.StringArray object at 0x7f7350b74a60> [ "2018-03-10T09:00:00.000000000Z" ] ``` So it's correctly showing the timestamp in the timezone's local time, but thus the "Z" indicator for UTC is wrong (the correct UTC time is 14:00, not 09:00). I think we should only add the "Z" indicator if the timezone is UTC. I am not fully sure what we should then use as default format for non-UTC timezones though: don't show any timezone information, include a numeric offset, or error. That would also mean that the "default" format string would depend on the input type of the data, which might not be easy / desirable. - I commented about the timezone handling when the initial PR had a keyword for this, but I forgot to reply after you removed that keyword (and support for local timestamps) altogether. But, what's the reasoning for disallowing local timestamps without timezone? I don't think there is any ambiguity in how they would be formatted? (after all, they represent "clock" time, which in the end is kind of a formatted string) - There was some discussion above about the behaviour of `%S` (https://github.com/apache/arrow/pull/10647#discussion_r670410876), where `date.h` / C++ handles it differently as Python or R (i.e. we are including the fractional sub-second decimals, and there is no easy way to only show integer seconds apart from casting to `timestamp("s")` first AFAIK). Since there are conflicting standards vs language implementations, there is no easy way to solve this. But I think it would be good to at least document this difference (it will be surprising for Python/R users) and how to work-around it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
