[
https://issues.apache.org/jira/browse/ARROW-16022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17518652#comment-17518652
]
Joris Van den Bossche commented on ARROW-16022:
-----------------------------------------------
Maybe something else to point out is that you should best be careful with how
you use {{pytz}} (as some call it "broken"). Your initial example might not be
doing what you expected:
{code:python}
t = pyarrow.timestamp('s', tz='America/New_York')
dt = datetime.datetime(2013, 11, 3, 1, 3, 14, tzinfo =
pytz.timezone('America/New_York'))
za = pyarrow.array([dt], t)
>>> print(dt)
2013-11-03 01:03:14-04:56
>>> za
<pyarrow.lib.TimestampArray object at 0x7fa58ecf0fa0>
[
2013-11-03 05:59:14
]
{code}
Note the strange "04:56" offset when printing (while we would expect either
"04:00" or "05:00"), and the strange UTC value when converted to a pyarrow
array (an hour of "05:59", instead of "05:03" or "06:03").
This is because the {{dt}} value was created "incorrectly" for how pytz works
(note that your code above is working fine when using zoneinfo timezones). See
https://bugs.launchpad.net/pytz/+bug/1746179 and
https://blog.ganssle.io/articles/2018/03/pytz-fastest-footgun.html for a more
detailed explanation about this.
The "correct" way to do this with the pytz library is (but this is a reason
many people recommend to stop using pytz):
{code:python}
>>> dt = pytz.timezone('America/New_York').localize(datetime.datetime(2013, 11,
>>> 3, 1, 3, 14))
>>> print(dt)
2013-11-03 01:03:14-05:00
>>> pa.array([dt])
<pyarrow.lib.TimestampArray object at 0x7fa58edb4340>
[
2013-11-03 06:03:14.000000
]
{code}
> [C++] Temporal floor/ceil/round throws exception for timestamps ambiguous due
> to DST
> ------------------------------------------------------------------------------------
>
> Key: ARROW-16022
> URL: https://issues.apache.org/jira/browse/ARROW-16022
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Affects Versions: 7.0.0
> Reporter: Kevin Crouse
> Priority: Major
>
> Running pyarrow.compute.floor_temporal for timestamps that exist will throw
> exceptions if the times are ambiguous during the daylight savings time
> transitions.
> As the *_temporal functions do not fundamentally change the times, it does
> not make sense that they would fail due to a timezone issue. If they must
> fail, it should be when the pyarrow.Timestamp is created.
>
>
> {code:java}
> import pyarrow
> import pyarrow.compute as pc
> import datetime
> import pytz
> t = pyarrow.timestamp('s', tz='America/New_York')
> dt = datetime.datetime(2013, 11, 3, 1, 3, 14, tzinfo =
> pytz.timezone('America/New_York'))
> # if a timestamp must be invalid, this could fail
> za = pyarrow.array([dt], t)
> # raises an exception, even though this is conceptually an identity function
> here
> pc.floor_temporal(za, unit = 'second') {code}
>
> And this actually works just fine (continued from above)
> {code:java}
> pc.cast(
> pc.floor_temporal(
> pc.cast(za, pyarrow.timestamp('s', 'UTC')),
> unit='second'),
> pyarrow.timestamp('s','America/New_York')
> )
> {code}
>
>
>
>
--
This message was sent by Atlassian Jira
(v8.20.1#820001)