[
https://issues.apache.org/jira/browse/ARROW-16022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17512411#comment-17512411
]
Kevin Crouse edited comment on ARROW-16022 at 3/25/22, 2:39 PM:
----------------------------------------------------------------
Hi Rok,
Thanks for this.
We are accessing multiple data systems, none of which use UTC to begin with -
so we are generally constrained to using local time. Also, to avoid confusing
the issue by introducing pandas, here's an example using python core datetime
that demonstrates localtime issues in pyarrow.
Also, I just realized this is only an issue for ambiguous times. It appears
that floor_temporal handles nonexistent times correctly. I'll demonstrate that
below as well.
{code:java}
import datetime
import zoneinfo # native in python 3.9+
import pyarrow as pa
import pa.compute as pc
tz = zoneinfo.ZoneInfo(key='America/New_York')
# In the US, the 1:00am hour is the ambiguous because the minute after 1:59am
Daylight-Savings Time is 1:00am Standard Time
# however, these times exist and
date_ambig = datetime.datetime(2013, 11, 3, 1, 3, 14, tzinfo = tz)
arr = pa.array([ date_ambig ], pa.timestamp("s", "America/New_York"))
#
# Here, let me introspect and annotate the objects created above
#
date_ambig
# > datetime.datetime(2013, 11, 3, 1, 3, 14,
tzinfo=zoneinfo.ZoneInfo(key='America/New_York'))
print(date_ambig)
# > 2013-11-03 01:03:14-04:00
# The native datetime object defaults to daylight time.
arr[0]
# > <pyarrow.TimestampScalar: datetime.datetime(2013, 11, 3, 1, 3, 14,
tzinfo=<DstTzInfo 'America/New_York' EDT-1 day, 20:00:00 DST>)>
arr
# > [ 2013-11-03 05:03:14 ]
# Notice that pyarrow actually understands the timestamp just fine - That is
the UTC value for it.
pc.floor_temporal(arr, unit="second")
# > pyarrow.lib.ArrowInvalid: Local time is ambiguous ...
{code}
I wrote this to demonstrate the issue for nonexistent times, but there's no
error. I just went back to my error logs and realize that it indeed only
happens at the start of DST.
{code:java}
import datetime
import zoneinfo # native in python 3.9+
import pyarrow as pa
import pa.compute as pc
tz = zoneinfo.ZoneInfo(key='America/New_York')
# In the US, the minute after 1:59am standard time is 3:00am in daylight time.
# Native python interprets a timestamp in the 2am hour as standard time, since
daylight time does not yet exist.
before_dst = datetime.datetime(2022, 3, 13, 1, 30, 14,
tzinfo=zoneinfo.ZoneInfo(key='America/New_York'))
nonext_time = datetime.datetime(2022, 3, 13, 2, 30, 14,
tzinfo=zoneinfo.ZoneInfo(key='America/New_York'))
after_dst = datetime.datetime(2022, 3, 13, 3, 30, 14,
tzinfo=zoneinfo.ZoneInfo(key='America/New_York'))
print(before_dst)
# > 2022-03-13 01:30:14-05:00
print(nonext_time)
# > 2022-03-13 02:30:14-05:00
print(after_dst )
# > 2022-03-13 03:30:14-04:00
pc.floor_temporal(pa.array([ before_dst, nonext_time, after_dst],
pa.timestamp("s", "America/New_York")), unit="second")
# <pyarrow.lib.TimestampArray object at 0x7f17eb5ce0a0>
# [
# 2022-03-13 06:30:14,
# 2022-03-13 07:30:14,
# 2022-03-13 07:30:14
# ]{code}
was (Author: JIRAUSER286896):
Hi Rok,
Thanks for this.
We are accessing multiple data systems, none of which use UTC to begin with -
so we are generally constrained to using local time. Also, to avoid confusing
the issue by introducing pandas, here's an example using python core datetime
that demonstrates localtime issues in pyarrow.
Also, I just realized is only an issue for ambiguous times. It appears that
floor_temporal handles nonexistent times correctly. I'll demonstrate that
below as wel.
{code:java}
import datetime
import zoneinfo # native in python 3.9+
import pyarrow as pa
import pa.compute as pc
tz = zoneinfo.ZoneInfo(key='America/New_York')
# In the US, the 1:00am hour is the ambiguous because the minute after 1:59am
Daylight-Savings Time is 1:00am Standard Time
# however, these times exist and
date_ambig = datetime.datetime(2013, 11, 3, 1, 3, 14, tzinfo = tz)
arr = pa.array([ date_ambig ], pa.timestamp("s", "America/New_York"))
#
# Here, let me introspect and annotate the objects created above
#
date_ambig
# > datetime.datetime(2013, 11, 3, 1, 3, 14,
tzinfo=zoneinfo.ZoneInfo(key='America/New_York'))
print(date_ambig)
# > 2013-11-03 01:03:14-04:00
# The native datetime object defaults to daylight time.
arr[0]
# > <pyarrow.TimestampScalar: datetime.datetime(2013, 11, 3, 1, 3, 14,
tzinfo=<DstTzInfo 'America/New_York' EDT-1 day, 20:00:00 DST>)>
arr
# > [ 2013-11-03 05:03:14 ]
# Notice that pyarrow actually understands the timestamp just fine - That is
the UTC value for it.
pc.floor_temporal(arr, unit="second")
# > pyarrow.lib.ArrowInvalid: Local time is ambiguous ...
{code}
I wrote this to demonstrate the issue for nonexistent times, but there's no
error. I just went back to my error logs and realize that it indeed only
happens at the start of DST.
{code:java}
import datetime
import zoneinfo # native in python 3.9+
import pyarrow as pa
import pa.compute as pc
tz = zoneinfo.ZoneInfo(key='America/New_York')
# In the US, the minute after 1:59am standard time is 3:00am in daylight time.
# Native python interprets a timestamp in the 2am hour as standard time, since
daylight time does not yet exist.
before_dst = datetime.datetime(2022, 3, 13, 1, 30, 14,
tzinfo=zoneinfo.ZoneInfo(key='America/New_York'))
nonext_time = datetime.datetime(2022, 3, 13, 2, 30, 14,
tzinfo=zoneinfo.ZoneInfo(key='America/New_York'))
after_dst = datetime.datetime(2022, 3, 13, 3, 30, 14,
tzinfo=zoneinfo.ZoneInfo(key='America/New_York'))
print(before_dst)
# > 2022-03-13 01:30:14-05:00
print(nonext_time)
# > 2022-03-13 02:30:14-05:00
print(after_dst )
# > 2022-03-13 03:30:14-04:00
pc.floor_temporal(pa.array([ before_dst, nonext_time, after_dst],
pa.timestamp("s", "America/New_York")), unit="second")
# <pyarrow.lib.TimestampArray object at 0x7f17eb5ce0a0>
# [
# 2022-03-13 06:30:14,
# 2022-03-13 07:30:14,
# 2022-03-13 07:30:14
# ]{code}
> floor_temporal / ceil_temporal throws exception for existing timestamps if
> ambiguous/existing
> ---------------------------------------------------------------------------------------------
>
> Key: ARROW-16022
> URL: https://issues.apache.org/jira/browse/ARROW-16022
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Affects Versions: 7.0.0
> Reporter: Kevin Crouse
> Priority: Major
>
> Running pyarrow.compute.floor_temporal for timestamps that exist will throw
> exceptions if the times are ambiguous during the daylight savings time
> transitions.
> As the *_temporal functions do not fundamentally change the times, it does
> not make sense that they would fail due to a timezone issue. If they must
> fail, it should be when the pyarrow.Timestamp is created.
>
>
> {code:java}
> import pyarrow
> import pyarrow.compute as pc
> import datetime
> import pytz
> t = pyarrow.timestamp('s', tz='America/New_York')
> dt = datetime.datetime(2013, 11, 3, 1, 3, 14, tzinfo =
> pytz.timezone('America/New_York'))
> # if a timestamp must be invalid, this could fail
> za = pyarrow.array([dt], t)
> # raises an exception, even though this is conceptually an identity function
> here
> pc.floor_temporal(za, unit = 'second') {code}
>
> And this actually works just fine (continued from above)
> {code:java}
> pc.cast(
> pc.floor_temporal(
> pc.cast(za, pyarrow.timestamp('s', 'UTC')),
> unit='second'),
> pyarrow.timestamp('s','America/New_York')
> )
> {code}
>
>
>
>
--
This message was sent by Atlassian Jira
(v8.20.1#820001)