[ 
https://issues.apache.org/jira/browse/ARROW-16022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17512411#comment-17512411
 ] 

Kevin Crouse edited comment on ARROW-16022 at 3/25/22, 2:39 PM:
----------------------------------------------------------------

Hi Rok,

Thanks for this. 

We are accessing multiple data systems, none of which use UTC to begin with - 
so we are generally constrained to using local time. Also, to avoid confusing 
the issue by introducing pandas, here's an example using python core datetime 
that demonstrates localtime issues in pyarrow.

Also, I just realized this is only an issue for ambiguous times. It appears 
that floor_temporal handles nonexistent times correctly.  I'll demonstrate that 
below as well.

 

 
{code:java}
import datetime
import zoneinfo # native in python 3.9+

import pyarrow as pa
import pa.compute as pc

tz = zoneinfo.ZoneInfo(key='America/New_York')

# In the US, the 1:00am hour is the ambiguous because the minute after 1:59am 
Daylight-Savings Time is 1:00am Standard Time
# however, these times exist and 
date_ambig = datetime.datetime(2013, 11, 3, 1, 3, 14, tzinfo = tz)
arr = pa.array([ date_ambig ], pa.timestamp("s", "America/New_York")) 

#
# Here, let me introspect and annotate the objects created above
#
date_ambig
# > datetime.datetime(2013, 11, 3, 1, 3, 14, 
tzinfo=zoneinfo.ZoneInfo(key='America/New_York'))

print(date_ambig) 
# > 2013-11-03 01:03:14-04:00
# The native datetime object defaults to daylight time.  

arr[0] 
# > <pyarrow.TimestampScalar: datetime.datetime(2013, 11, 3, 1, 3, 14, 
tzinfo=<DstTzInfo 'America/New_York' EDT-1 day, 20:00:00 DST>)>

arr
# > [  2013-11-03 05:03:14 ] 
# Notice that pyarrow actually understands the timestamp just fine - That is 
the UTC value for it.

pc.floor_temporal(arr, unit="second")

# > pyarrow.lib.ArrowInvalid: Local time is ambiguous ...

{code}
I wrote this to demonstrate the issue for nonexistent times, but there's no 
error. I just went back to my error logs and realize that it indeed only 
happens at the start of DST. 

 

 
{code:java}
import datetime
import zoneinfo # native in python 3.9+

import pyarrow as pa
import pa.compute as pc

tz = zoneinfo.ZoneInfo(key='America/New_York')

# In the US, the minute after 1:59am standard time is 3:00am in daylight time. 
# Native python interprets a timestamp in the 2am hour as standard time, since 
daylight time does not yet exist.

before_dst = datetime.datetime(2022, 3, 13, 1, 30, 14, 
tzinfo=zoneinfo.ZoneInfo(key='America/New_York'))
nonext_time = datetime.datetime(2022, 3, 13, 2, 30, 14, 
tzinfo=zoneinfo.ZoneInfo(key='America/New_York'))
after_dst = datetime.datetime(2022, 3, 13, 3, 30, 14, 
tzinfo=zoneinfo.ZoneInfo(key='America/New_York'))

print(before_dst)
# > 2022-03-13 01:30:14-05:00
print(nonext_time)
# > 2022-03-13 02:30:14-05:00
print(after_dst )
# > 2022-03-13 03:30:14-04:00

pc.floor_temporal(pa.array([ before_dst, nonext_time, after_dst], 
pa.timestamp("s", "America/New_York")), unit="second")

# <pyarrow.lib.TimestampArray object at 0x7f17eb5ce0a0>
# [
#    2022-03-13 06:30:14,
#    2022-03-13 07:30:14,
#    2022-03-13 07:30:14
# ]{code}
 


was (Author: JIRAUSER286896):
Hi Rok,

Thanks for this. 

We are accessing multiple data systems, none of which use UTC to begin with - 
so we are generally constrained to using local time. Also, to avoid confusing 
the issue by introducing pandas, here's an example using python core datetime 
that demonstrates localtime issues in pyarrow.

Also, I just realized is only an issue for ambiguous times. It appears that 
floor_temporal handles nonexistent times correctly.  I'll demonstrate that 
below as wel.

 

 
{code:java}
import datetime
import zoneinfo # native in python 3.9+

import pyarrow as pa
import pa.compute as pc

tz = zoneinfo.ZoneInfo(key='America/New_York')

# In the US, the 1:00am hour is the ambiguous because the minute after 1:59am 
Daylight-Savings Time is 1:00am Standard Time
# however, these times exist and 
date_ambig = datetime.datetime(2013, 11, 3, 1, 3, 14, tzinfo = tz)
arr = pa.array([ date_ambig ], pa.timestamp("s", "America/New_York")) 

#
# Here, let me introspect and annotate the objects created above
#
date_ambig
# > datetime.datetime(2013, 11, 3, 1, 3, 14, 
tzinfo=zoneinfo.ZoneInfo(key='America/New_York'))

print(date_ambig) 
# > 2013-11-03 01:03:14-04:00
# The native datetime object defaults to daylight time.  

arr[0] 
# > <pyarrow.TimestampScalar: datetime.datetime(2013, 11, 3, 1, 3, 14, 
tzinfo=<DstTzInfo 'America/New_York' EDT-1 day, 20:00:00 DST>)>

arr
# > [  2013-11-03 05:03:14 ] 
# Notice that pyarrow actually understands the timestamp just fine - That is 
the UTC value for it.

pc.floor_temporal(arr, unit="second")

# > pyarrow.lib.ArrowInvalid: Local time is ambiguous ...

{code}
I wrote this to demonstrate the issue for nonexistent times, but there's no 
error. I just went back to my error logs and realize that it indeed only 
happens at the start of DST. 

 

 
{code:java}
import datetime
import zoneinfo # native in python 3.9+

import pyarrow as pa
import pa.compute as pc

tz = zoneinfo.ZoneInfo(key='America/New_York')

# In the US, the minute after 1:59am standard time is 3:00am in daylight time. 
# Native python interprets a timestamp in the 2am hour as standard time, since 
daylight time does not yet exist.

before_dst = datetime.datetime(2022, 3, 13, 1, 30, 14, 
tzinfo=zoneinfo.ZoneInfo(key='America/New_York'))
nonext_time = datetime.datetime(2022, 3, 13, 2, 30, 14, 
tzinfo=zoneinfo.ZoneInfo(key='America/New_York'))
after_dst = datetime.datetime(2022, 3, 13, 3, 30, 14, 
tzinfo=zoneinfo.ZoneInfo(key='America/New_York'))

print(before_dst)
# > 2022-03-13 01:30:14-05:00
print(nonext_time)
# > 2022-03-13 02:30:14-05:00
print(after_dst )
# > 2022-03-13 03:30:14-04:00

pc.floor_temporal(pa.array([ before_dst, nonext_time, after_dst], 
pa.timestamp("s", "America/New_York")), unit="second")

# <pyarrow.lib.TimestampArray object at 0x7f17eb5ce0a0>
# [
#    2022-03-13 06:30:14,
#    2022-03-13 07:30:14,
#    2022-03-13 07:30:14
# ]{code}
 

> floor_temporal / ceil_temporal throws exception for existing timestamps if 
> ambiguous/existing
> ---------------------------------------------------------------------------------------------
>
>                 Key: ARROW-16022
>                 URL: https://issues.apache.org/jira/browse/ARROW-16022
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 7.0.0
>            Reporter: Kevin Crouse
>            Priority: Major
>
> Running pyarrow.compute.floor_temporal for timestamps that exist will throw 
> exceptions if the times are ambiguous during the daylight savings time 
> transitions. 
> As the *_temporal functions do not fundamentally change the times, it does 
> not make sense that they would fail due to a timezone issue. If they must 
> fail, it should be when the pyarrow.Timestamp is created.
>  
>  
> {code:java}
> import pyarrow
> import pyarrow.compute as pc
> import datetime
> import pytz
> t = pyarrow.timestamp('s', tz='America/New_York')
> dt = datetime.datetime(2013, 11, 3, 1, 3, 14, tzinfo = 
> pytz.timezone('America/New_York'))
> # if a timestamp must be invalid, this could fail
> za = pyarrow.array([dt], t) 
> # raises an exception, even though this is conceptually an identity function 
> here
> pc.floor_temporal(za, unit = 'second') {code}
>  
> And this actually works just fine (continued from above)
> {code:java}
> pc.cast(    
>     pc.floor_temporal(        
>         pc.cast(za, pyarrow.timestamp('s', 'UTC')),         
>     unit='second'),     
>     pyarrow.timestamp('s','America/New_York')
> )
>  {code}
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to