[ 
https://issues.apache.org/jira/browse/ARROW-13033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17361850#comment-17361850
 ] 

Joris Van den Bossche edited comment on ARROW-13033 at 6/14/21, 1:14 PM:
-------------------------------------------------------------------------

 If we do it at the boundary, we need to have the functionality anyway. But 
indeed, it can just be an internal helper function, and doesn't necessarily 
need to be a public kernel. It's then a question of what we think is useful 
user API to expose.

But personally I think only having it at the boundary level will not be very 
practical:

* We would need to add it to all readers (CSV, JSON, Parquet, etc), but since 
we don't support all data sources, we also need to add it to the in-memory 
conversion functions (from numpy/pandas in the Python case) for when the user 
uses another library to obtain the data.
* This functionality probably will need additional options (e.g. for how to 
deal with ambiguous and non-existent timestamps due to DST changes). Adding 
options for this in each functionality at the boundary might be more complex 
than having one kernel with options that users can directly use after IO.
* Interpreting local times is inherently messy (which relates to the need of 
the options, bullet point above), and I think it's not uncommon that you might 
want to inspect the before and after values, to check what went wrong. "Hiding" 
this operation in the boundary functions will make such process more difficult.

bq. Maybe a name like {{interpret_naive_as_being_in_timezone()}} would make 
this clearer?

That's certainly explicit ;) 
Some other names that I encountered: {{withZoneRetainFields}} 
(https://www.joda.org/joda-time/apidocs/index.html), {{atZone}} 
(https://docs.oracle.com/javase/8/docs/api/java/time/LocalDateTime.html#atZone-java.time.ZoneId-),
 {{AT TIME ZONE}} 
(https://www.postgresql.org/docs/13/functions-datetime.html#FUNCTIONS-DATETIME-ZONECONVERT)



was (Author: jorisvandenbossche):
 If we do it at the boundary, we need to functionality anyway. But indeed, it 
can just be an internal helper function, and doesn't necessarily need to be a 
public kernel. It's then a question of what we think is useful user API to 
expose.

But personally I think only having it at the boundary level will not be very 
practical:

* We would need to add it to all readers (CSV, JSON, Parquet, etc), but since 
we don't support all data sources, we also need to add it to the in-memory 
conversion functions (from numpy/pandas in the Python case) for when the user 
uses another library to obtain the data.
* This functionality probably will need additional options (e.g. for how to 
deal with ambiguous and non-existent timestamps due to DST changes). Adding 
options for this in each functionality at the boundary might be more complex 
than having one kernel with options that users can directly use after IO.
* Interpreting local times is inherently messy (which relates to the need of 
the options, bullet point above), and I think it's not uncommon that you might 
want to inspect the before and after values, to check what went wrong. "Hiding" 
this operation in the boundary functions will make such process more difficult.

bq. Maybe a name like {{interpret_naive_as_being_in_timezone()}} would make 
this clearer?

That's certainly explicit ;) 
Some other names that I encountered: {{withZoneRetainFields}} 
(https://www.joda.org/joda-time/apidocs/index.html), {{atZone}} 
(https://docs.oracle.com/javase/8/docs/api/java/time/LocalDateTime.html#atZone-java.time.ZoneId-),
 {{AT TIME ZONE}} 
(https://www.postgresql.org/docs/13/functions-datetime.html#FUNCTIONS-DATETIME-ZONECONVERT)


> [C++] Kernel to localize naive timestamps to a timezone (preserving 
> clock-time)
> -------------------------------------------------------------------------------
>
>                 Key: ARROW-13033
>                 URL: https://issues.apache.org/jira/browse/ARROW-13033
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Joris Van den Bossche
>            Priority: Major
>
> Given a tz-naive timestamp, "localize" would interpret that timestamp as 
> local in a given timezone, and return a tz-aware timestamp keeping the same 
> "clock time" (the same year/month/day/hour/etc in the printed 
> representation). Under the hood this converts the timestamp value from that 
> timezone to UTC, since tz-aware timestamps are stored as UTC.
> References: 
> [tz_localize|https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DatetimeIndex.tz_localize.html]
>  in pandas, or 
> [force_tz|https://lubridate.tidyverse.org/reference/force_tz.html] in R's 
> lubridate package
> This will (eventually) also have to deal with ambiguous or non-existing times.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to