[
https://issues.apache.org/jira/browse/ARROW-13033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17361850#comment-17361850
]
Joris Van den Bossche edited comment on ARROW-13033 at 6/14/21, 1:14 PM:
-------------------------------------------------------------------------
If we do it at the boundary, we need to have the functionality anyway. But
indeed, it can just be an internal helper function, and doesn't necessarily
need to be a public kernel. It's then a question of what we think is useful
user API to expose.
But personally I think only having it at the boundary level will not be very
practical:
* We would need to add it to all readers (CSV, JSON, Parquet, etc), but since
we don't support all data sources, we also need to add it to the in-memory
conversion functions (from numpy/pandas in the Python case) for when the user
uses another library to obtain the data.
* This functionality probably will need additional options (e.g. for how to
deal with ambiguous and non-existent timestamps due to DST changes). Adding
options for this in each functionality at the boundary might be more complex
than having one kernel with options that users can directly use after IO.
* Interpreting local times is inherently messy (which relates to the need of
the options, bullet point above), and I think it's not uncommon that you might
want to inspect the before and after values, to check what went wrong. "Hiding"
this operation in the boundary functions will make such process more difficult.
bq. Maybe a name like {{interpret_naive_as_being_in_timezone()}} would make
this clearer?
That's certainly explicit ;)
Some other names that I encountered: {{withZoneRetainFields}}
(https://www.joda.org/joda-time/apidocs/index.html), {{atZone}}
(https://docs.oracle.com/javase/8/docs/api/java/time/LocalDateTime.html#atZone-java.time.ZoneId-),
{{AT TIME ZONE}}
(https://www.postgresql.org/docs/13/functions-datetime.html#FUNCTIONS-DATETIME-ZONECONVERT)
was (Author: jorisvandenbossche):
If we do it at the boundary, we need to functionality anyway. But indeed, it
can just be an internal helper function, and doesn't necessarily need to be a
public kernel. It's then a question of what we think is useful user API to
expose.
But personally I think only having it at the boundary level will not be very
practical:
* We would need to add it to all readers (CSV, JSON, Parquet, etc), but since
we don't support all data sources, we also need to add it to the in-memory
conversion functions (from numpy/pandas in the Python case) for when the user
uses another library to obtain the data.
* This functionality probably will need additional options (e.g. for how to
deal with ambiguous and non-existent timestamps due to DST changes). Adding
options for this in each functionality at the boundary might be more complex
than having one kernel with options that users can directly use after IO.
* Interpreting local times is inherently messy (which relates to the need of
the options, bullet point above), and I think it's not uncommon that you might
want to inspect the before and after values, to check what went wrong. "Hiding"
this operation in the boundary functions will make such process more difficult.
bq. Maybe a name like {{interpret_naive_as_being_in_timezone()}} would make
this clearer?
That's certainly explicit ;)
Some other names that I encountered: {{withZoneRetainFields}}
(https://www.joda.org/joda-time/apidocs/index.html), {{atZone}}
(https://docs.oracle.com/javase/8/docs/api/java/time/LocalDateTime.html#atZone-java.time.ZoneId-),
{{AT TIME ZONE}}
(https://www.postgresql.org/docs/13/functions-datetime.html#FUNCTIONS-DATETIME-ZONECONVERT)
> [C++] Kernel to localize naive timestamps to a timezone (preserving
> clock-time)
> -------------------------------------------------------------------------------
>
> Key: ARROW-13033
> URL: https://issues.apache.org/jira/browse/ARROW-13033
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++
> Reporter: Joris Van den Bossche
> Priority: Major
>
> Given a tz-naive timestamp, "localize" would interpret that timestamp as
> local in a given timezone, and return a tz-aware timestamp keeping the same
> "clock time" (the same year/month/day/hour/etc in the printed
> representation). Under the hood this converts the timestamp value from that
> timezone to UTC, since tz-aware timestamps are stored as UTC.
> References:
> [tz_localize|https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DatetimeIndex.tz_localize.html]
> in pandas, or
> [force_tz|https://lubridate.tidyverse.org/reference/force_tz.html] in R's
> lubridate package
> This will (eventually) also have to deal with ambiguous or non-existing times.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)