As there was a lot of discussion around timestamp localization I'd
like to point out there is an open PR for it now [1].

[1] https://github.com/apache/arrow/pull/10610

Rok

On Thu, Jun 10, 2021 at 11:11 PM Wes McKinney <wesmck...@gmail.com> wrote:
>
> I agree that we need to implement the equivalent of pandas's
> "tz_localize" method which performs UTC normalization on tz-naive data
> and sets the timezone field. Here's a demo of this functionality (I
> originally implemented this years ago by porting pytz's logic to run
> against NumPy arrays in Cython):
>
> https://gist.github.com/wesm/0e02567c0c4bab768bc0ecabc2fcb6a8
>
> On Thu, Jun 10, 2021 at 3:04 PM Joris Van den Bossche
> <jorisvandenboss...@gmail.com> wrote:
> >
> > On Thu, 10 Jun 2021 at 18:06, Antoine Pitrou <anto...@python.org> wrote:
> > >
> > > On Thu, 10 Jun 2021 17:33:23 +0200
> > > Joris Van den Bossche <jorisvandenboss...@gmail.com> wrote:
> > > >
> > > > We just merged a PR to add some kernels to extract fields from 
> > > > timestamps
> > > > (year, month, day, hour, etc -> ARROW-11759
> > > > <https://github.com/apache/arrow/pull/10176>). But once you start with
> > > > kernels for timestamp data, you quickly run into the question: what to 
> > > > do
> > > > with tz-aware timestamps with a timezone?
> > > >
> > > > For example, we have:
> > > > - ARROW-12980 <https://issues.apache.org/jira/browse/ARROW-12980> about
> > > > making those kernels to extract timestamp fields timezone aware. For
> > > > example, if you have tz-aware timestamp with hour "09:30:00+02:00", 
> > > > this is
> > > > stored internally as "07:30:00 UTC" (+ the actual timezone as metadata 
> > > > of
> > > > the type). And for a kernel to extract the "hour" field, you want that 
> > > > to
> > > > return 9 and not 7 (which would happen if we use the internal UTC value
> > > > ignoring the timezone information).
> > > > - ARROW-13033 <https://issues.apache.org/jira/browse/ARROW-13033> 
> > > > (which I
> > > > opened today) about adding functionality to convert a tz-naive "local 
> > > > time"
> > > > (local "clock" time in a not-yet-specified time zone) to a properly
> > > > timezone-aware timestamp with the user-specified time zone attached. 
> > > > This
> > > > can be useful to handle data that does not have sufficient timezone
> > > > information attached to the data/type itself, but for which you know 
> > > > what
> > > > the timezone should be. For example, having a timestamp with hour
> > > > "09:30:00" (no explicit timezone, implicitly UTC), but the user knows 
> > > > this
> > > > is actually "09:30:00 CEST", so then you want to convert this to the UTC
> > > > time ("07:30:00Z") that is equivalent to "09:30:00 CEST".
> > >
> > > I don't think it's helpful to discuss those two use cases together.
> > > The first case is talking about the semantics of a kernel on valid
> > > timestamp data.
> > > The second case is talking about invalid timestamp data (with values
> > > expressed in a non-UTC timezone).
> > >
> >
> > What both cases have in common is that they need to look up timezone
> > offsets to do a conversion and thus require access to a timezone
> > database (and requiring us to deal with things like Windows not having
> > a system tz database available). That was the main aspect I wanted to
> > ensure we are OK with in general ("dealing with timezones"), and less
> > the specifics of the two examples I gave.
> >
> > If that general issue doesn't turn out to be such a discussion point,
> > I think that would be a good start. And then indeed each case where we
> > might want to add timezone handling can be discussed separately (since
> > adding it to a second or third etc kernel is much less of an issue
> > than *starting* to do timezone handling).
> >
> > Joris
> >
> > > Regards
> > >
> > > Antoine.
> > >
> > >

Reply via email to