Re: Strategies for converting UTC data to local windows for arbitrary resolutions and timezones

Marian Wendt Wed, 04 Oct 2023 03:27:36 -0700

Am 04.10.2023 um 05:36 schrieb Steve Crawford:

On Tue, Oct 3, 2023 at 7:33 PM Lincoln Swaine-Moore<lswainemo...@gmail.com> wrote:


    Hi all,

    I'm attempting to generate some reports using user-driven
    timezones for UTC data, and I'm having trouble writing a query
    that meets the following criteria:

    1) Data should be averaged at one of daily, hourly, or 15 minute
    granularities (user-driven).
    2) Data over a given period should reflect the localized value of
    the UTC data for a user-provided time zone. E.g.:
        a) The 1 hour period starting at '2023-10-03
    12:00:00.000000-0400' (America/New_York) should reflect data with
    timestamps between '2023-10-03 16:00:00.000000Z' and '2023-10-03
    17:00:00.000000Z'.
        b) The 1 day period starting at '2023-10-03
    00:00:00.000000-0400' should reflect data with timestamps between
    '2023-10-03 04:00:00.000000Z' and '2023-10-04 04:00:00.000000Z'.
    3) When a period interacts with a DST change in the given
    timezone, the data should not be clumped together. E.g.:
        a) Data points occurring at 2023-11-05 05:30:00.000000Z and
    2023-11-05 06:30:00.000000Z should be treated as falling into
    separate buckets when the time zone is America/New_York:
    (2023-11-05 01:30:00.000000-0400 and 2023-11-05
    01:30:00.000000-0500, respectively). This should be true for
    either the 15 minute or 1 hour intervals.
        b) Some clumping for day resolution seems ok! E.g. the 1 day
    period starting at '2023-11-05 00:00:00.000000-0400' can and
    probably should contain 25 hours' worth of data. Certainly it
    should not reflect the data falling between '2023-11-05
    00:04:00.000000Z' and '2023-11-05 00:04:00.000000Z' + '24
    hours'::interval (= '2023-11-06 00:04:00.000000Z'), because that
    would be the local times of '2023-11-05 00:00:00.000000-0400' and
    '2023-11-04 23:00:00.000000-0500'.
    4) It would be relatively simple to do 15 minute and 1 hour
    periods were all timezone offsets multiples of 1 hour (in that
    case, all operations could be done in UTC and then converted after
    the fact), but unfortunately some time zones have 30 min-based
    offsets, which interferes with this approach.
    5) Ideally, the solution would not involve messing with the
    server/connection's value of timezone. (Though I would be
    interested if there was a solution that relaxed this constraint
    and was relatively safe/compatible with transactions and psycopg2.)
    6) Ideally, my query would return periods that are missing data
    (though could plausibly fill these in in the layer above). This
    points toward generate_series but...
    7) Sigh: I can't upgrade to 16. I gather that 16 has a timezone
    parameter for generate_series, which I believe might help. But
    tragically, Digitalocean doesn't yet support 16, and it's not
    practical to migrate elsewhere. Based on historical release ->
    support timing, I'd imagine they will not support it until Q2
    2024, which is too late for this feature. If anyone had the inside
    scoop about when they'd likely support it, I'd welcome it!

    This looks pretty hairy written out as above, but I actually think
    it reflects most people's intuitions about what data a local
    period "should" correspond to (though I'd welcome feedback on this
    point).

    Here are some thoughts about approaches that I've tried, and what
    their drawbacks seem to be. For all these, I'll use the following
    CTE to demonstrate some data that crosses a DST boundary:

    ```
    with original_data as (
        select
            ('2023-11-05 00:00:00.000000Z'::timestamptz) + (15 * x ||
    ' minutes')::interval as "t"
        from
            generate_series(0, 1000) as x
    )
    ```

    1) date_trunc: it seems like as of v12, date_trunc accepts a third
    argument of timezone, which essentially plays the role of the
    server timezone setting for the scope of the function. This is
    very handy, and *I believe* solves my issues for the hour/day periods:
    ```
    [etc]
    select
        date_trunc('day', t, 'America/New_York'),
        min(t),
        max(t),
        count(*)
    from original_data
    group by 1
    order by 1;

           date_trunc       |          min           |  max          
    | count
    
------------------------+------------------------+------------------------+-------
     2023-11-04 04:00:00+00 | 2023-11-05 00:00:00+00 | 2023-11-05
    03:45:00+00 |    16
     2023-11-05 04:00:00+00 | 2023-11-05 04:00:00+00 | 2023-11-06
    04:45:00+00 |   100
     2023-11-06 05:00:00+00 | 2023-11-06 05:00:00+00 | 2023-11-07
    04:45:00+00 |    96
     2023-11-07 05:00:00+00 | 2023-11-07 05:00:00+00 | 2023-11-08
    04:45:00+00 |    96
     2023-11-08 05:00:00+00 | 2023-11-08 05:00:00+00 | 2023-11-09
    04:45:00+00 |    96
     2023-11-09 05:00:00+00 | 2023-11-09 05:00:00+00 | 2023-11-10
    04:45:00+00 |    96

    [etc]
    ```

    This checks out, but unfortunately doesn't seem to work for 15
    minutes. I think, by the way, that this behavior is identical to
    what I would've gotten if my server timezone was
    "America/New_York" and I ran it without the 3rd argument, though
    I'd be curious to hear if there are discrepancies.

    On this point, I've read up on some of the history around this
    feature, and was a little puzzled by this assertion in this
    thread:
    
https://www.postgresql.org/message-id/87in1k73nr....@news-spur.riddles.org.uk:

    > If you have a timestamp-in-UTC column and want to do a
    date_trunc in some other specified zone (that's not the session
    timezone), you need FOUR uses of AT TIME ZONE to do it correctly:
    > date_trunc('day', col AT TIME ZONE 'UTC' AT TIME ZONE
    'Asia/Kathmandu') AT TIME ZONE 'Asia/Kathmandu' AT TIME ZONE 'UTC'

    I *think* that this doesn't comport with behavior I've seen,
    because at a DST boundary the initial `col AT TIME ZONE 'UTC' AT
    TIME ZONE 'Asia/Kathmandu'` would drop the offset, which from then
    on would be unrecoverable. For example:

    ```
    select date_trunc('hour', '2023-11-05 05:30:00.000000'::timestamp
    AT TIME ZONE 'UTC' AT TIME ZONE 'America/New_York') AT TIME ZONE
    'America/New_York' AT TIME ZONE 'UTC';
          timezone
    ---------------------
     2023-11-05 06:00:00

    select date_trunc('hour', '2023-11-05 06:30:00.000000'::timestamp
    AT TIME ZONE 'UTC' AT TIME ZONE 'America/New_York') AT TIME ZONE
    'America/New_York' AT TIME ZONE 'UTC';
          timezone
    ---------------------
     2023-11-05 06:00:00
    ```

    This is what I meant above by "clumping" in point (3). I believe
    this issue is ameliorated by setting the database timezone:

    ```
    set timezone to 'America/New_York';
    select date_trunc('hour', '2023-11-05 06:30:00.000000'::timestamp
    AT TIME ZONE 'UTC');
           date_trunc
    ------------------------
     2023-11-05 01:00:00-05

    select date_trunc('hour', '2023-11-05 05:30:00.000000'::timestamp
    AT TIME ZONE 'UTC');
           date_trunc
    ------------------------
     2023-11-05 01:00:00-04
    ```

    The same (correct, in my opinion) behavior seems to occur when
    using the third argument to date_trunc, so all is well on that
    front, but I'm being pedantic about this because if I was
    misunderstanding and usage of four `at time zone`s was a
    legitimate strategy, it could plausibly be applied to the below.
    Hoping to be incorrect here!


    2) date_bin: this will take a "stride", which should accommodate
    all the periods I'm interested in, and "origin" to help deal with
    initial day/hour offsets. But I'm having trouble getting it to
    respect time zone/DST stuff. For example:

    ```
    [etc]
    select
        date_bin('24 hours', t, '2023-11-03
    00:00:00.000000-0400'::timestamptz),
        min(t),
        max(t),
        count(*)
    from
        original_data
    group by 1
    order by 1;

            date_bin        |          min           |  max          
    | count
    
------------------------+------------------------+------------------------+-------
     2023-11-04 00:00:00-04 | 2023-11-04 20:00:00-04 | 2023-11-04
    23:45:00-04 |    16
     2023-11-05 00:00:00-04 | 2023-11-05 00:00:00-04 | 2023-11-05
    22:45:00-05 |    96
     2023-11-05 23:00:00-05 | 2023-11-05 23:00:00-05 | 2023-11-06
    22:45:00-05 |    96
     2023-11-06 23:00:00-05 | 2023-11-06 23:00:00-05 | 2023-11-07
    22:45:00-05 |    96
     2023-11-07 23:00:00-05 | 2023-11-07 23:00:00-05 | 2023-11-08
    22:45:00-05 |    96
     2023-11-08 23:00:00-05 | 2023-11-08 23:00:00-05 | 2023-11-09
    22:45:00-05 |    96
     2023-11-09 23:00:00-05 | 2023-11-09 23:00:00-05 | 2023-11-10
    22:45:00-05 |    96
    [etc]
    ```

    As you can see, the days get "bumped" by DST, and are off by one
    (until the spring). I actually think this makes intuitive sense
    for this concept of "stride"--it just means that it's not as
    useful for timezones with DST.


    3) generate_series: If I could get this working, it would be
    ideal, because it would also help fill in gaps in my data with
    null rows. But again, I can't seem to, except in v16, have it
    respect timezones other than the server's:

    ```
    set timezone to 'UTC';
    select
        days as start_time,
        lead(days) over (order BY days) as end_time
    from generate_series(
        '2023-11-03 00:00:00.000000-0400'::timestamptz,
        '2023-11-07 00:00:00.000000-0500'::timestamptz,
        '1 day'::interval
    ) days;

           start_time       |        end_time
    ------------------------+------------------------
     2023-11-03 04:00:00+00 | 2023-11-04 04:00:00+00
     2023-11-04 04:00:00+00 | 2023-11-05 04:00:00+00
     2023-11-05 04:00:00+00 | 2023-11-06 04:00:00+00
     2023-11-06 04:00:00+00 | 2023-11-07 04:00:00+00
    [etc.]


    set timezone to 'America/New_York';
    [same as above]

           start_time       |        end_time
    ------------------------+------------------------
     2023-11-03 00:00:00-04 | 2023-11-04 00:00:00-04
     2023-11-04 00:00:00-04 | 2023-11-05 00:00:00-04
     2023-11-05 00:00:00-04 | 2023-11-06 00:00:00-05
     2023-11-06 00:00:00-05 | 2023-11-07 00:00:00-05
    [etc.]
    ```

    The latter being correct for these purposes, but not seeming super
    reliable/practical (see: point 5).

    Can anyone think of any other approaches to this problem? I'd be
    thrilled if I could manage it without resorting to manually
    post-processing in pandas or something, because I have a number of
    datapoints, and each individual one is quite wide.

    Apologies for the length of this message--just trying to be
    thorough. I sincerely appreciate any help or pointers!

    Best,
    Lincoln

--Lincoln Swaine-Moore




That's a long email to digest but a couple thoughts that may help.

I've always disliked the term "timestamp with time zone" and alwaysmentally translate that to "point in time". As long as data is storedas a timestamp with time zone (point in time), it can be manipulatedin any time zone you desire including handling DST.

Time calculation always involves some assumption of meaning andPostgreSQL makes some reasonable ones. I'm in US Pacific time and if Irun:

select '2023-11-05'::timestamptz ;
I get:
     timestamptz
------------------------
2023-11-05 00:00:00-07

If I add a day (crossing the DST boundary):
select '2023-11-05'::timestamptz + '1 day'::interval;
       ?column?
------------------------
2023-11-06 00:00:00-08

Observe that I end up at midnight the following day. But if I insteadadd 24 hours:

select '2023-11-05'::timestamptz + '24 hours'::interval;
       ?column?
------------------------
2023-11-05 23:00:00-08

24 hours is exactly what I get.

You are generating 15-minute intervals the hard way. You can do itdirectly and have DST handled for you:select generate_series('2023-11-05'::timestamptz,'2023-11-06'::timestamptz, '15 minutes'::interval);

    generate_series
------------------------
2023-11-05 00:00:00-07
2023-11-05 00:15:00-07
2023-11-05 00:30:00-07
2023-11-05 00:45:00-07
2023-11-05 01:00:00-07
2023-11-05 01:15:00-07
2023-11-05 01:30:00-07
2023-11-05 01:45:00-07
2023-11-05 01:00:00-08
2023-11-05 01:15:00-08
...
2023-11-06 00:00:00-08
(101 rows)

Note that 01:00:00-07 is a different point in time than 01:00:00-08.Sticking with timestamp with time zone, aka a fully qualified point intime, removes any ambiguity. Also observe that I get the correctnumber of "bins". If I run the same thing but for March 12 2023(spring forward) I'll get 93 rows but on "normal days" there will be 97.

I suspect your best bet will be to store the data as type timestamp with time zone and to set the time zone before running your queries.

Note that setting the time zone is a client/connection setting so ifyou set it within a transaction, it will stay set when the transactionconcludes. But time manipulation is tricky and trying to DIY reinventthe wheel is painful and often buggy. Let PostgreSQL do the work for you.


Cheers,
Steve

UTC is an excellent form of timestamps as a linear quantity in the db(like Steve wrote: "point in time"); no gaps, no duplications.

For the sake of simplicity, I only included the possible date-binvariants so that the results can be compared; ordering and grouping withjust one date_bin etc. can be easily customized...


SET TIMEZONE='Etc/UTC';
SELECT
    sub.gs
    ,date_bin('15 minutes', sub.gs, '2023-01-01') AS norm15minutes
    ,date_bin('1 hours', sub.gs, '2023-01-01') AS norm1hour
    ,date_bin('1 days', sub.gs, '2023-01-01') AS norm1day

FROM (SELECT generate_series('2023-11-03 00:00:00Z'::timestamptz,'2023-11-06 00:00:00Z'::timestamptz, '5 minutes'::interval) AS gs) AS sub;

For the WHERE clause also everything in UTC (the conversion of theparameters of "user time zone" takes place before).

Conversion of the results into the "user time zone" takes place in theclient app.

--
regards, marian wendt

Re: Strategies for converting UTC data to local windows for arbitrary resolutions and timezones

Reply via email to