Triggered by Michael mentioning subqueries I ended up trying
explain
select d.short_id,mv.timestamp ,mv.I64_01
from device d, device_configuration dc, measurement_value mv
where mv.device_id=d.short_id and dc.device_id = d.id and dc.latest=true and
mv.timestamp > '2020-11-06'::timestamp - (select tzn.utc_offset from
pg_timezone_names tzn where tzn.name=dc.timezone) and mv.timestamp <
'2020-11-07'::timestamp - (select tzn.utc_offset from pg_timezone_names tzn
where tzn.name=dc.timezone)
==>
Nested Loop (cost=25.85..84540074.64 rows=16996885 width=20)
-> Seq Scan on device_configuration dc (cost=0.00..470.01 rows=615 width=30)
Filter: latest
-> Nested Loop (cost=25.85..137027.83 rows=43494 width=36)
-> Index Scan using device_pkey on device d (cost=0.28..7.23 rows=1
width=20)
Index Cond: (id = dc.device_id)
-> Index Scan using measurement_values_pkey on measurement_value mv
(cost=25.58..136585.66 rows=43494 width=20)
Index Cond: ((device_id = d.short_id) AND ("timestamp" >
('2020-11-06 00:00:00'::timestamp without time zone - (SubPlan 1))) AND
("timestamp" < ('2020-11-07 00:00:00'::timestamp without time zone - (SubPlan
2))))
SubPlan 1
-> Function Scan on pg_timezone_names (cost=0.00..12.50
rows=5 width=16)
Filter: (name = dc.timezone)
SubPlan 2
-> Function Scan on pg_timezone_names pg_timezone_names_1
(cost=0.00..12.50 rows=5 width=16)
Filter: (name = dc.timezone)
Now returns the 320K in less than 5sec.
I was till now convinced that correlated subqueries or joins are equivalent. I
guess I was wrong :). Wonder how stable this plan will be though
Peter
> On 10 Nov 2020, at 09:06, Peter Coppens <[email protected]> wrote:
>
> Pavel
>
> Tx for the tip. But given that if the I64_01 column is removed from the
> select list, the index is used I guess the cast is not likely to be the cause.
>
> Like so
>
> explain
> select d.short_id,mv.timestamp --,mv.I64_01
> from device d, device_configuration dc, measurement_value mv,
> pg_timezone_names tzn
> where mv.device_id=d.short_id and dc.device_id = d.id <http://d.id/> and
> dc.latest=true and dc.timezone=tzn.name and
> mv.timestamp > '2020-11-06'::timestamp - tzn.utc_offset and
> mv.timestamp < '2020-11-07'::timestamp - tzn.utc_offset
>
>
> ==>
>
> Nested Loop (cost=1.13..6217004.08 rows=60425437 width=12)
> -> Nested Loop (cost=0.56..21334.84 rows=2186 width=20)
> Join Filter: (dc.timezone = pg_timezone_names.name)
> -> Nested Loop (cost=0.56..7497.34 rows=615 width=18)
> -> Index Scan using device_short_id_key on device d
> (cost=0.28..2423.90 rows=683 width=20)
> -> Index Scan using device_configuration_device_latest_idx on
> device_configuration dc (cost=0.28..7.42 rows=1 width=30)
> Index Cond: ((device_id = d.id <http://d.id/>) AND
> (latest = true))
> Filter: latest
> -> Function Scan on pg_timezone_names (cost=0.00..10.00 rows=1000
> width=48)
> -> Index Only Scan using measurement_values_pkey on measurement_value mv
> (cost=0.57..2399.33 rows=43492 width=12)
> Index Cond: ((device_id = d.short_id) AND ("timestamp" > ('2020-11-06
> 00:00:00'::timestamp without time zone - pg_timezone_names.utc_offset)) AND
> ("timestamp" < ('2020-11-07 00:00:00'::timestamp without time zone -
> pg_timezone_names.utc_offset)))
>
>
> Peter
>
>> On 10 Nov 2020, at 08:25, Pavel Stehule <[email protected]
>> <mailto:[email protected]>> wrote:
>>
>>
>>
>> út 10. 11. 2020 v 8:18 odesílatel Peter Coppens <[email protected]
>> <mailto:[email protected]>> napsal:
>> Michael
>>
>> Many thanks for spending your time on this. Your alternative does not help
>> unfortunately (see execution plan)
>>
>> Still a sequential scan on the complete table. I have tried many
>> alternatives and somehow whenever I add a column that is not in the index
>> (I64_01) the optimizer decides not to use the index. If I remove that
>> column, the index is used. I guess it estimates that the extra indirection
>> from index pages to the row pages is more costly than scanning the 168M
>> records. Pretty sure it’s not, but I cannot explain it to the stubborn thing
>> :)
>>
>> Btw, thanks for the >= tip (I was aware of it)
>>
>> Wkr,
>>
>> Peter
>>
>>
>> Hash Join (cost=683.93..7270857.46 rows=458127 width=20)
>> Hash Cond: (mv_inner.device_id = d.short_id)
>> Join Filter: ((mv_inner."timestamp" > ('2020-11-06 00:00:00'::timestamp
>> without time zone - pg_timezone_names.utc_offset)) AND (mv_inner."timestamp"
>> < ('2020-11-07 00:00:00'::timestamp without time zone -
>> pg_timezone_names.utc_offset)))
>> -> Seq Scan on measurement_value mv_inner (cost=0.00..7166797.33
>> rows=1287989 width=1006)
>> Filter: (("timestamp" > '2020-11-05 00:00:00'::timestamp without
>> time zone) AND ("timestamp" < '2020-11-08 00:00:00'::timestamp without time
>> zone))
>>
>> when you see cast in filter, then you should check type equality in
>> constraints. With some exception Postgres uses indexes only when filtered
>> value has same type like column type.
>>
>> Maybe there is inconsistency between timestamp (with time zone), and
>> timestamp without time zone
>>
>> Regards
>>
>> Pavel
>>
>>
>> -> Hash (cost=656.61..656.61 rows=2186 width=20)
>> -> Hash Join (cost=77.87..656.61 rows=2186 width=20)
>> Hash Cond: (dc.timezone = pg_timezone_names.name
>> <http://pg_timezone_names.name/>)
>> -> Hash Join (cost=55.37..533.83 rows=615 width=18)
>> Hash Cond: (dc.device_id = d.id <http://d.id/>)
>> -> Seq Scan on device_configuration dc
>> (cost=0.00..470.01 rows=615 width=30)
>> Filter: latest
>> -> Hash (cost=46.83..46.83 rows=683 width=20)
>> -> Seq Scan on device d (cost=0.00..46.83
>> rows=683 width=20)
>> -> Hash (cost=10.00..10.00 rows=1000 width=48)
>> -> Function Scan on pg_timezone_names
>> (cost=0.00..10.00 rows=1000 width=48)
>>
>>
>>
>>> On 10 Nov 2020, at 01:15, Michael Lewis <[email protected]
>>> <mailto:[email protected]>> wrote:
>>>
>>> On Mon, Nov 9, 2020 at 1:11 PM Peter Coppens <[email protected]
>>> <mailto:[email protected]>> wrote:
>>> Adding the tzn.utc_offset results in the fact that the execution plan no
>>> longer considers to use the index on the measurement_value table. Is there
>>> any way the SQL can be rewritten so that the index is used? Or any other
>>> solution so that the query with the timezone offset returns in a comparable
>>> time?
>>>
>>> I am not aware of a best practice to handle this. Your where condition on
>>> mv.timestamp now depends on several joins to do a filtering that used to be
>>> a static range that can be scanned into the index as a first node in the
>>> plan. I have sometimes used a sub-query on a broader condition that allows
>>> the use of the index, and then fully reducing the set later. Something like
>>> this-
>>>
>>> select d.short_id,mv.timestamp,mv.I64_01
>>> from device d, device_configuration dc, (
>>> select mv.*
>>> from measurement_value AS mv_inner
>>> where mv.timestamp > '2020-11-06'::timestamp - interval '1 day' and
>>> mv.timestamp < '2020-11-07'::timestamp + interval '1 day'
>>> offset 0 /* to prevent in-lining the join to the outside set */
>>> ) mv, pg_timezone_names tzn
>>> where mv.device_id=d.short_id and dc.device_id = d.id <http://d.id/> and
>>> dc.latest=true and dc.timezone=tzn.name <http://tzn.name/> and
>>> mv.timestamp > '2020-11-06'::timestamp - tzn.utc_offset and
>>> mv.timestamp < '2020-11-07'::timestamp - tzn.utc_offset
>>>
>>> By the way, it seems a little odd to be exclusive on both the begin and
>>> end. I'd usually expect timestamp >= start_date and timestamp < end_date +
>>> interval '1 day' to fully capture a 24 hour period. Right now, you are
>>> excluding any data that happens to have a timestamp value with .000000
>>> seconds (midnight exactly).
>>
>