Re: Faster distinct query?

Israel Brewster Fri, 24 Sep 2021 11:03:44 -0700

> On Sep 23, 2021, at 8:55 PM, Michael Lewis <[email protected]> wrote:
> 
> It’s only when I add in the AND data.channels=channels.channel that the query 
> time blows up to 5+ minutes. I personally don’t understand why there would be 
> such a large difference between the two queries - something wrong with my 
> indexes?
> 
> Did you do the manual vacuum as suggested by Tom?


I ran a VACUUM ANALYZE, yes.

> Maybe at least check pg_stat_user_tables for the last vacuum or autovacuum 
> timestamp, and how many records have changed since then.

volcano_seismology=# SELECT 
seq_scan,seq_tup_read,idx_scan,idx_tup_fetch,n_tup_ins,n_tup_upd,n_tup_del,n_tup_hot_upd,n_live_tup,n_dead_tup,n_mod_since_analyze,n_ins_since_vacuum,last_vacuum,last_autovacuum,last_analyze,last_autoanalyze,vacuum_count,autovacuum_count,analyze_count,autoanalyze_count
 FROM pg_stat_user_tables WHERE relname='data';
-[ RECORD 1 ]-------+------------------------------
seq_scan            | 95
seq_tup_read        | 25899340540
idx_scan            | 728372
idx_tup_fetch       | 51600217033
n_tup_ins           | 840283699
n_tup_upd           | 66120702
n_tup_del           | 2375651
n_tup_hot_upd       | 0
n_live_tup          | 839266956
n_dead_tup          | 66585751
n_mod_since_analyze | 58896052
n_ins_since_vacuum  | 24890460
last_vacuum         | 2021-09-22 21:32:11.367855+00
last_autovacuum     | 2021-09-14 07:13:23.745862+00
last_analyze        | 2021-09-22 21:32:21.071092+00
last_autoanalyze    | 2021-09-21 11:54:36.924762+00
vacuum_count        | 1
autovacuum_count    | 1
analyze_count       | 1
autoanalyze_count   | 2

Note that the update count was due to a (hopefully) one-time process where I 
had to change the value of a bunch of records. Generally this *should be* an 
insert-once-read-many database.


> The system is deciding on an index scan because that should be very fast, but 
> if it needs to look at the actual table table to determine if the process 
> executing that query should still be allowed to see that tuple (row version) 
> then the index only scan slows down a lot I believe. The huge number of heap 
> fetches that Tom pointed out mean that the system doesn't know that ALL 
> processes should be able to see those tuples. As someone else suggested 
> lowering the max freeze age, you might just do a manual "vacuum freeze" as 
> well to allow future auto vacuum processes to quickly skip over those pages 
> as well.

Ok, ran a VACUUM (ANALYZE, FREEZE), and am now showing this from 
pg_stat_user_tables:

volcano_seismology=# SELECT 
seq_scan,seq_tup_read,idx_scan,idx_tup_fetch,n_tup_ins,n_tup_upd,n_tup_del,n_tup_hot_upd,n_live_tup,n_dead_tup,n_mod_since_analyze,n_ins_since_vacuum,last_vacuum,last_autovacuum,last_analyze,last_autoanalyze,vacuum_count,autovacuum_count,analyze_count,autoanalyze_count
 FROM pg_stat_user_tables WHERE relname='data';
-[ RECORD 1 ]-------+------------------------------
seq_scan            | 96
seq_tup_read        | 26737263238
idx_scan            | 732396
idx_tup_fetch       | 52571927369
n_tup_ins           | 841017819
n_tup_upd           | 66120702
n_tup_del           | 2388723
n_tup_hot_upd       | 0
n_live_tup          | 840198830
n_dead_tup          | 10173
n_mod_since_analyze | 0
n_ins_since_vacuum  | 0
last_vacuum         | 2021-09-24 17:18:18.34282+00
last_autovacuum     | 2021-09-14 07:13:23.745862+00
last_analyze        | 2021-09-24 17:18:31.576238+00
last_autoanalyze    | 2021-09-21 11:54:36.924762+00
vacuum_count        | 2
autovacuum_count    | 1
analyze_count       | 2
autoanalyze_count   | 2

However, adding the AND data.channels=channels.channel to the query still makes 
it take around 5 minutes (https://explain.depesz.com/s/7hb1 
<https://explain.depesz.com/s/7hb1>). So, again, running VACUUM didn’t appear 
to help any.

Also perhaps interestingly, if I again modify the query to only match on 
channel, not station, it is again fast (though not quite as fast): 
https://explain.depesz.com/s/HLb8 <https://explain.depesz.com/s/HLb8>

So, basically, I can quickly get a list of all channels for which I have data, 
or all stations for which I have data, but getting a list of all channels for 
each station is slow.

> I've heard of the loose indexscan before mentioned on this thread, but I'm 
> not seeing how to implement that for multiple columns. Anyone have an example 
> or pseudo code perhaps?

This is my stumbling block to implementing this option as well. That said, with 
the lookup table in place, perhaps all the rest of this is a moot point? 

---
Israel Brewster
Software Engineer
Alaska Volcano Observatory 
Geophysical Institute - UAF 
2156 Koyukuk Drive 
Fairbanks AK 99775-7320
Work: 907-474-5172
cell:  907-328-9145

Re: Faster distinct query?

Reply via email to