Re: Faster distinct query?

Israel Brewster Thu, 23 Sep 2021 09:17:02 -0700

> On Sep 23, 2021, at 4:34 AM, Ryan Booz <r...@timescale.com> wrote:
> 
> Heh, I honestly forgot about the recursive CTE. Certainly worth a try and 
> wouldn't require installing other extensions.
> 
> This is what depesz is referring to: 
> https://wiki.postgresql.org/wiki/Loose_indexscan 
> <https://wiki.postgresql.org/wiki/Loose_indexscan>
Thanks for the pointer. Will definitely have to spend some time wrapping my 
brain around that one - I’ve done some CTE’s before, but not recursive that I 
can recall. Should be fun!

If it helps matters any, my structure is currently the following:

table “stations” listing station details (name, latitude, longitude, etc) with 
a smallint primary key “id"
table “data” with many (many!) data columns (mostly doubles), a station column 
that is a smallint referencing the stations table, and a channel column which 
is a varchar containing the *name* of the channel the data came in on.

I will readily accept that this may not be the best structure for the DB. For 
example, perhaps the channel column should be normalized out as has been 
mentioned a couple of times as an option. This would make sense, and would 
certainly simplify this portion of the project.

If I do go with a lookup table updated by a trigger, what would be the best 
option for the query the trigger runs - an upset (ON CONFLICT DO NOTHING)? Or a 
query followed by an insert if needed? The normal case would be that the entry 
already exists (millions of hits vs only the occasional insert needed).

Thanks again for all the suggestions!
---
Israel Brewster
Software Engineer
Alaska Volcano Observatory 
Geophysical Institute - UAF 
2156 Koyukuk Drive 
Fairbanks AK 99775-7320
Work: 907-474-5172
cell:  907-328-9145

> 
> On Thu, Sep 23, 2021 at 3:04 AM hubert depesz lubaczewski <dep...@depesz.com 
> <mailto:dep...@depesz.com>> wrote:
> On Wed, Sep 22, 2021 at 12:05:22PM -0800, Israel Brewster wrote:
> > I was wondering if there was any way to improve the performance of this 
> > query:
> > 
> > SELECT station,array_agg(distinct(channel)) as channels FROM data GROUP BY 
> > station;
> > 
> > The explain execution plan can be found here:
> > https://explain.depesz.com/s/mtxB#html 
> > <https://explain.depesz.com/s/mtxB#html> 
> > <https://explain.depesz.com/s/mtxB#html 
> > <https://explain.depesz.com/s/mtxB#html>>
> > 
> > and it looks pretty straight forward. It does an index_only scan, followed 
> > by an aggregate, to produce a result that is a list of stations along with 
> > a list of channels associated with each (there can be anywhere from 1 to 3 
> > channels associated with each station). This query takes around 5 minutes 
> > to run.
> > 
> > To work around the issue, I created a materialized view that I can update 
> > periodically, and of course I can query said view in no time flat. However, 
> > I’m concerned that as the dataset grows, the time it takes to refresh the 
> > view will also grow (correct me if I am wrong there).
> > 
> > This is running PostgreSQL 13, and the index referenced is a two-column 
> > index on data(station, channel)
> 
> It looks that there is ~ 170 stations, and ~ 800 million rows int he
> table.
> 
> can you tell us how many rows has this:
> 
> select distinct station, channel from data;
> 
> If this is not huge, then you can make the query run much faster using
> skip scan - recursive cte.
> 
> Best regards,
> 
> depesz
> 
> 
>

Re: Faster distinct query?

Reply via email to