> On Oct 4, 2021, at 8:46 AM, Rob Sargent <robjsarg...@gmail.com> wrote:
> 
>> On Oct 4, 2021, at 10:22 AM, Israel Brewster <ijbrews...@alaska.edu 
>> <mailto:ijbrews...@alaska.edu>> wrote:
> Guessing the “sd” is "standard deviation”?  Any chance those stddevs are 
> easily calculable from base data?  Could cut your table size in half (and put 
> those 20 cores to work on the reporting).

Possible - I’d have to dig into that with the script author. I was just handed 
an R script (I don’t work with R…) and told here’s the data it needs, here’s 
the output we need stored in the DB. I then spent just enough time with the 
script to figure out how to hook up the I/O. The schema is pretty much just a 
raw dump of the output - I haven’t really spent any resources figuring out 
what, exactly, the data is. Maybe I should :-)

>  And I wonder if the last three indices are strictly necessary? They take 
> disc space too.

Not sure. Here’s the output from pg_stat_all_indexes:

volcano_seismology=# select * from pg_stat_all_indexes where relname='data';
 relid | indexrelid | schemaname | relname |       indexrelname        | 
idx_scan | idx_tup_read | idx_tup_fetch 
-------+------------+------------+---------+---------------------------+----------+--------------+---------------
 19847 |      19869 | public     | data    | data_pkey                 |        
0 |            0 |             0
 19847 |      19873 | public     | data    | date_station_channel_idx  |   
811884 |  12031143199 |    1192412952
 19847 |      19875 | public     | data    | station_channel_epoch_idx |        
8 |       318506 |        318044
 19847 |      19876 | public     | data    | station_data_idx          |     
9072 |         9734 |          1235
 19847 |      19877 | public     | data    | station_date_idx          |   
721616 |  10927533403 |   10908912092
 19847 |      20479 | public     | data    | data_station_channel_idx  |    
47293 | 194422257262 |    6338753379
(6 rows)

so they *have* been used (although not the station_data_idx so much), but this 
doesn’t tell me when it was last used, so some of those may be queries I was 
experimenting with to see what was fastest, but are no longer in use. Maybe I 
should keep an eye on this for a while, see which values are increasing.

> 
> But my bet is you’re headed for partitioning on datetime or perhaps station.

While datetime partitioning seems to be the most common, I’m not clear on how 
that would help here, as the most intensive queries need *all* the datetimes 
for a given station, and even the smaller queries would be getting an arbitrary 
time range potentially spanning several, if not all, partitions. Now portioning 
on station seems to make sense - there are over 100 of those, and pretty much 
any query will only deal with a single station at a time. Perhaps if more 
partitioning would be better, portion by both station and channel? The queries 
that need to be fastest will only be looking at a single channel of a single 
station.

I’ll look into this a bit more, maybe try some experimenting while I still have 
*relatively* little data. My main hesitation here is that in the brief look 
I’ve given partitioning so far, it looks to be a royal pain to get set up. Any 
tips for making that easier?

Thanks for the suggestion!
---
Israel Brewster
Software Engineer
Alaska Volcano Observatory 
Geophysical Institute - UAF 
2156 Koyukuk Drive 
Fairbanks AK 99775-7320
Work: 907-474-5172
cell:  907-328-9145

Reply via email to