Here's the distribution of parameter_id's select count(parameter_id), parameter_id from datavalue group by parameter_id 88169 142889171 815805 178570 124257262 213947049 151225902 24091090 3103877 10633764 11994442 1849232 2014935 4563638 132955919 7
On Fri, Feb 26, 2016 at 2:02 PM, David G. Johnston < david.g.johns...@gmail.com> wrote: > On Fri, Feb 26, 2016 at 12:43 PM, joe meiring <josephmeir...@gmail.com> > wrote: > >> Also available on S.O.: >> >> >> http://stackoverflow.com/questions/35658238/postgres-odd-behavior-with-indices >> >> I've got a datavalue table with ~200M rows or so, with indices on both >> site_id and parameter_id. I need to execute queries like "return all >> sites with data" and "return all parameters with data". The site table >> has only 200 rows or so, and the parameter table has only 100 or so rows. >> >> The site query is fast and uses the index: >> >> EXPLAIN ANALYZEselect *from sitewhere exists ( >> select 1 from datavalue >> where datavalue.site_id = site.id limit 1); >> >> Seq Scan on site (cost=0.00..64.47 rows=64 width=113) (actual >> time=0.046..1.106 rows=89 loops=1) >> Filter: (SubPlan 1) >> Rows Removed by Filter: 39 >> SubPlan 1 >> -> Limit (cost=0.44..0.47 rows=1 width=0) (actual time=0.008..0.008 >> rows=1 loops=128) >> -> Index Only Scan using ix_datavalue_site_id on datavalue >> (cost=0.44..8142.71 rows=248930 width=0) (actual time=0.008..0.008 rows=1 >> loops=128) >> Index Cond: (site_id = site.id) >> Heap Fetches: 0 >> Planning time: 0.361 ms >> Execution time: 1.149 ms >> >> The same query for parameters is rather slow and does NOT use the index: >> >> EXPLAIN ANALYZEselect *from parameterwhere exists ( >> select 1 from datavalue >> where datavalue.parameter_id = parameter.id limit 1); >> >> Seq Scan on parameter (cost=0.00..20.50 rows=15 width=2648) (actual >> time=2895.972..21331.701 rows=15 loops=1) >> Filter: (SubPlan 1) >> Rows Removed by Filter: 6 >> SubPlan 1 >> -> Limit (cost=0.00..0.34 rows=1 width=0) (actual >> time=1015.790..1015.790 rows=1 loops=21) >> -> Seq Scan on datavalue (cost=0.00..502127.10 rows=1476987 >> width=0) (actual time=1015.786..1015.786 rows=1 loops=21) >> Filter: (parameter_id = parameter.id) >> Rows Removed by Filter: 7739355 >> Planning time: 0.123 ms >> Execution time: 21331.736 ms >> >> What the deuce is going on here? Alternatively, whats a good way to do >> this? >> >> Any help/guidance appreciated! >> >> >> >> Some of the table description: >> >> \d datavalue >> >> id BIGINT DEFAULT nextval('datavalue_id_seq'::regclass) NOT NULL, >> value DOUBLE PRECISION NOT NULL, >> site_id INTEGER NOT NULL, >> parameter_id INTEGER NOT NULL, >> deployment_id INTEGER, >> instrument_id INTEGER, >> invalid BOOLEAN, >> Indexes: >> "datavalue_pkey" PRIMARY KEY, btree (id) >> "datavalue_datetime_utc_site_id_parameter_id_instrument_id_key" UNIQUE >> CONSTRAINT, btree (datetime_utc, site_id, parameter_id, instrument_id) >> "ix_datavalue_instrument_id" btree (instrument_id) >> "ix_datavalue_parameter_id" btree (parameter_id) >> "ix_datavalue_site_id" btree (site_id) >> "tmp_idx" btree (site_id, datetime_utc) >> Foreign-key constraints: >> "datavalue_instrument_id_fkey" FOREIGN KEY (instrument_id) REFERENCES >> instrument(id) ON UPDATE CASCADE ON DELETE CASCADE >> "datavalue_parameter_id_fkey" FOREIGN KEY (parameter_id) REFERENCES >> parameter(id) ON UPDATE CASCADE ON DELETE CASCADE >> "datavalue_site_id_fkey" FOREIGN KEY (site_id) REFERENCES >> coastal.site(id) ON UPDATE CASCADE ON DELETE CASCADE >> "datavalue_statistic_type_id_fkey" >> >> >> I'm not great with the details but the short answer - aside from the > fact that you should consider increasing the statistics on these columns - > is that at a certain point querying the index and then subsequently > checking the table for visibility is more expensive than simply scanning > and then discarding the extra rows. > > The fact that you could perform an INDEX ONLY scan in the first query > makes that cost go away since no subsequent heap check is required. In the > parameters query the planner thinks it needs 1.5 million of the rows and > will have to check each of them for visibility. It decided that scanning > the entire table was more efficient. > > The LIMIT 1 in both queries should not be necessary. The planner is smart > enough to stop once it finds what it is looking for. In fact the LIMIT's > presence may be a contributing factor...but I cannot say for sure. > > A better query seems like it would be: > > WITH active_sites AS ( > SELECT DISTINCT site_id FROM datavalues; > ) > SELECT * > FROM sites > JOIN active_sites USING (site_id); > > David J. >