On Tue, 2005-11-15 at 13:12, Bill McGonigle wrote: > I have a query that's making the planner do the wrong thing (for my > definition of wrong) and I'm looking for advice on what to tune to make > it do what I want. > > The query consists or SELECT'ing a few fields from a table for a large > number of rows. The table has about seventy thousand rows and the user > is selecting some subset of them. I first do a SELECT...WHERE to > determine the unique identifiers I want (works fine) and then I do a > SELECT WHERE IN giving the list of id's I need additional data on > (which I see from EXPLAIN just gets translated into a very long list of > OR's). > > Everything works perfectly until I get to 65301 rows. At 65300 rows, > it does an index scan and takes 2197.193 ms. At 65301 rows it switches > to a sequential scan and takes 778951.556 ms. Values known not to > affect this are: work_mem, effective_cache_size. Setting > random_page_cost from 4 to 1 helps (79543.214 ms) but I'm not really > sure what '1' means, except it's relative. Of course, setting > 'enable_seqscan false' helps immensely (2337.289 ms) but that's as > inelegant of a solution as I've found - if there were other databases > on this install that wouldn't be the right approach. > > Now I can break this down into multiple SELECT's in code, capping each > query at 65300 rows, and that's a usable workaround, but academically > I'd like to know how to convince the planner to do it my way. It's > making a bad guess about something but I'm not sure what. I didn't see > any hard-coded limits grepping through the source (though it is close > to the 16-bit unsigned boundry - probably coincidental) so if anyone > has ideas or pointers to how I might figure out what's going wrong that > would be helpful.
OK, there IS a point at which switching to a sequential scan will be fast. I.e. when you're getting everything in the table. But the database is picking a number where to switch that is too low. First, we need to know if the statistics are giving the query planner a good enough idea of how many rows it's really gonna get versus how many it expects. Do an explain <your query here> and see how many it thinks it's gonna get. Since you've actually run it, you know how many it really is going to get, so there's no need for an explain analyze <your query here> just yet. Now, as long as the approximation is pretty close, fine. But if it's off by factors, then we need to increase the statistics target on that column, with: ALTER TABLE name ALTER columnname SET STATISTICS xxx where xxx is the new number. The default is set in your postgresql.conf file, and is usually pretty low, say 10. You can go up to 1000, but that makes query planning take longer. Try some incremental increase to say 20 or 40 or even 100, and run analyze on that table then do an explain on it again until the estimate is close. Once the estimate is close, you use change random_page_cost to get the query planner to switch at the "right" time. Change the number of in() numbers and play with random_page_cost and see where that sweet spot is. note that what seems right on a single table for a single user may not be best as you increase load or access other tables. random_page_cost represents the increase in a random access versus a sequential access. As long as your data fit into ram, the difference is pretty much none (i.e. random_page_cost=1) so don't set it too low, or accessing REALLY large data sets could become REALLY slow, as it uses indexes when it should have been sequentially scanning. Also, check what you've got effective_cache set to. This tells postgresql how much memory your kernel is using for cache, and so lets it know about how likely it is that your current data set under your query is to be in there. Also, read this: http://www.varlena.com/varlena/GeneralBits/Tidbits/perf.html ---------------------------(end of broadcast)--------------------------- TIP 6: explain analyze is your friend