Hi Tomas I really can't agree more. Many default values are just too conservative, and the documentation doesn't provide best practices.,i think reduce to 1.x,Or add a tip in the document, providing a recommended value for different SSDs.
On Mon, 6 Oct 2025 at 08:59, Tomas Vondra <[email protected]> wrote: > Hi, > > I wonder if it's time to consider updating the random_page_cost default > value. There are multiple reasons to maybe do that. > > The GUC (and the 4.0 default) was introduced in ~2000 [1], so ~25 years > ago. During that time the world went from rotational drives through > multiple generations of flash / network-attached storage. I find it hard > to believe those changes wouldn't affect random_page_cost. > > And indeed, it's common to advice to reduce the GUC closer to 1.0 on > SSDs. I myself recommended doing that in the past, but over time I got > somewhat skeptical about it. The advice is based on the "obvious" wisdom > that SSDs are much better in handling random I/O than rotational disks. > > But this has two flaws. First, it assumes the current 4.0 default makes > sense - maybe it doesn't and then it's useless as a "starting point". > Second, it's not obvious how much better SSDs are without concurrent IOs > (which is needed to fully leverage the SSDs). Which we don't do for > index scans (yet), and even if we did, the cost model has no concept for > such concurrency. > > Recently, I've been doing some experiments evaluating how often we pick > an optimal scan for simple select queries, assuming accurate estimates. > Turns out we pick the wrong plan fairly often, even with almost perfect > estimates. I somewhat expected that, with the default random_page_cost > value. What did surprise me was that to improve the plans, I had to > *increase* the value, even on really new/fast SSDs ... > > I started looking at how we calculated the 4.0 default back in 2000. > Unfortunately, there's a lot of info, as Tom pointed out in 2024 [2]. > But he outlined how the experiment worked: > > - generate large table (much bigger than RAM) > - measure runtime of seq scan > - measure runtime of full-table index scan > - calculate how much more expensive a random page access is > > So I decided to try doing this on a couple different devices, and see > what random_page_cost values that gives me. Attached is a script doing > such benchmark: > > (1) initializes a new cluster, with a couple parameters adjusted > > (2) creates a random table (with uniform distribution) > > (3) runs a sequential scan > > SELECT * FROM (SELECT * FROM test_table) OFFSET $nrows; > > (4) runs an index scan > > SELECT * FROM (SELECT * FROM test_table ORDER BY id) OFFSET $nrows; > > The script does a couple things to force the query plan, and it reports > timings of the two queries at the end. > > I've been running this on machines with 64GB of RAM, so I chose the > table to have 500M rows. With fillfactor=20 that means a ~182GB table > (23809524 pages). > > Lets say that > > T(seq) = timing of the seqscan query > T(idx) = timing of the index scan > IOS(seq) = number of sequential page reads > IOS(idx) = number of random page reads > P = number of pages > N = number of rows > > then time to read a sequential page (because IOS(seq) == P) > > PT(seq) = T(seq) / IOS(seq) = T(seq) / P > > and time to read a random page (assuming the table is perfectly random, > with no cache hits): > > PT(idx) = T(idx) / IOS(idx) = T(idx) / N > > which gives us the "idea" random page cost (as a value relative to > reading a page sequentially) > > random_page_cost = PT(idx) / PT(seq) > > T(idx) * P > = -------------- > T(seq) * N > > The "no cache hits" is not quite correct, with 182GB there's about 30% > of a page being in memory. I didn't think of using debug_io_direct at > the time, but it doesn't affect the conclusion very much (in fact, it > would *increase* the random_page_cost value, which makes it worse). > > I did this on the three SSDs I use for testing - 4x NVMe RAID0, 4x SATA > RAID0, and a single NVMe drive. Here's the results: > > seqscan (s) index scan (s) random_page_cost > ----------------------------------------------------------------- > NVMe 98 42232 20.4 > NVMe/RAID0 24 25462 49.3 > SATA/RAID0 109 48141 21.0 > > These are reasonably good SSDs, and yet the "correct" random_page cost > comes out about 5-10x of our default. > > FWIW I double checked the test is actually I/O bound. The CPU usage > never goes over ~50% (not even in the the seqscan case). > > These calculated values also align with the "optimal" plan choice, i.e. > the plans flip much closer to the actual crossing point (compared to > where it'd flip with 4.0). > > It obviously contradicts the advice to set the value closer to 1.0. But > why is that? SSDs are certainly better with random I/0, even if the I/O > is not concurrent and the SSD is not fully utilized. So the 4.0 seems > off, the value should be higher than what we got for SSDs ... > > I don't have any rotational devices in my test machines anymore, but I > got an azure VM with local "SCSI" disk, and with "standard HDD" volume. > And I got this (this is with 10M rows, ~3.7GB, with direct I/O): > > seqscan (s) index scan (s) random_page_cost > ----------------------------------------------------------------- > SCSI 2.1 1292 28.5 > standard HDD 209.8 27586 62.6 > > I suspect the SCSI disk is not actually rotational (or which model), it > seems more like an SSD with SCSI interface or what model is that. The > "standard HDD" seems much closer to rotational, with ~370 IOPS (it'd > take ages to do the index scan on more than 10M rows). > > Unless I did some silly mistakes, these results suggest the current 4.0 > value is a bit too low, and something like ~20 would be better even on > SSDs. This is not the first time it was suggested a higher default might > be better - see this 2008 post [3]. Of course, that's from before SSDs > became a thing, it's about evolution in hard disks and our code. > > However, it also says this: > > Yeah, it seems like raising random_page_cost is not something we > ever recommend in practice. I suspect what we'd really need here to > make any progress is a more detailed cost model, not just fooling > with the parameters of the existing one. > > I don't quite follow the reasoning. If increasing the cost model would > require making the cost model mode detailed, why wouldn't the same thing > apply for lowering it? I don't see a reason for asymmetry. > > Also, I intentionally used a table with "perfectly random" data, because > that's about the simplest thing to estimate, and it indeed makes all the > estimates almost perfect (including the internal ones in cost_index). If > we can't cost such simple cases correctly, what's the point of costing? > > From a robustness point of view, wouldn't it be better to actually err > on the side of using a higher random_page_cost value? That'd mean we > flip to "more-sequential" scans sooner, with much "flatter" behavior. > That doesn't need to be a seqscan (which is about as flat as it gets), > but e.g. a bitmap scan - which probably silently "fixes" many cases > where the index scan gets costed too low. > > It also says this: > > And the value of 4 seems to work well in practice. > > I wonder how do we know that? Most users don't experiment with different > values very much. They just run with the default, or maybe even lower > it, based on some recommendation. But they don't run the same query with > different values, so they can't spot differences unless they hit a > particularly bad plan. > > Of course, it's also true most workloads tend to access well cached > data, which makes errors much cheaper. Or maybe just queries with the > "problematic selectivities" are not that common. Still, even if it > doesn't change the scan choice, it seems important to keep the cost > somewhat closer to reality because of the plan nodes above ... > > > It seems to me the current default is a bit too low, but changing a GUC > this important is not trivial. So what should we do about it? > > > regards > > > > [1] > https://www.postgresql.org/message-id/flat/[email protected] > > [2] > https://www.postgresql.org/message-id/3866858.1728961439%40sss.pgh.pa.us > > [3] https://www.postgresql.org/message-id/23625.1223642230%40sss.pgh.pa.us > > -- > Tomas Vondra >
