Re: Should we update the random_page_cost default value?

wenhui qiu Fri, 17 Oct 2025 15:49:11 -0700

Hi Tomas

I really can't agree more. Many default values are just too conservative,
and the documentation doesn't provide best practices.，i think reduce to
1.x，Or add a tip in the document, providing a recommended value for
different SSDs.


On Mon, 6 Oct 2025 at 08:59, Tomas Vondra <[email protected]> wrote:

> Hi,
>
> I wonder if it's time to consider updating the random_page_cost default
> value. There are multiple reasons to maybe do that.
>
> The GUC (and the 4.0 default) was introduced in ~2000 [1], so ~25 years
> ago. During that time the world went from rotational drives through
> multiple generations of flash / network-attached storage. I find it hard
> to believe those changes wouldn't affect random_page_cost.
>
> And indeed, it's common to advice to reduce the GUC closer to 1.0 on
> SSDs. I myself recommended doing that in the past, but over time I got
> somewhat skeptical about it. The advice is based on the "obvious" wisdom
> that SSDs are much better in handling random I/O than rotational disks.
>
> But this has two flaws. First, it assumes the current 4.0 default makes
> sense - maybe it doesn't and then it's useless as a "starting point".
> Second, it's not obvious how much better SSDs are without concurrent IOs
> (which is needed to fully leverage the SSDs). Which we don't do for
> index scans (yet), and even if we did, the cost model has no concept for
> such concurrency.
>
> Recently, I've been doing some experiments evaluating how often we pick
> an optimal scan for simple select queries, assuming accurate estimates.
> Turns out we pick the wrong plan fairly often, even with almost perfect
> estimates. I somewhat expected that, with the default random_page_cost
> value. What did surprise me was that to improve the plans, I had to
> *increase* the value, even on really new/fast SSDs ...
>
> I started looking at how we calculated the 4.0 default back in 2000.
> Unfortunately, there's a lot of info, as Tom pointed out in 2024 [2].
> But he outlined how the experiment worked:
>
> - generate large table (much bigger than RAM)
> - measure runtime of seq scan
> - measure runtime of full-table index scan
> - calculate how much more expensive a random page access is
>
> So I decided to try doing this on a couple different devices, and see
> what random_page_cost values that gives me. Attached is a script doing
> such benchmark:
>
> (1) initializes a new cluster, with a couple parameters adjusted
>
> (2) creates a random table (with uniform distribution)
>
> (3) runs a sequential scan
>
>     SELECT * FROM (SELECT * FROM test_table) OFFSET $nrows;
>
> (4) runs an index scan
>
>     SELECT * FROM (SELECT * FROM test_table ORDER BY id) OFFSET $nrows;
>
> The script does a couple things to force the query plan, and it reports
> timings of the two queries at the end.
>
> I've been running this on machines with 64GB of RAM, so I chose the
> table to have 500M rows. With fillfactor=20 that means a ~182GB table
> (23809524 pages).
>
> Lets say that
>
>    T(seq) = timing of the seqscan query
>    T(idx) = timing of the index scan
>    IOS(seq) = number of sequential page reads
>    IOS(idx) = number of random page reads
>    P = number of pages
>    N = number of rows
>
> then time to read a sequential page (because IOS(seq) == P)
>
>    PT(seq) = T(seq) / IOS(seq) = T(seq) / P
>
> and time to read a random page (assuming the table is perfectly random,
> with no cache hits):
>
>    PT(idx) = T(idx) / IOS(idx) = T(idx) / N
>
> which gives us the "idea" random page cost (as a value relative to
> reading a page sequentially)
>
>    random_page_cost = PT(idx) / PT(seq)
>
>                         T(idx) * P
>                     = --------------
>                         T(seq) * N
>
> The "no cache hits" is not quite correct, with 182GB there's about 30%
> of a page being in memory. I didn't think of using debug_io_direct at
> the time, but it doesn't affect the conclusion very much (in fact, it
> would *increase* the random_page_cost value, which makes it worse).
>
> I did this on the three SSDs I use for testing - 4x NVMe RAID0, 4x SATA
> RAID0, and a single NVMe drive. Here's the results:
>
>                 seqscan (s)      index scan (s)  random_page_cost
>  -----------------------------------------------------------------
>   NVMe                   98               42232             20.4
>   NVMe/RAID0             24               25462             49.3
>   SATA/RAID0            109               48141             21.0
>
> These are reasonably good SSDs, and yet the "correct" random_page cost
> comes out about 5-10x of our default.
>
> FWIW I double checked the test is actually I/O bound. The CPU usage
> never goes over ~50% (not even in the the seqscan case).
>
> These calculated values also align with the "optimal" plan choice, i.e.
> the plans flip much closer to the actual crossing point (compared to
> where it'd flip with 4.0).
>
> It obviously contradicts the advice to set the value closer to 1.0. But
> why is that? SSDs are certainly better with random I/0, even if the I/O
> is not concurrent and the SSD is not fully utilized. So the 4.0 seems
> off, the value should be higher than what we got for SSDs ...
>
> I don't have any rotational devices in my test machines anymore, but I
> got an azure VM with local "SCSI" disk, and with "standard HDD" volume.
> And I got this (this is with 10M rows, ~3.7GB, with direct I/O):
>
>                 seqscan (s)      index scan (s)  random_page_cost
>  -----------------------------------------------------------------
>   SCSI                  2.1                1292             28.5
>   standard HDD        209.8               27586             62.6
>
> I suspect the SCSI disk is not actually rotational (or which model), it
> seems more like an SSD with SCSI interface or what model is that. The
> "standard HDD" seems much closer to rotational, with ~370 IOPS (it'd
> take ages to do the index scan on more than 10M rows).
>
> Unless I did some silly mistakes, these results suggest the current 4.0
> value is a bit too low, and something like ~20 would be better even on
> SSDs. This is not the first time it was suggested a higher default might
> be better - see this 2008 post [3]. Of course, that's from before SSDs
> became a thing, it's about evolution in hard disks and our code.
>
> However, it also says this:
>
>     Yeah, it seems like raising random_page_cost is not something we
>     ever recommend in practice.  I suspect what we'd really need here to
>     make any progress is a more detailed cost model, not just fooling
>     with the parameters of the existing one.
>
> I don't quite follow the reasoning. If increasing the cost model would
> require making the cost model mode detailed, why wouldn't the same thing
> apply for lowering it? I don't see a reason for asymmetry.
>
> Also, I intentionally used a table with "perfectly random" data, because
> that's about the simplest thing to estimate, and it indeed makes all the
> estimates almost perfect (including the internal ones in cost_index). If
> we can't cost such simple cases correctly, what's the point of costing?
>
> From a robustness point of view, wouldn't it be better to actually err
> on the side of using a higher random_page_cost value? That'd mean we
> flip to "more-sequential" scans sooner, with much "flatter" behavior.
> That doesn't need to be a seqscan (which is about as flat as it gets),
> but e.g. a bitmap scan - which probably silently "fixes" many cases
> where the index scan gets costed too low.
>
> It also says this:
>
>     And the value of 4 seems to work well in practice.
>
> I wonder how do we know that? Most users don't experiment with different
> values very much. They just run with the default, or maybe even lower
> it, based on some recommendation. But they don't run the same query with
> different values, so they can't spot differences unless they hit a
> particularly bad plan.
>
> Of course, it's also true most workloads tend to access well cached
> data, which makes errors much cheaper. Or maybe just queries with the
> "problematic selectivities" are not that common. Still, even if it
> doesn't change the scan choice, it seems important to keep the cost
> somewhat closer to reality because of the plan nodes above ...
>
>
> It seems to me the current default is a bit too low, but changing a GUC
> this important is not trivial. So what should we do about it?
>
>
> regards
>
>
>
> [1]
> https://www.postgresql.org/message-id/flat/[email protected]
>
> [2]
> https://www.postgresql.org/message-id/3866858.1728961439%40sss.pgh.pa.us
>
> [3] https://www.postgresql.org/message-id/23625.1223642230%40sss.pgh.pa.us
>
> --
> Tomas Vondra
>

Re: Should we update the random_page_cost default value?

Reply via email to