On Sun, Jan 19, 2025 at 5:51 AM Tomas Vondra <to...@vondra.me> wrote: > * Does it still make sense to default to eic=1? For this particular test > increasing eic=4 often cuts the duration in half (especially on nvme > storage).
Maybe it wasn't a bad choice for systems with one spinning disk, but obviously typical hardware has changed completely since then. Bruce even changed the docs to recommend "hundreds" on SSDs (46eafc88). We could definitely consider changing the value, but this particular thing is using the READ_STREAM_MAINTENANCE flag, so it uses maintenance_io_concurrency, and that one defaults to 10. That's also arbitrary and quite small, but I think it means we can avoid choosing new defaults for now :-) For the non-maintenance one, we might want to think about tweaking that in the context of bitmap heapscan? (Interestingly, MySQL seems to have a related setting defaulting to 10k, but that may be a system-wide setting, IDK. Our settings are quite low level, per "operation", or really per stream. In an off-list chat with Robert and Andres, we bounced around some new names, and the one I liked best was io_concurrency_per_stream. It would be accurate but bring a new implementation detail to the UX. I actually like that about it: it's like max_parallel_workers_per_gather, and also just the way work_mem is really work_mem_per_<something>: yes it is low level but is an honest expression of how (un)sophisticated our resource usage controls are today. Perhaps we'll eventually figure out how to balance all resources dynamically from global limits...) > * Why are we limiting ioc to <= 256kB? Per the benchmark it seems it > might be beneficial to set even higher values. That comes from: #define MAX_IO_COMBINE_LIMIT PG_IOV_MAX ... which comes from: /* Define a reasonable maximum that is safe to use on the stack. */ #define PG_IOV_MAX Min(IOV_MAX, 32) There are a few places that use either PG_IOV_MAX or MAX_IO_COMBINE_LIMIT to size a stack array, but those should be OK with a bigger number as the types are small: buffer, pointer, or struct iovec (16 bytes). For example, if it were 128 then we could do 1MB I/O, a nice round number, and arrays of those types would still only be 2kB or less. The other RDBMSes I googled seem to have max I/O sizes of around 512kB or 1MB, some tunable explicitly, some derived from other settings. I picked 128kB as the default combine limit because it comes up in lots of other places eg readahead size, and seemed to work pretty well, and is definitely supportable on all POSIX systems (see POSIX IOV_MAX, which is allowed to be as low as 16). I chose a maximum that was just a bit more, but not much more because I was also worried about how many places would eventually finish up wasting a lot of memory by having to multiply that by number-of-possible-I/Os-in-progress or other similar ideas, but I was expecting we'd want to increase it. read_stream.c doesn't do that sort of multiplication itself, but you can see a case like that in Andres's AIO patchset here: https://github.com/anarazel/postgres/blob/aio-2/src/backend/storage/aio/aio_init.c So for example if io_max_concurrency defaults to 1024; you'd get 2MB of iovecs in shared memory with PG_IOV_MAX of 128, instead of 512kB today. We can always discuss defaults separately but that case doesn't seem like a problem from here... Nice results, thanks!