Sorry for another late reply, finally found some time to formulate couple of

> On Thu, Feb 25, 2021 at 09:22:43AM +0100, Dmitry Dolgov wrote:
> > On Wed, Feb 24, 2021 at 01:45:10PM -0800, Andres Freund wrote:
> >
> > > I'm curious if it makes sense
> > > to explore possibility to have these sort of "backpressure", e.g. if
> > > number of inflight requests is too large calculate inflight_limit a bit
> > > lower than possible (to avoid hard performance deterioration when the db
> > > is trying to do too much IO, and rather do it smooth).
> >
> > What I do think is needed and feasible (there's a bunch of TODOs in the
> > code about it already) is to be better at only utilizing deeper queues
> > when lower queues don't suffice. So we e.g. don't read ahead more than a
> > few blocks for a scan where the query is spending most of the time
> > "elsewhere.
> >
> > There's definitely also some need for a bit better global, instead of
> > per-backend, control over the number of IOs in flight. That's not too
> > hard to implement - the hardest probably is to avoid it becoming a
> > scalability issue.
> >
> > I think the area with the most need for improvement is figuring out how
> > we determine the queue depths for different things using IO. Don't
> > really want to end up with 30 parameters influencing what queue depth to
> > use for (vacuum, index builds, sequential scans, index scans, bitmap
> > heap scans, ...) - but they benefit from a deeper queue will differ
> > between places.

Talking about parameters, from what I understand the actual number of queues
(e.g. io_uring) created is specified by PGAIO_NUM_CONTEXTS, shouldn't it be
configurable? Maybe in fact there should be not that many knobs after all - if
the model assumes the storage has:

* Some number of hardware queues, then the number of queues AIO implementation
  needs to use depends on it. For example, lowering number of contexts between
  different benchmark runs I could see that some of the hardware queues were
  significantly underutilized. Potentially there could be also such
  thing as too many contexts.

* Certain bandwidth, then the submit batch size (io_max_concurrency or
  PGAIO_SUBMIT_BATCH_SIZE) depends on it. This will allow to distinguish
  attached storage with high bandwidth and high latency vs local storages.

>From what I see max_aio_in_flight is used as a queue depth for contexts, which
is workload dependent and not easy to figure out as you mentioned. To avoid
having 30 different parameters maybe it's more feasible to introduce "shallow"
and "deep" queues, where particular depth for those could be derived from depth
of hardware queues. The question which activity should use which queue is not
easy, but if I get it right from queuing theory (assuming IO producers are
stationary processes and fixed IO latency from the storage) it depends on IO
arrivals distribution in every particular case and this in turn could be
roughly estimated for each type of activity. One can expect different IO
arrivals distributions for e.g. a normal point-query backend and a checkpoint
or vacuum process, no matter what are the other conditions (collecting those
for few benchmark runs gives indeed pretty distinct distributions).

If I understand correctly, those contexts defined by PGAIO_NUM_CONTEXTS are the
main working horse, right? I'm asking because there is also something called
local_ring, but it seems there are no IOs submitted into those. Assuming that
contexts are a main way of submitting IO, it would be also interesting to
explore isolated for different purposes contexts. I haven't finished yet my
changes here to give any results, but at least doing some tests with fio show
different latencies, when two io_urings are processing mixed read/writes vs
isolated read or writes. On the side note, at the end of the day there are so
many queues - application queue, io_uring, mq software queue, hardware queue -
I'm really curious if it would amplify tail latencies.

Another thing I've noticed is AIO implementation is much more significantly
affected by side IO activity than synchronous one. E.g. AIO version tps drops
from tens of thousands to a couple of hundreds just because of some kworker
started to flush dirty buffers (especially with disabled writeback throttling),
while synchronous version doesn't suffer that much. Not sure what to make of
it. Btw, overall I've managed to get better numbers from AIO implementation on
IO bounded test cases with local NVME device, but non IO bounded were mostly a
bit slower - is it expected, or am I missing something?

Interesting thing to note is that io_uring implementation apparently relaxed
requirements for polling operations, now one needs to have only CAP_SYS_NICE
capability, not CAP_SYS_ADMIN. I guess theoretically there are no issues using
it within the current design?

Reply via email to