Hello, Shaohua.
On Wed, Aug 23, 2017 at 11:15:15AM -0700, Shaohua Li wrote:
> loop block device handles IO in a separate thread. The actual IO
> dispatched isn't cloned from the IO loop device received, so the
> dispatched IO loses the cgroup context.
Ah, right. Thanks for spotting this.
> I'm ignoring buffer IO case now, which is quite complicated. Making the
> loop thread aware of cgroup context doesn't really help. The loop device
> only writes to a single file. In current writeback cgroup
> implementation, the file can only belong to one cgroup.
Yeah, writeback tracks the most active cgroup and associates writeback
ios with that cgroup. For buffered loop devices, I think it'd be fine
to make the loop driver forward the cgroup association and let the
writeback layer determine the matching association as usual.
> For direct IO case, we could workaround the issue in theory. For
> example, say we assign cgroup1 5M/s BW for loop device and cgroup2
> 10M/s. We can create a special cgroup for loop thread and assign at
> least 15M/s for the underlayer disk. In this way, we correctly throttle
> the two cgroups. But this is tricky to setup.
>
> This patch tries to address the issue. When loop thread is handling IO,
> it declares the IO is on behalf of the original task, then in block IO
> we use the original task to find cgroup. The concept probably works for
> other scenarios too, but right now I don't make it generic yet.
The overall approach looks sound to me. Some comments below.
> @@ -2058,7 +2058,10 @@ int bio_associate_current(struct bio *bio)
>
> get_io_context_active(ioc);
> bio->bi_ioc = ioc;
> - bio->bi_css = task_get_css(current, io_cgrp_id);
> + if (current->cgroup_task)
> + bio->bi_css = task_get_css(current->cgroup_task, io_cgrp_id);
> + else
> + bio->bi_css = task_get_css(current, io_cgrp_id);
Do we need a full pointer field for this? I think we should be able
to mark lo writers with a flag (PF or whatever) and then to
kthread_data() to get the lo struct which contains the target css.
Oh, let's do csses instead of tasks for consistency, correctness
(please see below) and performance (csses are cheaper to pin).
> @@ -502,11 +504,16 @@ static int lo_rw_aio(struct loop_device *lo, struct
> loop_cmd *cmd,
> cmd->iocb.ki_complete = lo_rw_aio_complete;
> cmd->iocb.ki_flags = IOCB_DIRECT;
>
> + if (cmd->cgroup_task)
> + current->cgroup_task = cmd->cgroup_task;
> +
> if (rw == WRITE)
> ret = call_write_iter(file, &cmd->iocb, &iter);
> else
> ret = call_read_iter(file, &cmd->iocb, &iter);
>
> + current->cgroup_task = NULL;
> +
> if (ret != -EIOCBQUEUED)
> cmd->iocb.ki_complete(&cmd->iocb, ret, 0);
> return 0;
Hmm... why do we need double forwarding (original issuer -> aio cmd ->
ios) in loop? If we need this, doesn't this mean that we're missing
ownership forwarding in aio paths and should make the forwarding
happen there?
> @@ -1705,6 +1712,11 @@ static blk_status_t loop_queue_rq(struct blk_mq_hw_ctx
> *hctx,
> break;
> }
>
> + if (cmd->use_aio) {
> + cmd->cgroup_task = current;
> + get_task_struct(current);
> + } else
> + cmd->cgroup_task = NULL;
What if %current isn't the original issuer of the io? It could be a
writeback worker trying to flush to a loop device and we don't want to
attribute those ios to the writeback worker. We should forward the
bi_css not the current task.
Thanks!
--
tejun