On Mon, Aug 28, 2017 at 03:54:59PM -0700, Tejun Heo wrote:
> Hello, Shaohua.
>
> On Wed, Aug 23, 2017 at 11:15:15AM -0700, Shaohua Li wrote:
> > loop block device handles IO in a separate thread. The actual IO
> > dispatched isn't cloned from the IO loop device received, so the
> > dispatched IO loses the cgroup context.
>
> Ah, right. Thanks for spotting this.
>
> > I'm ignoring buffer IO case now, which is quite complicated. Making the
> > loop thread aware of cgroup context doesn't really help. The loop device
> > only writes to a single file. In current writeback cgroup
> > implementation, the file can only belong to one cgroup.
>
> Yeah, writeback tracks the most active cgroup and associates writeback
> ios with that cgroup. For buffered loop devices, I think it'd be fine
> to make the loop driver forward the cgroup association and let the
> writeback layer determine the matching association as usual.
Doing this means we must forward cgroup info to page, not bio. I need to check
if we can make the mechanism work for memcg.
> > For direct IO case, we could workaround the issue in theory. For
> > example, say we assign cgroup1 5M/s BW for loop device and cgroup2
> > 10M/s. We can create a special cgroup for loop thread and assign at
> > least 15M/s for the underlayer disk. In this way, we correctly throttle
> > the two cgroups. But this is tricky to setup.
> >
> > This patch tries to address the issue. When loop thread is handling IO,
> > it declares the IO is on behalf of the original task, then in block IO
> > we use the original task to find cgroup. The concept probably works for
> > other scenarios too, but right now I don't make it generic yet.
>
> The overall approach looks sound to me. Some comments below.
>
> > @@ -2058,7 +2058,10 @@ int bio_associate_current(struct bio *bio)
> >
> > get_io_context_active(ioc);
> > bio->bi_ioc = ioc;
> > - bio->bi_css = task_get_css(current, io_cgrp_id);
> > + if (current->cgroup_task)
> > + bio->bi_css = task_get_css(current->cgroup_task, io_cgrp_id);
> > + else
> > + bio->bi_css = task_get_css(current, io_cgrp_id);
>
> Do we need a full pointer field for this? I think we should be able
> to mark lo writers with a flag (PF or whatever) and then to
> kthread_data() to get the lo struct which contains the target css.
> Oh, let's do csses instead of tasks for consistency, correctness
> (please see below) and performance (csses are cheaper to pin).
Forwarding css sounds better.
> > @@ -502,11 +504,16 @@ static int lo_rw_aio(struct loop_device *lo, struct
> > loop_cmd *cmd,
> > cmd->iocb.ki_complete = lo_rw_aio_complete;
> > cmd->iocb.ki_flags = IOCB_DIRECT;
> >
> > + if (cmd->cgroup_task)
> > + current->cgroup_task = cmd->cgroup_task;
> > +
> > if (rw == WRITE)
> > ret = call_write_iter(file, &cmd->iocb, &iter);
> > else
> > ret = call_read_iter(file, &cmd->iocb, &iter);
> >
> > + current->cgroup_task = NULL;
> > +
> > if (ret != -EIOCBQUEUED)
> > cmd->iocb.ki_complete(&cmd->iocb, ret, 0);
> > return 0;
>
> Hmm... why do we need double forwarding (original issuer -> aio cmd ->
> ios) in loop? If we need this, doesn't this mean that we're missing
> ownership forwarding in aio paths and should make the forwarding
> happen there?
what do you mean double forwarding?
>
> > @@ -1705,6 +1712,11 @@ static blk_status_t loop_queue_rq(struct
> > blk_mq_hw_ctx *hctx,
> > break;
> > }
> >
> > + if (cmd->use_aio) {
> > + cmd->cgroup_task = current;
> > + get_task_struct(current);
> > + } else
> > + cmd->cgroup_task = NULL;
>
> What if %current isn't the original issuer of the io? It could be a
> writeback worker trying to flush to a loop device and we don't want to
> attribute those ios to the writeback worker. We should forward the
> bi_css not the current task.
Makes sense.
Thanks,
Shaohua