From: Tang Junhui <[email protected]>
Hello Mike:
For the second question, I thinks this modification is somewhat complex,
cannot we do something simple to resolve it? I remember there were some
patches trying to avoid too small writeback rate, Coly, is there any
progress now?
-------
Tang Junhui
> Ah-- re #1 -- I was investigating earlier why not as much was combined
> as I thought should be when idle. This is surely a factor. Thanks
> for the catch-- KEY_OFFSET is correct. I will fix and retest.
>
> (Under heavy load, the correct thing still happens, but not under
> light or intermediate load0.
>
> About #2-- I wanted to attain a bounded amount of "combining" of
> operations. If we have 5 4k extents in a row to dispatch, it seems
> really wasteful to issue them as 5 IOs 60ms apart, which the existing
> code would be willing to do-- I'd rather do a 20k write IO (basically
> the same cost as a 4k write IO) and then sleep 300ms. It is dependent
> on the elevator/IO scheduler merging the requests. At the same time,
> I'd rather not combine a really large request.
>
> It would be really neat to blk_plug the backing device during the
> write issuance, but that requires further work.
>
> Thanks
>
> Mike
>
> On Tue, Sep 26, 2017 at 11:51 PM, <[email protected]> wrote:
> > From: Tang Junhui <[email protected]>
> >
> > Hello Lyle:
> >
> > Two questions:
> > 1) In keys_contiguous(), you judge I/O contiguous in cache device, but not
> > in backing device. I think you should judge it by backing device (remove
> > PTR_CACHE() and use KEY_OFFSET() instead of PTR_OFFSET()?).
> >
> > 2) I did not see you combine samll contiguous I/Os to big I/O, so I think
> > it is useful when writeback rate was low by avoiding single I/O write, but
> > have no sense in high writeback rate, since previously it is also write
> > I/Os asynchronously.
> >
> > -----------
> > Tang Junhui
> >
> >> Previously, there was some logic that attempted to immediately issue
> >> writeback of backing-contiguous blocks when the writeback rate was
> >> fast.
> >>
> >> The previous logic did not have any limits on the aggregate size it
> >> would issue, nor the number of keys it would combine at once. It
> >> would also discard the chance to do a contiguous write when the
> >> writeback rate was low-- e.g. at "background" writeback of target
> >> rate = 8, it would not combine two adjacent 4k writes and would
> >> instead seek the disk twice.
> >>
> >> This patch imposes limits and explicitly understands the size of
> >> contiguous I/O during issue. It also will combine contiguous I/O
> >> in all circumstances, not just when writeback is requested to be
> >> relatively fast.
> >>
> >> It is a win on its own, but also lays the groundwork for skip writes to
> >> short keys to make the I/O more sequential/contiguous.
> >>
> >> Signed-off-by: Michael Lyle <[email protected]>
> >> ---
> >> drivers/md/bcache/bcache.h | 6 --
> >> drivers/md/bcache/writeback.c | 131
> >> ++++++++++++++++++++++++++++++------------
> >> 2 files changed, 93 insertions(+), 44 deletions(-)
> >>
> >> diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
> >> index eb83be693d60..da803a3b1981 100644
> >> --- a/drivers/md/bcache/bcache.h
> >> +++ b/drivers/md/bcache/bcache.h
> >> @@ -321,12 +321,6 @@ struct cached_dev {
> >> struct bch_ratelimit writeback_rate;
> >> struct delayed_work writeback_rate_update;
> >>
> >> - /*
> >> - * Internal to the writeback code, so read_dirty() can keep
> >> track of
> >> - * where it's at.
> >> - */
> >> - sector_t last_read;
> >> -
> >> /* Limit number of writeback bios in flight */
> >> struct semaphore in_flight;
> >> struct task_struct *writeback_thread;
> >> diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c
> >> index 0b7c89813635..cf29c44605b9 100644
> >> --- a/drivers/md/bcache/writeback.c
> >> +++ b/drivers/md/bcache/writeback.c
> >> @@ -229,10 +229,26 @@ static void read_dirty_submit(struct closure *cl)
> >> continue_at(cl, write_dirty, io->dc->writeback_write_wq);
> >> }
> >>
> >> +static inline bool keys_contiguous(struct cached_dev *dc,
> >> + struct keybuf_key *first, struct keybuf_key
> >> *second)
> >> +{
> >> + if (PTR_CACHE(dc->disk.c, &second->key, 0)->bdev !=
> >> + PTR_CACHE(dc->disk.c,
> >> &first->key, 0)->bdev)
> >> + return false;
> >> +
> >> + if (PTR_OFFSET(&second->key, 0) !=
> >> + (PTR_OFFSET(&first->key, 0)
> >> + KEY_SIZE(&first->key)))
> >> + return false;
> >> +
> >> + return true;
> >> +}
> >> +
> >> static void read_dirty(struct cached_dev *dc)
> >> {
> >> unsigned delay = 0;
> >> - struct keybuf_key *w;
> >> + struct keybuf_key *next, *keys[5], *w;
> >> + size_t size;
> >> + int nk, i;
> >> struct dirty_io *io;
> >> struct closure cl;
> >>
> >> @@ -243,45 +259,84 @@ static void read_dirty(struct cached_dev *dc)
> >> * mempools.
> >> */
> >>
> >> - while (!kthread_should_stop()) {
> >> -
> >> - w = bch_keybuf_next(&dc->writeback_keys);
> >> - if (!w)
> >> - break;
> >> -
> >> - BUG_ON(ptr_stale(dc->disk.c, &w->key, 0));
> >> -
> >> - if (KEY_START(&w->key) != dc->last_read ||
> >> - jiffies_to_msecs(delay) > 50)
> >> - while
> >> (!kthread_should_stop() && delay)
> >> - delay =
> >> schedule_timeout_interruptible(delay);
> >> -
> >> - dc->last_read =
> >> KEY_OFFSET(&w->key);
> >> -
> >> - io = kzalloc(sizeof(struct dirty_io) +
> >> sizeof(struct bio_vec)
> >> - *
> >> DIV_ROUND_UP(KEY_SIZE(&w->key), PAGE_SECTORS),
> >> - GFP_KERNEL);
> >> - if (!io)
> >> - goto err;
> >> -
> >> - w->private = io;
> >> - io->dc = dc;
> >> -
> >> - dirty_init(w);
> >> - bio_set_op_attrs(&io->bio, REQ_OP_READ, 0);
> >> - io->bio.bi_iter.bi_sector =
> >> PTR_OFFSET(&w->key, 0);
> >> - bio_set_dev(&io->bio, PTR_CACHE(dc->disk.c,
> >> &w->key, 0)->bdev);
> >> - io->bio.bi_end_io =
> >> read_dirty_endio;
> >> -
> >> - if (bio_alloc_pages(&io->bio, GFP_KERNEL))
> >> - goto err_free;
> >> -
> >> - trace_bcache_writeback(&w->key);
> >> + next = bch_keybuf_next(&dc->writeback_keys);
> >> +
> >> + while (!kthread_should_stop() && next) {
> >> + size = 0;
> >> + nk = 0;
> >> +
> >> + do {
> >> +
> >> BUG_ON(ptr_stale(dc->disk.c, &next->key, 0));
> >> +
> >> + /* Don't combine too many
> >> operations, even if they
> >> + * are all small.
> >> + */
> >> + if (nk >= 5)
> >> + break;
> >> +
> >> + /* If the current operation
> >> is very large, don't
> >> + * further combine
> >> operations.
> >> + */
> >> + if (size > 5000)
> >> + break;
> >> +
> >> + /* Operations are only
> >> eligible to be combined
> >> + * if they are contiguous.
> >> + *
> >> + * TODO: add a heuristic
> >> willing to fire a
> >> + * certain amount of
> >> non-contiguous IO per pass,
> >> + * so that we can benefit
> >> from backing device
> >> + * command queueing.
> >> + */
> >> + if (nk != 0 &&
> >> !keys_contiguous(dc, keys[nk-1], next))
> >> + break;
> >> +
> >> + size +=
> >> KEY_SIZE(&next->key);
> >> + keys[nk++] = next;
> >> + } while ((next =
> >> bch_keybuf_next(&dc->writeback_keys)));
> >> +
> >> + /* Now we have gathered a set of 1..5 keys
> >> to write back. */
> >> +
> >> + for (i = 0; i < nk; i++) {
> >> + w = keys[i];
> >> +
> >> + io = kzalloc(sizeof(struct
> >> dirty_io) +
> >> +
> >> sizeof(struct bio_vec) *
> >> +
> >> DIV_ROUND_UP(KEY_SIZE(&w->key), PAGE_SECTORS),
> >> +
> >> GFP_KERNEL);
> >> + if (!io)
> >> + goto err;
> >> +
> >> + w->private =
> >> io;
> >> + io->dc
> >> = dc;
> >> +
> >> + dirty_init(w);
> >> + bio_set_op_attrs(&io->bio,
> >> REQ_OP_READ, 0);
> >> + io->bio.bi_iter.bi_sector =
> >> PTR_OFFSET(&w->key, 0);
> >> + bio_set_dev(&io->bio,
> >> +
> >> PTR_CACHE(dc->disk.c, &w->key, 0)->bdev);
> >> + io->bio.bi_end_io
> >> = read_dirty_endio;
> >> +
> >> + if
> >> (bio_alloc_pages(&io->bio, GFP_KERNEL))
> >> + goto
> >> err_free;
> >> +
> >> +
> >> trace_bcache_writeback(&w->key);
> >> +
> >> + down(&dc->in_flight);
> >> +
> >> + /* We've acquired a
> >> semaphore for the maximum
> >> + * simultaneous number of
> >> writebacks; from here
> >> + * everything happens
> >> asynchronously.
> >> + */
> >> + closure_call(&io->cl,
> >> read_dirty_submit, NULL, &cl);
> >> + }
> >>
> >> - down(&dc->in_flight);
> >> - closure_call(&io->cl, read_dirty_submit,
> >> NULL, &cl);
> >> + delay = writeback_delay(dc, size);
> >>
> >> - delay = writeback_delay(dc,
> >> KEY_SIZE(&w->key));
> >> + while (!kthread_should_stop() && delay) {
> >> +
> >> schedule_timeout_interruptible(delay);
> >> + delay = writeback_delay(dc,
> >> 0);
> >> + }
> >> }
> >>
> >> if (0) {
> >> --
> > --