On Tue, Mar 19 2024 at 11:41P -0400,
Martin Wilck <[email protected]> wrote:
> Hello Ming,
>
> On Tue, 2024-03-19 at 21:04 +0800, Ming Lei wrote:
> > Hello Martin,
> >
> > On Sat, Mar 16, 2024 at 12:10:35AM +0100, Martin Wilck wrote:
> > > In a recent kernel dump analysis, we found that the kernel crashed
> > > because
> > > dm_rq_target_io tio->ti was pointing to invalid memory in
> > > dm_end_request(),
> > > in a situation where multipathd was doing map reloads because of a
> > > storage
> > > failover. The map of the respective mapped_device had been replaced
> > > by a
> > > different struct dm_table.
> > >
> > > We obverved this with a 5.3.18 distro kernel, but the code in
> > > question
> > > hasn't change much since then. Basically, we were only missing
> > > b4459b11e840 ("dm rq: don't queue request to blk-mq during DM
> > > suspend"),
> > > which doesn't guarantee that the race I'm thinking of (see below)
> > > can't
> > > happen.
> > >
> > > When a map is resumed after a table reload, the live table is
> > > swapped, and
> > > the tio->ti member of any live request becomes stale.
> > > __dm_resume() avoids
> > > this by quiescing the queue and calling dm_wait_for_completion(),
> > > which
> > > waits until blk_mq_queue_inflight() doesn't report any in-flight
> > > requests.
> > >
> > > However, blk_mq_queue_inflight() counts only "started" requests.
> > > So, if a
> > > request is dispatched before the queue was quiesced, but
> > > dm_wait_for_completion() doesn't observe MQ_RQ_IN_FLIGHT for this
> > > request
> > > because of memory ordering effects, __dm_suspend() may finish
> > > successfully,
> >
> > Can you explain a bit about the exact memory order which causes
> > MQ_RQ_IN_FLIGHT
> > not observed?
> >
> > blk-mq quiesce includes synchronize_rcu() which drains all in-flight
> > dispatch, so after blk_mq_quiesce_queue() returns,
> > if blk_mq_queue_inflight() returns 0, it does mean there isn't any
> > active inflight requests.
> >
> > If there is bug in this pattern, I guess more drivers may have such
> > 'risk'.
>
> What we know for sure is that there was a bad dm_target reference in
> (struct dm_rq_target_io *tio)->ti:
>
> crash> struct -x dm_rq_target_io c00000245ca90128
> struct dm_rq_target_io {
> md = 0xc0000031c66a4000,
> ti = 0xc0080000020d0080 <fscache_object_list_lock+665632>,
>
> crash> struct -x dm_target 0xc0080000020d0080
> struct dm_target struct: invalid kernel virtual address:
> c0080000020d0080 type: "gdb_readmem_callback"
>
> The question is how this could have come to pass. It can only happen
> if tio->ti had been set before the map was reloaded.
> My theory is that the IO had been dispatched before the queue had been
> quiesced, like this:
>
> Task A Task B
> (dispatching IO) (executing a DM_SUSPEND ioctl to
> resume after DM_TABLE_LOAD)
> do_resume()
> dm_suspend()
> __dm_suspend()
> dm_mq_queue_rq()
> struct dm_target *ti =
> md->immutable_target;
> dm_stop_queue()
> blk_mq_quiesce_queue()
> /*
> * At this point, the queue is quiesced, but task A
> * has alreadyentered dm_mq_queue_rq()
> */
> dm_wait_for_completion()
> blk_mq_queue_inflight()
> /*
> * blk_mq_queue_inflight() doesn't see Task A's
> * request because it isn't started yet
> */
> set_bit(dmf_suspended_flag)
> dm_start_request(md, rq); dm_swap_table()
> __bind()
> md->immutable_target = ...
> dm_target_destroy()
> /* the previous md->immutable_target is freed */
> init_tio(tio, rq, md);
> /* the stale ti pointer is assigned to tio->ti */
> tio->ti = ti;
>
> dm_mq_queue_rq() contains no synchronization code if
> md->immutable_target is set, so I think that this can happen, even
> though it looks unlikely. With b4459b11e840 (which was not applied in
> the customer kernel), there would be a
> set_bit(DMF_BLOCK_IO_FOR_SUSPEND) statement before dm_stop_queue(),
> but IMO that the above would still be possible.
>
> If this can't happen, I have no more ideas how the observed situation
> came to pass. The customer who sent us the core claims that
> he has seen this multiple times already (but we have only this single
> core dump).
While I appreciate you making us aware of this crash, I'm really not
interested in going down the rabbit hole of reasoning through fairly
complex concerns unless you have gotten your customer a kernel with
latest fixes (e.g. commit b4459b11e840) and they _still_ crash.
Has that happened?
NOTE: There is also 85067747cf98 ("dm: do not use waitqueue for
request-based DM") but you probably have it since you said other than
missing the changes from commit b4459b11e840 that your dm-rq.c was
same.
Mike