On Wed, 2007-05-30 at 20:01 +0200, Jens Axboe wrote:
> On Mon, May 28 2007, Andrew Patterson wrote:
> > I am running into deadlock during domain validation when the request
> > queue is full. I am using the MPT Fusion spi driver and have run into
> > this problem with 2.6.16 and the latest scsi_misc kernels. The system
> > is running a load test on a u320 pSCSI bus with a drive that will
> > occasionally hang the bus until a host reset clears the condition. This
> > particular drive in known to not handle QAS very well. After the host
> > reset, the MPT Fusion driver attempts domain validation on all drives on
> > the bus. During DV, one or more of the queues lockup while trying to
> > execute various SCSI commands (INQUIRY, WRITE_BUFFER, etc) using the
> > scsi_execute() call. A stack trace shows:
>
> Ugh, that's nasty. If that is a valid scenario (and it looks like it
> is), then we have reserve a request (and SCSI command) for such uses as
> the below scenario is definitely livelock country.
>
> > [ 2318.524898] events/1 D a0000001007258f0 0 16 2 (L-TLB)
> > [ 2318.532030]
> > [ 2318.532031] Call Trace:
> > [ 2318.532202] [<a000000100724750>] schedule+0x1550/0x1840
> > [ 2318.532204] sp=e00000010a8dfc60
> > bsp=e00000010a8d8ff0
> > [ 2318.546975] [<a0000001007258f0>] io_schedule+0x50/0x80
> > [ 2318.546977] sp=e00000010a8dfcf0
> > bsp=e00000010a8d8fd0
> > [ 2318.554417] [<a0000001003b8820>] get_request_wait+0x200/0x2c0
> > [ 2318.554419] sp=e00000010a8dfcf0
> > bsp=e00000010a8d8f78
> > [ 2318.562540] [<a0000001003b8990>] blk_get_request+0xb0/0x120
> > [ 2318.562542] sp=e00000010a8dfd40
> > bsp=e00000010a8d8f40
> > [ 2318.579166] [<a00000010058b5e0>] scsi_execute+0x40/0x1e0
> > [ 2318.579168] sp=e00000010a8dfd40
> > bsp=e00000010a8d8ee8
> > [ 2318.586863] [<a0000001005980f0>] spi_execute+0x70/0x120
> > [ 2318.586865] sp=e00000010a8dfd40
> > bsp=e00000010a8d8e88
> > [ 2318.594204] [<a000000100599650>] spi_dv_device_echo_buffer+0x2f0/0x520
> > [ 2318.594206] sp=e00000010a8dfdc0
> > bsp=e00000010a8d8e30
> > [ 2318.607333] [<a000000100597a30>] spi_dv_retrain+0x70/0x520
> > [ 2318.607335] sp=e00000010a8dfde0
> > bsp=e00000010a8d8dc0
> > [ 2318.616119] [<a000000100599170>] spi_dv_device+0xdf0/0xf00
> > [ 2318.616121] sp=e00000010a8dfde0
> > bsp=e00000010a8d8d40
> > [ 2318.630538] [<a00000020db7e360>] mptspi_dv_device+0x160/0x2c0 [mptspi]
> > [ 2318.630540] sp=e00000010a8dfdf0
> > bsp=e00000010a8d8ce0
> > [ 2318.638341] [<a00000020db7e660>] mptspi_dv_renegotiate_work+0x1a0/0x220
> > [mptspi]
> > [ 2318.638343] sp=e00000010a8dfdf0
> > bsp=e00000010a8d8cb0
> > [ 2318.652773] [<a0000001000b80c0>] run_workqueue+0x1c0/0x320
> > [ 2318.652775] sp=e00000010a8dfe00
> > bsp=e00000010a8d8c80
> > [ 2318.660003] [<a0000001000b8460>] worker_thread+0x240/0x280
> > [ 2318.660005] sp=e00000010a8dfe00
> > bsp=e00000010a8d8c50
> > [ 2318.667536] [<a0000001000c24e0>] kthread+0xa0/0x120
> > [ 2318.667538] sp=e00000010a8dfe30
> > bsp=e00000010a8d8c20
> > [ 2318.681699] [<a0000001000129f0>] kernel_thread_helper+0xd0/0x100
> > [ 2318.681701] sp=e00000010a8dfe30
> > bsp=e00000010a8d8bf0
> > [ 2318.689121] [<a0000001000094c0>] start_kernel_thread+0x20/0x40
> > [ 2318.689124] sp=e00000010a8dfe30
> > bsp=e00000010a8d8bf0
> >
> >
> > Some code examination and tracing show that get_request_wait() calls
> > get_request() to obtain a request. If get_request() returns NULL, it
> > will wait and try again. Here is the code from get_request_wait():
> >
> > rq = get_request(q, rw_flags, bio, GFP_NOIO);
> > while (!rq) {
> > DEFINE_WAIT(wait);
> > struct request_list *rl = &q->rq;
> >
> > prepare_to_wait_exclusive(&rl->wait[rw], &wait,
> > TASK_UNINTERRUPTIBLE);
> >
> > rq = get_request(q, rw_flags, bio, GFP_NOIO);
> >
> > if (!rq) {
> > struct io_context *ioc;
> > blk_add_trace_generic(q, bio, rw, BLK_TA_SLEEPRQ);
> >
> > __generic_unplug_device(q);
> > spin_unlock_irq(q->queue_lock);
> > io_schedule();
> >
> > /*
> > * After sleeping, we become a "batching" process and
> > * will be able to allocate at least one request, and
> > * up to a big batch of them for a small period time.
> > * See ioc_batching, ioc_set_batching
> > */
> > ioc = current_io_context(GFP_NOIO, q->node);
> > ioc_set_batching(q, ioc);
> >
> > spin_lock_irq(q->queue_lock);
> > }
> > finish_wait(&rl->wait[rw], &wait);
> > }
> >
> > Note the io_schedule() here. As far as I can tell, there is not wakeup
> > for this wait queue. The only wakeup's occur when a request is freed.
> > No requests can be processed because the error handling is holding off
> > request processing until the error condition is cleared so we get a
> > deadlock.
> >
> > Looking through get_request() we see:
> >
> > if (rl->count[rw]+1 >= queue_congestion_on_threshold(q)) {
> > if (rl->count[rw]+1 >= q->nr_requests) {
> > ioc = current_io_context(GFP_ATOMIC, q->node);
> > /*
> > * The queue will fill after this allocation, so set
> > * it as full, and mark this process as "batching".
> > * This process will be allowed to complete a batch of
> > * requests, others will be blocked.
> > */
> > if (!blk_queue_full(q, rw)) {
> > ioc_set_batching(q, ioc);
> > blk_set_queue_full(q, rw);
> > } else {
> > if (may_queue != ELV_MQUEUE_MUST
> > && !ioc_batching(q, ioc)) {
> > /*
> > * The queue is full and the allocating
> > * process is not a "batcher", and not
> > * exempted by the IO scheduler
> > goto out;
> > }
> > }
> > }
> > blk_set_queue_congested(q, rw);
> > }
> >
> > In this heavily loaded system, we get into the "goto out" because count
> > > nr_requests. The "goto out" will lead to returning NULL. This
> > condition would not occur if ioc_batching was set, but this is not done
> > until after the io_schedule() in get_request_wait().
>
> It doesn't matter, memory allocation could still block due to reclaim
> which wont happen because no more IO is getting through. Or if you went
> atomic it could also fail.
>
> There's no other solution than maintaining a cached request + command
> for this. libata has a similar issue wrt error handling with ncq, we may
> need a command in error handling to retrieve the log page.
Actually, there is another solution: DV is careful only to be using a
single command for its processes ... if we could use the eh command for
this, then I think the problem would go away ... unfortunately, that's a
bit more complex to achieve than it sounds.
James
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at http://vger.kernel.org/majordomo-info.html