On Mon, Sep 28, 2009 at 5:27 PM, Roland Dreier <[email protected]> wrote: > > > If an SRP target processes SRP I/O slow enough, the SRP initiator locks up. > > > INFO: task fio:6389 blocked for more than 120 seconds. > > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > > fio D 0000000000000000 0 6389 6388 0x00000000 > > ffff880071dc5bd8 0000000000000046 ffff880071dc5b08 000000018107764d > > 0000000000012cc0 000000000000de20 0000000000000001 ffff880070cd8000 > > ffff880070cd83b0 0000000100000000 000000010001193e ffff88007fb99050 > > Call Trace: > > [<ffffffff812ec5e5>] ? _spin_unlock_irqrestore+0x65/0x80 > > [<ffffffff812e9b37>] io_schedule+0x37/0x50 > > [<ffffffff8110cff2>] __blockdev_direct_IO+0x692/0xd80 > > [<ffffffff810e0357>] ? get_super+0x27/0xc0 > > [<ffffffff8110b169>] blkdev_direct_IO+0x49/0x50 > > [<ffffffff8110a1f0>] ? blkdev_get_blocks+0x0/0xc0 > > [<ffffffff810a1799>] generic_file_aio_read+0x679/0x690 > > [<ffffffff810dc35a>] ? __dentry_open+0x13a/0x340 > > [<ffffffff810de091>] do_sync_read+0xf1/0x140 > > [<ffffffff810775ed>] ? trace_hardirqs_on_caller+0x14d/0x1a0 > > [<ffffffff810662f0>] ? autoremove_wake_function+0x0/0x40 > > [<ffffffff810775ed>] ? trace_hardirqs_on_caller+0x14d/0x1a0 > > [<ffffffff8107764d>] ? trace_hardirqs_on+0xd/0x10 > > [<ffffffff810ded28>] vfs_read+0xc8/0x180 > > [<ffffffff810deed0>] sys_read+0x50/0x90 > > [<ffffffff8100be6b>] system_call_fastpath+0x16/0x1b > > no locks held by fio/6389. > > It will probably be a while until I can get the time to build an scst > test set up to reproduce this unfortunately. So we'll have to debug > this with your set up for the moment. > > I don't have a good idea of where in the SRP initiator the problem could > be... the non-error path for ordinary SCSI commands is pretty trivial. > Presumably slowing down the target means that the queue of outstanding > commands fills up, but they should complete and let things make > progress. I guess the possibilities are a bug higher up in the block or > SCSI stack, or some accounting problem in SRP. > > You could try adding printks to srp_queuecommand() to see that all SCSI > commands are sent on the SRP connection and also add tracing to > srp_process_rsp() to make sure there's a matching call to ->scsi_done > for each SCSI command. And also we should make sure there's no > disconnections or task management commands or anything like that > confusing things ... there is definitely more room for bugs in the parts > of the SRP driver that handle exceptions.
(replying to an e-mail of two months ago -- finally got the time to have a closer look at the SRP initiator source code) I'm not sure that the non-error path for ordinary SCSI commands is that trivial. If my interpretation of the SRP initiator source code is correct, the statements complete(&target->done) and init_completion(&target->done) can be executed concurrently. Although I do not know what the exact consequences are, and although I do not know whether this is related to the issue I reported, this is a race condition. I'm not sure that allowing such races is good kernel programming practice. Bart. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to [email protected] More majordomo info at http://vger.kernel.org/majordomo-info.html
