> Il giorno 01 apr 2018, alle ore 10:56, Paolo Valente
> <paolo.vale...@linaro.org> ha scritto:
>> Il giorno 30 mar 2018, alle ore 18:57, Bart Van Assche
>> <bart.vanass...@wdc.com> ha scritto:
>> On Fri, 2018-03-30 at 10:23 +0200, Paolo Valente wrote:
>>> Still 4.16-rc1, being that the version for which you reported this
>>> issue in the first place.
>> A vanilla v4.16-rc1 kernel is not sufficient to run the srp-test software
>> since RDMA/CM support for the SRP target driver is missing from that kernel.
>> That's why I asked you to use the for-next branch from my github repository
>> in a previous e-mail.
> Yep, that's the branch/top commit I used (as you suggested):
> 190943ce1824 [bvanassche/for-next] scsi: mpt3sas: fix oops in error handlers
> after shutdown/unload
> bvanassche https://github.com/bvanassche/linux.git
> The kernel in that branch presents itself as 4.16-rc1, but, as you
> point out, it should contain the needed support.
>> Anyway, since the necessary patches are now in
>> linux-next, the srp-test software can also be run against linux-next. Here
>> are the results that I obtained with label next-20180329 and the kernel
>> config attached to your previous e-mail:
>> # while ./srp-test/run_tests -c -d -r 10 -e bfq; do :; done
>> BUG: unable to handle kernel NULL pointer dereference at 0000000000000200
>> PGD 0 P4D 0
>> Oops: 0002 [#1] SMP PTI
>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
>> 1.0.0-prebuilt.qemu-project.org 04/01/2014
>> RIP: 0010:rb_erase+0x284/0x380
>> Call Trace:
>> bfq_remove_request+0x9a/0x2e0 [bfq]
>> ? rcu_read_lock_sched_held+0x64/0x70
>> ? update_load_avg+0x72b/0x760
>> bfq_finish_requeue_request+0x2e1/0x3b0 [bfq]
>> ? __lock_is_held+0x5a/0xa0
> This new trace just confirms my suspects. Looking forward to some
> feedback from Mike or Jens. Otherwise I'll try to look into it
> myself, although I don't think I am the right person to suggest the
> best cure for this cloning issue.
I tried to investigate this further, but the corruption of a cloned
request (or some other mishappening) that then causes this failure
occurs somewhere, earlier, in the cloning phase; and, as I feared, I
was not able to spot the mistake in that part of the code, especially
because I'm not able to reproduce the failure itself.
I might possibly have more luck after some hints from knowledgeable
Otherwise, if, in your test, this failure occurs immediately after you
start the test, and if you are willing to repeat this test with my
development version of bfq, then we may have hope to get a detailed
trace of what happens under the hood.