Re: [lkp-robot] [rcu] b332151a29: kernel_BUG_at_mm/slab.c

2017-01-20 Thread Jens Axboe
On 01/20/2017 09:09 AM, Sebastian Andrzej Siewior wrote:
> On 2017-01-20 08:32:37 [-0800], Jens Axboe wrote:
>> That's alright, sounds like it's not a -next regression, but rather something
>> that is already broken. I can reproduce a lot of breakage if I enable
>> CONFIG_DEBUG_TEST_DRIVER_REMOVE, in fact my system doesn't boot at all. This
>> is the first bug:
>>
>> [   18.247895] [ cut here ]
>> [   18.247907] WARNING: CPU: 21 PID: 2223 at drivers/ata/libata-core.c:6522 
>> ata_host_detach+0x11b]
>> [   18.247908] Modules linked in: igb(+) ahci(+) libahci i2c_algo_bit dca 
>> libata nvme(+) nvme_core
>> [   18.247917] CPU: 21 PID: 2223 Comm: systemd-udevd Tainted: GW 
>>   4.10.0-rc4+ #30
>> [   18.247919] Hardware name: Dell Inc. PowerEdge T630/0NT78X, BIOS 2.3.4 
>> 11/09/2016
>> [   18.247919] Call Trace:
>> [   18.247928]  dump_stack+0x68/0x93
>> [   18.247934]  __warn+0xc6/0xe0
>> [   18.247937]  warn_slowpath_null+0x18/0x20
>> [   18.247943]  ata_host_detach+0x11b/0x120 [libata]
> …
> 
>> and it's even more downhill from there. That option is marked unstable, are 
>> we
>> expecting it to work right now?
> 
> Well, as per 248ff0216543 ("driver core: Make Kconfig text for
> DEBUG_TEST_DRIVER_REMOVE stronger"):
> 
> |   The current state of driver removal is not great.
> |   CONFIG_DEBUG_TEST_DRIVER_REMOVE finds lots of errors. The help text
> |   currently undersells exactly how many errors this option will find. Add
> |   a bit more description to indicate this option shouldn't be turned on
> |   unless you actually want to debug driver removal. The text can be
> |   changed later when more drivers are fixed up.
> 
> so it looks like the option is working but it uncovers bugs. I've put
> you in TO because the breakage in kvm test went away after I disabled
> the MQ support in SCSI. So I *assumed* that MQ was not doing something
> right in the removal path. I don't know if this libata-core backtrace is
> a false positive or not.

Sure, I get that, my question is just if it's always finding valid bugs,
or if the test itself is buggy. The fact that I can't boot anything after
enabling it makes me suspicious.

Or maybe the state of load/remove/load is just pretty sad.

-- 
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [lkp-robot] [rcu] b332151a29: kernel_BUG_at_mm/slab.c

2017-01-20 Thread Sebastian Andrzej Siewior
On 2017-01-20 08:32:37 [-0800], Jens Axboe wrote:
> That's alright, sounds like it's not a -next regression, but rather something
> that is already broken. I can reproduce a lot of breakage if I enable
> CONFIG_DEBUG_TEST_DRIVER_REMOVE, in fact my system doesn't boot at all. This
> is the first bug:
> 
> [   18.247895] [ cut here ]
> [   18.247907] WARNING: CPU: 21 PID: 2223 at drivers/ata/libata-core.c:6522 
> ata_host_detach+0x11b]
> [   18.247908] Modules linked in: igb(+) ahci(+) libahci i2c_algo_bit dca 
> libata nvme(+) nvme_core
> [   18.247917] CPU: 21 PID: 2223 Comm: systemd-udevd Tainted: GW  
>  4.10.0-rc4+ #30
> [   18.247919] Hardware name: Dell Inc. PowerEdge T630/0NT78X, BIOS 2.3.4 
> 11/09/2016
> [   18.247919] Call Trace:
> [   18.247928]  dump_stack+0x68/0x93
> [   18.247934]  __warn+0xc6/0xe0
> [   18.247937]  warn_slowpath_null+0x18/0x20
> [   18.247943]  ata_host_detach+0x11b/0x120 [libata]
…

> and it's even more downhill from there. That option is marked unstable, are we
> expecting it to work right now?

Well, as per 248ff0216543 ("driver core: Make Kconfig text for
DEBUG_TEST_DRIVER_REMOVE stronger"):

|   The current state of driver removal is not great.
|   CONFIG_DEBUG_TEST_DRIVER_REMOVE finds lots of errors. The help text
|   currently undersells exactly how many errors this option will find. Add
|   a bit more description to indicate this option shouldn't be turned on
|   unless you actually want to debug driver removal. The text can be
|   changed later when more drivers are fixed up.

so it looks like the option is working but it uncovers bugs. I've put
you in TO because the breakage in kvm test went away after I disabled
the MQ support in SCSI. So I *assumed* that MQ was not doing something
right in the removal path. I don't know if this libata-core backtrace is
a false positive or not.

Sebastian
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [lkp-robot] [rcu] b332151a29: kernel_BUG_at_mm/slab.c

2017-01-20 Thread Sebastian Andrzej Siewior
On 2017-01-20 08:09:36 [-0800], Jens Axboe wrote:
> Is there a full trace of this?

[3.654003] scsi host0: scsi_debug: version 1.86 [20160430]
[3.654003]   dev_size_mb=8, opts=0x0, submit_queues=1, statistics=0
[3.660755] scsi 0:0:0:0: Direct-Access Linuxscsi_debug   0186 
PQ: 0 ANSI: 7
[3.711231] sd 0:0:0:0: [sda] 16384 512-byte logical blocks: (8.39 MB/8.00 
MiB)
[3.716202] sd 0:0:0:0: [sda] Write Protect is off
[3.717244] sd 0:0:0:0: [sda] Mode Sense: 73 00 10 08
[3.725059] sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, 
supports DPO and FUA
[3.795093] sd 0:0:0:0: [sda] Attached SCSI disk
[3.796686] sd 0:0:0:0: [sda] Synchronizing SCSI cache
[3.804770] sd 0:0:0:0: [sda] 16384 512-byte logical blocks: (8.39 MB/8.00 
MiB)
[3.809748] sd 0:0:0:0: [sda] Write Protect is off
[3.810806] sd 0:0:0:0: [sda] Mode Sense: 73 00 10 08
[3.818599] sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, 
supports DPO and FUA
[3.830894] kobject (be01a5fc): tried to init an initialized object, 
something is seriously wrong.
[3.832820] CPU: 6 PID: 6 Comm: kworker/u14:0 Not tainted 4.9.0 #86
[3.834172] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 
1.10.1-1 04/01/2014
[3.835886] Workqueue: events_unbound async_run_entry_fn
[3.837028]  80079da8 83a5f6cc be01a5fc be01a5fc 80079dc4 83a61e33 842a883c 
be01a5fc
[3.838802]   84397488 be01a108 80079dec 83a46afa be01a5d8 840670a0 
be01a5d8
[3.840570]  be01a108 bd983468 be01a108 bd983470 be01a5d8 80079e14 83a3c857 
be01a5d8
[3.842350] Call Trace:
[3.842884]  [<83a5f6cc>] dump_stack+0x58/0x7c
[3.843834]  [<83a61e33>] kobject_init+0x73/0x80
[3.844828]  [<83a46afa>] blk_mq_register_dev+0x2a/0x110
[3.845971]  [<83a3c857>] blk_register_queue+0x87/0x140
[3.847085]  [<83a49c7e>] device_add_disk+0x1ce/0x470
[3.848170]  [<83c3d328>] sd_probe_async+0xe8/0x1b0
[3.849207]  [<83672347>] async_run_entry_fn+0x37/0xe0
[3.850313]  [<836694f7>] process_one_work+0x1b7/0x3e0
[3.851414]  [<8366947c>] ? process_one_work+0x13c/0x3e0
[3.852552]  [<83669759>] worker_thread+0x39/0x460
[3.853569]  [<83669720>] ? process_one_work+0x3e0/0x3e0
[3.854717]  [<8366f394>] kthread+0xb4/0xd0
[3.855611]  [<83f8cb4d>] ? _raw_spin_unlock_irq+0x2d/0x50
[3.856775]  [<8366f2e0>] ? __kthread_create_on_node+0x160/0x160
[3.858066]  [<8366f2e0>] ? __kthread_create_on_node+0x160/0x160
[3.859340]  [<83f8d363>] ret_from_fork+0x1b/0x28
[3.863951] kobject (ff0ca36c): tried to init an initialized object, 
something is seriously wrong.
[3.865875] CPU: 6 PID: 6 Comm: kworker/u14:0 Not tainted 4.9.0 #86
[3.867202] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 
1.10.1-1 04/01/2014
[3.868924] Workqueue: events_unbound async_run_entry_fn
[3.870079]  80079da8 83a5f6cc ff0ca36c be01a5fc 80079dc4 83a61e33 842a883c 
ff0ca36c
[3.871846]  80079dc4 84397474 be01a108 80079dec 83a46b1c be01a5d8 840670a0 
be01a5d8
[3.873605]  be01a108 bd983468 be01a108 bd983470 be01a5d8 80079e14 83a3c857 
be01a5d8
[3.875395] Call Trace:
[3.875928]  [<83a5f6cc>] dump_stack+0x58/0x7c
[3.876878]  [<83a61e33>] kobject_init+0x73/0x80
[3.877884]  [<83a46b1c>] blk_mq_register_dev+0x4c/0x110
[3.879018]  [<83a3c857>] blk_register_queue+0x87/0x140
[3.880136]  [<83a49c7e>] device_add_disk+0x1ce/0x470
[3.881227]  [<83c3d328>] sd_probe_async+0xe8/0x1b0
[3.882283]  [<83672347>] async_run_entry_fn+0x37/0xe0
[3.883381]  [<836694f7>] process_one_work+0x1b7/0x3e0
[3.884481]  [<8366947c>] ? process_one_work+0x13c/0x3e0
[3.885613]  [<83669759>] worker_thread+0x39/0x460
[3.886658]  [<83669720>] ? process_one_work+0x3e0/0x3e0
[3.887791]  [<8366f394>] kthread+0xb4/0xd0
[3.888680]  [<83f8cb4d>] ? _raw_spin_unlock_irq+0x2d/0x50
[3.889881]  [<8366f2e0>] ? __kthread_create_on_node+0x160/0x160
[3.891189]  [<8366f2e0>] ? __kthread_create_on_node+0x160/0x160
[3.892506]  [<83f8d363>] ret_from_fork+0x1b/0x28
[3.893563] kobject (ff21b36c): tried to init an initialized object, 
something is seriously wrong.
[3.895559] CPU: 6 PID: 6 Comm: kworker/u14:0 Not tainted 4.9.0 #86
[3.896938] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 
1.10.1-1 04/01/2014
[3.898741] Workqueue: events_unbound async_run_entry_fn
[3.899911]  80079da8 83a5f6cc ff21b36c be01a5fc 80079dc4 83a61e33 842a883c 
ff21b36c
[3.901739]  80079dc4 84397474 be01a108 80079dec 83a46b1c be01a5d8 840670a0 
be01a5d8
[3.903558]  be01a108 bd983468 be01a108 bd983470 be01a5d8 80079e14 83a3c857 
be01a5d8
[3.905381] Call Trace:
[3.905936]  [<83a5f6cc>] dump_stack+0x58/0x7c
[3.906913]  [<83a61e33>] kobject_init+0x73/0x80
[3.907931]  [<83a46b1c>] blk_mq_register_dev+0x4c/0x110
[3.909108]  [<83a3c857>] blk_register_queue+0x87/0x140
[3.910267]  [<83a49c7e>] device_add_disk+0x1ce/0x470
[

Re: [lkp-robot] [rcu] b332151a29: kernel_BUG_at_mm/slab.c

2017-01-20 Thread Jens Axboe
On 01/20/2017 08:23 AM, Sebastian Andrzej Siewior wrote:
>>> yes. With and without the patch there is a lot of wrong stuff like
>>> complains about a kobject initialized again. This leads to a double free
>>> at some point.
>>
>> And what patch are we talking about? I don't mind being CC'ed into a thread,
>> but some context and background would be immensely helpful here...
> 
> The patch is irrelevant. lkp-robot found a bug which was there before
> the patch in question but the pattern changed so it blamed the Author.
> It triggers even v4.9 with
>   CONFIG_SCSI_DEBUG
>   CONFIG_DEBUG_TEST_DRIVER_REMOVE
>   CONFIG_SCSI_MQ_DEFAULT
> enabled and CONFIG_SCSI_DEBUG is simply a SCSI host controller which is
> always there. I can send you a complete config against current HEAD
> which boots in kvm if you want.

That's alright, sounds like it's not a -next regression, but rather something
that is already broken. I can reproduce a lot of breakage if I enable
CONFIG_DEBUG_TEST_DRIVER_REMOVE, in fact my system doesn't boot at all. This
is the first bug:

[   18.247895] [ cut here ]
[   18.247907] WARNING: CPU: 21 PID: 2223 at drivers/ata/libata-core.c:6522 
ata_host_detach+0x11b]
[   18.247908] Modules linked in: igb(+) ahci(+) libahci i2c_algo_bit dca 
libata nvme(+) nvme_core
[   18.247917] CPU: 21 PID: 2223 Comm: systemd-udevd Tainted: GW   
4.10.0-rc4+ #30
[   18.247919] Hardware name: Dell Inc. PowerEdge T630/0NT78X, BIOS 2.3.4 
11/09/2016
[   18.247919] Call Trace:
[   18.247928]  dump_stack+0x68/0x93
[   18.247934]  __warn+0xc6/0xe0
[   18.247937]  warn_slowpath_null+0x18/0x20
[   18.247943]  ata_host_detach+0x11b/0x120 [libata]
[   18.247950]  ata_pci_remove_one+0x10/0x20 [libata]
[   18.247955]  ahci_remove_one+0x10/0x20 [ahci]
[   18.247958]  pci_device_remove+0x34/0xb0
[   18.247966]  driver_probe_device+0xd0/0x370
[   18.247969]  __driver_attach+0x9a/0xa0
[   18.247971]  ? driver_probe_device+0x370/0x370
[   18.247973]  bus_for_each_dev+0x5d/0x90
[   18.247975]  driver_attach+0x19/0x20
[   18.247977]  bus_add_driver+0x11f/0x220
[   18.247980]  driver_register+0x5b/0xd0
[   18.247982]  __pci_register_driver+0x58/0x60
[   18.247984]  ? 0xa00d9000
[   18.247988]  ahci_pci_driver_init+0x1e/0x20 [ahci]
[   18.247992]  do_one_initcall+0x3e/0x170
[   18.247997]  ? rcu_read_lock_sched_held+0x45/0x80
[   18.248001]  ? kmem_cache_alloc_trace+0x22e/0x290
[   18.248004]  do_init_module+0x5a/0x1cb
[   18.248007]  load_module+0x1e60/0x2570
[   18.248008]  ? __symbol_put+0x70/0x70
[   18.248010]  ? show_coresize+0x30/0x30
[   18.248013]  ? kernel_read_file+0x19e/0x1c0
[   18.248015]  ? kernel_read_file_from_fd+0x44/0x70
[   18.248016]  SYSC_finit_module+0xba/0xc0
[   18.248018]  SyS_finit_module+0x9/0x10
[   18.248021]  entry_SYSCALL_64_fastpath+0x18/0xad
[   18.248022] RIP: 0033:0x7f49c5a645b9
[   18.248023] RSP: 002b:7ffccf512658 EFLAGS: 0246 ORIG_RAX: 
0139
[   18.248025] RAX: ffda RBX: 7f49c61659dd RCX: 7f49c5a645b9
[   18.248026] RDX:  RSI: 7f49c53152c7 RDI: 0009
[   18.248026] RBP: 0002 R08:  R09: 
[   18.248027] R10: 0009 R11: 0246 R12: 555737e82b30
[   18.248028] R13: 555737e71200 R14: 555737e82b30 R15: 
[   18.248030] ---[ end trace b0ae5eae3430d5d6 ]---

and it's even more downhill from there. That option is marked unstable, are we
expecting it to work right now?

-- 
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [lkp-robot] [rcu] b332151a29: kernel_BUG_at_mm/slab.c

2017-01-20 Thread Jens Axboe
On 01/20/2017 08:01 AM, Sebastian Andrzej Siewior wrote:
> On 2017-01-19 09:02:16 [+0800], kernel test robot wrote:
>> test-description: Trinity is a linux system call fuzz tester.
> 
> you don't even get to fire up trinity. With and without the patch you
> crash very early.
> 
>> +-+++
>> | | d5f6ab9c11 | 
>> b332151a29 |
>> +-+++
>> | boot_successes  | 0  | 0   
>>|
>> | boot_failures   | 6  | 8   
>>|
>> | WARNING:at_include/linux/kref.h:#kobject_get| 6  | 8   
>>|
>> | WARNING:at_arch/x86/mm/dump_pagetables.c:#note_page | 2  | 2   
>>|
>> | kernel_BUG_at_mm/slab.c | 0  | 4   
>>|
>> | invalid_opcode:#[##]PREEMPT_SMP | 0  | 4   
>>|
>> | Kernel_panic-not_syncing:Fatal_exception| 0  | 6   
>>|
>> | BUG:unable_to_handle_kernel | 0  | 2   
>>|
>> | Oops| 0  | 2   
>>|
>> +-+++
> 
> There is no successful boot. The pattern changes with patch in question
> applied.
> 
>> [8.044624] sd 0:0:0:0: [sda] Synchronizing SCSI cache
>> [8.055721] slab: double free detected in cache 'kmalloc-32', objp 
>> 8af558c0
>> [8.057138] [ cut here ]
>> [8.058085] kernel BUG at mm/slab.c:2624!
>> [8.059255] invalid opcode:  [#1] PREEMPT SMP

Is there a full trace of this?

> yes. With and without the patch there is a lot of wrong stuff like
> complains about a kobject initialized again. This leads to a double free
> at some point.

And what patch are we talking about? I don't mind being CC'ed into a thread,
but some context and background would be immensely helpful here...

-- 
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [lkp-robot] [rcu] b332151a29: kernel_BUG_at_mm/slab.c

2017-01-20 Thread Sebastian Andrzej Siewior
On 2017-01-19 09:02:16 [+0800], kernel test robot wrote:
> test-description: Trinity is a linux system call fuzz tester.

you don't even get to fire up trinity. With and without the patch you
crash very early.

> +-+++
> | | d5f6ab9c11 | 
> b332151a29 |
> +-+++
> | boot_successes  | 0  | 0
>   |
> | boot_failures   | 6  | 8
>   |
> | WARNING:at_include/linux/kref.h:#kobject_get| 6  | 8
>   |
> | WARNING:at_arch/x86/mm/dump_pagetables.c:#note_page | 2  | 2
>   |
> | kernel_BUG_at_mm/slab.c | 0  | 4
>   |
> | invalid_opcode:#[##]PREEMPT_SMP | 0  | 4
>   |
> | Kernel_panic-not_syncing:Fatal_exception| 0  | 6
>   |
> | BUG:unable_to_handle_kernel | 0  | 2
>   |
> | Oops| 0  | 2
>   |
> +-+++

There is no successful boot. The pattern changes with patch in question
applied.

> [8.044624] sd 0:0:0:0: [sda] Synchronizing SCSI cache
> [8.055721] slab: double free detected in cache 'kmalloc-32', objp 8af558c0
> [8.057138] [ cut here ]
> [8.058085] kernel BUG at mm/slab.c:2624!
> [8.059255] invalid opcode:  [#1] PREEMPT SMP

yes. With and without the patch there is a lot of wrong stuff like
complains about a kobject initialized again. This leads to a double free
at some point.

What happens is the following: CONFIG_SCSI_DEBUG is enabled which adds a
dummy host controller with a dummy disk. This gets probed during boot.
Since you also enabled CONFIG_DEBUG_TEST_DRIVER_REMOVE it gets removed
and re-added. The request_queue in genhd disk is re-used while the disk
is added for the second time:

[1.314404] scsi host0: scsi_debug: version 1.86 [20160430]
[1.314404]   dev_size_mb=8, opts=0x0, submit_queues=1, statistics=0
[1.315994] scsi 0:0:0:0: Direct-Access Linuxscsi_debug   0186 
PQ: 0 ANSI: 7
[1.351052] sd 0:0:0:0: [sda] 16384 512-byte logical blocks: (8.39 MB/8.00 
MiB)
[1.355916] sd 0:0:0:0: [sda] Write Protect is off
[1.356838] sd 0:0:0:0: [sda] Mode Sense: 73 00 10 08
[1.364455] sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, 
supports DPO and FUA
[1.437642] sd 0:0:0:0: [sda] Attached SCSI disk
[1.438413] sd 0:0:0:0: [sda] Synchronizing SCSI cache
[1.445868] sd 0:0:0:0: [sda] 16384 512-byte logical blocks: (8.39 MB/8.00 
MiB)
[1.450819] sd 0:0:0:0: [sda] Write Protect is off
[1.451853] sd 0:0:0:0: [sda] Mode Sense: 73 00 10 08
[1.459636] sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, 
supports DPO and FUA
[1.471446] kobject (beb87d44): tried to init an initialized object, 
something is seriously wrong.

Since you also need CONFIG_SCSI_MQ_DEFAULT enabled I assume the MQ block
code is buggy here.
But commit b332151a29 in Paul's tree innocent.

Sebastian
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html