Re: Kernel crash in cgroup_pidlist_destroy_work_fn()
Hi, Zefan Thanks for your reply. You are right, vfs refcount should guarantee us there is no more file read before we destroy that cgroup. I thought there is somewhere else could release that cgroup refcount. Maybe I didn't make it clear, this bug is hardly reproducible, we only saw it once (otherwise I would have more clue and test my patch). The full stacktrace is at the end of this email, for your reference. Reading the code again, I thought we could have some race in-flight work and pending work, but cgroup_pidlist_destroy_work_fn() should get this right by checking the pending bit after holding mutex, what's more, we should only have one in-flight work for this workqueue, therefore the code seems pretty safe. The unbalanced mutex_unlock() you fixed actually is hard to hit since the memory must run out in that case, and I don't get any OOM information from the bug reporter. So I think I can rule out this case. Thanks! ---> [15955.313964] BUG: unable to handle kernel NULL pointer dereference at 0027 [15955.314958] IP: [] __mutex_lock_slowpath+0x2f/0x1a0 [15955.316088] PGD 0 [15955.316310] Oops: [#1] SMP [15955.316810] Modules linked in: tun ipv6 bonding dm_multipath scsi_dh video sbs sbshc acpi_pad acpi_ipmi parport_pc lp parport tcp_diag inet_diag ipmi_devintf dell_rbu sg igb i2c_algo_bit ptp pps_core iTCO_wdt iTCO_vendor_support dcdbas hed ipmi_si ipmi_msghandler i7core_edac i2c_i801 shpchp i2c_core edac_core ioatdma dca lpc_ich mfd_core microcode ahci libahci libata sd_mod scsi_mod [15955.321422] CPU: 5 PID: 7280 Comm: kworker/5:2 Not tainted 3.14.14 #1 [15955.322275] Hardware name: Dell C6100 /0D61XP, BIOS 1.66 12/23/2011 [15955.323326] Workqueue: cgroup_pidlist_destroy cgroup_pidlist_destroy_work_fn [15955.324317] task: 8800ad899840 ti: 880323b5 task.ti: 880323b5 [15955.325236] RIP: 0010:[] [] __mutex_lock_slowpath+0x2f/0x1a0 [15955.326465] RSP: 0018:880323b51dc0 EFLAGS: 00010286 [15955.327178] RAX: RBX: 88032c7de0f8 RCX: 88032e188ac0 [15955.328061] RDX: 8000 RSI: 0140 RDI: 88032c7de0f8 [15955.328927] RBP: 880323b51e08 R08: 88032e188ac0 R09: [15955.329784] R10: R11: 0040 R12: [15955.330778] R13: 8800ad899840 R14: 880333cb7200 R15: [15955.331650] FS: () GS:880333ca() knlGS: [15955.332616] CS: 0010 DS: ES: CR0: 8005003b [15955.29] CR2: 0027 CR3: 01a0c000 CR4: 07e0 [15955.334584] Stack: [15955.334798] 880323b51e08 8100163f 880323b51de8 8107d6e8 [15955.335751] 88032c7de0f8 880333cb2300 880333cb7200 [15955.336831] 880323b51e20 814ada45 88032e188ab8 [15955.337779] Call Trace: [15955.338053] [] ? __switch_to+0x1ca/0x3d2 [15955.338739] [] ? mmdrop+0x11/0x20 [15955.339386] [] mutex_lock+0x1f/0x2f [15955.340054] [] cgroup_pidlist_destroy_work_fn+0x22/0x66 [15955.340898] [] process_one_work+0x16d/0x289 [15955.341629] [] worker_thread+0x149/0x1f5 [15955.342327] [] ? rescuer_thread+0x27f/0x27f [15955.343049] [] kthread+0xae/0xb6 [15955.343685] [] ? idle_worker_timeout+0x4/0x5b [15955.344499] [] ? __kthread_parkme+0x61/0x61 [15955.345245] [] ret_from_fork+0x7c/0xb0 [15955.345934] [] ? __kthread_parkme+0x61/0x61 [15955.346663] Code: 90 55 48 89 e5 41 57 41 56 41 55 41 54 53 48 89 fb 48 83 e4 f0 48 83 ec 20 65 4c 8b 2c 25 80 b8 00 00 48 8b 47 18 48 85 c0 74 05 <8b> 40 28 eb 05 b8 01 00 00 00 85 c0 0f 84 9f 00 00 00 4c 8d 63 [15955.349439] RIP [] __mutex_lock_slowpath+0x2f/0x1a0 [15955.350270] RSP [15955.350635] CR2: 0027 [15955.351481] ---[ end trace 169754b1b86c507e ]--- -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Kernel crash in cgroup_pidlist_destroy_work_fn()
Hi, Zefan Thanks for your reply. You are right, vfs refcount should guarantee us there is no more file read before we destroy that cgroup. I thought there is somewhere else could release that cgroup refcount. Maybe I didn't make it clear, this bug is hardly reproducible, we only saw it once (otherwise I would have more clue and test my patch). The full stacktrace is at the end of this email, for your reference. Reading the code again, I thought we could have some race in-flight work and pending work, but cgroup_pidlist_destroy_work_fn() should get this right by checking the pending bit after holding mutex, what's more, we should only have one in-flight work for this workqueue, therefore the code seems pretty safe. The unbalanced mutex_unlock() you fixed actually is hard to hit since the memory must run out in that case, and I don't get any OOM information from the bug reporter. So I think I can rule out this case. Thanks! --- [15955.313964] BUG: unable to handle kernel NULL pointer dereference at 0027 [15955.314958] IP: [814ad8b5] __mutex_lock_slowpath+0x2f/0x1a0 [15955.316088] PGD 0 [15955.316310] Oops: [#1] SMP [15955.316810] Modules linked in: tun ipv6 bonding dm_multipath scsi_dh video sbs sbshc acpi_pad acpi_ipmi parport_pc lp parport tcp_diag inet_diag ipmi_devintf dell_rbu sg igb i2c_algo_bit ptp pps_core iTCO_wdt iTCO_vendor_support dcdbas hed ipmi_si ipmi_msghandler i7core_edac i2c_i801 shpchp i2c_core edac_core ioatdma dca lpc_ich mfd_core microcode ahci libahci libata sd_mod scsi_mod [15955.321422] CPU: 5 PID: 7280 Comm: kworker/5:2 Not tainted 3.14.14 #1 [15955.322275] Hardware name: Dell C6100 /0D61XP, BIOS 1.66 12/23/2011 [15955.323326] Workqueue: cgroup_pidlist_destroy cgroup_pidlist_destroy_work_fn [15955.324317] task: 8800ad899840 ti: 880323b5 task.ti: 880323b5 [15955.325236] RIP: 0010:[814ad8b5] [814ad8b5] __mutex_lock_slowpath+0x2f/0x1a0 [15955.326465] RSP: 0018:880323b51dc0 EFLAGS: 00010286 [15955.327178] RAX: RBX: 88032c7de0f8 RCX: 88032e188ac0 [15955.328061] RDX: 8000 RSI: 0140 RDI: 88032c7de0f8 [15955.328927] RBP: 880323b51e08 R08: 88032e188ac0 R09: [15955.329784] R10: R11: 0040 R12: [15955.330778] R13: 8800ad899840 R14: 880333cb7200 R15: [15955.331650] FS: () GS:880333ca() knlGS: [15955.332616] CS: 0010 DS: ES: CR0: 8005003b [15955.29] CR2: 0027 CR3: 01a0c000 CR4: 07e0 [15955.334584] Stack: [15955.334798] 880323b51e08 8100163f 880323b51de8 8107d6e8 [15955.335751] 88032c7de0f8 880333cb2300 880333cb7200 [15955.336831] 880323b51e20 814ada45 88032e188ab8 [15955.337779] Call Trace: [15955.338053] [8100163f] ? __switch_to+0x1ca/0x3d2 [15955.338739] [8107d6e8] ? mmdrop+0x11/0x20 [15955.339386] [814ada45] mutex_lock+0x1f/0x2f [15955.340054] [810bd2b9] cgroup_pidlist_destroy_work_fn+0x22/0x66 [15955.340898] [810710b9] process_one_work+0x16d/0x289 [15955.341629] [81071877] worker_thread+0x149/0x1f5 [15955.342327] [8107172e] ? rescuer_thread+0x27f/0x27f [15955.343049] [810764db] kthread+0xae/0xb6 [15955.343685] [8107] ? idle_worker_timeout+0x4/0x5b [15955.344499] [8107642d] ? __kthread_parkme+0x61/0x61 [15955.345245] [814b532c] ret_from_fork+0x7c/0xb0 [15955.345934] [8107642d] ? __kthread_parkme+0x61/0x61 [15955.346663] Code: 90 55 48 89 e5 41 57 41 56 41 55 41 54 53 48 89 fb 48 83 e4 f0 48 83 ec 20 65 4c 8b 2c 25 80 b8 00 00 48 8b 47 18 48 85 c0 74 05 8b 40 28 eb 05 b8 01 00 00 00 85 c0 0f 84 9f 00 00 00 4c 8d 63 [15955.349439] RIP [814ad8b5] __mutex_lock_slowpath+0x2f/0x1a0 [15955.350270] RSP 880323b51dc0 [15955.350635] CR2: 0027 [15955.351481] ---[ end trace 169754b1b86c507e ]--- -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Kernel crash in cgroup_pidlist_destroy_work_fn()
On 2014/9/17 13:29, Li Zefan wrote: > On 2014/9/17 7:56, Cong Wang wrote: >> Hi, Tejun >> >> >> We saw some kernel null pointer dereference in >> cgroup_pidlist_destroy_work_fn(), more precisely at >> __mutex_lock_slowpath(), on 3.14. I can show you the full stack trace >> on request. >> > > Yes, please. > >> Looking at the code, it seems flush_workqueue() doesn't care about new >> incoming works, it only processes currently pending ones, if this is >> correct, then we could have the following race condition: >> >> cgroup_pidlist_destroy_all(): >> //... >> mutex_lock(>pidlist_mutex); >> list_for_each_entry_safe(l, tmp_l, >pidlists, links) >> mod_delayed_work(cgroup_pidlist_destroy_wq, >> >destroy_dwork, 0); >> mutex_unlock(>pidlist_mutex); >> >> // <--- another process calls cgroup_pidlist_start() here >> since mutex is released >> >> flush_workqueue(cgroup_pidlist_destroy_wq); // <--- another >> process adds new pidlist and queue work in pararell >> BUG_ON(!list_empty(>pidlists)); // <--- This check is >> passed, list_add() could happen after this >> > > Did you confirm this is what happened when the bug was triggered? > > I don't think the race condition you described exists. In 3.14 kernel, > cgroup_diput() won't be called if there is any thread running > cgroup_pidlist_start(). This is guaranteed by vfs. > > But newer kernels are different. Looks like the bug exists in those > kernels. > Newer kernels should be also fine. If cgroup_pidlist_destroy_all() is called, it means kernfs has already removed the tasks file, and even if you still have it opened, when you try to read it, it will immediately return an errno. fd = open(cgrp/tasks) cgroup_rmdir(cgrp) cgroup_destroy_locked(c) kernfs_remove() ... css_free_work_fn() cgroup_pidlist_destroy_all() read(fd of cgrp/tasks) return -ENODEV So cgroup_pidlist_destroy_all() won't race with cgroup_pidlist_start(). -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Kernel crash in cgroup_pidlist_destroy_work_fn()
On 2014/9/17 13:29, Li Zefan wrote: On 2014/9/17 7:56, Cong Wang wrote: Hi, Tejun We saw some kernel null pointer dereference in cgroup_pidlist_destroy_work_fn(), more precisely at __mutex_lock_slowpath(), on 3.14. I can show you the full stack trace on request. Yes, please. Looking at the code, it seems flush_workqueue() doesn't care about new incoming works, it only processes currently pending ones, if this is correct, then we could have the following race condition: cgroup_pidlist_destroy_all(): //... mutex_lock(cgrp-pidlist_mutex); list_for_each_entry_safe(l, tmp_l, cgrp-pidlists, links) mod_delayed_work(cgroup_pidlist_destroy_wq, l-destroy_dwork, 0); mutex_unlock(cgrp-pidlist_mutex); // --- another process calls cgroup_pidlist_start() here since mutex is released flush_workqueue(cgroup_pidlist_destroy_wq); // --- another process adds new pidlist and queue work in pararell BUG_ON(!list_empty(cgrp-pidlists)); // --- This check is passed, list_add() could happen after this Did you confirm this is what happened when the bug was triggered? I don't think the race condition you described exists. In 3.14 kernel, cgroup_diput() won't be called if there is any thread running cgroup_pidlist_start(). This is guaranteed by vfs. But newer kernels are different. Looks like the bug exists in those kernels. Newer kernels should be also fine. If cgroup_pidlist_destroy_all() is called, it means kernfs has already removed the tasks file, and even if you still have it opened, when you try to read it, it will immediately return an errno. fd = open(cgrp/tasks) cgroup_rmdir(cgrp) cgroup_destroy_locked(c) kernfs_remove() ... css_free_work_fn() cgroup_pidlist_destroy_all() read(fd of cgrp/tasks) return -ENODEV So cgroup_pidlist_destroy_all() won't race with cgroup_pidlist_start(). -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Kernel crash in cgroup_pidlist_destroy_work_fn()
On 2014/9/17 7:56, Cong Wang wrote: > Hi, Tejun > > > We saw some kernel null pointer dereference in > cgroup_pidlist_destroy_work_fn(), more precisely at > __mutex_lock_slowpath(), on 3.14. I can show you the full stack trace > on request. > Yes, please. > Looking at the code, it seems flush_workqueue() doesn't care about new > incoming works, it only processes currently pending ones, if this is > correct, then we could have the following race condition: > > cgroup_pidlist_destroy_all(): > //... > mutex_lock(>pidlist_mutex); > list_for_each_entry_safe(l, tmp_l, >pidlists, links) > mod_delayed_work(cgroup_pidlist_destroy_wq, > >destroy_dwork, 0); > mutex_unlock(>pidlist_mutex); > > // <--- another process calls cgroup_pidlist_start() here > since mutex is released > > flush_workqueue(cgroup_pidlist_destroy_wq); // <--- another > process adds new pidlist and queue work in pararell > BUG_ON(!list_empty(>pidlists)); // <--- This check is > passed, list_add() could happen after this > Did you confirm this is what happened when the bug was triggered? I don't think the race condition you described exists. In 3.14 kernel, cgroup_diput() won't be called if there is any thread running cgroup_pidlist_start(). This is guaranteed by vfs. But newer kernels are different. Looks like the bug exists in those kernels. > > Therefore, the newly added pidlist will point to a freed cgroup, and > when it is freed in the delayed work we will crash. > > The attached patch (compile test ONLY) could be a possible fix, since > it will check and hold a refcount on this cgroup in > cgroup_pidlist_start(). But I could very easily miss something here > since there are many cgroup changes after 3.14 and I don't follow > cgroup development. > > What do you think? > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Kernel crash in cgroup_pidlist_destroy_work_fn()
Hi, Tejun We saw some kernel null pointer dereference in cgroup_pidlist_destroy_work_fn(), more precisely at __mutex_lock_slowpath(), on 3.14. I can show you the full stack trace on request. Looking at the code, it seems flush_workqueue() doesn't care about new incoming works, it only processes currently pending ones, if this is correct, then we could have the following race condition: cgroup_pidlist_destroy_all(): //... mutex_lock(>pidlist_mutex); list_for_each_entry_safe(l, tmp_l, >pidlists, links) mod_delayed_work(cgroup_pidlist_destroy_wq, >destroy_dwork, 0); mutex_unlock(>pidlist_mutex); // <--- another process calls cgroup_pidlist_start() here since mutex is released flush_workqueue(cgroup_pidlist_destroy_wq); // <--- another process adds new pidlist and queue work in pararell BUG_ON(!list_empty(>pidlists)); // <--- This check is passed, list_add() could happen after this Therefore, the newly added pidlist will point to a freed cgroup, and when it is freed in the delayed work we will crash. The attached patch (compile test ONLY) could be a possible fix, since it will check and hold a refcount on this cgroup in cgroup_pidlist_start(). But I could very easily miss something here since there are many cgroup changes after 3.14 and I don't follow cgroup development. What do you think? Thanks. diff --git a/kernel/cgroup.c b/kernel/cgroup.c index 940aced..2206151 100644 --- a/kernel/cgroup.c +++ b/kernel/cgroup.c @@ -4084,6 +4084,9 @@ static void *cgroup_pidlist_start(struct seq_file *s, loff_t *pos) int index = 0, pid = *pos; int *iter, ret; + if (!cgroup_tryget(cgrp)) + return NULL; + mutex_lock(>pidlist_mutex); /* @@ -4132,13 +4135,15 @@ static void *cgroup_pidlist_start(struct seq_file *s, loff_t *pos) static void cgroup_pidlist_stop(struct seq_file *s, void *v) { + struct cgroup *cgrp = seq_css(s)->cgroup; struct kernfs_open_file *of = s->private; struct cgroup_pidlist *l = of->priv; if (l) mod_delayed_work(cgroup_pidlist_destroy_wq, >destroy_dwork, CGROUP_PIDLIST_DESTROY_DELAY); - mutex_unlock(_css(s)->cgroup->pidlist_mutex); + mutex_unlock(>pidlist_mutex); + cgroup_put(cgrp); } static void *cgroup_pidlist_next(struct seq_file *s, void *v, loff_t *pos)
Kernel crash in cgroup_pidlist_destroy_work_fn()
Hi, Tejun We saw some kernel null pointer dereference in cgroup_pidlist_destroy_work_fn(), more precisely at __mutex_lock_slowpath(), on 3.14. I can show you the full stack trace on request. Looking at the code, it seems flush_workqueue() doesn't care about new incoming works, it only processes currently pending ones, if this is correct, then we could have the following race condition: cgroup_pidlist_destroy_all(): //... mutex_lock(cgrp-pidlist_mutex); list_for_each_entry_safe(l, tmp_l, cgrp-pidlists, links) mod_delayed_work(cgroup_pidlist_destroy_wq, l-destroy_dwork, 0); mutex_unlock(cgrp-pidlist_mutex); // --- another process calls cgroup_pidlist_start() here since mutex is released flush_workqueue(cgroup_pidlist_destroy_wq); // --- another process adds new pidlist and queue work in pararell BUG_ON(!list_empty(cgrp-pidlists)); // --- This check is passed, list_add() could happen after this Therefore, the newly added pidlist will point to a freed cgroup, and when it is freed in the delayed work we will crash. The attached patch (compile test ONLY) could be a possible fix, since it will check and hold a refcount on this cgroup in cgroup_pidlist_start(). But I could very easily miss something here since there are many cgroup changes after 3.14 and I don't follow cgroup development. What do you think? Thanks. diff --git a/kernel/cgroup.c b/kernel/cgroup.c index 940aced..2206151 100644 --- a/kernel/cgroup.c +++ b/kernel/cgroup.c @@ -4084,6 +4084,9 @@ static void *cgroup_pidlist_start(struct seq_file *s, loff_t *pos) int index = 0, pid = *pos; int *iter, ret; + if (!cgroup_tryget(cgrp)) + return NULL; + mutex_lock(cgrp-pidlist_mutex); /* @@ -4132,13 +4135,15 @@ static void *cgroup_pidlist_start(struct seq_file *s, loff_t *pos) static void cgroup_pidlist_stop(struct seq_file *s, void *v) { + struct cgroup *cgrp = seq_css(s)-cgroup; struct kernfs_open_file *of = s-private; struct cgroup_pidlist *l = of-priv; if (l) mod_delayed_work(cgroup_pidlist_destroy_wq, l-destroy_dwork, CGROUP_PIDLIST_DESTROY_DELAY); - mutex_unlock(seq_css(s)-cgroup-pidlist_mutex); + mutex_unlock(cgrp-pidlist_mutex); + cgroup_put(cgrp); } static void *cgroup_pidlist_next(struct seq_file *s, void *v, loff_t *pos)
Re: Kernel crash in cgroup_pidlist_destroy_work_fn()
On 2014/9/17 7:56, Cong Wang wrote: Hi, Tejun We saw some kernel null pointer dereference in cgroup_pidlist_destroy_work_fn(), more precisely at __mutex_lock_slowpath(), on 3.14. I can show you the full stack trace on request. Yes, please. Looking at the code, it seems flush_workqueue() doesn't care about new incoming works, it only processes currently pending ones, if this is correct, then we could have the following race condition: cgroup_pidlist_destroy_all(): //... mutex_lock(cgrp-pidlist_mutex); list_for_each_entry_safe(l, tmp_l, cgrp-pidlists, links) mod_delayed_work(cgroup_pidlist_destroy_wq, l-destroy_dwork, 0); mutex_unlock(cgrp-pidlist_mutex); // --- another process calls cgroup_pidlist_start() here since mutex is released flush_workqueue(cgroup_pidlist_destroy_wq); // --- another process adds new pidlist and queue work in pararell BUG_ON(!list_empty(cgrp-pidlists)); // --- This check is passed, list_add() could happen after this Did you confirm this is what happened when the bug was triggered? I don't think the race condition you described exists. In 3.14 kernel, cgroup_diput() won't be called if there is any thread running cgroup_pidlist_start(). This is guaranteed by vfs. But newer kernels are different. Looks like the bug exists in those kernels. Therefore, the newly added pidlist will point to a freed cgroup, and when it is freed in the delayed work we will crash. The attached patch (compile test ONLY) could be a possible fix, since it will check and hold a refcount on this cgroup in cgroup_pidlist_start(). But I could very easily miss something here since there are many cgroup changes after 3.14 and I don't follow cgroup development. What do you think? -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/