Re: Kernel crash in cgroup_pidlist_destroy_work_fn()

2014-09-18 Thread Cong Wang
Hi, Zefan


Thanks for your reply. You are right, vfs refcount should guarantee us
there is no more file read before we destroy that cgroup. I thought
there is somewhere else could release that cgroup refcount.

Maybe I didn't make it clear, this bug is hardly reproducible, we only
saw it once (otherwise I would have more clue and test my patch).
The full stacktrace is at the end of this email, for your reference.

Reading the code again, I thought we could have some race in-flight
work and pending work, but cgroup_pidlist_destroy_work_fn() should
get this right by checking the pending bit after holding mutex, what's
more, we should only have one in-flight work for this workqueue, therefore
the code seems pretty safe.

The unbalanced mutex_unlock() you fixed actually is hard to hit since
the memory must run out in that case, and I don't get any OOM
information from the bug reporter. So I think I can rule out this case.


Thanks!

--->

[15955.313964] BUG: unable to handle kernel NULL pointer dereference
at 0027
[15955.314958] IP: [] __mutex_lock_slowpath+0x2f/0x1a0
[15955.316088] PGD 0
[15955.316310] Oops:  [#1] SMP
[15955.316810] Modules linked in: tun ipv6 bonding dm_multipath
scsi_dh video sbs sbshc acpi_pad acpi_ipmi parport_pc lp parport
tcp_diag inet_diag ipmi_devintf dell_rbu sg igb i2c_algo_bit ptp
pps_core iTCO_wdt iTCO_vendor_support dcdbas hed ipmi_si
ipmi_msghandler i7core_edac i2c_i801 shpchp i2c_core edac_core ioatdma
dca lpc_ich mfd_core microcode ahci libahci libata sd_mod scsi_mod
[15955.321422] CPU: 5 PID: 7280 Comm: kworker/5:2 Not tainted 3.14.14 #1
[15955.322275] Hardware name: Dell   C6100   /0D61XP, BIOS
1.66 12/23/2011
[15955.323326] Workqueue: cgroup_pidlist_destroy cgroup_pidlist_destroy_work_fn
[15955.324317] task: 8800ad899840 ti: 880323b5 task.ti:
880323b5
[15955.325236] RIP: 0010:[]  []
__mutex_lock_slowpath+0x2f/0x1a0
[15955.326465] RSP: 0018:880323b51dc0  EFLAGS: 00010286
[15955.327178] RAX:  RBX: 88032c7de0f8 RCX: 88032e188ac0
[15955.328061] RDX: 8000 RSI: 0140 RDI: 88032c7de0f8
[15955.328927] RBP: 880323b51e08 R08: 88032e188ac0 R09: 
[15955.329784] R10:  R11: 0040 R12: 
[15955.330778] R13: 8800ad899840 R14: 880333cb7200 R15: 
[15955.331650] FS:  () GS:880333ca()
knlGS:
[15955.332616] CS:  0010 DS:  ES:  CR0: 8005003b
[15955.29] CR2: 0027 CR3: 01a0c000 CR4: 07e0
[15955.334584] Stack:
[15955.334798]  880323b51e08 8100163f 880323b51de8
8107d6e8
[15955.335751]  88032c7de0f8  880333cb2300
880333cb7200
[15955.336831]   880323b51e20 814ada45
88032e188ab8
[15955.337779] Call Trace:
[15955.338053]  [] ? __switch_to+0x1ca/0x3d2
[15955.338739]  [] ? mmdrop+0x11/0x20
[15955.339386]  [] mutex_lock+0x1f/0x2f
[15955.340054]  [] cgroup_pidlist_destroy_work_fn+0x22/0x66
[15955.340898]  [] process_one_work+0x16d/0x289
[15955.341629]  [] worker_thread+0x149/0x1f5
[15955.342327]  [] ? rescuer_thread+0x27f/0x27f
[15955.343049]  [] kthread+0xae/0xb6
[15955.343685]  [] ? idle_worker_timeout+0x4/0x5b
[15955.344499]  [] ? __kthread_parkme+0x61/0x61
[15955.345245]  [] ret_from_fork+0x7c/0xb0
[15955.345934]  [] ? __kthread_parkme+0x61/0x61
[15955.346663] Code: 90 55 48 89 e5 41 57 41 56 41 55 41 54 53 48 89
fb 48 83 e4 f0 48 83 ec 20 65 4c 8b 2c 25 80 b8 00 00 48 8b 47 18 48
85 c0 74 05 <8b> 40 28 eb 05 b8 01 00 00 00 85 c0 0f 84 9f 00 00 00 4c
8d 63
[15955.349439] RIP  [] __mutex_lock_slowpath+0x2f/0x1a0
[15955.350270]  RSP 
[15955.350635] CR2: 0027
[15955.351481] ---[ end trace 169754b1b86c507e ]---
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Kernel crash in cgroup_pidlist_destroy_work_fn()

2014-09-18 Thread Cong Wang
Hi, Zefan


Thanks for your reply. You are right, vfs refcount should guarantee us
there is no more file read before we destroy that cgroup. I thought
there is somewhere else could release that cgroup refcount.

Maybe I didn't make it clear, this bug is hardly reproducible, we only
saw it once (otherwise I would have more clue and test my patch).
The full stacktrace is at the end of this email, for your reference.

Reading the code again, I thought we could have some race in-flight
work and pending work, but cgroup_pidlist_destroy_work_fn() should
get this right by checking the pending bit after holding mutex, what's
more, we should only have one in-flight work for this workqueue, therefore
the code seems pretty safe.

The unbalanced mutex_unlock() you fixed actually is hard to hit since
the memory must run out in that case, and I don't get any OOM
information from the bug reporter. So I think I can rule out this case.


Thanks!

---

[15955.313964] BUG: unable to handle kernel NULL pointer dereference
at 0027
[15955.314958] IP: [814ad8b5] __mutex_lock_slowpath+0x2f/0x1a0
[15955.316088] PGD 0
[15955.316310] Oops:  [#1] SMP
[15955.316810] Modules linked in: tun ipv6 bonding dm_multipath
scsi_dh video sbs sbshc acpi_pad acpi_ipmi parport_pc lp parport
tcp_diag inet_diag ipmi_devintf dell_rbu sg igb i2c_algo_bit ptp
pps_core iTCO_wdt iTCO_vendor_support dcdbas hed ipmi_si
ipmi_msghandler i7core_edac i2c_i801 shpchp i2c_core edac_core ioatdma
dca lpc_ich mfd_core microcode ahci libahci libata sd_mod scsi_mod
[15955.321422] CPU: 5 PID: 7280 Comm: kworker/5:2 Not tainted 3.14.14 #1
[15955.322275] Hardware name: Dell   C6100   /0D61XP, BIOS
1.66 12/23/2011
[15955.323326] Workqueue: cgroup_pidlist_destroy cgroup_pidlist_destroy_work_fn
[15955.324317] task: 8800ad899840 ti: 880323b5 task.ti:
880323b5
[15955.325236] RIP: 0010:[814ad8b5]  [814ad8b5]
__mutex_lock_slowpath+0x2f/0x1a0
[15955.326465] RSP: 0018:880323b51dc0  EFLAGS: 00010286
[15955.327178] RAX:  RBX: 88032c7de0f8 RCX: 88032e188ac0
[15955.328061] RDX: 8000 RSI: 0140 RDI: 88032c7de0f8
[15955.328927] RBP: 880323b51e08 R08: 88032e188ac0 R09: 
[15955.329784] R10:  R11: 0040 R12: 
[15955.330778] R13: 8800ad899840 R14: 880333cb7200 R15: 
[15955.331650] FS:  () GS:880333ca()
knlGS:
[15955.332616] CS:  0010 DS:  ES:  CR0: 8005003b
[15955.29] CR2: 0027 CR3: 01a0c000 CR4: 07e0
[15955.334584] Stack:
[15955.334798]  880323b51e08 8100163f 880323b51de8
8107d6e8
[15955.335751]  88032c7de0f8  880333cb2300
880333cb7200
[15955.336831]   880323b51e20 814ada45
88032e188ab8
[15955.337779] Call Trace:
[15955.338053]  [8100163f] ? __switch_to+0x1ca/0x3d2
[15955.338739]  [8107d6e8] ? mmdrop+0x11/0x20
[15955.339386]  [814ada45] mutex_lock+0x1f/0x2f
[15955.340054]  [810bd2b9] cgroup_pidlist_destroy_work_fn+0x22/0x66
[15955.340898]  [810710b9] process_one_work+0x16d/0x289
[15955.341629]  [81071877] worker_thread+0x149/0x1f5
[15955.342327]  [8107172e] ? rescuer_thread+0x27f/0x27f
[15955.343049]  [810764db] kthread+0xae/0xb6
[15955.343685]  [8107] ? idle_worker_timeout+0x4/0x5b
[15955.344499]  [8107642d] ? __kthread_parkme+0x61/0x61
[15955.345245]  [814b532c] ret_from_fork+0x7c/0xb0
[15955.345934]  [8107642d] ? __kthread_parkme+0x61/0x61
[15955.346663] Code: 90 55 48 89 e5 41 57 41 56 41 55 41 54 53 48 89
fb 48 83 e4 f0 48 83 ec 20 65 4c 8b 2c 25 80 b8 00 00 48 8b 47 18 48
85 c0 74 05 8b 40 28 eb 05 b8 01 00 00 00 85 c0 0f 84 9f 00 00 00 4c
8d 63
[15955.349439] RIP  [814ad8b5] __mutex_lock_slowpath+0x2f/0x1a0
[15955.350270]  RSP 880323b51dc0
[15955.350635] CR2: 0027
[15955.351481] ---[ end trace 169754b1b86c507e ]---
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Kernel crash in cgroup_pidlist_destroy_work_fn()

2014-09-17 Thread Li Zefan
On 2014/9/17 13:29, Li Zefan wrote:
> On 2014/9/17 7:56, Cong Wang wrote:
>> Hi, Tejun
>>
>>
>> We saw some kernel null pointer dereference in
>> cgroup_pidlist_destroy_work_fn(), more precisely at
>> __mutex_lock_slowpath(), on 3.14. I can show you the full stack trace
>> on request.
>>
> 
> Yes, please.
> 
>> Looking at the code, it seems flush_workqueue() doesn't care about new
>> incoming works, it only processes currently pending ones, if this is
>> correct, then we could have the following race condition:
>>
>> cgroup_pidlist_destroy_all():
>> //...
>> mutex_lock(>pidlist_mutex);
>> list_for_each_entry_safe(l, tmp_l, >pidlists, links)
>> mod_delayed_work(cgroup_pidlist_destroy_wq,
>> >destroy_dwork, 0);
>> mutex_unlock(>pidlist_mutex);
>>
>> // <--- another process calls cgroup_pidlist_start() here
>> since mutex is released
>>
>> flush_workqueue(cgroup_pidlist_destroy_wq); // <--- another
>> process adds new pidlist and queue work in pararell
>> BUG_ON(!list_empty(>pidlists)); // <--- This check is
>> passed, list_add() could happen after this
>>
> 
> Did you confirm this is what happened when the bug was triggered?
> 
> I don't think the race condition you described exists. In 3.14 kernel,
> cgroup_diput() won't be called if there is any thread running
> cgroup_pidlist_start(). This is guaranteed by vfs.
> 
> But newer kernels are different. Looks like the bug exists in those
> kernels.
> 

Newer kernels should be also fine.

If cgroup_pidlist_destroy_all() is called, it means kernfs has already
removed the tasks file, and even if you still have it opened, when
you try to read it, it will immediately return an errno.

fd = open(cgrp/tasks)
cgroup_rmdir(cgrp)
  cgroup_destroy_locked(c)
kernfs_remove()
  ...
css_free_work_fn()
  cgroup_pidlist_destroy_all()
   read(fd of cgrp/tasks)
 return -ENODEV

So cgroup_pidlist_destroy_all() won't race with cgroup_pidlist_start().

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Kernel crash in cgroup_pidlist_destroy_work_fn()

2014-09-17 Thread Li Zefan
On 2014/9/17 13:29, Li Zefan wrote:
 On 2014/9/17 7:56, Cong Wang wrote:
 Hi, Tejun


 We saw some kernel null pointer dereference in
 cgroup_pidlist_destroy_work_fn(), more precisely at
 __mutex_lock_slowpath(), on 3.14. I can show you the full stack trace
 on request.

 
 Yes, please.
 
 Looking at the code, it seems flush_workqueue() doesn't care about new
 incoming works, it only processes currently pending ones, if this is
 correct, then we could have the following race condition:

 cgroup_pidlist_destroy_all():
 //...
 mutex_lock(cgrp-pidlist_mutex);
 list_for_each_entry_safe(l, tmp_l, cgrp-pidlists, links)
 mod_delayed_work(cgroup_pidlist_destroy_wq,
 l-destroy_dwork, 0);
 mutex_unlock(cgrp-pidlist_mutex);

 // --- another process calls cgroup_pidlist_start() here
 since mutex is released

 flush_workqueue(cgroup_pidlist_destroy_wq); // --- another
 process adds new pidlist and queue work in pararell
 BUG_ON(!list_empty(cgrp-pidlists)); // --- This check is
 passed, list_add() could happen after this

 
 Did you confirm this is what happened when the bug was triggered?
 
 I don't think the race condition you described exists. In 3.14 kernel,
 cgroup_diput() won't be called if there is any thread running
 cgroup_pidlist_start(). This is guaranteed by vfs.
 
 But newer kernels are different. Looks like the bug exists in those
 kernels.
 

Newer kernels should be also fine.

If cgroup_pidlist_destroy_all() is called, it means kernfs has already
removed the tasks file, and even if you still have it opened, when
you try to read it, it will immediately return an errno.

fd = open(cgrp/tasks)
cgroup_rmdir(cgrp)
  cgroup_destroy_locked(c)
kernfs_remove()
  ...
css_free_work_fn()
  cgroup_pidlist_destroy_all()
   read(fd of cgrp/tasks)
 return -ENODEV

So cgroup_pidlist_destroy_all() won't race with cgroup_pidlist_start().

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Kernel crash in cgroup_pidlist_destroy_work_fn()

2014-09-16 Thread Li Zefan
On 2014/9/17 7:56, Cong Wang wrote:
> Hi, Tejun
> 
> 
> We saw some kernel null pointer dereference in
> cgroup_pidlist_destroy_work_fn(), more precisely at
> __mutex_lock_slowpath(), on 3.14. I can show you the full stack trace
> on request.
> 

Yes, please.

> Looking at the code, it seems flush_workqueue() doesn't care about new
> incoming works, it only processes currently pending ones, if this is
> correct, then we could have the following race condition:
> 
> cgroup_pidlist_destroy_all():
> //...
> mutex_lock(>pidlist_mutex);
> list_for_each_entry_safe(l, tmp_l, >pidlists, links)
> mod_delayed_work(cgroup_pidlist_destroy_wq,
> >destroy_dwork, 0);
> mutex_unlock(>pidlist_mutex);
> 
> // <--- another process calls cgroup_pidlist_start() here
> since mutex is released
> 
> flush_workqueue(cgroup_pidlist_destroy_wq); // <--- another
> process adds new pidlist and queue work in pararell
> BUG_ON(!list_empty(>pidlists)); // <--- This check is
> passed, list_add() could happen after this
> 

Did you confirm this is what happened when the bug was triggered?

I don't think the race condition you described exists. In 3.14 kernel,
cgroup_diput() won't be called if there is any thread running
cgroup_pidlist_start(). This is guaranteed by vfs.

But newer kernels are different. Looks like the bug exists in those
kernels.

> 
> Therefore, the newly added pidlist will point to a freed cgroup, and
> when it is freed in the delayed work we will crash.
> 
> The attached patch (compile test ONLY) could be a possible fix, since
> it will check and hold a refcount on this cgroup in
> cgroup_pidlist_start(). But I could very easily miss something here
> since there are many cgroup changes after 3.14 and I don't follow
> cgroup development.
> 
> What do you think?
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Kernel crash in cgroup_pidlist_destroy_work_fn()

2014-09-16 Thread Cong Wang
Hi, Tejun


We saw some kernel null pointer dereference in
cgroup_pidlist_destroy_work_fn(), more precisely at
__mutex_lock_slowpath(), on 3.14. I can show you the full stack trace
on request.

Looking at the code, it seems flush_workqueue() doesn't care about new
incoming works, it only processes currently pending ones, if this is
correct, then we could have the following race condition:

cgroup_pidlist_destroy_all():
//...
mutex_lock(>pidlist_mutex);
list_for_each_entry_safe(l, tmp_l, >pidlists, links)
mod_delayed_work(cgroup_pidlist_destroy_wq,
>destroy_dwork, 0);
mutex_unlock(>pidlist_mutex);

// <--- another process calls cgroup_pidlist_start() here
since mutex is released

flush_workqueue(cgroup_pidlist_destroy_wq); // <--- another
process adds new pidlist and queue work in pararell
BUG_ON(!list_empty(>pidlists)); // <--- This check is
passed, list_add() could happen after this


Therefore, the newly added pidlist will point to a freed cgroup, and
when it is freed in the delayed work we will crash.

The attached patch (compile test ONLY) could be a possible fix, since
it will check and hold a refcount on this cgroup in
cgroup_pidlist_start(). But I could very easily miss something here
since there are many cgroup changes after 3.14 and I don't follow
cgroup development.

What do you think?

Thanks.
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 940aced..2206151 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -4084,6 +4084,9 @@ static void *cgroup_pidlist_start(struct seq_file *s, 
loff_t *pos)
int index = 0, pid = *pos;
int *iter, ret;
 
+   if (!cgroup_tryget(cgrp))
+   return NULL;
+
mutex_lock(>pidlist_mutex);
 
/*
@@ -4132,13 +4135,15 @@ static void *cgroup_pidlist_start(struct seq_file *s, 
loff_t *pos)
 
 static void cgroup_pidlist_stop(struct seq_file *s, void *v)
 {
+   struct cgroup *cgrp = seq_css(s)->cgroup;
struct kernfs_open_file *of = s->private;
struct cgroup_pidlist *l = of->priv;
 
if (l)
mod_delayed_work(cgroup_pidlist_destroy_wq, >destroy_dwork,
 CGROUP_PIDLIST_DESTROY_DELAY);
-   mutex_unlock(_css(s)->cgroup->pidlist_mutex);
+   mutex_unlock(>pidlist_mutex);
+   cgroup_put(cgrp);
 }
 
 static void *cgroup_pidlist_next(struct seq_file *s, void *v, loff_t *pos)


Kernel crash in cgroup_pidlist_destroy_work_fn()

2014-09-16 Thread Cong Wang
Hi, Tejun


We saw some kernel null pointer dereference in
cgroup_pidlist_destroy_work_fn(), more precisely at
__mutex_lock_slowpath(), on 3.14. I can show you the full stack trace
on request.

Looking at the code, it seems flush_workqueue() doesn't care about new
incoming works, it only processes currently pending ones, if this is
correct, then we could have the following race condition:

cgroup_pidlist_destroy_all():
//...
mutex_lock(cgrp-pidlist_mutex);
list_for_each_entry_safe(l, tmp_l, cgrp-pidlists, links)
mod_delayed_work(cgroup_pidlist_destroy_wq,
l-destroy_dwork, 0);
mutex_unlock(cgrp-pidlist_mutex);

// --- another process calls cgroup_pidlist_start() here
since mutex is released

flush_workqueue(cgroup_pidlist_destroy_wq); // --- another
process adds new pidlist and queue work in pararell
BUG_ON(!list_empty(cgrp-pidlists)); // --- This check is
passed, list_add() could happen after this


Therefore, the newly added pidlist will point to a freed cgroup, and
when it is freed in the delayed work we will crash.

The attached patch (compile test ONLY) could be a possible fix, since
it will check and hold a refcount on this cgroup in
cgroup_pidlist_start(). But I could very easily miss something here
since there are many cgroup changes after 3.14 and I don't follow
cgroup development.

What do you think?

Thanks.
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 940aced..2206151 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -4084,6 +4084,9 @@ static void *cgroup_pidlist_start(struct seq_file *s, 
loff_t *pos)
int index = 0, pid = *pos;
int *iter, ret;
 
+   if (!cgroup_tryget(cgrp))
+   return NULL;
+
mutex_lock(cgrp-pidlist_mutex);
 
/*
@@ -4132,13 +4135,15 @@ static void *cgroup_pidlist_start(struct seq_file *s, 
loff_t *pos)
 
 static void cgroup_pidlist_stop(struct seq_file *s, void *v)
 {
+   struct cgroup *cgrp = seq_css(s)-cgroup;
struct kernfs_open_file *of = s-private;
struct cgroup_pidlist *l = of-priv;
 
if (l)
mod_delayed_work(cgroup_pidlist_destroy_wq, l-destroy_dwork,
 CGROUP_PIDLIST_DESTROY_DELAY);
-   mutex_unlock(seq_css(s)-cgroup-pidlist_mutex);
+   mutex_unlock(cgrp-pidlist_mutex);
+   cgroup_put(cgrp);
 }
 
 static void *cgroup_pidlist_next(struct seq_file *s, void *v, loff_t *pos)


Re: Kernel crash in cgroup_pidlist_destroy_work_fn()

2014-09-16 Thread Li Zefan
On 2014/9/17 7:56, Cong Wang wrote:
 Hi, Tejun
 
 
 We saw some kernel null pointer dereference in
 cgroup_pidlist_destroy_work_fn(), more precisely at
 __mutex_lock_slowpath(), on 3.14. I can show you the full stack trace
 on request.
 

Yes, please.

 Looking at the code, it seems flush_workqueue() doesn't care about new
 incoming works, it only processes currently pending ones, if this is
 correct, then we could have the following race condition:
 
 cgroup_pidlist_destroy_all():
 //...
 mutex_lock(cgrp-pidlist_mutex);
 list_for_each_entry_safe(l, tmp_l, cgrp-pidlists, links)
 mod_delayed_work(cgroup_pidlist_destroy_wq,
 l-destroy_dwork, 0);
 mutex_unlock(cgrp-pidlist_mutex);
 
 // --- another process calls cgroup_pidlist_start() here
 since mutex is released
 
 flush_workqueue(cgroup_pidlist_destroy_wq); // --- another
 process adds new pidlist and queue work in pararell
 BUG_ON(!list_empty(cgrp-pidlists)); // --- This check is
 passed, list_add() could happen after this
 

Did you confirm this is what happened when the bug was triggered?

I don't think the race condition you described exists. In 3.14 kernel,
cgroup_diput() won't be called if there is any thread running
cgroup_pidlist_start(). This is guaranteed by vfs.

But newer kernels are different. Looks like the bug exists in those
kernels.

 
 Therefore, the newly added pidlist will point to a freed cgroup, and
 when it is freed in the delayed work we will crash.
 
 The attached patch (compile test ONLY) could be a possible fix, since
 it will check and hold a refcount on this cgroup in
 cgroup_pidlist_start(). But I could very easily miss something here
 since there are many cgroup changes after 3.14 and I don't follow
 cgroup development.
 
 What do you think?
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/