Re: [PATCH] futex: fix a race condition between REQUEUE_PI and task death

2014-10-29 Thread Mike Galbraith
On Wed, 2014-10-29 at 21:28 -0700, Darren Hart wrote: 
> On Thu, Oct 23, 2014 at 03:28:07PM -0400, Brian Silverman wrote:
> > Here's the test code:
> > 
> 
> I want to say "Thanks!" and pull it into futextest... but destroying 
> filesystems
> and BIOS errors?!? may not be ideal failure detection modes.

Yup, suse now has a test farm box with a slightly bent up btrfs root.
I'll have to remember to poke the install button when I return it :)

-Mike

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] futex: fix a race condition between REQUEUE_PI and task death

2014-10-29 Thread Darren Hart
On Thu, Oct 23, 2014 at 03:28:07PM -0400, Brian Silverman wrote:
> Here's the test code:
> 

I want to say "Thanks!" and pull it into futextest... but destroying filesystems
and BIOS errors?!? may not be ideal failure detection modes.

(Apologies for being so late to this particular party).

-- 
Darren Hart
Intel Open Source Technology Center
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] futex: fix a race condition between REQUEUE_PI and task death

2014-10-29 Thread Darren Hart
On Thu, Oct 23, 2014 at 03:28:07PM -0400, Brian Silverman wrote:
 Here's the test code:
 

I want to say Thanks! and pull it into futextest... but destroying filesystems
and BIOS errors?!? may not be ideal failure detection modes.

(Apologies for being so late to this particular party).

-- 
Darren Hart
Intel Open Source Technology Center
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] futex: fix a race condition between REQUEUE_PI and task death

2014-10-29 Thread Mike Galbraith
On Wed, 2014-10-29 at 21:28 -0700, Darren Hart wrote: 
 On Thu, Oct 23, 2014 at 03:28:07PM -0400, Brian Silverman wrote:
  Here's the test code:
  
 
 I want to say Thanks! and pull it into futextest... but destroying 
 filesystems
 and BIOS errors?!? may not be ideal failure detection modes.

Yup, suse now has a test farm box with a slightly bent up btrfs root.
I'll have to remember to poke the install button when I return it :)

-Mike

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] futex: fix a race condition between REQUEUE_PI and task death

2014-10-26 Thread Thomas Gleixner
On Sat, 25 Oct 2014, Brian Silverman wrote:
> On Sat, 25 Oct 2014, Thomas Gleixner wrote:
> 
> > > pi_state_free and exit_pi_state_list both clean up futex_pi_state's.
> > > exit_pi_state_list takes the hb lock first, and most callers of
> > > pi_state_free do too. requeue_pi didn't, which causes lots of problems.
> >
> > "causes lots of problems" is not really a good explanation of the root
> > cause. That wants a proper description of the race, i.e.
> >
> > CPU 0  CPU 1
> > ...
> >
> > I'm surely someone who is familiar with that code, but it took me
> > quite some time to understand whats going on. The casual reader will
> > just go into brain spiral mode and give up.
> 
> Thinking about it again, I'm no longer so sure that exit_pi_state_list is the
> only place that it can race against. However, I did catch that one with a
> particularly lucky crashdump, so I know it _can_ happen there. Is just
> giving an example for that good?

Sure. That immediately makes clear whats going on.

Thanks,

tglx
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] futex: fix a race condition between REQUEUE_PI and task death

2014-10-26 Thread Thomas Gleixner
On Sat, 25 Oct 2014, Brian Silverman wrote:
 On Sat, 25 Oct 2014, Thomas Gleixner wrote:
 
   pi_state_free and exit_pi_state_list both clean up futex_pi_state's.
   exit_pi_state_list takes the hb lock first, and most callers of
   pi_state_free do too. requeue_pi didn't, which causes lots of problems.
 
  causes lots of problems is not really a good explanation of the root
  cause. That wants a proper description of the race, i.e.
 
  CPU 0  CPU 1
  ...
 
  I'm surely someone who is familiar with that code, but it took me
  quite some time to understand whats going on. The casual reader will
  just go into brain spiral mode and give up.
 
 Thinking about it again, I'm no longer so sure that exit_pi_state_list is the
 only place that it can race against. However, I did catch that one with a
 particularly lucky crashdump, so I know it _can_ happen there. Is just
 giving an example for that good?

Sure. That immediately makes clear whats going on.

Thanks,

tglx
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] futex: fix a race condition between REQUEUE_PI and task death

2014-10-25 Thread Brian Silverman
On Sat, 25 Oct 2014, Thomas Gleixner wrote:

> > pi_state_free and exit_pi_state_list both clean up futex_pi_state's.
> > exit_pi_state_list takes the hb lock first, and most callers of
> > pi_state_free do too. requeue_pi didn't, which causes lots of problems.
>
> "causes lots of problems" is not really a good explanation of the root
> cause. That wants a proper description of the race, i.e.
>
> CPU 0  CPU 1
> ...
>
> I'm surely someone who is familiar with that code, but it took me
> quite some time to understand whats going on. The casual reader will
> just go into brain spiral mode and give up.

Thinking about it again, I'm no longer so sure that exit_pi_state_list is the
only place that it can race against. However, I did catch that one with a
particularly lucky crashdump, so I know it _can_ happen there. Is just
giving an example for that good?

> >  static void free_pi_state(struct futex_pi_state *pi_state)
>
> > @@ -1558,6 +1552,14 @@ retry_private:
> >   ret = get_futex_value_locked(, uaddr1);
> >
> >   if (unlikely(ret)) {
> > + if (flags & FLAGS_SHARED && pi_state != NULL) {
>
> Why is this dependend on "flags & FLAGS_SHARED"? The shared/private
> property has nothing to do with that at all, but I might be missing
> something.

Nothing... Good catch. It was a bad rebase.

Thanks,

Brian
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] futex: fix a race condition between REQUEUE_PI and task death

2014-10-25 Thread Thomas Gleixner
On Thu, 23 Oct 2014, Brian Silverman wrote:

First of all. Nice catch!

> pi_state_free and exit_pi_state_list both clean up futex_pi_state's.
> exit_pi_state_list takes the hb lock first, and most callers of
> pi_state_free do too. requeue_pi didn't, which causes lots of problems.

"causes lots of problems" is not really a good explanation of the root
cause. That wants a proper description of the race, i.e.

CPU 0  CPU 1
...

I'm surely someone who is familiar with that code, but it took me
quite some time to understand whats going on. The casual reader will
just go into brain spiral mode and give up.

> +/**
> + * Must be called with the hb lock held.
> + */

Having that comment is nice, but there is nothing which enforces
it. So we really should add another argument to that function,
i.e. struct futex_hash_bucket *hb and verify that the lock is held at
least when lockdep is enabled.

>  static void free_pi_state(struct futex_pi_state *pi_state)

> @@ -1558,6 +1552,14 @@ retry_private:
>   ret = get_futex_value_locked(, uaddr1);
>  
>   if (unlikely(ret)) {
> + if (flags & FLAGS_SHARED && pi_state != NULL) {

Why is this dependend on "flags & FLAGS_SHARED"? The shared/private
property has nothing to do with that at all, but I might be missing
something.

> + /*
> +  * We will have to lookup the pi_state again, so
> +  * free this one to keep the accounting correct.
> +  */
> + free_pi_state(pi_state);
> + pi_state = NULL;

Instead of copying the same code over and over, we should change
free_pi_state() to handle being called with a NULL pointer.

Thanks,

tglx
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] futex: fix a race condition between REQUEUE_PI and task death

2014-10-25 Thread Thomas Gleixner
On Thu, 23 Oct 2014, Brian Silverman wrote:

First of all. Nice catch!

 pi_state_free and exit_pi_state_list both clean up futex_pi_state's.
 exit_pi_state_list takes the hb lock first, and most callers of
 pi_state_free do too. requeue_pi didn't, which causes lots of problems.

causes lots of problems is not really a good explanation of the root
cause. That wants a proper description of the race, i.e.

CPU 0  CPU 1
...

I'm surely someone who is familiar with that code, but it took me
quite some time to understand whats going on. The casual reader will
just go into brain spiral mode and give up.

 +/**
 + * Must be called with the hb lock held.
 + */

Having that comment is nice, but there is nothing which enforces
it. So we really should add another argument to that function,
i.e. struct futex_hash_bucket *hb and verify that the lock is held at
least when lockdep is enabled.

  static void free_pi_state(struct futex_pi_state *pi_state)

 @@ -1558,6 +1552,14 @@ retry_private:
   ret = get_futex_value_locked(curval, uaddr1);
  
   if (unlikely(ret)) {
 + if (flags  FLAGS_SHARED  pi_state != NULL) {

Why is this dependend on flags  FLAGS_SHARED? The shared/private
property has nothing to do with that at all, but I might be missing
something.

 + /*
 +  * We will have to lookup the pi_state again, so
 +  * free this one to keep the accounting correct.
 +  */
 + free_pi_state(pi_state);
 + pi_state = NULL;

Instead of copying the same code over and over, we should change
free_pi_state() to handle being called with a NULL pointer.

Thanks,

tglx
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] futex: fix a race condition between REQUEUE_PI and task death

2014-10-25 Thread Brian Silverman
On Sat, 25 Oct 2014, Thomas Gleixner wrote:

  pi_state_free and exit_pi_state_list both clean up futex_pi_state's.
  exit_pi_state_list takes the hb lock first, and most callers of
  pi_state_free do too. requeue_pi didn't, which causes lots of problems.

 causes lots of problems is not really a good explanation of the root
 cause. That wants a proper description of the race, i.e.

 CPU 0  CPU 1
 ...

 I'm surely someone who is familiar with that code, but it took me
 quite some time to understand whats going on. The casual reader will
 just go into brain spiral mode and give up.

Thinking about it again, I'm no longer so sure that exit_pi_state_list is the
only place that it can race against. However, I did catch that one with a
particularly lucky crashdump, so I know it _can_ happen there. Is just
giving an example for that good?

   static void free_pi_state(struct futex_pi_state *pi_state)

  @@ -1558,6 +1552,14 @@ retry_private:
ret = get_futex_value_locked(curval, uaddr1);
 
if (unlikely(ret)) {
  + if (flags  FLAGS_SHARED  pi_state != NULL) {

 Why is this dependend on flags  FLAGS_SHARED? The shared/private
 property has nothing to do with that at all, but I might be missing
 something.

Nothing... Good catch. It was a bad rebase.

Thanks,

Brian
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] futex: fix a race condition between REQUEUE_PI and task death

2014-10-23 Thread Mike Galbraith
(CCs more eyeballs)

On Thu, 2014-10-23 at 15:28 -0400, Brian Silverman wrote: 
> Here's the test code:

Which took a 2 socket 28 core box (NOPREEMPT) out in short order.  With
patchlet applied, looks like it'll stay up (37 minutes and counting),
I'll squeak if it explodes.

Tested-by: Mike Galbraith 

[  387.396020] BUG: unable to handle kernel NULL pointer dereference at 
0b34
[  387.414177] IP: [] free_pi_state+0x4e/0xb0
[  387.427638] PGD 8394fe067 PUD 847c37067 PMD 0
[  387.438457] Oops: 0002 [#1] SMP
[  387.446534] Modules linked in: nfs(E) lockd(E) grace(E) sunrpc(E) fscache(E) 
iscsi_ibft(E) iscsi_boot_sysfs(E) af_packet(E) ext4(E) crc16(E) mbcache(E) 
jbd2(E) joydev(E) hid_generic(E) intel_rapl(E) usbhid(E) 
x86_pkg_temp_thermal(E) iTCO_wdt(E) intel_powerclamp(E) iTCO_vendor_support(E) 
coretemp(E) kvm_intel(E) kvm(E) crct10dif_pclmul(E) crc32_pclmul(E) 
ghash_clmulni_intel(E) aesni_intel(E) aes_x86_64(E) lrw(E) ixgbe(E) gf128mul(E) 
glue_helper(E) ptp(E) ablk_helper(E) pps_core(E) cryptd(E) sb_edac(E) pcspkr(E) 
mdio(E) edac_core(E) ipmi_si(E) dca(E) ipmi_msghandler(E) lpc_ich(E) 
mfd_core(E) wmi(E) acpi_power_meter(E) xhci_pci(E) mei_me(E) i2c_i801(E) 
acpi_pad(E) processor(E) mei(E) xhci_hcd(E) shpchp(E) button(E) dm_mod(E) 
btrfs(E) xor(E) raid6_pq(E) sr_mod(E) cdrom(E) sd_mod(E) mgag200(E) 
syscopyarea(E)
[  387.610406]  sysfillrect(E) sysimgblt(E) i2c_algo_bit(E) drm_kms_helper(E) 
ehci_pci(E) ttm(E) ahci(E) ehci_hcd(E) crc32c_intel(E) libahci(E) drm(E) 
usbcore(E) libata(E) usb_common(E) sg(E) scsi_mod(E) autofs4(E)
[  387.651513] CPU: 27 PID: 136696 Comm: futex-exit-race Tainted: G
E  3.18.0-default #51
[  387.672339] Hardware name: Intel Corporation S2600WTT/S2600WTT, BIOS 
GRNDSDP1.86B.0030.R03.1405061547 05/06/2014
[  387.696066] task: 880833002250 ti: 880830d04000 task.ti: 
880830d04000
[  387.713855] RIP: 0010:[]  [] 
free_pi_state+0x4e/0xb0
[  387.733015] RSP: 0018:880830d07d78  EFLAGS: 00010046
[  387.746030] RAX:  RBX: 8804592ea340 RCX: 0bb6
[  387.763089] RDX: 8804592ea340 RSI: 880866c3fb48 RDI: 88083b40cb84
[  387.780167] RBP: 880830d07d88 R08: c900136b8a70 R09: 88046afe8150
[  387.797255] R10:  R11: 000101b0 R12: 880830d07e08
[  387.814360] R13:  R14: c900136b8a70 R15: 0001
[  387.831476] FS:  7fb2637c5700() GS:88087f1a() 
knlGS:
[  387.850710] CS:  0010 DS:  ES:  CR0: 80050033
[  387.864766] CR2: 0b34 CR3: 0008374e7000 CR4: 001407e0
[  387.881901] Stack:
[  387.887707]  880830d07d88  880830d07e58 
810d52c4
[  387.905497]  001b0001 880830d07e20 0002018230d07dc8 
0001
[  387.923290]  c900136b8a84 880830d07f08 7fb2637d838c 
00010001
[  387.941071] Call Trace:
[  387.947864]  [] futex_requeue+0x2b4/0x8e0
[  387.961578]  [] do_futex+0xa9/0x580
[  387.974128]  [] ? do_nanosleep+0x82/0x110
[  387.987814]  [] ? hrtimer_nanosleep+0xac/0x160
[  388.002458]  [] SyS_futex+0x71/0x150
[  388.015181]  [] system_call_fastpath+0x12/0x17
[  388.029826] Code: 30 48 85 ff 74 41 48 81 c7 34 0b 00 00 e8 7b 1f 4b 00 48 
8b 43 08 48 8b 13 48 89 42 08 48 89 10 48 89 1b 48 89 5b 08 48 8b 43 30 <66> 83 
80 34 0b 00 00 01 fb 66 0f 1f 44 00 00 48 8b 73 30 48 8d
[  388.075136] RIP  [] free_pi_state+0x4e/0xb0
[  388.089266]  RSP 
[  388.098386] CR2: 0b34


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] futex: fix a race condition between REQUEUE_PI and task death

2014-10-23 Thread Brian Silverman
Here's the test code:

#define _GNU_SOURCE

#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

// Whether to use a pthread mutex+condvar or do the raw futex operations. Either
// one will break.
// There's less user-space code involved with the non-pthread version, but the
// syscalls I think are relevant end up the same (except they're FUTEX_PRIVATE
// with the pthread version).
#define PTHREAD_MUTEX 0

// The number of processes that repeatedly call RunTest to fork off.
// Making this big (like 2500) makes it reproduce faster, but I have seen it go
// with as little as 4.
// With big values, `ulimit -u unlimited` (or something big at least) is
// necessary before running it (or do it as root).
#define NUMBER_TESTERS 2500

#if PTHREAD_MUTEX
#include 

typedef struct {
  pthread_mutex_t mutex;
  pthread_cond_t condition;
} Shared;
#else
typedef volatile uint32_t my_futex __attribute__((aligned(sizeof(int;

void condition_wait(my_futex *c, my_futex *m) {
  syscall(SYS_futex, c, FUTEX_WAIT_REQUEUE_PI, *c, NULL, m);
}

void condition_signal(my_futex *c, my_futex *m) {
  syscall(SYS_futex, c, FUTEX_CMP_REQUEUE_PI, 1, 0, m, *c);
}

typedef struct {
  my_futex mutex;
  my_futex condition;
} Shared;
#endif

void RunTest(Shared *shared) {
#if PTHREAD_MUTEX
  pthread_mutexattr_t mutexattr;
  assert(pthread_mutexattr_init() == 0);
  assert(pthread_mutexattr_setprotocol(, PTHREAD_PRIO_INHERIT) == 0);
  assert(pthread_mutex_init(>mutex, ) == 0);
  assert(pthread_mutexattr_destroy() == 0);

  assert(pthread_cond_init(>condition, NULL) == 0);
#else
  shared->mutex = shared->condition = 0;
#endif

  pid_t child = fork();
  if (child == 0) {
#if PTHREAD_MUTEX
pthread_mutex_lock(>mutex);
pthread_cond_wait(>condition, >mutex);
#else
condition_wait(>condition, >mutex);
#endif
_exit(0);
  } else {
assert(child != -1);
// This sleep makes it reproduce way faster, but it will still break
// eventually without it.
usleep(2);
#if PTHREAD_MUTEX
assert(pthread_cond_broadcast(>condition) == 0);
#else
condition_signal(>condition, >mutex);
#endif
assert(kill(child, SIGKILL) == 0);
assert(waitpid(child, NULL, 0) == child);
  }
}

int main() {
  Shared *shared_memory = (Shared *)(mmap(NULL, sizeof(Shared) * NUMBER_TESTERS,
  PROT_READ | PROT_WRITE,
  MAP_SHARED | MAP_ANONYMOUS,
  -1, 0));
  int i;
  for (i = 0; i < NUMBER_TESTERS; ++i) {
if (fork() == 0) {
  while (1) {
RunTest(_memory[i]);
  }
}
  }
}

On Thu, Oct 23, 2014 at 3:22 PM, Brian Silverman  wrote:
> pi_state_free and exit_pi_state_list both clean up futex_pi_state's.
> exit_pi_state_list takes the hb lock first, and most callers of
> pi_state_free do too. requeue_pi didn't, which causes lots of problems.
> Move the pi_state_free calls in requeue_pi to before it drops the hb
> locks which it's already holding.
>
> I have test code that forks a bunch of processes, which all fork more
> processes that do futex(FUTEX_WAIT_REQUEUE_PI), then do
> futex(FUTEX_CMP_REQUEUE_PI) before killing the waiting processes. That
> test consistently breaks QEMU VMs without this patch.
>
> Although they both make it faster to crash, CONFIG_PREEMPT_RT and lots
> of CPUs are not necessary to reproduce this problem. The symptoms range
> from random reboots to corrupting the test program to corrupting whole
> filesystems to strange BIOS errors. Crashdumps (when they get created at
> all) are usually a mess, to the point of crashing tools used to examine
> them.
>
> Signed-off-by: Brian Silverman 
> ---
> I am not subscribed to the list so please CC me on any responses.
>
>  kernel/futex.c |   32 +---
>  1 file changed, 21 insertions(+), 11 deletions(-)
>
> diff --git a/kernel/futex.c b/kernel/futex.c
> index 815d7af..dc69775 100644
> --- a/kernel/futex.c
> +++ b/kernel/futex.c
> @@ -639,6 +639,9 @@ static struct futex_pi_state * alloc_pi_state(void)
> return pi_state;
>  }
>
> +/**
> + * Must be called with the hb lock held.
> + */
>  static void free_pi_state(struct futex_pi_state *pi_state)
>  {
> if (!atomic_dec_and_test(_state->refcount))
> @@ -1519,15 +1522,6 @@ static int futex_requeue(u32 __user *uaddr1, unsigned 
> int flags,
> }
>
>  retry:
> -   if (pi_state != NULL) {
> -   /*
> -* We will have to lookup the pi_state again, so free this one
> -* to keep the accounting correct.
> -*/
> -   free_pi_state(pi_state);
> -   pi_state = NULL;
> -   }
> -
> ret = get_futex_key(uaddr1, flags & FLAGS_SHARED, , VERIFY_READ);
> if (unlikely(ret != 0))
> goto out;
> @@ -1558,6 +1552,14 @@ retry_private:
> ret = get_futex_value_locked(, uaddr1);
>
> if (unlikely(ret)) {
> 

[PATCH] futex: fix a race condition between REQUEUE_PI and task death

2014-10-23 Thread Brian Silverman
pi_state_free and exit_pi_state_list both clean up futex_pi_state's.
exit_pi_state_list takes the hb lock first, and most callers of
pi_state_free do too. requeue_pi didn't, which causes lots of problems.
Move the pi_state_free calls in requeue_pi to before it drops the hb
locks which it's already holding.

I have test code that forks a bunch of processes, which all fork more
processes that do futex(FUTEX_WAIT_REQUEUE_PI), then do
futex(FUTEX_CMP_REQUEUE_PI) before killing the waiting processes. That
test consistently breaks QEMU VMs without this patch.

Although they both make it faster to crash, CONFIG_PREEMPT_RT and lots
of CPUs are not necessary to reproduce this problem. The symptoms range
from random reboots to corrupting the test program to corrupting whole
filesystems to strange BIOS errors. Crashdumps (when they get created at
all) are usually a mess, to the point of crashing tools used to examine
them.

Signed-off-by: Brian Silverman 
---
I am not subscribed to the list so please CC me on any responses.

 kernel/futex.c |   32 +---
 1 file changed, 21 insertions(+), 11 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index 815d7af..dc69775 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -639,6 +639,9 @@ static struct futex_pi_state * alloc_pi_state(void)
return pi_state;
 }
 
+/**
+ * Must be called with the hb lock held.
+ */
 static void free_pi_state(struct futex_pi_state *pi_state)
 {
if (!atomic_dec_and_test(_state->refcount))
@@ -1519,15 +1522,6 @@ static int futex_requeue(u32 __user *uaddr1, unsigned 
int flags,
}
 
 retry:
-   if (pi_state != NULL) {
-   /*
-* We will have to lookup the pi_state again, so free this one
-* to keep the accounting correct.
-*/
-   free_pi_state(pi_state);
-   pi_state = NULL;
-   }
-
ret = get_futex_key(uaddr1, flags & FLAGS_SHARED, , VERIFY_READ);
if (unlikely(ret != 0))
goto out;
@@ -1558,6 +1552,14 @@ retry_private:
ret = get_futex_value_locked(, uaddr1);
 
if (unlikely(ret)) {
+   if (flags & FLAGS_SHARED && pi_state != NULL) {
+   /*
+* We will have to lookup the pi_state again, so
+* free this one to keep the accounting correct.
+*/
+   free_pi_state(pi_state);
+   pi_state = NULL;
+   }
double_unlock_hb(hb1, hb2);
hb_waiters_dec(hb2);
 
@@ -1617,6 +1619,10 @@ retry_private:
case 0:
break;
case -EFAULT:
+   if (pi_state != NULL) {
+   free_pi_state(pi_state);
+   pi_state = NULL;
+   }
double_unlock_hb(hb1, hb2);
hb_waiters_dec(hb2);
put_futex_key();
@@ -1632,6 +1638,10 @@ retry_private:
 *   exit to complete.
 * - The user space value changed.
 */
+   if (pi_state != NULL) {
+   free_pi_state(pi_state);
+   pi_state = NULL;
+   }
double_unlock_hb(hb1, hb2);
hb_waiters_dec(hb2);
put_futex_key();
@@ -1708,6 +1718,8 @@ retry_private:
}
 
 out_unlock:
+   if (pi_state != NULL)
+   free_pi_state(pi_state);
double_unlock_hb(hb1, hb2);
hb_waiters_dec(hb2);
 
@@ -1725,8 +1737,6 @@ out_put_keys:
 out_put_key1:
put_futex_key();
 out:
-   if (pi_state != NULL)
-   free_pi_state(pi_state);
return ret ? ret : task_count;
 }
 
-- 
1.7.10.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] futex: fix a race condition between REQUEUE_PI and task death

2014-10-23 Thread Brian Silverman
pi_state_free and exit_pi_state_list both clean up futex_pi_state's.
exit_pi_state_list takes the hb lock first, and most callers of
pi_state_free do too. requeue_pi didn't, which causes lots of problems.
Move the pi_state_free calls in requeue_pi to before it drops the hb
locks which it's already holding.

I have test code that forks a bunch of processes, which all fork more
processes that do futex(FUTEX_WAIT_REQUEUE_PI), then do
futex(FUTEX_CMP_REQUEUE_PI) before killing the waiting processes. That
test consistently breaks QEMU VMs without this patch.

Although they both make it faster to crash, CONFIG_PREEMPT_RT and lots
of CPUs are not necessary to reproduce this problem. The symptoms range
from random reboots to corrupting the test program to corrupting whole
filesystems to strange BIOS errors. Crashdumps (when they get created at
all) are usually a mess, to the point of crashing tools used to examine
them.

Signed-off-by: Brian Silverman bsilver16...@gmail.com
---
I am not subscribed to the list so please CC me on any responses.

 kernel/futex.c |   32 +---
 1 file changed, 21 insertions(+), 11 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index 815d7af..dc69775 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -639,6 +639,9 @@ static struct futex_pi_state * alloc_pi_state(void)
return pi_state;
 }
 
+/**
+ * Must be called with the hb lock held.
+ */
 static void free_pi_state(struct futex_pi_state *pi_state)
 {
if (!atomic_dec_and_test(pi_state-refcount))
@@ -1519,15 +1522,6 @@ static int futex_requeue(u32 __user *uaddr1, unsigned 
int flags,
}
 
 retry:
-   if (pi_state != NULL) {
-   /*
-* We will have to lookup the pi_state again, so free this one
-* to keep the accounting correct.
-*/
-   free_pi_state(pi_state);
-   pi_state = NULL;
-   }
-
ret = get_futex_key(uaddr1, flags  FLAGS_SHARED, key1, VERIFY_READ);
if (unlikely(ret != 0))
goto out;
@@ -1558,6 +1552,14 @@ retry_private:
ret = get_futex_value_locked(curval, uaddr1);
 
if (unlikely(ret)) {
+   if (flags  FLAGS_SHARED  pi_state != NULL) {
+   /*
+* We will have to lookup the pi_state again, so
+* free this one to keep the accounting correct.
+*/
+   free_pi_state(pi_state);
+   pi_state = NULL;
+   }
double_unlock_hb(hb1, hb2);
hb_waiters_dec(hb2);
 
@@ -1617,6 +1619,10 @@ retry_private:
case 0:
break;
case -EFAULT:
+   if (pi_state != NULL) {
+   free_pi_state(pi_state);
+   pi_state = NULL;
+   }
double_unlock_hb(hb1, hb2);
hb_waiters_dec(hb2);
put_futex_key(key2);
@@ -1632,6 +1638,10 @@ retry_private:
 *   exit to complete.
 * - The user space value changed.
 */
+   if (pi_state != NULL) {
+   free_pi_state(pi_state);
+   pi_state = NULL;
+   }
double_unlock_hb(hb1, hb2);
hb_waiters_dec(hb2);
put_futex_key(key2);
@@ -1708,6 +1718,8 @@ retry_private:
}
 
 out_unlock:
+   if (pi_state != NULL)
+   free_pi_state(pi_state);
double_unlock_hb(hb1, hb2);
hb_waiters_dec(hb2);
 
@@ -1725,8 +1737,6 @@ out_put_keys:
 out_put_key1:
put_futex_key(key1);
 out:
-   if (pi_state != NULL)
-   free_pi_state(pi_state);
return ret ? ret : task_count;
 }
 
-- 
1.7.10.4

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] futex: fix a race condition between REQUEUE_PI and task death

2014-10-23 Thread Brian Silverman
Here's the test code:

#define _GNU_SOURCE

#include unistd.h
#include sys/types.h
#include sys/wait.h
#include assert.h
#include sys/mman.h
#include sys/syscall.h
#include inttypes.h
#include linux/futex.h

// Whether to use a pthread mutex+condvar or do the raw futex operations. Either
// one will break.
// There's less user-space code involved with the non-pthread version, but the
// syscalls I think are relevant end up the same (except they're FUTEX_PRIVATE
// with the pthread version).
#define PTHREAD_MUTEX 0

// The number of processes that repeatedly call RunTest to fork off.
// Making this big (like 2500) makes it reproduce faster, but I have seen it go
// with as little as 4.
// With big values, `ulimit -u unlimited` (or something big at least) is
// necessary before running it (or do it as root).
#define NUMBER_TESTERS 2500

#if PTHREAD_MUTEX
#include pthread.h

typedef struct {
  pthread_mutex_t mutex;
  pthread_cond_t condition;
} Shared;
#else
typedef volatile uint32_t my_futex __attribute__((aligned(sizeof(int;

void condition_wait(my_futex *c, my_futex *m) {
  syscall(SYS_futex, c, FUTEX_WAIT_REQUEUE_PI, *c, NULL, m);
}

void condition_signal(my_futex *c, my_futex *m) {
  syscall(SYS_futex, c, FUTEX_CMP_REQUEUE_PI, 1, 0, m, *c);
}

typedef struct {
  my_futex mutex;
  my_futex condition;
} Shared;
#endif

void RunTest(Shared *shared) {
#if PTHREAD_MUTEX
  pthread_mutexattr_t mutexattr;
  assert(pthread_mutexattr_init(mutexattr) == 0);
  assert(pthread_mutexattr_setprotocol(mutexattr, PTHREAD_PRIO_INHERIT) == 0);
  assert(pthread_mutex_init(shared-mutex, mutexattr) == 0);
  assert(pthread_mutexattr_destroy(mutexattr) == 0);

  assert(pthread_cond_init(shared-condition, NULL) == 0);
#else
  shared-mutex = shared-condition = 0;
#endif

  pid_t child = fork();
  if (child == 0) {
#if PTHREAD_MUTEX
pthread_mutex_lock(shared-mutex);
pthread_cond_wait(shared-condition, shared-mutex);
#else
condition_wait(shared-condition, shared-mutex);
#endif
_exit(0);
  } else {
assert(child != -1);
// This sleep makes it reproduce way faster, but it will still break
// eventually without it.
usleep(2);
#if PTHREAD_MUTEX
assert(pthread_cond_broadcast(shared-condition) == 0);
#else
condition_signal(shared-condition, shared-mutex);
#endif
assert(kill(child, SIGKILL) == 0);
assert(waitpid(child, NULL, 0) == child);
  }
}

int main() {
  Shared *shared_memory = (Shared *)(mmap(NULL, sizeof(Shared) * NUMBER_TESTERS,
  PROT_READ | PROT_WRITE,
  MAP_SHARED | MAP_ANONYMOUS,
  -1, 0));
  int i;
  for (i = 0; i  NUMBER_TESTERS; ++i) {
if (fork() == 0) {
  while (1) {
RunTest(shared_memory[i]);
  }
}
  }
}

On Thu, Oct 23, 2014 at 3:22 PM, Brian Silverman bsilver16...@gmail.com wrote:
 pi_state_free and exit_pi_state_list both clean up futex_pi_state's.
 exit_pi_state_list takes the hb lock first, and most callers of
 pi_state_free do too. requeue_pi didn't, which causes lots of problems.
 Move the pi_state_free calls in requeue_pi to before it drops the hb
 locks which it's already holding.

 I have test code that forks a bunch of processes, which all fork more
 processes that do futex(FUTEX_WAIT_REQUEUE_PI), then do
 futex(FUTEX_CMP_REQUEUE_PI) before killing the waiting processes. That
 test consistently breaks QEMU VMs without this patch.

 Although they both make it faster to crash, CONFIG_PREEMPT_RT and lots
 of CPUs are not necessary to reproduce this problem. The symptoms range
 from random reboots to corrupting the test program to corrupting whole
 filesystems to strange BIOS errors. Crashdumps (when they get created at
 all) are usually a mess, to the point of crashing tools used to examine
 them.

 Signed-off-by: Brian Silverman bsilver16...@gmail.com
 ---
 I am not subscribed to the list so please CC me on any responses.

  kernel/futex.c |   32 +---
  1 file changed, 21 insertions(+), 11 deletions(-)

 diff --git a/kernel/futex.c b/kernel/futex.c
 index 815d7af..dc69775 100644
 --- a/kernel/futex.c
 +++ b/kernel/futex.c
 @@ -639,6 +639,9 @@ static struct futex_pi_state * alloc_pi_state(void)
 return pi_state;
  }

 +/**
 + * Must be called with the hb lock held.
 + */
  static void free_pi_state(struct futex_pi_state *pi_state)
  {
 if (!atomic_dec_and_test(pi_state-refcount))
 @@ -1519,15 +1522,6 @@ static int futex_requeue(u32 __user *uaddr1, unsigned 
 int flags,
 }

  retry:
 -   if (pi_state != NULL) {
 -   /*
 -* We will have to lookup the pi_state again, so free this one
 -* to keep the accounting correct.
 -*/
 -   free_pi_state(pi_state);
 -   pi_state = NULL;
 -   }
 -
 ret = get_futex_key(uaddr1, flags  FLAGS_SHARED, key1, VERIFY_READ);
 if 

Re: [PATCH] futex: fix a race condition between REQUEUE_PI and task death

2014-10-23 Thread Mike Galbraith
(CCs more eyeballs)

On Thu, 2014-10-23 at 15:28 -0400, Brian Silverman wrote: 
 Here's the test code:

Which took a 2 socket 28 core box (NOPREEMPT) out in short order.  With
patchlet applied, looks like it'll stay up (37 minutes and counting),
I'll squeak if it explodes.

Tested-by: Mike Galbraith umgwanakikb...@gmail.com

[  387.396020] BUG: unable to handle kernel NULL pointer dereference at 
0b34
[  387.414177] IP: [810d411e] free_pi_state+0x4e/0xb0
[  387.427638] PGD 8394fe067 PUD 847c37067 PMD 0
[  387.438457] Oops: 0002 [#1] SMP
[  387.446534] Modules linked in: nfs(E) lockd(E) grace(E) sunrpc(E) fscache(E) 
iscsi_ibft(E) iscsi_boot_sysfs(E) af_packet(E) ext4(E) crc16(E) mbcache(E) 
jbd2(E) joydev(E) hid_generic(E) intel_rapl(E) usbhid(E) 
x86_pkg_temp_thermal(E) iTCO_wdt(E) intel_powerclamp(E) iTCO_vendor_support(E) 
coretemp(E) kvm_intel(E) kvm(E) crct10dif_pclmul(E) crc32_pclmul(E) 
ghash_clmulni_intel(E) aesni_intel(E) aes_x86_64(E) lrw(E) ixgbe(E) gf128mul(E) 
glue_helper(E) ptp(E) ablk_helper(E) pps_core(E) cryptd(E) sb_edac(E) pcspkr(E) 
mdio(E) edac_core(E) ipmi_si(E) dca(E) ipmi_msghandler(E) lpc_ich(E) 
mfd_core(E) wmi(E) acpi_power_meter(E) xhci_pci(E) mei_me(E) i2c_i801(E) 
acpi_pad(E) processor(E) mei(E) xhci_hcd(E) shpchp(E) button(E) dm_mod(E) 
btrfs(E) xor(E) raid6_pq(E) sr_mod(E) cdrom(E) sd_mod(E) mgag200(E) 
syscopyarea(E)
[  387.610406]  sysfillrect(E) sysimgblt(E) i2c_algo_bit(E) drm_kms_helper(E) 
ehci_pci(E) ttm(E) ahci(E) ehci_hcd(E) crc32c_intel(E) libahci(E) drm(E) 
usbcore(E) libata(E) usb_common(E) sg(E) scsi_mod(E) autofs4(E)
[  387.651513] CPU: 27 PID: 136696 Comm: futex-exit-race Tainted: G
E  3.18.0-default #51
[  387.672339] Hardware name: Intel Corporation S2600WTT/S2600WTT, BIOS 
GRNDSDP1.86B.0030.R03.1405061547 05/06/2014
[  387.696066] task: 880833002250 ti: 880830d04000 task.ti: 
880830d04000
[  387.713855] RIP: 0010:[810d411e]  [810d411e] 
free_pi_state+0x4e/0xb0
[  387.733015] RSP: 0018:880830d07d78  EFLAGS: 00010046
[  387.746030] RAX:  RBX: 8804592ea340 RCX: 0bb6
[  387.763089] RDX: 8804592ea340 RSI: 880866c3fb48 RDI: 88083b40cb84
[  387.780167] RBP: 880830d07d88 R08: c900136b8a70 R09: 88046afe8150
[  387.797255] R10:  R11: 000101b0 R12: 880830d07e08
[  387.814360] R13:  R14: c900136b8a70 R15: 0001
[  387.831476] FS:  7fb2637c5700() GS:88087f1a() 
knlGS:
[  387.850710] CS:  0010 DS:  ES:  CR0: 80050033
[  387.864766] CR2: 0b34 CR3: 0008374e7000 CR4: 001407e0
[  387.881901] Stack:
[  387.887707]  880830d07d88  880830d07e58 
810d52c4
[  387.905497]  001b0001 880830d07e20 0002018230d07dc8 
0001
[  387.923290]  c900136b8a84 880830d07f08 7fb2637d838c 
00010001
[  387.941071] Call Trace:
[  387.947864]  [810d52c4] futex_requeue+0x2b4/0x8e0
[  387.961578]  [810d5e89] do_futex+0xa9/0x580
[  387.974128]  [81585502] ? do_nanosleep+0x82/0x110
[  387.987814]  [810c414c] ? hrtimer_nanosleep+0xac/0x160
[  388.002458]  [810d63d1] SyS_futex+0x71/0x150
[  388.015181]  [815868a9] system_call_fastpath+0x12/0x17
[  388.029826] Code: 30 48 85 ff 74 41 48 81 c7 34 0b 00 00 e8 7b 1f 4b 00 48 
8b 43 08 48 8b 13 48 89 42 08 48 89 10 48 89 1b 48 89 5b 08 48 8b 43 30 66 83 
80 34 0b 00 00 01 fb 66 0f 1f 44 00 00 48 8b 73 30 48 8d
[  388.075136] RIP  [810d411e] free_pi_state+0x4e/0xb0
[  388.089266]  RSP 880830d07d78
[  388.098386] CR2: 0b34


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/