Re: race condition in crypto larval handling

2013-09-07 Thread Neil Horman
On Fri, Sep 06, 2013 at 04:20:50PM -0700, Kees Cook wrote:
 Hi,
 
 I've tracked down a race condition and ref counting problem in the
 crypto API internals. We've been seeing it under Chrome OS, but it
 seems it's not isolated to just us:
 
 https://code.google.com/p/chromium/issues/detail?id=244581
 http://marc.info/?l=linux-crypto-vgerm=135429403609373w=2
 https://bugzilla.redhat.com/show_bug.cgi?id=983682
 http://www.mail-archive.com/linux-cifs@vger.kernel.org/msg07933.html
 
 I think I've found the basic origin of the problem.
 crypto_larval_add() has synchronization to make sure only a single
 larval can ever be created. That logic seems totally fine. However,
 this means that crypto_larval_lookup() from two threads can return the
 same larval, which means that crypto_alg_mod_lookup() runs the risk of
 calling crypto_larval_kill() on the same larval twice, which does not
 appear to be handled sanely.
 
 I can easily crash the kernel by forcing a synchronization point just
 before the return crypt_larval_add(...) call in
 crypto_larval_lookup(). Basically, I added this sloppy sync code (I
 feel like there must be a better way to do this):
 
 +static atomic_t global_sync = ATOMIC_INIT(0);
 ...
 struct crypto_alg *crypto_larval_lookup(const char *name, u32 type, u32 mask)
 ...
 +   if (strncmp(name, asdf, 4) == 0) {
 +   int value;
 +
 +   value = atomic_add_return(1, global_sync);
 +   if (value == 1) {
 +   /* I was first to stall here, wait for inc. */
 +   while (atomic_read(global_sync) != 2)
 +   cpu_relax();
 +   } else if (value == 2) {
 +   /* I was second to stall here, wait for dec. */
 +   while (atomic_read(global_sync) != 1)
 +   cpu_relax();
 +   } else {
 +   BUG();
 +   }
 +   atomic_dec(global_sync);
 +   }
 
 return crypto_larval_add(name, type, mask);
  }
 
 And then ran code from userspace that did, effectively:
 
 struct sockaddr_alg sa = {
 .salg_family = AF_ALG,
 .salg_type   = hash,
 };
 strcpy(sa.salg_name, argv[1]);
 
 fork();
 ...
 sds[0] = socket(AF_ALG, SOCK_SEQPACKET, 0);
 bind(sds[0], (struct sockaddr *) sa, sizeof(sa));
 
 And ran this as ./tickle asdf1 to generate the two threads trying to
 find an invalid alg. The race looks possible even with valid algs, but
 this was the fastest path I could see to tickle the issue.
 
 With added printks to the kernel, it was clear that crypto_larval_kill
 was being called twice on the same larval, and the list_del() call was
 doing bad things. When I fixed that, the refcnt bug became very
 obvious. Here's the change I made for crypto_larval_kill:
 
 @@ -161,7 +166,8 @@ void crypto_larval_kill(struct crypto_alg *alg)
 struct crypto_larval *larval = (void *)alg;
 
 down_write(crypto_alg_sem);
 -   list_del(alg-cra_list);
 +   if (!list_empty(alg-cra_list))
 +   list_del_init(alg-cra_list);
 up_write(crypto_alg_sem);
 complete_all(larval-completion);
 crypto_alg_put(alg);
 
 It seems that there are too many put calls (mixed between
 crypto_alg_put() and crypto_mod_put(), which also calls
 crypto_alg_put()). See this annotated portion of
 crypto_alg_mod_lookup:
 
 /* This can (correctly) return the same larval for two threads. */
 larval = crypto_larval_lookup(name, type, mask);
 if (IS_ERR(larval) || !crypto_is_larval(larval))
 return larval;
 
 ok = crypto_probing_notify(CRYPTO_MSG_ALG_REQUEST, larval);
 
 if (ok == NOTIFY_STOP)
 /* This calls crypto_mod_put(), but sometimes also get?? */
 alg = crypto_larval_wait(larval);
 else {
 /* This directly calls crypto_mod_put */
 crypto_mod_put(larval);
 alg = ERR_PTR(-ENOENT);
 }
 /* This calls crypto_alg_put */
 crypto_larval_kill(larval);
 return alg;
 
 In the two-thread situation, the first thread gets a larval with
 refcnt 2 via crypto_larval_add. (Why 2?) The next thread finds the
 larval via crypto_larval_add's call to __crypto_alg_lookup() and sees
 the ref bump to 3. While exiting crypto_alg_mod_lookup, each thread
 decrements the ref count twice.
 
 It seems to me like either each call to crypto_larval_lookup() should
 result in a refcount bumped by two, or crypto_alg_mod_lookup() should
 decrement only once, and the initial refcount should be 1 not 2 from
 crypto_larval_add. But it's not clear to me which is sensible here.
 
 What's the right solution here?
 
 Thanks,
 
 -Kees
 
 -- 
 Kees Cook
 Chrome OS Security
 --
 To unsubscribe from this list: send the line unsubscribe linux-crypto in
 the body of a message to majord...@vger.kernel.org
 More 

Re: race condition in crypto larval handling

2013-09-07 Thread Kees Cook
On Sat, Sep 7, 2013 at 7:39 AM, Neil Horman nhor...@tuxdriver.com wrote:
 On Fri, Sep 06, 2013 at 04:20:50PM -0700, Kees Cook wrote:
 Hi,

 I've tracked down a race condition and ref counting problem in the
 crypto API internals. We've been seeing it under Chrome OS, but it
 seems it's not isolated to just us:

 https://code.google.com/p/chromium/issues/detail?id=244581
 http://marc.info/?l=linux-crypto-vgerm=135429403609373w=2
 https://bugzilla.redhat.com/show_bug.cgi?id=983682
 http://www.mail-archive.com/linux-cifs@vger.kernel.org/msg07933.html

 I think I've found the basic origin of the problem.
 crypto_larval_add() has synchronization to make sure only a single
 larval can ever be created. That logic seems totally fine. However,
 this means that crypto_larval_lookup() from two threads can return the
 same larval, which means that crypto_alg_mod_lookup() runs the risk of
 calling crypto_larval_kill() on the same larval twice, which does not
 appear to be handled sanely.

 I can easily crash the kernel by forcing a synchronization point just
 before the return crypt_larval_add(...) call in
 crypto_larval_lookup(). Basically, I added this sloppy sync code (I
 feel like there must be a better way to do this):

 +static atomic_t global_sync = ATOMIC_INIT(0);
 ...
 struct crypto_alg *crypto_larval_lookup(const char *name, u32 type, u32 mask)
 ...
 +   if (strncmp(name, asdf, 4) == 0) {
 +   int value;
 +
 +   value = atomic_add_return(1, global_sync);
 +   if (value == 1) {
 +   /* I was first to stall here, wait for inc. */
 +   while (atomic_read(global_sync) != 2)
 +   cpu_relax();
 +   } else if (value == 2) {
 +   /* I was second to stall here, wait for dec. */
 +   while (atomic_read(global_sync) != 1)
 +   cpu_relax();
 +   } else {
 +   BUG();
 +   }
 +   atomic_dec(global_sync);
 +   }

 return crypto_larval_add(name, type, mask);
  }

 And then ran code from userspace that did, effectively:

 struct sockaddr_alg sa = {
 .salg_family = AF_ALG,
 .salg_type   = hash,
 };
 strcpy(sa.salg_name, argv[1]);

 fork();
 ...
 sds[0] = socket(AF_ALG, SOCK_SEQPACKET, 0);
 bind(sds[0], (struct sockaddr *) sa, sizeof(sa));

 And ran this as ./tickle asdf1 to generate the two threads trying to
 find an invalid alg. The race looks possible even with valid algs, but
 this was the fastest path I could see to tickle the issue.

 With added printks to the kernel, it was clear that crypto_larval_kill
 was being called twice on the same larval, and the list_del() call was
 doing bad things. When I fixed that, the refcnt bug became very
 obvious. Here's the change I made for crypto_larval_kill:

 @@ -161,7 +166,8 @@ void crypto_larval_kill(struct crypto_alg *alg)
 struct crypto_larval *larval = (void *)alg;

 down_write(crypto_alg_sem);
 -   list_del(alg-cra_list);
 +   if (!list_empty(alg-cra_list))
 +   list_del_init(alg-cra_list);
 up_write(crypto_alg_sem);
 complete_all(larval-completion);
 crypto_alg_put(alg);

 It seems that there are too many put calls (mixed between
 crypto_alg_put() and crypto_mod_put(), which also calls
 crypto_alg_put()). See this annotated portion of
 crypto_alg_mod_lookup:

 /* This can (correctly) return the same larval for two threads. */
 larval = crypto_larval_lookup(name, type, mask);
 if (IS_ERR(larval) || !crypto_is_larval(larval))
 return larval;

 ok = crypto_probing_notify(CRYPTO_MSG_ALG_REQUEST, larval);

 if (ok == NOTIFY_STOP)
 /* This calls crypto_mod_put(), but sometimes also get?? */
 alg = crypto_larval_wait(larval);
 else {
 /* This directly calls crypto_mod_put */
 crypto_mod_put(larval);
 alg = ERR_PTR(-ENOENT);
 }
 /* This calls crypto_alg_put */
 crypto_larval_kill(larval);
 return alg;

 In the two-thread situation, the first thread gets a larval with
 refcnt 2 via crypto_larval_add. (Why 2?) The next thread finds the
 larval via crypto_larval_add's call to __crypto_alg_lookup() and sees
 the ref bump to 3. While exiting crypto_alg_mod_lookup, each thread
 decrements the ref count twice.

 It seems to me like either each call to crypto_larval_lookup() should
 result in a refcount bumped by two, or crypto_alg_mod_lookup() should
 decrement only once, and the initial refcount should be 1 not 2 from
 crypto_larval_add. But it's not clear to me which is sensible here.

 What's the right solution here?

 Thanks,

 -Kees

 --
 Kees Cook
 Chrome OS Security

 I've been tracking this problem too:
 https://bugzilla.redhat.com/show_bug.cgi?id=1002351


Re: race condition in crypto larval handling

2013-09-07 Thread Herbert Xu
On Fri, Sep 06, 2013 at 04:20:50PM -0700, Kees Cook wrote:

 In the two-thread situation, the first thread gets a larval with
 refcnt 2 via crypto_larval_add. (Why 2?) The next thread finds the
 larval via crypto_larval_add's call to __crypto_alg_lookup() and sees
 the ref bump to 3. While exiting crypto_alg_mod_lookup, each thread
 decrements the ref count twice.
 
 It seems to me like either each call to crypto_larval_lookup() should
 result in a refcount bumped by two, or crypto_alg_mod_lookup() should
 decrement only once, and the initial refcount should be 1 not 2 from
 crypto_larval_add. But it's not clear to me which is sensible here.
 
 What's the right solution here?

First of all thanks a lot for tracking this problem down! It's
been bothering me for months but I was unable to find a good
reproducer.

So now that you've identified the problem, the solution is easy.
crypto_larval_lookup should only ever return a larval if it created
one.  Any larval created earlier must be waited on first before we
return.

So does this patch fix the crash for you?

diff --git a/crypto/api.c b/crypto/api.c
index 320ea4d..a2b39c5 100644
--- a/crypto/api.c
+++ b/crypto/api.c
@@ -34,6 +34,8 @@ EXPORT_SYMBOL_GPL(crypto_alg_sem);
 BLOCKING_NOTIFIER_HEAD(crypto_chain);
 EXPORT_SYMBOL_GPL(crypto_chain);
 
+static struct crypto_alg *crypto_larval_wait(struct crypto_alg *alg);
+
 struct crypto_alg *crypto_mod_get(struct crypto_alg *alg)
 {
return try_module_get(alg-cra_module) ? crypto_alg_get(alg) : NULL;
@@ -144,8 +146,11 @@ static struct crypto_alg *crypto_larval_add(const char 
*name, u32 type,
}
up_write(crypto_alg_sem);
 
-   if (alg != larval-alg)
+   if (alg != larval-alg) {
kfree(larval);
+   if (crypto_is_larval(alg))
+   alg = crypto_larval_wait(alg);
+   }
 
return alg;
 }

Cheers,
-- 
Email: Herbert Xu herb...@gondor.apana.org.au
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line unsubscribe linux-crypto in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: race condition in crypto larval handling

2013-09-07 Thread Kees Cook
On Sat, Sep 7, 2013 at 6:32 PM, Herbert Xu herb...@gondor.apana.org.au wrote:
 On Fri, Sep 06, 2013 at 04:20:50PM -0700, Kees Cook wrote:

 In the two-thread situation, the first thread gets a larval with
 refcnt 2 via crypto_larval_add. (Why 2?) The next thread finds the
 larval via crypto_larval_add's call to __crypto_alg_lookup() and sees
 the ref bump to 3. While exiting crypto_alg_mod_lookup, each thread
 decrements the ref count twice.

 It seems to me like either each call to crypto_larval_lookup() should
 result in a refcount bumped by two, or crypto_alg_mod_lookup() should
 decrement only once, and the initial refcount should be 1 not 2 from
 crypto_larval_add. But it's not clear to me which is sensible here.

 What's the right solution here?

 First of all thanks a lot for tracking this problem down! It's
 been bothering me for months but I was unable to find a good
 reproducer.

 So now that you've identified the problem, the solution is easy.
 crypto_larval_lookup should only ever return a larval if it created
 one.  Any larval created earlier must be waited on first before we
 return.

 So does this patch fix the crash for you?

 diff --git a/crypto/api.c b/crypto/api.c
 index 320ea4d..a2b39c5 100644
 --- a/crypto/api.c
 +++ b/crypto/api.c
 @@ -34,6 +34,8 @@ EXPORT_SYMBOL_GPL(crypto_alg_sem);
  BLOCKING_NOTIFIER_HEAD(crypto_chain);
  EXPORT_SYMBOL_GPL(crypto_chain);

 +static struct crypto_alg *crypto_larval_wait(struct crypto_alg *alg);
 +
  struct crypto_alg *crypto_mod_get(struct crypto_alg *alg)
  {
 return try_module_get(alg-cra_module) ? crypto_alg_get(alg) : NULL;
 @@ -144,8 +146,11 @@ static struct crypto_alg *crypto_larval_add(const char 
 *name, u32 type,
 }
 up_write(crypto_alg_sem);

 -   if (alg != larval-alg)
 +   if (alg != larval-alg) {
 kfree(larval);
 +   if (crypto_is_larval(alg))
 +   alg = crypto_larval_wait(alg);
 +   }

 return alg;
  }

Thanks! This does stop the crash I was seeing on the race with no
valid alg path, yay!

However, I noticed on the good path (even without the above patch),
I sometimes see a double-kfree triggered by the modprobe process. I
can't, however, see how that's happening, since larval_destroy should
only be called when refcnt == 0.

Here's the debugging output I added that noticed this. pid 15797 and
15798 are my hash tool that forks before doing identical AF_ALG
calls for sha512, which has to be modprobed. The trailing 31 and
32 was an atomic counter I added just to make sure I wasn't going
crazy:

[  112.495789] crypto_alg_mod_lookup 15797: looking up [sha512]
[  112.495815] crypto_larval_lookup 15797: looking up [sha512]
[  112.495839] crypto_larval_lookup 15797: requesting module [sha512]
[  112.495915] crypto_alg_mod_lookup 15798: looking up [sha512]
[  112.495937] crypto_larval_lookup 15798: looking up [sha512]
[  112.495960] crypto_larval_lookup 15798: requesting module [sha512]
[  112.498691] crypto_larval_destroy 15808(modprobe): larval
kfree(880123e9da00) refcnt:0 (serial:31)
[  112.498771] crypto_larval_destroy 15808(modprobe): larval
kfree(880123e9da00) refcnt:0 (serial:32)
[  112.500888] crypto_larval_lookup 15797: after requesting module
[sha512], got alg c0299050
[  112.500904] crypto_larval_lookup 15797: for [sha512], have alg
c0299050
[  112.500934] crypto_larval_lookup 15797: for [sha512], alg
c0299050 is NOT larval
[  112.500953] crypto_alg_mod_lookup 15797: [sha512] is not larval
c0299050
[  112.501384] crypto_larval_lookup 15798: after requesting module
[sha512], got alg c0299050
[  112.501404] crypto_larval_lookup 15798: for [sha512], have alg
c0299050
[  112.501417] crypto_larval_lookup 15798: for [sha512], alg
c0299050 is NOT larval
[  112.501432] crypto_alg_mod_lookup 15798: [sha512] is not larval
c0299050

Regardless, this seems to be a separate bug.

-Kees

-- 
Kees Cook
Chrome OS Security
--
To unsubscribe from this list: send the line unsubscribe linux-crypto in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: race condition in crypto larval handling

2013-09-07 Thread Herbert Xu
On Sat, Sep 07, 2013 at 08:34:15PM -0700, Kees Cook wrote:

 However, I noticed on the good path (even without the above patch),
 I sometimes see a double-kfree triggered by the modprobe process. I
 can't, however, see how that's happening, since larval_destroy should
 only be called when refcnt == 0.

Do you still see this double free with this patch? Without the
patch it is completely expected as killing the same lavral twice
will cause memory corruption leading to all sorts of weirdness,
even if you stop it from deleting the list entry twice.

Cheers,
-- 
Email: Herbert Xu herb...@gondor.apana.org.au
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line unsubscribe linux-crypto in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: race condition in crypto larval handling

2013-09-07 Thread Herbert Xu
On Sun, Sep 08, 2013 at 02:37:03PM +1000, Herbert Xu wrote:
 On Sat, Sep 07, 2013 at 08:34:15PM -0700, Kees Cook wrote:
 
  However, I noticed on the good path (even without the above patch),
  I sometimes see a double-kfree triggered by the modprobe process. I
  can't, however, see how that's happening, since larval_destroy should
  only be called when refcnt == 0.
 
 Do you still see this double free with this patch? Without the
 patch it is completely expected as killing the same lavral twice
 will cause memory corruption leading to all sorts of weirdness,
 even if you stop it from deleting the list entry twice.

Actually I know what it is.  sha512 registers two algorithms.
Therefore, it will create two larvals in sequence and then destroy
them in turn.  So it's not a double free at all.  If you put a
printk in crypto_larval_alloc that should confirm this.

Cheers,
-- 
Email: Herbert Xu herb...@gondor.apana.org.au
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line unsubscribe linux-crypto in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html