Re: race condition in crypto larval handling
On Fri, Sep 06, 2013 at 04:20:50PM -0700, Kees Cook wrote: Hi, I've tracked down a race condition and ref counting problem in the crypto API internals. We've been seeing it under Chrome OS, but it seems it's not isolated to just us: https://code.google.com/p/chromium/issues/detail?id=244581 http://marc.info/?l=linux-crypto-vgerm=135429403609373w=2 https://bugzilla.redhat.com/show_bug.cgi?id=983682 http://www.mail-archive.com/linux-cifs@vger.kernel.org/msg07933.html I think I've found the basic origin of the problem. crypto_larval_add() has synchronization to make sure only a single larval can ever be created. That logic seems totally fine. However, this means that crypto_larval_lookup() from two threads can return the same larval, which means that crypto_alg_mod_lookup() runs the risk of calling crypto_larval_kill() on the same larval twice, which does not appear to be handled sanely. I can easily crash the kernel by forcing a synchronization point just before the return crypt_larval_add(...) call in crypto_larval_lookup(). Basically, I added this sloppy sync code (I feel like there must be a better way to do this): +static atomic_t global_sync = ATOMIC_INIT(0); ... struct crypto_alg *crypto_larval_lookup(const char *name, u32 type, u32 mask) ... + if (strncmp(name, asdf, 4) == 0) { + int value; + + value = atomic_add_return(1, global_sync); + if (value == 1) { + /* I was first to stall here, wait for inc. */ + while (atomic_read(global_sync) != 2) + cpu_relax(); + } else if (value == 2) { + /* I was second to stall here, wait for dec. */ + while (atomic_read(global_sync) != 1) + cpu_relax(); + } else { + BUG(); + } + atomic_dec(global_sync); + } return crypto_larval_add(name, type, mask); } And then ran code from userspace that did, effectively: struct sockaddr_alg sa = { .salg_family = AF_ALG, .salg_type = hash, }; strcpy(sa.salg_name, argv[1]); fork(); ... sds[0] = socket(AF_ALG, SOCK_SEQPACKET, 0); bind(sds[0], (struct sockaddr *) sa, sizeof(sa)); And ran this as ./tickle asdf1 to generate the two threads trying to find an invalid alg. The race looks possible even with valid algs, but this was the fastest path I could see to tickle the issue. With added printks to the kernel, it was clear that crypto_larval_kill was being called twice on the same larval, and the list_del() call was doing bad things. When I fixed that, the refcnt bug became very obvious. Here's the change I made for crypto_larval_kill: @@ -161,7 +166,8 @@ void crypto_larval_kill(struct crypto_alg *alg) struct crypto_larval *larval = (void *)alg; down_write(crypto_alg_sem); - list_del(alg-cra_list); + if (!list_empty(alg-cra_list)) + list_del_init(alg-cra_list); up_write(crypto_alg_sem); complete_all(larval-completion); crypto_alg_put(alg); It seems that there are too many put calls (mixed between crypto_alg_put() and crypto_mod_put(), which also calls crypto_alg_put()). See this annotated portion of crypto_alg_mod_lookup: /* This can (correctly) return the same larval for two threads. */ larval = crypto_larval_lookup(name, type, mask); if (IS_ERR(larval) || !crypto_is_larval(larval)) return larval; ok = crypto_probing_notify(CRYPTO_MSG_ALG_REQUEST, larval); if (ok == NOTIFY_STOP) /* This calls crypto_mod_put(), but sometimes also get?? */ alg = crypto_larval_wait(larval); else { /* This directly calls crypto_mod_put */ crypto_mod_put(larval); alg = ERR_PTR(-ENOENT); } /* This calls crypto_alg_put */ crypto_larval_kill(larval); return alg; In the two-thread situation, the first thread gets a larval with refcnt 2 via crypto_larval_add. (Why 2?) The next thread finds the larval via crypto_larval_add's call to __crypto_alg_lookup() and sees the ref bump to 3. While exiting crypto_alg_mod_lookup, each thread decrements the ref count twice. It seems to me like either each call to crypto_larval_lookup() should result in a refcount bumped by two, or crypto_alg_mod_lookup() should decrement only once, and the initial refcount should be 1 not 2 from crypto_larval_add. But it's not clear to me which is sensible here. What's the right solution here? Thanks, -Kees -- Kees Cook Chrome OS Security -- To unsubscribe from this list: send the line unsubscribe linux-crypto in the body of a message to majord...@vger.kernel.org More
Re: race condition in crypto larval handling
On Sat, Sep 7, 2013 at 7:39 AM, Neil Horman nhor...@tuxdriver.com wrote: On Fri, Sep 06, 2013 at 04:20:50PM -0700, Kees Cook wrote: Hi, I've tracked down a race condition and ref counting problem in the crypto API internals. We've been seeing it under Chrome OS, but it seems it's not isolated to just us: https://code.google.com/p/chromium/issues/detail?id=244581 http://marc.info/?l=linux-crypto-vgerm=135429403609373w=2 https://bugzilla.redhat.com/show_bug.cgi?id=983682 http://www.mail-archive.com/linux-cifs@vger.kernel.org/msg07933.html I think I've found the basic origin of the problem. crypto_larval_add() has synchronization to make sure only a single larval can ever be created. That logic seems totally fine. However, this means that crypto_larval_lookup() from two threads can return the same larval, which means that crypto_alg_mod_lookup() runs the risk of calling crypto_larval_kill() on the same larval twice, which does not appear to be handled sanely. I can easily crash the kernel by forcing a synchronization point just before the return crypt_larval_add(...) call in crypto_larval_lookup(). Basically, I added this sloppy sync code (I feel like there must be a better way to do this): +static atomic_t global_sync = ATOMIC_INIT(0); ... struct crypto_alg *crypto_larval_lookup(const char *name, u32 type, u32 mask) ... + if (strncmp(name, asdf, 4) == 0) { + int value; + + value = atomic_add_return(1, global_sync); + if (value == 1) { + /* I was first to stall here, wait for inc. */ + while (atomic_read(global_sync) != 2) + cpu_relax(); + } else if (value == 2) { + /* I was second to stall here, wait for dec. */ + while (atomic_read(global_sync) != 1) + cpu_relax(); + } else { + BUG(); + } + atomic_dec(global_sync); + } return crypto_larval_add(name, type, mask); } And then ran code from userspace that did, effectively: struct sockaddr_alg sa = { .salg_family = AF_ALG, .salg_type = hash, }; strcpy(sa.salg_name, argv[1]); fork(); ... sds[0] = socket(AF_ALG, SOCK_SEQPACKET, 0); bind(sds[0], (struct sockaddr *) sa, sizeof(sa)); And ran this as ./tickle asdf1 to generate the two threads trying to find an invalid alg. The race looks possible even with valid algs, but this was the fastest path I could see to tickle the issue. With added printks to the kernel, it was clear that crypto_larval_kill was being called twice on the same larval, and the list_del() call was doing bad things. When I fixed that, the refcnt bug became very obvious. Here's the change I made for crypto_larval_kill: @@ -161,7 +166,8 @@ void crypto_larval_kill(struct crypto_alg *alg) struct crypto_larval *larval = (void *)alg; down_write(crypto_alg_sem); - list_del(alg-cra_list); + if (!list_empty(alg-cra_list)) + list_del_init(alg-cra_list); up_write(crypto_alg_sem); complete_all(larval-completion); crypto_alg_put(alg); It seems that there are too many put calls (mixed between crypto_alg_put() and crypto_mod_put(), which also calls crypto_alg_put()). See this annotated portion of crypto_alg_mod_lookup: /* This can (correctly) return the same larval for two threads. */ larval = crypto_larval_lookup(name, type, mask); if (IS_ERR(larval) || !crypto_is_larval(larval)) return larval; ok = crypto_probing_notify(CRYPTO_MSG_ALG_REQUEST, larval); if (ok == NOTIFY_STOP) /* This calls crypto_mod_put(), but sometimes also get?? */ alg = crypto_larval_wait(larval); else { /* This directly calls crypto_mod_put */ crypto_mod_put(larval); alg = ERR_PTR(-ENOENT); } /* This calls crypto_alg_put */ crypto_larval_kill(larval); return alg; In the two-thread situation, the first thread gets a larval with refcnt 2 via crypto_larval_add. (Why 2?) The next thread finds the larval via crypto_larval_add's call to __crypto_alg_lookup() and sees the ref bump to 3. While exiting crypto_alg_mod_lookup, each thread decrements the ref count twice. It seems to me like either each call to crypto_larval_lookup() should result in a refcount bumped by two, or crypto_alg_mod_lookup() should decrement only once, and the initial refcount should be 1 not 2 from crypto_larval_add. But it's not clear to me which is sensible here. What's the right solution here? Thanks, -Kees -- Kees Cook Chrome OS Security I've been tracking this problem too: https://bugzilla.redhat.com/show_bug.cgi?id=1002351
Re: race condition in crypto larval handling
On Fri, Sep 06, 2013 at 04:20:50PM -0700, Kees Cook wrote: In the two-thread situation, the first thread gets a larval with refcnt 2 via crypto_larval_add. (Why 2?) The next thread finds the larval via crypto_larval_add's call to __crypto_alg_lookup() and sees the ref bump to 3. While exiting crypto_alg_mod_lookup, each thread decrements the ref count twice. It seems to me like either each call to crypto_larval_lookup() should result in a refcount bumped by two, or crypto_alg_mod_lookup() should decrement only once, and the initial refcount should be 1 not 2 from crypto_larval_add. But it's not clear to me which is sensible here. What's the right solution here? First of all thanks a lot for tracking this problem down! It's been bothering me for months but I was unable to find a good reproducer. So now that you've identified the problem, the solution is easy. crypto_larval_lookup should only ever return a larval if it created one. Any larval created earlier must be waited on first before we return. So does this patch fix the crash for you? diff --git a/crypto/api.c b/crypto/api.c index 320ea4d..a2b39c5 100644 --- a/crypto/api.c +++ b/crypto/api.c @@ -34,6 +34,8 @@ EXPORT_SYMBOL_GPL(crypto_alg_sem); BLOCKING_NOTIFIER_HEAD(crypto_chain); EXPORT_SYMBOL_GPL(crypto_chain); +static struct crypto_alg *crypto_larval_wait(struct crypto_alg *alg); + struct crypto_alg *crypto_mod_get(struct crypto_alg *alg) { return try_module_get(alg-cra_module) ? crypto_alg_get(alg) : NULL; @@ -144,8 +146,11 @@ static struct crypto_alg *crypto_larval_add(const char *name, u32 type, } up_write(crypto_alg_sem); - if (alg != larval-alg) + if (alg != larval-alg) { kfree(larval); + if (crypto_is_larval(alg)) + alg = crypto_larval_wait(alg); + } return alg; } Cheers, -- Email: Herbert Xu herb...@gondor.apana.org.au Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt -- To unsubscribe from this list: send the line unsubscribe linux-crypto in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: race condition in crypto larval handling
On Sat, Sep 7, 2013 at 6:32 PM, Herbert Xu herb...@gondor.apana.org.au wrote: On Fri, Sep 06, 2013 at 04:20:50PM -0700, Kees Cook wrote: In the two-thread situation, the first thread gets a larval with refcnt 2 via crypto_larval_add. (Why 2?) The next thread finds the larval via crypto_larval_add's call to __crypto_alg_lookup() and sees the ref bump to 3. While exiting crypto_alg_mod_lookup, each thread decrements the ref count twice. It seems to me like either each call to crypto_larval_lookup() should result in a refcount bumped by two, or crypto_alg_mod_lookup() should decrement only once, and the initial refcount should be 1 not 2 from crypto_larval_add. But it's not clear to me which is sensible here. What's the right solution here? First of all thanks a lot for tracking this problem down! It's been bothering me for months but I was unable to find a good reproducer. So now that you've identified the problem, the solution is easy. crypto_larval_lookup should only ever return a larval if it created one. Any larval created earlier must be waited on first before we return. So does this patch fix the crash for you? diff --git a/crypto/api.c b/crypto/api.c index 320ea4d..a2b39c5 100644 --- a/crypto/api.c +++ b/crypto/api.c @@ -34,6 +34,8 @@ EXPORT_SYMBOL_GPL(crypto_alg_sem); BLOCKING_NOTIFIER_HEAD(crypto_chain); EXPORT_SYMBOL_GPL(crypto_chain); +static struct crypto_alg *crypto_larval_wait(struct crypto_alg *alg); + struct crypto_alg *crypto_mod_get(struct crypto_alg *alg) { return try_module_get(alg-cra_module) ? crypto_alg_get(alg) : NULL; @@ -144,8 +146,11 @@ static struct crypto_alg *crypto_larval_add(const char *name, u32 type, } up_write(crypto_alg_sem); - if (alg != larval-alg) + if (alg != larval-alg) { kfree(larval); + if (crypto_is_larval(alg)) + alg = crypto_larval_wait(alg); + } return alg; } Thanks! This does stop the crash I was seeing on the race with no valid alg path, yay! However, I noticed on the good path (even without the above patch), I sometimes see a double-kfree triggered by the modprobe process. I can't, however, see how that's happening, since larval_destroy should only be called when refcnt == 0. Here's the debugging output I added that noticed this. pid 15797 and 15798 are my hash tool that forks before doing identical AF_ALG calls for sha512, which has to be modprobed. The trailing 31 and 32 was an atomic counter I added just to make sure I wasn't going crazy: [ 112.495789] crypto_alg_mod_lookup 15797: looking up [sha512] [ 112.495815] crypto_larval_lookup 15797: looking up [sha512] [ 112.495839] crypto_larval_lookup 15797: requesting module [sha512] [ 112.495915] crypto_alg_mod_lookup 15798: looking up [sha512] [ 112.495937] crypto_larval_lookup 15798: looking up [sha512] [ 112.495960] crypto_larval_lookup 15798: requesting module [sha512] [ 112.498691] crypto_larval_destroy 15808(modprobe): larval kfree(880123e9da00) refcnt:0 (serial:31) [ 112.498771] crypto_larval_destroy 15808(modprobe): larval kfree(880123e9da00) refcnt:0 (serial:32) [ 112.500888] crypto_larval_lookup 15797: after requesting module [sha512], got alg c0299050 [ 112.500904] crypto_larval_lookup 15797: for [sha512], have alg c0299050 [ 112.500934] crypto_larval_lookup 15797: for [sha512], alg c0299050 is NOT larval [ 112.500953] crypto_alg_mod_lookup 15797: [sha512] is not larval c0299050 [ 112.501384] crypto_larval_lookup 15798: after requesting module [sha512], got alg c0299050 [ 112.501404] crypto_larval_lookup 15798: for [sha512], have alg c0299050 [ 112.501417] crypto_larval_lookup 15798: for [sha512], alg c0299050 is NOT larval [ 112.501432] crypto_alg_mod_lookup 15798: [sha512] is not larval c0299050 Regardless, this seems to be a separate bug. -Kees -- Kees Cook Chrome OS Security -- To unsubscribe from this list: send the line unsubscribe linux-crypto in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: race condition in crypto larval handling
On Sat, Sep 07, 2013 at 08:34:15PM -0700, Kees Cook wrote: However, I noticed on the good path (even without the above patch), I sometimes see a double-kfree triggered by the modprobe process. I can't, however, see how that's happening, since larval_destroy should only be called when refcnt == 0. Do you still see this double free with this patch? Without the patch it is completely expected as killing the same lavral twice will cause memory corruption leading to all sorts of weirdness, even if you stop it from deleting the list entry twice. Cheers, -- Email: Herbert Xu herb...@gondor.apana.org.au Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt -- To unsubscribe from this list: send the line unsubscribe linux-crypto in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: race condition in crypto larval handling
On Sun, Sep 08, 2013 at 02:37:03PM +1000, Herbert Xu wrote: On Sat, Sep 07, 2013 at 08:34:15PM -0700, Kees Cook wrote: However, I noticed on the good path (even without the above patch), I sometimes see a double-kfree triggered by the modprobe process. I can't, however, see how that's happening, since larval_destroy should only be called when refcnt == 0. Do you still see this double free with this patch? Without the patch it is completely expected as killing the same lavral twice will cause memory corruption leading to all sorts of weirdness, even if you stop it from deleting the list entry twice. Actually I know what it is. sha512 registers two algorithms. Therefore, it will create two larvals in sequence and then destroy them in turn. So it's not a double free at all. If you put a printk in crypto_larval_alloc that should confirm this. Cheers, -- Email: Herbert Xu herb...@gondor.apana.org.au Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt -- To unsubscribe from this list: send the line unsubscribe linux-crypto in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html