[cryptodev:master 81/86] htmldocs: include/linux/crypto.h:614: warning: Function parameter or member 'stats.aead' not described in 'crypto_alg'
tree: https://git.kernel.org/pub/scm/linux/kernel/git/herbert/cryptodev-2.6.git master head: 88d905e20b11f7ad841e3afddaf1d59b6693c4a1 commit: 17c18f9e33282a170458cb5ea20759bfcb0da7d8 [81/86] crypto: user - Split stats in multiple structures reproduce: make htmldocs All warnings (new ones prefixed by >>): WARNING: convert(1) not found, for SVG to PDF conversion install ImageMagick (https://www.imagemagick.org) kernel/resource.c:337: warning: Function parameter or member 'start' not described in 'find_next_iomem_res' kernel/resource.c:337: warning: Function parameter or member 'end' not described in 'find_next_iomem_res' kernel/resource.c:337: warning: Function parameter or member 'flags' not described in 'find_next_iomem_res' kernel/resource.c:337: warning: Function parameter or member 'desc' not described in 'find_next_iomem_res' kernel/resource.c:337: warning: Function parameter or member 'first_lvl' not described in 'find_next_iomem_res' kernel/resource.c:337: warning: Function parameter or member 'res' not described in 'find_next_iomem_res' kernel/resource.c:409: warning: Function parameter or member 'arg' not described in 'walk_iomem_res_desc' kernel/resource.c:409: warning: Function parameter or member 'func' not described in 'walk_iomem_res_desc' kernel/resource.c:409: warning: Function parameter or member 'arg' not described in 'walk_iomem_res_desc' kernel/resource.c:409: warning: Function parameter or member 'func' not described in 'walk_iomem_res_desc' include/linux/rcutree.h:1: warning: no structured comments found kernel/rcu/tree.c:684: warning: Excess function parameter 'irq' description in 'rcu_nmi_exit' include/linux/srcu.h:175: warning: Function parameter or member 'p' not described in 'srcu_dereference_notrace' include/linux/srcu.h:175: warning: Function parameter or member 'sp' not described in 'srcu_dereference_notrace' include/linux/gfp.h:1: warning: no structured comments found >> include/linux/crypto.h:614: warning: Function parameter or member >> 'stats.aead' not described in 'crypto_alg' >> include/linux/crypto.h:614: warning: Function parameter or member >> 'stats.akcipher' not described in 'crypto_alg' >> include/linux/crypto.h:614: warning: Function parameter or member >> 'stats.cipher' not described in 'crypto_alg' >> include/linux/crypto.h:614: warning: Function parameter or member >> 'stats.compress' not described in 'crypto_alg' >> include/linux/crypto.h:614: warning: Function parameter or member >> 'stats.hash' not described in 'crypto_alg' >> include/linux/crypto.h:614: warning: Function parameter or member >> 'stats.rng' not described in 'crypto_alg' >> include/linux/crypto.h:614: warning: Function parameter or member >> 'stats.kpp' not described in 'crypto_alg' include/net/cfg80211.h:2838: warning: cannot understand function prototype: 'struct cfg80211_ftm_responder_stats ' include/net/cfg80211.h:2838: warning: cannot understand function prototype: 'struct cfg80211_ftm_responder_stats ' include/net/cfg80211.h:2838: warning: cannot understand function prototype: 'struct cfg80211_ftm_responder_stats ' include/net/cfg80211.h:2838: warning: cannot understand function prototype: 'struct cfg80211_ftm_responder_stats ' include/net/cfg80211.h:2838: warning: cannot understand function prototype: 'struct cfg80211_ftm_responder_stats ' include/net/cfg80211.h:2838: warning: cannot understand function prototype: 'struct cfg80211_ftm_responder_stats ' include/net/cfg80211.h:2838: warning: cannot understand function prototype: 'struct cfg80211_ftm_responder_stats ' include/net/cfg80211.h:2838: warning: cannot understand function prototype: 'struct cfg80211_ftm_responder_stats ' include/net/cfg80211.h:2838: warning: cannot understand function prototype: 'struct cfg80211_ftm_responder_stats ' include/net/cfg80211.h:2838: warning: cannot understand function prototype: 'struct cfg80211_ftm_responder_stats ' include/net/cfg80211.h:2838: warning: cannot understand function prototype: 'struct cfg80211_ftm_responder_stats ' include/net/cfg80211.h:2838: warning: cannot understand function prototype: 'struct cfg80211_ftm_responder_stats ' include/net/cfg80211.h:2838: warning: cannot understand function prototype: 'struct cfg80211_ftm_responder_stats ' include/net/cfg80211.h:4439: warning: Function parameter or member 'wext.ibss' not described in 'wireless_dev' include/net/cfg80211.h:4439: warning: Function parameter or member 'wext.connect' not described in 'wireless_dev' include/net/cfg80211.h:4439: warning: Function parameter or member 'wext.keys' not described in 'wireless_dev' include/net/cfg80211.h:4439: warning: Function parameter or member 'wext.ie' not described in 'wireless_dev' include/net/cfg80211.h:4439: warning: Function parameter or member 'wext.ie_len' not described in 'wireless_dev' include/net/cfg80211.h:4439: warning: Function
[PATCH] crypto: caam - fix setting IV after decrypt
The crypto API wants the updated IV in req->info after decryption. The updated IV used to be copied correctly to req->info after running the decryption job. Since 115957bb3e59 this is done before running the job so instead of the updated IV only the unmodified input IV is given back to the crypto API. This was observed running the gcm(aes) selftest which internally uses ctr(aes) implemented by the CAAM engine. Fixes: 115957bb3e59 ("crypto: caam - fix IV DMA mapping and updating") Signed-off-by: Sascha Hauer Cc: sta...@vger.kernel.org --- drivers/crypto/caam/caamalg.c | 17 + 1 file changed, 9 insertions(+), 8 deletions(-) diff --git a/drivers/crypto/caam/caamalg.c b/drivers/crypto/caam/caamalg.c index 869f092432de..c05c7938439c 100644 --- a/drivers/crypto/caam/caamalg.c +++ b/drivers/crypto/caam/caamalg.c @@ -917,10 +917,10 @@ static void skcipher_decrypt_done(struct device *jrdev, u32 *desc, u32 err, { struct skcipher_request *req = context; struct skcipher_edesc *edesc; -#ifdef DEBUG struct crypto_skcipher *skcipher = crypto_skcipher_reqtfm(req); int ivsize = crypto_skcipher_ivsize(skcipher); +#ifdef DEBUG dev_err(jrdev, "%s %d: err 0x%x\n", __func__, __LINE__, err); #endif @@ -937,6 +937,14 @@ static void skcipher_decrypt_done(struct device *jrdev, u32 *desc, u32 err, edesc->dst_nents > 1 ? 100 : req->cryptlen, 1); skcipher_unmap(jrdev, edesc, req); + + /* +* The crypto API expects us to set the IV (req->iv) to the last +* ciphertext block. +*/ + scatterwalk_map_and_copy(req->iv, req->src, req->cryptlen - ivsize, +ivsize, 0); + kfree(edesc); skcipher_request_complete(req, err); @@ -1588,13 +1596,6 @@ static int skcipher_decrypt(struct skcipher_request *req) if (IS_ERR(edesc)) return PTR_ERR(edesc); - /* -* The crypto API expects us to set the IV (req->iv) to the last -* ciphertext block. -*/ - scatterwalk_map_and_copy(req->iv, req->src, req->cryptlen - ivsize, -ivsize, 0); - /* Create and submit job descriptor*/ init_skcipher_job(req, edesc, false); desc = edesc->hw_desc; -- 2.19.1
Re: [PATCH v5 00/11] crypto: crypto_user_stat: misc enhancement
On Thu, Nov 29, 2018 at 02:42:15PM +, Corentin Labbe wrote: > Hello > > This patchset fixes all reported problem by Eric biggers. > > Regards > > Changes since v4: > - Inlined functions when !CRYPTO_STATS > > Changes since v3: > - Added a crypto_stats_init as asked vy Neil Horman > - Fixed some checkpatch complaints > > Changes since v2: > - moved all crypto_stats functions from header to algapi.c for using > crypto_alg_get/put > > Changes since v1: > - Better locking of crypto_alg via crypto_alg_get/crypto_alg_put > - remove all intermediate variables in crypto/crypto_user_stat.c > - splited all internal stats variables into different structures > > Corentin Labbe (11): > crypto: crypto_user_stat: made crypto_user_stat optional > crypto: CRYPTO_STATS should depend on CRYPTO_USER > crypto: crypto_user_stat: convert all stats from u32 to u64 > crypto: crypto_user_stat: split user space crypto stat structures > crypto: tool: getstat: convert user space example to the new > crypto_user_stat uapi > crypto: crypto_user_stat: fix use_after_free of struct xxx_request > crypto: crypto_user_stat: Fix invalid stat reporting > crypto: crypto_user_stat: remove intermediate variable > crypto: crypto_user_stat: Split stats in multiple structures > crypto: crypto_user_stat: rename err_cnt parameter > crypto: crypto_user_stat: Add crypto_stats_init > > crypto/Kconfig | 1 + > crypto/Makefile | 3 +- > crypto/ahash.c | 17 +- > crypto/algapi.c | 247 ++- > crypto/crypto_user_stat.c| 160 +-- > crypto/rng.c | 4 +- > include/crypto/acompress.h | 38 +--- > include/crypto/aead.h| 38 +--- > include/crypto/akcipher.h| 74 ++- > include/crypto/hash.h| 32 +-- > include/crypto/internal/cryptouser.h | 17 ++ > include/crypto/kpp.h | 48 + > include/crypto/rng.h | 27 +-- > include/crypto/skcipher.h| 36 +--- > include/linux/crypto.h | 290 ++- > include/uapi/linux/cryptouser.h | 102 ++ > tools/crypto/getstat.c | 72 +++ > 17 files changed, 676 insertions(+), 530 deletions(-) All applied. Thanks. -- Email: Herbert Xu Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
Re: [crypto chcr 1/2] small packet Tx stalls the queue
On Fri, Nov 30, 2018 at 02:31:48PM +0530, Atul Gupta wrote: > Immediate packets sent to hardware should include the work > request length in calculating the flits. WR occupy one flit and > if not accounted result in invalid request which stalls the HW > queue. > > Cc: sta...@vger.kernel.org > Signed-off-by: Atul Gupta > --- > drivers/crypto/chelsio/chcr_ipsec.c | 5 - > 1 file changed, 4 insertions(+), 1 deletion(-) All applied. Thanks. -- Email: Herbert Xu Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
[PATCH] crypto: adiantum - adjust some comments to match latest paper
From: Eric Biggers The 2018-11-28 revision of the Adiantum paper has revised some notation: - 'M' was replaced with 'L' (meaning "Left", for the left-hand part of the message) in the definition of Adiantum hashing, to avoid confusion with the full message - ε-almost-∆-universal is now abbreviated as ε-∆U instead of εA∆U - "block" is now used only to mean block cipher and Poly1305 blocks Also, Adiantum hashing was moved from the appendix to the main paper. To avoid confusion, update relevant comments in the code to match. Signed-off-by: Eric Biggers --- crypto/adiantum.c | 35 +++ crypto/nhpoly1305.c | 8 2 files changed, 23 insertions(+), 20 deletions(-) diff --git a/crypto/adiantum.c b/crypto/adiantum.c index ca27e0dc2958c..e62e34f5e389b 100644 --- a/crypto/adiantum.c +++ b/crypto/adiantum.c @@ -9,7 +9,7 @@ * Adiantum is a tweakable, length-preserving encryption mode designed for fast * and secure disk encryption, especially on CPUs without dedicated crypto * instructions. Adiantum encrypts each sector using the XChaCha12 stream - * cipher, two passes of an ε-almost-∆-universal (εA∆U) hash function based on + * cipher, two passes of an ε-almost-∆-universal (ε-∆U) hash function based on * NH and Poly1305, and an invocation of the AES-256 block cipher on a single * 16-byte block. See the paper for details: * @@ -21,12 +21,12 @@ * - Stream cipher: XChaCha12 or XChaCha20 * - Block cipher: any with a 128-bit block size and 256-bit key * - * This implementation doesn't currently allow other εA∆U hash functions, i.e. + * This implementation doesn't currently allow other ε-∆U hash functions, i.e. * HPolyC is not supported. This is because Adiantum is ~20% faster than HPolyC - * but still provably as secure, and also the εA∆U hash function of HBSH is + * but still provably as secure, and also the ε-∆U hash function of HBSH is * formally defined to take two inputs (tweak, message) which makes it difficult * to wrap with the crypto_shash API. Rather, some details need to be handled - * here. Nevertheless, if needed in the future, support for other εA∆U hash + * here. Nevertheless, if needed in the future, support for other ε-∆U hash * functions could be added here. */ @@ -41,7 +41,7 @@ #include "internal.h" /* - * Size of right-hand block of input data, in bytes; also the size of the block + * Size of right-hand part of input data, in bytes; also the size of the block * cipher's block size and the hash function's output. */ #define BLOCKCIPHER_BLOCK_SIZE 16 @@ -77,7 +77,7 @@ struct adiantum_tfm_ctx { struct adiantum_request_ctx { /* -* Buffer for right-hand block of data, i.e. +* Buffer for right-hand part of data, i.e. * *P_L => P_M => C_M => C_R when encrypting, or *C_R => C_M => P_M => P_L when decrypting. @@ -93,8 +93,8 @@ struct adiantum_request_ctx { bool enc; /* true if encrypting, false if decrypting */ /* -* The result of the Poly1305 εA∆U hash function applied to -* (message length, tweak). +* The result of the Poly1305 ε-∆U hash function applied to +* (bulk length, tweak) */ le128 header_hash; @@ -213,13 +213,16 @@ static inline void le128_sub(le128 *r, const le128 *v1, const le128 *v2) } /* - * Apply the Poly1305 εA∆U hash function to (message length, tweak) and save the - * result to rctx->header_hash. + * Apply the Poly1305 ε-∆U hash function to (bulk length, tweak) and save the + * result to rctx->header_hash. This is the calculation * - * This value is reused in both the first and second hash steps. Specifically, - * it's added to the result of an independently keyed εA∆U hash function (for - * equal length inputs only) taken over the message. This gives the overall - * Adiantum hash of the (tweak, message) pair. + * H_T ← Poly1305_{K_T}(bin_{128}(|L|) || T) + * + * from the procedure in section 6.4 of the Adiantum paper. The resulting value + * is reused in both the first and second hash steps. Specifically, it's added + * to the result of an independently keyed ε-∆U hash function (for equal length + * inputs only) taken over the left-hand part (the "bulk") of the message, to + * give the overall Adiantum hash of the (tweak, left-hand part) pair. */ static void adiantum_hash_header(struct skcipher_request *req) { @@ -248,7 +251,7 @@ static void adiantum_hash_header(struct skcipher_request *req) poly1305_core_emit(, >header_hash); } -/* Hash the left-hand block (the "bulk") of the message using NHPoly1305 */ +/* Hash the left-hand part (the "bulk") of the message using NHPoly1305 */ static int adiantum_hash_message(struct skcipher_request *req, struct scatterlist *sgl, le128 *digest) { @@ -550,7 +553,7 @@ static int adiantum_create(struct crypto_template *tmpl, struct rtattr **tb)
[PATCH] crypto: xchacha20 - fix comments for test vectors
From: Eric Biggers The kernel's ChaCha20 uses the RFC7539 convention of the nonce being 12 bytes rather than 8, so actually I only appended 12 random bytes (not 16) to its test vectors to form 24-byte nonces for the XChaCha20 test vectors. The other 4 bytes were just from zero-padding the stream position to 8 bytes. Fix the comments above the test vectors. Signed-off-by: Eric Biggers --- crypto/testmgr.h | 14 ++ 1 file changed, 6 insertions(+), 8 deletions(-) diff --git a/crypto/testmgr.h b/crypto/testmgr.h index 357cf4cbcbb1c..e8f47d7b92cdd 100644 --- a/crypto/testmgr.h +++ b/crypto/testmgr.h @@ -32281,8 +32281,9 @@ static const struct cipher_testvec xchacha20_tv_template[] = { "\x57\x78\x8e\x6f\xae\x90\xfc\x31" "\x09\x7c\xfc", .len= 91, - }, { /* Taken from the ChaCha20 test vectors, appended 16 random bytes - to nonce, and recomputed the ciphertext with libsodium */ + }, { /* Taken from the ChaCha20 test vectors, appended 12 random bytes + to the nonce, zero-padded the stream position from 4 to 8 bytes, + and recomputed the ciphertext using libsodium's XChaCha20 */ .key= "\x00\x00\x00\x00\x00\x00\x00\x00" "\x00\x00\x00\x00\x00\x00\x00\x00" "\x00\x00\x00\x00\x00\x00\x00\x00" @@ -32309,8 +32310,7 @@ static const struct cipher_testvec xchacha20_tv_template[] = { "\x03\xdc\xf8\x2b\xc1\xe1\x75\x67" "\x23\x7b\xe6\xfc\xd4\x03\x86\x54", .len= 64, - }, { /* Taken from the ChaCha20 test vectors, appended 16 random bytes - to nonce, and recomputed the ciphertext with libsodium */ + }, { /* Derived from a ChaCha20 test vector, via the process above */ .key= "\x00\x00\x00\x00\x00\x00\x00\x00" "\x00\x00\x00\x00\x00\x00\x00\x00" "\x00\x00\x00\x00\x00\x00\x00\x00" @@ -32419,8 +32419,7 @@ static const struct cipher_testvec xchacha20_tv_template[] = { .np = 3, .tap= { 375 - 20, 4, 16 }, - }, { /* Taken from the ChaCha20 test vectors, appended 16 random bytes - to nonce, and recomputed the ciphertext with libsodium */ + }, { /* Derived from a ChaCha20 test vector, via the process above */ .key= "\x1c\x92\x40\xa5\xeb\x55\xd3\x8a" "\xf3\x33\x88\x86\x04\xf6\xb5\xf0" "\x47\x39\x17\xc1\x40\x2b\x80\x09" @@ -32463,8 +32462,7 @@ static const struct cipher_testvec xchacha20_tv_template[] = { "\x65\x03\xfa\x45\xf7\x9e\x53\x7a" "\x99\xf1\x82\x25\x4f\x8d\x07", .len= 127, - }, { /* Taken from the ChaCha20 test vectors, appended 16 random bytes - to nonce, and recomputed the ciphertext with libsodium */ + }, { /* Derived from a ChaCha20 test vector, via the process above */ .key= "\x1c\x92\x40\xa5\xeb\x55\xd3\x8a" "\xf3\x33\x88\x86\x04\xf6\xb5\xf0" "\x47\x39\x17\xc1\x40\x2b\x80\x09" -- 2.20.0.rc2.403.gdbc3b29805-goog
[PATCH] crypto: xchacha - add test vector from XChaCha20 draft RFC
From: Eric Biggers There is a draft specification for XChaCha20 being worked on. Add the XChaCha20 test vector from the appendix so that we can be extra sure the kernel's implementation is compatible. I also recomputed the ciphertext with XChaCha12 and added it there too, to keep the tests for XChaCha20 and XChaCha12 in sync. Signed-off-by: Eric Biggers --- crypto/testmgr.h | 178 ++- 1 file changed, 176 insertions(+), 2 deletions(-) diff --git a/crypto/testmgr.h b/crypto/testmgr.h index e7e56a8febbca..357cf4cbcbb1c 100644 --- a/crypto/testmgr.h +++ b/crypto/testmgr.h @@ -32800,7 +32800,94 @@ static const struct cipher_testvec xchacha20_tv_template[] = { .also_non_np = 1, .np = 3, .tap= { 1200, 1, 80 }, - }, + }, { /* test vector from https://tools.ietf.org/html/draft-arciszewski-xchacha-02#appendix-A.3.2 */ + .key= "\x80\x81\x82\x83\x84\x85\x86\x87" + "\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f" + "\x90\x91\x92\x93\x94\x95\x96\x97" + "\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f", + .klen = 32, + .iv = "\x40\x41\x42\x43\x44\x45\x46\x47" + "\x48\x49\x4a\x4b\x4c\x4d\x4e\x4f" + "\x50\x51\x52\x53\x54\x55\x56\x58" + "\x00\x00\x00\x00\x00\x00\x00\x00", + .ptext = "\x54\x68\x65\x20\x64\x68\x6f\x6c" + "\x65\x20\x28\x70\x72\x6f\x6e\x6f" + "\x75\x6e\x63\x65\x64\x20\x22\x64" + "\x6f\x6c\x65\x22\x29\x20\x69\x73" + "\x20\x61\x6c\x73\x6f\x20\x6b\x6e" + "\x6f\x77\x6e\x20\x61\x73\x20\x74" + "\x68\x65\x20\x41\x73\x69\x61\x74" + "\x69\x63\x20\x77\x69\x6c\x64\x20" + "\x64\x6f\x67\x2c\x20\x72\x65\x64" + "\x20\x64\x6f\x67\x2c\x20\x61\x6e" + "\x64\x20\x77\x68\x69\x73\x74\x6c" + "\x69\x6e\x67\x20\x64\x6f\x67\x2e" + "\x20\x49\x74\x20\x69\x73\x20\x61" + "\x62\x6f\x75\x74\x20\x74\x68\x65" + "\x20\x73\x69\x7a\x65\x20\x6f\x66" + "\x20\x61\x20\x47\x65\x72\x6d\x61" + "\x6e\x20\x73\x68\x65\x70\x68\x65" + "\x72\x64\x20\x62\x75\x74\x20\x6c" + "\x6f\x6f\x6b\x73\x20\x6d\x6f\x72" + "\x65\x20\x6c\x69\x6b\x65\x20\x61" + "\x20\x6c\x6f\x6e\x67\x2d\x6c\x65" + "\x67\x67\x65\x64\x20\x66\x6f\x78" + "\x2e\x20\x54\x68\x69\x73\x20\x68" + "\x69\x67\x68\x6c\x79\x20\x65\x6c" + "\x75\x73\x69\x76\x65\x20\x61\x6e" + "\x64\x20\x73\x6b\x69\x6c\x6c\x65" + "\x64\x20\x6a\x75\x6d\x70\x65\x72" + "\x20\x69\x73\x20\x63\x6c\x61\x73" + "\x73\x69\x66\x69\x65\x64\x20\x77" + "\x69\x74\x68\x20\x77\x6f\x6c\x76" + "\x65\x73\x2c\x20\x63\x6f\x79\x6f" + "\x74\x65\x73\x2c\x20\x6a\x61\x63" + "\x6b\x61\x6c\x73\x2c\x20\x61\x6e" + "\x64\x20\x66\x6f\x78\x65\x73\x20" + "\x69\x6e\x20\x74\x68\x65\x20\x74" + "\x61\x78\x6f\x6e\x6f\x6d\x69\x63" + "\x20\x66\x61\x6d\x69\x6c\x79\x20" + "\x43\x61\x6e\x69\x64\x61\x65\x2e", + .ctext = "\x45\x59\xab\xba\x4e\x48\xc1\x61" + "\x02\xe8\xbb\x2c\x05\xe6\x94\x7f" + "\x50\xa7\x86\xde\x16\x2f\x9b\x0b" + "\x7e\x59\x2a\x9b\x53\xd0\xd4\xe9" + "\x8d\x8d\x64\x10\xd5\x40\xa1\xa6" + "\x37\x5b\x26\xd8\x0d\xac\xe4\xfa" + "\xb5\x23\x84\xc7\x31\xac\xbf\x16" + "\xa5\x92\x3c\x0c\x48\xd3\x57\x5d" + "\x4d\x0d\x2c\x67\x3b\x66\x6f\xaa" + "\x73\x10\x61\x27\x77\x01\x09\x3a" + "\x6b\xf7\xa1\x58\xa8\x86\x42\x92" + "\xa4\x1c\x48\xe3\xa9\xb4\xc0\xda" + "\xec\xe0\xf8\xd9\x8d\x0d\x7e\x05" + "\xb3\x7a\x30\x7b\xbb\x66\x33\x31" + "\x64\xec\x9e\x1b\x24\xea\x0d\x6c" + "\x3f\xfd\xdc\xec\x4f\x68\xe7\x44" + "\x30\x56\x19\x3a\x03\xc8\x10\xe1" + "\x13\x44\xca\x06\xd8\xed\x8a\x2b" + "\xfb\x1e\x8d\x48\xcf\xa6\xbc\x0e" +
Using Advanced Vector eXtensions with hand-coded x64 algorithms (e.g /arch/x86/blowfish-x86_64-asm_64.S)
I was curious if it might make implementing F() faster to use instructions that are meant to work with sets of data similar to what would be processed
[PATCH] crypto: adiantum - propagate CRYPTO_ALG_ASYNC flag to instance
From: Eric Biggers If the stream cipher implementation is asynchronous, then the Adiantum instance must be flagged as asynchronous as well. Otherwise someone asking for a synchronous algorithm can get an asynchronous algorithm. There are no asynchronous xchacha12 or xchacha20 implementations yet which makes this largely a theoretical issue, but it should be fixed. Fixes: 059c2a4d8e16 ("crypto: adiantum - add Adiantum support") Signed-off-by: Eric Biggers --- crypto/adiantum.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/crypto/adiantum.c b/crypto/adiantum.c index 2dfcf12fd4529..ca27e0dc2958c 100644 --- a/crypto/adiantum.c +++ b/crypto/adiantum.c @@ -590,6 +590,8 @@ static int adiantum_create(struct crypto_template *tmpl, struct rtattr **tb) hash_alg->base.cra_driver_name) >= CRYPTO_MAX_ALG_NAME) goto out_drop_hash; + inst->alg.base.cra_flags = streamcipher_alg->base.cra_flags & + CRYPTO_ALG_ASYNC; inst->alg.base.cra_blocksize = BLOCKCIPHER_BLOCK_SIZE; inst->alg.base.cra_ctxsize = sizeof(struct adiantum_tfm_ctx); inst->alg.base.cra_alignmask = streamcipher_alg->base.cra_alignmask | -- 2.20.0.rc1.387.gf8505762e3-goog
Re: [PATCH] fscrypt: remove CRYPTO_CTR dependency
On Thu, Sep 06, 2018 at 12:43:41PM +0200, Ard Biesheuvel wrote: > On 5 September 2018 at 21:24, Eric Biggers wrote: > > From: Eric Biggers > > > > fscrypt doesn't use the CTR mode of operation for anything, so there's > > no need to select CRYPTO_CTR. It was added by commit 71dea01ea2ed > > ("ext4 crypto: require CONFIG_CRYPTO_CTR if ext4 encryption is > > enabled"). But, I've been unable to identify the arm64 crypto bug it > > was supposedly working around. > > > > I suspect the issue was seen only on some old Android device kernel > > (circa 3.10?). So if the fix wasn't mistaken, the real bug is probably > > already fixed. Or maybe it was actually a bug in a non-upstream crypto > > driver. > > > > So, remove the dependency. If it turns out there's actually still a > > bug, we'll fix it properly. > > > > Signed-off-by: Eric Biggers > > Acked-by: Ard Biesheuvel > > This may be related to > > 11e3b725cfc2 crypto: arm64/aes-blk - honour iv_out requirement in CBC > and CTR modes > > given that the commit in question mentions CTS. How it actually works > around the issue is unclear to me, though. > > > > > > --- > > fs/crypto/Kconfig | 1 - > > 1 file changed, 1 deletion(-) > > > > diff --git a/fs/crypto/Kconfig b/fs/crypto/Kconfig > > index 02b7d91c92310..284b589b4774d 100644 > > --- a/fs/crypto/Kconfig > > +++ b/fs/crypto/Kconfig > > @@ -6,7 +6,6 @@ config FS_ENCRYPTION > > select CRYPTO_ECB > > select CRYPTO_XTS > > select CRYPTO_CTS > > - select CRYPTO_CTR > > select CRYPTO_SHA256 > > select KEYS > > help > > -- > > 2.19.0.rc2.392.g5ba43deb5a-goog > > Ping. Ted, can you consider applying this to the fscrypt tree for 4.21? Thanks, - Eric
[PATCH v2 0/3] crypto: arm64/chacha - performance improvements
Improve the performance of NEON based ChaCha: Patch #1 adds a block size of 1472 to the tcrypt test template so we have something that reflects the VPN case. Patch #2 improves performance for arbitrary length inputs: on deep pipelines, throughput increases ~30% when running on inputs blocks whose size is drawn randomly from the interval [64, 1024) Patch #3 adopts the OpenSSL approach to use the ALU in parallel with the SIMD unit to process a fifth block while the SIMD is operating on 4 blocks. Performance on Cortex-A57: BEFORE: === testing speed of async chacha20 (chacha20-neon) encryption tcrypt: test 0 (256 bit key, 16 byte blocks): 2528223 operations in 1 seconds (40451568 bytes) tcrypt: test 1 (256 bit key, 64 byte blocks): 2518155 operations in 1 seconds (161161920 bytes) tcrypt: test 2 (256 bit key, 256 byte blocks): 1207948 operations in 1 seconds (309234688 bytes) tcrypt: test 3 (256 bit key, 1024 byte blocks): 332194 operations in 1 seconds (340166656 bytes) tcrypt: test 4 (256 bit key, 1472 byte blocks): 185659 operations in 1 seconds (273290048 bytes) tcrypt: test 5 (256 bit key, 8192 byte blocks): 41829 operations in 1 seconds (342663168 bytes) AFTER: == testing speed of async chacha20 (chacha20-neon) encryption tcrypt: test 0 (256 bit key, 16 byte blocks): 2530018 operations in 1 seconds (40480288 bytes) tcrypt: test 1 (256 bit key, 64 byte blocks): 2518270 operations in 1 seconds (161169280 bytes) tcrypt: test 2 (256 bit key, 256 byte blocks): 1187760 operations in 1 seconds (304066560 bytes) tcrypt: test 3 (256 bit key, 1024 byte blocks): 361652 operations in 1 seconds (370331648 bytes) tcrypt: test 4 (256 bit key, 1472 byte blocks): 280971 operations in 1 seconds (413589312 bytes) tcrypt: test 5 (256 bit key, 8192 byte blocks): 53654 operations in 1 seconds (439533568 bytes) Zinc: = testing speed of async chacha20 (chacha20-software) encryption tcrypt: test 0 (256 bit key, 16 byte blocks): 2510300 operations in 1 seconds (40164800 bytes) tcrypt: test 1 (256 bit key, 64 byte blocks): 2663794 operations in 1 seconds (170482816 bytes) tcrypt: test 2 (256 bit key, 256 byte blocks): 1237617 operations in 1 seconds (316829952 bytes) tcrypt: test 3 (256 bit key, 1024 byte blocks): 364645 operations in 1 seconds (373396480 bytes) tcrypt: test 4 (256 bit key, 1472 byte blocks): 251548 operations in 1 seconds (370278656 bytes) tcrypt: test 5 (256 bit key, 8192 byte blocks): 47650 operations in 1 seconds (390348800 bytes) Cc: Eric Biggers Cc: Martin Willi Ard Biesheuvel (3): crypto: tcrypt - add block size of 1472 to skcipher template crypto: arm64/chacha - optimize for arbitrary length inputs crypto: arm64/chacha - use combined SIMD/ALU routine for more speed arch/arm64/crypto/chacha-neon-core.S | 396 +++- arch/arm64/crypto/chacha-neon-glue.c | 59 ++- crypto/tcrypt.c | 2 +- 3 files changed, 404 insertions(+), 53 deletions(-) -- 2.19.2
[PATCH v2 1/3] crypto: tcrypt - add block size of 1472 to skcipher template
In order to have better coverage of algorithms operating on block sizes that are in the ballpark of a VPN packet, add 1472 to the block_sizes array. Signed-off-by: Ard Biesheuvel --- crypto/tcrypt.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/crypto/tcrypt.c b/crypto/tcrypt.c index 0590a9204562..e7fb87e114a5 100644 --- a/crypto/tcrypt.c +++ b/crypto/tcrypt.c @@ -81,7 +81,7 @@ static char *check[] = { NULL }; -static u32 block_sizes[] = { 16, 64, 256, 1024, 8192, 0 }; +static u32 block_sizes[] = { 16, 64, 256, 1024, 1472, 8192, 0 }; static u32 aead_sizes[] = { 16, 64, 256, 512, 1024, 2048, 4096, 8192, 0 }; #define XBUFSIZE 8 -- 2.19.2
[PATCH v2 3/3] crypto: arm64/chacha - use combined SIMD/ALU routine for more speed
To some degree, most known AArch64 micro-architectures appear to be able to issue ALU instructions in parellel to SIMD instructions without affecting the SIMD throughput. This means we can use the ALU to process a fifth ChaCha block while the SIMD is processing four blocks in parallel. Signed-off-by: Ard Biesheuvel --- arch/arm64/crypto/chacha-neon-core.S | 235 ++-- arch/arm64/crypto/chacha-neon-glue.c | 39 ++-- 2 files changed, 239 insertions(+), 35 deletions(-) diff --git a/arch/arm64/crypto/chacha-neon-core.S b/arch/arm64/crypto/chacha-neon-core.S index 32086709e6b3..534e0a3fafa4 100644 --- a/arch/arm64/crypto/chacha-neon-core.S +++ b/arch/arm64/crypto/chacha-neon-core.S @@ -1,13 +1,13 @@ /* * ChaCha/XChaCha NEON helper functions * - * Copyright (C) 2016 Linaro, Ltd. + * Copyright (C) 2016-2018 Linaro, Ltd. * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License version 2 as * published by the Free Software Foundation. * - * Based on: + * Originally based on: * ChaCha20 256-bit cipher algorithm, RFC7539, x64 SSSE3 functions * * Copyright (C) 2015 Martin Willi @@ -160,8 +160,27 @@ ENTRY(hchacha_block_neon) ret x9 ENDPROC(hchacha_block_neon) + a0 .reqw12 + a1 .reqw13 + a2 .reqw14 + a3 .reqw15 + a4 .reqw16 + a5 .reqw17 + a6 .reqw19 + a7 .reqw20 + a8 .reqw21 + a9 .reqw22 + a10 .reqw23 + a11 .reqw24 + a12 .reqw25 + a13 .reqw26 + a14 .reqw27 + a15 .reqw28 + .align 6 ENTRY(chacha_4block_xor_neon) + frame_push 10 + // x0: Input state matrix, s // x1: 4 data blocks output, o // x2: 4 data blocks input, i @@ -181,6 +200,9 @@ ENTRY(chacha_4block_xor_neon) // matrix by interleaving 32- and then 64-bit words, which allows us to // do XOR in NEON registers. // + // At the same time, a fifth block is encrypted in parallel using + // scalar registers + // adr_l x9, CTRINC // ... and ROT8 ld1 {v30.4s-v31.4s}, [x9] @@ -191,7 +213,24 @@ ENTRY(chacha_4block_xor_neon) ld4r{ v8.4s-v11.4s}, [x8], #16 ld4r{v12.4s-v15.4s}, [x8] - // x12 += counter values 0-3 + mov a0, v0.s[0] + mov a1, v1.s[0] + mov a2, v2.s[0] + mov a3, v3.s[0] + mov a4, v4.s[0] + mov a5, v5.s[0] + mov a6, v6.s[0] + mov a7, v7.s[0] + mov a8, v8.s[0] + mov a9, v9.s[0] + mov a10, v10.s[0] + mov a11, v11.s[0] + mov a12, v12.s[0] + mov a13, v13.s[0] + mov a14, v14.s[0] + mov a15, v15.s[0] + + // x12 += counter values 1-4 add v12.4s, v12.4s, v30.4s .Ldoubleround4: @@ -200,33 +239,53 @@ ENTRY(chacha_4block_xor_neon) // x2 += x6, x14 = rotl32(x14 ^ x2, 16) // x3 += x7, x15 = rotl32(x15 ^ x3, 16) add v0.4s, v0.4s, v4.4s + add a0, a0, a4 add v1.4s, v1.4s, v5.4s + add a1, a1, a5 add v2.4s, v2.4s, v6.4s + add a2, a2, a6 add v3.4s, v3.4s, v7.4s + add a3, a3, a7 eor v12.16b, v12.16b, v0.16b + eor a12, a12, a0 eor v13.16b, v13.16b, v1.16b + eor a13, a13, a1 eor v14.16b, v14.16b, v2.16b + eor a14, a14, a2 eor v15.16b, v15.16b, v3.16b + eor a15, a15, a3 rev32 v12.8h, v12.8h + ror a12, a12, #16 rev32 v13.8h, v13.8h + ror a13, a13, #16 rev32 v14.8h, v14.8h + ror a14, a14, #16 rev32 v15.8h, v15.8h + ror a15, a15, #16 // x8 += x12, x4 = rotl32(x4 ^ x8, 12) // x9 += x13, x5 = rotl32(x5 ^ x9, 12) // x10 += x14, x6 = rotl32(x6 ^ x10, 12) // x11 += x15, x7 = rotl32(x7 ^ x11, 12) add v8.4s, v8.4s, v12.4s + add a8, a8, a12 add v9.4s, v9.4s, v13.4s + add a9, a9, a13 add v10.4s, v10.4s, v14.4s + add a10, a10, a14 add v11.4s, v11.4s, v15.4s + add
[PATCH v2 2/3] crypto: arm64/chacha - optimize for arbitrary length inputs
Update the 4-way NEON ChaCha routine so it can handle input of any length >64 bytes in its entirety, rather than having to call into the 1-way routine and/or memcpy()s via temp buffers to handle the tail of a ChaCha invocation that is not a multiple of 256 bytes. On inputs that are a multiple of 256 bytes (and thus in tcrypt benchmarks), performance drops by around 1% on Cortex-A57, while performance for inputs drawn randomly from the range [64, 1024) increases by around 30%. Signed-off-by: Ard Biesheuvel --- arch/arm64/crypto/chacha-neon-core.S | 183 ++-- arch/arm64/crypto/chacha-neon-glue.c | 38 ++-- 2 files changed, 184 insertions(+), 37 deletions(-) diff --git a/arch/arm64/crypto/chacha-neon-core.S b/arch/arm64/crypto/chacha-neon-core.S index 75b4e06cee79..32086709e6b3 100644 --- a/arch/arm64/crypto/chacha-neon-core.S +++ b/arch/arm64/crypto/chacha-neon-core.S @@ -19,6 +19,8 @@ */ #include +#include +#include .text .align 6 @@ -36,7 +38,7 @@ */ chacha_permute: - adr x10, ROT8 + adr_l x10, ROT8 ld1 {v12.4s}, [x10] .Ldoubleround: @@ -164,6 +166,12 @@ ENTRY(chacha_4block_xor_neon) // x1: 4 data blocks output, o // x2: 4 data blocks input, i // w3: nrounds + // x4: byte count + + adr_l x10, .Lpermute + and x5, x4, #63 + add x10, x10, x5 + add x11, x10, #64 // // This function encrypts four consecutive ChaCha blocks by loading @@ -173,15 +181,15 @@ ENTRY(chacha_4block_xor_neon) // matrix by interleaving 32- and then 64-bit words, which allows us to // do XOR in NEON registers. // - adr x9, CTRINC // ... and ROT8 + adr_l x9, CTRINC // ... and ROT8 ld1 {v30.4s-v31.4s}, [x9] // x0..15[0-3] = s0..3[0..3] - mov x4, x0 - ld4r{ v0.4s- v3.4s}, [x4], #16 - ld4r{ v4.4s- v7.4s}, [x4], #16 - ld4r{ v8.4s-v11.4s}, [x4], #16 - ld4r{v12.4s-v15.4s}, [x4] + add x8, x0, #16 + ld4r{ v0.4s- v3.4s}, [x0] + ld4r{ v4.4s- v7.4s}, [x8], #16 + ld4r{ v8.4s-v11.4s}, [x8], #16 + ld4r{v12.4s-v15.4s}, [x8] // x12 += counter values 0-3 add v12.4s, v12.4s, v30.4s @@ -425,24 +433,47 @@ ENTRY(chacha_4block_xor_neon) zip1v30.4s, v14.4s, v15.4s zip2v31.4s, v14.4s, v15.4s + mov x3, #64 + subsx5, x4, #64 + add x6, x5, x2 + cselx3, x3, xzr, ge + cselx2, x2, x6, ge + // interleave 64-bit words in state n, n+2 zip1v0.2d, v16.2d, v18.2d zip2v4.2d, v16.2d, v18.2d zip1v8.2d, v17.2d, v19.2d zip2v12.2d, v17.2d, v19.2d - ld1 {v16.16b-v19.16b}, [x2], #64 + ld1 {v16.16b-v19.16b}, [x2], x3 + + subsx6, x4, #128 + ccmpx3, xzr, #4, lt + add x7, x6, x2 + cselx3, x3, xzr, eq + cselx2, x2, x7, eq zip1v1.2d, v20.2d, v22.2d zip2v5.2d, v20.2d, v22.2d zip1v9.2d, v21.2d, v23.2d zip2v13.2d, v21.2d, v23.2d - ld1 {v20.16b-v23.16b}, [x2], #64 + ld1 {v20.16b-v23.16b}, [x2], x3 + + subsx7, x4, #192 + ccmpx3, xzr, #4, lt + add x8, x7, x2 + cselx3, x3, xzr, eq + cselx2, x2, x8, eq zip1v2.2d, v24.2d, v26.2d zip2v6.2d, v24.2d, v26.2d zip1v10.2d, v25.2d, v27.2d zip2v14.2d, v25.2d, v27.2d - ld1 {v24.16b-v27.16b}, [x2], #64 + ld1 {v24.16b-v27.16b}, [x2], x3 + + subsx8, x4, #256 + ccmpx3, xzr, #4, lt + add x9, x8, x2 + cselx2, x2, x9, eq zip1v3.2d, v28.2d, v30.2d zip2v7.2d, v28.2d, v30.2d @@ -451,29 +482,155 @@ ENTRY(chacha_4block_xor_neon) ld1 {v28.16b-v31.16b}, [x2] // xor with corresponding input, write to output + tbnzx5, #63, 0f eor v16.16b, v16.16b, v0.16b eor v17.16b, v17.16b, v1.16b eor v18.16b, v18.16b, v2.16b eor v19.16b, v19.16b, v3.16b + st1 {v16.16b-v19.16b}, [x1], #64 + + tbnzx6, #63, 1f eor v20.16b, v20.16b, v4.16b eor v21.16b, v21.16b, v5.16b
[crypto chcr 2/2] ESN for Inline IPSec Tx
Send SPI, 64b seq nos and 64b IV with aadiv drop for inline crypto. This information is added in outgoing packet after the CPL TX PKT XT and removed by hardware. The aad, auth and cipher offsets are then adjusted for ESN enabled tunnel. Signed-off-by: Atul Gupta --- drivers/crypto/chelsio/chcr_core.h | 9 ++ drivers/crypto/chelsio/chcr_ipsec.c | 175 2 files changed, 148 insertions(+), 36 deletions(-) diff --git a/drivers/crypto/chelsio/chcr_core.h b/drivers/crypto/chelsio/chcr_core.h index de3a9c0..4616663 100644 --- a/drivers/crypto/chelsio/chcr_core.h +++ b/drivers/crypto/chelsio/chcr_core.h @@ -159,8 +159,17 @@ struct chcr_ipsec_wr { struct chcr_ipsec_req req; }; +#define ESN_IV_INSERT_OFFSET 12 +struct chcr_ipsec_aadiv { + __be32 spi; + u8 seq_no[8]; + u8 iv[8]; +}; + struct ipsec_sa_entry { int hmac_ctrl; + u16 esn; + u16 imm; unsigned int enckey_len; unsigned int kctx_len; unsigned int authsize; diff --git a/drivers/crypto/chelsio/chcr_ipsec.c b/drivers/crypto/chelsio/chcr_ipsec.c index 1ff8738..9321d2b 100644 --- a/drivers/crypto/chelsio/chcr_ipsec.c +++ b/drivers/crypto/chelsio/chcr_ipsec.c @@ -76,12 +76,14 @@ static void chcr_xfrm_del_state(struct xfrm_state *x); static void chcr_xfrm_free_state(struct xfrm_state *x); static bool chcr_ipsec_offload_ok(struct sk_buff *skb, struct xfrm_state *x); +static void chcr_advance_esn_state(struct xfrm_state *x); static const struct xfrmdev_ops chcr_xfrmdev_ops = { .xdo_dev_state_add = chcr_xfrm_add_state, .xdo_dev_state_delete = chcr_xfrm_del_state, .xdo_dev_state_free = chcr_xfrm_free_state, .xdo_dev_offload_ok = chcr_ipsec_offload_ok, + .xdo_dev_state_advance_esn = chcr_advance_esn_state, }; /* Add offload xfrms to Chelsio Interface */ @@ -210,10 +212,6 @@ static int chcr_xfrm_add_state(struct xfrm_state *x) pr_debug("CHCR: Cannot offload compressed xfrm states\n"); return -EINVAL; } - if (x->props.flags & XFRM_STATE_ESN) { - pr_debug("CHCR: Cannot offload ESN xfrm states\n"); - return -EINVAL; - } if (x->props.family != AF_INET && x->props.family != AF_INET6) { pr_debug("CHCR: Only IPv4/6 xfrm state offloaded\n"); @@ -266,6 +264,8 @@ static int chcr_xfrm_add_state(struct xfrm_state *x) } sa_entry->hmac_ctrl = chcr_ipsec_setauthsize(x, sa_entry); + if (x->props.flags & XFRM_STATE_ESN) + sa_entry->esn = 1; chcr_ipsec_setkey(x, sa_entry); x->xso.offload_handle = (unsigned long)sa_entry; try_module_get(THIS_MODULE); @@ -294,31 +294,57 @@ static void chcr_xfrm_free_state(struct xfrm_state *x) static bool chcr_ipsec_offload_ok(struct sk_buff *skb, struct xfrm_state *x) { - /* Offload with IP options is not supported yet */ - if (ip_hdr(skb)->ihl > 5) - return false; - + if (x->props.family == AF_INET) { + /* Offload with IP options is not supported yet */ + if (ip_hdr(skb)->ihl > 5) + return false; + } else { + /* Offload with IPv6 extension headers is not support yet */ + if (ipv6_ext_hdr(ipv6_hdr(skb)->nexthdr)) + return false; + } return true; } -static inline int is_eth_imm(const struct sk_buff *skb, unsigned int kctx_len) +static void chcr_advance_esn_state(struct xfrm_state *x) +{ + /* do nothing */ + if (!x->xso.offload_handle) + return; +} + +static inline int is_eth_imm(const struct sk_buff *skb, +struct ipsec_sa_entry *sa_entry) { + unsigned int kctx_len; int hdrlen; + kctx_len = sa_entry->kctx_len; hdrlen = sizeof(struct fw_ulptx_wr) + sizeof(struct chcr_ipsec_req) + kctx_len; hdrlen += sizeof(struct cpl_tx_pkt); + if (sa_entry->esn) + hdrlen += (DIV_ROUND_UP(sizeof(struct chcr_ipsec_aadiv), 16) + << 4); if (skb->len <= MAX_IMM_TX_PKT_LEN - hdrlen) return hdrlen; return 0; } static inline unsigned int calc_tx_sec_flits(const struct sk_buff *skb, -unsigned int kctx_len) +struct ipsec_sa_entry *sa_entry) { + unsigned int kctx_len; unsigned int flits; - int hdrlen = is_eth_imm(skb, kctx_len); + int aadivlen; + int hdrlen; + + kctx_len = sa_entry->kctx_len; + hdrlen = is_eth_imm(skb, sa_entry); + aadivlen = sa_entry->esn ? DIV_ROUND_UP(sizeof(struct chcr_ipsec_aadiv), + 16) : 0; + aadivlen <<= 4; /* If the skb is small enough, we can pump it out as a work
[crypto chcr 1/2] small packet Tx stalls the queue
Immediate packets sent to hardware should include the work request length in calculating the flits. WR occupy one flit and if not accounted result in invalid request which stalls the HW queue. Cc: sta...@vger.kernel.org Signed-off-by: Atul Gupta --- drivers/crypto/chelsio/chcr_ipsec.c | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/drivers/crypto/chelsio/chcr_ipsec.c b/drivers/crypto/chelsio/chcr_ipsec.c index 461b97e..1ff8738 100644 --- a/drivers/crypto/chelsio/chcr_ipsec.c +++ b/drivers/crypto/chelsio/chcr_ipsec.c @@ -303,7 +303,10 @@ static bool chcr_ipsec_offload_ok(struct sk_buff *skb, struct xfrm_state *x) static inline int is_eth_imm(const struct sk_buff *skb, unsigned int kctx_len) { - int hdrlen = sizeof(struct chcr_ipsec_req) + kctx_len; + int hdrlen; + + hdrlen = sizeof(struct fw_ulptx_wr) + +sizeof(struct chcr_ipsec_req) + kctx_len; hdrlen += sizeof(struct cpl_tx_pkt); if (skb->len <= MAX_IMM_TX_PKT_LEN - hdrlen) -- 1.8.3.1
Re: [PATCH 0/3] crypto: x86/chacha20 - AVX-512VL block functions
On Tue, Nov 20, 2018 at 05:30:47PM +0100, Martin Willi wrote: > In the quest for pushing the limits of chacha20 encryption for both IPsec > and Wireguard, this small series adds AVX-512VL block functions. The VL > variant works on 256-bit ymm registers, but compared to AVX2 can benefit > from the new instructions. > > Compared to the AVX2 version, these block functions bring an overall > speed improvement across encryption lengths of ~20%. Below the tcrypt > results for additional block sizes in kOps/s, for the current AVX2 > code path, the new AVX-512VL code path and the comparison to Zinc in > AVX2 and AVX-512VL. All numbers from a Xeon Platinum 8168 (2.7GHz). > > These numbers result in a very nice chart, available at: > https://download.strongswan.org/misc/chacha-avx-512vl.svg > > zinc zinc > len avx2 512vl avx2 512vl >8 5719 5672 5468 5612 > 16 5675 5627 5355 5621 > 24 5687 5601 5322 5633 > 32 5667 5622 5244 5564 > 40 5603 5582 5337 5578 > 48 5638 5539 5400 5556 > 56 5624 5566 5375 5482 > 64 5590 5573 5352 5531 > 72 4841 5467 3365 3457 > 80 5316 5761 3310 3381 > 88 4798 5470 3239 3343 > 96 5324 5723 3197 3281 > 104 4819 5460 3155 3232 > 112 5266 5749 3020 3195 > 120 4776 5391 2959 3145 > 128 5291 5723 3398 3489 > 136 4122 4837 3321 3423 > 144 4507 5057 3247 3389 > 152 4139 4815 3233 3329 > 160 4482 5043 3159 3256 > 168 4142 4766 3131 3224 > 176 4506 5028 3073 3162 > 184 4119 4772 3010 3109 > 192 4499 5016 3402 3502 > 200 4127 4766 3329 3448 > 208 4452 5012 3276 3371 > 216 4128 4744 3243 3334 > 224 4484 5008 3203 3298 > 232 4103 4772 3141 3237 > 240 4458 4963 3115 3217 > 248 4121 4751 3085 3177 > 256 4461 4987 3364 4046 > 264 3406 4282 3270 4006 > 272 3408 4287 3207 3961 > 280 3371 4271 3203 3825 > 288 3625 4301 3129 3751 > 296 3402 4283 3093 3688 > 304 3401 4247 3062 3637 > 312 3382 4282 2995 3614 > 320 3611 4279 3305 4070 > 328 3386 4260 3276 3968 > 336 3369 4288 3171 3929 > 344 3389 4289 3134 3847 > 352 3609 4266 3127 3720 > 360 3355 4252 3076 3692 > 368 3387 4264 3048 3650 > 376 3387 4238 2967 3553 > 384 3568 4265 3277 4035 > 392 3369 4262 3299 3973 > 400 3362 4235 3239 3899 > 408 3352 4269 3196 3843 > 416 3585 4243 3127 3736 > 424 3364 4216 3092 3672 > 432 3341 4246 3067 3628 > 440 3353 4235 3018 3593 > 448 3538 4245 3327 4035 > 456 3322 4244 3275 3900 > 464 3340 4237 3212 3880 > 472 3330 4242 3054 3802 > 480 3530 4234 3078 3707 > 488 3337 4228 3094 3664 > 496 3330 4223 3015 3591 > 504 3317 4214 3002 3517 > 512 3531 4197 3339 4016 > 520 2511 3101 2030 2682 > 528 2627 3087 2027 2641 > 536 2508 3102 2001 2601 > 544 2638 3090 1964 2564 > 552 2494 3077 1962 2516 > 560 2625 3064 1941 2515 > 568 2500 3086 1922 2493 > 576 2611 3074 2050 2689 > 584 2482 3062 2041 2680 > 592 2595 3074 2026 2644 > 600 2470 3060 1985 2595 > 608 2581 3039 1961 2555 > 616 2478 3062 1956 2521 > 624 2587 3066 1930 2493 > 632 2457 3053 1923 2486 > 640 2581 3050 2059 2712 > 648 2296 2839 2024 2655 > 656 2389 2845 2019 2642 > 664 2292 2842 2002 2610 > 672 2404 2838 1959 2537 > 680 2273 2827 1956 2527 > 688 2389 2840 1938 2510 > 696 2280 2837 1911 2463 > 704 2370 2819 2055 2702 > 712 2277 2834 2029 2663 > 720 2369 2829 2020 2625 > 728 2255 2820 2001 2600 > 736 2373 2819 1958 2543 > 744 2269 2827 1956 2524 > 752 2364 2817 1937 2492 > 760 2270 2805 1909 2483 > 768 2378 2820 2050 2696 > 776 2053 2700 2002 2643 > 784 2066 2693 1922 2640 > 792 2065 2703 1928 2602 > 800 2138 2706 1962 2535 > 808 2065 2679 1938 2528 > 816 2063 2699 1929 2500 > 824 2053 2676 1915 2468 > 832 2149 2692 2036 2693 > 840 2055 2689 2024 2659 > 848 2049 2689 2006 2610 > 856 2057 2702 1979 2585 > 864 2144 2703 1960 2547 > 872 2047 2685 1945 2501 > 880 2055 2683 1902 2497 > 888 2060 2689 1897 2478 > 896 2139 2693 2023 2663 > 904 2049 2686 1970 2644 > 912 2055 2688 1925 2621 > 920 2047 2685 1911 2572 > 928 2114 2695 1907 2545 > 936 2055 2681 1927 2492 > 944 2055 2693 1930 2478
Re: [Help] Null pointer exception in scatterwalk_start() in kernel-4.9
On Tue, Nov 20, 2018 at 07:09:53AM +, gongchen (E) wrote: > Hi Dear Herbert, > > Sorry to bother you , but we’ve met a problem in crypto module, > would you please kindly help us look into it ? Thank you very much. > > In the below function chain, scatterwalk_start() doesn't check > the result of sg_next(), so the kernel will crash if sg_next() returns a null > pointer, which is our case. (The full stack is at the end of letter) > > blkcipher_walk_done()->scatterwalk_done()->scatterwalk_pagedone()->scatterwalk_start(walk, > sg_next(walk->sg)); > > Should we add a null-pointer-check in scatterwalk_start()? Or is > there any process can ensure that there should be a valid sg pointer if the > condition (walk->offset >= walk->sg->offset + walk->sg->length) is true? > > We are really looking forward to your reply, any information will > be appreciated , thanks again. Did you apply the following patch? commit 0868def3e4100591e7a1fdbf3eed1439cc8f7ca3 Author: Eric Biggers Date: Mon Jul 23 10:54:57 2018 -0700 crypto: blkcipher - fix crash flushing dcache in error path Cheers, -- Email: Herbert Xu Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
[no subject]
-- Guten Tag, Wir sind eine registrierte private Geldverleiher. Wir geben Kredite an Firmen, Einzelpersonen, die ihre finanzielle Status auf der ganzen Welt aktualisieren müssen, mit minimalen jährlichen Zinsen von 2% .reply, wenn nötig. Good Day, We are a registered private money lender. We give out loans to firms, Individual who need to update their financial status all over the world, with Minimal annual Interest Rates of 2%.reply if needed.
[PATCH 1/1] cavium: Update firmware for CNN55XX crypto driver
Firmware upgraded to v10 Signed-off-by: Nagadheeraj Rottela --- WHENCE | 2 +- cavium/cnn55xx_se.fw | Bin 27698 -> 35010 bytes 2 files changed, 1 insertion(+), 1 deletion(-) diff --git a/WHENCE b/WHENCE index a188c0d..ed10d5b 100644 --- a/WHENCE +++ b/WHENCE @@ -3586,7 +3586,7 @@ Licence: Redistributable. See LICENCE.cavium_liquidio for details Driver: nitrox -- Cavium CNN55XX crypto driver File: cavium/cnn55xx_se.fw -Version: v07 +Version: v10 Licence: Redistributable. See LICENCE.cavium for details diff --git a/cavium/cnn55xx_se.fw b/cavium/cnn55xx_se.fw index 076e270d383488e7ae67e8f4224a519c4173bedc..bc3c4d070625794b0b5aa61df48e186502a75e1d 100644 GIT binary patch literal 35010 zcmcG$30M=?+5kL>kQqWk5W%fllRzK|VHLEupSEMIt@PR!QR{-XsGy+W4sC6#Ljr<| zYhBQ5TUN;qDpq@48bSa!Hd%^RB?+sFpjNA`Apd)2pxWDe@Be+z|NMELoS8ZMdC$9@ z_q^w@rhf22=x2WKPxYHM&2QGM8GeC*L4nxIzc*Pge{m>B%F(1$l~S%ah97eJZP!q4 z80lzor1mW4j+|ogfVaU^4uw;giEYv2FaMP)M~Y%${2hftr@y7g^YgdIWe4%Ufl$c* zW@}tF#&=AQ%N_}7LjG7f)(^sUbiQSLKD>{Cv^ah!ykT?>rUVnZLs|@90%>vl{Se0R zPty5N4DXLR*Yl^r_XfTrw6~bf=L%`d=sK73-=|BUUNCh!(B)7M|5}dtFXeijf(FL0 zhVJ*v?P33=+#|XiYVWV*oSq_E#n7VxNJ8nw`~(O$(C_ylZ5H#3NQA z#nZLL&}kd!zA)w9{x9XGLs|s?n{9E~5!!q7Gh`f6aOVG9MV4GzYnoX z=)SGy#{hIz(mj|BY0UUu9>b?}zIl)~h_wjfE9uzJX$UNYw7-slDHru$%0<)dErQyF z{7ML!Ia$n4h4)xUTTjDc^#IPJ*=I%>4{0ys4+HDALYfnQ4%D`sAG#|p`z-$(2nqiCJL9q`NL$Xoxg#z+m7fUV zCB7shF8hD@li@uD(pK@?pw4x4iGBRXP|lZM24Om+eaQENT(9#D@I98_2mN|L_aFn( zT=+ZSdozDOgzfwU$XCnvfSPwhnj`-vyk*c>^ag(s)IW}%#eI;b;`ae>(D5swZ zq2`tRdr&$P(yaWOP}?Ye0;Fx?e*^hG=kuZT5lB13Uj(s>`1hb@M?N3koA{66Jqywf z^F3hfa{l{}_e;8mINty{@*r(Ke+T4Rz~@74Yxt>LjnkoGd1UWQ9Oq`eG_m`)llWc$tTELYgN% ztIPR*kXOflAL{6#aibBXCE8=R~oPRxsU~ zoc-Va6wA+aHr0wtOdgaK$N25iHL-7obgT$+r?@!*CfkIN8u-8IK93;M{Y*kK3(vxP zpVup~{|~*+tuj?p+lq4W&5rqxjMCOMVTK9r8;nmV_*j1zYEYBxi?-i<~Y+MUIIK
Re: [PATCH 0/6] crypto: x86/chacha20 - SIMD performance improvements
Hi Martin, On Tue, Nov 20, 2018 at 5:29 PM Martin Willi wrote: > Thanks for the offer, no need at this time. But I certainly would > welcome if you could do some (Wireguard) benching with that code to see > if it works for you. I certainly will test it in a few different network circumstances, especially since real testing like this is sometimes more telling than busy-loop benchmarks. > > Actually, similarly here, a 10nm Cannon Lake machine should be > > arriving at my house this week, which should make for some > > interesting testing ground for non-throttled zmm, if you'd like to > > play with it. > > Maybe in a future iteration, thanks. In fact would it be interesting to > know if Cannon Lake can handle that throttling better. Everything I've read on the Internet seems to indicate that's the case, so one of the first things I'll be doing is seeing if that's true. There are also the AVX512 IFMA instructions to play with! Jason
[PATCH 3/3] crypto: x86/chacha20 - Add a 4-block AVX-512VL variant
This version uses the same principle as the AVX2 version by scheduling the operations for two block pairs in parallel. It benefits from the AVX-512VL rotate instructions and the more efficient partial block handling using "vmovdqu8", resulting in a speedup of the raw block function of ~20%. Signed-off-by: Martin Willi --- arch/x86/crypto/chacha20-avx512vl-x86_64.S | 272 + arch/x86/crypto/chacha20_glue.c| 7 + 2 files changed, 279 insertions(+) diff --git a/arch/x86/crypto/chacha20-avx512vl-x86_64.S b/arch/x86/crypto/chacha20-avx512vl-x86_64.S index 261097578715..55d34de29e3e 100644 --- a/arch/x86/crypto/chacha20-avx512vl-x86_64.S +++ b/arch/x86/crypto/chacha20-avx512vl-x86_64.S @@ -12,6 +12,11 @@ CTR2BL:.octa 0x .octa 0x0001 +.section .rodata.cst32.CTR4BL, "aM", @progbits, 32 +.align 32 +CTR4BL:.octa 0x0002 + .octa 0x0003 + .section .rodata.cst32.CTR8BL, "aM", @progbits, 32 .align 32 CTR8BL:.octa 0x000300020001 @@ -185,6 +190,273 @@ ENTRY(chacha20_2block_xor_avx512vl) ENDPROC(chacha20_2block_xor_avx512vl) +ENTRY(chacha20_4block_xor_avx512vl) + # %rdi: Input state matrix, s + # %rsi: up to 4 data blocks output, o + # %rdx: up to 4 data blocks input, i + # %rcx: input/output length in bytes + + # This function encrypts four ChaCha20 block by loading the state + # matrix four times across eight AVX registers. It performs matrix + # operations on four words in two matrices in parallel, sequentially + # to the operations on the four words of the other two matrices. The + # required word shuffling has a rather high latency, we can do the + # arithmetic on two matrix-pairs without much slowdown. + + vzeroupper + + # x0..3[0-4] = s0..3 + vbroadcasti128 0x00(%rdi),%ymm0 + vbroadcasti128 0x10(%rdi),%ymm1 + vbroadcasti128 0x20(%rdi),%ymm2 + vbroadcasti128 0x30(%rdi),%ymm3 + + vmovdqa %ymm0,%ymm4 + vmovdqa %ymm1,%ymm5 + vmovdqa %ymm2,%ymm6 + vmovdqa %ymm3,%ymm7 + + vpaddd CTR2BL(%rip),%ymm3,%ymm3 + vpaddd CTR4BL(%rip),%ymm7,%ymm7 + + vmovdqa %ymm0,%ymm11 + vmovdqa %ymm1,%ymm12 + vmovdqa %ymm2,%ymm13 + vmovdqa %ymm3,%ymm14 + vmovdqa %ymm7,%ymm15 + + mov $10,%rax + +.Ldoubleround4: + + # x0 += x1, x3 = rotl32(x3 ^ x0, 16) + vpaddd %ymm1,%ymm0,%ymm0 + vpxord %ymm0,%ymm3,%ymm3 + vprold $16,%ymm3,%ymm3 + + vpaddd %ymm5,%ymm4,%ymm4 + vpxord %ymm4,%ymm7,%ymm7 + vprold $16,%ymm7,%ymm7 + + # x2 += x3, x1 = rotl32(x1 ^ x2, 12) + vpaddd %ymm3,%ymm2,%ymm2 + vpxord %ymm2,%ymm1,%ymm1 + vprold $12,%ymm1,%ymm1 + + vpaddd %ymm7,%ymm6,%ymm6 + vpxord %ymm6,%ymm5,%ymm5 + vprold $12,%ymm5,%ymm5 + + # x0 += x1, x3 = rotl32(x3 ^ x0, 8) + vpaddd %ymm1,%ymm0,%ymm0 + vpxord %ymm0,%ymm3,%ymm3 + vprold $8,%ymm3,%ymm3 + + vpaddd %ymm5,%ymm4,%ymm4 + vpxord %ymm4,%ymm7,%ymm7 + vprold $8,%ymm7,%ymm7 + + # x2 += x3, x1 = rotl32(x1 ^ x2, 7) + vpaddd %ymm3,%ymm2,%ymm2 + vpxord %ymm2,%ymm1,%ymm1 + vprold $7,%ymm1,%ymm1 + + vpaddd %ymm7,%ymm6,%ymm6 + vpxord %ymm6,%ymm5,%ymm5 + vprold $7,%ymm5,%ymm5 + + # x1 = shuffle32(x1, MASK(0, 3, 2, 1)) + vpshufd $0x39,%ymm1,%ymm1 + vpshufd $0x39,%ymm5,%ymm5 + # x2 = shuffle32(x2, MASK(1, 0, 3, 2)) + vpshufd $0x4e,%ymm2,%ymm2 + vpshufd $0x4e,%ymm6,%ymm6 + # x3 = shuffle32(x3, MASK(2, 1, 0, 3)) + vpshufd $0x93,%ymm3,%ymm3 + vpshufd $0x93,%ymm7,%ymm7 + + # x0 += x1, x3 = rotl32(x3 ^ x0, 16) + vpaddd %ymm1,%ymm0,%ymm0 + vpxord %ymm0,%ymm3,%ymm3 + vprold $16,%ymm3,%ymm3 + + vpaddd %ymm5,%ymm4,%ymm4 + vpxord %ymm4,%ymm7,%ymm7 + vprold $16,%ymm7,%ymm7 + + # x2 += x3, x1 = rotl32(x1 ^ x2, 12) + vpaddd %ymm3,%ymm2,%ymm2 + vpxord %ymm2,%ymm1,%ymm1 + vprold $12,%ymm1,%ymm1 + + vpaddd %ymm7,%ymm6,%ymm6 + vpxord %ymm6,%ymm5,%ymm5 + vprold $12,%ymm5,%ymm5 + + # x0 += x1, x3 = rotl32(x3 ^ x0, 8) + vpaddd %ymm1,%ymm0,%ymm0 + vpxord %ymm0,%ymm3,%ymm3 + vprold $8,%ymm3,%ymm3 + + vpaddd
[PATCH 0/3] crypto: x86/chacha20 - AVX-512VL block functions
In the quest for pushing the limits of chacha20 encryption for both IPsec and Wireguard, this small series adds AVX-512VL block functions. The VL variant works on 256-bit ymm registers, but compared to AVX2 can benefit from the new instructions. Compared to the AVX2 version, these block functions bring an overall speed improvement across encryption lengths of ~20%. Below the tcrypt results for additional block sizes in kOps/s, for the current AVX2 code path, the new AVX-512VL code path and the comparison to Zinc in AVX2 and AVX-512VL. All numbers from a Xeon Platinum 8168 (2.7GHz). These numbers result in a very nice chart, available at: https://download.strongswan.org/misc/chacha-avx-512vl.svg zinc zinc len avx2 512vl avx2 512vl 8 5719 5672 5468 5612 16 5675 5627 5355 5621 24 5687 5601 5322 5633 32 5667 5622 5244 5564 40 5603 5582 5337 5578 48 5638 5539 5400 5556 56 5624 5566 5375 5482 64 5590 5573 5352 5531 72 4841 5467 3365 3457 80 5316 5761 3310 3381 88 4798 5470 3239 3343 96 5324 5723 3197 3281 104 4819 5460 3155 3232 112 5266 5749 3020 3195 120 4776 5391 2959 3145 128 5291 5723 3398 3489 136 4122 4837 3321 3423 144 4507 5057 3247 3389 152 4139 4815 3233 3329 160 4482 5043 3159 3256 168 4142 4766 3131 3224 176 4506 5028 3073 3162 184 4119 4772 3010 3109 192 4499 5016 3402 3502 200 4127 4766 3329 3448 208 4452 5012 3276 3371 216 4128 4744 3243 3334 224 4484 5008 3203 3298 232 4103 4772 3141 3237 240 4458 4963 3115 3217 248 4121 4751 3085 3177 256 4461 4987 3364 4046 264 3406 4282 3270 4006 272 3408 4287 3207 3961 280 3371 4271 3203 3825 288 3625 4301 3129 3751 296 3402 4283 3093 3688 304 3401 4247 3062 3637 312 3382 4282 2995 3614 320 3611 4279 3305 4070 328 3386 4260 3276 3968 336 3369 4288 3171 3929 344 3389 4289 3134 3847 352 3609 4266 3127 3720 360 3355 4252 3076 3692 368 3387 4264 3048 3650 376 3387 4238 2967 3553 384 3568 4265 3277 4035 392 3369 4262 3299 3973 400 3362 4235 3239 3899 408 3352 4269 3196 3843 416 3585 4243 3127 3736 424 3364 4216 3092 3672 432 3341 4246 3067 3628 440 3353 4235 3018 3593 448 3538 4245 3327 4035 456 3322 4244 3275 3900 464 3340 4237 3212 3880 472 3330 4242 3054 3802 480 3530 4234 3078 3707 488 3337 4228 3094 3664 496 3330 4223 3015 3591 504 3317 4214 3002 3517 512 3531 4197 3339 4016 520 2511 3101 2030 2682 528 2627 3087 2027 2641 536 2508 3102 2001 2601 544 2638 3090 1964 2564 552 2494 3077 1962 2516 560 2625 3064 1941 2515 568 2500 3086 1922 2493 576 2611 3074 2050 2689 584 2482 3062 2041 2680 592 2595 3074 2026 2644 600 2470 3060 1985 2595 608 2581 3039 1961 2555 616 2478 3062 1956 2521 624 2587 3066 1930 2493 632 2457 3053 1923 2486 640 2581 3050 2059 2712 648 2296 2839 2024 2655 656 2389 2845 2019 2642 664 2292 2842 2002 2610 672 2404 2838 1959 2537 680 2273 2827 1956 2527 688 2389 2840 1938 2510 696 2280 2837 1911 2463 704 2370 2819 2055 2702 712 2277 2834 2029 2663 720 2369 2829 2020 2625 728 2255 2820 2001 2600 736 2373 2819 1958 2543 744 2269 2827 1956 2524 752 2364 2817 1937 2492 760 2270 2805 1909 2483 768 2378 2820 2050 2696 776 2053 2700 2002 2643 784 2066 2693 1922 2640 792 2065 2703 1928 2602 800 2138 2706 1962 2535 808 2065 2679 1938 2528 816 2063 2699 1929 2500 824 2053 2676 1915 2468 832 2149 2692 2036 2693 840 2055 2689 2024 2659 848 2049 2689 2006 2610 856 2057 2702 1979 2585 864 2144 2703 1960 2547 872 2047 2685 1945 2501 880 2055 2683 1902 2497 888 2060 2689 1897 2478 896 2139 2693 2023 2663 904 2049 2686 1970 2644 912 2055 2688 1925 2621 920 2047 2685 1911 2572 928 2114 2695 1907 2545 936 2055 2681 1927 2492 944 2055 2693 1930 2478 952 2042 2688 1909 2471 960 2136 2682 2014 2672 968 2054 2687 1999 2626 976 2040 2682 1982 2598 984 2055 2687 1943 2569 992 2138 2694 1884 2522 1000 2036 2681 1929 2506 1008 2052 2676 1926 2475 1016 2050 2686 1889 2430 1024 2125 2670 2039 2656
[PATCH 2/3] crypto: x86/chacha20 - Add a 2-block AVX-512VL variant
This version uses the same principle as the AVX2 version. It benefits from the AVX-512VL rotate instructions and the more efficient partial block handling using "vmovdqu8", resulting in a speedup of ~20%. Unlike the AVX2 version, it is faster than the single block SSSE3 version to process a single block. Hence we engage that function for (partial) single block lengths as well. Signed-off-by: Martin Willi --- arch/x86/crypto/chacha20-avx512vl-x86_64.S | 171 + arch/x86/crypto/chacha20_glue.c| 7 + 2 files changed, 178 insertions(+) diff --git a/arch/x86/crypto/chacha20-avx512vl-x86_64.S b/arch/x86/crypto/chacha20-avx512vl-x86_64.S index e1877afcaa73..261097578715 100644 --- a/arch/x86/crypto/chacha20-avx512vl-x86_64.S +++ b/arch/x86/crypto/chacha20-avx512vl-x86_64.S @@ -7,6 +7,11 @@ #include +.section .rodata.cst32.CTR2BL, "aM", @progbits, 32 +.align 32 +CTR2BL:.octa 0x + .octa 0x0001 + .section .rodata.cst32.CTR8BL, "aM", @progbits, 32 .align 32 CTR8BL:.octa 0x000300020001 @@ -14,6 +19,172 @@ CTR8BL: .octa 0x000300020001 .text +ENTRY(chacha20_2block_xor_avx512vl) + # %rdi: Input state matrix, s + # %rsi: up to 2 data blocks output, o + # %rdx: up to 2 data blocks input, i + # %rcx: input/output length in bytes + + # This function encrypts two ChaCha20 blocks by loading the state + # matrix twice across four AVX registers. It performs matrix operations + # on four words in each matrix in parallel, but requires shuffling to + # rearrange the words after each round. + + vzeroupper + + # x0..3[0-2] = s0..3 + vbroadcasti128 0x00(%rdi),%ymm0 + vbroadcasti128 0x10(%rdi),%ymm1 + vbroadcasti128 0x20(%rdi),%ymm2 + vbroadcasti128 0x30(%rdi),%ymm3 + + vpaddd CTR2BL(%rip),%ymm3,%ymm3 + + vmovdqa %ymm0,%ymm8 + vmovdqa %ymm1,%ymm9 + vmovdqa %ymm2,%ymm10 + vmovdqa %ymm3,%ymm11 + + mov $10,%rax + +.Ldoubleround: + + # x0 += x1, x3 = rotl32(x3 ^ x0, 16) + vpaddd %ymm1,%ymm0,%ymm0 + vpxord %ymm0,%ymm3,%ymm3 + vprold $16,%ymm3,%ymm3 + + # x2 += x3, x1 = rotl32(x1 ^ x2, 12) + vpaddd %ymm3,%ymm2,%ymm2 + vpxord %ymm2,%ymm1,%ymm1 + vprold $12,%ymm1,%ymm1 + + # x0 += x1, x3 = rotl32(x3 ^ x0, 8) + vpaddd %ymm1,%ymm0,%ymm0 + vpxord %ymm0,%ymm3,%ymm3 + vprold $8,%ymm3,%ymm3 + + # x2 += x3, x1 = rotl32(x1 ^ x2, 7) + vpaddd %ymm3,%ymm2,%ymm2 + vpxord %ymm2,%ymm1,%ymm1 + vprold $7,%ymm1,%ymm1 + + # x1 = shuffle32(x1, MASK(0, 3, 2, 1)) + vpshufd $0x39,%ymm1,%ymm1 + # x2 = shuffle32(x2, MASK(1, 0, 3, 2)) + vpshufd $0x4e,%ymm2,%ymm2 + # x3 = shuffle32(x3, MASK(2, 1, 0, 3)) + vpshufd $0x93,%ymm3,%ymm3 + + # x0 += x1, x3 = rotl32(x3 ^ x0, 16) + vpaddd %ymm1,%ymm0,%ymm0 + vpxord %ymm0,%ymm3,%ymm3 + vprold $16,%ymm3,%ymm3 + + # x2 += x3, x1 = rotl32(x1 ^ x2, 12) + vpaddd %ymm3,%ymm2,%ymm2 + vpxord %ymm2,%ymm1,%ymm1 + vprold $12,%ymm1,%ymm1 + + # x0 += x1, x3 = rotl32(x3 ^ x0, 8) + vpaddd %ymm1,%ymm0,%ymm0 + vpxord %ymm0,%ymm3,%ymm3 + vprold $8,%ymm3,%ymm3 + + # x2 += x3, x1 = rotl32(x1 ^ x2, 7) + vpaddd %ymm3,%ymm2,%ymm2 + vpxord %ymm2,%ymm1,%ymm1 + vprold $7,%ymm1,%ymm1 + + # x1 = shuffle32(x1, MASK(2, 1, 0, 3)) + vpshufd $0x93,%ymm1,%ymm1 + # x2 = shuffle32(x2, MASK(1, 0, 3, 2)) + vpshufd $0x4e,%ymm2,%ymm2 + # x3 = shuffle32(x3, MASK(0, 3, 2, 1)) + vpshufd $0x39,%ymm3,%ymm3 + + dec %rax + jnz .Ldoubleround + + # o0 = i0 ^ (x0 + s0) + vpaddd %ymm8,%ymm0,%ymm7 + cmp $0x10,%rcx + jl .Lxorpart2 + vpxord 0x00(%rdx),%xmm7,%xmm6 + vmovdqu %xmm6,0x00(%rsi) + vextracti128$1,%ymm7,%xmm0 + # o1 = i1 ^ (x1 + s1) + vpaddd %ymm9,%ymm1,%ymm7 + cmp $0x20,%rcx + jl .Lxorpart2 + vpxord 0x10(%rdx),%xmm7,%xmm6 + vmovdqu %xmm6,0x10(%rsi) + vextracti128$1,%ymm7,%xmm1 + # o2 = i2 ^ (x2 + s2) + vpaddd %ymm10,%ymm2,%ymm7 + cmp $0x30,%rcx + jl .Lxorpart2 + vpxord 0x20(%rdx),%xmm7,%xmm6 + vmovdqu %xmm6,0x20(%rsi) + vextracti128$1,%ymm7,%xmm2 + # o3 = i3 ^ (x3 +
[PATCH 1/3] crypto: x86/chacha20 - Add a 8-block AVX-512VL variant
This variant is similar to the AVX2 version, but benefits from the AVX-512 rotate instructions and the additional registers, so it can operate without any data on the stack. It uses ymm registers only to avoid the massive core throttling on Skylake-X platforms. Nontheless does it bring a ~30% speed improvement compared to the AVX2 variant for random encryption lengths. The AVX2 version uses "rep movsb" for partial block XORing via the stack. With AVX-512, the new "vmovdqu8" can do this much more efficiently. The associated "kmov" instructions to work with dynamic masks is not part of the AVX-512VL instruction set, hence we depend on AVX-512BW as well. Given that the major AVX-512VL architectures provide AVX-512BW and this extension does not affect core clocking, this seems to be no problem at least for now. Signed-off-by: Martin Willi --- arch/x86/crypto/Makefile | 5 + arch/x86/crypto/chacha20-avx512vl-x86_64.S | 396 + arch/x86/crypto/chacha20_glue.c| 26 ++ 3 files changed, 427 insertions(+) create mode 100644 arch/x86/crypto/chacha20-avx512vl-x86_64.S diff --git a/arch/x86/crypto/Makefile b/arch/x86/crypto/Makefile index a4b0007a54e1..ce4e43642984 100644 --- a/arch/x86/crypto/Makefile +++ b/arch/x86/crypto/Makefile @@ -8,6 +8,7 @@ OBJECT_FILES_NON_STANDARD := y avx_supported := $(call as-instr,vpxor %xmm0$(comma)%xmm0$(comma)%xmm0,yes,no) avx2_supported := $(call as-instr,vpgatherdd %ymm0$(comma)(%eax$(comma)%ymm1\ $(comma)4)$(comma)%ymm2,yes,no) +avx512_supported :=$(call as-instr,vpmovm2b %k1$(comma)%zmm5,yes,no) sha1_ni_supported :=$(call as-instr,sha1msg1 %xmm0$(comma)%xmm1,yes,no) sha256_ni_supported :=$(call as-instr,sha256msg1 %xmm0$(comma)%xmm1,yes,no) @@ -103,6 +104,10 @@ ifeq ($(avx2_supported),yes) morus1280-avx2-y := morus1280-avx2-asm.o morus1280-avx2-glue.o endif +ifeq ($(avx512_supported),yes) + chacha20-x86_64-y += chacha20-avx512vl-x86_64.o +endif + aesni-intel-y := aesni-intel_asm.o aesni-intel_glue.o aesni-intel-$(CONFIG_64BIT) += aesni-intel_avx-x86_64.o aes_ctrby8_avx-x86_64.o ghash-clmulni-intel-y := ghash-clmulni-intel_asm.o ghash-clmulni-intel_glue.o diff --git a/arch/x86/crypto/chacha20-avx512vl-x86_64.S b/arch/x86/crypto/chacha20-avx512vl-x86_64.S new file mode 100644 index ..e1877afcaa73 --- /dev/null +++ b/arch/x86/crypto/chacha20-avx512vl-x86_64.S @@ -0,0 +1,396 @@ +/* SPDX-License-Identifier: GPL-2.0+ */ +/* + * ChaCha20 256-bit cipher algorithm, RFC7539, x64 AVX-512VL functions + * + * Copyright (C) 2018 Martin Willi + */ + +#include + +.section .rodata.cst32.CTR8BL, "aM", @progbits, 32 +.align 32 +CTR8BL:.octa 0x000300020001 + .octa 0x0007000600050004 + +.text + +ENTRY(chacha20_8block_xor_avx512vl) + # %rdi: Input state matrix, s + # %rsi: up to 8 data blocks output, o + # %rdx: up to 8 data blocks input, i + # %rcx: input/output length in bytes + + # This function encrypts eight consecutive ChaCha20 blocks by loading + # the state matrix in AVX registers eight times. Compared to AVX2, this + # mostly benefits from the new rotate instructions in VL and the + # additional registers. + + vzeroupper + + # x0..15[0-7] = s[0..15] + vpbroadcastd0x00(%rdi),%ymm0 + vpbroadcastd0x04(%rdi),%ymm1 + vpbroadcastd0x08(%rdi),%ymm2 + vpbroadcastd0x0c(%rdi),%ymm3 + vpbroadcastd0x10(%rdi),%ymm4 + vpbroadcastd0x14(%rdi),%ymm5 + vpbroadcastd0x18(%rdi),%ymm6 + vpbroadcastd0x1c(%rdi),%ymm7 + vpbroadcastd0x20(%rdi),%ymm8 + vpbroadcastd0x24(%rdi),%ymm9 + vpbroadcastd0x28(%rdi),%ymm10 + vpbroadcastd0x2c(%rdi),%ymm11 + vpbroadcastd0x30(%rdi),%ymm12 + vpbroadcastd0x34(%rdi),%ymm13 + vpbroadcastd0x38(%rdi),%ymm14 + vpbroadcastd0x3c(%rdi),%ymm15 + + # x12 += counter values 0-3 + vpaddd CTR8BL(%rip),%ymm12,%ymm12 + + vmovdqa64 %ymm0,%ymm16 + vmovdqa64 %ymm1,%ymm17 + vmovdqa64 %ymm2,%ymm18 + vmovdqa64 %ymm3,%ymm19 + vmovdqa64 %ymm4,%ymm20 + vmovdqa64 %ymm5,%ymm21 + vmovdqa64 %ymm6,%ymm22 + vmovdqa64 %ymm7,%ymm23 + vmovdqa64 %ymm8,%ymm24 + vmovdqa64 %ymm9,%ymm25 + vmovdqa64 %ymm10,%ymm26 + vmovdqa64 %ymm11,%ymm27 + vmovdqa64 %ymm12,%ymm28 + vmovdqa64 %ymm13,%ymm29 + vmovdqa64 %ymm14,%ymm30 + vmovdqa64 %ymm15,%ymm31 + + mov $10,%eax + +.Ldoubleround8: + # x0 += x4, x12 = rotl32(x12 ^ x0, 16) + vpaddd %ymm0,%ymm4,%ymm0 + vpxord %ymm0,%ymm12,%ymm12 + vprold $16,%ymm12,%ymm12 + # x1 += x5, x13 = rotl32(x13 ^ x1, 16) +
Re: [PATCH 0/6] crypto: x86/chacha20 - SIMD performance improvements
Hi Jason, > [...] I have a massive Xeon Gold 5120 machine that I can give you > access to if you'd like to do some testing and benching. Thanks for the offer, no need at this time. But I certainly would welcome if you could do some (Wireguard) benching with that code to see if it works for you. > Actually, similarly here, a 10nm Cannon Lake machine should be > arriving at my house this week, which should make for some > interesting testing ground for non-throttled zmm, if you'd like to > play with it. Maybe in a future iteration, thanks. In fact would it be interesting to know if Cannon Lake can handle that throttling better. Regards Martin
[Help] Null pointer exception in scatterwalk_start() in kernel-4.9
Hi Dear Herbert, Sorry to bother you , but we’ve met a problem in crypto module, would you please kindly help us look into it ? Thank you very much. In the below function chain, scatterwalk_start() doesn't check the result of sg_next(), so the kernel will crash if sg_next() returns a null pointer, which is our case. (The full stack is at the end of letter) blkcipher_walk_done()->scatterwalk_done()->scatterwalk_pagedone()->scatterwalk_start(walk, sg_next(walk->sg)); Should we add a null-pointer-check in scatterwalk_start()? Or is there any process can ensure that there should be a valid sg pointer if the condition (walk->offset >= walk->sg->offset + walk->sg->length) is true? We are really looking forward to your reply, any information will be appreciated , thanks again. Best regards Chen Gong 2018.11.20 --- Full Stack: <1>[395491.178009s][pid:29501,cpu4,Binder:708_A]Unable to handle kernel NULL pointer dereference at virtual address 0008 <1>[395491.178039s][pid:29501,cpu4,Binder:708_A]pgd = ffc112c27000 <1>[395491.178039s][pid:29501,cpu4,Binder:708_A][0008] *pgd=, *pud= <0>[395491.178070s][pid:29501,cpu4,Binder:708_A]Internal error: Oops: 9605 [#1] PREEMPT SMP <4>[395491.178070s][pid:29501,cpu4,Binder:708_A]Modules linked in: hisi_dummy_ko <4>[395491.178100s][pid:29501,cpu4,Binder:708_A]CPU: 4 PID: 29501 Comm: Binder:708_A VIP: 00 Tainted: GW 4.9.111 #1 <4>[395491.178100s][pid:29501,cpu4,Binder:708_A]TGID: 708 Comm: Binder:708_2 <4>[395491.178100s][pid:29501,cpu4,Binder:708_A]Hardware name: hi3660 (DT) <4>[395491.178100s][pid:29501,cpu4,Binder:708_A]task: ffc1d43ec880 task.stack: ffc3007e <4>[395491.178100s][pid:29501,cpu4,Binder:708_A]PC is at blkcipher_walk_done+0x210/0x354 <4>[395491.178131s][pid:29501,cpu4,Binder:708_A]LR is at blkcipher_walk_done+0x20c/0x354 <4>[395491.178131s][pid:29501,cpu4,Binder:708_A]pc : [] lr : [] pstate: 6145 <4>[395491.178131s][pid:29501,cpu4,Binder:708_A]sp : ffc3007e3950 <4>[395491.178131s][pid:29501,cpu4,Binder:708_A]x29: ffc3007e3950 x28: <4>[395491.178161s][pid:29501,cpu4,Binder:708_A]x27: ffc1c6ef501e x26: 0100 <4>[395491.178161s][pid:29501,cpu4,Binder:708_A]x25: ffc3007e3b40 x24: ffc3007e3be8 <4>[395491.178161s][pid:29501,cpu4,Binder:708_A]x23: 0001 x22: 0500 <4>[395491.178161s][pid:29501,cpu4,Binder:708_A]x21: ffc3007e3a90 x20: ffc3007e3a10 <4>[395491.178192s][pid:29501,cpu4,Binder:708_A]x19: ffc3007e39d8 x18: 0001 <4>[395491.178192s][pid:29501,cpu4,Binder:708_A]x17: 0075aca06934 x16: ff9c1b032d10 <4>[395491.178192s][pid:29501,cpu4,Binder:708_A]x15: 0075aaffe5b8 x14: <4>[395491.178222s][pid:29501,cpu4,Binder:708_A]x13: 0075ac08642d x12: 0001 <4>[395491.178222s][pid:29501,cpu4,Binder:708_A]x11: x10: ffc3175e1680 <4>[395491.178222s][pid:29501,cpu4,Binder:708_A]x9 : ff9c1d408000 x8 : <4>[395491.178253s][pid:29501,cpu4,Binder:708_A]x7 : ff9c1c28 x6 : 0001 <4>[395491.178253s][pid:29501,cpu4,Binder:708_A]x5 : ffc3007e3be8 x4 : <4>[395491.178253s][pid:29501,cpu4,Binder:708_A]x3 : 0100 x2 : 0500 <4>[395491.178253s][pid:29501,cpu4,Binder:708_A]x1 : ffc31aa934c2 x0 : <4>[395491.180725s][pid:29501,cpu4,Binder:708_A][] blkcipher_walk_done+0x210/0x354 <4>[395491.180755s][pid:29501,cpu4,Binder:708_A][] cbc_decrypt+0xa0/0xe8 <4>[395491.180755s][pid:29501,cpu4,Binder:708_A][] ablk_decrypt+0x78/0xf4 <4>[395491.180755s][pid:29501,cpu4,Binder:708_A][] skcipher_decrypt_ablkcipher+0x70/0x80 <4>[395491.180786s][pid:29501,cpu4,Binder:708_A][] crypto_cts_decrypt+0xf0/0x184 <4>[395491.180786s][pid:29501,cpu4,Binder:708_A][] fname_decrypt.isra.1+0x110/0x1d8 <4>[395491.180786s][pid:29501,cpu4,Binder:708_A][] fscrypt_fname_disk_to_usr+0x1d8/0x264 <4>[395491.180816s][pid:29501,cpu4,Binder:708_A][] f2fs_fill_dentries+0x13c/0x1d4
Re: [PATCH] crypto: drop mask=CRYPTO_ALG_ASYNC from 'shash' tfm allocations
On Wed, Nov 14, 2018 at 12:21:11PM -0800, Eric Biggers wrote: > From: Eric Biggers > > 'shash' algorithms are always synchronous, so passing CRYPTO_ALG_ASYNC > in the mask to crypto_alloc_shash() has no effect. Many users therefore > already don't pass it, but some still do. This inconsistency can cause > confusion, especially since the way the 'mask' argument works is > somewhat counterintuitive. > > Thus, just remove the unneeded CRYPTO_ALG_ASYNC flags. > > This patch shouldn't change any actual behavior. > > Signed-off-by: Eric Biggers > --- > drivers/block/drbd/drbd_receiver.c | 2 +- > drivers/md/dm-integrity.c | 2 +- > drivers/net/wireless/intersil/orinoco/mic.c | 6 ++ > fs/ubifs/auth.c | 5 ++--- > net/bluetooth/smp.c | 2 +- > security/apparmor/crypto.c | 2 +- > security/integrity/evm/evm_crypto.c | 3 +-- > security/keys/encrypted-keys/encrypted.c| 4 ++-- > security/keys/trusted.c | 4 ++-- > 9 files changed, 13 insertions(+), 17 deletions(-) Patch applied. Thanks. -- Email: Herbert Xu Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
Re: [PATCH] crypto: drop mask=CRYPTO_ALG_ASYNC from 'cipher' tfm allocations
On Wed, Nov 14, 2018 at 12:19:39PM -0800, Eric Biggers wrote: > From: Eric Biggers > > 'cipher' algorithms (single block ciphers) are always synchronous, so > passing CRYPTO_ALG_ASYNC in the mask to crypto_alloc_cipher() has no > effect. Many users therefore already don't pass it, but some still do. > This inconsistency can cause confusion, especially since the way the > 'mask' argument works is somewhat counterintuitive. > > Thus, just remove the unneeded CRYPTO_ALG_ASYNC flags. > > This patch shouldn't change any actual behavior. > > Signed-off-by: Eric Biggers > --- > arch/s390/crypto/aes_s390.c | 2 +- > drivers/crypto/amcc/crypto4xx_alg.c | 3 +-- > drivers/crypto/ccp/ccp-crypto-aes-cmac.c | 4 +--- > drivers/crypto/geode-aes.c| 2 +- > drivers/md/dm-crypt.c | 2 +- > drivers/net/wireless/cisco/airo.c | 2 +- > drivers/staging/rtl8192e/rtllib_crypt_ccmp.c | 2 +- > drivers/staging/rtl8192u/ieee80211/ieee80211_crypt_ccmp.c | 2 +- > drivers/usb/wusbcore/crypto.c | 2 +- > net/bluetooth/smp.c | 6 +++--- > net/mac80211/wep.c| 4 ++-- > net/wireless/lib80211_crypt_ccmp.c| 2 +- > net/wireless/lib80211_crypt_tkip.c| 4 ++-- > net/wireless/lib80211_crypt_wep.c | 4 ++-- > 14 files changed, 19 insertions(+), 22 deletions(-) Patch applied. Thanks. -- Email: Herbert Xu Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
Re: [PATCH] crypto: remove useless initializations of cra_list
On Wed, Nov 14, 2018 at 11:35:48AM -0800, Eric Biggers wrote: > From: Eric Biggers > > Some algorithms initialize their .cra_list prior to registration. > But this is unnecessary since crypto_register_alg() will overwrite > .cra_list when adding the algorithm to the 'crypto_alg_list'. > Apparently the useless assignment has just been copy+pasted around. > > So, remove the useless assignments. > > Exception: paes_s390.c uses cra_list to check whether the algorithm is > registered or not, so I left that as-is for now. > > This patch shouldn't change any actual behavior. > > Signed-off-by: Eric Biggers > --- > arch/sparc/crypto/aes_glue.c | 5 - > arch/sparc/crypto/camellia_glue.c | 5 - > arch/sparc/crypto/des_glue.c | 5 - > crypto/lz4.c | 1 - > crypto/lz4hc.c| 1 - > drivers/crypto/bcm/cipher.c | 2 -- > drivers/crypto/omap-aes.c | 2 -- > drivers/crypto/omap-des.c | 1 - > drivers/crypto/qce/ablkcipher.c | 1 - > drivers/crypto/qce/sha.c | 1 - > drivers/crypto/sahara.c | 1 - > 11 files changed, 25 deletions(-) Patch applied. Thanks. -- Email: Herbert Xu Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
Re: [PATCH] crypto: inside-secure - remove useless setting of type flags
On Wed, Nov 14, 2018 at 11:10:53AM -0800, Eric Biggers wrote: > From: Eric Biggers > > Remove the unnecessary setting of CRYPTO_ALG_TYPE_SKCIPHER. > Commit 2c95e6d97892 ("crypto: skcipher - remove useless setting of type > flags") took care of this everywhere else, but a few more instances made > it into the tree at about the same time. Squash them before they get > copy+pasted around again. > > This patch shouldn't change any actual behavior. > > Signed-off-by: Eric Biggers > --- > drivers/crypto/inside-secure/safexcel_cipher.c | 8 > 1 file changed, 4 insertions(+), 4 deletions(-) Patch applied. Thanks. -- Email: Herbert Xu Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
Spende
Hallo, Sie haben eine wohltätige Spende in Höhe von 4.800, 000.00EUR, ich der Amerika-Lotterie Wert $ 560 Millionen gewonnen und ich bin einen Teil davon fünf glückliche Menschen und Altersheimen Spenden.Kontaktieren Sie mich für diesen Gott Gelegenheit per e-Mail: jane.d...@zoho.com -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean.
Re: [PATCH 0/6] crypto: x86/chacha20 - SIMD performance improvements
Hi Martin, On Mon, Nov 19, 2018 at 8:52 AM Martin Willi wrote: > > Adding AVX-512VL support is relatively simple. I have a patchset mostly > ready that is more than competitive with the code from Zinc. I'll clean > that up and do more testing before posting it later this week. Terrific. Depending on how it turns out, it'll be nice to try integrating this into Zinc. I have a massive Xeon Gold 5120 machine that I can give you access to if you'd like to do some testing and benching. Poke me on IRC -- I'm zx2c4. > I don't think that having AVX-512F is that important until it is really > usable on CPUs in the market. Actually, similarly here, a 10nm Cannon Lake machine should be arriving at my house this week, which should make for some interesting testing ground for non-throttled zmm, if you'd like to play with it. Jason
Re: [RFCv3 PATCH 1/6] uacce: Add documents for WarpDrive/uacce
On Mon, Nov 19, 2018 at 05:19:10PM +0800, Kenneth Lee wrote: > On Mon, Nov 19, 2018 at 05:14:05PM +0800, Kenneth Lee wrote: > > Date: Mon, 19 Nov 2018 17:14:05 +0800 > > From: Kenneth Lee > > To: Leon Romanovsky > > CC: Tim Sell , linux-...@vger.kernel.org, > > Alexander Shishkin , Zaibo Xu > > , zhangfei@foxmail.com, linux...@huawei.com, > > haojian.zhu...@linaro.org, Christoph Lameter , Hao Fang > > , Gavin Schenk , RDMA mailing > > list , Vinod Koul , Jason > > Gunthorpe , Doug Ledford , Uwe > > Kleine-König , David Kershner > > , Kenneth Lee , Johan > > Hovold , Cyrille Pitchen > > , Sagar Dharia > > , Jens Axboe , > > guodong...@linaro.org, linux-netdev , Randy Dunlap > > , linux-ker...@vger.kernel.org, Zhou Wang > > , linux-crypto@vger.kernel.org, Philippe > > Ombredanne , Sanyog Kale , > > "David S. Miller" , > > linux-accelerat...@lists.ozlabs.org > > Subject: Re: [RFCv3 PATCH 1/6] uacce: Add documents for WarpDrive/uacce > > User-Agent: Mutt/1.5.21 (2010-09-15) > > Message-ID: <20181119091405.GE157308@Turing-Arch-b> > > > > On Thu, Nov 15, 2018 at 04:54:55PM +0200, Leon Romanovsky wrote: > > > Date: Thu, 15 Nov 2018 16:54:55 +0200 > > > From: Leon Romanovsky > > > To: Kenneth Lee > > > CC: Kenneth Lee , Tim Sell , > > > linux-...@vger.kernel.org, Alexander Shishkin > > > , Zaibo Xu , > > > zhangfei@foxmail.com, linux...@huawei.com, haojian.zhu...@linaro.org, > > > Christoph Lameter , Hao Fang , > > > Gavin > > > Schenk , RDMA mailing list > > > , Zhou Wang , Jason > > > Gunthorpe , Doug Ledford , Uwe > > > Kleine-König , David Kershner > > > , Johan Hovold , Cyrille > > > Pitchen , Sagar Dharia > > > , Jens Axboe , > > > guodong...@linaro.org, linux-netdev , Randy > > > Dunlap > > > , linux-ker...@vger.kernel.org, Vinod Koul > > > , linux-crypto@vger.kernel.org, Philippe Ombredanne > > > , Sanyog Kale , "David S. > > > Miller" , linux-accelerat...@lists.ozlabs.org > > > Subject: Re: [RFCv3 PATCH 1/6] uacce: Add documents for WarpDrive/uacce > > > User-Agent: Mutt/1.10.1 (2018-07-13) > > > Message-ID: <20181115145455.gn3...@mtr-leonro.mtl.com> > > > > > > On Thu, Nov 15, 2018 at 04:51:09PM +0800, Kenneth Lee wrote: > > > > On Wed, Nov 14, 2018 at 06:00:17PM +0200, Leon Romanovsky wrote: > > > > > Date: Wed, 14 Nov 2018 18:00:17 +0200 > > > > > From: Leon Romanovsky > > > > > To: Kenneth Lee > > > > > CC: Tim Sell , linux-...@vger.kernel.org, > > > > > Alexander Shishkin , Zaibo Xu > > > > > , zhangfei@foxmail.com, linux...@huawei.com, > > > > > haojian.zhu...@linaro.org, Christoph Lameter , Hao > > > > > Fang > > > > > , Gavin Schenk , RDMA > > > > > mailing > > > > > list , Zhou Wang > > > > > , > > > > > Jason Gunthorpe , Doug Ledford , > > > > > Uwe > > > > > Kleine-König , David Kershner > > > > > , Johan Hovold , Cyrille > > > > > Pitchen , Sagar Dharia > > > > > , Jens Axboe , > > > > > guodong...@linaro.org, linux-netdev , Randy > > > > > Dunlap > > > > > , linux-ker...@vger.kernel.org, Vinod Koul > > > > > , linux-crypto@vger.kernel.org, Philippe Ombredanne > > > > > , Sanyog Kale , > > > > > Kenneth Lee > > > > > , "David S. Miller" , > > > > > linux-accelerat...@lists.ozlabs.org > > > > > Subject: Re: [RFCv3 PATCH 1/6] uacce: Add documents for > > > > > WarpDrive/uacce > > > > > User-Agent: Mutt/1.10.1 (2018-07-13) > > > > > Message-ID: <20181114160017.gi3...@mtr-leonro.mtl.com> > > > > > > > > > > On Wed, Nov 14, 2018 at 10:58:09AM +0800, Kenneth Lee wrote: > > > > > > > > > > > > 在 2018/11/13 上午8:23, Leon Romanovsky 写道: > > > > > > > On Mon, Nov 12, 2018 at 03:58:02PM +0800, Kenneth Lee wrote: > > > > > > > > From: Kenneth Lee > > > > > > > > > > > > > > > > WarpDrive is a general accelerator framework for the user > > > > > > > > application to > > > > > > > > access the hardware without going through the kernel in data > > > > > > > > path. > > > > > > > > > > > > > > > > The kernel component to provide kernel facility to driver for > > > > > > > > expose the > > > > > > > > user interface is called uacce. It a short name for > > > > > > > > "Unified/User-space-access-intended Accelerator Framework". > > > > > > > > > > > > > > > > This patch add document to explain how it works. > > > > > > > + RDMA and netdev folks > > > > > > > > > > > > > > Sorry, to be late in the game, I don't see other patches, but from > > > > > > > the description below it seems like you are reinventing RDMA verbs > > > > > > > model. I have hard time to see the differences in the proposed > > > > > > > framework to already implemented in drivers/infiniband/* for the > > > > > > > kernel > > > > > > > space and for the https://github.com/linux-rdma/rdma-core/ for > > > > > > > the user > > > > > > > space parts. > > > > > > > > > > > > Thanks Leon, > > > > > > > > > > > > Yes, we tried to solve similar problem in RDMA. We also learned a > > > > > > lot from > > > > > > the exist code of RDMA. But we we have to make a new one
Re: [RFCv3 PATCH 1/6] uacce: Add documents for WarpDrive/uacce
On Mon, Nov 19, 2018 at 05:14:05PM +0800, Kenneth Lee wrote: > Date: Mon, 19 Nov 2018 17:14:05 +0800 > From: Kenneth Lee > To: Leon Romanovsky > CC: Tim Sell , linux-...@vger.kernel.org, > Alexander Shishkin , Zaibo Xu > , zhangfei@foxmail.com, linux...@huawei.com, > haojian.zhu...@linaro.org, Christoph Lameter , Hao Fang > , Gavin Schenk , RDMA mailing > list , Vinod Koul , Jason > Gunthorpe , Doug Ledford , Uwe > Kleine-König , David Kershner > , Kenneth Lee , Johan > Hovold , Cyrille Pitchen > , Sagar Dharia > , Jens Axboe , > guodong...@linaro.org, linux-netdev , Randy Dunlap > , linux-ker...@vger.kernel.org, Zhou Wang > , linux-crypto@vger.kernel.org, Philippe > Ombredanne , Sanyog Kale , > "David S. Miller" , > linux-accelerat...@lists.ozlabs.org > Subject: Re: [RFCv3 PATCH 1/6] uacce: Add documents for WarpDrive/uacce > User-Agent: Mutt/1.5.21 (2010-09-15) > Message-ID: <20181119091405.GE157308@Turing-Arch-b> > > On Thu, Nov 15, 2018 at 04:54:55PM +0200, Leon Romanovsky wrote: > > Date: Thu, 15 Nov 2018 16:54:55 +0200 > > From: Leon Romanovsky > > To: Kenneth Lee > > CC: Kenneth Lee , Tim Sell , > > linux-...@vger.kernel.org, Alexander Shishkin > > , Zaibo Xu , > > zhangfei@foxmail.com, linux...@huawei.com, haojian.zhu...@linaro.org, > > Christoph Lameter , Hao Fang , Gavin > > Schenk , RDMA mailing list > > , Zhou Wang , Jason > > Gunthorpe , Doug Ledford , Uwe > > Kleine-König , David Kershner > > , Johan Hovold , Cyrille > > Pitchen , Sagar Dharia > > , Jens Axboe , > > guodong...@linaro.org, linux-netdev , Randy Dunlap > > , linux-ker...@vger.kernel.org, Vinod Koul > > , linux-crypto@vger.kernel.org, Philippe Ombredanne > > , Sanyog Kale , "David S. > > Miller" , linux-accelerat...@lists.ozlabs.org > > Subject: Re: [RFCv3 PATCH 1/6] uacce: Add documents for WarpDrive/uacce > > User-Agent: Mutt/1.10.1 (2018-07-13) > > Message-ID: <20181115145455.gn3...@mtr-leonro.mtl.com> > > > > On Thu, Nov 15, 2018 at 04:51:09PM +0800, Kenneth Lee wrote: > > > On Wed, Nov 14, 2018 at 06:00:17PM +0200, Leon Romanovsky wrote: > > > > Date: Wed, 14 Nov 2018 18:00:17 +0200 > > > > From: Leon Romanovsky > > > > To: Kenneth Lee > > > > CC: Tim Sell , linux-...@vger.kernel.org, > > > > Alexander Shishkin , Zaibo Xu > > > > , zhangfei@foxmail.com, linux...@huawei.com, > > > > haojian.zhu...@linaro.org, Christoph Lameter , Hao Fang > > > > , Gavin Schenk , RDMA > > > > mailing > > > > list , Zhou Wang , > > > > Jason Gunthorpe , Doug Ledford , > > > > Uwe > > > > Kleine-König , David Kershner > > > > , Johan Hovold , Cyrille > > > > Pitchen , Sagar Dharia > > > > , Jens Axboe , > > > > guodong...@linaro.org, linux-netdev , Randy > > > > Dunlap > > > > , linux-ker...@vger.kernel.org, Vinod Koul > > > > , linux-crypto@vger.kernel.org, Philippe Ombredanne > > > > , Sanyog Kale , Kenneth > > > > Lee > > > > , "David S. Miller" , > > > > linux-accelerat...@lists.ozlabs.org > > > > Subject: Re: [RFCv3 PATCH 1/6] uacce: Add documents for WarpDrive/uacce > > > > User-Agent: Mutt/1.10.1 (2018-07-13) > > > > Message-ID: <20181114160017.gi3...@mtr-leonro.mtl.com> > > > > > > > > On Wed, Nov 14, 2018 at 10:58:09AM +0800, Kenneth Lee wrote: > > > > > > > > > > 在 2018/11/13 上午8:23, Leon Romanovsky 写道: > > > > > > On Mon, Nov 12, 2018 at 03:58:02PM +0800, Kenneth Lee wrote: > > > > > > > From: Kenneth Lee > > > > > > > > > > > > > > WarpDrive is a general accelerator framework for the user > > > > > > > application to > > > > > > > access the hardware without going through the kernel in data path. > > > > > > > > > > > > > > The kernel component to provide kernel facility to driver for > > > > > > > expose the > > > > > > > user interface is called uacce. It a short name for > > > > > > > "Unified/User-space-access-intended Accelerator Framework". > > > > > > > > > > > > > > This patch add document to explain how it works. > > > > > > + RDMA and netdev folks > > > > > > > > > > > > Sorry, to be late in the game, I don't see other patches, but from > > > > > > the description below it seems like you are reinventing RDMA verbs > > > > > > model. I have hard time to see the differences in the proposed > > > > > > framework to already implemented in drivers/infiniband/* for the > > > > > > kernel > > > > > > space and for the https://github.com/linux-rdma/rdma-core/ for the > > > > > > user > > > > > > space parts. > > > > > > > > > > Thanks Leon, > > > > > > > > > > Yes, we tried to solve similar problem in RDMA. We also learned a lot > > > > > from > > > > > the exist code of RDMA. But we we have to make a new one because we > > > > > cannot > > > > > register accelerators such as AI operation, encryption or compression > > > > > to the > > > > > RDMA framework:) > > > > > > > > Assuming that you did everything right and still failed to use RDMA > > > > framework, you was supposed to fix it and not to reinvent new exactly > > > > same one. It is how we
Re: [RFCv3 PATCH 1/6] uacce: Add documents for WarpDrive/uacce
On Thu, Nov 15, 2018 at 04:54:55PM +0200, Leon Romanovsky wrote: > Date: Thu, 15 Nov 2018 16:54:55 +0200 > From: Leon Romanovsky > To: Kenneth Lee > CC: Kenneth Lee , Tim Sell , > linux-...@vger.kernel.org, Alexander Shishkin > , Zaibo Xu , > zhangfei@foxmail.com, linux...@huawei.com, haojian.zhu...@linaro.org, > Christoph Lameter , Hao Fang , Gavin > Schenk , RDMA mailing list > , Zhou Wang , Jason > Gunthorpe , Doug Ledford , Uwe > Kleine-König , David Kershner > , Johan Hovold , Cyrille > Pitchen , Sagar Dharia > , Jens Axboe , > guodong...@linaro.org, linux-netdev , Randy Dunlap > , linux-ker...@vger.kernel.org, Vinod Koul > , linux-crypto@vger.kernel.org, Philippe Ombredanne > , Sanyog Kale , "David S. > Miller" , linux-accelerat...@lists.ozlabs.org > Subject: Re: [RFCv3 PATCH 1/6] uacce: Add documents for WarpDrive/uacce > User-Agent: Mutt/1.10.1 (2018-07-13) > Message-ID: <20181115145455.gn3...@mtr-leonro.mtl.com> > > On Thu, Nov 15, 2018 at 04:51:09PM +0800, Kenneth Lee wrote: > > On Wed, Nov 14, 2018 at 06:00:17PM +0200, Leon Romanovsky wrote: > > > Date: Wed, 14 Nov 2018 18:00:17 +0200 > > > From: Leon Romanovsky > > > To: Kenneth Lee > > > CC: Tim Sell , linux-...@vger.kernel.org, > > > Alexander Shishkin , Zaibo Xu > > > , zhangfei@foxmail.com, linux...@huawei.com, > > > haojian.zhu...@linaro.org, Christoph Lameter , Hao Fang > > > , Gavin Schenk , RDMA > > > mailing > > > list , Zhou Wang , > > > Jason Gunthorpe , Doug Ledford , Uwe > > > Kleine-König , David Kershner > > > , Johan Hovold , Cyrille > > > Pitchen , Sagar Dharia > > > , Jens Axboe , > > > guodong...@linaro.org, linux-netdev , Randy > > > Dunlap > > > , linux-ker...@vger.kernel.org, Vinod Koul > > > , linux-crypto@vger.kernel.org, Philippe Ombredanne > > > , Sanyog Kale , Kenneth > > > Lee > > > , "David S. Miller" , > > > linux-accelerat...@lists.ozlabs.org > > > Subject: Re: [RFCv3 PATCH 1/6] uacce: Add documents for WarpDrive/uacce > > > User-Agent: Mutt/1.10.1 (2018-07-13) > > > Message-ID: <20181114160017.gi3...@mtr-leonro.mtl.com> > > > > > > On Wed, Nov 14, 2018 at 10:58:09AM +0800, Kenneth Lee wrote: > > > > > > > > 在 2018/11/13 上午8:23, Leon Romanovsky 写道: > > > > > On Mon, Nov 12, 2018 at 03:58:02PM +0800, Kenneth Lee wrote: > > > > > > From: Kenneth Lee > > > > > > > > > > > > WarpDrive is a general accelerator framework for the user > > > > > > application to > > > > > > access the hardware without going through the kernel in data path. > > > > > > > > > > > > The kernel component to provide kernel facility to driver for > > > > > > expose the > > > > > > user interface is called uacce. It a short name for > > > > > > "Unified/User-space-access-intended Accelerator Framework". > > > > > > > > > > > > This patch add document to explain how it works. > > > > > + RDMA and netdev folks > > > > > > > > > > Sorry, to be late in the game, I don't see other patches, but from > > > > > the description below it seems like you are reinventing RDMA verbs > > > > > model. I have hard time to see the differences in the proposed > > > > > framework to already implemented in drivers/infiniband/* for the > > > > > kernel > > > > > space and for the https://github.com/linux-rdma/rdma-core/ for the > > > > > user > > > > > space parts. > > > > > > > > Thanks Leon, > > > > > > > > Yes, we tried to solve similar problem in RDMA. We also learned a lot > > > > from > > > > the exist code of RDMA. But we we have to make a new one because we > > > > cannot > > > > register accelerators such as AI operation, encryption or compression > > > > to the > > > > RDMA framework:) > > > > > > Assuming that you did everything right and still failed to use RDMA > > > framework, you was supposed to fix it and not to reinvent new exactly > > > same one. It is how we develop kernel, by reusing existing code. > > > > Yes, but we don't force other system such as NIC or GPU into RDMA, do we? > > You don't introduce new NIC or GPU, but proposing another interface to > directly access HW memory and bypass kernel for the data path. This is > whole idea of RDMA and this is why it is already present in the kernel. > > Various hardware devices are supported in our stack allow a ton of crazy > stuff, including GPUs interconnections and NIC functionalities. Yes. We don't want to invent new wheel. That is why we did it behind VFIO in RFC v1 and v2. But finally we were persuaded by Mr. Jerome Glisse that VFIO was not a good place to solve the problem. And currently, as you see, IB is bound with devices doing RDMA. The register function, ib_register_device() hint that it is a netdev (get_netdev() callback), it know about gid, pkey, and Memory Window. IB is not simply a address space management framework. And verbs to IB are not transparent. If we start to add compression/decompression, AI (RNN, CNN stuff) operations, and encryption/decryption to the verbs set. It will become very complexity. Or maybe I
Re: [PATCH 0/6] crypto: x86/chacha20 - SIMD performance improvements
Hi Jason, > I'd be inclined to roll with your implementation if it can eventually > become competitive with Andy Polyakov's, [...] I think for the SSSE3/AVX2 code paths it is competitive; especially for small sizes it is faster, which is not that unimportant when implementing layer 3 VPNs. > there are still no AVX-512 paths, which means it's considerably > slower on all newer generation Intel chips. Andy's has the AVX-512VL > implementation for Skylake (using ymm, so as not to hit throttling) > and AVX-512F for Cannon Lake and beyond (using zmm). I don't think that having AVX-512F is that important until it is really usable on CPUs in the market. Adding AVX-512VL support is relatively simple. I have a patchset mostly ready that is more than competitive with the code from Zinc. I'll clean that up and do more testing before posting it later this week. Best regards Martin
Important
Hello, My name is ms. Reem Al-Hashimi. The UAE minister of state for international cooparation. I got your contact from a certain email database from your country while i was looking for someone to handle a huge financial transaction for me in confidence. Can you receive and invest on behalf of my only son. Please reply to reem2...@daum.net, for more details if you are interested. Regards, Ms. Reem Al-Hashimy
Re: [PATCH 0/5] crypto: caam - add support for Era 10
On Thu, Nov 08, 2018 at 03:36:26PM +0200, Horia Geantă wrote: > This patch set adds support for CAAM Era 10, currently used in LX2160A SoC: > -new register mapping: some registers/fields are deprecated and moved > to different locations, mainly version registers > -algorithms > chacha20 (over DPSECI - Data Path SEC Interface on fsl-mc bus) > rfc7539(chacha20,poly1305) (over both DPSECI and Job Ring Interface) > rfc7539esp(chacha20,poly1305) (over both DPSECI and Job Ring Interface) > > Note: the patch set is generated on top of cryptodev-2.6, however testing > was performed based on linux-next (tag: next-20181108) - which includes > LX2160A platform support + manually updating LX2160A dts with: > -fsl-mc bus DT node > -missing dma-ranges property in soc DT node > > Cristian Stoica (1): > crypto: export CHACHAPOLY_IV_SIZE > > Horia Geantă (4): > crypto: caam - add register map changes cf. Era 10 > crypto: caam/qi2 - add support for ChaCha20 > crypto: caam/jr - add support for Chacha20 + Poly1305 > crypto: caam/qi2 - add support for Chacha20 + Poly1305 > > crypto/chacha20poly1305.c | 2 - > drivers/crypto/caam/caamalg.c | 266 > ++--- > drivers/crypto/caam/caamalg_desc.c | 139 ++- > drivers/crypto/caam/caamalg_desc.h | 5 + > drivers/crypto/caam/caamalg_qi.c | 37 -- > drivers/crypto/caam/caamalg_qi2.c | 156 +- > drivers/crypto/caam/caamhash.c | 20 ++- > drivers/crypto/caam/caampkc.c | 10 +- > drivers/crypto/caam/caamrng.c | 10 +- > drivers/crypto/caam/compat.h | 2 + > drivers/crypto/caam/ctrl.c | 28 +++- > drivers/crypto/caam/desc.h | 28 > drivers/crypto/caam/desc_constr.h | 7 +- > drivers/crypto/caam/regs.h | 74 +-- > include/crypto/chacha20.h | 1 + > 15 files changed, 724 insertions(+), 61 deletions(-) All applied. Thanks. -- Email: Herbert Xu Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
Re: [PATCH 0/6] crypto: x86/chacha20 - SIMD performance improvements
On Sun, Nov 11, 2018 at 10:36:24AM +0100, Martin Willi wrote: > This patchset improves performance of the ChaCha20 SIMD implementations > for x86_64. For some specific encryption lengths, performance is more > than doubled. Two mechanisms are used to achieve this: > > * Instead of calculating the minimal number of required blocks for a > given encryption length, functions producing more blocks are used > more aggressively. Calculating a 4-block function can be faster than > calculating a 2-block and a 1-block function, even if only three > blocks are actually required. > > * In addition to the 8-block AVX2 function, a 4-block and a 2-block > function are introduced. > > Patches 1-3 add support for partial lengths to the existing 1-, 4- and > 8-block functions. Patch 4 makes use of that by engaging the next higher > level block functions more aggressively. Patch 5 and 6 add the new AVX2 > functions for 2 and 4 blocks. Patches are based on cryptodev and would > need adjustments to apply on top of the Adiantum patchset. > > Note that the more aggressive use of larger block functions calculate > blocks that may get discarded. This may have a negative impact on energy > usage or the processors thermal budget. However, with the new block > functions we can avoid this over-calculation for many lengths, so the > performance win can be considered more important. > > Below are performance numbers measured with tcrypt using additional > encryption lengths; numbers in kOps/s, on my i7-5557U. old is the > existing, new the implementation with this patchset. As comparison > the numbers for zinc in v6: > > len old new zinc >8 5908 5818 5818 > 16 5917 5828 5726 > 24 5916 5869 5757 > 32 5920 5789 5813 > 40 5868 5799 5710 > 48 5877 5761 5761 > 56 5869 5797 5742 > 64 5897 5862 5685 > 72 3381 4979 3520 > 80 3364 5541 3475 > 88 3350 4977 3424 > 96 3342 5530 3371 > 104 3328 4923 3313 > 112 3317 5528 3207 > 120 3313 4970 3150 > 128 3492 5535 3568 > 136 2487 4570 3690 > 144 2481 5047 3599 > 152 2473 4565 3566 > 160 2459 5022 3515 > 168 2461 4550 3437 > 176 2454 5020 3325 > 184 2449 4535 3279 > 192 2538 5011 3762 > 200 1962 4537 3702 > 208 1962 4971 3622 > 216 1954 4487 3518 > 224 1949 4936 3445 > 232 1948 4497 3422 > 240 1941 4947 3317 > 248 1940 4481 3279 > 256 3798 4964 3723 > 264 2638 3577 3639 > 272 2637 3567 3597 > 280 2628 3563 3565 > 288 2630 3795 3484 > 296 2621 3580 3422 > 304 2612 3569 3352 > 312 2602 3599 3308 > 320 2694 3821 3694 > 328 2060 3538 3681 > 336 2054 3565 3599 > 344 2054 3553 3523 > 352 2049 3809 3419 > 360 2045 3575 3403 > 368 2035 3560 3334 > 376 2036 3555 3257 > 384 2092 3785 3715 > 392 1691 3505 3612 > 400 1684 3527 3553 > 408 1686 3527 3496 > 416 1684 3804 3430 > 424 1681 3555 3402 > 432 1675 3559 3311 > 440 1672 3558 3275 > 448 1710 3780 3689 > 456 1431 3541 3618 > 464 1428 3538 3576 > 472 1430 3527 3509 > 480 1426 3788 3405 > 488 1423 3502 3397 > 496 1423 3519 3298 > 504 1418 3519 3277 > 512 3694 3736 3735 > 520 2601 2571 2209 > 528 2601 2677 2148 > 536 2587 2534 2164 > 544 2578 2659 2138 > 552 2570 2552 2126 > 560 2566 2661 2035 > 568 2567 2542 2041 > 576 2639 2674 2199 > 584 2031 2531 2183 > 592 2027 2660 2145 > 600 2016 2513 2155 > 608 2009 2638 2133 > 616 2006 2522 2115 > 624 2000 2649 2064 > 632 1996 2518 2045 > 640 2053 2651 2188 > 648 1666 2402 2182 > 656 1663 2517 2158 > 664 1659 2397 2147 > 672 1657 2510 2139 > 680 1656 2394 2114 > 688 1653 2497 2077 > 696 1646 2393 2043 > 704 1678 2510 2208 > 712 1414 2391 2189 > 720 1412 2506 2169 > 728 1411 2384 2145 > 736 1408 2494 2142 > 744 1408 2379 2081 > 752 1405 2485 2064 > 760 1403 2376 2043 > 768 2189 2498 2211 > 776 1756 2137 2192 > 784 1746 2145 2146 > 792 1744 2141 2141 > 800 1743 2094 > 808 1742 2140 2100 > 816 1735 2134 2061 > 824 1731 2135 2045 > 832 1778 2223 > 840 1480 2132 2184 > 848 1480 2134 2173 > 856 1476 2124 2145 > 864 1474 2210 2126 > 872 1472 2127 2105 > 880 1463 2123 2056 > 888 1468 2123 2043 > 896 1494 2208 2219 > 904 1278 2120 2192 > 912 1277 2121 2170 > 920 1273 2118 2149 > 928 1272 2207 2125 > 936 1267 2125 2098 > 944 1265 2127 2060 > 952 1267 2126 2049 > 960 1289 2213 2204 > 968 1125 2123 2187 > 976 1122 2127 2166 > 984 1120 2123 2136 > 992 1118 2207 2119 > 1000 1118 2120 2101 > 1008 1117 2122 2042 > 1016 1115 2121 2048 > 1024 2174 2191 2195 > 1032 1748 1724 1565 > 1040 1745 1782 1544 > 1048 1736 1737 1554 > 1056 1738 1802 1541 > 1064 1735 1728 1523 > 1072 1730 1780 1507 > 1080 1729 1724 1497 > 1088 1757 1783 1592 > 1096 1475 1723 1575 > 1104 1474 1778 1563 > 1112 1472 1708 1544 > 1120 1468 1774 1521 > 1128 1466 1718 1521 > 1136 1462 1780 1501 > 1144 1460 1719 1491 > 1152 1481 1782 1575 > 1160 1271 1647 1558 > 1168 1271 1706 1554 > 1176 1268 1645 1545 > 1184 1265 1711 1538 > 1192 1265 1648 1530 > 1200 1264 1705 1493 > 1208 1262 1647 1498 > 1216 1277 1695 1581
Re: [PATCH 0/6] crypto: x86/chacha20 - SIMD performance improvements
Hi Martin, This is nice work, and given that it's quite clean -- and that it's usually hard to screw up chacha in subtle ways when test vectors pass (unlike, say, poly1305 or curve25519), I'd be inclined to roll with your implementation if it can eventually become competitive with Andy Polyakov's, which I'm currently working on for Zinc (which no longer has pre-generated code, addressing the biggest hurdle; v9 will be sent shortly). Specifically, I'm not quite sure the improvements here tip the balance apply to all avx2 microarchitectures, and most importantly, there are still no AVX-512 paths, which means it's considerably slower on all newer generation Intel chips. Andy's has the AVX-512VL implementation for Skylake (using ymm, so as not to hit throttling) and AVX-512F for Cannon Lake and beyond (using zmm). I've attached some measurements below showing how stark the difference is. The take away is that while Andy's implementation is still ahead in terms of performance today, I'd certainly encourage your efforts to gain parity with that, and I'd be happy have that when the performance and fuzzing time is right for it. So please do keep chipping away at it; I think it's a potentially useful effort. Regards, Jason size old zinc 0 64 54 16 386 372 32 388 396 48 388 420 64 366 350 80 708 666 96 708 692 112 706 736 128 692 648 144 1036 682 160 1036 708 176 1036 730 192 1016 658 208 1360 684 224 1362 708 240 1360 732 256 644 500 272 990 526 288 988 556 304 988 576 320 972 500 336 1314 532 352 1316 558 368 1318 578 384 1308 506 400 1644 532 416 1644 556 432 1644 594 448 1624 508 464 1970 534 480 1970 556 496 1968 582 512 660 624 528 1016 682 544 1016 702 560 1018 728 576 998 654 592 1344 680 608 1344 708 624 1344 730 640 1326 654 656 1670 686 672 1670 708 688 1670 732 704 1652 658 720 1998 682 736 1998 710 752 1996 734 768 1256 662 784 1606 688 800 1606 714 816 1606 736 832 1584 660 848 1948 688 864 1950 714 880 1948 736 896 1912 688 912 2258 718 928 2258 744 944 2256 768 960 2238 692 976 2584 718 992 2584 744 1008 2584 770 On Thu, Nov 15, 2018 at 6:21 PM Herbert Xu wrote: > > On Sun, Nov 11, 2018 at 10:36:24AM +0100, Martin Willi wrote: > > This patchset improves performance of the ChaCha20 SIMD implementations > > for x86_64. For some specific encryption lengths, performance is more > > than doubled. Two mechanisms are used to achieve this: > > > > * Instead of calculating the minimal number of required blocks for a > > given encryption length, functions producing more blocks are used > > more aggressively. Calculating a 4-block function can be faster than > > calculating a 2-block and a 1-block function, even if only three > > blocks are actually required. > > > > * In addition to the 8-block AVX2 function, a 4-block and a 2-block > > function are introduced. > > > > Patches 1-3 add support for partial lengths to the existing 1-, 4- and > > 8-block functions. Patch 4 makes use of that by engaging the next higher > > level block functions more aggressively. Patch 5 and 6 add the new AVX2 > > functions for 2 and 4 blocks. Patches are based on cryptodev and would > > need adjustments to apply on top of the Adiantum patchset. > > > > Note that the more aggressive use of larger block functions calculate > > blocks that may get discarded. This may have a negative impact on energy > > usage or the processors thermal budget. However, with the new block > > functions we can avoid this over-calculation for many lengths, so the > > performance win can be considered more important. > > > > Below are performance numbers measured with tcrypt using additional > > encryption lengths; numbers in kOps/s, on my i7-5557U. old is the > > existing, new the implementation with this patchset. As comparison > > the numbers for zinc in v6: > > > > len old new zinc > >8 5908 5818 5818 > > 16 5917 5828 5726 > > 24 5916 5869 5757 > > 32 5920 5789 5813 > > 40 5868 5799 5710 > > 48 5877 5761 5761 > > 56 5869 5797 5742 > > 64 5897 5862 5685 > > 72 3381 4979 3520 > > 80 3364 5541 3475 > > 88 3350 4977 3424 > > 96 3342 5530 3371 > > 104 3328 4923 3313 > > 112 3317 5528 3207 > > 120 3313 4970 3150 > > 128 3492 5535 3568 > > 136 2487 4570 3690 > > 144 2481 5047 3599 > > 152 2473 4565 3566 > > 160 2459 5022 3515 > > 168 2461 4550 3437 > > 176 2454 5020 3325 > > 184 2449 4535 3279 > > 192 2538 5011 3762 > > 200 1962 4537 3702 > > 208 1962 4971 3622 > > 216 1954 4487 3518 > > 224 1949 4936 3445 > > 232 1948 4497 3422 > > 240 1941 4947 3317 > > 248 1940 4481 3279 > > 256 3798 4964 3723 > > 264 2638 3577 3639 > > 272 2637 3567 3597 > > 280 2628 3563 3565 > > 288 2630 3795 3484 > > 296 2621 3580 3422 > > 304 2612 3569 3352 > > 312 2602 3599 3308 > > 320 2694 3821 3694 > > 328 2060 3538 3681 > > 336 2054 3565 3599 > > 344 2054 3553 3523 > > 352 2049 3809 3419 > > 360 2045 3575 3403 > > 368 2035 3560 3334 > > 376 2036 3555 3257 > > 384 2092 3785 3715 > >
Re: [PATCH 0/6] crypto: x86/chacha20 - SIMD performance improvements
On Sun, Nov 11, 2018 at 10:36:24AM +0100, Martin Willi wrote: > This patchset improves performance of the ChaCha20 SIMD implementations > for x86_64. For some specific encryption lengths, performance is more > than doubled. Two mechanisms are used to achieve this: > > * Instead of calculating the minimal number of required blocks for a > given encryption length, functions producing more blocks are used > more aggressively. Calculating a 4-block function can be faster than > calculating a 2-block and a 1-block function, even if only three > blocks are actually required. > > * In addition to the 8-block AVX2 function, a 4-block and a 2-block > function are introduced. > > Patches 1-3 add support for partial lengths to the existing 1-, 4- and > 8-block functions. Patch 4 makes use of that by engaging the next higher > level block functions more aggressively. Patch 5 and 6 add the new AVX2 > functions for 2 and 4 blocks. Patches are based on cryptodev and would > need adjustments to apply on top of the Adiantum patchset. > > Note that the more aggressive use of larger block functions calculate > blocks that may get discarded. This may have a negative impact on energy > usage or the processors thermal budget. However, with the new block > functions we can avoid this over-calculation for many lengths, so the > performance win can be considered more important. > > Below are performance numbers measured with tcrypt using additional > encryption lengths; numbers in kOps/s, on my i7-5557U. old is the > existing, new the implementation with this patchset. As comparison > the numbers for zinc in v6: > > len old new zinc >8 5908 5818 5818 > 16 5917 5828 5726 > 24 5916 5869 5757 > 32 5920 5789 5813 > 40 5868 5799 5710 > 48 5877 5761 5761 > 56 5869 5797 5742 > 64 5897 5862 5685 > 72 3381 4979 3520 > 80 3364 5541 3475 > 88 3350 4977 3424 > 96 3342 5530 3371 > 104 3328 4923 3313 > 112 3317 5528 3207 > 120 3313 4970 3150 > 128 3492 5535 3568 > 136 2487 4570 3690 > 144 2481 5047 3599 > 152 2473 4565 3566 > 160 2459 5022 3515 > 168 2461 4550 3437 > 176 2454 5020 3325 > 184 2449 4535 3279 > 192 2538 5011 3762 > 200 1962 4537 3702 > 208 1962 4971 3622 > 216 1954 4487 3518 > 224 1949 4936 3445 > 232 1948 4497 3422 > 240 1941 4947 3317 > 248 1940 4481 3279 > 256 3798 4964 3723 > 264 2638 3577 3639 > 272 2637 3567 3597 > 280 2628 3563 3565 > 288 2630 3795 3484 > 296 2621 3580 3422 > 304 2612 3569 3352 > 312 2602 3599 3308 > 320 2694 3821 3694 > 328 2060 3538 3681 > 336 2054 3565 3599 > 344 2054 3553 3523 > 352 2049 3809 3419 > 360 2045 3575 3403 > 368 2035 3560 3334 > 376 2036 3555 3257 > 384 2092 3785 3715 > 392 1691 3505 3612 > 400 1684 3527 3553 > 408 1686 3527 3496 > 416 1684 3804 3430 > 424 1681 3555 3402 > 432 1675 3559 3311 > 440 1672 3558 3275 > 448 1710 3780 3689 > 456 1431 3541 3618 > 464 1428 3538 3576 > 472 1430 3527 3509 > 480 1426 3788 3405 > 488 1423 3502 3397 > 496 1423 3519 3298 > 504 1418 3519 3277 > 512 3694 3736 3735 > 520 2601 2571 2209 > 528 2601 2677 2148 > 536 2587 2534 2164 > 544 2578 2659 2138 > 552 2570 2552 2126 > 560 2566 2661 2035 > 568 2567 2542 2041 > 576 2639 2674 2199 > 584 2031 2531 2183 > 592 2027 2660 2145 > 600 2016 2513 2155 > 608 2009 2638 2133 > 616 2006 2522 2115 > 624 2000 2649 2064 > 632 1996 2518 2045 > 640 2053 2651 2188 > 648 1666 2402 2182 > 656 1663 2517 2158 > 664 1659 2397 2147 > 672 1657 2510 2139 > 680 1656 2394 2114 > 688 1653 2497 2077 > 696 1646 2393 2043 > 704 1678 2510 2208 > 712 1414 2391 2189 > 720 1412 2506 2169 > 728 1411 2384 2145 > 736 1408 2494 2142 > 744 1408 2379 2081 > 752 1405 2485 2064 > 760 1403 2376 2043 > 768 2189 2498 2211 > 776 1756 2137 2192 > 784 1746 2145 2146 > 792 1744 2141 2141 > 800 1743 2094 > 808 1742 2140 2100 > 816 1735 2134 2061 > 824 1731 2135 2045 > 832 1778 2223 > 840 1480 2132 2184 > 848 1480 2134 2173 > 856 1476 2124 2145 > 864 1474 2210 2126 > 872 1472 2127 2105 > 880 1463 2123 2056 > 888 1468 2123 2043 > 896 1494 2208 2219 > 904 1278 2120 2192 > 912 1277 2121 2170 > 920 1273 2118 2149 > 928 1272 2207 2125 > 936 1267 2125 2098 > 944 1265 2127 2060 > 952 1267 2126 2049 > 960 1289 2213 2204 > 968 1125 2123 2187 > 976 1122 2127 2166 > 984 1120 2123 2136 > 992 1118 2207 2119 > 1000 1118 2120 2101 > 1008 1117 2122 2042 > 1016 1115 2121 2048 > 1024 2174 2191 2195 > 1032 1748 1724 1565 > 1040 1745 1782 1544 > 1048 1736 1737 1554 > 1056 1738 1802 1541 > 1064 1735 1728 1523 > 1072 1730 1780 1507 > 1080 1729 1724 1497 > 1088 1757 1783 1592 > 1096 1475 1723 1575 > 1104 1474 1778 1563 > 1112 1472 1708 1544 > 1120 1468 1774 1521 > 1128 1466 1718 1521 > 1136 1462 1780 1501 > 1144 1460 1719 1491 > 1152 1481 1782 1575 > 1160 1271 1647 1558 > 1168 1271 1706 1554 > 1176 1268 1645 1545 > 1184 1265 1711 1538 > 1192 1265 1648 1530 > 1200 1264 1705 1493 > 1208 1262 1647 1498 > 1216 1277 1695 1581
Re: [PATCH] crypto: inside-secure - remove useless setting of type flags
Hi Eric, On Wed, Nov 14, 2018 at 11:10:53AM -0800, Eric Biggers wrote: > From: Eric Biggers > > Remove the unnecessary setting of CRYPTO_ALG_TYPE_SKCIPHER. > Commit 2c95e6d97892 ("crypto: skcipher - remove useless setting of type > flags") took care of this everywhere else, but a few more instances made > it into the tree at about the same time. Squash them before they get > copy+pasted around again. > > This patch shouldn't change any actual behavior. > > Signed-off-by: Eric Biggers Acked-by: Antoine Tenart Thanks! Antoine > --- > drivers/crypto/inside-secure/safexcel_cipher.c | 8 > 1 file changed, 4 insertions(+), 4 deletions(-) > > diff --git a/drivers/crypto/inside-secure/safexcel_cipher.c > b/drivers/crypto/inside-secure/safexcel_cipher.c > index 3aef1d43e4351..d531c14020dcb 100644 > --- a/drivers/crypto/inside-secure/safexcel_cipher.c > +++ b/drivers/crypto/inside-secure/safexcel_cipher.c > @@ -970,7 +970,7 @@ struct safexcel_alg_template safexcel_alg_cbc_des = { > .cra_name = "cbc(des)", > .cra_driver_name = "safexcel-cbc-des", > .cra_priority = 300, > - .cra_flags = CRYPTO_ALG_TYPE_SKCIPHER | > CRYPTO_ALG_ASYNC | > + .cra_flags = CRYPTO_ALG_ASYNC | >CRYPTO_ALG_KERN_DRIVER_ONLY, > .cra_blocksize = DES_BLOCK_SIZE, > .cra_ctxsize = sizeof(struct safexcel_cipher_ctx), > @@ -1010,7 +1010,7 @@ struct safexcel_alg_template safexcel_alg_ecb_des = { > .cra_name = "ecb(des)", > .cra_driver_name = "safexcel-ecb-des", > .cra_priority = 300, > - .cra_flags = CRYPTO_ALG_TYPE_SKCIPHER | > CRYPTO_ALG_ASYNC | > + .cra_flags = CRYPTO_ALG_ASYNC | >CRYPTO_ALG_KERN_DRIVER_ONLY, > .cra_blocksize = DES_BLOCK_SIZE, > .cra_ctxsize = sizeof(struct safexcel_cipher_ctx), > @@ -1074,7 +1074,7 @@ struct safexcel_alg_template safexcel_alg_cbc_des3_ede > = { > .cra_name = "cbc(des3_ede)", > .cra_driver_name = "safexcel-cbc-des3_ede", > .cra_priority = 300, > - .cra_flags = CRYPTO_ALG_TYPE_SKCIPHER | > CRYPTO_ALG_ASYNC | > + .cra_flags = CRYPTO_ALG_ASYNC | >CRYPTO_ALG_KERN_DRIVER_ONLY, > .cra_blocksize = DES3_EDE_BLOCK_SIZE, > .cra_ctxsize = sizeof(struct safexcel_cipher_ctx), > @@ -1114,7 +1114,7 @@ struct safexcel_alg_template safexcel_alg_ecb_des3_ede > = { > .cra_name = "ecb(des3_ede)", > .cra_driver_name = "safexcel-ecb-des3_ede", > .cra_priority = 300, > - .cra_flags = CRYPTO_ALG_TYPE_SKCIPHER | > CRYPTO_ALG_ASYNC | > + .cra_flags = CRYPTO_ALG_ASYNC | >CRYPTO_ALG_KERN_DRIVER_ONLY, > .cra_blocksize = DES3_EDE_BLOCK_SIZE, > .cra_ctxsize = sizeof(struct safexcel_cipher_ctx), > -- > 2.19.1.930.g4563a0d9d0-goog > -- Antoine Ténart, Bootlin Embedded Linux and Kernel engineering https://bootlin.com
[PATCH] crypto: drop mask=CRYPTO_ALG_ASYNC from 'shash' tfm allocations
From: Eric Biggers 'shash' algorithms are always synchronous, so passing CRYPTO_ALG_ASYNC in the mask to crypto_alloc_shash() has no effect. Many users therefore already don't pass it, but some still do. This inconsistency can cause confusion, especially since the way the 'mask' argument works is somewhat counterintuitive. Thus, just remove the unneeded CRYPTO_ALG_ASYNC flags. This patch shouldn't change any actual behavior. Signed-off-by: Eric Biggers --- drivers/block/drbd/drbd_receiver.c | 2 +- drivers/md/dm-integrity.c | 2 +- drivers/net/wireless/intersil/orinoco/mic.c | 6 ++ fs/ubifs/auth.c | 5 ++--- net/bluetooth/smp.c | 2 +- security/apparmor/crypto.c | 2 +- security/integrity/evm/evm_crypto.c | 3 +-- security/keys/encrypted-keys/encrypted.c| 4 ++-- security/keys/trusted.c | 4 ++-- 9 files changed, 13 insertions(+), 17 deletions(-) diff --git a/drivers/block/drbd/drbd_receiver.c b/drivers/block/drbd/drbd_receiver.c index 61c392752fe4b..ccfcf00f2798d 100644 --- a/drivers/block/drbd/drbd_receiver.c +++ b/drivers/block/drbd/drbd_receiver.c @@ -3623,7 +3623,7 @@ static int receive_protocol(struct drbd_connection *connection, struct packet_in * change. */ - peer_integrity_tfm = crypto_alloc_shash(integrity_alg, 0, CRYPTO_ALG_ASYNC); + peer_integrity_tfm = crypto_alloc_shash(integrity_alg, 0, 0); if (IS_ERR(peer_integrity_tfm)) { peer_integrity_tfm = NULL; drbd_err(connection, "peer data-integrity-alg %s not supported\n", diff --git a/drivers/md/dm-integrity.c b/drivers/md/dm-integrity.c index bb3096bf2cc6b..d4ad0bfee2519 100644 --- a/drivers/md/dm-integrity.c +++ b/drivers/md/dm-integrity.c @@ -2804,7 +2804,7 @@ static int get_mac(struct crypto_shash **hash, struct alg_spec *a, char **error, int r; if (a->alg_string) { - *hash = crypto_alloc_shash(a->alg_string, 0, CRYPTO_ALG_ASYNC); + *hash = crypto_alloc_shash(a->alg_string, 0, 0); if (IS_ERR(*hash)) { *error = error_alg; r = PTR_ERR(*hash); diff --git a/drivers/net/wireless/intersil/orinoco/mic.c b/drivers/net/wireless/intersil/orinoco/mic.c index 08bc7822f8209..709d9ab3e7bcb 100644 --- a/drivers/net/wireless/intersil/orinoco/mic.c +++ b/drivers/net/wireless/intersil/orinoco/mic.c @@ -16,8 +16,7 @@ // int orinoco_mic_init(struct orinoco_private *priv) { - priv->tx_tfm_mic = crypto_alloc_shash("michael_mic", 0, - CRYPTO_ALG_ASYNC); + priv->tx_tfm_mic = crypto_alloc_shash("michael_mic", 0, 0); if (IS_ERR(priv->tx_tfm_mic)) { printk(KERN_DEBUG "orinoco_mic_init: could not allocate " "crypto API michael_mic\n"); @@ -25,8 +24,7 @@ int orinoco_mic_init(struct orinoco_private *priv) return -ENOMEM; } - priv->rx_tfm_mic = crypto_alloc_shash("michael_mic", 0, - CRYPTO_ALG_ASYNC); + priv->rx_tfm_mic = crypto_alloc_shash("michael_mic", 0, 0); if (IS_ERR(priv->rx_tfm_mic)) { printk(KERN_DEBUG "orinoco_mic_init: could not allocate " "crypto API michael_mic\n"); diff --git a/fs/ubifs/auth.c b/fs/ubifs/auth.c index 124e965a28b30..5bf5fd08879e6 100644 --- a/fs/ubifs/auth.c +++ b/fs/ubifs/auth.c @@ -269,8 +269,7 @@ int ubifs_init_authentication(struct ubifs_info *c) goto out; } - c->hash_tfm = crypto_alloc_shash(c->auth_hash_name, 0, -CRYPTO_ALG_ASYNC); + c->hash_tfm = crypto_alloc_shash(c->auth_hash_name, 0, 0); if (IS_ERR(c->hash_tfm)) { err = PTR_ERR(c->hash_tfm); ubifs_err(c, "Can not allocate %s: %d", @@ -286,7 +285,7 @@ int ubifs_init_authentication(struct ubifs_info *c) goto out_free_hash; } - c->hmac_tfm = crypto_alloc_shash(hmac_name, 0, CRYPTO_ALG_ASYNC); + c->hmac_tfm = crypto_alloc_shash(hmac_name, 0, 0); if (IS_ERR(c->hmac_tfm)) { err = PTR_ERR(c->hmac_tfm); ubifs_err(c, "Can not allocate %s: %d", hmac_name, err); diff --git a/net/bluetooth/smp.c b/net/bluetooth/smp.c index 1f94a25beef69..621146d04c038 100644 --- a/net/bluetooth/smp.c +++ b/net/bluetooth/smp.c @@ -3912,7 +3912,7 @@ int __init bt_selftest_smp(void) return PTR_ERR(tfm_aes); } - tfm_cmac = crypto_alloc_shash("cmac(aes)", 0, CRYPTO_ALG_ASYNC); + tfm_cmac = crypto_alloc_shash("cmac(aes)", 0, 0); if (IS_ERR(tfm_cmac)) { BT_ERR("Unable to create CMAC crypto
[PATCH] crypto: drop mask=CRYPTO_ALG_ASYNC from 'cipher' tfm allocations
From: Eric Biggers 'cipher' algorithms (single block ciphers) are always synchronous, so passing CRYPTO_ALG_ASYNC in the mask to crypto_alloc_cipher() has no effect. Many users therefore already don't pass it, but some still do. This inconsistency can cause confusion, especially since the way the 'mask' argument works is somewhat counterintuitive. Thus, just remove the unneeded CRYPTO_ALG_ASYNC flags. This patch shouldn't change any actual behavior. Signed-off-by: Eric Biggers --- arch/s390/crypto/aes_s390.c | 2 +- drivers/crypto/amcc/crypto4xx_alg.c | 3 +-- drivers/crypto/ccp/ccp-crypto-aes-cmac.c | 4 +--- drivers/crypto/geode-aes.c| 2 +- drivers/md/dm-crypt.c | 2 +- drivers/net/wireless/cisco/airo.c | 2 +- drivers/staging/rtl8192e/rtllib_crypt_ccmp.c | 2 +- drivers/staging/rtl8192u/ieee80211/ieee80211_crypt_ccmp.c | 2 +- drivers/usb/wusbcore/crypto.c | 2 +- net/bluetooth/smp.c | 6 +++--- net/mac80211/wep.c| 4 ++-- net/wireless/lib80211_crypt_ccmp.c| 2 +- net/wireless/lib80211_crypt_tkip.c| 4 ++-- net/wireless/lib80211_crypt_wep.c | 4 ++-- 14 files changed, 19 insertions(+), 22 deletions(-) diff --git a/arch/s390/crypto/aes_s390.c b/arch/s390/crypto/aes_s390.c index 812d9498d97be..dd456725189f2 100644 --- a/arch/s390/crypto/aes_s390.c +++ b/arch/s390/crypto/aes_s390.c @@ -137,7 +137,7 @@ static int fallback_init_cip(struct crypto_tfm *tfm) struct s390_aes_ctx *sctx = crypto_tfm_ctx(tfm); sctx->fallback.cip = crypto_alloc_cipher(name, 0, - CRYPTO_ALG_ASYNC | CRYPTO_ALG_NEED_FALLBACK); +CRYPTO_ALG_NEED_FALLBACK); if (IS_ERR(sctx->fallback.cip)) { pr_err("Allocating AES fallback algorithm %s failed\n", diff --git a/drivers/crypto/amcc/crypto4xx_alg.c b/drivers/crypto/amcc/crypto4xx_alg.c index f5c07498ea4f0..4092c2aad8e21 100644 --- a/drivers/crypto/amcc/crypto4xx_alg.c +++ b/drivers/crypto/amcc/crypto4xx_alg.c @@ -520,8 +520,7 @@ static int crypto4xx_compute_gcm_hash_key_sw(__le32 *hash_start, const u8 *key, uint8_t src[16] = { 0 }; int rc = 0; - aes_tfm = crypto_alloc_cipher("aes", 0, CRYPTO_ALG_ASYNC | - CRYPTO_ALG_NEED_FALLBACK); + aes_tfm = crypto_alloc_cipher("aes", 0, CRYPTO_ALG_NEED_FALLBACK); if (IS_ERR(aes_tfm)) { rc = PTR_ERR(aes_tfm); pr_warn("could not load aes cipher driver: %d\n", rc); diff --git a/drivers/crypto/ccp/ccp-crypto-aes-cmac.c b/drivers/crypto/ccp/ccp-crypto-aes-cmac.c index 3c6fe57f91f8c..9108015e56cc5 100644 --- a/drivers/crypto/ccp/ccp-crypto-aes-cmac.c +++ b/drivers/crypto/ccp/ccp-crypto-aes-cmac.c @@ -346,9 +346,7 @@ static int ccp_aes_cmac_cra_init(struct crypto_tfm *tfm) crypto_ahash_set_reqsize(ahash, sizeof(struct ccp_aes_cmac_req_ctx)); - cipher_tfm = crypto_alloc_cipher("aes", 0, -CRYPTO_ALG_ASYNC | -CRYPTO_ALG_NEED_FALLBACK); + cipher_tfm = crypto_alloc_cipher("aes", 0, CRYPTO_ALG_NEED_FALLBACK); if (IS_ERR(cipher_tfm)) { pr_warn("could not load aes cipher driver\n"); return PTR_ERR(cipher_tfm); diff --git a/drivers/crypto/geode-aes.c b/drivers/crypto/geode-aes.c index eb2a0a73cbed1..b4c24a35b3d08 100644 --- a/drivers/crypto/geode-aes.c +++ b/drivers/crypto/geode-aes.c @@ -261,7 +261,7 @@ static int fallback_init_cip(struct crypto_tfm *tfm) struct geode_aes_op *op = crypto_tfm_ctx(tfm); op->fallback.cip = crypto_alloc_cipher(name, 0, - CRYPTO_ALG_ASYNC | CRYPTO_ALG_NEED_FALLBACK); + CRYPTO_ALG_NEED_FALLBACK); if (IS_ERR(op->fallback.cip)) { printk(KERN_ERR "Error allocating fallback algo %s\n", name); diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c index b8eec515a003c..a7195eb5b8d89 100644 --- a/drivers/md/dm-crypt.c +++ b/drivers/md/dm-crypt.c @@ -377,7 +377,7 @@ static struct crypto_cipher *alloc_essiv_cipher(struct crypt_config *cc, int err; /* Setup the essiv_tfm with the given salt */ - essiv_tfm = crypto_alloc_cipher(cc->cipher, 0, CRYPTO_ALG_ASYNC); + essiv_tfm = crypto_alloc_cipher(cc->cipher, 0, 0); if (IS_ERR(essiv_tfm)) { ti->error = "Error allocating crypto tfm for ESSIV"; return essiv_tfm; diff --git a/drivers/net/wireless/cisco/airo.c b/drivers/net/wireless/cisco/airo.c index 04dd7a9365938..6fab69fe6c92c 100644 ---
[PATCH] crypto: remove useless initializations of cra_list
From: Eric Biggers Some algorithms initialize their .cra_list prior to registration. But this is unnecessary since crypto_register_alg() will overwrite .cra_list when adding the algorithm to the 'crypto_alg_list'. Apparently the useless assignment has just been copy+pasted around. So, remove the useless assignments. Exception: paes_s390.c uses cra_list to check whether the algorithm is registered or not, so I left that as-is for now. This patch shouldn't change any actual behavior. Signed-off-by: Eric Biggers --- arch/sparc/crypto/aes_glue.c | 5 - arch/sparc/crypto/camellia_glue.c | 5 - arch/sparc/crypto/des_glue.c | 5 - crypto/lz4.c | 1 - crypto/lz4hc.c| 1 - drivers/crypto/bcm/cipher.c | 2 -- drivers/crypto/omap-aes.c | 2 -- drivers/crypto/omap-des.c | 1 - drivers/crypto/qce/ablkcipher.c | 1 - drivers/crypto/qce/sha.c | 1 - drivers/crypto/sahara.c | 1 - 11 files changed, 25 deletions(-) diff --git a/arch/sparc/crypto/aes_glue.c b/arch/sparc/crypto/aes_glue.c index 3cd4f6b198b65..a9b8b0b94a8d4 100644 --- a/arch/sparc/crypto/aes_glue.c +++ b/arch/sparc/crypto/aes_glue.c @@ -476,11 +476,6 @@ static bool __init sparc64_has_aes_opcode(void) static int __init aes_sparc64_mod_init(void) { - int i; - - for (i = 0; i < ARRAY_SIZE(algs); i++) - INIT_LIST_HEAD([i].cra_list); - if (sparc64_has_aes_opcode()) { pr_info("Using sparc64 aes opcodes optimized AES implementation\n"); return crypto_register_algs(algs, ARRAY_SIZE(algs)); diff --git a/arch/sparc/crypto/camellia_glue.c b/arch/sparc/crypto/camellia_glue.c index 561a84d93cf68..900d5c617e83b 100644 --- a/arch/sparc/crypto/camellia_glue.c +++ b/arch/sparc/crypto/camellia_glue.c @@ -299,11 +299,6 @@ static bool __init sparc64_has_camellia_opcode(void) static int __init camellia_sparc64_mod_init(void) { - int i; - - for (i = 0; i < ARRAY_SIZE(algs); i++) - INIT_LIST_HEAD([i].cra_list); - if (sparc64_has_camellia_opcode()) { pr_info("Using sparc64 camellia opcodes optimized CAMELLIA implementation\n"); return crypto_register_algs(algs, ARRAY_SIZE(algs)); diff --git a/arch/sparc/crypto/des_glue.c b/arch/sparc/crypto/des_glue.c index 61af794aa2d31..56499ea39fd36 100644 --- a/arch/sparc/crypto/des_glue.c +++ b/arch/sparc/crypto/des_glue.c @@ -510,11 +510,6 @@ static bool __init sparc64_has_des_opcode(void) static int __init des_sparc64_mod_init(void) { - int i; - - for (i = 0; i < ARRAY_SIZE(algs); i++) - INIT_LIST_HEAD([i].cra_list); - if (sparc64_has_des_opcode()) { pr_info("Using sparc64 des opcodes optimized DES implementation\n"); return crypto_register_algs(algs, ARRAY_SIZE(algs)); diff --git a/crypto/lz4.c b/crypto/lz4.c index 2ce2660d3519e..c160dfdbf2e07 100644 --- a/crypto/lz4.c +++ b/crypto/lz4.c @@ -122,7 +122,6 @@ static struct crypto_alg alg_lz4 = { .cra_flags = CRYPTO_ALG_TYPE_COMPRESS, .cra_ctxsize= sizeof(struct lz4_ctx), .cra_module = THIS_MODULE, - .cra_list = LIST_HEAD_INIT(alg_lz4.cra_list), .cra_init = lz4_init, .cra_exit = lz4_exit, .cra_u = { .compress = { diff --git a/crypto/lz4hc.c b/crypto/lz4hc.c index 2be14f054dafd..583b5e013d7a5 100644 --- a/crypto/lz4hc.c +++ b/crypto/lz4hc.c @@ -123,7 +123,6 @@ static struct crypto_alg alg_lz4hc = { .cra_flags = CRYPTO_ALG_TYPE_COMPRESS, .cra_ctxsize= sizeof(struct lz4hc_ctx), .cra_module = THIS_MODULE, - .cra_list = LIST_HEAD_INIT(alg_lz4hc.cra_list), .cra_init = lz4hc_init, .cra_exit = lz4hc_exit, .cra_u = { .compress = { diff --git a/drivers/crypto/bcm/cipher.c b/drivers/crypto/bcm/cipher.c index 2d1f1db9f8074..8808eacc65801 100644 --- a/drivers/crypto/bcm/cipher.c +++ b/drivers/crypto/bcm/cipher.c @@ -4605,7 +4605,6 @@ static int spu_register_ablkcipher(struct iproc_alg_s *driver_alg) crypto->cra_priority = cipher_pri; crypto->cra_alignmask = 0; crypto->cra_ctxsize = sizeof(struct iproc_ctx_s); - INIT_LIST_HEAD(>cra_list); crypto->cra_init = ablkcipher_cra_init; crypto->cra_exit = generic_cra_exit; @@ -4687,7 +4686,6 @@ static int spu_register_aead(struct iproc_alg_s *driver_alg) aead->base.cra_priority = aead_pri; aead->base.cra_alignmask = 0; aead->base.cra_ctxsize = sizeof(struct iproc_ctx_s); - INIT_LIST_HEAD(>base.cra_list); aead->base.cra_flags |= CRYPTO_ALG_ASYNC; /* setkey set in alg initialization */ diff --git a/drivers/crypto/omap-aes.c b/drivers/crypto/omap-aes.c
[PATCH] crypto: inside-secure - remove useless setting of type flags
From: Eric Biggers Remove the unnecessary setting of CRYPTO_ALG_TYPE_SKCIPHER. Commit 2c95e6d97892 ("crypto: skcipher - remove useless setting of type flags") took care of this everywhere else, but a few more instances made it into the tree at about the same time. Squash them before they get copy+pasted around again. This patch shouldn't change any actual behavior. Signed-off-by: Eric Biggers --- drivers/crypto/inside-secure/safexcel_cipher.c | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/drivers/crypto/inside-secure/safexcel_cipher.c b/drivers/crypto/inside-secure/safexcel_cipher.c index 3aef1d43e4351..d531c14020dcb 100644 --- a/drivers/crypto/inside-secure/safexcel_cipher.c +++ b/drivers/crypto/inside-secure/safexcel_cipher.c @@ -970,7 +970,7 @@ struct safexcel_alg_template safexcel_alg_cbc_des = { .cra_name = "cbc(des)", .cra_driver_name = "safexcel-cbc-des", .cra_priority = 300, - .cra_flags = CRYPTO_ALG_TYPE_SKCIPHER | CRYPTO_ALG_ASYNC | + .cra_flags = CRYPTO_ALG_ASYNC | CRYPTO_ALG_KERN_DRIVER_ONLY, .cra_blocksize = DES_BLOCK_SIZE, .cra_ctxsize = sizeof(struct safexcel_cipher_ctx), @@ -1010,7 +1010,7 @@ struct safexcel_alg_template safexcel_alg_ecb_des = { .cra_name = "ecb(des)", .cra_driver_name = "safexcel-ecb-des", .cra_priority = 300, - .cra_flags = CRYPTO_ALG_TYPE_SKCIPHER | CRYPTO_ALG_ASYNC | + .cra_flags = CRYPTO_ALG_ASYNC | CRYPTO_ALG_KERN_DRIVER_ONLY, .cra_blocksize = DES_BLOCK_SIZE, .cra_ctxsize = sizeof(struct safexcel_cipher_ctx), @@ -1074,7 +1074,7 @@ struct safexcel_alg_template safexcel_alg_cbc_des3_ede = { .cra_name = "cbc(des3_ede)", .cra_driver_name = "safexcel-cbc-des3_ede", .cra_priority = 300, - .cra_flags = CRYPTO_ALG_TYPE_SKCIPHER | CRYPTO_ALG_ASYNC | + .cra_flags = CRYPTO_ALG_ASYNC | CRYPTO_ALG_KERN_DRIVER_ONLY, .cra_blocksize = DES3_EDE_BLOCK_SIZE, .cra_ctxsize = sizeof(struct safexcel_cipher_ctx), @@ -1114,7 +1114,7 @@ struct safexcel_alg_template safexcel_alg_ecb_des3_ede = { .cra_name = "ecb(des3_ede)", .cra_driver_name = "safexcel-ecb-des3_ede", .cra_priority = 300, - .cra_flags = CRYPTO_ALG_TYPE_SKCIPHER | CRYPTO_ALG_ASYNC | + .cra_flags = CRYPTO_ALG_ASYNC | CRYPTO_ALG_KERN_DRIVER_ONLY, .cra_blocksize = DES3_EDE_BLOCK_SIZE, .cra_ctxsize = sizeof(struct safexcel_cipher_ctx), -- 2.19.1.930.g4563a0d9d0-goog
our urgent respond immediately
Hi Friend I am a bank director of the International Finance Bank Plc bf .I want to transfer an abandoned sum of 10.5 millions USD to your account.50% will be for you. No risk involved. Contact me for more details. Kindly reply me back to my alternative email address (samiramohamed5...@gmail.com) mrs samira mohamed
Price Inquiry
Hi,friend, This is Daniel Murray and i am from Sinara Group Co.Ltd Group Co.,LTD in Russia. We are glad to know about your company from the web and we are interested in your products. Could you kindly send us your Latest catalog and price list for our trial order. Best Regards, Daniel Murray Purchasing Manager
Inquiry 12/11/2018
Hi,friend, This is Daniel Murray and i am from Sinara Group Co.Ltd Group Co.,LTD in Russia. We are glad to know about your company from the web and we are interested in your products. Could you kindly send us your Latest catalog and price list for our trial order. Best Regards, Daniel Murray Purchasing Manager
Re: Something wrong with cryptodev-2.6 tree?
On Mon, Nov 12, 2018 at 09:44:41AM +0200, Gilad Ben-Yossef wrote: > Hi, > > It seems that the cryptodev-2.6 tree at > https://git.kernel.org/pub/scm/linux/kernel/git/herbert/cryptodev-2.6.git > has somehow rolled back 3 months ago. > > Not sure if it's a git.kernel.org issue or something else but probably > worth taking a look? Thanks Gilad. It should be fixed now. -- Email: Herbert Xu Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
Something wrong with cryptodev-2.6 tree?
Hi, It seems that the cryptodev-2.6 tree at https://git.kernel.org/pub/scm/linux/kernel/git/herbert/cryptodev-2.6.git has somehow rolled back 3 months ago. Not sure if it's a git.kernel.org issue or something else but probably worth taking a look? Thanks, Gilad -- Gilad Ben-Yossef Chief Coffee Drinker values of β will give rise to dom!
Re: [PATCH 03/17] hw_random: bcm2835-rng: Switch to SPDX identifier
On Sat, 2018-11-10 at 15:51 +0100, Stefan Wahren wrote: > Adopt the SPDX license identifier headers to ease license compliance > management. While we are at this fix the comment style, too. > > Cc: Lubomir Rintel > Signed-off-by: Stefan Wahren > --- > drivers/char/hw_random/bcm2835-rng.c | 7 ++- > 1 file changed, 2 insertions(+), 5 deletions(-) > > diff --git a/drivers/char/hw_random/bcm2835-rng.c > b/drivers/char/hw_random/bcm2835-rng.c > index 6767d96..256b0b1 100644 > --- a/drivers/char/hw_random/bcm2835-rng.c > +++ b/drivers/char/hw_random/bcm2835-rng.c > @@ -1,10 +1,7 @@ > -/** > +// SPDX-License-Identifier: GPL-2.0 > +/* > * Copyright (c) 2010-2012 Broadcom. All rights reserved. > * Copyright (c) 2013 Lubomir Rintel > - * > - * This program is free software; you can redistribute it and/or > - * modify it under the terms of the GNU General Public License > ("GPL") > - * version 2, as published by the Free Software Foundation. > */ > > #include Acked-by: Lubomir Rintel
[PATCH 6/6] crypto: x86/chacha20 - Add a 4-block AVX2 variant
This variant builds upon the idea of the 2-block AVX2 variant that shuffles words after each round. The shuffling has a rather high latency, so the arithmetic units are not optimally used. Given that we have plenty of registers in AVX, this version parallelizes the 2-block variant to do four blocks. While the first two blocks are shuffling, the CPU can do the XORing on the second two blocks and vice-versa, which makes this version much faster than the SSSE3 variant for four blocks. The latter is now mostly for systems that do not have AVX2, but there it is the work-horse, so we keep it in place. The partial XORing function trailer is very similar to the AVX2 2-block variant. While it could be shared, that code segment is rather short; profiling is also easier with the trailer integrated, so we keep it per function. Signed-off-by: Martin Willi --- arch/x86/crypto/chacha20-avx2-x86_64.S | 310 + arch/x86/crypto/chacha20_glue.c| 7 + 2 files changed, 317 insertions(+) diff --git a/arch/x86/crypto/chacha20-avx2-x86_64.S b/arch/x86/crypto/chacha20-avx2-x86_64.S index 8247076b0ba7..b6ab082be657 100644 --- a/arch/x86/crypto/chacha20-avx2-x86_64.S +++ b/arch/x86/crypto/chacha20-avx2-x86_64.S @@ -31,6 +31,11 @@ CTRINC: .octa 0x000300020001 CTR2BL:.octa 0x .octa 0x0001 +.section .rodata.cst32.CTR4BL, "aM", @progbits, 32 +.align 32 +CTR4BL:.octa 0x0002 + .octa 0x0003 + .text ENTRY(chacha20_2block_xor_avx2) @@ -225,6 +230,311 @@ ENTRY(chacha20_2block_xor_avx2) ENDPROC(chacha20_2block_xor_avx2) +ENTRY(chacha20_4block_xor_avx2) + # %rdi: Input state matrix, s + # %rsi: up to 4 data blocks output, o + # %rdx: up to 4 data blocks input, i + # %rcx: input/output length in bytes + + # This function encrypts four ChaCha20 block by loading the state + # matrix four times across eight AVX registers. It performs matrix + # operations on four words in two matrices in parallel, sequentially + # to the operations on the four words of the other two matrices. The + # required word shuffling has a rather high latency, we can do the + # arithmetic on two matrix-pairs without much slowdown. + + vzeroupper + + # x0..3[0-4] = s0..3 + vbroadcasti128 0x00(%rdi),%ymm0 + vbroadcasti128 0x10(%rdi),%ymm1 + vbroadcasti128 0x20(%rdi),%ymm2 + vbroadcasti128 0x30(%rdi),%ymm3 + + vmovdqa %ymm0,%ymm4 + vmovdqa %ymm1,%ymm5 + vmovdqa %ymm2,%ymm6 + vmovdqa %ymm3,%ymm7 + + vpaddd CTR2BL(%rip),%ymm3,%ymm3 + vpaddd CTR4BL(%rip),%ymm7,%ymm7 + + vmovdqa %ymm0,%ymm11 + vmovdqa %ymm1,%ymm12 + vmovdqa %ymm2,%ymm13 + vmovdqa %ymm3,%ymm14 + vmovdqa %ymm7,%ymm15 + + vmovdqa ROT8(%rip),%ymm8 + vmovdqa ROT16(%rip),%ymm9 + + mov %rcx,%rax + mov $10,%ecx + +.Ldoubleround4: + + # x0 += x1, x3 = rotl32(x3 ^ x0, 16) + vpaddd %ymm1,%ymm0,%ymm0 + vpxor %ymm0,%ymm3,%ymm3 + vpshufb %ymm9,%ymm3,%ymm3 + + vpaddd %ymm5,%ymm4,%ymm4 + vpxor %ymm4,%ymm7,%ymm7 + vpshufb %ymm9,%ymm7,%ymm7 + + # x2 += x3, x1 = rotl32(x1 ^ x2, 12) + vpaddd %ymm3,%ymm2,%ymm2 + vpxor %ymm2,%ymm1,%ymm1 + vmovdqa %ymm1,%ymm10 + vpslld $12,%ymm10,%ymm10 + vpsrld $20,%ymm1,%ymm1 + vpor%ymm10,%ymm1,%ymm1 + + vpaddd %ymm7,%ymm6,%ymm6 + vpxor %ymm6,%ymm5,%ymm5 + vmovdqa %ymm5,%ymm10 + vpslld $12,%ymm10,%ymm10 + vpsrld $20,%ymm5,%ymm5 + vpor%ymm10,%ymm5,%ymm5 + + # x0 += x1, x3 = rotl32(x3 ^ x0, 8) + vpaddd %ymm1,%ymm0,%ymm0 + vpxor %ymm0,%ymm3,%ymm3 + vpshufb %ymm8,%ymm3,%ymm3 + + vpaddd %ymm5,%ymm4,%ymm4 + vpxor %ymm4,%ymm7,%ymm7 + vpshufb %ymm8,%ymm7,%ymm7 + + # x2 += x3, x1 = rotl32(x1 ^ x2, 7) + vpaddd %ymm3,%ymm2,%ymm2 + vpxor %ymm2,%ymm1,%ymm1 + vmovdqa %ymm1,%ymm10 + vpslld $7,%ymm10,%ymm10 + vpsrld $25,%ymm1,%ymm1 + vpor%ymm10,%ymm1,%ymm1 + + vpaddd %ymm7,%ymm6,%ymm6 + vpxor %ymm6,%ymm5,%ymm5 + vmovdqa %ymm5,%ymm10 + vpslld $7,%ymm10,%ymm10 + vpsrld $25,%ymm5,%ymm5 + vpor%ymm10,%ymm5,%ymm5 + + # x1 = shuffle32(x1, MASK(0, 3, 2, 1)) + vpshufd $0x39,%ymm1,%ymm1 + vpshufd
[PATCH 3/6] crypto: x86/chacha20 - Support partial lengths in 8-block AVX2 variant
Add a length argument to the eight block function for AVX2, so the block function may XOR only a partial length of eight blocks. To avoid unnecessary operations, we integrate XORing of the first four blocks in the final lane interleaving; this also avoids some work in the partial lengths path. Signed-off-by: Martin Willi --- arch/x86/crypto/chacha20-avx2-x86_64.S | 189 + arch/x86/crypto/chacha20_glue.c| 5 +- 2 files changed, 133 insertions(+), 61 deletions(-) diff --git a/arch/x86/crypto/chacha20-avx2-x86_64.S b/arch/x86/crypto/chacha20-avx2-x86_64.S index f3cd26f48332..7b62d55bee3d 100644 --- a/arch/x86/crypto/chacha20-avx2-x86_64.S +++ b/arch/x86/crypto/chacha20-avx2-x86_64.S @@ -30,8 +30,9 @@ CTRINC: .octa 0x000300020001 ENTRY(chacha20_8block_xor_avx2) # %rdi: Input state matrix, s - # %rsi: 8 data blocks output, o - # %rdx: 8 data blocks input, i + # %rsi: up to 8 data blocks output, o + # %rdx: up to 8 data blocks input, i + # %rcx: input/output length in bytes # This function encrypts eight consecutive ChaCha20 blocks by loading # the state matrix in AVX registers eight times. As we need some @@ -48,6 +49,7 @@ ENTRY(chacha20_8block_xor_avx2) lea 8(%rsp),%r10 and $~31, %rsp sub $0x80, %rsp + mov %rcx,%rax # x0..15[0-7] = s[0..15] vpbroadcastd0x00(%rdi),%ymm0 @@ -375,74 +377,143 @@ ENTRY(chacha20_8block_xor_avx2) vpunpckhqdq %ymm15,%ymm0,%ymm15 # interleave 128-bit words in state n, n+4 - vmovdqa 0x00(%rsp),%ymm0 - vperm2i128 $0x20,%ymm4,%ymm0,%ymm1 - vperm2i128 $0x31,%ymm4,%ymm0,%ymm4 - vmovdqa %ymm1,0x00(%rsp) - vmovdqa 0x20(%rsp),%ymm0 - vperm2i128 $0x20,%ymm5,%ymm0,%ymm1 - vperm2i128 $0x31,%ymm5,%ymm0,%ymm5 - vmovdqa %ymm1,0x20(%rsp) - vmovdqa 0x40(%rsp),%ymm0 - vperm2i128 $0x20,%ymm6,%ymm0,%ymm1 - vperm2i128 $0x31,%ymm6,%ymm0,%ymm6 - vmovdqa %ymm1,0x40(%rsp) - vmovdqa 0x60(%rsp),%ymm0 - vperm2i128 $0x20,%ymm7,%ymm0,%ymm1 - vperm2i128 $0x31,%ymm7,%ymm0,%ymm7 - vmovdqa %ymm1,0x60(%rsp) + # xor/write first four blocks + vmovdqa 0x00(%rsp),%ymm1 + vperm2i128 $0x20,%ymm4,%ymm1,%ymm0 + cmp $0x0020,%rax + jl .Lxorpart8 + vpxor 0x(%rdx),%ymm0,%ymm0 + vmovdqu %ymm0,0x(%rsi) + vperm2i128 $0x31,%ymm4,%ymm1,%ymm4 + vperm2i128 $0x20,%ymm12,%ymm8,%ymm0 + cmp $0x0040,%rax + jl .Lxorpart8 + vpxor 0x0020(%rdx),%ymm0,%ymm0 + vmovdqu %ymm0,0x0020(%rsi) vperm2i128 $0x31,%ymm12,%ymm8,%ymm12 - vmovdqa %ymm0,%ymm8 - vperm2i128 $0x20,%ymm13,%ymm9,%ymm0 - vperm2i128 $0x31,%ymm13,%ymm9,%ymm13 - vmovdqa %ymm0,%ymm9 + + vmovdqa 0x40(%rsp),%ymm1 + vperm2i128 $0x20,%ymm6,%ymm1,%ymm0 + cmp $0x0060,%rax + jl .Lxorpart8 + vpxor 0x0040(%rdx),%ymm0,%ymm0 + vmovdqu %ymm0,0x0040(%rsi) + vperm2i128 $0x31,%ymm6,%ymm1,%ymm6 + vperm2i128 $0x20,%ymm14,%ymm10,%ymm0 + cmp $0x0080,%rax + jl .Lxorpart8 + vpxor 0x0060(%rdx),%ymm0,%ymm0 + vmovdqu %ymm0,0x0060(%rsi) vperm2i128 $0x31,%ymm14,%ymm10,%ymm14 - vmovdqa %ymm0,%ymm10 - vperm2i128 $0x20,%ymm15,%ymm11,%ymm0 - vperm2i128 $0x31,%ymm15,%ymm11,%ymm15 - vmovdqa %ymm0,%ymm11 - # xor with corresponding input, write to output - vmovdqa 0x00(%rsp),%ymm0 - vpxor 0x(%rdx),%ymm0,%ymm0 - vmovdqu %ymm0,0x(%rsi) - vmovdqa 0x20(%rsp),%ymm0 + vmovdqa 0x20(%rsp),%ymm1 + vperm2i128 $0x20,%ymm5,%ymm1,%ymm0 + cmp $0x00a0,%rax + jl .Lxorpart8 vpxor 0x0080(%rdx),%ymm0,%ymm0 vmovdqu %ymm0,0x0080(%rsi) - vmovdqa 0x40(%rsp),%ymm0 - vpxor 0x0040(%rdx),%ymm0,%ymm0 - vmovdqu %ymm0,0x0040(%rsi) - vmovdqa 0x60(%rsp),%ymm0 + vperm2i128 $0x31,%ymm5,%ymm1,%ymm5 + + vperm2i128 $0x20,%ymm13,%ymm9,%ymm0 + cmp $0x00c0,%rax + jl .Lxorpart8 + vpxor 0x00a0(%rdx),%ymm0,%ymm0 + vmovdqu %ymm0,0x00a0(%rsi) + vperm2i128 $0x31,%ymm13,%ymm9,%ymm13 + + vmovdqa 0x60(%rsp),%ymm1 + vperm2i128 $0x20,%ymm7,%ymm1,%ymm0 + cmp $0x00e0,%rax +
[PATCH 4/6] crypto: x86/chacha20 - Use larger block functions more aggressively
Now that all block functions support partial lengths, engage the wider block sizes more aggressively. This prevents using smaller block functions multiple times, where the next larger block function would have been faster. Signed-off-by: Martin Willi --- arch/x86/crypto/chacha20_glue.c | 39 - 1 file changed, 24 insertions(+), 15 deletions(-) diff --git a/arch/x86/crypto/chacha20_glue.c b/arch/x86/crypto/chacha20_glue.c index 882e8bf5965a..b541da71f11e 100644 --- a/arch/x86/crypto/chacha20_glue.c +++ b/arch/x86/crypto/chacha20_glue.c @@ -29,6 +29,12 @@ asmlinkage void chacha20_8block_xor_avx2(u32 *state, u8 *dst, const u8 *src, static bool chacha20_use_avx2; #endif +static unsigned int chacha20_advance(unsigned int len, unsigned int maxblocks) +{ + len = min(len, maxblocks * CHACHA20_BLOCK_SIZE); + return round_up(len, CHACHA20_BLOCK_SIZE) / CHACHA20_BLOCK_SIZE; +} + static void chacha20_dosimd(u32 *state, u8 *dst, const u8 *src, unsigned int bytes) { @@ -41,6 +47,11 @@ static void chacha20_dosimd(u32 *state, u8 *dst, const u8 *src, dst += CHACHA20_BLOCK_SIZE * 8; state[12] += 8; } + if (bytes > CHACHA20_BLOCK_SIZE * 4) { + chacha20_8block_xor_avx2(state, dst, src, bytes); + state[12] += chacha20_advance(bytes, 8); + return; + } } #endif while (bytes >= CHACHA20_BLOCK_SIZE * 4) { @@ -50,15 +61,14 @@ static void chacha20_dosimd(u32 *state, u8 *dst, const u8 *src, dst += CHACHA20_BLOCK_SIZE * 4; state[12] += 4; } - while (bytes >= CHACHA20_BLOCK_SIZE) { - chacha20_block_xor_ssse3(state, dst, src, bytes); - bytes -= CHACHA20_BLOCK_SIZE; - src += CHACHA20_BLOCK_SIZE; - dst += CHACHA20_BLOCK_SIZE; - state[12]++; + if (bytes > CHACHA20_BLOCK_SIZE) { + chacha20_4block_xor_ssse3(state, dst, src, bytes); + state[12] += chacha20_advance(bytes, 4); + return; } if (bytes) { chacha20_block_xor_ssse3(state, dst, src, bytes); + state[12]++; } } @@ -82,17 +92,16 @@ static int chacha20_simd(struct skcipher_request *req) kernel_fpu_begin(); - while (walk.nbytes >= CHACHA20_BLOCK_SIZE) { - chacha20_dosimd(state, walk.dst.virt.addr, walk.src.virt.addr, - rounddown(walk.nbytes, CHACHA20_BLOCK_SIZE)); - err = skcipher_walk_done(, -walk.nbytes % CHACHA20_BLOCK_SIZE); - } + while (walk.nbytes > 0) { + unsigned int nbytes = walk.nbytes; + + if (nbytes < walk.total) + nbytes = round_down(nbytes, walk.stride); - if (walk.nbytes) { chacha20_dosimd(state, walk.dst.virt.addr, walk.src.virt.addr, - walk.nbytes); - err = skcipher_walk_done(, 0); + nbytes); + + err = skcipher_walk_done(, walk.nbytes - nbytes); } kernel_fpu_end(); -- 2.17.1
[PATCH 1/6] crypto: x86/chacha20 - Support partial lengths in 1-block SSSE3 variant
Add a length argument to the single block function for SSSE3, so the block function may XOR only a partial length of the full block. Given that the setup code is rather cheap, the function does not process more than one block; this allows us to keep the block function selection in the C glue code. The required branching does not negatively affect performance for full block sizes. The partial XORing uses simple "rep movsb" to copy the data before and after doing XOR in SSE. This is rather efficient on modern processors; movsw can be slightly faster, but the additional complexity is probably not worth it. Signed-off-by: Martin Willi --- arch/x86/crypto/chacha20-ssse3-x86_64.S | 74 - arch/x86/crypto/chacha20_glue.c | 11 ++-- 2 files changed, 63 insertions(+), 22 deletions(-) diff --git a/arch/x86/crypto/chacha20-ssse3-x86_64.S b/arch/x86/crypto/chacha20-ssse3-x86_64.S index 512a2b500fd1..98d130b5e4ab 100644 --- a/arch/x86/crypto/chacha20-ssse3-x86_64.S +++ b/arch/x86/crypto/chacha20-ssse3-x86_64.S @@ -25,12 +25,13 @@ CTRINC: .octa 0x000300020001 ENTRY(chacha20_block_xor_ssse3) # %rdi: Input state matrix, s - # %rsi: 1 data block output, o - # %rdx: 1 data block input, i + # %rsi: up to 1 data block output, o + # %rdx: up to 1 data block input, i + # %rcx: input/output length in bytes # This function encrypts one ChaCha20 block by loading the state matrix # in four SSE registers. It performs matrix operation on four words in - # parallel, but requireds shuffling to rearrange the words after each + # parallel, but requires shuffling to rearrange the words after each # round. 8/16-bit word rotation is done with the slightly better # performing SSSE3 byte shuffling, 7/12-bit word rotation uses # traditional shift+OR. @@ -48,7 +49,8 @@ ENTRY(chacha20_block_xor_ssse3) movdqa ROT8(%rip),%xmm4 movdqa ROT16(%rip),%xmm5 - mov $10,%ecx + mov %rcx,%rax + mov $10,%ecx .Ldoubleround: @@ -122,27 +124,69 @@ ENTRY(chacha20_block_xor_ssse3) jnz .Ldoubleround # o0 = i0 ^ (x0 + s0) - movdqu 0x00(%rdx),%xmm4 paddd %xmm8,%xmm0 + cmp $0x10,%rax + jl .Lxorpart + movdqu 0x00(%rdx),%xmm4 pxor%xmm4,%xmm0 movdqu %xmm0,0x00(%rsi) # o1 = i1 ^ (x1 + s1) - movdqu 0x10(%rdx),%xmm5 paddd %xmm9,%xmm1 - pxor%xmm5,%xmm1 - movdqu %xmm1,0x10(%rsi) + movdqa %xmm1,%xmm0 + cmp $0x20,%rax + jl .Lxorpart + movdqu 0x10(%rdx),%xmm0 + pxor%xmm1,%xmm0 + movdqu %xmm0,0x10(%rsi) # o2 = i2 ^ (x2 + s2) - movdqu 0x20(%rdx),%xmm6 paddd %xmm10,%xmm2 - pxor%xmm6,%xmm2 - movdqu %xmm2,0x20(%rsi) + movdqa %xmm2,%xmm0 + cmp $0x30,%rax + jl .Lxorpart + movdqu 0x20(%rdx),%xmm0 + pxor%xmm2,%xmm0 + movdqu %xmm0,0x20(%rsi) # o3 = i3 ^ (x3 + s3) - movdqu 0x30(%rdx),%xmm7 paddd %xmm11,%xmm3 - pxor%xmm7,%xmm3 - movdqu %xmm3,0x30(%rsi) - + movdqa %xmm3,%xmm0 + cmp $0x40,%rax + jl .Lxorpart + movdqu 0x30(%rdx),%xmm0 + pxor%xmm3,%xmm0 + movdqu %xmm0,0x30(%rsi) + +.Ldone: ret + +.Lxorpart: + # xor remaining bytes from partial register into output + mov %rax,%r9 + and $0x0f,%r9 + jz .Ldone + and $~0x0f,%rax + + mov %rsi,%r11 + + lea 8(%rsp),%r10 + sub $0x10,%rsp + and $~31,%rsp + + lea (%rdx,%rax),%rsi + mov %rsp,%rdi + mov %r9,%rcx + rep movsb + + pxor0x00(%rsp),%xmm0 + movdqa %xmm0,0x00(%rsp) + + mov %rsp,%rsi + lea (%r11,%rax),%rdi + mov %r9,%rcx + rep movsb + + lea -8(%r10),%rsp + jmp .Ldone + ENDPROC(chacha20_block_xor_ssse3) ENTRY(chacha20_4block_xor_ssse3) diff --git a/arch/x86/crypto/chacha20_glue.c b/arch/x86/crypto/chacha20_glue.c index dce7c5d39c2f..cc4571736ce8 100644 --- a/arch/x86/crypto/chacha20_glue.c +++ b/arch/x86/crypto/chacha20_glue.c @@ -19,7 +19,8 @@ #define CHACHA20_STATE_ALIGN 16 -asmlinkage void chacha20_block_xor_ssse3(u32 *state, u8 *dst, const u8 *src); +asmlinkage void chacha20_block_xor_ssse3(u32
[PATCH 2/6] crypto: x86/chacha20 - Support partial lengths in 4-block SSSE3 variant
Add a length argument to the quad block function for SSSE3, so the block function may XOR only a partial length of four blocks. As we already have the stack set up, the partial XORing does not need to. This gives a slightly different function trailer, so we keep that separate from the 1-block function. Signed-off-by: Martin Willi --- arch/x86/crypto/chacha20-ssse3-x86_64.S | 163 ++-- arch/x86/crypto/chacha20_glue.c | 5 +- 2 files changed, 128 insertions(+), 40 deletions(-) diff --git a/arch/x86/crypto/chacha20-ssse3-x86_64.S b/arch/x86/crypto/chacha20-ssse3-x86_64.S index 98d130b5e4ab..d8ac75bb448f 100644 --- a/arch/x86/crypto/chacha20-ssse3-x86_64.S +++ b/arch/x86/crypto/chacha20-ssse3-x86_64.S @@ -191,8 +191,9 @@ ENDPROC(chacha20_block_xor_ssse3) ENTRY(chacha20_4block_xor_ssse3) # %rdi: Input state matrix, s - # %rsi: 4 data blocks output, o - # %rdx: 4 data blocks input, i + # %rsi: up to 4 data blocks output, o + # %rdx: up to 4 data blocks input, i + # %rcx: input/output length in bytes # This function encrypts four consecutive ChaCha20 blocks by loading the # the state matrix in SSE registers four times. As we need some scratch @@ -207,6 +208,7 @@ ENTRY(chacha20_4block_xor_ssse3) lea 8(%rsp),%r10 sub $0x80,%rsp and $~63,%rsp + mov %rcx,%rax # x0..15[0-3] = s0..3[0..3] movq0x00(%rdi),%xmm1 @@ -617,58 +619,143 @@ ENTRY(chacha20_4block_xor_ssse3) # xor with corresponding input, write to output movdqa 0x00(%rsp),%xmm0 + cmp $0x10,%rax + jl .Lxorpart4 movdqu 0x00(%rdx),%xmm1 pxor%xmm1,%xmm0 movdqu %xmm0,0x00(%rsi) - movdqa 0x10(%rsp),%xmm0 - movdqu 0x80(%rdx),%xmm1 + + movdqu %xmm4,%xmm0 + cmp $0x20,%rax + jl .Lxorpart4 + movdqu 0x10(%rdx),%xmm1 pxor%xmm1,%xmm0 - movdqu %xmm0,0x80(%rsi) + movdqu %xmm0,0x10(%rsi) + + movdqu %xmm8,%xmm0 + cmp $0x30,%rax + jl .Lxorpart4 + movdqu 0x20(%rdx),%xmm1 + pxor%xmm1,%xmm0 + movdqu %xmm0,0x20(%rsi) + + movdqu %xmm12,%xmm0 + cmp $0x40,%rax + jl .Lxorpart4 + movdqu 0x30(%rdx),%xmm1 + pxor%xmm1,%xmm0 + movdqu %xmm0,0x30(%rsi) + movdqa 0x20(%rsp),%xmm0 + cmp $0x50,%rax + jl .Lxorpart4 movdqu 0x40(%rdx),%xmm1 pxor%xmm1,%xmm0 movdqu %xmm0,0x40(%rsi) + + movdqu %xmm6,%xmm0 + cmp $0x60,%rax + jl .Lxorpart4 + movdqu 0x50(%rdx),%xmm1 + pxor%xmm1,%xmm0 + movdqu %xmm0,0x50(%rsi) + + movdqu %xmm10,%xmm0 + cmp $0x70,%rax + jl .Lxorpart4 + movdqu 0x60(%rdx),%xmm1 + pxor%xmm1,%xmm0 + movdqu %xmm0,0x60(%rsi) + + movdqu %xmm14,%xmm0 + cmp $0x80,%rax + jl .Lxorpart4 + movdqu 0x70(%rdx),%xmm1 + pxor%xmm1,%xmm0 + movdqu %xmm0,0x70(%rsi) + + movdqa 0x10(%rsp),%xmm0 + cmp $0x90,%rax + jl .Lxorpart4 + movdqu 0x80(%rdx),%xmm1 + pxor%xmm1,%xmm0 + movdqu %xmm0,0x80(%rsi) + + movdqu %xmm5,%xmm0 + cmp $0xa0,%rax + jl .Lxorpart4 + movdqu 0x90(%rdx),%xmm1 + pxor%xmm1,%xmm0 + movdqu %xmm0,0x90(%rsi) + + movdqu %xmm9,%xmm0 + cmp $0xb0,%rax + jl .Lxorpart4 + movdqu 0xa0(%rdx),%xmm1 + pxor%xmm1,%xmm0 + movdqu %xmm0,0xa0(%rsi) + + movdqu %xmm13,%xmm0 + cmp $0xc0,%rax + jl .Lxorpart4 + movdqu 0xb0(%rdx),%xmm1 + pxor%xmm1,%xmm0 + movdqu %xmm0,0xb0(%rsi) + movdqa 0x30(%rsp),%xmm0 + cmp $0xd0,%rax + jl .Lxorpart4 movdqu 0xc0(%rdx),%xmm1 pxor%xmm1,%xmm0 movdqu %xmm0,0xc0(%rsi) - movdqu 0x10(%rdx),%xmm1 - pxor%xmm1,%xmm4 - movdqu %xmm4,0x10(%rsi) - movdqu 0x90(%rdx),%xmm1 - pxor%xmm1,%xmm5 - movdqu %xmm5,0x90(%rsi) - movdqu 0x50(%rdx),%xmm1 -
[PATCH 0/6] crypto: x86/chacha20 - SIMD performance improvements
This patchset improves performance of the ChaCha20 SIMD implementations for x86_64. For some specific encryption lengths, performance is more than doubled. Two mechanisms are used to achieve this: * Instead of calculating the minimal number of required blocks for a given encryption length, functions producing more blocks are used more aggressively. Calculating a 4-block function can be faster than calculating a 2-block and a 1-block function, even if only three blocks are actually required. * In addition to the 8-block AVX2 function, a 4-block and a 2-block function are introduced. Patches 1-3 add support for partial lengths to the existing 1-, 4- and 8-block functions. Patch 4 makes use of that by engaging the next higher level block functions more aggressively. Patch 5 and 6 add the new AVX2 functions for 2 and 4 blocks. Patches are based on cryptodev and would need adjustments to apply on top of the Adiantum patchset. Note that the more aggressive use of larger block functions calculate blocks that may get discarded. This may have a negative impact on energy usage or the processors thermal budget. However, with the new block functions we can avoid this over-calculation for many lengths, so the performance win can be considered more important. Below are performance numbers measured with tcrypt using additional encryption lengths; numbers in kOps/s, on my i7-5557U. old is the existing, new the implementation with this patchset. As comparison the numbers for zinc in v6: len old new zinc 8 5908 5818 5818 16 5917 5828 5726 24 5916 5869 5757 32 5920 5789 5813 40 5868 5799 5710 48 5877 5761 5761 56 5869 5797 5742 64 5897 5862 5685 72 3381 4979 3520 80 3364 5541 3475 88 3350 4977 3424 96 3342 5530 3371 104 3328 4923 3313 112 3317 5528 3207 120 3313 4970 3150 128 3492 5535 3568 136 2487 4570 3690 144 2481 5047 3599 152 2473 4565 3566 160 2459 5022 3515 168 2461 4550 3437 176 2454 5020 3325 184 2449 4535 3279 192 2538 5011 3762 200 1962 4537 3702 208 1962 4971 3622 216 1954 4487 3518 224 1949 4936 3445 232 1948 4497 3422 240 1941 4947 3317 248 1940 4481 3279 256 3798 4964 3723 264 2638 3577 3639 272 2637 3567 3597 280 2628 3563 3565 288 2630 3795 3484 296 2621 3580 3422 304 2612 3569 3352 312 2602 3599 3308 320 2694 3821 3694 328 2060 3538 3681 336 2054 3565 3599 344 2054 3553 3523 352 2049 3809 3419 360 2045 3575 3403 368 2035 3560 3334 376 2036 3555 3257 384 2092 3785 3715 392 1691 3505 3612 400 1684 3527 3553 408 1686 3527 3496 416 1684 3804 3430 424 1681 3555 3402 432 1675 3559 3311 440 1672 3558 3275 448 1710 3780 3689 456 1431 3541 3618 464 1428 3538 3576 472 1430 3527 3509 480 1426 3788 3405 488 1423 3502 3397 496 1423 3519 3298 504 1418 3519 3277 512 3694 3736 3735 520 2601 2571 2209 528 2601 2677 2148 536 2587 2534 2164 544 2578 2659 2138 552 2570 2552 2126 560 2566 2661 2035 568 2567 2542 2041 576 2639 2674 2199 584 2031 2531 2183 592 2027 2660 2145 600 2016 2513 2155 608 2009 2638 2133 616 2006 2522 2115 624 2000 2649 2064 632 1996 2518 2045 640 2053 2651 2188 648 1666 2402 2182 656 1663 2517 2158 664 1659 2397 2147 672 1657 2510 2139 680 1656 2394 2114 688 1653 2497 2077 696 1646 2393 2043 704 1678 2510 2208 712 1414 2391 2189 720 1412 2506 2169 728 1411 2384 2145 736 1408 2494 2142 744 1408 2379 2081 752 1405 2485 2064 760 1403 2376 2043 768 2189 2498 2211 776 1756 2137 2192 784 1746 2145 2146 792 1744 2141 2141 800 1743 2094 808 1742 2140 2100 816 1735 2134 2061 824 1731 2135 2045 832 1778 2223 840 1480 2132 2184 848 1480 2134 2173 856 1476 2124 2145 864 1474 2210 2126 872 1472 2127 2105 880 1463 2123 2056 888 1468 2123 2043 896 1494 2208 2219 904 1278 2120 2192 912 1277 2121 2170 920 1273 2118 2149 928 1272 2207 2125 936 1267 2125 2098 944 1265 2127 2060 952 1267 2126 2049 960 1289 2213 2204 968 1125 2123 2187 976 1122 2127 2166 984 1120 2123 2136 992 1118 2207 2119 1000 1118 2120 2101 1008 1117 2122 2042 1016 1115 2121 2048 1024 2174 2191 2195 1032 1748 1724 1565 1040 1745 1782 1544 1048 1736 1737 1554 1056 1738 1802 1541 1064 1735 1728 1523 1072 1730 1780 1507 1080 1729 1724 1497 1088 1757 1783 1592 1096 1475 1723 1575 1104 1474 1778 1563 1112 1472 1708 1544 1120 1468 1774 1521 1128 1466 1718 1521 1136 1462 1780 1501 1144 1460 1719 1491 1152 1481 1782 1575 1160 1271 1647 1558 1168 1271 1706 1554 1176 1268 1645 1545 1184 1265 1711 1538 1192 1265 1648 1530 1200 1264 1705 1493 1208 1262 1647 1498 1216 1277 1695 1581 1224 1120 1642 1563 1232 1115 1702 1549 1240 1121 1646 1538 1248 1119 1703 1527 1256 1115 1640 1520 1264 1114 1693 1505 1272 1112 1642 1492 1280 1552 1699 1574 1288 1314 1525 1573 1296 1315 1522 1551 1304 1312 1521 1548 1312 1311 1564 1535 1320 1309 1518 1524 1328 1302 1527 1508 1336 1303 1521 1500 1344 1333 1561 1579 1352 1157 1524 1573 1360 1152 1520 1546 1368 1154 1522 1545 1376 1153 1562 1536 1384 1151 1525 1526 1392
[PATCH 5/6] crypto: x86/chacha20 - Add a 2-block AVX2 variant
This variant uses the same principle as the single block SSSE3 variant by shuffling the state matrix after each round. With the wider AVX registers, we can do two blocks in parallel, though. This function can increase performance and efficiency significantly for lengths that would otherwise require a 4-block function. Signed-off-by: Martin Willi --- arch/x86/crypto/chacha20-avx2-x86_64.S | 197 + arch/x86/crypto/chacha20_glue.c| 7 + 2 files changed, 204 insertions(+) diff --git a/arch/x86/crypto/chacha20-avx2-x86_64.S b/arch/x86/crypto/chacha20-avx2-x86_64.S index 7b62d55bee3d..8247076b0ba7 100644 --- a/arch/x86/crypto/chacha20-avx2-x86_64.S +++ b/arch/x86/crypto/chacha20-avx2-x86_64.S @@ -26,8 +26,205 @@ ROT16: .octa 0x0d0c0f0e09080b0a0504070601000302 CTRINC:.octa 0x000300020001 .octa 0x0007000600050004 +.section .rodata.cst32.CTR2BL, "aM", @progbits, 32 +.align 32 +CTR2BL:.octa 0x + .octa 0x0001 + .text +ENTRY(chacha20_2block_xor_avx2) + # %rdi: Input state matrix, s + # %rsi: up to 2 data blocks output, o + # %rdx: up to 2 data blocks input, i + # %rcx: input/output length in bytes + + # This function encrypts two ChaCha20 blocks by loading the state + # matrix twice across four AVX registers. It performs matrix operations + # on four words in each matrix in parallel, but requires shuffling to + # rearrange the words after each round. + + vzeroupper + + # x0..3[0-2] = s0..3 + vbroadcasti128 0x00(%rdi),%ymm0 + vbroadcasti128 0x10(%rdi),%ymm1 + vbroadcasti128 0x20(%rdi),%ymm2 + vbroadcasti128 0x30(%rdi),%ymm3 + + vpaddd CTR2BL(%rip),%ymm3,%ymm3 + + vmovdqa %ymm0,%ymm8 + vmovdqa %ymm1,%ymm9 + vmovdqa %ymm2,%ymm10 + vmovdqa %ymm3,%ymm11 + + vmovdqa ROT8(%rip),%ymm4 + vmovdqa ROT16(%rip),%ymm5 + + mov %rcx,%rax + mov $10,%ecx + +.Ldoubleround: + + # x0 += x1, x3 = rotl32(x3 ^ x0, 16) + vpaddd %ymm1,%ymm0,%ymm0 + vpxor %ymm0,%ymm3,%ymm3 + vpshufb %ymm5,%ymm3,%ymm3 + + # x2 += x3, x1 = rotl32(x1 ^ x2, 12) + vpaddd %ymm3,%ymm2,%ymm2 + vpxor %ymm2,%ymm1,%ymm1 + vmovdqa %ymm1,%ymm6 + vpslld $12,%ymm6,%ymm6 + vpsrld $20,%ymm1,%ymm1 + vpor%ymm6,%ymm1,%ymm1 + + # x0 += x1, x3 = rotl32(x3 ^ x0, 8) + vpaddd %ymm1,%ymm0,%ymm0 + vpxor %ymm0,%ymm3,%ymm3 + vpshufb %ymm4,%ymm3,%ymm3 + + # x2 += x3, x1 = rotl32(x1 ^ x2, 7) + vpaddd %ymm3,%ymm2,%ymm2 + vpxor %ymm2,%ymm1,%ymm1 + vmovdqa %ymm1,%ymm7 + vpslld $7,%ymm7,%ymm7 + vpsrld $25,%ymm1,%ymm1 + vpor%ymm7,%ymm1,%ymm1 + + # x1 = shuffle32(x1, MASK(0, 3, 2, 1)) + vpshufd $0x39,%ymm1,%ymm1 + # x2 = shuffle32(x2, MASK(1, 0, 3, 2)) + vpshufd $0x4e,%ymm2,%ymm2 + # x3 = shuffle32(x3, MASK(2, 1, 0, 3)) + vpshufd $0x93,%ymm3,%ymm3 + + # x0 += x1, x3 = rotl32(x3 ^ x0, 16) + vpaddd %ymm1,%ymm0,%ymm0 + vpxor %ymm0,%ymm3,%ymm3 + vpshufb %ymm5,%ymm3,%ymm3 + + # x2 += x3, x1 = rotl32(x1 ^ x2, 12) + vpaddd %ymm3,%ymm2,%ymm2 + vpxor %ymm2,%ymm1,%ymm1 + vmovdqa %ymm1,%ymm6 + vpslld $12,%ymm6,%ymm6 + vpsrld $20,%ymm1,%ymm1 + vpor%ymm6,%ymm1,%ymm1 + + # x0 += x1, x3 = rotl32(x3 ^ x0, 8) + vpaddd %ymm1,%ymm0,%ymm0 + vpxor %ymm0,%ymm3,%ymm3 + vpshufb %ymm4,%ymm3,%ymm3 + + # x2 += x3, x1 = rotl32(x1 ^ x2, 7) + vpaddd %ymm3,%ymm2,%ymm2 + vpxor %ymm2,%ymm1,%ymm1 + vmovdqa %ymm1,%ymm7 + vpslld $7,%ymm7,%ymm7 + vpsrld $25,%ymm1,%ymm1 + vpor%ymm7,%ymm1,%ymm1 + + # x1 = shuffle32(x1, MASK(2, 1, 0, 3)) + vpshufd $0x93,%ymm1,%ymm1 + # x2 = shuffle32(x2, MASK(1, 0, 3, 2)) + vpshufd $0x4e,%ymm2,%ymm2 + # x3 = shuffle32(x3, MASK(0, 3, 2, 1)) + vpshufd $0x39,%ymm3,%ymm3 + + dec %ecx + jnz .Ldoubleround + + # o0 = i0 ^ (x0 + s0) + vpaddd %ymm8,%ymm0,%ymm7 + cmp $0x10,%rax + jl .Lxorpart2 + vpxor 0x00(%rdx),%xmm7,%xmm6 + vmovdqu %xmm6,0x00(%rsi) + vextracti128$1,%ymm7,%xmm0 + # o1 = i1 ^ (x1 + s1) + vpaddd %ymm9,%ymm1,%ymm7 + cmp
Re: [PATCH 03/17] hw_random: bcm2835-rng: Switch to SPDX identifier
Stefan Wahren writes: > Adopt the SPDX license identifier headers to ease license compliance > management. While we are at this fix the comment style, too. Reviewed-by: Eric Anholt signature.asc Description: PGP signature
Re: [PATCH 03/17] hw_random: bcm2835-rng: Switch to SPDX identifier
On Sat, Nov 10, 2018 at 03:51:16PM +0100, Stefan Wahren wrote: > Adopt the SPDX license identifier headers to ease license compliance > management. While we are at this fix the comment style, too. > > Cc: Lubomir Rintel > Signed-off-by: Stefan Wahren > --- > drivers/char/hw_random/bcm2835-rng.c | 7 ++- > 1 file changed, 2 insertions(+), 5 deletions(-) Acked-by: Greg Kroah-Hartman
[PATCH 03/17] hw_random: bcm2835-rng: Switch to SPDX identifier
Adopt the SPDX license identifier headers to ease license compliance management. While we are at this fix the comment style, too. Cc: Lubomir Rintel Signed-off-by: Stefan Wahren --- drivers/char/hw_random/bcm2835-rng.c | 7 ++- 1 file changed, 2 insertions(+), 5 deletions(-) diff --git a/drivers/char/hw_random/bcm2835-rng.c b/drivers/char/hw_random/bcm2835-rng.c index 6767d96..256b0b1 100644 --- a/drivers/char/hw_random/bcm2835-rng.c +++ b/drivers/char/hw_random/bcm2835-rng.c @@ -1,10 +1,7 @@ -/** +// SPDX-License-Identifier: GPL-2.0 +/* * Copyright (c) 2010-2012 Broadcom. All rights reserved. * Copyright (c) 2013 Lubomir Rintel - * - * This program is free software; you can redistribute it and/or - * modify it under the terms of the GNU General Public License ("GPL") - * version 2, as published by the Free Software Foundation. */ #include -- 2.7.4
How driver can mark the algo implementation Unavailable
Hi All, PCI based devices can be shutdown from sysfs interface echo "unbind" > /sys/bus/pci/drivers/cxgb4/unbind In case device has active Transformation(tfm), Drivers cannot un-register the Algorithms because alg->cra_refcnt will be non zero. Can driver use the "CRYPTO_ALG_DEAD" flag to mark it un-available so that crypto_alg_lookup does not allocate new tfm using dead algo. Regards Harsh Jain
Re: [PATCH 1/2] crypto: fix cfb mode decryption
On Sat, Oct 20, 2018 at 02:01:52AM +0300, Dmitry Eremin-Solenikov wrote: > crypto_cfb_decrypt_segment() incorrectly XOR'ed generated keystream with > IV, rather than with data stream, resulting in incorrect decryption. > Test vectors will be added in the next patch. > > Signed-off-by: Dmitry Eremin-Solenikov > Cc: sta...@vger.kernel.org > --- > crypto/cfb.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) All applied. Thanks. -- Email: Herbert Xu Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
Re: [PATCH v3 0/2] crypto: some hardening against AES cache-timing attacks
On Wed, Oct 17, 2018 at 09:37:57PM -0700, Eric Biggers wrote: > This series makes the "aes-fixed-time" and "aes-arm" implementations of > AES more resistant to cache-timing attacks. > > Note that even after these changes, the implementations still aren't > necessarily guaranteed to be constant-time; see > https://cr.yp.to/antiforgery/cachetiming-20050414.pdf for a discussion > of the many difficulties involved in writing truly constant-time AES > software. But it's valuable to make such attacks more difficult. > > Changed since v2: > - In aes-arm, move the IRQ disable/enable into the assembly file. > - Other aes-arm tweaks. > - Add Kconfig help text. > > Thanks to Ard Biesheuvel for the suggestions. > > Eric Biggers (2): > crypto: aes_ti - disable interrupts while accessing S-box > crypto: arm/aes - add some hardening against cache-timing attacks > > arch/arm/crypto/Kconfig | 9 + > arch/arm/crypto/aes-cipher-core.S | 62 ++- > crypto/Kconfig| 3 +- > crypto/aes_generic.c | 9 +++-- > crypto/aes_ti.c | 18 + > 5 files changed, 86 insertions(+), 15 deletions(-) All applied. Thanks. -- Email: Herbert Xu Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
Re: [PATCH] crypto/simd: correctly take reqsize of wrapped skcipher into account
On 9 November 2018 at 10:45, Herbert Xu wrote: > On Fri, Nov 09, 2018 at 05:44:47PM +0800, Herbert Xu wrote: >> On Fri, Nov 09, 2018 at 12:33:23AM +0100, Ard Biesheuvel wrote: >> > >> > This should be >> > >> > reqsize += max(crypto_skcipher_reqsize(_tfm->base); >> >crypto_skcipher_reqsize(cryptd_skcipher_child(cryptd_tfm))); >> > >> > since the cryptd path in simd still needs some space in the subreq for >> > the completion. >> >> OK this is what I applied to the cryptodev tree, please double-check >> to see if I did anything silly: > > I meant the crypto tree rather than cryptodev. > That looks fine. Thanks Herbert.
Re: [PATCH] crypto/simd: correctly take reqsize of wrapped skcipher into account
On Fri, Nov 09, 2018 at 05:44:47PM +0800, Herbert Xu wrote: > On Fri, Nov 09, 2018 at 12:33:23AM +0100, Ard Biesheuvel wrote: > > > > This should be > > > > reqsize += max(crypto_skcipher_reqsize(_tfm->base); > >crypto_skcipher_reqsize(cryptd_skcipher_child(cryptd_tfm))); > > > > since the cryptd path in simd still needs some space in the subreq for > > the completion. > > OK this is what I applied to the cryptodev tree, please double-check > to see if I did anything silly: I meant the crypto tree rather than cryptodev. Cheers, -- Email: Herbert Xu Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
Re: [PATCH] crypto/simd: correctly take reqsize of wrapped skcipher into account
On Fri, Nov 09, 2018 at 12:33:23AM +0100, Ard Biesheuvel wrote: > > This should be > > reqsize += max(crypto_skcipher_reqsize(_tfm->base); >crypto_skcipher_reqsize(cryptd_skcipher_child(cryptd_tfm))); > > since the cryptd path in simd still needs some space in the subreq for > the completion. OK this is what I applied to the cryptodev tree, please double-check to see if I did anything silly: diff --git a/crypto/simd.c b/crypto/simd.c index ea7240be3001..78e8d037ae2b 100644 --- a/crypto/simd.c +++ b/crypto/simd.c @@ -124,8 +124,9 @@ static int simd_skcipher_init(struct crypto_skcipher *tfm) ctx->cryptd_tfm = cryptd_tfm; - reqsize = sizeof(struct skcipher_request); - reqsize += crypto_skcipher_reqsize(_tfm->base); + reqsize = crypto_skcipher_reqsize(cryptd_skcipher_child(cryptd_tfm)); + reqsize = max(reqsize, crypto_skcipher_reqsize(_tfm->base)); + reqsize += sizeof(struct skcipher_request); crypto_skcipher_set_reqsize(tfm, reqsize); Thanks, -- Email: Herbert Xu Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
Re: .S_shipped unnecessary?
On Fri, Nov 9, 2018 at 8:42 AM Ard Biesheuvel wrote: > > (+ Masahiro, kbuild ml) > > On 8 November 2018 at 21:37, Jason A. Donenfeld wrote: > > Hi Ard, Eric, and others, > > > > As promised, the next Zinc patchset will have less generated code! After a > > bit of work with Andy and Samuel, I'll be bundling the perlasm. > > > > Wonderful! Any problems doing that for x86_64 ? > > > One thing I'm wondering about, though, is the wisdom behind the current > > .S_shipped pattern. Usually the _shipped is for big firmware blobs that are > > hard (or impossible) to build independently. But in this case, the .S is > > generated from the .pl significantly faster than gcc even compiles a basic > > C file. And, since perl is needed to build the kernel anyway, it's not like > > it will be impossible to find the right tools. Rather than clutter up > > commits > > with the .pl _and_ the .S_shipped, what would you think if I just generated > > the .S each time as an ordinary build artifact. AFAICT, this is fairly > > usual, > > and it's hard to see downsides. Hence, why I'm writing this email: are there > > any downsides to that? > > > > I agree 100%. When I added this the first time, it was at the request > of the ARM maintainer, who was reluctant to rely on Perl for some > reason. > > Recently, we have had to add a kludge to prevent spurious rebuilds of > the .S_shipped files as well. > > I'd be perfectly happy to get rid of this entirely, and always > generate the .S from the .pl, which to me is kind of the point of > carrying these files in the first place. > > Masahiro: do you see any problems with this? No problem. Documentation/process/changes.rst says the following: You will need perl 5 and the following modules: ``Getopt::Long``, ``Getopt::Std``, ``File::Basename``, and ``File::Find`` to build the kernel. We can assume perl is installed on the user's build machine. -- Best Regards Masahiro Yamada
Re: [PATCH] crypto/simd: correctly take reqsize of wrapped skcipher into account
> On Nov 8, 2018, at 6:33 PM, Ard Biesheuvel wrote: > > On 8 November 2018 at 23:55, Ard Biesheuvel wrote: >> The simd wrapper's skcipher request context structure consists >> of a single subrequest whose size is taken from the subordinate >> skcipher. However, in simd_skcipher_init(), the reqsize that is >> retrieved is not from the subordinate skcipher but from the >> cryptd request structure, whose size is completely unrelated to >> the actual wrapped skcipher. >> >> Reported-by: Qian Cai >> Signed-off-by: Ard Biesheuvel >> --- >> crypto/simd.c | 2 +- >> 1 file changed, 1 insertion(+), 1 deletion(-) >> >> diff --git a/crypto/simd.c b/crypto/simd.c >> index ea7240be3001..2f3d6e897afc 100644 >> --- a/crypto/simd.c >> +++ b/crypto/simd.c >> @@ -125,7 +125,7 @@ static int simd_skcipher_init(struct crypto_skcipher >> *tfm) >>ctx->cryptd_tfm = cryptd_tfm; >> >>reqsize = sizeof(struct skcipher_request); >> - reqsize += crypto_skcipher_reqsize(_tfm->base); >> + reqsize += >> crypto_skcipher_reqsize(cryptd_skcipher_child(cryptd_tfm)); >> > > This should be > > reqsize += max(crypto_skcipher_reqsize(_tfm->base); > crypto_skcipher_reqsize(cryptd_skcipher_child(cryptd_tfm))); > > since the cryptd path in simd still needs some space in the subreq for > the completion. Tested-by: Qian Cai
Re: .S_shipped unnecessary?
Hey Ard, On Fri, Nov 9, 2018 at 12:42 AM Ard Biesheuvel wrote: > Wonderful! Any problems doing that for x86_64 ? The x86_64 is still a WIP, but hopefully we'll succeed. > I agree 100%. When I added this the first time, it was at the request > of the ARM maintainer, who was reluctant to rely on Perl for some > reason. > > Recently, we have had to add a kludge to prevent spurious rebuilds of > the .S_shipped files as well. > > I'd be perfectly happy to get rid of this entirely, and always > generate the .S from the .pl, which to me is kind of the point of > carrying these files in the first place. Terrific. I'll move ahead in that direction then. It makes things _so_ much cleaner, and doesn't introduce new build modes ("should the generated _ship go into the build directory or the source directory? what kind of artifact is it? how to address $(srcdir) vs $(src) in that context? bla bla") that really over complicate things. Jason
Re: .S_shipped unnecessary?
(+ Masahiro, kbuild ml) On 8 November 2018 at 21:37, Jason A. Donenfeld wrote: > Hi Ard, Eric, and others, > > As promised, the next Zinc patchset will have less generated code! After a > bit of work with Andy and Samuel, I'll be bundling the perlasm. > Wonderful! Any problems doing that for x86_64 ? > One thing I'm wondering about, though, is the wisdom behind the current > .S_shipped pattern. Usually the _shipped is for big firmware blobs that are > hard (or impossible) to build independently. But in this case, the .S is > generated from the .pl significantly faster than gcc even compiles a basic > C file. And, since perl is needed to build the kernel anyway, it's not like > it will be impossible to find the right tools. Rather than clutter up commits > with the .pl _and_ the .S_shipped, what would you think if I just generated > the .S each time as an ordinary build artifact. AFAICT, this is fairly usual, > and it's hard to see downsides. Hence, why I'm writing this email: are there > any downsides to that? > I agree 100%. When I added this the first time, it was at the request of the ARM maintainer, who was reluctant to rely on Perl for some reason. Recently, we have had to add a kludge to prevent spurious rebuilds of the .S_shipped files as well. I'd be perfectly happy to get rid of this entirely, and always generate the .S from the .pl, which to me is kind of the point of carrying these files in the first place. Masahiro: do you see any problems with this?
Re: [PATCH] crypto/simd: correctly take reqsize of wrapped skcipher into account
On 8 November 2018 at 23:55, Ard Biesheuvel wrote: > The simd wrapper's skcipher request context structure consists > of a single subrequest whose size is taken from the subordinate > skcipher. However, in simd_skcipher_init(), the reqsize that is > retrieved is not from the subordinate skcipher but from the > cryptd request structure, whose size is completely unrelated to > the actual wrapped skcipher. > > Reported-by: Qian Cai > Signed-off-by: Ard Biesheuvel > --- > crypto/simd.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/crypto/simd.c b/crypto/simd.c > index ea7240be3001..2f3d6e897afc 100644 > --- a/crypto/simd.c > +++ b/crypto/simd.c > @@ -125,7 +125,7 @@ static int simd_skcipher_init(struct crypto_skcipher *tfm) > ctx->cryptd_tfm = cryptd_tfm; > > reqsize = sizeof(struct skcipher_request); > - reqsize += crypto_skcipher_reqsize(_tfm->base); > + reqsize += crypto_skcipher_reqsize(cryptd_skcipher_child(cryptd_tfm)); > This should be reqsize += max(crypto_skcipher_reqsize(_tfm->base); crypto_skcipher_reqsize(cryptd_skcipher_child(cryptd_tfm))); since the cryptd path in simd still needs some space in the subreq for the completion.
[PATCH] crypto/simd: correctly take reqsize of wrapped skcipher into account
The simd wrapper's skcipher request context structure consists of a single subrequest whose size is taken from the subordinate skcipher. However, in simd_skcipher_init(), the reqsize that is retrieved is not from the subordinate skcipher but from the cryptd request structure, whose size is completely unrelated to the actual wrapped skcipher. Reported-by: Qian Cai Signed-off-by: Ard Biesheuvel --- crypto/simd.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/crypto/simd.c b/crypto/simd.c index ea7240be3001..2f3d6e897afc 100644 --- a/crypto/simd.c +++ b/crypto/simd.c @@ -125,7 +125,7 @@ static int simd_skcipher_init(struct crypto_skcipher *tfm) ctx->cryptd_tfm = cryptd_tfm; reqsize = sizeof(struct skcipher_request); - reqsize += crypto_skcipher_reqsize(_tfm->base); + reqsize += crypto_skcipher_reqsize(cryptd_skcipher_child(cryptd_tfm)); crypto_skcipher_set_reqsize(tfm, reqsize); -- 2.19.1
.S_shipped unnecessary?
Hi Ard, Eric, and others, As promised, the next Zinc patchset will have less generated code! After a bit of work with Andy and Samuel, I'll be bundling the perlasm. One thing I'm wondering about, though, is the wisdom behind the current .S_shipped pattern. Usually the _shipped is for big firmware blobs that are hard (or impossible) to build independently. But in this case, the .S is generated from the .pl significantly faster than gcc even compiles a basic C file. And, since perl is needed to build the kernel anyway, it's not like it will be impossible to find the right tools. Rather than clutter up commits with the .pl _and_ the .S_shipped, what would you think if I just generated the .S each time as an ordinary build artifact. AFAICT, this is fairly usual, and it's hard to see downsides. Hence, why I'm writing this email: are there any downsides to that? Thanks, Jason
[PATCH 5/5] crypto: caam/qi2 - add support for Chacha20 + Poly1305
Add support for Chacha20 + Poly1305 combined AEAD: -generic (rfc7539) -IPsec (rfc7634 - known as rfc7539esp in the kernel) Signed-off-by: Horia Geantă --- drivers/crypto/caam/caamalg.c | 4 +- drivers/crypto/caam/caamalg_desc.c | 24 ++- drivers/crypto/caam/caamalg_desc.h | 3 +- drivers/crypto/caam/caamalg_qi2.c | 129 - 4 files changed, 154 insertions(+), 6 deletions(-) diff --git a/drivers/crypto/caam/caamalg.c b/drivers/crypto/caam/caamalg.c index cbaeb264a261..523565ce0060 100644 --- a/drivers/crypto/caam/caamalg.c +++ b/drivers/crypto/caam/caamalg.c @@ -527,13 +527,13 @@ static int chachapoly_set_sh_desc(struct crypto_aead *aead) desc = ctx->sh_desc_enc; cnstr_shdsc_chachapoly(desc, >cdata, >adata, ivsize, - ctx->authsize, true); + ctx->authsize, true, false); dma_sync_single_for_device(jrdev, ctx->sh_desc_enc_dma, desc_bytes(desc), ctx->dir); desc = ctx->sh_desc_dec; cnstr_shdsc_chachapoly(desc, >cdata, >adata, ivsize, - ctx->authsize, false); + ctx->authsize, false, false); dma_sync_single_for_device(jrdev, ctx->sh_desc_dec_dma, desc_bytes(desc), ctx->dir); diff --git a/drivers/crypto/caam/caamalg_desc.c b/drivers/crypto/caam/caamalg_desc.c index 0eb2add7e4e2..7db1640d3577 100644 --- a/drivers/crypto/caam/caamalg_desc.c +++ b/drivers/crypto/caam/caamalg_desc.c @@ -1227,10 +1227,12 @@ EXPORT_SYMBOL(cnstr_shdsc_rfc4543_decap); * @ivsize: initialization vector size * @icvsize: integrity check value (ICV) size (truncated or full) * @encap: true if encapsulation, false if decapsulation + * @is_qi: true when called from caam/qi */ void cnstr_shdsc_chachapoly(u32 * const desc, struct alginfo *cdata, struct alginfo *adata, unsigned int ivsize, - unsigned int icvsize, const bool encap) + unsigned int icvsize, const bool encap, + const bool is_qi) { u32 *key_jump_cmd, *wait_cmd; u32 nfifo; @@ -1267,6 +1269,26 @@ void cnstr_shdsc_chachapoly(u32 * const desc, struct alginfo *cdata, OP_ALG_DECRYPT); } + if (is_qi) { + u32 *wait_load_cmd; + u32 ctx1_iv_off = is_ipsec ? 8 : 4; + + /* REG3 = assoclen */ + append_seq_load(desc, 4, LDST_CLASS_DECO | + LDST_SRCDST_WORD_DECO_MATH3 | + 4 << LDST_OFFSET_SHIFT); + + wait_load_cmd = append_jump(desc, JUMP_JSL | JUMP_TEST_ALL | + JUMP_COND_CALM | JUMP_COND_NCP | + JUMP_COND_NOP | JUMP_COND_NIP | + JUMP_COND_NIFP); + set_jump_tgt_here(desc, wait_load_cmd); + + append_seq_load(desc, ivsize, LDST_CLASS_1_CCB | + LDST_SRCDST_BYTE_CONTEXT | + ctx1_iv_off << LDST_OFFSET_SHIFT); + } + /* * MAGIC with NFIFO * Read associated data from the input and send them to class1 and diff --git a/drivers/crypto/caam/caamalg_desc.h b/drivers/crypto/caam/caamalg_desc.h index a1a7b0e6889d..d5ca42ff961a 100644 --- a/drivers/crypto/caam/caamalg_desc.h +++ b/drivers/crypto/caam/caamalg_desc.h @@ -98,7 +98,8 @@ void cnstr_shdsc_rfc4543_decap(u32 * const desc, struct alginfo *cdata, void cnstr_shdsc_chachapoly(u32 * const desc, struct alginfo *cdata, struct alginfo *adata, unsigned int ivsize, - unsigned int icvsize, const bool encap); + unsigned int icvsize, const bool encap, + const bool is_qi); void cnstr_shdsc_skcipher_encap(u32 * const desc, struct alginfo *cdata, unsigned int ivsize, const bool is_rfc3686, diff --git a/drivers/crypto/caam/caamalg_qi2.c b/drivers/crypto/caam/caamalg_qi2.c index a9e264bb9629..2598640aa98b 100644 --- a/drivers/crypto/caam/caamalg_qi2.c +++ b/drivers/crypto/caam/caamalg_qi2.c @@ -462,7 +462,15 @@ static struct aead_edesc *aead_edesc_alloc(struct aead_request *req, edesc->dst_nents = dst_nents; edesc->iv_dma = iv_dma; - edesc->assoclen = cpu_to_caam32(req->assoclen); + if ((alg->caam.class1_alg_type & OP_ALG_ALGSEL_MASK) == + OP_ALG_ALGSEL_CHACHA20 && ivsize != CHACHAPOLY_IV_SIZE) + /* +* The associated data comes already with the IV but we need +* to skip it when we authenticate or encrypt... +*/ + edesc->assoclen = cpu_to_caam32(req->assoclen - ivsize); + else
[PATCH 4/5] crypto: caam/jr - add support for Chacha20 + Poly1305
Add support for Chacha20 + Poly1305 combined AEAD: -generic (rfc7539) -IPsec (rfc7634 - known as rfc7539esp in the kernel) Signed-off-by: Cristian Stoica Signed-off-by: Horia Geantă --- drivers/crypto/caam/caamalg.c | 221 - drivers/crypto/caam/caamalg_desc.c | 111 +++ drivers/crypto/caam/caamalg_desc.h | 4 + drivers/crypto/caam/compat.h | 1 + drivers/crypto/caam/desc.h | 15 +++ drivers/crypto/caam/desc_constr.h | 7 +- 6 files changed, 354 insertions(+), 5 deletions(-) diff --git a/drivers/crypto/caam/caamalg.c b/drivers/crypto/caam/caamalg.c index 9f1414030bc2..cbaeb264a261 100644 --- a/drivers/crypto/caam/caamalg.c +++ b/drivers/crypto/caam/caamalg.c @@ -72,6 +72,8 @@ #define AUTHENC_DESC_JOB_IO_LEN(AEAD_DESC_JOB_IO_LEN + \ CAAM_CMD_SZ * 5) +#define CHACHAPOLY_DESC_JOB_IO_LEN (AEAD_DESC_JOB_IO_LEN + CAAM_CMD_SZ * 6) + #define DESC_MAX_USED_BYTES(CAAM_DESC_BYTES_MAX - DESC_JOB_IO_LEN) #define DESC_MAX_USED_LEN (DESC_MAX_USED_BYTES / CAAM_CMD_SZ) @@ -513,6 +515,61 @@ static int rfc4543_setauthsize(struct crypto_aead *authenc, return 0; } +static int chachapoly_set_sh_desc(struct crypto_aead *aead) +{ + struct caam_ctx *ctx = crypto_aead_ctx(aead); + struct device *jrdev = ctx->jrdev; + unsigned int ivsize = crypto_aead_ivsize(aead); + u32 *desc; + + if (!ctx->cdata.keylen || !ctx->authsize) + return 0; + + desc = ctx->sh_desc_enc; + cnstr_shdsc_chachapoly(desc, >cdata, >adata, ivsize, + ctx->authsize, true); + dma_sync_single_for_device(jrdev, ctx->sh_desc_enc_dma, + desc_bytes(desc), ctx->dir); + + desc = ctx->sh_desc_dec; + cnstr_shdsc_chachapoly(desc, >cdata, >adata, ivsize, + ctx->authsize, false); + dma_sync_single_for_device(jrdev, ctx->sh_desc_dec_dma, + desc_bytes(desc), ctx->dir); + + return 0; +} + +static int chachapoly_setauthsize(struct crypto_aead *aead, + unsigned int authsize) +{ + struct caam_ctx *ctx = crypto_aead_ctx(aead); + + if (authsize != POLY1305_DIGEST_SIZE) + return -EINVAL; + + ctx->authsize = authsize; + return chachapoly_set_sh_desc(aead); +} + +static int chachapoly_setkey(struct crypto_aead *aead, const u8 *key, +unsigned int keylen) +{ + struct caam_ctx *ctx = crypto_aead_ctx(aead); + unsigned int ivsize = crypto_aead_ivsize(aead); + unsigned int saltlen = CHACHAPOLY_IV_SIZE - ivsize; + + if (keylen != CHACHA20_KEY_SIZE + saltlen) { + crypto_aead_set_flags(aead, CRYPTO_TFM_RES_BAD_KEY_LEN); + return -EINVAL; + } + + ctx->cdata.key_virt = key; + ctx->cdata.keylen = keylen - saltlen; + + return chachapoly_set_sh_desc(aead); +} + static int aead_setkey(struct crypto_aead *aead, const u8 *key, unsigned int keylen) { @@ -1031,6 +1088,40 @@ static void init_gcm_job(struct aead_request *req, /* End of blank commands */ } +static void init_chachapoly_job(struct aead_request *req, + struct aead_edesc *edesc, bool all_contig, + bool encrypt) +{ + struct crypto_aead *aead = crypto_aead_reqtfm(req); + unsigned int ivsize = crypto_aead_ivsize(aead); + unsigned int assoclen = req->assoclen; + u32 *desc = edesc->hw_desc; + u32 ctx_iv_off = 4; + + init_aead_job(req, edesc, all_contig, encrypt); + + if (ivsize != CHACHAPOLY_IV_SIZE) { + /* IPsec specific: CONTEXT1[223:128] = {NONCE, IV} */ + ctx_iv_off += 4; + + /* +* The associated data comes already with the IV but we need +* to skip it when we authenticate or encrypt... +*/ + assoclen -= ivsize; + } + + append_math_add_imm_u32(desc, REG3, ZERO, IMM, assoclen); + + /* +* For IPsec load the IV further in the same register. +* For RFC7539 simply load the 12 bytes nonce in a single operation +*/ + append_load_as_imm(desc, req->iv, ivsize, LDST_CLASS_1_CCB | + LDST_SRCDST_BYTE_CONTEXT | + ctx_iv_off << LDST_OFFSET_SHIFT); +} + static void init_authenc_job(struct aead_request *req, struct aead_edesc *edesc, bool all_contig, bool encrypt) @@ -1289,6 +1380,72 @@ static int gcm_encrypt(struct aead_request *req) return ret; } +static int chachapoly_encrypt(struct aead_request *req) +{ + struct aead_edesc *edesc; + struct crypto_aead *aead =
[PATCH 0/5] crypto: caam - add support for Era 10
This patch set adds support for CAAM Era 10, currently used in LX2160A SoC: -new register mapping: some registers/fields are deprecated and moved to different locations, mainly version registers -algorithms chacha20 (over DPSECI - Data Path SEC Interface on fsl-mc bus) rfc7539(chacha20,poly1305) (over both DPSECI and Job Ring Interface) rfc7539esp(chacha20,poly1305) (over both DPSECI and Job Ring Interface) Note: the patch set is generated on top of cryptodev-2.6, however testing was performed based on linux-next (tag: next-20181108) - which includes LX2160A platform support + manually updating LX2160A dts with: -fsl-mc bus DT node -missing dma-ranges property in soc DT node Cristian Stoica (1): crypto: export CHACHAPOLY_IV_SIZE Horia Geantă (4): crypto: caam - add register map changes cf. Era 10 crypto: caam/qi2 - add support for ChaCha20 crypto: caam/jr - add support for Chacha20 + Poly1305 crypto: caam/qi2 - add support for Chacha20 + Poly1305 crypto/chacha20poly1305.c | 2 - drivers/crypto/caam/caamalg.c | 266 ++--- drivers/crypto/caam/caamalg_desc.c | 139 ++- drivers/crypto/caam/caamalg_desc.h | 5 + drivers/crypto/caam/caamalg_qi.c | 37 -- drivers/crypto/caam/caamalg_qi2.c | 156 +- drivers/crypto/caam/caamhash.c | 20 ++- drivers/crypto/caam/caampkc.c | 10 +- drivers/crypto/caam/caamrng.c | 10 +- drivers/crypto/caam/compat.h | 2 + drivers/crypto/caam/ctrl.c | 28 +++- drivers/crypto/caam/desc.h | 28 drivers/crypto/caam/desc_constr.h | 7 +- drivers/crypto/caam/regs.h | 74 +-- include/crypto/chacha20.h | 1 + 15 files changed, 724 insertions(+), 61 deletions(-) -- 2.16.2
[PATCH 1/5] crypto: caam - add register map changes cf. Era 10
Era 10 changes the register map. The updates that affect the drivers: -new version registers are added -DBG_DBG[deco_state] field is moved to a new register - DBG_EXEC[19:16] @ 8_0E3Ch. Signed-off-by: Horia Geantă --- drivers/crypto/caam/caamalg.c| 47 + drivers/crypto/caam/caamalg_qi.c | 37 +++- drivers/crypto/caam/caamhash.c | 20 --- drivers/crypto/caam/caampkc.c| 10 -- drivers/crypto/caam/caamrng.c| 10 +- drivers/crypto/caam/ctrl.c | 28 +++ drivers/crypto/caam/desc.h | 7 drivers/crypto/caam/regs.h | 74 ++-- 8 files changed, 184 insertions(+), 49 deletions(-) diff --git a/drivers/crypto/caam/caamalg.c b/drivers/crypto/caam/caamalg.c index 869f092432de..9f1414030bc2 100644 --- a/drivers/crypto/caam/caamalg.c +++ b/drivers/crypto/caam/caamalg.c @@ -3135,7 +3135,7 @@ static int __init caam_algapi_init(void) struct device *ctrldev; struct caam_drv_private *priv; int i = 0, err = 0; - u32 cha_vid, cha_inst, des_inst, aes_inst, md_inst; + u32 aes_vid, aes_inst, des_inst, md_vid, md_inst; unsigned int md_limit = SHA512_DIGEST_SIZE; bool registered = false; @@ -3168,14 +3168,34 @@ static int __init caam_algapi_init(void) * Register crypto algorithms the device supports. * First, detect presence and attributes of DES, AES, and MD blocks. */ - cha_vid = rd_reg32(>ctrl->perfmon.cha_id_ls); - cha_inst = rd_reg32(>ctrl->perfmon.cha_num_ls); - des_inst = (cha_inst & CHA_ID_LS_DES_MASK) >> CHA_ID_LS_DES_SHIFT; - aes_inst = (cha_inst & CHA_ID_LS_AES_MASK) >> CHA_ID_LS_AES_SHIFT; - md_inst = (cha_inst & CHA_ID_LS_MD_MASK) >> CHA_ID_LS_MD_SHIFT; + if (priv->era < 10) { + u32 cha_vid, cha_inst; + + cha_vid = rd_reg32(>ctrl->perfmon.cha_id_ls); + aes_vid = cha_vid & CHA_ID_LS_AES_MASK; + md_vid = (cha_vid & CHA_ID_LS_MD_MASK) >> CHA_ID_LS_MD_SHIFT; + + cha_inst = rd_reg32(>ctrl->perfmon.cha_num_ls); + des_inst = (cha_inst & CHA_ID_LS_DES_MASK) >> + CHA_ID_LS_DES_SHIFT; + aes_inst = cha_inst & CHA_ID_LS_AES_MASK; + md_inst = (cha_inst & CHA_ID_LS_MD_MASK) >> CHA_ID_LS_MD_SHIFT; + } else { + u32 aesa, mdha; + + aesa = rd_reg32(>ctrl->vreg.aesa); + mdha = rd_reg32(>ctrl->vreg.mdha); + + aes_vid = (aesa & CHA_VER_VID_MASK) >> CHA_VER_VID_SHIFT; + md_vid = (mdha & CHA_VER_VID_MASK) >> CHA_VER_VID_SHIFT; + + des_inst = rd_reg32(>ctrl->vreg.desa) & CHA_VER_NUM_MASK; + aes_inst = aesa & CHA_VER_NUM_MASK; + md_inst = mdha & CHA_VER_NUM_MASK; + } /* If MD is present, limit digest size based on LP256 */ - if (md_inst && ((cha_vid & CHA_ID_LS_MD_MASK) == CHA_ID_LS_MD_LP256)) + if (md_inst && md_vid == CHA_VER_VID_MD_LP256) md_limit = SHA256_DIGEST_SIZE; for (i = 0; i < ARRAY_SIZE(driver_algs); i++) { @@ -3196,10 +3216,10 @@ static int __init caam_algapi_init(void) * Check support for AES modes not available * on LP devices. */ - if ((cha_vid & CHA_ID_LS_AES_MASK) == CHA_ID_LS_AES_LP) - if ((t_alg->caam.class1_alg_type & OP_ALG_AAI_MASK) == -OP_ALG_AAI_XTS) - continue; + if (aes_vid == CHA_VER_VID_AES_LP && + (t_alg->caam.class1_alg_type & OP_ALG_AAI_MASK) == + OP_ALG_AAI_XTS) + continue; caam_skcipher_alg_init(t_alg); @@ -3236,9 +3256,8 @@ static int __init caam_algapi_init(void) * Check support for AES algorithms not available * on LP devices. */ - if ((cha_vid & CHA_ID_LS_AES_MASK) == CHA_ID_LS_AES_LP) - if (alg_aai == OP_ALG_AAI_GCM) - continue; + if (aes_vid == CHA_VER_VID_AES_LP && alg_aai == OP_ALG_AAI_GCM) + continue; /* * Skip algorithms requiring message digests diff --git a/drivers/crypto/caam/caamalg_qi.c b/drivers/crypto/caam/caamalg_qi.c index 23c9fc4975f8..c0d55310aade 100644 --- a/drivers/crypto/caam/caamalg_qi.c +++ b/drivers/crypto/caam/caamalg_qi.c @@ -2462,7 +2462,7 @@ static int __init caam_qi_algapi_init(void) struct device *ctrldev; struct caam_drv_private *priv; int i = 0, err = 0; - u32 cha_vid, cha_inst, des_inst, aes_inst, md_inst; + u32 aes_vid, aes_inst, des_inst, md_vid, md_inst; unsigned int md_limit = SHA512_DIGEST_SIZE; bool registered = false; @@ -2497,14
[PATCH 3/5] crypto: export CHACHAPOLY_IV_SIZE
From: Cristian Stoica Move CHACHAPOLY_IV_SIZE to header file, so it can be reused. Signed-off-by: Cristian Stoica Signed-off-by: Horia Geantă --- crypto/chacha20poly1305.c | 2 -- include/crypto/chacha20.h | 1 + 2 files changed, 1 insertion(+), 2 deletions(-) diff --git a/crypto/chacha20poly1305.c b/crypto/chacha20poly1305.c index 600afa99941f..f9dd5453046a 100644 --- a/crypto/chacha20poly1305.c +++ b/crypto/chacha20poly1305.c @@ -22,8 +22,6 @@ #include "internal.h" -#define CHACHAPOLY_IV_SIZE 12 - struct chachapoly_instance_ctx { struct crypto_skcipher_spawn chacha; struct crypto_ahash_spawn poly; diff --git a/include/crypto/chacha20.h b/include/crypto/chacha20.h index f76302d99e2b..2d3129442a52 100644 --- a/include/crypto/chacha20.h +++ b/include/crypto/chacha20.h @@ -13,6 +13,7 @@ #define CHACHA20_IV_SIZE 16 #define CHACHA20_KEY_SIZE 32 #define CHACHA20_BLOCK_SIZE64 +#define CHACHAPOLY_IV_SIZE 12 struct chacha20_ctx { u32 key[8]; -- 2.16.2
[PATCH 2/5] crypto: caam/qi2 - add support for ChaCha20
Add support for ChaCha20 skcipher algorithm. Signed-off-by: Carmen Iorga Signed-off-by: Horia Geantă --- drivers/crypto/caam/caamalg_desc.c | 6 -- drivers/crypto/caam/caamalg_qi2.c | 27 +-- drivers/crypto/caam/compat.h | 1 + drivers/crypto/caam/desc.h | 6 ++ 4 files changed, 36 insertions(+), 4 deletions(-) diff --git a/drivers/crypto/caam/caamalg_desc.c b/drivers/crypto/caam/caamalg_desc.c index 1a6f0da14106..d850590079a2 100644 --- a/drivers/crypto/caam/caamalg_desc.c +++ b/drivers/crypto/caam/caamalg_desc.c @@ -1228,7 +1228,8 @@ static inline void skcipher_append_src_dst(u32 *desc) * @desc: pointer to buffer used for descriptor construction * @cdata: pointer to block cipher transform definitions * Valid algorithm values - one of OP_ALG_ALGSEL_{AES, DES, 3DES} ANDed - * with OP_ALG_AAI_CBC or OP_ALG_AAI_CTR_MOD128. + * with OP_ALG_AAI_CBC or OP_ALG_AAI_CTR_MOD128 + *- OP_ALG_ALGSEL_CHACHA20 * @ivsize: initialization vector size * @is_rfc3686: true when ctr(aes) is wrapped by rfc3686 template * @ctx1_iv_off: IV offset in CONTEXT1 register @@ -1293,7 +1294,8 @@ EXPORT_SYMBOL(cnstr_shdsc_skcipher_encap); * @desc: pointer to buffer used for descriptor construction * @cdata: pointer to block cipher transform definitions * Valid algorithm values - one of OP_ALG_ALGSEL_{AES, DES, 3DES} ANDed - * with OP_ALG_AAI_CBC or OP_ALG_AAI_CTR_MOD128. + * with OP_ALG_AAI_CBC or OP_ALG_AAI_CTR_MOD128 + *- OP_ALG_ALGSEL_CHACHA20 * @ivsize: initialization vector size * @is_rfc3686: true when ctr(aes) is wrapped by rfc3686 template * @ctx1_iv_off: IV offset in CONTEXT1 register diff --git a/drivers/crypto/caam/caamalg_qi2.c b/drivers/crypto/caam/caamalg_qi2.c index 7d8ac0222fa3..a9e264bb9629 100644 --- a/drivers/crypto/caam/caamalg_qi2.c +++ b/drivers/crypto/caam/caamalg_qi2.c @@ -816,7 +816,9 @@ static int skcipher_setkey(struct crypto_skcipher *skcipher, const u8 *key, u32 *desc; u32 ctx1_iv_off = 0; const bool ctr_mode = ((ctx->cdata.algtype & OP_ALG_AAI_MASK) == - OP_ALG_AAI_CTR_MOD128); + OP_ALG_AAI_CTR_MOD128) && + ((ctx->cdata.algtype & OP_ALG_ALGSEL_MASK) != + OP_ALG_ALGSEL_CHACHA20); const bool is_rfc3686 = alg->caam.rfc3686; print_hex_dump_debug("key in @" __stringify(__LINE__)": ", @@ -1494,7 +1496,23 @@ static struct caam_skcipher_alg driver_algs[] = { .ivsize = AES_BLOCK_SIZE, }, .caam.class1_alg_type = OP_ALG_ALGSEL_AES | OP_ALG_AAI_XTS, - } + }, + { + .skcipher = { + .base = { + .cra_name = "chacha20", + .cra_driver_name = "chacha20-caam-qi2", + .cra_blocksize = 1, + }, + .setkey = skcipher_setkey, + .encrypt = skcipher_encrypt, + .decrypt = skcipher_decrypt, + .min_keysize = CHACHA20_KEY_SIZE, + .max_keysize = CHACHA20_KEY_SIZE, + .ivsize = CHACHA20_IV_SIZE, + }, + .caam.class1_alg_type = OP_ALG_ALGSEL_CHACHA20, + }, }; static struct caam_aead_alg driver_aeads[] = { @@ -4908,6 +4926,11 @@ static int dpaa2_caam_probe(struct fsl_mc_device *dpseci_dev) alg_sel == OP_ALG_ALGSEL_AES) continue; + /* Skip CHACHA20 algorithms if not supported by device */ + if (alg_sel == OP_ALG_ALGSEL_CHACHA20 && + !priv->sec_attr.ccha_acc_num) + continue; + t_alg->caam.dev = dev; caam_skcipher_alg_init(t_alg); diff --git a/drivers/crypto/caam/compat.h b/drivers/crypto/caam/compat.h index 9604ff7a335e..a5081b4050b6 100644 --- a/drivers/crypto/caam/compat.h +++ b/drivers/crypto/caam/compat.h @@ -36,6 +36,7 @@ #include #include #include +#include #include #include #include diff --git a/drivers/crypto/caam/desc.h b/drivers/crypto/caam/desc.h index ec1ef06049b4..9d117e51629f 100644 --- a/drivers/crypto/caam/desc.h +++ b/drivers/crypto/caam/desc.h @@ -1159,6 +1159,7 @@ #define OP_ALG_ALGSEL_KASUMI (0x70 << OP_ALG_ALGSEL_SHIFT) #define OP_ALG_ALGSEL_CRC (0x90 << OP_ALG_ALGSEL_SHIFT) #define OP_ALG_ALGSEL_SNOW_F9 (0xA0 << OP_ALG_ALGSEL_SHIFT) +#define OP_ALG_ALGSEL_CHACHA20 (0xD0 << OP_ALG_ALGSEL_SHIFT) #define OP_ALG_AAI_SHIFT 4 #define OP_ALG_AAI_MASK(0x1ff << OP_ALG_AAI_SHIFT) @@ -1206,6 +1207,11 @@ #define OP_ALG_AAI_RNG4_AI (0x80 << OP_ALG_AAI_SHIFT) #define OP_ALG_AAI_RNG4_SK (0x100 <<
Re: [RFC PATCH 1/4] kconfig: add as-instr macro to scripts/Kconfig.include
On 07/11/18 14:55, Will Deacon wrote: > On Wed, Nov 07, 2018 at 09:40:05AM +, Vladimir Murzin wrote: >> There are cases where the whole feature, for instance arm64/lse or >> arm/crypto, can depend on assembler. Current practice is to report >> buildtime that selected feature is not supported, which can be quite >> annoying... > > Why is it annoying? You still end up with a working kernel. .config doesn't really represent if option was built or not, annoying part is digging build logs (if anyone's saved them at all!) or relevant parts of dmesg (if option throws anything in there and which not always part of reports). > >> It'd nicer if we can check assembler first and opt-in feature >> visibility in Kconfig. >> >> Cc: Masahiro Yamada >> Cc: Will Deacon >> Cc: Marc Zyngier >> Cc: Ard Biesheuvel >> Signed-off-by: Vladimir Murzin >> --- >> scripts/Kconfig.include | 4 >> 1 file changed, 4 insertions(+) > > One issue I have with doing the check like this is that if somebody sends > you a .config with e.g. ARM64_LSE_ATOMICS=y and you try to build a kernel > using that .config and an old toolchain, the option is silently dropped. I see... at least we have some tools like ./scripts/diffconfig > > I think the diagnostic is actually useful in this case. Fully agree on diagnostic side, any suggestions how it can be improved? Cheers Vladimir > > Will >
Re: [RFC PATCH 1/4] kconfig: add as-instr macro to scripts/Kconfig.include
On Wed, Nov 07, 2018 at 09:40:05AM +, Vladimir Murzin wrote: > There are cases where the whole feature, for instance arm64/lse or > arm/crypto, can depend on assembler. Current practice is to report > buildtime that selected feature is not supported, which can be quite > annoying... Why is it annoying? You still end up with a working kernel. > It'd nicer if we can check assembler first and opt-in feature > visibility in Kconfig. > > Cc: Masahiro Yamada > Cc: Will Deacon > Cc: Marc Zyngier > Cc: Ard Biesheuvel > Signed-off-by: Vladimir Murzin > --- > scripts/Kconfig.include | 4 > 1 file changed, 4 insertions(+) One issue I have with doing the check like this is that if somebody sends you a .config with e.g. ARM64_LSE_ATOMICS=y and you try to build a kernel using that .config and an old toolchain, the option is silently dropped. I think the diagnostic is actually useful in this case. Will
[RFC PATCH 2/4] arm64: lse: expose dependency on gas via Kconfig
So we can simply hide LSE support if dependency is not satisfied. Cc: Will Deacon Signed-off-by: Vladimir Murzin --- arch/arm64/Kconfig | 1 + arch/arm64/Makefile | 13 ++--- arch/arm64/include/asm/atomic.h | 2 +- arch/arm64/include/asm/lse.h| 6 +++--- arch/arm64/kernel/cpufeature.c | 4 ++-- 5 files changed, 9 insertions(+), 17 deletions(-) diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index 964f682..7978aee 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -1072,6 +1072,7 @@ config ARM64_PAN config ARM64_LSE_ATOMICS bool "Atomic instructions" default y + depends on $(as-instr,.arch_extension lse) help As part of the Large System Extensions, ARMv8.1 introduces new atomic instructions that are designed specifically to scale in diff --git a/arch/arm64/Makefile b/arch/arm64/Makefile index b4e994c..3054757 100644 --- a/arch/arm64/Makefile +++ b/arch/arm64/Makefile @@ -32,15 +32,6 @@ endif KBUILD_DEFCONFIG := defconfig -# Check for binutils support for specific extensions -lseinstr := $(call as-instr,.arch_extension lse,-DCONFIG_AS_LSE=1) - -ifeq ($(CONFIG_ARM64_LSE_ATOMICS), y) - ifeq ($(lseinstr),) -$(warning LSE atomics not supported by binutils) - endif -endif - ifeq ($(CONFIG_ARM64), y) brokengasinst := $(call as-instr,1:\n.inst 0\n.rept . - 1b\n\nnop\n.endr\n,,-DCONFIG_BROKEN_GAS_INST=1) @@ -49,9 +40,9 @@ $(warning Detected assembler with broken .inst; disassembly will be unreliable) endif endif -KBUILD_CFLAGS += -mgeneral-regs-only $(lseinstr) $(brokengasinst) +KBUILD_CFLAGS += -mgeneral-regs-only $(brokengasinst) KBUILD_CFLAGS += -fno-asynchronous-unwind-tables -KBUILD_AFLAGS += $(lseinstr) $(brokengasinst) +KBUILD_AFLAGS += $(brokengasinst) KBUILD_CFLAGS += $(call cc-option,-mabi=lp64) KBUILD_AFLAGS += $(call cc-option,-mabi=lp64) diff --git a/arch/arm64/include/asm/atomic.h b/arch/arm64/include/asm/atomic.h index 9bca54d..9d8d029 100644 --- a/arch/arm64/include/asm/atomic.h +++ b/arch/arm64/include/asm/atomic.h @@ -30,7 +30,7 @@ #define __ARM64_IN_ATOMIC_IMPL -#if defined(CONFIG_ARM64_LSE_ATOMICS) && defined(CONFIG_AS_LSE) +#ifdef CONFIG_ARM64_LSE_ATOMICS #include #else #include diff --git a/arch/arm64/include/asm/lse.h b/arch/arm64/include/asm/lse.h index 8262325..1fd31c7 100644 --- a/arch/arm64/include/asm/lse.h +++ b/arch/arm64/include/asm/lse.h @@ -2,7 +2,7 @@ #ifndef __ASM_LSE_H #define __ASM_LSE_H -#if defined(CONFIG_AS_LSE) && defined(CONFIG_ARM64_LSE_ATOMICS) +#ifdef CONFIG_ARM64_LSE_ATOMICS #include #include @@ -36,7 +36,7 @@ ALTERNATIVE(llsc, lse, ARM64_HAS_LSE_ATOMICS) #endif /* __ASSEMBLER__ */ -#else /* CONFIG_AS_LSE && CONFIG_ARM64_LSE_ATOMICS */ +#else /* CONFIG_ARM64_LSE_ATOMICS */ #ifdef __ASSEMBLER__ @@ -53,5 +53,5 @@ #define ARM64_LSE_ATOMIC_INSN(llsc, lse) llsc #endif /* __ASSEMBLER__ */ -#endif /* CONFIG_AS_LSE && CONFIG_ARM64_LSE_ATOMICS */ +#endif /* CONFIG_ARM64_LSE_ATOMICS */ #endif /* __ASM_LSE_H */ diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c index 74e9dcb..46f1bac 100644 --- a/arch/arm64/kernel/cpufeature.c +++ b/arch/arm64/kernel/cpufeature.c @@ -1170,7 +1170,7 @@ static void cpu_clear_disr(const struct arm64_cpu_capabilities *__unused) .cpu_enable = cpu_enable_pan, }, #endif /* CONFIG_ARM64_PAN */ -#if defined(CONFIG_AS_LSE) && defined(CONFIG_ARM64_LSE_ATOMICS) +#ifdef CONFIG_ARM64_LSE_ATOMICS { .desc = "LSE atomic instructions", .capability = ARM64_HAS_LSE_ATOMICS, @@ -1181,7 +1181,7 @@ static void cpu_clear_disr(const struct arm64_cpu_capabilities *__unused) .sign = FTR_UNSIGNED, .min_field_value = 2, }, -#endif /* CONFIG_AS_LSE && CONFIG_ARM64_LSE_ATOMICS */ +#endif /* CONFIG_ARM64_LSE_ATOMICS */ { .desc = "Software prefetching using PRFM", .capability = ARM64_HAS_NO_HW_PREFETCH, -- 1.9.1
[RFC PATCH 1/4] kconfig: add as-instr macro to scripts/Kconfig.include
There are cases where the whole feature, for instance arm64/lse or arm/crypto, can depend on assembler. Current practice is to report buildtime that selected feature is not supported, which can be quite annoying... It'd nicer if we can check assembler first and opt-in feature visibility in Kconfig. Cc: Masahiro Yamada Cc: Will Deacon Cc: Marc Zyngier Cc: Ard Biesheuvel Signed-off-by: Vladimir Murzin --- scripts/Kconfig.include | 4 1 file changed, 4 insertions(+) diff --git a/scripts/Kconfig.include b/scripts/Kconfig.include index dad5583..07c145c 100644 --- a/scripts/Kconfig.include +++ b/scripts/Kconfig.include @@ -22,6 +22,10 @@ success = $(if-success,$(1),y,n) # Return y if the compiler supports , n otherwise cc-option = $(success,$(CC) -Werror $(1) -E -x c /dev/null -o /dev/null) +# $(as-instr,) +# Return y if the assembler supports , n otherwise +as-instr = $(success,printf "%b\n" "$(1)" | $(CC) -Werror -c -x assembler -o /dev/null -) + # $(ld-option,) # Return y if the linker supports , n otherwise ld-option = $(success,$(LD) -v $(1)) -- 1.9.1
[RFC PATCH 3/4] arm64: turn "broken gas inst" into real config option
So it is available everywhere and there is no need to keep CONFIG_ARM64 workaround ;) Cc: Marc Zyngier Signed-off-by: Vladimir Murzin --- arch/arm64/Kconfig | 3 +++ arch/arm64/Makefile | 9 ++--- 2 files changed, 5 insertions(+), 7 deletions(-) diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index 7978aee..86fc357 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -287,6 +287,9 @@ config ARCH_SUPPORTS_UPROBES config ARCH_PROC_KCORE_TEXT def_bool y +config BROKEN_GAS_INST + def_bool y if !$(as-instr,1:\n.inst 0\n.rept . - 1b\n\nnop\n.endr\n) + source "arch/arm64/Kconfig.platforms" menu "Bus support" diff --git a/arch/arm64/Makefile b/arch/arm64/Makefile index 3054757..9860d3a 100644 --- a/arch/arm64/Makefile +++ b/arch/arm64/Makefile @@ -32,17 +32,12 @@ endif KBUILD_DEFCONFIG := defconfig -ifeq ($(CONFIG_ARM64), y) -brokengasinst := $(call as-instr,1:\n.inst 0\n.rept . - 1b\n\nnop\n.endr\n,,-DCONFIG_BROKEN_GAS_INST=1) - - ifneq ($(brokengasinst),) +ifeq ($(CONFIG_BROKEN_GAS_INST),y) $(warning Detected assembler with broken .inst; disassembly will be unreliable) - endif endif -KBUILD_CFLAGS += -mgeneral-regs-only $(brokengasinst) +KBUILD_CFLAGS += -mgeneral-regs-only KBUILD_CFLAGS += -fno-asynchronous-unwind-tables -KBUILD_AFLAGS += $(brokengasinst) KBUILD_CFLAGS += $(call cc-option,-mabi=lp64) KBUILD_AFLAGS += $(call cc-option,-mabi=lp64) -- 1.9.1
[RFC PATCH 0/4] Minor improvements over handling dependency on GAS
With recent changes in Kconfig processing it is now possible to expose dependency on specific tools and supported options via Kconfig rather than bury it deep in Makefile. This small series try to address the case where the whole feature, for instance arm64/lse or arm/crypto, depends on GAS. Vladimir Murzin (4): kconfig: add as-instr macro to scripts/Kconfig.include arm64: lse: expose dependency on gas via Kconfig arm64: turn "broken gas inst" into real config option ARM: crypto: expose dependency on gas via Kconfig arch/arm/crypto/Kconfig | 31 +-- arch/arm/crypto/Makefile| 31 ++- arch/arm64/Kconfig | 4 arch/arm64/Makefile | 18 ++ arch/arm64/include/asm/atomic.h | 2 +- arch/arm64/include/asm/lse.h| 6 +++--- arch/arm64/kernel/cpufeature.c | 4 ++-- scripts/Kconfig.include | 4 8 files changed, 43 insertions(+), 57 deletions(-) -- 1.9.1
[RFC PATCH 4/4] ARM: crypto: expose dependency on gas via Kconfig
So we can advertise only those entries which dependency is satisfied. Cc: Ard Biesheuvel Signed-off-by: Vladimir Murzin --- arch/arm/crypto/Kconfig | 31 +-- arch/arm/crypto/Makefile | 31 ++- 2 files changed, 27 insertions(+), 35 deletions(-) diff --git a/arch/arm/crypto/Kconfig b/arch/arm/crypto/Kconfig index ef0c7fe..f437a91f 100644 --- a/arch/arm/crypto/Kconfig +++ b/arch/arm/crypto/Kconfig @@ -9,6 +9,12 @@ menuconfig ARM_CRYPTO if ARM_CRYPTO +config ARM_AS_HAS_CE + def_bool $(as-instr,.fpu crypto-neon-fp-armv8) + +config ARM_AS_HAS_CRC + def_bool $(as-instr,.arch armv8-a\n.arch_extension crc) + config CRYPTO_SHA1_ARM tristate "SHA1 digest algorithm (ARM-asm)" select CRYPTO_SHA1 @@ -30,21 +36,21 @@ config CRYPTO_SHA1_ARM_NEON config CRYPTO_SHA1_ARM_CE tristate "SHA1 digest algorithm (ARM v8 Crypto Extensions)" - depends on KERNEL_MODE_NEON + depends on KERNEL_MODE_NEON && ARM_AS_HAS_CE select CRYPTO_SHA1_ARM select CRYPTO_HASH help SHA-1 secure hash standard (FIPS 180-1/DFIPS 180-2) implemented - using special ARMv8 Crypto Extensions. + using special ARMv8 Crypto Extensions (need binutils 2.23 or higher). config CRYPTO_SHA2_ARM_CE tristate "SHA-224/256 digest algorithm (ARM v8 Crypto Extensions)" - depends on KERNEL_MODE_NEON + depends on KERNEL_MODE_NEON && ARM_AS_HAS_CE select CRYPTO_SHA256_ARM select CRYPTO_HASH help SHA-256 secure hash standard (DFIPS 180-2) implemented - using special ARMv8 Crypto Extensions. + using special ARMv8 Crypto Extensions (need binutils 2.23 or higher). config CRYPTO_SHA256_ARM tristate "SHA-224/256 digest algorithm (ARM-asm and NEON)" @@ -87,16 +93,16 @@ config CRYPTO_AES_ARM_BS config CRYPTO_AES_ARM_CE tristate "Accelerated AES using ARMv8 Crypto Extensions" - depends on KERNEL_MODE_NEON + depends on KERNEL_MODE_NEON && ARM_AS_HAS_CE select CRYPTO_BLKCIPHER select CRYPTO_SIMD help Use an implementation of AES in CBC, CTR and XTS modes that uses - ARMv8 Crypto Extensions + ARMv8 Crypto Extensions (need binutils 2.23 or higher) config CRYPTO_GHASH_ARM_CE tristate "PMULL-accelerated GHASH using NEON/ARMv8 Crypto Extensions" - depends on KERNEL_MODE_NEON + depends on KERNEL_MODE_NEON && ARM_AS_HAS_CE select CRYPTO_HASH select CRYPTO_CRYPTD select CRYPTO_GF128MUL @@ -104,17 +110,22 @@ config CRYPTO_GHASH_ARM_CE Use an implementation of GHASH (used by the GCM AEAD chaining mode) that uses the 64x64 to 128 bit polynomial multiplication (vmull.p64) that is part of the ARMv8 Crypto Extensions, or a slower variant that - uses the vmull.p8 instruction that is part of the basic NEON ISA. + uses the vmull.p8 instruction that is part of the basic NEON ISA (need + binutils 2.23 or higher). config CRYPTO_CRCT10DIF_ARM_CE tristate "CRCT10DIF digest algorithm using PMULL instructions" - depends on KERNEL_MODE_NEON && CRC_T10DIF + depends on KERNEL_MODE_NEON && CRC_T10DIF && ARM_AS_HAS_CE select CRYPTO_HASH + help + Need binutils 2.23 or higher config CRYPTO_CRC32_ARM_CE tristate "CRC32(C) digest algorithm using CRC and/or PMULL instructions" - depends on KERNEL_MODE_NEON && CRC32 + depends on KERNEL_MODE_NEON && CRC32 && ARM_AS_HAS_CRC select CRYPTO_HASH + help + Need binutils 2.23 or higher config CRYPTO_CHACHA20_NEON tristate "NEON accelerated ChaCha20 symmetric cipher" diff --git a/arch/arm/crypto/Makefile b/arch/arm/crypto/Makefile index bd5bcee..e897327 100644 --- a/arch/arm/crypto/Makefile +++ b/arch/arm/crypto/Makefile @@ -11,32 +11,13 @@ obj-$(CONFIG_CRYPTO_SHA256_ARM) += sha256-arm.o obj-$(CONFIG_CRYPTO_SHA512_ARM) += sha512-arm.o obj-$(CONFIG_CRYPTO_CHACHA20_NEON) += chacha20-neon.o -ce-obj-$(CONFIG_CRYPTO_AES_ARM_CE) += aes-arm-ce.o -ce-obj-$(CONFIG_CRYPTO_SHA1_ARM_CE) += sha1-arm-ce.o -ce-obj-$(CONFIG_CRYPTO_SHA2_ARM_CE) += sha2-arm-ce.o -ce-obj-$(CONFIG_CRYPTO_GHASH_ARM_CE) += ghash-arm-ce.o -ce-obj-$(CONFIG_CRYPTO_CRCT10DIF_ARM_CE) += crct10dif-arm-ce.o -crc-obj-$(CONFIG_CRYPTO_CRC32_ARM_CE) += crc32-arm-ce.o +obj-$(CONFIG_CRYPTO_AES_ARM_CE) += aes-arm-ce.o +obj-$(CONFIG_CRYPTO_SHA1_ARM_CE) += sha1-arm-ce.o +obj-$(CONFIG_CRYPTO_SHA2_ARM_CE) += sha2-arm-ce.o +obj-$(CONFIG_CRYPTO_GHASH_ARM_CE) += ghash-arm-ce.o +obj-$(CONFIG_CRYPTO_CRCT10DIF_ARM_CE) += crct10dif-arm-ce.o -ifneq ($(crc-obj-y)$(crc-obj-m),) -ifeq ($(call as-instr,.arch armv8-a\n.arch_extension crc,y,n),y) -ce-obj-y += $(crc-obj-y) -ce-obj-m += $(crc-obj-m) -else -$(warning These CRC Extensions modules need binutils 2.23 or higher) -$(warning $(crc-obj-y) $(crc-obj-m)) -endif -endif -
Re: [PATCH 1/2] crypto: fix cfb mode decryption
чт, 1 нояб. 2018 г. в 11:41, Herbert Xu : > > On Thu, Nov 01, 2018 at 11:32:37AM +0300, Dmitry Eremin-Solenikov wrote: > > > > Since 4.20 pull went into Linus'es tree, any change of getting these two > > patches > > in crypto tree? > > These aren't critical enough for the current mainline so they will > go in at the next merge window. Thank you. -- With best wishes Dmitry
Re: [PATCH 1/2] crypto: fix cfb mode decryption
On Thu, Nov 01, 2018 at 11:32:37AM +0300, Dmitry Eremin-Solenikov wrote: > > Since 4.20 pull went into Linus'es tree, any change of getting these two > patches > in crypto tree? These aren't critical enough for the current mainline so they will go in at the next merge window. Cheers, -- Email: Herbert Xu Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
Re: [PATCH 1/2] crypto: fix cfb mode decryption
Hello, вс, 21 окт. 2018 г. в 11:07, James Bottomley : > > On Sun, 2018-10-21 at 09:05 +0200, Ard Biesheuvel wrote: > > (+ James) > > Thanks! > > > On 20 October 2018 at 01:01, Dmitry Eremin-Solenikov > > wrote: > > > crypto_cfb_decrypt_segment() incorrectly XOR'ed generated keystream > > > with > > > IV, rather than with data stream, resulting in incorrect > > > decryption. > > > Test vectors will be added in the next patch. > > > > > > Signed-off-by: Dmitry Eremin-Solenikov > > > Cc: sta...@vger.kernel.org > > > --- > > > crypto/cfb.c | 2 +- > > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > > > diff --git a/crypto/cfb.c b/crypto/cfb.c > > > index a0d68c09e1b9..fd4e8500e121 100644 > > > --- a/crypto/cfb.c > > > +++ b/crypto/cfb.c > > > @@ -144,7 +144,7 @@ static int crypto_cfb_decrypt_segment(struct > > > skcipher_walk *walk, > > > > > > do { > > > crypto_cfb_encrypt_one(tfm, iv, dst); > > > - crypto_xor(dst, iv, bsize); > > > + crypto_xor(dst, src, bsize); > > This does look right. I think the reason the TPM code works is that it > always does encrypt/decrypt in-place, which is a separate piece of the > code which appears to be correct. Since 4.20 pull went into Linus'es tree, any change of getting these two patches in crypto tree? -- With best wishes Dmitry
Hello
Please can you contact me for a transaction.
Re: [PATCH v4 2/7] tpm2-sessions: Add full HMAC and encrypt/decrypt session handling
On Wed, 24 Oct 2018, James Bottomley wrote: +static void KDFa(u8 *key, int keylen, const char *label, u8 *u, +u8 *v, int bytes, u8 *out) Should this be in lower case? I would rename it as tpm_kdfa(). This one is defined as KDFa() in the standards and it's not TPM specific (although some standards refer to it as KDFA). I'm not averse to making them tpm_kdfe() and tpm_kdfa() but I was hoping that one day the crypto subsystem would need them and we could move them in there because KDFs are the new shiny in crypto primitives (TLS 1.2 started using them, for instance). I care more about tracing and debugging than naming and having 'tpm_' in front of every TPM function makes tracing a lean process. AFAIK using upper case letters is against kernel coding conventions. I'm not sure why this would make an exception on that. Why doesn't it matter here? Because, as the comment says, it eventually gets overwritten by running ecdh to derive the two co-ordinates. (pointers to these two uninitialized areas are passed into the ecdh destination sg list). Oh, I just misunderstood the comment. Wouldn't it be easier to say that the data is initialized later? + buf_len = crypto_ecdh_key_len(); + if (sizeof(encoded_key) < buf_len) { + dev_err(>dev, "salt buffer too small needs %d\n", + buf_len); + goto out; + } In what situation this can happen? Can sizeof(encoded_key) >= buf_len? Yes, but only if someone is trying to crack your ecdh. One of the security issues in ecdh is if someone makes a very specific point choice (usually in the cofactor space) that has a very short period, the attacker can guess the input to KDFe. In this case if TPM genie provided a specially crafted attack EC point, we'd detect it here because the resulting buffer would be too short. Right. Thank you for the explanation. Here some kind of comment might not be a bad idea... In general this function should have a clear explanation what it does and maybe less these one character variables but instead variables with more documenting names. Explain in high level what algorithms are used and how the salt is calculated. I'll try, but this is a rather complex function. Understood. I do not expect perfection here and we can improve documetation later on. For anyone wanting to review James' patches and w/o much experience on EC, I recommend reading this article: https://arstechnica.com/information-technology/2013/10/a-relatively-easy-to-understand-primer-on-elliptic-curve-cryptography/ I read it few years ago and refreshed my memory few days ago by re-reading it. + +/** + * tpm_buf_append_hmac_session() append a TPM session element + * @buf: The buffer to be appended + * @auth: the auth structure allocated by tpm2_start_auth_session() + * @attributes: The session attributes + * @passphrase: The session authority (NULL if none) + * @passphraselen: The length of the session authority (0 if none) The alignment. the alignment of what? We generally have parameter descriptions tab-aligned. Why there would be trailing zeros? Because TPM 1.2 mandated zero padded fixed size passphrases so the TPM 2.0 standard specifies a way of converting these to variable size strings by eliminating the zero padding. Ok. James I'm also looking forward for the CONTEXT_GAP patch based on the yesterdays discussion. We do want it and I was stupid not to take it couple years ago :-) Thanks. /Jarkko
Re: [PATCH v4 0/7] add integrity and security to TPM2 transactions
On Wed, 24 Oct 2018, James Bottomley wrote: On Wed, 2018-10-24 at 02:51 +0300, Jarkko Sakkinen wrote: I would consider sending first a patch set that would iterate the existing session stuff to be ready for this i.e. merge in two iterations (emphasis on the word "consider"). We can probably merge the groundwork quite fast. I realise we're going to have merge conflicts on the later ones, so why don't we do this: I'll still send as one series, but you apply the ones you think are precursors and I'll rebase and resend the rest? James Works for me and now I think after yesterdays dicussions etc. that this should be merged as one series. /Jarkko
Re: [PATCH v4 2/7] tpm2-sessions: Add full HMAC and encrypt/decrypt session handling
On Wed, 2018-10-24 at 02:48 +0300, Jarkko Sakkinen wrote: > On Mon, 22 Oct 2018, James Bottomley wrote: > > [...] I'll tidy up the descriptions. > These all sould be combined with the existing session stuff inside > tpm2-cmd.c and not have duplicate infrastructures. The file name > should be tpm2-session.c (we neither have tpm2-cmds.c). You mean move tpm2_buf_append_auth() into the new sessions file as well ... sure, I can do that. [...] > > + > > +/* > > + * assume hash sha256 and nonces u, v of size SHA256_DIGEST_SIZE > > but > > + * otherwise standard KDFa. Note output is in bytes not bits. > > + */ > > +static void KDFa(u8 *key, int keylen, const char *label, u8 *u, > > +u8 *v, int bytes, u8 *out) > > Should this be in lower case? I would rename it as tpm_kdfa(). This one is defined as KDFa() in the standards and it's not TPM specific (although some standards refer to it as KDFA). I'm not averse to making them tpm_kdfe() and tpm_kdfa() but I was hoping that one day the crypto subsystem would need them and we could move them in there because KDFs are the new shiny in crypto primitives (TLS 1.2 started using them, for instance). > > +{ > > + u32 counter; > > + const __be32 bits = cpu_to_be32(bytes * 8); > > + > > + for (counter = 1; bytes > 0; bytes -= SHA256_DIGEST_SIZE, > > counter++, > > +out += SHA256_DIGEST_SIZE) { > > Only one counter is actually used for anything so this is overly > complicated and IMHO it is ok to call the counter just 'i'. Maybe > just: > > for (i = 1; (bytes - (i - 1) * SHA256_DIGEST_SIZE) > 0; i++) { > > > + SHASH_DESC_ON_STACK(desc, sha256_hash); > > + __be32 c = cpu_to_be32(counter); > > + > > + hmac_init(desc, key, keylen); > > + crypto_shash_update(desc, (u8 *), sizeof(c)); > > + crypto_shash_update(desc, label, strlen(label)+1); > > + crypto_shash_update(desc, u, SHA256_DIGEST_SIZE); > > + crypto_shash_update(desc, v, SHA256_DIGEST_SIZE); > > + crypto_shash_update(desc, (u8 *), > > sizeof(bits)); > > + hmac_final(desc, key, keylen, out); > > + } > > +} > > + > > +/* > > + * Somewhat of a bastardization of the real KDFe. We're assuming > > + * we're working with known point sizes for the input parameters > > and > > + * the hash algorithm is fixed at sha256. Because we know that > > the > > + * point size is 32 bytes like the hash size, there's no need to > > loop > > + * in this KDF. > > + */ > > +static void KDFe(u8 z[EC_PT_SZ], const char *str, u8 *pt_u, u8 > > *pt_v, > > +u8 *keyout) > > +{ > > + SHASH_DESC_ON_STACK(desc, sha256_hash); > > + /* > > +* this should be an iterative counter, but because we > > know > > +* we're only taking 32 bytes for the point using a > > sha256 > > +* hash which is also 32 bytes, there's only one loop > > +*/ > > + __be32 c = cpu_to_be32(1); > > + > > + desc->tfm = sha256_hash; > > + desc->flags = CRYPTO_TFM_REQ_MAY_SLEEP; > > + > > + crypto_shash_init(desc); > > + /* counter (BE) */ > > + crypto_shash_update(desc, (u8 *), sizeof(c)); > > + /* secret value */ > > + crypto_shash_update(desc, z, EC_PT_SZ); > > + /* string including trailing zero */ > > + crypto_shash_update(desc, str, strlen(str)+1); > > + crypto_shash_update(desc, pt_u, EC_PT_SZ); > > + crypto_shash_update(desc, pt_v, EC_PT_SZ); > > + crypto_shash_final(desc, keyout); > > +} > > + > > +static void tpm_buf_append_salt(struct tpm_buf *buf, struct > > tpm_chip *chip, > > + struct tpm2_auth *auth) > > Given the complexity of this function and some not that obvious > choices in the implementation (coordinates), it would make sense to > document this function. I'll try to beef up the salting description > > +{ > > + struct crypto_kpp *kpp; > > + struct kpp_request *req; > > + struct scatterlist s[2], d[1]; > > + struct ecdh p = {0}; > > + u8 encoded_key[EC_PT_SZ], *x, *y; > > Why you use one character variable name 'p' and longer name > 'encoded_key'? > > > + unsigned int buf_len; > > + u8 *secret; > > + > > + secret = kmalloc(EC_PT_SZ, GFP_KERNEL); > > + if (!secret) > > + return; > > + > > + p.curve_id = ECC_CURVE_NIST_P256; > > Could this be set already in the initialization? I'm never sure about designated initializers, but I think, after looking them up again, it will zero fill unmentioned elements. > > + > > + /* secret is two sized points */ > > + tpm_buf_append_u16(buf, (EC_PT_SZ + 2)*2); > > White space missing. Should be "(EC_PT_SZ + 2) * 2". The comment is a > bit obscure (maybe, do not have any specific suggestion how to make > it less obscure). > > > + /* > > +* we cheat here and append uninitialized data to form > > +* the points. All we care about is getting the two > > +* co-ordinate pointers, which will be used to overwrite > > +* the uninitialized data > > +*/ > >
Re: [PATCH v4 2/7] tpm2-sessions: Add full HMAC and encrypt/decrypt session handling
On Tue, 23 Oct 2018, Ard Biesheuvel wrote: On 23 October 2018 at 04:01, James Bottomley wrote: On Mon, 2018-10-22 at 19:19 -0300, Ard Biesheuvel wrote: [...] +static void hmac_init(struct shash_desc *desc, u8 *key, int keylen) +{ + u8 pad[SHA256_BLOCK_SIZE]; + int i; + + desc->tfm = sha256_hash; + desc->flags = CRYPTO_TFM_REQ_MAY_SLEEP; I don't think this actually does anything in the shash API implementation, so you can drop this. OK, I find crypto somewhat hard to follow. There were bits I had to understand, like when I wrote the CFB implementation or when I fixed the ECDH scatterlist handling, but I've got to confess, in time honoured tradition I simply copied this from EVM crypto without actually digging into the code to understand why. Yeah, it is notoriously hard to use, and we should try to improve that. James, I would hope (already said in my review) to use longer than one character variable names for most of the stuff. I did not quite understand why you decided to use 'counter' for obvious counter variable and one character names for non-obvious stuff :-) I'm not sure where the 'encoded' exactly comes in the variable name 'encoded_key' especially in the context of these cryptic names. /Jarkko
Re: [PATCH v4 0/7] add integrity and security to TPM2 transactions
On Wed, 2018-10-24 at 02:51 +0300, Jarkko Sakkinen wrote: > I would consider sending first a patch set that would iterate the > existing session stuff to be ready for this i.e. merge in two > iterations (emphasis on the word "consider"). We can probably merge > the groundwork quite fast. I realise we're going to have merge conflicts on the later ones, so why don't we do this: I'll still send as one series, but you apply the ones you think are precursors and I'll rebase and resend the rest? James
Re: [PATCH v4 5/7] trusted keys: Add session encryption protection to the seal/unseal path
The tag in the short description does not look at all. Should be either "tpm:" or "keys, trusted:". On Mon, 22 Oct 2018, James Bottomley wrote: If some entity is snooping the TPM bus, the can see the data going in to be sealed and the data coming out as it is unsealed. Add parameter and response encryption to these cases to ensure that no secrets are leaked even if the bus is snooped. As part of doing this conversion it was discovered that policy sessions can't work with HMAC protected authority because of missing pieces (the tpm Nonce). I've added code to work the same way as before, which will result in potential authority exposure (while still adding security for the command and the returned blob), and a fixme to redo the API to get rid of this security hole. Signed-off-by: James Bottomley --- drivers/char/tpm/tpm2-cmd.c | 155 1 file changed, 98 insertions(+), 57 deletions(-) diff --git a/drivers/char/tpm/tpm2-cmd.c b/drivers/char/tpm/tpm2-cmd.c index 22f1c7bee173..a8655cd535d1 100644 --- a/drivers/char/tpm/tpm2-cmd.c +++ b/drivers/char/tpm/tpm2-cmd.c @@ -425,7 +425,9 @@ int tpm2_seal_trusted(struct tpm_chip *chip, { unsigned int blob_len; struct tpm_buf buf; + struct tpm_buf t2b; u32 hash; + struct tpm2_auth *auth; int i; int rc; @@ -439,45 +441,56 @@ int tpm2_seal_trusted(struct tpm_chip *chip, if (i == ARRAY_SIZE(tpm2_hash_map)) return -EINVAL; - rc = tpm_buf_init(, TPM2_ST_SESSIONS, TPM2_CC_CREATE); + rc = tpm2_start_auth_session(chip, ); if (rc) return rc; - tpm_buf_append_u32(, options->keyhandle); - tpm2_buf_append_auth(, TPM2_RS_PW, -NULL /* nonce */, 0, -0 /* session_attributes */, -options->keyauth /* hmac */, -TPM_DIGEST_SIZE); + rc = tpm_buf_init(, TPM2_ST_SESSIONS, TPM2_CC_CREATE); + if (rc) { + tpm2_end_auth_session(auth); + return rc; + } + + rc = tpm_buf_init_2b(); + if (rc) { + tpm_buf_destroy(); + tpm2_end_auth_session(auth); + return rc; + } + tpm_buf_append_name(, auth, options->keyhandle, NULL); + tpm_buf_append_hmac_session(, auth, TPM2_SA_DECRYPT, + options->keyauth, TPM_DIGEST_SIZE); /* sensitive */ - tpm_buf_append_u16(, 4 + TPM_DIGEST_SIZE + payload->key_len + 1); + tpm_buf_append_u16(, TPM_DIGEST_SIZE); + tpm_buf_append(, options->blobauth, TPM_DIGEST_SIZE); + tpm_buf_append_u16(, payload->key_len + 1); + tpm_buf_append(, payload->key, payload->key_len); + tpm_buf_append_u8(, payload->migratable); - tpm_buf_append_u16(, TPM_DIGEST_SIZE); - tpm_buf_append(, options->blobauth, TPM_DIGEST_SIZE); - tpm_buf_append_u16(, payload->key_len + 1); - tpm_buf_append(, payload->key, payload->key_len); - tpm_buf_append_u8(, payload->migratable); + tpm_buf_append_2b(, ); /* public */ - tpm_buf_append_u16(, 14 + options->policydigest_len); - tpm_buf_append_u16(, TPM2_ALG_KEYEDHASH); - tpm_buf_append_u16(, hash); + tpm_buf_append_u16(, TPM2_ALG_KEYEDHASH); + tpm_buf_append_u16(, hash); /* policy */ if (options->policydigest_len) { - tpm_buf_append_u32(, 0); - tpm_buf_append_u16(, options->policydigest_len); - tpm_buf_append(, options->policydigest, + tpm_buf_append_u32(, 0); + tpm_buf_append_u16(, options->policydigest_len); + tpm_buf_append(, options->policydigest, options->policydigest_len); } else { - tpm_buf_append_u32(, TPM2_OA_USER_WITH_AUTH); - tpm_buf_append_u16(, 0); + tpm_buf_append_u32(, TPM2_OA_USER_WITH_AUTH); + tpm_buf_append_u16(, 0); } /* public parameters */ - tpm_buf_append_u16(, TPM2_ALG_NULL); - tpm_buf_append_u16(, 0); + tpm_buf_append_u16(, TPM2_ALG_NULL); + /* unique (zero) */ + tpm_buf_append_u16(, 0); + + tpm_buf_append_2b(, ); /* outside info */ tpm_buf_append_u16(, 0); @@ -490,8 +503,11 @@ int tpm2_seal_trusted(struct tpm_chip *chip, goto out; } - rc = tpm_transmit_cmd(chip, NULL, buf.data, PAGE_SIZE, 4, 0, - "sealing data"); + tpm_buf_fill_hmac_session(, auth); + + rc = tpm_transmit_cmd(chip, >kernel_space, buf.data, + PAGE_SIZE, 4, 0, "sealing data"); + rc = tpm_buf_check_hmac_response(, auth, rc); if (rc) goto out; @@ -509,6 +525,7 @@ int tpm2_seal_trusted(struct tpm_chip *chip, payload->blob_len = blob_len;
Re: [PATCH v4 0/7] add integrity and security to TPM2 transactions
I would consider sending first a patch set that would iterate the existing session stuff to be ready for this i.e. merge in two iterations (emphasis on the word "consider"). We can probably merge the groundwork quite fast. /Jarkko On Mon, 22 Oct 2018, James Bottomley wrote: By now, everybody knows we have a problem with the TPM2_RS_PW easy button on TPM2 in that transactions on the TPM bus can be intercepted and altered. The way to fix this is to use real sessions for HMAC capabilities to ensure integrity and to use parameter and response encryption to ensure confidentiality of the data flowing over the TPM bus. This patch series is about adding a simple API which can ensure the above properties as a layered addition to the existing TPM handling code. This series now includes protections for PCR extend, getting random numbers from the TPM and data sealing and unsealing. It therefore eliminates all uses of TPM2_RS_PW in the kernel and adds encryption protection to sensitive data flowing into and out of the TPM. In the third version I added data sealing and unsealing protection, apart from one API based problem which means that the way trusted keys were protected it's not currently possible to HMAC protect an authority that comes with a policy, so the API will have to be extended to fix that case In this fourth version, I tidy up some of the code and add more security features, the most notable is that we now calculate the NULL seed name and compare our calculation to the value returned in TPM2_ReadPublic, which means we now can't be spoofed. This version also gives a sysfs variable for the null seed which userspace can use to run a key certification operation to prove that the TPM was always secure when communicating with the kernel. I've verified this using the test suite in the last patch on a VM connected to a tpm2 emulator. I also instrumented the emulator to make sure the sensitive data was properly encrypted. James --- James Bottomley (7): tpm-buf: create new functions for handling TPM buffers tpm2-sessions: Add full HMAC and encrypt/decrypt session handling tpm2: add hmac checks to tpm2_pcr_extend() tpm2: add session encryption protection to tpm2_get_random() trusted keys: Add session encryption protection to the seal/unseal path tpm: add the null key name as a tpm2 sysfs variable tpm2-sessions: NOT FOR COMMITTING add sessions testing drivers/char/tpm/Kconfig |3 + drivers/char/tpm/Makefile |3 +- drivers/char/tpm/tpm-buf.c| 191 ++ drivers/char/tpm/tpm-chip.c |1 + drivers/char/tpm/tpm-sysfs.c | 27 +- drivers/char/tpm/tpm.h| 129 ++-- drivers/char/tpm/tpm2-cmd.c | 248 --- drivers/char/tpm/tpm2-sessions-test.c | 360 ++ drivers/char/tpm/tpm2-sessions.c | 1188 + drivers/char/tpm/tpm2-sessions.h | 57 ++ 10 files changed, 2027 insertions(+), 180 deletions(-) create mode 100644 drivers/char/tpm/tpm-buf.c create mode 100644 drivers/char/tpm/tpm2-sessions-test.c create mode 100644 drivers/char/tpm/tpm2-sessions.c create mode 100644 drivers/char/tpm/tpm2-sessions.h -- 2.16.4