[PATCH V3 6/6] crypto/nx: Add P9 NX support for 842 compression engine
This patch adds P9 NX support for 842 compression engine. Virtual Accelerator Switchboard (VAS) is used to access 842 engine on P9. For each NX engine per chip, setup receive window using vas_rx_win_open() which configures RxFIFo with FIFO address, lpid, pid and tid values. This unique (lpid, pid, tid) combination will be used to identify the target engine. For crypto open request, open send window on the NX engine for the corresponding chip / cpu where the open request is executed. This send window will be closed upon crypto close request. NX provides high and normal priority FIFOs. For compression / decompression requests, we use only hight priority FIFOs in kernel. Each NX request will be communicated to VAS using copy/paste instructions with vas_copy_crb() / vas_paste_crb() functions. Signed-off-by: Haren Myneni--- drivers/crypto/nx/Kconfig | 1 + drivers/crypto/nx/nx-842-powernv.c | 375 - drivers/crypto/nx/nx-842.c | 2 +- 3 files changed, 371 insertions(+), 7 deletions(-) diff --git a/drivers/crypto/nx/Kconfig b/drivers/crypto/nx/Kconfig index ad7552a6998c..cd5dda9c48f4 100644 --- a/drivers/crypto/nx/Kconfig +++ b/drivers/crypto/nx/Kconfig @@ -38,6 +38,7 @@ config CRYPTO_DEV_NX_COMPRESS_PSERIES config CRYPTO_DEV_NX_COMPRESS_POWERNV tristate "Compression acceleration support on PowerNV platform" depends on PPC_POWERNV + depends on PPC_VAS default y help Support for PowerPC Nest (NX) compression acceleration. This diff --git a/drivers/crypto/nx/nx-842-powernv.c b/drivers/crypto/nx/nx-842-powernv.c index c0dd4c7e17d3..13089a0b9dfa 100644 --- a/drivers/crypto/nx/nx-842-powernv.c +++ b/drivers/crypto/nx/nx-842-powernv.c @@ -23,6 +23,7 @@ #include #include #include +#include MODULE_LICENSE("GPL"); MODULE_AUTHOR("Dan Streetman "); @@ -32,6 +33,9 @@ MODULE_ALIAS_CRYPTO("842-nx"); #define WORKMEM_ALIGN (CRB_ALIGN) #define CSB_WAIT_MAX (5000) /* ms */ +#define VAS_RETRIES(10) +/* # of requests allowed per RxFIFO at a time. 0 for unlimited */ +#define MAX_CREDITS_PER_RXFIFO (1024) struct nx842_workmem { /* Below fields must be properly aligned */ @@ -42,16 +46,27 @@ struct nx842_workmem { ktime_t start; + struct vas_window *txwin; /* Used with VAS function */ char padding[WORKMEM_ALIGN]; /* unused, to allow alignment */ } __packed __aligned(WORKMEM_ALIGN); struct nx842_coproc { unsigned int chip_id; unsigned int ct; - unsigned int ci; + unsigned int ci;/* Coprocessor instance, used with icswx */ + struct { + struct vas_window *rxwin; + int id; + } vas; struct list_head list; }; +/* + * Send the request to NX engine on the chip for the corresponding CPU + * where the process is executing. Use with VAS function. + */ +static DEFINE_PER_CPU(struct nx842_coproc *, coproc_inst); + /* no cpu hotplug on powernv, so this list never changes after init */ static LIST_HEAD(nx842_coprocs); static unsigned int nx842_ct; /* used in icswx function */ @@ -513,6 +528,105 @@ static int nx842_exec_icswx(const unsigned char *in, unsigned int inlen, } /** + * nx842_exec_vas - compress/decompress data using the 842 algorithm + * + * (De)compression provided by the NX842 coprocessor on IBM PowerNV systems. + * This compresses or decompresses the provided input buffer into the provided + * output buffer. + * + * Upon return from this function @outlen contains the length of the + * output data. If there is an error then @outlen will be 0 and an + * error will be specified by the return code from this function. + * + * The @workmem buffer should only be used by one function call at a time. + * + * @in: input buffer pointer + * @inlen: input buffer size + * @out: output buffer pointer + * @outlenp: output buffer size pointer + * @workmem: working memory buffer pointer, size determined by + * nx842_powernv_driver.workmem_size + * @fc: function code, see CCW Function Codes in nx-842.h + * + * Returns: + * 0 Success, output of length @outlenp stored in the buffer + * at @out + * -ENODEV Hardware unavailable + * -ENOSPC Output buffer is to small + * -EMSGSIZE Input buffer too large + * -EINVAL buffer constraints do not fix nx842_constraints + * -EPROTO hardware error during operation + * -ETIMEDOUThardware did not complete operation in reasonable time + * -EINTRoperation was aborted + */ +static int nx842_exec_vas(const unsigned char *in, unsigned int inlen, + unsigned char *out, unsigned int *outlenp, + void *workmem, int fc) +{ + struct coprocessor_request_block *crb; + struct coprocessor_status_block *csb; + struct nx842_workmem *wmem; + struct vas_window *txwin; +
[PATCH V3 5/6] crypto/nx: Add P9 NX specific error codes for 842 engine
This patch adds changes for checking P9 specific 842 engine error codes. These errros are reported in coprocessor status block (CSB) for failures. Signed-off-by: Haren Myneni--- arch/powerpc/include/asm/icswx.h | 3 +++ drivers/crypto/nx/nx-842-powernv.c | 18 ++ drivers/crypto/nx/nx-842.h | 8 3 files changed, 29 insertions(+) diff --git a/arch/powerpc/include/asm/icswx.h b/arch/powerpc/include/asm/icswx.h index 27e588f6c72e..6a2c87577541 100644 --- a/arch/powerpc/include/asm/icswx.h +++ b/arch/powerpc/include/asm/icswx.h @@ -69,7 +69,10 @@ struct coprocessor_completion_block { #define CSB_CC_WR_PROTECTION (16) #define CSB_CC_UNKNOWN_CODE(17) #define CSB_CC_ABORT (18) +#define CSB_CC_EXCEED_BYTE_COUNT (19)/* P9 or later */ #define CSB_CC_TRANSPORT (20) +#define CSB_CC_INVALID_CRB (21)/* P9 or later */ +#define CSB_CC_INVALID_DDE (30)/* P9 or later */ #define CSB_CC_SEGMENTED_DDL (31) #define CSB_CC_PROGRESS_POINT (32) #define CSB_CC_DDE_OVERFLOW(33) diff --git a/drivers/crypto/nx/nx-842-powernv.c b/drivers/crypto/nx/nx-842-powernv.c index 829b5cad0043..c0dd4c7e17d3 100644 --- a/drivers/crypto/nx/nx-842-powernv.c +++ b/drivers/crypto/nx/nx-842-powernv.c @@ -243,6 +243,13 @@ static int wait_for_csb(struct nx842_workmem *wmem, case CSB_CC_TEMPL_OVERFLOW: CSB_ERR(csb, "Compressed data template shows data past end"); return -EINVAL; + case CSB_CC_EXCEED_BYTE_COUNT: /* P9 or later */ + /* +* DDE byte count exceeds the limit specified in Maximum +* byte count register. +*/ + CSB_ERR(csb, "DDE byte count exceeds the limit"); + return -EINVAL; /* these should not happen */ case CSB_CC_INVALID_ALIGN: @@ -284,9 +291,17 @@ static int wait_for_csb(struct nx842_workmem *wmem, CSB_ERR(csb, "Too many DDEs in DDL"); return -EINVAL; case CSB_CC_TRANSPORT: + case CSB_CC_INVALID_CRB:/* P9 or later */ /* shouldn't happen, we setup CRB correctly */ CSB_ERR(csb, "Invalid CRB"); return -EINVAL; + case CSB_CC_INVALID_DDE:/* P9 or later */ + /* +* shouldn't happen, setup_direct/indirect_dde creates +* DDE right +*/ + CSB_ERR(csb, "Invalid DDE"); + return -EINVAL; case CSB_CC_SEGMENTED_DDL: /* shouldn't happen, setup_ddl creates DDL right */ CSB_ERR(csb, "Segmented DDL error"); @@ -330,6 +345,9 @@ static int wait_for_csb(struct nx842_workmem *wmem, case CSB_CC_HW: CSB_ERR(csb, "Correctable hardware error"); return -EPROTO; + case CSB_CC_HW_EXPIRED_TIMER: /* P9 or later */ + CSB_ERR(csb, "Job did not finish within allowed time"); + return -EPROTO; default: CSB_ERR(csb, "Invalid CC %d", csb->cc); diff --git a/drivers/crypto/nx/nx-842.h b/drivers/crypto/nx/nx-842.h index 30929bd7d1a9..bb2f31792683 100644 --- a/drivers/crypto/nx/nx-842.h +++ b/drivers/crypto/nx/nx-842.h @@ -76,9 +76,17 @@ #define CSB_CC_DECRYPT_OVERFLOW(64) /* asym crypt codes */ #define CSB_CC_MINV_OVERFLOW (128) +/* + * HW error - Job did not finish in the maximum time allowed. + * Job terminated. + */ +#define CSB_CC_HW_EXPIRED_TIMER(224) /* These are reserved for hypervisor use */ #define CSB_CC_HYP_RESERVE_START (240) #define CSB_CC_HYP_RESERVE_END (253) +#define CSB_CC_HYP_RESERVE_P9_END (251) +/* No valid interrupt server (P9 or later). */ +#define CSB_CC_HYP_RESERVE_NO_INTR_SERVER (252) #define CSB_CC_HYP_NO_HW (254) #define CSB_CC_HYP_HANG_ABORTED(255) -- 2.11.0
[PATCH V3 3/6] crypto/nx: Create nx842_delete_coprocs function
Move deleting coprocessors info upon exit or failure to nx842_delete_coprocs(). Signed-off-by: Haren Myneni--- drivers/crypto/nx/nx-842-powernv.c | 25 - 1 file changed, 12 insertions(+), 13 deletions(-) diff --git a/drivers/crypto/nx/nx-842-powernv.c b/drivers/crypto/nx/nx-842-powernv.c index 1bd19e03eb7d..67dc06f9b557 100644 --- a/drivers/crypto/nx/nx-842-powernv.c +++ b/drivers/crypto/nx/nx-842-powernv.c @@ -593,6 +593,16 @@ static int __init nx842_powernv_probe(struct device_node *dn) return 0; } +static void nx842_delete_coprocs(void) +{ + struct nx842_coproc *coproc, *n; + + list_for_each_entry_safe(coproc, n, _coprocs, list) { + list_del(>list); + kfree(coproc); + } +} + static struct nx842_constraints nx842_powernv_constraints = { .alignment =DDE_BUFFER_ALIGN, .multiple = DDE_BUFFER_LAST_MULT, @@ -652,13 +662,7 @@ static __init int nx842_powernv_init(void) ret = crypto_register_alg(_powernv_alg); if (ret) { - struct nx842_coproc *coproc, *n; - - list_for_each_entry_safe(coproc, n, _coprocs, list) { - list_del(>list); - kfree(coproc); - } - + nx842_delete_coprocs(); return ret; } @@ -668,13 +672,8 @@ module_init(nx842_powernv_init); static void __exit nx842_powernv_exit(void) { - struct nx842_coproc *coproc, *n; - crypto_unregister_alg(_powernv_alg); - list_for_each_entry_safe(coproc, n, _coprocs, list) { - list_del(>list); - kfree(coproc); - } + nx842_delete_coprocs(); } module_exit(nx842_powernv_exit); -- 2.11.0
[PATCH V3 4/6] crypto/nx: Add nx842_add_coprocs_list function
Updating coprocessor list is moved to nx842_add_coprocs_list(). This function will be used for both icswx and VAS functions. Signed-off-by: Haren Myneni--- drivers/crypto/nx/nx-842-powernv.c | 12 +--- 1 file changed, 9 insertions(+), 3 deletions(-) diff --git a/drivers/crypto/nx/nx-842-powernv.c b/drivers/crypto/nx/nx-842-powernv.c index 67dc06f9b557..829b5cad0043 100644 --- a/drivers/crypto/nx/nx-842-powernv.c +++ b/drivers/crypto/nx/nx-842-powernv.c @@ -550,6 +550,14 @@ static int nx842_powernv_decompress(const unsigned char *in, unsigned int inlen, wmem, CCW_FC_842_DECOMP_CRC); } +static inline void nx842_add_coprocs_list(struct nx842_coproc *coproc, + int chipid) +{ + coproc->chip_id = chipid; + INIT_LIST_HEAD(>list); + list_add(>list, _coprocs); +} + static int __init nx842_powernv_probe(struct device_node *dn) { struct nx842_coproc *coproc; @@ -576,11 +584,9 @@ static int __init nx842_powernv_probe(struct device_node *dn) if (!coproc) return -ENOMEM; - coproc->chip_id = chip_id; coproc->ct = ct; coproc->ci = ci; - INIT_LIST_HEAD(>list); - list_add(>list, _coprocs); + nx842_add_coprocs_list(coproc, chip_id); pr_info("coprocessor found on chip %d, CT %d CI %d\n", chip_id, ct, ci); -- 2.11.0
[PATCH V3 2/6] crypto/nx: Create nx842_configure_crb function
Configure CRB is moved to nx842_configure_crb() so that it can be used for icswx and VAS exec functions. VAS function will be added later with P9 support. Signed-off-by: Haren Myneni--- drivers/crypto/nx/nx-842-powernv.c | 57 +- 1 file changed, 38 insertions(+), 19 deletions(-) diff --git a/drivers/crypto/nx/nx-842-powernv.c b/drivers/crypto/nx/nx-842-powernv.c index 161987698bbc..1bd19e03eb7d 100644 --- a/drivers/crypto/nx/nx-842-powernv.c +++ b/drivers/crypto/nx/nx-842-powernv.c @@ -358,6 +358,40 @@ static int wait_for_csb(struct nx842_workmem *wmem, return 0; } +static int nx842_config_crb(const unsigned char *in, unsigned int inlen, + unsigned char *out, unsigned int outlen, + struct nx842_workmem *wmem) +{ + struct coprocessor_request_block *crb; + struct coprocessor_status_block *csb; + u64 csb_addr; + int ret; + + crb = >crb; + csb = >csb; + + /* Clear any previous values */ + memset(crb, 0, sizeof(*crb)); + + /* set up DDLs */ + ret = setup_ddl(>source, wmem->ddl_in, + (unsigned char *)in, inlen, true); + if (ret) + return ret; + + ret = setup_ddl(>target, wmem->ddl_out, + out, outlen, false); + if (ret) + return ret; + + /* set up CRB's CSB addr */ + csb_addr = nx842_get_pa(csb) & CRB_CSB_ADDRESS; + csb_addr |= CRB_CSB_AT; /* Addrs are phys */ + crb->csb_addr = cpu_to_be64(csb_addr); + + return 0; +} + /** * nx842_exec_icswx - compress/decompress data using the 842 algorithm * @@ -397,7 +431,6 @@ static int nx842_exec_icswx(const unsigned char *in, unsigned int inlen, struct coprocessor_status_block *csb; struct nx842_workmem *wmem; int ret; - u64 csb_addr; u32 ccw; unsigned int outlen = *outlenp; @@ -411,33 +444,19 @@ static int nx842_exec_icswx(const unsigned char *in, unsigned int inlen, return -ENODEV; } - crb = >crb; - csb = >csb; - - /* Clear any previous values */ - memset(crb, 0, sizeof(*crb)); - - /* set up DDLs */ - ret = setup_ddl(>source, wmem->ddl_in, - (unsigned char *)in, inlen, true); - if (ret) - return ret; - ret = setup_ddl(>target, wmem->ddl_out, - out, outlen, false); + ret = nx842_config_crb(in, inlen, out, outlen, wmem); if (ret) return ret; + crb = >crb; + csb = >csb; + /* set up CCW */ ccw = 0; ccw = SET_FIELD(CCW_CT, ccw, nx842_ct); ccw = SET_FIELD(CCW_CI_842, ccw, 0); /* use 0 for hw auto-selection */ ccw = SET_FIELD(CCW_FC_842, ccw, fc); - /* set up CRB's CSB addr */ - csb_addr = nx842_get_pa(csb) & CRB_CSB_ADDRESS; - csb_addr |= CRB_CSB_AT; /* Addrs are phys */ - crb->csb_addr = cpu_to_be64(csb_addr); - wmem->start = ktime_get(); /* do ICSWX */ -- 2.11.0
[PATCH V3 1/6] crypto/nx842: Rename nx842_powernv_function as icswx function
Rename nx842_powernv_function to nx842_powernv_exec. nx842_powernv_exec points to nx842_exec_icswx and will be point to VAS exec function which will be added later for P9 NX support. Signed-off-by: Haren Myneni--- drivers/crypto/nx/nx-842-powernv.c | 20 +--- 1 file changed, 13 insertions(+), 7 deletions(-) diff --git a/drivers/crypto/nx/nx-842-powernv.c b/drivers/crypto/nx/nx-842-powernv.c index 3abb045cdba7..161987698bbc 100644 --- a/drivers/crypto/nx/nx-842-powernv.c +++ b/drivers/crypto/nx/nx-842-powernv.c @@ -54,7 +54,11 @@ struct nx842_coproc { /* no cpu hotplug on powernv, so this list never changes after init */ static LIST_HEAD(nx842_coprocs); -static unsigned int nx842_ct; +static unsigned int nx842_ct; /* used in icswx function */ + +static int (*nx842_powernv_exec)(const unsigned char *in, + unsigned int inlen, unsigned char *out, + unsigned int *outlenp, void *workmem, int fc); /** * setup_indirect_dde - Setup an indirect DDE @@ -355,7 +359,7 @@ static int wait_for_csb(struct nx842_workmem *wmem, } /** - * nx842_powernv_function - compress/decompress data using the 842 algorithm + * nx842_exec_icswx - compress/decompress data using the 842 algorithm * * (De)compression provided by the NX842 coprocessor on IBM PowerNV systems. * This compresses or decompresses the provided input buffer into the provided @@ -385,7 +389,7 @@ static int wait_for_csb(struct nx842_workmem *wmem, * -ETIMEDOUThardware did not complete operation in reasonable time * -EINTRoperation was aborted */ -static int nx842_powernv_function(const unsigned char *in, unsigned int inlen, +static int nx842_exec_icswx(const unsigned char *in, unsigned int inlen, unsigned char *out, unsigned int *outlenp, void *workmem, int fc) { @@ -489,13 +493,13 @@ static int nx842_powernv_function(const unsigned char *in, unsigned int inlen, * @workmem: working memory buffer pointer, size determined by * nx842_powernv_driver.workmem_size * - * Returns: see @nx842_powernv_function() + * Returns: see @nx842_powernv_exec() */ static int nx842_powernv_compress(const unsigned char *in, unsigned int inlen, unsigned char *out, unsigned int *outlenp, void *wmem) { - return nx842_powernv_function(in, inlen, out, outlenp, + return nx842_powernv_exec(in, inlen, out, outlenp, wmem, CCW_FC_842_COMP_CRC); } @@ -517,13 +521,13 @@ static int nx842_powernv_compress(const unsigned char *in, unsigned int inlen, * @workmem: working memory buffer pointer, size determined by * nx842_powernv_driver.workmem_size * - * Returns: see @nx842_powernv_function() + * Returns: see @nx842_powernv_exec() */ static int nx842_powernv_decompress(const unsigned char *in, unsigned int inlen, unsigned char *out, unsigned int *outlenp, void *wmem) { - return nx842_powernv_function(in, inlen, out, outlenp, + return nx842_powernv_exec(in, inlen, out, outlenp, wmem, CCW_FC_842_DECOMP_CRC); } @@ -625,6 +629,8 @@ static __init int nx842_powernv_init(void) if (!nx842_ct) return -ENODEV; + nx842_powernv_exec = nx842_exec_icswx; + ret = crypto_register_alg(_powernv_alg); if (ret) { struct nx842_coproc *coproc, *n; -- 2.11.0
[PATCH V3 0/6] Enable NX 842 compression engine on Power9
P9 introduces Virtual Accelerator Switchboard (VAS) to communicate with NX 842 engine. icswx function is used to access NX before. On powerNV systems, NX-842 driver invokes VAS functions for configuring RxFIFO (receive window) per each NX engine. VAS uses this FIFO to communicate the request to NX. The kernel opens send window which is used to transfer compression/decompression requests to VAS. It maps the send window to the corresponding RxFIFO. copy/paste instructions are used to pass the CRB to VAS. This patch series adds P9 NX support for 842 compression engine. First 4 patches reorganize the current code so that VAS function can be added. - nx842_powernv_function points to VAS function if VAS feature is available. Otherwise icswx function is used. - Move configure CRB code nx842_cfg_crb() - In addition to freeing co-processor structs for initialization failures and exit, both send and receive windows have to closed for VAS. - Move updating coprocessor info list to nx842_add_coprocs_list(). The last 2 patches adds configuring and invoking VAS, and also checking P9 NX specific errors that are provided in co-processor status block (CSB) for failures. Patches have been tested on P9 DD1 system using VAS changes and on P8 HW to make sure no regression. This patchset depends on VAS kernel changes: https://lists.ozlabs.org/pipermail/linuxppc-dev/2017-May/158178.html Thanks to Sukadev Bhattiprolu for his review, input and testing with VAS changes. Also thanks to Michael Ellerman and Benjamin Herrenschmidt for their valuable guidance and comments. Changelog[V3] - preemption disable for copy/paste as Nichalos Piggin suggested. - PTR_ALIGN for workmem buffer based on Ram Pai's comemnt. Changelog[v2] - Open/close send windows in nx842_poernv_crypto_init/exit_vas(). - Changes for the new device-tree NX properties such as priority and compatible properties. - Incorporated review comments from Michael Ellerman. - Other minor issues found during HW testing. Haren Myneni (6): crypto/nx842: Rename nx842_powernv_function as icswx function crypto/nx: Create nx842_configure_crb function crypto/nx: Create nx842_delete_coprocs function crypto/nx: Add nx842_add_coprocs_list function crypto/nx: Add P9 NX specific error codes for 842 engine crypto/nx: Add P9 NX support for 842 compression engine. arch/powerpc/include/asm/icswx.h | 3 + drivers/crypto/nx/Kconfig | 1 + drivers/crypto/nx/nx-842-powernv.c | 499 + drivers/crypto/nx/nx-842.c | 2 +- drivers/crypto/nx/nx-842.h | 8 + 5 files changed, 465 insertions(+), 48 deletions(-) -- 2.11.0
Re: [PATCH V2 6/6] crypto/nx: Add P9 NX support for 842 compression engine
On 07/17/2017 11:53 PM, Ram Pai wrote: > On Mon, Jul 17, 2017 at 04:50:38PM -0700, Haren Myneni wrote: >> >> This patch adds P9 NX support for 842 compression engine. Virtual >> Accelerator Switchboard (VAS) is used to access 842 engine on P9. >> >> For each NX engine per chip, setup receive window using >> vas_rx_win_open() which configures RxFIFo with FIFO address, lpid, >> pid and tid values. This unique (lpid, pid, tid) combination will >> be used to identify the target engine. >> >> For crypto open request, open send window on the NX engine for >> the corresponding chip / cpu where the open request is executed. >> This send window will be closed upon crypto close request. >> >> NX provides high and normal priority FIFOs. For compression / >> decompression requests, we use only hight priority FIFOs in kernel. >> >> Each NX request will be communicated to VAS using copy/paste >> instructions with vas_copy_crb() / vas_paste_crb() functions. >> >> Signed-off-by: Haren Myneni>> --- >> drivers/crypto/nx/Kconfig | 1 + >> drivers/crypto/nx/nx-842-powernv.c | 369 >> - >> drivers/crypto/nx/nx-842.c | 2 +- >> 3 files changed, 365 insertions(+), 7 deletions(-) >> >> diff --git a/drivers/crypto/nx/Kconfig b/drivers/crypto/nx/Kconfig >> index ad7552a6998c..cd5dda9c48f4 100644 >> --- a/drivers/crypto/nx/Kconfig >> +++ b/drivers/crypto/nx/Kconfig >> @@ -38,6 +38,7 @@ config CRYPTO_DEV_NX_COMPRESS_PSERIES >> config CRYPTO_DEV_NX_COMPRESS_POWERNV >> tristate "Compression acceleration support on PowerNV platform" >> depends on PPC_POWERNV >> +depends on PPC_VAS >> default y >> help >>Support for PowerPC Nest (NX) compression acceleration. This >> diff --git a/drivers/crypto/nx/nx-842-powernv.c >> b/drivers/crypto/nx/nx-842-powernv.c >> index c0dd4c7e17d3..8d9d21420144 100644 >> --- a/drivers/crypto/nx/nx-842-powernv.c >> +++ b/drivers/crypto/nx/nx-842-powernv.c >> @@ -23,6 +23,7 @@ >> #include >> #include >> #include >> +#include >> >> MODULE_LICENSE("GPL"); >> MODULE_AUTHOR("Dan Streetman "); >> @@ -32,6 +33,9 @@ MODULE_ALIAS_CRYPTO("842-nx"); >> >> #define WORKMEM_ALIGN (CRB_ALIGN) >> #define CSB_WAIT_MAX(5000) /* ms */ >> +#define VAS_RETRIES (10) >> +/* # of requests allowed per RxFIFO at a time. 0 for unlimited */ >> +#define MAX_CREDITS_PER_RXFIFO (64) >> >> struct nx842_workmem { >> /* Below fields must be properly aligned */ >> @@ -42,16 +46,27 @@ struct nx842_workmem { >> >> ktime_t start; >> >> +struct vas_window *txwin; /* Used with VAS function */ >> char padding[WORKMEM_ALIGN]; /* unused, to allow alignment */ >> } __packed __aligned(WORKMEM_ALIGN); >> >> struct nx842_coproc { >> unsigned int chip_id; >> unsigned int ct; >> -unsigned int ci; >> +unsigned int ci;/* Coprocessor instance, used with icswx */ >> +struct { >> +struct vas_window *rxwin; >> +int id; >> +} vas; > > ci and vas are mutually exclusive. a few bytes could be saved by unionizing > them? We will have few coproc entries - NX engine per chip. > >> struct list_head list; >> }; >> >> +/* >> + * Send the request to NX engine on the chip for the corresponding CPU >> + * where the process is executing. Use with VAS function. >> + */ >> +static DEFINE_PER_CPU(struct nx842_coproc *, coproc_inst); >> + >> /* no cpu hotplug on powernv, so this list never changes after init */ >> static LIST_HEAD(nx842_coprocs); >> static unsigned int nx842_ct; /* used in icswx function */ >> @@ -513,6 +528,108 @@ static int nx842_exec_icswx(const unsigned char *in, >> unsigned int inlen, >> } >> >> /** >> + * nx842_exec_vas - compress/decompress data using the 842 algorithm >> + * >> + * (De)compression provided by the NX842 coprocessor on IBM PowerNV systems. >> + * This compresses or decompresses the provided input buffer into the >> provided >> + * output buffer. >> + * >> + * Upon return from this function @outlen contains the length of the >> + * output data. If there is an error then @outlen will be 0 and an >> + * error will be specified by the return code from this function. >> + * >> + * The @workmem buffer should only be used by one function call at a time. >> + * >> + * @in: input buffer pointer >> + * @inlen: input buffer size >> + * @out: output buffer pointer >> + * @outlenp: output buffer size pointer >> + * @workmem: working memory buffer pointer, size determined by >> + * nx842_powernv_driver.workmem_size >> + * @fc: function code, see CCW Function Codes in nx-842.h >> + * >> + * Returns: >> + * 0 Success, output of length @outlenp stored in the buffer >> + * at @out >> + * -ENODEVHardware unavailable >> + * -ENOSPCOutput buffer is to small >> + * -EMSGSIZE Input buffer too large >> + * -EINVALbuffer
[PATCH v2 4/4] crypto: ccp - Add XTS-AES-256 support for CCP version 5
Signed-off-by: Gary R Hook--- drivers/crypto/ccp/ccp-crypto-aes-xts.c | 16 +--- drivers/crypto/ccp/ccp-crypto.h |2 +- drivers/crypto/ccp/ccp-ops.c|3 +++ 3 files changed, 17 insertions(+), 4 deletions(-) diff --git a/drivers/crypto/ccp/ccp-crypto-aes-xts.c b/drivers/crypto/ccp/ccp-crypto-aes-xts.c index 3c37794ffe2d..4a3fe4d5ac71 100644 --- a/drivers/crypto/ccp/ccp-crypto-aes-xts.c +++ b/drivers/crypto/ccp/ccp-crypto-aes-xts.c @@ -80,19 +80,24 @@ static int ccp_aes_xts_setkey(struct crypto_ablkcipher *tfm, const u8 *key, { struct crypto_tfm *xfm = crypto_ablkcipher_tfm(tfm); struct ccp_ctx *ctx = crypto_tfm_ctx(xfm); + unsigned int ccpversion = ccp_version(); int ret; ret = xts_check_key(xfm, key, key_len); if (ret) return ret; - /* Only support 128-bit AES key with a 128-bit Tweak key, -* otherwise use the fallback + /* Version 3 devices support 128-bit keys; version 5 devices can +* accommodate 128- and 256-bit keys. */ switch (key_len) { case AES_KEYSIZE_128 * 2: memcpy(ctx->u.aes.key, key, key_len); break; + case AES_KEYSIZE_256 * 2: + if (ccpversion > CCP_VERSION(3, 0)) + memcpy(ctx->u.aes.key, key, key_len); + break; } ctx->u.aes.key_len = key_len / 2; sg_init_one(>u.aes.key_sg, ctx->u.aes.key, key_len); @@ -105,6 +110,7 @@ static int ccp_aes_xts_crypt(struct ablkcipher_request *req, { struct ccp_ctx *ctx = crypto_tfm_ctx(req->base.tfm); struct ccp_aes_req_ctx *rctx = ablkcipher_request_ctx(req); + unsigned int ccpversion = ccp_version(); unsigned int fallback = 0; unsigned int unit; u32 block_size; @@ -141,7 +147,11 @@ static int ccp_aes_xts_crypt(struct ablkcipher_request *req, */ if (unit_size == CCP_XTS_AES_UNIT_SIZE__LAST) fallback = 1; - if (ctx->u.aes.key_len != AES_KEYSIZE_128) + if ((ccpversion < CCP_VERSION(5, 0)) && + (ctx->u.aes.key_len != AES_KEYSIZE_128)) + fallback = 1; + if ((ctx->u.aes.key_len != AES_KEYSIZE_128) && + (ctx->u.aes.key_len != AES_KEYSIZE_256)) fallback = 1; if (fallback) { SKCIPHER_REQUEST_ON_STACK(subreq, ctx->u.aes.tfm_skcipher); diff --git a/drivers/crypto/ccp/ccp-crypto.h b/drivers/crypto/ccp/ccp-crypto.h index 156b8233853f..880f8acdd0cd 100644 --- a/drivers/crypto/ccp/ccp-crypto.h +++ b/drivers/crypto/ccp/ccp-crypto.h @@ -91,7 +91,7 @@ struct ccp_aes_ctx { struct scatterlist key_sg; unsigned int key_len; - u8 key[AES_MAX_KEY_SIZE]; + u8 key[AES_MAX_KEY_SIZE * 2]; u8 nonce[CTR_RFC3686_NONCE_SIZE]; diff --git a/drivers/crypto/ccp/ccp-ops.c b/drivers/crypto/ccp/ccp-ops.c index 8113355151d2..fbd024f6e898 100644 --- a/drivers/crypto/ccp/ccp-ops.c +++ b/drivers/crypto/ccp/ccp-ops.c @@ -1065,6 +1065,8 @@ static int ccp_run_xts_aes_cmd(struct ccp_cmd_queue *cmd_q, if (xts->key_len == AES_KEYSIZE_128) aestype = CCP_AES_TYPE_128; + else if (xts->key_len == AES_KEYSIZE_256) + aestype = CCP_AES_TYPE_256; else return -EINVAL; @@ -1089,6 +1091,7 @@ static int ccp_run_xts_aes_cmd(struct ccp_cmd_queue *cmd_q, op.sb_ctx = cmd_q->sb_ctx; op.init = 1; op.u.xts.action = xts->action; + op.u.xts.type = aestype; op.u.xts.unit_size = xts->unit_size; /* A version 3 device only supports 128-bit keys, which fits into a
[PATCH v2 0/4] Update support for XTS-AES on AMD CCPs
The following series adds support for XS-AES on version 5 CCPs, both 128- and 256-bit, and enhances/clarifies/simplifies some crypto layer code. Changes since v1: - rework the validation of the unit-size; move to a separate patch - expand the key buffer to accommodate 256-bit keys - use xts_check_key() in the crypto layer --- Gary R Hook (4): crypto: ccp - Add a call to xts_check_key() crypto: ccp - Enable XTS-AES-128 support on all CCPs crypto: ccp - Rework the unit-size check for XTS-AES crypto: ccp - Add XTS-AES-256 support for CCP version 5 drivers/crypto/ccp/ccp-crypto-aes-xts.c | 98 +-- drivers/crypto/ccp/ccp-crypto.h |2 - drivers/crypto/ccp/ccp-dev-v5.c |2 + drivers/crypto/ccp/ccp-dev.h|2 + drivers/crypto/ccp/ccp-ops.c| 55 +++-- 5 files changed, 106 insertions(+), 53 deletions(-) --
[PATCH v2 3/4] crypto: ccp - Rework the unit-size check for XTS-AES
The CCP supports a limited set of unit-size values. Change the check for this parameter such that acceptable values match the enumeration. Then clarify the conditions under which we must use the fallback implementation. Signed-off-by: Gary R Hook--- drivers/crypto/ccp/ccp-crypto-aes-xts.c | 75 ++- 1 file changed, 35 insertions(+), 40 deletions(-) diff --git a/drivers/crypto/ccp/ccp-crypto-aes-xts.c b/drivers/crypto/ccp/ccp-crypto-aes-xts.c index 4a313f62dbea..3c37794ffe2d 100644 --- a/drivers/crypto/ccp/ccp-crypto-aes-xts.c +++ b/drivers/crypto/ccp/ccp-crypto-aes-xts.c @@ -1,8 +1,9 @@ /* * AMD Cryptographic Coprocessor (CCP) AES XTS crypto API support * - * Copyright (C) 2013 Advanced Micro Devices, Inc. + * Copyright (C) 2013,2017 Advanced Micro Devices, Inc. * + * Author: Gary R Hook * Author: Tom Lendacky * * This program is free software; you can redistribute it and/or modify @@ -38,46 +39,26 @@ struct ccp_unit_size_map { u32 value; }; -static struct ccp_unit_size_map unit_size_map[] = { +static struct ccp_unit_size_map xts_unit_sizes[] = { { - .size = 4096, - .value = CCP_XTS_AES_UNIT_SIZE_4096, - }, - { - .size = 2048, - .value = CCP_XTS_AES_UNIT_SIZE_2048, - }, - { - .size = 1024, - .value = CCP_XTS_AES_UNIT_SIZE_1024, + .size = 16, + .value = CCP_XTS_AES_UNIT_SIZE_16, }, { - .size = 512, + .size = 512, .value = CCP_XTS_AES_UNIT_SIZE_512, }, { - .size = 256, - .value = CCP_XTS_AES_UNIT_SIZE__LAST, - }, - { - .size = 128, - .value = CCP_XTS_AES_UNIT_SIZE__LAST, - }, - { - .size = 64, - .value = CCP_XTS_AES_UNIT_SIZE__LAST, - }, - { - .size = 32, - .value = CCP_XTS_AES_UNIT_SIZE__LAST, + .size = 1024, + .value = CCP_XTS_AES_UNIT_SIZE_1024, }, { - .size = 16, - .value = CCP_XTS_AES_UNIT_SIZE_16, + .size = 2048, + .value = CCP_XTS_AES_UNIT_SIZE_2048, }, { - .size = 1, - .value = CCP_XTS_AES_UNIT_SIZE__LAST, + .size = 4096, + .value = CCP_XTS_AES_UNIT_SIZE_4096, }, }; @@ -124,7 +105,9 @@ static int ccp_aes_xts_crypt(struct ablkcipher_request *req, { struct ccp_ctx *ctx = crypto_tfm_ctx(req->base.tfm); struct ccp_aes_req_ctx *rctx = ablkcipher_request_ctx(req); + unsigned int fallback = 0; unsigned int unit; + u32 block_size; u32 unit_size; int ret; @@ -137,18 +120,30 @@ static int ccp_aes_xts_crypt(struct ablkcipher_request *req, if (!req->info) return -EINVAL; + /* Check conditions under which the CCP can fulfill a request. The +* device can handle input plaintext of a length that is a multiple +* of the unit_size, bug the crypto implementation only supports +* the unit_size being equal to the input length. This limits the +* number of scenarios we can handle. Also validate the key length. +*/ + block_size = 0; unit_size = CCP_XTS_AES_UNIT_SIZE__LAST; - if (req->nbytes <= unit_size_map[0].size) { - for (unit = 0; unit < ARRAY_SIZE(unit_size_map); unit++) { - if (!(req->nbytes & (unit_size_map[unit].size - 1))) { - unit_size = unit_size_map[unit].value; - break; - } + for (unit = 0; unit < ARRAY_SIZE(xts_unit_sizes); unit++) { + if (req->nbytes == xts_unit_sizes[unit].size) { + unit_size = unit; + block_size = xts_unit_sizes[unit].size; + break; } } - - if ((unit_size == CCP_XTS_AES_UNIT_SIZE__LAST) || - (ctx->u.aes.key_len != AES_KEYSIZE_128)) { + /* The CCP has restrictions on block sizes. Also, a version 3 device +* only supports AES-128 operations; version 5 CCPs support both +* AES-128 and -256 operations. +*/ + if (unit_size == CCP_XTS_AES_UNIT_SIZE__LAST) + fallback = 1; + if (ctx->u.aes.key_len != AES_KEYSIZE_128) + fallback = 1; + if (fallback) { SKCIPHER_REQUEST_ON_STACK(subreq, ctx->u.aes.tfm_skcipher); /* Use the fallback to process the request for any
[PATCH v2 1/4] crypto: ccp - Add a call to xts_check_key()
Vet the key using the available standard function Signed-off-by: Gary R Hook--- drivers/crypto/ccp/ccp-crypto-aes-xts.c |9 - 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/drivers/crypto/ccp/ccp-crypto-aes-xts.c b/drivers/crypto/ccp/ccp-crypto-aes-xts.c index 58a4244b4752..4a313f62dbea 100644 --- a/drivers/crypto/ccp/ccp-crypto-aes-xts.c +++ b/drivers/crypto/ccp/ccp-crypto-aes-xts.c @@ -15,6 +15,7 @@ #include #include #include +#include #include #include @@ -96,7 +97,13 @@ static int ccp_aes_xts_complete(struct crypto_async_request *async_req, int ret) static int ccp_aes_xts_setkey(struct crypto_ablkcipher *tfm, const u8 *key, unsigned int key_len) { - struct ccp_ctx *ctx = crypto_tfm_ctx(crypto_ablkcipher_tfm(tfm)); + struct crypto_tfm *xfm = crypto_ablkcipher_tfm(tfm); + struct ccp_ctx *ctx = crypto_tfm_ctx(xfm); + int ret; + + ret = xts_check_key(xfm, key, key_len); + if (ret) + return ret; /* Only support 128-bit AES key with a 128-bit Tweak key, * otherwise use the fallback
[PATCH v2 2/4] crypto: ccp - Enable XTS-AES-128 support on all CCPs
Version 5 CCPs have some new requirements for XTS-AES: the type field must be specified, and the key requires 512 bits, with each part occupying 256 bits and padded with zeroes. Signed-off-by: Gary R Hook--- drivers/crypto/ccp/ccp-dev-v5.c |2 ++ drivers/crypto/ccp/ccp-dev.h|2 ++ drivers/crypto/ccp/ccp-ops.c| 52 --- 3 files changed, 47 insertions(+), 9 deletions(-) diff --git a/drivers/crypto/ccp/ccp-dev-v5.c b/drivers/crypto/ccp/ccp-dev-v5.c index b3526336d608..0fb4519c5194 100644 --- a/drivers/crypto/ccp/ccp-dev-v5.c +++ b/drivers/crypto/ccp/ccp-dev-v5.c @@ -145,6 +145,7 @@ union ccp_function { #defineCCP_AES_MODE(p) ((p)->aes.mode) #defineCCP_AES_TYPE(p) ((p)->aes.type) #defineCCP_XTS_SIZE(p) ((p)->aes_xts.size) +#defineCCP_XTS_TYPE(p) ((p)->aes_xts.type) #defineCCP_XTS_ENCRYPT(p) ((p)->aes_xts.encrypt) #defineCCP_DES3_SIZE(p)((p)->des3.size) #defineCCP_DES3_ENCRYPT(p) ((p)->des3.encrypt) @@ -346,6 +347,7 @@ static int ccp5_perform_xts_aes(struct ccp_op *op) function.raw = 0; CCP_XTS_ENCRYPT() = op->u.xts.action; CCP_XTS_SIZE() = op->u.xts.unit_size; + CCP_XTS_TYPE() = op->u.xts.type; CCP5_CMD_FUNCTION() = function.raw; CCP5_CMD_LEN() = op->src.u.dma.length; diff --git a/drivers/crypto/ccp/ccp-dev.h b/drivers/crypto/ccp/ccp-dev.h index 9320931d89da..3d51180199ac 100644 --- a/drivers/crypto/ccp/ccp-dev.h +++ b/drivers/crypto/ccp/ccp-dev.h @@ -194,6 +194,7 @@ #define CCP_AES_CTX_SB_COUNT 1 #define CCP_XTS_AES_KEY_SB_COUNT 1 +#define CCP5_XTS_AES_KEY_SB_COUNT 2 #define CCP_XTS_AES_CTX_SB_COUNT 1 #define CCP_DES3_KEY_SB_COUNT 1 @@ -497,6 +498,7 @@ struct ccp_aes_op { }; struct ccp_xts_aes_op { + enum ccp_aes_type type; enum ccp_aes_action action; enum ccp_xts_aes_unit_size unit_size; }; diff --git a/drivers/crypto/ccp/ccp-ops.c b/drivers/crypto/ccp/ccp-ops.c index e23d138fc1ce..8113355151d2 100644 --- a/drivers/crypto/ccp/ccp-ops.c +++ b/drivers/crypto/ccp/ccp-ops.c @@ -1038,6 +1038,8 @@ static int ccp_run_xts_aes_cmd(struct ccp_cmd_queue *cmd_q, struct ccp_op op; unsigned int unit_size, dm_offset; bool in_place = false; + unsigned int sb_count = 0; + enum ccp_aes_type aestype; int ret; switch (xts->unit_size) { @@ -1061,9 +1063,12 @@ static int ccp_run_xts_aes_cmd(struct ccp_cmd_queue *cmd_q, return -EINVAL; } - if (xts->key_len != AES_KEYSIZE_128) + if (xts->key_len == AES_KEYSIZE_128) + aestype = CCP_AES_TYPE_128; + else return -EINVAL; + if (!xts->final && (xts->src_len & (AES_BLOCK_SIZE - 1))) return -EINVAL; @@ -1086,20 +1091,47 @@ static int ccp_run_xts_aes_cmd(struct ccp_cmd_queue *cmd_q, op.u.xts.action = xts->action; op.u.xts.unit_size = xts->unit_size; - /* All supported key sizes fit in a single (32-byte) SB entry -* and must be in little endian format. Use the 256-bit byte -* swap passthru option to convert from big endian to little -* endian. + /* A version 3 device only supports 128-bit keys, which fits into a +* single SB entry. A version 5 device uses a 512-bit vector, so two +* SB entries. */ + if (cmd_q->ccp->vdata->version == CCP_VERSION(3, 0)) + sb_count = CCP_XTS_AES_KEY_SB_COUNT; + else + sb_count = CCP5_XTS_AES_KEY_SB_COUNT; ret = ccp_init_dm_workarea(, cmd_q, - CCP_XTS_AES_KEY_SB_COUNT * CCP_SB_BYTES, + sb_count * CCP_SB_BYTES, DMA_TO_DEVICE); if (ret) return ret; - dm_offset = CCP_SB_BYTES - AES_KEYSIZE_128; - ccp_set_dm_area(, dm_offset, xts->key, 0, xts->key_len); - ccp_set_dm_area(, 0, xts->key, dm_offset, xts->key_len); + if (cmd_q->ccp->vdata->version == CCP_VERSION(3, 0)) { + /* All supported key sizes must be in little endian format. +* Use the 256-bit byte swap passthru option to convert from +* big endian to little endian. +*/ + dm_offset = CCP_SB_BYTES - AES_KEYSIZE_128; + ccp_set_dm_area(, dm_offset, xts->key, 0, xts->key_len); + ccp_set_dm_area(, 0, xts->key, xts->key_len, xts->key_len); + } else { + /* The AES key is at the little end and the tweak key is +* at the big end. Since the byteswap operation is only +* 256-bit, load the buffer according to the way things +* will end up. +*/ + unsigned int pad; + + op.sb_key =
[PATCH] crypto: Kconfig: Correct help text about feeding entropy pool
Modify Kconfig help text to reflect the fact that random data from hwrng is fed into kernel random number generator's entropy pool. Signed-off-by: PrasannaKumar Muralidharan--- drivers/char/hw_random/Kconfig | 6 ++ 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/drivers/char/hw_random/Kconfig b/drivers/char/hw_random/Kconfig index 1b223c3..2f45e15 100644 --- a/drivers/char/hw_random/Kconfig +++ b/drivers/char/hw_random/Kconfig @@ -13,10 +13,8 @@ menuconfig HW_RANDOM that's usually called /dev/hwrng, and which exposes one of possibly several hardware random number generators. - These hardware random number generators do not feed directly - into the kernel's random number generator. That is usually - handled by the "rngd" daemon. Documentation/hw_random.txt - has more information. + These hardware random number generators do feed into the + kernel's random number generator entropy pool. If unsure, say Y. -- 2.10.0
Re: [PATCH] crypto: ccp - Fix XTS-AES support on a version 5 CCP
On 07/17/2017 04:48 PM, Lendacky, Thomas wrote: On 7/17/2017 3:08 PM, Gary R Hook wrote: Version 5 CCPs have differing requirements for XTS-AES: key components are stored in a 512-bit vector. The context must be little-endian justified. AES-256 is supported now, so propagate the cipher size to the command descriptor. Signed-off-by: Gary R HookLook for a version 2 --- drivers/crypto/ccp/ccp-crypto-aes-xts.c | 79 --- drivers/crypto/ccp/ccp-dev-v5.c |2 + drivers/crypto/ccp/ccp-dev.h|2 + drivers/crypto/ccp/ccp-ops.c| 56 ++ 4 files changed, 89 insertions(+), 50 deletions(-) diff --git a/drivers/crypto/ccp/ccp-crypto-aes-xts.c b/drivers/crypto/ccp/ccp-crypto-aes-xts.c index 58a4244b4752..8d248b198e22 100644 --- a/drivers/crypto/ccp/ccp-crypto-aes-xts.c +++ b/drivers/crypto/ccp/ccp-crypto-aes-xts.c @@ -1,7 +1,7 @@ /* * AMD Cryptographic Coprocessor (CCP) AES XTS crypto API support * - * Copyright (C) 2013 Advanced Micro Devices, Inc. + * Copyright (C) 2013,2017 Advanced Micro Devices, Inc. * * Author: Tom Lendacky * @@ -37,46 +37,26 @@ struct ccp_unit_size_map { u32 value; }; -static struct ccp_unit_size_map unit_size_map[] = { +static struct ccp_unit_size_map xts_unit_sizes[] = { { - .size = 4096, - .value = CCP_XTS_AES_UNIT_SIZE_4096, - }, - { - .size = 2048, - .value = CCP_XTS_AES_UNIT_SIZE_2048, - }, - { - .size = 1024, - .value = CCP_XTS_AES_UNIT_SIZE_1024, + .size = 16, + .value = CCP_XTS_AES_UNIT_SIZE_16, }, { - .size = 512, + .size = 512, .value = CCP_XTS_AES_UNIT_SIZE_512, }, { - .size = 256, - .value = CCP_XTS_AES_UNIT_SIZE__LAST, - }, - { - .size = 128, - .value = CCP_XTS_AES_UNIT_SIZE__LAST, - }, - { - .size = 64, - .value = CCP_XTS_AES_UNIT_SIZE__LAST, - }, - { - .size = 32, - .value = CCP_XTS_AES_UNIT_SIZE__LAST, + .size = 1024, + .value = CCP_XTS_AES_UNIT_SIZE_1024, }, { - .size = 16, - .value = CCP_XTS_AES_UNIT_SIZE_16, + .size = 2048, + .value = CCP_XTS_AES_UNIT_SIZE_2048, }, { - .size = 1, - .value = CCP_XTS_AES_UNIT_SIZE__LAST, + .size = 4096, + .value = CCP_XTS_AES_UNIT_SIZE_4096, }, }; Because of the way the unit size check is performed, you can't delete the intermediate size checks. Those must remain so that unit sizes that aren't supported by the CCP are sent to the fallback mechanism. Given the limitations of the CCP (w.r.t. XTS-AES) I thought it more clear to look for only those unit-sizes that are supported, and to use the enumerated value as the index. Also, re-arranging the order should be a separate patch if that doesn't really fix anything. Yes, agreed. @@ -97,14 +77,20 @@ static int ccp_aes_xts_setkey(struct crypto_ablkcipher *tfm, const u8 *key, unsigned int key_len) { struct ccp_ctx *ctx = crypto_tfm_ctx(crypto_ablkcipher_tfm(tfm)); + unsigned int ccpversion = ccp_version(); /* Only support 128-bit AES key with a 128-bit Tweak key, * otherwise use the fallback */ + Remove the addition of the blank line and update the above comment to indicate the new supported key size added below. Yes. switch (key_len) { case AES_KEYSIZE_128 * 2: memcpy(ctx->u.aes.key, key, key_len); break; + case AES_KEYSIZE_256 * 2: + if (ccpversion > CCP_VERSION(3, 0)) + memcpy(ctx->u.aes.key, key, key_len); + break; Isn't u.aes.key defined with a maximum buffer size of AES_MAX_KEY_SIZE (which is 32)? I think this will cause a buffer overrun on memcpy. Yes, the structure member needs to be made larger. } ctx->u.aes.key_len = key_len / 2; sg_init_one(>u.aes.key_sg, ctx->u.aes.key, key_len); @@ -117,7 +103,10 @@ static int ccp_aes_xts_crypt(struct ablkcipher_request *req, { struct ccp_ctx *ctx = crypto_tfm_ctx(req->base.tfm); struct ccp_aes_req_ctx *rctx = ablkcipher_request_ctx(req); + unsigned int ccpversion = ccp_version(); + unsigned int fallback = 0; unsigned int unit; + u32 block_size = 0; u32 unit_size; int ret; @@ -131,17 +120,29 @@ static int ccp_aes_xts_crypt(struct ablkcipher_request *req, return -EINVAL; unit_size = CCP_XTS_AES_UNIT_SIZE__LAST; - if (req->nbytes <= unit_size_map[0].size) { This check can't be deleted. It was added
[PATCH v2 0/3] crypto: scompress - defer allocation of percpu scratch buffers
This is a followup to 'crypto: scompress - eliminate percpu scratch buffers', which attempted to replace the scompress per-CPU buffer entirely, but as Herbert pointed out, this is not going to fly in the targeted use cases. Instead, move the alloc/free of the buffers into the tfm init/exit hooks, so that we only have the per-CPU buffers allocated if their are any acomp ciphers of the right kind (i.e, ones that encapsulate a synchronous implementation) in use (#3) Patches #1 and #2 are fixes for issues I spotted when working on this code. Ard Biesheuvel (3): crypto: scompress - don't sleep with preemption disabled crypto: scompress - free partially allocated scratch buffers on failure crypto: scompress - defer allocation of scratch buffer to first use crypto/scompress.c | 55 1 file changed, 22 insertions(+), 33 deletions(-) -- 2.11.0
[PATCH v2 3/3] crypto: scompress - defer allocation of scratch buffer to first use
The scompress code allocates 2 x 128 KB of scratch buffers for each CPU, so that clients of the async API can use synchronous implementations even from atomic context. However, on systems such as Cavium Thunderx (which has 96 cores), this adds up to a non-negligible 24 MB. Also, 32-bit systems may prefer to use their precious vmalloc space for other things,especially since there don't appear to be any clients for the async compression API yet. So let's defer allocation of the scratch buffers until the first time we allocate an acompress cipher based on an scompress implementation. Signed-off-by: Ard Biesheuvel--- crypto/scompress.c | 46 1 file changed, 17 insertions(+), 29 deletions(-) diff --git a/crypto/scompress.c b/crypto/scompress.c index 2c07648305ad..2075e2c4e7df 100644 --- a/crypto/scompress.c +++ b/crypto/scompress.c @@ -65,11 +65,6 @@ static void crypto_scomp_show(struct seq_file *m, struct crypto_alg *alg) seq_puts(m, "type : scomp\n"); } -static int crypto_scomp_init_tfm(struct crypto_tfm *tfm) -{ - return 0; -} - static void crypto_scomp_free_scratches(void * __percpu *scratches) { int i; @@ -134,6 +129,17 @@ static int crypto_scomp_alloc_all_scratches(void) return 0; } +static int crypto_scomp_init_tfm(struct crypto_tfm *tfm) +{ + int ret; + + mutex_lock(_lock); + ret = crypto_scomp_alloc_all_scratches(); + mutex_unlock(_lock); + + return ret; +} + static void crypto_scomp_sg_free(struct scatterlist *sgl) { int i, n; @@ -241,6 +247,10 @@ static void crypto_exit_scomp_ops_async(struct crypto_tfm *tfm) struct crypto_scomp **ctx = crypto_tfm_ctx(tfm); crypto_free_scomp(*ctx); + + mutex_lock(_lock); + crypto_scomp_free_all_scratches(); + mutex_unlock(_lock); } int crypto_init_scomp_ops_async(struct crypto_tfm *tfm) @@ -317,40 +327,18 @@ static const struct crypto_type crypto_scomp_type = { int crypto_register_scomp(struct scomp_alg *alg) { struct crypto_alg *base = >base; - int ret = -ENOMEM; - - mutex_lock(_lock); - if (crypto_scomp_alloc_all_scratches()) - goto error; base->cra_type = _scomp_type; base->cra_flags &= ~CRYPTO_ALG_TYPE_MASK; base->cra_flags |= CRYPTO_ALG_TYPE_SCOMPRESS; - ret = crypto_register_alg(base); - if (ret) - goto error; - - mutex_unlock(_lock); - return ret; - -error: - crypto_scomp_free_all_scratches(); - mutex_unlock(_lock); - return ret; + return crypto_register_alg(base); } EXPORT_SYMBOL_GPL(crypto_register_scomp); int crypto_unregister_scomp(struct scomp_alg *alg) { - int ret; - - mutex_lock(_lock); - ret = crypto_unregister_alg(>base); - crypto_scomp_free_all_scratches(); - mutex_unlock(_lock); - - return ret; + return crypto_unregister_alg(>base); } EXPORT_SYMBOL_GPL(crypto_unregister_scomp); -- 2.11.0
[PATCH v2 2/3] crypto: scompress - free partially allocated scratch buffers on failure
When allocating the per-CPU scratch buffers, we allocate the source and destination buffers separately, but bail immediately if the second allocation fails, without freeing the first one. Fix that. Signed-off-by: Ard Biesheuvel--- crypto/scompress.c | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/crypto/scompress.c b/crypto/scompress.c index 0b40d991d65f..2c07648305ad 100644 --- a/crypto/scompress.c +++ b/crypto/scompress.c @@ -125,8 +125,11 @@ static int crypto_scomp_alloc_all_scratches(void) if (!scomp_src_scratches) return -ENOMEM; scomp_dst_scratches = crypto_scomp_alloc_scratches(); - if (!scomp_dst_scratches) + if (!scomp_dst_scratches) { + crypto_scomp_free_scratches(scomp_src_scratches); + scomp_src_scratches = NULL; return -ENOMEM; + } } return 0; } -- 2.11.0
[PATCH v2 1/3] crypto: scompress - don't sleep with preemption disabled
Due to the use of per-CPU buffers, scomp_acomp_comp_decomp() executes with preemption disabled, and so whether the CRYPTO_TFM_REQ_MAY_SLEEP flag is set is irrelevant, since we cannot sleep anyway. So disregard the flag, and use GFP_ATOMIC unconditionally. Cc:# v4.10+ Signed-off-by: Ard Biesheuvel --- crypto/scompress.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/crypto/scompress.c b/crypto/scompress.c index ae1d3cf209e4..0b40d991d65f 100644 --- a/crypto/scompress.c +++ b/crypto/scompress.c @@ -211,9 +211,7 @@ static int scomp_acomp_comp_decomp(struct acomp_req *req, int dir) scratch_dst, >dlen, *ctx); if (!ret) { if (!req->dst) { - req->dst = crypto_scomp_sg_alloc(req->dlen, - req->base.flags & CRYPTO_TFM_REQ_MAY_SLEEP ? - GFP_KERNEL : GFP_ATOMIC); + req->dst = crypto_scomp_sg_alloc(req->dlen, GFP_ATOMIC); if (!req->dst) goto out; } -- 2.11.0
Re: [RFC PATCH v12 3/4] Linux Random Number Generator
Am Freitag, 21. Juli 2017, 17:09:11 CEST schrieb Arnd Bergmann: Hi Arnd, > On Fri, Jul 21, 2017 at 10:57 AM, Stephan Müllerwrote: > > Am Freitag, 21. Juli 2017, 05:08:47 CEST schrieb Theodore Ts'o: > >> Um, the timer is the largest number of interrupts on my system. Compare: > >> CPU0 CPU1 CPU2 CPU3 > >> > >> LOC:6396552603886565586466057102 Local timer > >> interrupts > >> > >> with the number of disk related interrupts: > >> 120: 21492 139284 405131705886 PCI-MSI 376832-edge > >> > >> ahci[:00:17.0] > > > > They seem to be not picked up with the add_interrupt_randomness function. > > On x86, the local APIC timer has some special handling in > arch/x86/entry/entry_64.S that does not go through handle_irq_event(). > > I would assume that this is different when you boot with the "noapictimer" > option and use the hpet clockevent instead. > > On other architectures, the timer interrupt is often handled as a regular > IRQ as well. Thank you for the hint. Yet, I would think that timer interrupts can be identified by add_interrupt_randomness, either by the IRQ or the stuck test that was is suggested with the LRNG patch set. Ciao Stephan
Re: [RFC PATCH v12 3/4] Linux Random Number Generator
On Fri, Jul 21, 2017 at 10:57 AM, Stephan Müllerwrote: > Am Freitag, 21. Juli 2017, 05:08:47 CEST schrieb Theodore Ts'o: >> Um, the timer is the largest number of interrupts on my system. Compare: >> >> CPU0 CPU1 CPU2 CPU3 >> LOC:6396552603886565586466057102 Local timer interrupts >> >> with the number of disk related interrupts: >> >> 120: 21492 139284 405131705886 PCI-MSI 376832-edge >> ahci[:00:17.0] > > They seem to be not picked up with the add_interrupt_randomness function. On x86, the local APIC timer has some special handling in arch/x86/entry/entry_64.S that does not go through handle_irq_event(). I would assume that this is different when you boot with the "noapictimer" option and use the hpet clockevent instead. On other architectures, the timer interrupt is often handled as a regular IRQ as well. Arnd
Re: Poor RNG performance on Ryzen
On 07/21/2017 09:47 AM, Theodore Ts'o wrote: On Fri, Jul 21, 2017 at 01:39:13PM +0200, Oliver Mangold wrote: Better, but obviously there is still much room for improvement by reducing the number of calls to RDRAND. Hmm, is there some way we can easily tell we are running on Ryzen? Or do we believe this is going to be true for all AMD devices? I guess we could add some kind of "has_crappy_arch_get_random()" call which could be defined by arch/x86, and change how aggressively we use arch_get_random_*() depending on whether has_crappy_arch_get_random() returns true or not Just a quick note to say that we are aware of the issue, and that it is being addressed.
Re: Poor RNG performance on Ryzen
On 21.07.2017 16:47, Theodore Ts'o wrote: On Fri, Jul 21, 2017 at 01:39:13PM +0200, Oliver Mangold wrote: Better, but obviously there is still much room for improvement by reducing the number of calls to RDRAND. Hmm, is there some way we can easily tell we are running on Ryzen? Or do we believe this is going to be true for all AMD devices? I would like to note that my first measurement on Broadwell suggest that the current frequency of RDRAND calls seems to slow things down on Intel also (but not as much as on Ryzen).
Re: Poor RNG performance on Ryzen
On Fri, Jul 21, 2017 at 01:39:13PM +0200, Oliver Mangold wrote: > Better, but obviously there is still much room for improvement by reducing > the number of calls to RDRAND. Hmm, is there some way we can easily tell we are running on Ryzen? Or do we believe this is going to be true for all AMD devices? I guess we could add some kind of "has_crappy_arch_get_random()" call which could be defined by arch/x86, and change how aggressively we use arch_get_random_*() depending on whether has_crappy_arch_get_random() returns true or not - Ted
Re: [PATCH] crypto: scompress - eliminate percpu scratch buffers
On 21 July 2017 at 14:44, Herbert Xuwrote: > On Fri, Jul 21, 2017 at 02:42:20PM +0100, Ard Biesheuvel wrote: >> >> >> - Would you mind a patch that makes the code only use the per-CPU >> >> buffers if we are running atomically to begin with? >> > >> > That would mean dropping the first packet so no. >> > >> >> I think you misunderstood me: the idea is not to allocate the per-CPU >> buffers at that time, but avoiding them and use the heap directly (as >> my patch does now). This way, we don't unnecessarily disable >> preemption for calls executing in process context. > > When we actually have a user other than IPcomp then we could > try this optimisation. Otherwise I think it'd be a waste of > time. > Fair enough.
Re: [PATCH] crypto: scompress - eliminate percpu scratch buffers
On Fri, Jul 21, 2017 at 02:42:20PM +0100, Ard Biesheuvel wrote: > > >> - Would you mind a patch that makes the code only use the per-CPU > >> buffers if we are running atomically to begin with? > > > > That would mean dropping the first packet so no. > > > > I think you misunderstood me: the idea is not to allocate the per-CPU > buffers at that time, but avoiding them and use the heap directly (as > my patch does now). This way, we don't unnecessarily disable > preemption for calls executing in process context. When we actually have a user other than IPcomp then we could try this optimisation. Otherwise I think it'd be a waste of time. Cheers, -- Email: Herbert XuHome Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
Re: [PATCH] crypto: scompress - eliminate percpu scratch buffers
On 21 July 2017 at 14:31, Herbert Xuwrote: > On Fri, Jul 21, 2017 at 02:24:02PM +0100, Ard Biesheuvel wrote: >> >> OK, but that doesn't really answer any of my questions: >> - Should we enforce that CRYPTO_ACOMP_ALLOC_OUTPUT is mutually >> exclusive with CRYPTO_TFM_REQ_MAY_SLEEP, or should >> crypto_scomp_sg_alloc() always use GFP_ATOMIC? We need one or the >> other, because the current situation is buggy. > > Yeah I'm not surprised that it is broken, because there are currently > no users of the API. We should fix the acomp API. > >> - Could we move the scratch buffers to the request structure instead? >> Or may that be allocated from softirq context as well? > > Yes the IPcomp requests would be allocated from softirq context. > __acomp_request_alloc() currently uses GFP_KERNEL explicitly, so that needs to be fixed as well then. >> - Would you mind a patch that defers the allocation of the per-CPU >> buffers to the first invocation of crypto_scomp_init_tfm()? > > Sure. That's what we currently do for IPcomp too. > OK. >> - Would you mind a patch that makes the code only use the per-CPU >> buffers if we are running atomically to begin with? > > That would mean dropping the first packet so no. > I think you misunderstood me: the idea is not to allocate the per-CPU buffers at that time, but avoiding them and use the heap directly (as my patch does now). This way, we don't unnecessarily disable preemption for calls executing in process context. I.e.. if (req->base.flags & CRYPTO_TFM_REQ_MAY_SLEEP) { cpu = get_cpu(); scratch_src = *per_cpu_ptr(scomp_src_scratches, cpu); ... } else { scratch_src = (u8 *)__get_free_pages(gfp, alloc_order); } and make the put_cpu() at the end conditional in the same way. In any case, my main gripe with this code is the unconditional allocation of the scratch buffers (taking up memory and vmalloc space) so if we can defer that to crypto_scomp_init_tfm() instead, I already have most of what I need. -- Ard.
Re: [PATCH] crypto: scompress - eliminate percpu scratch buffers
On Fri, Jul 21, 2017 at 02:24:02PM +0100, Ard Biesheuvel wrote: > > OK, but that doesn't really answer any of my questions: > - Should we enforce that CRYPTO_ACOMP_ALLOC_OUTPUT is mutually > exclusive with CRYPTO_TFM_REQ_MAY_SLEEP, or should > crypto_scomp_sg_alloc() always use GFP_ATOMIC? We need one or the > other, because the current situation is buggy. Yeah I'm not surprised that it is broken, because there are currently no users of the API. We should fix the acomp API. > - Could we move the scratch buffers to the request structure instead? > Or may that be allocated from softirq context as well? Yes the IPcomp requests would be allocated from softirq context. > - Would you mind a patch that defers the allocation of the per-CPU > buffers to the first invocation of crypto_scomp_init_tfm()? Sure. That's what we currently do for IPcomp too. > - Would you mind a patch that makes the code only use the per-CPU > buffers if we are running atomically to begin with? That would mean dropping the first packet so no. Cheers, -- Email: Herbert XuHome Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
Re: [PATCH] crypto: scompress - eliminate percpu scratch buffers
On 21 July 2017 at 14:24, Ard Biesheuvelwrote: > On 21 July 2017 at 14:11, Herbert Xu wrote: >> On Fri, Jul 21, 2017 at 02:09:39PM +0100, Ard Biesheuvel wrote: >>> >>> Right. And is req->dst guaranteed to be assigned in that case? Because >>> crypto_scomp_sg_alloc() happily allocates pages and kmalloc()s the >>> scatterlist if req->dst == NULL. >>> >>> Is there any way we could make these scratch buffers part of the >>> request structure instead? Or at least defer allocating them until the >>> first call to crypto_scomp_init_tfm()? And on top of that, we should >>> probably only use the per-CPU scratch buffers if >>> CRYPTO_TFM_REQ_MAY_SLEEP is cleared, because in that case, we are not >>> pre-emptible to begin with, and the concern does not apply. >> >> Well for now scomp is new code so nothing actually uses it. But >> the idea is to convert IPcomp over to acomp and scomp will fill in >> as the compatibility glue. >> > > Yes, I guessed as much. > >> The medium/long-term plan is to convert IPcomp over to a partial >> decompression model so that we can allocate pages instead of a >> contiguous chunk as we do now. >> > > OK, but that doesn't really answer any of my questions: > - Should we enforce that CRYPTO_ACOMP_ALLOC_OUTPUT is mutually > exclusive with CRYPTO_TFM_REQ_MAY_SLEEP, or should > crypto_scomp_sg_alloc() always use GFP_ATOMIC? We need one or the > other, because the current situation is buggy. I mean CRYPTO_ACOMP_ALLOC_OUTPUT should not be set if CRYPTO_TFM_REQ_MAY_SLEEP is cleared. > - Could we move the scratch buffers to the request structure instead? > Or may that be allocated from softirq context as well? > - Would you mind a patch that defers the allocation of the per-CPU > buffers to the first invocation of crypto_scomp_init_tfm()? > - Would you mind a patch that makes the code only use the per-CPU > buffers if we are running atomically to begin with? > > Thanks, > Ard.
Re: [PATCH] crypto: scompress - eliminate percpu scratch buffers
On 21 July 2017 at 14:11, Herbert Xuwrote: > On Fri, Jul 21, 2017 at 02:09:39PM +0100, Ard Biesheuvel wrote: >> >> Right. And is req->dst guaranteed to be assigned in that case? Because >> crypto_scomp_sg_alloc() happily allocates pages and kmalloc()s the >> scatterlist if req->dst == NULL. >> >> Is there any way we could make these scratch buffers part of the >> request structure instead? Or at least defer allocating them until the >> first call to crypto_scomp_init_tfm()? And on top of that, we should >> probably only use the per-CPU scratch buffers if >> CRYPTO_TFM_REQ_MAY_SLEEP is cleared, because in that case, we are not >> pre-emptible to begin with, and the concern does not apply. > > Well for now scomp is new code so nothing actually uses it. But > the idea is to convert IPcomp over to acomp and scomp will fill in > as the compatibility glue. > Yes, I guessed as much. > The medium/long-term plan is to convert IPcomp over to a partial > decompression model so that we can allocate pages instead of a > contiguous chunk as we do now. > OK, but that doesn't really answer any of my questions: - Should we enforce that CRYPTO_ACOMP_ALLOC_OUTPUT is mutually exclusive with CRYPTO_TFM_REQ_MAY_SLEEP, or should crypto_scomp_sg_alloc() always use GFP_ATOMIC? We need one or the other, because the current situation is buggy. - Could we move the scratch buffers to the request structure instead? Or may that be allocated from softirq context as well? - Would you mind a patch that defers the allocation of the per-CPU buffers to the first invocation of crypto_scomp_init_tfm()? - Would you mind a patch that makes the code only use the per-CPU buffers if we are running atomically to begin with? Thanks, Ard.
Re: [PATCH] crypto: scompress - eliminate percpu scratch buffers
On Fri, Jul 21, 2017 at 02:09:39PM +0100, Ard Biesheuvel wrote: > > Right. And is req->dst guaranteed to be assigned in that case? Because > crypto_scomp_sg_alloc() happily allocates pages and kmalloc()s the > scatterlist if req->dst == NULL. > > Is there any way we could make these scratch buffers part of the > request structure instead? Or at least defer allocating them until the > first call to crypto_scomp_init_tfm()? And on top of that, we should > probably only use the per-CPU scratch buffers if > CRYPTO_TFM_REQ_MAY_SLEEP is cleared, because in that case, we are not > pre-emptible to begin with, and the concern does not apply. Well for now scomp is new code so nothing actually uses it. But the idea is to convert IPcomp over to acomp and scomp will fill in as the compatibility glue. The medium/long-term plan is to convert IPcomp over to a partial decompression model so that we can allocate pages instead of a contiguous chunk as we do now. Cheers, -- Email: Herbert XuHome Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
Re: [PATCH] crypto: scompress - eliminate percpu scratch buffers
On 21 July 2017 at 13:42, Herbert Xuwrote: > On Thu, Jul 20, 2017 at 12:40:00PM +0100, Ard Biesheuvel wrote: >> The scompress code unconditionally allocates 2 per-CPU scratch buffers >> of 128 KB each, in order to avoid allocation overhead in the async >> wrapper that encapsulates the synchronous compression algorithm, since >> it may execute in atomic context. > > The whole point of pre-allocation is that we cannot allocate 128K > (or 64K as it was before scomp) at run-time, and in particular, > for IPsec which runs in softirq path. Am I missing something? > Right. And is req->dst guaranteed to be assigned in that case? Because crypto_scomp_sg_alloc() happily allocates pages and kmalloc()s the scatterlist if req->dst == NULL. Is there any way we could make these scratch buffers part of the request structure instead? Or at least defer allocating them until the first call to crypto_scomp_init_tfm()? And on top of that, we should probably only use the per-CPU scratch buffers if CRYPTO_TFM_REQ_MAY_SLEEP is cleared, because in that case, we are not pre-emptible to begin with, and the concern does not apply.
Re: [PATCH] crypto: scompress - eliminate percpu scratch buffers
On Thu, Jul 20, 2017 at 12:40:00PM +0100, Ard Biesheuvel wrote: > The scompress code unconditionally allocates 2 per-CPU scratch buffers > of 128 KB each, in order to avoid allocation overhead in the async > wrapper that encapsulates the synchronous compression algorithm, since > it may execute in atomic context. The whole point of pre-allocation is that we cannot allocate 128K (or 64K as it was before scomp) at run-time, and in particular, for IPsec which runs in softirq path. Am I missing something? Cheers, -- Email: Herbert XuHome Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
Re: Poor RNG performance on Ryzen
On Fri, Jul 21, 2017 at 3:12 AM, Oliver Mangoldwrote: > Hi, > > I was wondering why reading from /dev/urandom is much slower on Ryzen than > on Intel, and did some analysis. It turns out that the RDRAND instruction is > at fault, which takes much longer on AMD. > > if I read this correctly: > > --- drivers/char/random.c --- > 862 spin_lock_irqsave(>lock, flags); > 863 if (arch_get_random_long()) > 864 crng->state[14] ^= v; > 865 chacha20_block(>state[0], out); > > one call to RDRAND (with 64-bit operand) is issued per computation of a > chacha20 block. According to the measurements I did, it seems on Ryzen this > dominates the time usage: AMD's implementation of RDRAND and RDSEED are simply slow. It dates back to Bulldozer. While Intel can produce random numbers at 10 cycle/sbyte, AMD regularly takes thousands of cycles for one byte. Bulldozer was measured at 4100 cycles per byte. It also appears AMD uses the same circuit for random numbers for both RDRAND and RDSEED. Both are equally fast (or equally slow). Here are some benchmarks if you are interested: https://www.cryptopp.com/wiki/RDRAND#Performance . Jeff
Re: Poor RNG performance on Ryzen
On 21.07.2017 11:26, Jan Glauber wrote: Nice catch. How much does the performance improve on Ryzen when you use arch_get_random_int()? Okay, now I have some results for you: On Ryzen 1800X (using arch_get_random_int()): --- # dd if=/dev/urandom of=/dev/null bs=1M status=progress 8751415296 bytes (8,8 GB, 8,2 GiB) copied, 71,0079 s, 123 MB/s # perf top 57,37% [kernel][k] _extract_crng 26,20% [kernel][k] chacha20_block --- Better, but obviously there is still much room for improvement by reducing the number of calls to RDRAND. On Ryzen 1800X (with nordrand kernel option): --- # dd if=/dev/urandom of=/dev/null bs=1M status=progress 22643998720 bytes (23 GB, 21 GiB) copied, 67,0025 s, 338 MB/s --- Here is the patch I used: --- drivers/char/random.c.orig 2017-07-03 01:07:02.0 +0200 +++ drivers/char/random.c 2017-07-21 11:57:40.541677118 +0200 @@ -859,13 +859,14 @@ static void _extract_crng(struct crng_state *crng, __u8 out[CHACHA20_BLOCK_SIZE]) { - unsigned long v, flags; + unsigned int v; + unsigned long flags; if (crng_init > 1 && time_after(jiffies, crng->init_time + CRNG_RESEED_INTERVAL)) crng_reseed(crng, crng == _crng ? _pool : NULL); spin_lock_irqsave(>lock, flags); - if (arch_get_random_long()) + if (arch_get_random_int()) crng->state[14] ^= v; chacha20_block(>state[0], out); if (crng->state[12] == 0)
Re: [RFC PATCH v12 3/4] Linux Random Number Generator
Hi Ted, Snipping one comment: > Practically no one uses /dev/random. It's essentially a deprecated > interface; the primary interfaces that have been recommended for well > over a decade is /dev/urandom, and now, getrandom(2). We only need > 384 bits of randomness every 5 minutes to reseed the CRNG, and that's > plenty even given the very conservative entropy estimation currently > being used. The statement about /dev/random being deprecated is not well documented. A quick search is not turning up the expected results. The RANDOM(4) man page provides competing (conflicting?) information: When read, the /dev/random device will return random bytes only within the estimated number of bits of noise in the entropy pool. /dev/random should be suitable for uses that need very high quality randomness such as one-time pad or key generation... We regularly test the /dev/random generator by reading 10K bytes in non-blocking, discarding them, and then asking for 16 bytes in blocking. We also compress as a poor man's fitness test. We are interested in how robust the generator is, how well it performs under stress, and how well it recovers. After draining it often takes minutes for the generator to produce 16 bytes. On Debian based systems the experiment usually fails unless rng-tools is installed. The failures occur even on systems with hardware based generators like rdrand and rdseed. I've witnessed the failure on i686, x86_64, ARM and MIPS. We recently suggested the GCC compile farm install rng-tools because we were witnessing the problem there on machines. Cf., https://lists.tetaneutral.net/pipermail/cfarm-users/2017-July/30.html . I've even seen vendors recommend wiring /dev/random to /dev/urandom because of the entropy depletion problems. That's a big No-No according to https://lwn.net/Articles/525459/. The failures have always left me with an uncomfortable feeling because there are so many damn programs out there that do their own thing. Distro don't perform a SecArch review before packaging, so problems lie in wait. If the generator is truly deprecated, then it may be prudent to remove it completely or remove it from userland. Otherwise, improve its robustness. At minimum, update the documentation. Jeff On Thu, Jul 20, 2017 at 11:08 PM, Theodore Ts'owrote: > On Thu, Jul 20, 2017 at 09:00:02PM +0200, Stephan Müller wrote: >> I concur with your rationale where de-facto the correlation is effect is >> diminished and eliminated with the fast_pool and the minimal entropy >> estimation of interrupts. >> >> But it does not address my concern. Maybe I was not clear, please allow me to >> explain it again. >> >> We have lots of entropy in the system which is discarded by the >> aforementioned >> approach (if a high-res timer is present -- without it all bets are off >> anyway >> and this should be covered in a separate discussion). At boot time, this >> issue >> is fixed by injecting 256 interrupts in the CRNG and consider it seeded. >> >> But at runtime, were we still need entropy to reseed the CRNG and to supply / >> dev/random. The accounting of entropy at runtime is much too conservative... > > Practically no one uses /dev/random. It's essentially a deprecated > interface; the primary interfaces that have been recommended for well > over a decade is /dev/urandom, and now, getrandom(2). We only need > 384 bits of randomness every 5 minutes to reseed the CRNG, and that's > plenty even given the very conservative entropy estimation currently > being used. > > This was deliberate. I care a lot more that we get the initial > boot-time CRNG initialization right on ARM32 and MIPS embedded > devices, far, far, more than I care about making plenty of > information-theoretic entropy available at /dev/random on an x86 > system. Further, I haven't seen an argument for the use case where > this would be valuable. > > If you don't think they count because ARM32 and MIPS don't have a > high-res timer, then you have very different priorities than I do. I > will point out that numerically there are huge number of these devices > --- and very, very few users of /dev/random. > >> You mentioned that you are super conservative for interrupts due to timer >> interrupts. In all measurements on the different systems I conducted, I have >> not seen that the timer triggers an interrupt picked up by >> add_interrupt_randomness. > > Um, the timer is the largest number of interrupts on my system. Compare: > > CPU0 CPU1 CPU2 CPU3 > LOC:6396552603886565586466057102 Local timer interrupts > > with the number of disk related interrupts: > > 120: 21492 139284 405131705886 PCI-MSI 376832-edge > ahci[:00:17.0] > > ... and add_interrupt_randomness() gets called for **every** > interrupt. On an mostly idle machine (I was in meetings most of > today) it's not surprising that time interrupts
Re: Poor RNG performance on Ryzen
On Fri, Jul 21, 2017 at 09:12:01AM +0200, Oliver Mangold wrote: > Hi, > > I was wondering why reading from /dev/urandom is much slower on > Ryzen than on Intel, and did some analysis. It turns out that the > RDRAND instruction is at fault, which takes much longer on AMD. > > if I read this correctly: > > --- drivers/char/random.c --- > 862 spin_lock_irqsave(>lock, flags); > 863 if (arch_get_random_long()) > 864 crng->state[14] ^= v; > 865 chacha20_block(>state[0], out); > > one call to RDRAND (with 64-bit operand) is issued per computation > of a chacha20 block. According to the measurements I did, it seems > on Ryzen this dominates the time usage: > > On Broadwell E5-2650 v4: > > --- > # dd if=/dev/urandom of=/dev/null bs=1M status=progress > 28827451392 bytes (29 GB) copied, 143.290349 s, 201 MB/s > # perf top > 49.88% [kernel][k] chacha20_block > 31.22% [kernel][k] _extract_crng > --- > > On Ryzen 1800X: > > --- > # dd if=/dev/urandom of=/dev/null bs=1M status=progress > 3169845248 bytes (3,2 GB, 3,0 GiB) copied, 42,0106 s, 75,5 MB/s > # perf top > 76,40% [kernel] [k] _extract_crng > 13,05% [kernel] [k] chacha20_block > --- > > An easy improvement might be to replace the usage of > arch_get_random_long() by arch_get_random_int(), as the state array > contains just 32-bit elements, and (contrary to Intel) on Ryzen > 32-bit RDRAND is supposed to be faster by roughly a factor of 2. Nice catch. How much does the performance improve on Ryzen when you use arch_get_random_int()? --Jan > Best regards, > > OM
Re: [RFC PATCH v12 3/4] Linux Random Number Generator
Am Freitag, 21. Juli 2017, 05:08:47 CEST schrieb Theodore Ts'o: Hi Theodore, > On Thu, Jul 20, 2017 at 09:00:02PM +0200, Stephan Müller wrote: > > I concur with your rationale where de-facto the correlation is effect is > > diminished and eliminated with the fast_pool and the minimal entropy > > estimation of interrupts. > > > > But it does not address my concern. Maybe I was not clear, please allow me > > to explain it again. > > > > We have lots of entropy in the system which is discarded by the > > aforementioned approach (if a high-res timer is present -- without it all > > bets are off anyway and this should be covered in a separate discussion). > > At boot time, this issue is fixed by injecting 256 interrupts in the CRNG > > and consider it seeded. > > > > But at runtime, were we still need entropy to reseed the CRNG and to > > supply / dev/random. The accounting of entropy at runtime is much too > > conservative... > Practically no one uses /dev/random. It's essentially a deprecated > interface; the primary interfaces that have been recommended for well > over a decade is /dev/urandom, and now, getrandom(2). We only need > 384 bits of randomness every 5 minutes to reseed the CRNG, and that's > plenty even given the very conservative entropy estimation currently > being used. On a headless system with SSDs, this is not enough based on measurements where entropy_avail is always low ... > > This was deliberate. I care a lot more that we get the initial > boot-time CRNG initialization right on ARM32 and MIPS embedded > devices, far, far, more than I care about making plenty of > information-theoretic entropy available at /dev/random on an x86 > system. Further, I haven't seen an argument for the use case where > this would be valuable. My concern covers *both* /dev/random and /dev/urandom. > > If you don't think they count because ARM32 and MIPS don't have a > high-res timer, then you have very different priorities than I do. I > will point out that numerically there are huge number of these devices > --- and very, very few users of /dev/random. With only jiffies, you will not get suitable entropy from interrupts or HID or block devices, as these actions can be monitored from user space with a suitable degree of precision. The only real entropy provider would be HID as the entropy may come from the key strokes. But that is observable and should not count as entropy. > > > You mentioned that you are super conservative for interrupts due to timer > > interrupts. In all measurements on the different systems I conducted, I > > have not seen that the timer triggers an interrupt picked up by > > add_interrupt_randomness. > > Um, the timer is the largest number of interrupts on my system. Compare: > > CPU0 CPU1 CPU2 CPU3 > LOC:6396552603886565586466057102 Local timer interrupts > > with the number of disk related interrupts: > > 120: 21492 139284 405131705886 PCI-MSI 376832-edge > ahci[:00:17.0] They seem to be not picked up with the add_interrupt_randomness function. Execute the follwing SystemTap script: global NUMSAMPLES = 1; global num_events = 0; probe kernel.function("add_interrupt_randomness") { printf("%d\n", $irq); num_events++; if (num_events > NUMSAMPLES) exit(); } The timer interrupt does not show up here. > > ... and add_interrupt_randomness() gets called for **every** > interrupt. On an mostly idle machine (I was in meetings most of > today) it's not surprising that time interrupts dominate. That > doesn't matter for me as much because I don't really care about > /dev/random performance. What's is **far** more important is that the > entropy estimations behave correctly, across all of Linux's > architectures, while the kernel is going through startup, before CRNG > is declared initialized. > > > As we have no formal model about entropy to begin with, we can only assume > > and hope we underestimate entropy with the entropy heuristic. > > Yes, and that's why I use an ultra-conservative estimate. If we start > using a more aggressive hueristic, we open ourselves up to potentially > very severe security bugs --- and for what? What's the cost benefit > ratio here which makes this a worthwhile thing to risk? The benefit is that in case there is an entropy hog on the system, /dev/random and /dev/urandom recover faster from that. Otherwise they do not get reseeded at all. > > > Finally, I still think it is helpful to allow (not mandate) to involve the > > kernel crypto API for the DRNG maintenance (i.e. the supplier for > > /dev/random and /dev/urandom). The reason is that now more and more DRNG > > implementations in hardware pop up. Why not allowing them to be used. > > I.e. random.c would only contain the logic to manage entropy but uses the > > DRNG requested by a user. > We *do* allow them to be used. And we support a large number
Poor RNG performance on Ryzen
Hi, I was wondering why reading from /dev/urandom is much slower on Ryzen than on Intel, and did some analysis. It turns out that the RDRAND instruction is at fault, which takes much longer on AMD. if I read this correctly: --- drivers/char/random.c --- 862 spin_lock_irqsave(>lock, flags); 863 if (arch_get_random_long()) 864 crng->state[14] ^= v; 865 chacha20_block(>state[0], out); one call to RDRAND (with 64-bit operand) is issued per computation of a chacha20 block. According to the measurements I did, it seems on Ryzen this dominates the time usage: On Broadwell E5-2650 v4: --- # dd if=/dev/urandom of=/dev/null bs=1M status=progress 28827451392 bytes (29 GB) copied, 143.290349 s, 201 MB/s # perf top 49.88% [kernel][k] chacha20_block 31.22% [kernel][k] _extract_crng --- On Ryzen 1800X: --- # dd if=/dev/urandom of=/dev/null bs=1M status=progress 3169845248 bytes (3,2 GB, 3,0 GiB) copied, 42,0106 s, 75,5 MB/s # perf top 76,40% [kernel] [k] _extract_crng 13,05% [kernel] [k] chacha20_block --- An easy improvement might be to replace the usage of arch_get_random_long() by arch_get_random_int(), as the state array contains just 32-bit elements, and (contrary to Intel) on Ryzen 32-bit RDRAND is supposed to be faster by roughly a factor of 2. Best regards, OM
Re: [PATCH 3/4] crypto: axis: add ARTPEC-6/7 crypto accelerator driver
On 07/20/2017 04:51 PM, Stephan Müller wrote: Am Donnerstag, 20. Juli 2017, 15:44:31 CEST schrieb Lars Persson: Hi Lars, +static int +artpec6_crypto_cipher_set_key(struct crypto_skcipher *cipher, const u8 *key, + unsigned int keylen) +{ + struct artpec6_cryptotfm_context *ctx = + crypto_skcipher_ctx(cipher); + + if (ctx->crypto_type == ARTPEC6_CRYPTO_CIPHER_NULL) + return 0; + + switch (keylen) { + case 16: + case 24: + case 32: + break; + case 48: + case 64: + if (ctx->crypto_type == ARTPEC6_CRYPTO_CIPHER_AES_XTS) Could you please add xts_check_key for XTS keys? Sure, I will add this in v2. Thanks, Lars