[PATCH V3 6/6] crypto/nx: Add P9 NX support for 842 compression engine

2017-07-21 Thread Haren Myneni

This patch adds P9 NX support for 842 compression engine. Virtual
Accelerator Switchboard (VAS) is used to access 842 engine on P9.

For each NX engine per chip, setup receive window using
vas_rx_win_open() which configures RxFIFo with FIFO address, lpid,
pid and tid values. This unique (lpid, pid, tid) combination will
be used to identify the target engine.

For crypto open request, open send window on the NX engine for
the corresponding chip / cpu where the open request is executed.
This send window will be closed upon crypto close request.

NX provides high and normal priority FIFOs. For compression /
decompression requests, we use only hight priority FIFOs in kernel.

Each NX request will be communicated to VAS using copy/paste
instructions with vas_copy_crb() / vas_paste_crb() functions.

Signed-off-by: Haren Myneni 
---
 drivers/crypto/nx/Kconfig  |   1 +
 drivers/crypto/nx/nx-842-powernv.c | 375 -
 drivers/crypto/nx/nx-842.c |   2 +-
 3 files changed, 371 insertions(+), 7 deletions(-)

diff --git a/drivers/crypto/nx/Kconfig b/drivers/crypto/nx/Kconfig
index ad7552a6998c..cd5dda9c48f4 100644
--- a/drivers/crypto/nx/Kconfig
+++ b/drivers/crypto/nx/Kconfig
@@ -38,6 +38,7 @@ config CRYPTO_DEV_NX_COMPRESS_PSERIES
 config CRYPTO_DEV_NX_COMPRESS_POWERNV
tristate "Compression acceleration support on PowerNV platform"
depends on PPC_POWERNV
+   depends on PPC_VAS
default y
help
  Support for PowerPC Nest (NX) compression acceleration. This
diff --git a/drivers/crypto/nx/nx-842-powernv.c 
b/drivers/crypto/nx/nx-842-powernv.c
index c0dd4c7e17d3..13089a0b9dfa 100644
--- a/drivers/crypto/nx/nx-842-powernv.c
+++ b/drivers/crypto/nx/nx-842-powernv.c
@@ -23,6 +23,7 @@
 #include 
 #include 
 #include 
+#include 
 
 MODULE_LICENSE("GPL");
 MODULE_AUTHOR("Dan Streetman ");
@@ -32,6 +33,9 @@ MODULE_ALIAS_CRYPTO("842-nx");
 
 #define WORKMEM_ALIGN  (CRB_ALIGN)
 #define CSB_WAIT_MAX   (5000) /* ms */
+#define VAS_RETRIES(10)
+/* # of requests allowed per RxFIFO at a time. 0 for unlimited */
+#define MAX_CREDITS_PER_RXFIFO (1024)
 
 struct nx842_workmem {
/* Below fields must be properly aligned */
@@ -42,16 +46,27 @@ struct nx842_workmem {
 
ktime_t start;
 
+   struct vas_window *txwin;   /* Used with VAS function */
char padding[WORKMEM_ALIGN]; /* unused, to allow alignment */
 } __packed __aligned(WORKMEM_ALIGN);
 
 struct nx842_coproc {
unsigned int chip_id;
unsigned int ct;
-   unsigned int ci;
+   unsigned int ci;/* Coprocessor instance, used with icswx */
+   struct {
+   struct vas_window *rxwin;
+   int id;
+   } vas;
struct list_head list;
 };
 
+/*
+ * Send the request to NX engine on the chip for the corresponding CPU
+ * where the process is executing. Use with VAS function.
+ */
+static DEFINE_PER_CPU(struct nx842_coproc *, coproc_inst);
+
 /* no cpu hotplug on powernv, so this list never changes after init */
 static LIST_HEAD(nx842_coprocs);
 static unsigned int nx842_ct;  /* used in icswx function */
@@ -513,6 +528,105 @@ static int nx842_exec_icswx(const unsigned char *in, 
unsigned int inlen,
 }
 
 /**
+ * nx842_exec_vas - compress/decompress data using the 842 algorithm
+ *
+ * (De)compression provided by the NX842 coprocessor on IBM PowerNV systems.
+ * This compresses or decompresses the provided input buffer into the provided
+ * output buffer.
+ *
+ * Upon return from this function @outlen contains the length of the
+ * output data.  If there is an error then @outlen will be 0 and an
+ * error will be specified by the return code from this function.
+ *
+ * The @workmem buffer should only be used by one function call at a time.
+ *
+ * @in: input buffer pointer
+ * @inlen: input buffer size
+ * @out: output buffer pointer
+ * @outlenp: output buffer size pointer
+ * @workmem: working memory buffer pointer, size determined by
+ *   nx842_powernv_driver.workmem_size
+ * @fc: function code, see CCW Function Codes in nx-842.h
+ *
+ * Returns:
+ *   0 Success, output of length @outlenp stored in the buffer
+ * at @out
+ *   -ENODEV   Hardware unavailable
+ *   -ENOSPC   Output buffer is to small
+ *   -EMSGSIZE Input buffer too large
+ *   -EINVAL   buffer constraints do not fix nx842_constraints
+ *   -EPROTO   hardware error during operation
+ *   -ETIMEDOUThardware did not complete operation in reasonable time
+ *   -EINTRoperation was aborted
+ */
+static int nx842_exec_vas(const unsigned char *in, unsigned int inlen,
+ unsigned char *out, unsigned int *outlenp,
+ void *workmem, int fc)
+{
+   struct coprocessor_request_block *crb;
+   struct coprocessor_status_block *csb;
+   struct nx842_workmem *wmem;
+   struct vas_window *txwin;
+   

[PATCH V3 5/6] crypto/nx: Add P9 NX specific error codes for 842 engine

2017-07-21 Thread Haren Myneni

This patch adds changes for checking P9 specific 842 engine
error codes. These errros are reported in coprocessor status
block (CSB) for failures.

Signed-off-by: Haren Myneni 
---
 arch/powerpc/include/asm/icswx.h   |  3 +++
 drivers/crypto/nx/nx-842-powernv.c | 18 ++
 drivers/crypto/nx/nx-842.h |  8 
 3 files changed, 29 insertions(+)

diff --git a/arch/powerpc/include/asm/icswx.h b/arch/powerpc/include/asm/icswx.h
index 27e588f6c72e..6a2c87577541 100644
--- a/arch/powerpc/include/asm/icswx.h
+++ b/arch/powerpc/include/asm/icswx.h
@@ -69,7 +69,10 @@ struct coprocessor_completion_block {
 #define CSB_CC_WR_PROTECTION   (16)
 #define CSB_CC_UNKNOWN_CODE(17)
 #define CSB_CC_ABORT   (18)
+#define CSB_CC_EXCEED_BYTE_COUNT   (19)/* P9 or later */
 #define CSB_CC_TRANSPORT   (20)
+#define CSB_CC_INVALID_CRB (21)/* P9 or later */
+#define CSB_CC_INVALID_DDE (30)/* P9 or later */
 #define CSB_CC_SEGMENTED_DDL   (31)
 #define CSB_CC_PROGRESS_POINT  (32)
 #define CSB_CC_DDE_OVERFLOW(33)
diff --git a/drivers/crypto/nx/nx-842-powernv.c 
b/drivers/crypto/nx/nx-842-powernv.c
index 829b5cad0043..c0dd4c7e17d3 100644
--- a/drivers/crypto/nx/nx-842-powernv.c
+++ b/drivers/crypto/nx/nx-842-powernv.c
@@ -243,6 +243,13 @@ static int wait_for_csb(struct nx842_workmem *wmem,
case CSB_CC_TEMPL_OVERFLOW:
CSB_ERR(csb, "Compressed data template shows data past end");
return -EINVAL;
+   case CSB_CC_EXCEED_BYTE_COUNT:  /* P9 or later */
+   /*
+* DDE byte count exceeds the limit specified in Maximum
+* byte count register.
+*/
+   CSB_ERR(csb, "DDE byte count exceeds the limit");
+   return -EINVAL;
 
/* these should not happen */
case CSB_CC_INVALID_ALIGN:
@@ -284,9 +291,17 @@ static int wait_for_csb(struct nx842_workmem *wmem,
CSB_ERR(csb, "Too many DDEs in DDL");
return -EINVAL;
case CSB_CC_TRANSPORT:
+   case CSB_CC_INVALID_CRB:/* P9 or later */
/* shouldn't happen, we setup CRB correctly */
CSB_ERR(csb, "Invalid CRB");
return -EINVAL;
+   case CSB_CC_INVALID_DDE:/* P9 or later */
+   /*
+* shouldn't happen, setup_direct/indirect_dde creates
+* DDE right
+*/
+   CSB_ERR(csb, "Invalid DDE");
+   return -EINVAL;
case CSB_CC_SEGMENTED_DDL:
/* shouldn't happen, setup_ddl creates DDL right */
CSB_ERR(csb, "Segmented DDL error");
@@ -330,6 +345,9 @@ static int wait_for_csb(struct nx842_workmem *wmem,
case CSB_CC_HW:
CSB_ERR(csb, "Correctable hardware error");
return -EPROTO;
+   case CSB_CC_HW_EXPIRED_TIMER:   /* P9 or later */
+   CSB_ERR(csb, "Job did not finish within allowed time");
+   return -EPROTO;
 
default:
CSB_ERR(csb, "Invalid CC %d", csb->cc);
diff --git a/drivers/crypto/nx/nx-842.h b/drivers/crypto/nx/nx-842.h
index 30929bd7d1a9..bb2f31792683 100644
--- a/drivers/crypto/nx/nx-842.h
+++ b/drivers/crypto/nx/nx-842.h
@@ -76,9 +76,17 @@
 #define CSB_CC_DECRYPT_OVERFLOW(64)
 /* asym crypt codes */
 #define CSB_CC_MINV_OVERFLOW   (128)
+/*
+ * HW error - Job did not finish in the maximum time allowed.
+ * Job terminated.
+ */
+#define CSB_CC_HW_EXPIRED_TIMER(224)
 /* These are reserved for hypervisor use */
 #define CSB_CC_HYP_RESERVE_START   (240)
 #define CSB_CC_HYP_RESERVE_END (253)
+#define CSB_CC_HYP_RESERVE_P9_END  (251)
+/* No valid interrupt server (P9 or later). */
+#define CSB_CC_HYP_RESERVE_NO_INTR_SERVER  (252)
 #define CSB_CC_HYP_NO_HW   (254)
 #define CSB_CC_HYP_HANG_ABORTED(255)
 
-- 
2.11.0





[PATCH V3 3/6] crypto/nx: Create nx842_delete_coprocs function

2017-07-21 Thread Haren Myneni

Move deleting coprocessors info upon exit or failure to
nx842_delete_coprocs().

Signed-off-by: Haren Myneni 
---
 drivers/crypto/nx/nx-842-powernv.c | 25 -
 1 file changed, 12 insertions(+), 13 deletions(-)

diff --git a/drivers/crypto/nx/nx-842-powernv.c 
b/drivers/crypto/nx/nx-842-powernv.c
index 1bd19e03eb7d..67dc06f9b557 100644
--- a/drivers/crypto/nx/nx-842-powernv.c
+++ b/drivers/crypto/nx/nx-842-powernv.c
@@ -593,6 +593,16 @@ static int __init nx842_powernv_probe(struct device_node 
*dn)
return 0;
 }
 
+static void nx842_delete_coprocs(void)
+{
+   struct nx842_coproc *coproc, *n;
+
+   list_for_each_entry_safe(coproc, n, _coprocs, list) {
+   list_del(>list);
+   kfree(coproc);
+   }
+}
+
 static struct nx842_constraints nx842_powernv_constraints = {
.alignment =DDE_BUFFER_ALIGN,
.multiple = DDE_BUFFER_LAST_MULT,
@@ -652,13 +662,7 @@ static __init int nx842_powernv_init(void)
 
ret = crypto_register_alg(_powernv_alg);
if (ret) {
-   struct nx842_coproc *coproc, *n;
-
-   list_for_each_entry_safe(coproc, n, _coprocs, list) {
-   list_del(>list);
-   kfree(coproc);
-   }
-
+   nx842_delete_coprocs();
return ret;
}
 
@@ -668,13 +672,8 @@ module_init(nx842_powernv_init);
 
 static void __exit nx842_powernv_exit(void)
 {
-   struct nx842_coproc *coproc, *n;
-
crypto_unregister_alg(_powernv_alg);
 
-   list_for_each_entry_safe(coproc, n, _coprocs, list) {
-   list_del(>list);
-   kfree(coproc);
-   }
+   nx842_delete_coprocs();
 }
 module_exit(nx842_powernv_exit);
-- 
2.11.0





[PATCH V3 4/6] crypto/nx: Add nx842_add_coprocs_list function

2017-07-21 Thread Haren Myneni

Updating coprocessor list is moved to nx842_add_coprocs_list().
This function will be used for both icswx and VAS functions.

Signed-off-by: Haren Myneni 
---
 drivers/crypto/nx/nx-842-powernv.c | 12 +---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/drivers/crypto/nx/nx-842-powernv.c 
b/drivers/crypto/nx/nx-842-powernv.c
index 67dc06f9b557..829b5cad0043 100644
--- a/drivers/crypto/nx/nx-842-powernv.c
+++ b/drivers/crypto/nx/nx-842-powernv.c
@@ -550,6 +550,14 @@ static int nx842_powernv_decompress(const unsigned char 
*in, unsigned int inlen,
  wmem, CCW_FC_842_DECOMP_CRC);
 }
 
+static inline void nx842_add_coprocs_list(struct nx842_coproc *coproc,
+   int chipid)
+{
+   coproc->chip_id = chipid;
+   INIT_LIST_HEAD(>list);
+   list_add(>list, _coprocs);
+}
+
 static int __init nx842_powernv_probe(struct device_node *dn)
 {
struct nx842_coproc *coproc;
@@ -576,11 +584,9 @@ static int __init nx842_powernv_probe(struct device_node 
*dn)
if (!coproc)
return -ENOMEM;
 
-   coproc->chip_id = chip_id;
coproc->ct = ct;
coproc->ci = ci;
-   INIT_LIST_HEAD(>list);
-   list_add(>list, _coprocs);
+   nx842_add_coprocs_list(coproc, chip_id);
 
pr_info("coprocessor found on chip %d, CT %d CI %d\n", chip_id, ct, ci);
 
-- 
2.11.0





[PATCH V3 2/6] crypto/nx: Create nx842_configure_crb function

2017-07-21 Thread Haren Myneni

Configure CRB is moved to nx842_configure_crb() so that it can
be used for icswx and VAS exec functions. VAS function will be
added later with P9 support.

Signed-off-by: Haren Myneni 
---
 drivers/crypto/nx/nx-842-powernv.c | 57 +-
 1 file changed, 38 insertions(+), 19 deletions(-)

diff --git a/drivers/crypto/nx/nx-842-powernv.c 
b/drivers/crypto/nx/nx-842-powernv.c
index 161987698bbc..1bd19e03eb7d 100644
--- a/drivers/crypto/nx/nx-842-powernv.c
+++ b/drivers/crypto/nx/nx-842-powernv.c
@@ -358,6 +358,40 @@ static int wait_for_csb(struct nx842_workmem *wmem,
return 0;
 }
 
+static int nx842_config_crb(const unsigned char *in, unsigned int inlen,
+   unsigned char *out, unsigned int outlen,
+   struct nx842_workmem *wmem)
+{
+   struct coprocessor_request_block *crb;
+   struct coprocessor_status_block *csb;
+   u64 csb_addr;
+   int ret;
+
+   crb = >crb;
+   csb = >csb;
+
+   /* Clear any previous values */
+   memset(crb, 0, sizeof(*crb));
+
+   /* set up DDLs */
+   ret = setup_ddl(>source, wmem->ddl_in,
+   (unsigned char *)in, inlen, true);
+   if (ret)
+   return ret;
+
+   ret = setup_ddl(>target, wmem->ddl_out,
+   out, outlen, false);
+   if (ret)
+   return ret;
+
+   /* set up CRB's CSB addr */
+   csb_addr = nx842_get_pa(csb) & CRB_CSB_ADDRESS;
+   csb_addr |= CRB_CSB_AT; /* Addrs are phys */
+   crb->csb_addr = cpu_to_be64(csb_addr);
+
+   return 0;
+}
+
 /**
  * nx842_exec_icswx - compress/decompress data using the 842 algorithm
  *
@@ -397,7 +431,6 @@ static int nx842_exec_icswx(const unsigned char *in, 
unsigned int inlen,
struct coprocessor_status_block *csb;
struct nx842_workmem *wmem;
int ret;
-   u64 csb_addr;
u32 ccw;
unsigned int outlen = *outlenp;
 
@@ -411,33 +444,19 @@ static int nx842_exec_icswx(const unsigned char *in, 
unsigned int inlen,
return -ENODEV;
}
 
-   crb = >crb;
-   csb = >csb;
-
-   /* Clear any previous values */
-   memset(crb, 0, sizeof(*crb));
-
-   /* set up DDLs */
-   ret = setup_ddl(>source, wmem->ddl_in,
-   (unsigned char *)in, inlen, true);
-   if (ret)
-   return ret;
-   ret = setup_ddl(>target, wmem->ddl_out,
-   out, outlen, false);
+   ret = nx842_config_crb(in, inlen, out, outlen, wmem);
if (ret)
return ret;
 
+   crb = >crb;
+   csb = >csb;
+
/* set up CCW */
ccw = 0;
ccw = SET_FIELD(CCW_CT, ccw, nx842_ct);
ccw = SET_FIELD(CCW_CI_842, ccw, 0); /* use 0 for hw auto-selection */
ccw = SET_FIELD(CCW_FC_842, ccw, fc);
 
-   /* set up CRB's CSB addr */
-   csb_addr = nx842_get_pa(csb) & CRB_CSB_ADDRESS;
-   csb_addr |= CRB_CSB_AT; /* Addrs are phys */
-   crb->csb_addr = cpu_to_be64(csb_addr);
-
wmem->start = ktime_get();
 
/* do ICSWX */
-- 
2.11.0





[PATCH V3 1/6] crypto/nx842: Rename nx842_powernv_function as icswx function

2017-07-21 Thread Haren Myneni

Rename nx842_powernv_function to nx842_powernv_exec.
nx842_powernv_exec points to nx842_exec_icswx and
will be point to VAS exec function which will be added later
for P9 NX support.

Signed-off-by: Haren Myneni 
---
 drivers/crypto/nx/nx-842-powernv.c | 20 +---
 1 file changed, 13 insertions(+), 7 deletions(-)

diff --git a/drivers/crypto/nx/nx-842-powernv.c 
b/drivers/crypto/nx/nx-842-powernv.c
index 3abb045cdba7..161987698bbc 100644
--- a/drivers/crypto/nx/nx-842-powernv.c
+++ b/drivers/crypto/nx/nx-842-powernv.c
@@ -54,7 +54,11 @@ struct nx842_coproc {
 
 /* no cpu hotplug on powernv, so this list never changes after init */
 static LIST_HEAD(nx842_coprocs);
-static unsigned int nx842_ct;
+static unsigned int nx842_ct;  /* used in icswx function */
+
+static int (*nx842_powernv_exec)(const unsigned char *in,
+   unsigned int inlen, unsigned char *out,
+   unsigned int *outlenp, void *workmem, int fc);
 
 /**
  * setup_indirect_dde - Setup an indirect DDE
@@ -355,7 +359,7 @@ static int wait_for_csb(struct nx842_workmem *wmem,
 }
 
 /**
- * nx842_powernv_function - compress/decompress data using the 842 algorithm
+ * nx842_exec_icswx - compress/decompress data using the 842 algorithm
  *
  * (De)compression provided by the NX842 coprocessor on IBM PowerNV systems.
  * This compresses or decompresses the provided input buffer into the provided
@@ -385,7 +389,7 @@ static int wait_for_csb(struct nx842_workmem *wmem,
  *   -ETIMEDOUThardware did not complete operation in reasonable time
  *   -EINTRoperation was aborted
  */
-static int nx842_powernv_function(const unsigned char *in, unsigned int inlen,
+static int nx842_exec_icswx(const unsigned char *in, unsigned int inlen,
  unsigned char *out, unsigned int *outlenp,
  void *workmem, int fc)
 {
@@ -489,13 +493,13 @@ static int nx842_powernv_function(const unsigned char 
*in, unsigned int inlen,
  * @workmem: working memory buffer pointer, size determined by
  *   nx842_powernv_driver.workmem_size
  *
- * Returns: see @nx842_powernv_function()
+ * Returns: see @nx842_powernv_exec()
  */
 static int nx842_powernv_compress(const unsigned char *in, unsigned int inlen,
  unsigned char *out, unsigned int *outlenp,
  void *wmem)
 {
-   return nx842_powernv_function(in, inlen, out, outlenp,
+   return nx842_powernv_exec(in, inlen, out, outlenp,
  wmem, CCW_FC_842_COMP_CRC);
 }
 
@@ -517,13 +521,13 @@ static int nx842_powernv_compress(const unsigned char 
*in, unsigned int inlen,
  * @workmem: working memory buffer pointer, size determined by
  *   nx842_powernv_driver.workmem_size
  *
- * Returns: see @nx842_powernv_function()
+ * Returns: see @nx842_powernv_exec()
  */
 static int nx842_powernv_decompress(const unsigned char *in, unsigned int 
inlen,
unsigned char *out, unsigned int *outlenp,
void *wmem)
 {
-   return nx842_powernv_function(in, inlen, out, outlenp,
+   return nx842_powernv_exec(in, inlen, out, outlenp,
  wmem, CCW_FC_842_DECOMP_CRC);
 }
 
@@ -625,6 +629,8 @@ static __init int nx842_powernv_init(void)
if (!nx842_ct)
return -ENODEV;
 
+   nx842_powernv_exec = nx842_exec_icswx;
+
ret = crypto_register_alg(_powernv_alg);
if (ret) {
struct nx842_coproc *coproc, *n;
-- 
2.11.0





[PATCH V3 0/6] Enable NX 842 compression engine on Power9

2017-07-21 Thread Haren Myneni

P9 introduces Virtual Accelerator Switchboard (VAS) to communicate
with NX 842 engine. icswx function is used to access NX before.
On powerNV systems, NX-842 driver invokes VAS functions for
configuring RxFIFO (receive window) per each NX engine. VAS uses
this FIFO to communicate the request to NX. The kernel opens send
window which is used to transfer compression/decompression requests
to VAS. It maps the send window to the corresponding RxFIFO.
copy/paste instructions are used to pass the CRB to VAS.

This patch series adds P9 NX support for 842 compression engine.
First 4 patches reorganize the current code so that VAS function
can be added.
- nx842_powernv_function points to VAS function if VAS feature is
  available. Otherwise icswx function is used.
- Move configure CRB code nx842_cfg_crb() 
- In addition to freeing co-processor structs for initialization 
  failures and exit, both send and receive windows have to closed
  for VAS.
- Move updating coprocessor info list to nx842_add_coprocs_list().

The last 2 patches adds configuring and invoking VAS, and also
checking P9 NX specific errors that are provided in co-processor
status block (CSB) for failures.

Patches have been tested on P9 DD1 system using VAS changes and
on P8 HW to make sure no regression.

This patchset depends on VAS kernel changes:
https://lists.ozlabs.org/pipermail/linuxppc-dev/2017-May/158178.html

Thanks to Sukadev Bhattiprolu for his review, input and testing with
VAS changes. Also thanks to Michael Ellerman and Benjamin Herrenschmidt
for their valuable guidance and comments.

Changelog[V3]
- preemption disable for copy/paste as Nichalos Piggin suggested.
- PTR_ALIGN for workmem buffer based on Ram Pai's comemnt.
 
Changelog[v2]
- Open/close send windows in nx842_poernv_crypto_init/exit_vas().
- Changes for the new device-tree NX properties such as priority
  and compatible properties.
- Incorporated review comments from Michael Ellerman.
- Other minor issues found during HW testing.

Haren Myneni (6):
  crypto/nx842: Rename nx842_powernv_function as icswx function
  crypto/nx: Create nx842_configure_crb function
  crypto/nx: Create nx842_delete_coprocs function
  crypto/nx: Add nx842_add_coprocs_list function
  crypto/nx: Add P9 NX specific error codes for 842 engine
  crypto/nx: Add P9 NX support for 842 compression engine.

 arch/powerpc/include/asm/icswx.h   |   3 +
 drivers/crypto/nx/Kconfig  |   1 +
 drivers/crypto/nx/nx-842-powernv.c | 499 +
 drivers/crypto/nx/nx-842.c |   2 +-
 drivers/crypto/nx/nx-842.h |   8 +
 5 files changed, 465 insertions(+), 48 deletions(-)

-- 
2.11.0





Re: [PATCH V2 6/6] crypto/nx: Add P9 NX support for 842 compression engine

2017-07-21 Thread Haren Myneni
On 07/17/2017 11:53 PM, Ram Pai wrote:
> On Mon, Jul 17, 2017 at 04:50:38PM -0700, Haren Myneni wrote:
>>
>> This patch adds P9 NX support for 842 compression engine. Virtual
>> Accelerator Switchboard (VAS) is used to access 842 engine on P9.
>>
>> For each NX engine per chip, setup receive window using
>> vas_rx_win_open() which configures RxFIFo with FIFO address, lpid,
>> pid and tid values. This unique (lpid, pid, tid) combination will
>> be used to identify the target engine.
>>
>> For crypto open request, open send window on the NX engine for
>> the corresponding chip / cpu where the open request is executed.
>> This send window will be closed upon crypto close request.
>>
>> NX provides high and normal priority FIFOs. For compression /
>> decompression requests, we use only hight priority FIFOs in kernel.
>>
>> Each NX request will be communicated to VAS using copy/paste
>> instructions with vas_copy_crb() / vas_paste_crb() functions.
>>
>> Signed-off-by: Haren Myneni 
>> ---
>>  drivers/crypto/nx/Kconfig  |   1 +
>>  drivers/crypto/nx/nx-842-powernv.c | 369 
>> -
>>  drivers/crypto/nx/nx-842.c |   2 +-
>>  3 files changed, 365 insertions(+), 7 deletions(-)
>>
>> diff --git a/drivers/crypto/nx/Kconfig b/drivers/crypto/nx/Kconfig
>> index ad7552a6998c..cd5dda9c48f4 100644
>> --- a/drivers/crypto/nx/Kconfig
>> +++ b/drivers/crypto/nx/Kconfig
>> @@ -38,6 +38,7 @@ config CRYPTO_DEV_NX_COMPRESS_PSERIES
>>  config CRYPTO_DEV_NX_COMPRESS_POWERNV
>>  tristate "Compression acceleration support on PowerNV platform"
>>  depends on PPC_POWERNV
>> +depends on PPC_VAS
>>  default y
>>  help
>>Support for PowerPC Nest (NX) compression acceleration. This
>> diff --git a/drivers/crypto/nx/nx-842-powernv.c 
>> b/drivers/crypto/nx/nx-842-powernv.c
>> index c0dd4c7e17d3..8d9d21420144 100644
>> --- a/drivers/crypto/nx/nx-842-powernv.c
>> +++ b/drivers/crypto/nx/nx-842-powernv.c
>> @@ -23,6 +23,7 @@
>>  #include 
>>  #include 
>>  #include 
>> +#include 
>>
>>  MODULE_LICENSE("GPL");
>>  MODULE_AUTHOR("Dan Streetman ");
>> @@ -32,6 +33,9 @@ MODULE_ALIAS_CRYPTO("842-nx");
>>
>>  #define WORKMEM_ALIGN   (CRB_ALIGN)
>>  #define CSB_WAIT_MAX(5000) /* ms */
>> +#define VAS_RETRIES (10)
>> +/* # of requests allowed per RxFIFO at a time. 0 for unlimited */
>> +#define MAX_CREDITS_PER_RXFIFO  (64)
>>
>>  struct nx842_workmem {
>>  /* Below fields must be properly aligned */
>> @@ -42,16 +46,27 @@ struct nx842_workmem {
>>
>>  ktime_t start;
>>
>> +struct vas_window *txwin;   /* Used with VAS function */
>>  char padding[WORKMEM_ALIGN]; /* unused, to allow alignment */
>>  } __packed __aligned(WORKMEM_ALIGN);
>>
>>  struct nx842_coproc {
>>  unsigned int chip_id;
>>  unsigned int ct;
>> -unsigned int ci;
>> +unsigned int ci;/* Coprocessor instance, used with icswx */
>> +struct {
>> +struct vas_window *rxwin;
>> +int id;
>> +} vas;
> 
> ci and vas are mutually exclusive. a few bytes could be saved by unionizing 
> them?

We will have few coproc entries - NX engine per chip. 
> 
>>  struct list_head list;
>>  };
>>
>> +/*
>> + * Send the request to NX engine on the chip for the corresponding CPU
>> + * where the process is executing. Use with VAS function.
>> + */
>> +static DEFINE_PER_CPU(struct nx842_coproc *, coproc_inst);
>> +
>>  /* no cpu hotplug on powernv, so this list never changes after init */
>>  static LIST_HEAD(nx842_coprocs);
>>  static unsigned int nx842_ct;   /* used in icswx function */
>> @@ -513,6 +528,108 @@ static int nx842_exec_icswx(const unsigned char *in, 
>> unsigned int inlen,
>>  }
>>
>>  /**
>> + * nx842_exec_vas - compress/decompress data using the 842 algorithm
>> + *
>> + * (De)compression provided by the NX842 coprocessor on IBM PowerNV systems.
>> + * This compresses or decompresses the provided input buffer into the 
>> provided
>> + * output buffer.
>> + *
>> + * Upon return from this function @outlen contains the length of the
>> + * output data.  If there is an error then @outlen will be 0 and an
>> + * error will be specified by the return code from this function.
>> + *
>> + * The @workmem buffer should only be used by one function call at a time.
>> + *
>> + * @in: input buffer pointer
>> + * @inlen: input buffer size
>> + * @out: output buffer pointer
>> + * @outlenp: output buffer size pointer
>> + * @workmem: working memory buffer pointer, size determined by
>> + *   nx842_powernv_driver.workmem_size
>> + * @fc: function code, see CCW Function Codes in nx-842.h
>> + *
>> + * Returns:
>> + *   0  Success, output of length @outlenp stored in the buffer
>> + *  at @out
>> + *   -ENODEVHardware unavailable
>> + *   -ENOSPCOutput buffer is to small
>> + *   -EMSGSIZE  Input buffer too large
>> + *   -EINVALbuffer 

[PATCH v2 4/4] crypto: ccp - Add XTS-AES-256 support for CCP version 5

2017-07-21 Thread Gary R Hook
Signed-off-by: Gary R Hook 
---
 drivers/crypto/ccp/ccp-crypto-aes-xts.c |   16 +---
 drivers/crypto/ccp/ccp-crypto.h |2 +-
 drivers/crypto/ccp/ccp-ops.c|3 +++
 3 files changed, 17 insertions(+), 4 deletions(-)

diff --git a/drivers/crypto/ccp/ccp-crypto-aes-xts.c 
b/drivers/crypto/ccp/ccp-crypto-aes-xts.c
index 3c37794ffe2d..4a3fe4d5ac71 100644
--- a/drivers/crypto/ccp/ccp-crypto-aes-xts.c
+++ b/drivers/crypto/ccp/ccp-crypto-aes-xts.c
@@ -80,19 +80,24 @@ static int ccp_aes_xts_setkey(struct crypto_ablkcipher 
*tfm, const u8 *key,
 {
struct crypto_tfm *xfm = crypto_ablkcipher_tfm(tfm);
struct ccp_ctx *ctx = crypto_tfm_ctx(xfm);
+   unsigned int ccpversion = ccp_version();
int ret;
 
ret = xts_check_key(xfm, key, key_len);
if (ret)
return ret;
 
-   /* Only support 128-bit AES key with a 128-bit Tweak key,
-* otherwise use the fallback
+   /* Version 3 devices support 128-bit keys; version 5 devices can
+* accommodate 128- and 256-bit keys.
 */
switch (key_len) {
case AES_KEYSIZE_128 * 2:
memcpy(ctx->u.aes.key, key, key_len);
break;
+   case AES_KEYSIZE_256 * 2:
+   if (ccpversion > CCP_VERSION(3, 0))
+   memcpy(ctx->u.aes.key, key, key_len);
+   break;
}
ctx->u.aes.key_len = key_len / 2;
sg_init_one(>u.aes.key_sg, ctx->u.aes.key, key_len);
@@ -105,6 +110,7 @@ static int ccp_aes_xts_crypt(struct ablkcipher_request *req,
 {
struct ccp_ctx *ctx = crypto_tfm_ctx(req->base.tfm);
struct ccp_aes_req_ctx *rctx = ablkcipher_request_ctx(req);
+   unsigned int ccpversion = ccp_version();
unsigned int fallback = 0;
unsigned int unit;
u32 block_size;
@@ -141,7 +147,11 @@ static int ccp_aes_xts_crypt(struct ablkcipher_request 
*req,
 */
if (unit_size == CCP_XTS_AES_UNIT_SIZE__LAST)
fallback = 1;
-   if (ctx->u.aes.key_len != AES_KEYSIZE_128)
+   if ((ccpversion < CCP_VERSION(5, 0)) &&
+   (ctx->u.aes.key_len != AES_KEYSIZE_128))
+   fallback = 1;
+   if ((ctx->u.aes.key_len != AES_KEYSIZE_128) &&
+   (ctx->u.aes.key_len != AES_KEYSIZE_256))
fallback = 1;
if (fallback) {
SKCIPHER_REQUEST_ON_STACK(subreq, ctx->u.aes.tfm_skcipher);
diff --git a/drivers/crypto/ccp/ccp-crypto.h b/drivers/crypto/ccp/ccp-crypto.h
index 156b8233853f..880f8acdd0cd 100644
--- a/drivers/crypto/ccp/ccp-crypto.h
+++ b/drivers/crypto/ccp/ccp-crypto.h
@@ -91,7 +91,7 @@ struct ccp_aes_ctx {
 
struct scatterlist key_sg;
unsigned int key_len;
-   u8 key[AES_MAX_KEY_SIZE];
+   u8 key[AES_MAX_KEY_SIZE * 2];
 
u8 nonce[CTR_RFC3686_NONCE_SIZE];
 
diff --git a/drivers/crypto/ccp/ccp-ops.c b/drivers/crypto/ccp/ccp-ops.c
index 8113355151d2..fbd024f6e898 100644
--- a/drivers/crypto/ccp/ccp-ops.c
+++ b/drivers/crypto/ccp/ccp-ops.c
@@ -1065,6 +1065,8 @@ static int ccp_run_xts_aes_cmd(struct ccp_cmd_queue 
*cmd_q,
 
if (xts->key_len == AES_KEYSIZE_128)
aestype = CCP_AES_TYPE_128;
+   else if (xts->key_len == AES_KEYSIZE_256)
+   aestype = CCP_AES_TYPE_256;
else
return -EINVAL;
 
@@ -1089,6 +1091,7 @@ static int ccp_run_xts_aes_cmd(struct ccp_cmd_queue 
*cmd_q,
op.sb_ctx = cmd_q->sb_ctx;
op.init = 1;
op.u.xts.action = xts->action;
+   op.u.xts.type = aestype;
op.u.xts.unit_size = xts->unit_size;
 
/* A version 3 device only supports 128-bit keys, which fits into a



[PATCH v2 0/4] Update support for XTS-AES on AMD CCPs

2017-07-21 Thread Gary R Hook
The following series adds support for XS-AES on version 5 CCPs,
both 128- and 256-bit, and enhances/clarifies/simplifies some
crypto layer code.

Changes since v1:
 - rework the validation of the unit-size; move to a separate patch
 - expand the key buffer to accommodate 256-bit keys
 - use xts_check_key() in the crypto layer

---

Gary R Hook (4):
  crypto: ccp - Add a call to xts_check_key()
  crypto: ccp - Enable XTS-AES-128 support on all CCPs
  crypto: ccp - Rework the unit-size check for XTS-AES
  crypto: ccp - Add XTS-AES-256 support for CCP version 5


 drivers/crypto/ccp/ccp-crypto-aes-xts.c |   98 +--
 drivers/crypto/ccp/ccp-crypto.h |2 -
 drivers/crypto/ccp/ccp-dev-v5.c |2 +
 drivers/crypto/ccp/ccp-dev.h|2 +
 drivers/crypto/ccp/ccp-ops.c|   55 +++--
 5 files changed, 106 insertions(+), 53 deletions(-)

--


[PATCH v2 3/4] crypto: ccp - Rework the unit-size check for XTS-AES

2017-07-21 Thread Gary R Hook
The CCP supports a limited set of unit-size values. Change the check
for this parameter such that acceptable values match the enumeration.
Then clarify the conditions under which we must use the fallback
implementation.

Signed-off-by: Gary R Hook 
---
 drivers/crypto/ccp/ccp-crypto-aes-xts.c |   75 ++-
 1 file changed, 35 insertions(+), 40 deletions(-)

diff --git a/drivers/crypto/ccp/ccp-crypto-aes-xts.c 
b/drivers/crypto/ccp/ccp-crypto-aes-xts.c
index 4a313f62dbea..3c37794ffe2d 100644
--- a/drivers/crypto/ccp/ccp-crypto-aes-xts.c
+++ b/drivers/crypto/ccp/ccp-crypto-aes-xts.c
@@ -1,8 +1,9 @@
 /*
  * AMD Cryptographic Coprocessor (CCP) AES XTS crypto API support
  *
- * Copyright (C) 2013 Advanced Micro Devices, Inc.
+ * Copyright (C) 2013,2017 Advanced Micro Devices, Inc.
  *
+ * Author: Gary R Hook 
  * Author: Tom Lendacky 
  *
  * This program is free software; you can redistribute it and/or modify
@@ -38,46 +39,26 @@ struct ccp_unit_size_map {
u32 value;
 };
 
-static struct ccp_unit_size_map unit_size_map[] = {
+static struct ccp_unit_size_map xts_unit_sizes[] = {
{
-   .size   = 4096,
-   .value  = CCP_XTS_AES_UNIT_SIZE_4096,
-   },
-   {
-   .size   = 2048,
-   .value  = CCP_XTS_AES_UNIT_SIZE_2048,
-   },
-   {
-   .size   = 1024,
-   .value  = CCP_XTS_AES_UNIT_SIZE_1024,
+   .size   = 16,
+   .value  = CCP_XTS_AES_UNIT_SIZE_16,
},
{
-   .size   = 512,
+   .size   = 512,
.value  = CCP_XTS_AES_UNIT_SIZE_512,
},
{
-   .size   = 256,
-   .value  = CCP_XTS_AES_UNIT_SIZE__LAST,
-   },
-   {
-   .size   = 128,
-   .value  = CCP_XTS_AES_UNIT_SIZE__LAST,
-   },
-   {
-   .size   = 64,
-   .value  = CCP_XTS_AES_UNIT_SIZE__LAST,
-   },
-   {
-   .size   = 32,
-   .value  = CCP_XTS_AES_UNIT_SIZE__LAST,
+   .size   = 1024,
+   .value  = CCP_XTS_AES_UNIT_SIZE_1024,
},
{
-   .size   = 16,
-   .value  = CCP_XTS_AES_UNIT_SIZE_16,
+   .size   = 2048,
+   .value  = CCP_XTS_AES_UNIT_SIZE_2048,
},
{
-   .size   = 1,
-   .value  = CCP_XTS_AES_UNIT_SIZE__LAST,
+   .size   = 4096,
+   .value  = CCP_XTS_AES_UNIT_SIZE_4096,
},
 };
 
@@ -124,7 +105,9 @@ static int ccp_aes_xts_crypt(struct ablkcipher_request *req,
 {
struct ccp_ctx *ctx = crypto_tfm_ctx(req->base.tfm);
struct ccp_aes_req_ctx *rctx = ablkcipher_request_ctx(req);
+   unsigned int fallback = 0;
unsigned int unit;
+   u32 block_size;
u32 unit_size;
int ret;
 
@@ -137,18 +120,30 @@ static int ccp_aes_xts_crypt(struct ablkcipher_request 
*req,
if (!req->info)
return -EINVAL;
 
+   /* Check conditions under which the CCP can fulfill a request. The
+* device can handle input plaintext of a length that is a multiple
+* of the unit_size, bug the crypto implementation only supports
+* the unit_size being equal to the input length. This limits the
+* number of scenarios we can handle. Also validate the key length.
+*/
+   block_size = 0;
unit_size = CCP_XTS_AES_UNIT_SIZE__LAST;
-   if (req->nbytes <= unit_size_map[0].size) {
-   for (unit = 0; unit < ARRAY_SIZE(unit_size_map); unit++) {
-   if (!(req->nbytes & (unit_size_map[unit].size - 1))) {
-   unit_size = unit_size_map[unit].value;
-   break;
-   }
+   for (unit = 0; unit < ARRAY_SIZE(xts_unit_sizes); unit++) {
+   if (req->nbytes == xts_unit_sizes[unit].size) {
+   unit_size = unit;
+   block_size = xts_unit_sizes[unit].size;
+   break;
}
}
-
-   if ((unit_size == CCP_XTS_AES_UNIT_SIZE__LAST) ||
-   (ctx->u.aes.key_len != AES_KEYSIZE_128)) {
+   /* The CCP has restrictions on block sizes. Also, a version 3 device
+* only supports AES-128 operations; version 5 CCPs support both
+* AES-128 and -256 operations.
+*/
+   if (unit_size == CCP_XTS_AES_UNIT_SIZE__LAST)
+   fallback = 1;
+   if (ctx->u.aes.key_len != AES_KEYSIZE_128)
+   fallback = 1;
+   if (fallback) {
SKCIPHER_REQUEST_ON_STACK(subreq, ctx->u.aes.tfm_skcipher);
 
/* Use the fallback to process the request for any



[PATCH v2 1/4] crypto: ccp - Add a call to xts_check_key()

2017-07-21 Thread Gary R Hook
Vet the key using the available standard function

Signed-off-by: Gary R Hook 
---
 drivers/crypto/ccp/ccp-crypto-aes-xts.c |9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/drivers/crypto/ccp/ccp-crypto-aes-xts.c 
b/drivers/crypto/ccp/ccp-crypto-aes-xts.c
index 58a4244b4752..4a313f62dbea 100644
--- a/drivers/crypto/ccp/ccp-crypto-aes-xts.c
+++ b/drivers/crypto/ccp/ccp-crypto-aes-xts.c
@@ -15,6 +15,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -96,7 +97,13 @@ static int ccp_aes_xts_complete(struct crypto_async_request 
*async_req, int ret)
 static int ccp_aes_xts_setkey(struct crypto_ablkcipher *tfm, const u8 *key,
  unsigned int key_len)
 {
-   struct ccp_ctx *ctx = crypto_tfm_ctx(crypto_ablkcipher_tfm(tfm));
+   struct crypto_tfm *xfm = crypto_ablkcipher_tfm(tfm);
+   struct ccp_ctx *ctx = crypto_tfm_ctx(xfm);
+   int ret;
+
+   ret = xts_check_key(xfm, key, key_len);
+   if (ret)
+   return ret;
 
/* Only support 128-bit AES key with a 128-bit Tweak key,
 * otherwise use the fallback



[PATCH v2 2/4] crypto: ccp - Enable XTS-AES-128 support on all CCPs

2017-07-21 Thread Gary R Hook
Version 5 CCPs have some new requirements for XTS-AES: the type field
must be specified, and the key requires 512 bits, with each part
occupying 256 bits and padded with zeroes.

Signed-off-by: Gary R Hook 
---
 drivers/crypto/ccp/ccp-dev-v5.c |2 ++
 drivers/crypto/ccp/ccp-dev.h|2 ++
 drivers/crypto/ccp/ccp-ops.c|   52 ---
 3 files changed, 47 insertions(+), 9 deletions(-)

diff --git a/drivers/crypto/ccp/ccp-dev-v5.c b/drivers/crypto/ccp/ccp-dev-v5.c
index b3526336d608..0fb4519c5194 100644
--- a/drivers/crypto/ccp/ccp-dev-v5.c
+++ b/drivers/crypto/ccp/ccp-dev-v5.c
@@ -145,6 +145,7 @@ union ccp_function {
 #defineCCP_AES_MODE(p) ((p)->aes.mode)
 #defineCCP_AES_TYPE(p) ((p)->aes.type)
 #defineCCP_XTS_SIZE(p) ((p)->aes_xts.size)
+#defineCCP_XTS_TYPE(p) ((p)->aes_xts.type)
 #defineCCP_XTS_ENCRYPT(p)  ((p)->aes_xts.encrypt)
 #defineCCP_DES3_SIZE(p)((p)->des3.size)
 #defineCCP_DES3_ENCRYPT(p) ((p)->des3.encrypt)
@@ -346,6 +347,7 @@ static int ccp5_perform_xts_aes(struct ccp_op *op)
function.raw = 0;
CCP_XTS_ENCRYPT() = op->u.xts.action;
CCP_XTS_SIZE() = op->u.xts.unit_size;
+   CCP_XTS_TYPE() = op->u.xts.type;
CCP5_CMD_FUNCTION() = function.raw;
 
CCP5_CMD_LEN() = op->src.u.dma.length;
diff --git a/drivers/crypto/ccp/ccp-dev.h b/drivers/crypto/ccp/ccp-dev.h
index 9320931d89da..3d51180199ac 100644
--- a/drivers/crypto/ccp/ccp-dev.h
+++ b/drivers/crypto/ccp/ccp-dev.h
@@ -194,6 +194,7 @@
 #define CCP_AES_CTX_SB_COUNT   1
 
 #define CCP_XTS_AES_KEY_SB_COUNT   1
+#define CCP5_XTS_AES_KEY_SB_COUNT  2
 #define CCP_XTS_AES_CTX_SB_COUNT   1
 
 #define CCP_DES3_KEY_SB_COUNT  1
@@ -497,6 +498,7 @@ struct ccp_aes_op {
 };
 
 struct ccp_xts_aes_op {
+   enum ccp_aes_type type;
enum ccp_aes_action action;
enum ccp_xts_aes_unit_size unit_size;
 };
diff --git a/drivers/crypto/ccp/ccp-ops.c b/drivers/crypto/ccp/ccp-ops.c
index e23d138fc1ce..8113355151d2 100644
--- a/drivers/crypto/ccp/ccp-ops.c
+++ b/drivers/crypto/ccp/ccp-ops.c
@@ -1038,6 +1038,8 @@ static int ccp_run_xts_aes_cmd(struct ccp_cmd_queue 
*cmd_q,
struct ccp_op op;
unsigned int unit_size, dm_offset;
bool in_place = false;
+   unsigned int sb_count = 0;
+   enum ccp_aes_type aestype;
int ret;
 
switch (xts->unit_size) {
@@ -1061,9 +1063,12 @@ static int ccp_run_xts_aes_cmd(struct ccp_cmd_queue 
*cmd_q,
return -EINVAL;
}
 
-   if (xts->key_len != AES_KEYSIZE_128)
+   if (xts->key_len == AES_KEYSIZE_128)
+   aestype = CCP_AES_TYPE_128;
+   else
return -EINVAL;
 
+
if (!xts->final && (xts->src_len & (AES_BLOCK_SIZE - 1)))
return -EINVAL;
 
@@ -1086,20 +1091,47 @@ static int ccp_run_xts_aes_cmd(struct ccp_cmd_queue 
*cmd_q,
op.u.xts.action = xts->action;
op.u.xts.unit_size = xts->unit_size;
 
-   /* All supported key sizes fit in a single (32-byte) SB entry
-* and must be in little endian format. Use the 256-bit byte
-* swap passthru option to convert from big endian to little
-* endian.
+   /* A version 3 device only supports 128-bit keys, which fits into a
+* single SB entry. A version 5 device uses a 512-bit vector, so two
+* SB entries.
 */
+   if (cmd_q->ccp->vdata->version == CCP_VERSION(3, 0))
+   sb_count = CCP_XTS_AES_KEY_SB_COUNT;
+   else
+   sb_count = CCP5_XTS_AES_KEY_SB_COUNT;
ret = ccp_init_dm_workarea(, cmd_q,
-  CCP_XTS_AES_KEY_SB_COUNT * CCP_SB_BYTES,
+  sb_count * CCP_SB_BYTES,
   DMA_TO_DEVICE);
if (ret)
return ret;
 
-   dm_offset = CCP_SB_BYTES - AES_KEYSIZE_128;
-   ccp_set_dm_area(, dm_offset, xts->key, 0, xts->key_len);
-   ccp_set_dm_area(, 0, xts->key, dm_offset, xts->key_len);
+   if (cmd_q->ccp->vdata->version == CCP_VERSION(3, 0)) {
+   /* All supported key sizes must be in little endian format.
+* Use the 256-bit byte swap passthru option to convert from
+* big endian to little endian.
+*/
+   dm_offset = CCP_SB_BYTES - AES_KEYSIZE_128;
+   ccp_set_dm_area(, dm_offset, xts->key, 0, xts->key_len);
+   ccp_set_dm_area(, 0, xts->key, xts->key_len, xts->key_len);
+   } else {
+   /* The AES key is at the little end and the tweak key is
+* at the big end. Since the byteswap operation is only
+* 256-bit, load the buffer according to the way things
+* will end up.
+*/
+   unsigned int pad;
+
+   op.sb_key = 

[PATCH] crypto: Kconfig: Correct help text about feeding entropy pool

2017-07-21 Thread PrasannaKumar Muralidharan
Modify Kconfig help text to reflect the fact that random data from hwrng
is fed into kernel random number generator's entropy pool.

Signed-off-by: PrasannaKumar Muralidharan 
---
 drivers/char/hw_random/Kconfig | 6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/drivers/char/hw_random/Kconfig b/drivers/char/hw_random/Kconfig
index 1b223c3..2f45e15 100644
--- a/drivers/char/hw_random/Kconfig
+++ b/drivers/char/hw_random/Kconfig
@@ -13,10 +13,8 @@ menuconfig HW_RANDOM
  that's usually called /dev/hwrng, and which exposes one
  of possibly several hardware random number generators.
 
- These hardware random number generators do not feed directly
- into the kernel's random number generator.  That is usually
- handled by the "rngd" daemon.  Documentation/hw_random.txt
- has more information.
+ These hardware random number generators do feed into the
+ kernel's random number generator entropy pool.
 
  If unsure, say Y.
 
-- 
2.10.0



Re: [PATCH] crypto: ccp - Fix XTS-AES support on a version 5 CCP

2017-07-21 Thread Gary R Hook

On 07/17/2017 04:48 PM, Lendacky, Thomas wrote:

On 7/17/2017 3:08 PM, Gary R Hook wrote:

Version 5 CCPs have differing requirements for XTS-AES: key components
are stored in a 512-bit vector. The context must be little-endian
justified. AES-256 is supported now, so propagate the cipher size to
the command descriptor.

Signed-off-by: Gary R Hook 



Look for a version 2


---
  drivers/crypto/ccp/ccp-crypto-aes-xts.c |   79 ---
  drivers/crypto/ccp/ccp-dev-v5.c |2 +
  drivers/crypto/ccp/ccp-dev.h|2 +
  drivers/crypto/ccp/ccp-ops.c|   56 ++
  4 files changed, 89 insertions(+), 50 deletions(-)

diff --git a/drivers/crypto/ccp/ccp-crypto-aes-xts.c 
b/drivers/crypto/ccp/ccp-crypto-aes-xts.c
index 58a4244b4752..8d248b198e22 100644
--- a/drivers/crypto/ccp/ccp-crypto-aes-xts.c
+++ b/drivers/crypto/ccp/ccp-crypto-aes-xts.c
@@ -1,7 +1,7 @@
  /*
   * AMD Cryptographic Coprocessor (CCP) AES XTS crypto API support
   *
- * Copyright (C) 2013 Advanced Micro Devices, Inc.
+ * Copyright (C) 2013,2017 Advanced Micro Devices, Inc.
   *
   * Author: Tom Lendacky 
   *
@@ -37,46 +37,26 @@ struct ccp_unit_size_map {
   u32 value;
  };

-static struct ccp_unit_size_map unit_size_map[] = {
+static struct ccp_unit_size_map xts_unit_sizes[] = {
   {
- .size   = 4096,
- .value  = CCP_XTS_AES_UNIT_SIZE_4096,
- },
- {
- .size   = 2048,
- .value  = CCP_XTS_AES_UNIT_SIZE_2048,
- },
- {
- .size   = 1024,
- .value  = CCP_XTS_AES_UNIT_SIZE_1024,
+ .size   = 16,
+ .value  = CCP_XTS_AES_UNIT_SIZE_16,
   },
   {
- .size   = 512,
+ .size   = 512,
   .value  = CCP_XTS_AES_UNIT_SIZE_512,
   },
   {
- .size   = 256,
- .value  = CCP_XTS_AES_UNIT_SIZE__LAST,
- },
- {
- .size   = 128,
- .value  = CCP_XTS_AES_UNIT_SIZE__LAST,
- },
- {
- .size   = 64,
- .value  = CCP_XTS_AES_UNIT_SIZE__LAST,
- },
- {
- .size   = 32,
- .value  = CCP_XTS_AES_UNIT_SIZE__LAST,
+ .size   = 1024,
+ .value  = CCP_XTS_AES_UNIT_SIZE_1024,
   },
   {
- .size   = 16,
- .value  = CCP_XTS_AES_UNIT_SIZE_16,
+ .size   = 2048,
+ .value  = CCP_XTS_AES_UNIT_SIZE_2048,
   },
   {
- .size   = 1,
- .value  = CCP_XTS_AES_UNIT_SIZE__LAST,
+ .size   = 4096,
+ .value  = CCP_XTS_AES_UNIT_SIZE_4096,
   },
  };


Because of the way the unit size check is performed, you can't delete
the intermediate size checks.  Those must remain so that unit sizes
that aren't supported by the CCP are sent to the fallback mechanism.


Given the limitations of the CCP (w.r.t. XTS-AES) I thought it more 
clear to look
for only those unit-sizes that are supported, and to use the enumerated 
value

as the index.


Also, re-arranging the order should be a separate patch if that doesn't
really fix anything.


Yes, agreed.



@@ -97,14 +77,20 @@ static int ccp_aes_xts_setkey(struct crypto_ablkcipher 
*tfm, const u8 *key,
 unsigned int key_len)
  {
   struct ccp_ctx *ctx = crypto_tfm_ctx(crypto_ablkcipher_tfm(tfm));
+ unsigned int ccpversion = ccp_version();

   /* Only support 128-bit AES key with a 128-bit Tweak key,
* otherwise use the fallback
*/
+


Remove the addition of the blank line and update the above comment to
indicate the new supported key size added below.


Yes.




   switch (key_len) {
   case AES_KEYSIZE_128 * 2:
   memcpy(ctx->u.aes.key, key, key_len);
   break;
+ case AES_KEYSIZE_256 * 2:
+ if (ccpversion > CCP_VERSION(3, 0))
+ memcpy(ctx->u.aes.key, key, key_len);
+ break;


Isn't u.aes.key defined with a maximum buffer size of AES_MAX_KEY_SIZE
(which is 32)?  I think this will cause a buffer overrun on memcpy.


Yes, the structure member needs to be made larger.




   }
   ctx->u.aes.key_len = key_len / 2;
   sg_init_one(>u.aes.key_sg, ctx->u.aes.key, key_len);
@@ -117,7 +103,10 @@ static int ccp_aes_xts_crypt(struct ablkcipher_request 
*req,
  {
   struct ccp_ctx *ctx = crypto_tfm_ctx(req->base.tfm);
   struct ccp_aes_req_ctx *rctx = ablkcipher_request_ctx(req);
+ unsigned int ccpversion = ccp_version();
+ unsigned int fallback = 0;
   unsigned int unit;
+ u32 block_size = 0;
   u32 unit_size;
   int ret;

@@ -131,17 +120,29 @@ static int ccp_aes_xts_crypt(struct ablkcipher_request 
*req,
   return -EINVAL;

   unit_size = CCP_XTS_AES_UNIT_SIZE__LAST;
- if (req->nbytes <= unit_size_map[0].size) {


This check can't be deleted.  It was added 

[PATCH v2 0/3] crypto: scompress - defer allocation of percpu scratch buffers

2017-07-21 Thread Ard Biesheuvel
This is a followup to 'crypto: scompress - eliminate percpu scratch buffers',
which attempted to replace the scompress per-CPU buffer entirely, but as
Herbert pointed out, this is not going to fly in the targeted use cases.

Instead, move the alloc/free of the buffers into the tfm init/exit hooks,
so that we only have the per-CPU buffers allocated if their are any
acomp ciphers of the right kind (i.e, ones that encapsulate a synchronous
implementation) in use (#3)

Patches #1 and #2 are fixes for issues I spotted when working on this code.

Ard Biesheuvel (3):
  crypto: scompress - don't sleep with preemption disabled
  crypto: scompress - free partially allocated scratch buffers on
failure
  crypto: scompress - defer allocation of scratch buffer to first use

 crypto/scompress.c | 55 
 1 file changed, 22 insertions(+), 33 deletions(-)

-- 
2.11.0



[PATCH v2 3/3] crypto: scompress - defer allocation of scratch buffer to first use

2017-07-21 Thread Ard Biesheuvel
The scompress code allocates 2 x 128 KB of scratch buffers for each CPU,
so that clients of the async API can use synchronous implementations
even from atomic context. However, on systems such as Cavium Thunderx
(which has 96 cores), this adds up to a non-negligible 24 MB. Also,
32-bit systems may prefer to use their precious vmalloc space for other
things,especially since there don't appear to be any clients for the
async compression API yet.

So let's defer allocation of the scratch buffers until the first time
we allocate an acompress cipher based on an scompress implementation.

Signed-off-by: Ard Biesheuvel 
---
 crypto/scompress.c | 46 
 1 file changed, 17 insertions(+), 29 deletions(-)

diff --git a/crypto/scompress.c b/crypto/scompress.c
index 2c07648305ad..2075e2c4e7df 100644
--- a/crypto/scompress.c
+++ b/crypto/scompress.c
@@ -65,11 +65,6 @@ static void crypto_scomp_show(struct seq_file *m, struct 
crypto_alg *alg)
seq_puts(m, "type : scomp\n");
 }
 
-static int crypto_scomp_init_tfm(struct crypto_tfm *tfm)
-{
-   return 0;
-}
-
 static void crypto_scomp_free_scratches(void * __percpu *scratches)
 {
int i;
@@ -134,6 +129,17 @@ static int crypto_scomp_alloc_all_scratches(void)
return 0;
 }
 
+static int crypto_scomp_init_tfm(struct crypto_tfm *tfm)
+{
+   int ret;
+
+   mutex_lock(_lock);
+   ret = crypto_scomp_alloc_all_scratches();
+   mutex_unlock(_lock);
+
+   return ret;
+}
+
 static void crypto_scomp_sg_free(struct scatterlist *sgl)
 {
int i, n;
@@ -241,6 +247,10 @@ static void crypto_exit_scomp_ops_async(struct crypto_tfm 
*tfm)
struct crypto_scomp **ctx = crypto_tfm_ctx(tfm);
 
crypto_free_scomp(*ctx);
+
+   mutex_lock(_lock);
+   crypto_scomp_free_all_scratches();
+   mutex_unlock(_lock);
 }
 
 int crypto_init_scomp_ops_async(struct crypto_tfm *tfm)
@@ -317,40 +327,18 @@ static const struct crypto_type crypto_scomp_type = {
 int crypto_register_scomp(struct scomp_alg *alg)
 {
struct crypto_alg *base = >base;
-   int ret = -ENOMEM;
-
-   mutex_lock(_lock);
-   if (crypto_scomp_alloc_all_scratches())
-   goto error;
 
base->cra_type = _scomp_type;
base->cra_flags &= ~CRYPTO_ALG_TYPE_MASK;
base->cra_flags |= CRYPTO_ALG_TYPE_SCOMPRESS;
 
-   ret = crypto_register_alg(base);
-   if (ret)
-   goto error;
-
-   mutex_unlock(_lock);
-   return ret;
-
-error:
-   crypto_scomp_free_all_scratches();
-   mutex_unlock(_lock);
-   return ret;
+   return crypto_register_alg(base);
 }
 EXPORT_SYMBOL_GPL(crypto_register_scomp);
 
 int crypto_unregister_scomp(struct scomp_alg *alg)
 {
-   int ret;
-
-   mutex_lock(_lock);
-   ret = crypto_unregister_alg(>base);
-   crypto_scomp_free_all_scratches();
-   mutex_unlock(_lock);
-
-   return ret;
+   return crypto_unregister_alg(>base);
 }
 EXPORT_SYMBOL_GPL(crypto_unregister_scomp);
 
-- 
2.11.0



[PATCH v2 2/3] crypto: scompress - free partially allocated scratch buffers on failure

2017-07-21 Thread Ard Biesheuvel
When allocating the per-CPU scratch buffers, we allocate the source
and destination buffers separately, but bail immediately if the second
allocation fails, without freeing the first one. Fix that.

Signed-off-by: Ard Biesheuvel 
---
 crypto/scompress.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/crypto/scompress.c b/crypto/scompress.c
index 0b40d991d65f..2c07648305ad 100644
--- a/crypto/scompress.c
+++ b/crypto/scompress.c
@@ -125,8 +125,11 @@ static int crypto_scomp_alloc_all_scratches(void)
if (!scomp_src_scratches)
return -ENOMEM;
scomp_dst_scratches = crypto_scomp_alloc_scratches();
-   if (!scomp_dst_scratches)
+   if (!scomp_dst_scratches) {
+   crypto_scomp_free_scratches(scomp_src_scratches);
+   scomp_src_scratches = NULL;
return -ENOMEM;
+   }
}
return 0;
 }
-- 
2.11.0



[PATCH v2 1/3] crypto: scompress - don't sleep with preemption disabled

2017-07-21 Thread Ard Biesheuvel
Due to the use of per-CPU buffers, scomp_acomp_comp_decomp() executes
with preemption disabled, and so whether the CRYPTO_TFM_REQ_MAY_SLEEP
flag is set is irrelevant, since we cannot sleep anyway. So disregard
the flag, and use GFP_ATOMIC unconditionally.

Cc:  # v4.10+
Signed-off-by: Ard Biesheuvel 
---
 crypto/scompress.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/crypto/scompress.c b/crypto/scompress.c
index ae1d3cf209e4..0b40d991d65f 100644
--- a/crypto/scompress.c
+++ b/crypto/scompress.c
@@ -211,9 +211,7 @@ static int scomp_acomp_comp_decomp(struct acomp_req *req, 
int dir)
  scratch_dst, >dlen, *ctx);
if (!ret) {
if (!req->dst) {
-   req->dst = crypto_scomp_sg_alloc(req->dlen,
-  req->base.flags & CRYPTO_TFM_REQ_MAY_SLEEP ?
-  GFP_KERNEL : GFP_ATOMIC);
+   req->dst = crypto_scomp_sg_alloc(req->dlen, GFP_ATOMIC);
if (!req->dst)
goto out;
}
-- 
2.11.0



Re: [RFC PATCH v12 3/4] Linux Random Number Generator

2017-07-21 Thread Stephan Müller
Am Freitag, 21. Juli 2017, 17:09:11 CEST schrieb Arnd Bergmann:

Hi Arnd,

> On Fri, Jul 21, 2017 at 10:57 AM, Stephan Müller  
wrote:
> > Am Freitag, 21. Juli 2017, 05:08:47 CEST schrieb Theodore Ts'o:
> >> Um, the timer is the largest number of interrupts on my system.  Compare:
> >> CPU0   CPU1   CPU2   CPU3
> >>  
> >>  LOC:6396552603886565586466057102   Local timer
> >>  interrupts
> >> 
> >> with the number of disk related interrupts:
> >>  120:  21492 139284  405131705886   PCI-MSI 376832-edge
> >> 
> >> ahci[:00:17.0]
> > 
> > They seem to be not picked up with the add_interrupt_randomness function.
> 
> On x86, the local APIC timer has some special handling in
> arch/x86/entry/entry_64.S that does not go through handle_irq_event().
> 
> I would assume that this is different when you boot with the "noapictimer"
> option and use the hpet clockevent instead.
> 
> On other architectures, the timer interrupt is often handled as a regular
> IRQ as well.

Thank you for the hint.

Yet, I would think that timer interrupts can be identified by 
add_interrupt_randomness, either by the IRQ or the stuck test that was is 
suggested with the LRNG patch set.

Ciao
Stephan


Re: [RFC PATCH v12 3/4] Linux Random Number Generator

2017-07-21 Thread Arnd Bergmann
On Fri, Jul 21, 2017 at 10:57 AM, Stephan Müller  wrote:
> Am Freitag, 21. Juli 2017, 05:08:47 CEST schrieb Theodore Ts'o:

>> Um, the timer is the largest number of interrupts on my system.  Compare:
>>
>> CPU0   CPU1   CPU2   CPU3
>>  LOC:6396552603886565586466057102   Local timer interrupts
>>
>> with the number of disk related interrupts:
>>
>>  120:  21492 139284  405131705886   PCI-MSI 376832-edge
>> ahci[:00:17.0]
>
> They seem to be not picked up with the add_interrupt_randomness function.

On x86, the local APIC timer has some special handling in
arch/x86/entry/entry_64.S that does not go through handle_irq_event().

I would assume that this is different when you boot with the "noapictimer"
option and use the hpet clockevent instead.

On other architectures, the timer interrupt is often handled as a regular
IRQ as well.

   Arnd


Re: Poor RNG performance on Ryzen

2017-07-21 Thread Gary R Hook

On 07/21/2017 09:47 AM, Theodore Ts'o wrote:

On Fri, Jul 21, 2017 at 01:39:13PM +0200, Oliver Mangold wrote:

Better, but obviously there is still much room for improvement by reducing
the number of calls to RDRAND.


Hmm, is there some way we can easily tell we are running on Ryzen?  Or
do we believe this is going to be true for all AMD devices?

I guess we could add some kind of "has_crappy_arch_get_random()" call
which could be defined by arch/x86, and change how aggressively we use
arch_get_random_*() depending on whether has_crappy_arch_get_random()
returns true or not


Just a quick note to say that we are aware of the issue, and that it is
being addressed.


Re: Poor RNG performance on Ryzen

2017-07-21 Thread Oliver Mangold

On 21.07.2017 16:47, Theodore Ts'o wrote:

On Fri, Jul 21, 2017 at 01:39:13PM +0200, Oliver Mangold wrote:

Better, but obviously there is still much room for improvement by reducing
the number of calls to RDRAND.

Hmm, is there some way we can easily tell we are running on Ryzen?  Or
do we believe this is going to be true for all AMD devices?
I would like to note that my first measurement on Broadwell suggest that 
the current frequency of RDRAND calls seems to slow things down on Intel 
also (but not as much as on Ryzen).


Re: Poor RNG performance on Ryzen

2017-07-21 Thread Theodore Ts'o
On Fri, Jul 21, 2017 at 01:39:13PM +0200, Oliver Mangold wrote:
> Better, but obviously there is still much room for improvement by reducing
> the number of calls to RDRAND.

Hmm, is there some way we can easily tell we are running on Ryzen?  Or
do we believe this is going to be true for all AMD devices?

I guess we could add some kind of "has_crappy_arch_get_random()" call
which could be defined by arch/x86, and change how aggressively we use
arch_get_random_*() depending on whether has_crappy_arch_get_random()
returns true or not

- Ted


Re: [PATCH] crypto: scompress - eliminate percpu scratch buffers

2017-07-21 Thread Ard Biesheuvel
On 21 July 2017 at 14:44, Herbert Xu  wrote:
> On Fri, Jul 21, 2017 at 02:42:20PM +0100, Ard Biesheuvel wrote:
>>
>> >> - Would you mind a patch that makes the code only use the per-CPU
>> >> buffers if we are running atomically to begin with?
>> >
>> > That would mean dropping the first packet so no.
>> >
>>
>> I think you misunderstood me: the idea is not to allocate the per-CPU
>> buffers at that time, but avoiding them and use the heap directly (as
>> my patch does now). This way, we don't unnecessarily disable
>> preemption for calls executing in process context.
>
> When we actually have a user other than IPcomp then we could
> try this optimisation.  Otherwise I think it'd be a waste of
> time.
>

Fair enough.


Re: [PATCH] crypto: scompress - eliminate percpu scratch buffers

2017-07-21 Thread Herbert Xu
On Fri, Jul 21, 2017 at 02:42:20PM +0100, Ard Biesheuvel wrote:
> 
> >> - Would you mind a patch that makes the code only use the per-CPU
> >> buffers if we are running atomically to begin with?
> >
> > That would mean dropping the first packet so no.
> >
> 
> I think you misunderstood me: the idea is not to allocate the per-CPU
> buffers at that time, but avoiding them and use the heap directly (as
> my patch does now). This way, we don't unnecessarily disable
> preemption for calls executing in process context.

When we actually have a user other than IPcomp then we could
try this optimisation.  Otherwise I think it'd be a waste of
time.

Cheers,
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


Re: [PATCH] crypto: scompress - eliminate percpu scratch buffers

2017-07-21 Thread Ard Biesheuvel
On 21 July 2017 at 14:31, Herbert Xu  wrote:
> On Fri, Jul 21, 2017 at 02:24:02PM +0100, Ard Biesheuvel wrote:
>>
>> OK, but that doesn't really answer any of my questions:
>> - Should we enforce that CRYPTO_ACOMP_ALLOC_OUTPUT is mutually
>> exclusive with CRYPTO_TFM_REQ_MAY_SLEEP, or should
>> crypto_scomp_sg_alloc() always use GFP_ATOMIC? We need one or the
>> other, because the current situation is buggy.
>
> Yeah I'm not surprised that it is broken, because there are currently
> no users of the API.  We should fix the acomp API.
>
>> - Could we move the scratch buffers to the request structure instead?
>> Or may that be allocated from softirq context as well?
>
> Yes the IPcomp requests would be allocated from softirq context.
>

__acomp_request_alloc() currently uses GFP_KERNEL explicitly, so that
needs to be fixed as well then.

>> - Would you mind a patch that defers the allocation of the per-CPU
>> buffers to the first invocation of crypto_scomp_init_tfm()?
>
> Sure.  That's what we currently do for IPcomp too.
>

OK.

>> - Would you mind a patch that makes the code only use the per-CPU
>> buffers if we are running atomically to begin with?
>
> That would mean dropping the first packet so no.
>

I think you misunderstood me: the idea is not to allocate the per-CPU
buffers at that time, but avoiding them and use the heap directly (as
my patch does now). This way, we don't unnecessarily disable
preemption for calls executing in process context.

I.e..

if (req->base.flags & CRYPTO_TFM_REQ_MAY_SLEEP) {
  cpu = get_cpu();
  scratch_src = *per_cpu_ptr(scomp_src_scratches, cpu);
  ...
} else {
  scratch_src = (u8 *)__get_free_pages(gfp, alloc_order);
}

and make the put_cpu() at the end conditional in the same way.

In any case, my main gripe with this code is the unconditional
allocation of the scratch buffers (taking up memory and vmalloc space)
so if we can defer that to crypto_scomp_init_tfm() instead, I already
have most of what I need.

-- 
Ard.


Re: [PATCH] crypto: scompress - eliminate percpu scratch buffers

2017-07-21 Thread Herbert Xu
On Fri, Jul 21, 2017 at 02:24:02PM +0100, Ard Biesheuvel wrote:
>
> OK, but that doesn't really answer any of my questions:
> - Should we enforce that CRYPTO_ACOMP_ALLOC_OUTPUT is mutually
> exclusive with CRYPTO_TFM_REQ_MAY_SLEEP, or should
> crypto_scomp_sg_alloc() always use GFP_ATOMIC? We need one or the
> other, because the current situation is buggy.

Yeah I'm not surprised that it is broken, because there are currently
no users of the API.  We should fix the acomp API.

> - Could we move the scratch buffers to the request structure instead?
> Or may that be allocated from softirq context as well?

Yes the IPcomp requests would be allocated from softirq context.

> - Would you mind a patch that defers the allocation of the per-CPU
> buffers to the first invocation of crypto_scomp_init_tfm()?

Sure.  That's what we currently do for IPcomp too.

> - Would you mind a patch that makes the code only use the per-CPU
> buffers if we are running atomically to begin with?

That would mean dropping the first packet so no.

Cheers,
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


Re: [PATCH] crypto: scompress - eliminate percpu scratch buffers

2017-07-21 Thread Ard Biesheuvel
On 21 July 2017 at 14:24, Ard Biesheuvel  wrote:
> On 21 July 2017 at 14:11, Herbert Xu  wrote:
>> On Fri, Jul 21, 2017 at 02:09:39PM +0100, Ard Biesheuvel wrote:
>>>
>>> Right. And is req->dst guaranteed to be assigned in that case? Because
>>> crypto_scomp_sg_alloc() happily allocates pages and kmalloc()s the
>>> scatterlist if req->dst == NULL.
>>>
>>> Is there any way we could make these scratch buffers part of the
>>> request structure instead? Or at least defer allocating them until the
>>> first call to crypto_scomp_init_tfm()? And on top of that, we should
>>> probably only use the per-CPU scratch buffers if
>>> CRYPTO_TFM_REQ_MAY_SLEEP is cleared, because in that case, we are not
>>> pre-emptible to begin with, and the concern does not apply.
>>
>> Well for now scomp is new code so nothing actually uses it.  But
>> the idea is to convert IPcomp over to acomp and scomp will fill in
>> as the compatibility glue.
>>
>
> Yes, I guessed as much.
>
>> The medium/long-term plan is to convert IPcomp over to a partial
>> decompression model so that we can allocate pages instead of a
>> contiguous chunk as we do now.
>>
>
> OK, but that doesn't really answer any of my questions:
> - Should we enforce that CRYPTO_ACOMP_ALLOC_OUTPUT is mutually
> exclusive with CRYPTO_TFM_REQ_MAY_SLEEP, or should
> crypto_scomp_sg_alloc() always use GFP_ATOMIC? We need one or the
> other, because the current situation is buggy.

I mean CRYPTO_ACOMP_ALLOC_OUTPUT should not be set if
CRYPTO_TFM_REQ_MAY_SLEEP is cleared.

> - Could we move the scratch buffers to the request structure instead?
> Or may that be allocated from softirq context as well?
> - Would you mind a patch that defers the allocation of the per-CPU
> buffers to the first invocation of crypto_scomp_init_tfm()?
> - Would you mind a patch that makes the code only use the per-CPU
> buffers if we are running atomically to begin with?
>
> Thanks,
> Ard.


Re: [PATCH] crypto: scompress - eliminate percpu scratch buffers

2017-07-21 Thread Ard Biesheuvel
On 21 July 2017 at 14:11, Herbert Xu  wrote:
> On Fri, Jul 21, 2017 at 02:09:39PM +0100, Ard Biesheuvel wrote:
>>
>> Right. And is req->dst guaranteed to be assigned in that case? Because
>> crypto_scomp_sg_alloc() happily allocates pages and kmalloc()s the
>> scatterlist if req->dst == NULL.
>>
>> Is there any way we could make these scratch buffers part of the
>> request structure instead? Or at least defer allocating them until the
>> first call to crypto_scomp_init_tfm()? And on top of that, we should
>> probably only use the per-CPU scratch buffers if
>> CRYPTO_TFM_REQ_MAY_SLEEP is cleared, because in that case, we are not
>> pre-emptible to begin with, and the concern does not apply.
>
> Well for now scomp is new code so nothing actually uses it.  But
> the idea is to convert IPcomp over to acomp and scomp will fill in
> as the compatibility glue.
>

Yes, I guessed as much.

> The medium/long-term plan is to convert IPcomp over to a partial
> decompression model so that we can allocate pages instead of a
> contiguous chunk as we do now.
>

OK, but that doesn't really answer any of my questions:
- Should we enforce that CRYPTO_ACOMP_ALLOC_OUTPUT is mutually
exclusive with CRYPTO_TFM_REQ_MAY_SLEEP, or should
crypto_scomp_sg_alloc() always use GFP_ATOMIC? We need one or the
other, because the current situation is buggy.
- Could we move the scratch buffers to the request structure instead?
Or may that be allocated from softirq context as well?
- Would you mind a patch that defers the allocation of the per-CPU
buffers to the first invocation of crypto_scomp_init_tfm()?
- Would you mind a patch that makes the code only use the per-CPU
buffers if we are running atomically to begin with?

Thanks,
Ard.


Re: [PATCH] crypto: scompress - eliminate percpu scratch buffers

2017-07-21 Thread Herbert Xu
On Fri, Jul 21, 2017 at 02:09:39PM +0100, Ard Biesheuvel wrote:
>
> Right. And is req->dst guaranteed to be assigned in that case? Because
> crypto_scomp_sg_alloc() happily allocates pages and kmalloc()s the
> scatterlist if req->dst == NULL.
> 
> Is there any way we could make these scratch buffers part of the
> request structure instead? Or at least defer allocating them until the
> first call to crypto_scomp_init_tfm()? And on top of that, we should
> probably only use the per-CPU scratch buffers if
> CRYPTO_TFM_REQ_MAY_SLEEP is cleared, because in that case, we are not
> pre-emptible to begin with, and the concern does not apply.

Well for now scomp is new code so nothing actually uses it.  But
the idea is to convert IPcomp over to acomp and scomp will fill in
as the compatibility glue.

The medium/long-term plan is to convert IPcomp over to a partial
decompression model so that we can allocate pages instead of a
contiguous chunk as we do now.

Cheers,
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


Re: [PATCH] crypto: scompress - eliminate percpu scratch buffers

2017-07-21 Thread Ard Biesheuvel
On 21 July 2017 at 13:42, Herbert Xu  wrote:
> On Thu, Jul 20, 2017 at 12:40:00PM +0100, Ard Biesheuvel wrote:
>> The scompress code unconditionally allocates 2 per-CPU scratch buffers
>> of 128 KB each, in order to avoid allocation overhead in the async
>> wrapper that encapsulates the synchronous compression algorithm, since
>> it may execute in atomic context.
>
> The whole point of pre-allocation is that we cannot allocate 128K
> (or 64K as it was before scomp) at run-time, and in particular,
> for IPsec which runs in softirq path.  Am I missing something?
>

Right. And is req->dst guaranteed to be assigned in that case? Because
crypto_scomp_sg_alloc() happily allocates pages and kmalloc()s the
scatterlist if req->dst == NULL.

Is there any way we could make these scratch buffers part of the
request structure instead? Or at least defer allocating them until the
first call to crypto_scomp_init_tfm()? And on top of that, we should
probably only use the per-CPU scratch buffers if
CRYPTO_TFM_REQ_MAY_SLEEP is cleared, because in that case, we are not
pre-emptible to begin with, and the concern does not apply.


Re: [PATCH] crypto: scompress - eliminate percpu scratch buffers

2017-07-21 Thread Herbert Xu
On Thu, Jul 20, 2017 at 12:40:00PM +0100, Ard Biesheuvel wrote:
> The scompress code unconditionally allocates 2 per-CPU scratch buffers
> of 128 KB each, in order to avoid allocation overhead in the async
> wrapper that encapsulates the synchronous compression algorithm, since
> it may execute in atomic context.

The whole point of pre-allocation is that we cannot allocate 128K
(or 64K as it was before scomp) at run-time, and in particular,
for IPsec which runs in softirq path.  Am I missing something?

Cheers,
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


Re: Poor RNG performance on Ryzen

2017-07-21 Thread Jeffrey Walton
On Fri, Jul 21, 2017 at 3:12 AM, Oliver Mangold  wrote:
> Hi,
>
> I was wondering why reading from /dev/urandom is much slower on Ryzen than
> on Intel, and did some analysis. It turns out that the RDRAND instruction is
> at fault, which takes much longer on AMD.
>
> if I read this correctly:
>
> --- drivers/char/random.c ---
> 862 spin_lock_irqsave(>lock, flags);
> 863 if (arch_get_random_long())
> 864 crng->state[14] ^= v;
> 865 chacha20_block(>state[0], out);
>
> one call to RDRAND (with 64-bit operand) is issued per computation of a
> chacha20 block. According to the measurements I did, it seems on Ryzen this
> dominates the time usage:

AMD's implementation of RDRAND and RDSEED are simply slow. It dates
back to Bulldozer. While Intel can produce random numbers at 10
cycle/sbyte, AMD regularly takes thousands of cycles for one byte.
Bulldozer was measured at 4100 cycles per byte.

It also appears AMD uses the same circuit for random numbers for both
RDRAND and RDSEED. Both are equally fast (or equally slow).

Here are some benchmarks if you are interested:
https://www.cryptopp.com/wiki/RDRAND#Performance .

Jeff


Re: Poor RNG performance on Ryzen

2017-07-21 Thread Oliver Mangold

On 21.07.2017 11:26, Jan Glauber wrote:


Nice catch. How much does the performance improve on Ryzen when you
use arch_get_random_int()?


Okay, now I have some results for you:

On Ryzen 1800X (using arch_get_random_int()):

---
# dd if=/dev/urandom of=/dev/null bs=1M status=progress
8751415296 bytes (8,8 GB, 8,2 GiB) copied, 71,0079 s, 123 MB/s
# perf top
   57,37%  [kernel][k] _extract_crng
   26,20%  [kernel][k] chacha20_block
---

Better, but obviously there is still much room for improvement by 
reducing the number of calls to RDRAND.


On Ryzen 1800X (with nordrand kernel option):

---
# dd if=/dev/urandom of=/dev/null bs=1M status=progress
22643998720 bytes (23 GB, 21 GiB) copied, 67,0025 s, 338 MB/s
---

Here is the patch I used:

--- drivers/char/random.c.orig  2017-07-03 01:07:02.0 +0200
+++ drivers/char/random.c   2017-07-21 11:57:40.541677118 +0200
@@ -859,13 +859,14 @@
  static void _extract_crng(struct crng_state *crng,
   __u8 out[CHACHA20_BLOCK_SIZE])
  {
-   unsigned long v, flags;
+   unsigned int v;
+   unsigned long flags;

 if (crng_init > 1 &&
 time_after(jiffies, crng->init_time + CRNG_RESEED_INTERVAL))
 crng_reseed(crng, crng == _crng ? _pool 
: NULL);

 spin_lock_irqsave(>lock, flags);
-   if (arch_get_random_long())
+   if (arch_get_random_int())
 crng->state[14] ^= v;
 chacha20_block(>state[0], out);
 if (crng->state[12] == 0)


Re: [RFC PATCH v12 3/4] Linux Random Number Generator

2017-07-21 Thread Jeffrey Walton
Hi Ted,

Snipping one comment:

> Practically no one uses /dev/random.  It's essentially a deprecated
> interface; the primary interfaces that have been recommended for well
> over a decade is /dev/urandom, and now, getrandom(2).  We only need
> 384 bits of randomness every 5 minutes to reseed the CRNG, and that's
> plenty even given the very conservative entropy estimation currently
> being used.

The statement about /dev/random being deprecated is not well
documented. A quick search is not turning up the expected results.

The RANDOM(4) man page provides competing (conflicting?) information:

   When read, the /dev/random device will return random bytes only  within
   the estimated number of bits of noise in the entropy pool.  /dev/random
   should be suitable for uses that need very high quality randomness such
   as  one-time  pad  or  key generation...

We regularly test the /dev/random generator by reading 10K bytes in
non-blocking, discarding them, and then asking for 16 bytes in
blocking. We also compress as a poor man's fitness test. We are
interested in how robust the generator is, how well it performs under
stress, and how well it recovers.

After draining it often takes minutes for the generator to produce 16
bytes. On Debian based systems the experiment usually fails unless
rng-tools is installed. The failures occur even on systems with
hardware based generators like rdrand and rdseed. I've witnessed the
failure on i686, x86_64, ARM and MIPS.

We recently suggested the GCC compile farm install rng-tools because
we were witnessing the problem there on machines. Cf.,
https://lists.tetaneutral.net/pipermail/cfarm-users/2017-July/30.html
. I've even seen vendors recommend wiring /dev/random to /dev/urandom
because of the entropy depletion problems. That's a big No-No
according to https://lwn.net/Articles/525459/.

The failures have always left me with an uncomfortable feeling because
there are so many damn programs out there that do their own thing.
Distro don't perform a SecArch review before packaging, so problems
lie in wait.

If the generator is truly deprecated, then it may be prudent to remove
it completely or remove it from userland. Otherwise, improve its
robustness. At minimum, update the documentation.

Jeff

On Thu, Jul 20, 2017 at 11:08 PM, Theodore Ts'o  wrote:
> On Thu, Jul 20, 2017 at 09:00:02PM +0200, Stephan Müller wrote:
>> I concur with your rationale where de-facto the correlation is effect is
>> diminished and eliminated with the fast_pool and the minimal entropy
>> estimation of interrupts.
>>
>> But it does not address my concern. Maybe I was not clear, please allow me to
>> explain it again.
>>
>> We have lots of entropy in the system which is discarded by the 
>> aforementioned
>> approach (if a high-res timer is present -- without it all bets are off 
>> anyway
>> and this should be covered in a separate discussion). At boot time, this 
>> issue
>> is fixed by injecting 256 interrupts in the CRNG and consider it seeded.
>>
>> But at runtime, were we still need entropy to reseed the CRNG and to supply /
>> dev/random. The accounting of entropy at runtime is much too conservative...
>
> Practically no one uses /dev/random.  It's essentially a deprecated
> interface; the primary interfaces that have been recommended for well
> over a decade is /dev/urandom, and now, getrandom(2).  We only need
> 384 bits of randomness every 5 minutes to reseed the CRNG, and that's
> plenty even given the very conservative entropy estimation currently
> being used.
>
> This was deliberate.  I care a lot more that we get the initial
> boot-time CRNG initialization right on ARM32 and MIPS embedded
> devices, far, far, more than I care about making plenty of
> information-theoretic entropy available at /dev/random on an x86
> system.  Further, I haven't seen an argument for the use case where
> this would be valuable.
>
> If you don't think they count because ARM32 and MIPS don't have a
> high-res timer, then you have very different priorities than I do.  I
> will point out that numerically there are huge number of these devices
> --- and very, very few users of /dev/random.
>
>> You mentioned that you are super conservative for interrupts due to timer
>> interrupts. In all measurements on the different systems I conducted, I have
>> not seen that the timer triggers an interrupt picked up by
>> add_interrupt_randomness.
>
> Um, the timer is the largest number of interrupts on my system.  Compare:
>
> CPU0   CPU1   CPU2   CPU3
>  LOC:6396552603886565586466057102   Local timer interrupts
>
> with the number of disk related interrupts:
>
>  120:  21492 139284  405131705886   PCI-MSI 376832-edge  
> ahci[:00:17.0]
>
> ... and add_interrupt_randomness() gets called for **every**
> interrupt.  On an mostly idle machine (I was in meetings most of
> today) it's not surprising that time interrupts 

Re: Poor RNG performance on Ryzen

2017-07-21 Thread Jan Glauber
On Fri, Jul 21, 2017 at 09:12:01AM +0200, Oliver Mangold wrote:
> Hi,
> 
> I was wondering why reading from /dev/urandom is much slower on
> Ryzen than on Intel, and did some analysis. It turns out that the
> RDRAND instruction is at fault, which takes much longer on AMD.
> 
> if I read this correctly:
> 
> --- drivers/char/random.c ---
> 862 spin_lock_irqsave(>lock, flags);
> 863 if (arch_get_random_long())
> 864 crng->state[14] ^= v;
> 865 chacha20_block(>state[0], out);
> 
> one call to RDRAND (with 64-bit operand) is issued per computation
> of a chacha20 block. According to the measurements I did, it seems
> on Ryzen this dominates the time usage:
> 
> On Broadwell E5-2650 v4:
> 
> ---
> # dd if=/dev/urandom of=/dev/null bs=1M status=progress
> 28827451392 bytes (29 GB) copied, 143.290349 s, 201 MB/s
> # perf top
>   49.88%  [kernel][k] chacha20_block
>   31.22%  [kernel][k] _extract_crng
> ---
> 
> On Ryzen 1800X:
> 
> ---
> # dd if=/dev/urandom of=/dev/null bs=1M status=progress
> 3169845248 bytes (3,2 GB, 3,0 GiB) copied, 42,0106 s, 75,5 MB/s
> # perf top
>   76,40%  [kernel]   [k] _extract_crng
>   13,05%  [kernel]   [k] chacha20_block
> ---
> 
> An easy improvement might be to replace the usage of
> arch_get_random_long() by arch_get_random_int(), as the state array
> contains just 32-bit elements, and (contrary to Intel) on Ryzen
> 32-bit RDRAND is supposed to be faster by roughly a factor of 2.

Nice catch. How much does the performance improve on Ryzen when you
use arch_get_random_int()?

--Jan

> Best regards,
> 
> OM


Re: [RFC PATCH v12 3/4] Linux Random Number Generator

2017-07-21 Thread Stephan Müller
Am Freitag, 21. Juli 2017, 05:08:47 CEST schrieb Theodore Ts'o:

Hi Theodore,

> On Thu, Jul 20, 2017 at 09:00:02PM +0200, Stephan Müller wrote:
> > I concur with your rationale where de-facto the correlation is effect is
> > diminished and eliminated with the fast_pool and the minimal entropy
> > estimation of interrupts.
> > 
> > But it does not address my concern. Maybe I was not clear, please allow me
> > to explain it again.
> > 
> > We have lots of entropy in the system which is discarded by the
> > aforementioned approach (if a high-res timer is present -- without it all
> > bets are off anyway and this should be covered in a separate discussion).
> > At boot time, this issue is fixed by injecting 256 interrupts in the CRNG
> > and consider it seeded.
> > 
> > But at runtime, were we still need entropy to reseed the CRNG and to
> > supply / dev/random. The accounting of entropy at runtime is much too
> > conservative...
> Practically no one uses /dev/random.  It's essentially a deprecated
> interface; the primary interfaces that have been recommended for well
> over a decade is /dev/urandom, and now, getrandom(2).  We only need
> 384 bits of randomness every 5 minutes to reseed the CRNG, and that's
> plenty even given the very conservative entropy estimation currently
> being used.

On a headless system with SSDs, this is not enough based on measurements where 
entropy_avail is always low ...
> 
> This was deliberate.  I care a lot more that we get the initial
> boot-time CRNG initialization right on ARM32 and MIPS embedded
> devices, far, far, more than I care about making plenty of
> information-theoretic entropy available at /dev/random on an x86
> system.  Further, I haven't seen an argument for the use case where
> this would be valuable.

My concern covers *both* /dev/random and /dev/urandom.
> 
> If you don't think they count because ARM32 and MIPS don't have a
> high-res timer, then you have very different priorities than I do.  I
> will point out that numerically there are huge number of these devices
> --- and very, very few users of /dev/random.

With only jiffies, you will not get suitable entropy from interrupts or HID or 
block devices, as these actions can be monitored from user space with a 
suitable degree of precision.

The only real entropy provider would be HID as the entropy may come from the 
key strokes. But that is observable and should not count as entropy.
> 
> > You mentioned that you are super conservative for interrupts due to timer
> > interrupts. In all measurements on the different systems I conducted, I
> > have not seen that the timer triggers an interrupt picked up by
> > add_interrupt_randomness.
> 
> Um, the timer is the largest number of interrupts on my system.  Compare:
> 
> CPU0   CPU1   CPU2   CPU3
>  LOC:6396552603886565586466057102   Local timer interrupts
> 
> with the number of disk related interrupts:
> 
>  120:  21492 139284  405131705886   PCI-MSI 376832-edge 
> ahci[:00:17.0]

They seem to be not picked up with the add_interrupt_randomness function.

Execute the follwing SystemTap script:

global NUMSAMPLES = 1;

global num_events = 0;

probe kernel.function("add_interrupt_randomness")
{
printf("%d\n", $irq);
num_events++;

if (num_events > NUMSAMPLES)
exit();
}

The timer interrupt does not show up here.

> 
> ... and add_interrupt_randomness() gets called for **every**
> interrupt.  On an mostly idle machine (I was in meetings most of
> today) it's not surprising that time interrupts dominate.  That
> doesn't matter for me as much because I don't really care about
> /dev/random performance.  What's is **far** more important is that the
> entropy estimations behave correctly, across all of Linux's
> architectures, while the kernel is going through startup, before CRNG
> is declared initialized.
> 
> > As we have no formal model about entropy to begin with, we can only assume
> > and hope we underestimate entropy with the entropy heuristic.
> 
> Yes, and that's why I use an ultra-conservative estimate.  If we start
> using a more aggressive hueristic, we open ourselves up to potentially
> very severe security bugs --- and for what?  What's the cost benefit
> ratio here which makes this a worthwhile thing to risk?

The benefit is that in case there is an entropy hog on the system, /dev/random 
and /dev/urandom recover faster from that. Otherwise they do not get reseeded 
at all.
> 
> > Finally, I still think it is helpful to allow (not mandate) to involve the
> > kernel crypto API for the DRNG maintenance (i.e. the supplier for
> > /dev/random and /dev/urandom). The reason is that now more and more DRNG
> > implementations in hardware pop up. Why not allowing them to be used.
> > I.e. random.c would only contain the logic to manage entropy but uses the
> > DRNG requested by a user.
> We *do* allow them to be used.  And we support a large number 

Poor RNG performance on Ryzen

2017-07-21 Thread Oliver Mangold

Hi,

I was wondering why reading from /dev/urandom is much slower on Ryzen 
than on Intel, and did some analysis. It turns out that the RDRAND 
instruction is at fault, which takes much longer on AMD.


if I read this correctly:

--- drivers/char/random.c ---
862 spin_lock_irqsave(>lock, flags);
863 if (arch_get_random_long())
864 crng->state[14] ^= v;
865 chacha20_block(>state[0], out);

one call to RDRAND (with 64-bit operand) is issued per computation of a 
chacha20 block. According to the measurements I did, it seems on Ryzen 
this dominates the time usage:


On Broadwell E5-2650 v4:

---
# dd if=/dev/urandom of=/dev/null bs=1M status=progress
28827451392 bytes (29 GB) copied, 143.290349 s, 201 MB/s
# perf top
  49.88%  [kernel][k] chacha20_block
  31.22%  [kernel][k] _extract_crng
---

On Ryzen 1800X:

---
# dd if=/dev/urandom of=/dev/null bs=1M status=progress
3169845248 bytes (3,2 GB, 3,0 GiB) copied, 42,0106 s, 75,5 MB/s
# perf top
  76,40%  [kernel]   [k] _extract_crng
  13,05%  [kernel]   [k] chacha20_block
---

An easy improvement might be to replace the usage of 
arch_get_random_long() by arch_get_random_int(), as the state array 
contains just 32-bit elements, and (contrary to Intel) on Ryzen 32-bit 
RDRAND is supposed to be faster by roughly a factor of 2.


Best regards,

OM


Re: [PATCH 3/4] crypto: axis: add ARTPEC-6/7 crypto accelerator driver

2017-07-21 Thread Lars Persson



On 07/20/2017 04:51 PM, Stephan Müller wrote:

Am Donnerstag, 20. Juli 2017, 15:44:31 CEST schrieb Lars Persson:

Hi Lars,


+static int
+artpec6_crypto_cipher_set_key(struct crypto_skcipher *cipher, const u8
*key, +   unsigned int keylen)
+{
+   struct artpec6_cryptotfm_context *ctx =
+   crypto_skcipher_ctx(cipher);
+
+   if (ctx->crypto_type == ARTPEC6_CRYPTO_CIPHER_NULL)
+   return 0;
+
+   switch (keylen) {
+   case 16:
+   case 24:
+   case 32:
+   break;
+   case 48:
+   case 64:
+   if (ctx->crypto_type == ARTPEC6_CRYPTO_CIPHER_AES_XTS)


Could you please add xts_check_key for XTS keys?


Sure, I will add this in v2.

Thanks,
 Lars