Re: [PATCH] x86/uaccess: use unrolled string copy for short strings
On Thu, 2017-06-22 at 10:30 -0700, Linus Torvalds wrote: > So if you want to do this optimization, I'd argue that you should just > do it inside the copy_user_enhanced_fast_string() function itself, the > same way we already handle the really small case specially in > copy_user_generic_string(). > > And do *not* use the unrolled code, which isn't used for small copies > anyway - rewrite the "copy_user_generic_unrolled" function in that > same asm file to have the non-unrolled cases (label "17" and forward) > accessible, so that you don't bother re-testing the size. Thank you for the feedback. I'm quite new to the core x86 land; the rep stosb cost popped out while messing with the networking. I'll try to dig into the asm. Regards, Paolo
Re: [PATCH] x86/uaccess: use unrolled string copy for short strings
On Thu, 2017-06-22 at 10:30 -0700, Linus Torvalds wrote: > So if you want to do this optimization, I'd argue that you should just > do it inside the copy_user_enhanced_fast_string() function itself, the > same way we already handle the really small case specially in > copy_user_generic_string(). > > And do *not* use the unrolled code, which isn't used for small copies > anyway - rewrite the "copy_user_generic_unrolled" function in that > same asm file to have the non-unrolled cases (label "17" and forward) > accessible, so that you don't bother re-testing the size. Thank you for the feedback. I'm quite new to the core x86 land; the rep stosb cost popped out while messing with the networking. I'll try to dig into the asm. Regards, Paolo
Re: [PATCH] x86/uaccess: use unrolled string copy for short strings
On Wed, Jun 21, 2017 at 4:09 AM, Paolo Abeniwrote: > > + if (len <= 64) > + return copy_user_generic_unrolled(to, from, len); > + > /* > * If CPU has ERMS feature, use copy_user_enhanced_fast_string. > * Otherwise, if CPU has rep_good feature, use > copy_user_generic_string. NAK. Please do *not* do this. It puts the check in completely the wrong place for several reasons: (a) it puts it in the inlined caller case (which could be ok for constant sizes, but not in general) (b) it uses the copy_user_generic_unrolled() function that will then just test the size *AGAIN* against small cases. so it's both bigger than necessary, and stupid. So if you want to do this optimization, I'd argue that you should just do it inside the copy_user_enhanced_fast_string() function itself, the same way we already handle the really small case specially in copy_user_generic_string(). And do *not* use the unrolled code, which isn't used for small copies anyway - rewrite the "copy_user_generic_unrolled" function in that same asm file to have the non-unrolled cases (label "17" and forward) accessible, so that you don't bother re-testing the size. Linus
Re: [PATCH] x86/uaccess: use unrolled string copy for short strings
On Wed, Jun 21, 2017 at 4:09 AM, Paolo Abeni wrote: > > + if (len <= 64) > + return copy_user_generic_unrolled(to, from, len); > + > /* > * If CPU has ERMS feature, use copy_user_enhanced_fast_string. > * Otherwise, if CPU has rep_good feature, use > copy_user_generic_string. NAK. Please do *not* do this. It puts the check in completely the wrong place for several reasons: (a) it puts it in the inlined caller case (which could be ok for constant sizes, but not in general) (b) it uses the copy_user_generic_unrolled() function that will then just test the size *AGAIN* against small cases. so it's both bigger than necessary, and stupid. So if you want to do this optimization, I'd argue that you should just do it inside the copy_user_enhanced_fast_string() function itself, the same way we already handle the really small case specially in copy_user_generic_string(). And do *not* use the unrolled code, which isn't used for small copies anyway - rewrite the "copy_user_generic_unrolled" function in that same asm file to have the non-unrolled cases (label "17" and forward) accessible, so that you don't bother re-testing the size. Linus
Re: [PATCH] x86/uaccess: use unrolled string copy for short strings
On Thu, 2017-06-22 at 10:47 +0200, Ingo Molnar wrote: > * Paolo Abeniwrote: > > > The 'rep' prefix suffers for a relevant "setup cost"; as a result > > string copies with unrolled loops are faster than even > > optimized string copy using 'rep' variant, for short string. > > > > This change updates __copy_user_generic() to use the unrolled > > version for small string length. The threshold length for short > > string - 64 - has been selected with empirical measures as the > > larger value that still ensure a measurable gain. > > > > A micro-benchmark of __copy_from_user() with different lengths shows > > the following: > > > > string len vanilla patched delta > > bytes ticks ticks tick(%) > > > > 0 58 26 32(55%) > > 1 49 29 20(40%) > > 2 49 31 18(36%) > > 3 49 32 17(34%) > > 4 50 34 16(32%) > > 5 49 35 14(28%) > > 6 49 36 13(26%) > > 7 49 38 11(22%) > > 8 50 31 19(38%) > > 9 51 33 18(35%) > > 10 52 36 16(30%) > > 11 52 37 15(28%) > > 12 52 38 14(26%) > > 13 52 40 12(23%) > > 14 52 41 11(21%) > > 15 52 42 10(19%) > > 16 51 34 17(33%) > > 17 51 35 16(31%) > > 18 52 37 15(28%) > > 19 51 38 13(25%) > > 20 52 39 13(25%) > > 21 52 40 12(23%) > > 22 51 42 9(17%) > > 23 51 46 5(9%) > > 24 52 35 17(32%) > > 25 52 37 15(28%) > > 26 52 38 14(26%) > > 27 52 39 13(25%) > > 28 52 40 12(23%) > > 29 53 42 11(20%) > > 30 52 43 9(17%) > > 31 52 44 8(15%) > > 32 51 36 15(29%) > > 33 51 38 13(25%) > > 34 51 39 12(23%) > > 35 51 41 10(19%) > > 36 52 41 11(21%) > > 37 52 43 9(17%) > > 38 51 44 7(13%) > > 39 52 46 6(11%) > > 40 51 37 14(27%) > > 41 50 38 12(24%) > > 42 50 39 11(22%) > > 43 50 40 10(20%) > > 44 50 42 8(16%) > > 45 50 43 7(14%) > > 46 50 43 7(14%) > > 47 50 45 5(10%) > > 48 50 37 13(26%) > > 49 49 38 11(22%) > > 50 50 40 10(20%) > > 51 50 42 8(16%) > > 52 50 42 8(16%) > > 53 49 46 3(6%) > > 54 50 46 4(8%) > > 55 49 48 1(2%) > > 56 50 39 11(22%) > > 57 50 40 10(20%) > > 58 49 42 7(14%) > > 59 50 42 8(16%) > > 60 50 46 4(8%) > > 61 50 47 3(6%) > > 62 50 48 2(4%) > > 63 50 48 2(4%) > > 64 51 38 13(25%) > > > > Above 64 bytes the gain fades away. > > > > Very similar values are collectd for __copy_to_user(). > > UDP receive performances under flood with small packets using recvfrom() > > increase by ~5%. > > What CPU model(s) were used for the performance testing and was it > performance > tested on several different types of CPUs? > > Please add a comment here: > > + if (len <= 64) > + return copy_user_generic_unrolled(to, from, len); > + > > ... because it's not obvious at all that this is a performance optimization, > not a > correctness issue. Also explain that '64'
Re: [PATCH] x86/uaccess: use unrolled string copy for short strings
On Thu, 2017-06-22 at 10:47 +0200, Ingo Molnar wrote: > * Paolo Abeni wrote: > > > The 'rep' prefix suffers for a relevant "setup cost"; as a result > > string copies with unrolled loops are faster than even > > optimized string copy using 'rep' variant, for short string. > > > > This change updates __copy_user_generic() to use the unrolled > > version for small string length. The threshold length for short > > string - 64 - has been selected with empirical measures as the > > larger value that still ensure a measurable gain. > > > > A micro-benchmark of __copy_from_user() with different lengths shows > > the following: > > > > string len vanilla patched delta > > bytes ticks ticks tick(%) > > > > 0 58 26 32(55%) > > 1 49 29 20(40%) > > 2 49 31 18(36%) > > 3 49 32 17(34%) > > 4 50 34 16(32%) > > 5 49 35 14(28%) > > 6 49 36 13(26%) > > 7 49 38 11(22%) > > 8 50 31 19(38%) > > 9 51 33 18(35%) > > 10 52 36 16(30%) > > 11 52 37 15(28%) > > 12 52 38 14(26%) > > 13 52 40 12(23%) > > 14 52 41 11(21%) > > 15 52 42 10(19%) > > 16 51 34 17(33%) > > 17 51 35 16(31%) > > 18 52 37 15(28%) > > 19 51 38 13(25%) > > 20 52 39 13(25%) > > 21 52 40 12(23%) > > 22 51 42 9(17%) > > 23 51 46 5(9%) > > 24 52 35 17(32%) > > 25 52 37 15(28%) > > 26 52 38 14(26%) > > 27 52 39 13(25%) > > 28 52 40 12(23%) > > 29 53 42 11(20%) > > 30 52 43 9(17%) > > 31 52 44 8(15%) > > 32 51 36 15(29%) > > 33 51 38 13(25%) > > 34 51 39 12(23%) > > 35 51 41 10(19%) > > 36 52 41 11(21%) > > 37 52 43 9(17%) > > 38 51 44 7(13%) > > 39 52 46 6(11%) > > 40 51 37 14(27%) > > 41 50 38 12(24%) > > 42 50 39 11(22%) > > 43 50 40 10(20%) > > 44 50 42 8(16%) > > 45 50 43 7(14%) > > 46 50 43 7(14%) > > 47 50 45 5(10%) > > 48 50 37 13(26%) > > 49 49 38 11(22%) > > 50 50 40 10(20%) > > 51 50 42 8(16%) > > 52 50 42 8(16%) > > 53 49 46 3(6%) > > 54 50 46 4(8%) > > 55 49 48 1(2%) > > 56 50 39 11(22%) > > 57 50 40 10(20%) > > 58 49 42 7(14%) > > 59 50 42 8(16%) > > 60 50 46 4(8%) > > 61 50 47 3(6%) > > 62 50 48 2(4%) > > 63 50 48 2(4%) > > 64 51 38 13(25%) > > > > Above 64 bytes the gain fades away. > > > > Very similar values are collectd for __copy_to_user(). > > UDP receive performances under flood with small packets using recvfrom() > > increase by ~5%. > > What CPU model(s) were used for the performance testing and was it > performance > tested on several different types of CPUs? > > Please add a comment here: > > + if (len <= 64) > + return copy_user_generic_unrolled(to, from, len); > + > > ... because it's not obvious at all that this is a performance optimization, > not a > correctness issue. Also explain that '64' is a number that we
Re: [PATCH] x86/uaccess: use unrolled string copy for short strings
> > diff --git a/arch/x86/include/asm/uaccess_64.h > > b/arch/x86/include/asm/uaccess_64.h > > index c5504b9..16a8871 100644 > > --- a/arch/x86/include/asm/uaccess_64.h > > +++ b/arch/x86/include/asm/uaccess_64.h > > @@ -28,6 +28,9 @@ copy_user_generic(void *to, const void *from, unsigned > > len) > > { > > unsigned ret; > > > > + if (len <= 64) > > + return copy_user_generic_unrolled(to, from, len); > > + Given this gets inlined surely that should bracketed with an ifdef against optimizing for space. Also you give one set of benchmarks but how do they look on different processors - AMD v Intel, Atom v Core, Ryzen v older AMD etc ? Alan
Re: [PATCH] x86/uaccess: use unrolled string copy for short strings
> > diff --git a/arch/x86/include/asm/uaccess_64.h > > b/arch/x86/include/asm/uaccess_64.h > > index c5504b9..16a8871 100644 > > --- a/arch/x86/include/asm/uaccess_64.h > > +++ b/arch/x86/include/asm/uaccess_64.h > > @@ -28,6 +28,9 @@ copy_user_generic(void *to, const void *from, unsigned > > len) > > { > > unsigned ret; > > > > + if (len <= 64) > > + return copy_user_generic_unrolled(to, from, len); > > + Given this gets inlined surely that should bracketed with an ifdef against optimizing for space. Also you give one set of benchmarks but how do they look on different processors - AMD v Intel, Atom v Core, Ryzen v older AMD etc ? Alan
Re: [PATCH] x86/uaccess: use unrolled string copy for short strings
* Paolo Abeniwrote: > The 'rep' prefix suffers for a relevant "setup cost"; as a result > string copies with unrolled loops are faster than even > optimized string copy using 'rep' variant, for short string. > > This change updates __copy_user_generic() to use the unrolled > version for small string length. The threshold length for short > string - 64 - has been selected with empirical measures as the > larger value that still ensure a measurable gain. > > A micro-benchmark of __copy_from_user() with different lengths shows > the following: > > string lenvanilla patched delta > bytes ticks ticks tick(%) > > 0 58 26 32(55%) > 1 49 29 20(40%) > 2 49 31 18(36%) > 3 49 32 17(34%) > 4 50 34 16(32%) > 5 49 35 14(28%) > 6 49 36 13(26%) > 7 49 38 11(22%) > 8 50 31 19(38%) > 9 51 33 18(35%) > 1052 36 16(30%) > 1152 37 15(28%) > 1252 38 14(26%) > 1352 40 12(23%) > 1452 41 11(21%) > 1552 42 10(19%) > 1651 34 17(33%) > 1751 35 16(31%) > 1852 37 15(28%) > 1951 38 13(25%) > 2052 39 13(25%) > 2152 40 12(23%) > 2251 42 9(17%) > 2351 46 5(9%) > 2452 35 17(32%) > 2552 37 15(28%) > 2652 38 14(26%) > 2752 39 13(25%) > 2852 40 12(23%) > 2953 42 11(20%) > 3052 43 9(17%) > 3152 44 8(15%) > 3251 36 15(29%) > 3351 38 13(25%) > 3451 39 12(23%) > 3551 41 10(19%) > 3652 41 11(21%) > 3752 43 9(17%) > 3851 44 7(13%) > 3952 46 6(11%) > 4051 37 14(27%) > 4150 38 12(24%) > 4250 39 11(22%) > 4350 40 10(20%) > 4450 42 8(16%) > 4550 43 7(14%) > 4650 43 7(14%) > 4750 45 5(10%) > 4850 37 13(26%) > 4949 38 11(22%) > 5050 40 10(20%) > 5150 42 8(16%) > 5250 42 8(16%) > 5349 46 3(6%) > 5450 46 4(8%) > 5549 48 1(2%) > 5650 39 11(22%) > 5750 40 10(20%) > 5849 42 7(14%) > 5950 42 8(16%) > 6050 46 4(8%) > 6150 47 3(6%) > 6250 48 2(4%) > 6350 48 2(4%) > 6451 38 13(25%) > > Above 64 bytes the gain fades away. > > Very similar values are collectd for __copy_to_user(). > UDP receive performances under flood with small packets using recvfrom() > increase by ~5%. What CPU model(s) were used for the performance testing and was it performance tested on several different types of CPUs? Please add a comment here: + if (len <= 64) + return copy_user_generic_unrolled(to, from, len); + ... because it's not obvious at all that this is a performance optimization, not a correctness issue. Also explain that '64' is a number that we got from performance measurements. But in general I like the change - as long as it was measured on reasonably
Re: [PATCH] x86/uaccess: use unrolled string copy for short strings
* Paolo Abeni wrote: > The 'rep' prefix suffers for a relevant "setup cost"; as a result > string copies with unrolled loops are faster than even > optimized string copy using 'rep' variant, for short string. > > This change updates __copy_user_generic() to use the unrolled > version for small string length. The threshold length for short > string - 64 - has been selected with empirical measures as the > larger value that still ensure a measurable gain. > > A micro-benchmark of __copy_from_user() with different lengths shows > the following: > > string lenvanilla patched delta > bytes ticks ticks tick(%) > > 0 58 26 32(55%) > 1 49 29 20(40%) > 2 49 31 18(36%) > 3 49 32 17(34%) > 4 50 34 16(32%) > 5 49 35 14(28%) > 6 49 36 13(26%) > 7 49 38 11(22%) > 8 50 31 19(38%) > 9 51 33 18(35%) > 1052 36 16(30%) > 1152 37 15(28%) > 1252 38 14(26%) > 1352 40 12(23%) > 1452 41 11(21%) > 1552 42 10(19%) > 1651 34 17(33%) > 1751 35 16(31%) > 1852 37 15(28%) > 1951 38 13(25%) > 2052 39 13(25%) > 2152 40 12(23%) > 2251 42 9(17%) > 2351 46 5(9%) > 2452 35 17(32%) > 2552 37 15(28%) > 2652 38 14(26%) > 2752 39 13(25%) > 2852 40 12(23%) > 2953 42 11(20%) > 3052 43 9(17%) > 3152 44 8(15%) > 3251 36 15(29%) > 3351 38 13(25%) > 3451 39 12(23%) > 3551 41 10(19%) > 3652 41 11(21%) > 3752 43 9(17%) > 3851 44 7(13%) > 3952 46 6(11%) > 4051 37 14(27%) > 4150 38 12(24%) > 4250 39 11(22%) > 4350 40 10(20%) > 4450 42 8(16%) > 4550 43 7(14%) > 4650 43 7(14%) > 4750 45 5(10%) > 4850 37 13(26%) > 4949 38 11(22%) > 5050 40 10(20%) > 5150 42 8(16%) > 5250 42 8(16%) > 5349 46 3(6%) > 5450 46 4(8%) > 5549 48 1(2%) > 5650 39 11(22%) > 5750 40 10(20%) > 5849 42 7(14%) > 5950 42 8(16%) > 6050 46 4(8%) > 6150 47 3(6%) > 6250 48 2(4%) > 6350 48 2(4%) > 6451 38 13(25%) > > Above 64 bytes the gain fades away. > > Very similar values are collectd for __copy_to_user(). > UDP receive performances under flood with small packets using recvfrom() > increase by ~5%. What CPU model(s) were used for the performance testing and was it performance tested on several different types of CPUs? Please add a comment here: + if (len <= 64) + return copy_user_generic_unrolled(to, from, len); + ... because it's not obvious at all that this is a performance optimization, not a correctness issue. Also explain that '64' is a number that we got from performance measurements. But in general I like the change - as long as it was measured on reasonably modern x86 CPUs.
Re: [PATCH] x86/uaccess: use unrolled string copy for short strings
On Wed, Jun 21, 2017 at 4:09 AM, Paolo Abeniwrote: > The 'rep' prefix suffers for a relevant "setup cost"; as a result > string copies with unrolled loops are faster than even > optimized string copy using 'rep' variant, for short string. > > This change updates __copy_user_generic() to use the unrolled > version for small string length. The threshold length for short > string - 64 - has been selected with empirical measures as the > larger value that still ensure a measurable gain. > > A micro-benchmark of __copy_from_user() with different lengths shows > the following: > > string len vanilla patched delta > bytes ticks ticks tick(%) > > 0 58 26 32(55%) > 1 49 29 20(40%) > 2 49 31 18(36%) > 3 49 32 17(34%) > 4 50 34 16(32%) > 5 49 35 14(28%) > 6 49 36 13(26%) > 7 49 38 11(22%) > 8 50 31 19(38%) > 9 51 33 18(35%) > 10 52 36 16(30%) > 11 52 37 15(28%) > 12 52 38 14(26%) > 13 52 40 12(23%) > 14 52 41 11(21%) > 15 52 42 10(19%) > 16 51 34 17(33%) > 17 51 35 16(31%) > 18 52 37 15(28%) > 19 51 38 13(25%) > 20 52 39 13(25%) > 21 52 40 12(23%) > 22 51 42 9(17%) > 23 51 46 5(9%) > 24 52 35 17(32%) > 25 52 37 15(28%) > 26 52 38 14(26%) > 27 52 39 13(25%) > 28 52 40 12(23%) > 29 53 42 11(20%) > 30 52 43 9(17%) > 31 52 44 8(15%) > 32 51 36 15(29%) > 33 51 38 13(25%) > 34 51 39 12(23%) > 35 51 41 10(19%) > 36 52 41 11(21%) > 37 52 43 9(17%) > 38 51 44 7(13%) > 39 52 46 6(11%) > 40 51 37 14(27%) > 41 50 38 12(24%) > 42 50 39 11(22%) > 43 50 40 10(20%) > 44 50 42 8(16%) > 45 50 43 7(14%) > 46 50 43 7(14%) > 47 50 45 5(10%) > 48 50 37 13(26%) > 49 49 38 11(22%) > 50 50 40 10(20%) > 51 50 42 8(16%) > 52 50 42 8(16%) > 53 49 46 3(6%) > 54 50 46 4(8%) > 55 49 48 1(2%) > 56 50 39 11(22%) > 57 50 40 10(20%) > 58 49 42 7(14%) > 59 50 42 8(16%) > 60 50 46 4(8%) > 61 50 47 3(6%) > 62 50 48 2(4%) > 63 50 48 2(4%) > 64 51 38 13(25%) > > Above 64 bytes the gain fades away. > > Very similar values are collectd for __copy_to_user(). > UDP receive performances under flood with small packets using recvfrom() > increase by ~5%. > > Signed-off-by: Paolo Abeni Since there are no regressions here, this seems sensible to me. :) Reviewed-by: Kees Cook -Kees > --- > arch/x86/include/asm/uaccess_64.h | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/arch/x86/include/asm/uaccess_64.h >
Re: [PATCH] x86/uaccess: use unrolled string copy for short strings
On Wed, Jun 21, 2017 at 4:09 AM, Paolo Abeni wrote: > The 'rep' prefix suffers for a relevant "setup cost"; as a result > string copies with unrolled loops are faster than even > optimized string copy using 'rep' variant, for short string. > > This change updates __copy_user_generic() to use the unrolled > version for small string length. The threshold length for short > string - 64 - has been selected with empirical measures as the > larger value that still ensure a measurable gain. > > A micro-benchmark of __copy_from_user() with different lengths shows > the following: > > string len vanilla patched delta > bytes ticks ticks tick(%) > > 0 58 26 32(55%) > 1 49 29 20(40%) > 2 49 31 18(36%) > 3 49 32 17(34%) > 4 50 34 16(32%) > 5 49 35 14(28%) > 6 49 36 13(26%) > 7 49 38 11(22%) > 8 50 31 19(38%) > 9 51 33 18(35%) > 10 52 36 16(30%) > 11 52 37 15(28%) > 12 52 38 14(26%) > 13 52 40 12(23%) > 14 52 41 11(21%) > 15 52 42 10(19%) > 16 51 34 17(33%) > 17 51 35 16(31%) > 18 52 37 15(28%) > 19 51 38 13(25%) > 20 52 39 13(25%) > 21 52 40 12(23%) > 22 51 42 9(17%) > 23 51 46 5(9%) > 24 52 35 17(32%) > 25 52 37 15(28%) > 26 52 38 14(26%) > 27 52 39 13(25%) > 28 52 40 12(23%) > 29 53 42 11(20%) > 30 52 43 9(17%) > 31 52 44 8(15%) > 32 51 36 15(29%) > 33 51 38 13(25%) > 34 51 39 12(23%) > 35 51 41 10(19%) > 36 52 41 11(21%) > 37 52 43 9(17%) > 38 51 44 7(13%) > 39 52 46 6(11%) > 40 51 37 14(27%) > 41 50 38 12(24%) > 42 50 39 11(22%) > 43 50 40 10(20%) > 44 50 42 8(16%) > 45 50 43 7(14%) > 46 50 43 7(14%) > 47 50 45 5(10%) > 48 50 37 13(26%) > 49 49 38 11(22%) > 50 50 40 10(20%) > 51 50 42 8(16%) > 52 50 42 8(16%) > 53 49 46 3(6%) > 54 50 46 4(8%) > 55 49 48 1(2%) > 56 50 39 11(22%) > 57 50 40 10(20%) > 58 49 42 7(14%) > 59 50 42 8(16%) > 60 50 46 4(8%) > 61 50 47 3(6%) > 62 50 48 2(4%) > 63 50 48 2(4%) > 64 51 38 13(25%) > > Above 64 bytes the gain fades away. > > Very similar values are collectd for __copy_to_user(). > UDP receive performances under flood with small packets using recvfrom() > increase by ~5%. > > Signed-off-by: Paolo Abeni Since there are no regressions here, this seems sensible to me. :) Reviewed-by: Kees Cook -Kees > --- > arch/x86/include/asm/uaccess_64.h | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/arch/x86/include/asm/uaccess_64.h > b/arch/x86/include/asm/uaccess_64.h > index c5504b9..16a8871 100644 > ---