On Tue, Apr 28, 2015 at 05:38:52PM -0700, Feng Kan wrote:
> Using the glibc cortex string work work authored by Linaro as base to
> create new copy to/from user kernel routine.
> 
> Iperf performance increase:
>               -l (size)               1 core result
> Optimized     64B                     44-51Mb/s
>               1500B                   4.9Gb/s
>               30000B                  16.2Gb/s
> Original      64B                     34-50.7Mb/s
>               1500B                   4.7Gb/s
>               30000B                  14.5Gb/s

There is indeed some performance improvement, especially for large
buffers, so I'm fine in principle with changing these functions.
However, I have some comments below.

>  arch/arm64/lib/copy_from_user.S |  92 +++++++++++------
>  arch/arm64/lib/copy_template.S  | 213 
> ++++++++++++++++++++++++++++++++++++++++
>  arch/arm64/lib/copy_to_user.S   |  56 ++++++-----

We should try to share copy_template.S with the memcpy routine as well.
They are basically the same, just the labels for user access differ.
More about this below.

We also have copy_in_user which is very similar.

> --- a/arch/arm64/lib/copy_from_user.S
> +++ b/arch/arm64/lib/copy_from_user.S
> @@ -15,7 +15,6 @@
>   */
>  
>  #include <linux/linkage.h>
> -#include <asm/assembler.h>
>  
>  /*
>   * Copy from user space to a kernel buffer (alignment handled by the 
> hardware)
[...]
> +9:
> +     /*
> +      * count is accurate
> +      * dst is accurate
> +      */
> +     mov     x0, count
> +     sub     dst, dst, tmp1
> +     b       .Lfinalize
> +
> +10:
> +     /*
> +      * count is in the last 15 bytes
> +      * dst is somewhere in there
> +      */
> +     mov     x0, count
> +     sub     dst, limit, count
> +     b       .Lfinalize

I'm confused by these labels and what they try to do. In the copy
template, usually 'count' is decremented before a set of post-indexed
load/store instructions are issued. I don't think 'count' is relevant
here for how many bytes have been copied. For example, copy_from_user()
can only fault on a load instruction. When you handle such exception,
you want to zero from the current dst (the store wasn't issued since the
load failed) to the end of the buffer (your limit).

> --- /dev/null
> +++ b/arch/arm64/lib/copy_template.S
> @@ -0,0 +1,213 @@
[...]
> +#include <asm/assembler.h>
> +#include <asm/cache.h>
> +
> +dstin        .req x0
> +src  .req x1
> +count        .req x2
> +tmp1 .req x3
> +tmp1w        .req w3
> +tmp2 .req x4
> +tmp2w        .req w4
> +limit        .req x5
> +dst  .req x6
> +
> +A_l  .req x7
> +A_h  .req x8
> +B_l  .req x9
> +B_h  .req x10
> +C_l  .req x11
> +C_h  .req x12
> +D_l  .req x13
> +D_h  .req x14
> +
> +     mov     dst, dstin
> +     add     limit, dst, count
> +     cmp     count, #16
> +     b.lo    .Ltail15
> +
> +     /*
> +      * We don't much care about the alignment of DST, but we want SRC
> +      * to be 128-bit (16 byte) aligned so that we don't cross cache line
> +      * boundaries on both loads and stores.
> +      */
> +     ands    tmp2, src, #15
> +     b.eq    .LSrcAligned
> +     sub     count, count, tmp2
> +
> +     tbz     tmp2, #0, 1f
> +     USER(11f, ldrb  tmp1w, [src], #1)
> +     USER(11f, strb  tmp1w, [dst], #1)

That's wrong. Depending own whether you want copy_from_user,
copy_to_user or copy_in_user, you have the USER annotation for load,
store or both respectively. It may be easier if we use asm macros
similar to the arm32 copy_template.S. Or maybe some macros like TMPL_LD,
TMPL_ST which may be defined to USER or just expanding the argument
based on the function they are called from (and we can easily share the
template with memcpy and copy_in_user).

> --- a/arch/arm64/lib/copy_to_user.S
> +++ b/arch/arm64/lib/copy_to_user.S
[...]
> +     .align    2
> +9:
> +10:
> +     /*
> +      * count is accurate
> +      */
> +     mov     x0, count
> +     b       .Lfinalize
> +11:
> +     /*
> +      * count is over accounted by tmp2
> +      */
> +     add     x0, count, tmp2
> +     b       .Lfinalize
> +12:
> +14:
> +     /*
> +      * (count + 64) bytes remain
> +      * dst is accurate
> +      */
> +     adds    x0, count, #64
> +     b       .Lfinalize
> +13:
> +     /*
> +      * (count + 128) bytes remain
> +      */
> +     add     x0, count, #128
> +.Lfinalize:
>       ret
>       .previous

Can we no just use something like (limit - dst) here and avoid checks
for whether 'count' was accurate or not?

-- 
Catalin
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Reply via email to