string: extend memcpy_flushcache() fixed-size fastpaths

Li Zhe Thu, 02 Jul 2026 00:57:45 -0700

On Wed, 1 Jul 2026 19:23:42 -0700, [email protected] wrote:

> On Wed, Jul 01, 2026 at 05:05:52PM +0800, Li Zhe wrote:
> > The new x86 memcpy_nt() helper in this series maps to
> > memcpy_flushcache(), and the ZONE_DEVICE fast path uses that primitive
> > for constant-sized struct page template copies.
> >
> > memcpy_flushcache() currently inlines only the 4, 8, and 16-byte
> > cases. Larger constant-sized copies fall back to __memcpy_flushcache()
> > even when the destination is naturally aligned. Extend the inline
> > movnti coverage to 32, 48, 64, 80, and 96 bytes so the struct
> 
> Why exactly those sizes and not bigger?
> 
> Any numbers to back this up?


The relevant target sizes here are the x86_64 sizeof(struct page)
values which patch 8 can hit: 64, 80, and 96 bytes.

On x86_64, CONFIG_HAVE_ALIGNED_STRUCT_PAGE keeps struct page 16-byte
aligned. That gives 64 bytes in the common case, 80 bytes when
WANT_PAGE_VIRTUAL or CONFIG_KMSAN adds extra fields, and 96 bytes when
both are present. I stopped there deliberately because this series does
not have a current caller which needs larger constant sizes.

The 32-byte and 48-byte cases were added only as intermediate fixed-size
combinations used to build those larger struct page-sized copies while
keeping them on the inline movnti path.

> > page-sized copies used by that path can stay on the inline
> > non-temporal store path instead of dropping into the out-of-line
> > helper.
> >
> > Factor the store sequences into 8/16/32/64-byte helpers, keep the
> > existing 4/8/16-byte cases handled directly in memcpy_flushcache(),
> > issue the stores in ascending address order, and leave all other sizes
> > on __memcpy_flushcache().
> >
> > Signed-off-by: Li Zhe <[email protected]>
> > ---
> >  arch/x86/include/asm/string_64.h | 80 +++++++++++++++++++++++++++++++-
> >  1 file changed, 79 insertions(+), 1 deletion(-)
> >
> > diff --git a/arch/x86/include/asm/string_64.h 
> > b/arch/x86/include/asm/string_64.h
> > index 6f36abedc56a..95ef2d481418 100644
> > --- a/arch/x86/include/asm/string_64.h
> > +++ b/arch/x86/include/asm/string_64.h
> > @@ -4,6 +4,7 @@
> >
> >  #ifdef __KERNEL__
> >  #include <linux/jump_label.h>
> > +#include <linux/align.h>
> >
> >  /* Written 2002 by Andi Kleen */
> >
> > @@ -82,8 +83,81 @@ int strcmp(const char *cs, const char *ct);
> >  #ifdef CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE
> >  #define __HAVE_ARCH_MEMCPY_FLUSHCACHE 1
> >  void __memcpy_flushcache(void *dst, const void *src, size_t cnt);
> > -static __always_inline void memcpy_flushcache(void *dst, const void *src, 
> > size_t cnt)
> > +
> > +static __always_inline void memcpy_flushcache_8(void *dst, const void *src)
> 
> If this is calling MOVNTI, you might as well call it that way. Only leave the
> externally exposed wrapper called "flushcache".

Thanks for the review.

In the next version I will rename the internal leaf helpers to
movnti_*() names and keep memcpy_flushcache() as the externally exposed
wrapper.

> > +{
> > +   asm volatile("movntiq %1, %0"
> > +                : "=m"(*(u64 *)dst)
> > +                : "r"(*(const u64 *)src)
> > +                : "memory");
> > +}
> > +
> > +static __always_inline void memcpy_flushcache_16(void *dst,
> > +                                            const void *src)
> > +{
> > +   memcpy_flushcache_8(dst, src);
> > +   memcpy_flushcache_8(dst + 8, src + 8);
> > +}
> > +
> > +static __always_inline void memcpy_flushcache_32(void *dst,
> > +                                            const void *src)
> > +{
> > +   memcpy_flushcache_16(dst, src);
> > +   memcpy_flushcache_16(dst + 16, src + 16);
> > +}
> > +
> > +static __always_inline void memcpy_flushcache_64(void *dst,
> > +                                            const void *src)
> >  {
> > +   memcpy_flushcache_32(dst, src);
> > +   memcpy_flushcache_32(dst + 32, src + 32);
> > +}
> > +
> > +/*
> > + * Keep the additional aligned fixed-size cases on the inline movnti path.
> > + * Leave the existing 4/8/16-byte cases handled directly in
> > + * memcpy_flushcache() so their code generation stays unchanged.
> > + */
> > +static __always_inline int memcpy_flushcache_large(void *dst,
> > +                                              const void *src,
> > +                                              size_t cnt)
> > +{
> > +   char *dptr = dst;
> > +   const char *sptr = src;
> > +
> > +   if (!IS_ALIGNED((unsigned long)dst, 8))
> > +           return 0;
> > +
> > +   switch (cnt) {
> > +   case 32:
> > +           memcpy_flushcache_32(dptr, sptr);
> > +           return 1;
> > +   case 48:
> > +           memcpy_flushcache_32(dptr, sptr);
> > +           memcpy_flushcache_16(dptr + 32, sptr + 32);
> > +           return 1;
> > +   case 64:
> > +           memcpy_flushcache_64(dptr, sptr);
> > +           return 1;
> > +   case 80:
> > +           memcpy_flushcache_64(dptr, sptr);
> > +           memcpy_flushcache_16(dptr + 64, sptr + 64);
> > +           return 1;
> > +   case 96:
> > +           memcpy_flushcache_64(dptr, sptr);
> > +           memcpy_flushcache_32(dptr + 64, sptr + 64);
> > +           return 1;
> > +   }
> > +
> > +   return 0;
> > +}
> > +
> > +static __always_inline void memcpy_flushcache(void *dst, const void *src,
> > +                                         size_t cnt)
> > +{
> > +   if (!cnt)
> > +           return;
> > +
> >     if (__builtin_constant_p(cnt)) {
> >             switch (cnt) {
> >                     case 4:
> > @@ -97,7 +171,11 @@ static __always_inline void memcpy_flushcache(void 
> > *dst, const void *src, size_t
> >                             asm ("movntiq %1, %0" : "=m"(*(u64 *)(dst + 8)) 
> > : "r"(*(u64 *)(src + 8)));
> >                             return;
> >             }
> > +
> > +           if (memcpy_flushcache_large(dst, src, cnt))
> 
> Why is this a separate function and not extending the switch-case?
> 
> I betcha it has to do with the alignment check but nothing explains to me why.

Yes. The separate helper is there exactly because the added
32/48/64/80/96-byte cases carry an extra 8-byte destination-alignment
check, while I left the existing 4/8/16-byte inline cases unchanged.

The intent was to keep the larger fixed-size cases on the inline
movnti path only when they already match the alignment assumptions
used by __memcpy_flushcache(); otherwise they fall back to that
out-of-line helper, which already handles the misaligned head/tail
fragments before entering its movnti loops.

The current patch does not explain that clearly enough. In v6 I will
add an explicit comment to memcpy_flushcache_large() and spell the
same reasoning out in the changelog.

Thanks,
Zhe

Re: [PATCH v5 7/8] x86/string: extend memcpy_flushcache() fixed-size fastpaths

Reply via email to