On Sun, 7 Jun 2026 12:08:04 -0700, [email protected] wrote: > On Wed, Jun 03, 2026 at 04:01:50PM +0800, Li Zhe wrote: > > Introduce a generic memcpy_streaming() interface for write-once copy > > sites that can fall back to memcpy() when no architecture-specific > > optimization is available, or when an architecture-specific backend > > cannot safely handle a given transfer. > > > > Add memcpy_streaming_drain() alongside it so callers can separate the > > copy primitive from any required ordering point. On x86, use > > memcpy_flushcache() and sfence only for aligned transfers that can stay > > entirely on the non-temporal store path; otherwise fall back to memcpy() > > So you throwing "streaming", "non-temporal" and "flush-cache" wildly around > here and this is adding unnecessary confusion where it shouldn't. I'd suggest > you stick to "non-temporal" which you can abbreviate short'n'sweet to "nt" and > that's it. Keep it simple.
Thanks for the review. Will switch to nt-based naming in next revision. > > so the generic API does not expose flushcache semantics on cached > > head/tail fragments. > > > > Callers are responsible for invoking memcpy_streaming_drain() before > > later normal stores that must be ordered after the streaming copy. > > > > Signed-off-by: Li Zhe <[email protected]> > > --- > > arch/x86/include/asm/string_64.h | 32 ++++++++++++++++++++++++++++++++ > > include/linux/string.h | 20 ++++++++++++++++++++ > > 2 files changed, 52 insertions(+) > > > > diff --git a/arch/x86/include/asm/string_64.h > > b/arch/x86/include/asm/string_64.h > > index 4635616863f5..aee63108577f 100644 > > --- a/arch/x86/include/asm/string_64.h > > +++ b/arch/x86/include/asm/string_64.h > > There's arch/x86/include/asm/string.h. Why are those here, in the _64 variant? The current placement was meant to reflect that the x86 implementation here is really just a thin wrapper around the existing memcpy_flushcache() backend, and that backend is x86_64-only today. On x86, CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE is selected only for X86_64, so a 32-bit build would still need to fall back to the generic memcpy()-based implementation anyway. Keeping it in string_64.h made that backend dependency explicit. That said, I see your layering point. If arch/x86/include/asm/string.h is the preferred place for the arch-visible wrapper, I can move the wrapper there in the next revision while keeping the x86_64-specific implementation details in string_64.h. > > @@ -4,6 +4,7 @@ > > > > #ifdef __KERNEL__ > > #include <linux/jump_label.h> > > +#include <linux/align.h> > > > > /* Written 2002 by Andi Kleen */ > > > > @@ -100,6 +101,37 @@ static __always_inline void memcpy_flushcache(void > > *dst, const void *src, size_t > > } > > __memcpy_flushcache(dst, src, cnt); > > } > > + > > +/* > > + * Only map memcpy_streaming() to memcpy_flushcache() when the destination > > + * is already 8-byte aligned and the size can be handled without cached > > + * head/tail fragments in __memcpy_flushcache(). > > + */ > > +static __always_inline bool memcpy_flushcache_nt_safe(const void *dst, > > + size_t cnt) > > This is checking alignment. Then call it that. > > > +{ > > + unsigned long d = (unsigned long)dst; > > Useless. > > > + > > + return cnt && IS_ALIGNED(d, 8) && IS_ALIGNED(cnt, 4); > > +} > > AFAICT, this helper is used only once. Zap it completely. Agreed. That helper is over-factored in its current form. I'll fold the alignment test into the callsite and drop the temporary variable in the next revision. > > + > > +#define __HAVE_ARCH_MEMCPY_STREAMING 1 > > +static __always_inline void memcpy_streaming(void *dst, const void *src, > > memcpy_nt() > > > + size_t cnt) > > +{ > > + if (!cnt) > > + return; > > + > > + if (memcpy_flushcache_nt_safe(dst, cnt)) > > That branch can cost. Why is that alignment checking so necessary? Why can't > you simply DTRT by handling the misaligned parts like __memcpy_flushcache(). > > What does this bring you? None of that is explained in the commit message so > why do I want this patch at all? > > The commit message is basically telling me what the patch does but I can kinda > read that from the diff itself. What it is not telling me is *why* it exists. The extra alignment gating was meant to keep this helper narrower than __memcpy_flushcache(), so patch 8 would not inherit the mixed cached head/tail handling from that implementation. Thinking about it more, I agree that this is hard to justify for a generic helper. For this series, what really matters is that the struct page copies in patch 8 can use the existing x86 memcpy_flushcache() fastpaths where that is beneficial; I do not need patch 6 to impose extra selection policy on unrelated callers. I'll simplify and rework this part in the next revision, rewrite the changelog to explain the actual motivation more clearly, and respin patches 6-8 accordingly. Thanks, Zhe

