[ofa-general] movnt (Was Re: libmlx4 wc flash)

Michael S. Tsirkin Tue, 15 May 2007 13:43:37 -0700

>  > I don't think it works this way: if PAT is programmed to UC,
>  > I think you get UC access with movntq. No?
> 
> You're right -- I misremembered what the non-temporal stuff does, but
> I just checked and the manual says:
> 
>  "The memory type of the region being written to can override the
>   non-temporal hint, if the memory address specified for the
>   non-temporal store is in an uncacheable (UC) or write protected (WP)
>   memory region."


Actually, I think I just thought up a way to solve this, and I quote in full:

Vol. 1 10-19
PROGRAMMING WITH STREAMING SIMD EXTENSIONS (SSE)

These SSE and SSE2 non-temporal store instructions minimize cache pollutions by
treating the memory being accessed as the write combining (WC) type. If a 
program
specifies a non-temporal store with one of these instructions and the 
destination
region is mapped as cacheable memory (write back [WB], write through [WT] or WC
memory type), the processor will do the following:
• If the memory location being written to is present in the cache hierarchy, 
the data
in the caches is evicted.
• The non-temporal data is written to memory with WC semantics.
See also: Chapter 10, “Memory Cache Control,” in the Intel® 64 and IA-32 
Architectures
Software Developer’s Manual, Volume 3A.
Using the WC semantics, the store transaction will be weakly ordered, meaning 
that
the data may not be written to memory in program order, and the store will not 
write
allocate (that is, the processor will not fetch the corresponding cache line 
into the
cache hierarchy, prior to performing the store). Also, different processor 
implementations
may choose to collapse and combine these stores.
The memory type of the region being written to can override the non-temporal 
hint,
if the memory address specified for the non-temporal store is in uncacheable
memory. Uncacheable as referred to here means that the region being written to 
has
been mapped with either an uncacheable (UC) or write protected (WP) memory type.

-------------

So we can map the device memory with WB or WT semantics, and movnt will enable
WC. And the nice thing about this trick, is that both WB and WT *are already
programmed into PAT after reset*, which means that we can use them for pages we
map for userspace, without stepping on anyone's toes or waiting for
the generic in-kernel support for WC to materialize.

Another nice thing is that all WRs are 16-byte aligned so we can
use the aligned instructions there.

Given that full WC support in kernel is likely to take
quite while to materialize, maybe that's the way to go for now?
What do you think?

I attach a header file that implements WC memcpy with these
instructions for lengths from 16 to 128 bytes (and one can,
naturally, just call xmm_copy64 in a loop), that I wrote for fun
at some point. Feel free to read/flame/reuse in any way you like.

As far as I remember, replacing memcpy with this hack resulted
in a marginal latency speedup for intel, likely on account
of loop unrolling I did there.

-- 
MST

#ifndef XMM_COPY
#define XMM_COPY

static inline void xmm_copy16(const char *from, char *to)
{
	/* Clobbers xmm0. No way to tell gcc. */
	asm("//copy\n"
	    "	movdqa %0, %%xmm0\n"
	    "	movntdq %%xmm0, %1"
	    :: "m"(from[0]), "m"(to[0]) : "memory");
}

static inline void xmm_copy32(const char *from, char *to)
{
	/* Clobbers xmm0-1. No way to tell gcc. */
	asm("// xmm_copy32_start\n"
	    "	movdqa %0, %%xmm0\n"
	    "	movdqa %2, %%xmm1\n"
	    "	movntdq %%xmm0, %1\n"
	    "	movntdq %%xmm1, %3\n"
	    "	// xmm_copy32_end"
	    ::
	    "m"(from[0]), "m"(to[0]),
	    "m"(from[16]), "m"(to[16])
	    : "memory");
}

static inline void xmm_copy64(const char *from, char *to)
{
	/* Clobbers xmm0-3. No way to tell gcc. */
	asm("// xmm_copy64_start\n"
	    "	movdqa %0, %%xmm0\n"
	    "	movdqa %2, %%xmm1\n"
	    "	movdqa %4, %%xmm2\n"
	    "	movdqa %6, %%xmm3\n"
	    "	movntdq %%xmm0, %1\n"
	    "	movntdq %%xmm1, %3\n"
	    "	movntdq %%xmm2, %5\n"
	    "	movntdq %%xmm3, %7\n"
	    "	// xmm_copy64_end"
	    ::
	    "m"(from[0]), "m"(to[0]),
	    "m"(from[16]), "m"(to[16]),
	    "m"(from[32]), "m"(to[32]),
	    "m"(from[48]), "m"(to[48])
	    : "memory");
}

static inline void xmm_copy128(const char *from, char *to)
{
	/* Clobbers xmm0-7. No way to tell gcc. */
	asm("// xmm_copy128_start\n"
	    "	movdqa %0, %%xmm0\n"
	    "	movdqa %2, %%xmm1\n"
	    "	movdqa %4, %%xmm2\n"
	    "	movdqa %6, %%xmm3\n"
	    "	movdqa %8, %%xmm4\n"
	    "	movdqa %10, %%xmm5\n"
	    "	movdqa %12, %%xmm6\n"
	    "	movdqa %14, %%xmm7\n"
	    "	movntdq %%xmm0, %1\n"
	    "	movntdq %%xmm1, %3\n"
	    "	movntdq %%xmm2, %5\n"
	    "	movntdq %%xmm3, %7\n"
	    "	movntdq %%xmm4, %9\n"
	    "	movntdq %%xmm5, %11\n"
	    "	movntdq %%xmm6, %13\n"
	    "	movntdq %%xmm7, %15\n"
	    "	// xmm_copy128_end"
	    ::
	    "m"(from[0x0 ]), "m"(to[0x0 ]),
	    "m"(from[0x10]), "m"(to[0x10]),
	    "m"(from[0x20]), "m"(to[0x20]),
	    "m"(from[0x30]), "m"(to[0x30]),
	    "m"(from[0x40]), "m"(to[0x40]),
	    "m"(from[0x50]), "m"(to[0x50]),
	    "m"(from[0x60]), "m"(to[0x60]),
	    "m"(from[0x70]), "m"(to[0x70])
	    : "memory");
}

#endif

_______________________________________________
general mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] movnt (Was Re: libmlx4 wc flash)

Reply via email to