On Tue, Jan 20, 2015 at 09:15:38AM -0800, Stephen Hemminger wrote: > On Mon, 19 Jan 2015 09:53:34 +0800 > zhihong.wang at intel.com wrote: > > > Main code changes: > > > > 1. Differentiate architectural features based on CPU flags > > > > a. Implement separated move functions for SSE/AVX/AVX2 to make full > > utilization of cache bandwidth > > > > b. Implement separated copy flow specifically optimized for target > > architecture > > > > 2. Rewrite the memcpy function "rte_memcpy" > > > > a. Add store aligning > > > > b. Add load aligning based on architectural features > > > > c. Put block copy loop into inline move functions for better control of > > instruction order > > > > d. Eliminate unnecessary MOVs > > > > 3. Rewrite the inline move functions > > > > a. Add move functions for unaligned load cases > > > > b. Change instruction order in copy loops for better pipeline > > utilization > > > > c. Use intrinsics instead of assembly code > > > > 4. Remove slow glibc call for constant copies > > > > Signed-off-by: Zhihong Wang <zhihong.wang at intel.com> > > Dumb question: why not fix glibc memcpy instead? > What is special about rte_memcpy? > > Fair point. Though, does glibc implement optimized memcpys per arch? Or do they just rely on the __builtin's from gcc to get optimized variants?
Neil