I?m not as concerned with compile times given the potential performance boost.
A long time ago (mid-80s) I was at Convex, and wanted to do a vector bcopy(), because it would make the I/O system (mostly disk then (*)) go faster. The architect explained to me that the vector registers were for applications, not the kernel (as well as re-explaining the expense of vector context switches, should the kernel be using the vector unit(s) and some application also wanted to use them. The same is true today of AVX/AVX2, SSE, and even the AES-NI instructions. Normally we don?t use these in kernel code (which is traditionally where the networking stack has lived). The differences with DPDK are that a) entire cores (including the AVX/SSE units and even AES-NI (FPU) are dedicated to DPDK, and b) DPDK is a library, and the resulting networking applications are exactly that, applications. The "operating system? is now a control plane. Jim (* Back then it was commonly thought that TCP would never be able to fill a 10Gbps Ethernet.) > On Jan 21, 2015, at 2:54 PM, Neil Horman <nhorman at tuxdriver.com> wrote: > > On Wed, Jan 21, 2015 at 11:49:47AM -0800, Stephen Hemminger wrote: >> On Wed, 21 Jan 2015 13:26:20 +0000 >> Bruce Richardson <bruce.richardson at intel.com> wrote: >> >>> On Wed, Jan 21, 2015 at 02:21:25PM +0100, Marc Sune wrote: >>>> >>>> On 21/01/15 14:02, Bruce Richardson wrote: >>>>> On Wed, Jan 21, 2015 at 01:36:41PM +0100, Marc Sune wrote: >>>>>> On 21/01/15 04:44, Wang, Zhihong wrote: >>>>>>>> -----Original Message----- >>>>>>>> From: Richardson, Bruce >>>>>>>> Sent: Wednesday, January 21, 2015 12:15 AM >>>>>>>> To: Neil Horman >>>>>>>> Cc: Wang, Zhihong; dev at dpdk.org >>>>>>>> Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization >>>>>>>> >>>>>>>> On Tue, Jan 20, 2015 at 10:11:18AM -0500, Neil Horman wrote: >>>>>>>>> On Tue, Jan 20, 2015 at 03:01:44AM +0000, Wang, Zhihong wrote: >>>>>>>>>>> -----Original Message----- >>>>>>>>>>> From: Neil Horman [mailto:nhorman at tuxdriver.com] >>>>>>>>>>> Sent: Monday, January 19, 2015 9:02 PM >>>>>>>>>>> To: Wang, Zhihong >>>>>>>>>>> Cc: dev at dpdk.org >>>>>>>>>>> Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization >>>>>>>>>>> >>>>>>>>>>> On Mon, Jan 19, 2015 at 09:53:30AM +0800, zhihong.wang at intel.com >>>>>>>> wrote: >>>>>>>>>>>> This patch set optimizes memcpy for DPDK for both SSE and AVX >>>>>>>> platforms. >>>>>>>>>>>> It also extends memcpy test coverage with unaligned cases and >>>>>>>>>>>> more test >>>>>>>>>>> points. >>>>>>>>>>>> Optimization techniques are summarized below: >>>>>>>>>>>> >>>>>>>>>>>> 1. Utilize full cache bandwidth >>>>>>>>>>>> >>>>>>>>>>>> 2. Enforce aligned stores >>>>>>>>>>>> >>>>>>>>>>>> 3. Apply load address alignment based on architecture features >>>>>>>>>>>> >>>>>>>>>>>> 4. Make load/store address available as early as possible >>>>>>>>>>>> >>>>>>>>>>>> 5. General optimization techniques like inlining, branch >>>>>>>>>>>> reducing, prefetch pattern access >>>>>>>>>>>> >>>>>>>>>>>> Zhihong Wang (4): >>>>>>>>>>>> Disabled VTA for memcpy test in app/test/Makefile >>>>>>>>>>>> Removed unnecessary test cases in test_memcpy.c >>>>>>>>>>>> Extended test coverage in test_memcpy_perf.c >>>>>>>>>>>> Optimized memcpy in arch/x86/rte_memcpy.h for both SSE and AVX >>>>>>>>>>>> platforms >>>>>>>>>>>> >>>>>>>>>>>> app/test/Makefile | 6 + >>>>>>>>>>>> app/test/test_memcpy.c | 52 +- >>>>>>>>>>>> app/test/test_memcpy_perf.c | 238 +++++--- >>>>>>>>>>>> .../common/include/arch/x86/rte_memcpy.h | 664 >>>>>>>>>>> +++++++++++++++------ >>>>>>>>>>>> 4 files changed, 656 insertions(+), 304 deletions(-) >>>>>>>>>>>> >>>>>>>>>>>> -- >>>>>>>>>>>> 1.9.3 >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> Are you able to compile this with gcc 4.9.2? The compilation of >>>>>>>>>>> test_memcpy_perf is taking forever for me. It appears hung. >>>>>>>>>>> Neil >>>>>>>>>> Neil, >>>>>>>>>> >>>>>>>>>> Thanks for reporting this! >>>>>>>>>> It should compile but will take quite some time if the CPU doesn't >>>>>>>>>> support >>>>>>>> AVX2, the reason is that: >>>>>>>>>> 1. The SSE & AVX memcpy implementation is more complicated than >>>>>>>> AVX2 >>>>>>>>>> version thus the compiler takes more time to compile and optimize 2. >>>>>>>>>> The new test_memcpy_perf.c contains 126 constants memcpy calls for >>>>>>>>>> better test case coverage, that's quite a lot >>>>>>>>>> >>>>>>>>>> I've just tested this patch on an Ivy Bridge machine with GCC 4.9.2: >>>>>>>>>> 1. The whole compile process takes 9'41" with the original >>>>>>>>>> test_memcpy_perf.c (63 + 63 = 126 constant memcpy calls) 2. It takes >>>>>>>>>> only 2'41" after I reduce the constant memcpy call number to 12 + 12 >>>>>>>>>> = 24 >>>>>>>>>> >>>>>>>>>> I'll reduce memcpy call in the next version of patch. >>>>>>>>>> >>>>>>>>> ok, thank you. I'm all for optimzation, but I think a compile that >>>>>>>>> takes almost >>>>>>>>> 10 minutes for a single file is going to generate some raised eyebrows >>>>>>>>> when end users start tinkering with it >>>>>>>>> >>>>>>>>> Neil >>>>>>>>> >>>>>>>>>> Zhihong (John) >>>>>>>>>> >>>>>>>> Even two minutes is a very long time to compile, IMHO. The whole of >>>>>>>> DPDK >>>>>>>> doesn't take that long to compile right now, and that's with a couple >>>>>>>> of huge >>>>>>>> header files with routing tables in it. Any chance you could cut >>>>>>>> compile time >>>>>>>> down to a few seconds while still having reasonable tests? >>>>>>>> Also, when there is AVX2 present on the system, what is the compile >>>>>>>> time >>>>>>>> like for that code? >>>>>>>> >>>>>>>> /Bruce >>>>>>> Neil, Bruce, >>>>>>> >>>>>>> Some data first. >>>>>>> >>>>>>> Sandy Bridge without AVX2: >>>>>>> 1. original w/ 10 constant memcpy: 2'25" >>>>>>> 2. patch w/ 12 constant memcpy: 2'41" >>>>>>> 3. patch w/ 63 constant memcpy: 9'41" >>>>>>> >>>>>>> Haswell with AVX2: >>>>>>> 1. original w/ 10 constant memcpy: 1'57" >>>>>>> 2. patch w/ 12 constant memcpy: 1'56" >>>>>>> 3. patch w/ 63 constant memcpy: 3'16" >>>>>>> >>>>>>> Also, to address Bruce's question, we have to reduce test case to cut >>>>>>> down compile time. Because we use: >>>>>>> 1. intrinsics instead of assembly for better flexibility and can >>>>>>> utilize more compiler optimization >>>>>>> 2. complex function body for better performance >>>>>>> 3. inlining >>>>>>> This increases compile time. >>>>>>> But I think it'd be okay to do that as long as we can select a fair set >>>>>>> of test points. >>>>>>> >>>>>>> It'd be great if you could give some suggestion, say, 12 points. >>>>>>> >>>>>>> Zhihong (John) >>>>>>> >>>>>>> >>>>>> While I agree in the general case these long compilation times is painful >>>>>> for the users, having a factor of 2-8x in memcpy operations is quite an >>>>>> improvement, specially in DPDK applications which need to deal >>>>>> (unfortunately) heavily on them -- e.g. IP fragmentation and reassembly. >>>>>> >>>>>> Why not having a fast compilation by default, and a tunable config flag >>>>>> to >>>>>> enable a highly optimized version of rte_memcpy (e.g. >>>>>> RTE_EAL_OPT_MEMCPY)? >>>>>> >>>>>> Marc >>>>>> >>>>> Out of interest, are these 2-8x improvements something you have >>>>> benchmarked >>>>> in these app scenarios? [i.e. not just in micro-benchmarks]. >>>> >>>> How much that micro-speedup will end up affecting the performance of the >>>> entire application is something I cannot say, so I agree that we should >>>> probably have some additional benchmarks before deciding that pays off >>>> maintaining 2 versions of rte_memcpy. >>>> >>>> There are however a bunch of possible DPDK applications that could >>>> potentially benefit; IP fragmentation, tunneling and specialized DPI >>>> applications, among others, since they involve a reasonable amount of >>>> memcpys per pkt. My point was, *if* it proves that is enough beneficial, >>>> why >>>> not having it optionally? >>>> >>>> Marc >>> >>> I agree, if it provides the speedups then we need to have it in - and quite >>> possibly >>> on by default, even. >>> >>> /Bruce >> >> One issue I have is that as a vendor we need to ship on binary, not >> different distributions >> for each Intel chip variant. There is some support for multi-chip version >> functions >> but only in latest Gcc which isn't in Debian stable. And the multi-chip >> version >> of functions is going to be more expensive than inlining. For some cases, I >> have >> seen that the overhead of fancy instructions looks good but have nasty side >> effects >> like CPU stall and/or increased power consumption which turns of turbo boost. >> >> >> Distro's in general have the same problem with special case optimizations. >> > What we really need is to do something like borrow the alternatives mechanism > from the kernel so that we can dynamically replace instructions at run time > based on cpu flags. That way we could make the choice at run time, and > wouldn't > have to do alot of special case jumping about. > Neil >