> -----Original Message----- > From: Jianbo Liu [mailto:jianbo.liu at linaro.org] > Sent: Thursday, September 22, 2016 10:42 PM > To: Wang, Zhihong <zhihong.wang at intel.com> > Cc: Yuanhan Liu <yuanhan.liu at linux.intel.com>; Maxime Coquelin > <maxime.coquelin at redhat.com>; dev at dpdk.org > Subject: Re: [dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue > > On 22 September 2016 at 18:04, Wang, Zhihong <zhihong.wang at intel.com> > wrote: > > > > > >> -----Original Message----- > >> From: Jianbo Liu [mailto:jianbo.liu at linaro.org] > >> Sent: Thursday, September 22, 2016 5:02 PM > >> To: Wang, Zhihong <zhihong.wang at intel.com> > >> Cc: Yuanhan Liu <yuanhan.liu at linux.intel.com>; Maxime Coquelin > >> <maxime.coquelin at redhat.com>; dev at dpdk.org > >> Subject: Re: [dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue > >> > >> On 22 September 2016 at 14:58, Wang, Zhihong <zhihong.wang at intel.com> > >> wrote: > >> > > >> > > >> >> -----Original Message----- > >> >> From: Jianbo Liu [mailto:jianbo.liu at linaro.org] > >> >> Sent: Thursday, September 22, 2016 1:48 PM > >> >> To: Yuanhan Liu <yuanhan.liu at linux.intel.com> > >> >> Cc: Wang, Zhihong <zhihong.wang at intel.com>; Maxime Coquelin > >> >> <maxime.coquelin at redhat.com>; dev at dpdk.org > >> >> Subject: Re: [dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue > >> >> > >> >> On 22 September 2016 at 10:29, Yuanhan Liu > >> <yuanhan.liu at linux.intel.com> > >> >> wrote: > >> >> > On Wed, Sep 21, 2016 at 08:54:11PM +0800, Jianbo Liu wrote: > >> >> >> >> > My setup consists of one host running a guest. > >> >> >> >> > The guest generates as much 64bytes packets as possible using > >> >> >> >> > >> >> >> >> Have you tested with other different packet size? > >> >> >> >> My testing shows that performance is dropping when packet size is > >> >> more > >> >> >> >> than 256. > >> >> >> > > >> >> >> > > >> >> >> > Hi Jianbo, > >> >> >> > > >> >> >> > Thanks for reporting this. > >> >> >> > > >> >> >> > 1. Are you running the vector frontend with mrg_rxbuf=off? > >> >> >> > > >> >> Yes, my testing is mrg_rxbuf=off, but not vector frontend PMD. > >> >> > >> >> >> > 2. Could you please specify what CPU you're running? Is it Haswell > >> >> >> > or Ivy Bridge? > >> >> >> > > >> >> It's an ARM server. > >> >> > >> >> >> > 3. How many percentage of drop are you seeing? > >> >> The testing result: > >> >> size (bytes) improvement (%) > >> >> 64 3.92 > >> >> 128 11.51 > >> >> 256 24.16 > >> >> 512 -13.79 > >> >> 1024 -22.51 > >> >> 1500 -12.22 > >> >> A correction is that performance is dropping if byte size is larger > >> >> than 512. > >> > > >> > > >> > Jianbo, > >> > > >> > Could you please verify does this patch really cause enqueue perf to > >> > drop? > >> > > >> > You can test the enqueue path only by set guest to do rxonly, and compare > >> > the mpps by show port stats all in the guest. > >> > > >> > > >> Tested with testpmd, host: txonly, guest: rxonly > >> size (bytes) improvement (%) > >> 64 4.12 > >> 128 6 > >> 256 2.65 > >> 512 -1.12 > >> 1024 -7.02 > > > > > > > > I think your number is little bit hard to understand for me, this patch's > > optimization contains 2 parts: > > > > 1. ring operation: works for both mrg_rxbuf on and off > > > > 2. remote write ordering: works for mrg_rxbuf=on only > > > > So, for mrg_rxbuf=off, if this patch is good for 64B packets, then it > > shouldn't do anything bad for larger packets. > > > > This is the gain on x86 platform: host iofwd between nic and vhost, > > guest rxonly. > > > > nic2vm enhancement > > 64 21.83% > > 128 16.97% > > 256 6.34% > > 512 0.01% > > 1024 0.00% > > > I bootup a VM with 2 virtual port, and stress the traffic between them. > First, I stressed with pktgen-dpdk in VM, and did iofwd in host. > Then, as you told, I did rxonly in VM, and txonly in host. > > > I suspect there's some complication in ARM's micro-arch. > > > > Could you try v6 and apply all patches except the the last one: > > [PATCH v6 6/6] vhost: optimize cache access > > > > And see if there's still perf drop? > > > The last patch can improve the performance. The drop is actually > caused by the second patch.
This is expected because the 2nd patch is just a baseline and all optimization patches are organized in the rest of this patch set. I think you can do bottleneck analysis on ARM to see what's slowing down the perf, there might be some micro-arch complications there, mostly likely in memcpy. Do you use glibc's memcpy? I suggest to hand-crafted it on your own. Could you publish the mrg_rxbuf=on data also? Since it's more widely used in terms of spec integrity. Thanks Zhihong > > Jianbo