> -----Original Message----- > From: Maxime Coquelin [mailto:maxime.coquelin at redhat.com] > Sent: Thursday, November 3, 2016 4:11 PM > To: Wang, Zhihong <zhihong.wang at intel.com>; Yuanhan Liu > <yuanhan.liu at linux.intel.com> > Cc: stephen at networkplumber.org; Pierre Pfister (ppfister) > <ppfister at cisco.com>; Xie, Huawei <huawei.xie at intel.com>; dev at > dpdk.org; > vkaplans at redhat.com; mst at redhat.com > Subject: Re: [dpdk-dev] [PATCH v4] vhost: Add indirect descriptors support > to the TX path > > > > On 11/02/2016 11:51 AM, Maxime Coquelin wrote: > > > > > > On 10/31/2016 11:01 AM, Wang, Zhihong wrote: > >> > >> > >>> -----Original Message----- > >>> From: Maxime Coquelin [mailto:maxime.coquelin at redhat.com] > >>> Sent: Friday, October 28, 2016 3:42 PM > >>> To: Wang, Zhihong <zhihong.wang at intel.com>; Yuanhan Liu > >>> <yuanhan.liu at linux.intel.com> > >>> Cc: stephen at networkplumber.org; Pierre Pfister (ppfister) > >>> <ppfister at cisco.com>; Xie, Huawei <huawei.xie at intel.com>; > dev at dpdk.org; > >>> vkaplans at redhat.com; mst at redhat.com > >>> Subject: Re: [dpdk-dev] [PATCH v4] vhost: Add indirect descriptors > >>> support > >>> to the TX path > >>> > >>> > >>> > >>> On 10/28/2016 02:49 AM, Wang, Zhihong wrote: > >>>> > >>>>>> -----Original Message----- > >>>>>> From: Yuanhan Liu [mailto:yuanhan.liu at linux.intel.com] > >>>>>> Sent: Thursday, October 27, 2016 6:46 PM > >>>>>> To: Maxime Coquelin <maxime.coquelin at redhat.com> > >>>>>> Cc: Wang, Zhihong <zhihong.wang at intel.com>; > >>>>>> stephen at networkplumber.org; Pierre Pfister (ppfister) > >>>>>> <ppfister at cisco.com>; Xie, Huawei <huawei.xie at intel.com>; > >>> dev at dpdk.org; > >>>>>> vkaplans at redhat.com; mst at redhat.com > >>>>>> Subject: Re: [dpdk-dev] [PATCH v4] vhost: Add indirect descriptors > >>> support > >>>>>> to the TX path > >>>>>> > >>>>>> On Thu, Oct 27, 2016 at 12:35:11PM +0200, Maxime Coquelin wrote: > >>>>>>>> > >>>>>>>> > >>>>>>>> On 10/27/2016 12:33 PM, Yuanhan Liu wrote: > >>>>>>>>>> On Thu, Oct 27, 2016 at 11:10:34AM +0200, Maxime Coquelin > >>> wrote: > >>>>>>>>>>>> Hi Zhihong, > >>>>>>>>>>>> > >>>>>>>>>>>> On 10/27/2016 11:00 AM, Wang, Zhihong wrote: > >>>>>>>>>>>>>> Hi Maxime, > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Seems indirect desc feature is causing serious > performance > >>>>>>>>>>>>>> degradation on Haswell platform, about 20% drop for both > >>>>>>>>>>>>>> mrg=on and mrg=off (--txqflags=0xf00, non-vector > version), > >>>>>>>>>>>>>> both iofwd and macfwd. > >>>>>>>>>>>> I tested PVP (with macswap on guest) and Txonly/Rxonly on > an > >>> Ivy > >>>>>> Bridge > >>>>>>>>>>>> platform, and didn't faced such a drop. > >>>>>>>>>> > >>>>>>>>>> I was actually wondering that may be the cause. I tested it with > >>>>>>>>>> my IvyBridge server as well, I saw no drop. > >>>>>>>>>> > >>>>>>>>>> Maybe you should find a similar platform (Haswell) and have a > >>>>>>>>>> try? > >>>>>>>> Yes, that's why I asked Zhihong whether he could test Txonly in > >>>>>>>> guest > >>> to > >>>>>>>> see if issue is reproducible like this. > >>>>>> > >>>>>> I have no Haswell box, otherwise I could do a quick test for you. > >>>>>> IIRC, > >>>>>> he tried to disable the indirect_desc feature, then the performance > >>>>>> recovered. So, it's likely the indirect_desc is the culprit here. > >>>>>> > >>>>>>>> I will be easier for me to find an Haswell machine if it has not > >>>>>>>> to be > >>>>>>>> connected back to back to and HW/SW packet generator. > >>>> In fact simple loopback test will also do, without pktgen. > >>>> > >>>> Start testpmd in both host and guest, and do "start" in one > >>>> and "start tx_first 32" in another. > >>>> > >>>> Perf drop is about 24% in my test. > >>>> > >>> > >>> Thanks, I never tried this test. > >>> I managed to find an Haswell platform (Intel(R) Xeon(R) CPU E5-2699 v3 > >>> @ 2.30GHz), and can reproduce the problem with the loop test you > >>> mention. I see a performance drop about 10% (8.94Mpps/8.08Mpps). > >>> Out of curiosity, what are the numbers you get with your setup? > >> > >> Hi Maxime, > >> > >> Let's align our test case to RC2, mrg=on, loopback, on Haswell. > >> My results below: > >> 1. indirect=1: 5.26 Mpps > >> 2. indirect=0: 6.54 Mpps > >> > >> It's about 24% drop. > > OK, so on my side, same setup on Haswell: > > 1. indirect=1: 7.44 Mpps > > 2. indirect=0: 8.18 Mpps > > > > Still 10% drop in my case with mrg=on. > > > > The strange thing with both of our figures is that this is below from > > what I obtain with my SandyBridge machine. The SB cpu freq is 4% higher, > > but that doesn't explain the gap between the measurements. > > > > I'm continuing the investigations on my side. > > Maybe we should fix a deadline, and decide do disable indirect in > > Virtio PMD if root cause not identified/fixed at some point? > > > > Yuanhan, what do you think? > > I have done some measurements using perf, and know understand better > what happens. > > With indirect descriptors, I can see a cache miss when fetching the > descriptors in the indirect table. Actually, this is expected, so > we prefetch the first desc as soon as possible, but still not soon > enough to make it transparent. > In direct descriptors case, the desc in the virtqueue seems to be > remain in the cache from its previous use, so we have a hit. > > That said, in realistic use-case, I think we should not have a hit, > even with direct descriptors. > Indeed, the test case use testpmd on guest side with the forwarding set > in IO mode. It means the packet content is never accessed by the guest. > > In my experiments, I am used to set the "macswap" forwarding mode, which > swaps src and dest MAC addresses in the packet. I find it more > realistic, because I don't see the point in sending packets to the guest > if it is not accessed (not even its header). > > I tried again the test case, this time with setting the forwarding mode > to macswap in the guest. This time, I get same performance with both > direct and indirect (indirect even a little better with a small > optimization, consisting in prefetching the 2 first descs > systematically as we know there are contiguous).
Hi Maxime, I did a little more macswap test and found out more stuff here: 1. I did loopback test on another HSW machine with the same H/W, and indirect_desc on and off seems have close perf 2. So I checked the gcc version: * Previous: gcc version 6.2.1 20160916 (Fedora 24) * New: gcc version 5.4.0 20160609 (Ubuntu 16.04.1 LTS) On previous one indirect_desc has 20% drop 3. Then I compiled binary on Ubuntu and scp to Fedora, and as expected I got the same perf as on Ubuntu, and the perf gap disappeared, so gcc is definitely one factor here 4. Then I use the Ubuntu binary on Fedora for PVP test, then the perf gap comes back again and the same with the Fedora binary results, indirect_desc causes about 20% drop So in all, could you try PVP traffic on HSW to see how it works? > > Do you agree we should assume that the packet (header or/and buf) will > always be accessed by the guest application? > If so, do you agree we should keep indirect descs enabled, and maybe > update the test cases? I agree with you that mac/macswap test is more realistic and makes more sense for real applications. Thanks Zhihong > > Thanks, > Maxime