On 10/23/2017 09:57 AM, Wei Xu wrote: > On Wed, Oct 18, 2017 at 04:17:51PM -0400, Matthew Rosato wrote: >> On 10/12/2017 02:31 PM, Wei Xu wrote: >>> On Thu, Oct 05, 2017 at 04:07:45PM -0400, Matthew Rosato wrote: >>>> >>>> Ping... Jason, any other ideas or suggestions? >>> >>> Hi Matthew, >>> Recently I am doing similar test on x86 for this patch, here are some, >>> differences between our testbeds. >>> >>> 1. It is nice you have got improvement with 50+ instances(or connections >>> here?) >>> which would be quite helpful to address the issue, also you've figured out >>> the >>> cost(wait/wakeup), kindly reminder did you pin uperf client/server along >>> the whole >>> path besides vhost and vcpu threads? >> >> Was not previously doing any pinning whatsoever, just reproducing an >> environment that one of our testers here was running. Reducing guest >> vcpu count from 4->1, still see the regression. Then, pinned each vcpu >> thread and vhost thread to a separate host CPU -- still made no >> difference (regression still present). >> >>> >>> 2. It might be useful to short the traffic path as a reference, What I am >>> running >>> is briefly like: >>> pktgen(host kernel) -> tap(x) -> guest(DPDK testpmd) >>> >>> The bridge driver(br_forward(), etc) might impact performance due to my >>> personal >>> experience, so eventually I settled down with this simplified testbed which >>> fully >>> isolates the traffic from both userspace and host kernel stack(1 and 50 >>> instances, >>> bridge driver, etc), therefore reduces potential interferences. >>> >>> The down side of this is that it needs DPDK support in guest, has this ever >>> be >>> run on s390x guest? An alternative approach is to directly run XDP drop on >>> virtio-net nic in guest, while this requires compiling XDP inside guest >>> which needs >>> a newer distro(Fedora 25+ in my case or Ubuntu 16.10, not sure). >>> >> >> I made an attempt at DPDK, but it has not been run on s390x as far as >> I'm aware and didn't seem trivial to get working. >> >> So instead I took your alternate suggestion & did: >> pktgen(host) -> tap(x) -> guest(xdp_drop) > > It is really nice of you for having tried this, I also tried this on x86 with > two ubuntu 16.04 guests, but unfortunately I couldn't reproduce it as well, > but I did get lower throughput with 50 instances than one instance(1-4 vcpus), > is this the same on s390x?
For me, the total throughput is higher from 50 instances than for 1 instance when host kernel is 4.13. However, when running a 50 instance uperf load I cannot reproduce the regression, either. Throughput is a little bit better when host is 4.13 vs 4.12 for a 50 instance run. > >> >> When running this setup, I am not able to reproduce the regression. As >> mentioned previously, I am also unable to reproduce when running one end >> of the uperf connection from the host - I have only ever been able to >> reproduce when both ends of the uperf connection are running within a guest. > > Did you see improvement when running uperf from the host if no regression? > > It would be pretty nice to run pktgen from the VM as Jason suggested in > another > mail(pktgen(vm1) -> tap1 -> bridge -> tap2 -> vm2), this is super close to > your > original test case and can help to determine if we can get some clue with tcp > or > bridge driver. > > Also I am interested in your hardware platform, how many NUMA nodes do you > have? > what about your binding(vcpu/vhost/pktgen). For my case, I got a server with 4 > NUMA nodes and 12 cpus for each sockets, and I am explicitly launching qemu > from > cpu0, then bind vhost(Rx/Tx) to cpu 2&3, and vcpus start from cpu 4(3 vcpus > for > each). I'm running in an LPAR on a z13. The particular LPAR I am using to reproduce has 20 CPUs and 40G of memory assigned, all in 1 NUMA node. I was initially recreating an issue uncovered by someone elses test, and thus was doing no cpu binding -- But have attempted binding vhost and vcpu threads to individual host CPUs and it seemed to have no impact on the noted regression. When doing said binding, I did: qemu-guestA -> cpu0(or 0-3 when running 4vcpu), qemu-guestA-vhost -> cpu4, qemu-guestB -> cpu8(or 8-11 when running 4vcpu), qemu-guestB-vhost -> cpu12.