Re: solved: Re: Chelsio T520-SO-CR low performance (netmap tested) for RX

Navdeep Parhar Sat, 23 Jan 2016 22:43:16 -0800

On Sat, Jan 23, 2016 at 09:33:32PM -0800, Luigi Rizzo wrote:
> On Sat, Jan 23, 2016 at 8:28 PM, Marcus Cenzatti <[email protected]> wrote:
> >
> >
> > On 1/24/2016 at 1:10 AM, "Luigi Rizzo" <[email protected]> wrote:
> >>
> >>Thanks for re-running the experiments.
> >>
> >>I am changing the subject so that in the archives it is clear
> >>that the chelsio card works fine.
> >>
> >>Overall the tests confirm that whenever you hit the host stack you
> >>are bound
> >>to the poor performance of the latter. The problem does not appear
> >>using intel
> >>as a receiver because on the intel card netmap mode disables the
> >>host stack.
> >>
> >>More comments on the experiments:
> >>
> >>The only meaningful test is the one where you use the DMAC of the
> >>ncxl0 port:
> >>
> >>    SENDER: ./pkt-gen -i ix0 -f tx -S 00:07:e9:44:d2:ba -D
> >>00:07:43:33:8d:c1
> >>
> >>in the other experiment you transmit broadcast frames and hit the
> >>network stack.
> >>ARP etc do not matter since tx and rx are directly connected.
> >>
> >>On the receiver you do not need to specify addresses:
> >>
> >>    RECEIVER: ./pkt-gen -i ncxl0 -f rx
> >>
> >>The numbers in netstat are clearly rounded, so 15M is probably
> >>14.88M
> >>(line rate), and 3.7M that you see correctly represents the
> >>difference
> >>between incoming and received packets.
> >>
> >>The fact that you see drops may be related to the NIC being unable
> >>to
> >>replenish the queue fast enough, which in turn may be a hardware
> >>or a
> >>software (netmap) issue.
> >>You may try experiment with shorter batches on the receive side
> >>(say, -b 64 or less) and see if you have better results.
> >>
> >>A short batch replenishes the rx queue more frequently, but it is
> >>not a conclusive experiment because there is an optimization in
> >>the netmap poll code which, as an unintended side effect,
> >>replenishes
> >>the queue less often than it should.
> >>For a conclusive experiment you should grab the netmap code from
> >>github.com/luigirizzo/netmap and use pkt-gen-b which
> >>uses busy wait and works around the poll "optimization"
> >>
> >>thanks again for investigating the issue.
> >>
> >>cheers
> >>luigi
> >>
> >
> > so as a summary, with IP test on intel card, netmap disables the host stack 
> > while on chelsio netmap does not disable the host stack and we ket things 
> > injected to host, so the only reliable test is mac based when using chelsio 
> > cards?
> >
> > yes I am already running github's netmap code, let's try with busy code:
> ...
> > chelsio# ./pkt-gen-b -i ncxl0 -f rx
> > 785.659290 main [1930] interface is ncxl0
> > 785.659337 main [2050] running on 1 cpus (have 4)
> > 785.659477 extract_ip_range [367] range is 10.0.0.1:0 to 10.0.0.1:0
> > 785.659496 extract_ip_range [367] range is 10.1.0.1:0 to 10.1.0.1:0
> > 785.718707 main [2148] mapped 334980KB at 0x801800000
> > Receiving from netmap:ncxl0: 2 queues, 1 threads and 1 cpus.
> > 785.718784 main [2235] Wait 2 secs for phy reset
> > 787.729197 main [2237] Ready...
> > 787.729449 receiver_body [1412] reading from netmap:ncxl0 fd 3 main_fd 3
> > 788.730089 main_thread [1720] 11.159 Mpps (11.166 Mpkts 5.360 Gbps in 
> > 1000673 usec) 205.89 avg_batch 0 min_space
> > 789.730588 main_thread [1720] 11.164 Mpps (11.169 Mpkts 5.361 Gbps in 
> > 1000500 usec) 183.54 avg_batch 0 min_space
> > 790.734224 main_thread [1720] 11.172 Mpps (11.213 Mpkts 5.382 Gbps in 
> > 1003636 usec) 198.84 avg_batch 0 min_space
> > ^C791.140853 sigint_h [404] received control-C on thread 0x801406800
> > 791.742841 main_thread [1720] 4.504 Mpps (4.542 Mpkts 2.180 Gbps in 1008617 
> > usec) 179.62 avg_batch 0 min_space
> > Received 38091031 packets 2285461860 bytes 196774 events 60 bytes each in 
> > 3.41 seconds.
> > Speed: 11.166 Mpps Bandwidth: 5.360 Gbps (raw 7.504 Gbps). Average batch: 
> > 193.58 pkts
> >
> > chelsio# ./pkt-gen-b -b 64 -i ncxl0 -f rx
> > 522.430459 main [1930] interface is ncxl0
> > 522.430507 main [2050] running on 1 cpus (have 4)
> > 522.430644 extract_ip_range [367] range is 10.0.0.1:0 to 10.0.0.1:0
> > 522.430662 extract_ip_range [367] range is 10.1.0.1:0 to 10.1.0.1:0
> > 522.677743 main [2148] mapped 334980KB at 0x801800000
> > Receiving from netmap:ncxl0: 2 queues, 1 threads and 1 cpus.
> > 522.677822 main [2235] Wait 2 secs for phy reset
> > 524.698114 main [2237] Ready...
> > 524.698373 receiver_body [1412] reading from netmap:ncxl0 fd 3 main_fd 3
> > 525.699118 main_thread [1720] 10.958 Mpps (10.966 Mpkts 5.264 Gbps in 
> > 1000765 usec) 61.84 avg_batch 0 min_space
> > 526.700108 main_thread [1720] 11.086 Mpps (11.097 Mpkts 5.327 Gbps in 
> > 1000991 usec) 61.06 avg_batch 0 min_space
> > 527.705650 main_thread [1720] 11.166 Mpps (11.227 Mpkts 5.389 Gbps in 
> > 1005542 usec) 61.91 avg_batch 0 min_space
> > 528.707113 main_thread [1720] 11.090 Mpps (11.107 Mpkts 5.331 Gbps in 
> > 1001463 usec) 61.34 avg_batch 0 min_space
> > 529.707617 main_thread [1720] 10.847 Mpps (10.853 Mpkts 5.209 Gbps in 
> > 1000504 usec) 62.51 avg_batch 0 min_space
> > ^C530.556309 sigint_h [404] received control-C on thread 0x801406800
> > 530.709133 main_thread [1720] 9.166 Mpps (9.180 Mpkts 4.406 Gbps in 1001516 
> > usec) 62.92 avg_batch 0 min_space
> > Received 64430028 packets 3865801680 bytes 1041000 events 60 bytes each in 
> > 5.86 seconds.
> > Speed: 10.999 Mpps Bandwidth: 5.279 Gbps (raw 7.391 Gbps). Average batch: 
> > 61.89 pkts
> ...
> 
> > so, the lower the batch the smaller performance.
> >
> > did you expect some other behaviour?
> 
> 
> for very small batches, yes.
> For larger batch sizes I was hoping that refilling the ring more often
> could reduce losses.
> 
> One last attempt: try use -l 64 on the sender, this will generate 64+4 byte
> packets, which may become just 64 on the receiver if the chelsio is configured
> to strip the CRC. This should result in well aligned PCIe transactions and
> reduced PCIe traffic, which may help (the ix driver has a similar problem,
> but since it does not strip the CRC can rx at line rate with 60 bytes but not
> with 64).


Keep hw.cxgbe.fl_pktshift in mind for these kind of tests.  The default
value is 2 so the chip DMAs payload at an offset of 2B from the start of
the rx buffer.  So you'll need to adjust your frame size by 2 (66B on
the wire, 62B after CRC is removed, making it exactly 64B across PCIe if
pktshift is 2) or just set hw.cxgbe.fl_pktshift=0 in /boot/loader.conf.

Regards,
Navdeep
_______________________________________________
[email protected] mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "[email protected]"

Re: solved: Re: Chelsio T520-SO-CR low performance (netmap tested) for RX

Reply via email to