On Sat, Jan 23, 2016 at 09:33:32PM -0800, Luigi Rizzo wrote: > On Sat, Jan 23, 2016 at 8:28 PM, Marcus Cenzatti <[email protected]> wrote: > > > > > > On 1/24/2016 at 1:10 AM, "Luigi Rizzo" <[email protected]> wrote: > >> > >>Thanks for re-running the experiments. > >> > >>I am changing the subject so that in the archives it is clear > >>that the chelsio card works fine. > >> > >>Overall the tests confirm that whenever you hit the host stack you > >>are bound > >>to the poor performance of the latter. The problem does not appear > >>using intel > >>as a receiver because on the intel card netmap mode disables the > >>host stack. > >> > >>More comments on the experiments: > >> > >>The only meaningful test is the one where you use the DMAC of the > >>ncxl0 port: > >> > >> SENDER: ./pkt-gen -i ix0 -f tx -S 00:07:e9:44:d2:ba -D > >>00:07:43:33:8d:c1 > >> > >>in the other experiment you transmit broadcast frames and hit the > >>network stack. > >>ARP etc do not matter since tx and rx are directly connected. > >> > >>On the receiver you do not need to specify addresses: > >> > >> RECEIVER: ./pkt-gen -i ncxl0 -f rx > >> > >>The numbers in netstat are clearly rounded, so 15M is probably > >>14.88M > >>(line rate), and 3.7M that you see correctly represents the > >>difference > >>between incoming and received packets. > >> > >>The fact that you see drops may be related to the NIC being unable > >>to > >>replenish the queue fast enough, which in turn may be a hardware > >>or a > >>software (netmap) issue. > >>You may try experiment with shorter batches on the receive side > >>(say, -b 64 or less) and see if you have better results. > >> > >>A short batch replenishes the rx queue more frequently, but it is > >>not a conclusive experiment because there is an optimization in > >>the netmap poll code which, as an unintended side effect, > >>replenishes > >>the queue less often than it should. > >>For a conclusive experiment you should grab the netmap code from > >>github.com/luigirizzo/netmap and use pkt-gen-b which > >>uses busy wait and works around the poll "optimization" > >> > >>thanks again for investigating the issue. > >> > >>cheers > >>luigi > >> > > > > so as a summary, with IP test on intel card, netmap disables the host stack > > while on chelsio netmap does not disable the host stack and we ket things > > injected to host, so the only reliable test is mac based when using chelsio > > cards? > > > > yes I am already running github's netmap code, let's try with busy code: > ... > > chelsio# ./pkt-gen-b -i ncxl0 -f rx > > 785.659290 main [1930] interface is ncxl0 > > 785.659337 main [2050] running on 1 cpus (have 4) > > 785.659477 extract_ip_range [367] range is 10.0.0.1:0 to 10.0.0.1:0 > > 785.659496 extract_ip_range [367] range is 10.1.0.1:0 to 10.1.0.1:0 > > 785.718707 main [2148] mapped 334980KB at 0x801800000 > > Receiving from netmap:ncxl0: 2 queues, 1 threads and 1 cpus. > > 785.718784 main [2235] Wait 2 secs for phy reset > > 787.729197 main [2237] Ready... > > 787.729449 receiver_body [1412] reading from netmap:ncxl0 fd 3 main_fd 3 > > 788.730089 main_thread [1720] 11.159 Mpps (11.166 Mpkts 5.360 Gbps in > > 1000673 usec) 205.89 avg_batch 0 min_space > > 789.730588 main_thread [1720] 11.164 Mpps (11.169 Mpkts 5.361 Gbps in > > 1000500 usec) 183.54 avg_batch 0 min_space > > 790.734224 main_thread [1720] 11.172 Mpps (11.213 Mpkts 5.382 Gbps in > > 1003636 usec) 198.84 avg_batch 0 min_space > > ^C791.140853 sigint_h [404] received control-C on thread 0x801406800 > > 791.742841 main_thread [1720] 4.504 Mpps (4.542 Mpkts 2.180 Gbps in 1008617 > > usec) 179.62 avg_batch 0 min_space > > Received 38091031 packets 2285461860 bytes 196774 events 60 bytes each in > > 3.41 seconds. > > Speed: 11.166 Mpps Bandwidth: 5.360 Gbps (raw 7.504 Gbps). Average batch: > > 193.58 pkts > > > > chelsio# ./pkt-gen-b -b 64 -i ncxl0 -f rx > > 522.430459 main [1930] interface is ncxl0 > > 522.430507 main [2050] running on 1 cpus (have 4) > > 522.430644 extract_ip_range [367] range is 10.0.0.1:0 to 10.0.0.1:0 > > 522.430662 extract_ip_range [367] range is 10.1.0.1:0 to 10.1.0.1:0 > > 522.677743 main [2148] mapped 334980KB at 0x801800000 > > Receiving from netmap:ncxl0: 2 queues, 1 threads and 1 cpus. > > 522.677822 main [2235] Wait 2 secs for phy reset > > 524.698114 main [2237] Ready... > > 524.698373 receiver_body [1412] reading from netmap:ncxl0 fd 3 main_fd 3 > > 525.699118 main_thread [1720] 10.958 Mpps (10.966 Mpkts 5.264 Gbps in > > 1000765 usec) 61.84 avg_batch 0 min_space > > 526.700108 main_thread [1720] 11.086 Mpps (11.097 Mpkts 5.327 Gbps in > > 1000991 usec) 61.06 avg_batch 0 min_space > > 527.705650 main_thread [1720] 11.166 Mpps (11.227 Mpkts 5.389 Gbps in > > 1005542 usec) 61.91 avg_batch 0 min_space > > 528.707113 main_thread [1720] 11.090 Mpps (11.107 Mpkts 5.331 Gbps in > > 1001463 usec) 61.34 avg_batch 0 min_space > > 529.707617 main_thread [1720] 10.847 Mpps (10.853 Mpkts 5.209 Gbps in > > 1000504 usec) 62.51 avg_batch 0 min_space > > ^C530.556309 sigint_h [404] received control-C on thread 0x801406800 > > 530.709133 main_thread [1720] 9.166 Mpps (9.180 Mpkts 4.406 Gbps in 1001516 > > usec) 62.92 avg_batch 0 min_space > > Received 64430028 packets 3865801680 bytes 1041000 events 60 bytes each in > > 5.86 seconds. > > Speed: 10.999 Mpps Bandwidth: 5.279 Gbps (raw 7.391 Gbps). Average batch: > > 61.89 pkts > ... > > > so, the lower the batch the smaller performance. > > > > did you expect some other behaviour? > > > for very small batches, yes. > For larger batch sizes I was hoping that refilling the ring more often > could reduce losses. > > One last attempt: try use -l 64 on the sender, this will generate 64+4 byte > packets, which may become just 64 on the receiver if the chelsio is configured > to strip the CRC. This should result in well aligned PCIe transactions and > reduced PCIe traffic, which may help (the ix driver has a similar problem, > but since it does not strip the CRC can rx at line rate with 60 bytes but not > with 64).
Keep hw.cxgbe.fl_pktshift in mind for these kind of tests. The default value is 2 so the chip DMAs payload at an offset of 2B from the start of the rx buffer. So you'll need to adjust your frame size by 2 (66B on the wire, 62B after CRC is removed, making it exactly 64B across PCIe if pktshift is 2) or just set hw.cxgbe.fl_pktshift=0 in /boot/loader.conf. Regards, Navdeep _______________________________________________ [email protected] mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "[email protected]"
