Hi Harry,
Thanks for your feedback!
> > > > On Wed, May 06, 2020 at 02:06:09PM +0100, Harry van Haaren wrote:
> > > > > This commit adds an AVX-512 dpcls lookup implementation.
> > > > > It uses the AVX-512 SIMD ISA to perform multiple miniflow
> > > > > operations in parallel.
>
> <snip lots of code/patch contents for readability>
>
> > Hi Harry,
> >
> > I managed to find a machine with avx512 in google cloud and did some
> > performance testing. I saw lower performance when enabling avx512,
> > I believe I did something wrong. Do you mind having a look:
> >
<snip>
> >
> > 3) start ovs and set avx and traffic gen
> > ovs-appctl dpif-netdev/subtable-lookup-set avx512_gather 5
> > ovs-vsctl add-port br0 tg0 -- set int tg0 type=dpdk
> > options:dpdk-devargs=vdev:net_pcap0,rx_pcap=/root/ovs/p0.pcap,infinite_rx=1
>
> The output of the first command (enabling the AVX512 lookup) posts some
> output to Log INFO, please ensure its there?
>
> 2020-05-20T09:39:09Z|00262|dpif_netdev_lookup|INFO|Subtable function
> 'avx512_gather' set priority to 4
> 2020-05-20T09:39:09Z|00006|dpif_netdev(pmd-c15/id:99)|INFO|reprobing sub
> func, 5 1
>
Yes, got these info log.
ovs-vsctl -- add-br br0 -- set Bridge br0 datapath_type=netdev
ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x3
ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk-init=true
ovs-ofctl add-flow br0 'actions=drop'
ovs-appctl dpif-netdev/subtable-lookup-set avx512_gather 5
ovs-vsctl add-port br0 tg0 -- set int tg0 type=dpdk \
options:dpdk-devargs=vdev:net_pcap0,rx_pcap=/root/ovs/p0.pcap,infinite_rx=1
LOG:
2020-05-20T13:49:26.542Z|00047|dpdk|INFO|DPDK Enabled - initialized
2020-05-20T13:49:26.544Z|00048|connmgr|INFO|br0<->unix#2: 1 flow_mods
in the last 0 s (1 adds)
2020-05-20T13:49:26.547Z|00049|dpif_netdev_lookup|INFO|Subtable
function 'avx512_gather' set priority to 5
2020-05-20T13:49:26.553Z|00050|netdev_dpdk|INFO|Device
'vdev:net_pcap0,rx_pcap=/root/ovs/p0.pcap,infinite_rx=1' attached to
DPDK
2020-05-20T13:49:26.553Z|00051|dpif_netdev|INFO|PMD thread on numa_id:
0, core id: 0 created.
2020-05-20T13:49:26.554Z|00052|dpif_netdev|INFO|PMD thread on numa_id:
0, core id: 1 created.
2020-05-20T13:49:26.554Z|00053|dpif_netdev|INFO|There are 2 pmd
threads on numa node 0
2020-05-20T13:49:26.554Z|00054|dpdk|INFO|Device with port_id=0 already stopped
2020-05-20T13:49:26.648Z|00055|netdev_dpdk|WARN|Rx checksum offload is
not supported on port 0
2020-05-20T13:49:26.648Z|00056|netdev_dpdk|WARN|Interface tg0 does not
support MTU configuration, max packet size supported is 1500.
2020-05-20T13:49:26.648Z|00057|netdev_dpdk|INFO|Port 0: 02:70:63:61:70:00
2020-05-20T13:49:26.648Z|00058|dpif_netdev|INFO|Core 0 on numa node 0
assigned port 'tg0' rx queue 0 (measured processing cycles 0).
2020-05-20T13:49:26.648Z|00059|bridge|INFO|bridge br0: added interface
tg0 on port 1
2020-05-20T13:49:26.648Z|00001|ofproto_dpif_upcall(pmd-c00/id:9)|WARN|upcall_cb
failure: ukey installation fails
2020-05-20T13:49:27.562Z|00002|dpif_netdev(pmd-c00/id:9)|INFO|reprobing
sub func, 4 1
>
> > 4) dp flows with miniflow info
<snip>
> It seems the "packets:0, bytes:0,used:never" tags indicate that there is no
> traffic hitting these rules at all?
> Output here (with traffic running for a while) shows:
> packets:621588543, bytes:37295312580, used:0.000s, dp:ovs, actions:dpdk1,
> dp-extra-info:miniflow_bits(4,1)
>
Thanks, this is the hit rules:
root@instance-3:~/ovs# ovs-appctl dpctl/dump-flows -m | grep -v never
flow-dump from pmd on cpu core: 0
ufid:f06998a0-9ff8-4ee5-b12f-5d7e2fcc7f0f,
skb_priority(0/0),skb_mark(0/0),ct_state(0/0),ct_zone(0/0),ct_mark(0/0),ct_label(0/0),recirc_id(0),dp_hash(0/0),in_port(tg0),packet_type(ns=0,id=0),eth(src=42:01:0a:b6:00:01/00:00:00:00:00:00,dst=42:01:0a:b6:00:02/00:00:00:00:00:00),eth_type(0x0800),ipv4(src=169.254.169.254/0.0.0.0,dst=10.182.0.2/0.0.0.0,proto=6/0,tos=0/0,ttl=64/0,frag=no),tcp(src=80/0,dst=51642/0),tcp_flags(0/0),
packets:3942096, bytes:2511115152, used:0.001s, flags:P., dp:ovs,
actions:drop, dp-extra-info:miniflow_bits(4,1)
ufid:cb3a6eac-3a7d-4e0d-a145-414dd482b4b9,
skb_priority(0/0),skb_mark(0/0),ct_state(0/0),ct_zone(0/0),ct_mark(0/0),ct_label(0/0),recirc_id(0),dp_hash(0/0),in_port(tg0),packet_type(ns=0,id=0),eth(src=42:01:0a:b6:00:01/00:00:00:00:00:00,dst=42:01:0a:b6:00:02/00:00:00:00:00:00),eth_type(0x0800),ipv4(src=169.254.169.254/0.0.0.0,dst=10.182.0.2/0.0.0.0,proto=6/0,tos=0/0,ttl=64/0,frag=no),tcp(src=80/0,dst=51650/0),tcp_flags(0/0),
packets:2779552, bytes:172332224, used:0.000s, flags:S., dp:ovs,
actions:drop, dp-extra-info:miniflow_bits(4,1)
ufid:781f3f48-ffd7-424f-ae99-62158ba05cbd,
skb_priority(0/0),skb_mark(0/0),ct_state(0/0),ct_zone(0/0),ct_mark(0/0),ct_label(0/0),recirc_id(0),dp_hash(0/0),in_port(tg0),packet_type(ns=0,id=0),eth(src=42:01:0a:b6:00:02/00:00:00:00:00:00,dst=42:01:0a:b6:00:01/00:00:00:00:00:00),eth_type(0x0800),ipv4(src=10.182.0.2/0.0.0.0,dst=169.254.169.254/0.0.0.0,proto=6/0,tos=0/0,ttl=64/0,frag=no),tcp(src=51650/0,dst=80/0),tcp_flags(0/0),
packets:637373, bytes:216706820, used:0.000s, flags:P., dp:ovs,
actions:drop, dp-extra-info:miniflow_bits(4,1)
>
> > 5) pmd-stat-show
> > root@instance-3:~/ovs# ovs-appctl dpif-netdev/pmd-stats-show
> > pmd thread numa_id 0 core_id 0:
> > packets received: 19838528
> > packet recirculations: 0
> > avg. datapath passes per packet: 1.00
> > emc hits: 0
> > smc hits: 0
> > megaflow hits: 0
> > avg. subtable lookups per megaflow hit: 0.00 (---> this doesn't
> > look right ....)
> > miss with success upcall: 78
> > miss with failed upcall: 19838418
> > avg. packets per output batch: 2.00
> > idle cycles: 0 (0.00%)
> > processing cycles: 103431787838 (100.00%)
> > avg cycles per packet: 5213.68 (103431787838/19838528)
> > avg processing cycles per packet: 5213.68 (103431787838/19838528)
>
> Would you try the pmd-stats-show command before setting the AVX512 lookup?
> If the issue is still present it would indicate its not related to the exact
> lookup
> implementation.
Before setting AVX512
### Scalar Lookup
pmd thread numa_id 0 core_id 0:
packets received: 77470176
packet recirculations: 0
avg. datapath passes per packet: 1.00
emc hits: 70423947
smc hits: 0
megaflow hits: 7045897
avg. subtable lookups per megaflow hit: 1.00
miss with success upcall: 1
miss with failed upcall: 331
avg. packets per output batch: 0.00
idle cycles: 0 (0.00%)
processing cycles: 19596627706 (100.00%)
avg cycles per packet: 252.96 (19596627706/77470176)
avg processing cycles per packet: 252.96 (19596627706/77470176)
### AVX512 Lookup (restart ovs-vswitchd with additional command
"dpif-netdev/subtable-lookup-set avx512_gather 5"
pmd thread numa_id 0 core_id 0:
packets received: 1178784
packet recirculations: 0
avg. datapath passes per packet: 1.00
emc hits: 0
smc hits: 0
megaflow hits: 0
avg. subtable lookups per megaflow hit: 0.00
miss with success upcall: 13
miss with failed upcall: 1178739 ---> this looks not right
avg. packets per output batch: 0.00
idle cycles: 0 (0.00%)
processing cycles: 5408870500 (100.00%)
avg cycles per packet: 4588.52 (5408870500/1178784)
avg processing cycles per packet: 4588.52 (5408870500/1178784)
>
>
> > 6) gdb also looks not right..., I didn't see any avx512 instructions
> > (gdb) b avx512_lookup_impl
> > Breakpoint 2 at 0x55e92342a8df: avx512_lookup_impl. (4 locations)
> > Dump of assembler code for function dpcls_avx512_gather_skx_mf_5_1:
> > 96 const uint64_t tbl_u0 = subtable->mask.mf.map.bits[0];
> > 0x000055e92342a8df <+31>: mov 0x30(%rdi),%r8
> > 97 const uint64_t tbl_u1 = subtable->mask.mf.map.bits[1];
> > 0x000055e92342a8e3 <+35>: mov 0x38(%rdi),%r9
> > 98 ovs_assert(__builtin_popcountll(tbl_u0) == bit_count_u0);
> <snip some ASM>
>
> (gdb) disas dpcls_avx512_gather_skx_mf_5_1
> <snip preamble>
> 0x0000555556103f34 <+724>: vmovdqu64 0x28(%rdi),%zmm2{%k1}{z}
> 0x0000555556103f3e <+734>: vmovdqu64 0x18(%rcx),%zmm0{%k1}{z}
> 0x0000555556103f48 <+744>: vpandd %zmm0,%zmm1,%zmm0
> 0x0000555556103f4e <+750>: vpcmpeqq %zmm2,%zmm0,%k7{%k1}
>
> Disassembly here shows AVX512 register usage here, as expected.
OK, tried
(gdb) disas dpcls_avx512_gather_skx_mf_5_1
and works for me. I can see avx512 instructions.
setting breakpoint at run time also work
(gdb) b dpcls_avx512_gather_skx_mf_4_1
Thread 13 "pmd-c00/id:9" hit Breakpoint 1,
dpcls_avx512_gather_skx_mf_4_1 (subtable=0x7f732c008210,
keys_map=1, keys=0x7f733af2a798, rules=0x7f733af2a7a0) at
lib/dpif-netdev-lookup-avx512-gather.c:212
212 DECLARE_OPTIMIZED_LOOKUP_FUNCTION(4, 1)
Using perf record also show
3.11% pmd-c00/id:9 ovs-vswitchd [.] dpcls_avx512_gather_skx_mf_4_1
dpcls_avx512_gather_skx_mf_4_1 /usr/local/sbin/ovs-vswitchd [Percent:
local period]
0.48 │ lea -0x1(%rdi),%rax
│ _mm512_srli_epi64():
│ return (__m512i) __builtin_ia32_psrlqi512_mask ((__v8di) __A, __B,
│ vpsrlq $0x4,%zmm0,%zmm1
│ _mm512_shuffle_epi8():
0.80 │ vpandd %zmm3,%zmm0,%zmm0
│ avx512_lookup_impl():
│ ULLONG_FOR_EACH_1 (i, keys_map) {
0.32 │ and %rax,%rdi
│ _mm512_shuffle_epi8():
0.16 │ vpandd %zmm1,%zmm3,%zmm1
0.48 │ vpshufb %zmm0,%zmm4,%zmm0
0.80 │ vpshufb %zmm1,%zmm4,%zmm1
>
> Note the "avx512_lookup_impl" is a static function in a .c file, so it is not
> visible
> outside the compilation unit. Further, it is also marked "ALWAYS_INLINE", so
> even
> inside the compilation unit, there isn't a symbol with that name. I'm
> surprised GDB
> let me set a breakpoint on it. Disassembling it doesn't work:
> (gdb) b avx512_lookup_impl
> Breakpoint 2 at 0x5555561035af: avx512_lookup_impl. (4 locations)
> (gdb) disas avx512_lookup_impl
> No symbol "avx512_lookup_impl" in current context.
>
> The functions it is inlined into are available for disassembly, as their
> symbols
> do exist in the binary. (Sidenote: Going to add dpcls_ to the _any function
> for
> consistency in naming with the others);
> dpcls_avx512_gather_skx_mf_4_0
> dpcls_avx512_gather_skx_mf_4_1
> dpcls_avx512_gather_skx_mf_5_1
> avx512_gather_any
>
> Disassembling the _any version of the avx512 lookup function here
> shows the AVX512 instructions, using ZMM registers and {k} masks.
> (gdb) disas avx512_gather_mf_any
> Dump of assembler code for function avx512_gather_mf_any:
> 0x0000555556103fb0 <+0>: lea 0x8(%rsp),%r10
> 0x0000555556103fb5 <+5>: and $0xffffffffffffffc0,%rsp
> 0x0000555556103fb9 <+9>: pushq -0x8(%r10)
> <skipping preamble/pushes etc, to the fun AVX512 part>
> 0x00005555561040dd <+301>: vpandd %zmm0,%zmm5,%zmm0
> 0x00005555561040e3 <+307>: or %rdi,%rax
> 0x00005555561040e6 <+310>: test %r8,%r8
> 0x00005555561040e9 <+313>: kmovb %eax,%k4
> 0x00005555561040ed <+317>: vpsrlq $0x4,%zmm0,%zmm2
> 0x00005555561040f4 <+324>: vpandd %zmm3,%zmm0,%zmm0
> 0x00005555561040fa <+330>: vpandd %zmm2,%zmm3,%zmm2
> 0x0000555556104100 <+336>: vpshufb %zmm0,%zmm4,%zmm0
> 0x0000555556104106 <+342>: vpshufb %zmm2,%zmm4,%zmm2
> 0x000055555610410c <+348>: vpaddb %zmm2,%zmm0,%zmm0
> 0x0000555556104112 <+354>: vpsadbw %zmm7,%zmm0,%zmm0
> 0x0000555556104118 <+360>: vpaddq %zmm1,%zmm0,%zmm0
> 0x000055555610411e <+366>: vmovdqa64 %zmm8,%zmm1
> 0x0000555556104124 <+372>: vpgatherqq 0x18(%r9,%zmm0,8),%zmm1{%k3}
> 0x000055555610412c <+380>: vpandq %zmm6,%zmm1,%zmm0{%k4}{z}
>
> Would you try some of the above and see can it be reproduced?
btw, I saw every second ovs is doing reprobing
2020-05-20T14:15:15.113Z|00373|dpif_netdev(pmd-c00/id:9)|INFO|reprobing
sub func, 4 1
2020-05-20T14:15:16.129Z|00374|dpif_netdev(pmd-c00/id:9)|INFO|reprobing
sub func, 4 1
2020-05-20T14:15:17.138Z|00375|dpif_netdev(pmd-c00/id:9)|INFO|reprobing
sub func, 4 1
2020-05-20T14:15:18.150Z|00376|dpif_netdev(pmd-c00/id:9)|INFO|reprobing
sub func, 4 1
2020-05-20T14:15:19.170Z|00377|dpif_netdev(pmd-c00/id:9)|INFO|reprobing
sub func, 4 1
2020-05-20T14:15:20.184Z|00378|dpif_netdev(pmd-c00/id:9)|INFO|reprobing
sub func, 4 1
2020-05-20T14:15:21.219Z|00379|dpif_netdev(pmd-c00/id:9)|INFO|reprobing
sub func, 4 1
Regards,
William
_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev