On 3/26/24 03:30, Chris Mi wrote:
> This patch set adds offload support for sFlow.
> 
> Psample is a genetlink channel for packet sampling. TC action act_sample
> uses psample to send sampled packets to userspace.
> 
> When offloading sample action to TC, userspace creates a unique ID to
> map sFlow action and tunnel info and passes this ID to kernel instead
> of the sFlow info. psample will send this ID and sampled packet to
> userspace. Using the ID, userspace can recover the sFlow info and send
> sampled packet to the right sFlow monitoring host.

Hi, Chris, others.

I've been looking through the set and a psample in general for the past
few days.  There are few fairly minor issues in the set like UBsan crash
because of incorrect OVS_RETURNS_NONNULL annotation and a few style and
formatting issues, which are not hard to fix.  However, I encountered
an issue with a way tunnel metadata is recovered that I'm not sure we
can actually fix.

The algorithm of work in this patch set is described in the cover letter
above.  The key point is then packet-1 goes though MISS upcall we install
a new datapath flow into TC and we remember the tunnel metadata of the
packet-1 associated with a sample ID.  When the packet-2 hits the datapath
flow (tc actions), it gets sent to psample and ovs-vswitchd reads this
packet from psample netlink socket and presents it as an ACTION upcall.
Doing that it finds tunnel metadata using the sample ID and uses it as a
tunnel metadata for packet-2.  The key of the problem is that this is
metadata from packet-1 and it can be different from metadata of packet-2.

An example of this will be having two separate TCP streams going through
the same VxLAN tunnel.  For load balancing reasons most UDP tunnels
choose source port of the outer UDP header based on RSS or 5-tuple hash
of the inner packet.  This means that two TCP streams going through the
same tunnel will have different UDP source port in the outer header.

When a packet from a first stream hits a MISS upcall, datapath flow will
be installed and the sample ID will be associated with the metadata of
the first stream.  Chances are this datapath flow will not have exact
match on the source port of the tunnel metadata.  That means that a
packet from the second stream will successfully match on the installed
datapath flow and will go to psample with the ID of the first stream.
OVS will read it from the psample socket and send to sFlow collector
with metadata it found by the ID, i.e. with a tunnel metadata that
contains tunnel source port of the first TCP stream, which is incorrect.

The test below reproduces the issue.  It's not a real test, it always fails
on purpose and not very pretty, but what it does is that it creates a
tunnel, then starts netcat server on one side and sends two files to it
from the other side.  These two transfers are separate TCP connections, so
they use different source ports, that is causing the tunnel to choose
different UDP source ports for them.  After the test failure we can see in
the sflow.log that all the values for 'tunnel4_in_src_port' are the same
for both streams and some exchanges prior to that.  However, if offload
is disabled, the values will be different as they should.

So, unless we can ensure that packets from different streams do not match
on the same datapath flow, this schema doesn't work.

The only way to ensure that packets from different streams within the same
tunnel do not match on the same datapath flow is to always have an exact
match on the whole tunnel metadata.  But I don't think that is a good idea,
because this way we'll have a separate datapath flow per TCP stream, which
will be very bad for performance.  And it will be bad for hardware offload
since hardware NICs normally have a more limited amount of available
resources.

This leads us to conclusion that only non-tunneling traffic can be offloaded
this way.  For this to work we'll have to add an explicit match on tunnel
metadata being empty (can that be expressed in TC?)

But at this point a question arises if this even worth implementing, taking
into account that it requires a lot of infrastructure changes throughout all
the layers of OVS code just for sampling of non-tunnel packets, i.e. a very
narrow use case.

Also, my understanding was that there is a plan to offload other types of
userspace() action in the future, such as controller() action using the same
psample infrastructure.  But that will not be possible for the same reason.
In addition to tunnel metadata we will also have to provide connection tracking
information, which is even harder, because conntrack state is more dynamic and
is only actually known to the datapath.  psample can't supply this information
to the userspace.

Thoughts?

Best regards, Ilya Maximets.

P.S.  The test that reproduces the issue:
---
diff --git a/tests/system-offloads-traffic.at b/tests/system-offloads-traffic.at
index ee4817e8e..51abc2782 100644
--- a/tests/system-offloads-traffic.at
+++ b/tests/system-offloads-traffic.at
@@ -100,6 +100,102 @@ OVS_TRAFFIC_VSWITCHD_STOP
 AT_CLEANUP
 
 
+AT_SETUP([offloads - two TCP streams over vxlan with sFlow - offloads enabled 
qwe])
+OVS_CHECK_TUNNEL_TSO()
+OVS_CHECK_VXLAN()
+OVS_LOAD_MODULE([psample])
+
+OVS_TRAFFIC_VSWITCHD_START([], [], [-- set Open_vSwitch . 
other_config:hw-offload=true])
+ADD_BR([br-underlay])
+
+AT_CHECK([ovs-ofctl add-flow br0 "actions=normal"])
+AT_CHECK([ovs-ofctl add-flow br-underlay "actions=normal"])
+
+ADD_NAMESPACES(at_ns0)
+
+dnl Set up underlay link from host into the namespace using veth pair.
+ADD_VETH(p0, at_ns0, br-underlay, "172.31.1.1/24")
+AT_CHECK([ip addr add dev br-underlay "172.31.1.100/24"])
+AT_CHECK([ip link set dev br-underlay up])
+
+dnl Set up tunnel endpoints on OVS outside the namespace and with a native
+dnl linux device inside the namespace.
+ADD_OVS_TUNNEL([vxlan], [br0], [at_vxlan0], [172.31.1.1], [10.1.1.100/24])
+ADD_NATIVE_TUNNEL([vxlan], [at_vxlan1], [at_ns0], [172.31.1.100], 
[10.1.1.1/24],
+                  [id 0 dstport 4789])
+
+on_exit 'kill $(cat test-sflow.pid)'
+AT_CHECK([ovstest test-sflow --log-file --detach --no-chdir --pidfile \
+                             0:127.0.0.1 > sflow.log], [0], [], [ignore])
+AT_CAPTURE_FILE([sflow.log])
+PARSE_LISTENING_PORT([test-sflow.log], [SFLOW_PORT])
+AT_CHECK([ovs-vsctl -- \
+    --id=@sflow create sflow agent=lo target=\"127.0.0.1:$SFLOW_PORT\" \
+                             header=128 sampling=1 polling=100 -- \
+    set bridge br0 sflow=@sflow], [0], [ignore])
+
+ovs-appctl vlog/set dpif:file:dbg
+ovs-appctl vlog/disable-rate-limit dpif
+ovs-appctl vlog/set dpif_netlink:file:dbg
+ovs-appctl vlog/set ofproto_dpif:file:dbg
+ovs-appctl vlog/set tc:file:dbg
+ovs-appctl vlog/set netdev_offload_tc:file:dbg
+
+dnl First, check the underlay.
+NS_CHECK_EXEC([at_ns0], [ping -q -c 3 -i 0.3 -W 2 172.31.1.100 | FORMAT_PING], 
[0], [dnl
+3 packets transmitted, 3 received, 0% packet loss, time 0ms
+])
+
+dnl Okay, now check the overlay.
+NS_CHECK_EXEC([at_ns0], [ping -q -c 3 -i 0.1 -W 2 10.1.1.100 | FORMAT_PING], 
[0], [dnl
+3 packets transmitted, 3 received, 0% packet loss, time 0ms
+])
+
+sleep 2
+OVS_APP_EXIT_AND_WAIT([test-sflow])
+echo "================================= 1 ========================" >> 
sflow.log
+AT_CHECK([ovstest test-sflow --log-file --detach --no-chdir --pidfile \
+                             ${SFLOW_PORT}:127.0.0.1 >> sflow.log], [0], [], 
[ignore])
+
+echo "================================= 1 ========================" >> 
ovs-vswitchd.log
+dnl Start ncat listener.
+OVS_DAEMONIZE([nc -k -l 10.1.1.100 1234 > tcp_data], [nc.pid])
+
+dnl Verify that ncat is ready.
+OVS_WAIT_UNTIL([netstat -ln | grep :1234])
+
+dnl Check large TCP.
+AT_CHECK([dd if=/dev/urandom of=payload.bin bs=60000 count=1 2> /dev/null])
+NS_CHECK_EXEC([at_ns0], [nc $NC_EOF_OPT 10.1.1.100 1234 < payload.bin])
+OVS_WAIT_UNTIL([diff -q payload.bin tcp_data])
+
+sleep 2
+OVS_APP_EXIT_AND_WAIT([test-sflow])
+echo "================================= 2 ========================" >> 
sflow.log
+AT_CHECK([ovstest test-sflow --log-file --detach --no-chdir --pidfile \
+                             ${SFLOW_PORT}:127.0.0.1 >> sflow.log], [0], [], 
[ignore])
+
+
+echo "================================= 2 ========================" >> 
ovs-vswitchd.log
+dnl Verify that ncat is ready.
+OVS_WAIT_UNTIL([netstat -ln | grep :1234])
+cp payload.bin payload0.bin
+dnl Check large TCP.
+AT_CHECK([dd if=/dev/urandom of=payload.bin bs=60000 count=1 2> /dev/null])
+NS_CHECK_EXEC([at_ns0], [nc $NC_EOF_OPT 10.1.1.100 1234 < payload.bin])
+cat payload.bin >> payload0.bin
+OVS_WAIT_UNTIL([diff -q payload0.bin tcp_data])
+
+sleep 2
+OVS_APP_EXIT_AND_WAIT([test-sflow])
+
+echo "================================= Done======================" >> 
sflow.log
+
+OVS_TRAFFIC_VSWITCHD_STOP
+AT_FAIL_IF([:])
+AT_CLEANUP
+
+
 AT_SETUP([offloads - sflow with sampling=1 - offloads enabled])
 OVS_LOAD_MODULE([psample])
 OVS_TRAFFIC_VSWITCHD_START([], [], [-- set Open_vSwitch . 
other_config:hw-offload=true])
---
_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to