On 9 Sep 2021, at 4:35, Chris Mi wrote:

> On 9/8/2021 9:57 PM, Eelco Chaudron wrote:
>>
>> On 8 Sep 2021, at 13:52, Chris Mi wrote:
>>
>>> On 9/6/2021 5:47 PM, Eelco Chaudron wrote:
>>>> On 6 Sep 2021, at 11:14, Chris Mi wrote:
>>>>
>>>>> On 9/3/2021 8:54 PM, Eelco Chaudron wrote:
>>>>>> On 3 Sep 2021, at 14:02, Eelco Chaudron wrote:
>>>>>>
>>>>>>       On 15 Jul 2021, at 8:01, Chris Mi wrote:
>>>>>>
>>>>>>           This patch set adds offload support for sFlow.
>>>>>>
>>>>>>           Psample is a genetlink channel for packet sampling. TC action
>>>>>>           act_sample
>>>>>>           uses psample to send sampled packets to userspace.
>>>>>>
>>>>>>           When offloading sample action to TC, userspace creates a
>>>>>>           unique ID to
>>>>>>           map sFlow action and tunnel info and passes this ID to kernel
>>>>>>           instead
>>>>>>           of the sFlow info. psample will send this ID and sampled 
>>>>>> packet to
>>>>>>           userspace. Using the ID, userspace can recover the sFlow info
>>>>>>           and send
>>>>>>           sampled packet to the right sFlow monitoring host.
>>>>>>
>>>>>>       Hi Chris,
>>>>>>
>>>>>>       One thing missing from this patchset is a test case. I think we
>>>>>>       should add it, as I’m going over this manually every patch 
>>>>>> iteration.
>>>>>>
>>>>>>       I would add the following:
>>>>>>
>>>>>>        *
>>>>>>
>>>>>>           Set the sample rate to 1, send 100 packets and make sure you
>>>>>>           receive all of them
>>>>>>
>>>>>>             o Also, verify the output of “ovs-appctl dpctl/dump-flows
>>>>>>               system@ovs-system type=tc” is correct, i.e., matches the
>>>>>>               kernel output
>>>>>>        *
>>>>>>
>>>>>>           Set the sample rate to 10, send 100 packets and make sure you
>>>>>>           receive 10.
>>>>>>
>>>>>>             o Also, verify the output of “ovs-appctl dpctl/dump-flows
>>>>>>               system@ovs-system type=tc” is correct, i.e., matches the
>>>>>>               kernel output
>>>>>>
>>>>>>       Cheers,
>>>>>>
>>>>>>       Eelco
>>>>>>
>>>>>>       PS: I also had a problem where only one packet got sent to the
>>>>>>       collector, and then no more packets would arrive. Of course, when
>>>>>>       I added some debug code, it never happened, and when removing the
>>>>>>       debug code, it also worked fine :( Did you ever experience
>>>>>>       something like this? I will play a bit more when reviewing
>>>>>>       specific code, maybe it will happen again.
>>>>>>
>>>>>>
>>>>>> One additional thing I’m seeing is the following:
>>>>>>
>>>>>> $ ovs-appctl dpctl/dump-flows system@ovs-system type=tc
>>>>>> recirc_id(0),in_port(3),eth(src=52:54:00:88:51:38,dst=04:f4:bc:28:57:01),eth_type(0x0800),ipv4(frag=no),
>>>>>>  packets:7144, bytes:600096, used:7.430s, 
>>>>>> actions:sample(sample=10.0%,actions(userspace(pid=3925927229,sFlow(vid=0,pcp=0,output=2147483650),actions))),2
>>>>>>
>>>>>> Sometimes I get a rather large sFlow(output=) value in the sFlow output. 
>>>>>> Converting the above to hex, 0x80000002, where I see this in 
>>>>>> fix_sflow_action() as being mentioned multiple output ports? This seems 
>>>>>> wrong to me as it should be 3 (as it always is in the none hw offload 
>>>>>> case). This odd value is also reported to the sFlow samples.
>>>>>>
>>>>>> Unfortunately, I do not have a reproducer for this.
>>>>>>
>>>>> Actually, in the very beginning of the development when I have a lot of 
>>>>> debug code, I also observed such odd port number sometimes.
>>>>> Such rule will disappear soon. So it is hard to test such corner case. 
>>>>> And it doesn't affect the functionality.
>>>>> So current code didn't deal with such corner case.
>>>> I’m getting this almost every time I restart OVS and initiate a flow 
>>>> (using a simple ping). This does not go away by itself, it stays until the 
>>>> flow times out. So this should be investigated and fixed.
>>>>
>>>> My test setup is rather simple, just two ports, one is a VM one is a 
>>>> external host, which I ping from the VM. All default config, so using the 
>>>> NORMAL rule. This is my sFlow config:
>>>>
>>>>    ovs-vsctl -- --id=@sflow create sflow agent=lo \
>>>>         target="127.0.0.1" header=128 \
>>>>         sampling=10 polling=4 \
>>>>         -- set bridge ovs_pvp_br0 sflow=@sflow
>>>>
>>> When writing the test, I can reproduce it. Not sure what changed. But it 
>>> happens even without offload. I don't know what need to be fixed.
>>> OVS generates the DP rule and installs it. If offload is enabled, 
>>> sFlow(vid=0,pcp=0,output=2147483650) (actually the whole action) is saved in
>>> dpif_sflow_attr->action when installing the DP rule. When receiving sampled 
>>> packets, tc or driver passes the gid to ovs daemon and
>>> the daemon will find the dpif_sflow_attr->action using the gid. TC didn't 
>>> change it. It is up to OVS sflow engine to process it.
>> Nice! I did not see it in the none-offload case, but if you do, it might not 
>> be related to your patchset and we should not worry about it for now :)
>> I’m assuming that you do get the 2147483650 value from the nlmsg attributes 
>> passed to the offload handler.
>>
> Yes, I got 2147483650. But give a second look, my rule is a little different:
>
> recirc_id(0),in_port(3),eth(src=26:f2:ed:17:7d:eb,dst=33:33:00:00:00:16),eth_type(0x86dd),ipv6(frag=no),
>  packets:0, bytes:0, used:0.000s, 
> actions:userspace(pid=4294967295,sFlow(vid=0,pcp=0,output=2147483650),actions),1,2
>
> or
>
> recirc_id(0),in_port(enp8s0f0_1),eth(src=02:25:d0:10:01:02,dst=33:33:00:00:00:02),eth_type(0x86dd),ipv6(frag=no),
>  packets:2, bytes:140, used:1.666s, 
> actions:userspace(pid=4294967295,sFlow(vid=0,pcp=0,output=2147483650),actions),br,enp8s0f0_2
>
> It seems the above rule is expected because the output is multiple ports. But 
> your rule only outputs to 1 port. So to be accurate, I haven't reproduced the 
> issue you hit.
> The suspicious code is dpif_sflow_attr_equal(). In the early version, I 
> compared the whole dpif_sflow_attr, but now I only compare the ufid. Maybe 
> the ufid is duplicate.
> But it should not happen. Anyway, I will keep an eye on this issue. Thanks 
> for your carefulness.

Here is some debug info, it seems related to an update to an existing handle, 
which does not update the tc part:

Here the flow gets added:

2021-09-09T14:01:12.278Z|00001|netdev_offload_tc(handler1)|ERR|EC_DBG: Enter 
b13915aa-c7a8-4574-8dcf-439f03929b8e.
2021-09-09T14:01:12.278Z|00002|netdev_offload_tc(handler1)|ERR|EC_DEBUG: port 
-2147483646
2021-09-09T14:01:12.285Z|00003|netdev_offload_tc(handler1)|ERR|EC_DBG: EXIT ok 
b13915aa-c7a8-4574-8dcf-439f03929b8e.


Here the revalidator updates it with new port information:

2021-09-09T14:01:12.537Z|00001|netdev_offload_tc(revalidator13)|ERR|EC_DBG: 
Enter b13915aa-c7a8-4574-8dcf-439f03929b8e.
2021-09-09T14:01:12.537Z|00002|netdev_offload_tc(revalidator13)|ERR|EC_DEBUG: 
port 3
2021-09-09T14:01:12.537Z|00004|netdev_offload_tc(revalidator13)|DBG|updating 
old handle: 1 prio: 2
2021-09-09T14:01:12.553Z|00005|netdev_offload_tc(revalidator13)|ERR|EC_DBG: 
EXIT ok b13915aa-c7a8-4574-8dcf-439f03929b8e.

Here is a dp dump showing the wrong/old port:

$ ovs-appctl dpctl/dump-flows -m system@ovs-system type=tc
ufid:b13915aa-c7a8-4574-8dcf-439f03929b8e, 
skb_priority(0/0),skb_mark(0/0),ct_state(0/0),ct_zone(0/0),ct_mark(0/0),ct_label(0/0),recirc_id(0),dp_hash(0/0),in_port(vnet4),packet_type(ns=0/0,id=0/0),eth(src=52:54:00:88:51:38,dst=04:f4:bc:28:57:01),eth_type(0x0800),ipv4(src=0.0.0.0/0.0.0.0,dst=0.0.0.0/0.0.0.0,proto=0/0,tos=0/0,ttl=0/0,frag=no),
 packets:180, bytes:15120, used:0.230s, dp:tc, 
actions:sample(sample=10.0%,actions(userspace(pid=2771969963,sFlow(vid=0,pcp=0,output=2147483650),actions))),enp5s0f0


Guess this is the area you need to look at del_filter_and_ufid_mapping():

2389     if (get_ufid_tc_mapping(ufid, &id) == 0) {
2390         VLOG_DBG_RL(&rl, "updating old handle: %d prio: %d",
2391                     id.handle, id.prio);
2392         info->tc_modify_flow_deleted = !del_filter_and_ufid_mapping(&id, 
ufid);
2393     }


There might also be a problem in the sgid mapping as it’s also using the uuid 
as the hash, and there could be a collision during the update phase.


I’m working on some escalations so, please take a look and see if you can fix 
this.
I need to restart OVS, and then send a ping, and I see the problem once, I need 
to restart OVS each time to see it.

I’ll wait for v15 to see how you fixed it ;)

_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to