On 3/8/25 21:03, Aaron Conole wrote: > Ilya Maximets <i.maxim...@ovn.org> writes: > >> The actions length check is unreliable and produces different results >> depending on the initial length of the provided netlink attribute and >> the composition of the actual actions inside of it. For example, a >> user can add 4088 empty clone() actions without triggering -EMSGSIZE, >> on attempt to add 4089 such actions the operation will fail with the >> -EMSGSIZE verdict. However, if another 16 KB of other actions will >> be *appended* to the previous 4089 clone() actions, the check passes >> and the flow is successfully installed into the openvswitch datapath. >> >> The reason for a such a weird behavior is the way memory is allocated. >> When ovs_flow_cmd_new() is invoked, it calls ovs_nla_copy_actions(), >> that in turn calls nla_alloc_flow_actions() with either the actual >> length of the user-provided actions or the MAX_ACTIONS_BUFSIZE. The >> function adds the size of the sw_flow_actions structure and then the >> actually allocated memory is rounded up to the closest power of two. >> >> So, if the user-provided actions are larger than MAX_ACTIONS_BUFSIZE, >> then MAX_ACTIONS_BUFSIZE + sizeof(*sfa) rounded up is 32K + 24 -> 64K. >> Later, while copying individual actions, we look at ksize(), which is >> 64K, so this way the MAX_ACTIONS_BUFSIZE check is not actually >> triggered and the user can easily allocate almost 64 KB of actions. >> >> However, when the initial size is less than MAX_ACTIONS_BUFSIZE, but >> the actions contain ones that require size increase while copying >> (such as clone() or sample()), then the limit check will be performed >> during the reserve_sfa_size() and the user will not be allowed to >> create actions that yield more than 32 KB internally. >> >> This is one part of the problem. The other part is that it's not >> actually possible for the userspace application to know beforehand >> if the particular set of actions will be rejected or not. >> >> Certain actions require more space in the internal representation, >> e.g. an empty clone() takes 4 bytes in the action list passed in by >> the user, but it takes 12 bytes in the internal representation due >> to an extra nested attribute, and some actions require less space in >> the internal representations, e.g. set(tunnel(..)) normally takes >> 64+ bytes in the action list provided by the user, but only needs to >> store a single pointer in the internal implementation, since all the >> data is stored in the tunnel_info structure instead. >> >> And the action size limit is applied to the internal representation, >> not to the action list passed by the user. So, it's not possible for >> the userpsace application to predict if the certain combination of >> actions will be rejected or not, because it is not possible for it to >> calculate how much space these actions will take in the internal >> representation without knowing kernel internals. >> >> All that is causing random failures in ovs-vswitchd in userspace and >> inability to handle certain traffic patterns as a result. For example, >> it is reported that adding a bit more than a 1100 VMs in an OpenStack >> setup breaks the network due to OVS not being able to handle ARP >> traffic anymore in some cases (it tries to install a proper datapath >> flow, but the kernel rejects it with -EMSGSIZE, even though the action >> list isn't actually that large.) >> >> Kernel behavior must be consistent and predictable in order for the >> userspace application to use it in a reasonable way. ovs-vswitchd has >> a mechanism to re-direct parts of the traffic and partially handle it >> in userspace if the required action list is oversized, but that doesn't >> work properly if we can't actually tell if the action list is oversized >> or not. >> >> Solution for this is to check the size of the user-provided actions >> instead of the internal representation. This commit just removes the >> check from the internal part because there is already an implicit size >> check imposed by the netlink protocol. The attribute can't be larger >> than 64 KB. Realistically, we could reduce the limit to 32 KB, but >> we'll be risking to break some existing setups that rely on the fact >> that it's possible to create nearly 64 KB action lists today. >> >> Vast majority of flows in real setups are below 100-ish bytes. So >> removal of the limit will not change real memory consumption on the >> system. The absolutely worst case scenario is if someone adds a flow >> with 64 KB of empty clone() actions. That will yield a 192 KB in the >> internal representation consuming 256 KB block of memory. However, >> that list of actions is not meaningful and also a no-op. Real world >> very large action lists (that can occur for a rare cases of BUM >> traffic handling) are unlikely to contain a large number of clones and >> will likely have a lot of tunnel attributes making the internal >> representation comparable in size to the original action list. >> So, it should be fine to just remove the limit. >> >> Commit in the 'Fixes' tag is the first one that introduced the >> difference between internal representation and the user-provided action >> lists, but there were many more afterwards that lead to the situation >> we have today. >> >> Fixes: 7d5437c709de ("openvswitch: Add tunneling interface.") >> Signed-off-by: Ilya Maximets <i.maxim...@ovn.org> >> --- > > Thanks for the detailed explanation. Do you think it's useful to > check with selftest: > > # python3 ./ovs-dpctl.py add-dp flbr > # python3 ./ovs-dpctl.py add-flow flbr \ > "in_port(0),eth(),eth_type(0x806),arp()" \ > $(echo 'print("clone(),"*4089)' | python3) > > I think a limit test is probably a good thing to have anyway (although > after this commit we will rely on netlink limits).
I had a similar thought, but as you said, this commit will remove the limit so we'll not really be testing OVS code at this point. So, I thought it may be better to not include such a test for easier backporting to older kernels (given the Fixes tag goes far back). But I agree that it's a good thing in general to have tests that cover maximum size cases, so maybe we can add something like this to net-next instead, once the fix is accepted? Note: 4089 is too small for such a test, it should be somewhere around 16K. > > > Reviewed-by: Aaron Conole <acon...@redhat.com> > Thanks for review! Best regards, Ilya Maximets. _______________________________________________ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev