On 3/8/24 15:24, Ilya Maximets wrote:
On 3/7/24 22:29, Ilya Maximets wrote:
On 3/7/24 21:59, Adrian Moreno wrote:


On 3/7/24 17:54, Ilya Maximets wrote:
On 3/7/24 16:18, Adrian Moreno wrote:
** Background ** Currently, OVS supports several packet
sampling mechanisms (sFlow, per-bridge IPFIX, per-flow IPFIX).
These end up being translated into a userspace action that
needs to be handled by ovs-vswitchd's handler threads only to
be forwarded to some third party application that will somehow
process the sample and provide observability on the datapath.

The fact that sampled traffic share netlink sockets and
handler thread time with upcalls, apart from being a
performance bottleneck in the sample extraction itself, can
severely compromise the datapath, yielding this solution unfit
for highly loaded production systems.

Users are left with little options other than guessing what
sampling rate will be OK for their traffic pattern and system
load and dealing with the lost accuracy.

** Proposal ** In this RFC, I'd like to request feedback on an
attempt to fix this situation by adding a flag to the userspace
action to indicate the upcall should be sent to a netlink
multicast group instead of unicasted to ovs-vswitchd.

This would allow for other processes to read samples directly,
freeing the netlink sockets and handler threads to process
packet upcalls.

** Notes on tc-offloading ** I am aware of the efforts being
made to offload the sample action with the help of psample. I
did consider using psample to multicast the samples. However, I
found a limitation that I'd like to discuss: I would like to
support OVN-driven per-flow (IPFIX) sampling because it allows
OVN to insert two 32-bit values (obs_domain_id and
ovs_point_id) that can be used to enrich the sample with "high
level" controller metadata (see debug_drop_domain_id NBDB
option in ovn-nb(5)).

The existing fields in psample_metadata are not enough to
carry this information. Would it be possible to extend this
struct to make room for some extra "application-specific"
metadata?

** Alternatives ** An alternative approach that I'm
considering (apart from using psample as explained above) is to
use a brand-new action. This lead to a cleaner separation of
concerns with existing userspace action (used for slow paths
and OFP_CONTROLLER actions) and cleaner statistics. Also,
ovs-vswitchd could more easily make the layout of this new
userdata part of the public API, allowing third party sample
collectors to decode it.

I am currently exploring this alternative but wanted to send
the RFC to get some early feedback, guidance or ideas.


Hi, Adrian.  Thanks for the patches!


Thanks for the quick feedback. Also adding Dumitru who I missed to
include in the original CC list.

Though I'm not sure if broadcasting is generally the best
approach. These messages contain opaque information that is not
actually parsable by any other entity than a process that
created the action. And I don't think the structure of these
opaque fields should become part of uAPI in neither kernel nor
OVS in userspace.


I understand this can be cumbersome, specially given the opaque
field is currently also used for some purely-internal OVS actions
(e.g: CONTROLLER).

However, for features such as OVN-driven per-flow sampling, where
OVN-generated identifiers are placed in obs_domain_id and
obs_point_id, it would be _really_ useful if this opaque value
could be somehow decoded by external programs.

Two ideas come to mind to try to alleviate the potential
maintainability issues: - As I suggested, using a new action maybe
makes things easier. By splitting the current "user_action_cookie"
in two, one for internal actions and one for "observability"
actions, we could expose the latter in the OVS userspace API
without having to expose the former. - Exposing functions in OVS
that decode the opaque value. Third party applications could link
against, say, libopenvswitch.so and use it to extract
obs_{domain,point}_ids.

Linking with OVS libraries is practically the same as just exposing
the internal structure, because once the external application is
running it either must have the same library version as the process
that installs the action, or it may not be able to parse the
message.

Any form of exposing to an external application will freeze the
opaque arguments and effectively make them a form of uAPI.

The separate action with a defined uAPI solves this problem by just
creating a new uAPI, but I'm not sure why it is needed.


What do you think?

The userspace() action already has a OVS_USERSPACE_ATTR_PID
argument. And it is not actually used when
OVS_DP_F_DISPATCH_UPCALL_PER_CPU is enabled.  All known users of
OVS_DP_F_DISPATCH_UPCALL_PER_CPU are setting the
OVS_USERSPACE_ATTR_PID to UINT32_MAX, which is not a pid that
kernel could generate.

So, with a minimal and pretty much backward compatible change in
  output_userspace() function, we can honor
OVS_USERSPACE_ATTR_PID if it's not U32_MAX.  This way userspace
process can open a separate socket and configure sampling to
redirect all packets there while normal MISS upcalls would still
arrive to per-cpu sockets.  This should cover the performance
concern.


Do you mean creating a new thread to process samples or using
handlers? The latter would still have performance impact and the
former would likely fail to process all samples in a timely manner
if there are many.

Besides, the current userspace tc-offloading series uses netlink
broadcast with psample, can't we do the same for non-offloaded
actions? It enable building external observability applications
without overloading OVS.

Creating a separate thread solves the performance issue.  But you can
also write a separate application that would communicate its PID to
the running OVS daemon.  Let's say the same application that
configures sampling in the OVS database can also write a PID there.

The thing is that existence of external application immediately
breaks opacity of the arguments and forces us to define uAPI.
However, if there is an explicit communication between that
application and OVS userpsace daemon, then we can establish a
contract (structure of opaque values) between these two userspace
applications without defining that contract in the kernel uAPI.  But
if we're going with multicast, that anyone can subscribe to, then we
have to define that contract in the kernel uAPI.

Also, in order for this observability to work with userspace datapath
we'll have to implement userspace-to-userspace netlink multicast
(does that even exist?).  Running the sample collection within OVS as
a thread will be much less painful.

One other thing worth mentioning is that the PID approach I suggested
is just a minor tweak of what is already supported in the kernel.  It
doesn't prohibit introduction of a new action or a multicast group in
the future.  While premature uAPI definition may end up with another
action that nobody uses.  It can be added later if end up being
actually necessary.

Thinking more about this problem, it seems to make some sense to have
a way to ask OVS for sampling that multiple observers can subscribe to.
Unicast socket will not allow such functionality.  However, I still don't
think creation of a new multicast group for that purpose is justified.
Kernel already has a generic sampling mechanism (psample) with a multicast
group created specifically for a very similar purpose.  So, instead of
re-inventing it, we can add a small modification to the OVS'es sampling
action allowing it to sample to psample instead of OVS'es own unicast
sockets.  This can be achieved by adding a new OVS_SAMPLE_ATTR that
would tell to direct packets to psample instead of executing actions.
Or adding a new OVS_USERSPACE_ATTR that would do the same thing but from
a userspace() action instead, i.e. direct packets to psample instead of
OVS'es own sockets copying OVS_PACKET_ATTR_USERDATA into the
PSAMPLE_ATTR_SAMPLE_GROUP.  Might be cleaner this way, not sure.

Form a perspective of an OVS userspace daemon, this functionality can
be clearly exposed as a separate sampling mechanism alongside IPFIX,
sFlow and NetFlow.

I see you eluded to this approach in the original cover letter above.
So, I'd vote for it instead.  Hopefully the psample API can be extended
to be more flexible and allow larger userdata to be passed in.  Maybe have
SAMPLE_SUBGROUP in addition to the existing PSAMPLE_ATTR_SAMPLE_GROUP ?
OTOH, it is not really IPFIX, it's a different interface, so it might have
different requirements.  In any case OVS may check that userdata fits
into psample arguments and reject flows that are not compatible.

Thanks for your feedback Ilya.

I'll send an RFC_v2 with the proposed alternative.


Best regards, Ilya Maximets.



For the case without per-cpu dispatch, the feature comes for free
if userspace application wants to use it.  However, there is no
currently supported version of OVS that doesn't use per-cpu
dispatch when available.
What do you think? Best regards, Ilya Maximets.





--
Adrián Moreno

_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to