On Thu, Jul 12, 2018 at 12:59:33PM +0530, Sriharsha Basavapatna via dev wrote:
> This is the third patch in the patch-set to support dynamic rebalancing
> of offloaded flows.
>
> The dynamic rebalancing functionality is implemented in this patch. The
> ukeys that are not scheduled for deletion are obtained and passed as input
> to the rebalancing routine. The rebalancing is done in the context of
> revalidation leader thread, after all other revalidator threads are
> done with gathering rebalancing data for flows.
>
> For each netdev that is in OOR state, a list of flows - both offloaded
> and non-offloaded (pending) - is obtained using the ukeys. For each netdev
> that is in OOR state, the flows are grouped and sorted into offloaded and
> pending flows. The offloaded flows are sorted in descending order of
> pps-rate, while pending flows are sorted in ascending order of pps-rate.
>
> The rebalancing is done in two phases. In the first phase, we try to
> offload all pending flows and if that succeeds, the OOR state on the device
> is cleared. If some (or none) of the pending flows could not be offloaded,
> then we start replacing an offloaded flow that has a lower pps-rate than
> a pending flow, until there are no more pending flows with a higher rate
> than an offloaded flow. The flows that are replaced from the device are
> added into kernel datapath.
>
> Signed-off-by: Sriharsha Basavapatna <[email protected]>
> Co-authored-by: Venkat Duvvuru <[email protected]>
> Signed-off-by: Venkat Duvvuru <[email protected]>
> Reviewed-by: Sathya Perla <[email protected]>
Thanks for the patch. I have some comments.
This whole rebalancing idea looks extraordinarily expensive in the case
where we have a million datapath flows and the user has no offload
capable hardware. Please make it user-configurable and default to on
only if offload capable hardware is in use.
Usually, one should avoid naming that says what not to do; instead, flip
names so that they say what to do. This is because negative names end
up causing confusion later due to double negatives. Actually I feel
confused already about "skip type", which seems like a negative name to
me. I would be inclined to use something more like this:
enum dpif_offload_type {
DPIF_OFFLOAD_AUTO, /* Offload if possible, fallback to software. */
DPIF_OFFLOAD_NEVER, /* Never offload to hardware. */
DPIF_OFFLOAD_ALWAYS, /* Always offload to hardware. */
};
dpif_netlink_operate() validates its argument skip_flag, but that should
happen at a higher level, probably in dpif_operate(). I would make it
an assertion rather than a log message.
It looks like dpif_netlink_operate() sometimes doesn't process all of
the requested operations. It is expected to always set at least the
'error' member of every op passed in, but the new implementation
sometimes omits of them. This breaks the interface invariants.
It is unusual for udpif_run_flow_rebalance() to keep static variables.
It would normally be better for the caller to pass in a place to store
its timer. In any case, the 'now' variable in
udpif_run_flow_rebalance() does not need to be static.
It would be better to initialize 'time' in udpif_run_flow_rebalance() to
LLONG_MIN rather than 0. This will avoid problems if the system's
monotonic clock starts out negative.
This test in udpif_flow_rebalance_prepare() seems weird because it has a
single caller that always passes nonnull. I would delete it:
if (!active_flows || !num_active_flows) {
return;
}
In udpif_flow_rebalance_prepare(), please declare the loop variables in
the loops themselves.
udpif_flow_rebalance_prepare() has some 8-space indentations. OVS uses
4-space. I see a few other incorrect indentations scattered around,
too.
In all new code, please declare variables at point of first use where
possible.
This code uses two passes on flows: first, collect all flows, then
discard all the flows that don't involve an out-of-resources device.
I believe that these two passes could be combined and that combining
them would significantly reduce the required time and space.
It could be valuable as a first step to determine whether the system has
any out-of-resource devices. If it does not, then all the work of
rebalancing can be skipped entirely.
The code in udpif_flow_rebalance() modifies netdev data directly. I
want to discourage that.
There is lots of code around that does something like "n->tunnel_netdev
? n->tunnel_netdev : n->in_netdev". In fact, I do not see *any* code
that uses in_netdev or tunnel_netdev directly. I believe that this code
would work the same, and be easier to understand, if tunnel_netdev were
dropped and in_netdev were just replaced by the tunnel_netdev whenever
the latter was nonnull.
flow_compare_rebalance() compares netdevs by name but I don't think the
ordering really matters. It would be more efficient to just compare
them by pointer, e.g. netdev1 < netdev2 etc.
flow_compare_rebalance() has a few "return (x);" statements that should
just be written "return x;".
In flow_compare_rebalance(), I do not recommend using subtraction to
form a 3-way comparison result, because it yields the wrong result when
overflow or underflow occurs. This is especially true for floating
point.
In udpif_flow_rebalance(), in the while loop, I believe the "continue;"
will lead to an infinite loop because nothing changes from one loop
iteration to the next in this case. (This makes me worry that this code
has not been thoroughly tested.)
In rebalance_device(), the indentation is odd here. Either the comment
should be above the 'if' or it should be indented the same as the
'return':
if (!pending_count) {
/*
* Successfully offloaded all pending flows. The device
* is no longer in OOR state; done rebalancing this device.
*/
return false;
}
In rebalance_device(), please remove the inner parentheses:
while ((churn_count < offload_count) &&
(churn_count < pending_count)) {
Also in rebalance_insert_pending():
if ((count >= insert_count) &&
(flow->flow_pps_rate < rate_threshold)) {
break;
}
In VLOG_*(), please omit the final \n from the message. The vlog code
inserts it automatically.
I don't see any value in ukey_to_ukey_op(). I think it should be
inlined in each case.
I don't see any value in FLOW_PGM_NUM_OPS. I would write 1 instead.
Thanks,
Ben.
_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev