On Sat, Sep 1, 2018 at 12:58 AM, Ben Pfaff <[email protected]> wrote:
> On Thu, Jul 12, 2018 at 12:59:33PM +0530, Sriharsha Basavapatna via dev wrote:
>> This is the third patch in the patch-set to support dynamic rebalancing
>> of offloaded flows.
>>
>> The dynamic rebalancing functionality is implemented in this patch. The
>> ukeys that are not scheduled for deletion are obtained and passed as input
>> to the rebalancing routine. The rebalancing is done in the context of
>> revalidation leader thread, after all other revalidator threads are
>> done with gathering rebalancing data for flows.
>>
>> For each netdev that is in OOR state, a list of flows - both offloaded
>> and non-offloaded (pending) - is obtained using the ukeys. For each netdev
>> that is in OOR state, the flows are grouped and sorted into offloaded and
>> pending flows. The offloaded flows are sorted in descending order of
>> pps-rate, while pending flows are sorted in ascending order of pps-rate.
>>
>> The rebalancing is done in two phases. In the first phase, we try to
>> offload all pending flows and if that succeeds, the OOR state on the device
>> is cleared. If some (or none) of the pending flows could not be offloaded,
>> then we start replacing an offloaded flow that has a lower pps-rate than
>> a pending flow, until there are no more pending flows with a higher rate
>> than an offloaded flow. The flows that are replaced from the device are
>> added into kernel datapath.
>>
>> Signed-off-by: Sriharsha Basavapatna <[email protected]>
>> Co-authored-by: Venkat Duvvuru <[email protected]>
>> Signed-off-by: Venkat Duvvuru <[email protected]>
>> Reviewed-by: Sathya Perla <[email protected]>
>
> Thanks for the patch. I have some comments.
Thanks for your review comments. Please see my response inline.
>
> This whole rebalancing idea looks extraordinarily expensive in the case
> where we have a million datapath flows and the user has no offload
> capable hardware. Please make it user-configurable and default to on
> only if offload capable hardware is in use.
Like you suggested, folded in patch-4 into patch-3 with the ovs config
parameter changes to enable/disable this feature.
>
> Usually, one should avoid naming that says what not to do; instead, flip
> names so that they say what to do. This is because negative names end
> up causing confusion later due to double negatives. Actually I feel
> confused already about "skip type", which seems like a negative name to
> me. I would be inclined to use something more like this:
>
> enum dpif_offload_type {
> DPIF_OFFLOAD_AUTO, /* Offload if possible, fallback to software.
> */
> DPIF_OFFLOAD_NEVER, /* Never offload to hardware. */
> DPIF_OFFLOAD_ALWAYS, /* Always offload to hardware. */
> };
Thanks for suggesting this change; it certainly looks better (more easily
readable) with the new types.
>
> dpif_netlink_operate() validates its argument skip_flag, but that should
> happen at a higher level, probably in dpif_operate(). I would make it
> an assertion rather than a log message.
done; added arg validation in dpif_operate().
>
> It looks like dpif_netlink_operate() sometimes doesn't process all of
> the requested operations. It is expected to always set at least the
> 'error' member of every op passed in, but the new implementation
> sometimes omits of them. This breaks the interface invariants.
fixed it.
>
> It is unusual for udpif_run_flow_rebalance() to keep static variables.
> It would normally be better for the caller to pass in a place to store
> its timer. In any case, the 'now' variable in
> udpif_run_flow_rebalance() does not need to be static.
done; added a new variable in struct udpif to track time of last rebalancing.
>
> It would be better to initialize 'time' in udpif_run_flow_rebalance() to
> LLONG_MIN rather than 0. This will avoid problems if the system's
> monotonic clock starts out negative.
done; it is now initialized to time_msec() in udpif init.
>
> This test in udpif_flow_rebalance_prepare() seems weird because it has a
> single caller that always passes nonnull. I would delete it:
>
> if (!active_flows || !num_active_flows) {
> return;
> }
removed.
>
> In udpif_flow_rebalance_prepare(), please declare the loop variables in
> the loops themselves.
udpif_flow_rebalance_prepare() routine has been removed, while fixing
another comment below: "This code uses two passes on flows..."
>
> udpif_flow_rebalance_prepare() has some 8-space indentations. OVS uses
> 4-space. I see a few other incorrect indentations scattered around,
> too.
>
done
> In all new code, please declare variables at point of first use where
> possible.
done
>
> This code uses two passes on flows: first, collect all flows, then
> discard all the flows that don't involve an out-of-resources device.
> I believe that these two passes could be combined and that combining
> them would significantly reduce the required time and space.
Thanks for this comment. The following changes are done to fix this:
- do not collect all flows; remove udpif_flow_rebalance_prepare() routine.
- collect only those flows that reference an oor netdev.
- allocate space dynamically as needed.
- reduce window of reference to oor-netdevs to udpif_flow_rebalance().
(previously reference was being taken during pps-rate computation)
>
> It could be valuable as a first step to determine whether the system has
> any out-of-resource devices. If it does not, then all the work of
> rebalancing can be skipped entirely.
done.
>
> The code in udpif_flow_rebalance() modifies netdev data directly. I
> want to discourage that.
>
done.
> There is lots of code around that does something like "n->tunnel_netdev
> ? n->tunnel_netdev : n->in_netdev". In fact, I do not see *any* code
> that uses in_netdev or tunnel_netdev directly. I believe that this code
> would work the same, and be easier to understand, if tunnel_netdev were
> dropped and in_netdev were just replaced by the tunnel_netdev whenever
> the latter was nonnull.
done.
>
> flow_compare_rebalance() compares netdevs by name but I don't think the
> ordering really matters. It would be more efficient to just compare
> them by pointer, e.g. netdev1 < netdev2 etc.
done.
>
> flow_compare_rebalance() has a few "return (x);" statements that should
> just be written "return x;".
done.
>
> In flow_compare_rebalance(), I do not recommend using subtraction to
> form a 3-way comparison result, because it yields the wrong result when
> overflow or underflow occurs. This is especially true for floating
> point.
done; float changed to uint64; result saved in an int64.
>
> In udpif_flow_rebalance(), in the while loop, I believe the "continue;"
> will lead to an infinite loop because nothing changes from one loop
> iteration to the next in this case. (This makes me worry that this code
> has not been thoroughly tested.)
This won't lead to an infinite loop because we would never hit the condition;
i.e, we must have only oor-netdevs in this loop. I've changed it to an assert.
>
> In rebalance_device(), the indentation is odd here. Either the comment
> should be above the 'if' or it should be indented the same as the
> 'return':
>
> if (!pending_count) {
> /*
> * Successfully offloaded all pending flows. The device
> * is no longer in OOR state; done rebalancing this device.
> */
> return false;
> }
>
done.
> In rebalance_device(), please remove the inner parentheses:
>
> while ((churn_count < offload_count) &&
> (churn_count < pending_count)) {
>
> Also in rebalance_insert_pending():
>
> if ((count >= insert_count) &&
> (flow->flow_pps_rate < rate_threshold)) {
> break;
> }
done.
>
> In VLOG_*(), please omit the final \n from the message. The vlog code
> inserts it automatically.
done.
>
> I don't see any value in ukey_to_ukey_op(). I think it should be
> inlined in each case.
done.
>
> I don't see any value in FLOW_PGM_NUM_OPS. I would write 1 instead.
done.
>
> Thanks,
>
> Ben.
Thanks,
-Harsha
_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev