On Thu, Oct 31, 2019 at 6:16 AM Han Zhou <[email protected]> wrote:
>
>
>
> On Tue, Oct 29, 2019 at 8:11 AM Dumitru Ceara <[email protected]> wrote:
> >
> > The incremental processing engine might stop a run before the
> > en_runtime_data node is processed. In such cases the ed_runtime_data
> > fields might contain pointers to already deleted SB records. For
> > example, if a port binding corresponding to a patch port is removed from
> > the SB database and the incremental processing engine aborts before the
> > en_runtime_data node is processed then the corresponding local_datapath
> > hashtable entry in ed_runtime_data is stale and will store a pointer to
> > the already freed sbrec_port_binding record.
> >
> > This will cause invalid memory accesses in various places (e.g.,
> > pinctrl_run() -> prepare_ipv6_ras()).
> >
> > To fix the issue we need a way to track how each node was processed
> > during an engine run. This commit transforms the 'changed' field in
> > struct engine_node in a 'state' field. Possible node states are:
> > - "Stale": data in the node is not up to date with the DB.
> > - "Updated": data in the node is valid but was updated during
> > the last run of the engine.
> > - "Valid": data in the node is valid and didn't change during
> > the last run of the engine.
> > - "Aborted": during the last run, processing was aborted for
> > this node.
> >
> > The commit also simplifies the logic of calling engine_run and
> > engine_need_run in order to reduce the number of external variables
> > required to track the result of the last engine execution.
> >
> > Functions that need to be called from the main loop and depend on
> > various data contents of the engine's nodes are now called only if
> > the data is up to date.
> >
> > CC: Han Zhou <[email protected]>
> > Fixes: a6b7d9f4f047 ("ovn-controller: Fix flow installation latency
> > caused by recompute.")
> > Signed-off-by: Dumitru Ceara <[email protected]>
>
>
> Thanks Dumitru for the great fix! This was my omission.
> I realized that the problem should have existed even before the commit
> a6b7d9f4f047 (so the commit message can be updated), because there are
> chances the engine would not run at all, which is mostly caused by a previous
> SB update from this node and the corresponding ovsdb otification is coming
> back. At that moment since the transaction's reply is not coming yet, the txn
> is always NULL and engine_run() is always skipped. If the transaction itself
> deleted some data, which happen to be used by pinctrl_run() in this
> iteration, then it would hit the dangling pointer problem. I am not sure why
> I never encountered such case though.
>
Hi Han,
Right, I selected the wrong commit. I'll update the commit message. I
think it's harder to see the issue because you need to be waiting for
a transaction reply and at the same time deleting port bindings from
the SB database.
> In general I think one option to fix this is never store references to
> another node's data. For example, always replicate the IDL data to engine
> node's data. In practice, this is both inefficient and complex. It is
> inefficient because of data copy. It is complex because data need to be deep
> copied.
>
> The other option is the one implemented by your patch, to maintain states of
> each egnine node and make sure accessing node data only when it is valid. I
> think this is a more practical approach.
>
I think this will be fixed indirectly when/if pinctrl moves to a
different process. Until then this seems like the easiest solution.
> I also like the way you handle the case when engine_run is not invoked, by
> passing run_id in the checks.
>
> However, there are still some places that some of the engine nodes’ data are
> used without checking engine_node_data_valid(), such as ofctrl_run() using
> ed_runtime_data.pending_ct_zones, ofctrl_inject_pkt() using
> &ed_addr_sets.addr_sets and &ed_port_groups.port_groups, etc. However, those
> data are actually valid all the time - they don't maintain any reference
> pointers to their inputs. So it may not be a real problem. However, it can be
> confusing why at some places we validate but at other places we don't. So I
> think here is an improvement idea. Basically, we should always validate the
> engine node data before accessing it, and the validation mechanism can be
> improved to reflect each node's real state. We can add an optional interface
> is_data_valid() for engine_node, which can override the default
> engine_node_data_valid() behavior. We can let engine_node_data_valid() call
> the node’s is_data_valid() interface if it is implemented, otherwise fallback
> to the default behavior, which validates only from the node state and run_id.
> For runtime_data, since part of it is always valid, while other part may be
> invalid, I think we should split the pending_ct_zones out as a separate
> engine node. This improvement is not urgent, so I think it is ok to add it as
> a follow-up patch.
True, I missed those cases. It does sound better to have an optional
is_data_valid() node callback. I'll add it in v2. If it's not too much
work I'll also move pending_ct_zones to a separate engine node.
Otherwise we can add it as a follow up.
>
> Regarding the current patch, I have some other comments below:
>
> >
> >
> > + /* We need to make sure that at least the runtime data
> > + * (e.g., local datapaths, ct zones) are fresh before
> > + * calling ofctrl_put and pinctrl_run to avoid using
> > + * potentially freed references to database records.
> > + */
> > + if (engine_node_data_valid(&en_runtime_data,
> > + engine_run_id)) {
> > + ofctrl_put(&ed_flow_output.flow_table,
> > + &ed_runtime_data.pending_ct_zones,
> > +
> > sbrec_meter_table_get(ovnsb_idl_loop.idl),
> > + get_nb_cfg(sbrec_sb_global_table_get(
> > + ovnsb_idl_loop.idl)),
> > +
> > engine_node_data_changed(&en_flow_output,
> > +
> > engine_run_id));
>
>
> I think ofctrl_put() needs to be invoked every iteration, because it takes
> care of reinstalling flows to OVS when ovs-vswitchd is restarted. See
> "need_reinstall_flows" in ofctrl.c.
> For now, I think it is ok to leave this function out of the if
> (engine_node_data_valid()) check, because pending_ct_zones is always valid.
> For improvement, we should move pending_ct_zones to a separate engine node so
> that it always passes the check.
Ack.
>
> >
> >
> > @@ -2095,17 +2118,16 @@ main(int argc, char *argv[])
> > }
> >
> > }
> > - if (old_engine_run_id == engine_run_id || !engine_run_done) {
> > - if (!engine_run_done || engine_need_run(&en_flow_output)) {
> > - VLOG_DBG("engine did not run, force recompute next
> > time: "
> > - "br_int %p, chassis %p", br_int, chassis);
> > - engine_set_force_recompute(true);
> > - poll_immediate_wake();
> > - } else {
> > - VLOG_DBG("engine did not run, and it was not needed"
> > - " either: br_int %p, chassis %p",
> > - br_int, chassis);
> > - }
> > + if (engine_need_run(&en_flow_output, engine_run_id)) {
> > + VLOG_DBG("engine did not run, force recompute next time: "
> > + "br_int %p, chassis %p", br_int, chassis);
> > + engine_set_force_recompute(true);
> > + poll_immediate_wake();
> > + } else if (engine_aborted(&en_flow_output)) {
> > + VLOG_DBG("engine was aborted, force recompute next time: "
> > + "br_int %p, chassis %p", br_int, chassis);
> > + engine_set_force_recompute(true);
> > + poll_immediate_wake();
>
>
> The logic is clearer than before. However, can we have the DBG log that tells
> when engine didn't run and was not needed? It could be an else if here, like:
Sure, I'll add it.
> } else if (!engine_has_run(&en_flow_output, engine_run_id)) {
> VLOG_DBG(...)
>
> > } else {
> > engine_set_force_recompute(false);
> > }
> >
>
>
> >
> > @@ -82,6 +82,20 @@ struct engine_node_input {
> > bool (*change_handler)(struct engine_node *node);
> > };
> >
> > +enum engine_node_state {
> > + EN_STALE, /* Data in the node is not up to date with the DB. */
> > + EN_UPDATED, /* Data in the node is valid but was updated during the
> > + * last run.
> > + */
> > + EN_VALID, /* Data in the node is valid and didn't change during the
> > + * last run.
> > + */
> > + EN_ABORTED, /* During the last run, processing was aborted for
> > + * this node.
> > + */
> > + EN_STATE_MAX,
> > +};
> > +
>
>
> Do we really need STALE state? I think it is same as ABORTED. It seems the
> node state would never be STALE, right? Even though, I think I am ok for
> separating them for clarity. Just want to make sure you thought about it.
We don't actually end up with the final state STALE because in
engine_run() we always move from STALE to VALID if the data didn't
change and we didn't abort execution. It did help me follow the logic
of the code though so I would keep it.
>
> >
> > struct engine_node {
> > /* A unique id to distinguish each iteration of the engine_run(). */
> > uint64_t run_id;
> > @@ -102,8 +116,8 @@ struct engine_node {
> > * node. */
> > void *data;
> >
> > - /* Whether the data changed in the last engine run. */
> > - bool changed;
> > + /* State of the node after the last engine run. */
> > + enum engine_node_state state;
> >
> > /* Method to initialize data. It may be NULL. */
> > void (*init)(struct engine_node *);
> > @@ -121,9 +135,9 @@ struct engine_node {
> > void engine_init(struct engine_node *);
> >
> > /* Execute the processing recursively, which should be called in the main
> > - * loop. Returns true if the execution is compelte, false if it is aborted,
> > - * which could happen when engine_abort_recompute is set. */
> > -bool engine_run(struct engine_node *, uint64_t run_id);
> > + * loop. Updates the engine node's states accordingly.
> > + */
> > +void engine_run(struct engine_node *, uint64_t run_id);
> >
> > /* Clean up the data for the engine nodes recursively. It calls each node's
> > * cleanup() method if not NULL. It should be called before the program
> > @@ -132,7 +146,7 @@ void engine_cleanup(struct engine_node *);
> >
> > /* Check if engine needs to run, i.e. any change to be processed. */
>
>
> We need to update the comment: Check if engine needs to run but didn't.
Ack.
>
> > bool
> > -engine_need_run(struct engine_node *);
> > +engine_need_run(struct engine_node *, uint64_t run_id);
> >
>
> Thanks,
> Han
I'll send a follow up v2 soon.
Thanks,
Dumitru
_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev