Hello,

I am a Software Engineer from Citrix, Cambridge UK office.

We are having a problem when running our nightly test on a XenServer host with lacp bonds with OVS. Indeed, when deleting the bond the following /segfault/ error appears (from GDB):

Program terminated with signal 11, Segmentation fault.

#0  0x0000000000411d2f in hmap_remove (node=0x7fd9d402e730, hmap=0x1c3e9e8)

    at lib/hmap.h:236

236         while (*bucket != node) {

(gdb) bt

#0  0x0000000000411843 in hmap_remove (node=0x7fd9d402e730, hmap=0x1c3e9e8)

    at lib/hmap.h:236

#1  update_recirc_rules (bond=0x1c3e920) at ofproto/bond.c:382

#2  0x0000000000001ff5 in ?? ()

(gdb) p *hmap

$1 = {buckets = 0x7fd9d4039b70, one = 0x0, mask = 255, n = 147}

(gdb) p *node

$2 = {hash = 2450845263, next = 0x0}

(gdb) p *bucket

Cannot access memory at address 0x8

(gdb) p *(hmap->buckets)

$3 = (struct hmap_node *) 0x0

(gdb) p node->hash & hmap->mask

$4 = 79

(gdb) p *(&hmap->buckets[79])

$5 = (struct hmap_node *) 0x0

I managed to reproduce the error using a while loop in which I repeatedly create and destroy a lacp bond, waiting for 15 seconds after each of these operations.

The error is not constant in timing and I think that it can be a race condition due to the recirculation process. It happens indeed that when calling /hmap_remove()/ from /update_recirc_rules()/ (in case of DEL), the bucket points to NULL. My idea is the following:

/hmap_remove()/ is called from two different places in /ofproto/bond.c/:

1. from /update_recirc_rules()/ (in which it raises the /segfault/)

hmap_remove(&bond->pr_rule_ops, &pr_op->hmap_node);
*pr_op->pr_rule = NULL;
free(pr_op);

2. and from /bond_unref()/ (called when I try to delete the bond)

HMAP_FOR_EACH_SAFE(pr_op, next_op, hmap_node, &bond->pr_rule_ops) {
    hmap_remove(&bond->pr_rule_ops, &pr_op->hmap_node);
    free(pr_op);
}

Now, if these two calls happen at the same time, a conflict on the /bond pr_rule/ may happen. Looking at the code I can see that the recirculation /hmap_remove()/ is executed inside an external lock (/ovs_rwlock_wrlock(&rwlock)/) if called by /bond_rebalance()/

void bond_rebalance(struct bond *bond) {
...
ovs_rwlock_wrlock(&rwlock);
...
...
if (use_recirc && rebalanced) {
    bond_update_post_recirc_rules(bond,true); <---- hmap_remove()
}

done:
    ovs_rwlock_unlock(&rwlock);
}

but I can't see any lock when called by /output_normal()/

static void output_normal(struct xlate_ctx *ctx, const struct xbundle 
*out_xbundle, uint16_t vlan) {
...
if (ctx->use_recirc) {
    ...
    bond_update_post_recirc_rules(out_xbundle->bond, false); <---- hmap_remove()
    ...
}
...

The same lock is used in /bond_unref()/ but only for /hmap_remove(all_bonds, &bond->hmap_node)/ and not for /hmap_remove(&bond->pr_rule_ops, &pr_op->hmap_node)/.

...
ovs_rwlock_wrlock(&rwlock);
hmap_remove(all_bonds, &bond->hmap_node);
ovs_rwlock_unlock(&rwlock);
...
HMAP_FOR_EACH_SAFE(pr_op, next_op, hmap_node, &bond->pr_rule_ops) {
    hmap_remove(&bond->pr_rule_ops, &pr_op->hmap_node);
    free(pr_op);
}
...

I suppose that the goal is to remove any pointer to the bond, from all_bonds, inside the lock and to work on it locally. My doubt is: can the bond be still accessible from somewhere else (let's say from a /bundle/)? If yes, what happen if a /bundle/ tries to access a bond which was previously removed from /all_bonds/ (let's say from /bundle_destroy()/)?

The version of OVS that I am using is /openvswitch-2.3.0-7.8312.x86_64/.

Could you please help me to find the problem?


Thank you for the help & kind regards,

Salvatore


_______________________________________________
discuss mailing list
discuss@openvswitch.org
http://openvswitch.org/mailman/listinfo/discuss

Reply via email to