Hello,
I am a Software Engineer from Citrix, Cambridge UK office.
We are having a problem when running our nightly test on a XenServer
host with lacp bonds with OVS. Indeed, when deleting the bond the
following /segfault/ error appears (from GDB):
Program terminated with signal 11, Segmentation fault.
#0 0x0000000000411d2f in hmap_remove (node=0x7fd9d402e730, hmap=0x1c3e9e8)
at lib/hmap.h:236
236 while (*bucket != node) {
(gdb) bt
#0 0x0000000000411843 in hmap_remove (node=0x7fd9d402e730, hmap=0x1c3e9e8)
at lib/hmap.h:236
#1 update_recirc_rules (bond=0x1c3e920) at ofproto/bond.c:382
#2 0x0000000000001ff5 in ?? ()
(gdb) p *hmap
$1 = {buckets = 0x7fd9d4039b70, one = 0x0, mask = 255, n = 147}
(gdb) p *node
$2 = {hash = 2450845263, next = 0x0}
(gdb) p *bucket
Cannot access memory at address 0x8
(gdb) p *(hmap->buckets)
$3 = (struct hmap_node *) 0x0
(gdb) p node->hash & hmap->mask
$4 = 79
(gdb) p *(&hmap->buckets[79])
$5 = (struct hmap_node *) 0x0
I managed to reproduce the error using a while loop in which I
repeatedly create and destroy a lacp bond, waiting for 15 seconds after
each of these operations.
The error is not constant in timing and I think that it can be a race
condition due to the recirculation process. It happens indeed that when
calling /hmap_remove()/ from /update_recirc_rules()/ (in case of DEL),
the bucket points to NULL. My idea is the following:
/hmap_remove()/ is called from two different places in /ofproto/bond.c/:
1. from /update_recirc_rules()/ (in which it raises the /segfault/)
hmap_remove(&bond->pr_rule_ops, &pr_op->hmap_node);
*pr_op->pr_rule = NULL;
free(pr_op);
2. and from /bond_unref()/ (called when I try to delete the bond)
HMAP_FOR_EACH_SAFE(pr_op, next_op, hmap_node, &bond->pr_rule_ops) {
hmap_remove(&bond->pr_rule_ops, &pr_op->hmap_node);
free(pr_op);
}
Now, if these two calls happen at the same time, a conflict on the /bond
pr_rule/ may happen. Looking at the code I can see that the
recirculation /hmap_remove()/ is executed inside an external lock
(/ovs_rwlock_wrlock(&rwlock)/) if called by /bond_rebalance()/
void bond_rebalance(struct bond *bond) {
...
ovs_rwlock_wrlock(&rwlock);
...
...
if (use_recirc && rebalanced) {
bond_update_post_recirc_rules(bond,true); <---- hmap_remove()
}
done:
ovs_rwlock_unlock(&rwlock);
}
but I can't see any lock when called by /output_normal()/
static void output_normal(struct xlate_ctx *ctx, const struct xbundle
*out_xbundle, uint16_t vlan) {
...
if (ctx->use_recirc) {
...
bond_update_post_recirc_rules(out_xbundle->bond, false); <---- hmap_remove()
...
}
...
The same lock is used in /bond_unref()/ but only for
/hmap_remove(all_bonds, &bond->hmap_node)/ and not for
/hmap_remove(&bond->pr_rule_ops, &pr_op->hmap_node)/.
...
ovs_rwlock_wrlock(&rwlock);
hmap_remove(all_bonds, &bond->hmap_node);
ovs_rwlock_unlock(&rwlock);
...
HMAP_FOR_EACH_SAFE(pr_op, next_op, hmap_node, &bond->pr_rule_ops) {
hmap_remove(&bond->pr_rule_ops, &pr_op->hmap_node);
free(pr_op);
}
...
I suppose that the goal is to remove any pointer to the bond, from
all_bonds, inside the lock and to work on it locally. My doubt is: can
the bond be still accessible from somewhere else (let's say from a
/bundle/)? If yes, what happen if a /bundle/ tries to access a bond
which was previously removed from /all_bonds/ (let's say from
/bundle_destroy()/)?
The version of OVS that I am using is /openvswitch-2.3.0-7.8312.x86_64/.
Could you please help me to find the problem?
Thank you for the help & kind regards,
Salvatore
_______________________________________________
discuss mailing list
discuss@openvswitch.org
http://openvswitch.org/mailman/listinfo/discuss