Hi Andy, Yes, it makes a lot of sense!
I applied the patch and I'm running my script again in order to test it. It looks good so far! I will leave the script running all across the weekend and I'll come back to you with my results on Monday.
Thank you for your very fast response and for your effort, Salvatore On 13/02/15 02:10, Andy Zhou wrote:
Does not mean to drop the list. ---------- Forwarded message ---------- From: Andy Zhou <[email protected]> Date: Thu, Feb 12, 2015 at 3:42 PM Subject: Re: [ovs-discuss] OVS segfault in recirculation To: Salvatore Cambria <[email protected]> Hi, Salvatore, I think I found a bug: hmap_remove needs to be protected by a lock. As you pointed out, the output_normal() path is not properly protected. Would you please try the attached patch to see if it helps? Bond object is protected by reference counting. Assuming reference counter is valid, it should be O.K. for a bundle to hold on to the object. Notice bundle_destroy() calls unref_bond(). At any rate, I did not spot anything obvious there. Thanks for providing such a detailed description. It is very helpful. Andy ================================ iff --git a/ofproto/bond.c b/ofproto/bond.c index c4b3a3a..a9b7a83 100644 --- a/ofproto/bond.c +++ b/ofproto/bond.c @@ -923,8 +923,8 @@ bond_may_recirc(const struct bond *bond, uint32_t *recirc_id, } } -void -bond_update_post_recirc_rules(struct bond* bond, const bool force) +static void +bond_update_post_recirc_rules__(struct bond* bond, const bool force) { struct bond_entry *e; bool update_rules = force; /* Always update rules if caller forces it. */ @@ -945,6 +945,14 @@ bond_update_post_recirc_rules(struct bond* bond, const bool force) update_recirc_rules(bond); } } + +void +bond_update_post_recirc_rules(struct bond* bond, const bool force) +{ + ovs_rwlock_wrlock(&rwlock); + bond_update_post_recirc_rules__(bond, force); + ovs_rwlock_unlock(&rwlock); +} ^L /* Rebalancing. */ @@ -1203,7 +1211,7 @@ bond_rebalance(struct bond *bond) } if (use_recirc && rebalanced) { - bond_update_post_recirc_rules(bond,true); + bond_update_post_recirc_rules__(bond,true); } done: On Tue, Feb 10, 2015 at 7:57 AM, Salvatore Cambria <[email protected]> wrote:Hello, I am a Software Engineer from Citrix, Cambridge UK office. We are having a problem when running our nightly test on a XenServer host with lacp bonds with OVS. Indeed, when deleting the bond the following segfault error appears (from GDB): Program terminated with signal 11, Segmentation fault. #0 0x0000000000411d2f in hmap_remove (node=0x7fd9d402e730, hmap=0x1c3e9e8) at lib/hmap.h:236 236 while (*bucket != node) { (gdb) bt #0 0x0000000000411843 in hmap_remove (node=0x7fd9d402e730, hmap=0x1c3e9e8) at lib/hmap.h:236 #1 update_recirc_rules (bond=0x1c3e920) at ofproto/bond.c:382 #2 0x0000000000001ff5 in ?? () (gdb) p *hmap $1 = {buckets = 0x7fd9d4039b70, one = 0x0, mask = 255, n = 147} (gdb) p *node $2 = {hash = 2450845263, next = 0x0} (gdb) p *bucket Cannot access memory at address 0x8 (gdb) p *(hmap->buckets) $3 = (struct hmap_node *) 0x0 (gdb) p node->hash & hmap->mask $4 = 79 (gdb) p *(&hmap->buckets[79]) $5 = (struct hmap_node *) 0x0 I managed to reproduce the error using a while loop in which I repeatedly create and destroy a lacp bond, waiting for 15 seconds after each of these operations. The error is not constant in timing and I think that it can be a race condition due to the recirculation process. It happens indeed that when calling hmap_remove() from update_recirc_rules() (in case of DEL), the bucket points to NULL. My idea is the following: hmap_remove() is called from two different places in ofproto/bond.c: 1. from update_recirc_rules() (in which it raises the segfault) hmap_remove(&bond->pr_rule_ops, &pr_op->hmap_node); *pr_op->pr_rule = NULL; free(pr_op); 2. and from bond_unref() (called when I try to delete the bond) HMAP_FOR_EACH_SAFE(pr_op, next_op, hmap_node, &bond->pr_rule_ops) { hmap_remove(&bond->pr_rule_ops, &pr_op->hmap_node); free(pr_op); } Now, if these two calls happen at the same time, a conflict on the bond pr_rule may happen. Looking at the code I can see that the recirculation hmap_remove() is executed inside an external lock (ovs_rwlock_wrlock(&rwlock)) if called by bond_rebalance() void bond_rebalance(struct bond *bond) { ... ovs_rwlock_wrlock(&rwlock); ... ... if (use_recirc && rebalanced) { bond_update_post_recirc_rules(bond,true); <---- hmap_remove() } done: ovs_rwlock_unlock(&rwlock); } but I can't see any lock when called by output_normal() static void output_normal(struct xlate_ctx *ctx, const struct xbundle *out_xbundle, uint16_t vlan) { ... if (ctx->use_recirc) { ... bond_update_post_recirc_rules(out_xbundle->bond, false); <---- hmap_remove() ... } ... The same lock is used in bond_unref() but only for hmap_remove(all_bonds, &bond->hmap_node) and not for hmap_remove(&bond->pr_rule_ops, &pr_op->hmap_node). ... ovs_rwlock_wrlock(&rwlock); hmap_remove(all_bonds, &bond->hmap_node); ovs_rwlock_unlock(&rwlock); ... HMAP_FOR_EACH_SAFE(pr_op, next_op, hmap_node, &bond->pr_rule_ops) { hmap_remove(&bond->pr_rule_ops, &pr_op->hmap_node); free(pr_op); } ... I suppose that the goal is to remove any pointer to the bond, from all_bonds, inside the lock and to work on it locally. My doubt is: can the bond be still accessible from somewhere else (let's say from a bundle)? If yes, what happen if a bundle tries to access a bond which was previously removed from all_bonds (let's say from bundle_destroy())? The version of OVS that I am using is openvswitch-2.3.0-7.8312.x86_64. Could you please help me to find the problem? Thank you for the help & kind regards, Salvatore _______________________________________________ discuss mailing list [email protected] http://openvswitch.org/mailman/listinfo/discuss
_______________________________________________ dev mailing list [email protected] http://openvswitch.org/mailman/listinfo/dev
