Great! looking forward to the test results.
On Fri, Feb 13, 2015 at 3:56 AM, Salvatore Cambria <salvatore.camb...@citrix.com> wrote: > Hi Andy, > > Yes, it makes a lot of sense! > > I applied the patch and I'm running my script again in order to test it. It > looks good so far! > I will leave the script running all across the weekend and I'll come back to > you with my results on Monday. > > Thank you for your very fast response and for your effort, > > Salvatore > > On 13/02/15 02:10, Andy Zhou wrote: >> >> Does not mean to drop the list. >> >> >> ---------- Forwarded message ---------- >> From: Andy Zhou <az...@nicira.com> >> Date: Thu, Feb 12, 2015 at 3:42 PM >> Subject: Re: [ovs-discuss] OVS segfault in recirculation >> To: Salvatore Cambria <salvatore.camb...@citrix.com> >> >> >> Hi, Salvatore, >> >> I think I found a bug: hmap_remove needs to be protected by a lock. As >> you pointed out, the output_normal() path is not properly protected. >> Would you please >> try the attached patch to see if it helps? >> >> Bond object is protected by reference counting. Assuming reference >> counter is valid, it should be O.K. for a bundle to hold on to the >> object. Notice >> bundle_destroy() calls unref_bond(). At any rate, I did not spot >> anything obvious there. >> >> Thanks for providing such a detailed description. It is very helpful. >> >> Andy >> >> ================================ >> >> iff --git a/ofproto/bond.c b/ofproto/bond.c >> index c4b3a3a..a9b7a83 100644 >> --- a/ofproto/bond.c >> +++ b/ofproto/bond.c >> @@ -923,8 +923,8 @@ bond_may_recirc(const struct bond *bond, uint32_t >> *recirc_id, >> } >> } >> >> -void >> -bond_update_post_recirc_rules(struct bond* bond, const bool force) >> +static void >> +bond_update_post_recirc_rules__(struct bond* bond, const bool force) >> { >> struct bond_entry *e; >> bool update_rules = force; /* Always update rules if caller forces >> it. */ >> @@ -945,6 +945,14 @@ bond_update_post_recirc_rules(struct bond* bond, >> const bool force) >> update_recirc_rules(bond); >> } >> } >> + >> +void >> +bond_update_post_recirc_rules(struct bond* bond, const bool force) >> +{ >> + ovs_rwlock_wrlock(&rwlock); >> + bond_update_post_recirc_rules__(bond, force); >> + ovs_rwlock_unlock(&rwlock); >> +} >> ^L >> /* Rebalancing. */ >> >> @@ -1203,7 +1211,7 @@ bond_rebalance(struct bond *bond) >> } >> >> if (use_recirc && rebalanced) { >> - bond_update_post_recirc_rules(bond,true); >> + bond_update_post_recirc_rules__(bond,true); >> } >> >> done: >> >> >> On Tue, Feb 10, 2015 at 7:57 AM, Salvatore Cambria >> <salvatore.camb...@citrix.com> wrote: >>> >>> Hello, >>> >>> I am a Software Engineer from Citrix, Cambridge UK office. >>> >>> We are having a problem when running our nightly test on a XenServer host >>> with lacp bonds with OVS. Indeed, when deleting the bond the following >>> segfault error appears (from GDB): >>> >>> Program terminated with signal 11, Segmentation fault. >>> >>> #0 0x0000000000411d2f in hmap_remove (node=0x7fd9d402e730, >>> hmap=0x1c3e9e8) >>> >>> at lib/hmap.h:236 >>> >>> 236 while (*bucket != node) { >>> >>> (gdb) bt >>> >>> #0 0x0000000000411843 in hmap_remove (node=0x7fd9d402e730, >>> hmap=0x1c3e9e8) >>> >>> at lib/hmap.h:236 >>> >>> #1 update_recirc_rules (bond=0x1c3e920) at ofproto/bond.c:382 >>> >>> #2 0x0000000000001ff5 in ?? () >>> >>> (gdb) p *hmap >>> >>> $1 = {buckets = 0x7fd9d4039b70, one = 0x0, mask = 255, n = 147} >>> >>> (gdb) p *node >>> >>> $2 = {hash = 2450845263, next = 0x0} >>> >>> (gdb) p *bucket >>> >>> Cannot access memory at address 0x8 >>> >>> (gdb) p *(hmap->buckets) >>> >>> $3 = (struct hmap_node *) 0x0 >>> >>> (gdb) p node->hash & hmap->mask >>> >>> $4 = 79 >>> >>> (gdb) p *(&hmap->buckets[79]) >>> >>> $5 = (struct hmap_node *) 0x0 >>> >>> I managed to reproduce the error using a while loop in which I repeatedly >>> create and destroy a lacp bond, waiting for 15 seconds after each of >>> these >>> operations. >>> >>> The error is not constant in timing and I think that it can be a race >>> condition due to the recirculation process. It happens indeed that when >>> calling hmap_remove() from update_recirc_rules() (in case of DEL), the >>> bucket points to NULL. My idea is the following: >>> >>> hmap_remove() is called from two different places in ofproto/bond.c: >>> >>> 1. from update_recirc_rules() (in which it raises the segfault) >>> >>> hmap_remove(&bond->pr_rule_ops, &pr_op->hmap_node); >>> *pr_op->pr_rule = NULL; >>> free(pr_op); >>> >>> 2. and from bond_unref() (called when I try to delete the bond) >>> >>> HMAP_FOR_EACH_SAFE(pr_op, next_op, hmap_node, &bond->pr_rule_ops) { >>> hmap_remove(&bond->pr_rule_ops, &pr_op->hmap_node); >>> free(pr_op); >>> } >>> >>> Now, if these two calls happen at the same time, a conflict on the bond >>> pr_rule may happen. Looking at the code I can see that the recirculation >>> hmap_remove() is executed inside an external lock >>> (ovs_rwlock_wrlock(&rwlock)) if called by bond_rebalance() >>> >>> void bond_rebalance(struct bond *bond) { >>> ... >>> ovs_rwlock_wrlock(&rwlock); >>> ... >>> ... >>> if (use_recirc && rebalanced) { >>> bond_update_post_recirc_rules(bond,true); <---- hmap_remove() >>> } >>> >>> done: >>> ovs_rwlock_unlock(&rwlock); >>> } >>> >>> but I can't see any lock when called by output_normal() >>> >>> static void output_normal(struct xlate_ctx *ctx, const struct xbundle >>> *out_xbundle, uint16_t vlan) { >>> ... >>> if (ctx->use_recirc) { >>> ... >>> bond_update_post_recirc_rules(out_xbundle->bond, false); <---- >>> hmap_remove() >>> ... >>> } >>> ... >>> >>> The same lock is used in bond_unref() but only for hmap_remove(all_bonds, >>> &bond->hmap_node) and not for hmap_remove(&bond->pr_rule_ops, >>> &pr_op->hmap_node). >>> >>> ... >>> ovs_rwlock_wrlock(&rwlock); >>> hmap_remove(all_bonds, &bond->hmap_node); >>> ovs_rwlock_unlock(&rwlock); >>> ... >>> HMAP_FOR_EACH_SAFE(pr_op, next_op, hmap_node, &bond->pr_rule_ops) { >>> hmap_remove(&bond->pr_rule_ops, &pr_op->hmap_node); >>> free(pr_op); >>> } >>> ... >>> >>> I suppose that the goal is to remove any pointer to the bond, from >>> all_bonds, inside the lock and to work on it locally. My doubt is: can >>> the >>> bond be still accessible from somewhere else (let's say from a bundle)? >>> If >>> yes, what happen if a bundle tries to access a bond which was previously >>> removed from all_bonds (let's say from bundle_destroy())? >>> >>> The version of OVS that I am using is openvswitch-2.3.0-7.8312.x86_64. >>> >>> Could you please help me to find the problem? >>> >>> >>> Thank you for the help & kind regards, >>> >>> Salvatore >>> >>> >>> >>> _______________________________________________ >>> discuss mailing list >>> disc...@openvswitch.org >>> http://openvswitch.org/mailman/listinfo/discuss >>> > _______________________________________________ dev mailing list dev@openvswitch.org http://openvswitch.org/mailman/listinfo/dev