asaph-

> I think the problem still exists even with this fix.

did you test that this problem still exists and if so can you give me the 
platform and error?  please make it clearer when you post to the list 
whether the bug you are reporting is real or theoretical.

even if this is a theoretical bug it is important.  to be absolutely sure
there is no problems in the future, i've declare two separate barrier
pointers and i don't ever free the data they point to.  this increases the
memory footprint of gmond by 96 bytes but i think that we can live with
that.

thanks for your technical expertise
-matt

> You don't know the order by which threads leave the barrier,
> so you might still be calling barrier_destroy() while there
> are threads accessing b.
> 
> In general this kind of scheme:
> 
>    thread1:
>        b = allocate_barrier();
>        spawn_threads(b);
>        wait_barrier(b);
>        free(b);
> 
> 
>    threadnN:
>        wait_barrier(b);
> 
> can't work because you have threads 1..N all accessing the 
> data structure pointed to by b simultaneously, and you have
> no control over which one will exit wait_barrier() first.
> If it happens to be thread1, then it will free() b while
> other threads are still reading the data pointed to by b.
> 
> If you REALLY want to solve this, I think you'd need two
> barriers:
> 
> 
>    thread1:
>        b2 = static_barrier;
>        b1 = allocate_barrier();
>        spawn_threads(b1,b2);
>        wait_barrier(b1);
>        wait_barrier(b2);
>        free(b1);
>        // b2 is never freed
> 
> 
>    threadnN:
>        wait_barrier(b1);
>        wait_barrier(b2);
>  
> Of course, this is only interesting if you can't make do with just
> having only static barriers. If you are in a situation that you
> absolutely must allocate and free the memory held by the barriers
> I don't know of another safe way to do this.
> 
> 
> On Mon, Apr 08, 2002 at 03:48:41PM -0700, matt massie wrote:
> > mike-
> > 
> > you can blame me for the problem you were having.  i didn't code the 
> > barriers correctly in gmond.  the machines i tested gmond on before i 
> > released it didn't display the problem so i released it with this bug...
> > 
> > if you look at line 108 of gmond you'll see i initialize a barrier and 
> > then pass it to the mcast_threads that i spin off. directly afterwards i 
> > run a barrier_destroy().  bad.
> > 
> > if the main gmond runs the barrier_destroy() BEFORE all the mcast_threads 
> > can run a barrier_barrier() then you will have a problem.  the mcast 
> > threads will be operating on freed memory... otherwise.. everthing is 
> > peachy.
> > 
> > the fix was just to increase the barrier count by one and place a 
> > barrier_barrier() just before the barrier_destroy() to force the main 
> > thread to wait until all the mcast threads are started.
> > 
> > thanks so much for the feedback.
> > 
> > also, i added the --no_setuid and --setuid flags in order to give you more 
> > debugging power.  i know you were having trouble creating a core file 
> > because gmond sets the uid to the uid of "nobody".  you can prevent gmond 
> > from starting up as nobody with the "--no_setuid" flag.
> > 
> > good luck!  and please let me know if i didn't solve your problem!
> > -matt
> > 
> > Saturday, Mike Snitzer wrote forth saying...
> > 
> > > gmond segfaults 50% of the time at startup.  The random nature of it
> > > suggests to me that their is a race condition when the gmond threads
> > > startup.  When I tried to strace or run gmond through gdb the problem
> > > wasn't apparant.. which is what led me to believe it's a threading problem
> > > that strace or gdb masks.
> > > 
> > > Any recommendations for accurately debugging gmond would be great; cause
> > > when running through strace and gdb I can't get it to segfault.
> > > 
> > > FYI, I'm running gmond v2.2.2 on 48 nodes of those 16 of the nodes' gmond
> > > segfaulted at startup... 
> > > 
> > > Mike
> > > 
> > > ps.
> > > here's an example:
> > > `which gmond` --debug_level=1 -i eth0
> > > 
> > > mcast_listen_thread() received metric data cpu_speed
> > > mcast_value() mcasting cpu_user value
> > > 2051 pre_process_node() remote_ip=192.168.0.28encoded 8 XDR
> > > bytespre_process_node() has saved the hostname
> > > pre_process_node() has set the timestamp
> > > pre_process_node() received a new node
> > > 
> > > 
> > > XDR data successfully sent
> > > set_metric_value() got metric key 11
> > > set_metric_value() exec'd cpu_nice_func (11)
> > > Segmentation fault
> > > 
> > > 
> > > _______________________________________________
> > > Ganglia-general mailing list
> > > [email protected]
> > > https://lists.sourceforge.net/lists/listinfo/ganglia-general
> > > 
> > 
> > 
> > _______________________________________________
> > Ganglia-general mailing list
> > [email protected]
> > https://lists.sourceforge.net/lists/listinfo/ganglia-general
> > 
> > Sponsored by http://www.ThinkGeek.com/
> 



Reply via email to