I think the problem still exists even with this fix.
You don't know the order by which threads leave the barrier,
so you might still be calling barrier_destroy() while there
are threads accessing b.
In general this kind of scheme:
thread1:
b = allocate_barrier();
spawn_threads(b);
wait_barrier(b);
free(b);
threadnN:
wait_barrier(b);
can't work because you have threads 1..N all accessing the
data structure pointed to by b simultaneously, and you have
no control over which one will exit wait_barrier() first.
If it happens to be thread1, then it will free() b while
other threads are still reading the data pointed to by b.
If you REALLY want to solve this, I think you'd need two
barriers:
thread1:
b2 = static_barrier;
b1 = allocate_barrier();
spawn_threads(b1,b2);
wait_barrier(b1);
wait_barrier(b2);
free(b1);
// b2 is never freed
threadnN:
wait_barrier(b1);
wait_barrier(b2);
Of course, this is only interesting if you can't make do with just
having only static barriers. If you are in a situation that you
absolutely must allocate and free the memory held by the barriers
I don't know of another safe way to do this.
On Mon, Apr 08, 2002 at 03:48:41PM -0700, matt massie wrote:
> mike-
>
> you can blame me for the problem you were having. i didn't code the
> barriers correctly in gmond. the machines i tested gmond on before i
> released it didn't display the problem so i released it with this bug...
>
> if you look at line 108 of gmond you'll see i initialize a barrier and
> then pass it to the mcast_threads that i spin off. directly afterwards i
> run a barrier_destroy(). bad.
>
> if the main gmond runs the barrier_destroy() BEFORE all the mcast_threads
> can run a barrier_barrier() then you will have a problem. the mcast
> threads will be operating on freed memory... otherwise.. everthing is
> peachy.
>
> the fix was just to increase the barrier count by one and place a
> barrier_barrier() just before the barrier_destroy() to force the main
> thread to wait until all the mcast threads are started.
>
> thanks so much for the feedback.
>
> also, i added the --no_setuid and --setuid flags in order to give you more
> debugging power. i know you were having trouble creating a core file
> because gmond sets the uid to the uid of "nobody". you can prevent gmond
> from starting up as nobody with the "--no_setuid" flag.
>
> good luck! and please let me know if i didn't solve your problem!
> -matt
>
> Saturday, Mike Snitzer wrote forth saying...
>
> > gmond segfaults 50% of the time at startup. The random nature of it
> > suggests to me that their is a race condition when the gmond threads
> > startup. When I tried to strace or run gmond through gdb the problem
> > wasn't apparant.. which is what led me to believe it's a threading problem
> > that strace or gdb masks.
> >
> > Any recommendations for accurately debugging gmond would be great; cause
> > when running through strace and gdb I can't get it to segfault.
> >
> > FYI, I'm running gmond v2.2.2 on 48 nodes of those 16 of the nodes' gmond
> > segfaulted at startup...
> >
> > Mike
> >
> > ps.
> > here's an example:
> > `which gmond` --debug_level=1 -i eth0
> >
> > mcast_listen_thread() received metric data cpu_speed
> > mcast_value() mcasting cpu_user value
> > 2051 pre_process_node() remote_ip=192.168.0.28encoded 8 XDR
> > bytespre_process_node() has saved the hostname
> > pre_process_node() has set the timestamp
> > pre_process_node() received a new node
> >
> >
> > XDR data successfully sent
> > set_metric_value() got metric key 11
> > set_metric_value() exec'd cpu_nice_func (11)
> > Segmentation fault
> >
> >
> > _______________________________________________
> > Ganglia-general mailing list
> > [email protected]
> > https://lists.sourceforge.net/lists/listinfo/ganglia-general
> >
>
>
> _______________________________________________
> Ganglia-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/ganglia-general
>
> Sponsored by http://www.ThinkGeek.com/