mike-

you can blame me for the problem you were having.  i didn't code the 
barriers correctly in gmond.  the machines i tested gmond on before i 
released it didn't display the problem so i released it with this bug...

if you look at line 108 of gmond you'll see i initialize a barrier and 
then pass it to the mcast_threads that i spin off. directly afterwards i 
run a barrier_destroy().  bad.

if the main gmond runs the barrier_destroy() BEFORE all the mcast_threads 
can run a barrier_barrier() then you will have a problem.  the mcast 
threads will be operating on freed memory... otherwise.. everthing is 
peachy.

the fix was just to increase the barrier count by one and place a 
barrier_barrier() just before the barrier_destroy() to force the main 
thread to wait until all the mcast threads are started.

thanks so much for the feedback.

also, i added the --no_setuid and --setuid flags in order to give you more 
debugging power.  i know you were having trouble creating a core file 
because gmond sets the uid to the uid of "nobody".  you can prevent gmond 
from starting up as nobody with the "--no_setuid" flag.

good luck!  and please let me know if i didn't solve your problem!
-matt

Saturday, Mike Snitzer wrote forth saying...

> gmond segfaults 50% of the time at startup.  The random nature of it
> suggests to me that their is a race condition when the gmond threads
> startup.  When I tried to strace or run gmond through gdb the problem
> wasn't apparant.. which is what led me to believe it's a threading problem
> that strace or gdb masks.
> 
> Any recommendations for accurately debugging gmond would be great; cause
> when running through strace and gdb I can't get it to segfault.
> 
> FYI, I'm running gmond v2.2.2 on 48 nodes of those 16 of the nodes' gmond
> segfaulted at startup... 
> 
> Mike
> 
> ps.
> here's an example:
> `which gmond` --debug_level=1 -i eth0
> 
> mcast_listen_thread() received metric data cpu_speed
> mcast_value() mcasting cpu_user value
> 2051 pre_process_node() remote_ip=192.168.0.28encoded 8 XDR
> bytespre_process_node() has saved the hostname
> pre_process_node() has set the timestamp
> pre_process_node() received a new node
> 
> 
> XDR data successfully sent
> set_metric_value() got metric key 11
> set_metric_value() exec'd cpu_nice_func (11)
> Segmentation fault
> 
> 
> _______________________________________________
> Ganglia-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/ganglia-general
> 


Reply via email to