mike- you can blame me for the problem you were having. i didn't code the barriers correctly in gmond. the machines i tested gmond on before i released it didn't display the problem so i released it with this bug...
if you look at line 108 of gmond you'll see i initialize a barrier and then pass it to the mcast_threads that i spin off. directly afterwards i run a barrier_destroy(). bad. if the main gmond runs the barrier_destroy() BEFORE all the mcast_threads can run a barrier_barrier() then you will have a problem. the mcast threads will be operating on freed memory... otherwise.. everthing is peachy. the fix was just to increase the barrier count by one and place a barrier_barrier() just before the barrier_destroy() to force the main thread to wait until all the mcast threads are started. thanks so much for the feedback. also, i added the --no_setuid and --setuid flags in order to give you more debugging power. i know you were having trouble creating a core file because gmond sets the uid to the uid of "nobody". you can prevent gmond from starting up as nobody with the "--no_setuid" flag. good luck! and please let me know if i didn't solve your problem! -matt Saturday, Mike Snitzer wrote forth saying... > gmond segfaults 50% of the time at startup. The random nature of it > suggests to me that their is a race condition when the gmond threads > startup. When I tried to strace or run gmond through gdb the problem > wasn't apparant.. which is what led me to believe it's a threading problem > that strace or gdb masks. > > Any recommendations for accurately debugging gmond would be great; cause > when running through strace and gdb I can't get it to segfault. > > FYI, I'm running gmond v2.2.2 on 48 nodes of those 16 of the nodes' gmond > segfaulted at startup... > > Mike > > ps. > here's an example: > `which gmond` --debug_level=1 -i eth0 > > mcast_listen_thread() received metric data cpu_speed > mcast_value() mcasting cpu_user value > 2051 pre_process_node() remote_ip=192.168.0.28encoded 8 XDR > bytespre_process_node() has saved the hostname > pre_process_node() has set the timestamp > pre_process_node() received a new node > > > XDR data successfully sent > set_metric_value() got metric key 11 > set_metric_value() exec'd cpu_nice_func (11) > Segmentation fault > > > _______________________________________________ > Ganglia-general mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/ganglia-general >

