The Solaris problem is definitely thread related.  Gmetad has been rock
solid ever since I applied the patch below.  I'll forward this thread to
the developers list so the patch can be considered for the next Ganglia
release.

Regarding the performance problem with gmetad on Red Hat ... I tried
gmetad on SuSE (also 2.4.21) and it did *not* cause the system to grind
to a halt.  So it seems that the performance problem I stumbled on is
particular to Red Hat Enterprise release 3.  I'm going to leave gmetad
running on Solaris for now since it's finally stable.  David

--- rrd_helpers.c       13 Sep 2004 20:55:16 -0000      1.1.1.1
+++ rrd_helpers.c       2 Nov 2004 23:58:32 -0000
@@ -20,10 +20,13 @@
 static void inline
 my_mkdir ( const char *dir )
 {
+   pthread_mutex_lock( &rrd_mutex );
    if ( mkdir ( dir, 0755 ) < 0 && errno != EEXIST)
       {
+         pthread_mutex_unlock( &rrd_mutex );
          err_sys("Unable to mkdir(%s)",dir);
       }
+   pthread_mutex_unlock( &rrd_mutex );
 }

 static int

Note: this patch is against v2.5.7

-----Original Message-----
From: Wood, David 
Sent: Tuesday, October 26, 2004 7:52 AM
To: '[email protected]'
Subject: gmetad on Solaris


I'm running gmetad v2.5.7 on Solaris 8 and it randomly falls over with a
mkdir error():

deputy.nyc:/var/adm > grep gmetad messages messages.0
messages:Oct 25 04:25:29 deputy.nyc.deshaw.com
/usr/local/ganglia/sbin/gmetad[15969]: [ID 937774 user.error] Unable to
mkdir(/var/ganglia/rrds/__SummaryInfo__): Error 0
messages:Oct 25 04:25:29 deputy.nyc.deshaw.com
/usr/local/ganglia/sbin/gmetad[15969]: [ID 937774 user.error] Unable to
mkdir(/var/ganglia/rrds/__SummaryInfo__): Error 0
messages.0:Oct 23 02:57:32 deputy.nyc.deshaw.com
/usr/local/ganglia/sbin/gmetad[29731]: [ID 937774 user.error] Unable to
mkdir(/var/ganglia/rrds/__SummaryInfo__): Error 0
messages.0:Oct 23 02:57:32 deputy.nyc.deshaw.com
/usr/local/ganglia/sbin/gmetad[29731]: [ID 937774 user.error] Unable to
mkdir(/var/ganglia/rrds/__SummaryInfo__): Error 0

The permissions on /var/ganglia are definitely correct.  Notice the
error text is "Error 0" which implies that errno is 0.  Here's the
relevant code fragment (gmetad/rrd_helpers.c):

    static void inline
    my_mkdir ( const char *dir )
    {
       if ( mkdir ( dir, 0755 ) < 0 && errno != EEXIST)
          {
             err_sys("Unable to mkdir(%s)",dir);
          }
    }

I thought err_sys() might be mangling errno; however, some quick testing
shows that err_sys() is working fine.  Thus, it seems that mkdir returns
< 0 but errno is really 0!  Could this be some sort of weird Solaris
thread interaction with mkdir()?  Any other ideas? 

BTW - I originally ran gmetad on Red Hat Enterprise release 3; however,
the system eventually ground to a halt.  I'm attributing it to this
RedHat bug: https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=121434
(still unresolved).  Soon enough I'll be out of possible platforms for
gmetad ;-).  David

Reply via email to