On Sun, 2006-08-20 at 20:18 +0300, Sasha Khapyorsky wrote:
> On 13:01 Sun 20 Aug     , Hal Rosenstock wrote:
> > Hi Sasha,
> > 
> > On Sun, 2006-08-20 at 12:05, Sasha Khapyorsky wrote:
> > > In case when OpenSM log file overflows filesystem and write() fails with
> > > 'No space left on device' try to truncate the log file and wrap-around
> > > logging.
> > 
> > Should it be an (admin) option as to whether to truncate the file or not
> > or is there no way to continue without logging (other than this) once
> > the log file fills the disk ?
> 
> In theory OpenSM may continue, but don't think it is good idea to leave
> overflowed disk on the SM machine (by default it is '/var/log'). For me
> truncating there looks as reasonable default behavior, don't think we
> need the option.

I would definitely put the option in, and in fact would default it to
*NOT* truncate.  If the disk is full, you have no idea why.  It *might*
be your logs, or it might be a mail bomb filling /var/spool/mail.  I'm
sure as an admin the last thing I would want is my apps deciding, based
upon incomplete information, that wiping out their log files is the
right thing to do.  To me that sounds more like an intruder covering his
tracks than a reasonable thing to do when confronted with ENOSPC.

Truncating logs is something best left up to the admin that's dealing
with the disk full problem in the first place.  After all, if it is
something like an errant app filling the mail spool, truncating the logs
just looses valuable logs while at the same time making room for the app
to keep on adding more to /var/spool/mail.  That's just wrong.  If you
run out of space, just quit logging things until the admin clears the
problem up.  If you put this code in, make the admin turn it on.  That
will keep opensm friendly to appliance like devices that are single task
subnet managers.  But I don't think having this patch always on makes
any sense on a multi task server.

> > 
> > See comment below as well.
> > 
> > -- Hal
> > 
> > > Signed-off-by: Sasha Khapyorsky <[EMAIL PROTECTED]>
> > > ---
> > > 
> > >  osm/opensm/osm_log.c |   23 +++++++++++++++--------
> > >  1 files changed, 15 insertions(+), 8 deletions(-)
> > > 
> > > diff --git a/osm/opensm/osm_log.c b/osm/opensm/osm_log.c
> > > index 668e9a6..b4700c8 100644
> > > --- a/osm/opensm/osm_log.c
> > > +++ b/osm/opensm/osm_log.c
> > > @@ -58,6 +58,7 @@ #include <stdarg.h>
> > >  #include <fcntl.h>
> > >  #include <sys/types.h>
> > >  #include <sys/stat.h>
> > > +#include <errno.h>
> > >  
> > >  #ifndef WIN32
> > >  #include <sys/time.h>
> > > @@ -152,6 +153,7 @@ #endif    
> > >      cl_spinlock_acquire( &p_log->lock );
> > >  #ifdef WIN32
> > >      GetLocalTime(&st);
> > > + _retry:
> > >      ret = fprintf(   p_log->out_port, "[%02d:%02d:%02d:%03d][%04X] -> 
> > > %s",
> > >                       st.wHour, st.wMinute, st.wSecond, st.wMilliseconds,
> > >                       pid, buffer);
> > > @@ -159,6 +161,7 @@ #ifdef WIN32
> > >  #else
> > >      pid = pthread_self();
> > >      tim = time(NULL);
> > > + _retry:
> > >      ret = fprintf( p_log->out_port, "%s %02d %02d:%02d:%02d %06d [%04X] 
> > > -> %s",
> > >                     ((result.tm_mon < 12) && (result.tm_mon >= 0) ? 
> > >                      month_str[result.tm_mon] : "???"),
> > > @@ -166,6 +169,18 @@ #else
> > >                     result.tm_min, result.tm_sec,
> > >                     usecs, pid, buffer);
> > >  #endif /*  WIN32 */
> > > +
> > > +    if (ret >= 0)
> > > +      log_exit_count = 0;
> > > +    else if (errno == ENOSPC && log_exit_count < 3) {
> > > +      int fd = fileno(p_log->out_port);
> > > +      fprintf(stderr, "log write failed: %s. Will truncate the log 
> > > file.\n",
> > > +              strerror(errno));
> > > +      ftruncate(fd, 0);
> > 
> > Should return from ftruncate be checked here ?
> 
> May be checked, but I don't think that potential ftruncate() failure
> should change the flow - in case of failure we will try to continue
> with lseek() anyway (in order to wrap around the file at least).
> 
> Sasha
> 
> > 
> > > +      lseek(fd, 0, SEEK_SET);
> > > +      log_exit_count++;
> > > +      goto _retry;
> > > +    }
> > >      
> > >      /*
> > >        Flush log on errors too.
> > > @@ -174,14 +189,6 @@ #endif /*  WIN32 */
> > >        fflush( p_log->out_port );
> > >      
> > >      cl_spinlock_release( &p_log->lock );
> > > -    
> > > -    if (ret < 0)
> > > -    {
> > > -      if (log_exit_count++ < 10)
> > > -      {
> > > -        fprintf(stderr, "OSM LOG FAILURE! Quota probably exceeded\n");
> > > -      }
> > > -    }
> > >    }
> > >  }
> > >  
> > 
> 
> _______________________________________________
> openib-general mailing list
> [email protected]
> http://openib.org/mailman/listinfo/openib-general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
-- 
Doug Ledford <[EMAIL PROTECTED]>
              GPG KeyID: CFBFF194
              http://people.redhat.com/dledford

Infiniband specific RPMs available at
              http://people.redhat.com/dledford/Infiniband

Attachment: signature.asc
Description: This is a digitally signed message part

_______________________________________________
openib-general mailing list
[email protected]
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Reply via email to