first shot at the data: this looks like a full queue where the enqueue
operation is waiting on a drain that does not happen (fast enough). Of
course, that doesn't explain why this happens...

Rainer

> -----Original Message-----
> From: [email protected] [mailto:rsyslog-
> [email protected]] On Behalf Of Aaron Wiebe
> Sent: Tuesday, November 03, 2009 6:27 PM
> To: rsyslog-users
> Subject: Re: [rsyslog] Queuing bug (4.5.5)
> 
> Ok - I captured an strace.  This appears to be the action thread.
> This specific set of calls took minutes.  I'll re-run this with -t
> and/or -T.
> 
> Note that this syslog instance is not actually receiving any data right
> now.
> 
> -Aaron
> 
> 26365 --- SIGSTOP (Stopped (signal)) @ 0 (0) ---
> 26365 futex(0x1bec924, FUTEX_WAIT, 97, NULL <unfinished ...>
> 26365 <... futex resumed> )             = 0
> 26365 clock_gettime(CLOCK_REALTIME, {1257268800, 165944000}) = 0
> 26365 futex(0x1bec8c8, FUTEX_WAKE, 1)   = 0
> 26365 clock_gettime(CLOCK_REALTIME, {1257268800, 166035000}) = 0
> 26365 futex(0x1bec924, FUTEX_WAIT, 99, {1, 999909000} <unfinished ...>
> 26365 <... futex resumed> )             = -1 ETIMEDOUT (Connection
> timed out)
> 26365 write(165, "Oct 29 10:01:42 R6002 WT-GARULF6"..., 229) = 229
> 26365 futex(0x1bec8c8, FUTEX_WAKE, 1)   = 0
> 26365 clock_gettime(CLOCK_REALTIME, {1257268802, 173123000}) = 0
> 26365 futex(0x1bec924, FUTEX_WAIT, 101, NULL <unfinished ...>
> 26365 <... futex resumed> )             = 0
> 26365 clock_gettime(CLOCK_REALTIME, {1257268893, 439872000}) = 0
> 26365 futex(0x1bec8c8, FUTEX_WAKE, 1)   = 0
> 26365 clock_gettime(CLOCK_REALTIME, {1257268893, 439999000}) = 0
> 26365 futex(0x1bec924, FUTEX_WAIT, 103, {1, 999873000} <unfinished ...>
> 26365 <... futex resumed> )             = -1 ETIMEDOUT (Connection
> timed out)
> 26365 write(165, "Oct 29 10:01:42 R6002 WT-GARULF6"..., 248) = 248
> 26365 futex(0x1bec8c8, FUTEX_WAKE, 1)   = 0
> 26365 clock_gettime(CLOCK_REALTIME, {1257268895, 443096000}) = 0
> 26365 futex(0x1bec924, FUTEX_WAIT, 105, NULL <unfinished ...>
> 26365 <... futex resumed> )             = 0
> 26365 clock_gettime(CLOCK_REALTIME,  <unfinished ...>
> 26365 <... clock_gettime resumed> {1257268950, 612585000}) = 0
> 26365 futex(0x1bec8c8, FUTEX_WAKE, 1 <unfinished ...>
> 26365 <... futex resumed> )             = 0
> 26365 clock_gettime(CLOCK_REALTIME, {1257268950, 612757000}) = 0
> 26365 futex(0x1bec924, FUTEX_WAIT, 107, {1, 999828000} <unfinished ...>
> 26365 <... futex resumed> )             = -1 ETIMEDOUT (Connection
> timed out)
> 26365 write(165, "Oct 29 10:01:42 R6002 WT-GARULF6"..., 180) = 180
> 26365 futex(0x1bec8c8, FUTEX_WAKE, 1)   = 0
> 26365 clock_gettime(CLOCK_REALTIME, {1257268952, 619980000}) = 0
> 26365 futex(0x1bec924, FUTEX_WAIT, 109, NULL <unfinished ...>
> 26365 <... futex resumed> )             = 0
> 26365 futex(0x1bec8c8, FUTEX_WAIT, 2, NULL <unfinished ...>
> 26365 <... futex resumed> )             = -1 EAGAIN (Resource
> temporarily unavailable)
> 26365 clock_gettime(CLOCK_REALTIME,  <unfinished ...>
> 26365 <... clock_gettime resumed> {1257268956, 628783000}) = 0
> 26365 futex(0x1bec8c8, FUTEX_WAKE, 1 <unfinished ...>
> 26365 <... futex resumed> )             = 0
> 26365 clock_gettime(CLOCK_REALTIME,  <unfinished ...>
> 26365 <... clock_gettime resumed> {1257268956, 628953000}) = 0
> 26365 futex(0x1bec924, FUTEX_WAIT, 111, {1, 999830000} <unfinished ...>
> 26365 <... futex resumed> )             = -1 ETIMEDOUT (Connection
> timed out)
> 26365 write(165, "Oct 29 10:01:42 R6002 WT-GARULF6"..., 335) = 335
> 26365 futex(0x1bec8c8, FUTEX_WAKE, 1)   = 0
> 26365 clock_gettime(CLOCK_REALTIME, {1257268958, 636092000}) = 0
> 26365 futex(0x1bec924, FUTEX_WAIT, 113, NULL <unfinished ...>
> 
> 
> On Tue, Nov 3, 2009 at 10:25 AM, Rainer Gerhards
> <[email protected]> wrote:
> >> -----Original Message-----
> >> From: [email protected] [mailto:rsyslog-
> >> [email protected]] On Behalf Of [email protected]
> >> Sent: Tuesday, November 03, 2009 4:06 PM
> >> To: rsyslog-users
> >> Subject: Re: [rsyslog] Queuing bug (4.5.5)
> >>
> >> On Tue, 3 Nov 2009, Aaron Wiebe wrote:
> >>
> >> > This is still taking place this hour - though I obviously can't
> >> > restart onto a newer version without losing this case.  Is there
> >> > anything I can do in the current configuration to try and debug
> this
> >> > situation?
> >>
> >> if you can strace each thread for a few seconds it may give you an
> idea
> >> what's happening.
> >
> > This is a good suggestion.
> >
> > It would also potentially be enlightening to attach gdb to the
> process at
> > various points in time and get a backtrace from all running threads.
> >
> > Other than that, there is little you can do on a running version. If
> it were
> > compiled for debug, and the environment setup so that we could
> capture
> > stdout/stderr, sending SIGUSR2 would yield to much the same
> information as
> > the gdb backtrace BUT when runtime instrumentation is enabled, the
> build is
> > operating at a third to a quarter of its normal speed.
> >
> >> in the meantime you need to stop the system from getting further
> >> behind,
> >> can you redirect or reconfigure the senders so that they are not
> >> sending
> >> new logs to this box so that it can dig itself out (stopping the
> input
> >> may
> >> be enough by itself, rsyslog has historicly done a LOT of locking
> >> around
> >> the main queue, and if you have a full queue with more data trying
> to
> >> be
> >> delivered it can really slow things down. I wouldn't expect it to be
> >> this
> >> much, but it could be part of it)
> >>
> >
> > This may clean up things, but I really doubt it, given the magnitude
> of
> > delays. Also, Aaron runs 4.4.5, which already has lots of the locking
> > removed/restructure (not to compare with the recent v5-devel, but
> much more
> > effcient than early v4 and v3).
> >
> > Rainer
> >> David Lang
> >>
> >> > (We're up to 18:46:51 now!)
> >> >
> >> > -Aaron
> >> >
> >> > On Tue, Nov 3, 2009 at 1:46 AM, Rainer Gerhards
> >> > <[email protected]> wrote:
> >> >> mhhh... This is obviously not intended behavior. There are some
> >> rate-limiting
> >> >> features that can deliberately slow down a queue, but I guess you
> >> have not
> >> >> configured these. So there either is a bug that activates them at
> >> some point
> >> >> during processing (e.g. an invalid memory access could do that)
> or
> >> there is
> >> >> some other bug that causes the dequeue to happen very slowly. In
> any
> >> case, it
> >> >> is hard to guess.
> >> >>
> >> >> Given the volume of the messages, a debug log probably does not
> >> help. We
> >> >> could gain a little insight in at least the queue sizes that
> rsyslog
> >> sees via
> >> >> imdiag plus the (very rudamentary) v5 debug front-end (it doesn't
> >> require the
> >> >> engine to be v5!). I would need to spend at least a little work
> on
> >> the
> >> >> front-end to enable remote access, but that's not really a
> problem.
> >> >>
> >> >> One other thing is that I am still holding a v4-beta with
> additional
> >> fixes as
> >> >> I didn't want to put these in the middle of some other debugging.
> >> However,
> >> >> these fixes address potential memory problems, so the most
> >> appropriate course
> >> >> of action would be to give that version at least a try. It needs
> to
> >> be
> >> >> released in the next days in any case.
> >> >>
> >> >> I have uploaded that (pre-4.5.6) version so that you can give it
> a
> >> try if you
> >> >> like:
> >> >>
> >> >> http://www.rsyslog.com/download/rsyslog/pre/rsyslog-4.5.6.tar.gz
> >> >>
> >> >> I think it would good if you could try it at least once, because
> I
> >> think any
> >> >> other troubleshooting will boil down to looking at the fixes this
> >> version
> >> >> contains.
> >> >>
> >> >> Rainer
> >> >>
> >> >>> -----Original Message-----
> >> >>> From: [email protected] [mailto:rsyslog-
> >> >>> [email protected]] On Behalf Of Aaron Wiebe
> >> >>> Sent: Monday, November 02, 2009 11:52 PM
> >> >>> To: rsyslog-users
> >> >>> Subject: [rsyslog] Queuing bug (4.5.5)
> >> >>>
> >> >>> Greetings all,
> >> >>>
> >> >>> I appear to have run into an issue with 4.5.5 where queues are
> not
> >> >>> being flushed in a timely manner.  In this specific case, I have
> >> data
> >> >>> from last wednesday that is being written to disk in small
> chunks
> >> >>> today since last wednesday.  Unfortunately I cannot be too
> specific
> >> in
> >> >>> details, but here is what I can see:
> >> >>>
> >> >>> Using omfile+gzip, there appears to have been a decent burst in
> >> >>> traffic sometime last wednesday.  The rsyslog instance has grown
> to
> >> >>> 1.8GB of memory and stayed there for a while.  I have both main
> >> >>> message and action queues defined using fixedarray, and I see no
> >> >>> on-disk queues (appears to be entirely in memory).
> >> >>>
> >> >>> I've got templates writing out to filenames using an hourly
> >> timestamp
> >> >>> (filenames like:  [token]-[host]-YYYYMMDD-HH.txt.gz)  In some of
> >> those
> >> >>> files, all of them less than 5k in size, there are between 5 and
> 15
> >> >>> lines of content, all of them from last wednesday, and within a
> few
> >> >>> seconds of each other.  It's almost like there was a significant
> >> queue
> >> >>> built up, the hour rolled over, and only the first block of
> lines
> >> were
> >> >>> pulled from the queue.  Then the hour rolled over again, and
> >> another
> >> >>> block of lines were pulled from the queue.  Then the next hour,
> >> then
> >> >>> another 5-15 lines.
> >> >>>
> >> >>> Is it possible that one of the queues still has a good chunk of
> >> data
> >> >>> built up, and is flushing it out very slowly?  It hasn't been
> >> >>> consistantly at the top of the hour either, and not every hour.
> >>  But
> >> >>> the log lines themselves are sequentially written out, and
> usually
> >> the
> >> >>> lines are within a few seconds of each other.
> >> >>>
> >> >>> For example:
> >> >>>
> >> >>> syslog-myhost-20091102-18.txt.gz:  3 lines, 2 with TS Oct 21
> >> 18:46:34
> >> >>> and one 18:46:35
> >> >>> syslog-myhost-20091102-19.txt.gz:  17 lines, 3 Oct 21 18:46:35,
> 14
> >> >>> lines Oct 21 18:46:36
> >> >>> syslog-myhost-20091102-20.txt.gz:  12 lines, 8 Oct 21 18:46:36,
> 4
> >> >>> lines Oct 21 18:46:37
> >> >>>
> >> >>> Thoughts?
> >> >>> -Aaron
> >> >>> _______________________________________________
> >> >>> rsyslog mailing list
> >> >>> http://lists.adiscon.net/mailman/listinfo/rsyslog
> >> >>> http://www.rsyslog.com
> >> >> _______________________________________________
> >> >> rsyslog mailing list
> >> >> http://lists.adiscon.net/mailman/listinfo/rsyslog
> >> >> http://www.rsyslog.com
> >> >>
> >> > _______________________________________________
> >> > rsyslog mailing list
> >> > http://lists.adiscon.net/mailman/listinfo/rsyslog
> >> > http://www.rsyslog.com
> >> >
> > _______________________________________________
> > rsyslog mailing list
> > http://lists.adiscon.net/mailman/listinfo/rsyslog
> > http://www.rsyslog.com
> >
> _______________________________________________
> rsyslog mailing list
> http://lists.adiscon.net/mailman/listinfo/rsyslog
> http://www.rsyslog.com
_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com

Reply via email to