Here's one with times:

26365 17:27:59.996584 --- SIGSTOP (Stopped (signal)) @ 0 (0) ---
26365 17:28:00.002436 futex(0x1bec924, FUTEX_WAIT, 121, NULL <unfinished ...>
26365 17:28:00.003810 <... futex resumed> ) = 0 <0.001356>
26365 17:28:00.003871 clock_gettime(CLOCK_REALTIME,  <unfinished ...>
26365 17:28:00.003934 <... clock_gettime resumed> {1257269280,
3889000}) = 0 <0.000047>
26365 17:28:00.004001 futex(0x1bec8c8, FUTEX_WAKE, 1 <unfinished ...>
26365 17:28:00.004065 <... futex resumed> ) = 0 <0.000046>
26365 17:28:00.004129 clock_gettime(CLOCK_REALTIME, {1257269280,
4147000}) = 0 <0.000013>
26365 17:28:00.004185 futex(0x1bec924, FUTEX_WAIT, 123, {1, 999742000}
<unfinished ...>
26365 17:28:02.006675 <... futex resumed> ) = -1 ETIMEDOUT (Connection
timed out) <2.002462>
26365 17:28:02.006797 write(165, "Oct 29 10:01:42 R6002
WT-GARULF6"..., 250) = 250 <0.000305>
26365 17:28:02.007183 futex(0x1bec8c8, FUTEX_WAKE, 1) = 0 <0.000028>
26365 17:28:02.007264 clock_gettime(CLOCK_REALTIME, {1257269282,
7292000}) = 0 <0.000027>
26365 17:28:02.007349 futex(0x1bec924, FUTEX_WAIT, 125, NULL <unfinished ...>
26365 17:30:15.414854 <... futex resumed> ) = 0 <133.407486>
26365 17:30:15.414914 clock_gettime(CLOCK_REALTIME,  <unfinished ...>
26365 17:30:15.414978 <... clock_gettime resumed> {1257269415,
414932000}) = 0 <0.000048>
26365 17:30:15.415044 futex(0x1bec8c8, FUTEX_WAKE, 1 <unfinished ...>
26365 17:30:15.415108 <... futex resumed> ) = 0 <0.000047>
26365 17:30:15.415173 clock_gettime(CLOCK_REALTIME, {1257269415,
415191000}) = 0 <0.000013>
26365 17:30:15.415229 futex(0x1bec924, FUTEX_WAIT, 127, {1, 999741000}
<unfinished ...>
26365 17:30:17.417472 <... futex resumed> ) = -1 ETIMEDOUT (Connection
timed out) <2.002220>
26365 17:30:17.417540 write(165, "Oct 29 10:01:42 R6002
WT-GARULF6"..., 197) = 197 <0.000279>
26365 17:30:17.417886 futex(0x1bec8c8, FUTEX_WAKE, 1) = 0 <0.000027>
26365 17:30:17.417965 clock_gettime(CLOCK_REALTIME, {1257269417,
417992000}) = 0 <0.000027>
26365 17:30:17.418047 futex(0x1bec924, FUTEX_WAIT, 129, NULL <unfinished ...>
26365 17:31:15.608148 <... futex resumed> ) = 0 <58.190082>
26365 17:31:15.608209 clock_gettime(CLOCK_REALTIME,  <unfinished ...>
26365 17:31:15.608272 <... clock_gettime resumed> {1257269475,
608226000}) = 0 <0.000047>
26365 17:31:15.608339 futex(0x1bec8c8, FUTEX_WAKE, 1 <unfinished ...>
26365 17:31:15.608403 <... futex resumed> ) = 0 <0.000048>
26365 17:31:15.608468 clock_gettime(CLOCK_REALTIME, {1257269475,
608486000}) = 0 <0.000014>
26365 17:31:15.608524 futex(0x1bec924, FUTEX_WAIT, 131, {1, 999740000}
<unfinished ...>

It looks like the locks are waiting a -very- long time.

-Aaron

On Tue, Nov 3, 2009 at 12:30 PM, Rainer Gerhards
<[email protected]> wrote:
> first shot at the data: this looks like a full queue where the enqueue
> operation is waiting on a drain that does not happen (fast enough). Of
> course, that doesn't explain why this happens...
>
> Rainer
>
>> -----Original Message-----
>> From: [email protected] [mailto:rsyslog-
>> [email protected]] On Behalf Of Aaron Wiebe
>> Sent: Tuesday, November 03, 2009 6:27 PM
>> To: rsyslog-users
>> Subject: Re: [rsyslog] Queuing bug (4.5.5)
>>
>> Ok - I captured an strace.  This appears to be the action thread.
>> This specific set of calls took minutes.  I'll re-run this with -t
>> and/or -T.
>>
>> Note that this syslog instance is not actually receiving any data right
>> now.
>>
>> -Aaron
>>
>> 26365 --- SIGSTOP (Stopped (signal)) @ 0 (0) ---
>> 26365 futex(0x1bec924, FUTEX_WAIT, 97, NULL <unfinished ...>
>> 26365 <... futex resumed> )             = 0
>> 26365 clock_gettime(CLOCK_REALTIME, {1257268800, 165944000}) = 0
>> 26365 futex(0x1bec8c8, FUTEX_WAKE, 1)   = 0
>> 26365 clock_gettime(CLOCK_REALTIME, {1257268800, 166035000}) = 0
>> 26365 futex(0x1bec924, FUTEX_WAIT, 99, {1, 999909000} <unfinished ...>
>> 26365 <... futex resumed> )             = -1 ETIMEDOUT (Connection
>> timed out)
>> 26365 write(165, "Oct 29 10:01:42 R6002 WT-GARULF6"..., 229) = 229
>> 26365 futex(0x1bec8c8, FUTEX_WAKE, 1)   = 0
>> 26365 clock_gettime(CLOCK_REALTIME, {1257268802, 173123000}) = 0
>> 26365 futex(0x1bec924, FUTEX_WAIT, 101, NULL <unfinished ...>
>> 26365 <... futex resumed> )             = 0
>> 26365 clock_gettime(CLOCK_REALTIME, {1257268893, 439872000}) = 0
>> 26365 futex(0x1bec8c8, FUTEX_WAKE, 1)   = 0
>> 26365 clock_gettime(CLOCK_REALTIME, {1257268893, 439999000}) = 0
>> 26365 futex(0x1bec924, FUTEX_WAIT, 103, {1, 999873000} <unfinished ...>
>> 26365 <... futex resumed> )             = -1 ETIMEDOUT (Connection
>> timed out)
>> 26365 write(165, "Oct 29 10:01:42 R6002 WT-GARULF6"..., 248) = 248
>> 26365 futex(0x1bec8c8, FUTEX_WAKE, 1)   = 0
>> 26365 clock_gettime(CLOCK_REALTIME, {1257268895, 443096000}) = 0
>> 26365 futex(0x1bec924, FUTEX_WAIT, 105, NULL <unfinished ...>
>> 26365 <... futex resumed> )             = 0
>> 26365 clock_gettime(CLOCK_REALTIME,  <unfinished ...>
>> 26365 <... clock_gettime resumed> {1257268950, 612585000}) = 0
>> 26365 futex(0x1bec8c8, FUTEX_WAKE, 1 <unfinished ...>
>> 26365 <... futex resumed> )             = 0
>> 26365 clock_gettime(CLOCK_REALTIME, {1257268950, 612757000}) = 0
>> 26365 futex(0x1bec924, FUTEX_WAIT, 107, {1, 999828000} <unfinished ...>
>> 26365 <... futex resumed> )             = -1 ETIMEDOUT (Connection
>> timed out)
>> 26365 write(165, "Oct 29 10:01:42 R6002 WT-GARULF6"..., 180) = 180
>> 26365 futex(0x1bec8c8, FUTEX_WAKE, 1)   = 0
>> 26365 clock_gettime(CLOCK_REALTIME, {1257268952, 619980000}) = 0
>> 26365 futex(0x1bec924, FUTEX_WAIT, 109, NULL <unfinished ...>
>> 26365 <... futex resumed> )             = 0
>> 26365 futex(0x1bec8c8, FUTEX_WAIT, 2, NULL <unfinished ...>
>> 26365 <... futex resumed> )             = -1 EAGAIN (Resource
>> temporarily unavailable)
>> 26365 clock_gettime(CLOCK_REALTIME,  <unfinished ...>
>> 26365 <... clock_gettime resumed> {1257268956, 628783000}) = 0
>> 26365 futex(0x1bec8c8, FUTEX_WAKE, 1 <unfinished ...>
>> 26365 <... futex resumed> )             = 0
>> 26365 clock_gettime(CLOCK_REALTIME,  <unfinished ...>
>> 26365 <... clock_gettime resumed> {1257268956, 628953000}) = 0
>> 26365 futex(0x1bec924, FUTEX_WAIT, 111, {1, 999830000} <unfinished ...>
>> 26365 <... futex resumed> )             = -1 ETIMEDOUT (Connection
>> timed out)
>> 26365 write(165, "Oct 29 10:01:42 R6002 WT-GARULF6"..., 335) = 335
>> 26365 futex(0x1bec8c8, FUTEX_WAKE, 1)   = 0
>> 26365 clock_gettime(CLOCK_REALTIME, {1257268958, 636092000}) = 0
>> 26365 futex(0x1bec924, FUTEX_WAIT, 113, NULL <unfinished ...>
>>
>>
>> On Tue, Nov 3, 2009 at 10:25 AM, Rainer Gerhards
>> <[email protected]> wrote:
>> >> -----Original Message-----
>> >> From: [email protected] [mailto:rsyslog-
>> >> [email protected]] On Behalf Of [email protected]
>> >> Sent: Tuesday, November 03, 2009 4:06 PM
>> >> To: rsyslog-users
>> >> Subject: Re: [rsyslog] Queuing bug (4.5.5)
>> >>
>> >> On Tue, 3 Nov 2009, Aaron Wiebe wrote:
>> >>
>> >> > This is still taking place this hour - though I obviously can't
>> >> > restart onto a newer version without losing this case.  Is there
>> >> > anything I can do in the current configuration to try and debug
>> this
>> >> > situation?
>> >>
>> >> if you can strace each thread for a few seconds it may give you an
>> idea
>> >> what's happening.
>> >
>> > This is a good suggestion.
>> >
>> > It would also potentially be enlightening to attach gdb to the
>> process at
>> > various points in time and get a backtrace from all running threads.
>> >
>> > Other than that, there is little you can do on a running version. If
>> it were
>> > compiled for debug, and the environment setup so that we could
>> capture
>> > stdout/stderr, sending SIGUSR2 would yield to much the same
>> information as
>> > the gdb backtrace BUT when runtime instrumentation is enabled, the
>> build is
>> > operating at a third to a quarter of its normal speed.
>> >
>> >> in the meantime you need to stop the system from getting further
>> >> behind,
>> >> can you redirect or reconfigure the senders so that they are not
>> >> sending
>> >> new logs to this box so that it can dig itself out (stopping the
>> input
>> >> may
>> >> be enough by itself, rsyslog has historicly done a LOT of locking
>> >> around
>> >> the main queue, and if you have a full queue with more data trying
>> to
>> >> be
>> >> delivered it can really slow things down. I wouldn't expect it to be
>> >> this
>> >> much, but it could be part of it)
>> >>
>> >
>> > This may clean up things, but I really doubt it, given the magnitude
>> of
>> > delays. Also, Aaron runs 4.4.5, which already has lots of the locking
>> > removed/restructure (not to compare with the recent v5-devel, but
>> much more
>> > effcient than early v4 and v3).
>> >
>> > Rainer
>> >> David Lang
>> >>
>> >> > (We're up to 18:46:51 now!)
>> >> >
>> >> > -Aaron
>> >> >
>> >> > On Tue, Nov 3, 2009 at 1:46 AM, Rainer Gerhards
>> >> > <[email protected]> wrote:
>> >> >> mhhh... This is obviously not intended behavior. There are some
>> >> rate-limiting
>> >> >> features that can deliberately slow down a queue, but I guess you
>> >> have not
>> >> >> configured these. So there either is a bug that activates them at
>> >> some point
>> >> >> during processing (e.g. an invalid memory access could do that)
>> or
>> >> there is
>> >> >> some other bug that causes the dequeue to happen very slowly. In
>> any
>> >> case, it
>> >> >> is hard to guess.
>> >> >>
>> >> >> Given the volume of the messages, a debug log probably does not
>> >> help. We
>> >> >> could gain a little insight in at least the queue sizes that
>> rsyslog
>> >> sees via
>> >> >> imdiag plus the (very rudamentary) v5 debug front-end (it doesn't
>> >> require the
>> >> >> engine to be v5!). I would need to spend at least a little work
>> on
>> >> the
>> >> >> front-end to enable remote access, but that's not really a
>> problem.
>> >> >>
>> >> >> One other thing is that I am still holding a v4-beta with
>> additional
>> >> fixes as
>> >> >> I didn't want to put these in the middle of some other debugging.
>> >> However,
>> >> >> these fixes address potential memory problems, so the most
>> >> appropriate course
>> >> >> of action would be to give that version at least a try. It needs
>> to
>> >> be
>> >> >> released in the next days in any case.
>> >> >>
>> >> >> I have uploaded that (pre-4.5.6) version so that you can give it
>> a
>> >> try if you
>> >> >> like:
>> >> >>
>> >> >> http://www.rsyslog.com/download/rsyslog/pre/rsyslog-4.5.6.tar.gz
>> >> >>
>> >> >> I think it would good if you could try it at least once, because
>> I
>> >> think any
>> >> >> other troubleshooting will boil down to looking at the fixes this
>> >> version
>> >> >> contains.
>> >> >>
>> >> >> Rainer
>> >> >>
>> >> >>> -----Original Message-----
>> >> >>> From: [email protected] [mailto:rsyslog-
>> >> >>> [email protected]] On Behalf Of Aaron Wiebe
>> >> >>> Sent: Monday, November 02, 2009 11:52 PM
>> >> >>> To: rsyslog-users
>> >> >>> Subject: [rsyslog] Queuing bug (4.5.5)
>> >> >>>
>> >> >>> Greetings all,
>> >> >>>
>> >> >>> I appear to have run into an issue with 4.5.5 where queues are
>> not
>> >> >>> being flushed in a timely manner.  In this specific case, I have
>> >> data
>> >> >>> from last wednesday that is being written to disk in small
>> chunks
>> >> >>> today since last wednesday.  Unfortunately I cannot be too
>> specific
>> >> in
>> >> >>> details, but here is what I can see:
>> >> >>>
>> >> >>> Using omfile+gzip, there appears to have been a decent burst in
>> >> >>> traffic sometime last wednesday.  The rsyslog instance has grown
>> to
>> >> >>> 1.8GB of memory and stayed there for a while.  I have both main
>> >> >>> message and action queues defined using fixedarray, and I see no
>> >> >>> on-disk queues (appears to be entirely in memory).
>> >> >>>
>> >> >>> I've got templates writing out to filenames using an hourly
>> >> timestamp
>> >> >>> (filenames like:  [token]-[host]-YYYYMMDD-HH.txt.gz)  In some of
>> >> those
>> >> >>> files, all of them less than 5k in size, there are between 5 and
>> 15
>> >> >>> lines of content, all of them from last wednesday, and within a
>> few
>> >> >>> seconds of each other.  It's almost like there was a significant
>> >> queue
>> >> >>> built up, the hour rolled over, and only the first block of
>> lines
>> >> were
>> >> >>> pulled from the queue.  Then the hour rolled over again, and
>> >> another
>> >> >>> block of lines were pulled from the queue.  Then the next hour,
>> >> then
>> >> >>> another 5-15 lines.
>> >> >>>
>> >> >>> Is it possible that one of the queues still has a good chunk of
>> >> data
>> >> >>> built up, and is flushing it out very slowly?  It hasn't been
>> >> >>> consistantly at the top of the hour either, and not every hour.
>> >>  But
>> >> >>> the log lines themselves are sequentially written out, and
>> usually
>> >> the
>> >> >>> lines are within a few seconds of each other.
>> >> >>>
>> >> >>> For example:
>> >> >>>
>> >> >>> syslog-myhost-20091102-18.txt.gz:  3 lines, 2 with TS Oct 21
>> >> 18:46:34
>> >> >>> and one 18:46:35
>> >> >>> syslog-myhost-20091102-19.txt.gz:  17 lines, 3 Oct 21 18:46:35,
>> 14
>> >> >>> lines Oct 21 18:46:36
>> >> >>> syslog-myhost-20091102-20.txt.gz:  12 lines, 8 Oct 21 18:46:36,
>> 4
>> >> >>> lines Oct 21 18:46:37
>> >> >>>
>> >> >>> Thoughts?
>> >> >>> -Aaron
>> >> >>> _______________________________________________
>> >> >>> rsyslog mailing list
>> >> >>> http://lists.adiscon.net/mailman/listinfo/rsyslog
>> >> >>> http://www.rsyslog.com
>> >> >> _______________________________________________
>> >> >> rsyslog mailing list
>> >> >> http://lists.adiscon.net/mailman/listinfo/rsyslog
>> >> >> http://www.rsyslog.com
>> >> >>
>> >> > _______________________________________________
>> >> > rsyslog mailing list
>> >> > http://lists.adiscon.net/mailman/listinfo/rsyslog
>> >> > http://www.rsyslog.com
>> >> >
>> > _______________________________________________
>> > rsyslog mailing list
>> > http://lists.adiscon.net/mailman/listinfo/rsyslog
>> > http://www.rsyslog.com
>> >
>> _______________________________________________
>> rsyslog mailing list
>> http://lists.adiscon.net/mailman/listinfo/rsyslog
>> http://www.rsyslog.com
> _______________________________________________
> rsyslog mailing list
> http://lists.adiscon.net/mailman/listinfo/rsyslog
> http://www.rsyslog.com
>
_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com

Reply via email to