I'll see what I can do.  I'm actually somewhat tied up with other
stuff this week, but I may be able to find time to update.

-Aaron

On Tue, Nov 3, 2009 at 12:38 PM, Rainer Gerhards
<[email protected]> wrote:
> It really looks like a full queue. The 2-second wait is the default wait
> interval before discarding a message when a queue is full. So for some reason
> the action queue seems to think it is full, so that the main queue worker can
> no longer add any data to it. Out of the strace, I unfortunately do not see
> why this happens.
>
> If possible, it would be good to try the 4.5.6 release, as this may actually
> be caused by the memory bug that is solved in that release. If it doesn't
> help, we would need to think about a way to get more in-depth info when the
> problem happens. I have an idea, but need to check if it can be done.
>
> Rainer
>
>> -----Original Message-----
>> From: [email protected] [mailto:rsyslog-
>> [email protected]] On Behalf Of Aaron Wiebe
>> Sent: Tuesday, November 03, 2009 6:34 PM
>> To: rsyslog-users
>> Subject: Re: [rsyslog] Queuing bug (4.5.5)
>>
>> Here's one with times:
>>
>> 26365 17:27:59.996584 --- SIGSTOP (Stopped (signal)) @ 0 (0) ---
>> 26365 17:28:00.002436 futex(0x1bec924, FUTEX_WAIT, 121, NULL
>> <unfinished ...>
>> 26365 17:28:00.003810 <... futex resumed> ) = 0 <0.001356>
>> 26365 17:28:00.003871 clock_gettime(CLOCK_REALTIME,  <unfinished ...>
>> 26365 17:28:00.003934 <... clock_gettime resumed> {1257269280,
>> 3889000}) = 0 <0.000047>
>> 26365 17:28:00.004001 futex(0x1bec8c8, FUTEX_WAKE, 1 <unfinished ...>
>> 26365 17:28:00.004065 <... futex resumed> ) = 0 <0.000046>
>> 26365 17:28:00.004129 clock_gettime(CLOCK_REALTIME, {1257269280,
>> 4147000}) = 0 <0.000013>
>> 26365 17:28:00.004185 futex(0x1bec924, FUTEX_WAIT, 123, {1, 999742000}
>> <unfinished ...>
>> 26365 17:28:02.006675 <... futex resumed> ) = -1 ETIMEDOUT (Connection
>> timed out) <2.002462>
>> 26365 17:28:02.006797 write(165, "Oct 29 10:01:42 R6002
>> WT-GARULF6"..., 250) = 250 <0.000305>
>> 26365 17:28:02.007183 futex(0x1bec8c8, FUTEX_WAKE, 1) = 0 <0.000028>
>> 26365 17:28:02.007264 clock_gettime(CLOCK_REALTIME, {1257269282,
>> 7292000}) = 0 <0.000027>
>> 26365 17:28:02.007349 futex(0x1bec924, FUTEX_WAIT, 125, NULL
>> <unfinished ...>
>> 26365 17:30:15.414854 <... futex resumed> ) = 0 <133.407486>
>> 26365 17:30:15.414914 clock_gettime(CLOCK_REALTIME,  <unfinished ...>
>> 26365 17:30:15.414978 <... clock_gettime resumed> {1257269415,
>> 414932000}) = 0 <0.000048>
>> 26365 17:30:15.415044 futex(0x1bec8c8, FUTEX_WAKE, 1 <unfinished ...>
>> 26365 17:30:15.415108 <... futex resumed> ) = 0 <0.000047>
>> 26365 17:30:15.415173 clock_gettime(CLOCK_REALTIME, {1257269415,
>> 415191000}) = 0 <0.000013>
>> 26365 17:30:15.415229 futex(0x1bec924, FUTEX_WAIT, 127, {1, 999741000}
>> <unfinished ...>
>> 26365 17:30:17.417472 <... futex resumed> ) = -1 ETIMEDOUT (Connection
>> timed out) <2.002220>
>> 26365 17:30:17.417540 write(165, "Oct 29 10:01:42 R6002
>> WT-GARULF6"..., 197) = 197 <0.000279>
>> 26365 17:30:17.417886 futex(0x1bec8c8, FUTEX_WAKE, 1) = 0 <0.000027>
>> 26365 17:30:17.417965 clock_gettime(CLOCK_REALTIME, {1257269417,
>> 417992000}) = 0 <0.000027>
>> 26365 17:30:17.418047 futex(0x1bec924, FUTEX_WAIT, 129, NULL
>> <unfinished ...>
>> 26365 17:31:15.608148 <... futex resumed> ) = 0 <58.190082>
>> 26365 17:31:15.608209 clock_gettime(CLOCK_REALTIME,  <unfinished ...>
>> 26365 17:31:15.608272 <... clock_gettime resumed> {1257269475,
>> 608226000}) = 0 <0.000047>
>> 26365 17:31:15.608339 futex(0x1bec8c8, FUTEX_WAKE, 1 <unfinished ...>
>> 26365 17:31:15.608403 <... futex resumed> ) = 0 <0.000048>
>> 26365 17:31:15.608468 clock_gettime(CLOCK_REALTIME, {1257269475,
>> 608486000}) = 0 <0.000014>
>> 26365 17:31:15.608524 futex(0x1bec924, FUTEX_WAIT, 131, {1, 999740000}
>> <unfinished ...>
>>
>> It looks like the locks are waiting a -very- long time.
>>
>> -Aaron
>>
>> On Tue, Nov 3, 2009 at 12:30 PM, Rainer Gerhards
>> <[email protected]> wrote:
>> > first shot at the data: this looks like a full queue where the
>> enqueue
>> > operation is waiting on a drain that does not happen (fast enough).
>> Of
>> > course, that doesn't explain why this happens...
>> >
>> > Rainer
>> >
>> >> -----Original Message-----
>> >> From: [email protected] [mailto:rsyslog-
>> >> [email protected]] On Behalf Of Aaron Wiebe
>> >> Sent: Tuesday, November 03, 2009 6:27 PM
>> >> To: rsyslog-users
>> >> Subject: Re: [rsyslog] Queuing bug (4.5.5)
>> >>
>> >> Ok - I captured an strace.  This appears to be the action thread.
>> >> This specific set of calls took minutes.  I'll re-run this with -t
>> >> and/or -T.
>> >>
>> >> Note that this syslog instance is not actually receiving any data
>> right
>> >> now.
>> >>
>> >> -Aaron
>> >>
>> >> 26365 --- SIGSTOP (Stopped (signal)) @ 0 (0) ---
>> >> 26365 futex(0x1bec924, FUTEX_WAIT, 97, NULL <unfinished ...>
>> >> 26365 <... futex resumed> )             = 0
>> >> 26365 clock_gettime(CLOCK_REALTIME, {1257268800, 165944000}) = 0
>> >> 26365 futex(0x1bec8c8, FUTEX_WAKE, 1)   = 0
>> >> 26365 clock_gettime(CLOCK_REALTIME, {1257268800, 166035000}) = 0
>> >> 26365 futex(0x1bec924, FUTEX_WAIT, 99, {1, 999909000} <unfinished
>> ...>
>> >> 26365 <... futex resumed> )             = -1 ETIMEDOUT (Connection
>> >> timed out)
>> >> 26365 write(165, "Oct 29 10:01:42 R6002 WT-GARULF6"..., 229) = 229
>> >> 26365 futex(0x1bec8c8, FUTEX_WAKE, 1)   = 0
>> >> 26365 clock_gettime(CLOCK_REALTIME, {1257268802, 173123000}) = 0
>> >> 26365 futex(0x1bec924, FUTEX_WAIT, 101, NULL <unfinished ...>
>> >> 26365 <... futex resumed> )             = 0
>> >> 26365 clock_gettime(CLOCK_REALTIME, {1257268893, 439872000}) = 0
>> >> 26365 futex(0x1bec8c8, FUTEX_WAKE, 1)   = 0
>> >> 26365 clock_gettime(CLOCK_REALTIME, {1257268893, 439999000}) = 0
>> >> 26365 futex(0x1bec924, FUTEX_WAIT, 103, {1, 999873000} <unfinished
>> ...>
>> >> 26365 <... futex resumed> )             = -1 ETIMEDOUT (Connection
>> >> timed out)
>> >> 26365 write(165, "Oct 29 10:01:42 R6002 WT-GARULF6"..., 248) = 248
>> >> 26365 futex(0x1bec8c8, FUTEX_WAKE, 1)   = 0
>> >> 26365 clock_gettime(CLOCK_REALTIME, {1257268895, 443096000}) = 0
>> >> 26365 futex(0x1bec924, FUTEX_WAIT, 105, NULL <unfinished ...>
>> >> 26365 <... futex resumed> )             = 0
>> >> 26365 clock_gettime(CLOCK_REALTIME,  <unfinished ...>
>> >> 26365 <... clock_gettime resumed> {1257268950, 612585000}) = 0
>> >> 26365 futex(0x1bec8c8, FUTEX_WAKE, 1 <unfinished ...>
>> >> 26365 <... futex resumed> )             = 0
>> >> 26365 clock_gettime(CLOCK_REALTIME, {1257268950, 612757000}) = 0
>> >> 26365 futex(0x1bec924, FUTEX_WAIT, 107, {1, 999828000} <unfinished
>> ...>
>> >> 26365 <... futex resumed> )             = -1 ETIMEDOUT (Connection
>> >> timed out)
>> >> 26365 write(165, "Oct 29 10:01:42 R6002 WT-GARULF6"..., 180) = 180
>> >> 26365 futex(0x1bec8c8, FUTEX_WAKE, 1)   = 0
>> >> 26365 clock_gettime(CLOCK_REALTIME, {1257268952, 619980000}) = 0
>> >> 26365 futex(0x1bec924, FUTEX_WAIT, 109, NULL <unfinished ...>
>> >> 26365 <... futex resumed> )             = 0
>> >> 26365 futex(0x1bec8c8, FUTEX_WAIT, 2, NULL <unfinished ...>
>> >> 26365 <... futex resumed> )             = -1 EAGAIN (Resource
>> >> temporarily unavailable)
>> >> 26365 clock_gettime(CLOCK_REALTIME,  <unfinished ...>
>> >> 26365 <... clock_gettime resumed> {1257268956, 628783000}) = 0
>> >> 26365 futex(0x1bec8c8, FUTEX_WAKE, 1 <unfinished ...>
>> >> 26365 <... futex resumed> )             = 0
>> >> 26365 clock_gettime(CLOCK_REALTIME,  <unfinished ...>
>> >> 26365 <... clock_gettime resumed> {1257268956, 628953000}) = 0
>> >> 26365 futex(0x1bec924, FUTEX_WAIT, 111, {1, 999830000} <unfinished
>> ...>
>> >> 26365 <... futex resumed> )             = -1 ETIMEDOUT (Connection
>> >> timed out)
>> >> 26365 write(165, "Oct 29 10:01:42 R6002 WT-GARULF6"..., 335) = 335
>> >> 26365 futex(0x1bec8c8, FUTEX_WAKE, 1)   = 0
>> >> 26365 clock_gettime(CLOCK_REALTIME, {1257268958, 636092000}) = 0
>> >> 26365 futex(0x1bec924, FUTEX_WAIT, 113, NULL <unfinished ...>
>> >>
>> >>
>> >> On Tue, Nov 3, 2009 at 10:25 AM, Rainer Gerhards
>> >> <[email protected]> wrote:
>> >> >> -----Original Message-----
>> >> >> From: [email protected] [mailto:rsyslog-
>> >> >> [email protected]] On Behalf Of [email protected]
>> >> >> Sent: Tuesday, November 03, 2009 4:06 PM
>> >> >> To: rsyslog-users
>> >> >> Subject: Re: [rsyslog] Queuing bug (4.5.5)
>> >> >>
>> >> >> On Tue, 3 Nov 2009, Aaron Wiebe wrote:
>> >> >>
>> >> >> > This is still taking place this hour - though I obviously can't
>> >> >> > restart onto a newer version without losing this case.  Is
>> there
>> >> >> > anything I can do in the current configuration to try and debug
>> >> this
>> >> >> > situation?
>> >> >>
>> >> >> if you can strace each thread for a few seconds it may give you
>> an
>> >> idea
>> >> >> what's happening.
>> >> >
>> >> > This is a good suggestion.
>> >> >
>> >> > It would also potentially be enlightening to attach gdb to the
>> >> process at
>> >> > various points in time and get a backtrace from all running
>> threads.
>> >> >
>> >> > Other than that, there is little you can do on a running version.
>> If
>> >> it were
>> >> > compiled for debug, and the environment setup so that we could
>> >> capture
>> >> > stdout/stderr, sending SIGUSR2 would yield to much the same
>> >> information as
>> >> > the gdb backtrace BUT when runtime instrumentation is enabled, the
>> >> build is
>> >> > operating at a third to a quarter of its normal speed.
>> >> >
>> >> >> in the meantime you need to stop the system from getting further
>> >> >> behind,
>> >> >> can you redirect or reconfigure the senders so that they are not
>> >> >> sending
>> >> >> new logs to this box so that it can dig itself out (stopping the
>> >> input
>> >> >> may
>> >> >> be enough by itself, rsyslog has historicly done a LOT of locking
>> >> >> around
>> >> >> the main queue, and if you have a full queue with more data
>> trying
>> >> to
>> >> >> be
>> >> >> delivered it can really slow things down. I wouldn't expect it to
>> be
>> >> >> this
>> >> >> much, but it could be part of it)
>> >> >>
>> >> >
>> >> > This may clean up things, but I really doubt it, given the
>> magnitude
>> >> of
>> >> > delays. Also, Aaron runs 4.4.5, which already has lots of the
>> locking
>> >> > removed/restructure (not to compare with the recent v5-devel, but
>> >> much more
>> >> > effcient than early v4 and v3).
>> >> >
>> >> > Rainer
>> >> >> David Lang
>> >> >>
>> >> >> > (We're up to 18:46:51 now!)
>> >> >> >
>> >> >> > -Aaron
>> >> >> >
>> >> >> > On Tue, Nov 3, 2009 at 1:46 AM, Rainer Gerhards
>> >> >> > <[email protected]> wrote:
>> >> >> >> mhhh... This is obviously not intended behavior. There are
>> some
>> >> >> rate-limiting
>> >> >> >> features that can deliberately slow down a queue, but I guess
>> you
>> >> >> have not
>> >> >> >> configured these. So there either is a bug that activates them
>> at
>> >> >> some point
>> >> >> >> during processing (e.g. an invalid memory access could do
>> that)
>> >> or
>> >> >> there is
>> >> >> >> some other bug that causes the dequeue to happen very slowly.
>> In
>> >> any
>> >> >> case, it
>> >> >> >> is hard to guess.
>> >> >> >>
>> >> >> >> Given the volume of the messages, a debug log probably does
>> not
>> >> >> help. We
>> >> >> >> could gain a little insight in at least the queue sizes that
>> >> rsyslog
>> >> >> sees via
>> >> >> >> imdiag plus the (very rudamentary) v5 debug front-end (it
>> doesn't
>> >> >> require the
>> >> >> >> engine to be v5!). I would need to spend at least a little
>> work
>> >> on
>> >> >> the
>> >> >> >> front-end to enable remote access, but that's not really a
>> >> problem.
>> >> >> >>
>> >> >> >> One other thing is that I am still holding a v4-beta with
>> >> additional
>> >> >> fixes as
>> >> >> >> I didn't want to put these in the middle of some other
>> debugging.
>> >> >> However,
>> >> >> >> these fixes address potential memory problems, so the most
>> >> >> appropriate course
>> >> >> >> of action would be to give that version at least a try. It
>> needs
>> >> to
>> >> >> be
>> >> >> >> released in the next days in any case.
>> >> >> >>
>> >> >> >> I have uploaded that (pre-4.5.6) version so that you can give
>> it
>> >> a
>> >> >> try if you
>> >> >> >> like:
>> >> >> >>
>> >> >> >> http://www.rsyslog.com/download/rsyslog/pre/rsyslog-
>> 4.5.6.tar.gz
>> >> >> >>
>> >> >> >> I think it would good if you could try it at least once,
>> because
>> >> I
>> >> >> think any
>> >> >> >> other troubleshooting will boil down to looking at the fixes
>> this
>> >> >> version
>> >> >> >> contains.
>> >> >> >>
>> >> >> >> Rainer
>> >> >> >>
>> >> >> >>> -----Original Message-----
>> >> >> >>> From: [email protected] [mailto:rsyslog-
>> >> >> >>> [email protected]] On Behalf Of Aaron Wiebe
>> >> >> >>> Sent: Monday, November 02, 2009 11:52 PM
>> >> >> >>> To: rsyslog-users
>> >> >> >>> Subject: [rsyslog] Queuing bug (4.5.5)
>> >> >> >>>
>> >> >> >>> Greetings all,
>> >> >> >>>
>> >> >> >>> I appear to have run into an issue with 4.5.5 where queues
>> are
>> >> not
>> >> >> >>> being flushed in a timely manner.  In this specific case, I
>> have
>> >> >> data
>> >> >> >>> from last wednesday that is being written to disk in small
>> >> chunks
>> >> >> >>> today since last wednesday.  Unfortunately I cannot be too
>> >> specific
>> >> >> in
>> >> >> >>> details, but here is what I can see:
>> >> >> >>>
>> >> >> >>> Using omfile+gzip, there appears to have been a decent burst
>> in
>> >> >> >>> traffic sometime last wednesday.  The rsyslog instance has
>> grown
>> >> to
>> >> >> >>> 1.8GB of memory and stayed there for a while.  I have both
>> main
>> >> >> >>> message and action queues defined using fixedarray, and I see
>> no
>> >> >> >>> on-disk queues (appears to be entirely in memory).
>> >> >> >>>
>> >> >> >>> I've got templates writing out to filenames using an hourly
>> >> >> timestamp
>> >> >> >>> (filenames like:  [token]-[host]-YYYYMMDD-HH.txt.gz)  In some
>> of
>> >> >> those
>> >> >> >>> files, all of them less than 5k in size, there are between 5
>> and
>> >> 15
>> >> >> >>> lines of content, all of them from last wednesday, and within
>> a
>> >> few
>> >> >> >>> seconds of each other.  It's almost like there was a
>> significant
>> >> >> queue
>> >> >> >>> built up, the hour rolled over, and only the first block of
>> >> lines
>> >> >> were
>> >> >> >>> pulled from the queue.  Then the hour rolled over again, and
>> >> >> another
>> >> >> >>> block of lines were pulled from the queue.  Then the next
>> hour,
>> >> >> then
>> >> >> >>> another 5-15 lines.
>> >> >> >>>
>> >> >> >>> Is it possible that one of the queues still has a good chunk
>> of
>> >> >> data
>> >> >> >>> built up, and is flushing it out very slowly?  It hasn't been
>> >> >> >>> consistantly at the top of the hour either, and not every
>> hour.
>> >> >>  But
>> >> >> >>> the log lines themselves are sequentially written out, and
>> >> usually
>> >> >> the
>> >> >> >>> lines are within a few seconds of each other.
>> >> >> >>>
>> >> >> >>> For example:
>> >> >> >>>
>> >> >> >>> syslog-myhost-20091102-18.txt.gz:  3 lines, 2 with TS Oct 21
>> >> >> 18:46:34
>> >> >> >>> and one 18:46:35
>> >> >> >>> syslog-myhost-20091102-19.txt.gz:  17 lines, 3 Oct 21
>> 18:46:35,
>> >> 14
>> >> >> >>> lines Oct 21 18:46:36
>> >> >> >>> syslog-myhost-20091102-20.txt.gz:  12 lines, 8 Oct 21
>> 18:46:36,
>> >> 4
>> >> >> >>> lines Oct 21 18:46:37
>> >> >> >>>
>> >> >> >>> Thoughts?
>> >> >> >>> -Aaron
>> >> >> >>> _______________________________________________
>> >> >> >>> rsyslog mailing list
>> >> >> >>> http://lists.adiscon.net/mailman/listinfo/rsyslog
>> >> >> >>> http://www.rsyslog.com
>> >> >> >> _______________________________________________
>> >> >> >> rsyslog mailing list
>> >> >> >> http://lists.adiscon.net/mailman/listinfo/rsyslog
>> >> >> >> http://www.rsyslog.com
>> >> >> >>
>> >> >> > _______________________________________________
>> >> >> > rsyslog mailing list
>> >> >> > http://lists.adiscon.net/mailman/listinfo/rsyslog
>> >> >> > http://www.rsyslog.com
>> >> >> >
>> >> > _______________________________________________
>> >> > rsyslog mailing list
>> >> > http://lists.adiscon.net/mailman/listinfo/rsyslog
>> >> > http://www.rsyslog.com
>> >> >
>> >> _______________________________________________
>> >> rsyslog mailing list
>> >> http://lists.adiscon.net/mailman/listinfo/rsyslog
>> >> http://www.rsyslog.com
>> > _______________________________________________
>> > rsyslog mailing list
>> > http://lists.adiscon.net/mailman/listinfo/rsyslog
>> > http://www.rsyslog.com
>> >
>> _______________________________________________
>> rsyslog mailing list
>> http://lists.adiscon.net/mailman/listinfo/rsyslog
>> http://www.rsyslog.com
> _______________________________________________
> rsyslog mailing list
> http://lists.adiscon.net/mailman/listinfo/rsyslog
> http://www.rsyslog.com
>
_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com

Reply via email to