first shot at the data: this looks like a full queue where the enqueue operation is waiting on a drain that does not happen (fast enough). Of course, that doesn't explain why this happens...
Rainer > -----Original Message----- > From: [email protected] [mailto:rsyslog- > [email protected]] On Behalf Of Aaron Wiebe > Sent: Tuesday, November 03, 2009 6:27 PM > To: rsyslog-users > Subject: Re: [rsyslog] Queuing bug (4.5.5) > > Ok - I captured an strace. This appears to be the action thread. > This specific set of calls took minutes. I'll re-run this with -t > and/or -T. > > Note that this syslog instance is not actually receiving any data right > now. > > -Aaron > > 26365 --- SIGSTOP (Stopped (signal)) @ 0 (0) --- > 26365 futex(0x1bec924, FUTEX_WAIT, 97, NULL <unfinished ...> > 26365 <... futex resumed> ) = 0 > 26365 clock_gettime(CLOCK_REALTIME, {1257268800, 165944000}) = 0 > 26365 futex(0x1bec8c8, FUTEX_WAKE, 1) = 0 > 26365 clock_gettime(CLOCK_REALTIME, {1257268800, 166035000}) = 0 > 26365 futex(0x1bec924, FUTEX_WAIT, 99, {1, 999909000} <unfinished ...> > 26365 <... futex resumed> ) = -1 ETIMEDOUT (Connection > timed out) > 26365 write(165, "Oct 29 10:01:42 R6002 WT-GARULF6"..., 229) = 229 > 26365 futex(0x1bec8c8, FUTEX_WAKE, 1) = 0 > 26365 clock_gettime(CLOCK_REALTIME, {1257268802, 173123000}) = 0 > 26365 futex(0x1bec924, FUTEX_WAIT, 101, NULL <unfinished ...> > 26365 <... futex resumed> ) = 0 > 26365 clock_gettime(CLOCK_REALTIME, {1257268893, 439872000}) = 0 > 26365 futex(0x1bec8c8, FUTEX_WAKE, 1) = 0 > 26365 clock_gettime(CLOCK_REALTIME, {1257268893, 439999000}) = 0 > 26365 futex(0x1bec924, FUTEX_WAIT, 103, {1, 999873000} <unfinished ...> > 26365 <... futex resumed> ) = -1 ETIMEDOUT (Connection > timed out) > 26365 write(165, "Oct 29 10:01:42 R6002 WT-GARULF6"..., 248) = 248 > 26365 futex(0x1bec8c8, FUTEX_WAKE, 1) = 0 > 26365 clock_gettime(CLOCK_REALTIME, {1257268895, 443096000}) = 0 > 26365 futex(0x1bec924, FUTEX_WAIT, 105, NULL <unfinished ...> > 26365 <... futex resumed> ) = 0 > 26365 clock_gettime(CLOCK_REALTIME, <unfinished ...> > 26365 <... clock_gettime resumed> {1257268950, 612585000}) = 0 > 26365 futex(0x1bec8c8, FUTEX_WAKE, 1 <unfinished ...> > 26365 <... futex resumed> ) = 0 > 26365 clock_gettime(CLOCK_REALTIME, {1257268950, 612757000}) = 0 > 26365 futex(0x1bec924, FUTEX_WAIT, 107, {1, 999828000} <unfinished ...> > 26365 <... futex resumed> ) = -1 ETIMEDOUT (Connection > timed out) > 26365 write(165, "Oct 29 10:01:42 R6002 WT-GARULF6"..., 180) = 180 > 26365 futex(0x1bec8c8, FUTEX_WAKE, 1) = 0 > 26365 clock_gettime(CLOCK_REALTIME, {1257268952, 619980000}) = 0 > 26365 futex(0x1bec924, FUTEX_WAIT, 109, NULL <unfinished ...> > 26365 <... futex resumed> ) = 0 > 26365 futex(0x1bec8c8, FUTEX_WAIT, 2, NULL <unfinished ...> > 26365 <... futex resumed> ) = -1 EAGAIN (Resource > temporarily unavailable) > 26365 clock_gettime(CLOCK_REALTIME, <unfinished ...> > 26365 <... clock_gettime resumed> {1257268956, 628783000}) = 0 > 26365 futex(0x1bec8c8, FUTEX_WAKE, 1 <unfinished ...> > 26365 <... futex resumed> ) = 0 > 26365 clock_gettime(CLOCK_REALTIME, <unfinished ...> > 26365 <... clock_gettime resumed> {1257268956, 628953000}) = 0 > 26365 futex(0x1bec924, FUTEX_WAIT, 111, {1, 999830000} <unfinished ...> > 26365 <... futex resumed> ) = -1 ETIMEDOUT (Connection > timed out) > 26365 write(165, "Oct 29 10:01:42 R6002 WT-GARULF6"..., 335) = 335 > 26365 futex(0x1bec8c8, FUTEX_WAKE, 1) = 0 > 26365 clock_gettime(CLOCK_REALTIME, {1257268958, 636092000}) = 0 > 26365 futex(0x1bec924, FUTEX_WAIT, 113, NULL <unfinished ...> > > > On Tue, Nov 3, 2009 at 10:25 AM, Rainer Gerhards > <[email protected]> wrote: > >> -----Original Message----- > >> From: [email protected] [mailto:rsyslog- > >> [email protected]] On Behalf Of [email protected] > >> Sent: Tuesday, November 03, 2009 4:06 PM > >> To: rsyslog-users > >> Subject: Re: [rsyslog] Queuing bug (4.5.5) > >> > >> On Tue, 3 Nov 2009, Aaron Wiebe wrote: > >> > >> > This is still taking place this hour - though I obviously can't > >> > restart onto a newer version without losing this case. Is there > >> > anything I can do in the current configuration to try and debug > this > >> > situation? > >> > >> if you can strace each thread for a few seconds it may give you an > idea > >> what's happening. > > > > This is a good suggestion. > > > > It would also potentially be enlightening to attach gdb to the > process at > > various points in time and get a backtrace from all running threads. > > > > Other than that, there is little you can do on a running version. If > it were > > compiled for debug, and the environment setup so that we could > capture > > stdout/stderr, sending SIGUSR2 would yield to much the same > information as > > the gdb backtrace BUT when runtime instrumentation is enabled, the > build is > > operating at a third to a quarter of its normal speed. > > > >> in the meantime you need to stop the system from getting further > >> behind, > >> can you redirect or reconfigure the senders so that they are not > >> sending > >> new logs to this box so that it can dig itself out (stopping the > input > >> may > >> be enough by itself, rsyslog has historicly done a LOT of locking > >> around > >> the main queue, and if you have a full queue with more data trying > to > >> be > >> delivered it can really slow things down. I wouldn't expect it to be > >> this > >> much, but it could be part of it) > >> > > > > This may clean up things, but I really doubt it, given the magnitude > of > > delays. Also, Aaron runs 4.4.5, which already has lots of the locking > > removed/restructure (not to compare with the recent v5-devel, but > much more > > effcient than early v4 and v3). > > > > Rainer > >> David Lang > >> > >> > (We're up to 18:46:51 now!) > >> > > >> > -Aaron > >> > > >> > On Tue, Nov 3, 2009 at 1:46 AM, Rainer Gerhards > >> > <[email protected]> wrote: > >> >> mhhh... This is obviously not intended behavior. There are some > >> rate-limiting > >> >> features that can deliberately slow down a queue, but I guess you > >> have not > >> >> configured these. So there either is a bug that activates them at > >> some point > >> >> during processing (e.g. an invalid memory access could do that) > or > >> there is > >> >> some other bug that causes the dequeue to happen very slowly. In > any > >> case, it > >> >> is hard to guess. > >> >> > >> >> Given the volume of the messages, a debug log probably does not > >> help. We > >> >> could gain a little insight in at least the queue sizes that > rsyslog > >> sees via > >> >> imdiag plus the (very rudamentary) v5 debug front-end (it doesn't > >> require the > >> >> engine to be v5!). I would need to spend at least a little work > on > >> the > >> >> front-end to enable remote access, but that's not really a > problem. > >> >> > >> >> One other thing is that I am still holding a v4-beta with > additional > >> fixes as > >> >> I didn't want to put these in the middle of some other debugging. > >> However, > >> >> these fixes address potential memory problems, so the most > >> appropriate course > >> >> of action would be to give that version at least a try. It needs > to > >> be > >> >> released in the next days in any case. > >> >> > >> >> I have uploaded that (pre-4.5.6) version so that you can give it > a > >> try if you > >> >> like: > >> >> > >> >> http://www.rsyslog.com/download/rsyslog/pre/rsyslog-4.5.6.tar.gz > >> >> > >> >> I think it would good if you could try it at least once, because > I > >> think any > >> >> other troubleshooting will boil down to looking at the fixes this > >> version > >> >> contains. > >> >> > >> >> Rainer > >> >> > >> >>> -----Original Message----- > >> >>> From: [email protected] [mailto:rsyslog- > >> >>> [email protected]] On Behalf Of Aaron Wiebe > >> >>> Sent: Monday, November 02, 2009 11:52 PM > >> >>> To: rsyslog-users > >> >>> Subject: [rsyslog] Queuing bug (4.5.5) > >> >>> > >> >>> Greetings all, > >> >>> > >> >>> I appear to have run into an issue with 4.5.5 where queues are > not > >> >>> being flushed in a timely manner. In this specific case, I have > >> data > >> >>> from last wednesday that is being written to disk in small > chunks > >> >>> today since last wednesday. Unfortunately I cannot be too > specific > >> in > >> >>> details, but here is what I can see: > >> >>> > >> >>> Using omfile+gzip, there appears to have been a decent burst in > >> >>> traffic sometime last wednesday. The rsyslog instance has grown > to > >> >>> 1.8GB of memory and stayed there for a while. I have both main > >> >>> message and action queues defined using fixedarray, and I see no > >> >>> on-disk queues (appears to be entirely in memory). > >> >>> > >> >>> I've got templates writing out to filenames using an hourly > >> timestamp > >> >>> (filenames like: [token]-[host]-YYYYMMDD-HH.txt.gz) In some of > >> those > >> >>> files, all of them less than 5k in size, there are between 5 and > 15 > >> >>> lines of content, all of them from last wednesday, and within a > few > >> >>> seconds of each other. It's almost like there was a significant > >> queue > >> >>> built up, the hour rolled over, and only the first block of > lines > >> were > >> >>> pulled from the queue. Then the hour rolled over again, and > >> another > >> >>> block of lines were pulled from the queue. Then the next hour, > >> then > >> >>> another 5-15 lines. > >> >>> > >> >>> Is it possible that one of the queues still has a good chunk of > >> data > >> >>> built up, and is flushing it out very slowly? It hasn't been > >> >>> consistantly at the top of the hour either, and not every hour. > >> But > >> >>> the log lines themselves are sequentially written out, and > usually > >> the > >> >>> lines are within a few seconds of each other. > >> >>> > >> >>> For example: > >> >>> > >> >>> syslog-myhost-20091102-18.txt.gz: 3 lines, 2 with TS Oct 21 > >> 18:46:34 > >> >>> and one 18:46:35 > >> >>> syslog-myhost-20091102-19.txt.gz: 17 lines, 3 Oct 21 18:46:35, > 14 > >> >>> lines Oct 21 18:46:36 > >> >>> syslog-myhost-20091102-20.txt.gz: 12 lines, 8 Oct 21 18:46:36, > 4 > >> >>> lines Oct 21 18:46:37 > >> >>> > >> >>> Thoughts? > >> >>> -Aaron > >> >>> _______________________________________________ > >> >>> rsyslog mailing list > >> >>> http://lists.adiscon.net/mailman/listinfo/rsyslog > >> >>> http://www.rsyslog.com > >> >> _______________________________________________ > >> >> rsyslog mailing list > >> >> http://lists.adiscon.net/mailman/listinfo/rsyslog > >> >> http://www.rsyslog.com > >> >> > >> > _______________________________________________ > >> > rsyslog mailing list > >> > http://lists.adiscon.net/mailman/listinfo/rsyslog > >> > http://www.rsyslog.com > >> > > > _______________________________________________ > > rsyslog mailing list > > http://lists.adiscon.net/mailman/listinfo/rsyslog > > http://www.rsyslog.com > > > _______________________________________________ > rsyslog mailing list > http://lists.adiscon.net/mailman/listinfo/rsyslog > http://www.rsyslog.com _______________________________________________ rsyslog mailing list http://lists.adiscon.net/mailman/listinfo/rsyslog http://www.rsyslog.com

