I'll see what I can do. I'm actually somewhat tied up with other stuff this week, but I may be able to find time to update.
-Aaron On Tue, Nov 3, 2009 at 12:38 PM, Rainer Gerhards <[email protected]> wrote: > It really looks like a full queue. The 2-second wait is the default wait > interval before discarding a message when a queue is full. So for some reason > the action queue seems to think it is full, so that the main queue worker can > no longer add any data to it. Out of the strace, I unfortunately do not see > why this happens. > > If possible, it would be good to try the 4.5.6 release, as this may actually > be caused by the memory bug that is solved in that release. If it doesn't > help, we would need to think about a way to get more in-depth info when the > problem happens. I have an idea, but need to check if it can be done. > > Rainer > >> -----Original Message----- >> From: [email protected] [mailto:rsyslog- >> [email protected]] On Behalf Of Aaron Wiebe >> Sent: Tuesday, November 03, 2009 6:34 PM >> To: rsyslog-users >> Subject: Re: [rsyslog] Queuing bug (4.5.5) >> >> Here's one with times: >> >> 26365 17:27:59.996584 --- SIGSTOP (Stopped (signal)) @ 0 (0) --- >> 26365 17:28:00.002436 futex(0x1bec924, FUTEX_WAIT, 121, NULL >> <unfinished ...> >> 26365 17:28:00.003810 <... futex resumed> ) = 0 <0.001356> >> 26365 17:28:00.003871 clock_gettime(CLOCK_REALTIME, <unfinished ...> >> 26365 17:28:00.003934 <... clock_gettime resumed> {1257269280, >> 3889000}) = 0 <0.000047> >> 26365 17:28:00.004001 futex(0x1bec8c8, FUTEX_WAKE, 1 <unfinished ...> >> 26365 17:28:00.004065 <... futex resumed> ) = 0 <0.000046> >> 26365 17:28:00.004129 clock_gettime(CLOCK_REALTIME, {1257269280, >> 4147000}) = 0 <0.000013> >> 26365 17:28:00.004185 futex(0x1bec924, FUTEX_WAIT, 123, {1, 999742000} >> <unfinished ...> >> 26365 17:28:02.006675 <... futex resumed> ) = -1 ETIMEDOUT (Connection >> timed out) <2.002462> >> 26365 17:28:02.006797 write(165, "Oct 29 10:01:42 R6002 >> WT-GARULF6"..., 250) = 250 <0.000305> >> 26365 17:28:02.007183 futex(0x1bec8c8, FUTEX_WAKE, 1) = 0 <0.000028> >> 26365 17:28:02.007264 clock_gettime(CLOCK_REALTIME, {1257269282, >> 7292000}) = 0 <0.000027> >> 26365 17:28:02.007349 futex(0x1bec924, FUTEX_WAIT, 125, NULL >> <unfinished ...> >> 26365 17:30:15.414854 <... futex resumed> ) = 0 <133.407486> >> 26365 17:30:15.414914 clock_gettime(CLOCK_REALTIME, <unfinished ...> >> 26365 17:30:15.414978 <... clock_gettime resumed> {1257269415, >> 414932000}) = 0 <0.000048> >> 26365 17:30:15.415044 futex(0x1bec8c8, FUTEX_WAKE, 1 <unfinished ...> >> 26365 17:30:15.415108 <... futex resumed> ) = 0 <0.000047> >> 26365 17:30:15.415173 clock_gettime(CLOCK_REALTIME, {1257269415, >> 415191000}) = 0 <0.000013> >> 26365 17:30:15.415229 futex(0x1bec924, FUTEX_WAIT, 127, {1, 999741000} >> <unfinished ...> >> 26365 17:30:17.417472 <... futex resumed> ) = -1 ETIMEDOUT (Connection >> timed out) <2.002220> >> 26365 17:30:17.417540 write(165, "Oct 29 10:01:42 R6002 >> WT-GARULF6"..., 197) = 197 <0.000279> >> 26365 17:30:17.417886 futex(0x1bec8c8, FUTEX_WAKE, 1) = 0 <0.000027> >> 26365 17:30:17.417965 clock_gettime(CLOCK_REALTIME, {1257269417, >> 417992000}) = 0 <0.000027> >> 26365 17:30:17.418047 futex(0x1bec924, FUTEX_WAIT, 129, NULL >> <unfinished ...> >> 26365 17:31:15.608148 <... futex resumed> ) = 0 <58.190082> >> 26365 17:31:15.608209 clock_gettime(CLOCK_REALTIME, <unfinished ...> >> 26365 17:31:15.608272 <... clock_gettime resumed> {1257269475, >> 608226000}) = 0 <0.000047> >> 26365 17:31:15.608339 futex(0x1bec8c8, FUTEX_WAKE, 1 <unfinished ...> >> 26365 17:31:15.608403 <... futex resumed> ) = 0 <0.000048> >> 26365 17:31:15.608468 clock_gettime(CLOCK_REALTIME, {1257269475, >> 608486000}) = 0 <0.000014> >> 26365 17:31:15.608524 futex(0x1bec924, FUTEX_WAIT, 131, {1, 999740000} >> <unfinished ...> >> >> It looks like the locks are waiting a -very- long time. >> >> -Aaron >> >> On Tue, Nov 3, 2009 at 12:30 PM, Rainer Gerhards >> <[email protected]> wrote: >> > first shot at the data: this looks like a full queue where the >> enqueue >> > operation is waiting on a drain that does not happen (fast enough). >> Of >> > course, that doesn't explain why this happens... >> > >> > Rainer >> > >> >> -----Original Message----- >> >> From: [email protected] [mailto:rsyslog- >> >> [email protected]] On Behalf Of Aaron Wiebe >> >> Sent: Tuesday, November 03, 2009 6:27 PM >> >> To: rsyslog-users >> >> Subject: Re: [rsyslog] Queuing bug (4.5.5) >> >> >> >> Ok - I captured an strace. This appears to be the action thread. >> >> This specific set of calls took minutes. I'll re-run this with -t >> >> and/or -T. >> >> >> >> Note that this syslog instance is not actually receiving any data >> right >> >> now. >> >> >> >> -Aaron >> >> >> >> 26365 --- SIGSTOP (Stopped (signal)) @ 0 (0) --- >> >> 26365 futex(0x1bec924, FUTEX_WAIT, 97, NULL <unfinished ...> >> >> 26365 <... futex resumed> ) = 0 >> >> 26365 clock_gettime(CLOCK_REALTIME, {1257268800, 165944000}) = 0 >> >> 26365 futex(0x1bec8c8, FUTEX_WAKE, 1) = 0 >> >> 26365 clock_gettime(CLOCK_REALTIME, {1257268800, 166035000}) = 0 >> >> 26365 futex(0x1bec924, FUTEX_WAIT, 99, {1, 999909000} <unfinished >> ...> >> >> 26365 <... futex resumed> ) = -1 ETIMEDOUT (Connection >> >> timed out) >> >> 26365 write(165, "Oct 29 10:01:42 R6002 WT-GARULF6"..., 229) = 229 >> >> 26365 futex(0x1bec8c8, FUTEX_WAKE, 1) = 0 >> >> 26365 clock_gettime(CLOCK_REALTIME, {1257268802, 173123000}) = 0 >> >> 26365 futex(0x1bec924, FUTEX_WAIT, 101, NULL <unfinished ...> >> >> 26365 <... futex resumed> ) = 0 >> >> 26365 clock_gettime(CLOCK_REALTIME, {1257268893, 439872000}) = 0 >> >> 26365 futex(0x1bec8c8, FUTEX_WAKE, 1) = 0 >> >> 26365 clock_gettime(CLOCK_REALTIME, {1257268893, 439999000}) = 0 >> >> 26365 futex(0x1bec924, FUTEX_WAIT, 103, {1, 999873000} <unfinished >> ...> >> >> 26365 <... futex resumed> ) = -1 ETIMEDOUT (Connection >> >> timed out) >> >> 26365 write(165, "Oct 29 10:01:42 R6002 WT-GARULF6"..., 248) = 248 >> >> 26365 futex(0x1bec8c8, FUTEX_WAKE, 1) = 0 >> >> 26365 clock_gettime(CLOCK_REALTIME, {1257268895, 443096000}) = 0 >> >> 26365 futex(0x1bec924, FUTEX_WAIT, 105, NULL <unfinished ...> >> >> 26365 <... futex resumed> ) = 0 >> >> 26365 clock_gettime(CLOCK_REALTIME, <unfinished ...> >> >> 26365 <... clock_gettime resumed> {1257268950, 612585000}) = 0 >> >> 26365 futex(0x1bec8c8, FUTEX_WAKE, 1 <unfinished ...> >> >> 26365 <... futex resumed> ) = 0 >> >> 26365 clock_gettime(CLOCK_REALTIME, {1257268950, 612757000}) = 0 >> >> 26365 futex(0x1bec924, FUTEX_WAIT, 107, {1, 999828000} <unfinished >> ...> >> >> 26365 <... futex resumed> ) = -1 ETIMEDOUT (Connection >> >> timed out) >> >> 26365 write(165, "Oct 29 10:01:42 R6002 WT-GARULF6"..., 180) = 180 >> >> 26365 futex(0x1bec8c8, FUTEX_WAKE, 1) = 0 >> >> 26365 clock_gettime(CLOCK_REALTIME, {1257268952, 619980000}) = 0 >> >> 26365 futex(0x1bec924, FUTEX_WAIT, 109, NULL <unfinished ...> >> >> 26365 <... futex resumed> ) = 0 >> >> 26365 futex(0x1bec8c8, FUTEX_WAIT, 2, NULL <unfinished ...> >> >> 26365 <... futex resumed> ) = -1 EAGAIN (Resource >> >> temporarily unavailable) >> >> 26365 clock_gettime(CLOCK_REALTIME, <unfinished ...> >> >> 26365 <... clock_gettime resumed> {1257268956, 628783000}) = 0 >> >> 26365 futex(0x1bec8c8, FUTEX_WAKE, 1 <unfinished ...> >> >> 26365 <... futex resumed> ) = 0 >> >> 26365 clock_gettime(CLOCK_REALTIME, <unfinished ...> >> >> 26365 <... clock_gettime resumed> {1257268956, 628953000}) = 0 >> >> 26365 futex(0x1bec924, FUTEX_WAIT, 111, {1, 999830000} <unfinished >> ...> >> >> 26365 <... futex resumed> ) = -1 ETIMEDOUT (Connection >> >> timed out) >> >> 26365 write(165, "Oct 29 10:01:42 R6002 WT-GARULF6"..., 335) = 335 >> >> 26365 futex(0x1bec8c8, FUTEX_WAKE, 1) = 0 >> >> 26365 clock_gettime(CLOCK_REALTIME, {1257268958, 636092000}) = 0 >> >> 26365 futex(0x1bec924, FUTEX_WAIT, 113, NULL <unfinished ...> >> >> >> >> >> >> On Tue, Nov 3, 2009 at 10:25 AM, Rainer Gerhards >> >> <[email protected]> wrote: >> >> >> -----Original Message----- >> >> >> From: [email protected] [mailto:rsyslog- >> >> >> [email protected]] On Behalf Of [email protected] >> >> >> Sent: Tuesday, November 03, 2009 4:06 PM >> >> >> To: rsyslog-users >> >> >> Subject: Re: [rsyslog] Queuing bug (4.5.5) >> >> >> >> >> >> On Tue, 3 Nov 2009, Aaron Wiebe wrote: >> >> >> >> >> >> > This is still taking place this hour - though I obviously can't >> >> >> > restart onto a newer version without losing this case. Is >> there >> >> >> > anything I can do in the current configuration to try and debug >> >> this >> >> >> > situation? >> >> >> >> >> >> if you can strace each thread for a few seconds it may give you >> an >> >> idea >> >> >> what's happening. >> >> > >> >> > This is a good suggestion. >> >> > >> >> > It would also potentially be enlightening to attach gdb to the >> >> process at >> >> > various points in time and get a backtrace from all running >> threads. >> >> > >> >> > Other than that, there is little you can do on a running version. >> If >> >> it were >> >> > compiled for debug, and the environment setup so that we could >> >> capture >> >> > stdout/stderr, sending SIGUSR2 would yield to much the same >> >> information as >> >> > the gdb backtrace BUT when runtime instrumentation is enabled, the >> >> build is >> >> > operating at a third to a quarter of its normal speed. >> >> > >> >> >> in the meantime you need to stop the system from getting further >> >> >> behind, >> >> >> can you redirect or reconfigure the senders so that they are not >> >> >> sending >> >> >> new logs to this box so that it can dig itself out (stopping the >> >> input >> >> >> may >> >> >> be enough by itself, rsyslog has historicly done a LOT of locking >> >> >> around >> >> >> the main queue, and if you have a full queue with more data >> trying >> >> to >> >> >> be >> >> >> delivered it can really slow things down. I wouldn't expect it to >> be >> >> >> this >> >> >> much, but it could be part of it) >> >> >> >> >> > >> >> > This may clean up things, but I really doubt it, given the >> magnitude >> >> of >> >> > delays. Also, Aaron runs 4.4.5, which already has lots of the >> locking >> >> > removed/restructure (not to compare with the recent v5-devel, but >> >> much more >> >> > effcient than early v4 and v3). >> >> > >> >> > Rainer >> >> >> David Lang >> >> >> >> >> >> > (We're up to 18:46:51 now!) >> >> >> > >> >> >> > -Aaron >> >> >> > >> >> >> > On Tue, Nov 3, 2009 at 1:46 AM, Rainer Gerhards >> >> >> > <[email protected]> wrote: >> >> >> >> mhhh... This is obviously not intended behavior. There are >> some >> >> >> rate-limiting >> >> >> >> features that can deliberately slow down a queue, but I guess >> you >> >> >> have not >> >> >> >> configured these. So there either is a bug that activates them >> at >> >> >> some point >> >> >> >> during processing (e.g. an invalid memory access could do >> that) >> >> or >> >> >> there is >> >> >> >> some other bug that causes the dequeue to happen very slowly. >> In >> >> any >> >> >> case, it >> >> >> >> is hard to guess. >> >> >> >> >> >> >> >> Given the volume of the messages, a debug log probably does >> not >> >> >> help. We >> >> >> >> could gain a little insight in at least the queue sizes that >> >> rsyslog >> >> >> sees via >> >> >> >> imdiag plus the (very rudamentary) v5 debug front-end (it >> doesn't >> >> >> require the >> >> >> >> engine to be v5!). I would need to spend at least a little >> work >> >> on >> >> >> the >> >> >> >> front-end to enable remote access, but that's not really a >> >> problem. >> >> >> >> >> >> >> >> One other thing is that I am still holding a v4-beta with >> >> additional >> >> >> fixes as >> >> >> >> I didn't want to put these in the middle of some other >> debugging. >> >> >> However, >> >> >> >> these fixes address potential memory problems, so the most >> >> >> appropriate course >> >> >> >> of action would be to give that version at least a try. It >> needs >> >> to >> >> >> be >> >> >> >> released in the next days in any case. >> >> >> >> >> >> >> >> I have uploaded that (pre-4.5.6) version so that you can give >> it >> >> a >> >> >> try if you >> >> >> >> like: >> >> >> >> >> >> >> >> http://www.rsyslog.com/download/rsyslog/pre/rsyslog- >> 4.5.6.tar.gz >> >> >> >> >> >> >> >> I think it would good if you could try it at least once, >> because >> >> I >> >> >> think any >> >> >> >> other troubleshooting will boil down to looking at the fixes >> this >> >> >> version >> >> >> >> contains. >> >> >> >> >> >> >> >> Rainer >> >> >> >> >> >> >> >>> -----Original Message----- >> >> >> >>> From: [email protected] [mailto:rsyslog- >> >> >> >>> [email protected]] On Behalf Of Aaron Wiebe >> >> >> >>> Sent: Monday, November 02, 2009 11:52 PM >> >> >> >>> To: rsyslog-users >> >> >> >>> Subject: [rsyslog] Queuing bug (4.5.5) >> >> >> >>> >> >> >> >>> Greetings all, >> >> >> >>> >> >> >> >>> I appear to have run into an issue with 4.5.5 where queues >> are >> >> not >> >> >> >>> being flushed in a timely manner. In this specific case, I >> have >> >> >> data >> >> >> >>> from last wednesday that is being written to disk in small >> >> chunks >> >> >> >>> today since last wednesday. Unfortunately I cannot be too >> >> specific >> >> >> in >> >> >> >>> details, but here is what I can see: >> >> >> >>> >> >> >> >>> Using omfile+gzip, there appears to have been a decent burst >> in >> >> >> >>> traffic sometime last wednesday. The rsyslog instance has >> grown >> >> to >> >> >> >>> 1.8GB of memory and stayed there for a while. I have both >> main >> >> >> >>> message and action queues defined using fixedarray, and I see >> no >> >> >> >>> on-disk queues (appears to be entirely in memory). >> >> >> >>> >> >> >> >>> I've got templates writing out to filenames using an hourly >> >> >> timestamp >> >> >> >>> (filenames like: [token]-[host]-YYYYMMDD-HH.txt.gz) In some >> of >> >> >> those >> >> >> >>> files, all of them less than 5k in size, there are between 5 >> and >> >> 15 >> >> >> >>> lines of content, all of them from last wednesday, and within >> a >> >> few >> >> >> >>> seconds of each other. It's almost like there was a >> significant >> >> >> queue >> >> >> >>> built up, the hour rolled over, and only the first block of >> >> lines >> >> >> were >> >> >> >>> pulled from the queue. Then the hour rolled over again, and >> >> >> another >> >> >> >>> block of lines were pulled from the queue. Then the next >> hour, >> >> >> then >> >> >> >>> another 5-15 lines. >> >> >> >>> >> >> >> >>> Is it possible that one of the queues still has a good chunk >> of >> >> >> data >> >> >> >>> built up, and is flushing it out very slowly? It hasn't been >> >> >> >>> consistantly at the top of the hour either, and not every >> hour. >> >> >> But >> >> >> >>> the log lines themselves are sequentially written out, and >> >> usually >> >> >> the >> >> >> >>> lines are within a few seconds of each other. >> >> >> >>> >> >> >> >>> For example: >> >> >> >>> >> >> >> >>> syslog-myhost-20091102-18.txt.gz: 3 lines, 2 with TS Oct 21 >> >> >> 18:46:34 >> >> >> >>> and one 18:46:35 >> >> >> >>> syslog-myhost-20091102-19.txt.gz: 17 lines, 3 Oct 21 >> 18:46:35, >> >> 14 >> >> >> >>> lines Oct 21 18:46:36 >> >> >> >>> syslog-myhost-20091102-20.txt.gz: 12 lines, 8 Oct 21 >> 18:46:36, >> >> 4 >> >> >> >>> lines Oct 21 18:46:37 >> >> >> >>> >> >> >> >>> Thoughts? >> >> >> >>> -Aaron >> >> >> >>> _______________________________________________ >> >> >> >>> rsyslog mailing list >> >> >> >>> http://lists.adiscon.net/mailman/listinfo/rsyslog >> >> >> >>> http://www.rsyslog.com >> >> >> >> _______________________________________________ >> >> >> >> rsyslog mailing list >> >> >> >> http://lists.adiscon.net/mailman/listinfo/rsyslog >> >> >> >> http://www.rsyslog.com >> >> >> >> >> >> >> > _______________________________________________ >> >> >> > rsyslog mailing list >> >> >> > http://lists.adiscon.net/mailman/listinfo/rsyslog >> >> >> > http://www.rsyslog.com >> >> >> > >> >> > _______________________________________________ >> >> > rsyslog mailing list >> >> > http://lists.adiscon.net/mailman/listinfo/rsyslog >> >> > http://www.rsyslog.com >> >> > >> >> _______________________________________________ >> >> rsyslog mailing list >> >> http://lists.adiscon.net/mailman/listinfo/rsyslog >> >> http://www.rsyslog.com >> > _______________________________________________ >> > rsyslog mailing list >> > http://lists.adiscon.net/mailman/listinfo/rsyslog >> > http://www.rsyslog.com >> > >> _______________________________________________ >> rsyslog mailing list >> http://lists.adiscon.net/mailman/listinfo/rsyslog >> http://www.rsyslog.com > _______________________________________________ > rsyslog mailing list > http://lists.adiscon.net/mailman/listinfo/rsyslog > http://www.rsyslog.com > _______________________________________________ rsyslog mailing list http://lists.adiscon.net/mailman/listinfo/rsyslog http://www.rsyslog.com

