On Tue, 3 Nov 2009, Aaron Wiebe wrote:

There is an action queue defined per rule set.. of which there are four.

is it possible that something blocked one of the rule outputs? as I understand it, that would cause that action queue to fill up, and then the behavior of the main queue trying, then sleeping would make sense to me

Here's the tricky part:  input has been stopped.  For days.  The
queues just seem to continue to flush very very slowly.  I have a
feeling that if I increased the traffic flow, the queues would flush
faster.  I'm not 100% sure about that though, and I can't test in this
environment.  Unfortunately, due to outside requirements, I'm going to
have to pull rsyslog out of this deployment until we can sort out
these problems...

did you configure this with HUPisRestart off? if so you may try sending it a HUP and see if that gets a few log messages out. if it does, hammer it with HUPs to drain the queue

it is really a pain to run into problems only in production.

David Lang

I'll work on replicating the issue in another environment, if I can.

-Aaron

On Tue, Nov 3, 2009 at 1:20 PM,  <[email protected]> wrote:
what is the configuration? does it have seperate action queues defined?

if you have some way to stop the input, it would let the main queue drop
down from being full (and eliminate the input modules from needing to lock
it to try and deliver messages)

I did see some times during my testing of 4.x stuff where the throughput
would drop drasticly when the queue was full and there ws more incoming
traffic hammering on it. it would drop from 500K logs/min to ~3k logs/min
with a trivial config, slowing down the input would let it get out of this
mode and speed up again after a little while. so much other stuff was
happening at the time that I don't think that I reported it.

David Lang

On Tue, 3 Nov 2009, Rainer Gerhards wrote:

Date: Tue, 3 Nov 2009 18:38:40 +0100
From: Rainer Gerhards <[email protected]>
Reply-To: rsyslog-users <[email protected]>
To: rsyslog-users <[email protected]>
Subject: Re: [rsyslog] Queuing bug (4.5.5)

It really looks like a full queue. The 2-second wait is the default wait
interval before discarding a message when a queue is full. So for some
reason
the action queue seems to think it is full, so that the main queue worker
can
no longer add any data to it. Out of the strace, I unfortunately do not
see
why this happens.

If possible, it would be good to try the 4.5.6 release, as this may
actually
be caused by the memory bug that is solved in that release. If it doesn't
help, we would need to think about a way to get more in-depth info when
the
problem happens. I have an idea, but need to check if it can be done.

Rainer

-----Original Message-----
From: [email protected] [mailto:rsyslog-
[email protected]] On Behalf Of Aaron Wiebe
Sent: Tuesday, November 03, 2009 6:34 PM
To: rsyslog-users
Subject: Re: [rsyslog] Queuing bug (4.5.5)

Here's one with times:

26365 17:27:59.996584 --- SIGSTOP (Stopped (signal)) @ 0 (0) ---
26365 17:28:00.002436 futex(0x1bec924, FUTEX_WAIT, 121, NULL
<unfinished ...>
26365 17:28:00.003810 <... futex resumed> ) = 0 <0.001356>
26365 17:28:00.003871 clock_gettime(CLOCK_REALTIME,  <unfinished ...>
26365 17:28:00.003934 <... clock_gettime resumed> {1257269280,
3889000}) = 0 <0.000047>
26365 17:28:00.004001 futex(0x1bec8c8, FUTEX_WAKE, 1 <unfinished ...>
26365 17:28:00.004065 <... futex resumed> ) = 0 <0.000046>
26365 17:28:00.004129 clock_gettime(CLOCK_REALTIME, {1257269280,
4147000}) = 0 <0.000013>
26365 17:28:00.004185 futex(0x1bec924, FUTEX_WAIT, 123, {1, 999742000}
<unfinished ...>
26365 17:28:02.006675 <... futex resumed> ) = -1 ETIMEDOUT (Connection
timed out) <2.002462>
26365 17:28:02.006797 write(165, "Oct 29 10:01:42 R6002
WT-GARULF6"..., 250) = 250 <0.000305>
26365 17:28:02.007183 futex(0x1bec8c8, FUTEX_WAKE, 1) = 0 <0.000028>
26365 17:28:02.007264 clock_gettime(CLOCK_REALTIME, {1257269282,
7292000}) = 0 <0.000027>
26365 17:28:02.007349 futex(0x1bec924, FUTEX_WAIT, 125, NULL
<unfinished ...>
26365 17:30:15.414854 <... futex resumed> ) = 0 <133.407486>
26365 17:30:15.414914 clock_gettime(CLOCK_REALTIME,  <unfinished ...>
26365 17:30:15.414978 <... clock_gettime resumed> {1257269415,
414932000}) = 0 <0.000048>
26365 17:30:15.415044 futex(0x1bec8c8, FUTEX_WAKE, 1 <unfinished ...>
26365 17:30:15.415108 <... futex resumed> ) = 0 <0.000047>
26365 17:30:15.415173 clock_gettime(CLOCK_REALTIME, {1257269415,
415191000}) = 0 <0.000013>
26365 17:30:15.415229 futex(0x1bec924, FUTEX_WAIT, 127, {1, 999741000}
<unfinished ...>
26365 17:30:17.417472 <... futex resumed> ) = -1 ETIMEDOUT (Connection
timed out) <2.002220>
26365 17:30:17.417540 write(165, "Oct 29 10:01:42 R6002
WT-GARULF6"..., 197) = 197 <0.000279>
26365 17:30:17.417886 futex(0x1bec8c8, FUTEX_WAKE, 1) = 0 <0.000027>
26365 17:30:17.417965 clock_gettime(CLOCK_REALTIME, {1257269417,
417992000}) = 0 <0.000027>
26365 17:30:17.418047 futex(0x1bec924, FUTEX_WAIT, 129, NULL
<unfinished ...>
26365 17:31:15.608148 <... futex resumed> ) = 0 <58.190082>
26365 17:31:15.608209 clock_gettime(CLOCK_REALTIME,  <unfinished ...>
26365 17:31:15.608272 <... clock_gettime resumed> {1257269475,
608226000}) = 0 <0.000047>
26365 17:31:15.608339 futex(0x1bec8c8, FUTEX_WAKE, 1 <unfinished ...>
26365 17:31:15.608403 <... futex resumed> ) = 0 <0.000048>
26365 17:31:15.608468 clock_gettime(CLOCK_REALTIME, {1257269475,
608486000}) = 0 <0.000014>
26365 17:31:15.608524 futex(0x1bec924, FUTEX_WAIT, 131, {1, 999740000}
<unfinished ...>

It looks like the locks are waiting a -very- long time.

-Aaron

On Tue, Nov 3, 2009 at 12:30 PM, Rainer Gerhards
<[email protected]> wrote:

first shot at the data: this looks like a full queue where the

enqueue

operation is waiting on a drain that does not happen (fast enough).

Of

course, that doesn't explain why this happens...

Rainer

-----Original Message-----
From: [email protected] [mailto:rsyslog-
[email protected]] On Behalf Of Aaron Wiebe
Sent: Tuesday, November 03, 2009 6:27 PM
To: rsyslog-users
Subject: Re: [rsyslog] Queuing bug (4.5.5)

Ok - I captured an strace.  This appears to be the action thread.
This specific set of calls took minutes.  I'll re-run this with -t
and/or -T.

Note that this syslog instance is not actually receiving any data

right

now.

-Aaron

26365 --- SIGSTOP (Stopped (signal)) @ 0 (0) ---
26365 futex(0x1bec924, FUTEX_WAIT, 97, NULL <unfinished ...>
26365 <... futex resumed> )             = 0
26365 clock_gettime(CLOCK_REALTIME, {1257268800, 165944000}) = 0
26365 futex(0x1bec8c8, FUTEX_WAKE, 1)   = 0
26365 clock_gettime(CLOCK_REALTIME, {1257268800, 166035000}) = 0
26365 futex(0x1bec924, FUTEX_WAIT, 99, {1, 999909000} <unfinished

...>

26365 <... futex resumed> )             = -1 ETIMEDOUT (Connection
timed out)
26365 write(165, "Oct 29 10:01:42 R6002 WT-GARULF6"..., 229) = 229
26365 futex(0x1bec8c8, FUTEX_WAKE, 1)   = 0
26365 clock_gettime(CLOCK_REALTIME, {1257268802, 173123000}) = 0
26365 futex(0x1bec924, FUTEX_WAIT, 101, NULL <unfinished ...>
26365 <... futex resumed> )             = 0
26365 clock_gettime(CLOCK_REALTIME, {1257268893, 439872000}) = 0
26365 futex(0x1bec8c8, FUTEX_WAKE, 1)   = 0
26365 clock_gettime(CLOCK_REALTIME, {1257268893, 439999000}) = 0
26365 futex(0x1bec924, FUTEX_WAIT, 103, {1, 999873000} <unfinished

...>

26365 <... futex resumed> )             = -1 ETIMEDOUT (Connection
timed out)
26365 write(165, "Oct 29 10:01:42 R6002 WT-GARULF6"..., 248) = 248
26365 futex(0x1bec8c8, FUTEX_WAKE, 1)   = 0
26365 clock_gettime(CLOCK_REALTIME, {1257268895, 443096000}) = 0
26365 futex(0x1bec924, FUTEX_WAIT, 105, NULL <unfinished ...>
26365 <... futex resumed> )             = 0
26365 clock_gettime(CLOCK_REALTIME,  <unfinished ...>
26365 <... clock_gettime resumed> {1257268950, 612585000}) = 0
26365 futex(0x1bec8c8, FUTEX_WAKE, 1 <unfinished ...>
26365 <... futex resumed> )             = 0
26365 clock_gettime(CLOCK_REALTIME, {1257268950, 612757000}) = 0
26365 futex(0x1bec924, FUTEX_WAIT, 107, {1, 999828000} <unfinished

...>

26365 <... futex resumed> )             = -1 ETIMEDOUT (Connection
timed out)
26365 write(165, "Oct 29 10:01:42 R6002 WT-GARULF6"..., 180) = 180
26365 futex(0x1bec8c8, FUTEX_WAKE, 1)   = 0
26365 clock_gettime(CLOCK_REALTIME, {1257268952, 619980000}) = 0
26365 futex(0x1bec924, FUTEX_WAIT, 109, NULL <unfinished ...>
26365 <... futex resumed> )             = 0
26365 futex(0x1bec8c8, FUTEX_WAIT, 2, NULL <unfinished ...>
26365 <... futex resumed> )             = -1 EAGAIN (Resource
temporarily unavailable)
26365 clock_gettime(CLOCK_REALTIME,  <unfinished ...>
26365 <... clock_gettime resumed> {1257268956, 628783000}) = 0
26365 futex(0x1bec8c8, FUTEX_WAKE, 1 <unfinished ...>
26365 <... futex resumed> )             = 0
26365 clock_gettime(CLOCK_REALTIME,  <unfinished ...>
26365 <... clock_gettime resumed> {1257268956, 628953000}) = 0
26365 futex(0x1bec924, FUTEX_WAIT, 111, {1, 999830000} <unfinished

...>

26365 <... futex resumed> )             = -1 ETIMEDOUT (Connection
timed out)
26365 write(165, "Oct 29 10:01:42 R6002 WT-GARULF6"..., 335) = 335
26365 futex(0x1bec8c8, FUTEX_WAKE, 1)   = 0
26365 clock_gettime(CLOCK_REALTIME, {1257268958, 636092000}) = 0
26365 futex(0x1bec924, FUTEX_WAIT, 113, NULL <unfinished ...>


On Tue, Nov 3, 2009 at 10:25 AM, Rainer Gerhards
<[email protected]> wrote:

-----Original Message-----
From: [email protected] [mailto:rsyslog-
[email protected]] On Behalf Of [email protected]
Sent: Tuesday, November 03, 2009 4:06 PM
To: rsyslog-users
Subject: Re: [rsyslog] Queuing bug (4.5.5)

On Tue, 3 Nov 2009, Aaron Wiebe wrote:

This is still taking place this hour - though I obviously can't
restart onto a newer version without losing this case.  Is

there

anything I can do in the current configuration to try and debug

this

situation?

if you can strace each thread for a few seconds it may give you

an

idea

what's happening.

This is a good suggestion.

It would also potentially be enlightening to attach gdb to the

process at

various points in time and get a backtrace from all running

threads.

Other than that, there is little you can do on a running version.

If

it were

compiled for debug, and the environment setup so that we could

capture

stdout/stderr, sending SIGUSR2 would yield to much the same

information as

the gdb backtrace BUT when runtime instrumentation is enabled, the

build is

operating at a third to a quarter of its normal speed.

in the meantime you need to stop the system from getting further
behind,
can you redirect or reconfigure the senders so that they are not
sending
new logs to this box so that it can dig itself out (stopping the

input

may
be enough by itself, rsyslog has historicly done a LOT of locking
around
the main queue, and if you have a full queue with more data

trying

to

be
delivered it can really slow things down. I wouldn't expect it to

be

this
much, but it could be part of it)


This may clean up things, but I really doubt it, given the

magnitude

of

delays. Also, Aaron runs 4.4.5, which already has lots of the

locking

removed/restructure (not to compare with the recent v5-devel, but

much more

effcient than early v4 and v3).

Rainer

David Lang

(We're up to 18:46:51 now!)

-Aaron

On Tue, Nov 3, 2009 at 1:46 AM, Rainer Gerhards
<[email protected]> wrote:

mhhh... This is obviously not intended behavior. There are

some

rate-limiting

features that can deliberately slow down a queue, but I guess

you

have not

configured these. So there either is a bug that activates them

at

some point

during processing (e.g. an invalid memory access could do

that)

or

there is

some other bug that causes the dequeue to happen very slowly.

In

any

case, it

is hard to guess.

Given the volume of the messages, a debug log probably does

not

help. We

could gain a little insight in at least the queue sizes that

rsyslog

sees via

imdiag plus the (very rudamentary) v5 debug front-end (it

doesn't

require the

engine to be v5!). I would need to spend at least a little

work

on

the

front-end to enable remote access, but that's not really a

problem.

One other thing is that I am still holding a v4-beta with

additional

fixes as

I didn't want to put these in the middle of some other

debugging.

However,

these fixes address potential memory problems, so the most

appropriate course

of action would be to give that version at least a try. It

needs

to

be

released in the next days in any case.

I have uploaded that (pre-4.5.6) version so that you can give

it

a

try if you

like:

http://www.rsyslog.com/download/rsyslog/pre/rsyslog-

4.5.6.tar.gz

I think it would good if you could try it at least once,

because

I

think any

other troubleshooting will boil down to looking at the fixes

this

version

contains.

Rainer

-----Original Message-----
From: [email protected] [mailto:rsyslog-
[email protected]] On Behalf Of Aaron Wiebe
Sent: Monday, November 02, 2009 11:52 PM
To: rsyslog-users
Subject: [rsyslog] Queuing bug (4.5.5)

Greetings all,

I appear to have run into an issue with 4.5.5 where queues

are

not

being flushed in a timely manner.  In this specific case, I

have

data

from last wednesday that is being written to disk in small

chunks

today since last wednesday.  Unfortunately I cannot be too

specific

in

details, but here is what I can see:

Using omfile+gzip, there appears to have been a decent burst

in

traffic sometime last wednesday.  The rsyslog instance has

grown

to

1.8GB of memory and stayed there for a while.  I have both

main

message and action queues defined using fixedarray, and I see

no

on-disk queues (appears to be entirely in memory).

I've got templates writing out to filenames using an hourly

timestamp

(filenames like:  [token]-[host]-YYYYMMDD-HH.txt.gz)  In some

of

those

files, all of them less than 5k in size, there are between 5

and

15

lines of content, all of them from last wednesday, and within

a

few

seconds of each other.  It's almost like there was a

significant

queue

built up, the hour rolled over, and only the first block of

lines

were

pulled from the queue.  Then the hour rolled over again, and

another

block of lines were pulled from the queue.  Then the next

hour,

then

another 5-15 lines.

Is it possible that one of the queues still has a good chunk

of

data

built up, and is flushing it out very slowly?  It hasn't been
consistantly at the top of the hour either, and not every

hour.

 But

the log lines themselves are sequentially written out, and

usually

the

lines are within a few seconds of each other.

For example:

syslog-myhost-20091102-18.txt.gz:  3 lines, 2 with TS Oct 21

18:46:34

and one 18:46:35
syslog-myhost-20091102-19.txt.gz:  17 lines, 3 Oct 21

18:46:35,

14

lines Oct 21 18:46:36
syslog-myhost-20091102-20.txt.gz:  12 lines, 8 Oct 21

18:46:36,

4

lines Oct 21 18:46:37

Thoughts?
-Aaron
_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com

_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com

_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com

_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com

_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com

_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com

_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com

_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com

_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com


_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com
_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com

Reply via email to