On 11/25/2013 03:06 AM, Pavel Levshin wrote:
25.11.2013 12:35, David Lang пишет:
On Mon, 25 Nov 2013, Pavel Levshin wrote:
25.11.2013 12:18, David Lang:\
Another possibly interesting message is:
7975.038523942:7fe2064cf700: main Q: doEnqSingleObject: LightDelay
mark reached for light delayable message - blocking a bit.
which was received approximately once per second during following
interval (this is also when the traffic went down to zero):
1385087975 Fri Nov 22 02:39:35 UTC 2013
1385088546 Fri Nov 22 02:49:06 UTC 2013
Does this shed any light on what's going on?
I bet this is the problem. This blocks main queue for a second.
There are possible workarounds, including increasing main queue
size (this watermark is set to 70% by default).
searching through the documentation, this seems to be a tcp input
option
old style
$InputTCPFlowControl defaults on set to off to disable
new style
FlowControl='off' on the module load
it still seems wrong that this triggers problems for such a long
time. unless some output is blocking and causing the queue to not
drain, it seems like it shoudl very quicly open up again.
I'm wrong in that it is blocking the whole queue; in reality, it is
blocking only one input. In this case, it is RELP. It is using flow
control, there is no option to disable it.
The question is: why queue length stays above LightDelayMark for a
while?
what does pstats show during this timeframe? set the reporting
frequency so that you will get several samples during the problem
timeframe.
Has been posted today here, already. Some excerpts:
Fri Nov 22 02:15:45 2013: main Q: size=9533 enqueued=56272 full=0
discarded.full=0 discarded.nf=0 maxqsize=9533
Fri Nov 22 02:16:46 2013: main Q: size=22597 enqueued=69336 full=0
discarded.full=0 discarded.nf=0 maxqsize=22597
Fri Nov 22 02:17:46 2013: main Q: size=26 enqueued=82099 full=0
discarded.full=0 discarded.nf=0 maxqsize=25655
Fri Nov 22 02:18:46 2013: main Q: size=27 enqueued=94622 full=0
discarded.full=0 discarded.nf=0 maxqsize=25655
Fri Nov 22 02:19:46 2013: main Q: size=28 enqueued=107148 full=0
discarded.full=0 discarded.nf=0 maxqsize=25655
Fri Nov 22 02:20:46 2013: main Q: size=9202 enqueued=119439 full=0
discarded.full=0 discarded.nf=0 maxqsize=25655
Fri Nov 22 02:21:46 2013: main Q: size=25 enqueued=131510 full=0
discarded.full=0 discarded.nf=0 maxqsize=25655
Fri Nov 22 02:22:46 2013: main Q: size=29 enqueued=143703 full=0
discarded.full=0 discarded.nf=0 maxqsize=25655
Fri Nov 22 02:23:46 2013: main Q: size=25 enqueued=156219 full=0
discarded.full=0 discarded.nf=0 maxqsize=25655
Fri Nov 22 02:24:46 2013: main Q: size=27 enqueued=169026 full=0
discarded.full=0 discarded.nf=0 maxqsize=25655
Fri Nov 22 02:25:46 2013: main Q: size=25 enqueued=181734 full=0
discarded.full=0 discarded.nf=0 maxqsize=25655
Fri Nov 22 02:26:46 2013: main Q: size=29 enqueued=193979 full=0
discarded.full=0 discarded.nf=0 maxqsize=25655
Fri Nov 22 02:27:46 2013: main Q: size=27 enqueued=206103 full=0
discarded.full=0 discarded.nf=0 maxqsize=25655
Fri Nov 22 02:28:46 2013: main Q: size=28 enqueued=218616 full=0
discarded.full=0 discarded.nf=0 maxqsize=25655
Fri Nov 22 02:29:46 2013: main Q: size=29 enqueued=231468 full=0
discarded.full=0 discarded.nf=0 maxqsize=25655
Fri Nov 22 02:30:46 2013: main Q: size=9396 enqueued=244058 full=0
discarded.full=0 discarded.nf=0 maxqsize=25655
Fri Nov 22 02:31:46 2013: main Q: size=21097 enqueued=255759 full=0
discarded.full=0 discarded.nf=0 maxqsize=25655
Fri Nov 22 02:32:47 2013: main Q: size=32811 enqueued=267473 full=0
discarded.full=0 discarded.nf=0 maxqsize=32811
Fri Nov 22 02:33:47 2013: main Q: size=18451 enqueued=279993 full=0
discarded.full=0 discarded.nf=0 maxqsize=40206
Fri Nov 22 02:34:47 2013: main Q: size=6527 enqueued=292854 full=0
discarded.full=0 discarded.nf=0 maxqsize=40206
Fri Nov 22 02:35:47 2013: main Q: size=19557 enqueued=305884 full=0
discarded.full=0 discarded.nf=0 maxqsize=40206
Fri Nov 22 02:36:47 2013: main Q: size=32967 enqueued=319294 full=0
discarded.full=0 discarded.nf=0 maxqsize=40206
Fri Nov 22 02:37:47 2013: main Q: size=46424 enqueued=332751 full=0
discarded.full=0 discarded.nf=0 maxqsize=46424
Fri Nov 22 02:38:47 2013: main Q: size=59397 enqueued=345724 full=0
discarded.full=0 discarded.nf=0 maxqsize=59397
Fri Nov 22 02:39:47 2013: main Q: size=70037 enqueued=356365 full=0
discarded.full=0 discarded.nf=0 maxqsize=70037
Fri Nov 22 02:40:47 2013: main Q: size=70329 enqueued=356657 full=0
discarded.full=0 discarded.nf=0 maxqsize=70329
Fri Nov 22 02:41:47 2013: main Q: size=70416 enqueued=356744 full=0
discarded.full=0 discarded.nf=0 maxqsize=70416
Fri Nov 22 02:42:47 2013: main Q: size=70503 enqueued=356831 full=0
discarded.full=0 discarded.nf=0 maxqsize=70503
Fri Nov 22 02:43:47 2013: main Q: size=70590 enqueued=356918 full=0
discarded.full=0 discarded.nf=0 maxqsize=70590
Fri Nov 22 02:44:47 2013: main Q: size=70677 enqueued=357005 full=0
discarded.full=0 discarded.nf=0 maxqsize=70677
Fri Nov 22 02:45:47 2013: main Q: size=70969 enqueued=357297 full=0
discarded.full=0 discarded.nf=0 maxqsize=70969
Fri Nov 22 02:46:47 2013: main Q: size=69207 enqueued=367465 full=0
discarded.full=0 discarded.nf=0 maxqsize=71020
Fri Nov 22 02:47:47 2013: main Q: size=65006 enqueued=406528 full=0
discarded.full=0 discarded.nf=0 maxqsize=71020
Fri Nov 22 02:48:48 2013: main Q: size=70055 enqueued=426682 full=0
discarded.full=0 discarded.nf=0 maxqsize=71020
Fri Nov 22 02:49:48 2013: main Q: size=66943 enqueued=454801 full=0
discarded.full=0 discarded.nf=0 maxqsize=71020
Fri Nov 22 02:51:36 2013: main Q: size=10 enqueued=8284 full=0
discarded.full=0 discarded.nf=0 maxqsize=14
Fri Nov 22 02:52:36 2013: main Q: size=9 enqueued=20954 full=0
discarded.full=0 discarded.nf=0 maxqsize=15
Fri Nov 22 02:53:36 2013: main Q: size=13 enqueued=33913 full=0
discarded.full=0 discarded.nf=0 maxqsize=20
Fri Nov 22 02:54:36 2013: main Q: size=22 enqueued=46568 full=0
discarded.full=0 discarded.nf=0 maxqsize=22
Fri Nov 22 02:55:36 2013: main Q: size=6 enqueued=60148 full=0
discarded.full=0 discarded.nf=0 maxqsize=70
Fri Nov 22 02:56:36 2013: main Q: size=15 enqueued=73316 full=0
discarded.full=0 discarded.nf=0 maxqsize=70
Fri Nov 22 02:57:36 2013: main Q: size=12 enqueued=85892 full=0
discarded.full=0 discarded.nf=0 maxqsize=70
Fri Nov 22 02:58:36 2013: main Q: size=24 enqueued=98845 full=0
discarded.full=0 discarded.nf=0 maxqsize=70
Fri Nov 22 02:59:37 2013: main Q: size=14 enqueued=111903 full=0
discarded.full=0 discarded.nf=0 maxqsize=70
Fri Nov 22 03:00:37 2013: main Q: size=8096 enqueued=125043 full=0
discarded.full=0 discarded.nf=0 maxqsize=8096
The queue is almost "locked" from 02:39:47 to 02:45:47, and rsyslog is
restarted short after this time.
Actually in this case there is no restart, I just stopped rsyslog
(running in debug mode) after the traffic went back to normal.
The pattern is burst, sometime shortly after burst there's silence,
another burst and traffic is normal again.
Sometiem the silence periods are very regular (usually two bursts
cause no problem, third one cause the the silence), sometime every other
burst causes the problem and sometime the problem disappears for longer
periods. Not sure what is it that makes the difference.
However the end of silence is 100% when the next or next to next
burst happens. It never seems to resolve itself and with 15 min burst
interval it is ALWAYS resolved when the next burst comes in, with 5 min
interval it is sometime second burst that makes it work again.
Reload does not seems to have much effect on this.
The queue does not shrink, so the problem can be in this output, which
is freezing many times:
Fri Nov 22 02:15:45 2013: action 7: processed=46613 failed=0
Fri Nov 22 02:16:46 2013: action 7: processed=46613 failed=0
Fri Nov 22 02:17:46 2013: action 7: processed=81687 failed=0
Fri Nov 22 02:18:46 2013: action 7: processed=94184 failed=0
Fri Nov 22 02:19:46 2013: action 7: processed=106683 failed=0
Fri Nov 22 02:20:46 2013: action 7: processed=109768 failed=0
Fri Nov 22 02:21:46 2013: action 7: processed=130786 failed=0
Fri Nov 22 02:22:46 2013: action 7: processed=142948 failed=0
Fri Nov 22 02:23:46 2013: action 7: processed=155441 failed=0
Fri Nov 22 02:24:46 2013: action 7: processed=168219 failed=0
Fri Nov 22 02:25:46 2013: action 7: processed=180697 failed=0
Fri Nov 22 02:26:46 2013: action 7: processed=192912 failed=0
Fri Nov 22 02:27:46 2013: action 7: processed=205010 failed=0
Fri Nov 22 02:28:46 2013: action 7: processed=217496 failed=0
Fri Nov 22 02:29:46 2013: action 7: processed=230320 failed=0
Fri Nov 22 02:30:46 2013: action 7: processed=233457 failed=0
Fri Nov 22 02:31:46 2013: action 7: processed=233457 failed=0
Fri Nov 22 02:32:47 2013: action 7: processed=233457 failed=0
Fri Nov 22 02:33:47 2013: action 7: processed=260539 failed=0
Fri Nov 22 02:34:47 2013: action 7: processed=285012 failed=0
Fri Nov 22 02:35:47 2013: action 7: processed=285012 failed=0
Fri Nov 22 02:36:47 2013: action 7: processed=285012 failed=0
Fri Nov 22 02:37:47 2013: action 7: processed=285012 failed=0
Fri Nov 22 02:38:47 2013: action 7: processed=285012 failed=0
Fri Nov 22 02:39:47 2013: action 7: processed=285012 failed=0
Fri Nov 22 02:40:47 2013: action 7: processed=285012 failed=0
Fri Nov 22 02:41:47 2013: action 7: processed=285012 failed=0
Fri Nov 22 02:42:47 2013: action 7: processed=285012 failed=0
Fri Nov 22 02:43:47 2013: action 7: processed=285012 failed=0
Fri Nov 22 02:44:47 2013: action 7: processed=285012 failed=0
Fri Nov 22 02:45:47 2013: action 7: processed=285012 failed=0
Fri Nov 22 02:46:47 2013: action 7: processed=296930 failed=0
Fri Nov 22 02:47:47 2013: action 7: processed=340113 failed=0
Fri Nov 22 02:48:48 2013: action 7: processed=354705 failed=0
Fri Nov 22 02:49:48 2013: action 7: processed=385790 failed=0
Fri Nov 22 02:51:36 2013: action 7: processed=8246 failed=0
Fri Nov 22 02:52:36 2013: action 7: processed=20889 failed=0
Fri Nov 22 02:53:36 2013: action 7: processed=33821 failed=0
Fri Nov 22 02:54:36 2013: action 7: processed=46449 failed=0
Fri Nov 22 02:55:36 2013: action 7: processed=59797 failed=0
Fri Nov 22 02:56:36 2013: action 7: processed=72938 failed=0
Fri Nov 22 02:57:36 2013: action 7: processed=85487 failed=0
Fri Nov 22 02:58:36 2013: action 7: processed=98413 failed=0
Fri Nov 22 02:59:37 2013: action 7: processed=111444 failed=0
Fri Nov 22 03:00:37 2013: action 7: processed=116395 failed=0
It was in initial posting:
- 6 hosts that were sending syslog messages to the log collector STOP
sending anything (as verified by stracing rsyslogd, tcpdump and in
amazon AWS console metric for network in)
- after this nothing is ever written into /path/2013-10-10/03/00/log.json
- the 6 clients start sending sysog messages again when the next file
is created (in this example it would be /path/2013-10-10/03/01/log.json)
So, most probably, rsyslog is unable to write the file.
that seems unlikely, specially since it is resolved by second burst.
There is tons of space on the disk, other operations on the disk work,
strace does not reveal anything funny related to disk writing.
in this last scenario the burst are coming over /dev/log and are
going out via RELP to a different machine (then to a file). The previous
scenario was the bursts coming over /dev/log and going out over RELP to
the SAME machine (then to a file). Behaviour is the same for both
scenarios, the machine that SENDS the bursts makes all other machines
STOP sending it the data.
In the second scenario, the machine that receives the burst of 200
messages over RELP is not affected (100 messages caused the problem just
the same).
erik
--
Pavel Levshin
_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad
of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you
DON'T LIKE THAT.
_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of
sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE
THAT.