Re: [rsyslog] ruleset queues filled in start quickly and then smoothly

David Lang Wed, 30 Dec 2015 12:28:57 -0800

On Wed, 30 Dec 2015, Muhammad Asif wrote:

How about if i go for like this.


ruleset(name="Ruleset-for-TCP" queue.size="1000000"
  # queue.filename="forwarding" queue.maxdiskspace="5g"
queue.highwatermark="900000" queue.lowwatermark= "500000"
   queue.dequeuebatchsize="30" queue.dequeueslowdown="10000"
queue.type="LinkedList"){

    action(type="omfwd" name="Events-on-TCP" target="127.0.0.1"
port="5170" protocol="tcp"  template="msgonly" )
    action(type="omfwd" name="Events-on-TCP" target="127.0.0.1"
port="5171" protocol="tcp"  template="msgonly" )
  }
Should they not perform load balancing of events in this ruleset.


no, this just sends two copies of each log message, one to each port.

before this rsyslog was receiving 12000 /sec and processing about 2000
/sec, thats why queue was getting filled rapidly.
Now i tuned at sender end to send about 2000/sec (instead of 12000) and
two actions to meet the load. But still it is increasing but very slowly as
compare to previous test.

before you were sending a single batch between 1-2000 messages, then waiting for1 sec before sending the next batch. so the max you would send is 2000/sec(actually less because it takes time to send the messages)

now you are seding a batch between 1-30 messages, then waiting for 0.01 sec, sothis is less than to 3000 per sec (again, because it takes time to deliver themessages)

I think it should not increase anyway because dequeuebatch is 3000/sec
which is 1000 messages/sec more than receiving rate.


looking at yesterday's stats:

your receive rate when you startup is >120K/10 seconds, or 12K/second. once thequeue fills, it will only accept messages if it has space for them, so theincoming rate drops to the processed rate (with the backlog sitting on thesending servers where you don't have stats)

if you look at these stats, you will see that once the queue hits the watermarklevel the enqueued for one report almost exactly matches the processed for theprior report (the errors are due to the resetcounters issue inaccuracy)


origin=core.action processed=0 failed=0
origin=core.queue size=0 enqueued=0
origin=core.action processed=0 failed=0
origin=core.queue size=0 enqueued=0
origin=core.action processed=8744 failed=0
origin=core.queue size=48776 enqueued=55520
origin=core.action processed=18000 failed=0
origin=core.queue size=150353 enqueued=119577
origin=core.action processed=0 failed=0
origin=core.queue size=273953 enqueued=123600
origin=core.action processed=0 failed=0
origin=core.queue size=398355 enqueued=124402
origin=core.action processed=12000 failed=0
origin=core.queue size=513015 enqueued=126660
origin=core.action processed=6000 failed=0
origin=core.queue size=629848 enqueued=122833
origin=core.action processed=0 failed=0
origin=core.queue size=700003 enqueued=70156
origin=core.action processed=12000 failed=0
origin=core.queue size=700000 enqueued=11997
origin=core.action processed=0 failed=0
origin=core.queue size=700010 enqueued=10
origin=core.action processed=18000 failed=0
origin=core.queue size=700000 enqueued=17990
origin=core.action processed=0 failed=0
origin=core.queue size=700010 enqueued=10
origin=core.action processed=6000 failed=0
origin=core.queue size=700000 enqueued=5990
origin=core.action processed=2000 failed=0
origin=core.queue size=700009 enqueued=2009
origin=core.action processed=4000 failed=0
origin=core.queue size=700006 enqueued=3997
origin=core.action processed=6000 failed=0
origin=core.queue size=700006 enqueued=6000
origin=core.action processed=0 failed=0
origin=core.queue size=700016 enqueued=10

with whatever settings you were using for this, you are processing between200-1800 logs/sec


looking at the latest stats

grep -e Ruleset-for-TCP -e Events-on-TCP KyUNp65r.txt |cut -f 8-9 -d ' '
processed=0 failed=0
processed=0 failed=0
size=0 enqueued=0
processed=0 failed=0
processed=0 failed=0
size=0 enqueued=0
processed=8339 failed=0
processed=8339 failed=0
size=513 enqueued=8552
processed=18675 failed=0
processed=18675 failed=0
size=524 enqueued=18686
processed=12982 failed=0
processed=12982 failed=0
size=5920 enqueued=18378
processed=9600 failed=0
processed=9600 failed=0
size=14725 enqueued=18405
processed=9300 failed=0
processed=9300 failed=0
size=23921 enqueued=18496
processed=15300 failed=0
processed=15300 failed=0
size=26938 enqueued=18317
processed=3000 failed=0
processed=3000 failed=0
size=50063 enqueued=26125
processed=6900 failed=0
processed=6900 failed=0
size=61733 enqueued=18570
processed=13800 failed=0
processed=13800 failed=0
size=67225 enqueued=19292
processed=10200 failed=0
processed=10200 failed=0
size=75307 enqueued=18282
processed=10500 failed=0
processed=10500 failed=0
size=82938 enqueued=18131
processed=11700 failed=0
processed=11700 failed=0
size=89817 enqueued=18579
processed=9300 failed=0
processed=9300 failed=0
size=99147 enqueued=18630
processed=9300 failed=0
processed=9300 failed=0
size=108306 enqueued=18459
processed=15300 failed=0
processed=15300 failed=0
size=111493 enqueued=18487
processed=6300 failed=0
processed=6300 failed=0
size=126404 enqueued=21211
processed=0 failed=0
processed=0 failed=0
size=145015 enqueued=18611
processed=8400 failed=0
processed=8400 failed=0
size=154578 enqueued=17963
processed=7200 failed=0
processed=7200 failed=0
size=166088 enqueued=18710
processed=9900 failed=0
processed=9900 failed=0

you see how you are sending the same number of logs to each destination and theold size + enqueued - one copy of processed = new size

you are correct that new logs are just under 2000 logs/sec, but you are onlyprocessing ~1000 logs/sec, and there was one 10 second period when you were ableto deliver zero messages. If you remove the extra copy, you may find that thereceiving software can handle ~2000 logs/sec, this is uncomfortably close towhat you are receiving.

If you were doing this to separate systems instead of localhost, the answerwould be easy, have multiple machines receiving the logs and use iptablesCLUSTERIP to load balance across them as I describe in this paper/presentationhttps://www.usenix.org/conference/lisa12/technical-sessions/presentation/lang_david

note that this is two parts, one on the receiving end to load balance based onthe hash of sourceip/sourceport (because destinationip/destinationport arefixed), and the other to disconnect every X messages so that the load balancercan do it's job

There are probably ways to use iptables to redirect traffic arriving on port Xto ports Y and Z depending on the hash of sourceip/sourceport, but I'd have todo some digging to track a solution down.

the poor man's version of this is to use global variables in rsyslog to dosomething like:


set $\counter = $\counter + 1;
if $\counter%2 = 0 then {
  send to first port
} else {
  send to second port
}

this is rather inefficient to do in rsyslog because of the locking requiredaround global variables.


David Lang

Queue stats:    http://pastebin.com/KyUNp65r

On Wed, Dec 30, 2015 at 2:39 PM, Muhammad Asif <[email protected]> wrote:

I am really sorry I attached old link of config.  This is actual.
http://pastebin.com/H7JPvvzj

I am grateful to you for your detailed answer.

Thanks


On Wed, Dec 30, 2015 at 1:17 AM, David Lang <[email protected]> wrote:

On Tue, 29 Dec 2015, Muhammad Asif wrote:

Hi Guys,


I have three queues at rulesets level as following.


no, this config shows the queue at the action level, not the ruleset level

at the ruleset level, instead of:

ruleset(name="events-on-tcp"){
 action(type="omfwd" target="127.0.0.1" port="5170" protocol="tcp"
    name="tcp-events-queue" template="msgonly" queue.size="1000000"
    queue.dequeuebatchsize="2000" queue.dequeueslowdown="1000000"
    queue.workerthreads="2" queue.type="LinkedList" )
}

you would have:

ruleset(name="events-on-tcp" queue.size="1000000"
    queue.dequeuebatchsize="2000" queue.dequeueslowdown="1000000"
    queue.workerthreads="2" queue.type="LinkedList" ){
  action(type="omfwd" target="127.0.0.1" port="5170" protocol="tcp"
      name="tcp-events-queue" template="msgonly" )
}


http://pastebin.com/k4EWRwL7


you should try to use a unique name for everything, you use
'Events-on-TCP' for both a ruleset (and therefor the queue for the ruleset)
and the action. Reuse of names can cause stats data to be lost.

Queues stats.


http://pastebin.com/Js5bEqxm

I restart rsyslog service and then observer that ruleset queus get empty
,
then start to fill rapidly and then smootly as shown above stats.

Please help me in understanding this behavior.


so the interesting data for Events-on-TCP queue is:

grep core.queue Js5bEqxm.txt |grep Events-on-TCP |grep size= |cut -f
8,9,13 -d ' '
size=0 enqueued=0 maxqsize=0
size=0 enqueued=0 maxqsize=0
size=48776 enqueued=55520 maxqsize=48776
size=150353 enqueued=119577 maxqsize=150353
size=273953 enqueued=123600 maxqsize=273953
size=398355 enqueued=124402 maxqsize=398355
size=513015 enqueued=126660 maxqsize=513015
size=629848 enqueued=122833 maxqsize=629848
size=700003 enqueued=70156 maxqsize=700003
size=700000 enqueued=11997 maxqsize=700007
size=700010 enqueued=10 maxqsize=700010
size=700000 enqueued=17990 maxqsize=700011
size=700010 enqueued=10 maxqsize=700011
size=700000 enqueued=5990 maxqsize=700019
size=700009 enqueued=2009 maxqsize=700019
size=700006 enqueued=3997 maxqsize=700019
size=700006 enqueued=6000 maxqsize=700019
size=700016 enqueued=10 maxqsize=700019


so you have a flood of messages arriving shortly after you startup, which
is the senders noticing that you are there and sending their backlog. After
that the rate of new messages arriving stabilizes

But the fact that the queue size climbs up to ~70,000 and stays there
means that you are not delivering messages to your destination at a
reasonable rate. Having the queue build initially as you are receiving
hundreds of thousands of logs per sample period, but when the volume of new
logs drops to thousands of new logs per period, you should make headway in
delivering the logs faster than you receive them.

The fact that the events-on-tcp action is processing very few messages,
and only in bursts is saying that you are having problems delivering
messages. this may just be due to your throttling settings, you wait a LONG
time between bursts.

grep -e Events-on-TCP -e imtcp  Js5bEqxm.txt  |cut -f 7,8 -d ' '
origin=core.action processed=0
origin=imtcp submitted=0
origin=core.queue size=0
origin=core.action processed=0
origin=imtcp submitted=0
origin=core.queue size=0
origin=core.action processed=8744
origin=imtcp submitted=55549
origin=core.queue size=48776
origin=core.action processed=18000
origin=imtcp submitted=119548
origin=core.queue size=150353
origin=core.action processed=0
origin=imtcp submitted=123600
origin=core.queue size=273953
origin=core.action processed=0
origin=imtcp submitted=124402
origin=core.queue size=398355
origin=core.action processed=12000
origin=imtcp submitted=126660
origin=core.queue size=513015
origin=core.action processed=6000
origin=imtcp submitted=122833
origin=core.queue size=629848
origin=core.action processed=0
origin=imtcp submitted=70156
origin=core.queue size=700003
origin=core.action processed=12000
origin=imtcp submitted=12643
origin=core.queue size=700000
origin=core.action processed=0
origin=imtcp submitted=0
origin=core.queue size=700010
origin=core.action processed=18000
origin=imtcp submitted=17772
origin=core.queue size=700000
origin=core.action processed=0
origin=imtcp submitted=0
origin=core.queue size=700010
origin=core.action processed=6000
origin=imtcp submitted=5925
origin=core.queue size=700000
origin=core.action processed=2000
origin=imtcp submitted=2221
origin=core.queue size=700009
origin=core.action processed=4000
origin=imtcp submitted=3713
origin=core.queue size=700006
origin=core.action processed=6000
origin=imtcp submitted=5924
origin=core.queue size=700006
origin=core.action processed=0
origin=imtcp submitted=0
origin=core.queue size=700016

so during this time you have received 790946 messages, but only processed
92744 of them, leaving ~70,000 messages pending. the fact that the
submissions drop off may mean that the watermark settings are refusing to
accept more messages (the fact that the queue size is staying at almost
exactly 70k even when you have a wildly varying number of messages
processed is very suspcious.

try setting your batch size to 20 instead of 2000 and your delay to 10000
instead of 1000000 this is the same total limit, but should smooth things
out a lot so that you can see what the receiving processes can really
handle. Right now they are getting a huge batch of 2000 messages every
second and then nothing for the next second instead of a continuing stream
of messages.

David Lang
_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad
of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you
DON'T LIKE THAT.

_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of 
sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE 
THAT.

_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of 
sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE 
THAT.

Re: [rsyslog] ruleset queues filled in start quickly and then smoothly

Reply via email to