Re: [rsyslog] Help Optimizing Action Queue

Devin Christensen Fri, 19 Sep 2014 15:20:05 -0700

I've been digging into this a little deeper and I can consistently getrsyslog to freeze indefinitely upon shutdown. Below is a Vagrantfile youcan use to reproduce my testing environment followed by the commands I'mrunning and the results I'm getting. The output.rb file is available ifyou want to watch the logs come across, but it's not actually necessaryfor the tests.


---------------------------------------------------------------------
Vagrantfile
---------------------------------------------------------------------
# -*- mode: ruby -*-
# vi: set ft=ruby :


BASE_RSYSLOG_CONF = <<-EOF
##################################################
# Global Config
##################################################
$PrivDropToGroup syslog
$PrivDropToUser syslog
$WorkDirectory /var/spool/rsyslog

##################################################
# Input
##################################################
module(load="imuxsock")
module(load="imudp" threads="1")
input(type="imudp" Address="127.0.0.1" Port="514")

##################################################
# Output
##################################################
local1.* action(
  type="omfwd"
  Target="localhost"
  Port="4414"
  Protocol="tcp"
  action.resumeRetryCount="-1"
  action.resumeInterval="15"
  queue.type="FixedArray"
  queue.size="100000"
  queue.highwatermark="60000"
  queue.lowwatermark="50000"
  queue.dequeuebatchsize="10"
  queue.timeoutenqueue="0"
  queue.workerthreads="1"
  queue.filename="fwd_test"
  queue.maxdiskspace="4g"
  queue.maxfilesize="64m"
  queue.saveonshutdown="on"
)
EOF

INPUT_RB = <<-EOF
require "syslog"
require "securerandom"

ARGV[0].to_i.times do
  fork do
    syslog  = Syslog.open "input", 0, Syslog::LOG_LOCAL1
    message = SecureRandom.hex 512

    ARGV[1].to_i.times do
      syslog.log Syslog::LOG_INFO, message
    end
  end
end

Process.waitall
EOF

OUTPUT_RB = <<-EOF
require "socket"
require "thread"

server = TCPServer.new "localhost", 4414
count_mutex = Mutex.new
count = 0

loop do
  Thread.new(server.accept) do |client|
    while log = client.gets do
      count_mutex.synchronize do
        count = count + 1
        printf "[%06d] %s", count, log
      end
    end
  end
end
EOF

Vagrant.configure("2") do |config|
  config.vm.box = "chef/ubuntu-12.04"
  config.vm.provision :shell, inline: <<-EOF
    apt-get install -y python-software-properties
    apt-add-repository ppa:adiscon/v8-stable
    apt-get update
    apt-get install -y rsyslog=8.4.0-0adiscon2precise1
    apt-get install -y ruby1.9.1
    echo '#{BASE_RSYSLOG_CONF}' > /etc/rsyslog.conf
    echo '#{INPUT_RB}' > /home/vagrant/input.rb
    echo '#{OUTPUT_RB}' > /home/vagrant/output.rb
    mkdir -p /var/spool/rsyslog && chown syslog:syslog /var/spool/rsyslog
    stop rsyslog
  EOF

  config.vm.define "rsyslog" do |node|
    node.vm.hostname = "rsyslog"
    node.vm.network "private_network", ip: "10.0.0.11"
  end
end

---------------------------------------------------------------------
Steps for each test (running as root):
---------------------------------------------------------------------
root@rsyslog:/home/vagrant# rm -rf /var/spool/rsyslog/*
root@rsyslog:/home/vagrant# rsyslogd -n &
root@rsyslog:/home/vagrant# ruby input.rb 1 100000
root@rsyslog:/home/vagrant# kill %1; time wait %1
root@rsyslog:/home/vagrant# ls /var/spool/rsyslog

When rsyslog failed to shutdown I used ctrl+c followed by:
root@rsyslog:/home/vagrant# kill -9 %1; fg

---------------------------------------------------------------------

Results - changes made to the rsyslog config for each test are relativeto the BASE_RSYSLOG_CONF above. They do not include changes in precedingresult sets.

---------------------------------------------------------------------
[success] No changes
- Rsyslog shuts down in 1.386 seconds
- fwd_test.qi is written to /var/spool/rsyslog

[failure] With threads="2" on the imudp module
- Rsyslog does not shut down (waited over an hour during lunch)
- fwd_test.qi is never written

[success] With threads="2" and running with debug (replace "rsyslog -n&" with "rsyslogd -dn > /tmp/rsyslog.log &" in the testing steps above)

- Rsyslog shuts down in 4.673 seconds
- fwd_test.qi is written to /var/spool/rsyslog

Strange that it works with debug enabled. Here are the log lines aroundimudp during shutdown:

8124.409983502:main thread : request term via SIGTTIN for inputthread 'imudp' 0x76043700

8124.410282614:imudp.c        : imudp: epoll_wait() returned with -1 fds

8125.410418736:main thread : input thread term: timeout expiredwaiting on thread imudp termination - canceling8125.410464944:main thread : non-cancel input thread terminationsucceeded for thread imudp 0x760437008125.410541505:imudp.c : destructor for debug call stack0x7fad680008c0 called

The error on epoll_wait() might point to something, but I'm notsufficiently familiar with the source to say.


On 09/19/2014 01:43 AM, Devin Christensen wrote:

David, you're right. I'm on Ubuntu 12.04 and lo and behold, "killtimeout 30" appears in the upstart script. I can definitely duplicatethis at will and I will investigate "rsyslog -dn" after I've let mybrain rest a bit (currently 1:30AM in these parts). I appreciate theguidance regarding the number of workers and such. I probably slippedinto the trap of premature optimization. I'll set my workers to "1"and migrate to the FixedArray. Thanks for the help!
On 09/19/2014 01:20 AM, David Lang wrote:
On Fri, 19 Sep 2014, Devin Christensen wrote:
Alright, I've got everything working, but I came across a verystrang issue in the process. As far as I can tell, the key togetting a disk assisted queue to be re-loaded when rsyslog restartsis the *.qi file that gets stored in the spool directory.
I can confirm that this is correct.
The follwing config prevented that file from getting written whenrsyslog shutdown.
module(load="imudp" threads="2")
When I changed the the threads setting in the imudp module from "2"to "1", the *.qi files for my disk-assisted queues would correctlyget written to the spool directory when I shutdown rsyslog. I belivethis is becasue rsyslog would also take roughly 30 seconds toshutdown when threads was set to "2". I'm thinking the shutdownprocess is timing out for some reason while trying to cleanup theimudp input and never makes it to the disk-assisted queues to writeout the *.qi files.
All I know for sure is that setting threads in the imudp module to"2" causes rsyslog to take 29.685s to shutdown and the *.qi filesfor my disk-assisted queues never get written. When I set thethreads for the imudp module to "1", rsyslog shuts down in 0.121s,and the *.qi files for my disk-assisted queues are written to thespool directory. I was seeing this behavior in 7.6.3 as well as 8.4.0.
What distro are you using for this? Could you check the distroshutdown script/config to see if it's enforcing the timeout?
I believe that rsyslog has a setting for how long it's allowed tospend trying to flush it's queues when shutting down.
If you can duplicate this at will (especially if you can duplicate itstarting and stopping rsyslog manually), see if starting rsyslog with-dn has any interesting output at shutdown time.
It's interesting that you are running into grief with an inputmodule. that would say that it's probably an issue with locking onthe main queue.
The inudp module is rather efficient and you should not need morethan one thread with it unless you are dealing with more than acouple hundred thousand messages per second. With most things inrsyslog, you have the option of configuring additional threads, butusually shouldn't. Rsyslog has a lot of internal queues that themessage moves between. Normally this is done very efficiently inbatches of messages, but if you have too many threads for the workyou are doing, the batch size drops drastically and performanceplummets because the threads are spending too much time locking andunlocking the queues. I would suggest dropping the workers count downto the default 1 everywhere, unless benchmarking indicates that youhave a bottleneck.
Also, as a side note, I've measured better performance withFixedArray queues than LinkedList queues.
David Lang
Let me know if you need any other info to track this down.

On 09/18/2014 04:11 PM, Devin Christensen wrote:
I'm getting closer on this. Adding queue.timeoutenqueue="0" to theaction gave me what I was after in terms of discarding. It filledup the disk queue to the configured max and then started discardingall messages after that. I'm using impstats to track things. Hereare the relative output lines after filling up my disk-assisted queue:
action 1 queue[DA]: size=4101477 enqueued=5948040 full=1846563discarded.full=1846563 discarded.nf=0 maxqsize=4101477action 1 queue: size=51960 enqueued=6000000 full=0 discarded.full=0discarded.nf=0 maxqsize=61776main Q: size=17 enqueued=6000941 full=0 discarded.full=0discarded.nf=0 maxqsize=5309
As you can see, the messages are only getting discarded at "action1 queue[DA]" which is what I expected. So things are looking greatthere. Back pressure is not piling up on the main Q (maxqsize=5309)either. Very happy about the way that is working, but I'm nowbumping into an issue during restarts. When I restart rsyslog, theimpstats output changes from what's above to:
action 1 queue[DA]: size=0 enqueued=0 full=0 discarded.full=0discarded.nf=0 maxqsize=0rsyslogd-pstats: action 1 queue: size=0 enqueued=0 full=0discarded.full=0 discarded.nf=0 maxqsize=0main Q: size=17 enqueued=21 full=0 discarded.full=0 discarded.nf=0maxqsize=17
As you can see, all the logs in "action 1 queue[DA]" are now gone(size=0), even though the spool files still exist on disk. Iexpected the "action 1 queue[DA]" to remain at "size=4101477" aftera restart. I confirmed the logs were lost by starting the omfwdreceiver again and waiting for 5 minutes. No logs were evertransmitted; howerver, it does receive new logs just fine. When Ishutdown the receiver and send more logs I get:
action 1 queue[DA]: size=101666 enqueued=101667 full=0discarded.full=0 discarded.nf=0 maxqsize=101666action 1 queue: size=77270 enqueued=500000 full=290102discarded.full=290102 discarded.nf=0 maxqsize=100000main Q: size=17 enqueued=501421 full=0 discarded.full=0discarded.nf=0 maxqsize=475
Now the logs are getting discared at "action 1 queue" rather than"action 1 queue[DA]". It also looks like rsyslog reuses the spoolfiles. Even though "action 1 queue[DA]" filled to "size=101666",the actual on disk size never increased beyond the size it was atwhen I restarted rsyslog. I'm going to do some more experimenting,but do let me know if you see antyhing wrong with my setup. I'mcurrently doing all testing on rsyslog 8.4.0. I'll try 7.6.3 if Ican't get 8.4.0 working. Here's my test configuration:
################################################################################# Global Config################################################################################$FileGroup adm
$FileOwner syslog
$PrivDropToGroup syslog
$PrivDropToUser syslog
$RepeatedMsgReduction off
$Umask 0022
$WorkDirectory /var/spool/rsyslog
################################################################################# Input################################################################################module(load="imuxsock") # provides support for local system logging
module(load="imklog")   # provides kernel logging support
module(load="imfile")
module(load="impstats" interval="5")
################################################################################# Main Queue################################################################################main_queue(
  queue.type="LinkedList"
  queue.size="100000"
  queue.dequeuebatchsize="1000"
  queue.workerthreads="2"
  queue.dequeueslowdown="0"
)
################################################################################# Templates################################################################################template(name="preformatted" type="list") {
  property(name="msg")
}
################################################################################# Output################################################################################
# Defaults for file output
module(
  load="builtin:omfile"
  template="RSYSLOG_TraditionalFileFormat"
  dirCreateMode="0755"
  fileCreateMode="0644"
)

# Test forward action
local1.* action(
  type="omfwd"
  Target="localhost"
  Port="4414"
  Protocol="tcp"
  template="preformatted"
  action.resumeRetryCount="-1"
  action.resumeInterval="15"
  queue.type="LinkedList"
  queue.size="100000"
  queue.highwatermark="60000"
  queue.lowwatermark="50000"
  queue.dequeuebatchsize="10"
  queue.timeoutenqueue="0"
  queue.workerthreads="2"
  queue.filename="fwd_test"
  queue.maxdiskspace="4g"
  queue.maxfilesize="64m"
  queue.saveonshutdown="on"
)

auth,authpriv.* /var/log/auth.log
*.*;auth,authpriv.none;\
local0,local1,local2,local3,local4,local5,local6,local7.none/var/log/syslog
syslog.*        /var/log/rsyslog.log #rsyslog error messages
kern.*          /var/log/kern.log
mail.*          /var/log/mail.log
mail.err        /var/log/mail.err
news.crit       /var/log/news/news.crit
news.err        /var/log/news/news.err
news.notice     /var/log/news/news.notice
*.emerg         :omusrmsg:*

On 09/18/2014 02:50 AM, Rainer Gerhards wrote:
2014-09-18 9:03 GMT+02:00 Radu Gheorghe <[email protected]>:
Hi Devin,

My understanding is that a timeoutenqueue of 0 might have you lose
messages, as they're enqueued from the main queue to the actionqueue whenrsyslog is busy, because it waits for... no time... for theaction queue toenqueue that message. I was losing tons of messages during aperformancetest with that setting. So I changed it to the default and had nomore
problems.
It's used in cases where the queue in question is full ornear-full, as
Devin quoted:

    We can not hold processing infinitely, not even when throtteling.
For example, throtteling the local log socket too long wouldcausethe system at whole come to a standstill. To prevent this,rsyslogdtimes out after a configured period("$<object>QueueTimeoutEnqueue
",
specified in milliseconds) if no space becomes available. As alast
    resort, it then discards the newly arrived message.
You probably lost messages because the queue was full and theoutput wasnot able to free space. With defaults, the output could quicklyenough freespace and so the whole process continued without loss (remember,if there
is no timeout, even a nanosecond is "too long" a waittime).

Rainer
I'm still not 100% on what timeoutenqueue does, but I'd leave itto thedefault for now on the action queue. Maybe Rainer or someone elsewho
understand this stuff better can clarify.

Best regards,
Radu

--
Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * http://sematext.com/

On Thu, Sep 18, 2014 at 9:52 AM, Devin Christensen <
[email protected]> wrote:
Thanks for the clarification. In that case, am I right inthinking thatthe following action will discard all messages it receives whenits disk
queue reaches 1 gigabyte, and avoid any throttling?

local1.* action(
    type="omfwd"
    Target="remote.example.com"
    Port="4414"
    Protocol="tcp"
    template="preformatted"
    action.resumeRetryCount="-1"
    action.resumeInterval="15"
    queue.type="LinkedList"
    queue.size="100000"
    queue.highwatermark="60000"
    queue.lowwatermark="50000"
    queue.dequeuebatchsize="1000"
    queue.timeoutenqueue="0"
    queue.workerthreads="2"
    queue.filename="fwd_to_remote"
    queue.maxdisksize="1g"
    queue.maxfilesize="16m"
    queue.saveonshutdown="on"
)

On 09/18/2014 12:43 AM, Rainer Gerhards wrote:
2014-09-18 8:28 GMT+02:00 Devin Christensen <
[email protected]>:
Actually, it looks like there may be some conflictingdocumentation
around
"queue.timeoutenqueue". From "Understanding Rsyslog Queues"
We can not hold processing infinitely, not even whenthrotteling.For example, throtteling the local log socket too longwould causethe system at whole come to a standstill. To preventthis, rsyslogd
     times out after a configured period
("$<object>QueueTimeoutEnqueue",
specified in milliseconds) if no space becomes available.As a last
     resort, it then discards the newly arrived message.
*If you do not like throtteling, set the timeout to 0 -the messagewill then immediately be discarded*. If you use a hightimeout, besure you know what you do. If a high main message queueenqueuetimeout is set, it can lead to something like a completehang of
the
     system. The same problem does not apply to action queues.

  From "General Queue Parameters"
*queue.timeoutenqueue* number number is timeout in ms(1000ms is
     1sec!), default 2000, *0 means indefinite*


  This is wrong, just fixed the doc. 0 is "discard immediately".
Thx,
Rainer

  Guess I won't tinker with that without a bit of clarification.
On 09/18/2014 12:15 AM, Devin Christensen wrote:
Thanks for the quick response. The other setting that Ithought might
help is "queue.timeoutenqueue" which I was consideringsetting to 0 on
the
action queue. The documentation leads me to believe this willdiscard
any
new messages arriving to the action when the disk queuereaches its
max
size. Does that sound right?
If I can isolate the discarded messages to those going to theomfwdaction that would be ideal. None of the other logs shouldcause backpressure becuase they're not dependent on a remote host beingup. Of
course, I think I should also add queue.discardmark and
queue.discardseverity to the main queue for additionalreassurance.
On 09/17/2014 11:54 PM, Radu Gheorghe wrote:

  Hi Devin,
I'm not 100% sure about this, but it sounds like what youshould do
is
to
apply queue.discardmark and queue.discardseverity on themain queue.
This
should allow the action queue to fill up (to that 1GB), and put
pressure
on
the main queue. When main queue has more than $DISCARDMARKmessages,
it
should begin discarding messages with a severity numberhigher than
$DISCARDSEVERITY.

You could go all-or-nothing with this, and discard everything
(severity=1
or maybe even 0 works?) when you hit 999999 messages, or youcan
show a
bit
of mercy and, say, let only errors pass after you have 800Kmessages
in
the
queue. In the latter case you'd risk putting pressure backon the
socket,
though.
It sounds like you already know about all the queueparameters, but
just
in
case you missed the docs:
http://www.rsyslog.com/doc/master/rainerscript/queue_parameters.html
Best regards,
Radu
--
Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * http://sematext.com/

On Thu, Sep 18, 2014 at 8:41 AM, Devin Christensen <
[email protected]> wrote:
I'm trying to configure an action queue so that it willdiscard all
messages immediately if it fills up it's allocated diskspace. The
log
messages are coming in on the local socket. I justrecovered from ascenario where rsyslog was bringing systems to a halt,presumably
because
back pressure is ending up on the local log socket, fillingit up,
and
letting nothing else write.
Here is my current configuration for my main queue and theaction.
main_queue(
     queue.type="LinkedList"
     queue.size="1000000"
     queue.dequeuebatchsize="1000"
     queue.workerthreads="5"
     queue.dequeueslowdown="0"
)

local1.* action(
     type="omfwd"
     Target="remote.example.com"
     Port="4414"
     Protocol="tcp"
     template="preformatted"
     action.resumeRetryCount="-1"
     action.resumeInterval="15"
     queue.type="LinkedList"
     queue.size="100000"
     queue.highwatermark="60000"
     queue.lowwatermark="50000"
     queue.dequeuebatchsize="1000"
     queue.workerthreads="2"
queue.filename="fwd_preformatted_to_logflume"
     queue.maxdisksize="1g"
     queue.maxfilesize="16m"
     queue.saveonshutdown="on"
)
In the event that the target (remote.example.com) isunavailable, I
would
like logs to spool to disk upto 1 gigabyte, and discardeverythingimmediately after that. I want to avoid any back pressureending up
on
the
local log socket. It's much more valuable for our systems to
continue
running than to get all the log data.

My question is, what am I missing or completely messed up?
_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts areARCHIVED by a
myriad
of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOTPOST if
you
DON'T LIKE THAT.

_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVEDby a
myriad
of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOTPOST if
you
DON'T LIKE THAT.

_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVEDby a
myriad
of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOTPOST if you
DON'T LIKE THAT.

_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a
myriad
of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOTPOST if you
DON'T LIKE THAT.

  _______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED bya myriadof sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POSTif you
DON'T LIKE THAT.
_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED bya myriadof sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POSTif you
DON'T LIKE THAT.
_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by amyriadof sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POSTif you
DON'T LIKE THAT.
_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by amyriad of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOTPOST if you DON'T LIKE THAT.
_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by amyriad of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOTPOST if you DON'T LIKE THAT.
_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by amyriad of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOTPOST if you DON'T LIKE THAT.
_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by amyriad of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOTPOST if you DON'T LIKE THAT.
_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by amyriad of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POSTif you DON'T LIKE THAT.


_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of 
sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE 
THAT.

Re: [rsyslog] Help Optimizing Action Queue

Reply via email to