On Thu, 3 Jul 2014, Masuda, Bond wrote:

Date: Thu, 3 Jul 2014 07:55:15 +0000
From: "Masuda, Bond" <[email protected]>
Reply-To: rsyslog-users <[email protected]>
To: rsyslog-users <[email protected]>
Subject: Re: [rsyslog] question on reliability of message processing

Thanks David & RobertM for your replies. impstats definitely looks like it 
would help me understand what's going on - will use that tomorrow and repeat the 
load tests.



As for rsyslog, using version 8.2.1 on RHEL 6.5 64bit.

good



When you say disable DNS look ups, you mean running with -x option, yes?


yes


As for what i'm doing with logs, it goes like this (pulling out the relevant parts from multiple files and internal hostnames sanitized):



input(type="imudp" port="514" ruleset="MainRoutingEngine")



ruleset(name="MainRoutingEngine"

       queue.type="fixedArray"

       queue.size="250000"

       queue.dequeueBatchSize="4096"

       queue.workerThreads="4"

this is frequently a mistake, it turns out that having too many workers can be as bad as too few, as the contention between the workers slows things down

       queue.workerThreadMinimumMessages="60000"

       ) {

       call DataArchiving

       #call DataForwarding

}



ruleset(name="DataArchiving"

       queue.type="fixedArray"

       queue.size="250000"

       queue.dequeueBatchSize="4096"

       queue.workerThreads="4"

       queue.workerThreadMinimumMessages="60000"

       ) {


having the main queue do nothing but put the message in a second queue also ends up being one of those things that can cause problems.

what ends up happening is that your mutliple main worker threads are in a loop

1. lock the main queue

2. is there anything in the main queue

3. if so grab up to batchsize messages

4. unlock the main queue

5. lock the action queue

6. reserve slots in the action queue for the number of messages in this batch

7. unlock the action queue

8. move these messages to the action queue

9. lock the action queue

10. mark the messages as being completely in the action queue

11. unlock the action queue

12. lock the main queue

13. mark the messages as being processed (freeing the queue slots)

14. unlock the main queue

15. goto 1

with 4 workers and no actual work to do other than copy the messages, this means that all four workers are in a tight loop, each processing very few messages, and as a result, there is a LOT of time when the queue is locked by the workers

This seems like a horribly complicated process, but the result is that you can get a LOT of work done without the queues being locked (inserting into the main queue is similar to steps 5-11)

But if you have too many workers, the batch size ends up being 1 (because the messages are grabbed so aggressivly, and because the queues are locked so much of the time that it's hard to get the new messages inserted)

So you don't want to create additional worker threads unless you find that you need them (because the existing worker threads are maxing out the cpu), and you don't want to have an action queue unless the action is either potentially very slow, or is likely to fail. Writing to disk is neither of these (but writing to a remote machine, or to a database, etc is both)


       action( type="omfile"

               file="/data/logs/incoming-all.log"

               ioBufferSize="64k"

probably not needed, leave it out unless we find we need it

               flushOnTXEnd="off"

ditto

               asyncWriting="on")

asyncwriting is another one of those things that you probably shouldn't be using. AsyncWriting doesn't mean the same thing that traditional syslog async means (rsyslog does what traditional syslog async means by default, and you have to go to extreme lengths to disable it). AsyncWriting means that instead of the worker thread writing to disk, it formats the output line and writes it to another thread (with more locking involved for this handoff), which then writes it to disk. This can be useful if you have a very complex output format, or if you are doing a lot of work in the output (compressing/encrypting/signing the file as you go are the ony things that I've ever seen trigger this)



       if $syslogtag == 'LEEF:' and $msg contains 'Websense' then {

contains is one of the most expensive operations you can do, especially if you do it against rawmsg, sometimes you need it, but check carefully to see if you really do, and where possible, do like you do here and have the contains after another test that can narrow down the test first. Rsyslog will short-circuit the test so if the first part if false, it won't even evaluate the second part

               action( type="omfile"

                       dynaFile="sortByHostnameYearMonthWebsense"

using dynafile can be a problem. There is a parameter to set the dynafilecache size, you need to set that to a number larger than the number of files that you expect to be working with at any one time. If it's smaller, then you end up with the process

1. I have a message to write to file 'A1', but I don't have it open

2. I want to open file A1, but my dynafilecache is full

3. evict file A9 from the dynafile cach (as it hasn't been accessed the longest), flush any remaining messages to disk (fsync) and close the file

4. open file A1.
.
.
.
5. I hae a message to write to file A9, but I don't have it open

6. I want to open file A9, but my dynafilecache is full

7. evict file A1 from the cach...

8. open file A1

.
.
.

9. I have a message to write to file A1....

impstats should tell you stats on the dynafile cache, how many opens were needed.

When figuring out your working set, remember that including timestamps as part of the filename has some overlap, you may be processing messages with different timestamps at the same time.

                       ioBufferSize="64k"

                       flushOnTXEnd="off"

                       asyncWriting="on"

                       )

               stop

       }



       if $rawmsg contains '_SIEM' then {

               action( type="omfile"

                       dynaFile="sortByHostnameYearMonthCiscoWSASquid"

                       ioBufferSize="64k"

                       flushOnTXEnd="off"

                       asyncWriting="on"

                       )

               stop

       }



       if $rawmsg contains 'SFIMS:' then {

               action( type="omfile"

                       dynaFile="sortByHostnameYearMonthSourcefire"

                       ioBufferSize="64k"

                       flushOnTXEnd="off"

                       asyncWriting="on"

                       )

               stop

       }



       if $rawmsg contains 'fenotify-' then {

               action( type="omfile"

                       dynaFile="sortByHostnameYearMonthFireeye"

                       ioBufferSize="64k"

                       flushOnTXEnd="off"

                       asyncWriting="on"

                       )

               stop

       }



       if $hostname == 'server.com' then {

               action( type="omfile"

                       dynaFile="sortByHostnameYearMonthTippingpoint"

                       ioBufferSize="64k"

                       flushOnTXEnd="off"

                       asyncWriting="on"

                       )

               stop

       }


it's a good idea to put simple tests like this ahead of more expensive ones. If they match this test, stop will prevent any of the other tests from being done.


       if re_match($rawmsg, 'a|b|c|d' ) then {

regex match is one of the things that's even more expensive than contains. look to see if there's any way to filter this more.

               action( type="omfile"

                       dynaFile="sortByHostnameYearMonthJuniperSSLVPN"

                       ioBufferSize="64k"

                       flushOnTXEnd="off"

                       asyncWriting="on"

                       )

               stop

       }



       if $syslogtag startswith 'MSWinEventLog' then {

               action( type="omfile"

                       dynaFile="sortByHostnameYearMonthWindows"

                       ioBufferSize="64k"

                       flushOnTXEnd="off"

                       asyncWriting="on"

                       )

               stop

       }



       if $rawmsg contains 'RT_FLOW: RT_FLOW_SESSION' then {

               action( type="omfile"

                       dynaFile="sortByHostnameYearMonthSRXFW"

                       ioBufferSize="64k"

                       flushOnTXEnd="off"

                       asyncWriting="on"

                       )

               stop

       }



       if re_match($msg, '%ASA-[0-9]-[0-9]{6}' ) then {

use the pmcisco* module to fix the cisco "hostname is :" bug, then make this be if $syslogtag startswith '%ASA', that avoids needing to check the entire message and avoids the regex.

If there are only some messages you want to match, the arraymatch trick can help

if $syslogtag == ['%ASA-2-2645','%ASA-2-2646']

note that the array elements must be in alpha sorted order.

               action( type="omfile"

                       dynaFile="sortByHostnameYearMonthCiscoASA"

                       ioBufferSize="64k"

                       flushOnTXEnd="off"

                       asyncWriting="on"

                       )

               stop

       }



       if re_match($rawmsg, '%PIX-[0-9]-[0-9]{6}' ) then {

               action( type="omfile"

                       dynaFile="sortByHostnameYearMonthCiscoPIX"

                       ioBufferSize="64k"

                       flushOnTXEnd="off"

                       asyncWriting="on"

                       )

               stop

       }



       if re_match($rawmsg, '%FWSM-[0-9]-[0-9]{6}' ) then {

               action( type="omfile"

                       dynaFile="sortByHostnameYearMonthCiscoFWSM"

                       ioBufferSize="64k"

                       flushOnTXEnd="off"

                       asyncWriting="on"

                       )

               stop

       }



       if $msg contains '%SEC-6-IPACCESSLOG' then {

               action( type="omfile"

                       dynaFile="sortByHostnameYearMonthCiscoACLRouter"

                       ioBufferSize="64k"

                       flushOnTXEnd="off"

                       asyncWriting="on"

                       )

               stop

       }



       if $rawmsg contains 'CSCOacs' then {

               action( type="omfile"

                       dynaFile="sortByHostnameYearMonthCiscoACS"

                       ioBufferSize="64k"

                       flushOnTXEnd="off"

                       asyncWriting="on"

                       )

               stop

       }



       action( type="omfile"

               dynaFile="sortByHostnameYearMonthLog"

               ioBufferSize="64k"

               flushOnTXEnd="off"

               asyncWriting="on")

}



The files are written to a glusterfs file system sitting on top of XFS on a iSCSI device. iSCSI has its own dedicated interface (eth1) and so does glusterfs (eth2) that is separate from the incoming syslog traffic (eth0). iostat output during the load test showed no strain on the disk i/o sub-system.

Good thing to keep an eye on, but also watch the impstats for the outputs. If they get slow (potentially because of glusterfs locking for example), it will show up as the queue building up.

David Lang



Thanks,

-Bond



-----Original Message-----

From: [email protected] 
[mailto:[email protected]] On Behalf Of David Lang

Sent: Wednesday, July 02, 2014 11:06 PM

To: rsyslog-users

Subject: Re: [rsyslog] question on reliability of message processing



On Wed, 2 Jul 2014, Masuda, Bond wrote:



hi rsyslogers:



we've been doing some load testing of syslog messages over UDP/514 to rsyslog. 
we write all incoming messages to a file in /data/logs/incoming-all.log.



In our load test, we generated about 29 million messages in 300 seconds. On the 
server side, we are receiving about 25 million messages; and about 4 million 
messages are lost on the network (not an rsyslog issue). However, of the 25 
million messages we know arrive at the server, we are also seeing message lost 
in /data/logs/incoming-all.log, albeit to a much lesser degree than the network 
problem.



The actual numbers are:



29,561,113 messages generated and sent in 300 seconds

24,802,441 messages arrive at the rsyslog server (counting UDP packets via 
NETFILTER/mangle-PREROUTING accounting rule)

24,774,587 messages written to /data/logs/incoming-all.log



So it would seem that we lost 27854 messages within rsyslog.



My question is this:



1. Does rsyslog drop messages when its message queues are overflowing?



Yes, if the queue is full, what else can rsyslog do?



2. If answer to #1 is yes, does it keep any accounting of the lost messages and 
how can I see those numbers? or at least warn that its queues are overflowing?



yes, enable impstats



3. if answer to #1 is yes, is there some configuration setting to make rsyslog 
guarantee not to drop messages, potentially as trade off with some other 
problem? Or is it just a matter of increasing queue sizes?



not with UDP, because if rsyslog can't deal with the message _NOW_ the OS will

drop the packet



4. If answer to #1 is no, what's the best way to go about troubleshooting why 
messages are being lost?





BTW, under less stressful conditions, all the in/out numbers perfectly match.

We only start seeing "lost messages/packets" when we go above ~50,000 messages

per second.



first question, what version are you running, 8.2 is worlds better than 5.x



after that, have you tried disabling DNS lookups



run tcpdump (saving to a file, not outputting to the screen) to check if the

packets are really being lost on the network or if they are being lost in the

network stack of the receiving machine)



configure impstats to see what your queue sizes look like



what are you doing with the logs? it's possible that your bottleneck is in

outputting the logs and fixing that will solve your problem



you could configure disk assisted queues to spool logs to disk, but that is

slower than just outputting them to a file, so it doesn't solve the problem



a tuned rsyslog setup can do gig-E wire speed or faster, you are well below that

limit, so we should be able to help you.



David Lang

_______________________________________________

rsyslog mailing list

http://lists.adiscon.net/mailman/listinfo/rsyslog

http://www.rsyslog.com/professional-services/

What's up with rsyslog? Follow https://twitter.com/rgerhards

NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of 
sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE 
THAT.
_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of 
sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE 
THAT.

_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of 
sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE 
THAT.

Reply via email to