On Thu, 3 Jul 2014, Masuda, Bond wrote:
Date: Thu, 3 Jul 2014 07:55:15 +0000
From: "Masuda, Bond" <[email protected]>
Reply-To: rsyslog-users <[email protected]>
To: rsyslog-users <[email protected]>
Subject: Re: [rsyslog] question on reliability of message processing
Thanks David & RobertM for your replies. impstats definitely looks like it
would help me understand what's going on - will use that tomorrow and repeat the
load tests.
As for rsyslog, using version 8.2.1 on RHEL 6.5 64bit.
good
When you say disable DNS look ups, you mean running with -x option, yes?
yes
As for what i'm doing with logs, it goes like this (pulling out the relevant
parts from multiple files and internal hostnames sanitized):
input(type="imudp" port="514" ruleset="MainRoutingEngine")
ruleset(name="MainRoutingEngine"
queue.type="fixedArray"
queue.size="250000"
queue.dequeueBatchSize="4096"
queue.workerThreads="4"
this is frequently a mistake, it turns out that having too many workers can be
as bad as too few, as the contention between the workers slows things down
queue.workerThreadMinimumMessages="60000"
) {
call DataArchiving
#call DataForwarding
}
ruleset(name="DataArchiving"
queue.type="fixedArray"
queue.size="250000"
queue.dequeueBatchSize="4096"
queue.workerThreads="4"
queue.workerThreadMinimumMessages="60000"
) {
having the main queue do nothing but put the message in a second queue also ends
up being one of those things that can cause problems.
what ends up happening is that your mutliple main worker threads are in a loop
1. lock the main queue
2. is there anything in the main queue
3. if so grab up to batchsize messages
4. unlock the main queue
5. lock the action queue
6. reserve slots in the action queue for the number of messages in this batch
7. unlock the action queue
8. move these messages to the action queue
9. lock the action queue
10. mark the messages as being completely in the action queue
11. unlock the action queue
12. lock the main queue
13. mark the messages as being processed (freeing the queue slots)
14. unlock the main queue
15. goto 1
with 4 workers and no actual work to do other than copy the messages, this means
that all four workers are in a tight loop, each processing very few messages,
and as a result, there is a LOT of time when the queue is locked by the workers
This seems like a horribly complicated process, but the result is that you can
get a LOT of work done without the queues being locked (inserting into the main
queue is similar to steps 5-11)
But if you have too many workers, the batch size ends up being 1 (because the
messages are grabbed so aggressivly, and because the queues are locked so much
of the time that it's hard to get the new messages inserted)
So you don't want to create additional worker threads unless you find that you
need them (because the existing worker threads are maxing out the cpu), and you
don't want to have an action queue unless the action is either potentially very
slow, or is likely to fail. Writing to disk is neither of these (but writing to
a remote machine, or to a database, etc is both)
action( type="omfile"
file="/data/logs/incoming-all.log"
ioBufferSize="64k"
probably not needed, leave it out unless we find we need it
flushOnTXEnd="off"
ditto
asyncWriting="on")
asyncwriting is another one of those things that you probably shouldn't be
using. AsyncWriting doesn't mean the same thing that traditional syslog async
means (rsyslog does what traditional syslog async means by default, and you have
to go to extreme lengths to disable it). AsyncWriting means that instead of the
worker thread writing to disk, it formats the output line and writes it to
another thread (with more locking involved for this handoff), which then writes
it to disk. This can be useful if you have a very complex output format, or if
you are doing a lot of work in the output (compressing/encrypting/signing the
file as you go are the ony things that I've ever seen trigger this)
if $syslogtag == 'LEEF:' and $msg contains 'Websense' then {
contains is one of the most expensive operations you can do, especially if you
do it against rawmsg, sometimes you need it, but check carefully to see if you
really do, and where possible, do like you do here and have the contains after
another test that can narrow down the test first. Rsyslog will short-circuit the
test so if the first part if false, it won't even evaluate the second part
action( type="omfile"
dynaFile="sortByHostnameYearMonthWebsense"
using dynafile can be a problem. There is a parameter to set the dynafilecache
size, you need to set that to a number larger than the number of files that you
expect to be working with at any one time. If it's smaller, then you end up with
the process
1. I have a message to write to file 'A1', but I don't have it open
2. I want to open file A1, but my dynafilecache is full
3. evict file A9 from the dynafile cach (as it hasn't been accessed the
longest), flush any remaining messages to disk (fsync) and close the file
4. open file A1.
.
.
.
5. I hae a message to write to file A9, but I don't have it open
6. I want to open file A9, but my dynafilecache is full
7. evict file A1 from the cach...
8. open file A1
.
.
.
9. I have a message to write to file A1....
impstats should tell you stats on the dynafile cache, how many opens were
needed.
When figuring out your working set, remember that including timestamps as part
of the filename has some overlap, you may be processing messages with different
timestamps at the same time.
ioBufferSize="64k"
flushOnTXEnd="off"
asyncWriting="on"
)
stop
}
if $rawmsg contains '_SIEM' then {
action( type="omfile"
dynaFile="sortByHostnameYearMonthCiscoWSASquid"
ioBufferSize="64k"
flushOnTXEnd="off"
asyncWriting="on"
)
stop
}
if $rawmsg contains 'SFIMS:' then {
action( type="omfile"
dynaFile="sortByHostnameYearMonthSourcefire"
ioBufferSize="64k"
flushOnTXEnd="off"
asyncWriting="on"
)
stop
}
if $rawmsg contains 'fenotify-' then {
action( type="omfile"
dynaFile="sortByHostnameYearMonthFireeye"
ioBufferSize="64k"
flushOnTXEnd="off"
asyncWriting="on"
)
stop
}
if $hostname == 'server.com' then {
action( type="omfile"
dynaFile="sortByHostnameYearMonthTippingpoint"
ioBufferSize="64k"
flushOnTXEnd="off"
asyncWriting="on"
)
stop
}
it's a good idea to put simple tests like this ahead of more expensive ones. If
they match this test, stop will prevent any of the other tests from being done.
if re_match($rawmsg, 'a|b|c|d' ) then {
regex match is one of the things that's even more expensive than contains. look
to see if there's any way to filter this more.
action( type="omfile"
dynaFile="sortByHostnameYearMonthJuniperSSLVPN"
ioBufferSize="64k"
flushOnTXEnd="off"
asyncWriting="on"
)
stop
}
if $syslogtag startswith 'MSWinEventLog' then {
action( type="omfile"
dynaFile="sortByHostnameYearMonthWindows"
ioBufferSize="64k"
flushOnTXEnd="off"
asyncWriting="on"
)
stop
}
if $rawmsg contains 'RT_FLOW: RT_FLOW_SESSION' then {
action( type="omfile"
dynaFile="sortByHostnameYearMonthSRXFW"
ioBufferSize="64k"
flushOnTXEnd="off"
asyncWriting="on"
)
stop
}
if re_match($msg, '%ASA-[0-9]-[0-9]{6}' ) then {
use the pmcisco* module to fix the cisco "hostname is :" bug, then make this be
if $syslogtag startswith '%ASA', that avoids needing to check the entire message
and avoids the regex.
If there are only some messages you want to match, the arraymatch trick can help
if $syslogtag == ['%ASA-2-2645','%ASA-2-2646']
note that the array elements must be in alpha sorted order.
action( type="omfile"
dynaFile="sortByHostnameYearMonthCiscoASA"
ioBufferSize="64k"
flushOnTXEnd="off"
asyncWriting="on"
)
stop
}
if re_match($rawmsg, '%PIX-[0-9]-[0-9]{6}' ) then {
action( type="omfile"
dynaFile="sortByHostnameYearMonthCiscoPIX"
ioBufferSize="64k"
flushOnTXEnd="off"
asyncWriting="on"
)
stop
}
if re_match($rawmsg, '%FWSM-[0-9]-[0-9]{6}' ) then {
action( type="omfile"
dynaFile="sortByHostnameYearMonthCiscoFWSM"
ioBufferSize="64k"
flushOnTXEnd="off"
asyncWriting="on"
)
stop
}
if $msg contains '%SEC-6-IPACCESSLOG' then {
action( type="omfile"
dynaFile="sortByHostnameYearMonthCiscoACLRouter"
ioBufferSize="64k"
flushOnTXEnd="off"
asyncWriting="on"
)
stop
}
if $rawmsg contains 'CSCOacs' then {
action( type="omfile"
dynaFile="sortByHostnameYearMonthCiscoACS"
ioBufferSize="64k"
flushOnTXEnd="off"
asyncWriting="on"
)
stop
}
action( type="omfile"
dynaFile="sortByHostnameYearMonthLog"
ioBufferSize="64k"
flushOnTXEnd="off"
asyncWriting="on")
}
The files are written to a glusterfs file system sitting on top of XFS on a
iSCSI device. iSCSI has its own dedicated interface (eth1) and so does
glusterfs (eth2) that is separate from the incoming syslog traffic (eth0).
iostat output during the load test showed no strain on the disk i/o
sub-system.
Good thing to keep an eye on, but also watch the impstats for the outputs. If
they get slow (potentially because of glusterfs locking for example), it will
show up as the queue building up.
David Lang
Thanks,
-Bond
-----Original Message-----
From: [email protected]
[mailto:[email protected]] On Behalf Of David Lang
Sent: Wednesday, July 02, 2014 11:06 PM
To: rsyslog-users
Subject: Re: [rsyslog] question on reliability of message processing
On Wed, 2 Jul 2014, Masuda, Bond wrote:
hi rsyslogers:
we've been doing some load testing of syslog messages over UDP/514 to rsyslog.
we write all incoming messages to a file in /data/logs/incoming-all.log.
In our load test, we generated about 29 million messages in 300 seconds. On the
server side, we are receiving about 25 million messages; and about 4 million
messages are lost on the network (not an rsyslog issue). However, of the 25
million messages we know arrive at the server, we are also seeing message lost
in /data/logs/incoming-all.log, albeit to a much lesser degree than the network
problem.
The actual numbers are:
29,561,113 messages generated and sent in 300 seconds
24,802,441 messages arrive at the rsyslog server (counting UDP packets via
NETFILTER/mangle-PREROUTING accounting rule)
24,774,587 messages written to /data/logs/incoming-all.log
So it would seem that we lost 27854 messages within rsyslog.
My question is this:
1. Does rsyslog drop messages when its message queues are overflowing?
Yes, if the queue is full, what else can rsyslog do?
2. If answer to #1 is yes, does it keep any accounting of the lost messages and
how can I see those numbers? or at least warn that its queues are overflowing?
yes, enable impstats
3. if answer to #1 is yes, is there some configuration setting to make rsyslog
guarantee not to drop messages, potentially as trade off with some other
problem? Or is it just a matter of increasing queue sizes?
not with UDP, because if rsyslog can't deal with the message _NOW_ the OS will
drop the packet
4. If answer to #1 is no, what's the best way to go about troubleshooting why
messages are being lost?
BTW, under less stressful conditions, all the in/out numbers perfectly match.
We only start seeing "lost messages/packets" when we go above ~50,000 messages
per second.
first question, what version are you running, 8.2 is worlds better than 5.x
after that, have you tried disabling DNS lookups
run tcpdump (saving to a file, not outputting to the screen) to check if the
packets are really being lost on the network or if they are being lost in the
network stack of the receiving machine)
configure impstats to see what your queue sizes look like
what are you doing with the logs? it's possible that your bottleneck is in
outputting the logs and fixing that will solve your problem
you could configure disk assisted queues to spool logs to disk, but that is
slower than just outputting them to a file, so it doesn't solve the problem
a tuned rsyslog setup can do gig-E wire speed or faster, you are well below that
limit, so we should be able to help you.
David Lang
_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of
sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE
THAT.
_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of
sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE
THAT.
_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of
sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE
THAT.