On Wed, 22 Apr 2015, Joe Blow wrote:

Hey all,

I've got a log server (running the latest and greatest rsyslog as of
yesterday) which i've been seeing randomly dying.  I load balance and have
scripts to check if rsyslog isn't running, and if it is restart it, but i'm
having a really tough time tracking down what could be making rsyslog
crash.  Sometimes it seems like the rsyslog daemon dies, and because of it
I can't login via SSH.  Because of this i've been forced to keep a screen
open with the rsyslog boxes just in case i need to restart rsyslog (which
allows me to login again).

I'm taking logs in from a number of different sources (asa, snare, etc...
all of them with their own disk assisted output queues, all outputting to
Elasticsearch).  If i'm monitoring the queues, i don't see any queues
backing up or anything which would lead me to believe rsyslog is balooning
memory and going to die.

I have a number of these logs in my catchall bucket:

Framing Error in received TCP message: delimiter is not SP but has ASCII
value 46. [v8.9.0]

I see a few of these within the error logs too:

"UnavailableShardsException[[cisco-20150420][4] [3] shardIt, [1] active :
Timeout waiting for [3m], request:
org.elasticsearch.action.bulk.BulkShardRequest@5d48afd4]"

Could either of these cause rsyslog to hard die?  How would you recommend
finding these seemingly random failures?

Here are what most of my ES output queues look like:

<snip>

if $rawmsg contains "%ASA-" or $rawmsg contains "%PIX-" then{
       action(type="mmnormalize" userawmsg="on"
rulebase="/etc/rsyslog.d/asa.rule")
       action(type="omelasticsearch"
               name="rsys_asa"
               server="10.10.10.10"
               serverport="9200"
               template="ciscoasa"
               asyncrepl="on"
               searchType="asa"
               searchIndex="ciscoasaindex"
               timeout="3m"
               dynSearchIndex="on"
               bulkmode="on"
               errorfile="asa_err.log"
               queue.type="linkedlist"
               queue.filename="cisco.rsysq"
               queue.size="15000000"
               queue.saveonshutdown="on"
               queue.maxdiskspace="100g"
               queue.dequeuebatchsize="5000"
               action.resumeretrycount="30")stop}

</snip>

Anything glaring here?  Could my retries be killing rsyslog?  Any ideas on
how i should go about troubleshooting?

start by configuring impstats and have it log to a local file so you get it's info even if other logging can't happen.

the odds are good that what is happening isn't that rsylog is crashing, but rather that it's filling it's queues and then not accepting new input. If rsyslog isn't running, then you would still be able to login, but if it's running but it's queues are full, you would be unable to log the ssh login and get the symptoms that you are describing.

the impstats data will let you track down which action is not keeping up.

David Lang
_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of 
sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE 
THAT.

Reply via email to