maybe this was too wordy, here's simpler version.

  after adding:

  action.resumeRetryCount="-1"
  action.resumeInterval="5"

to action definition I see that rsyslogd tries to resend the message few times, where sendto succeeds but not RELP response is received. After few times sendto take approximately 10 minutes and fails.

  Afterwards the connection is closed, re-opened and everything works fine.

Is there any setting that would make the 10 minute sendto timeout shorter? Assuming it's some TCP level timeout so maybe it's not settable using rsyslog config (didn't find anything myself).

Alternatively is there anything that would make rsyslogd close and reopen connection after not receiving RELP response for some time (or some number of retries)?

I see there are some queue timeouts like $ActionQueueTimeoutActionCompletion (looking at http://www.rsyslog.com/doc/rsyslog_conf_global.html) but not sure what those actually do and if they would apply to this case.

  thanks!

        erik

On 12/16/2013 05:51 PM, Erik Steffl wrote:
   chaged the config as suggested however it does not seem it retries
that often, here's what tcpdump shows:

ubuntu@ip-10-158-97-169:~$ sudo tcpdump -A port 5140
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes
01:05:43.652804 IP ip-10-158-97-169.ec2.internal.53118 >
ec2-50-19-250-187.compute-1.amazonaws.com.5140: Flags [P.], seq
3725470457:3725470859, ack 2328419555, win 237, options [nop,nop,TS val
840645376 ecr 1017783106], length 402
E.....@.@...
.a.2....~....*................
2.;.<.#B26 syslog 387 <134>2013-12-17T00:58:13.436136+00:00
ip-10-158-97-169 logBurst[3103]:
@cee:{"message-json":{"count":"0/1","now":"Tue Dec 17 00:58:13
2013"},"yummlyLogOrigin":{"supportLevel":"prod","system":"LOGS","cluster":"prod","role":"collectorErik","host":"ip-10-158-97-169","tag":"logBurst[3103]:","programname":"logBurst","priority":"local0.info","timestamp":"2013-12-17T00:58:13.436136+00:00"}}



01:07:43.972781 IP ip-10-158-97-169.ec2.internal.53118 >
ec2-50-19-250-187.compute-1.amazonaws.com.5140: Flags [P.], seq 0:402,
ack 1, win 237, options [nop,nop,TS val 840675456 ecr 1017783106],
length 402
E.....@[email protected]
.a.2....~....*................
2...<.#B26 syslog 387 <134>2013-12-17T00:58:13.436136+00:00
ip-10-158-97-169 logBurst[3103]:
@cee:{"message-json":{"count":"0/1","now":"Tue Dec 17 00:58:13
2013"},"yummlyLogOrigin":{"supportLevel":"prod","system":"LOGS","cluster":"prod","role":"collectorErik","host":"ip-10-158-97-169","tag":"logBurst[3103]:","programname":"logBurst","priority":"local0.info","timestamp":"2013-12-17T00:58:13.436136+00:00"}}


   syslog 387 keeps repeating every two minutes.

   Config:

if
   prifilt("local0.*") or
   ...
   prifilt("local7.*")
then {
   action(type="mmjsonparse")
   if $parsesuccess == "OK" then {
     action(
       type="omrelp"
       target="elb.collector.prod.logs.ylmmuy.com"
       port="5140"
       template="json"
       queue.type="LinkedList"
       queue.filename="json"
       queue.maxdiskspace="75161927680" # 70GB (valuable data)
       action.resumeRetryCount="-1"
       action.resumeInterval="5"
     )
   } else {
...
   }
   stop
}

   Same test as before (host to load balancer to another host using
RELP), no MARK, no other messages, just wait for connection to go stale
then start sending messages every 5 seconds. It takes about 15 minutes
for it to recover.

   First message (strace output):

3081  00:58:13.528459 sendto(13, "26 syslog 387
<134>2013-12-17T00:58:13.436136+00:00 ip-10-158-97-169 logBurst[3103]:
@cee:{\"message-json\":{\"count\":\"0/1\",\"now\":\"Tue Dec 17 00:58:13
2013\"},\"yummlyLogOrigin\":{\"supportLevel\":\"prod\",\"system\":\"LOGS\",\"cluster\":\"prod\",\"role\":\"collectorErik\",\"host\":\"ip-10-158-97-169\",\"tag\":\"logBurst[3103]:\",\"programname\":\"logBurst\",\"priority\":\"local0.info\",\"timestamp\":\"2013-12-17T00:58:13.436136+00:00\"}}\n\n",
402, 0, NULL, 0) = 402
...
3081  00:58:13.529411 setsockopt(13, SOL_TCP, TCP_CORK, [0], 4) = 0
...
3081  00:58:18.725420 setsockopt(13, SOL_TCP, TCP_CORK, [1], 4) = 0
...
3081  00:58:18.726657 sendto(13, "27 syslog 387 ... same as above ...

   Just like tcpdump shows the message is being resent (strace output
just like the one above) until:

3081  01:02:27.982896 sendto(13, "77 syslog 387
<134>2013-12-17T01:02:27.893264+00:00 ip-10-158-97-169 logBurst[3257]:
@cee:{\"message-json\":{\"count\":\"0/1\",\"now\":\"Tue Dec 17 01:02:27
2013\"},\"yummlyLogOrigin\":{\"supportLevel\":\"prod\",\"system\":\"LOGS\",\"cluster\":\"prod\",\"role\":\"collectorErik\",\"host\":\"ip-10-158-97-169\",\"tag\":\"logBurst[3257]:\",\"programname\":\"logBurst\",\"priority\":\"local0.info\",\"timestamp\":\"2013-12-17T01:02:27.893264+00:00\"}}\n\n",
402, 0, NULL, 0 <unfinished ...>
... other threads ...
3081  01:13:44.932842 <... sendto resumed> ) = 45
... writing debug info ...
3081  01:13:44.934579 setsockopt(13, SOL_SOCKET, SO_LINGER, {onoff=1,
linger=0}, 8) = 0
3081  01:13:44.934662 close(13)         = 0

   After this it recovers. The total time is 15 minutes or so. Is there
any way to shorten this time?

     erik

On 12/16/2013 12:42 AM, Rainer Gerhards wrote:
On Mon, Dec 16, 2013 at 9:35 AM, Erik Steffl <[email protected]> wrote:


if
   prifilt("local0.*") or
   ...
   (prifilt("kern.info") and ($msg == '-- MARK --'))

then {
   action(type="mmjsonparse")
   if $parsesuccess == "OK" then {
     action(
       type="omrelp"
       target="elb.collector.prod.logs.ylmmuy.com"
       port="5140"
       template="json"
     )
   } else {
     action(
       type="omrelp"
       target="elb.collector.prod.logs.ylmmuy.com"
       port="5140"
       template="text"
     )
   }
   stop
}


that's what I suspected. You use the defaults, which means "disable me
for
30 seconds if the connections break continuesly". Try

use

     action(
       type="omrelp"
       target="elb.collector.prod.logs.ylmmuy.com"
       port="5140"
       template="text"
       *action.resumeRetryCount="-1"*
       *action.resumeInterval="5"*
     )

  to get you started. It will try infinitely to send messages, but will
pause 5 seconds between retries. Note that you may run into trouble if
the
destination is offline for an extended period of time.

   http://www.rsyslog.com/doc/omrelp.html don't see the retry
settings, are
these some generic action retries?


action parameters applying to all actions:

http://www.rsyslog.com/doc/rsyslog_conf_actions.html

(you know the doc discussion, so no need to explain it may be unintuitive
to find ;-))

Rainer
_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a
myriad of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST
if you DON'T LIKE THAT.



_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of 
sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE 
THAT.

Reply via email to