On 12/18/2013 10:47 PM, Rainer Gerhards wrote:
On Thu, Dec 19, 2013 at 2:28 AM, Erik Steffl <[email protected]> wrote:

   maybe this was too wordy, here's simpler version.


I had finally to work on some internal things before vacation, thus the
silence. Today is meetings as well...


   after adding:

   action.resumeRetryCount="-1"
   action.resumeInterval="5"

   to action definition I see that rsyslogd tries to resend the message few
times, where sendto succeeds but not RELP response is received. After few
times sendto take approximately 10 minutes and fails.


You mean sendto() itself takes around 10 minutes?

yes, there are few sendto syscalls (same message) that return immediately, I assume the data is not actually sent out, just accepted by driver. Then after few minutes one sendto itself takes about 10 minutes (this is from strace -f -tt so the first number is thread pid then time then system call, I left out most of the data in sendto, the exact sendto is quoted below):

3081 01:02:27.982896 sendto(13, "77 syslog 387 <134>...",402, 0, NULL, 0 <unfinished ...>

 and it returns with an error 10 min later:

3081  01:13:44.932842 <... sendto resumed> ) = 45

  then rsyslogd closes the connection:

3081 01:13:44.934579 setsockopt(13, SOL_SOCKET, SO_LINGER, {onoff=1, linger=0}, 8) = 0
3081  01:13:44.934662 close(13)         = 0

  afterwards rsyslog opens connection again and it all works fine...

  thanks!

        erik


Rainer

   Afterwards the connection is closed, re-opened and everything works fine.

   Is there any setting that would make the 10 minute sendto timeout
shorter? Assuming it's some TCP level timeout so maybe it's not settable
using rsyslog config (didn't find anything myself).

   Alternatively is there anything that would make rsyslogd close and
reopen connection after not receiving RELP response for some time (or some
number of retries)?

   I see there are some queue timeouts like $ActionQueueTimeoutActionCompletion
(looking at http://www.rsyslog.com/doc/rsyslog_conf_global.html) but not
sure what those actually do and if they would apply to this case.

   thanks!

         erik


On 12/16/2013 05:51 PM, Erik Steffl wrote:

    chaged the config as suggested however it does not seem it retries
that often, here's what tcpdump shows:

ubuntu@ip-10-158-97-169:~$ sudo tcpdump -A port 5140
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes
01:05:43.652804 IP ip-10-158-97-169.ec2.internal.53118 >
ec2-50-19-250-187.compute-1.amazonaws.com.5140: Flags [P.], seq
3725470457:3725470859, ack 2328419555, win 237, options [nop,nop,TS val
840645376 ecr 1017783106], length 402
E.....@.@...
.a.2....~....*................
2.;.<.#B26 syslog 387 <134>2013-12-17T00:58:13.436136+00:00
ip-10-158-97-169 logBurst[3103]:
@cee:{"message-json":{"count":"0/1","now":"Tue Dec 17 00:58:13
2013"},"yummlyLogOrigin":{"supportLevel":"prod","system":
"LOGS","cluster":"prod","role":"collectorErik","host":"ip-
10-158-97-169","tag":"logBurst[3103]:","programname"
:"logBurst","priority":"local0.info","timestamp":"2013-12-
17T00:58:13.436136+00:00"}}



01:07:43.972781 IP ip-10-158-97-169.ec2.internal.53118 >
ec2-50-19-250-187.compute-1.amazonaws.com.5140: Flags [P.], seq 0:402,
ack 1, win 237, options [nop,nop,TS val 840675456 ecr 1017783106],
length 402
E.....@[email protected]
.a.2....~....*................
2...<.#B26 syslog 387 <134>2013-12-17T00:58:13.436136+00:00
ip-10-158-97-169 logBurst[3103]:
@cee:{"message-json":{"count":"0/1","now":"Tue Dec 17 00:58:13
2013"},"yummlyLogOrigin":{"supportLevel":"prod","system":
"LOGS","cluster":"prod","role":"collectorErik","host":"ip-
10-158-97-169","tag":"logBurst[3103]:","programname"
:"logBurst","priority":"local0.info","timestamp":"2013-12-
17T00:58:13.436136+00:00"}}


    syslog 387 keeps repeating every two minutes.

    Config:

if
    prifilt("local0.*") or
    ...
    prifilt("local7.*")
then {
    action(type="mmjsonparse")
    if $parsesuccess == "OK" then {
      action(
        type="omrelp"
        target="elb.collector.prod.logs.ylmmuy.com"
        port="5140"
        template="json"
        queue.type="LinkedList"
        queue.filename="json"
        queue.maxdiskspace="75161927680" # 70GB (valuable data)
        action.resumeRetryCount="-1"
        action.resumeInterval="5"
      )
    } else {
...
    }
    stop
}

    Same test as before (host to load balancer to another host using
RELP), no MARK, no other messages, just wait for connection to go stale
then start sending messages every 5 seconds. It takes about 15 minutes
for it to recover.

    First message (strace output):

3081  00:58:13.528459 sendto(13, "26 syslog 387
<134>2013-12-17T00:58:13.436136+00:00 ip-10-158-97-169 logBurst[3103]:
@cee:{\"message-json\":{\"count\":\"0/1\",\"now\":\"Tue Dec 17 00:58:13
2013\"},\"yummlyLogOrigin\":{\"supportLevel\":\"prod\",\"
system\":\"LOGS\",\"cluster\":\"prod\",\"role\":\"
collectorErik\",\"host\":\"ip-10-158-97-169\",\"tag\":\"
logBurst[3103]:\",\"programname\":\"logBurst\",\"priority\":\"local0.info
\",\"timestamp\":\"2013-12-17T00:58:13.436136+00:00\"}}\n\n",
402, 0, NULL, 0) = 402
...
3081  00:58:13.529411 setsockopt(13, SOL_TCP, TCP_CORK, [0], 4) = 0
...
3081  00:58:18.725420 setsockopt(13, SOL_TCP, TCP_CORK, [1], 4) = 0
...
3081  00:58:18.726657 sendto(13, "27 syslog 387 ... same as above ...

    Just like tcpdump shows the message is being resent (strace output
just like the one above) until:

3081  01:02:27.982896 sendto(13, "77 syslog 387
<134>2013-12-17T01:02:27.893264+00:00 ip-10-158-97-169 logBurst[3257]:
@cee:{\"message-json\":{\"count\":\"0/1\",\"now\":\"Tue Dec 17 01:02:27
2013\"},\"yummlyLogOrigin\":{\"supportLevel\":\"prod\",\"
system\":\"LOGS\",\"cluster\":\"prod\",\"role\":\"
collectorErik\",\"host\":\"ip-10-158-97-169\",\"tag\":\"
logBurst[3257]:\",\"programname\":\"logBurst\",\"priority\":\"local0.info
\",\"timestamp\":\"2013-12-17T01:02:27.893264+00:00\"}}\n\n",
402, 0, NULL, 0 <unfinished ...>
... other threads ...
3081  01:13:44.932842 <... sendto resumed> ) = 45
... writing debug info ...
3081  01:13:44.934579 setsockopt(13, SOL_SOCKET, SO_LINGER, {onoff=1,
linger=0}, 8) = 0
3081  01:13:44.934662 close(13)         = 0

    After this it recovers. The total time is 15 minutes or so. Is there
any way to shorten this time?

      erik

On 12/16/2013 12:42 AM, Rainer Gerhards wrote:

On Mon, Dec 16, 2013 at 9:35 AM, Erik Steffl <[email protected]> wrote:


if
    prifilt("local0.*") or
    ...
    (prifilt("kern.info") and ($msg == '-- MARK --'))

then {
    action(type="mmjsonparse")
    if $parsesuccess == "OK" then {
      action(
        type="omrelp"
        target="elb.collector.prod.logs.ylmmuy.com"
        port="5140"
        template="json"
      )
    } else {
      action(
        type="omrelp"
        target="elb.collector.prod.logs.ylmmuy.com"
        port="5140"
        template="text"
      )
    }
    stop
}


  that's what I suspected. You use the defaults, which means "disable me
for
30 seconds if the connections break continuesly". Try

use

      action(
        type="omrelp"
        target="elb.collector.prod.logs.ylmmuy.com"
        port="5140"
        template="text"
        *action.resumeRetryCount="-1"*
        *action.resumeInterval="5"*
      )

   to get you started. It will try infinitely to send messages, but will
pause 5 seconds between retries. Note that you may run into trouble if
the
destination is offline for an extended period of time.

    http://www.rsyslog.com/doc/omrelp.html don't see the retry
settings, are

these some generic action retries?



action parameters applying to all actions:

http://www.rsyslog.com/doc/rsyslog_conf_actions.html

(you know the doc discussion, so no need to explain it may be unintuitive
to find ;-))

Rainer
_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a
myriad of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST
if you DON'T LIKE THAT.



_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad
of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you
DON'T LIKE THAT.

_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of 
sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE 
THAT.


_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of 
sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE 
THAT.

Reply via email to