even though I don't see the periods of silence on collector-test the 100-message-bursts are not received every 5 minutes, e.g. there was nothing received by collector-prod at 4:10 but at 4:15 I see (strace of the recevier, i.e. collector-prod):

[pid 8503] 04:15:05.295163 recvfrom(173, "2 syslog 389 <134>2013-12-07T04:10:01.577241+00:00 ip-10-158-97-169 logBurst[11492]: @cee:{\"message-json\":{\"count\":100,\"START\":\"Sat Dec 7 04:10:01 2013\"},\"yummlyLogOrigin\":{\"supportLevel\":\"prod\",\"system\":\"LOGS\",\"cluster\":\"prod\",\"role\":\"collectorErik\",\"host\":\"ip-10-158-97-169\",\"tag\":\"logBurst[11492]:\",\"programname\":\"logBurst\",\"priority\":\"local0.info\",\"timestamp\":\"2013-12-07T04:10:01.577241+00:00\"}}\n\n3 syslog 381 <134>2013-12-07T04:10:01.577610+00:00 ip-10-158-97-169 logBurst[11492]: @cee:{\"message-json\":{\"now\":\"Sat Dec 7 04:10:01 2013\",\"i\":0},\"yummlyLogOrigin\":{\"supportLevel\":\"prod\",\"system\":\"LOGS\",\"cluster\":\"prod\",\"role\":\"collectorErik\",\"host\":\"ip-10-158-97-169\",\"tag\":\"logBurst[11492]:\",\"programname\":\"logBurst\",\"priority\":\"local0.info\",\"timestamp\":\"2013-12-07T04:10:01.577610+00:00\"}}\n\n4 syslog 381 <134>2013-12-07T04:10:01.577868+00:00 ip-10-158-97-169 logBurst[11492]: @cee:{\"message-json\":{\"now\":\"Sat Dec 7 04:10:01 2013\",\"i\":1},\"yummlyLogOrigin\":{\"supportLevel\":\"prod\",\"system\":\"LOGS"..., 32768, MSG_DONTWAIT, NULL, NULL) = 32768

as you can see in the message it was actually sent at 4:10 but was received at 4:15 (the strave date). So I guess the total silence is just a worst case of the problem that happens much more often.

        erik

On 12/06/2013 06:59 PM, Erik Steffl wrote:
   getting some messages about suspend/resume in /var/log/syslog, but
this time I don't see any periods of silence (sometime they disappear,
not sure why). There was one silence at 2013-12-7 2:20 - 2:25 UTC time
(all times in logs are UTC as well)

   Filtered out pstats messages, full syslog is at:

https://drive.google.com/file/d/0B5zeQB_ZuLw0c010VnhfVkZLSWs/edit?usp=sharing


Dec  7 01:55:01 ip-10-158-97-169 CRON[11194]: (ubuntu) CMD
(/home/ubuntu/logBurst.pl --count 100)
Dec  7 01:56:01 ip-10-158-97-169 rsyslogd-3003: action '(null)'
suspended, next retry is Sat Dec  7 01:56:31 2013 [try
http://www.rsyslog.com/e/3003 ]
Dec  7 01:57:30 ip-10-158-97-169 rsyslogd-3003: action '(null)'
suspended, next retry is Sat Dec  7 01:58:00 2013 [try
http://www.rsyslog.com/e/3003 ]
Dec  7 01:58:59 ip-10-158-97-169 rsyslogd-3003: action '(null)'
suspended, next retry is Sat Dec  7 01:59:29 2013 [try
http://www.rsyslog.com/e/3003 ]
Dec  7 02:00:01 ip-10-158-97-169 CRON[11197]: (ubuntu) CMD
(/home/ubuntu/logBurst.pl --count 100)
Dec  7 02:00:28 ip-10-158-97-169 rsyslogd-3003: action '(null)'
suspended, next retry is Sat Dec  7 02:00:58 2013 [try
http://www.rsyslog.com/e/3003 ]
Dec  7 02:01:57 ip-10-158-97-169 rsyslogd-3003: action '(null)'
suspended, next retry is Sat Dec  7 02:02:27 2013 [try
http://www.rsyslog.com/e/3003 ]
Dec  7 02:03:26 ip-10-158-97-169 rsyslogd-3003: action '(null)'
suspended, next retry is Sat Dec  7 02:03:56 2013 [try
http://www.rsyslog.com/e/3003 ]
Dec  7 02:03:56 ip-10-158-97-169 rsyslogd0: action '(null)' resumed [try
http://www.rsyslog.com/e/0 ]
Dec  7 02:05:01 ip-10-158-97-169 CRON[11200]: (ubuntu) CMD
(/home/ubuntu/logBurst.pl --count 100)
Dec  7 02:05:02 ip-10-158-97-169 rsyslogd0: action '(null)' resumed [try
http://www.rsyslog.com/e/0 ]
Dec  7 02:10:01 ip-10-158-97-169 CRON[11203]: (ubuntu) CMD
(/home/ubuntu/logBurst.pl --count 100)
Dec  7 02:15:01 ip-10-158-97-169 CRON[11207]: (ubuntu) CMD
(/home/ubuntu/logBurst.pl --count 100)
Dec  7 02:17:01 ip-10-158-97-169 CRON[11211]: (root) CMD (   cd / &&
run-parts --report /etc/cron.hourly)
Dec  7 02:20:01 ip-10-158-97-169 CRON[11346]: (ubuntu) CMD
(/home/ubuntu/logBurst.pl --count 100)
Dec  7 02:25:01 ip-10-158-97-169 CRON[11349]: (ubuntu) CMD
(/home/ubuntu/logBurst.pl --count 100)
Dec  7 02:25:54 ip-10-158-97-169 rsyslogd0: action '(null)' resumed [try
http://www.rsyslog.com/e/0 ]
Dec  7 02:30:01 ip-10-158-97-169 CRON[11352]: (ubuntu) CMD
(/home/ubuntu/logBurst.pl --count 100)
Dec  7 02:31:14 ip-10-158-97-169 rsyslogd0: action '(null)' resumed [try
http://www.rsyslog.com/e/0 ]
Dec  7 02:35:01 ip-10-158-97-169 CRON[11355]: (ubuntu) CMD
(/home/ubuntu/logBurst.pl --count 100)

   the CRON logs saying that logBurst was ran show when the messages
where sent from this host (collector-test) to another rsyslog host
(collector-prod). Other than that nothing else should be sent between
these two hosts.

   There's a steady traffic of incoming messages to this host that are
being written to files.

   Let me know if I can provide more info or run different scenarios to
help figure out what's going on.

   thanks!

     erik

On 12/06/2013 01:04 AM, Rainer Gerhards wrote:
On Fri, Dec 6, 2013 at 10:00 AM, Erik Steffl <[email protected]> wrote:

On 12/05/2013 11:27 PM, Rainer Gerhards wrote:

On Fri, Dec 6, 2013 at 8:17 AM, Erik Steffl <[email protected]> wrote:

     compiled it successfully but didn't try it yet, was trying to
figure
out
what's going on with ELB and timeout.

    Will try it tomorrow (fri pacific time), what will I be looking
for?



sorry, don't understand that sentence.


   is the new version going to print additional logs to /var/log/syslog?


yes, it will log messages when an action is suspended and when it is
reactivated again. I think I also added stats counters for these events.
But was already a couple of days (and > 100 commits) ago, so I don't know
exactly out of my head.


Or debug logs? Would these be triggered when there is timeout waiting
for
RELP response?


It's triggered whenever the action decides it fails. That's exactly
what I
would like to know.

Rainer



  Do you think it will log anything interesting given the new
information I
just posted?


  Honestly, I haven't look deeply at it, because I need the other
information
first. Don't waste time on trying to find out things if you don't
have the
basics...


   alright, will test with rsyslog I complied from git (master) and see
what happens,

   thanks,

         erik



Rainer


     thanks!

          erik


On 12/05/2013 10:34 PM, Rainer Gerhards wrote:

  Erik,

have you tried the version I created for you? What does it report?

Rainer


On Fri, Dec 6, 2013 at 5:10 AM, Erik Steffl <[email protected]> wrote:

      at this point pretty sure it's related to Amazon Elastic Load
Balancer

(ELB) and some kind of timeout (it only happens if I use ELB and
there
is
enough time between bursts of messages, e.g. 5 min triggers
failure, 1
minute does not).

     Now I am back to original failure scenario (ELB and burst of
100
messages every 5 minutes).

     Here's output from lsof:

ubuntu@ip-10-158-97-169:~$ sudo lsof | grep rsyslogd | grep TCP
rsyslogd   9743          syslog    6u     IPv4            7177271
0t0
      TCP *:5140 (LISTEN)
rsyslogd   9743          syslog    7u     IPv6            7177272
0t0
      TCP *:5140 (LISTEN)
rsyslogd   9743          syslog   10u     IPv4            7187191
0t0
      TCP ip-10-158-97-169.ec2.internal:48176->ec2-50-19-250-187.
compute-1.amazonaws.com:5140 (CLOSE_WAIT)
rsyslogd   9743          syslog   12u     IPv4            7183726
0t0
      TCP ip-10-158-97-169.ec2.internal:5140->216.133.125.82:16543(
ESTABLISHED)

     where:

ip-10-158-97-169.ec2.internal is the collector-test, the host
that is
sending messages

ec2-50-19-250-187.compute-1.amazonaws.com is ELB

Strace shows some interesting stuff, even though I didn't catch the
beginning of the problem this time (running strace with -f
options so
the
first number on each line is thread pid):

9842  03:08:44.391025 sendto(10, "204 syslog 380
<134>2013-12-06T02:55:01.747931+00:00
ip-10-158-97-169 logBurst[9908]:
@cee:{\"message-json\":{\"now\":\"Fri
Dec  6 02:55:01 2013\",\"i\":99},\"yummlyLogOrigin\":{\"
supportLevel\":\"prod\",\"system\":\"LOGS\",\"cluster\":
\"prod\",\"role\":\"collectorErik\",\"host\":\"ip-
10-158-97-169\",\"tag\":\"logBurst[9908]:\",\"
programname\":\"logBurst\",\"priority\":\"local0.info\",\"
timestamp\":\"2013-12-06T02:55:01.747931+00:00\"}}\n\n", 396, 0,
NULL, 0
<unfinished ...>

... then nothing related to fd 10 until:

9842  03:10:32.676803 <... sendto resumed> ) = -1 ETIMEDOUT
(Connection
timed out)
9842  03:10:32.676887 gettimeofday({1386299432, 676918}, NULL) = 0
9842  03:10:32.676994 setsockopt(10, SOL_SOCKET, SO_LINGER,
{onoff=1,
linger=0}, 8) = 0
9842  03:10:32.677101 close(10)         = 0

     So I assume that means the connection is bad, rsyslog
figures it
out
and
closes the connection.

     Then rsyslog looks up IP of ELB and opens another connection:

9842  03:10:32.687983 socket(PF_INET, SOCK_STREAM, IPPROTO_TCP) = 10
9842  03:10:32.688088 connect(10, {sa_family=AF_INET,
sin_port=htons(5140), sin_addr=inet_addr("50.19.250.187")}, 16) = 0
9842  03:10:32.689592 clock_gettime(CLOCK_REALTIME, {1386299432,
689625848}) = 0
9842  03:10:32.689694 sendto(10, "1 open 85
relp_version=0\nrelp_software=
librelp,1.2.0,http://librelp.adiscon.com\ncommands=syslog\n";, 96, 0,
NULL, 0) = 96
9842  03:10:32.689812 recvfrom(10, 0x7ffc58b53490, 32768, 64, 0,
0) =
-1
EAGAIN (Resource temporarily unavailable)
...
9842  03:11:32.214608 recvfrom(10, "", 32768, MSG_DONTWAIT, NULL,
NULL)
= 0
...
9750  03:12:16.994327 setsockopt(10, SOL_SOCKET, SO_LINGER,
{onoff=1,
linger=0}, 8 <unfinished ...>
9750  03:12:16.994443 close(10 <unfinished ...>
9750  03:12:16.994597 <... close resumed> ) = 0

     So it seems like rsyslog is trying to open a new RELP
session but
it's
getting ETIMEDOUT. This is very weird because all involved hosts
(sender,
ELB, receiver) work fine at the same time, they continue to send and
receive data, ELB thinks the receiver is healthy etc. Everything
seems
to
be working fine except rsyslog is not able to open working
connection
to
ELB (to recevier via ELB).

     Any ideas how to investigate this further? This seems to be
a lot
more
specific then what I had before (just mysterious silence) so
hopefully
it
makes more sense...

     thanks!

           erik


On 11/30/2013 02:12 AM, Rainer Gerhards wrote:

   If the connection fails, rsyslog *immediately* tries to
re-open, but

that
seems to fail as well.

Btw: git v7-devel and master branch niw contain the changes i made.
   Could
you please try one of these versions.

Sent from phone, thus brief.
Am 30.11.2013 00:39 schrieb "Erik Steffl" <[email protected]>:

    On 11/29/2013 01:48 AM, Rainer Gerhards wrote:


    On Thu, Nov 28, 2013 at 2:07 AM, Erik Steffl <[email protected]>
wrote:


     On 11/26/2013 07:06 AM, Pavel Levshin wrote:



    26.11.2013 1:20, Erik Steffl:



        what does the above mean exactly? It does make
sense, in
each

  case
the burst of messages gets in via /dev/log then is send out
via
RELP
(either to the same machine or a different one).

       Any ideas why this doesn't fix itself until the next
burst
of
messages? Any suggestions on what to do or what to investigate
next?
I
guess I could run these with strace and see what exactly
they say
to
each other (both the sender and receiver).


     I'm not sure why (and if) this is 'fixed' by the next
burst
of

   messages,

but this can be somehow because next portion of messages pushes
the
queue. Perhaps it is unable to retry after suspend without a
push.


        not sure if 'fixed' is the right word but it is always
unstuck

  right
after next (sometime next to next) burst of messages, never at
random
time.
This behaviour is the same whether the period is 5 min or 15
min.


    While I have been silent, I followed the ML discussion (I
did
not

  check
the
debug log further, though, as Pavel did excellent work here).
To me,
it
looks like "normal" suspension code is kicking in. If an action
fails,
rsyslog retries once (except otherwise configured) and if it
fails
again,
the action is initially suspended for 30 seconds. Then, retries
happen
and
the suspension period is prolonged if they fail.

It looks very much like this is the mechanism at work. However,
what I
don't understand is why the suspension period is so long.

Out of all this, I think it would make much sense if rsyslog
had the
capability to report when an action is suspended and when it is
resumed. I
am right now adding this capability. I would suggest that when
this
change
is ready, you apply it and we can than see what it reports (much
easier
than walking the debug log, and very obvious to users ;)).


       as I mentioned I tried to do two changes to the test
scenario
and

tested
each of these separately:

      - send 200 message burst directly from collector-test to
collector-prod
(RELP), i.e. no load balancer

      - keep using the load balancer but in addition to 200
messages
curst
every 5 min also send few (3) messages every minute

      Each of these scenarios work, as in traffic is smooth, NO
silences.
This
means that something goes wrong if we use load balancer and
there is
no
traffic for 5 minutes.

      From our previous investigation of amazon elastic load
balancer
(which
is what we're using) it often lies about the connection, i.e.
it has
no
connection on the backend but it happily accepts connects and data
and
pretends everything is fine (that's why we initially switched from
plain
TCP to RELP).

      Not entirely sure what the load balancer is doing in this
case
but
it
seems that rsyslog thinks the connection is fine and keeps sending
data
to
the load balancer but connection is actually broken (maybe the
load
balancer closed the connection between itself and the destination
because
of some timeout).

      So the situation is not fixed until rsyslog closes the
connection
and
opens a new one.

      Is there any way for us to either make rsyslog keep the
connection
alive
or to re-open it sooner?

            erik

_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a
myriad
of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if
you
DON'T LIKE THAT.

    _______________________________________________

  rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a
myriad
of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST
if you
DON'T LIKE THAT.


   _______________________________________________

rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a
myriad
of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST
if you
DON'T LIKE THAT.

   _______________________________________________

rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a
myriad
of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if
you
DON'T LIKE THAT.


  _______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a
myriad
of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you
DON'T LIKE THAT.

  _______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a
myriad
of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you
DON'T LIKE THAT.


_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad
of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you
DON'T LIKE THAT.

_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a
myriad of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST
if you DON'T LIKE THAT.



_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of 
sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE 
THAT.

Reply via email to