(ELB) and some kind of timeout (it only happens if I use ELB and there is
enough time between bursts of messages, e.g. 5 min triggers failure, 1
minute does not).
Now I am back to original failure scenario (ELB and burst of 100
messages every 5 minutes).
Here's output from lsof:
ubuntu@ip-10-158-97-169:~$ sudo lsof | grep rsyslogd | grep TCP
rsyslogd 9743 syslog 6u IPv4 7177271 0t0
TCP *:5140 (LISTEN)
rsyslogd 9743 syslog 7u IPv6 7177272 0t0
TCP *:5140 (LISTEN)
rsyslogd 9743 syslog 10u IPv4 7187191 0t0
TCP ip-10-158-97-169.ec2.internal:48176->ec2-50-19-250-187.
compute-1.amazonaws.com:5140 (CLOSE_WAIT)
rsyslogd 9743 syslog 12u IPv4 7183726 0t0
TCP ip-10-158-97-169.ec2.internal:5140->216.133.125.82:16543(
ESTABLISHED)
where:
ip-10-158-97-169.ec2.internal is the collector-test, the host that is
sending messages
ec2-50-19-250-187.compute-1.amazonaws.com is ELB
Strace shows some interesting stuff, even though I didn't catch the
beginning of the problem this time (running strace with -f options so the
first number on each line is thread pid):
9842 03:08:44.391025 sendto(10, "204 syslog 380
<134>2013-12-06T02:55:01.747931+00:00
ip-10-158-97-169 logBurst[9908]: @cee:{\"message-json\":{\"now\":\"Fri
Dec 6 02:55:01 2013\",\"i\":99},\"yummlyLogOrigin\":{\"
supportLevel\":\"prod\",\"system\":\"LOGS\",\"cluster\":
\"prod\",\"role\":\"collectorErik\",\"host\":\"ip-
10-158-97-169\",\"tag\":\"logBurst[9908]:\",\"
programname\":\"logBurst\",\"priority\":\"local0.info\",\"
timestamp\":\"2013-12-06T02:55:01.747931+00:00\"}}\n\n", 396, 0, NULL, 0
<unfinished ...>
... then nothing related to fd 10 until:
9842 03:10:32.676803 <... sendto resumed> ) = -1 ETIMEDOUT (Connection
timed out)
9842 03:10:32.676887 gettimeofday({1386299432, 676918}, NULL) = 0
9842 03:10:32.676994 setsockopt(10, SOL_SOCKET, SO_LINGER, {onoff=1,
linger=0}, 8) = 0
9842 03:10:32.677101 close(10) = 0
So I assume that means the connection is bad, rsyslog figures it out
and
closes the connection.
Then rsyslog looks up IP of ELB and opens another connection:
9842 03:10:32.687983 socket(PF_INET, SOCK_STREAM, IPPROTO_TCP) = 10
9842 03:10:32.688088 connect(10, {sa_family=AF_INET,
sin_port=htons(5140), sin_addr=inet_addr("50.19.250.187")}, 16) = 0
9842 03:10:32.689592 clock_gettime(CLOCK_REALTIME, {1386299432,
689625848}) = 0
9842 03:10:32.689694 sendto(10, "1 open 85
relp_version=0\nrelp_software=
librelp,1.2.0,http://librelp.adiscon.com\ncommands=syslog\n", 96, 0,
NULL, 0) = 96
9842 03:10:32.689812 recvfrom(10, 0x7ffc58b53490, 32768, 64, 0, 0) = -1
EAGAIN (Resource temporarily unavailable)
...
9842 03:11:32.214608 recvfrom(10, "", 32768, MSG_DONTWAIT, NULL, NULL)
= 0
...
9750 03:12:16.994327 setsockopt(10, SOL_SOCKET, SO_LINGER, {onoff=1,
linger=0}, 8 <unfinished ...>
9750 03:12:16.994443 close(10 <unfinished ...>
9750 03:12:16.994597 <... close resumed> ) = 0
So it seems like rsyslog is trying to open a new RELP session but it's
getting ETIMEDOUT. This is very weird because all involved hosts (sender,
ELB, receiver) work fine at the same time, they continue to send and
receive data, ELB thinks the receiver is healthy etc. Everything seems to
be working fine except rsyslog is not able to open working connection to
ELB (to recevier via ELB).
Any ideas how to investigate this further? This seems to be a lot more
specific then what I had before (just mysterious silence) so hopefully it
makes more sense...
thanks!
erik
On 11/30/2013 02:12 AM, Rainer Gerhards wrote:
If the connection fails, rsyslog *immediately* tries to re-open, but
that
seems to fail as well.
Btw: git v7-devel and master branch niw contain the changes i made.
Could
you please try one of these versions.
Sent from phone, thus brief.
Am 30.11.2013 00:39 schrieb "Erik Steffl" <[email protected]>:
On 11/29/2013 01:48 AM, Rainer Gerhards wrote:
On Thu, Nov 28, 2013 at 2:07 AM, Erik Steffl <[email protected]>
wrote:
On 11/26/2013 07:06 AM, Pavel Levshin wrote:
26.11.2013 1:20, Erik Steffl:
what does the above mean exactly? It does make sense, in each
case
the burst of messages gets in via /dev/log then is send out via
RELP
(either to the same machine or a different one).
Any ideas why this doesn't fix itself until the next burst of
messages? Any suggestions on what to do or what to investigate
next?
I
guess I could run these with strace and see what exactly they say
to
each other (both the sender and receiver).
I'm not sure why (and if) this is 'fixed' by the next burst of
messages,
but this can be somehow because next portion of messages pushes the
queue. Perhaps it is unable to retry after suspend without a push.
not sure if 'fixed' is the right word but it is always unstuck
right
after next (sometime next to next) burst of messages, never at random
time.
This behaviour is the same whether the period is 5 min or 15 min.
While I have been silent, I followed the ML discussion (I did not
check
the
debug log further, though, as Pavel did excellent work here). To me,
it
looks like "normal" suspension code is kicking in. If an action fails,
rsyslog retries once (except otherwise configured) and if it fails
again,
the action is initially suspended for 30 seconds. Then, retries happen
and
the suspension period is prolonged if they fail.
It looks very much like this is the mechanism at work. However, what I
don't understand is why the suspension period is so long.
Out of all this, I think it would make much sense if rsyslog had the
capability to report when an action is suspended and when it is
resumed. I
am right now adding this capability. I would suggest that when this
change
is ready, you apply it and we can than see what it reports (much
easier
than walking the debug log, and very obvious to users ;)).
as I mentioned I tried to do two changes to the test scenario and
tested
each of these separately:
- send 200 message burst directly from collector-test to
collector-prod
(RELP), i.e. no load balancer
- keep using the load balancer but in addition to 200 messages
curst
every 5 min also send few (3) messages every minute
Each of these scenarios work, as in traffic is smooth, NO silences.
This
means that something goes wrong if we use load balancer and there is no
traffic for 5 minutes.
From our previous investigation of amazon elastic load balancer
(which
is what we're using) it often lies about the connection, i.e. it has no
connection on the backend but it happily accepts connects and data and
pretends everything is fine (that's why we initially switched from
plain
TCP to RELP).
Not entirely sure what the load balancer is doing in this case but
it
seems that rsyslog thinks the connection is fine and keeps sending data
to
the load balancer but connection is actually broken (maybe the load
balancer closed the connection between itself and the destination
because
of some timeout).
So the situation is not fixed until rsyslog closes the connection
and
opens a new one.
Is there any way for us to either make rsyslog keep the connection
alive
or to re-open it sooner?
erik
_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a
myriad
of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you
DON'T LIKE THAT.
_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad
of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you
DON'T LIKE THAT.
_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad
of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you
DON'T LIKE THAT.
_______________________________________________