Re: [rsyslog] rsyslogd keeps sending data to closed connection

Rainer Gerhards Tue, 13 Aug 2013 05:55:50 -0700

I tried to have a quick lock, but that requires more time. I thik rsyslog
uses nonblocking mode vs. nc not. Maybe that makes the difference. Rest
probably needs to wait until next week.


Rainer


On Tue, Aug 13, 2013 at 4:16 AM, David Lang <[email protected]> wrote:

> On Mon, 12 Aug 2013, Erik Steffl wrote:
>
>  On 08/12/2013 03:53 PM, David Lang wrote:
>>
>>> On Mon, 12 Aug 2013, Erik Steffl wrote:
>>>
>>>  On 08/12/2013 01:21 PM, David Lang wrote:
>>>>
>>>>> Eric, with your particular load balancer, if you telnet to the ip on
>>>>> port 514 does telnet fail with an error, or does it establish a
>>>>> connection that then gets close almost immediatly?
>>>>>
>>>>
>>>> it opens connection that is closed immediately (to Amazon elastic load
>>>> balancer with no hosts behind it):
>>>>
>>>
>>> Ok, I think what is happening is that rsyslog is opening the connection
>>> successfully, and is then sending the log message before it gets closed.
>>>
>>> When you use netcat to do your test, does it show the same behavior as
>>> telnet? (connect then close), or is it somehow failing the connect?
>>>
>>
>>  in same scenario it behaves in similar way, it connects, sends one
>> message (which appears to be sent ok) then closes connection (it does not
>> retry), the difference is that connect return -1 and errno is EINPROGRESS
>> (unlike rsyslog connects successfully).
>>
>
> different configuration options, something for Rainer to look at when he
> gets back from vacation. Somehow detecting that things aren't in good shape
> would be a good thing, but I suspect that what's happening is that there is
> some data being sent in the nc case (negotiation of some sort) that's not
> being sent in the rsyslog case that's how the problem is being detected.
>
>
>   As far as I can tell it's the load balancer behaving badly, the connect
>> and sendto/write should not succeed if load balancer has nowhere to send it
>> (looking at Amazon console I see that load balancer has no hosts).
>>
>
> yes and no.
>
> There are two ways that a load balancer can operate
>
> 1. the load balancer does not terminate the connection. It just redirects
> it via NAT (or MAC rewriting) to the appropriate destination.
>
> In this case, the load balancer must make it's decision before any data is
> sent at all. It can't terminate the SSL on the load balancer, and it can't
> direct sessions based on data in the message (such as cookie values)
>
> In this case, if there is no active server, the sending system will not
> succeed in making a connection.
>
>
> 2. the load balancer does terminate the connenction, then it makes a new
> connection to the destination server.
>
> In this case, the load balancer has much more flexibility in it's
> decision, and it can terminate the SSL session on the load balancer.
>
> But when the load balancer acts like this, you get the behavior that you
> are seeing, where the connection is established even if there is no
> back-end system available. In theory the load balancer could stop listening
> on the VIP, but disconnecting and reconnecting like that is expensive, and
> you aren't supposed to be in a situation hwere you don't have any available
> servers, right? :-)
>
> I've seen this same behavior with several different brands of load
> balancers, it's not just the AWS ELB that does this, I see the same thing
> when you hit a radware or F5 load balancer that is in the mode that it
> terminates the connection.
>
>
>
> As far as the sendto/write failing, that isn't going to happen because the
> sendto/write doesn't actually involve the remote machine. All that happens
> at the sendto/write command is that the program hands the data to the
> kernel on the local machine for the kernel to send sometime later. The
> kernel will only return an error if there is no connection, or the queues
> are full on the sending system.
>
> In this case, the connection is made (syn, syn/ack, ack exchange), then
> when the sending machine first trys to send actual data, it gets an error
> and terminates the connection. But this isn't happening until after rsyslog
> has 'sent' the log.
>
> the same thing would happen if something happened to the recieving machine
> after the connection is established.
>
> If you think about this a bit. Suppose that you have a high speed
> connection across the country. Since your data can only travel at the speed
> of light, you can have a LOT of data in flight (64K was the old configured
> max, but with newer, high speed lines, it's common to end up with a few MB
> of data in flight). The sending software thinks the data has been delivered
> as soon as it hands it to the kernel to send, but when the recieving system
> dies (or a firewall cuts the connection), all the data in flight will not
> be recieved, but the sending software has no way of knowing this fact.
>
> This is why TCP is not really reliable when something can happen to the
> connection or the machine at the far end.
>
> The way around this is to have the recieving application acknowlege that
> it has the log message. This is _exactly_ what RELP does. And this is one
> of the reasons that it does this.
>
> David Lang
>
>
>          erik
>>
>>
>>> David Lang
>>>
>>>  erik@yummly-ubuntu-erik-**gazelle:~$ telnet $elbHost 5140
>>>> Trying 54.243.148.203...
>>>> Connected to ...elb host...
>>>> Escape character is '^]'.
>>>> Connection closed by foreign host.
>>>>
>>>> strace (part from opening to closing connection, the connection opens
>>>> successfully)):
>>>>
>>>> socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 3
>>>> setsockopt(3, SOL_IP, IP_TOS, [16], 4)  = 0
>>>> connect(3, {sa_family=AF_INET, sin_port=htons(5140),
>>>> sin_addr=inet_addr("54.243.**148.203")}, 16) = 0
>>>> open("/etc/telnetrc", O_RDONLY)         = -1 ENOENT (No such file or
>>>> directory)
>>>> open("/home/erik/.telnetrc", O_RDONLY)  = -1 ENOENT (No such file or
>>>> directory)
>>>> write(1, "Connected to
>>>> erik-tcp-test-2132493349.us-**east-1.elb.amazonaws.com<http://erik-tcp-test-2132493349.us-east-1.elb.amazonaws.com>.\n",
>>>> 67) = 67
>>>> write(1, "Escape character is '^]'.\n", 26) = 26
>>>> rt_sigprocmask(SIG_BLOCK, NULL, [], 8)  = 0
>>>> rt_sigaction(SIGINT, {0x407970, [INT], SA_RESTORER|SA_RESTART,
>>>> 0x7f0b9e3fc0b0}, {SIG_DFL, [], 0}, 8) = 0
>>>> rt_sigaction(SIGQUIT, {0x407920, [QUIT], SA_RESTORER|SA_RESTART,
>>>> 0x7f0b9e3fc0b0}, {SIG_DFL, [], 0}, 8) = 0
>>>> rt_sigaction(SIGWINCH, {0x407900, [WINCH], SA_RESTORER|SA_RESTART,
>>>> 0x7f0b9e3fc0b0}, {SIG_DFL, [], 0}, 8) = 0
>>>> rt_sigaction(SIGTSTP, {0x40b6c0, [TSTP], SA_RESTORER|SA_RESTART,
>>>> 0x7f0b9e3fc0b0}, {0x40b6c0, [TSTP], SA_RESTORER|SA_RESTART,
>>>> 0x7f0b9e3fc0b0}, 8) = 0
>>>> ioctl(0, SNDCTL_TMR_TIMEBASE or TCGETS, {B38400 opost isig icanon echo
>>>> ...}) = 0
>>>> ioctl(0, SNDCTL_TMR_STOP or TCSETSW, {B38400 opost isig icanon echo
>>>> ...}) = 0
>>>> ioctl(0, SNDCTL_TMR_TIMEBASE or TCGETS, {B38400 opost isig icanon echo
>>>> ...}) = 0
>>>> ioctl(0, FIONBIO, [1])                  = 0
>>>> ioctl(1, FIONBIO, [1])                  = 0
>>>> ioctl(3, FIONBIO, [1])                  = 0
>>>> setsockopt(3, SOL_SOCKET, SO_OOBINLINE, [1], 4) = 0
>>>> select(4, [0 3], [], [3], {0, 0})       = 0 (Timeout)
>>>> select(4, [0 3], [], [3], NULL)         = 1 (in [3])
>>>> recvfrom(3, "", 8191, 0, NULL, NULL)    = 0
>>>> rt_sigaction(SIGTSTP, {SIG_DFL, [TSTP], SA_RESTORER|SA_RESTART,
>>>> 0x7f0b9e3fc0b0}, {0x40b6c0, [TSTP], SA_RESTORER|SA_RESTART,
>>>> 0x7f0b9e3fc0b0}, 8) = 0
>>>> ioctl(0, SNDCTL_TMR_TIMEBASE or TCGETS, {B38400 opost isig icanon echo
>>>> ...}) = 0
>>>> ioctl(0, SNDCTL_TMR_STOP or TCSETSW, {B38400 opost isig icanon echo
>>>> ...}) = 0
>>>> ioctl(0, SNDCTL_TMR_TIMEBASE or TCGETS, {B38400 opost isig icanon echo
>>>> ...}) = 0
>>>> ioctl(0, FIONBIO, [0])                  = 0
>>>> ioctl(1, FIONBIO, [0])                  = 0
>>>> close(3)                                = 0
>>>>
>>>>     erik
>>>>
>>>>  David Lang
>>>>>
>>>>> On Fri, 9 Aug 2013, Erik Steffl wrote:
>>>>>
>>>>>  Date: Fri, 09 Aug 2013 19:10:29 -0700
>>>>>> From: Erik Steffl <[email protected]>
>>>>>> Reply-To: rsyslog-users <[email protected]>
>>>>>> To: rsyslog-users <[email protected]>
>>>>>> Subject: Re: [rsyslog] rsyslogd keeps sending data to closed
>>>>>> connection
>>>>>>
>>>>>> On 07/26/2013 04:33 PM, David Lang wrote:
>>>>>>
>>>>>>> On Fri, 26 Jul 2013, Erik Steffl wrote:
>>>>>>>
>>>>>>>   While testing rsyslogd sending logs to a remote server I
>>>>>>>> encountered
>>>>>>>> this scenario:
>>>>>>>>
>>>>>>>>  - remote server (which is behind amazon elastic load balancer)
>>>>>>>> closes
>>>>>>>> connection (the host is up but nobody listens on the port)
>>>>>>>>
>>>>>>>>  - rsyslogd seems to be sending data to the closed connection
>>>>>>>>
>>>>>>>>  - connection is in CLOSED_WAIT state
>>>>>>>>
>>>>>>>>  strace of rsyslogd reveals (replaced IP and content of message by
>>>>>>>> XXX):
>>>>>>>>
>>>>>>>> [pid 14188] socket(PF_INET, SOCK_STREAM, IPPROTO_TCP) = 2
>>>>>>>> [pid 14188] connect(2, {sa_family=AF_INET, sin_port=htons(5140),
>>>>>>>> sin_addr=inet_addr("XXX")}, 16) = 0
>>>>>>>> [pid 14188] recvfrom(2, 0x7f44d42e765f, 1, 66, 0, 0) = -1 EAGAIN
>>>>>>>> (Resource temporarily unavailable)
>>>>>>>> [pid 14188] sendto(2, "<135>2013-07-26T20:43:08+00:**00 XXX", 679,
>>>>>>>> 0,
>>>>>>>> NULL, 0) = 679
>>>>>>>>
>>>>>>>>  And everybody thinks the connection is in CLOSE_WAIT state:
>>>>>>>>
>>>>>>>> $ sudo lsof | grep rsyslogd | grep TCP
>>>>>>>> rsyslogd  14184           syslog    2u     IPv4             188760
>>>>>>>> 0t0
>>>>>>>> TCP
>>>>>>>> ip-10-2-35-151.ec2.internal:**48228->ec2-54-225-181-82.**
>>>>>>>> compute-1.amazonaws.com:5140<http://ec2-54-225-181-82.compute-1.amazonaws.com:5140>
>>>>>>>>
>>>>>>>>
>>>>>>>> (CLOSE_WAIT)
>>>>>>>>
>>>>>>>> $ netstat -a | grep 5140
>>>>>>>> tcp        1      0 XXX:48229 ELB:5140 CLOSE_WAIT
>>>>>>>>
>>>>>>>>  ELB in above is name/IP of Amazon elastic load balancer, it seems
>>>>>>>> that it behaves slightly suspiciously (why does connect succeed?,
>>>>>>>> why
>>>>>>>> does sendto succeed?)
>>>>>>>>
>>>>>>>>  Any ideas why rsyslogd does not close the connection that is in
>>>>>>>> CLOSE_WAIT state? The connection remains in CLOSE_WAIT state even
>>>>>>>> after the remote server starts listening on 5140 port. This does not
>>>>>>>> happen everytime there is no listener on remote host but when it
>>>>>>>> happens it doesn't seem to be fixed until I restart rsyslogd.
>>>>>>>>
>>>>>>>
>>>>>>> rsyslog is closing the connection, but then it's opening the
>>>>>>> connection
>>>>>>> again for the next message.
>>>>>>>
>>>>>>> My guess is that the ELB is accepting the connection (which is why
>>>>>>> the
>>>>>>> connect succeeds), and then discovering that there's no way for it to
>>>>>>> send the connection to one of the servers that would handle the
>>>>>>> traffic,
>>>>>>> so it turns around and closes the connection.
>>>>>>>
>>>>>>> While it is closing the connection, rsyslog is sending data out
>>>>>>> through
>>>>>>> that connection, because it thinks it's open. This results in some
>>>>>>> data
>>>>>>> being lost. but since rsyslog thinks some data got through (and it
>>>>>>> successfully opened the connection), it doesn't mark that
>>>>>>> destination as
>>>>>>> being down, and keeps trying to send data there.
>>>>>>>
>>>>>>> If you were to use RELP, it would properly detect that the messages
>>>>>>> are
>>>>>>> not being delivered and no messages would be lost. This is exactly
>>>>>>> the
>>>>>>> type of thing that triggered the creation of RELP.
>>>>>>>
>>>>>>
>>>>>>  it seems that using RELP should work, will try that,
>>>>>>
>>>>>>  investigating this I figured out that netcat does not suffer the same
>>>>>> problem for some reason. Not saying there is a problem with rsyslog
>>>>>> (seem load balancer problem), just wondering if anybody has any
>>>>>> insights.
>>>>>>
>>>>>>  I created a simple setup of client machine -> ELB (load balancer) ->
>>>>>> server machine that works like this:
>>>>>>
>>>>>> client runs:
>>>>>>
>>>>>> perl -e '$| = 1 ; for($i=0;$i<100000;$i++) {print localtime() . "
>>>>>> [$i]\n"; sleep(1);}' | nc $elb_hostname 5140
>>>>>>
>>>>>> server runs:
>>>>>>
>>>>>> nc -k -l $ip_address 5140
>>>>>>
>>>>>> In this setup client nc recognies all errors, strace example (when the
>>>>>> server is down):
>>>>>>
>>>>>> socket(PF_INET, SOCK_STREAM, IPPROTO_TCP) = 3
>>>>>> fcntl(3, F_GETFL)                       = 0x2 (flags O_RDWR)
>>>>>> fcntl(3, F_SETFL, O_RDWR|O_NONBLOCK)    = 0
>>>>>> connect(3, {sa_family=AF_INET, sin_port=htons(5140),
>>>>>> sin_addr=inet_addr("10.151.83.**76")}, 16) = -1 EINPROGRESS
>>>>>> (Operation
>>>>>> now in progress)
>>>>>> select(4, NULL, [3], NULL, NULL)        = 1 (out [3])
>>>>>> getsockopt(3, SOL_SOCKET, SO_ERROR, [111], [4]) = 0
>>>>>> fcntl(3, F_SETFL, O_RDWR)               = 0
>>>>>> close(3)                                = 0
>>>>>>
>>>>>> On the other hand rsyslog is doing this (same scenario, sending data
>>>>>> to load balancer that has nothing on the other side):
>>>>>>
>>>>>> 15218 socket(PF_INET, SOCK_STREAM, IPPROTO_TCP) = 6
>>>>>> 15218 connect(6, {sa_family=AF_INET, sin_port=htons(5140),
>>>>>> sin_addr=inet_addr("54.225.**181.82")}, 16) = 0
>>>>>> 15218 recvfrom(6, 0x7f009427065f, 1, 66, 0, 0) = -1 EAGAIN (Resource
>>>>>> temporarily unavailable)
>>>>>> 15218 sendto(6, "<135>2013-08-10T00:11:10+00:**00 ...data
>>>>>> removed...\n",
>>>>>> 679, 0, NULL, 0) = 679
>>>>>> 15218 gettimeofday({1376093470, 69447}, NULL) = 0
>>>>>> 15218 gettimeofday({1376093470, 69515}, NULL) = 0
>>>>>> 15218 futex(0xd80b94, FUTEX_WAIT_PRIVATE, 497273, NULL <unfinished
>>>>>> ...>
>>>>>> 15217 <... epoll_wait resumed> {{EPOLLIN, {u32=14173920,
>>>>>> u64=14173920}}}, 10, 4294967295) = 1
>>>>>> 15217 recvfrom(4, "<135>Aug 10 00:11:11  ...data removed... \n", 8096,
>>>>>> 0, {sa_family=AF_INET, sin_port=htons(57437),
>>>>>> sin_addr=inet_addr("127.0.0.1"**)}, [16]) = 399
>>>>>> 15217 gettimeofday({1376093471, 560833}, NULL) = 0
>>>>>> 15217 recvfrom(4, 0xd9e370, 8096, 0, 0x7f0094a714b0, 0x7f0094a6f364) =
>>>>>> -1 EAGAIN (Resource temporarily unavailable)
>>>>>> 15217 futex(0xd80b94, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0xd80b90,
>>>>>> {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1} <unfinished ...>
>>>>>> 15218 <... futex resumed> )             = 0
>>>>>> 15217 <... futex resumed> )             = 1
>>>>>> 15218 futex(0xd80dc0, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
>>>>>> 15217 epoll_wait(2,  <unfinished ...>
>>>>>> 15218 <... futex resumed> )             = 0
>>>>>> 15218 gettimeofday({1376093471, 561248}, NULL) = 0
>>>>>> 15218 gettimeofday({1376093471, 561367}, NULL) = 0
>>>>>> 15218 recvfrom(6, "", 1, MSG_PEEK|MSG_DONTWAIT, NULL, NULL) = 0
>>>>>> 15218 close(6)                          = 0
>>>>>>
>>>>>>  This repeats so efectively rsyslog keeps sending messages (sendto
>>>>>> returns success), closing connection, opening connection etc. As far
>>>>>> as I can tell load balancer should not successfully connect and even
>>>>>> receive data...
>>>>>>
>>>>>>  Any comments/insights?
>>>>>>
>>>>>>     erik
>>>>>> ______________________________**_________________
>>>>>> rsyslog mailing list
>>>>>> http://lists.adiscon.net/**mailman/listinfo/rsyslog<http://lists.adiscon.net/mailman/listinfo/rsyslog>
>>>>>> http://www.rsyslog.com/**professional-services/<http://www.rsyslog.com/professional-services/>
>>>>>> What's up with rsyslog? Follow https://twitter.com/rgerhards
>>>>>> NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a
>>>>>> myriad of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST
>>>>>> if you DON'T LIKE THAT.
>>>>>>
>>>>>>  ______________________________**_________________
>>>>> rsyslog mailing list
>>>>> http://lists.adiscon.net/**mailman/listinfo/rsyslog<http://lists.adiscon.net/mailman/listinfo/rsyslog>
>>>>> http://www.rsyslog.com/**professional-services/<http://www.rsyslog.com/professional-services/>
>>>>> What's up with rsyslog? Follow https://twitter.com/rgerhards
>>>>> NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a
>>>>> myriad
>>>>> of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you
>>>>> DON'T LIKE THAT.
>>>>>
>>>>
>>>> ______________________________**_________________
>>>> rsyslog mailing list
>>>> http://lists.adiscon.net/**mailman/listinfo/rsyslog<http://lists.adiscon.net/mailman/listinfo/rsyslog>
>>>> http://www.rsyslog.com/**professional-services/<http://www.rsyslog.com/professional-services/>
>>>> What's up with rsyslog? Follow https://twitter.com/rgerhards
>>>> NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a
>>>> myriad of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST
>>>> if you DON'T LIKE THAT.
>>>>
>>>>  ______________________________**_________________
>>> rsyslog mailing list
>>> http://lists.adiscon.net/**mailman/listinfo/rsyslog<http://lists.adiscon.net/mailman/listinfo/rsyslog>
>>> http://www.rsyslog.com/**professional-services/<http://www.rsyslog.com/professional-services/>
>>> What's up with rsyslog? Follow https://twitter.com/rgerhards
>>> NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad
>>> of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you
>>> DON'T LIKE THAT.
>>>
>>
>> ______________________________**_________________
>> rsyslog mailing list
>> http://lists.adiscon.net/**mailman/listinfo/rsyslog<http://lists.adiscon.net/mailman/listinfo/rsyslog>
>> http://www.rsyslog.com/**professional-services/<http://www.rsyslog.com/professional-services/>
>> What's up with rsyslog? Follow https://twitter.com/rgerhards
>> NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad
>> of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you
>> DON'T LIKE THAT.
>>
>>  ______________________________**_________________
> rsyslog mailing list
> http://lists.adiscon.net/**mailman/listinfo/rsyslog<http://lists.adiscon.net/mailman/listinfo/rsyslog>
> http://www.rsyslog.com/**professional-services/<http://www.rsyslog.com/professional-services/>
> What's up with rsyslog? Follow https://twitter.com/rgerhards
> NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad
> of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you
> DON'T LIKE THAT.
>
_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of 
sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE 
THAT.

Re: [rsyslog] rsyslogd keeps sending data to closed connection

Reply via email to