http://bugzilla.adiscon.com/show_bug.cgi?id=470
-- Dave On 8/13/13 11:31 PM, David Lang wrote:
The application level acknowledgement takes care of just about everything.The one remaining gap is that if the sender gets restarted while the recipent is down, the sender gets confused and can loose some messagesI don't know what performance RELP will provide, unfortunantly it does not allow overlapping sent messages, so you have to do a round trip to acknoledge after every message.David Lang On Tue, 13 Aug 2013, Erik Steffl wrote:thanks, that makes sense. BTW switching to RELP :) Just trying to understand different failure scenarios to make sure we don't miss anything thatr might be relevant even when using RELP.erik On 08/12/2013 07:16 PM, David Lang wrote:On Mon, 12 Aug 2013, Erik Steffl wrote:On 08/12/2013 03:53 PM, David Lang wrote:On Mon, 12 Aug 2013, Erik Steffl wrote:On 08/12/2013 01:21 PM, David Lang wrote:Eric, with your particular load balancer, if you telnet to the ip onport 514 does telnet fail with an error, or does it establish a connection that then gets close almost immediatly?it opens connection that is closed immediately (to Amazon elastic loadbalancer with no hosts behind it):Ok, I think what is happening is that rsyslog is opening the connection successfully, and is then sending the log message before it gets closed.When you use netcat to do your test, does it show the same behavior astelnet? (connect then close), or is it somehow failing the connect?in same scenario it behaves in similar way, it connects, sends one message (which appears to be sent ok) then closes connection (it does not retry), the difference is that connect return -1 and errno is EINPROGRESS (unlike rsyslog connects successfully).different configuration options, something for Rainer to look at when hegets back from vacation. Somehow detecting that things aren't in goodshape would be a good thing, but I suspect that what's happening is thatthere is some data being sent in the nc case (negotiation of some sort) that's not being sent in the rsyslog case that's how the problem is being detected.As far as I can tell it's the load balancer behaving badly, the connect and sendto/write should not succeed if load balancer has nowhere to send it (looking at Amazon console I see that load balancer has no hosts).yes and no. There are two ways that a load balancer can operate 1. the load balancer does not terminate the connection. It just redirects it via NAT (or MAC rewriting) to the appropriate destination. In this case, the load balancer must make it's decision before any data is sent at all. It can't terminate the SSL on the load balancer, and itcan't direct sessions based on data in the message (such as cookie values)In this case, if there is no active server, the sending system will not succeed in making a connection.2. the load balancer does terminate the connenction, then it makes a newconnection to the destination server. In this case, the load balancer has much more flexibility in it's decision, and it can terminate the SSL session on the load balancer.But when the load balancer acts like this, you get the behavior that youare seeing, where the connection is established even if there is no back-end system available. In theory the load balancer could stop listening on the VIP, but disconnecting and reconnecting like that is expensive, and you aren't supposed to be in a situation hwere you don't have any available servers, right? :-) I've seen this same behavior with several different brands of load balancers, it's not just the AWS ELB that does this, I see the same thing when you hit a radware or F5 load balancer that is in the mode that it terminates the connection. As far as the sendto/write failing, that isn't going to happen because the sendto/write doesn't actually involve the remote machine. All that happens at the sendto/write command is that the program hands the data to the kernel on the local machine for the kernel to send sometime later. The kernel will only return an error if there is no connection, or the queues are full on the sending system. In this case, the connection is made (syn, syn/ack, ack exchange), then when the sending machine first trys to send actual data, it gets an error and terminates the connection. But this isn't happening until after rsyslog has 'sent' the log. the same thing would happen if something happened to the recieving machine after the connection is established. If you think about this a bit. Suppose that you have a high speed connection across the country. Since your data can only travel at the speed of light, you can have a LOT of data in flight (64K was the old configured max, but with newer, high speed lines, it's common to end up with a few MB of data in flight). The sending software thinks the data has been delivered as soon as it hands it to the kernel to send, but when the recieving system dies (or a firewall cuts the connection), allthe data in flight will not be recieved, but the sending software has noway of knowing this fact. This is why TCP is not really reliable when something can happen to the connection or the machine at the far end.The way around this is to have the recieving application acknowlege thatit has the log message. This is _exactly_ what RELP does. And this is one of the reasons that it does this. David LangerikDavid Langerik@yummly-ubuntu-erik-gazelle:~$ telnet $elbHost 5140 Trying 54.243.148.203... Connected to ...elb host... Escape character is '^]'. Connection closed by foreign host.strace (part from opening to closing connection, the connection openssuccessfully)): socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 3 setsockopt(3, SOL_IP, IP_TOS, [16], 4) = 0 connect(3, {sa_family=AF_INET, sin_port=htons(5140), sin_addr=inet_addr("54.243.148.203")}, 16) = 0 open("/etc/telnetrc", O_RDONLY) = -1 ENOENT (No such file or directory) open("/home/erik/.telnetrc", O_RDONLY) = -1 ENOENT (No such file or directory) write(1, "Connected to erik-tcp-test-2132493349.us-east-1.elb.amazonaws.com.\n", 67) = 67 write(1, "Escape character is '^]'.\n", 26) = 26 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 rt_sigaction(SIGINT, {0x407970, [INT], SA_RESTORER|SA_RESTART, 0x7f0b9e3fc0b0}, {SIG_DFL, [], 0}, 8) = 0 rt_sigaction(SIGQUIT, {0x407920, [QUIT], SA_RESTORER|SA_RESTART, 0x7f0b9e3fc0b0}, {SIG_DFL, [], 0}, 8) = 0 rt_sigaction(SIGWINCH, {0x407900, [WINCH], SA_RESTORER|SA_RESTART, 0x7f0b9e3fc0b0}, {SIG_DFL, [], 0}, 8) = 0 rt_sigaction(SIGTSTP, {0x40b6c0, [TSTP], SA_RESTORER|SA_RESTART, 0x7f0b9e3fc0b0}, {0x40b6c0, [TSTP], SA_RESTORER|SA_RESTART, 0x7f0b9e3fc0b0}, 8) = 0ioctl(0, SNDCTL_TMR_TIMEBASE or TCGETS, {B38400 opost isig icanon echo...}) = 0 ioctl(0, SNDCTL_TMR_STOP or TCSETSW, {B38400 opost isig icanon echo ...}) = 0ioctl(0, SNDCTL_TMR_TIMEBASE or TCGETS, {B38400 opost isig icanon echo...}) = 0 ioctl(0, FIONBIO, [1]) = 0 ioctl(1, FIONBIO, [1]) = 0 ioctl(3, FIONBIO, [1]) = 0 setsockopt(3, SOL_SOCKET, SO_OOBINLINE, [1], 4) = 0 select(4, [0 3], [], [3], {0, 0}) = 0 (Timeout) select(4, [0 3], [], [3], NULL) = 1 (in [3]) recvfrom(3, "", 8191, 0, NULL, NULL) = 0 rt_sigaction(SIGTSTP, {SIG_DFL, [TSTP], SA_RESTORER|SA_RESTART, 0x7f0b9e3fc0b0}, {0x40b6c0, [TSTP], SA_RESTORER|SA_RESTART, 0x7f0b9e3fc0b0}, 8) = 0ioctl(0, SNDCTL_TMR_TIMEBASE or TCGETS, {B38400 opost isig icanon echo...}) = 0 ioctl(0, SNDCTL_TMR_STOP or TCSETSW, {B38400 opost isig icanon echo ...}) = 0ioctl(0, SNDCTL_TMR_TIMEBASE or TCGETS, {B38400 opost isig icanon echo...}) = 0 ioctl(0, FIONBIO, [0]) = 0 ioctl(1, FIONBIO, [0]) = 0 close(3) = 0 erikDavid Lang On Fri, 9 Aug 2013, Erik Steffl wrote:Date: Fri, 09 Aug 2013 19:10:29 -0700 From: Erik Steffl <[email protected]> Reply-To: rsyslog-users <[email protected]> To: rsyslog-users <[email protected]> Subject: Re: [rsyslog] rsyslogd keeps sending data to closed connection On 07/26/2013 04:33 PM, David Lang wrote:On Fri, 26 Jul 2013, Erik Steffl wrote:While testing rsyslogd sending logs to a remote server I encountered this scenario: - remote server (which is behind amazon elastic load balancer) closes connection (the host is up but nobody listens on the port) - rsyslogd seems to be sending data to the closed connection - connection is in CLOSED_WAIT statestrace of rsyslogd reveals (replaced IP and content of message byXXX): [pid 14188] socket(PF_INET, SOCK_STREAM, IPPROTO_TCP) = 2 [pid 14188] connect(2, {sa_family=AF_INET, sin_port=htons(5140), sin_addr=inet_addr("XXX")}, 16) = 0 [pid 14188] recvfrom(2, 0x7f44d42e765f, 1, 66, 0, 0) = -1 EAGAIN (Resource temporarily unavailable)[pid 14188] sendto(2, "<135>2013-07-26T20:43:08+00:00 XXX", 679, 0,NULL, 0) = 679 And everybody thinks the connection is in CLOSE_WAIT state: $ sudo lsof | grep rsyslogd | grep TCP rsyslogd 14184 syslog 2u IPv4 188760 0t0 TCPip-10-2-35-151.ec2.internal:48228->ec2-54-225-181-82.compute-1.amazonaws.com:5140(CLOSE_WAIT) $ netstat -a | grep 5140 tcp 1 0 XXX:48229 ELB:5140 CLOSE_WAITELB in above is name/IP of Amazon elastic load balancer, it seemsthat it behaves slightly suspiciously (why does connect succeed?, why does sendto succeed?) Any ideas why rsyslogd does not close the connection that is inCLOSE_WAIT state? The connection remains in CLOSE_WAIT state evenafter the remote server starts listening on 5140 port. This does not happen everytime there is no listener on remote host but when it happens it doesn't seem to be fixed until I restart rsyslogd.rsyslog is closing the connection, but then it's opening the connection again for the next message. My guess is that the ELB is accepting the connection (which is why the connect succeeds), and then discovering that there's no way for it to send the connection to one of the servers that would handle the traffic, so it turns around and closes the connection. While it is closing the connection, rsyslog is sending data out throughthat connection, because it thinks it's open. This results in somedatabeing lost. but since rsyslog thinks some data got through (and itsuccessfully opened the connection), it doesn't mark that destination as being down, and keeps trying to send data there.If you were to use RELP, it would properly detect that the messagesare not being delivered and no messages would be lost. This is exactly the type of thing that triggered the creation of RELP.it seems that using RELP should work, will try that, investigating this I figured out that netcat does not suffer the sameproblem for some reason. Not saying there is a problem with rsyslog(seem load balancer problem), just wondering if anybody has any insights.I created a simple setup of client machine -> ELB (load balancer) ->server machine that works like this: client runs: perl -e '$| = 1 ; for($i=0;$i<100000;$i++) {print localtime() . " [$i]\n"; sleep(1);}' | nc $elb_hostname 5140 server runs: nc -k -l $ip_address 5140 In this setup client nc recognies all errors, strace example (when the server is down): socket(PF_INET, SOCK_STREAM, IPPROTO_TCP) = 3 fcntl(3, F_GETFL) = 0x2 (flags O_RDWR) fcntl(3, F_SETFL, O_RDWR|O_NONBLOCK) = 0 connect(3, {sa_family=AF_INET, sin_port=htons(5140),sin_addr=inet_addr("10.151.83.76")}, 16) = -1 EINPROGRESS (Operationnow in progress) select(4, NULL, [3], NULL, NULL) = 1 (out [3]) getsockopt(3, SOL_SOCKET, SO_ERROR, [111], [4]) = 0 fcntl(3, F_SETFL, O_RDWR) = 0 close(3) = 0On the other hand rsyslog is doing this (same scenario, sending datato load balancer that has nothing on the other side): 15218 socket(PF_INET, SOCK_STREAM, IPPROTO_TCP) = 6 15218 connect(6, {sa_family=AF_INET, sin_port=htons(5140), sin_addr=inet_addr("54.225.181.82")}, 16) = 015218 recvfrom(6, 0x7f009427065f, 1, 66, 0, 0) = -1 EAGAIN (Resourcetemporarily unavailable) 15218 sendto(6, "<135>2013-08-10T00:11:10+00:00 ...data removed...\n", 679, 0, NULL, 0) = 679 15218 gettimeofday({1376093470, 69447}, NULL) = 0 15218 gettimeofday({1376093470, 69515}, NULL) = 0 15218 futex(0xd80b94, FUTEX_WAIT_PRIVATE, 497273, NULL <unfinished ...> 15217 <... epoll_wait resumed> {{EPOLLIN, {u32=14173920, u64=14173920}}}, 10, 4294967295) = 1 15217 recvfrom(4, "<135>Aug 10 00:11:11 ...data removed... \n", 8096, 0, {sa_family=AF_INET, sin_port=htons(57437), sin_addr=inet_addr("127.0.0.1")}, [16]) = 399 15217 gettimeofday({1376093471, 560833}, NULL) = 0 15217 recvfrom(4, 0xd9e370, 8096, 0, 0x7f0094a714b0, 0x7f0094a6f364) = -1 EAGAIN (Resource temporarily unavailable) 15217 futex(0xd80b94, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0xd80b90, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1} <unfinished ...> 15218 <... futex resumed> ) = 0 15217 <... futex resumed> ) = 1 15218 futex(0xd80dc0, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> 15217 epoll_wait(2, <unfinished ...> 15218 <... futex resumed> ) = 0 15218 gettimeofday({1376093471, 561248}, NULL) = 0 15218 gettimeofday({1376093471, 561367}, NULL) = 0 15218 recvfrom(6, "", 1, MSG_PEEK|MSG_DONTWAIT, NULL, NULL) = 0 15218 close(6) = 0 This repeats so efectively rsyslog keeps sending messages (sendtoreturns success), closing connection, opening connection etc. As far as I can tell load balancer should not successfully connect and evenreceive data... Any comments/insights? erik _______________________________________________ rsyslog mailing list http://lists.adiscon.net/mailman/listinfo/rsyslog http://www.rsyslog.com/professional-services/ What's up with rsyslog? Follow https://twitter.com/rgerhards NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE THAT._______________________________________________ rsyslog mailing list http://lists.adiscon.net/mailman/listinfo/rsyslog http://www.rsyslog.com/professional-services/ What's up with rsyslog? Follow https://twitter.com/rgerhards NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriadof sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if youDON'T LIKE THAT._______________________________________________ rsyslog mailing list http://lists.adiscon.net/mailman/listinfo/rsyslog http://www.rsyslog.com/professional-services/ What's up with rsyslog? Follow https://twitter.com/rgerhards NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by amyriad of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POSTif you DON'T LIKE THAT._______________________________________________ rsyslog mailing list http://lists.adiscon.net/mailman/listinfo/rsyslog http://www.rsyslog.com/professional-services/ What's up with rsyslog? Follow https://twitter.com/rgerhardsNOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if youDON'T LIKE THAT._______________________________________________ rsyslog mailing list http://lists.adiscon.net/mailman/listinfo/rsyslog http://www.rsyslog.com/professional-services/ What's up with rsyslog? Follow https://twitter.com/rgerhards NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE THAT._______________________________________________ rsyslog mailing list http://lists.adiscon.net/mailman/listinfo/rsyslog http://www.rsyslog.com/professional-services/ What's up with rsyslog? Follow https://twitter.com/rgerhardsNOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriadof sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE THAT._______________________________________________ rsyslog mailing list http://lists.adiscon.net/mailman/listinfo/rsyslog http://www.rsyslog.com/professional-services/ What's up with rsyslog? Follow https://twitter.com/rgerhardsNOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE THAT._______________________________________________ rsyslog mailing list http://lists.adiscon.net/mailman/listinfo/rsyslog http://www.rsyslog.com/professional-services/ What's up with rsyslog? Follow https://twitter.com/rgerhardsNOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE THAT.
_______________________________________________ rsyslog mailing list http://lists.adiscon.net/mailman/listinfo/rsyslog http://www.rsyslog.com/professional-services/ What's up with rsyslog? Follow https://twitter.com/rgerhards NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE THAT.

