I tried to have a quick lock, but that requires more time. I thik rsyslog uses nonblocking mode vs. nc not. Maybe that makes the difference. Rest probably needs to wait until next week.
Rainer On Tue, Aug 13, 2013 at 4:16 AM, David Lang <[email protected]> wrote: > On Mon, 12 Aug 2013, Erik Steffl wrote: > > On 08/12/2013 03:53 PM, David Lang wrote: >> >>> On Mon, 12 Aug 2013, Erik Steffl wrote: >>> >>> On 08/12/2013 01:21 PM, David Lang wrote: >>>> >>>>> Eric, with your particular load balancer, if you telnet to the ip on >>>>> port 514 does telnet fail with an error, or does it establish a >>>>> connection that then gets close almost immediatly? >>>>> >>>> >>>> it opens connection that is closed immediately (to Amazon elastic load >>>> balancer with no hosts behind it): >>>> >>> >>> Ok, I think what is happening is that rsyslog is opening the connection >>> successfully, and is then sending the log message before it gets closed. >>> >>> When you use netcat to do your test, does it show the same behavior as >>> telnet? (connect then close), or is it somehow failing the connect? >>> >> >> in same scenario it behaves in similar way, it connects, sends one >> message (which appears to be sent ok) then closes connection (it does not >> retry), the difference is that connect return -1 and errno is EINPROGRESS >> (unlike rsyslog connects successfully). >> > > different configuration options, something for Rainer to look at when he > gets back from vacation. Somehow detecting that things aren't in good shape > would be a good thing, but I suspect that what's happening is that there is > some data being sent in the nc case (negotiation of some sort) that's not > being sent in the rsyslog case that's how the problem is being detected. > > > As far as I can tell it's the load balancer behaving badly, the connect >> and sendto/write should not succeed if load balancer has nowhere to send it >> (looking at Amazon console I see that load balancer has no hosts). >> > > yes and no. > > There are two ways that a load balancer can operate > > 1. the load balancer does not terminate the connection. It just redirects > it via NAT (or MAC rewriting) to the appropriate destination. > > In this case, the load balancer must make it's decision before any data is > sent at all. It can't terminate the SSL on the load balancer, and it can't > direct sessions based on data in the message (such as cookie values) > > In this case, if there is no active server, the sending system will not > succeed in making a connection. > > > 2. the load balancer does terminate the connenction, then it makes a new > connection to the destination server. > > In this case, the load balancer has much more flexibility in it's > decision, and it can terminate the SSL session on the load balancer. > > But when the load balancer acts like this, you get the behavior that you > are seeing, where the connection is established even if there is no > back-end system available. In theory the load balancer could stop listening > on the VIP, but disconnecting and reconnecting like that is expensive, and > you aren't supposed to be in a situation hwere you don't have any available > servers, right? :-) > > I've seen this same behavior with several different brands of load > balancers, it's not just the AWS ELB that does this, I see the same thing > when you hit a radware or F5 load balancer that is in the mode that it > terminates the connection. > > > > As far as the sendto/write failing, that isn't going to happen because the > sendto/write doesn't actually involve the remote machine. All that happens > at the sendto/write command is that the program hands the data to the > kernel on the local machine for the kernel to send sometime later. The > kernel will only return an error if there is no connection, or the queues > are full on the sending system. > > In this case, the connection is made (syn, syn/ack, ack exchange), then > when the sending machine first trys to send actual data, it gets an error > and terminates the connection. But this isn't happening until after rsyslog > has 'sent' the log. > > the same thing would happen if something happened to the recieving machine > after the connection is established. > > If you think about this a bit. Suppose that you have a high speed > connection across the country. Since your data can only travel at the speed > of light, you can have a LOT of data in flight (64K was the old configured > max, but with newer, high speed lines, it's common to end up with a few MB > of data in flight). The sending software thinks the data has been delivered > as soon as it hands it to the kernel to send, but when the recieving system > dies (or a firewall cuts the connection), all the data in flight will not > be recieved, but the sending software has no way of knowing this fact. > > This is why TCP is not really reliable when something can happen to the > connection or the machine at the far end. > > The way around this is to have the recieving application acknowlege that > it has the log message. This is _exactly_ what RELP does. And this is one > of the reasons that it does this. > > David Lang > > > erik >> >> >>> David Lang >>> >>> erik@yummly-ubuntu-erik-**gazelle:~$ telnet $elbHost 5140 >>>> Trying 54.243.148.203... >>>> Connected to ...elb host... >>>> Escape character is '^]'. >>>> Connection closed by foreign host. >>>> >>>> strace (part from opening to closing connection, the connection opens >>>> successfully)): >>>> >>>> socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 3 >>>> setsockopt(3, SOL_IP, IP_TOS, [16], 4) = 0 >>>> connect(3, {sa_family=AF_INET, sin_port=htons(5140), >>>> sin_addr=inet_addr("54.243.**148.203")}, 16) = 0 >>>> open("/etc/telnetrc", O_RDONLY) = -1 ENOENT (No such file or >>>> directory) >>>> open("/home/erik/.telnetrc", O_RDONLY) = -1 ENOENT (No such file or >>>> directory) >>>> write(1, "Connected to >>>> erik-tcp-test-2132493349.us-**east-1.elb.amazonaws.com<http://erik-tcp-test-2132493349.us-east-1.elb.amazonaws.com>.\n", >>>> 67) = 67 >>>> write(1, "Escape character is '^]'.\n", 26) = 26 >>>> rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 >>>> rt_sigaction(SIGINT, {0x407970, [INT], SA_RESTORER|SA_RESTART, >>>> 0x7f0b9e3fc0b0}, {SIG_DFL, [], 0}, 8) = 0 >>>> rt_sigaction(SIGQUIT, {0x407920, [QUIT], SA_RESTORER|SA_RESTART, >>>> 0x7f0b9e3fc0b0}, {SIG_DFL, [], 0}, 8) = 0 >>>> rt_sigaction(SIGWINCH, {0x407900, [WINCH], SA_RESTORER|SA_RESTART, >>>> 0x7f0b9e3fc0b0}, {SIG_DFL, [], 0}, 8) = 0 >>>> rt_sigaction(SIGTSTP, {0x40b6c0, [TSTP], SA_RESTORER|SA_RESTART, >>>> 0x7f0b9e3fc0b0}, {0x40b6c0, [TSTP], SA_RESTORER|SA_RESTART, >>>> 0x7f0b9e3fc0b0}, 8) = 0 >>>> ioctl(0, SNDCTL_TMR_TIMEBASE or TCGETS, {B38400 opost isig icanon echo >>>> ...}) = 0 >>>> ioctl(0, SNDCTL_TMR_STOP or TCSETSW, {B38400 opost isig icanon echo >>>> ...}) = 0 >>>> ioctl(0, SNDCTL_TMR_TIMEBASE or TCGETS, {B38400 opost isig icanon echo >>>> ...}) = 0 >>>> ioctl(0, FIONBIO, [1]) = 0 >>>> ioctl(1, FIONBIO, [1]) = 0 >>>> ioctl(3, FIONBIO, [1]) = 0 >>>> setsockopt(3, SOL_SOCKET, SO_OOBINLINE, [1], 4) = 0 >>>> select(4, [0 3], [], [3], {0, 0}) = 0 (Timeout) >>>> select(4, [0 3], [], [3], NULL) = 1 (in [3]) >>>> recvfrom(3, "", 8191, 0, NULL, NULL) = 0 >>>> rt_sigaction(SIGTSTP, {SIG_DFL, [TSTP], SA_RESTORER|SA_RESTART, >>>> 0x7f0b9e3fc0b0}, {0x40b6c0, [TSTP], SA_RESTORER|SA_RESTART, >>>> 0x7f0b9e3fc0b0}, 8) = 0 >>>> ioctl(0, SNDCTL_TMR_TIMEBASE or TCGETS, {B38400 opost isig icanon echo >>>> ...}) = 0 >>>> ioctl(0, SNDCTL_TMR_STOP or TCSETSW, {B38400 opost isig icanon echo >>>> ...}) = 0 >>>> ioctl(0, SNDCTL_TMR_TIMEBASE or TCGETS, {B38400 opost isig icanon echo >>>> ...}) = 0 >>>> ioctl(0, FIONBIO, [0]) = 0 >>>> ioctl(1, FIONBIO, [0]) = 0 >>>> close(3) = 0 >>>> >>>> erik >>>> >>>> David Lang >>>>> >>>>> On Fri, 9 Aug 2013, Erik Steffl wrote: >>>>> >>>>> Date: Fri, 09 Aug 2013 19:10:29 -0700 >>>>>> From: Erik Steffl <[email protected]> >>>>>> Reply-To: rsyslog-users <[email protected]> >>>>>> To: rsyslog-users <[email protected]> >>>>>> Subject: Re: [rsyslog] rsyslogd keeps sending data to closed >>>>>> connection >>>>>> >>>>>> On 07/26/2013 04:33 PM, David Lang wrote: >>>>>> >>>>>>> On Fri, 26 Jul 2013, Erik Steffl wrote: >>>>>>> >>>>>>> While testing rsyslogd sending logs to a remote server I >>>>>>>> encountered >>>>>>>> this scenario: >>>>>>>> >>>>>>>> - remote server (which is behind amazon elastic load balancer) >>>>>>>> closes >>>>>>>> connection (the host is up but nobody listens on the port) >>>>>>>> >>>>>>>> - rsyslogd seems to be sending data to the closed connection >>>>>>>> >>>>>>>> - connection is in CLOSED_WAIT state >>>>>>>> >>>>>>>> strace of rsyslogd reveals (replaced IP and content of message by >>>>>>>> XXX): >>>>>>>> >>>>>>>> [pid 14188] socket(PF_INET, SOCK_STREAM, IPPROTO_TCP) = 2 >>>>>>>> [pid 14188] connect(2, {sa_family=AF_INET, sin_port=htons(5140), >>>>>>>> sin_addr=inet_addr("XXX")}, 16) = 0 >>>>>>>> [pid 14188] recvfrom(2, 0x7f44d42e765f, 1, 66, 0, 0) = -1 EAGAIN >>>>>>>> (Resource temporarily unavailable) >>>>>>>> [pid 14188] sendto(2, "<135>2013-07-26T20:43:08+00:**00 XXX", 679, >>>>>>>> 0, >>>>>>>> NULL, 0) = 679 >>>>>>>> >>>>>>>> And everybody thinks the connection is in CLOSE_WAIT state: >>>>>>>> >>>>>>>> $ sudo lsof | grep rsyslogd | grep TCP >>>>>>>> rsyslogd 14184 syslog 2u IPv4 188760 >>>>>>>> 0t0 >>>>>>>> TCP >>>>>>>> ip-10-2-35-151.ec2.internal:**48228->ec2-54-225-181-82.** >>>>>>>> compute-1.amazonaws.com:5140<http://ec2-54-225-181-82.compute-1.amazonaws.com:5140> >>>>>>>> >>>>>>>> >>>>>>>> (CLOSE_WAIT) >>>>>>>> >>>>>>>> $ netstat -a | grep 5140 >>>>>>>> tcp 1 0 XXX:48229 ELB:5140 CLOSE_WAIT >>>>>>>> >>>>>>>> ELB in above is name/IP of Amazon elastic load balancer, it seems >>>>>>>> that it behaves slightly suspiciously (why does connect succeed?, >>>>>>>> why >>>>>>>> does sendto succeed?) >>>>>>>> >>>>>>>> Any ideas why rsyslogd does not close the connection that is in >>>>>>>> CLOSE_WAIT state? The connection remains in CLOSE_WAIT state even >>>>>>>> after the remote server starts listening on 5140 port. This does not >>>>>>>> happen everytime there is no listener on remote host but when it >>>>>>>> happens it doesn't seem to be fixed until I restart rsyslogd. >>>>>>>> >>>>>>> >>>>>>> rsyslog is closing the connection, but then it's opening the >>>>>>> connection >>>>>>> again for the next message. >>>>>>> >>>>>>> My guess is that the ELB is accepting the connection (which is why >>>>>>> the >>>>>>> connect succeeds), and then discovering that there's no way for it to >>>>>>> send the connection to one of the servers that would handle the >>>>>>> traffic, >>>>>>> so it turns around and closes the connection. >>>>>>> >>>>>>> While it is closing the connection, rsyslog is sending data out >>>>>>> through >>>>>>> that connection, because it thinks it's open. This results in some >>>>>>> data >>>>>>> being lost. but since rsyslog thinks some data got through (and it >>>>>>> successfully opened the connection), it doesn't mark that >>>>>>> destination as >>>>>>> being down, and keeps trying to send data there. >>>>>>> >>>>>>> If you were to use RELP, it would properly detect that the messages >>>>>>> are >>>>>>> not being delivered and no messages would be lost. This is exactly >>>>>>> the >>>>>>> type of thing that triggered the creation of RELP. >>>>>>> >>>>>> >>>>>> it seems that using RELP should work, will try that, >>>>>> >>>>>> investigating this I figured out that netcat does not suffer the same >>>>>> problem for some reason. Not saying there is a problem with rsyslog >>>>>> (seem load balancer problem), just wondering if anybody has any >>>>>> insights. >>>>>> >>>>>> I created a simple setup of client machine -> ELB (load balancer) -> >>>>>> server machine that works like this: >>>>>> >>>>>> client runs: >>>>>> >>>>>> perl -e '$| = 1 ; for($i=0;$i<100000;$i++) {print localtime() . " >>>>>> [$i]\n"; sleep(1);}' | nc $elb_hostname 5140 >>>>>> >>>>>> server runs: >>>>>> >>>>>> nc -k -l $ip_address 5140 >>>>>> >>>>>> In this setup client nc recognies all errors, strace example (when the >>>>>> server is down): >>>>>> >>>>>> socket(PF_INET, SOCK_STREAM, IPPROTO_TCP) = 3 >>>>>> fcntl(3, F_GETFL) = 0x2 (flags O_RDWR) >>>>>> fcntl(3, F_SETFL, O_RDWR|O_NONBLOCK) = 0 >>>>>> connect(3, {sa_family=AF_INET, sin_port=htons(5140), >>>>>> sin_addr=inet_addr("10.151.83.**76")}, 16) = -1 EINPROGRESS >>>>>> (Operation >>>>>> now in progress) >>>>>> select(4, NULL, [3], NULL, NULL) = 1 (out [3]) >>>>>> getsockopt(3, SOL_SOCKET, SO_ERROR, [111], [4]) = 0 >>>>>> fcntl(3, F_SETFL, O_RDWR) = 0 >>>>>> close(3) = 0 >>>>>> >>>>>> On the other hand rsyslog is doing this (same scenario, sending data >>>>>> to load balancer that has nothing on the other side): >>>>>> >>>>>> 15218 socket(PF_INET, SOCK_STREAM, IPPROTO_TCP) = 6 >>>>>> 15218 connect(6, {sa_family=AF_INET, sin_port=htons(5140), >>>>>> sin_addr=inet_addr("54.225.**181.82")}, 16) = 0 >>>>>> 15218 recvfrom(6, 0x7f009427065f, 1, 66, 0, 0) = -1 EAGAIN (Resource >>>>>> temporarily unavailable) >>>>>> 15218 sendto(6, "<135>2013-08-10T00:11:10+00:**00 ...data >>>>>> removed...\n", >>>>>> 679, 0, NULL, 0) = 679 >>>>>> 15218 gettimeofday({1376093470, 69447}, NULL) = 0 >>>>>> 15218 gettimeofday({1376093470, 69515}, NULL) = 0 >>>>>> 15218 futex(0xd80b94, FUTEX_WAIT_PRIVATE, 497273, NULL <unfinished >>>>>> ...> >>>>>> 15217 <... epoll_wait resumed> {{EPOLLIN, {u32=14173920, >>>>>> u64=14173920}}}, 10, 4294967295) = 1 >>>>>> 15217 recvfrom(4, "<135>Aug 10 00:11:11 ...data removed... \n", 8096, >>>>>> 0, {sa_family=AF_INET, sin_port=htons(57437), >>>>>> sin_addr=inet_addr("127.0.0.1"**)}, [16]) = 399 >>>>>> 15217 gettimeofday({1376093471, 560833}, NULL) = 0 >>>>>> 15217 recvfrom(4, 0xd9e370, 8096, 0, 0x7f0094a714b0, 0x7f0094a6f364) = >>>>>> -1 EAGAIN (Resource temporarily unavailable) >>>>>> 15217 futex(0xd80b94, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0xd80b90, >>>>>> {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1} <unfinished ...> >>>>>> 15218 <... futex resumed> ) = 0 >>>>>> 15217 <... futex resumed> ) = 1 >>>>>> 15218 futex(0xd80dc0, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> >>>>>> 15217 epoll_wait(2, <unfinished ...> >>>>>> 15218 <... futex resumed> ) = 0 >>>>>> 15218 gettimeofday({1376093471, 561248}, NULL) = 0 >>>>>> 15218 gettimeofday({1376093471, 561367}, NULL) = 0 >>>>>> 15218 recvfrom(6, "", 1, MSG_PEEK|MSG_DONTWAIT, NULL, NULL) = 0 >>>>>> 15218 close(6) = 0 >>>>>> >>>>>> This repeats so efectively rsyslog keeps sending messages (sendto >>>>>> returns success), closing connection, opening connection etc. As far >>>>>> as I can tell load balancer should not successfully connect and even >>>>>> receive data... >>>>>> >>>>>> Any comments/insights? >>>>>> >>>>>> erik >>>>>> ______________________________**_________________ >>>>>> rsyslog mailing list >>>>>> http://lists.adiscon.net/**mailman/listinfo/rsyslog<http://lists.adiscon.net/mailman/listinfo/rsyslog> >>>>>> http://www.rsyslog.com/**professional-services/<http://www.rsyslog.com/professional-services/> >>>>>> What's up with rsyslog? Follow https://twitter.com/rgerhards >>>>>> NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a >>>>>> myriad of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST >>>>>> if you DON'T LIKE THAT. >>>>>> >>>>>> ______________________________**_________________ >>>>> rsyslog mailing list >>>>> http://lists.adiscon.net/**mailman/listinfo/rsyslog<http://lists.adiscon.net/mailman/listinfo/rsyslog> >>>>> http://www.rsyslog.com/**professional-services/<http://www.rsyslog.com/professional-services/> >>>>> What's up with rsyslog? Follow https://twitter.com/rgerhards >>>>> NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a >>>>> myriad >>>>> of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you >>>>> DON'T LIKE THAT. >>>>> >>>> >>>> ______________________________**_________________ >>>> rsyslog mailing list >>>> http://lists.adiscon.net/**mailman/listinfo/rsyslog<http://lists.adiscon.net/mailman/listinfo/rsyslog> >>>> http://www.rsyslog.com/**professional-services/<http://www.rsyslog.com/professional-services/> >>>> What's up with rsyslog? Follow https://twitter.com/rgerhards >>>> NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a >>>> myriad of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST >>>> if you DON'T LIKE THAT. >>>> >>>> ______________________________**_________________ >>> rsyslog mailing list >>> http://lists.adiscon.net/**mailman/listinfo/rsyslog<http://lists.adiscon.net/mailman/listinfo/rsyslog> >>> http://www.rsyslog.com/**professional-services/<http://www.rsyslog.com/professional-services/> >>> What's up with rsyslog? Follow https://twitter.com/rgerhards >>> NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad >>> of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you >>> DON'T LIKE THAT. >>> >> >> ______________________________**_________________ >> rsyslog mailing list >> http://lists.adiscon.net/**mailman/listinfo/rsyslog<http://lists.adiscon.net/mailman/listinfo/rsyslog> >> http://www.rsyslog.com/**professional-services/<http://www.rsyslog.com/professional-services/> >> What's up with rsyslog? Follow https://twitter.com/rgerhards >> NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad >> of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you >> DON'T LIKE THAT. >> >> ______________________________**_________________ > rsyslog mailing list > http://lists.adiscon.net/**mailman/listinfo/rsyslog<http://lists.adiscon.net/mailman/listinfo/rsyslog> > http://www.rsyslog.com/**professional-services/<http://www.rsyslog.com/professional-services/> > What's up with rsyslog? Follow https://twitter.com/rgerhards > NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad > of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you > DON'T LIKE THAT. > _______________________________________________ rsyslog mailing list http://lists.adiscon.net/mailman/listinfo/rsyslog http://www.rsyslog.com/professional-services/ What's up with rsyslog? Follow https://twitter.com/rgerhards NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE THAT.

