Just a quick follow-up to Rodney's mail, we've been
running with the attached patch to counter this issue
for a little while now, and so far without a repeat
incident.
cheers.
----- Original Message -----
> Hello,
>
> With the following config:
> ...
> # Send all logs onto the local relay
> #
> *.*;syslog.!=info @@log1;RSYSLOG_ForwardFormat
> $ActionExecOnlyWhenPreviousIsSuspended on
> & @@log2
> & /var/spool/rsyslog-buffer
> $ActionExecOnlyWhenPreviousIsSuspended off
>
>
> We have systems stuck in a connect state that do not appear to be
> recovering:
>
> # strace -p 32318
> Process 32318 attached - interrupt to quit
> connect(1, {sa_family=AF_INET, sin_port=htons(514),
> sin_addr=inet_addr("log2")}, 16 <unfinished ...>
> Process 32318 detached
>
>
> Loaded symbols for /lib64/libnss_dns.so.2
> Reading symbols from /lib64/libresolv.so.2...done.
> Loaded symbols for /lib64/libresolv.so.2
> Reading symbols from /lib64/rsyslog/lmnsd_ptcp.so...done.
> Loaded symbols for /lib64/rsyslog/lmnsd_ptcp.so
> 0x000000319ca0cf2b in connect () from /lib64/libpthread.so.0
> #0 0x000000319ca0cf2b in connect () from /lib64/libpthread.so.0
> #1 0x00002aaaab0d1d65 in Connect (pNsd=0x2aaaac8abe10, family=<value
> optimized out>, port=<value optimized out>, host=<value optimized
> out>) at nsd_ptcp.c:684
> #2 0x000000000040ff29 in TCPSendInit ()
> #3 0x0000000000410038 in doTryResume ()
> #4 0x0000000000436d30 in actionTryResume ()
> #5 0x0000000000437393 in submitBatch ()
> #6 0x0000000000437978 in processBatchMain ()
> #7 0x0000000000435896 in doSubmitToActionQBatch ()
> #8 0x00000000004361f9 in doSubmitToActionQNotAllMarkBatch ()
> #9 0x00000000004325b8 in processBatchDoActions ()
> #10 0x000000000041d5d8 in llExecFunc ()
> #11 0x0000000000432933 in processBatch ()
> #12 0x00000000004319de in processBatchDoRules ()
> #13 0x000000000041d5d8 in llExecFunc ()
> #14 0x0000000000431f04 in processBatch ()
> #15 0x000000000040b5cf in msgConsumer ()
> #16 0x0000000000430dcd in ConsumerReg ()
> #17 0x000000000042a51c in wtiWorker ()
> #18 0x000000000042a136 in wtpWorker ()
> #19 0x000000319ca062f7 in start_thread () from /lib64/libpthread.so.0
> #20 0x000000319c2d1b6d in clone () from /lib64/libc.so.6
>
> $ sudo /usr/sbin/lsof -p 20064
> COMMAND PID USER FD TYPE DEVICE SIZE NODE NAME
> rsyslogd 20064 root cwd DIR 8,7 4096 2 /
> rsyslogd 20064 root rtd DIR 8,7 4096 2 /
> rsyslogd 20064 root txt REG 8,7 441537 1310792 /sbin/rsyslogd
> rsyslogd 20064 root mem REG 8,7 134400 885017 /lib64/ld-2.5.so
> rsyslogd 20064 root mem REG 8,7 1699912 885018 /lib64/libc-2.5.so
> rsyslogd 20064 root mem REG 8,7 23360 885019 /lib64/libdl-2.5.so
> rsyslogd 20064 root mem REG 8,7 141440 885023 /lib64/libpthread-2.5.so
> rsyslogd 20064 root mem REG 8,7 53448 885024 /lib64/librt-2.5.so
> rsyslogd 20064 root mem REG 8,6 85928 164425 /usr/lib64/libz.so.1.2.3
> rsyslogd 20064 root mem REG 8,7 92736 884796 /lib64/libresolv-2.5.so
> rsyslogd 20064 root mem REG 8,7 53880 884764
> /lib64/libnss_files-2.5.so
> rsyslogd 20064 root mem REG 8,7 23632 884762 /lib64/libnss_dns-2.5.so
> rsyslogd 20064 root mem REG 8,7 93320 885005
> /lib64/rsyslog/lmnsd_ptcp.so
> rsyslogd 20064 root mem REG 8,7 75802 884921 /lib64/rsyslog/lmnet.so
> rsyslogd 20064 root mem REG 8,7 1295631 884806
> /lib64/rsyslog/imuxsock.so
> rsyslogd 20064 root mem REG 8,7 81914 884794 /lib64/rsyslog/imklog.so
> rsyslogd 20064 root mem REG 8,7 57594 884804 /lib64/rsyslog/imudp.so
> rsyslogd 20064 root mem REG 8,7 37373 884801
> /lib64/rsyslog/impstats.so
> rsyslogd 20064 root mem REG 8,7 90803 884994
> /lib64/rsyslog/lmnetstrms.so
> rsyslogd 20064 root mem REG 8,7 35770 884873
> /lib64/rsyslog/lmtcpclt.so
> rsyslogd 20064 root 0u unix 0xffff81031e95e0c0 876257540 /dev/log
> rsyslogd 20064 root 1u IPv4 1000073229 TCP
> app1.lhr.acx:43293->192.168.132.143:shell (SYN_SENT)
> rsyslogd 20064 root 2r 0000 0,10 0 876257542 eventpoll
> rsyslogd 20064 root 3u IPv6 876257538 UDP *:syslog
> rsyslogd 20064 root 4u IPv4 876257539 UDP *:syslog
> rsyslogd 20064 root 8r REG 0,3 0 4026531849 /proc/kmsg
>
>
> The suspected code is in nsd_ptcp.c as it does not appear to allow for
> a timeout with a NODELAY or other mechanism on the connect.
>
> if((pThis->sock = socket(res->ai_family, res->ai_socktype,
> res->ai_protocol)) == -1) {
> ABORT_FINALIZE(RS_RET_IO_ERROR);
> }
>
> if(connect(pThis->sock, res->ai_addr, res->ai_addrlen) != 0) {
> ABORT_FINALIZE(RS_RET_IO_ERROR);
> }
>
>
>
> Rgds
> Rodney
> _______________________________________________
> rsyslog mailing list
> http://lists.adiscon.net/mailman/listinfo/rsyslog
> http://www.rsyslog.com
--
Nathan
diff -Naurp rsyslog-5.8.6/runtime/nsd_ptcp.c rsyslog-connect/runtime/nsd_ptcp.c
--- rsyslog-5.8.6/runtime/nsd_ptcp.c 2011-10-21 20:53:02.000000000 +1100
+++ rsyslog-connect/runtime/nsd_ptcp.c 2011-11-29 13:15:22.000000000 +1100
@@ -662,6 +662,7 @@ Connect(nsd_t *pNsd, int family, uchar *
nsd_ptcp_t *pThis = (nsd_ptcp_t*) pNsd;
struct addrinfo *res = NULL;
struct addrinfo hints;
+ int fdflags;
DEFiRet;
ISOBJ_TYPE_assert(pThis, nsd_ptcp);
@@ -681,10 +682,22 @@ Connect(nsd_t *pNsd, int family, uchar *
ABORT_FINALIZE(RS_RET_IO_ERROR);
}
+ if((fdflags = fcntl(pThis->sock, F_GETFL)) == -1) {
+ ABORT_FINALIZE(RS_RET_IO_ERROR);
+ }
+
+ if(fcntl(pThis->sock, F_SETFL, fdflags | O_NONBLOCK) == -1) {
+ ABORT_FINALIZE(RS_RET_IO_ERROR);
+ }
+
if(connect(pThis->sock, res->ai_addr, res->ai_addrlen) != 0) {
ABORT_FINALIZE(RS_RET_IO_ERROR);
}
+ if(fcntl(pThis->sock, F_SETFL, fdflags) == -1) {
+ ABORT_FINALIZE(RS_RET_IO_ERROR);
+ }
+
finalize_it:
if(res != NULL)
freeaddrinfo(res);
_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/