I tweaked a few OS/kernel parameters like eth driver options but finally,
this seems to have done the trick:
sysctl -w net.ipv4.tcp_window_scaling=0



On Wed, Dec 10, 2014 at 9:13 PM, Tim Smith <[email protected]> wrote:

> As I was typing out the email, it occurred to me that the issue is OS
> related:
>
> Looking at a sending server, A, I saw these messages in dmesg:
> TCP: Peer 10.2.1.2:514/47081 unexpectedly shrunk window
> 861404336:861405796 (repaired)
>
> The local TCP port, 47081 is the same one that is part of the stuck
> connection.
>
> Now, I know what the problem is :) However, cannot seem to find a fix :(
>
>
>
>
> On Wed, Dec 10, 2014 at 8:46 PM, Tim Smith <[email protected]> wrote:
>
>> Hi,
>>
>> I have a pair of Linux/RHEL servers (RHEL 6.x), A and B, that forward
>> logs to multiple destinations:
>> - one copy to Splunk syslog listener
>> - one copy to local flume process over TCP
>> - one copy to a remote RSyslog receiver, X and Y (RHEL 6.x)
>>
>> Forwarding copies to Splunk and Flume works fine. However, forwarding to
>> the remote Syslog receivers gets stuck in a strange way. The forwarding is
>> setup as:
>> RSyslog-Server-A -> RSyslog-Server-X
>> RSyslog-Server-B -> RSyslog-Server-Y
>>
>> All four - A,B, X and Y are running exactly the same version of RSyslog -
>> 8.6.2-2, from the adiscon repo:
>> rsyslog-8.6.0-2.el6.x86_64
>>
>> What happens is A/B stop sending logs to X/Y. Looking at the send/receive
>> TCP queues at both ends, the receive queue on X/Y is clear but the sendQ on
>> A/B gets stuck. As an example, this connection lingers forever (extracted
>> with netstat -an | grep EST):
>> tcp        0 103660 10.24.62.9:47081         10.2.1.2:514
>>  ESTABLISHED
>>
>> Observations:
>> ==========
>> - The connection remains established with the same number of bytes in the
>> sendQ
>> - No data is transferred over the "stuck" connection, looking at tcpdump
>> - Re-starting the receive end, X/Y, does not help
>> - I don't see an action suspended error in the rsyslog logs
>> - Running the send side in debug doesn't help - I easily ended up with
>> 100+ Gigs of debug logs without the issue manifesting itself. The A/B pair
>> handle lots of traffic and running rsyslogd in debug mode reduces their
>> throughput - perhaps the issue does not manifest at lower EPS.
>> - Only re-starting the send side, A/B, resolves the issue.
>>
>> I tweaked omfwd action to change TCP_Framing from default to octet-based.
>> Here is the send side omfwd config on A/B:
>> --------------------
>> action (name="it_tcp_X" type="omfwd" Target="X.abc.com" Port="514"
>> Protocol="tcp" TCP_Framing="octet-counted" queue.filename="it_tcp_X"
>>  queue.maxdiskspace="10G" queue.Size="8640000"
>> queue.dequeuebatchsize="4096" queue.type="LinkedList"
>> queue.timeoutenqueue="0" queue.maxfilesize="1G" queue.saveonshutdown="on"
>> queue.workerThreads="4"  RebindInterval="10000000" template="fwdformat" )
>> --------------------
>>
>>
>> The receive side, X/Y, config:
>> --------------------
>> module(load="imptcp" threads="16") # needs to be done just once
>>
>> global (
>>     workdirectory="/data/rsyslog/queues"
>>     maxmessagesize="64K"
>>     debug.logfile="/data/rsyslog/debug/debug.log"
>>     net.enabledns="off"
>> )
>>
>> $DebugLevel 0
>>
>> main_queue (
>>     queue.FileName="globalqueue"
>>     queue.Type="LinkedList"
>>     queue.MaxDiskSpace="250g"
>>     queue.maxfilesize="5g"
>>     queue.Size="864000000"
>>     queue.dequeuebatchsize="1000"
>>     queue.TimeoutEnqueue="0"
>>     queue.workerThreads="4"
>>     queue.SaveOnShutdown="on"
>> )
>>
>> ruleset(name="aggregate") {
>> action (name="to_flume"
>>         type="omfwd"
>>         Target="localhost"
>>         Port="5614"
>>         Protocol="tcp"
>>         queue.filename="to_flume"
>>         queue.size="360000000"
>>         queue.maxdiskspace="360G"
>>         queue.highwatermark="216000000"   # 60% of queue.size
>>         queue.discardmark="288000000"     # 80% of queue.size
>>         queue.type="LinkedList"
>>         queue.dequeuebatchsize="4096"
>>         queue.timeoutenqueue="0"
>>         queue.maxfilesize="4G"
>>         queue.saveonshutdown="on"
>>         queue.workerThreads="4"
>>         RebindInterval="10000000"
>>         template="rawfwd"
>>       ) stop
>> }
>>
>> input(type="imptcp" port="514" ruleset="aggregate")
>> --------------------
>>
>> Any pointers to troubleshoot and smoke out the bug will be highly
>> appreciated :)
>>
>> Thanks
>>
>>
>>
>>
>
_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of 
sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE 
THAT.

Reply via email to