This makes me think that you have a firewall between the two that doesn't understand window scaling and is stripping it out of the packets (breaking things when scaling is in use)

This is not normally done by ISPs, but if you have an old firewall in the path somewhere, check it out. It probably needs to be updated to patch security holes (and to get it onto a supported version, this is an old problem)

David Lang

On Fri, 12 Dec 2014, Tim Smith wrote:

I tweaked a few OS/kernel parameters like eth driver options but finally,
this seems to have done the trick:
sysctl -w net.ipv4.tcp_window_scaling=0



On Wed, Dec 10, 2014 at 9:13 PM, Tim Smith <[email protected]> wrote:

As I was typing out the email, it occurred to me that the issue is OS
related:

Looking at a sending server, A, I saw these messages in dmesg:
TCP: Peer 10.2.1.2:514/47081 unexpectedly shrunk window
861404336:861405796 (repaired)

The local TCP port, 47081 is the same one that is part of the stuck
connection.

Now, I know what the problem is :) However, cannot seem to find a fix :(




On Wed, Dec 10, 2014 at 8:46 PM, Tim Smith <[email protected]> wrote:

Hi,

I have a pair of Linux/RHEL servers (RHEL 6.x), A and B, that forward
logs to multiple destinations:
- one copy to Splunk syslog listener
- one copy to local flume process over TCP
- one copy to a remote RSyslog receiver, X and Y (RHEL 6.x)

Forwarding copies to Splunk and Flume works fine. However, forwarding to
the remote Syslog receivers gets stuck in a strange way. The forwarding is
setup as:
RSyslog-Server-A -> RSyslog-Server-X
RSyslog-Server-B -> RSyslog-Server-Y

All four - A,B, X and Y are running exactly the same version of RSyslog -
8.6.2-2, from the adiscon repo:
rsyslog-8.6.0-2.el6.x86_64

What happens is A/B stop sending logs to X/Y. Looking at the send/receive
TCP queues at both ends, the receive queue on X/Y is clear but the sendQ on
A/B gets stuck. As an example, this connection lingers forever (extracted
with netstat -an | grep EST):
tcp        0 103660 10.24.62.9:47081         10.2.1.2:514
 ESTABLISHED

Observations:
==========
- The connection remains established with the same number of bytes in the
sendQ
- No data is transferred over the "stuck" connection, looking at tcpdump
- Re-starting the receive end, X/Y, does not help
- I don't see an action suspended error in the rsyslog logs
- Running the send side in debug doesn't help - I easily ended up with
100+ Gigs of debug logs without the issue manifesting itself. The A/B pair
handle lots of traffic and running rsyslogd in debug mode reduces their
throughput - perhaps the issue does not manifest at lower EPS.
- Only re-starting the send side, A/B, resolves the issue.

I tweaked omfwd action to change TCP_Framing from default to octet-based.
Here is the send side omfwd config on A/B:
--------------------
action (name="it_tcp_X" type="omfwd" Target="X.abc.com" Port="514"
Protocol="tcp" TCP_Framing="octet-counted" queue.filename="it_tcp_X"
 queue.maxdiskspace="10G" queue.Size="8640000"
queue.dequeuebatchsize="4096" queue.type="LinkedList"
queue.timeoutenqueue="0" queue.maxfilesize="1G" queue.saveonshutdown="on"
queue.workerThreads="4"  RebindInterval="10000000" template="fwdformat" )
--------------------


The receive side, X/Y, config:
--------------------
module(load="imptcp" threads="16") # needs to be done just once

global (
    workdirectory="/data/rsyslog/queues"
    maxmessagesize="64K"
    debug.logfile="/data/rsyslog/debug/debug.log"
    net.enabledns="off"
)

$DebugLevel 0

main_queue (
    queue.FileName="globalqueue"
    queue.Type="LinkedList"
    queue.MaxDiskSpace="250g"
    queue.maxfilesize="5g"
    queue.Size="864000000"
    queue.dequeuebatchsize="1000"
    queue.TimeoutEnqueue="0"
    queue.workerThreads="4"
    queue.SaveOnShutdown="on"
)

ruleset(name="aggregate") {
action (name="to_flume"
        type="omfwd"
        Target="localhost"
        Port="5614"
        Protocol="tcp"
        queue.filename="to_flume"
        queue.size="360000000"
        queue.maxdiskspace="360G"
        queue.highwatermark="216000000"   # 60% of queue.size
        queue.discardmark="288000000"     # 80% of queue.size
        queue.type="LinkedList"
        queue.dequeuebatchsize="4096"
        queue.timeoutenqueue="0"
        queue.maxfilesize="4G"
        queue.saveonshutdown="on"
        queue.workerThreads="4"
        RebindInterval="10000000"
        template="rawfwd"
      ) stop
}

input(type="imptcp" port="514" ruleset="aggregate")
--------------------

Any pointers to troubleshoot and smoke out the bug will be highly
appreciated :)

Thanks





_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of 
sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE 
THAT.

_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of 
sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE 
THAT.

Reply via email to