Thanks David.

Issues with Layer-3 network devices (routers, firewalls etc) did occur to
me but honestly, my priority was to stabilize the log stream between the
two ends without involving others since that usually slows down the
process. Also, unless it is a major outage causing bug, network engineers
are very reluctant to closely cooperate on troubleshooting such issues :)
That said, and now that my log stream is stable, I will open a case with
our network engineering team. Will post back if I find something useful.

As a side note, my sending rsyslog servers are running on hardware with tg3
ethernet drivers. Googling around, people seem to have had lots of trouble
with tg3 drivers and I saw several recommendations to turn off a bunch of
tcp activities off-loaded to the ethernet card - "gso off tso off sg off
gro off"





On Fri, Dec 12, 2014 at 4:53 AM, David Lang <[email protected]> wrote:
>
> This makes me think that you have a firewall between the two that doesn't
> understand window scaling and is stripping it out of the packets (breaking
> things when scaling is in use)
>
> This is not normally done by ISPs, but if you have an old firewall in the
> path somewhere, check it out. It probably needs to be updated to patch
> security holes (and to get it onto a supported version, this is an old
> problem)
>
> David Lang
>
>
> On Fri, 12 Dec 2014, Tim Smith wrote:
>
>  I tweaked a few OS/kernel parameters like eth driver options but finally,
>> this seems to have done the trick:
>> sysctl -w net.ipv4.tcp_window_scaling=0
>>
>>
>>
>> On Wed, Dec 10, 2014 at 9:13 PM, Tim Smith <[email protected]> wrote:
>>
>>  As I was typing out the email, it occurred to me that the issue is OS
>>> related:
>>>
>>> Looking at a sending server, A, I saw these messages in dmesg:
>>> TCP: Peer 10.2.1.2:514/47081 unexpectedly shrunk window
>>> 861404336:861405796 (repaired)
>>>
>>> The local TCP port, 47081 is the same one that is part of the stuck
>>> connection.
>>>
>>> Now, I know what the problem is :) However, cannot seem to find a fix :(
>>>
>>>
>>>
>>>
>>> On Wed, Dec 10, 2014 at 8:46 PM, Tim Smith <[email protected]> wrote:
>>>
>>>  Hi,
>>>>
>>>> I have a pair of Linux/RHEL servers (RHEL 6.x), A and B, that forward
>>>> logs to multiple destinations:
>>>> - one copy to Splunk syslog listener
>>>> - one copy to local flume process over TCP
>>>> - one copy to a remote RSyslog receiver, X and Y (RHEL 6.x)
>>>>
>>>> Forwarding copies to Splunk and Flume works fine. However, forwarding to
>>>> the remote Syslog receivers gets stuck in a strange way. The forwarding
>>>> is
>>>> setup as:
>>>> RSyslog-Server-A -> RSyslog-Server-X
>>>> RSyslog-Server-B -> RSyslog-Server-Y
>>>>
>>>> All four - A,B, X and Y are running exactly the same version of RSyslog
>>>> -
>>>> 8.6.2-2, from the adiscon repo:
>>>> rsyslog-8.6.0-2.el6.x86_64
>>>>
>>>> What happens is A/B stop sending logs to X/Y. Looking at the
>>>> send/receive
>>>> TCP queues at both ends, the receive queue on X/Y is clear but the
>>>> sendQ on
>>>> A/B gets stuck. As an example, this connection lingers forever
>>>> (extracted
>>>> with netstat -an | grep EST):
>>>> tcp        0 103660 10.24.62.9:47081         10.2.1.2:514
>>>>  ESTABLISHED
>>>>
>>>> Observations:
>>>> ==========
>>>> - The connection remains established with the same number of bytes in
>>>> the
>>>> sendQ
>>>> - No data is transferred over the "stuck" connection, looking at tcpdump
>>>> - Re-starting the receive end, X/Y, does not help
>>>> - I don't see an action suspended error in the rsyslog logs
>>>> - Running the send side in debug doesn't help - I easily ended up with
>>>> 100+ Gigs of debug logs without the issue manifesting itself. The A/B
>>>> pair
>>>> handle lots of traffic and running rsyslogd in debug mode reduces their
>>>> throughput - perhaps the issue does not manifest at lower EPS.
>>>> - Only re-starting the send side, A/B, resolves the issue.
>>>>
>>>> I tweaked omfwd action to change TCP_Framing from default to
>>>> octet-based.
>>>> Here is the send side omfwd config on A/B:
>>>> --------------------
>>>> action (name="it_tcp_X" type="omfwd" Target="X.abc.com" Port="514"
>>>> Protocol="tcp" TCP_Framing="octet-counted" queue.filename="it_tcp_X"
>>>>  queue.maxdiskspace="10G" queue.Size="8640000"
>>>> queue.dequeuebatchsize="4096" queue.type="LinkedList"
>>>> queue.timeoutenqueue="0" queue.maxfilesize="1G"
>>>> queue.saveonshutdown="on"
>>>> queue.workerThreads="4"  RebindInterval="10000000" template="fwdformat"
>>>> )
>>>> --------------------
>>>>
>>>>
>>>> The receive side, X/Y, config:
>>>> --------------------
>>>> module(load="imptcp" threads="16") # needs to be done just once
>>>>
>>>> global (
>>>>     workdirectory="/data/rsyslog/queues"
>>>>     maxmessagesize="64K"
>>>>     debug.logfile="/data/rsyslog/debug/debug.log"
>>>>     net.enabledns="off"
>>>> )
>>>>
>>>> $DebugLevel 0
>>>>
>>>> main_queue (
>>>>     queue.FileName="globalqueue"
>>>>     queue.Type="LinkedList"
>>>>     queue.MaxDiskSpace="250g"
>>>>     queue.maxfilesize="5g"
>>>>     queue.Size="864000000"
>>>>     queue.dequeuebatchsize="1000"
>>>>     queue.TimeoutEnqueue="0"
>>>>     queue.workerThreads="4"
>>>>     queue.SaveOnShutdown="on"
>>>> )
>>>>
>>>> ruleset(name="aggregate") {
>>>> action (name="to_flume"
>>>>         type="omfwd"
>>>>         Target="localhost"
>>>>         Port="5614"
>>>>         Protocol="tcp"
>>>>         queue.filename="to_flume"
>>>>         queue.size="360000000"
>>>>         queue.maxdiskspace="360G"
>>>>         queue.highwatermark="216000000"   # 60% of queue.size
>>>>         queue.discardmark="288000000"     # 80% of queue.size
>>>>         queue.type="LinkedList"
>>>>         queue.dequeuebatchsize="4096"
>>>>         queue.timeoutenqueue="0"
>>>>         queue.maxfilesize="4G"
>>>>         queue.saveonshutdown="on"
>>>>         queue.workerThreads="4"
>>>>         RebindInterval="10000000"
>>>>         template="rawfwd"
>>>>       ) stop
>>>> }
>>>>
>>>> input(type="imptcp" port="514" ruleset="aggregate")
>>>> --------------------
>>>>
>>>> Any pointers to troubleshoot and smoke out the bug will be highly
>>>> appreciated :)
>>>>
>>>> Thanks
>>>>
>>>>
>>>>
>>>>
>>>>
>>>  _______________________________________________
>> rsyslog mailing list
>> http://lists.adiscon.net/mailman/listinfo/rsyslog
>> http://www.rsyslog.com/professional-services/
>> What's up with rsyslog? Follow https://twitter.com/rgerhards
>> NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad
>> of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you
>> DON'T LIKE THAT.
>>
>>  _______________________________________________
> rsyslog mailing list
> http://lists.adiscon.net/mailman/listinfo/rsyslog
> http://www.rsyslog.com/professional-services/
> What's up with rsyslog? Follow https://twitter.com/rgerhards
> NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad
> of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you
> DON'T LIKE THAT.
>
_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of 
sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE 
THAT.

Reply via email to