Hello. I have in my hands a very weird issue, that have never seen before, and was hoping some of you guys might have suggestions about it.
Scenario: Two-member Load Sharing Unicast cluster running R75.10 over open servers running SPLAT. Cluster has worked like this for months without any problems but today received report about problems with a new application that requires traffic to go through the cluster. This new application is running on a DMZ interface, the following info was provided about it: Web Servers en DMZ: IBM HTTP Server version 7.0.0.11 (build cf111021.10) over AIX 6.1-02 Portal Servers on the intranet: IBM WebSphere Application Server – ND version 7.0.0.11 (build cf111021.10) over AIX 6.1-02 Traffic comes from web services located in other network segments, through the firewall and to the DMZ in question. After the deployment of this application, noticed important delays with traffic through the cluster, but those appeared some times and some times not. Decided to do some tests, among those, enabled "fw monitor" captures on both cluster members and found out when traffic goes through the Pivot member of the Unicast cluster, everything works perfect, but when it is handled by the other cluster member, the delays occur. Here are multiple pieces of info that might be of help: - Traffic goes over TCP ports 10039, 10040, 10050. - Only affecting this new app. - No drops are shown in the logs. - Cluster "advanced" configuration is set to handle load sharing by "IPs" only and "use sticky decision function" is selected. - If the cluster is changed to HA instead of LS Unicast operation mode, everything works perfect - Checked the cluster status with multiple commands, but one in particular caused interest: "cphaprob syncstat". SK34475 document says the following: Lost sync connection (num of events)... SHOULD be 0 - positive value indicates connectivity problems Not held due to no members............. SHOULD be 0 - positive value indicates connectivity problem between the members Running the command on both cluster members in fact showed positive values in both variables (for example: 2144 in the primary member for "lost sync"). Noticed changing from LS unicast to HA causes increases on these values, so currently unsure if the positive value is normal given multiple changes in the cluster operation mode. Given the fact the issue is affecting only one application, it appears to me it might not be related with a general cluster problem, but thought it might be useful info. Any ideas on how to get this one resolved will be very appreciated. Regards -- Sergio Alvarez CISSP | CCSE+ ================================================= To set vacation, Out-Of-Office, or away messages, send an email to [email protected] in the BODY of the email add: set fw-1-mailinglist nomail ================================================= To unsubscribe from this mailing list, please see the instructions at http://www.checkpoint.com/services/mailing.html ================================================= If you have any questions on how to change your subscription options, email [email protected] ================================================= Scanned by Check Point Total Security Gateway.
