I'm in the process of setting up a failover group of SunRay servers (SRSS 3.1) 
on Solaris 10 1/06 (sparc) on a couple of V440. While debugging why the load 
balancing (and utswitch(1)) doesn't seem to work as expected for me, I stumbled 
across some kind of "multicast problem" which lets me assume that this probably 
a Solaris-related issue:

By default, the SunRay software sends out keep-alive messages to the multicast 
group 224.101.101.101 via both configured interfaces (ce0: LAN, ce1: dedicated 
SunRay interconnect). The strange thing is, that *almost* none of these packets 
seem to reach the other machines of that failover group via their ce1 
interfaces, but only via ce0. Since the SunRay software seems to listen for 
these packets only on ce1, all the other servers are incorrectly marked down...

I've got no problems with our production systems which are basically the same 
hardware setup (but running Solaris 12/03 + patches) where snoop(1M) does 
indeed show the multicast traffic arriving via ce1.

Has anyone an idea what's going wrong here? Did I overlook something or is 
there a bug somewhere in the multicast handling?

Our setup:

4 identical V440 running Solaris 10 1/06 (with 119578-11, 118822-26) and SRSS 
3.1

1st machine:
- ce0: 129.70.160.101/24
- ce1: 10.0.0.4/8

2nd machine:
- ce0: 129.70.160.102/24
- ce1: 10.0.0.5/8

3rd machine:
- ce0: 129.70.160.103/24
- ce1: 10.0.0.6/8

4th machine:
- ce0: 129.70.160.105/24
- ce1: 10.0.0.10/8

All the ce1 interfaces are connected to the same Cisco switch and belong to the 
same VLAN (no router or inter-switch connections involved). Please note that 
our working Sol9 production systems are connected to the same switch but use a 
different VLAN.

"netstat -gn" on all 4 Sol10 machines shows the correct subscription for the 
multicast group:

Group Memberships: IPv4
Interface Group                RefCnt
--------- -------------------- ------
lo0       224.0.0.1                 1
ce0       224.101.101.101           1
ce0       224.0.0.1                 1
ce1       224.101.101.101           1
ce1       224.0.0.1                 1

Snooping simultaneously on ce1 of two of these machines usually looks like the 
following ("snoop -ta -d ce1 multicast"):

1st system:
13:25:14.75893   buzzie-ce1 -> 224.101.101.101 UDP D=7009 S=7009 LEN=314
13:25:34.80855   buzzie-ce1 -> 224.101.101.101 UDP D=7009 S=7009 LEN=314
13:25:54.85827   buzzie-ce1 -> 224.101.101.101 UDP D=7009 S=7009 LEN=314
13:26:14.90793   buzzie-ce1 -> 224.101.101.101 UDP D=7009 S=7009 LEN=314
13:26:22.18350     10.0.0.5 -> (broadcast)  ARP C Who is 10.0.0.4, buzzie-ce1 ?
13:26:22.18358     10.0.0.5 -> (broadcast)  ARP C Who is 10.0.0.6, 10.0.0.6 ?
13:26:34.95759   buzzie-ce1 -> 224.101.101.101 UDP D=7009 S=7009 LEN=314
13:26:43.97907    10.0.0.10 -> 224.101.101.101 UDP D=7009 S=7009 LEN=313
13:26:55.00762   buzzie-ce1 -> 224.101.101.101 UDP D=7009 S=7009 LEN=314

4th system:
13:25:3.72962    flaps-ce1 -> 224.101.101.101 UDP D=7009 S=7009 LEN=313
13:25:23.77974    flaps-ce1 -> 224.101.101.101 UDP D=7009 S=7009 LEN=313
13:25:43.82956    flaps-ce1 -> 224.101.101.101 UDP D=7009 S=7009 LEN=313
13:26:3.87939    flaps-ce1 -> 224.101.101.101 UDP D=7009 S=7009 LEN=313
13:26:22.18358     10.0.0.5 -> (broadcast)  ARP C Who is 10.0.0.4, 10.0.0.4 ?
13:26:22.18368     10.0.0.5 -> (broadcast)  ARP C Who is 10.0.0.6, 10.0.0.6 ?
13:26:23.92929    flaps-ce1 -> 224.101.101.101 UDP D=7009 S=7009 LEN=313
13:26:43.97910    flaps-ce1 -> 224.101.101.101 UDP D=7009 S=7009 LEN=313

Obviously, each system sends out the multicast packets every 20s (as it should) 
via ce1, but only one (sent @ 13:26:43) is received by the other via ce1. Via 
ce0, all the multicast traffic reaches the systems.

On our perfectly working Sol9 systems, all the multicast traffic can be seen on 
the ce1 interface on all other machines.

Any clues on what's going on here is appreciated...
This message posted from opensolaris.org
_______________________________________________
networking-discuss mailing list
[email protected]

Reply via email to