I'm in the process of setting up a failover group of SunRay servers (SRSS 3.1)
on Solaris 10 1/06 (sparc) on a couple of V440. While debugging why the load
balancing (and utswitch(1)) doesn't seem to work as expected for me, I stumbled
across some kind of "multicast problem" which lets me assume that this probably
a Solaris-related issue:
By default, the SunRay software sends out keep-alive messages to the multicast
group 224.101.101.101 via both configured interfaces (ce0: LAN, ce1: dedicated
SunRay interconnect). The strange thing is, that *almost* none of these packets
seem to reach the other machines of that failover group via their ce1
interfaces, but only via ce0. Since the SunRay software seems to listen for
these packets only on ce1, all the other servers are incorrectly marked down...
I've got no problems with our production systems which are basically the same
hardware setup (but running Solaris 12/03 + patches) where snoop(1M) does
indeed show the multicast traffic arriving via ce1.
Has anyone an idea what's going wrong here? Did I overlook something or is
there a bug somewhere in the multicast handling?
Our setup:
4 identical V440 running Solaris 10 1/06 (with 119578-11, 118822-26) and SRSS
3.1
1st machine:
- ce0: 129.70.160.101/24
- ce1: 10.0.0.4/8
2nd machine:
- ce0: 129.70.160.102/24
- ce1: 10.0.0.5/8
3rd machine:
- ce0: 129.70.160.103/24
- ce1: 10.0.0.6/8
4th machine:
- ce0: 129.70.160.105/24
- ce1: 10.0.0.10/8
All the ce1 interfaces are connected to the same Cisco switch and belong to the
same VLAN (no router or inter-switch connections involved). Please note that
our working Sol9 production systems are connected to the same switch but use a
different VLAN.
"netstat -gn" on all 4 Sol10 machines shows the correct subscription for the
multicast group:
Group Memberships: IPv4
Interface Group RefCnt
--------- -------------------- ------
lo0 224.0.0.1 1
ce0 224.101.101.101 1
ce0 224.0.0.1 1
ce1 224.101.101.101 1
ce1 224.0.0.1 1
Snooping simultaneously on ce1 of two of these machines usually looks like the
following ("snoop -ta -d ce1 multicast"):
1st system:
13:25:14.75893 buzzie-ce1 -> 224.101.101.101 UDP D=7009 S=7009 LEN=314
13:25:34.80855 buzzie-ce1 -> 224.101.101.101 UDP D=7009 S=7009 LEN=314
13:25:54.85827 buzzie-ce1 -> 224.101.101.101 UDP D=7009 S=7009 LEN=314
13:26:14.90793 buzzie-ce1 -> 224.101.101.101 UDP D=7009 S=7009 LEN=314
13:26:22.18350 10.0.0.5 -> (broadcast) ARP C Who is 10.0.0.4, buzzie-ce1 ?
13:26:22.18358 10.0.0.5 -> (broadcast) ARP C Who is 10.0.0.6, 10.0.0.6 ?
13:26:34.95759 buzzie-ce1 -> 224.101.101.101 UDP D=7009 S=7009 LEN=314
13:26:43.97907 10.0.0.10 -> 224.101.101.101 UDP D=7009 S=7009 LEN=313
13:26:55.00762 buzzie-ce1 -> 224.101.101.101 UDP D=7009 S=7009 LEN=314
4th system:
13:25:3.72962 flaps-ce1 -> 224.101.101.101 UDP D=7009 S=7009 LEN=313
13:25:23.77974 flaps-ce1 -> 224.101.101.101 UDP D=7009 S=7009 LEN=313
13:25:43.82956 flaps-ce1 -> 224.101.101.101 UDP D=7009 S=7009 LEN=313
13:26:3.87939 flaps-ce1 -> 224.101.101.101 UDP D=7009 S=7009 LEN=313
13:26:22.18358 10.0.0.5 -> (broadcast) ARP C Who is 10.0.0.4, 10.0.0.4 ?
13:26:22.18368 10.0.0.5 -> (broadcast) ARP C Who is 10.0.0.6, 10.0.0.6 ?
13:26:23.92929 flaps-ce1 -> 224.101.101.101 UDP D=7009 S=7009 LEN=313
13:26:43.97910 flaps-ce1 -> 224.101.101.101 UDP D=7009 S=7009 LEN=313
Obviously, each system sends out the multicast packets every 20s (as it should)
via ce1, but only one (sent @ 13:26:43) is received by the other via ce1. Via
ce0, all the multicast traffic reaches the systems.
On our perfectly working Sol9 systems, all the multicast traffic can be seen on
the ce1 interface on all other machines.
Any clues on what's going on here is appreciated...
This message posted from opensolaris.org
_______________________________________________
networking-discuss mailing list
[email protected]