On Wed, Feb 06, 2019 at 11:49:28AM +0200, Maged Mokhtar wrote:
> It could be used for sending cluster maps or other configuration in a 
> push model, i believe corosync uses this by default. For use in sending 
> actual data during write ops, a primary osd can send to its replicas, 
> they do not have to process all traffic but can listen on specific group 
> address associated with that pg, which could be an increment from a base 
> multicast address defined. Some additional erasure codes and 
> acknowledgment messages need to be added to account for errors/dropped 
> packets.

> i doubt it will give a appreciable boost given most pools use 3
> replicas in total, additionally there could be issues to get multicast
> working correctly like setup igmp, so all in all in it could be a
> hassle.
A separate concern there is that there are too many combinations of OSDs
vs multicast limitations in switchgear. As a quick math testcase: 
Having 3 replicas with 512 OSDs, split over 32 hosts for is ~30k unique
host combinations. 

At at IPv4 protocol layer, this does fit into the 232/8 network for SSM
scope or 239/8 LSA scope; in each of those 16.7M multicast addresses.

On the switchgear side, even the big Cisco gear, the limits are even
lower: 32K.
| Output interface lists are stored in the multicast expansion table
| (MET). The MET has room for up to 32,000 output interface lists.  The
| MET resources are shared by both Layer 3 multicast routes and by Layer 2
| multicast entries. The actual number of output interface lists available
| in hardware depends on the specific configuration. If the total number
| of multicast routes exceed 32,000, multicast packets might not be
| switched by the Integrated Switching Engine. They would be forwarded by
| the CPU subsystem at much slower speeds.
older switchgear was even lower :-(.

This would also be a switch from TCP to UDP, and redesign of other
pieces, including CephX security.

I'm not convinced of the overall gain at this scale for actual data.
For heartbeat and other cluster-wide stuff, yes, I do agree that
multicast might have benefits.

Robin Hugh Johnson
Gentoo Linux: Dev, Infra Lead, Foundation Treasurer
E-Mail   : robb...@gentoo.org
GnuPG FP : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85
GnuPG FP : 7D0B3CEB E9B85B1F 825BCECF EE05E6F6 A48F6136

Attachment: signature.asc
Description: PGP signature

ceph-users mailing list

Reply via email to