On 07/01/2012 09:52 PM, Darren Reed wrote:
> On 2/07/2012 3:58 AM, Sašo Kiselkov wrote:
>> I did and indeed it fixed the ip_drop_input issue, however, I think 
>> something broken because of it, because now I can see that my machine stops 
>> responding to IGMP requests from the router as soon as I switch this code 
>> path on. That means that while packets do fan out to multiple cores for IP 
>> processing, the upstream router cuts of me off after a while because it 
>> stops seeing IGMP membership reports from me (since at input the IGMP 
>> queries from the router get filtered, my machine never responds to them, 
>> leading the router to conclude that I must have lost interest in the 
>> multicast group). I'm still trying to determine exactly why this is 
>> happening... I'll try to dtrace around to see where in the IP layer the 
>> packets get lost.
> 
> Are you using a production size network here or just
> a small multicast network to see if things work?

This is a production-size network, so I've got multiple gigabits worth
of multicast traffic at my disposal here :-) really useful for putting
networking stacks to the test.

> It might help others work on some of these issues if
> we could build a network setup similar to your own,
> including the tweaks you've made with dladm.
> 
> The goal here is to be able to build a functional (and small)
> test case to verify IGMP group membership is working
> properly so that future changes can be tested for correctness.
> 
> Just as long as we don't need to be running IPTV :)

I sorted the issue out by modifying the proposed changes you sent me
last time. I figured out that mp->b_rptr must only be adjusted if there
are higher-level transport protocols in use (TCP, UDP, SCTP, etc.), as
is done for example in mac_rx_srs_proto_fanout like so:

                switch (ipha->ipha_protocol) {
                case IPPROTO_TCP:
                        type = V4_TCP;
                        mp->b_rptr += hdrsize;
                        break;
                case IPPROTO_UDP:
                        type = V4_UDP;
                        mp->b_rptr += hdrsize;
                        break;
                default:
                        type = OTH;
                        break;
                }

                FANOUT_ENQUEUE_MP(headmp[type], tailmp[type], cnt[type],
                    bw_ctl, sz[type], sz1, mp);

So I modified mac_rx_srs_long_fanout to modify b_rptr similarly just
before returning in these cases (since subsequently to calling
mac_rx_srs_long_fanout, the mac_rx_srs_fanout routine immediately
enqueues the packets, resulting in identical behavior to the above
sample from mac_rx_srs_proto_fanout).

To better deal with fanout of multicast traffic which might originate in
professional IRDs and various other hardware appliances (which often
stream everything from a single source addr+port combo and only
differentiate streams by destination multicast address), I also
implemented a new type of mac_fanout_type, MAC_FANOUT_SRC_DST, which
does an XOR of the source and destination addresses in hash computation.
That way, the default behavior of src-addr + src-port fanout is
unchanged and the new behavior is selected only if the user wants it.
I've compiled my changes in the attached src_dst.patch file - you'll
want to apply this patch on top of the 918.patch from Nexenta (which
fixes most of the fanout problems, but, apparently, breaks multicast, at
least for me).

It's also possible to change the code in such a fashion so as to always
do src-dst based fanout on multicast packets - for multicast that really
makes sense (since the machine may at any one time be a member of a
large number of multicast groups all originating at a single source).

Hope to get this upstreamed in Illumos, if it agrees with the
responsible networking gurus.

Cheers,
--
Saso



-------------------------------------------
illumos-discuss
Archives: https://www.listbox.com/member/archive/182180/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182180/21175430-2e6923be
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=21175430&id_secret=21175430-6a77cda4
Powered by Listbox: http://www.listbox.com
--- old/usr/src/uts/common/io/mac/mac_sched.c	Sun Jul  1 23:53:14 2012
+++ new/usr/src/uts/common/io/mac/mac_sched.c	Sun Jul  1 23:50:38 2012
@@ -516,6 +516,7 @@
 
 #define	MAC_FANOUT_DEFAULT	0
 #define	MAC_FANOUT_RND_ROBIN	1
+#define	MAC_FANOUT_SRC_DST	2
 int mac_fanout_type = MAC_FANOUT_DEFAULT;
 
 #define	MAX_SR_TYPES	3
@@ -778,7 +779,7 @@
 	uint16_t	remlen;
 	uint8_t		nexthdr;
 	uint16_t	hdr_len;
-	uint32_t	src_val;
+	uint32_t	src_val, dst_val;
 	boolean_t	modifiable = B_TRUE, v6;
 
 	ASSERT(MBLKL(mp) >= hdrsize);
@@ -848,6 +849,7 @@
 		remlen = ntohs(ip6h->ip6_plen);
 		nexthdr = ip6h->ip6_nxt;
 		src_val = V4_PART_OF_V6(ip6h->ip6_src);
+		dst_val = V4_PART_OF_V6(ip6h->ip6_dst);
 		/*
 		 * Do src based fanout if below tunable is set to B_TRUE or
 		 * when mac_ip_hdr_length_v6() fails because of malformed
@@ -863,6 +865,7 @@
 		remlen = ntohs(ipha->ipha_length) - hdr_len;
 		nexthdr = ipha->ipha_protocol;
 		src_val = (uint32_t)ipha->ipha_src;
+		dst_val = (uint32_t)ipha->ipha_dst;
 		if (mac_src_ipv4_fanout)
 			goto src_based_fanout;
 	}
@@ -897,6 +900,7 @@
 			*type = OTH;
 		} else {
 			*type = V4_TCP;
+			mp->b_rptr += hdrsize;
 		}
 		break;
 	case IPPROTO_UDP:
@@ -906,6 +910,11 @@
 			hash = HASH_ADDR(src_val, *(uint32_t *)whereptr);
 			*indx = COMPUTE_INDEX(hash,
 			    mac_srs->srs_udp_ring_count);
+		} else if (mac_fanout_type == MAC_FANOUT_SRC_DST) {
+			hash = HASH_ADDR(src_val ^ dst_val,
+			    *(uint32_t *)whereptr);
+			*indx = COMPUTE_INDEX(hash,
+			    mac_srs->srs_udp_ring_count);
 		} else {
 			*indx = mac_srs->srs_ind % mac_srs->srs_udp_ring_count;
 			mac_srs->srs_ind++;
@@ -915,6 +924,7 @@
 			*type = OTH;
 		} else {
 			*type = V4_UDP;
+			mp->b_rptr += hdrsize;
 		}
 		break;
 	}
@@ -921,7 +931,10 @@
 	return (0);
 
 src_based_fanout:
-	hash = HASH_ADDR(src_val, (uint32_t)0);
+	if (mac_fanout_type == MAC_FANOUT_SRC_DST)
+		hash = HASH_ADDR(src_val ^ dst_val, (uint32_t)0);
+	else
+		hash = HASH_ADDR(src_val, (uint32_t)0);
 	*indx = COMPUTE_INDEX(hash, mac_srs->srs_oth_ring_count);
 	*type = OTH;
 	return (0);
@@ -1205,6 +1218,11 @@
 				hash = HASH_ADDR(ipha->ipha_src,
 				    *(uint32_t *)(mp->b_rptr + ports_offset));
 				indx = COMPUTE_INDEX(hash,
+				    mac_srs->srs_udp_ring_count);
+			} else if (mac_fanout_type == MAC_FANOUT_SRC_DST) {
+				hash = HASH_ADDR(ipha->ipha_src ^ ipha->ipha_dst,
+				    *(uint32_t *)(mp->b_rptr + ports_offset));
+				indx = COMPUTE_INDEX(hash,
 				    mac_srs->srs_udp_ring_count);
 			} else {
 				indx = mac_srs->srs_ind %

Reply via email to