[RFC][PATCH 2/3] TCP/IP Critical socket communication mechanism
When 'system_in_emergency' flag is set, drop any incoming packets that belong to non-critical sockets as soon as can determine the destination socket. This is necessary to prevent incoming non-critical packets to consume memory from critical page pool. - include/net/sock.h | 14 ++ net/dccp/ipv4.c |4 net/ipv4/tcp_ipv4.c |3 +++ net/ipv4/udp.c |9 - net/ipv6/tcp_ipv6.c |3 +++ net/sctp/input.c|3 +++ 6 files changed, 35 insertions(+), 1 deletions(-) diff --git a/include/net/sock.h b/include/net/sock.h index 982b4ec..8de8a8b 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -1391,4 +1391,18 @@ extern int sysctl_optmem_max; extern __u32 sysctl_wmem_default; extern __u32 sysctl_rmem_default; +extern int system_in_emergency; + +static inline int emergency_check(struct sock *sk, struct sk_buff *skb) +{ + if (system_in_emergency !(sk-sk_allocation __GFP_CRITICAL)) { + if (net_ratelimit()) + printk(discarding skb:%p len:%d sk:%p protocol:%d\n, + skb, skb-len, sk, sk-sk_protocol); + return 0; + } + + return 1; +} + #endif /* _SOCK_H */ diff --git a/net/dccp/ipv4.c b/net/dccp/ipv4.c index ca03521..405cdf8 100644 --- a/net/dccp/ipv4.c +++ b/net/dccp/ipv4.c @@ -1130,6 +1130,10 @@ int dccp_v4_rcv(struct sk_buff *skb) goto no_dccp_socket; } + if (!emergency_check(sk, skb)) { + goto discard_and_relse; + } + /* * Step 2: * ... or S.state == TIMEWAIT, diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index 4d5021e..acfb9a1 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -1232,6 +1232,9 @@ int tcp_v4_rcv(struct sk_buff *skb) if (!sk) goto no_tcp_socket; + if (!emergency_check(sk, skb)) + goto discard_and_relse; + process: if (sk-sk_state == TCP_TIME_WAIT) goto do_time_wait; diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c index 2422a5f..f79cbfd 100644 --- a/net/ipv4/udp.c +++ b/net/ipv4/udp.c @@ -1150,7 +1150,14 @@ int udp_rcv(struct sk_buff *skb) sk = udp_v4_lookup(saddr, uh-source, daddr, uh-dest, skb-dev-ifindex); if (sk != NULL) { - int ret = udp_queue_rcv_skb(sk, skb); + int ret; + + if (!emergency_check(sk, skb)) { + sock_put(sk); + goto drop; + } else + ret = udp_queue_rcv_skb(sk, skb); + sock_put(sk); /* a return value 0 means to resubmit the input, but diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c index 62c0e5b..d017181 100644 --- a/net/ipv6/tcp_ipv6.c +++ b/net/ipv6/tcp_ipv6.c @@ -1592,6 +1592,9 @@ static int tcp_v6_rcv(struct sk_buff **p if (!sk) goto no_tcp_socket; + if (!emergency_check(sk, skb)) + goto discard_and_relse; + process: if (sk-sk_state == TCP_TIME_WAIT) goto do_time_wait; diff --git a/net/sctp/input.c b/net/sctp/input.c index b24ff2c..553365b 100644 --- a/net/sctp/input.c +++ b/net/sctp/input.c @@ -181,6 +181,9 @@ int sctp_rcv(struct sk_buff *skb) rcvr = asoc ? asoc-base : ep-base; sk = rcvr-sk; + if (!emergency_check(sk, skb)) + goto discard_it; + /* * If a frame arrives on an interface and the receiving socket is * bound to another interface, via SO_BINDTODEVICE, treat it as OOTB - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC][PATCH 1/3] TCP/IP Critical socket communication mechanism
Introduce a new socket option SO_CRITICAL to mark a socket as critical. This socket option takes a integer boolean flag that can be set using setsockopt() and read with getsockopt(). --- include/asm-i386/socket.h|2 ++ include/asm-powerpc/socket.h |2 ++ net/core/sock.c | 13 - 3 files changed, 16 insertions(+), 1 deletions(-) diff --git a/include/asm-i386/socket.h b/include/asm-i386/socket.h index 802ae76..bd4ce8e 100644 --- a/include/asm-i386/socket.h +++ b/include/asm-i386/socket.h @@ -49,4 +49,6 @@ #define SO_PEERSEC 31 +#define SO_CRITICAL100 + #endif /* _ASM_SOCKET_H */ diff --git a/include/asm-powerpc/socket.h b/include/asm-powerpc/socket.h index e4b8177..6cfb79a 100644 --- a/include/asm-powerpc/socket.h +++ b/include/asm-powerpc/socket.h @@ -56,4 +56,6 @@ #define SO_PEERSEC 31 +#define SO_CRITICAL100 + #endif /* _ASM_POWERPC_SOCKET_H */ diff --git a/net/core/sock.c b/net/core/sock.c index 13cc3be..d2d10cb 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -456,6 +456,13 @@ set_rcvbuf: ret = -ENONET; break; + case SO_CRITICAL: + if (valbool) + sk-sk_allocation |= __GFP_CRITICAL; + else + sk-sk_allocation = ~__GFP_CRITICAL; + break; + /* We implement the SO_SNDLOWAT etc to not be settable (1003.1g 5.3) */ default: @@ -616,7 +623,11 @@ int sock_getsockopt(struct socket *sock, case SO_PEERSEC: return security_socket_getpeersec(sock, optval, optlen, len); - + + case SO_CRITICAL: + v.val = ((sk-sk_allocation __GFP_CRITICAL) != 0); + break; + default: return(-ENOPROTOOPT); } - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC][PATCH 0/3] TCP/IP Critical socket communication mechanism
These set of patches provide a TCP/IP emergency communication mechanism that could be used to guarantee high priority communications over a critical socket to succeed even under very low memory conditions that last for a couple of minutes. It uses the critical page pool facility provided by Matt's patches that he posted recently on lkml. http://lkml.org/lkml/2005/12/14/34/index.html This mechanism provides a new socket option SO_CRITICAL that can be used to mark a socket as critical. A critical connection used for emergency communications has to be established and marked as critical before we enter the emergency condition. It uses the __GFP_CRITICAL flag introduced in the critical page pool patches to indicate an allocation request as critical and should be satisfied from the critical page pool if required. In the send path, this flag is passed with all allocation requests that are made for a critical socket. But in the receive path we do not know if a packet is critical or not until we receive it and find the socket that it is destined to. So we treat all the allocation requests in the receive path as critical. The critical page pool patches also introduces a global flag 'system_in_emergency' that is used to indicate an emergency situation(could be a low memory condition). When this flag is set any incoming packets that belong to non-critical sockets are dropped as soon as possible in the receive path. This is necessary to prevent incoming non-critical packets to consume memory from critical page pool. I would appreciate any feedback or comments on this approach. Thanks Sridhar - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 0/3] TCP/IP Critical socket communication mechanism
I would appreciate any feedback or comments on this approach. Maybe I'm missing something but wouldn't you need an own critical pool (or at least reservation) for each socket to be safe against deadlocks? Otherwise if a critical sockets needs e.g. 2 pages to finish something and 2 critical sockets are active they can each steal the last pages from each other and deadlock. -Andi - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 3/3] TCP/IP Critical socket communication mechanism
On Mer, 2005-12-14 at 01:12 -0800, Sridhar Samudrala wrote: Pass __GFP_CRITICAL flag with all allocation requests that are critical. - All allocations needed to process incoming packets are marked as CRITICAL. This includes the allocations - made by the driver to receive incoming packets - to process and send ARP packets - to create new routes for incoming packets But your user space that would add the routes is not so protected so I'm not sure this is actually a solution, more of an extended fudge. In which case I'm not clear why it is any better than the current GFP_ATOMIC approach. +#define SK_CRIT_ALLOC(sk, flags) ((sk-sk_allocation __GFP_CRITICAL) | flags) Lots of hidden conditional logic on critical paths. Also sk should be in brackets so that the macro evaluation order is defined as should flags +#define CRIT_ALLOC(flags) (__GFP_CRITICAL | flags) Pointless obfuscation - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] cubic: pre-compute based on parameters
David S. Miller wrote: From: Stephen Hemminger [EMAIL PROTECTED] Date: Mon, 12 Dec 2005 12:03:22 -0800 -d32 = d32 / HZ; - /* (wmax-cwnd) * (srtt3 / HZ) / c * 2^(3*bictcp_HZ) */ -d64 = (d64 * dist * d32) (count+3-BICTCP_HZ); - -/* cubic root */ -d64 = cubic_root(d64); - -result = (u32)d64; -return result; + return cubic_root((cube_factor * dist) (cube_scale + 3 - BICTCP_HZ)); ... + while ( !(d32 0x8000) (cube_scale BICTCP_HZ)){ + d32 = d32 1; + ++cube_scale; + } + cube_factor = d64 * d32 / HZ; + I don't think this transformation is equivalent. In the old code only the d32 is scaled by HZ. So in the old code we're saying something like: d64 = (d64 * dist * (d32 / HZ)) (count + 3 - BICTCP_HZ); whereas the new code looks like: d64 = (((d64 * d32) / HZ) * dist) (count + 3 - BICTCP_HZ); Is that really equivalent? Almost. It depends on how large the numbers are in d64 and d32, if their multiplication may overflow than the first option is better since it has less of a chance to overflow. On the other hand, the second line can be more accurate. Baruch - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 3/3] TCP/IP Critical socket communication mechanism
Alan Cox wrote: But your user space that would add the routes is not so protected so I'm not sure this is actually a solution, more of an extended fudge. Yes, there's no 100% solution -- no matter how much memory you reserve and how many paths you protect if you try hard enough you can come up with cases where it'll fail. (I'm swapping to NFS across a tun/tap interface to a custom userland SSL tunnel to a server across a BGP route...) However, if the 'extended fundge' pushes a problem from can happen, even in a very normal setup territory to only happens if you're doing something pretty weird then is it really such a bad thing? I think the cost in code complexity looks pretty reasonable. +#define SK_CRIT_ALLOC(sk, flags) ((sk-sk_allocation __GFP_CRITICAL) | flags) Lots of hidden conditional logic on critical paths. How expensive is it compared to the allocation itself? +#define CRIT_ALLOC(flags) (__GFP_CRITICAL | flags) Pointless obfuscation Fully agree. -Mitch - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Fw: 2.6.15-rc5 gre tunnel checksum error
On Tue, Dec 13, 2005 at 06:30:38AM +, Paul Erkkila wrote: GRE tunnel. ip tunnel: tunnel0: gre/ip remote xx.xx.xx.xx local xx.xx.xx.xx ttl 255 key xx.xx.xx.xx Checksum in received packet is required. Checksum output packets. Thanks. It turns out to be a bug in the GRE layer. I added that bug when I introduced skb_postpull_rcsum. [GRE]: Fix hardware checksum modification The skb_postpull_rcsum introduced a bug to the checksum modification. Although the length pulled is offset bytes, the origin of the pulling is the GRE header, not the IP header. Signed-off-by: Herbert Xu [EMAIL PROTECTED] Dave, please apply this if this works for Paul. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmVHI~} [EMAIL PROTECTED] Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt diff --git a/net/ipv4/ip_gre.c b/net/ipv4/ip_gre.c --- a/net/ipv4/ip_gre.c +++ b/net/ipv4/ip_gre.c @@ -618,7 +618,7 @@ static int ipgre_rcv(struct sk_buff *skb skb-mac.raw = skb-nh.raw; skb-nh.raw = __pskb_pull(skb, offset); - skb_postpull_rcsum(skb, skb-mac.raw, offset); + skb_postpull_rcsum(skb, skb-h.raw, offset); memset((IPCB(skb)-opt), 0, sizeof(struct ip_options)); skb-pkt_type = PACKET_HOST; #ifdef CONFIG_NET_IPGRE_BROADCAST
Re: Fw: 2.6.15-rc5 gre tunnel checksum error
Herbert Xu wrote: Thanks. It turns out to be a bug in the GRE layer. I added that bug when I introduced skb_postpull_rcsum. [GRE]: Fix hardware checksum modification The skb_postpull_rcsum introduced a bug to the checksum modification. Although the length pulled is offset bytes, the origin of the pulling is the GRE header, not the IP header. Signed-off-by: Herbert Xu [EMAIL PROTECTED] Dave, please apply this if this works for Paul. Cheers, Works fine here. Thanks =). -pee - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] ip / ifconfig redesign
Bernd Eckenfels wrote: Al Boldi wrote: The current ip / ifconfig configuration is arcane and inflexible. The reason being, that they are based on design principles inherited from the last century. Yes I agree, however note that some of the asumptions are backed up and required by RFCs. For example the binding of addresses to interfaces. This is especially strongly required in the IPV6 world with all the scoping and renumbering RFCs. Can you point me to those RFCs? Thanks! The things you want to change need to be changed in kernel space, btw. True. I mentioned ip / ifconfig not to imply that they are the culprit, but instead to expose the underlying kernel implementation. This does not mean though, that ip / ifconfig cannot offer an emulated OSI compliant mode, which would be an impetus to change the underlying implementation. Thanks! -- Al - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Specs for Tulip3
Hello, I've been reading the source code for the tg3 module (Broadcom Tigon3 Ethernet card) in the Linux kernel. Specifically, I need to access the NIC specific statistics, since I have to measure the performance of a server under heavy network loads. Althought the statistics exported with ethtool are quite self-explained, I would like to understand in depth the meaning of each variable. I guess those are described in the NIC Specs, but I wasnt able to find them in the Web (even in the Broadcom web page). How can I find the specs for the Tulip3 NIC? Thank you Regards Aritz - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Resend [PATCH netdev-2.6 2/8] e1000: Performance Enhancements
jamal writes: Essentially the approach would be the same as Robert's old recycle patch where he doesnt recycle certain skbs - the only difference being in the case of forwarding, the recycle is done asynchronously at EOT whereas this is done synchronously upon return from host path. The beauty of the approach is you dont have to worry about recycling on the wrong CPU ;- (which has been a big problem for forwarding path) I have to chime in and say for the host stack - I like it ;- No we don't solve any problems for forwarding but as Dave pointed out we can do nice things. Instead of dropping skb is case of failures or netfilter etc we can reuse the skb and if the skb is consumed within the RX softirq we can just return it driver. You did the feedback mechanism NET_RX_XX stuff six years ago. Now it can possible used :) A practical problem is how maintain compatibility with the current behavior which defaults to NET_RX_SKB_CONSUMED. An new driver entry point? And can we increase skb-users to delay skb destruction until the driver got the indication back? So the driver will do the final kfree and not in the protocol layers as now? This to avoid massive code changes. Thoughts? Cheers --ro - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: IPSEC tunnel: more than two networks?
Michael Tokarev wrote: [..] So the question is: is the setup like this one supposed to work at all in linux? I know there are other less ugly ways to achieve the same effect, eg by using GRE/IPIP tunnels and incapsulating the traffic into IPSEC (this way, we'll have only one transport-mode IPSEC connection and normal interfaces to route traffic to/via), but this is NOT how the whole infrastrtructure in their network is implemented - they - it seems, for whatever reason - [...] use separate tunnels to route each network. Yes, that's how I did it, too. It works perfectly to tunnel each network segment seperately. Simple routing is not enough. Don't forget to mention your tunneled networks in the FORWARD chain, if your ipsec gateway is also your firewall. I implemented the seperate tunnels via racoon and racoon-tool from latest Debian sarge. Connectivity to a Cisco PIX was possible that way. Regards Ingo Oeser - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Specs for Tulip3
On Wed, 2005-12-14 at 17:56 +0100, Aritz Bastida wrote: How can I find the specs for the Tulip3 NIC? Most of the statistics counters follow the MIB definitions in the RFCs. There are a few that are non-standard but should be self-explanatory. Send me an email if you need more information on some of the counters. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 0/3] TCP/IP Critical socket communication mechanism
On Wed, 2005-12-14 at 10:22 +0100, Andi Kleen wrote: I would appreciate any feedback or comments on this approach. Maybe I'm missing something but wouldn't you need an own critical pool (or at least reservation) for each socket to be safe against deadlocks? Otherwise if a critical sockets needs e.g. 2 pages to finish something and 2 critical sockets are active they can each steal the last pages from each other and deadlock. Here we are assuming that the pre-allocated critical page pool is big enough to satisfy the requirements of all the critical sockets. In the current critical page pool implementation, there is also a limitation that only order-0 allocations(single page) are supported. I think in the networking send/receive patch, the only place where multi-page allocs are requested is in the drivers if the MTU PAGESIZE. But i guess the drivers are getting updated to avoid order-0 allocations. Also during the emergency, we free the memory allocated for non-critical packets as quickly as possible so that it can be re-used for critical allocations. Thanks Sridhar - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 3/3] TCP/IP Critical socket communication mechanism
On Wed, 2005-12-14 at 11:17 +, Alan Cox wrote: On Mer, 2005-12-14 at 01:12 -0800, Sridhar Samudrala wrote: Pass __GFP_CRITICAL flag with all allocation requests that are critical. - All allocations needed to process incoming packets are marked as CRITICAL. This includes the allocations - made by the driver to receive incoming packets - to process and send ARP packets - to create new routes for incoming packets But your user space that would add the routes is not so protected so I'm not sure this is actually a solution, more of an extended fudge. In which case I'm not clear why it is any better than the current GFP_ATOMIC approach. I am not referring to routes that are added by user-space, but the allocations needed for cached routes stored in skb-dst in ip_route_input() path. +#define SK_CRIT_ALLOC(sk, flags) ((sk-sk_allocation __GFP_CRITICAL) | flags) Lots of hidden conditional logic on critical paths. Also sk should be in brackets so that the macro evaluation order is defined as should flags +#define CRIT_ALLOC(flags) (__GFP_CRITICAL | flags) Pointless obfuscation The only reason i made these macros is that i would expect this to a compile time configurable option so that there is zero overhead for regular users. #ifdef CONFIG_CRIT_SOCKET #define SK_CRIT_ALLOC(sk, flags) ((sk-sk_allocation __GFP_CRITICAL) | flags) #define CRIT_ALLOC(flags) (__GFP_CRITICAL | flags) #else #define SK_CRIT_ALLOC(sk, flags) flags #define CRIT_ALLOC(flags) flags #endif Thanks Sridhar - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 3/3] TCP/IP Critical socket communication mechanism
Sridhar Samudrala wrote: The only reason i made these macros is that i would expect this to a compile time configurable option so that there is zero overhead for regular users. #ifdef CONFIG_CRIT_SOCKET #define SK_CRIT_ALLOC(sk, flags) ((sk-sk_allocation __GFP_CRITICAL) | flags) #define CRIT_ALLOC(flags) (__GFP_CRITICAL | flags) #else #define SK_CRIT_ALLOC(sk, flags) flags #define CRIT_ALLOC(flags) flags #endif Oh, that's much simpler to achieve: #ifdef CONFIG_CRIT_SOCKET #define __GFP_CRITICAL_SOCKET __GFP_CRITICAL #else #define __GFP_CRITICAL_SOCKET 0 #endif Maybe we can get better naming here, but you get the point, I think. Regards Ingo Oeser - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 3/3] TCP/IP Critical socket communication mechanism
On Wed, 2005-12-14 at 04:12 -0800, Mitchell Blank Jr wrote: Alan Cox wrote: But your user space that would add the routes is not so protected so I'm not sure this is actually a solution, more of an extended fudge. Yes, there's no 100% solution -- no matter how much memory you reserve and how many paths you protect if you try hard enough you can come up with cases where it'll fail. (I'm swapping to NFS across a tun/tap interface to a custom userland SSL tunnel to a server across a BGP route...) However, if the 'extended fundge' pushes a problem from can happen, even in a very normal setup territory to only happens if you're doing something pretty weird then is it really such a bad thing? I think the cost in code complexity looks pretty reasonable. Yes. This should work fine for cases where you need a limited number of critical allocation requests to succeed for a short period of time. +#define SK_CRIT_ALLOC(sk, flags) ((sk-sk_allocation __GFP_CRITICAL) | flags) Lots of hidden conditional logic on critical paths. How expensive is it compared to the allocation itself? Also, as i said in my other response we could make it a compile-time configurable option with zero overhead when turned off. Thanks Sridhar +#define CRIT_ALLOC(flags) (__GFP_CRITICAL | flags) Pointless obfuscation Fully agree. -Mitch - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] forcedeth TSO fix for large buffers
Has anyone had a chance to review this patch and apply it? I would like it to make 2.6.15 kernel since it is a bug related to TSO in the driver. Thanks, Ayaz - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 0/3] TCP/IP Critical socket communication mechanism
It has a lot more users that compete true, but likely the set of GFP_CRITICAL users would grow over time too and it would develop the same problem. No, because the critical set is determined by the user (by setting the socket flag). The receive side has some things marked as critical until we have processed enough to check the socket flag, but then they should be released. Those short-lived allocations and frees are more or less 0 net towards the pool. Certainly, it wouldn't work very well if every socket is marked as critical, but with an adequate pool for the workload, I expect it'll work as advertised (esp. since it'll usually be only one socket associated with swap management that'll be critical). +-DLS - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Specs for Tulip3
On Wed, 2005-12-14 at 19:38 +0100, Aritz Bastida wrote: Thank you for your email. But could you tell me what RFC specifically? Is it RFC1284? The counters I am looking for are: These are custom counters not from any RFCs. dma_writeq_full DMA write queue full - meaning host is not recycling rx buffers fast enough. rx_threshold_hit Rx max coalescing frames threshold hit. ring_status_update Status block update. I have a dual AMD Opteron 1800MHz, which will be capturing all the traffic in a Gigabit Ethernet segment, and analyzing the packets it captures. It's a kind of IDS which must work under heavy network loads. I am testing the maximum speed it can receive packets (it has got a Broadcom Tulip3 NIC: BCM5704). For that purpose, I use another machine to inject the packets. I do that with the pktgen module. Here are the results for a sample test: Injection machine (Dual Pentium III 866MHz): * Number of packets: 21134488 * Packet size: 100 bytes * Speed: 341242pps 272Mb/sec Receive machine (Dual AMD Opteron 1800MHz ): (There are no processes running in this machine, specifically the packet analysis is stopped) * rx_ucast_packets: 21134816 * rx_65_to_127_octet_packets: 21134597 * dma_writeq_full: 12919200 * rx_discards: 12947380 * rx_threshold_hit: 1549692 * ring_status_update: 1677648 What bus is the NIC in? PCI or PCIX? What speed? You may want to play around with the rx ring sizes and rx coalescing parameters, all can be changed with ethtool. Also, be sure to use the latest tg3 driver which is 3.45. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 0/3] TCP/IP Critical socket communication mechanism
On 12/14/05, Sridhar Samudrala [EMAIL PROTECTED] wrote: These set of patches provide a TCP/IP emergency communication mechanism that could be used to guarantee high priority communications over a critical socket to succeed even under very low memory conditions that last for a couple of minutes. It uses the critical page pool facility provided by Matt's patches that he posted recently on lkml. http://lkml.org/lkml/2005/12/14/34/index.html This mechanism provides a new socket option SO_CRITICAL that can be used to mark a socket as critical. A critical connection used for emergency So now everyone writing commercial apps for Linux are going to set SO_CRITICAL on sockets in their apps so their apps can survive better under pressure than the competitors aps and clueless programmers all over are going to think cool, with this I can make my app more important than everyone elses, I'm going to use this. When everyone and his dog starts to set this, what's the point? communications has to be established and marked as critical before we enter the emergency condition. It uses the __GFP_CRITICAL flag introduced in the critical page pool patches to indicate an allocation request as critical and should be satisfied from the critical page pool if required. In the send path, this flag is passed with all allocation requests that are made for a critical socket. But in the receive path we do not know if a packet is critical or not until we receive it and find the socket that it is destined to. So we treat all the allocation requests in the receive path as critical. The critical page pool patches also introduces a global flag 'system_in_emergency' that is used to indicate an emergency situation(could be a low memory condition). When this flag is set any incoming packets that belong to non-critical sockets are dropped as soon as possible in the receive path. Hmm, so if I fire up an app that has SO_CRITICAL set on a socket and can then somehow put a lot of memory pressure on the machine I can cause traffic on other sockets to be dropped.. hmmm.. sounds like something to play with to create new and interresting DoS attacks... This is necessary to prevent incoming non-critical packets to consume memory from critical page pool. I would appreciate any feedback or comments on this approach. To be a little serious, it sounds like something that could be used to cause trouble and something that will lose its usefulness once enough people start using it (for valid or invalid reasons), so what's the point... -- Jesper Juhl [EMAIL PROTECTED] Don't top-post http://www.catb.org/~esr/jargon/html/T/top-post.html Plain text mails only, please http://www.expita.com/nomime.html - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Poor performance with r8169
Carl-Daniel Hailfinger [EMAIL PROTECTED] : [...] Performance with nttcp was approximately at 135 MBit/s in both directions. Both cards were connected directly with a CAT5e cable. Enabling/disabling NAPI didn't have any measurable effect. Are these results expected, and if so, is there any card 1 - I get more than the 141 Mbit/ on an old PII; 2 - can you check with lspci -vvx if there is a difference between the two devices (latency or such) ? The cards are built around the same chipset. I see no reason why one card could be slower than the other; 3 - please send: - complete dmesg and vmstat 1 output during test - .config - ethtool -s eth0 which delivers more reasonable performance? If the cards should deliver higher performance, do you have any patch or any tuning tip I can test? Can you check 'top' output during test ? Any difference if you renice ksoftirqd like crazy ? -- Ueimor - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 0/3] TCP/IP Critical socket communication mechanism
Jesper Juhl wrote: To be a little serious, it sounds like something that could be used to cause trouble and something that will lose its usefulness once enough people start using it (for valid or invalid reasons), so what's the point... It could easily be a user-configurable option in an application. If DOS is a real concern, only let this work for root users... Ben -- Ben Greear [EMAIL PROTECTED] Candela Technologies Inc http://www.candelatech.com - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 0/3] TCP/IP Critical socket communication mechanism
Jesper Juhl wrote: On 12/14/05, Sridhar Samudrala [EMAIL PROTECTED] wrote: These set of patches provide a TCP/IP emergency communication mechanism that could be used to guarantee high priority communications over a critical socket to succeed even under very low memory conditions that last for a couple of minutes. It uses the critical page pool facility provided by Matt's patches that he posted recently on lkml. http://lkml.org/lkml/2005/12/14/34/index.html This mechanism provides a new socket option SO_CRITICAL that can be used to mark a socket as critical. A critical connection used for emergency So now everyone writing commercial apps for Linux are going to set SO_CRITICAL on sockets in their apps so their apps can survive better under pressure than the competitors aps and clueless programmers all over are going to think cool, with this I can make my app more important than everyone elses, I'm going to use this. When everyone and his dog starts to set this, what's the point? I don't think the initial patches that Matt did were intended for what you are describing. When I had the conversation with Matt at KS, the problem we were trying to solve was Memory pressure with network attached swap space. I came up with the idea that I think Matt has implemented. Letting the OS choose which are critical TCP/IP sessions is fine. But letting an application choose is a recipe for disaster. James - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.15-rc5 gre tunnel checksum error
From: Herbert Xu [EMAIL PROTECTED] Date: Wed, 14 Dec 2005 23:16:29 +1100 [GRE]: Fix hardware checksum modification The skb_postpull_rcsum introduced a bug to the checksum modification. Although the length pulled is offset bytes, the origin of the pulling is the GRE header, not the IP header. Signed-off-by: Herbert Xu [EMAIL PROTECTED] Dave, please apply this if this works for Paul. Applied, thanks. -stable needs this too, so I'll toss it there as well. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[2.6 patch] net/sunrpc/xdr.c: remove xdr_decode_string()
This patch removes ths unused function xdr_decode_string(). Signed-off-by: Adrian Bunk [EMAIL PROTECTED] Acked-by: Neil Brown [EMAIL PROTECTED] Acked-by: Charles Lever [EMAIL PROTECTED] --- include/linux/sunrpc/xdr.h |1 - net/sunrpc/xdr.c | 21 - 2 files changed, 22 deletions(-) --- linux-2.6.15-rc1-mm2-full/include/linux/sunrpc/xdr.h.old2005-11-23 02:03:01.0 +0100 +++ linux-2.6.15-rc1-mm2-full/include/linux/sunrpc/xdr.h2005-11-23 02:03:08.0 +0100 @@ -91,7 +91,6 @@ u32 * xdr_encode_opaque_fixed(u32 *p, const void *ptr, unsigned int len); u32 * xdr_encode_opaque(u32 *p, const void *ptr, unsigned int len); u32 * xdr_encode_string(u32 *p, const char *s); -u32 * xdr_decode_string(u32 *p, char **sp, int *lenp, int maxlen); u32 * xdr_decode_string_inplace(u32 *p, char **sp, int *lenp, int maxlen); u32 * xdr_encode_netobj(u32 *p, const struct xdr_netobj *); u32 * xdr_decode_netobj(u32 *p, struct xdr_netobj *); --- linux-2.6.15-rc1-mm2-full/net/sunrpc/xdr.c.old 2005-11-23 02:03:17.0 +0100 +++ linux-2.6.15-rc1-mm2-full/net/sunrpc/xdr.c 2005-11-23 02:03:27.0 +0100 @@ -93,27 +93,6 @@ } u32 * -xdr_decode_string(u32 *p, char **sp, int *lenp, int maxlen) -{ - unsigned intlen; - char*string; - - if ((len = ntohl(*p++)) maxlen) - return NULL; - if (lenp) - *lenp = len; - if ((len % 4) != 0) { - string = (char *) p; - } else { - string = (char *) (p - 1); - memmove(string, p, len); - } - string[len] = '\0'; - *sp = string; - return p + XDR_QUADLEN(len); -} - -u32 * xdr_decode_string_inplace(u32 *p, char **sp, int *lenp, int maxlen) { unsigned intlen; - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 0/3] TCP/IP Critical socket communication mechanism
On Wed, 2005-12-14 at 20:49 +, James Courtier-Dutton wrote: Jesper Juhl wrote: On 12/14/05, Sridhar Samudrala [EMAIL PROTECTED] wrote: These set of patches provide a TCP/IP emergency communication mechanism that could be used to guarantee high priority communications over a critical socket to succeed even under very low memory conditions that last for a couple of minutes. It uses the critical page pool facility provided by Matt's patches that he posted recently on lkml. http://lkml.org/lkml/2005/12/14/34/index.html This mechanism provides a new socket option SO_CRITICAL that can be used to mark a socket as critical. A critical connection used for emergency So now everyone writing commercial apps for Linux are going to set SO_CRITICAL on sockets in their apps so their apps can survive better under pressure than the competitors aps and clueless programmers all over are going to think cool, with this I can make my app more important than everyone elses, I'm going to use this. When everyone and his dog starts to set this, what's the point? I don't think the initial patches that Matt did were intended for what you are describing. When I had the conversation with Matt at KS, the problem we were trying to solve was Memory pressure with network attached swap space. I came up with the idea that I think Matt has implemented. Letting the OS choose which are critical TCP/IP sessions is fine. But letting an application choose is a recipe for disaster. We could easily add capable(CAP_NET_ADMIN) check to allow this option to be set only by privileged users. Thanks Sridhar - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 0/3] TCP/IP Critical socket communication mechanism
Sridhar Samudrala wrote: On Wed, 2005-12-14 at 20:49 +, James Courtier-Dutton wrote: Jesper Juhl wrote: On 12/14/05, Sridhar Samudrala [EMAIL PROTECTED] wrote: These set of patches provide a TCP/IP emergency communication mechanism that could be used to guarantee high priority communications over a critical socket to succeed even under very low memory conditions that last for a couple of minutes. It uses the critical page pool facility provided by Matt's patches that he posted recently on lkml. http://lkml.org/lkml/2005/12/14/34/index.html This mechanism provides a new socket option SO_CRITICAL that can be used to mark a socket as critical. A critical connection used for emergency So now everyone writing commercial apps for Linux are going to set SO_CRITICAL on sockets in their apps so their apps can survive better under pressure than the competitors aps and clueless programmers all over are going to think cool, with this I can make my app more important than everyone elses, I'm going to use this. When everyone and his dog starts to set this, what's the point? I don't think the initial patches that Matt did were intended for what you are describing. When I had the conversation with Matt at KS, the problem we were trying to solve was Memory pressure with network attached swap space. I came up with the idea that I think Matt has implemented. Letting the OS choose which are critical TCP/IP sessions is fine. But letting an application choose is a recipe for disaster. We could easily add capable(CAP_NET_ADMIN) check to allow this option to be set only by privileged users. Thanks Sridhar Sridhar, Have you actually thought about what would happen in a real world senario? There is no real world requirement for this sort of user land feature. In memory pressure mode, you don't care about user applications. In fact, under memory pressure no user applications are getting scheduled. All you care about is swapping out memory to achieve a net gain in free memory, so that the applications can then run ok again. James - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] fix multiple issues in MLDv2 reports
Dave, I tested these together, but let me know if you want me to split these into a few pieces, though they'll probably conflict with each other. :-) The below jumbo patch fixes the following problems in MLDv2. 1) Add necessary ntohs to recent pskb_may_pull check [breaks all nonzero source queries on little-endian (!)] 2) Add locking to source filter list [resend of prior patch] 3) fix mld_marksources() to a) send nothing when all queried sources are excluded b) send full exclude report when source queried sources are not excluded c) don't schedule a timer when there's nothing to report NOTE: RFC 3810 specifies the source list should be saved and each source reported individually as an IS_IN. This is an obvious DOS path, requiring the host to store and then multicast as many sources as are queried (e.g., millions...). This alternative sends a full, relevant report that's limited to number of sources present on the machine. 4) fix add_grec() to send empty-source records when it should The original check doesn't account for a non-empty source list with all sources inactive; the new code keeps that short-circuit case, and also generates the group header with an empty list if needed. 5) fix mca_crcount decrement to be after add_grec(), which needs its original value These issues (other than item #1 ;-) ) were all found by Yan Zheng-- much thanks! +-DLS [in-line for viewing, attached for applying] Signed-off-by: David L Stevens [EMAIL PROTECTED] diff -ruNp linux-2.6.15-rc5/include/net/if_inet6.h linux-2.6.15-rc5MC1/include/net/if_inet6.h --- linux-2.6.15-rc5/include/net/if_inet6.h 2005-10-27 17:02:08.0 -0700 +++ linux-2.6.15-rc5MC1/include/net/if_inet6.h 2005-12-09 15:22:46.0 -0800 @@ -82,6 +82,7 @@ struct ipv6_mc_socklist struct in6_addr addr; int ifindex; struct ipv6_mc_socklist *next; + rwlock_tsflock; unsigned intsfmode; /* MCAST_{INCLUDE,EXCLUDE} */ struct ip6_sf_socklist *sflist; }; diff -ruNp linux-2.6.15-rc5/net/ipv6/mcast.c linux-2.6.15-rc5MC1/net/ipv6/mcast.c --- linux-2.6.15-rc5/net/ipv6/mcast.c 2005-12-12 15:01:33.0 -0800 +++ linux-2.6.15-rc5MC1/net/ipv6/mcast.c2005-12-13 16:02:46.0 -0800 @@ -224,6 +224,7 @@ int ipv6_sock_mc_join(struct sock *sk, i mc_lst-ifindex = dev-ifindex; mc_lst-sfmode = MCAST_EXCLUDE; + mc_lst-sflock = RW_LOCK_UNLOCKED; mc_lst-sflist = NULL; /* @@ -360,6 +361,7 @@ int ip6_mc_source(int add, int omode, st struct ip6_sf_socklist *psl; int i, j, rv; int leavegroup = 0; + int pmclocked = 0; int err; if (pgsr-gsr_group.ss_family != AF_INET6 || @@ -403,6 +405,9 @@ int ip6_mc_source(int add, int omode, st pmc-sfmode = omode; } + write_lock_bh(pmc-sflock); + pmclocked = 1; + psl = pmc-sflist; if (!add) { if (!psl) @@ -475,6 +480,8 @@ int ip6_mc_source(int add, int omode, st /* update the interface list */ ip6_mc_add_src(idev, group, omode, 1, source, 1); done: + if (pmclocked) + write_unlock_bh(pmc-sflock); read_unlock_bh(ipv6_sk_mc_lock); read_unlock_bh(idev-lock); in6_dev_put(idev); @@ -510,6 +517,8 @@ int ip6_mc_msfilter(struct sock *sk, str dev = idev-dev; err = 0; + read_lock_bh(ipv6_sk_mc_lock); + if (gsf-gf_fmode == MCAST_INCLUDE gsf-gf_numsrc == 0) { leavegroup = 1; goto done; @@ -549,6 +558,8 @@ int ip6_mc_msfilter(struct sock *sk, str newpsl = NULL; (void) ip6_mc_add_src(idev, group, gsf-gf_fmode, 0, NULL, 0); } + + write_lock_bh(pmc-sflock); psl = pmc-sflist; if (psl) { (void) ip6_mc_del_src(idev, group, pmc-sfmode, @@ -558,8 +569,10 @@ int ip6_mc_msfilter(struct sock *sk, str (void) ip6_mc_del_src(idev, group, pmc-sfmode, 0, NULL, 0); pmc-sflist = newpsl; pmc-sfmode = gsf-gf_fmode; + write_unlock_bh(pmc-sflock); err = 0; done: + read_unlock_bh(ipv6_sk_mc_lock); read_unlock_bh(idev-lock); in6_dev_put(idev); dev_put(dev); @@ -592,6 +605,11 @@ int ip6_mc_msfget(struct sock *sk, struc dev = idev-dev; err = -EADDRNOTAVAIL; + /* +* changes to the ipv6_mc_list require the socket lock and +* a read lock on ip6_sk_mc_lock. We have the socket lock, +* so reading the list is safe. +*/ for (pmc=inet6-ipv6_mc_list; pmc; pmc=pmc-next) { if (pmc-ifindex != gsf-gf_interface) @@ -614,6 +632,10 @@ int ip6_mc_msfget(struct sock *sk, struc
Re: Default net.ipv6.mld_max_msf = 10 and net.core.optmem_max=10240
Hi david all, As implemented now, the default memory allocated in net.core.optmem_max permit to join up to 320 (S,G) channels per sockets (for IPv6, each channels cost 32bytes in net.core.optmem_max), thing is that net.ipv6.mld_max_msf is setting an hard limit on it, so assuming that you don't change the value of net.core.opmem_max, would it make sense to increase net.ipv6.mld_max_msf to let say, 256 ? the rest of the memory can still be used for various option setup on the socket. Cheers, Hoerdt Mickaël David Stevens wrote: [I'm CC-ing Dave Miller and Yoshifuji Hideaki; you probably ought to bring this up on [EMAIL PROTECTED] Hoerdt, I don't object to increasing the default, but how much is a good question. For an include-mode filter, it'll do a linear search on the sources for every packet received for that group. If those are large numbers, then an administrator should decide that's a good use of the machine, I think. The reports are (roughly) an n^2 algorithm in the number of sources. The per-packet filtering can be improved by using a hash for source look-ups, but I don't think there's a significant improvement for report computations (it's n^3 in the obvious way, so already pretty good). I've done testing with hundreds of sources and no apparent performance problems (though performance isn't what I was testing). I don't know what a reasonable limit on reasonable hardware is. Like the per-socket group limit, this one is probably too low for common applications, and also like that, easily evaded. 1024 or 2048 as the default seems high to me, on the assumption that a few apps doing that would kill performance, but since I haven't tested, I don't really know. I also see it appears not to be enforced in the full-state API (an oversight, unless I'm missing the check when I look now). I don't see any problem with bumping this up to, say, 64, immediately, which would solve the immediate problem, I guess. But I'm not the maintainer. :-) I think some stress testing to show how well this scales for higher numbers would be appropriate before going too high. If you have numbers (or can get them), that'd help. I wouldn't mind doing some tests along these lines myself, but I don't expect to have much uncommitted time available through December. +-DLS Hoerdt Mickael [EMAIL PROTECTED] wrote on 11/30/2005 08:29:51 AM: Hello David, It seems for me that net.ipv6.mld_max_msf and and igmp_max_msf default values are a little bit too short for multi-source multicast applications. On the M6bone, we are using a software named dbeacon (,http://mars.innerghost.net.ipv4.sixxs.org/matrix/) which joins a high number (currently it's up to 30 sources) SSM sources on the same socket. This create a management problem because when users are installing it : root admin must change this value, but dbeacon is run by normal users on the hosts. For Layered multicast, this can be a problem too, It's easy to imagine a flow with 256 different layer, FLUTE application is one implementation of this layered multicast concept (http://atm.tut.fi/mad/) .Could it be possible to increase this default value to let say, 1024 or 2048 ? If not possible, could you tell me why, and then we may consider developping an application layer instanciating several sockets for joining a high number of SSM channels per application. Thank you, Hoerdt Mickaël - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 0/3] TCP/IP Critical socket communication mechanism
James Courtier-Dutton wrote: Have you actually thought about what would happen in a real world senario? There is no real world requirement for this sort of user land feature. In memory pressure mode, you don't care about user applications. In fact, under memory pressure no user applications are getting scheduled. All you care about is swapping out memory to achieve a net gain in free memory, so that the applications can then run ok again. Low 'ATOMIC' memory is different from the memory that user space typically uses, so just because you can't allocate an SKB does not mean you are swapping out user-space apps. I have an app that can have 2000+ sockets open. I would definately like to make the management and other important sockets have priority over others in my app... Ben -- Ben Greear [EMAIL PROTECTED] Candela Technologies Inc http://www.candelatech.com - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: netif_stop_queue() and multiple hardware queues
Hi Jeremy, I implemented this functionality in Devicescape's 802.11 stack. The approach I took was for the driver to install a device specific qdisc as the root qdisc on the device. This root qdisc's purpose is to expose the hardware queues directly, so other qdiscs can be attached as leaf qdiscs. This hardware specific root qdisc cannot be deleted or changed. This makes it possible to use tc to inspect/set/modify per hardware queue statistics and parameters. In order for this to work my device driver never calls netif_stop. Instead the qdisc dequeue function for the root qdisc looks to see which hardware queues can accept a frame, and if none then it returns no data. The driver's frame completion function calls __netif_schedule appropriately too to ensure the queue runs when it should. This allows Devicescape's 802.11 stack to properly integrate with the Linux tc framework. I don't think any other 802.11 drivers achieve this. In the future I plan to extend Devicescape's 802.11 root qdisc to further expose the 802.11 MAC's internal queues, in cases where this is useful (e.g. the scheduled access implementation). The same principle could apply to Intel's e1000. Simon -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Jeremy Jackson Sent: Wednesday, December 14, 2005 2:31 PM To: netdev@vger.kernel.org Subject: netif_stop_queue() and multiple hardware queues Hi, I posted this briefly on linux-net, before being redirected here. Two pieces of hardware now have Linux driver support for multiple hardware queues: Intel's e1000 (two queues from what I can see in the code) and Atheros's 5212 and up, in support of 802.11e and WME (four hardware queues). In the GigE case, this just reduces latency due to hardware queueing. In the WiFi case, the queues actually have significance in access to the shared medium. (ACKs can be disabled as well) It is also worthy of note that ADSL2, VDSL, and ADSL2+ have 4 different latency queues. These last two are significant; real-time applications suffer the most from low speed, shared, and/or non-deterministic media. I wonder where DOCSIS 2 is in this regard. Anyone? Beuler? So my question is, what's it going to take to get dev-hard_start_xmit() to hook up tc queues directly to hardware/driver queues? Right now, it seems no matter how elaborate a tc setup you have, everything funnels through one queue, where the only thing that survives from the classifying/queueing is skb- priority (ie nothing). The hardware driver can then try to reclassify packets. I suppose you could hack up an iptables classifyer to set skb-priority... The Atheros driver tries to do it's own classifying by first wiping out skb-priority, then hard-coding a mapping (tsk - policy is for the sysadmin!) between VLAN tag priority, IP TOS/DSCP, and skb-priority, then to one of the 4 queues and ACK states, blithely ignoring any fine work done by tc. It's be sweet to head this nonsense off at the pass, before others discover the rabbit trail and make it into a trade route. -- Jeremy Jackson Coplanar Networks W:(519)489-4903 MSN: [EMAIL PROTECTED] ICQ: 43937409 http://www.coplanar.net - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Specs for Tulip3
Hello again, Sorry Michael, but I am a kind of newbie in this subject and couldnt understand everything you said clearly. I'm working on my final year project (I think it's said like that, I mean the project you do when you finish your degree :P). The purpose of my project is to capture and analyze network packets as fast as I can. So, I'll try to expose my doubts about this, and please don't be too concise, since I couldnt understand it. However, if there is a good reference I should read, tell me, since I couldnt get any good book entered in linux kernel networking. The only one I know about (and have read it) is The Linux kernel networking Architecture, although it's quite old (kernel 2.4). I have also read Linux Device Drivers 3rd Edition and Linux Kernel Development. Also, some articles about NAPI and interrupt coalescencing: Eliminating Receive Livelock by Jeffrey Mogul, Beyond Sofnet by Jada Salim and something more probably. I know the concepts of NAPI but have not seen any real driver in action, except for the Realtek 8139too. So here go the questions: rx_threshold_hit Rx max coalescing frames threshold hit. Well, I didn't understand what is this threshold for What bus is the NIC in? PCI or PCIX? What speed? You may want to play around with the rx ring sizes and rx coalescing parameters, all can be changed with ethtool. Also, be sure to use the latest tg3 driver which is 3.45. I'm running Linux kernel 2.6.13 and tg3 version 3.37, so should be new enough. I dont know how to verify if the NIC is in a PCIX bus. How can I check that? Running lspci I can see there are some PCIX bridges: :00:01.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8131 PCI-X Bridge (rev 12) :00:01.1 PIC: Advanced Micro Devices [AMD] AMD-8131 PCI-X IOAPIC :00:02.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8131 PCI-X Bridge (rev 12) :00:02.1 PIC: Advanced Micro Devices [AMD] AMD-8131 PCI-X IOAPIC :00:06.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8111 PCI :00:07.0 ISA bridge: Advanced Micro Devices [AMD] AMD-8111 LPC (...) :02:03.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5704 Gigabit Ethernet (rev 02) :02:03.1 Ethernet controller: Broadcom Corporation NetXtreme BCM5704 Gigabit Ethernet (rev 02) Are the NICs in a PCI-X bus? The bridges at least are. I have seen I can change the rx ring entries with ethtool, although in the driver code it says that the size is fixed to 512 entries. So what you actually change is the pending entries (defaulting to 200). What does that mean? That even if the ring is 512 entries long, it seems to be full if there are 200 packets the kernel didn't get? As I said, the only driver I have read before is Realtek 8139too, which is quite simple, but at least I could find a tutorial which explains how it works. In that driver there was a rx_ring and a tx_ring (I don't know if there can be more than one in some other drivers). When a packet arrives the NIC stores in the rx_ring a packet descriptor (4 bytes, 2 for the packet length and 2 for the packet receive status), and just after that the packet itself. So the driver has just to read the descriptor and then read the following packet length bytes. As I have seen in tg3, the rx_ring seems to be only for packet descriptors. So, as I guess, the descriptor should contain also the address for the actual packet stored. Where is that packet stored? In another rx_ring just for incoming packets? What is the benefit comparing the way 8139too does it? To finish, what do you mean with changing coalescencing parameters? The dev-quota and budget? Are there more things I can change for my benefits? Thank you for your patience. Regards Aritz - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: netif_stop_queue() and multiple hardware queues
Oh - and re: policy - my 802.11 qdisc first calls out to the tc classify function - allowing the sysadmin to do what he wants, then if no class is selected it has a default implementation that reflects the appropriate 802.11 and WiFi specs for classification. Of course another implementation would be to implement an 802.11 classifier, and install this by default on the 802.11 qdisc. Simon -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Simon Barber Sent: Wednesday, December 14, 2005 3:07 PM To: Jeremy Jackson; netdev@vger.kernel.org Subject: RE: netif_stop_queue() and multiple hardware queues Hi Jeremy, I implemented this functionality in Devicescape's 802.11 stack. The approach I took was for the driver to install a device specific qdisc as the root qdisc on the device. This root qdisc's purpose is to expose the hardware queues directly, so other qdiscs can be attached as leaf qdiscs. This hardware specific root qdisc cannot be deleted or changed. This makes it possible to use tc to inspect/set/modify per hardware queue statistics and parameters. In order for this to work my device driver never calls netif_stop. Instead the qdisc dequeue function for the root qdisc looks to see which hardware queues can accept a frame, and if none then it returns no data. The driver's frame completion function calls __netif_schedule appropriately too to ensure the queue runs when it should. This allows Devicescape's 802.11 stack to properly integrate with the Linux tc framework. I don't think any other 802.11 drivers achieve this. In the future I plan to extend Devicescape's 802.11 root qdisc to further expose the 802.11 MAC's internal queues, in cases where this is useful (e.g. the scheduled access implementation). The same principle could apply to Intel's e1000. Simon -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Jeremy Jackson Sent: Wednesday, December 14, 2005 2:31 PM To: netdev@vger.kernel.org Subject: netif_stop_queue() and multiple hardware queues Hi, I posted this briefly on linux-net, before being redirected here. Two pieces of hardware now have Linux driver support for multiple hardware queues: Intel's e1000 (two queues from what I can see in the code) and Atheros's 5212 and up, in support of 802.11e and WME (four hardware queues). In the GigE case, this just reduces latency due to hardware queueing. In the WiFi case, the queues actually have significance in access to the shared medium. (ACKs can be disabled as well) It is also worthy of note that ADSL2, VDSL, and ADSL2+ have 4 different latency queues. These last two are significant; real-time applications suffer the most from low speed, shared, and/or non-deterministic media. I wonder where DOCSIS 2 is in this regard. Anyone? Beuler? So my question is, what's it going to take to get dev-hard_start_xmit() to hook up tc queues directly to hardware/driver queues? Right now, it seems no matter how elaborate a tc setup you have, everything funnels through one queue, where the only thing that survives from the classifying/queueing is skb- priority (ie nothing). The hardware driver can then try to reclassify packets. I suppose you could hack up an iptables classifyer to set skb-priority... The Atheros driver tries to do it's own classifying by first wiping out skb-priority, then hard-coding a mapping (tsk - policy is for the sysadmin!) between VLAN tag priority, IP TOS/DSCP, and skb-priority, then to one of the 4 queues and ACK states, blithely ignoring any fine work done by tc. It's be sweet to head this nonsense off at the pass, before others discover the rabbit trail and make it into a trade route. -- Jeremy Jackson Coplanar Networks W:(519)489-4903 MSN: [EMAIL PROTECTED] ICQ: 43937409 http://www.coplanar.net - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/4] TCP Cubic updates for 2.6.16
This set of patches: * precomputes constants used in TCP cubic * uses Newton/Raphson for cube root * adds find largest set bit 64 to make initial estimate -- Stephen Hemminger [EMAIL PROTECTED] OSDL http://developer.osdl.org/~shemminger - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/4] TCP cubic precompute constants
Revised version of patch to pre-compute values for TCP cubic. * d32,d64 replaced with descriptive names * cube_factor replaces srtt[scaled by count] / HZ * ((1 (10+2*BICTCP_HZ)) / bic_scale) * beta_scale replaces 8*(BICTCP_BETA_SCALE+beta)/3/(BICTCP_BETA_SCALE-beta); Signed-off-by: Stephen Hemminger [EMAIL PROTECTED] --- net-2.6.16.orig/net/ipv4/tcp_cubic.c +++ net-2.6.16/net/ipv4/tcp_cubic.c @@ -16,7 +16,7 @@ #include linux/mm.h #include linux/module.h #include net/tcp.h - +#include asm/div64.h #define BICTCP_BETA_SCALE1024 /* Scale factor beta calculation * max_cwnd = snd_cwnd * beta @@ -34,15 +34,20 @@ static int initial_ssthresh = 100; static int bic_scale = 41; static int tcp_friendliness = 1; +static u32 cube_rtt_scale; +static u32 beta_scale; +static u64 cube_factor; + +/* Note parameters that are used for precomputing scale factors are read-only */ module_param(fast_convergence, int, 0644); MODULE_PARM_DESC(fast_convergence, turn on/off fast convergence); module_param(max_increment, int, 0644); MODULE_PARM_DESC(max_increment, Limit on increment allowed during binary search); -module_param(beta, int, 0644); +module_param(beta, int, 0444); MODULE_PARM_DESC(beta, beta for multiplicative increase); module_param(initial_ssthresh, int, 0644); MODULE_PARM_DESC(initial_ssthresh, initial value of slow start threshold); -module_param(bic_scale, int, 0644); +module_param(bic_scale, int, 0444); MODULE_PARM_DESC(bic_scale, scale (scaled by 1024) value for bic function (bic_scale/1024)); module_param(tcp_friendliness, int, 0644); MODULE_PARM_DESC(tcp_friendliness, turn on/off tcp friendliness); @@ -151,65 +156,13 @@ static u32 cubic_root(u64 x) return (u32)end; } -static inline u32 bictcp_K(u32 dist, u32 srtt) -{ -u64 d64; -u32 d32; -u32 count; -u32 result; - -/* calculate the K for (wmax-cwnd) = c/rtt * K^3 - so K = cubic_root( (wmax-cwnd)*rtt/c ) - the unit of K is bictcp_HZ=2^10, not HZ - - c = bic_scale 10 - rtt = (tp-srtt 3 ) / HZ - - the following code has been designed and tested for - cwnd 1 million packets - RTT 100 seconds - HZ 1,000,00 (corresponding to 10 nano-second) - -*/ - -/* 1/c * 2^2*bictcp_HZ */ -d32 = (1 (10+2*BICTCP_HZ)) / bic_scale; -d64 = (__u64)d32; - -/* srtt * 2^count / HZ - 1) to get a better accuracy of the following d32, - the larger the count, the better the accuracy - 2) and avoid overflow of the following d64 - the larger the count, the high possibility of overflow - 3) so find a count between bictcp_hz-3 and bictcp_hz - count may be less than bictcp_HZ, - then d64 becomes 0. that is OK -*/ -d32 = srtt; -count = 0; -while (((d32 0x8000)==0) (count BICTCP_HZ)){ -d32 = d32 1; -count++; -} -d32 = d32 / HZ; - -/* (wmax-cwnd) * (srtt3 / HZ) / c * 2^(3*bictcp_HZ) */ -d64 = (d64 * dist * d32) (count+3-BICTCP_HZ); - -/* cubic root */ -d64 = cubic_root(d64); - -result = (u32)d64; -return result; -} - /* * Compute congestion window to use. */ static inline void bictcp_update(struct bictcp *ca, u32 cwnd) { - u64 d64; - u32 d32, t, srtt, bic_target, min_cnt, max_cnt; + u64 offs; + u32 delta, t, bic_target, min_cnt, max_cnt; ca-ack_cnt++; /* count the number of ACKs */ @@ -220,8 +173,6 @@ static inline void bictcp_update(struct ca-last_cwnd = cwnd; ca-last_time = tcp_time_stamp; - srtt = (HZ 3)/10;/* use real time-based growth function */ - if (ca-epoch_start == 0) { ca-epoch_start = tcp_time_stamp; /* record the beginning of an epoch */ ca-ack_cnt = 1;/* start counting */ @@ -231,7 +182,11 @@ static inline void bictcp_update(struct ca-bic_K = 0; ca-bic_origin_point = cwnd; } else { - ca-bic_K = bictcp_K(ca-last_max_cwnd-cwnd, srtt); + /* Compute new K based on +* (wmax-cwnd) * (srtt3 / HZ) / c * 2^(3*bictcp_HZ) +*/ + ca-bic_K = cubic_root(cube_factor + * (ca-last_max_cwnd - cwnd)); ca-bic_origin_point = ca-last_max_cwnd; } } @@ -239,9 +194,9 @@ static inline void bictcp_update(struct /* cubic function - calc*/ /* calculate c * time^3 / rtt, * while considering overflow in calculation of time^3 -* (so time^3 is done by using d64) +* (so time^3 is done
[PATCH 2/4] fls64: x86_64 version
Index: net-2.6.16/include/asm-x86_64/bitops.h === --- net-2.6.16.orig/include/asm-x86_64/bitops.h +++ net-2.6.16/include/asm-x86_64/bitops.h @@ -340,6 +340,20 @@ static __inline__ unsigned long __ffs(un return word; } +/* + * __fls: find last bit set. + * @word: The word to search + * + * Undefined if no zero exists, so code should check against ~0UL first. + */ +static __inline__ unsigned long __fls(unsigned long word) +{ + __asm__(bsrq %1,%0 + :=r (word) + :rm (word)); + return word; +} + #ifdef __KERNEL__ static inline int sched_find_first_bit(const unsigned long *b) @@ -370,6 +384,19 @@ static __inline__ int ffs(int x) } /** + * fls64 - find last bit set in 64 bit word + * @x: the word to search + * + * This is defined the same way as fls. + */ +static __inline__ int fls64(__u64 x) +{ + if (x == 0) + return 0; + return __fls(x) + 1; +} + +/** * hweightN - returns the hamming weight of a N-bit word * @x: the word to weigh * @@ -409,7 +436,6 @@ static __inline__ int ffs(int x) /* find last set bit */ #define fls(x) generic_fls(x) -#define fls64(x) generic_fls64(x) #endif /* __KERNEL__ */ -- Stephen Hemminger [EMAIL PROTECTED] OSDL http://developer.osdl.org/~shemminger - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 4/4] TCP Cubic use Newton-Raphson
Replace cube root algorithim with a faster version using Newton-Raphson. Surprisingly, doing the scaled div64_64 is faster than a true 64 bit division on 64 bit CPU's. Signed-off-by: Stephen Hemminger [EMAIL PROTECTED] --- net-2.6.16.orig/net/ipv4/tcp_cubic.c +++ net-2.6.16/net/ipv4/tcp_cubic.c @@ -52,6 +52,7 @@ MODULE_PARM_DESC(bic_scale, scale (scal module_param(tcp_friendliness, int, 0644); MODULE_PARM_DESC(tcp_friendliness, turn on/off tcp friendliness); +#include asm/div64.h /* BIC TCP Parameters */ struct bictcp { @@ -93,67 +94,51 @@ static void bictcp_init(struct sock *sk) tcp_sk(sk)-snd_ssthresh = initial_ssthresh; } -/* 65536 times the cubic root */ -static const u64 cubic_table[8] - = {0, 65536, 82570, 94519, 104030, 112063, 119087, 125367}; - -/* - * calculate the cubic root of x - * the basic idea is that x can be expressed as i*8^j - * so cubic_root(x) = cubic_root(i)*2^j - * in the following code, x is i, and y is 2^j - * because of integer calculation, there are errors in calculation - * so finally use binary search to find out the exact solution - */ -static u32 cubic_root(u64 x) +/* 64bit divisor, dividend and result. dynamic precision */ +static inline u_int64_t div64_64(u_int64_t dividend, u_int64_t divisor) { -u64 y, app, target, start, end, mid, start_diff, end_diff; + u_int32_t d = divisor; -if (x == 0) -return 0; + if (divisor 0xULL) { + unsigned int shift = fls(divisor 32); -target = x; + d = divisor shift; + dividend = shift; + } -/* first estimate lower and upper bound */ -y = 1; -while (x = 8){ -x = (x 3); -y = (y 1); -} -start = (y*cubic_table[x])16; -if (x==7) -end = (y1); -else -end = (y*cubic_table[x+1]+65535)16; + /* avoid 64 bit division if possible */ + if (dividend 32) + do_div(dividend, d); + else + dividend = (uint32_t) dividend / d; -/* binary search for more accurate one */ -while (start end-1) { -mid = (start+end) 1; -app = mid*mid*mid; -if (app target) -start = mid; -else if (app target) -end = mid; -else -return mid; -} + return dividend; +} -/* find the most accurate one from start and end */ -app = start*start*start; -if (app target) -start_diff = target - app; -else -start_diff = app - target; -app = end*end*end; -if (app target) -end_diff = target - app; -else -end_diff = app - target; +/* + * calculate the cubic root of x using Newton-Raphson + */ +static u32 cubic_root(u64 a) +{ + u32 x, x1; -if (start_diff end_diff) -return (u32)start; -else -return (u32)end; + /* Initial estimate is based on: +* cbrt(x) = exp(log(x) / 3) +*/ + x = 1u (fls64(a)/3); + + /* +* Iteration based on: +* 2 +* x= ( 2 * x + a / x ) / 3 +* k+1 k k +*/ + do { + x1 = x; + x = (2 * x + (uint32_t) div64_64(a, x*x)) / 3; + } while (abs(x1 - x) 1); + + return x; } /* -- Stephen Hemminger [EMAIL PROTECTED] OSDL http://developer.osdl.org/~shemminger - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/4] fls64: generic version
Index: bic-2.6/include/linux/bitops.h === --- bic-2.6.orig/include/linux/bitops.h +++ bic-2.6/include/linux/bitops.h @@ -76,6 +76,15 @@ static __inline__ int generic_fls(int x) */ #include asm/bitops.h + +static inline int generic_fls64(__u64 x) +{ + __u32 h = x 32; + if (h) + return fls(x) + 32; + return fls(x); +} + static __inline__ int get_bitmask_order(unsigned int count) { int order; Index: bic-2.6/include/asm-alpha/bitops.h === --- bic-2.6.orig/include/asm-alpha/bitops.h +++ bic-2.6/include/asm-alpha/bitops.h @@ -321,6 +321,7 @@ static inline int fls(int word) #else #define flsgeneric_fls #endif +#define fls64 generic_fls64 /* Compute powers of two for the given integer. */ static inline long floor_log2(unsigned long word) Index: bic-2.6/include/asm-arm/bitops.h === --- bic-2.6.orig/include/asm-arm/bitops.h +++ bic-2.6/include/asm-arm/bitops.h @@ -332,6 +332,7 @@ static inline unsigned long __ffs(unsign */ #define fls(x) generic_fls(x) +#define fls64(x) generic_fls64(x) /* * ffs: find first bit set. This is defined the same way as @@ -351,6 +352,7 @@ static inline unsigned long __ffs(unsign #define fls(x) \ ( __builtin_constant_p(x) ? generic_fls(x) : \ ({ int __r; asm(clz\t%0, %1 : =r(__r) : r(x) : cc); 32-__r; }) ) +#define fls64(x) generic_fls64(x) #define ffs(x) ({ unsigned long __t = (x); fls(__t -__t); }) #define __ffs(x) (ffs(x) - 1) #define ffz(x) __ffs( ~(x) ) Index: bic-2.6/include/asm-arm26/bitops.h === --- bic-2.6.orig/include/asm-arm26/bitops.h +++ bic-2.6/include/asm-arm26/bitops.h @@ -259,6 +259,7 @@ static inline unsigned long __ffs(unsign */ #define fls(x) generic_fls(x) +#define fls64(x) generic_fls64(x) /* * ffs: find first bit set. This is defined the same way as Index: bic-2.6/include/asm-cris/bitops.h === --- bic-2.6.orig/include/asm-cris/bitops.h +++ bic-2.6/include/asm-cris/bitops.h @@ -240,6 +240,7 @@ static inline int test_bit(int nr, const */ #define fls(x) generic_fls(x) +#define fls64(x) generic_fls64(x) /* * hweightN - returns the hamming weight of a N-bit word Index: bic-2.6/include/asm-frv/bitops.h === --- bic-2.6.orig/include/asm-frv/bitops.h +++ bic-2.6/include/asm-frv/bitops.h @@ -228,6 +228,7 @@ found_middle: \ bit ? 33 - bit : bit; \ }) +#define fls64(x) generic_fls64(x) /* * Every architecture must define this function. It's the fastest Index: bic-2.6/include/asm-generic/bitops.h === --- bic-2.6.orig/include/asm-generic/bitops.h +++ bic-2.6/include/asm-generic/bitops.h @@ -56,6 +56,7 @@ extern __inline__ int test_bit(int nr, c */ #define fls(x) generic_fls(x) +#define fls64(x) generic_fls64(x) #ifdef __KERNEL__ Index: bic-2.6/include/asm-h8300/bitops.h === --- bic-2.6.orig/include/asm-h8300/bitops.h +++ bic-2.6/include/asm-h8300/bitops.h @@ -406,5 +406,6 @@ found_middle: #endif /* __KERNEL__ */ #define fls(x) generic_fls(x) +#define fls64(x) generic_fls64(x) #endif /* _H8300_BITOPS_H */ Index: bic-2.6/include/asm-i386/bitops.h === --- bic-2.6.orig/include/asm-i386/bitops.h +++ bic-2.6/include/asm-i386/bitops.h @@ -372,6 +372,7 @@ static inline unsigned long ffz(unsigned */ #define fls(x) generic_fls(x) +#define fls64(x) generic_fls64(x) #ifdef __KERNEL__ Index: bic-2.6/include/asm-ia64/bitops.h === --- bic-2.6.orig/include/asm-ia64/bitops.h +++ bic-2.6/include/asm-ia64/bitops.h @@ -345,6 +345,7 @@ fls (int t) x |= x 16; return ia64_popcnt(x); } +#define fls64(x) generic_fls64(x) /* * ffs: find first bit set. This is defined the same way as the libc and compiler builtin Index: bic-2.6/include/asm-m32r/bitops.h === --- bic-2.6.orig/include/asm-m32r/bitops.h +++ bic-2.6/include/asm-m32r/bitops.h @@ -465,6 +465,7 @@ static __inline__ unsigned long __ffs(un * fls: find last bit set. */ #define fls(x) generic_fls(x) +#define fls64(x) generic_fls64(x) #ifdef __KERNEL__ Index: bic-2.6/include/asm-m68k/bitops.h === --- bic-2.6.orig/include/asm-m68k/bitops.h +++ bic-2.6/include/asm-m68k/bitops.h @@ -310,6 +310,7 @@ static inline
Re: Default net.ipv6.mld_max_msf = 10 and net.core.optmem_max=10240
From: Hoerdt Mickael [EMAIL PROTECTED] Date: Wed, 14 Dec 2005 23:38:56 +0100 As implemented now, the default memory allocated in net.core.optmem_max permit to join up to 320 (S,G) channels per sockets (for IPv6, each channels cost 32bytes in net.core.optmem_max), thing is that net.ipv6.mld_max_msf is setting an hard limit on it, so assuming that you don't change the value of net.core.opmem_max, would it make sense to increase net.ipv6.mld_max_msf to let say, 256 ? the rest of the memory can still be used for various option setup on the socket. I think people running programs that need the higher value can increase the limit. This is no different than having to tweak tcp_wmem[] or the socket buffering limits via sysctl. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/6] skge: handle out of memory on MTU size changes
Changing the MTU size causes the receiver to have to reallocate buffers. If this allocation fails, then we need to return an error, and take the device offline. It can then be brought back up or reconfigured for a smaller MTU. Signed-off-by: Stephen Hemminger [EMAIL PROTECTED] --- skge-2.6.orig/drivers/net/skge.c +++ skge-2.6/drivers/net/skge.c @@ -2192,6 +2192,7 @@ static int skge_up(struct net_device *de kfree(skge-rx_ring.start); free_pci_mem: pci_free_consistent(hw-pdev, skge-mem_size, skge-mem, skge-dma); + skge-mem = NULL; return err; } @@ -2202,6 +2203,9 @@ static int skge_down(struct net_device * struct skge_hw *hw = skge-hw; int port = skge-port; + if (skge-mem == NULL) + return 0; + if (netif_msg_ifdown(skge)) printk(KERN_INFO PFX %s: disabling interface\n, dev-name); @@ -2258,6 +2262,7 @@ static int skge_down(struct net_device * kfree(skge-rx_ring.start); kfree(skge-tx_ring.start); pci_free_consistent(hw-pdev, skge-mem_size, skge-mem, skge-dma); + skge-mem = NULL; return 0; } @@ -2416,18 +2421,23 @@ static void skge_tx_timeout(struct net_d static int skge_change_mtu(struct net_device *dev, int new_mtu) { - int err = 0; - int running = netif_running(dev); + int err; if (new_mtu ETH_ZLEN || new_mtu ETH_JUMBO_MTU) return -EINVAL; + if (!netif_running(dev)) { + dev-mtu = new_mtu; + return 0; + } + + skge_down(dev); - if (running) - skge_down(dev); dev-mtu = new_mtu; - if (running) - skge_up(dev); + + err = skge_up(dev); + if (err) + dev_close(dev); return err; } -- - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 6/6] skge: version number (1.3)
Enough changes for one version. Signed-off-by: Stephen Hemminger [EMAIL PROTECTED] --- skge-2.6.orig/drivers/net/skge.c +++ skge-2.6/drivers/net/skge.c @@ -43,7 +43,7 @@ #include skge.h #define DRV_NAME skge -#define DRV_VERSION1.2 +#define DRV_VERSION1.3 #define PFXDRV_NAME #define DEFAULT_TX_RING_SIZE 128 -- - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 5/6] skge: handle out of memory on ring parameter change
If changing ring parameters is unable to allocate memory, we need to return an error and take the device down. Fixes-bug: http://bugzilla.kernel.org/show_bug.cgi?id=5715 Signed-off-by: Stephen Hemminger [EMAIL PROTECTED] --- skge-2.6.orig/drivers/net/skge.c +++ skge-2.6/drivers/net/skge.c @@ -397,6 +397,7 @@ static int skge_set_ring_param(struct ne struct ethtool_ringparam *p) { struct skge_port *skge = netdev_priv(dev); + int err; if (p-rx_pending == 0 || p-rx_pending MAX_RX_RING_SIZE || p-tx_pending == 0 || p-tx_pending MAX_TX_RING_SIZE) @@ -407,7 +408,9 @@ static int skge_set_ring_param(struct ne if (netif_running(dev)) { skge_down(dev); - skge_up(dev); + err = skge_up(dev); + if (err) + dev_close(dev); } return 0; -- - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/6] skge: avoid up/down on speed changes
Change the speed settings doesn't need to cause link to go down/up. It can be handled by doing the same logic as nway_reset. Signed-off-by: Stephen Hemminger [EMAIL PROTECTED] --- skge-2.6.orig/drivers/net/skge.c +++ skge-2.6/drivers/net/skge.c @@ -88,15 +88,14 @@ MODULE_DEVICE_TABLE(pci, skge_id_table); static int skge_up(struct net_device *dev); static int skge_down(struct net_device *dev); +static void skge_phy_reset(struct skge_port *skge); static void skge_tx_clean(struct skge_port *skge); static int xm_phy_write(struct skge_hw *hw, int port, u16 reg, u16 val); static int gm_phy_write(struct skge_hw *hw, int port, u16 reg, u16 val); static void genesis_get_stats(struct skge_port *skge, u64 *data); static void yukon_get_stats(struct skge_port *skge, u64 *data); static void yukon_init(struct skge_hw *hw, int port); -static void yukon_reset(struct skge_hw *hw, int port); static void genesis_mac_init(struct skge_hw *hw, int port); -static void genesis_reset(struct skge_hw *hw, int port); static void genesis_link_up(struct skge_port *skge); /* Avoid conditionals by using array */ @@ -276,10 +275,9 @@ static int skge_set_settings(struct net_ skge-autoneg = ecmd-autoneg; skge-advertising = ecmd-advertising; - if (netif_running(dev)) { - skge_down(dev); - skge_up(dev); - } + if (netif_running(dev)) + skge_phy_reset(skge); + return (0); } @@ -430,21 +428,11 @@ static void skge_set_msglevel(struct net static int skge_nway_reset(struct net_device *dev) { struct skge_port *skge = netdev_priv(dev); - struct skge_hw *hw = skge-hw; - int port = skge-port; if (skge-autoneg != AUTONEG_ENABLE || !netif_running(dev)) return -EINVAL; - spin_lock_bh(hw-phy_lock); - if (hw-chip_id == CHIP_ID_GENESIS) { - genesis_reset(hw, port); - genesis_mac_init(hw, port); - } else { - yukon_reset(hw, port); - yukon_init(hw, port); - } - spin_unlock_bh(hw-phy_lock); + skge_phy_reset(skge); return 0; } @@ -2019,6 +2007,25 @@ static void yukon_phy_intr(struct skge_p /* XXX restart autonegotiation? */ } +static void skge_phy_reset(struct skge_port *skge) +{ + struct skge_hw *hw = skge-hw; + int port = skge-port; + + netif_stop_queue(skge-netdev); + netif_carrier_off(skge-netdev); + + spin_lock_bh(hw-phy_lock); + if (hw-chip_id == CHIP_ID_GENESIS) { + genesis_reset(hw, port); + genesis_mac_init(hw, port); + } else { + yukon_reset(hw, port); + yukon_init(hw, port); + } + spin_unlock_bh(hw-phy_lock); +} + /* Basic MII support */ static int skge_ioctl(struct net_device *dev, struct ifreq *ifr, int cmd) { -- - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/6] skge: error handling on config changes
-- - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 4/6] skge: get rid of Yukon2 defines
Don't need to keep Yukon-2 related definitions around for Skge driver that is only for Yukon-1 and Genesis. Signed-off-by: Stephen Hemminger [EMAIL PROTECTED] --- skge-2.6.orig/drivers/net/skge.h +++ skge-2.6/drivers/net/skge.h @@ -475,18 +475,6 @@ enum { Q_T2= 0x40, /* 32 bit Test Register 2 */ Q_T3= 0x44, /* 32 bit Test Register 3 */ -/* Yukon-2 */ - Q_DONE = 0x24, /* 16 bit Done Index (Yukon-2 only) */ - Q_WM= 0x40, /* 16 bit FIFO Watermark */ - Q_AL= 0x42, /* 8 bit FIFO Alignment */ - Q_RSP = 0x44, /* 16 bit FIFO Read Shadow Pointer */ - Q_RSL = 0x46, /* 8 bit FIFO Read Shadow Level */ - Q_RP= 0x48, /* 8 bit FIFO Read Pointer */ - Q_RL= 0x4a, /* 8 bit FIFO Read Level */ - Q_WP= 0x4c, /* 8 bit FIFO Write Pointer */ - Q_WSP = 0x4d, /* 8 bit FIFO Write Shadow Pointer */ - Q_WL= 0x4e, /* 8 bit FIFO Write Level */ - Q_WSL = 0x4f, /* 8 bit FIFO Write Shadow Level */ }; #define Q_ADDR(reg, offs) (B8_Q_REGS + (reg) + (offs)) @@ -675,22 +663,16 @@ enum { LED_OFF = 10, /* switch LED off */ }; -/* Receive GMAC FIFO (YUKON and Yukon-2) */ +/* Receive GMAC FIFO (YUKON) */ enum { RX_GMF_EA = 0x0c40,/* 32 bit Rx GMAC FIFO End Address */ RX_GMF_AF_THR = 0x0c44,/* 32 bit Rx GMAC FIFO Almost Full Thresh. */ RX_GMF_CTRL_T = 0x0c48,/* 32 bit Rx GMAC FIFO Control/Test */ RX_GMF_FL_MSK = 0x0c4c,/* 32 bit Rx GMAC FIFO Flush Mask */ RX_GMF_FL_THR = 0x0c50,/* 32 bit Rx GMAC FIFO Flush Threshold */ - RX_GMF_TR_THR = 0x0c54,/* 32 bit Rx Truncation Threshold (Yukon-2) */ - - RX_GMF_VLAN = 0x0c5c,/* 32 bit Rx VLAN Type Register (Yukon-2) */ RX_GMF_WP = 0x0c60,/* 32 bit Rx GMAC FIFO Write Pointer */ - RX_GMF_WLEV = 0x0c68,/* 32 bit Rx GMAC FIFO Write Level */ - RX_GMF_RP = 0x0c70,/* 32 bit Rx GMAC FIFO Read Pointer */ - RX_GMF_RLEV = 0x0c78,/* 32 bit Rx GMAC FIFO Read Level */ }; @@ -855,48 +837,6 @@ enum { GMAC_TI_ST_TST = 0x0e1a,/* 8 bit Time Stamp Timer Test Reg */ }; -/* Status BMU Registers (Yukon-2 only)*/ -enum { - STAT_CTRL = 0x0e80,/* 32 bit Status BMU Control Reg */ - STAT_LAST_IDX = 0x0e84,/* 16 bit Status BMU Last Index */ - /* 0x0e85 - 0x0e86: reserved */ - STAT_LIST_ADDR_LO = 0x0e88,/* 32 bit Status List Start Addr (low) */ - STAT_LIST_ADDR_HI = 0x0e8c,/* 32 bit Status List Start Addr (high) */ - STAT_TXA1_RIDX = 0x0e90,/* 16 bit Status TxA1 Report Index Reg */ - STAT_TXS1_RIDX = 0x0e92,/* 16 bit Status TxS1 Report Index Reg */ - STAT_TXA2_RIDX = 0x0e94,/* 16 bit Status TxA2 Report Index Reg */ - STAT_TXS2_RIDX = 0x0e96,/* 16 bit Status TxS2 Report Index Reg */ - STAT_TX_IDX_TH = 0x0e98,/* 16 bit Status Tx Index Threshold Reg */ - STAT_PUT_IDX= 0x0e9c,/* 16 bit Status Put Index Reg */ - -/* FIFO Control/Status Registers (Yukon-2 only)*/ - STAT_FIFO_WP= 0x0ea0,/* 8 bit Status FIFO Write Pointer Reg */ - STAT_FIFO_RP= 0x0ea4,/* 8 bit Status FIFO Read Pointer Reg */ - STAT_FIFO_RSP = 0x0ea6,/* 8 bit Status FIFO Read Shadow Ptr */ - STAT_FIFO_LEVEL = 0x0ea8,/* 8 bit Status FIFO Level Reg */ - STAT_FIFO_SHLVL = 0x0eaa,/* 8 bit Status FIFO Shadow Level Reg */ - STAT_FIFO_WM= 0x0eac,/* 8 bit Status FIFO Watermark Reg */ - STAT_FIFO_ISR_WM= 0x0ead,/* 8 bit Status FIFO ISR Watermark Reg */ - -/* Level and ISR Timer Registers (Yukon-2 only)*/ - STAT_LEV_TIMER_INI = 0x0eb0,/* 32 bit Level Timer Init. Value Reg */ - STAT_LEV_TIMER_CNT = 0x0eb4,/* 32 bit Level Timer Counter Reg */ - STAT_LEV_TIMER_CTRL = 0x0eb8,/* 8 bit Level Timer Control Reg */ - STAT_LEV_TIMER_TEST = 0x0eb9,/* 8 bit Level Timer Test Reg */ - STAT_TX_TIMER_INI = 0x0ec0,/* 32 bit Tx Timer Init. Value Reg */ - STAT_TX_TIMER_CNT = 0x0ec4,/* 32 bit Tx Timer Counter Reg */ - STAT_TX_TIMER_CTRL = 0x0ec8,/* 8 bit Tx Timer Control Reg */ - STAT_TX_TIMER_TEST = 0x0ec9,/* 8 bit Tx Timer Test Reg */ - STAT_ISR_TIMER_INI = 0x0ed0,/* 32 bit ISR Timer Init. Value Reg */ - STAT_ISR_TIMER_CNT = 0x0ed4,/* 32 bit ISR Timer Counter Reg */ - STAT_ISR_TIMER_CTRL = 0x0ed8,/* 8 bit ISR Timer Control Reg */ - STAT_ISR_TIMER_TEST = 0x0ed9,/* 8 bit ISR Timer Test Reg */ - - ST_LAST_IDX_MASK= 0x007f,/* Last Index Mask */ - ST_TXRP_IDX_MASK= 0x0fff,/* Tx Report Index Mask
Re: [RFC][PATCH 0/3] TCP/IP Critical socket communication mechanism
On Wed, 2005-12-14 at 14:39 -0800, Ben Greear wrote: James Courtier-Dutton wrote: Have you actually thought about what would happen in a real world senario? There is no real world requirement for this sort of user land feature. In memory pressure mode, you don't care about user applications. In fact, under memory pressure no user applications are getting scheduled. All you care about is swapping out memory to achieve a net gain in free memory, so that the applications can then run ok again. Low 'ATOMIC' memory is different from the memory that user space typically uses, so just because you can't allocate an SKB does not mean you are swapping out user-space apps. I have an app that can have 2000+ sockets open. I would definately like to make the management and other important sockets have priority over others in my app... The scenario we are trying to address is also a management connection between the nodes of a cluster and a server that manages the swap devices accessible by all the nodes of the cluster. The critical connection is supposed to be used to exchange status notifications of the swap devices so that failover can happen and propagated to all the nodes as quickly as possible. The management apps will be pinned into memory so that they are not swapped out. As such the traffic that flows over the critical sockets is not high but should not stall even if we run into a memory constrained situation. That is the reason why we would like to have a pre-allocated critical page pool which could be used when we run out of ATOMIC memory. Thanks Sridhar - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] vlan hardware rx csum errors
From: Stephen Hemminger [EMAIL PROTECTED] Date: Tue, 13 Dec 2005 16:57:00 -0800 Receiving VLAN packets over a device (without VLAN assist) that is doing hardware checksumming (CHECKSUM_HW), causes errors because the VLAN code forgets to adjust the hardware checksum. Signed-off-by: Stephen Hemminger [EMAIL PROTECTED] Good catch, applied. I'll forward this off to -stable, as the fix is needed there as well. Thanks. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Resend [PATCH netdev-2.6 2/8] e1000: Performance Enhancements
David S. Miller wrote: From: John Ronciak [EMAIL PROTECTED] Date: Wed, 7 Dec 2005 11:48:46 -0800 Copybreak probably shouldn't be used in routing use cases. I think even this is arguable, routers route a lot more than small 64-byte frames. Unfortunately, that is what everyone uses for packet rate tests. :-/ Assuming only TCP flows go through a router, it is safe to say that the full-sized data frame to ACK ratio is about 2 to 1. Sadly, the picture most routers see is the opposite: about 2 sub-100 byte frames for every 1 decent sized one - and fullsize is really rare, maybe just 1 in 5. This thread is semi-modern with some good data: http://www.cctec.com/maillists/nanog/historical/0312/msg00394.html and it is getting worse over time.. in 1998 it was more like 1:1 So the all-64byte test isn't that crazy. BTW - this has been a great thread - enjoyed reading it very much. But I've kind of lost a feel for what the prefetch and copybreak cases mean for local delievery (e.g tcp termination) scenarios.. both in throughput and cpu left for the local application. That has to be a more important profile than ip forwarding. Any thoughts on that? -Patrick - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SA switchover
On Wed, 2005-14-12 at 16:48 -0800, David S. Miller wrote: Please have a look at: http://bugzilla.kernel.org/show_bug.cgi?id=4952 It should look familiar. It is - the soup nazi got involved on that bug ;- http://marc.theaimsgroup.com/?l=linux-netdevm=113070963711648w=2 We were discussing this in depth a few weeks ago, but the discussion tailed off and I don't know how close we came to a consensus or what that consensus might be :-) it sort of is still hanging but there is progress. The crux of the matter, to reiterate, is that it is a non-trivial problem to determine what existing SA entries are subsumed by a newly inserted one. The kernel would need to execute a rather complicated search in order to determine this SA set. Right - Herbert has some ideas that would require help from the KM. And we are actually agreeing we should implement a minimalist approach. More below .. The subsequent argument states that actually, unlike the kernel, the keying daemon does have some knowledge about what a new SA entry might be replacing. And therefore, that userland daemons such as racoon bear some responsibility in assisting in the smooth and efficient switchover from the dying state entry to the newly inserted SA. Any comments or corrections on this? correct with caveats: there are two sorts of problematic devices. 1) The Ciscos, I think PIX and their relatives (I heard linksys): These suckers have a fixed time between soft expiry time and hard expiry time;- IKE only negotiates hard expiry, and soft expiry is up to the peer. Racoon says soft expiry = 80% of hard expiry. So if you have the expiry at 10 hours, racoon will set soft expiry at 8 hours. CISCO hardcodes 30 seconds to be between the hard and soft expiry ;- Yep, when you have RFCs written in a natural language like English shit like this happens. So at the 8 hour mark, racoon renegotiates. For 30 seconds more after that, things continue working. Then for the next 119.5 minutes nothing works because infact CISCO purges its old SA and Linux (as it should) starts using the new one. The proper way is for CISCO to send a IKE delete; it doesnt. To fix this i submitted a patch to racoon which is in their CVS - i was told it will show up around their release 0.7. The patch allows people to hardcode like in cisco a specific time. So this fixes the CISCO problem without touching the kernel. 2) There are other sorts of devices - i am told some made by a vendor called DrayTek infact deletes right away after renegotiation. But they do send a IKE delete except racoon ignores it ;- As was pointed out to me that even since IKEv1 is unreliable such a message could be lost anyways. So bug in racoon for sure but not good enough given the unreliability of IKEv1. So in the last discussion Herbert and I had we talked about doing something in the kernel since this was getting frustrating ... Herbert has it on his TODO and i was going to get racoon part once he has his patch. cheers, jamal - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Specs for Tulip3
On Thu, 2005-12-15 at 00:07 +0100, Aritz Bastida wrote: rx_threshold_hit Rx max coalescing frames threshold hit. Well, I didn't understand what is this threshold for This counter counts the number of times rx packets have reached the max rx coalesced frames setting before an interrupt is generated. By default, the max rx coalesced frames is set to 6 which means that the chip will try to wait until 6 packets are received before generating an interrupt. Interrupt coalescing in addition to NAPI under heavy traffic may further increase throughput. I'm running Linux kernel 2.6.13 and tg3 version 3.37, so should be new enough. Newer versions have fancy prefetch added, a spinlock removed from the rx path, and an optimization in the use of the status tag. All these may allow you to receive a few more packets. I dont know how to verify if the NIC is in a PCIX bus. How can I check that? Running lspci I can see there are some PCIX bridges: tg3's probing output will print the bus the device is in. You can also run lspci -vvv to find out. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 0/3] TCP/IP Critical socket communication mechanism
On Wed, Dec 14, 2005 at 09:55:45AM -0800, Sridhar Samudrala wrote: On Wed, 2005-12-14 at 10:22 +0100, Andi Kleen wrote: I would appreciate any feedback or comments on this approach. Maybe I'm missing something but wouldn't you need an own critical pool (or at least reservation) for each socket to be safe against deadlocks? Otherwise if a critical sockets needs e.g. 2 pages to finish something and 2 critical sockets are active they can each steal the last pages from each other and deadlock. Here we are assuming that the pre-allocated critical page pool is big enough to satisfy the requirements of all the critical sockets. Not a good assumption. A system can have between 1-1000 iSCSI connections open and we certainly don't want to preallocate enough room for 1000 connections to make progress when we might only have one in use. I think we need a global receive pool and per-socket send pools. -- Mathematics is the supreme nostalgia of our time. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 0/3] TCP/IP Critical socket communication mechanism
On Wed, Dec 14, 2005 at 08:30:23PM -0800, David S. Miller wrote: From: Matt Mackall [EMAIL PROTECTED] Date: Wed, 14 Dec 2005 19:39:37 -0800 I think we need a global receive pool and per-socket send pools. Mind telling everyone how you plan to make use of the global receive pool when the allocation happens in the device driver and we have no idea which socket the packet is destined for? What should be done for non-local packets being routed? The device drivers allocate packets for the entire system, long before we know who the eventually received packets are for. It is fully anonymous memory, and it's easy to design cases where the whole pool can be eaten up by non-local forwarded packets. There needs to be two rules: iff global memory critical flag is set - allocate from the global critical receive pool on receive - return packet to global pool if not destined for a socket with an attached send mempool I think this will provide the desired behavior, though only probabilistically. That is, we can fill the global receive pool with uninteresting packets such that we're forced to drop critical ACKs, but the boring packets will eventually be discarded as we walk up the stack and we'll eventually have room to receive retried ACKs. I truly dislike these patches being discussed because they are a complete hack, and admittedly don't even solve the problem fully. I don't have any concrete better ideas but that doesn't mean this stuff should go into the tree. Agreed. I'm fairly convinced a full fix is doable, if you make a couple assumptions (limited fragmentation), but will unavoidably be less than pretty as it needs to cross some layers. I think GFP_ATOMIC memory pools are more powerful than they are given credit for. There is nothing preventing the implementation of dynamic GFP_ATOMIC watermarks, and having critical socket behavior kick in in response to hitting those water marks. There are two problems with GFP_ATOMIC. The first is that its users don't pre-state their worst-case usage, which means sizing the pool to reliably avoid deadlocks is impossible. The second is that there aren't any guarantees that GFP_ATOMIC allocations are actually critical in the needed-to-make-forward-VM-progress sense or will be returned to the pool in a timely fashion. So I do think we need a distinct pool if we want to tackle this problem. Though it's probably worth mentioning that Linus was rather adamantly against even trying at KS. -- Mathematics is the supreme nostalgia of our time. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 0/3] TCP/IP Critical socket communication mechanism
From: Matt Mackall [EMAIL PROTECTED] Date: Wed, 14 Dec 2005 21:02:50 -0800 There needs to be two rules: iff global memory critical flag is set - allocate from the global critical receive pool on receive - return packet to global pool if not destined for a socket with an attached send mempool This shuts off a router and/or firewall just because iSCSI or NFS peed in it's pants. Not really acceptable. I think this will provide the desired behavior It's not desirable. What if iSCSI is protected by IPSEC, and the key management daemon has to process a security assosciation expiration and negotiate a new one in order for iSCSI to further communicate with it's peer when this memory shortage occurs? It needs to send packets back and forth with the remove key management daemon in order to do this, but since you cut it off with this critical receive pool, the negotiation will never succeed. This stuff won't work. It's not a generic solution and that's why it has more holes than swiss cheese. :-) - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 0/3] TCP/IP Critical socket communication mechanism
On Wed, Dec 14, 2005 at 08:30:23PM -0800, David S. Miller wrote: From: Matt Mackall [EMAIL PROTECTED] Date: Wed, 14 Dec 2005 19:39:37 -0800 I think we need a global receive pool and per-socket send pools. Mind telling everyone how you plan to make use of the global receive pool when the allocation happens in the device driver and we have no idea which socket the packet is destined for? What should be done for In theory one could use multiple receive queue on intelligent enough NIC with the NIC distingushing the sockets. But that would be still a nasty you need advanced hardware FOO to avoid subtle problem Y case. Also it would require lots of driver hacking. And most NICs seem to have limits on the size of the socket tables for this, which means you would end up in a only N sockets supported safely situation, with N likely being quite small on common hardware. I think the idea of the original poster was that just freeing non critical packets after a short time again would be good enough, but I'm a bit sceptical on that. I truly dislike these patches being discussed because they are a complete hack, and admittedly don't even solve the problem fully. I I agree. I think GFP_ATOMIC memory pools are more powerful than they are given credit for. There is nothing preventing the implementation of dynamic Their main problem is that they are used too widely and in a lot of situations that aren't really critical. -Andi - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 0/3] TCP/IP Critical socket communication mechanism
David S. Miller wrote: From: Matt Mackall [EMAIL PROTECTED] Date: Wed, 14 Dec 2005 21:02:50 -0800 There needs to be two rules: iff global memory critical flag is set - allocate from the global critical receive pool on receive - return packet to global pool if not destined for a socket with an attached send mempool This shuts off a router and/or firewall just because iSCSI or NFS peed in it's pants. Not really acceptable. But that should only happen (shut off a router and/or firewall) in cases where we now completely deadlock and never recover, including shutting off the router and firewall, because they don't have enough memory to recv packets either. I think this will provide the desired behavior It's not desirable. What if iSCSI is protected by IPSEC, and the key management daemon has to process a security assosciation expiration and negotiate a new one in order for iSCSI to further communicate with it's peer when this memory shortage occurs? It needs to send packets back and forth with the remove key management daemon in order to do this, but since you cut it off with this critical receive pool, the negotiation will never succeed. I guess IPSEC would be a critical socket too, in that case. Sure there is nothing we can do if the daemon insists on allocating lots of memory... This stuff won't work. It's not a generic solution and that's why it has more holes than swiss cheese. :-) True it will have holes. I think something that is complementary and would be desirable is to simply limit the amount of in-flight writeout that things like NFS allows (or used to allow, haven't checked for a while and there were noises about it getting better). -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 0/3] TCP/IP Critical socket communication mechanism
On Wed, 14 Dec 2005 21:23:09 -0800 (PST) David S. Miller [EMAIL PROTECTED] wrote: From: Matt Mackall [EMAIL PROTECTED] Date: Wed, 14 Dec 2005 21:02:50 -0800 There needs to be two rules: iff global memory critical flag is set - allocate from the global critical receive pool on receive - return packet to global pool if not destined for a socket with an attached send mempool This shuts off a router and/or firewall just because iSCSI or NFS peed in it's pants. Not really acceptable. I think this will provide the desired behavior It's not desirable. What if iSCSI is protected by IPSEC, and the key management daemon has to process a security assosciation expiration and negotiate a new one in order for iSCSI to further communicate with it's peer when this memory shortage occurs? It needs to send packets back and forth with the remove key management daemon in order to do this, but since you cut it off with this critical receive pool, the negotiation will never succeed. This stuff won't work. It's not a generic solution and that's why it has more holes than swiss cheese. :-) Also, all this stuff is just a band aid because linux OOM behavior is so fucked up. The VM system just lets the user dig themselves into a huge over commit, then we get into trying to change every other system to compensate. How about cutting things off earlier, and not falling off the cliff? How about pushing out pages to swap earlier when memory pressure starts to get noticed. Then you can free those non-dirty pages to make progress. Too many of the VM decisions seem to be made in favor of keep-it-in-memory benchmark situations. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 0/3] TCP/IP Critical socket communication mechanism
On Thu, 15 Dec 2005 06:42:45 +0100 Andi Kleen [EMAIL PROTECTED] wrote: On Wed, Dec 14, 2005 at 08:30:23PM -0800, David S. Miller wrote: From: Matt Mackall [EMAIL PROTECTED] Date: Wed, 14 Dec 2005 19:39:37 -0800 I think we need a global receive pool and per-socket send pools. Mind telling everyone how you plan to make use of the global receive pool when the allocation happens in the device driver and we have no idea which socket the packet is destined for? What should be done for In theory one could use multiple receive queue on intelligent enough NIC with the NIC distingushing the sockets. But that would be still a nasty you need advanced hardware FOO to avoid subtle problem Y case. Also it would require lots of driver hacking. And most NICs seem to have limits on the size of the socket tables for this, which means you would end up in a only N sockets supported safely situation, with N likely being quite small on common hardware. I think the idea of the original poster was that just freeing non critical packets after a short time again would be good enough, but I'm a bit sceptical on that. I truly dislike these patches being discussed because they are a complete hack, and admittedly don't even solve the problem fully. I I agree. I think GFP_ATOMIC memory pools are more powerful than they are given credit for. There is nothing preventing the implementation of dynamic Their main problem is that they are used too widely and in a lot of situations that aren't really critical. Most of the use of GFP_ATOMIC is by stuff that could fail but can't sleep waiting for memory. How about adding a GFP_NORMAL for allocations while holding a lock. #define GFP_NORMAL (__GFP_NOMEMALLOC) Then get people to change the unneeded GFP_ATOMIC's to GFP_NORMAL in places where the error paths are reasonable. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 0/3] TCP/IP Critical socket communication mechanism
On Wed, 14 Dec 2005, David S. Miller wrote: From: Matt Mackall [EMAIL PROTECTED] Date: Wed, 14 Dec 2005 19:39:37 -0800 I think we need a global receive pool and per-socket send pools. Mind telling everyone how you plan to make use of the global receive pool when the allocation happens in the device driver and we have no idea which socket the packet is destined for? What should be done for non-local packets being routed? The device drivers allocate packets for the entire system, long before we know who the eventually received packets are for. It is fully anonymous memory, and it's easy to design cases where the whole pool can be eaten up by non-local forwarded packets. I truly dislike these patches being discussed because they are a complete hack, and admittedly don't even solve the problem fully. I don't have any concrete better ideas but that doesn't mean this stuff should go into the tree. I think GFP_ATOMIC memory pools are more powerful than they are given credit for. There is nothing preventing the implementation of dynamic GFP_ATOMIC watermarks, and having critical socket behavior kick in in response to hitting those water marks. Does this mean that you are OK with having a mechanism to mark the sockets as critical and dropping the non critical packets under emergency, but you do not like having a separate critical page pool. Instead, you seem to be suggesting in_emergency to be set dynamically when we are about to run out of ATOMIC memory. Is this right? Thanks Sridhar - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html