[RFC][PATCH 2/3] TCP/IP Critical socket communication mechanism

2005-12-14 Thread Sridhar Samudrala
When 'system_in_emergency' flag is set, drop any incoming packets that belong
to non-critical sockets as soon as can determine the destination socket. This
is necessary to prevent incoming non-critical packets to consume memory from
critical page pool.
-

 include/net/sock.h  |   14 ++
 net/dccp/ipv4.c |4 
 net/ipv4/tcp_ipv4.c |3 +++
 net/ipv4/udp.c  |9 -
 net/ipv6/tcp_ipv6.c |3 +++
 net/sctp/input.c|3 +++
 6 files changed, 35 insertions(+), 1 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index 982b4ec..8de8a8b 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1391,4 +1391,18 @@ extern int sysctl_optmem_max;
 extern __u32 sysctl_wmem_default;
 extern __u32 sysctl_rmem_default;

+extern int system_in_emergency;
+
+static inline int emergency_check(struct sock *sk, struct sk_buff *skb)
+{
+   if (system_in_emergency  !(sk-sk_allocation  __GFP_CRITICAL)) {
+   if (net_ratelimit())
+   printk(discarding skb:%p len:%d sk:%p protocol:%d\n,
+   skb, skb-len, sk, sk-sk_protocol);
+   return 0;
+   }
+
+   return 1;
+}
+
 #endif /* _SOCK_H */
diff --git a/net/dccp/ipv4.c b/net/dccp/ipv4.c
index ca03521..405cdf8 100644
--- a/net/dccp/ipv4.c
+++ b/net/dccp/ipv4.c
@@ -1130,6 +1130,10 @@ int dccp_v4_rcv(struct sk_buff *skb)
goto no_dccp_socket;
}

+   if (!emergency_check(sk, skb)) {
+   goto discard_and_relse;
+   }
+
/*
 * Step 2:
 *  ... or S.state == TIMEWAIT,
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 4d5021e..acfb9a1 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1232,6 +1232,9 @@ int tcp_v4_rcv(struct sk_buff *skb)
if (!sk)
goto no_tcp_socket;

+   if (!emergency_check(sk, skb))
+   goto discard_and_relse;
+
 process:
if (sk-sk_state == TCP_TIME_WAIT)
goto do_time_wait;
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 2422a5f..f79cbfd 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1150,7 +1150,14 @@ int udp_rcv(struct sk_buff *skb)
sk = udp_v4_lookup(saddr, uh-source, daddr, uh-dest, 
skb-dev-ifindex);

if (sk != NULL) {
-   int ret = udp_queue_rcv_skb(sk, skb);
+   int ret;
+
+   if (!emergency_check(sk, skb)) {
+   sock_put(sk);
+   goto drop;
+   } else
+   ret = udp_queue_rcv_skb(sk, skb);
+
sock_put(sk);

/* a return value  0 means to resubmit the input, but
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 62c0e5b..d017181 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1592,6 +1592,9 @@ static int tcp_v6_rcv(struct sk_buff **p
if (!sk)
goto no_tcp_socket;

+   if (!emergency_check(sk, skb))
+   goto discard_and_relse;
+
 process:
if (sk-sk_state == TCP_TIME_WAIT)
goto do_time_wait;
diff --git a/net/sctp/input.c b/net/sctp/input.c
index b24ff2c..553365b 100644
--- a/net/sctp/input.c
+++ b/net/sctp/input.c
@@ -181,6 +181,9 @@ int sctp_rcv(struct sk_buff *skb)
rcvr = asoc ? asoc-base : ep-base;
sk = rcvr-sk;

+   if (!emergency_check(sk, skb))
+   goto discard_it;
+
/*
 * If a frame arrives on an interface and the receiving socket is
 * bound to another interface, via SO_BINDTODEVICE, treat it as OOTB
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC][PATCH 1/3] TCP/IP Critical socket communication mechanism

2005-12-14 Thread Sridhar Samudrala
Introduce a new socket option SO_CRITICAL to mark a socket as critical.
This socket option takes a integer boolean flag that can be set using
setsockopt() and read with getsockopt().
---

 include/asm-i386/socket.h|2 ++
 include/asm-powerpc/socket.h |2 ++
 net/core/sock.c  |   13 -
 3 files changed, 16 insertions(+), 1 deletions(-)

diff --git a/include/asm-i386/socket.h b/include/asm-i386/socket.h
index 802ae76..bd4ce8e 100644
--- a/include/asm-i386/socket.h
+++ b/include/asm-i386/socket.h
@@ -49,4 +49,6 @@

 #define SO_PEERSEC 31

+#define SO_CRITICAL100
+
 #endif /* _ASM_SOCKET_H */
diff --git a/include/asm-powerpc/socket.h b/include/asm-powerpc/socket.h
index e4b8177..6cfb79a 100644
--- a/include/asm-powerpc/socket.h
+++ b/include/asm-powerpc/socket.h
@@ -56,4 +56,6 @@

 #define SO_PEERSEC 31

+#define SO_CRITICAL100
+
 #endif /* _ASM_POWERPC_SOCKET_H */
diff --git a/net/core/sock.c b/net/core/sock.c
index 13cc3be..d2d10cb 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -456,6 +456,13 @@ set_rcvbuf:
ret = -ENONET;
break;

+   case SO_CRITICAL:
+   if (valbool)
+   sk-sk_allocation |= __GFP_CRITICAL;
+   else
+   sk-sk_allocation = ~__GFP_CRITICAL;
+   break;
+
/* We implement the SO_SNDLOWAT etc to
   not be settable (1003.1g 5.3) */
default:
@@ -616,7 +623,11 @@ int sock_getsockopt(struct socket *sock,

case SO_PEERSEC:
return security_socket_getpeersec(sock, optval, optlen, 
len);
-
+
+   case SO_CRITICAL:
+   v.val = ((sk-sk_allocation  __GFP_CRITICAL) != 0);
+   break;
+
default:
return(-ENOPROTOOPT);
}

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC][PATCH 0/3] TCP/IP Critical socket communication mechanism

2005-12-14 Thread Sridhar Samudrala

These set of patches provide a TCP/IP emergency communication mechanism that
could be used to guarantee high priority communications over a critical socket
to succeed even under very low memory conditions that last for a couple of
minutes. It uses the critical page pool facility provided by Matt's patches
that he posted recently on lkml.
http://lkml.org/lkml/2005/12/14/34/index.html

This mechanism provides a new socket option SO_CRITICAL that can be used to
mark a socket as critical. A critical connection used for emergency
communications has to be established and marked as critical before we enter
the emergency condition.

It uses the __GFP_CRITICAL flag introduced in the critical page pool patches
to indicate an allocation request as critical and should be satisfied from the
critical page pool if required. In the send path, this flag is passed with all
allocation requests that are made for a critical socket. But in the receive
path we do not know if a packet is critical or not until we receive it and
find the socket that it is destined to. So we treat all the allocation
requests in the receive path as critical.

The critical page pool patches also introduces a global flag
'system_in_emergency' that is used to indicate an emergency situation(could be
a low memory condition). When this flag is set any incoming packets that belong
to non-critical sockets are dropped as soon as possible in the receive path.
This is necessary to prevent incoming non-critical packets to consume memory
from critical page pool.

I would appreciate any feedback or comments on this approach.

Thanks
Sridhar
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH 0/3] TCP/IP Critical socket communication mechanism

2005-12-14 Thread Andi Kleen
 I would appreciate any feedback or comments on this approach.

Maybe I'm missing something but wouldn't you need an own critical
pool (or at least reservation) for each socket to be safe against deadlocks?

Otherwise if a critical sockets needs e.g. 2 pages to finish something
and 2 critical sockets are active they can each steal the last pages
from each other and deadlock.

-Andi
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH 3/3] TCP/IP Critical socket communication mechanism

2005-12-14 Thread Alan Cox
On Mer, 2005-12-14 at 01:12 -0800, Sridhar Samudrala wrote:
 Pass __GFP_CRITICAL flag with all allocation requests that are critical.
 - All allocations needed to process incoming packets are marked as CRITICAL.
   This includes the allocations
  - made by the driver to receive incoming packets
  - to process and send ARP packets
  - to create new routes for incoming packets

But your user space that would add the routes is not so protected so I'm
not sure this is actually a solution, more of an extended fudge. In
which case I'm not clear why it is any better than the current
GFP_ATOMIC approach.

 +#define SK_CRIT_ALLOC(sk, flags) ((sk-sk_allocation  __GFP_CRITICAL) | 
 flags)

Lots of hidden conditional logic on critical paths. Also sk should be in
brackets so that the macro evaluation order is defined as should flags

 +#define CRIT_ALLOC(flags) (__GFP_CRITICAL | flags)

Pointless obfuscation


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] cubic: pre-compute based on parameters

2005-12-14 Thread Baruch Even
David S. Miller wrote:
 From: Stephen Hemminger [EMAIL PROTECTED]
 Date: Mon, 12 Dec 2005 12:03:22 -0800
 
 
-d32 = d32 / HZ;
-
 /* (wmax-cwnd) * (srtt3 / HZ) / c * 2^(3*bictcp_HZ)  */
-d64 = (d64 * dist * d32)  (count+3-BICTCP_HZ);
-
-/* cubic root */
-d64 = cubic_root(d64);
-
-result = (u32)d64;
-return result;
+ return cubic_root((cube_factor * dist)  (cube_scale + 3 - BICTCP_HZ));
 
  ...
 
+ while ( !(d32  0x8000)  (cube_scale  BICTCP_HZ)){
+ d32 = d32  1;
+ ++cube_scale;
+ }
+ cube_factor = d64 * d32 / HZ;
+
 
 
 I don't think this transformation is equivalent.
 
 In the old code only the d32 is scaled by HZ.
 
 So in the old code we're saying something like:
 
   d64 = (d64 * dist * (d32 / HZ))  (count + 3 - BICTCP_HZ);
 
 whereas the new code looks like:
 
   d64 = (((d64 * d32) / HZ) * dist)  (count + 3 - BICTCP_HZ);
 
 Is that really equivalent?

Almost. It depends on how large the numbers are in d64 and d32, if their
multiplication may overflow than the first option is better since it has
less of a chance to overflow.

On the other hand, the second line can be more accurate.

Baruch
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH 3/3] TCP/IP Critical socket communication mechanism

2005-12-14 Thread Mitchell Blank Jr
Alan Cox wrote:
 But your user space that would add the routes is not so protected so I'm
 not sure this is actually a solution, more of an extended fudge.

Yes, there's no 100% solution -- no matter how much memory you reserve and
how many paths you protect if you try hard enough you can come up
with cases where it'll fail.  (I'm swapping to NFS across a tun/tap
interface to a custom userland SSL tunnel to a server across a BGP route...)

However, if the 'extended fundge' pushes a problem from can happen, even
in a very normal setup territory to only happens if you're doing something
pretty weird then is it really such a bad thing?  I think the cost in code
complexity looks pretty reasonable.

  +#define SK_CRIT_ALLOC(sk, flags) ((sk-sk_allocation  __GFP_CRITICAL) | 
  flags)
 
 Lots of hidden conditional logic on critical paths.

How expensive is it compared to the allocation itself?

  +#define CRIT_ALLOC(flags) (__GFP_CRITICAL | flags)
 
 Pointless obfuscation

Fully agree.

-Mitch
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fw: 2.6.15-rc5 gre tunnel checksum error

2005-12-14 Thread Herbert Xu
On Tue, Dec 13, 2005 at 06:30:38AM +, Paul Erkkila wrote:

 GRE tunnel.
 
 ip tunnel:
 tunnel0: gre/ip  remote xx.xx.xx.xx  local xx.xx.xx.xx  ttl 255  key
 xx.xx.xx.xx
   Checksum in received packet is required.
   Checksum output packets.

Thanks.  It turns out to be a bug in the GRE layer.  I added that
bug when I introduced skb_postpull_rcsum.

[GRE]: Fix hardware checksum modification

The skb_postpull_rcsum introduced a bug to the checksum modification.
Although the length pulled is offset bytes, the origin of the pulling
is the GRE header, not the IP header.

Signed-off-by: Herbert Xu [EMAIL PROTECTED]

Dave, please apply this if this works for Paul.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmVHI~} [EMAIL PROTECTED]
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
diff --git a/net/ipv4/ip_gre.c b/net/ipv4/ip_gre.c
--- a/net/ipv4/ip_gre.c
+++ b/net/ipv4/ip_gre.c
@@ -618,7 +618,7 @@ static int ipgre_rcv(struct sk_buff *skb
 
skb-mac.raw = skb-nh.raw;
skb-nh.raw = __pskb_pull(skb, offset);
-   skb_postpull_rcsum(skb, skb-mac.raw, offset);
+   skb_postpull_rcsum(skb, skb-h.raw, offset);
memset((IPCB(skb)-opt), 0, sizeof(struct ip_options));
skb-pkt_type = PACKET_HOST;
 #ifdef CONFIG_NET_IPGRE_BROADCAST


Re: Fw: 2.6.15-rc5 gre tunnel checksum error

2005-12-14 Thread Paul Erkkila
Herbert Xu wrote:
 Thanks.  It turns out to be a bug in the GRE layer.  I added that
 bug when I introduced skb_postpull_rcsum.

 [GRE]: Fix hardware checksum modification

 The skb_postpull_rcsum introduced a bug to the checksum modification.
 Although the length pulled is offset bytes, the origin of the pulling
 is the GRE header, not the IP header.

 Signed-off-by: Herbert Xu [EMAIL PROTECTED]

 Dave, please apply this if this works for Paul.

 Cheers,
   
Works fine here.

Thanks  =).

-pee
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] ip / ifconfig redesign

2005-12-14 Thread Al Boldi
Bernd Eckenfels wrote:
 Al Boldi wrote:
  The current ip / ifconfig configuration is arcane and inflexible.  The
  reason being, that they are based on design principles inherited from
  the last century.

 Yes I agree, however note that some of the asumptions are backed up and
 required by RFCs. For example the binding of addresses to interfaces.  This
 is especially strongly required in the IPV6 world with all the scoping and
 renumbering RFCs.

Can you point me to those RFCs? Thanks!

 The things you want to change need to be changed in kernel space, btw.

True.

I mentioned ip / ifconfig not to imply that they are the culprit, but instead 
to expose the underlying kernel implementation.

This does not mean though, that ip / ifconfig cannot offer an emulated OSI 
compliant mode, which would be an impetus to change the underlying 
implementation.

Thanks!

--
Al

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Specs for Tulip3

2005-12-14 Thread Aritz Bastida
Hello,

I've been reading the source code for the tg3 module (Broadcom Tigon3
Ethernet card) in the Linux kernel. Specifically, I need to access the
NIC specific statistics, since I have to measure the performance of a
server under heavy network loads. Althought the statistics exported
with ethtool are quite self-explained, I would like to understand in
depth the meaning of each variable. I guess those are described in the
NIC Specs, but I wasnt able to find them in the Web (even in the
Broadcom web page).

How can I find the specs for the Tulip3 NIC?

Thank you
Regards

Aritz
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Resend [PATCH netdev-2.6 2/8] e1000: Performance Enhancements

2005-12-14 Thread Robert Olsson

jamal writes:

  Essentially the approach would be the same as Robert's old recycle patch
  where he doesnt recycle certain skbs - the only difference being in the
  case of forwarding, the  recycle is done asynchronously at EOT whereas
  this is done  synchronously upon return from host path. 
  The beauty of the approach is you dont have to worry about recycling on
  the wrong CPU ;- (which has been a big problem for forwarding path)

  I have to chime in and say for the host stack - I like it ;-

 No we don't solve any problems for forwarding but as Dave pointed out
 we can do nice things. Instead of dropping skb is case of failures or
 netfilter etc we can reuse the skb and if the skb is consumed within
 the RX softirq we can just return it driver.
 
 You did the feedback mechanism NET_RX_XX stuff six years ago.
 Now it can possible used :)

 A practical problem is how maintain compatibility with the current 
 behavior which defaults to NET_RX_SKB_CONSUMED.
 
 An new driver entry point? And can we increase skb-users to delay 
 skb destruction until the driver got the indication back? 
 So the driver will do the final kfree and not in the protocol layers 
 as now?  This to avoid massive code changes.

 Thoughts?

 Cheers
--ro
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: IPSEC tunnel: more than two networks?

2005-12-14 Thread Ingo Oeser
Michael Tokarev wrote:

[..]
 So the question is: is the setup like this one supposed to work at all
 in linux?
 
 I know there are other less ugly ways to achieve the same effect, eg
 by using GRE/IPIP tunnels and incapsulating the traffic into IPSEC (this
 way, we'll have only one transport-mode IPSEC connection and normal
 interfaces to route traffic to/via), but this is NOT how the whole
 infrastrtructure in their network is implemented - they - it seems, for
 whatever reason - 
[...]
 use separate tunnels to route each network. 

Yes, that's how I did it, too. It works perfectly to tunnel 
each network segment seperately. Simple routing is not enough.

Don't forget to mention your tunneled networks in the FORWARD chain,
if your ipsec gateway is also your firewall.

I implemented the seperate tunnels via racoon and racoon-tool 
from latest Debian sarge. Connectivity to a Cisco PIX was possible that way.


Regards

Ingo Oeser

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Specs for Tulip3

2005-12-14 Thread Michael Chan
On Wed, 2005-12-14 at 17:56 +0100, Aritz Bastida wrote:

 How can I find the specs for the Tulip3 NIC?
 
Most of the statistics counters follow the MIB definitions in the RFCs.
There are a few that are non-standard but should be self-explanatory.
Send me an email if you need more information on some of the counters.

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH 0/3] TCP/IP Critical socket communication mechanism

2005-12-14 Thread Sridhar Samudrala
On Wed, 2005-12-14 at 10:22 +0100, Andi Kleen wrote:
  I would appreciate any feedback or comments on this approach.
 
 Maybe I'm missing something but wouldn't you need an own critical
 pool (or at least reservation) for each socket to be safe against deadlocks?
 
 Otherwise if a critical sockets needs e.g. 2 pages to finish something
 and 2 critical sockets are active they can each steal the last pages
 from each other and deadlock.

Here we are assuming that the pre-allocated critical page pool is big enough
to satisfy the requirements of all the critical sockets.

In the current critical page pool implementation, there is also a limitation 
that only order-0 allocations(single page) are supported. I think in the
networking send/receive patch, the only place where multi-page allocs are
requested is in the drivers if the MTU  PAGESIZE. But i guess the drivers
are getting updated to avoid  order-0 allocations.

Also during the emergency, we free the memory allocated for non-critical 
packets as quickly as possible so that it can be re-used for critical
allocations.

Thanks
Sridhar

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH 3/3] TCP/IP Critical socket communication mechanism

2005-12-14 Thread Sridhar Samudrala
On Wed, 2005-12-14 at 11:17 +, Alan Cox wrote:
 On Mer, 2005-12-14 at 01:12 -0800, Sridhar Samudrala wrote:
  Pass __GFP_CRITICAL flag with all allocation requests that are critical.
  - All allocations needed to process incoming packets are marked as CRITICAL.
This includes the allocations
   - made by the driver to receive incoming packets
   - to process and send ARP packets
   - to create new routes for incoming packets
 
 But your user space that would add the routes is not so protected so I'm
 not sure this is actually a solution, more of an extended fudge. In
 which case I'm not clear why it is any better than the current
 GFP_ATOMIC approach.

I am not referring to routes that are added by user-space, but the allocations
needed for cached routes stored in skb-dst in ip_route_input() path.

  +#define SK_CRIT_ALLOC(sk, flags) ((sk-sk_allocation  __GFP_CRITICAL) | 
  flags)
 
 Lots of hidden conditional logic on critical paths. Also sk should be in
 brackets so that the macro evaluation order is defined as should flags
 
  +#define CRIT_ALLOC(flags) (__GFP_CRITICAL | flags)
 
 Pointless obfuscation

The only reason i made these macros is that i would expect this to a compile
time configurable option so that there is zero overhead for regular users.

#ifdef CONFIG_CRIT_SOCKET
#define SK_CRIT_ALLOC(sk, flags) ((sk-sk_allocation  __GFP_CRITICAL) | flags)
#define CRIT_ALLOC(flags) (__GFP_CRITICAL | flags)
#else
#define SK_CRIT_ALLOC(sk, flags) flags
#define CRIT_ALLOC(flags) flags
#endif

Thanks
Sridhar

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH 3/3] TCP/IP Critical socket communication mechanism

2005-12-14 Thread Ingo Oeser
Sridhar Samudrala wrote:
 The only reason i made these macros is that i would expect this to a compile
 time configurable option so that there is zero overhead for regular users.
 
 #ifdef CONFIG_CRIT_SOCKET
 #define SK_CRIT_ALLOC(sk, flags) ((sk-sk_allocation  __GFP_CRITICAL) | 
 flags)
 #define CRIT_ALLOC(flags) (__GFP_CRITICAL | flags)
 #else
 #define SK_CRIT_ALLOC(sk, flags) flags
 #define CRIT_ALLOC(flags) flags
 #endif

Oh, that's much simpler to achieve:

#ifdef CONFIG_CRIT_SOCKET
#define __GFP_CRITICAL_SOCKET __GFP_CRITICAL
#else
#define __GFP_CRITICAL_SOCKET 0
#endif

Maybe we can get better naming here, but you get the point, I think.


Regards

Ingo Oeser

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH 3/3] TCP/IP Critical socket communication mechanism

2005-12-14 Thread Sridhar Samudrala
On Wed, 2005-12-14 at 04:12 -0800, Mitchell Blank Jr wrote:
 Alan Cox wrote:
  But your user space that would add the routes is not so protected so I'm
  not sure this is actually a solution, more of an extended fudge.
 
 Yes, there's no 100% solution -- no matter how much memory you reserve and
 how many paths you protect if you try hard enough you can come up
 with cases where it'll fail.  (I'm swapping to NFS across a tun/tap
 interface to a custom userland SSL tunnel to a server across a BGP route...)
 
 However, if the 'extended fundge' pushes a problem from can happen, even
 in a very normal setup territory to only happens if you're doing something
 pretty weird then is it really such a bad thing?  I think the cost in code
 complexity looks pretty reasonable.

Yes. This should work fine for cases where you need a limited number of
critical allocation requests to succeed for a short period of time.

   +#define SK_CRIT_ALLOC(sk, flags) ((sk-sk_allocation  __GFP_CRITICAL) | 
   flags)
  
  Lots of hidden conditional logic on critical paths.
 
 How expensive is it compared to the allocation itself?

Also, as i said in my other response we could make it a compile-time
configurable option with zero overhead when turned off.

Thanks
Sridhar

 
   +#define CRIT_ALLOC(flags) (__GFP_CRITICAL | flags)
  
  Pointless obfuscation
 
 Fully agree.
 
 -Mitch

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] forcedeth TSO fix for large buffers

2005-12-14 Thread Ayaz Abdulla
Has anyone had a chance to review this patch and apply it? I would like 
it to make 2.6.15 kernel since it is a bug related to TSO in the driver.


Thanks,
Ayaz
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH 0/3] TCP/IP Critical socket communication mechanism

2005-12-14 Thread David Stevens
 It has a lot
 more users that compete true, but likely the set of GFP_CRITICAL users
 would grow over time too and it would develop the same problem.

No, because the critical set is determined by the user (by setting
the socket flag).
The receive side has some things marked as critical until we
have processed enough to check the socket flag, but then they should
be released. Those short-lived allocations and frees are more or less
0 net towards the pool.
Certainly, it wouldn't work very well if every socket is
marked as critical, but with an adequate pool for the workload, I
expect it'll work as advertised (esp. since it'll usually be only one
socket associated with swap management that'll be critical).

+-DLS

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Specs for Tulip3

2005-12-14 Thread Michael Chan
On Wed, 2005-12-14 at 19:38 +0100, Aritz Bastida wrote:

 Thank you for your email. But could you tell me what RFC specifically?
 Is it RFC1284? The counters I am looking for are:
 

These are custom counters not from any RFCs.

dma_writeq_full

DMA write queue full - meaning host is not recycling rx buffers fast
enough.
rx_threshold_hit

Rx max coalescing frames threshold hit.

ring_status_update

Status block update.

 
 I have a dual AMD Opteron 1800MHz, which will be capturing all the
 traffic in a Gigabit Ethernet segment, and analyzing the packets it
 captures. It's a kind of IDS which must work under heavy network
 loads. I am testing the maximum speed it can receive packets (it has
 got a Broadcom Tulip3 NIC:  BCM5704). For that purpose, I use another
 machine to inject the packets. I do that with the pktgen module.
 
 Here are the results for a sample test:
 Injection machine (Dual Pentium III 866MHz):
  * Number of packets: 21134488
  * Packet size: 100 bytes
  * Speed: 341242pps 272Mb/sec
 
 Receive machine (Dual AMD Opteron 1800MHz ):
 (There are no processes running in this machine, specifically the packet
  analysis is stopped)
  * rx_ucast_packets: 21134816
  * rx_65_to_127_octet_packets: 21134597
  * dma_writeq_full: 12919200
  * rx_discards: 12947380
  * rx_threshold_hit: 1549692
  * ring_status_update: 1677648
 

What bus is the NIC in? PCI or PCIX? What speed? You may want to play
around with the rx ring sizes and rx coalescing parameters, all can be
changed with ethtool. Also, be sure to use the latest tg3 driver which
is 3.45.



-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH 0/3] TCP/IP Critical socket communication mechanism

2005-12-14 Thread Jesper Juhl
On 12/14/05, Sridhar Samudrala [EMAIL PROTECTED] wrote:

 These set of patches provide a TCP/IP emergency communication mechanism that
 could be used to guarantee high priority communications over a critical socket
 to succeed even under very low memory conditions that last for a couple of
 minutes. It uses the critical page pool facility provided by Matt's patches
 that he posted recently on lkml.
 http://lkml.org/lkml/2005/12/14/34/index.html

 This mechanism provides a new socket option SO_CRITICAL that can be used to
 mark a socket as critical. A critical connection used for emergency

So now everyone writing commercial apps for Linux are going to set
SO_CRITICAL on sockets in their apps so their apps can survive better
under pressure than the competitors aps and clueless programmers all
over are going to think cool, with this I can make my app more
important than everyone elses, I'm going to use this.  When everyone
and his dog starts to set this, what's the point?


 communications has to be established and marked as critical before we enter
 the emergency condition.

 It uses the __GFP_CRITICAL flag introduced in the critical page pool patches
 to indicate an allocation request as critical and should be satisfied from the
 critical page pool if required. In the send path, this flag is passed with all
 allocation requests that are made for a critical socket. But in the receive
 path we do not know if a packet is critical or not until we receive it and
 find the socket that it is destined to. So we treat all the allocation
 requests in the receive path as critical.

 The critical page pool patches also introduces a global flag
 'system_in_emergency' that is used to indicate an emergency situation(could be
 a low memory condition). When this flag is set any incoming packets that 
 belong
 to non-critical sockets are dropped as soon as possible in the receive path.

Hmm, so if I fire up an app that has SO_CRITICAL set on a socket and
can then somehow put a lot of memory pressure on the machine I can
cause traffic on other sockets to be dropped.. hmmm.. sounds like
something to play with to create new and interresting DoS attacks...


 This is necessary to prevent incoming non-critical packets to consume memory
 from critical page pool.

 I would appreciate any feedback or comments on this approach.


To be a little serious, it sounds like something that could be used to
cause trouble and something that will lose its usefulness once enough
people start using it (for valid or invalid reasons), so what's the
point...


--
Jesper Juhl [EMAIL PROTECTED]
Don't top-post  http://www.catb.org/~esr/jargon/html/T/top-post.html
Plain text mails only, please  http://www.expita.com/nomime.html
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Poor performance with r8169

2005-12-14 Thread Francois Romieu
Carl-Daniel Hailfinger [EMAIL PROTECTED] :
[...]
 Performance with nttcp was approximately at 135 MBit/s in
 both directions.
 
 Both cards were connected directly with a CAT5e cable.
 Enabling/disabling NAPI didn't have any measurable effect.
 
 Are these results expected, and if so, is there any card

1 - I get more than the 141 Mbit/ on an old PII;
2 - can you check with lspci -vvx if there is a difference
between the two devices (latency or such) ? The cards
are built around the same chipset. I see no reason why
one card could be slower than the other;
3 - please send:
- complete dmesg and vmstat 1 output during test
- .config
- ethtool -s eth0

 which delivers more reasonable performance? If the cards
 should deliver higher performance, do you have any patch
 or any tuning tip I can test?

Can you check 'top' output during test ?
Any difference if you renice ksoftirqd like crazy ?

--
Ueimor
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH 0/3] TCP/IP Critical socket communication mechanism

2005-12-14 Thread Ben Greear

Jesper Juhl wrote:


To be a little serious, it sounds like something that could be used to
cause trouble and something that will lose its usefulness once enough
people start using it (for valid or invalid reasons), so what's the
point...


It could easily be a user-configurable option in an application.  If
DOS is a real concern, only let this work for root users...

Ben

--
Ben Greear [EMAIL PROTECTED]
Candela Technologies Inc  http://www.candelatech.com

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH 0/3] TCP/IP Critical socket communication mechanism

2005-12-14 Thread James Courtier-Dutton

Jesper Juhl wrote:

On 12/14/05, Sridhar Samudrala [EMAIL PROTECTED] wrote:


These set of patches provide a TCP/IP emergency communication mechanism that
could be used to guarantee high priority communications over a critical socket
to succeed even under very low memory conditions that last for a couple of
minutes. It uses the critical page pool facility provided by Matt's patches
that he posted recently on lkml.
   http://lkml.org/lkml/2005/12/14/34/index.html

This mechanism provides a new socket option SO_CRITICAL that can be used to
mark a socket as critical. A critical connection used for emergency



So now everyone writing commercial apps for Linux are going to set
SO_CRITICAL on sockets in their apps so their apps can survive better
under pressure than the competitors aps and clueless programmers all
over are going to think cool, with this I can make my app more
important than everyone elses, I'm going to use this.  When everyone
and his dog starts to set this, what's the point?




I don't think the initial patches that Matt did were intended for what 
you are describing.
When I had the conversation with Matt at KS, the problem we were trying 
to solve was Memory pressure with network attached swap space.

I came up with the idea that I think Matt has implemented.
Letting the OS choose which are critical TCP/IP sessions is fine. But 
letting an application choose is a recipe for disaster.


James
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.15-rc5 gre tunnel checksum error

2005-12-14 Thread David S. Miller
From: Herbert Xu [EMAIL PROTECTED]
Date: Wed, 14 Dec 2005 23:16:29 +1100

 [GRE]: Fix hardware checksum modification
 
 The skb_postpull_rcsum introduced a bug to the checksum modification.
 Although the length pulled is offset bytes, the origin of the pulling
 is the GRE header, not the IP header.
 
 Signed-off-by: Herbert Xu [EMAIL PROTECTED]
 
 Dave, please apply this if this works for Paul.

Applied, thanks.

-stable needs this too, so I'll toss it there as well.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[2.6 patch] net/sunrpc/xdr.c: remove xdr_decode_string()

2005-12-14 Thread Adrian Bunk
This patch removes ths unused function xdr_decode_string().


Signed-off-by: Adrian Bunk [EMAIL PROTECTED]
Acked-by: Neil Brown [EMAIL PROTECTED]
Acked-by: Charles Lever [EMAIL PROTECTED]

---

 include/linux/sunrpc/xdr.h |1 -
 net/sunrpc/xdr.c   |   21 -
 2 files changed, 22 deletions(-)

--- linux-2.6.15-rc1-mm2-full/include/linux/sunrpc/xdr.h.old2005-11-23 
02:03:01.0 +0100
+++ linux-2.6.15-rc1-mm2-full/include/linux/sunrpc/xdr.h2005-11-23 
02:03:08.0 +0100
@@ -91,7 +91,6 @@
 u32 *  xdr_encode_opaque_fixed(u32 *p, const void *ptr, unsigned int len);
 u32 *  xdr_encode_opaque(u32 *p, const void *ptr, unsigned int len);
 u32 *  xdr_encode_string(u32 *p, const char *s);
-u32 *  xdr_decode_string(u32 *p, char **sp, int *lenp, int maxlen);
 u32 *  xdr_decode_string_inplace(u32 *p, char **sp, int *lenp, int maxlen);
 u32 *  xdr_encode_netobj(u32 *p, const struct xdr_netobj *);
 u32 *  xdr_decode_netobj(u32 *p, struct xdr_netobj *);
--- linux-2.6.15-rc1-mm2-full/net/sunrpc/xdr.c.old  2005-11-23 
02:03:17.0 +0100
+++ linux-2.6.15-rc1-mm2-full/net/sunrpc/xdr.c  2005-11-23 02:03:27.0 
+0100
@@ -93,27 +93,6 @@
 }
 
 u32 *
-xdr_decode_string(u32 *p, char **sp, int *lenp, int maxlen)
-{
-   unsigned intlen;
-   char*string;
-
-   if ((len = ntohl(*p++))  maxlen)
-   return NULL;
-   if (lenp)
-   *lenp = len;
-   if ((len % 4) != 0) {
-   string = (char *) p;
-   } else {
-   string = (char *) (p - 1);
-   memmove(string, p, len);
-   }
-   string[len] = '\0';
-   *sp = string;
-   return p + XDR_QUADLEN(len);
-}
-
-u32 *
 xdr_decode_string_inplace(u32 *p, char **sp, int *lenp, int maxlen)
 {
unsigned intlen;

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH 0/3] TCP/IP Critical socket communication mechanism

2005-12-14 Thread Sridhar Samudrala
On Wed, 2005-12-14 at 20:49 +, James Courtier-Dutton wrote:
 Jesper Juhl wrote:
  On 12/14/05, Sridhar Samudrala [EMAIL PROTECTED] wrote:
  
 These set of patches provide a TCP/IP emergency communication mechanism that
 could be used to guarantee high priority communications over a critical 
 socket
 to succeed even under very low memory conditions that last for a couple of
 minutes. It uses the critical page pool facility provided by Matt's patches
 that he posted recently on lkml.
 http://lkml.org/lkml/2005/12/14/34/index.html
 
 This mechanism provides a new socket option SO_CRITICAL that can be used to
 mark a socket as critical. A critical connection used for emergency
  
  
  So now everyone writing commercial apps for Linux are going to set
  SO_CRITICAL on sockets in their apps so their apps can survive better
  under pressure than the competitors aps and clueless programmers all
  over are going to think cool, with this I can make my app more
  important than everyone elses, I'm going to use this.  When everyone
  and his dog starts to set this, what's the point?
  
  
 
 I don't think the initial patches that Matt did were intended for what 
 you are describing.
 When I had the conversation with Matt at KS, the problem we were trying 
 to solve was Memory pressure with network attached swap space.
 I came up with the idea that I think Matt has implemented.
 Letting the OS choose which are critical TCP/IP sessions is fine. But 
 letting an application choose is a recipe for disaster.

We could easily add capable(CAP_NET_ADMIN) check to allow this option to
be set only by privileged users.

Thanks
Sridhar

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH 0/3] TCP/IP Critical socket communication mechanism

2005-12-14 Thread James Courtier-Dutton

Sridhar Samudrala wrote:

On Wed, 2005-12-14 at 20:49 +, James Courtier-Dutton wrote:


Jesper Juhl wrote:


On 12/14/05, Sridhar Samudrala [EMAIL PROTECTED] wrote:



These set of patches provide a TCP/IP emergency communication mechanism that
could be used to guarantee high priority communications over a critical socket
to succeed even under very low memory conditions that last for a couple of
minutes. It uses the critical page pool facility provided by Matt's patches
that he posted recently on lkml.
  http://lkml.org/lkml/2005/12/14/34/index.html

This mechanism provides a new socket option SO_CRITICAL that can be used to
mark a socket as critical. A critical connection used for emergency



So now everyone writing commercial apps for Linux are going to set
SO_CRITICAL on sockets in their apps so their apps can survive better
under pressure than the competitors aps and clueless programmers all
over are going to think cool, with this I can make my app more
important than everyone elses, I'm going to use this.  When everyone
and his dog starts to set this, what's the point?




I don't think the initial patches that Matt did were intended for what 
you are describing.
When I had the conversation with Matt at KS, the problem we were trying 
to solve was Memory pressure with network attached swap space.

I came up with the idea that I think Matt has implemented.
Letting the OS choose which are critical TCP/IP sessions is fine. But 
letting an application choose is a recipe for disaster.



We could easily add capable(CAP_NET_ADMIN) check to allow this option to
be set only by privileged users.

Thanks
Sridhar



Sridhar,

Have you actually thought about what would happen in a real world senario?
There is no real world requirement for this sort of user land feature.
In memory pressure mode, you don't care about user applications. In 
fact, under memory pressure no user applications are getting scheduled.
All you care about is swapping out memory to achieve a net gain in free 
memory, so that the applications can then run ok again.


James
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] fix multiple issues in MLDv2 reports

2005-12-14 Thread David Stevens
Dave,
I tested these together, but let me know if you want me to
split these into a few pieces, though they'll probably conflict with
each other. :-)

The below jumbo patch fixes the following problems in MLDv2.
1) Add necessary ntohs to recent pskb_may_pull check [breaks
all nonzero source queries on little-endian (!)]
2) Add locking to source filter list [resend of prior patch]
3) fix mld_marksources() to
a) send nothing when all queried sources are excluded
b) send full exclude report when source queried sources are
not excluded
c) don't schedule a timer when there's nothing to report

NOTE: RFC 3810 specifies the source list should be saved and each
  source reported individually as an IS_IN. This is an obvious DOS
  path, requiring the host to store and then multicast as many sources
  as are queried (e.g., millions...). This alternative sends a full, 
relevant
  report that's limited to number of sources present on the machine.

4) fix add_grec() to send empty-source records when it should
The original check doesn't account for a non-empty source
list with all sources inactive; the new code keeps that
short-circuit case, and also generates the group header
with an empty list if needed.

5) fix mca_crcount decrement to be after add_grec(), which needs
its original value

These issues (other than item #1 ;-) ) were all found by Yan 
Zheng--
much thanks!

+-DLS

[in-line for viewing, attached for applying]

Signed-off-by: David L Stevens [EMAIL PROTECTED]
diff -ruNp linux-2.6.15-rc5/include/net/if_inet6.h 
linux-2.6.15-rc5MC1/include/net/if_inet6.h
--- linux-2.6.15-rc5/include/net/if_inet6.h 2005-10-27 
17:02:08.0 -0700
+++ linux-2.6.15-rc5MC1/include/net/if_inet6.h  2005-12-09 
15:22:46.0 -0800
@@ -82,6 +82,7 @@ struct ipv6_mc_socklist
struct in6_addr addr;
int ifindex;
struct ipv6_mc_socklist *next;
+   rwlock_tsflock;
unsigned intsfmode; /* MCAST_{INCLUDE,EXCLUDE} 
*/
struct ip6_sf_socklist  *sflist;
 };
diff -ruNp linux-2.6.15-rc5/net/ipv6/mcast.c 
linux-2.6.15-rc5MC1/net/ipv6/mcast.c
--- linux-2.6.15-rc5/net/ipv6/mcast.c   2005-12-12 15:01:33.0 
-0800
+++ linux-2.6.15-rc5MC1/net/ipv6/mcast.c2005-12-13 
16:02:46.0 -0800
@@ -224,6 +224,7 @@ int ipv6_sock_mc_join(struct sock *sk, i
 
mc_lst-ifindex = dev-ifindex;
mc_lst-sfmode = MCAST_EXCLUDE;
+   mc_lst-sflock = RW_LOCK_UNLOCKED;
mc_lst-sflist = NULL;
 
/*
@@ -360,6 +361,7 @@ int ip6_mc_source(int add, int omode, st
struct ip6_sf_socklist *psl;
int i, j, rv;
int leavegroup = 0;
+   int pmclocked = 0;
int err;
 
if (pgsr-gsr_group.ss_family != AF_INET6 ||
@@ -403,6 +405,9 @@ int ip6_mc_source(int add, int omode, st
pmc-sfmode = omode;
}
 
+   write_lock_bh(pmc-sflock);
+   pmclocked = 1;
+
psl = pmc-sflist;
if (!add) {
if (!psl)
@@ -475,6 +480,8 @@ int ip6_mc_source(int add, int omode, st
/* update the interface list */
ip6_mc_add_src(idev, group, omode, 1, source, 1);
 done:
+   if (pmclocked)
+   write_unlock_bh(pmc-sflock);
read_unlock_bh(ipv6_sk_mc_lock);
read_unlock_bh(idev-lock);
in6_dev_put(idev);
@@ -510,6 +517,8 @@ int ip6_mc_msfilter(struct sock *sk, str
dev = idev-dev;
 
err = 0;
+   read_lock_bh(ipv6_sk_mc_lock);
+
if (gsf-gf_fmode == MCAST_INCLUDE  gsf-gf_numsrc == 0) {
leavegroup = 1;
goto done;
@@ -549,6 +558,8 @@ int ip6_mc_msfilter(struct sock *sk, str
newpsl = NULL;
(void) ip6_mc_add_src(idev, group, gsf-gf_fmode, 0, NULL, 
0);
}
+
+   write_lock_bh(pmc-sflock);
psl = pmc-sflist;
if (psl) {
(void) ip6_mc_del_src(idev, group, pmc-sfmode,
@@ -558,8 +569,10 @@ int ip6_mc_msfilter(struct sock *sk, str
(void) ip6_mc_del_src(idev, group, pmc-sfmode, 0, NULL, 
0);
pmc-sflist = newpsl;
pmc-sfmode = gsf-gf_fmode;
+   write_unlock_bh(pmc-sflock);
err = 0;
 done:
+   read_unlock_bh(ipv6_sk_mc_lock);
read_unlock_bh(idev-lock);
in6_dev_put(idev);
dev_put(dev);
@@ -592,6 +605,11 @@ int ip6_mc_msfget(struct sock *sk, struc
dev = idev-dev;
 
err = -EADDRNOTAVAIL;
+   /*
+* changes to the ipv6_mc_list require the socket lock and
+* a read lock on ip6_sk_mc_lock. We have the socket lock,
+* so reading the list is safe.
+*/
 
for (pmc=inet6-ipv6_mc_list; pmc; pmc=pmc-next) {
if (pmc-ifindex != gsf-gf_interface)
@@ -614,6 +632,10 @@ int ip6_mc_msfget(struct sock *sk, struc
   

Re: Default net.ipv6.mld_max_msf = 10 and net.core.optmem_max=10240

2005-12-14 Thread Hoerdt Mickael

Hi david  all,

As implemented now, the default memory allocated in net.core.optmem_max 
permit
to join up to 320 (S,G) channels per sockets (for IPv6, each channels 
cost 32bytes in
net.core.optmem_max), thing is that net.ipv6.mld_max_msf is setting an 
hard limit on it, so assuming that you don't change the value of 
net.core.opmem_max, would it make sense to increase net.ipv6.mld_max_msf 
to let say, 256 ? the rest of the memory can

still be used for various option setup on the socket.

Cheers,

Hoerdt Mickaël

David Stevens wrote:

[I'm CC-ing Dave Miller and Yoshifuji Hideaki; you probably ought to bring 
this up on

   [EMAIL PROTECTED]

Hoerdt,
   I don't object to increasing the default, but how much is a good 
question. For an
include-mode filter, it'll do a linear search on the sources for every 
packet received for
that group. If those are large numbers, then an administrator should 
decide that's a good

use of the machine, I think.
   The reports are (roughly) an n^2 algorithm in the number of 
sources. The per-packet
filtering can be improved by using a hash for source look-ups, but I don't 
think there's a
significant improvement for report computations (it's n^3 in the obvious 
way, so already pretty good).
   I've done testing with hundreds of sources and no apparent 
performance problems
(though performance isn't what I was testing). I don't know what a 
reasonable limit on

reasonable hardware is.
   Like the per-socket group limit, this one is probably too low for 
common applications,
and also like that, easily evaded. 1024 or 2048 as the default seems high 
to me, on the
assumption that a few apps doing that would kill performance, but since I 
haven't tested,

I don't really know.
   I also see it appears not to be enforced in the full-state API (an 
oversight, unless

I'm missing the check when I look now).

   I don't see any problem with bumping this up to, say, 64, 
immediately, which would
solve the immediate problem, I guess. But I'm not the maintainer. :-) I 
think some stress
testing to show how well this scales for higher numbers would be 
appropriate before
going too high. If you have numbers (or can get them), that'd help. I 
wouldn't mind doing
some tests along these lines myself, but I don't expect to have much 
uncommitted time

available through December.

   +-DLS

Hoerdt Mickael [EMAIL PROTECTED] wrote on 11/30/2005 08:29:51 
AM:


 


Hello David,

It seems for me that net.ipv6.mld_max_msf and and igmp_max_msf default 
values are
a little bit too short for multi-source multicast applications. On the 
M6bone, we
are using a software named dbeacon 
(,http://mars.innerghost.net.ipv4.sixxs.org/matrix/) which joins a high 
number (currently it's up to 30 sources) SSM sources on the same socket. 
   



 

This create a management problem because when users are installing it : 
root admin must change this value, but dbeacon is run by normal users on 
   



 


the hosts.

For Layered multicast, this can be a problem too, It's easy to imagine a 
   



 


flow with 256
different layer,  FLUTE application is one implementation of this 
layered multicast concept (http://atm.tut.fi/mad/) .Could it be possible 
   



 

to increase this default value to let say, 1024 or 2048 ? If not 
possible, could you tell me why, and then we may consider developping an 
   



 

application layer instanciating several sockets for joining a high 
number of SSM channels per application.


Thank you,

Hoerdt Mickaël
   




 



-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH 0/3] TCP/IP Critical socket communication mechanism

2005-12-14 Thread Ben Greear

James Courtier-Dutton wrote:


Have you actually thought about what would happen in a real world senario?
There is no real world requirement for this sort of user land feature.
In memory pressure mode, you don't care about user applications. In 
fact, under memory pressure no user applications are getting scheduled.
All you care about is swapping out memory to achieve a net gain in free 
memory, so that the applications can then run ok again.


Low 'ATOMIC' memory is different from the memory that user space typically
uses, so just because you can't allocate an SKB does not mean you are swapping
out user-space apps.

I have an app that can have 2000+ sockets open.  I would definately like to make
the management and other important sockets have priority over others in my 
app...

Ben

--
Ben Greear [EMAIL PROTECTED]
Candela Technologies Inc  http://www.candelatech.com

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: netif_stop_queue() and multiple hardware queues

2005-12-14 Thread Simon Barber
Hi Jeremy,

I implemented this functionality in Devicescape's 802.11 stack.

The approach I took was for the driver to install a device specific
qdisc as the root qdisc on the device. This root qdisc's purpose is to
expose the hardware queues directly, so other qdiscs can be attached as
leaf qdiscs. This hardware specific root qdisc cannot be deleted or
changed. This makes it possible to use tc to inspect/set/modify per
hardware queue statistics and parameters.

In order for this to work my device driver never calls netif_stop.
Instead the qdisc dequeue function for the root qdisc looks to see which
hardware queues can accept a frame, and if none then it returns no data.
The driver's frame completion function calls __netif_schedule
appropriately too to ensure the queue runs when it should.

This allows Devicescape's 802.11 stack to properly integrate with the
Linux tc framework. I don't think any other 802.11 drivers achieve this.

In the future I plan to extend Devicescape's 802.11 root qdisc to
further expose the 802.11 MAC's internal queues, in cases where this is
useful (e.g. the scheduled access implementation).

The same principle could apply to Intel's e1000.

Simon


-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
On Behalf Of Jeremy Jackson
Sent: Wednesday, December 14, 2005 2:31 PM
To: netdev@vger.kernel.org
Subject: netif_stop_queue() and multiple hardware queues

Hi,

I posted this briefly on linux-net, before being redirected here.

Two pieces of hardware now have Linux driver support for multiple
hardware queues: Intel's e1000 (two queues from what I can see in the
code) and Atheros's 5212 and up, in support of 802.11e and WME (four
hardware queues).  In the GigE case, this just reduces latency due to
hardware queueing.  In the WiFi case, the queues actually have
significance in access to the shared medium. (ACKs can be disabled as
well)  It is also worthy of note that ADSL2, VDSL, and ADSL2+ have 4
different latency queues.  These last two are significant; real-time
applications suffer the most from low speed, shared, and/or
non-deterministic media.  I wonder where DOCSIS 2 is in this regard.  
Anyone?  Beuler?

So my question is, what's it going to take to get dev-hard_start_xmit()
to hook up tc queues directly to hardware/driver queues?

Right now, it seems no matter how elaborate a tc setup you have,
everything funnels through one queue, where the only thing that survives
from the classifying/queueing is skb- priority (ie nothing).  The 
hardware driver can then try to reclassify packets.   I suppose you 
could hack up an iptables classifyer to set skb-priority...

The  Atheros driver tries to  do it's own classifying by first wiping
out skb-priority, then hard-coding  a mapping (tsk - policy is for the
sysadmin!) between VLAN tag priority, IP TOS/DSCP, and skb-priority,
then to one of the 4 queues and ACK states, blithely ignoring any fine
work done by tc.

It's be sweet to head this nonsense off at the pass, before others
discover the rabbit trail and make it into a trade route.

--
Jeremy Jackson
Coplanar Networks
W:(519)489-4903
MSN: [EMAIL PROTECTED]
ICQ: 43937409
http://www.coplanar.net

-
To unsubscribe from this list: send the line unsubscribe netdev in the
body of a message to [EMAIL PROTECTED] More majordomo info at
http://vger.kernel.org/majordomo-info.html
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Specs for Tulip3

2005-12-14 Thread Aritz Bastida
Hello again,

Sorry Michael, but I am a kind of newbie in this subject and couldnt
understand everything you said clearly. I'm working on my final year
project (I think it's said like that, I mean the project you do when
you finish your degree :P). The purpose of my project is to capture
and analyze network packets as fast as I can.

So, I'll try to expose my doubts about this, and please don't be too
concise, since I couldnt understand it. However, if there is a good
reference I should read, tell me, since I couldnt get any good book
entered in linux kernel networking. The only one I know about (and
have read it) is The Linux kernel networking Architecture, although
it's quite old (kernel 2.4). I have also read Linux Device Drivers
3rd Edition and Linux
Kernel Development. Also, some articles about NAPI and interrupt
coalescencing:
   Eliminating Receive Livelock by Jeffrey Mogul,
   Beyond Sofnet by Jada Salim
and something more probably. I know the concepts of NAPI but have not
seen any real driver in action, except for the Realtek 8139too.

So here go the questions:


 rx_threshold_hit
 Rx max coalescing frames threshold hit.

Well, I didn't understand what is this threshold for


 What bus is the NIC in? PCI or PCIX? What speed? You may want to play
 around with the rx ring sizes and rx coalescing parameters, all can be
 changed with ethtool. Also, be sure to use the latest tg3 driver which
 is 3.45.

I'm running Linux kernel 2.6.13 and tg3 version 3.37, so should be new
enough. I dont know how to verify if the NIC is in a PCIX bus. How can
I check that? Running lspci I can see there are some PCIX bridges:

:00:01.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8131 PCI-X
Bridge (rev 12)
:00:01.1 PIC: Advanced Micro Devices [AMD] AMD-8131 PCI-X IOAPIC
:00:02.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8131 PCI-X
Bridge (rev 12)
:00:02.1 PIC: Advanced Micro Devices [AMD] AMD-8131 PCI-X IOAPIC
:00:06.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8111 PCI
:00:07.0 ISA bridge: Advanced Micro Devices [AMD] AMD-8111 LPC
(...)
:02:03.0 Ethernet controller: Broadcom Corporation NetXtreme
BCM5704 Gigabit Ethernet (rev 02)
:02:03.1 Ethernet controller: Broadcom Corporation NetXtreme
BCM5704 Gigabit Ethernet (rev 02)

Are the NICs in a PCI-X bus? The bridges at least are.


I have seen I can change the rx ring entries with ethtool, although in
the driver code it says that the size is fixed to 512 entries. So what
you actually change is the pending entries (defaulting to 200). What
does that mean? That even if the ring is 512 entries long, it seems to
be full if there are 200 packets the kernel didn't get?

As I said, the only driver I have read before is Realtek 8139too,
which is quite simple, but at least I could find a tutorial which
explains how it works. In that driver there was a rx_ring and a
tx_ring (I don't know if there can be more than one in some other
drivers). When a packet arrives the NIC stores in the rx_ring a packet
descriptor (4 bytes, 2 for the packet length and 2 for the packet
receive status), and just after that the packet itself. So the driver
has just to read the descriptor and then read the following packet
length bytes.

As I have seen in tg3, the rx_ring seems to be only for packet
descriptors. So, as I guess, the descriptor should contain also the
address for the actual packet stored. Where is that packet stored? In
another rx_ring just for incoming packets? What is the benefit
comparing the way 8139too does  it?

To finish, what do you mean with changing coalescencing parameters?
The dev-quota and budget? Are there more things I can change for my
benefits?


Thank you for your patience.
Regards

Aritz
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: netif_stop_queue() and multiple hardware queues

2005-12-14 Thread Simon Barber
Oh - and re: policy - my 802.11 qdisc first calls out to the tc classify
function - allowing the sysadmin to do what he wants, then if no class
is selected it has a default implementation that reflects the
appropriate 802.11 and WiFi specs for classification.

Of course another implementation would be to implement an 802.11
classifier, and install this by default on the 802.11 qdisc.

Simon
 

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
On Behalf Of Simon Barber
Sent: Wednesday, December 14, 2005 3:07 PM
To: Jeremy Jackson; netdev@vger.kernel.org
Subject: RE: netif_stop_queue() and multiple hardware queues

Hi Jeremy,

I implemented this functionality in Devicescape's 802.11 stack.

The approach I took was for the driver to install a device specific
qdisc as the root qdisc on the device. This root qdisc's purpose is to
expose the hardware queues directly, so other qdiscs can be attached as
leaf qdiscs. This hardware specific root qdisc cannot be deleted or
changed. This makes it possible to use tc to inspect/set/modify per
hardware queue statistics and parameters.

In order for this to work my device driver never calls netif_stop.
Instead the qdisc dequeue function for the root qdisc looks to see which
hardware queues can accept a frame, and if none then it returns no data.
The driver's frame completion function calls __netif_schedule
appropriately too to ensure the queue runs when it should.

This allows Devicescape's 802.11 stack to properly integrate with the
Linux tc framework. I don't think any other 802.11 drivers achieve this.

In the future I plan to extend Devicescape's 802.11 root qdisc to
further expose the 802.11 MAC's internal queues, in cases where this is
useful (e.g. the scheduled access implementation).

The same principle could apply to Intel's e1000.

Simon


-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
On Behalf Of Jeremy Jackson
Sent: Wednesday, December 14, 2005 2:31 PM
To: netdev@vger.kernel.org
Subject: netif_stop_queue() and multiple hardware queues

Hi,

I posted this briefly on linux-net, before being redirected here.

Two pieces of hardware now have Linux driver support for multiple
hardware queues: Intel's e1000 (two queues from what I can see in the
code) and Atheros's 5212 and up, in support of 802.11e and WME (four
hardware queues).  In the GigE case, this just reduces latency due to
hardware queueing.  In the WiFi case, the queues actually have
significance in access to the shared medium. (ACKs can be disabled as
well)  It is also worthy of note that ADSL2, VDSL, and ADSL2+ have 4
different latency queues.  These last two are significant; real-time
applications suffer the most from low speed, shared, and/or
non-deterministic media.  I wonder where DOCSIS 2 is in this regard.  
Anyone?  Beuler?

So my question is, what's it going to take to get dev-hard_start_xmit()
to hook up tc queues directly to hardware/driver queues?

Right now, it seems no matter how elaborate a tc setup you have,
everything funnels through one queue, where the only thing that survives
from the classifying/queueing is skb- priority (ie nothing).  The 
hardware driver can then try to reclassify packets.   I suppose you 
could hack up an iptables classifyer to set skb-priority...

The  Atheros driver tries to  do it's own classifying by first wiping
out skb-priority, then hard-coding  a mapping (tsk - policy is for the
sysadmin!) between VLAN tag priority, IP TOS/DSCP, and skb-priority,
then to one of the 4 queues and ACK states, blithely ignoring any fine
work done by tc.

It's be sweet to head this nonsense off at the pass, before others
discover the rabbit trail and make it into a trade route.

--
Jeremy Jackson
Coplanar Networks
W:(519)489-4903
MSN: [EMAIL PROTECTED]
ICQ: 43937409
http://www.coplanar.net

-
To unsubscribe from this list: send the line unsubscribe netdev in the
body of a message to [EMAIL PROTECTED] More majordomo info at
http://vger.kernel.org/majordomo-info.html
-
To unsubscribe from this list: send the line unsubscribe netdev in the
body of a message to [EMAIL PROTECTED] More majordomo info at
http://vger.kernel.org/majordomo-info.html
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/4] TCP Cubic updates for 2.6.16

2005-12-14 Thread shemminger
This set of patches:
* precomputes constants used in TCP cubic
* uses Newton/Raphson for cube root
* adds find largest set bit 64 to make initial estimate

--
Stephen Hemminger [EMAIL PROTECTED]
OSDL http://developer.osdl.org/~shemminger

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/4] TCP cubic precompute constants

2005-12-14 Thread shemminger
Revised version of patch to pre-compute values for TCP cubic.
  * d32,d64 replaced with descriptive names
  * cube_factor replaces 
 srtt[scaled by count] / HZ * ((1  (10+2*BICTCP_HZ)) / bic_scale)
  * beta_scale replaces
8*(BICTCP_BETA_SCALE+beta)/3/(BICTCP_BETA_SCALE-beta);

Signed-off-by: Stephen Hemminger [EMAIL PROTECTED]


--- net-2.6.16.orig/net/ipv4/tcp_cubic.c
+++ net-2.6.16/net/ipv4/tcp_cubic.c
@@ -16,7 +16,7 @@
 #include linux/mm.h
 #include linux/module.h
 #include net/tcp.h
-
+#include asm/div64.h
 
 #define BICTCP_BETA_SCALE1024  /* Scale factor beta calculation
 * max_cwnd = snd_cwnd * beta
@@ -34,15 +34,20 @@ static int initial_ssthresh = 100;
 static int bic_scale = 41;
 static int tcp_friendliness = 1;
 
+static u32 cube_rtt_scale;
+static u32 beta_scale;
+static u64 cube_factor;
+
+/* Note parameters that are used for precomputing scale factors are read-only 
*/
 module_param(fast_convergence, int, 0644);
 MODULE_PARM_DESC(fast_convergence, turn on/off fast convergence);
 module_param(max_increment, int, 0644);
 MODULE_PARM_DESC(max_increment, Limit on increment allowed during binary 
search);
-module_param(beta, int, 0644);
+module_param(beta, int, 0444);
 MODULE_PARM_DESC(beta, beta for multiplicative increase);
 module_param(initial_ssthresh, int, 0644);
 MODULE_PARM_DESC(initial_ssthresh, initial value of slow start threshold);
-module_param(bic_scale, int, 0644);
+module_param(bic_scale, int, 0444);
 MODULE_PARM_DESC(bic_scale, scale (scaled by 1024) value for bic function 
(bic_scale/1024));
 module_param(tcp_friendliness, int, 0644);
 MODULE_PARM_DESC(tcp_friendliness, turn on/off tcp friendliness);
@@ -151,65 +156,13 @@ static u32 cubic_root(u64 x)
 return (u32)end;
 }
 
-static inline u32 bictcp_K(u32 dist, u32 srtt)
-{
-u64 d64;
-u32 d32;
-u32 count;
-u32 result;
-
-/* calculate the K for (wmax-cwnd) = c/rtt * K^3
-   so K = cubic_root( (wmax-cwnd)*rtt/c )
-   the unit of K is bictcp_HZ=2^10, not HZ
-
-   c = bic_scale  10
-   rtt = (tp-srtt  3 ) / HZ
-
-   the following code has been designed and tested for
-   cwnd  1 million packets
-   RTT  100 seconds
-   HZ  1,000,00  (corresponding to 10 nano-second)
-
-*/
-
-/* 1/c * 2^2*bictcp_HZ */
-d32 = (1  (10+2*BICTCP_HZ)) / bic_scale;
-d64 = (__u64)d32;
-
-/* srtt * 2^count / HZ
-   1) to get a better accuracy of the following d32,
-   the larger the count, the better the accuracy
-   2) and avoid overflow of the following d64
-   the larger the count, the high possibility of overflow
-   3) so find a count between bictcp_hz-3 and bictcp_hz
-   count may be less than bictcp_HZ,
-   then d64 becomes 0. that is OK
-*/
-d32 = srtt;
-count = 0;
-while (((d32  0x8000)==0)  (count  BICTCP_HZ)){
-d32 = d32  1;
-count++;
-}
-d32 = d32 / HZ;
-
-/* (wmax-cwnd) * (srtt3 / HZ) / c * 2^(3*bictcp_HZ)  */
-d64 = (d64 * dist * d32)  (count+3-BICTCP_HZ);
-
-/* cubic root */
-d64 = cubic_root(d64);
-
-result = (u32)d64;
-return result;
-}
-
 /*
  * Compute congestion window to use.
  */
 static inline void bictcp_update(struct bictcp *ca, u32 cwnd)
 {
-   u64 d64;
-   u32 d32, t, srtt, bic_target, min_cnt, max_cnt;
+   u64 offs;
+   u32 delta, t, bic_target, min_cnt, max_cnt;
 
ca-ack_cnt++;  /* count the number of ACKs */
 
@@ -220,8 +173,6 @@ static inline void bictcp_update(struct 
ca-last_cwnd = cwnd;
ca-last_time = tcp_time_stamp;
 
-   srtt = (HZ  3)/10;/* use real time-based growth function */
-
if (ca-epoch_start == 0) {
ca-epoch_start = tcp_time_stamp;   /* record the beginning 
of an epoch */
ca-ack_cnt = 1;/* start counting */
@@ -231,7 +182,11 @@ static inline void bictcp_update(struct 
ca-bic_K = 0;
ca-bic_origin_point = cwnd;
} else {
-   ca-bic_K = bictcp_K(ca-last_max_cwnd-cwnd, srtt);
+   /* Compute new K based on
+* (wmax-cwnd) * (srtt3 / HZ) / c * 2^(3*bictcp_HZ)
+*/
+   ca-bic_K = cubic_root(cube_factor
+  * (ca-last_max_cwnd - cwnd));
ca-bic_origin_point = ca-last_max_cwnd;
}
}
@@ -239,9 +194,9 @@ static inline void bictcp_update(struct 
 /* cubic function - calc*/
 /* calculate c * time^3 / rtt,
  *  while considering overflow in calculation of time^3
-* (so time^3 is done by using d64)
+* (so time^3 is done 

[PATCH 2/4] fls64: x86_64 version

2005-12-14 Thread shemminger
Index: net-2.6.16/include/asm-x86_64/bitops.h
===
--- net-2.6.16.orig/include/asm-x86_64/bitops.h
+++ net-2.6.16/include/asm-x86_64/bitops.h
@@ -340,6 +340,20 @@ static __inline__ unsigned long __ffs(un
return word;
 }
 
+/*
+ * __fls: find last bit set.
+ * @word: The word to search
+ *
+ * Undefined if no zero exists, so code should check against ~0UL first.
+ */
+static __inline__ unsigned long __fls(unsigned long word)
+{
+   __asm__(bsrq %1,%0
+   :=r (word)
+   :rm (word));
+   return word;
+}
+
 #ifdef __KERNEL__
 
 static inline int sched_find_first_bit(const unsigned long *b)
@@ -370,6 +384,19 @@ static __inline__ int ffs(int x)
 }
 
 /**
+ * fls64 - find last bit set in 64 bit word
+ * @x: the word to search
+ *
+ * This is defined the same way as fls.
+ */
+static __inline__ int fls64(__u64 x)
+{
+   if (x == 0)
+   return 0;
+   return __fls(x) + 1;
+}
+
+/**
  * hweightN - returns the hamming weight of a N-bit word
  * @x: the word to weigh
  *
@@ -409,7 +436,6 @@ static __inline__ int ffs(int x)
 
 /* find last set bit */
 #define fls(x) generic_fls(x)
-#define fls64(x) generic_fls64(x)
 
 #endif /* __KERNEL__ */
 

--
Stephen Hemminger [EMAIL PROTECTED]
OSDL http://developer.osdl.org/~shemminger

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/4] TCP Cubic use Newton-Raphson

2005-12-14 Thread shemminger
Replace cube root algorithim with a faster version using Newton-Raphson.
Surprisingly, doing the scaled div64_64 is faster than a true 64 bit
division on 64 bit CPU's.

Signed-off-by: Stephen Hemminger [EMAIL PROTECTED]


--- net-2.6.16.orig/net/ipv4/tcp_cubic.c
+++ net-2.6.16/net/ipv4/tcp_cubic.c
@@ -52,6 +52,7 @@ MODULE_PARM_DESC(bic_scale, scale (scal
 module_param(tcp_friendliness, int, 0644);
 MODULE_PARM_DESC(tcp_friendliness, turn on/off tcp friendliness);
 
+#include asm/div64.h
 
 /* BIC TCP Parameters */
 struct bictcp {
@@ -93,67 +94,51 @@ static void bictcp_init(struct sock *sk)
tcp_sk(sk)-snd_ssthresh = initial_ssthresh;
 }
 
-/* 65536 times the cubic root */
-static const u64 cubic_table[8]
-   = {0, 65536, 82570, 94519, 104030, 112063, 119087, 125367};
-
-/*
- * calculate the cubic root of x
- * the basic idea is that x can be expressed as i*8^j
- * so cubic_root(x) = cubic_root(i)*2^j
- *  in the following code, x is i, and y is 2^j
- *  because of integer calculation, there are errors in calculation
- *  so finally use binary search to find out the exact solution
- */
-static u32 cubic_root(u64 x)
+/* 64bit divisor, dividend and result. dynamic precision */
+static inline u_int64_t div64_64(u_int64_t dividend, u_int64_t divisor)
 {
-u64 y, app, target, start, end, mid, start_diff, end_diff;
+   u_int32_t d = divisor;
 
-if (x == 0)
-return 0;
+   if (divisor  0xULL) {
+   unsigned int shift = fls(divisor  32);
 
-target = x;
+   d = divisor  shift;
+   dividend = shift;
+   }
 
-/* first estimate lower and upper bound */
-y = 1;
-while (x = 8){
-x = (x  3);
-y = (y  1);
-}
-start = (y*cubic_table[x])16;
-if (x==7)
-end = (y1);
-else
-end = (y*cubic_table[x+1]+65535)16;
+   /* avoid 64 bit division if possible */
+   if (dividend  32)
+   do_div(dividend, d);
+   else
+   dividend = (uint32_t) dividend / d;
 
-/* binary search for more accurate one */
-while (start  end-1) {
-mid = (start+end)  1;
-app = mid*mid*mid;
-if (app  target)
-start = mid;
-else if (app  target)
-end = mid;
-else
-return mid;
-}
+   return dividend;
+}
 
-/* find the most accurate one from start and end */
-app = start*start*start;
-if (app  target)
-start_diff = target - app;
-else
-start_diff = app - target;
-app = end*end*end;
-if (app  target)
-end_diff = target - app;
-else
-end_diff = app - target;
+/*
+ * calculate the cubic root of x using Newton-Raphson
+ */
+static u32 cubic_root(u64 a)
+{
+   u32 x, x1;
 
-if (start_diff  end_diff)
-return (u32)start;
-else
-return (u32)end;
+   /* Initial estimate is based on:
+* cbrt(x) = exp(log(x) / 3)
+*/
+   x = 1u  (fls64(a)/3);
+
+   /*
+* Iteration based on:
+* 2
+* x= ( 2 * x  +  a / x  ) / 3
+*  k+1  k k
+*/
+   do {
+   x1 = x;
+   x = (2 * x + (uint32_t) div64_64(a, x*x)) / 3;
+   } while (abs(x1 - x)  1);
+
+   return x;
 }
 
 /*

--
Stephen Hemminger [EMAIL PROTECTED]
OSDL http://developer.osdl.org/~shemminger

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/4] fls64: generic version

2005-12-14 Thread shemminger
Index: bic-2.6/include/linux/bitops.h
===
--- bic-2.6.orig/include/linux/bitops.h
+++ bic-2.6/include/linux/bitops.h
@@ -76,6 +76,15 @@ static __inline__ int generic_fls(int x)
  */
 #include asm/bitops.h
 
+
+static inline int generic_fls64(__u64 x)
+{
+   __u32 h = x  32;
+   if (h)
+   return fls(x) + 32;
+   return fls(x);
+}
+
 static __inline__ int get_bitmask_order(unsigned int count)
 {
int order;
Index: bic-2.6/include/asm-alpha/bitops.h
===
--- bic-2.6.orig/include/asm-alpha/bitops.h
+++ bic-2.6/include/asm-alpha/bitops.h
@@ -321,6 +321,7 @@ static inline int fls(int word)
 #else
 #define flsgeneric_fls
 #endif
+#define fls64   generic_fls64
 
 /* Compute powers of two for the given integer.  */
 static inline long floor_log2(unsigned long word)
Index: bic-2.6/include/asm-arm/bitops.h
===
--- bic-2.6.orig/include/asm-arm/bitops.h
+++ bic-2.6/include/asm-arm/bitops.h
@@ -332,6 +332,7 @@ static inline unsigned long __ffs(unsign
  */
 
 #define fls(x) generic_fls(x)
+#define fls64(x)   generic_fls64(x)
 
 /*
  * ffs: find first bit set. This is defined the same way as
@@ -351,6 +352,7 @@ static inline unsigned long __ffs(unsign
 #define fls(x) \
( __builtin_constant_p(x) ? generic_fls(x) : \
  ({ int __r; asm(clz\t%0, %1 : =r(__r) : r(x) : cc); 32-__r; 
}) )
+#define fls64(x)   generic_fls64(x)
 #define ffs(x) ({ unsigned long __t = (x); fls(__t  -__t); })
 #define __ffs(x) (ffs(x) - 1)
 #define ffz(x) __ffs( ~(x) )
Index: bic-2.6/include/asm-arm26/bitops.h
===
--- bic-2.6.orig/include/asm-arm26/bitops.h
+++ bic-2.6/include/asm-arm26/bitops.h
@@ -259,6 +259,7 @@ static inline unsigned long __ffs(unsign
  */
 
 #define fls(x) generic_fls(x)
+#define fls64(x)   generic_fls64(x)
 
 /*
  * ffs: find first bit set. This is defined the same way as
Index: bic-2.6/include/asm-cris/bitops.h
===
--- bic-2.6.orig/include/asm-cris/bitops.h
+++ bic-2.6/include/asm-cris/bitops.h
@@ -240,6 +240,7 @@ static inline int test_bit(int nr, const
  */
 
 #define fls(x) generic_fls(x)
+#define fls64(x)   generic_fls64(x)
 
 /*
  * hweightN - returns the hamming weight of a N-bit word
Index: bic-2.6/include/asm-frv/bitops.h
===
--- bic-2.6.orig/include/asm-frv/bitops.h
+++ bic-2.6/include/asm-frv/bitops.h
@@ -228,6 +228,7 @@ found_middle:
\
bit ? 33 - bit : bit;   \
 })
+#define fls64(x)   generic_fls64(x)
 
 /*
  * Every architecture must define this function. It's the fastest
Index: bic-2.6/include/asm-generic/bitops.h
===
--- bic-2.6.orig/include/asm-generic/bitops.h
+++ bic-2.6/include/asm-generic/bitops.h
@@ -56,6 +56,7 @@ extern __inline__ int test_bit(int nr, c
  */
 
 #define fls(x) generic_fls(x)
+#define fls64(x)   generic_fls64(x)
 
 #ifdef __KERNEL__
 
Index: bic-2.6/include/asm-h8300/bitops.h
===
--- bic-2.6.orig/include/asm-h8300/bitops.h
+++ bic-2.6/include/asm-h8300/bitops.h
@@ -406,5 +406,6 @@ found_middle:
 #endif /* __KERNEL__ */
 
 #define fls(x) generic_fls(x)
+#define fls64(x)   generic_fls64(x)
 
 #endif /* _H8300_BITOPS_H */
Index: bic-2.6/include/asm-i386/bitops.h
===
--- bic-2.6.orig/include/asm-i386/bitops.h
+++ bic-2.6/include/asm-i386/bitops.h
@@ -372,6 +372,7 @@ static inline unsigned long ffz(unsigned
  */
 
 #define fls(x) generic_fls(x)
+#define fls64(x)   generic_fls64(x)
 
 #ifdef __KERNEL__
 
Index: bic-2.6/include/asm-ia64/bitops.h
===
--- bic-2.6.orig/include/asm-ia64/bitops.h
+++ bic-2.6/include/asm-ia64/bitops.h
@@ -345,6 +345,7 @@ fls (int t)
x |= x  16;
return ia64_popcnt(x);
 }
+#define fls64(x)   generic_fls64(x)
 
 /*
  * ffs: find first bit set. This is defined the same way as the libc and 
compiler builtin
Index: bic-2.6/include/asm-m32r/bitops.h
===
--- bic-2.6.orig/include/asm-m32r/bitops.h
+++ bic-2.6/include/asm-m32r/bitops.h
@@ -465,6 +465,7 @@ static __inline__ unsigned long __ffs(un
  * fls: find last bit set.
  */
 #define fls(x) generic_fls(x)
+#define fls64(x)   generic_fls64(x)
 
 #ifdef __KERNEL__
 
Index: bic-2.6/include/asm-m68k/bitops.h
===
--- bic-2.6.orig/include/asm-m68k/bitops.h
+++ bic-2.6/include/asm-m68k/bitops.h
@@ -310,6 +310,7 @@ static inline 

Re: Default net.ipv6.mld_max_msf = 10 and net.core.optmem_max=10240

2005-12-14 Thread David S. Miller
From: Hoerdt Mickael [EMAIL PROTECTED]
Date: Wed, 14 Dec 2005 23:38:56 +0100

 As implemented now, the default memory allocated in net.core.optmem_max 
 permit
 to join up to 320 (S,G) channels per sockets (for IPv6, each channels 
 cost 32bytes in
 net.core.optmem_max), thing is that net.ipv6.mld_max_msf is setting an 
 hard limit on it, so assuming that you don't change the value of 
 net.core.opmem_max, would it make sense to increase net.ipv6.mld_max_msf 
 to let say, 256 ? the rest of the memory can
 still be used for various option setup on the socket.

I think people running programs that need the higher value
can increase the limit.  This is no different than having
to tweak tcp_wmem[] or the socket buffering limits via
sysctl.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/6] skge: handle out of memory on MTU size changes

2005-12-14 Thread Stephen Hemminger
Changing the MTU size causes the receiver to have to reallocate buffers.
If this allocation fails, then we need to return an error, and take
the device offline. It can then be brought back up or reconfigured
for a smaller MTU.

Signed-off-by: Stephen Hemminger [EMAIL PROTECTED]


--- skge-2.6.orig/drivers/net/skge.c
+++ skge-2.6/drivers/net/skge.c
@@ -2192,6 +2192,7 @@ static int skge_up(struct net_device *de
kfree(skge-rx_ring.start);
  free_pci_mem:
pci_free_consistent(hw-pdev, skge-mem_size, skge-mem, skge-dma);
+   skge-mem = NULL;
 
return err;
 }
@@ -2202,6 +2203,9 @@ static int skge_down(struct net_device *
struct skge_hw *hw = skge-hw;
int port = skge-port;
 
+   if (skge-mem == NULL)
+   return 0;
+
if (netif_msg_ifdown(skge))
printk(KERN_INFO PFX %s: disabling interface\n, dev-name);
 
@@ -2258,6 +2262,7 @@ static int skge_down(struct net_device *
kfree(skge-rx_ring.start);
kfree(skge-tx_ring.start);
pci_free_consistent(hw-pdev, skge-mem_size, skge-mem, skge-dma);
+   skge-mem = NULL;
return 0;
 }
 
@@ -2416,18 +2421,23 @@ static void skge_tx_timeout(struct net_d
 
 static int skge_change_mtu(struct net_device *dev, int new_mtu)
 {
-   int err = 0;
-   int running = netif_running(dev);
+   int err;
 
if (new_mtu  ETH_ZLEN || new_mtu  ETH_JUMBO_MTU)
return -EINVAL;
 
+   if (!netif_running(dev)) {
+   dev-mtu = new_mtu;
+   return 0;
+   }
+
+   skge_down(dev);
 
-   if (running)
-   skge_down(dev);
dev-mtu = new_mtu;
-   if (running)
-   skge_up(dev);
+
+   err = skge_up(dev);
+   if (err)
+   dev_close(dev);
 
return err;
 }

--

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 6/6] skge: version number (1.3)

2005-12-14 Thread Stephen Hemminger
Enough changes for one version.

Signed-off-by: Stephen Hemminger [EMAIL PROTECTED]


--- skge-2.6.orig/drivers/net/skge.c
+++ skge-2.6/drivers/net/skge.c
@@ -43,7 +43,7 @@
 #include skge.h
 
 #define DRV_NAME   skge
-#define DRV_VERSION1.2
+#define DRV_VERSION1.3
 #define PFXDRV_NAME  
 
 #define DEFAULT_TX_RING_SIZE   128

--

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 5/6] skge: handle out of memory on ring parameter change

2005-12-14 Thread Stephen Hemminger
If changing ring parameters is unable to allocate memory, we need
to return an error and take the device down.

Fixes-bug: http://bugzilla.kernel.org/show_bug.cgi?id=5715
Signed-off-by: Stephen Hemminger [EMAIL PROTECTED]


--- skge-2.6.orig/drivers/net/skge.c
+++ skge-2.6/drivers/net/skge.c
@@ -397,6 +397,7 @@ static int skge_set_ring_param(struct ne
   struct ethtool_ringparam *p)
 {
struct skge_port *skge = netdev_priv(dev);
+   int err;
 
if (p-rx_pending == 0 || p-rx_pending  MAX_RX_RING_SIZE ||
p-tx_pending == 0 || p-tx_pending  MAX_TX_RING_SIZE)
@@ -407,7 +408,9 @@ static int skge_set_ring_param(struct ne
 
if (netif_running(dev)) {
skge_down(dev);
-   skge_up(dev);
+   err = skge_up(dev);
+   if (err)
+   dev_close(dev);
}
 
return 0;

--

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/6] skge: avoid up/down on speed changes

2005-12-14 Thread Stephen Hemminger
Change the speed settings doesn't need to cause link to go down/up.
It can be handled by doing the same logic as nway_reset.

Signed-off-by: Stephen Hemminger [EMAIL PROTECTED]

--- skge-2.6.orig/drivers/net/skge.c
+++ skge-2.6/drivers/net/skge.c
@@ -88,15 +88,14 @@ MODULE_DEVICE_TABLE(pci, skge_id_table);
 
 static int skge_up(struct net_device *dev);
 static int skge_down(struct net_device *dev);
+static void skge_phy_reset(struct skge_port *skge);
 static void skge_tx_clean(struct skge_port *skge);
 static int xm_phy_write(struct skge_hw *hw, int port, u16 reg, u16 val);
 static int gm_phy_write(struct skge_hw *hw, int port, u16 reg, u16 val);
 static void genesis_get_stats(struct skge_port *skge, u64 *data);
 static void yukon_get_stats(struct skge_port *skge, u64 *data);
 static void yukon_init(struct skge_hw *hw, int port);
-static void yukon_reset(struct skge_hw *hw, int port);
 static void genesis_mac_init(struct skge_hw *hw, int port);
-static void genesis_reset(struct skge_hw *hw, int port);
 static void genesis_link_up(struct skge_port *skge);
 
 /* Avoid conditionals by using array */
@@ -276,10 +275,9 @@ static int skge_set_settings(struct net_
skge-autoneg = ecmd-autoneg;
skge-advertising = ecmd-advertising;
 
-   if (netif_running(dev)) {
-   skge_down(dev);
-   skge_up(dev);
-   }
+   if (netif_running(dev))
+   skge_phy_reset(skge);
+
return (0);
 }
 
@@ -430,21 +428,11 @@ static void skge_set_msglevel(struct net
 static int skge_nway_reset(struct net_device *dev)
 {
struct skge_port *skge = netdev_priv(dev);
-   struct skge_hw *hw = skge-hw;
-   int port = skge-port;
 
if (skge-autoneg != AUTONEG_ENABLE || !netif_running(dev))
return -EINVAL;
 
-   spin_lock_bh(hw-phy_lock);
-   if (hw-chip_id == CHIP_ID_GENESIS) {
-   genesis_reset(hw, port);
-   genesis_mac_init(hw, port);
-   } else {
-   yukon_reset(hw, port);
-   yukon_init(hw, port);
-   }
-   spin_unlock_bh(hw-phy_lock);
+   skge_phy_reset(skge);
return 0;
 }
 
@@ -2019,6 +2007,25 @@ static void yukon_phy_intr(struct skge_p
/* XXX restart autonegotiation? */
 }
 
+static void skge_phy_reset(struct skge_port *skge)
+{
+   struct skge_hw *hw = skge-hw;
+   int port = skge-port;
+
+   netif_stop_queue(skge-netdev);
+   netif_carrier_off(skge-netdev);
+
+   spin_lock_bh(hw-phy_lock);
+   if (hw-chip_id == CHIP_ID_GENESIS) {
+   genesis_reset(hw, port);
+   genesis_mac_init(hw, port);
+   } else {
+   yukon_reset(hw, port);
+   yukon_init(hw, port);
+   }
+   spin_unlock_bh(hw-phy_lock);
+}
+
 /* Basic MII support */
 static int skge_ioctl(struct net_device *dev, struct ifreq *ifr, int cmd)
 {

--

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/6] skge: error handling on config changes

2005-12-14 Thread Stephen Hemminger
--

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/6] skge: get rid of Yukon2 defines

2005-12-14 Thread Stephen Hemminger
Don't need to keep Yukon-2 related definitions around for Skge
driver that is only for Yukon-1 and Genesis.

Signed-off-by: Stephen Hemminger [EMAIL PROTECTED]

--- skge-2.6.orig/drivers/net/skge.h
+++ skge-2.6/drivers/net/skge.h
@@ -475,18 +475,6 @@ enum {
Q_T2= 0x40, /* 32 bit   Test Register 2 */
Q_T3= 0x44, /* 32 bit   Test Register 3 */
 
-/* Yukon-2 */
-   Q_DONE  = 0x24, /* 16 bit   Done Index  (Yukon-2 only) 
*/
-   Q_WM= 0x40, /* 16 bit   FIFO Watermark */
-   Q_AL= 0x42, /*  8 bit   FIFO Alignment */
-   Q_RSP   = 0x44, /* 16 bit   FIFO Read Shadow Pointer */
-   Q_RSL   = 0x46, /*  8 bit   FIFO Read Shadow Level */
-   Q_RP= 0x48, /*  8 bit   FIFO Read Pointer */
-   Q_RL= 0x4a, /*  8 bit   FIFO Read Level */
-   Q_WP= 0x4c, /*  8 bit   FIFO Write Pointer */
-   Q_WSP   = 0x4d, /*  8 bit   FIFO Write Shadow Pointer */
-   Q_WL= 0x4e, /*  8 bit   FIFO Write Level */
-   Q_WSL   = 0x4f, /*  8 bit   FIFO Write Shadow Level */
 };
 #define Q_ADDR(reg, offs) (B8_Q_REGS + (reg) + (offs))
 
@@ -675,22 +663,16 @@ enum {
LED_OFF = 10, /* switch LED off */
 };
 
-/* Receive GMAC FIFO (YUKON and Yukon-2) */
+/* Receive GMAC FIFO (YUKON) */
 enum {
RX_GMF_EA   = 0x0c40,/* 32 bit  Rx GMAC FIFO End Address */
RX_GMF_AF_THR   = 0x0c44,/* 32 bit  Rx GMAC FIFO Almost Full 
Thresh. */
RX_GMF_CTRL_T   = 0x0c48,/* 32 bit  Rx GMAC FIFO Control/Test */
RX_GMF_FL_MSK   = 0x0c4c,/* 32 bit  Rx GMAC FIFO Flush Mask */
RX_GMF_FL_THR   = 0x0c50,/* 32 bit  Rx GMAC FIFO Flush Threshold */
-   RX_GMF_TR_THR   = 0x0c54,/* 32 bit  Rx Truncation Threshold 
(Yukon-2) */
-
-   RX_GMF_VLAN = 0x0c5c,/* 32 bit  Rx VLAN Type Register (Yukon-2) 
*/
RX_GMF_WP   = 0x0c60,/* 32 bit  Rx GMAC FIFO Write Pointer */
-
RX_GMF_WLEV = 0x0c68,/* 32 bit  Rx GMAC FIFO Write Level */
-
RX_GMF_RP   = 0x0c70,/* 32 bit  Rx GMAC FIFO Read Pointer */
-
RX_GMF_RLEV = 0x0c78,/* 32 bit  Rx GMAC FIFO Read Level */
 };
 
@@ -855,48 +837,6 @@ enum {
GMAC_TI_ST_TST  = 0x0e1a,/*  8 bit  Time Stamp Timer Test Reg */
 };
 
-/* Status BMU Registers (Yukon-2 only)*/
-enum {
-   STAT_CTRL   = 0x0e80,/* 32 bit  Status BMU Control Reg */
-   STAT_LAST_IDX   = 0x0e84,/* 16 bit  Status BMU Last Index */
-   /* 0x0e85 - 0x0e86: reserved */
-   STAT_LIST_ADDR_LO   = 0x0e88,/* 32 bit  Status List Start Addr 
(low) */
-   STAT_LIST_ADDR_HI   = 0x0e8c,/* 32 bit  Status List Start Addr 
(high) */
-   STAT_TXA1_RIDX  = 0x0e90,/* 16 bit  Status TxA1 Report Index Reg */
-   STAT_TXS1_RIDX  = 0x0e92,/* 16 bit  Status TxS1 Report Index Reg */
-   STAT_TXA2_RIDX  = 0x0e94,/* 16 bit  Status TxA2 Report Index Reg */
-   STAT_TXS2_RIDX  = 0x0e96,/* 16 bit  Status TxS2 Report Index Reg */
-   STAT_TX_IDX_TH  = 0x0e98,/* 16 bit  Status Tx Index Threshold Reg */
-   STAT_PUT_IDX= 0x0e9c,/* 16 bit  Status Put Index Reg */
-
-/* FIFO Control/Status Registers (Yukon-2 only)*/
-   STAT_FIFO_WP= 0x0ea0,/*  8 bit  Status FIFO Write Pointer Reg */
-   STAT_FIFO_RP= 0x0ea4,/*  8 bit  Status FIFO Read Pointer Reg */
-   STAT_FIFO_RSP   = 0x0ea6,/*  8 bit  Status FIFO Read Shadow Ptr */
-   STAT_FIFO_LEVEL = 0x0ea8,/*  8 bit  Status FIFO Level Reg */
-   STAT_FIFO_SHLVL = 0x0eaa,/*  8 bit  Status FIFO Shadow Level Reg */
-   STAT_FIFO_WM= 0x0eac,/*  8 bit  Status FIFO Watermark Reg */
-   STAT_FIFO_ISR_WM= 0x0ead,/*  8 bit  Status FIFO ISR 
Watermark Reg */
-
-/* Level and ISR Timer Registers (Yukon-2 only)*/
-   STAT_LEV_TIMER_INI  = 0x0eb0,/* 32 bit  Level Timer Init. Value 
Reg */
-   STAT_LEV_TIMER_CNT  = 0x0eb4,/* 32 bit  Level Timer Counter Reg 
*/
-   STAT_LEV_TIMER_CTRL = 0x0eb8,/*  8 bit  Level Timer Control Reg 
*/
-   STAT_LEV_TIMER_TEST = 0x0eb9,/*  8 bit  Level Timer Test Reg */
-   STAT_TX_TIMER_INI   = 0x0ec0,/* 32 bit  Tx Timer Init. Value 
Reg */
-   STAT_TX_TIMER_CNT   = 0x0ec4,/* 32 bit  Tx Timer Counter Reg */
-   STAT_TX_TIMER_CTRL  = 0x0ec8,/*  8 bit  Tx Timer Control Reg */
-   STAT_TX_TIMER_TEST  = 0x0ec9,/*  8 bit  Tx Timer Test Reg */
-   STAT_ISR_TIMER_INI  = 0x0ed0,/* 32 bit  ISR Timer Init. Value 
Reg */
-   STAT_ISR_TIMER_CNT  = 0x0ed4,/* 32 bit  ISR Timer Counter Reg */
-   STAT_ISR_TIMER_CTRL = 0x0ed8,/*  8 bit  ISR Timer Control Reg */
-   STAT_ISR_TIMER_TEST = 0x0ed9,/*  8 bit  ISR Timer Test Reg */
-
-   ST_LAST_IDX_MASK= 0x007f,/* Last Index Mask */
-   ST_TXRP_IDX_MASK= 0x0fff,/* Tx Report Index Mask 

Re: [RFC][PATCH 0/3] TCP/IP Critical socket communication mechanism

2005-12-14 Thread Sridhar Samudrala
On Wed, 2005-12-14 at 14:39 -0800, Ben Greear wrote:
 James Courtier-Dutton wrote:
 
  Have you actually thought about what would happen in a real world senario?
  There is no real world requirement for this sort of user land feature.
  In memory pressure mode, you don't care about user applications. In 
  fact, under memory pressure no user applications are getting scheduled.
  All you care about is swapping out memory to achieve a net gain in free 
  memory, so that the applications can then run ok again.
 
 Low 'ATOMIC' memory is different from the memory that user space typically
 uses, so just because you can't allocate an SKB does not mean you are swapping
 out user-space apps.
 
 I have an app that can have 2000+ sockets open.  I would definately like to 
 make
 the management and other important sockets have priority over others in my 
 app...

The scenario we are trying to address is also a management connection between 
the 
nodes of a cluster and a server that manages the swap devices accessible by all 
the 
nodes of the cluster. The critical connection is supposed to be used to 
exchange 
status notifications of the swap devices so that failover can happen and 
propagated 
to all the nodes as quickly as possible. The management apps will be pinned into
memory so that they are not swapped out.

As such the traffic that flows over the critical sockets is not high but should
not stall even if we run into a memory constrained situation. That is the reason
why we would like to have a pre-allocated critical page pool which could be used
when we run out of ATOMIC memory.

Thanks
Sridhar


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] vlan hardware rx csum errors

2005-12-14 Thread David S. Miller
From: Stephen Hemminger [EMAIL PROTECTED]
Date: Tue, 13 Dec 2005 16:57:00 -0800

 Receiving VLAN packets over a device (without VLAN assist) that is
 doing hardware checksumming (CHECKSUM_HW), causes errors because the
 VLAN code forgets to adjust the hardware checksum.
 
 Signed-off-by: Stephen Hemminger [EMAIL PROTECTED]

Good catch, applied.

I'll forward this off to -stable, as the fix is needed there
as well.

Thanks.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Resend [PATCH netdev-2.6 2/8] e1000: Performance Enhancements

2005-12-14 Thread Patrick McManus

David S. Miller wrote:

From: John Ronciak [EMAIL PROTECTED]
Date: Wed, 7 Dec 2005 11:48:46 -0800


Copybreak probably shouldn't be used in routing use cases.


I think even this is arguable, routers route a lot more than
small 64-byte frames.  Unfortunately, that is what everyone
uses for packet rate tests. :-/

Assuming only TCP flows go through a router, it is safe to
say that the full-sized data frame to ACK ratio is about 2
to 1.


Sadly, the picture most routers see is the opposite: about 2 sub-100 
byte frames for every 1 decent sized one - and fullsize is really rare, 
maybe just 1 in 5.


This thread is semi-modern with some good data:
http://www.cctec.com/maillists/nanog/historical/0312/msg00394.html

and it is getting worse over time.. in 1998 it was more like 1:1

So the all-64byte test isn't that crazy.

BTW - this has been a great thread - enjoyed reading it very much. But 
I've kind of lost a feel for what the prefetch and copybreak cases mean 
for local delievery (e.g tcp termination) scenarios.. both in throughput 
and cpu left for the local application. That has to be a more important 
profile than ip forwarding. Any thoughts on that?


-Patrick



-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SA switchover

2005-12-14 Thread jamal

On Wed, 2005-14-12 at 16:48 -0800, David S. Miller wrote:
 Please have a look at:
 
http://bugzilla.kernel.org/show_bug.cgi?id=4952
 
 It should look familiar.

It is - the soup nazi got involved on that bug ;-
http://marc.theaimsgroup.com/?l=linux-netdevm=113070963711648w=2

 We were discussing this in depth a few weeks ago, but the
 discussion tailed off and I don't know how close we came
 to a consensus or what that consensus might be :-)
 

it sort of is still hanging but there is progress.

 The crux of the matter, to reiterate, is that it is a non-trivial
 problem to determine what existing SA entries are subsumed by a
 newly inserted one.  The kernel would need to execute a rather
 complicated search in order to determine this SA set.

Right - Herbert has some ideas that would require help from the KM.
And we are actually agreeing we should implement a minimalist approach.
More below ..

 The subsequent argument states that actually, unlike the kernel,
 the keying daemon does have some knowledge about what a new
 SA entry might be replacing.  And therefore, that userland
 daemons such as racoon bear some responsibility in assisting
 in the smooth and efficient switchover from the dying state
 entry to the newly inserted SA.
 
 Any comments or corrections on this?

correct with caveats:

there are two sorts of problematic devices. 

1) The Ciscos, I think PIX and their relatives (I heard linksys):
These suckers have a fixed time between soft expiry time and 
hard expiry time;-
IKE only negotiates hard expiry, and soft expiry is up to the peer.
Racoon says soft expiry = 80% of hard expiry.
So if you have the expiry at 10 hours, racoon will set soft expiry
at 8 hours. CISCO hardcodes 30 seconds to be between the hard and soft
expiry ;- Yep, when you have RFCs written in a natural language like
English shit like this happens. So at the 8 hour mark, racoon
renegotiates. For 30 seconds more after that, things continue working.
Then for the next 119.5 minutes nothing works because infact CISCO
purges its old SA and Linux (as it should) starts using the new one.
The proper way is for CISCO to send a IKE delete; it doesnt.

To fix this i submitted a patch to racoon which is in their CVS - i was
told it will show up around their release 0.7. The patch allows people
to hardcode like in cisco a specific time. So this fixes the CISCO
problem without touching the kernel.

2) There are other sorts of devices - i am told some made by a vendor
called DrayTek infact deletes right away after renegotiation.
But they do send a IKE delete except racoon ignores it ;-
As was pointed out to me that even since IKEv1 is unreliable such a
message could be lost anyways.
So bug in racoon for sure but not good enough given the unreliability of
IKEv1. So in the last discussion Herbert and I had we talked about doing
something in the kernel since this was getting frustrating ...
Herbert has it on his TODO and i was going to get racoon part once he
has his patch.

cheers,
jamal

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Specs for Tulip3

2005-12-14 Thread Michael Chan
On Thu, 2005-12-15 at 00:07 +0100, Aritz Bastida wrote:

  rx_threshold_hit
  Rx max coalescing frames threshold hit.
 
 Well, I didn't understand what is this threshold for
 

This counter counts the number of times rx packets have reached the max
rx coalesced frames setting before an interrupt is generated. By
default, the max rx coalesced frames is set to 6 which means that the
chip will try to wait until 6 packets are received before generating an
interrupt. Interrupt coalescing in addition to NAPI under heavy traffic
may further increase throughput.

 
 I'm running Linux kernel 2.6.13 and tg3 version 3.37, so should be new
 enough.

Newer versions have fancy prefetch added, a spinlock removed from the rx
path, and an optimization in the use of the status tag. All these may
allow you to receive a few more packets.

 I dont know how to verify if the NIC is in a PCIX bus. How can
 I check that? Running lspci I can see there are some PCIX bridges:

tg3's probing output will print the bus the device is in. You can also
run lspci -vvv to find out.


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH 0/3] TCP/IP Critical socket communication mechanism

2005-12-14 Thread Matt Mackall
On Wed, Dec 14, 2005 at 09:55:45AM -0800, Sridhar Samudrala wrote:
 On Wed, 2005-12-14 at 10:22 +0100, Andi Kleen wrote:
   I would appreciate any feedback or comments on this approach.
  
  Maybe I'm missing something but wouldn't you need an own critical
  pool (or at least reservation) for each socket to be safe against deadlocks?
  
  Otherwise if a critical sockets needs e.g. 2 pages to finish something
  and 2 critical sockets are active they can each steal the last pages
  from each other and deadlock.
 
 Here we are assuming that the pre-allocated critical page pool is big enough
 to satisfy the requirements of all the critical sockets.

Not a good assumption. A system can have between 1-1000 iSCSI
connections open and we certainly don't want to preallocate enough
room for 1000 connections to make progress when we might only have one
in use.

I think we need a global receive pool and per-socket send pools.

-- 
Mathematics is the supreme nostalgia of our time.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH 0/3] TCP/IP Critical socket communication mechanism

2005-12-14 Thread Matt Mackall
On Wed, Dec 14, 2005 at 08:30:23PM -0800, David S. Miller wrote:
 From: Matt Mackall [EMAIL PROTECTED]
 Date: Wed, 14 Dec 2005 19:39:37 -0800
 
  I think we need a global receive pool and per-socket send pools.
 
 Mind telling everyone how you plan to make use of the global receive
 pool when the allocation happens in the device driver and we have no
 idea which socket the packet is destined for?  What should be done for
 non-local packets being routed?  The device drivers allocate packets
 for the entire system, long before we know who the eventually received
 packets are for.  It is fully anonymous memory, and it's easy to
 design cases where the whole pool can be eaten up by non-local
 forwarded packets.

There needs to be two rules:

iff global memory critical flag is set
- allocate from the global critical receive pool on receive
- return packet to global pool if not destined for a socket with an
  attached send mempool

I think this will provide the desired behavior, though only
probabilistically. That is, we can fill the global receive pool with
uninteresting packets such that we're forced to drop critical ACKs,
but the boring packets will eventually be discarded as we walk up the
stack and we'll eventually have room to receive retried ACKs.

 I truly dislike these patches being discussed because they are a
 complete hack, and admittedly don't even solve the problem fully.  I
 don't have any concrete better ideas but that doesn't mean this stuff
 should go into the tree.

Agreed. I'm fairly convinced a full fix is doable, if you make a
couple assumptions (limited fragmentation), but will unavoidably be
less than pretty as it needs to cross some layers.

 I think GFP_ATOMIC memory pools are more powerful than they are given
 credit for.  There is nothing preventing the implementation of dynamic
 GFP_ATOMIC watermarks, and having critical socket behavior kick in
 in response to hitting those water marks.

There are two problems with GFP_ATOMIC. The first is that its users
don't pre-state their worst-case usage, which means sizing the pool to
reliably avoid deadlocks is impossible. The second is that there
aren't any guarantees that GFP_ATOMIC allocations are actually
critical in the needed-to-make-forward-VM-progress sense or will be
returned to the pool in a timely fashion.

So I do think we need a distinct pool if we want to tackle this
problem. Though it's probably worth mentioning that Linus was rather
adamantly against even trying at KS.

-- 
Mathematics is the supreme nostalgia of our time.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH 0/3] TCP/IP Critical socket communication mechanism

2005-12-14 Thread David S. Miller
From: Matt Mackall [EMAIL PROTECTED]
Date: Wed, 14 Dec 2005 21:02:50 -0800

 There needs to be two rules:
 
 iff global memory critical flag is set
 - allocate from the global critical receive pool on receive
 - return packet to global pool if not destined for a socket with an
   attached send mempool

This shuts off a router and/or firewall just because iSCSI or NFS peed
in it's pants.  Not really acceptable.

 I think this will provide the desired behavior

It's not desirable.

What if iSCSI is protected by IPSEC, and the key management daemon has
to process a security assosciation expiration and negotiate a new one
in order for iSCSI to further communicate with it's peer when this
memory shortage occurs?  It needs to send packets back and forth with
the remove key management daemon in order to do this, but since you
cut it off with this critical receive pool, the negotiation will never
succeed.

This stuff won't work.  It's not a generic solution and that's
why it has more holes than swiss cheese. :-)
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH 0/3] TCP/IP Critical socket communication mechanism

2005-12-14 Thread Andi Kleen
On Wed, Dec 14, 2005 at 08:30:23PM -0800, David S. Miller wrote:
 From: Matt Mackall [EMAIL PROTECTED]
 Date: Wed, 14 Dec 2005 19:39:37 -0800
 
  I think we need a global receive pool and per-socket send pools.
 
 Mind telling everyone how you plan to make use of the global receive
 pool when the allocation happens in the device driver and we have no
 idea which socket the packet is destined for?  What should be done for

In theory one could use multiple receive queue on intelligent enough
NIC with the NIC distingushing the sockets.

But that would be still a nasty you need advanced hardware FOO to avoid
subtle problem Y case. Also it would require lots of  driver hacking.

And most NICs seem to have limits on the size of the socket tables for this, 
which
means you would end up in a only N sockets supported safely situation,
with N likely being quite small on common hardware.

I think the idea of the original poster was that just freeing non critical 
packets
after a short time again would be good enough, but I'm a bit sceptical
on that.

 I truly dislike these patches being discussed because they are a
 complete hack, and admittedly don't even solve the problem fully.  I

I agree. 

 I think GFP_ATOMIC memory pools are more powerful than they are given
 credit for.  There is nothing preventing the implementation of dynamic

Their main problem is that they are used too widely and in a lot
of situations that aren't really critical.

-Andi

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH 0/3] TCP/IP Critical socket communication mechanism

2005-12-14 Thread Nick Piggin

David S. Miller wrote:

From: Matt Mackall [EMAIL PROTECTED]
Date: Wed, 14 Dec 2005 21:02:50 -0800



There needs to be two rules:

iff global memory critical flag is set
- allocate from the global critical receive pool on receive
- return packet to global pool if not destined for a socket with an
 attached send mempool



This shuts off a router and/or firewall just because iSCSI or NFS peed
in it's pants.  Not really acceptable.



But that should only happen (shut off a router and/or firewall) in cases
where we now completely deadlock and never recover, including shutting off
the router and firewall, because they don't have enough memory to recv
packets either.




I think this will provide the desired behavior



It's not desirable.

What if iSCSI is protected by IPSEC, and the key management daemon has
to process a security assosciation expiration and negotiate a new one
in order for iSCSI to further communicate with it's peer when this
memory shortage occurs?  It needs to send packets back and forth with
the remove key management daemon in order to do this, but since you
cut it off with this critical receive pool, the negotiation will never
succeed.



I guess IPSEC would be a critical socket too, in that case. Sure
there is nothing we can do if the daemon insists on allocating lots
of memory...


This stuff won't work.  It's not a generic solution and that's
why it has more holes than swiss cheese. :-)


True it will have holes. I think something that is complementary and
would be desirable is to simply limit the amount of in-flight writeout
that things like NFS allows (or used to allow, haven't checked for a
while and there were noises about it getting better).

--
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH 0/3] TCP/IP Critical socket communication mechanism

2005-12-14 Thread Stephen Hemminger
On Wed, 14 Dec 2005 21:23:09 -0800 (PST)
David S. Miller [EMAIL PROTECTED] wrote:

 From: Matt Mackall [EMAIL PROTECTED]
 Date: Wed, 14 Dec 2005 21:02:50 -0800
 
  There needs to be two rules:
  
  iff global memory critical flag is set
  - allocate from the global critical receive pool on receive
  - return packet to global pool if not destined for a socket with an
attached send mempool
 
 This shuts off a router and/or firewall just because iSCSI or NFS peed
 in it's pants.  Not really acceptable.
 
  I think this will provide the desired behavior
 
 It's not desirable.
 
 What if iSCSI is protected by IPSEC, and the key management daemon has
 to process a security assosciation expiration and negotiate a new one
 in order for iSCSI to further communicate with it's peer when this
 memory shortage occurs?  It needs to send packets back and forth with
 the remove key management daemon in order to do this, but since you
 cut it off with this critical receive pool, the negotiation will never
 succeed.
 
 This stuff won't work.  It's not a generic solution and that's
 why it has more holes than swiss cheese. :-)

Also, all this stuff is just a band aid because linux OOM behavior is so
fucked up. The VM system just lets the user dig themselves into a huge
over commit, then we get into trying to change every other system to
compensate.  How about cutting things off earlier, and not falling
off the cliff? How about pushing out pages to swap earlier when memory
pressure starts to get noticed. Then you can free those non-dirty pages
to make progress. Too many of the VM decisions seem to be made in favor
of keep-it-in-memory benchmark situations.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH 0/3] TCP/IP Critical socket communication mechanism

2005-12-14 Thread Stephen Hemminger
On Thu, 15 Dec 2005 06:42:45 +0100
Andi Kleen [EMAIL PROTECTED] wrote:

 On Wed, Dec 14, 2005 at 08:30:23PM -0800, David S. Miller wrote:
  From: Matt Mackall [EMAIL PROTECTED]
  Date: Wed, 14 Dec 2005 19:39:37 -0800
  
   I think we need a global receive pool and per-socket send pools.
  
  Mind telling everyone how you plan to make use of the global receive
  pool when the allocation happens in the device driver and we have no
  idea which socket the packet is destined for?  What should be done for
 
 In theory one could use multiple receive queue on intelligent enough
 NIC with the NIC distingushing the sockets.
 
 But that would be still a nasty you need advanced hardware FOO to avoid
 subtle problem Y case. Also it would require lots of  driver hacking.
 
 And most NICs seem to have limits on the size of the socket tables for this, 
 which
 means you would end up in a only N sockets supported safely situation,
 with N likely being quite small on common hardware.
 
 I think the idea of the original poster was that just freeing non critical 
 packets
 after a short time again would be good enough, but I'm a bit sceptical
 on that.
 
  I truly dislike these patches being discussed because they are a
  complete hack, and admittedly don't even solve the problem fully.  I
 
 I agree. 
 
  I think GFP_ATOMIC memory pools are more powerful than they are given
  credit for.  There is nothing preventing the implementation of dynamic
 
 Their main problem is that they are used too widely and in a lot
 of situations that aren't really critical.

Most of the use of GFP_ATOMIC is by stuff that could fail but can't
sleep waiting for memory. How about adding a GFP_NORMAL for allocations
while holding a lock.

#define GFP_NORMAL (__GFP_NOMEMALLOC)

Then get people to change the unneeded GFP_ATOMIC's to GFP_NORMAL in
places where the error paths are reasonable.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH 0/3] TCP/IP Critical socket communication mechanism

2005-12-14 Thread Sridhar Samudrala
On Wed, 14 Dec 2005, David S. Miller wrote:

 From: Matt Mackall [EMAIL PROTECTED]
 Date: Wed, 14 Dec 2005 19:39:37 -0800

  I think we need a global receive pool and per-socket send pools.

 Mind telling everyone how you plan to make use of the global receive
 pool when the allocation happens in the device driver and we have no
 idea which socket the packet is destined for?  What should be done for
 non-local packets being routed?  The device drivers allocate packets
 for the entire system, long before we know who the eventually received
 packets are for.  It is fully anonymous memory, and it's easy to
 design cases where the whole pool can be eaten up by non-local
 forwarded packets.

 I truly dislike these patches being discussed because they are a
 complete hack, and admittedly don't even solve the problem fully.  I
 don't have any concrete better ideas but that doesn't mean this stuff
 should go into the tree.

 I think GFP_ATOMIC memory pools are more powerful than they are given
 credit for.  There is nothing preventing the implementation of dynamic
 GFP_ATOMIC watermarks, and having critical socket behavior kick in
 in response to hitting those water marks.

Does this mean that you are OK with having a mechanism to mark the
sockets as critical and dropping the non critical packets under
emergency, but you do not like having a separate critical page pool.

Instead, you seem to be suggesting in_emergency to be set dynamically
when we are about to run out of ATOMIC memory. Is this right?

Thanks
Sridhar
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html