date:20060816

Re: bonding: cannot remove certain named devices

2006-08-16 Thread Giacomo A. Catenazzi

David Miller wrote:
 From: Bodo Eggert [EMAIL PROTECTED]
 Date: Wed, 16 Aug 2006 02:02:03 +0200

 Stephen Hemminger [EMAIL PROTECTED] wrote:

 IMHO idiots who put space's in filenames should be ignored. As long as the
 bonding code doesn't throw a fatal error, it has every right to return
 No such device to the fool.
 Maybe you should limit device names to eight uppercase characters and up to
 three characters extension, too. NOT! There is no reason to artificially
 impose limitations on device names, so don't do that.

 Are you willing to work to add the special case code necessary to
 handle whitespace characters in the device name over all of the kernel
 code and also all of the userland tools too?

But if you don't handle spaces in userspace, you handle *, ?, [, ], $,
, ', \  in userspace? Should kernel disable also these (insane device
chars) chars?

ciao
cate

 No?  Great, I'm glad that's settled.

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: proposal for new wireless configuration API

2006-08-16 Thread Johannes Berg

On Tue, 2006-08-15 at 12:29 -0400, Dan Williams wrote:

 We might want to take the time to fix up a few of the ambiguities of
 WEXT that we've encountered over the past few years:

Yes, I definitely agree.

 o Separate attributes for signal strength units; signed integer type for
 dBm, unsigned integer type for RSSI.  One 8-bit var to represent both is
 just too confusing for people, evidently (which is true...)

Yes, agreed, they should be separated. In general, I think that one
attribute should always have a single meaning and unit attached, except
for explicitly unit-less attributes (number of frames or whatever), or
attributes that explicitly have no stable unit (raw rssi).

 o Merge functionality ENCODE and ENCODEEXT handlers into one

Good one. I'm still not sure whether we should have an attribute for
this, or a command. The whole key business seems rather complex and it
might be good to have a command 'set key' with say a possible attribute
for the mac address of a pairwise key, a key material attribute and an
IV attribute or something. Otherwise we'll end up parsing the contents
of an attribute again, which rather sucks...

On the other hand, having it as a command won't allow the user to
optimize setting the key and other things at once. I'm not too sure we
should pay all that much attention to this problem though, it can't take
forever and typically a user with such a card won't be changing the key
or parameters all the time, hence it's usually probably done only at boo
or association time.

johannes
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: proposal for new wireless configuration API

2006-08-16 Thread Johannes Berg

On Tue, 2006-08-15 at 15:59 -0400, Dan Williams wrote:

 Ok, so if somebody magically opens up new unlicensed ISM spectrum
 around, say, 7GHz, does that space get broken into channels and assigned
 specific numbers by the IEEE?
 
 I know there are stable channel #s for abg range.  What about the
 future? [1]  Can we guarantee that whenever new spectrum opens up that
 future 802.11 products may use, that the mappings are well-defined?
 
 That was my main question.

I'd expect them to actually break it into channels and assign channel
numbers. Or whoever creates the hardware first does it, and those
numbers then get adopted in the year-long specification process ;)

Besides, if we really really really needed something else later for
whatever weird reason, we could add a new attribute for those cases, and
have it reject the channel attribute then :)

johannes
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: proposal for new wireless configuration API

2006-08-16 Thread Johannes Berg

On Tue, 2006-08-15 at 12:14 -0400, Luis R. Rodriguez wrote:

 Basically redo WE completely from scratch using netlink.

Not quite, I hope! As Dan mentioned, for example all the key management
stuff ought to be consolidated. Same for some other things.

 For per packet this makes sense, for modification of all packets I
 think configfs would be more suitable. Then again this is just an
 addition, I'm not disagreeing here with the approach. The same goes
 for several common wireless settings -- we could also have a configfs
 directory for each device which would allow manual read/writing for
 setting/getting certain values; mind you that congifs does allow for
 setting/getting multiple values at the same time, for those of you who
 have wondered. This could just could easily go in as a wrapper for
 configfs-new NL API.

Yeah, that might not even be undesirable. But we also need per-packet
controls, and a bunch of them. The current situation with a special
header in front of a packet injected into the management interface isn't
too great.

I'm not sure what kind of generic packet sending parameters we have.
Bitrate obviously, and all the other possible attributes...

 NL80211_ATTR_IFINDEX: index of interface to use

This was just meant to be the ifindex of the eth0 or whatever device.

 (NL80211_ATTR_PHYIDX: (later) index of wiphy to configure)
 
 Do you mean to have a wireless device have its own device index,
 separate from the netdevice index? Can you elaborate a bit on this?

Well, the d80211 stack gives each driver backend phyN
in /sys/class/ieee80211/. If we ever want to get rid of the wmasterN
interface, we probably want to allow configuring without an ifindex
because the physical device might not have any network devices attached
at that time. I'm not exactly sure if it really makes sense to configure
the device then, but hey.

 With WE we were restricted to the number of attributes possibly
 changed by the number of ioctls and later by sub-ioctl hack
 restrictions. What restrictions are we to face with this? 

We can have tons of attributes, it's a 16-bit field. I think that should
be sufficient :)

 Do we want
 to map each attribute directly to the respective WE ioctl number to
 make it easy to do backward compatibility?

No, because that would mean having very large attribute numbers
up-front, and due to the way genetlink works there is memory allocated
for each possible attribute. Hence, attribute numbers should be
allocated in an increasing fashion starting from 1, and not be sparse.

johannes
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: hard_start_xmit conext

2006-08-16 Thread Jeff Garzik


Herbert Xu wrote:

kiran kandi [EMAIL PROTECTED] wrote:
In what context hard_start_xmit  function is called. Is it called in soft 
irq or a  processes context.


softirq


Also can you call kfree_skb in  soft irq context.


Yes.  Don't do it in hard irq context though.


FWIW there is also Documentation/networking/netdevices.txt where this 
sort of stuff is documented.


Jeff


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 2.6.17] net/ipv6/udp.c: remove duplicate udp_get_port code

2006-08-16 Thread gerrit

UDPv4 and UDPv6 use an almost identical version of the get_port function,
which is unnecessary since the (long) code differs in only one if-statement.

By disentangling the if statement and adding v4/v6 checks appropriately, this
code duplication can be removed and further
   * udp_port_rover can stay in net/ipv4/udp.c
   * udp_lport_inuse can become static in net/ipv4/udp.c (only called by
 udp_get_port

The text below discusses the re-arrangement of the if-statement. This is 
implemented
by enclosed patch (works both on stable and Torvalds' release). The patch also 
dispenses 
with a goto statement whose jump label is referenced only once.

  D i s c u s s i o n

The following compares the statements for udp_v{4,6}_get_port. 

A) In udp_v4_get_port():
=
if (inet2-num == snum  
sk2 != sk   
!ipv6_only_sock(sk2)   
(!sk2-sk_bound_dev_if || !sk-sk_bound_dev_if 
 || sk2-sk_bound_dev_if == sk-sk_bound_dev_if)
(!inet2-rcv_saddr || !inet-rcv_saddr 
 || inet2-rcv_saddr == inet-rcv_saddr)
(!sk2-sk_reuse || !sk-sk_reuse) )
  goto fail;

This function is called from IPv4 context, hence sk-sk_family == PF_INET.
   
  
B) In udp_v6_get_port():
=
if (inet_sk(sk2)-num == snum   
sk2 != sk   
(!sk2-sk_bound_dev_if || !sk-sk_bound_dev_if
 || sk2-sk_bound_dev_if == sk-sk_bound_dev_if) 
(!sk2-sk_reuse || !sk-sk_reuse)
ipv6_rcv_saddr_equal(sk, sk2)  )
  goto fail;

This function is called from IPv6 context, hence sk-sk_family == PF_INET6.

Common denominator: 
===
By re-ordering some of the last literals, both functions share the following
conjunction of conditions:

if (inet_sk(sk2)-num == snum   sk2 != sk 
(!sk2-sk_bound_dev_if || !sk-sk_bound_dev_if 
 || sk2-sk_bound_dev_if == sk-sk_bound_dev_if) 
(!sk2-sk_reuse || !sk-sk_reuse)  ) 

To make the function applicable to both v4 and v6 contexts, a second if 
statement
is added, which branches according to sk's sk_family. 


Signed-off-by: Gerrit Renker [EMAIL PROTECTED]
---

 include/net/udp.h |   17 +--
 net/ipv4/udp.c|   57 --
 net/ipv6/udp.c|   79 +
 3 files changed, 38 insertions(+), 115 deletions(-)


diff --git a/include/net/udp.h b/include/net/udp.h
index 766fba1..69d4288 100644
--- a/include/net/udp.h
+++ b/include/net/udp.h
@@ -30,25 +30,9 @@ #include linux/seq_file.h
 
 #define UDP_HTABLE_SIZE128
 
-/* udp.c: This needs to be shared by v4 and v6 because the lookup
- *and hashing code needs to work with different AF's yet
- *the port space is shared.
- */
 extern struct hlist_head udp_hash[UDP_HTABLE_SIZE];
 extern rwlock_t udp_hash_lock;
 
-extern int udp_port_rover;
-
-static inline int udp_lport_inuse(u16 num)
-{
-   struct sock *sk;
-   struct hlist_node *node;
-
-   sk_for_each(sk, node, udp_hash[num  (UDP_HTABLE_SIZE - 1)])
-   if (inet_sk(sk)-num == num)
-   return 1;
-   return 0;
-}
 
 /* Note: this must match 'valbool' in sock_setsockopt */
 #define UDP_CSUM_NOXMIT1
@@ -63,6 +47,7 @@ extern struct proto udp_prot;
 
 struct sk_buff;
 
+extern int udp_get_port(struct sock *sk, unsigned short snum);
 extern voidudp_err(struct sk_buff *, u32);
 
 extern int udp_sendmsg(struct kiocb *iocb, struct sock *sk,
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 3f93292..eb3aa82 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -102,6 +102,7 @@ #include net/protocol.h
 #include linux/skbuff.h
 #include linux/proc_fs.h
 #include linux/seq_file.h
+#include net/addrconf.h
 #include net/sock.h
 #include net/udp.h
 #include net/icmp.h
@@ -119,10 +120,20 @@ DEFINE_SNMP_STAT(struct udp_mib, udp_sta
 struct hlist_head udp_hash[UDP_HTABLE_SIZE];
 DEFINE_RWLOCK(udp_hash_lock);
 
-/* Shared by v4/v6 udp. */
 int udp_port_rover;
 
-static int udp_v4_get_port(struct sock *sk, unsigned short snum)
+static inline int udp_lport_inuse(u16 num)
+{
+   struct sock *sk;
+   struct hlist_node *node;
+
+   sk_for_each(sk, node, udp_hash[num  (UDP_HTABLE_SIZE - 1)])
+   if (inet_sk(sk)-num == num)
+   return 1;
+   return 0;
+}
+
+int udp_get_port(struct sock *sk, unsigned short snum)
 {
struct hlist_node *node;
struct sock *sk2;
@@ -151,11 +162,10 @@ static int udp_v4_get_port(struct sock *
}
size = 0;
sk_for_each(sk2, node, list)
-   if (++size = best_size_so_far)
-

[PATCH2 1/1] network memory allocator.

2006-08-16 Thread Evgeniy Polyakov

Hello.

Network tree allocator can be used to allocate memory for all network
operations from any context.

Changes from previous release:
 * added dynamically grown cache
 * changed some inline issues
 * reduced code size
 * removed AVL tree implementation from the sources
 * changed minimum allocation size to l1 cache line size (some arches require 
that)
 * removed skb-__tsize parameter
 * added a lot of comments
 * a lot of small cleanups

Trivial epoll based web server achieved more than 2450 requests per
second with this version (usual numbers are about 1600-1800 when usual
kmalloc is used for all network operations).

Network allocator design and implementation notes as long as performance
and fragmentation analysis can be found at project homepage:
http://tservice.net.ru/~s0mbre/old/?section=projectsitem=nta

Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED]

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 19c96d4..f550f95 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -327,6 +327,10 @@ #include linux/slab.h
 
 #include asm/system.h
 
+extern void *avl_alloc(unsigned int size, gfp_t gfp_mask);
+extern void avl_free(void *ptr, unsigned int size);
+extern int avl_init(void);
+
 extern void kfree_skb(struct sk_buff *skb);
 extern void   __kfree_skb(struct sk_buff *skb);
 extern struct sk_buff *__alloc_skb(unsigned int size,
diff --git a/net/core/Makefile b/net/core/Makefile
index 2645ba4..d86d468 100644
--- a/net/core/Makefile
+++ b/net/core/Makefile
@@ -10,6 +10,8 @@ obj-$(CONFIG_SYSCTL) += sysctl_net_core.
 obj-y   += dev.o ethtool.o dev_mcast.o dst.o netevent.o \
neighbour.o rtnetlink.o utils.o link_watch.o filter.o
 
+obj-y += alloc/
+
 obj-$(CONFIG_XFRM) += flow.o
 obj-$(CONFIG_SYSFS) += net-sysfs.o
 obj-$(CONFIG_NET_DIVERT) += dv.o
diff --git a/net/core/alloc/Makefile b/net/core/alloc/Makefile
new file mode 100644
index 000..21b7c51
--- /dev/null
+++ b/net/core/alloc/Makefile
@@ -0,0 +1,3 @@
+obj-y  := allocator.o
+
+allocator-y:= avl.o
diff --git a/net/core/alloc/avl.c b/net/core/alloc/avl.c
new file mode 100644
index 000..c404cbe
--- /dev/null
+++ b/net/core/alloc/avl.c
@@ -0,0 +1,651 @@
+/*
+ * avl.c
+ * 
+ * 2006 Copyright (c) Evgeniy Polyakov [EMAIL PROTECTED]
+ * All rights reserved.
+ * 
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+#include linux/kernel.h
+#include linux/types.h
+#include linux/string.h
+#include linux/errno.h
+#include linux/slab.h
+#include linux/spinlock.h
+#include linux/percpu.h
+#include linux/list.h
+#include linux/mm.h
+#include linux/skbuff.h
+
+#include avl.h
+
+static struct avl_allocator_data avl_allocator[NR_CPUS];
+
+/*
+ * Get node pointer from address.
+ */
+static inline struct avl_node *avl_get_node_ptr(unsigned long ptr)
+{
+   struct page *page = virt_to_page(ptr);
+   struct avl_node *node = (struct avl_node *)(page-lru.next);
+
+   return node;
+}
+
+/*
+ * Set node pointer for page for given address.
+ */
+static void avl_set_node_ptr(unsigned long ptr, struct avl_node *node, int 
order)
+{
+   int nr_pages = 1order, i;
+   struct page *page = virt_to_page(ptr);
+   
+   for (i=0; inr_pages; ++i) {
+   page-lru.next = (void *)node;
+   page++;
+   }
+}
+
+/*
+ * Get allocation CPU from address.
+ */
+static inline int avl_get_cpu_ptr(unsigned long ptr)
+{
+   struct page *page = virt_to_page(ptr);
+   int cpu = (int)(unsigned long)(page-lru.prev);
+
+   return cpu;
+}
+
+/*
+ * Set allocation cpu for page for given address.
+ */
+static void avl_set_cpu_ptr(unsigned long ptr, int cpu, int order)
+{
+   int nr_pages = 1order, i;
+   struct page *page = virt_to_page(ptr);
+   
+   for (i=0; inr_pages; ++i) {
+   page-lru.prev = (void *)(unsigned long)cpu;
+   page++;
+   }
+}
+
+/*
+ * Convert pointer to node's value.
+ * Node's value is a start address for contiguous chunk bound to given node.
+ */
+static inline unsigned long avl_ptr_to_value(void *ptr)
+{
+   struct avl_node *node = avl_get_node_ptr((unsigned long)ptr);
+   return node-value;
+}
+
+/*
+ * Convert pointer into offset from start address of the contiguous chunk
+ *

Re: hard_start_xmit conext

2006-08-16 Thread Ben Greear


Jeff Garzik wrote:

Herbert Xu wrote:


kiran kandi [EMAIL PROTECTED] wrote:

In what context hard_start_xmit  function is called. Is it called in 
soft irq or a  processes context.



softirq


It can be process too...doesn't pktgen call it directly?

Ben

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/1] network memory allocator.

2006-08-16 Thread Christoph Hellwig

On Wed, Aug 16, 2006 at 09:35:46AM +0400, Evgeniy Polyakov wrote:
 On Tue, Aug 15, 2006 at 10:21:22PM +0200, Arnd Bergmann ([EMAIL PROTECTED]) 
 wrote:
  Am Monday 14 August 2006 13:04 schrieb Evgeniy Polyakov:
   ?* full per CPU allocation and freeing (objects are never freed on
   different CPU)
  
  Many of your data structures are per cpu, but your underlying allocations
  are all using regular kzalloc/__get_free_page/__get_free_pages functions.
  Shouldn't these be converted to calls to kmalloc_node and alloc_pages_node
  in order to get better locality on NUMA systems?
 
  OTOH, we have recently experimented with doing the dev_alloc_skb calls
  with affinity to the NUMA node that holds the actual network adapter, and
  got significant improvements on the Cell blade server. That of course
  may be a conflicting goal since it would mean having per-cpu per-node
  page pools if any CPU is supposed to be able to allocate pages for use
  as DMA buffers on any node.
 
 Doesn't alloc_pages() automatically switches to alloc_pages_node() or
 alloc_pages_current()?

That's not what's wanted.  If you have a slow interconnect you always want
to allocate memory on the node the network device is attached to.

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2.6.17] net/ipv6/udp.c: remove duplicate udp_get_port code

2006-08-16 Thread YOSHIFUJI Hideaki / 吉藤英明

Hello.

In article [EMAIL PROTECTED] (at Wed, 16 Aug 2006 08:46:48 +0100), [EMAIL 
PROTECTED] says:

 UDPv4 and UDPv6 use an almost identical version of the get_port function,
 which is unnecessary since the (long) code differs in only one if-statement.
:

:
 +#if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE)
 + else if(sk-sk_family == PF_INET6 
 + ipv6_rcv_saddr_equal(sk, sk2) )
 + goto fail;
 + }
 +#endif

This is not good because you cannot link ipv6_rcv_saddr_equal()
if you are compiling IPv6 as module.

How about retaining udp_v{4,6}_get_port() and call
common udp_get_port() from both functions?

--yoshfuji
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/1] network memory allocator.

2006-08-16 Thread Evgeniy Polyakov

On Wed, Aug 16, 2006 at 09:48:08AM +0100, Christoph Hellwig ([EMAIL PROTECTED]) 
wrote:
  Doesn't alloc_pages() automatically switches to alloc_pages_node() or
  alloc_pages_current()?
 
 That's not what's wanted.  If you have a slow interconnect you always want
 to allocate memory on the node the network device is attached to.

There is drawback here - if data was allocated on CPU wheere NIC is
closer and then processed on different CPU it will cost more than 
in case where buffer was allocated on CPU where it will be processed.

But from other point of view, most of the adapters preallocate set of
skbs, and with msi-x help there will be a possibility to bind irq and
processing to the CPU where data was origianlly allocated.

So I would like to know how to determine which node should be used for
allocation. Changes of __get_user_pages() to alloc_pages_node() are
trivial.

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/1] network memory allocator.

2006-08-16 Thread David Miller

From: Evgeniy Polyakov [EMAIL PROTECTED]
Date: Wed, 16 Aug 2006 13:00:31 +0400

 So I would like to know how to determine which node should be used for
 allocation. Changes of __get_user_pages() to alloc_pages_node() are
 trivial.

netdev_alloc_skb() knows the netdevice, and therefore you can
obtain the struct device; referenced inside of the netdev,
and therefore you can determine the node using the struct
device.

Christophe is working on adding support for this using existing
allocator.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/1] network memory allocator.

2006-08-16 Thread Christoph Hellwig

On Wed, Aug 16, 2006 at 02:05:03AM -0700, David Miller wrote:
 From: Evgeniy Polyakov [EMAIL PROTECTED]
 Date: Wed, 16 Aug 2006 13:00:31 +0400

  So I would like to know how to determine which node should be used for
  allocation. Changes of __get_user_pages() to alloc_pages_node() are
  trivial.

 netdev_alloc_skb() knows the netdevice, and therefore you can
 obtain the struct device; referenced inside of the netdev,
 and therefore you can determine the node using the struct
 device.

It's not that easy unfortunately.  I did what you describe above in my
first prototype but then found out the hard way that the struct device
in the netdevice can be a non-pci one, e.g. for pcmcia.  Im that case
the kernel will crash on you becuase you can only get the node infortmation
for pci devices.  My current patchkit adds an int node member to struct
net_device instead.  I can repost the patchkit ontop of -mm (which is
the required slab memory leak tracking changes) if anyone cares.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/1] network memory allocator.

2006-08-16 Thread Evgeniy Polyakov

On Wed, Aug 16, 2006 at 10:10:29AM +0100, Christoph Hellwig ([EMAIL PROTECTED]) 
wrote:
 On Wed, Aug 16, 2006 at 02:05:03AM -0700, David Miller wrote:
  From: Evgeniy Polyakov [EMAIL PROTECTED]
  Date: Wed, 16 Aug 2006 13:00:31 +0400

   So I would like to know how to determine which node should be used for
   allocation. Changes of __get_user_pages() to alloc_pages_node() are
   trivial.

  netdev_alloc_skb() knows the netdevice, and therefore you can
  obtain the struct device; referenced inside of the netdev,
  and therefore you can determine the node using the struct
  device.

 It's not that easy unfortunately.  I did what you describe above in my
 first prototype but then found out the hard way that the struct device
 in the netdevice can be a non-pci one, e.g. for pcmcia.  Im that case
 the kernel will crash on you becuase you can only get the node infortmation
 for pci devices.  My current patchkit adds an int node member to struct
 net_device instead.  I can repost the patchkit ontop of -mm (which is
 the required slab memory leak tracking changes) if anyone cares.

Can we check device-bus_type or device-driver-bus against
pci_bus_type for that?

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/1] network memory allocator.

2006-08-16 Thread Christoph Hellwig

On Wed, Aug 16, 2006 at 01:32:02PM +0400, Evgeniy Polyakov wrote:
 On Wed, Aug 16, 2006 at 10:10:29AM +0100, Christoph Hellwig ([EMAIL 
 PROTECTED]) wrote:
  On Wed, Aug 16, 2006 at 02:05:03AM -0700, David Miller wrote:
   From: Evgeniy Polyakov [EMAIL PROTECTED]
   Date: Wed, 16 Aug 2006 13:00:31 +0400

So I would like to know how to determine which node should be used for
allocation. Changes of __get_user_pages() to alloc_pages_node() are
trivial.

   netdev_alloc_skb() knows the netdevice, and therefore you can
   obtain the struct device; referenced inside of the netdev,
   and therefore you can determine the node using the struct
   device.

  It's not that easy unfortunately.  I did what you describe above in my
  first prototype but then found out the hard way that the struct device
  in the netdevice can be a non-pci one, e.g. for pcmcia.  Im that case
  the kernel will crash on you becuase you can only get the node infortmation
  for pci devices.  My current patchkit adds an int node member to struct
  net_device instead.  I can repost the patchkit ontop of -mm (which is
  the required slab memory leak tracking changes) if anyone cares.

 Can we check device-bus_type or device-driver-bus against
 pci_bus_type for that?

We could, but I'd rather waste 4 bytes in struct net_device than having
such ugly warts in common code.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/1] network memory allocator.

2006-08-16 Thread Christoph Hellwig

On Wed, Aug 16, 2006 at 01:00:31PM +0400, Evgeniy Polyakov wrote:
 On Wed, Aug 16, 2006 at 09:48:08AM +0100, Christoph Hellwig ([EMAIL 
 PROTECTED]) wrote:
   Doesn't alloc_pages() automatically switches to alloc_pages_node() or
   alloc_pages_current()?
  
  That's not what's wanted.  If you have a slow interconnect you always want
  to allocate memory on the node the network device is attached to.
 
 There is drawback here - if data was allocated on CPU wheere NIC is
 closer and then processed on different CPU it will cost more than 
 in case where buffer was allocated on CPU where it will be processed.
 
 But from other point of view, most of the adapters preallocate set of
 skbs, and with msi-x help there will be a possibility to bind irq and
 processing to the CPU where data was origianlly allocated.

The case we've benchmarked (spidernet) is the common preallocated case.
For allocate on demand I'd expect the slab allocator to get things right.
We do have the irq on the right node, not through MSI but due to the odd
interreupt architecture of the Cell blades.

 So I would like to know how to determine which node should be used for
 allocation. Changes of __get_user_pages() to alloc_pages_node() are
 trivial.

The patches I have add the node field to struct net_device and use it.
It's set via alloc_netdev_node, a function I add and for the normal case
of PCI adapters the node arguments comes from pcibus_to_node().  It's
arguable we should add a alloc_foodeve_pdev variant that hids that detail,
but I'm not entirely sure about whether it's worth the effort.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/1] network memory allocator.

2006-08-16 Thread David Miller

From: Christoph Hellwig [EMAIL PROTECTED]
Date: Wed, 16 Aug 2006 10:38:37 +0100

 We could, but I'd rather waste 4 bytes in struct net_device than
 having such ugly warts in common code.

Why not instead have struct device store some default node value?
The node decision will be sub-optimal on non-pci but it won't crash.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/1] network memory allocator.

2006-08-16 Thread Christoph Hellwig

On Wed, Aug 16, 2006 at 02:40:08AM -0700, David Miller wrote:
 From: Christoph Hellwig [EMAIL PROTECTED]
 Date: Wed, 16 Aug 2006 10:38:37 +0100

  We could, but I'd rather waste 4 bytes in struct net_device than
  having such ugly warts in common code.

 Why not instead have struct device store some default node value?
 The node decision will be sub-optimal on non-pci but it won't crash.

Right now we don't even have the node stored in the pci_dev structure but
only arch-specific accessor functions/macros.  We could change those to
take a struct device instead and make them return -1 for everything non-pci
as we already do in architectures that don't support those helpers.  -1
means 'any node' for all common allocators.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: hard_start_xmit conext

2006-08-16 Thread Herbert Xu

On Wed, Aug 16, 2006 at 01:48:56AM -0700, Ben Greear wrote:

 softirq
 
 It can be process too...doesn't pktgen call it directly?

Only with BH disabled.
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmVHI~} [EMAIL PROTECTED]
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 7/9] net_device seq_file

2006-08-16 Thread Andrey Savochkin

Library function to create a seq_file in proc filesystem,
showing some information for each netdevice.
This code is present in the kernel in about 10 instances, and
all of them can be converted to using introduced library function.

Signed-off-by: Andrey Savochkin [EMAIL PROTECTED]
---
 include/linux/netdevice.h |7 +++
 net/core/dev.c|   96 ++
 2 files changed, 103 insertions(+)

--- ./include/linux/netdevice.h.venetproc   Tue Aug 15 13:46:08 2006
+++ ./include/linux/netdevice.h Tue Aug 15 13:46:08 2006
@@ -592,6 +592,13 @@ extern int register_netdevice(struct ne
 extern int unregister_netdevice(struct net_device *dev);
 extern voidfree_netdev(struct net_device *dev);
 extern voidsynchronize_net(void);
+#ifdef CONFIG_PROC_FS
+extern int netdev_proc_create(char *name,
+   int (*show)(struct seq_file *,
+   struct net_device *, void *),
+   void *data, struct module *mod);
+void   netdev_proc_remove(char *name);
+#endif
 extern int register_netdevice_notifier(struct notifier_block *nb);
 extern int unregister_netdevice_notifier(struct notifier_block 
*nb);
 extern int call_netdevice_notifiers(unsigned long val, void *v);
--- ./net/core/dev.c.venetproc  Tue Aug 15 13:46:08 2006
+++ ./net/core/dev.cTue Aug 15 13:46:08 2006
@@ -2100,6 +2100,102 @@ static int dev_ifconf(char __user *arg)
 }
 
 #ifdef CONFIG_PROC_FS
+
+struct netdev_proc_data {
+   struct file_operations fops;
+   int (*show)(struct seq_file *, struct net_device *, void *);
+   void *data;
+};
+
+static void *netdev_proc_seq_start(struct seq_file *seq, loff_t *pos)
+{
+   struct net_device *dev;
+   loff_t off;
+
+   read_lock(dev_base_lock);
+   if (*pos == 0)
+   return SEQ_START_TOKEN;
+   for (dev = dev_base, off = 1; dev; dev = dev-next, off++) {
+   if (*pos == off)
+   return dev;
+   }
+   return NULL;
+}
+
+static void *netdev_proc_seq_next(struct seq_file *seq, void *v, loff_t *pos)
+{
+   ++*pos;
+   return (v == SEQ_START_TOKEN) ? dev_base
+   : ((struct net_device *)v)-next;
+}
+
+static void netdev_proc_seq_stop(struct seq_file *seq, void *v)
+{
+   read_unlock(dev_base_lock);
+}
+
+static int netdev_proc_seq_show(struct seq_file *seq, void *v)
+{
+   struct netdev_proc_data *p;
+
+   p = seq-private;
+   return (*p-show)(seq, v, p-data);
+}
+
+static struct seq_operations netdev_proc_seq_ops = {
+   .start = netdev_proc_seq_start,
+   .next  = netdev_proc_seq_next,
+   .stop  = netdev_proc_seq_stop,
+   .show  = netdev_proc_seq_show,
+};
+
+static int netdev_proc_open(struct inode *inode, struct file *file)
+{
+   int err;
+   struct seq_file *p;
+
+   err = seq_open(file, netdev_proc_seq_ops);
+   if (!err) {
+   p = file-private_data;
+   p-private = (struct netdev_proc_data *)PDE(inode)-data;
+   }
+   return err;
+}
+
+int netdev_proc_create(char *name,
+   int (*show)(struct seq_file *, struct net_device *, void *),
+   void *data, struct module *mod)
+{
+   struct netdev_proc_data *p;
+   struct proc_dir_entry *ent;
+
+   p = kzalloc(sizeof(*p), GFP_KERNEL);
+   p-fops.owner = mod;
+   p-fops.open = netdev_proc_open;
+   p-fops.read = seq_read;
+   p-fops.llseek = seq_lseek;
+   p-fops.release = seq_release;
+   p-show = show;
+   p-data = data;
+   ent = create_proc_entry(name, S_IRUGO, proc_net);
+   if (ent == NULL) {
+   kfree(p);
+   return -EINVAL;
+   }
+   ent-data = p;
+   ent-destructor = proc_data_destructor;
+   smp_wmb();
+   ent-proc_fops = p-fops;
+   return 0;
+}
+EXPORT_SYMBOL(netdev_proc_create);
+
+void netdev_proc_remove(char *name)
+{
+   proc_net_remove(name);
+}
+EXPORT_SYMBOL(netdev_proc_remove);
+
 /*
  * This is invoked by the /proc filesystem handler to display a device
  * in detail.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 5/9] network namespaces: async socket operations

2006-08-16 Thread Andrey Savochkin

Non-trivial part of socket namespaces: asynchronous events
should be run in proper context.

Signed-off-by: Andrey Savochkin [EMAIL PROTECTED]
---
 af_inet.c|   10 ++
 inet_timewait_sock.c |8 
 tcp_timer.c  |9 +
 3 files changed, 27 insertions(+)

--- ./net/ipv4/af_inet.c.venssock-asyn  Mon Aug 14 17:04:07 2006
+++ ./net/ipv4/af_inet.cTue Aug 15 13:45:44 2006
@@ -366,10 +366,17 @@ out_rcu_unlock:
 int inet_release(struct socket *sock)
 {
struct sock *sk = sock-sk;
+   struct net_namespace *ns, *orig_net_ns;
 
if (sk) {
long timeout;
 
+   /* Need to change context here since protocol -close
+* operation may send packets.
+*/
+   ns = get_net_ns(sk-sk_net_ns);
+   push_net_ns(ns, orig_net_ns);
+
/* Applications forget to leave groups before exiting */
ip_mc_drop_socket(sk);
 
@@ -386,6 +393,9 @@ int inet_release(struct socket *sock)
timeout = sk-sk_lingertime;
sock-sk = NULL;
sk-sk_prot-close(sk, timeout);
+
+   pop_net_ns(orig_net_ns);
+   put_net_ns(ns);
}
return 0;
 }
--- ./net/ipv4/inet_timewait_sock.c.venssock-asyn   Tue Aug 15 13:45:44 2006
+++ ./net/ipv4/inet_timewait_sock.c Tue Aug 15 13:45:44 2006
@@ -129,6 +129,7 @@ static int inet_twdr_do_twkill_work(stru
 {
struct inet_timewait_sock *tw;
struct hlist_node *node;
+   struct net_namespace *orig_net_ns;
unsigned int killed;
int ret;
 
@@ -140,8 +141,10 @@ static int inet_twdr_do_twkill_work(stru
 */
killed = 0;
ret = 0;
+   push_net_ns(current_net_ns, orig_net_ns);
 rescan:
inet_twsk_for_each_inmate(tw, node, twdr-cells[slot]) {
+   switch_net_ns(tw-tw_net_ns);
__inet_twsk_del_dead_node(tw);
spin_unlock(twdr-death_lock);
__inet_twsk_kill(tw, twdr-hashinfo);
@@ -164,6 +167,7 @@ rescan:
 
twdr-tw_count -= killed;
NET_ADD_STATS_BH(LINUX_MIB_TIMEWAITED, killed);
+   pop_net_ns(orig_net_ns);
 
return ret;
 }
@@ -338,10 +342,12 @@ void inet_twdr_twcal_tick(unsigned long 
int n, slot;
unsigned long j;
unsigned long now = jiffies;
+   struct net_namespace *orig_net_ns;
int killed = 0;
int adv = 0;
 
twdr = (struct inet_timewait_death_row *)data;
+   push_net_ns(current_net_ns, orig_net_ns);
 
spin_lock(twdr-death_lock);
if (twdr-twcal_hand  0)
@@ -357,6 +363,7 @@ void inet_twdr_twcal_tick(unsigned long 
 
inet_twsk_for_each_inmate_safe(tw, node, safe,
   twdr-twcal_row[slot]) {
+   switch_net_ns(tw-tw_net_ns);
__inet_twsk_del_dead_node(tw);
__inet_twsk_kill(tw, twdr-hashinfo);
inet_twsk_put(tw);
@@ -384,6 +391,7 @@ out:
del_timer(twdr-tw_timer);
NET_ADD_STATS_BH(LINUX_MIB_TIMEWAITKILLED, killed);
spin_unlock(twdr-death_lock);
+   pop_net_ns(orig_net_ns);
 }
 
 EXPORT_SYMBOL_GPL(inet_twdr_twcal_tick);
--- ./net/ipv4/tcp_timer.c.venssock-asynMon Aug 14 16:43:51 2006
+++ ./net/ipv4/tcp_timer.c  Tue Aug 15 13:45:44 2006
@@ -171,7 +171,9 @@ static void tcp_delack_timer(unsigned lo
struct sock *sk = (struct sock*)data;
struct tcp_sock *tp = tcp_sk(sk);
struct inet_connection_sock *icsk = inet_csk(sk);
+   struct net_namespace *orig_net_ns;
 
+   push_net_ns(sk-sk_net_ns, orig_net_ns);
bh_lock_sock(sk);
if (sock_owned_by_user(sk)) {
/* Try again later. */
@@ -225,6 +227,7 @@ out:
 out_unlock:
bh_unlock_sock(sk);
sock_put(sk);
+   pop_net_ns(orig_net_ns);
 }
 
 static void tcp_probe_timer(struct sock *sk)
@@ -384,8 +387,10 @@ static void tcp_write_timer(unsigned lon
 {
struct sock *sk = (struct sock*)data;
struct inet_connection_sock *icsk = inet_csk(sk);
+   struct net_namespace *orig_net_ns;
int event;
 
+   push_net_ns(sk-sk_net_ns, orig_net_ns);
bh_lock_sock(sk);
if (sock_owned_by_user(sk)) {
/* Try again later */
@@ -419,6 +424,7 @@ out:
 out_unlock:
bh_unlock_sock(sk);
sock_put(sk);
+   pop_net_ns(orig_net_ns);
 }
 
 /*
@@ -447,9 +453,11 @@ static void tcp_keepalive_timer (unsigne
 {
struct sock *sk = (struct sock *) data;
struct inet_connection_sock *icsk = inet_csk(sk);
+   struct net_namespace *orig_net_ns;
struct tcp_sock *tp = tcp_sk(sk);
__u32 elapsed;
 
+   push_net_ns(sk-sk_net_ns, orig_net_ns);
/* Only process if socket is not in use. */
bh_lock_sock(sk);

[RFC] network namespaces

2006-08-16 Thread Andrey Savochkin

Hi All,

I'd like to resurrect our discussion about network namespaces.
In our previous discussions it appeared that we have rather polar concepts
which seemed hard to reconcile.
Now I have an idea how to look at all discussed concepts to enable everyone's
usage scenario.

1. The most straightforward concept is complete separation of namespaces,
   covering device list, routing tables, netfilter tables, socket hashes, and
   everything else.

   On input path, each packet is tagged with namespace right from the
   place where it appears from a device, and is processed by each layer
   in the context of this namespace.
   Non-root namespaces communicate with the outside world in two ways: by
   owning hardware devices, or receiving packets forwarded them by their parent
   namespace via pass-through device.

   This complete separation of namespaces is very useful for at least two
   purposes:
- allowing users to create and manage by their own various tunnels and
  VPNs, and
- enabling easier and more straightforward live migration of groups of
  processes with their environment.

2. People expressed concerns that complete separation of namespaces
   may introduce an undesired overhead in certain usage scenarios.
   The overhead comes from packets traversing input path, then output path,
   then input path again in the destination namespace if root namespace
   acts as a router.

   So, we may introduce short-cuts, when input packet starts to be processes
   in one namespace, but changes it at some upper layer.
   The places where packet can change namespace are, for example:
   routing, post-routing netfilter hook, or even lookup in socket hash.

   The cleanest example among them is post-routing netfilter hook.
   Tagging of input packets there means that the packets is checked against
   root namespace's routing table, found to be local, and go directly to
   the socket hash lookup in the destination namespace.
   In this scheme the ability to change routing tables or netfilter rules on
   a per-namespace basis is traded for lower overhead.

   All other optimized schemes where input packets do not travel
   input-output-input paths in general case may be viewed as short-cuts in
   scheme (1).  The remaining question is which exactly short-cuts make most
   sense, and how to make them consistent from the interface point of view.

My current idea is to reach some agreement on the basic concept, review
patches, and then move on to implementing feasible short-cuts.

Opinions?

Next in this thread are patches introducing namespaces to device list,
IPv4 routing, and socket hashes, and a pass-through device.
Patches are against 2.6.18-rc4-mm1.

Best regards,

Andrey
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 2/9] network namespaces: IPv4 routing

2006-08-16 Thread Andrey Savochkin

Structures related to IPv4 rounting (FIB and routing cache)
are made per-namespace.

Signed-off-by: Andrey Savochkin [EMAIL PROTECTED]
---
 include/linux/net_ns.h   |   10 +++
 include/net/flow.h   |3 +
 include/net/ip_fib.h |   46 
 net/core/dev.c   |8 ++
 net/core/fib_rules.c |   43 ---
 net/ipv4/Kconfig |4 -
 net/ipv4/fib_frontend.c  |  132 +--
 net/ipv4/fib_hash.c  |   13 +++-
 net/ipv4/fib_rules.c |   86 +-
 net/ipv4/fib_semantics.c |   99 +++
 net/ipv4/route.c |   26 -
 11 files changed, 375 insertions(+), 95 deletions(-)

--- ./include/linux/net_ns.h.vensroute  Mon Aug 14 17:18:59 2006
+++ ./include/linux/net_ns.hMon Aug 14 19:19:14 2006
@@ -14,7 +14,17 @@ struct net_namespace {
atomic_tactive_ref, use_ref;
struct net_device   *dev_base_p, **dev_tail_p;
struct net_device   *loopback;
+#ifndef CONFIG_IP_MULTIPLE_TABLES
+   struct fib_table*fib4_local_table, *fib4_main_table;
+#else
+   struct list_headfib_rules_ops_list;
+   struct fib_rules_ops*fib4_rules_ops;
+   struct hlist_head   *fib4_tables;
+#endif
+   struct hlist_head   *fib4_hash, *fib4_laddrhash;
+   unsignedfib4_hash_size, fib4_info_cnt;
unsigned inthash;
+   chardestroying;
struct work_struct  destroy_work;
 };
 
--- ./include/net/flow.h.vensroute  Mon Aug 14 17:04:04 2006
+++ ./include/net/flow.hMon Aug 14 17:18:59 2006
@@ -79,6 +79,9 @@ struct flowi {
 #define fl_icmp_code   uli_u.icmpt.code
 #define fl_ipsec_spi   uli_u.spi
__u32   secid;  /* used by xfrm; see secid.txt */
+#ifdef CONFIG_NET_NS
+   struct net_namespace *net_ns;
+#endif
 } __attribute__((__aligned__(BITS_PER_LONG/8)));
 
 #define FLOW_DIR_IN0
--- ./include/net/ip_fib.h.vensrouteMon Aug 14 17:04:04 2006
+++ ./include/net/ip_fib.h  Tue Aug 15 11:53:22 2006
@@ -18,6 +18,7 @@
 
 #include net/flow.h
 #include linux/seq_file.h
+#include linux/net_ns.h
 #include net/fib_rules.h
 
 /* WARNING: The ordering of these elements must match ordering
@@ -171,14 +172,21 @@ struct fib_table {
 
 #ifndef CONFIG_IP_MULTIPLE_TABLES
 
-extern struct fib_table *ip_fib_local_table;
-extern struct fib_table *ip_fib_main_table;
+#ifndef CONFIG_NET_NS
+extern struct fib_table *ip_fib_local_table_static;
+extern struct fib_table *ip_fib_main_table_static;
+#define ip_fib_local_table_ns()ip_fib_local_table_static
+#define ip_fib_main_table_ns() ip_fib_main_table_static
+#else
+#define ip_fib_local_table_ns()
(current_net_ns-fib4_local_table)
+#define ip_fib_main_table_ns() (current_net_ns-fib4_main_table)
+#endif
 
 static inline struct fib_table *fib_get_table(u32 id)
 {
if (id != RT_TABLE_LOCAL)
-   return ip_fib_main_table;
-   return ip_fib_local_table;
+   return ip_fib_main_table_ns();
+   return ip_fib_local_table_ns();
 }
 
 static inline struct fib_table *fib_new_table(u32 id)
@@ -188,21 +196,29 @@ static inline struct fib_table *fib_new_
 
 static inline int fib_lookup(const struct flowi *flp, struct fib_result *res)
 {
-   if (ip_fib_local_table-tb_lookup(ip_fib_local_table, flp, res) 
-   ip_fib_main_table-tb_lookup(ip_fib_main_table, flp, res))
+   struct fib_table *tb;
+
+   tb = ip_fib_local_table_ns();
+   if (!tb-tb_lookup(tb, flp, res))
+   return 0;
+   tb = ip_fib_main_table_ns();
+   if (tb-tb_lookup(tb, flp, res))
return -ENETUNREACH;
return 0;
 }
 
 static inline void fib_select_default(const struct flowi *flp, struct 
fib_result *res)
 {
+   struct fib_table *tb;
+
+   tb = ip_fib_main_table_ns();
if (FIB_RES_GW(*res)  FIB_RES_NH(*res).nh_scope == RT_SCOPE_LINK)
-   ip_fib_main_table-tb_select_default(ip_fib_main_table, flp, 
res);
+   tb-tb_select_default(main_table, flp, res);
 }
 
 #else /* CONFIG_IP_MULTIPLE_TABLES */
-#define ip_fib_local_table fib_get_table(RT_TABLE_LOCAL)
-#define ip_fib_main_table fib_get_table(RT_TABLE_MAIN)
+#define ip_fib_local_table_ns() fib_get_table(RT_TABLE_LOCAL)
+#define ip_fib_main_table_ns() fib_get_table(RT_TABLE_MAIN)
 
 extern int fib_lookup(struct flowi *flp, struct fib_result *res);
 
@@ -214,6 +230,10 @@ extern void fib_select_default(const str
 
 /* Exported by fib_frontend.c */
 extern voidip_fib_init(void);
+#ifdef CONFIG_NET_NS
+extern int ip_fib_struct_init(void);
+extern void ip_fib_struct_cleanup(void);
+#endif
 extern int inet_rtm_delroute(struct sk_buff *skb, struct nlmsghdr* nlh, void 
*arg);
 extern int inet_rtm_newroute(struct sk_buff *skb, struct nlmsghdr* nlh, void 
*arg);
 extern int

[PATCH 1/9] network namespaces: core and device list

2006-08-16 Thread Andrey Savochkin

CONFIG_NET_NS and net_namespace structure are introduced.
List of network devices is made per-namespace.
Each namespace gets its own loopback device.

Signed-off-by: Andrey Savochkin [EMAIL PROTECTED]
---
 drivers/net/loopback.c|   69 -
 include/linux/init_task.h |9 ++
 include/linux/net_ns.h|   82 +
 include/linux/netdevice.h |   13 +++
 include/linux/nsproxy.h   |3 
 include/linux/sched.h |3 
 kernel/nsproxy.c  |   14 
 net/Kconfig   |7 ++
 net/core/dev.c|  150 --
 net/core/net-sysfs.c  |   24 +++
 net/ipv4/devinet.c|2 
 net/ipv6/addrconf.c   |2 
 net/ipv6/route.c  |9 +-
 13 files changed, 349 insertions(+), 38 deletions(-)

--- ./drivers/net/loopback.c.vensdevMon Aug 14 17:02:18 2006
+++ ./drivers/net/loopback.cMon Aug 14 17:18:20 2006
@@ -196,42 +196,55 @@ static struct ethtool_ops loopback_ethto
.set_tso= ethtool_op_set_tso,
 };
 
-struct net_device loopback_dev = {
-   .name   = lo,
-   .mtu= (16 * 1024) + 20 + 20 + 12,
-   .hard_start_xmit= loopback_xmit,
-   .hard_header= eth_header,
-   .hard_header_cache  = eth_header_cache,
-   .header_cache_update= eth_header_cache_update,
-   .hard_header_len= ETH_HLEN, /* 14   */
-   .addr_len   = ETH_ALEN, /* 6*/
-   .tx_queue_len   = 0,
-   .type   = ARPHRD_LOOPBACK,  /* 0x0001*/
-   .rebuild_header = eth_rebuild_header,
-   .flags  = IFF_LOOPBACK,
-   .features   = NETIF_F_SG | NETIF_F_FRAGLIST
+struct net_device loopback_dev_static;
+EXPORT_SYMBOL(loopback_dev_static);
+
+void loopback_dev_dtor(struct net_device *dev)
+{
+   if (dev-priv) {
+   kfree(dev-priv);
+   dev-priv = NULL;
+   }
+   free_netdev(dev);
+}
+
+void loopback_dev_ctor(struct net_device *dev)
+{
+   struct net_device_stats *stats;
+
+   memset(dev, 0, sizeof(*dev));
+   strcpy(dev-name, lo);
+   dev-mtu= (16 * 1024) + 20 + 20 + 12;
+   dev-hard_start_xmit= loopback_xmit;
+   dev-hard_header= eth_header;
+   dev-hard_header_cache  = eth_header_cache;
+   dev-header_cache_update = eth_header_cache_update;
+   dev-hard_header_len= ETH_HLEN; /* 14   */
+   dev-addr_len   = ETH_ALEN; /* 6*/
+   dev-tx_queue_len   = 0;
+   dev-type   = ARPHRD_LOOPBACK;  /* 0x0001*/
+   dev-rebuild_header = eth_rebuild_header;
+   dev-flags  = IFF_LOOPBACK;
+   dev-features   = NETIF_F_SG | NETIF_F_FRAGLIST
 #ifdef LOOPBACK_TSO
  | NETIF_F_TSO
 #endif
  | NETIF_F_NO_CSUM | NETIF_F_HIGHDMA
- | NETIF_F_LLTX,
-   .ethtool_ops= loopback_ethtool_ops,
-};
-
-/* Setup and register the loopback device. */
-int __init loopback_init(void)
-{
-   struct net_device_stats *stats;
+ | NETIF_F_LLTX;
+   dev-ethtool_ops= loopback_ethtool_ops;
 
/* Can survive without statistics */
stats = kmalloc(sizeof(struct net_device_stats), GFP_KERNEL);
if (stats) {
memset(stats, 0, sizeof(struct net_device_stats));
-   loopback_dev.priv = stats;
-   loopback_dev.get_stats = get_stats;
+   dev-priv = stats;
+   dev-get_stats = get_stats;
}
-   
-   return register_netdev(loopback_dev);
-};
+}
 
-EXPORT_SYMBOL(loopback_dev);
+/* Setup and register the loopback device. */
+int __init loopback_init(void)
+{
+   loopback_dev_ctor(loopback_dev_static);
+   return register_netdev(loopback_dev_static);
+};
--- ./include/linux/init_task.h.vensdev Mon Aug 14 17:04:04 2006
+++ ./include/linux/init_task.h Mon Aug 14 17:18:21 2006
@@ -87,6 +87,14 @@ extern struct nsproxy init_nsproxy;
 
 extern struct group_info init_groups;
 
+#ifdef CONFIG_NET_NS
+extern struct net_namespace init_net_ns;
+#define INIT_NET_NS \
+   .net_context= init_net_ns,
+#else
+#define INIT_NET_NS
+#endif
+
 /*
  *  INIT_TASK is used to set up the first task table, touch at
  * your own risk!. Base=0, limit=0x1f (=2MB)
@@ -129,6 +137,7 @@ extern struct group_info init_groups;
.signal = init_signals,\
.sighand= init_sighand,\
.nsproxy= init_nsproxy,\
+   INIT_NET_NS \
.pending= { \
.list =

[PATCH 3/9] network namespaces: playing and debugging

2006-08-16 Thread Andrey Savochkin

Temporary code to play with network namespaces in the simplest way.
Do
exec 7 /proc/net/net_ns
in your bash shell and you'll get a brand new network namespace.
There you can, for example, do
ip link set lo up
ip addr list
ip addr add 1.2.3.4 dev lo
ping -n 1.2.3.4

Signed-off-by: Andrey Savochkin [EMAIL PROTECTED]
---
 dev.c |   20 
 1 files changed, 20 insertions(+)

--- ./net/core/dev.c.vensxdbg   Tue Aug 15 13:46:44 2006
+++ ./net/core/dev.cTue Aug 15 13:46:44 2006
@@ -3597,6 +3597,8 @@ int net_ns_start(void)
if (err)
goto out_register;
put_net_ns(orig_ns);
+   printk(KERN_DEBUG NET_NS: created new netcontext %p for %s (pid=%d)\n,
+   ns, task-comm, task-tgid);
return 0;
 
 out_register:
@@ -3629,14 +3631,29 @@ static void net_ns_destroy(void *data)
ip_fib_struct_cleanup();
pop_net_ns(orig_ns);
kfree(ns);
+   printk(KERN_DEBUG NET_NS: netcontext %p freed\n, ns);
 }
 
 void net_ns_stop(struct net_namespace *ns)
 {
+   printk(KERN_DEBUG NET_NS: netcontext %p scheduled for stop\n, ns);
INIT_WORK(ns-destroy_work, net_ns_destroy, ns);
schedule_work(ns-destroy_work);
 }
 EXPORT_SYMBOL(net_ns_stop);
+
+static int net_ns_open(struct inode *i, struct file *f)
+{
+   return net_ns_start();
+}
+static struct file_operations net_ns_fops = {
+   .open   = net_ns_open,
+};
+static int net_ns_init(void)
+{
+   return proc_net_fops_create(net_ns, S_IRWXU, net_ns_fops)
+   ? 0 : -ENOMEM;
+}
 #endif
 
 /*
@@ -3701,6 +3718,9 @@ static int __init net_dev_init(void)
hotcpu_notifier(dev_cpu_callback, 0);
dst_init();
dev_mcast_init();
+#ifdef CONFIG_NET_NS
+   net_ns_init();
+#endif
rc = 0;
 out:
return rc;
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 4/9] network namespaces: socket hashes

2006-08-16 Thread Andrey Savochkin

Socket hash lookups are made within namespace.
Hash tables are common for all namespaces, with
additional permutation of indexes.

Signed-off-by: Andrey Savochkin [EMAIL PROTECTED]
---
 include/linux/ipv6.h |3 ++-
 include/net/inet6_hashtables.h   |6 --
 include/net/inet_hashtables.h|   38 +-
 include/net/inet_sock.h  |6 --
 include/net/inet_timewait_sock.h |2 ++
 include/net/sock.h   |4 
 include/net/udp.h|   12 +---
 net/core/sock.c  |5 +
 net/ipv4/inet_connection_sock.c  |   19 +++
 net/ipv4/inet_hashtables.c   |   29 ++---
 net/ipv4/inet_timewait_sock.c|8 ++--
 net/ipv4/raw.c   |2 ++
 net/ipv4/udp.c   |   20 +---
 net/ipv6/inet6_connection_sock.c |2 ++
 net/ipv6/inet6_hashtables.c  |   25 ++---
 net/ipv6/raw.c   |4 
 net/ipv6/udp.c   |   21 ++---
 17 files changed, 151 insertions(+), 55 deletions(-)

--- ./include/linux/ipv6.h.venssock Mon Aug 14 17:02:45 2006
+++ ./include/linux/ipv6.h  Tue Aug 15 13:38:47 2006
@@ -428,10 +428,11 @@ static inline struct raw6_sock *raw6_sk(
 #define inet_v6_ipv6only(__sk) 0
 #endif /* defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE) */
 
-#define INET6_MATCH(__sk, __hash, __saddr, __daddr, __ports, __dif)\
+#define INET6_MATCH(__sk, __hash, __saddr, __daddr, __ports, __dif, __ns)\
(((__sk)-sk_hash == (__hash))   \
 ((*((__u32 *)(inet_sk(__sk)-dport))) == (__ports))\
 ((__sk)-sk_family == AF_INET6) \
+net_ns_match((__sk)-sk_net_ns, __ns)   \
 ipv6_addr_equal(inet6_sk(__sk)-daddr, (__saddr))  \
 ipv6_addr_equal(inet6_sk(__sk)-rcv_saddr, (__daddr))  \
 (!((__sk)-sk_bound_dev_if) || ((__sk)-sk_bound_dev_if == (__dif
--- ./include/net/inet6_hashtables.h.venssock   Mon Aug 14 17:02:47 2006
+++ ./include/net/inet6_hashtables.hTue Aug 15 13:38:47 2006
@@ -26,11 +26,13 @@ struct inet_hashinfo;
 
 /* I have no idea if this is a good hash for v6 or not. -DaveM */
 static inline unsigned int inet6_ehashfn(const struct in6_addr *laddr, const 
u16 lport,
-   const struct in6_addr *faddr, const u16 fport)
+   const struct in6_addr *faddr, const u16 fport,
+   struct net_namespace *ns)
 {
unsigned int hashent = (lport ^ fport);
 
hashent ^= (laddr-s6_addr32[3] ^ faddr-s6_addr32[3]);
+   hashent ^= net_ns_hash(ns);
hashent ^= hashent  16;
hashent ^= hashent  8;
return hashent;
@@ -44,7 +46,7 @@ static inline int inet6_sk_ehashfn(const
const struct in6_addr *faddr = np-daddr;
const __u16 lport = inet-num;
const __u16 fport = inet-dport;
-   return inet6_ehashfn(laddr, lport, faddr, fport);
+   return inet6_ehashfn(laddr, lport, faddr, fport, current_net_ns);
 }
 
 extern void __inet6_hash(struct inet_hashinfo *hashinfo, struct sock *sk);
--- ./include/net/inet_hashtables.h.venssockMon Aug 14 17:04:04 2006
+++ ./include/net/inet_hashtables.h Tue Aug 15 13:38:47 2006
@@ -74,6 +74,9 @@ struct inet_ehash_bucket {
  * ports are created in O(1) time?  I thought so. ;-)  -DaveM
  */
 struct inet_bind_bucket {
+#ifdef CONFIG_NET_NS
+   struct net_namespace*net_ns;
+#endif
unsigned short  port;
signed shortfastreuse;
struct hlist_node   node;
@@ -142,30 +145,34 @@ extern struct inet_bind_bucket *
 extern void inet_bind_bucket_destroy(kmem_cache_t *cachep,
 struct inet_bind_bucket *tb);
 
-static inline int inet_bhashfn(const __u16 lport, const int bhash_size)
+static inline int inet_bhashfn(const __u16 lport,
+  struct net_namespace *ns,
+  const int bhash_size)
 {
-   return lport  (bhash_size - 1);
+   return (lport ^ net_ns_hash(ns))  (bhash_size - 1);
 }
 
 extern void inet_bind_hash(struct sock *sk, struct inet_bind_bucket *tb,
   const unsigned short snum);
 
 /* These can have wildcards, don't try too hard. */
-static inline int inet_lhashfn(const unsigned short num)
+static inline int inet_lhashfn(const unsigned short num,
+  struct net_namespace *ns)
 {
-   return num  (INET_LHTABLE_SIZE - 1);
+   return (num ^ net_ns_hash(ns))  (INET_LHTABLE_SIZE - 1);
 }
 
 static inline int inet_sk_listen_hashfn(const struct sock *sk)
 {
-   return inet_lhashfn(inet_sk(sk)-num);
+   return inet_lhashfn(inet_sk(sk)-num, current_net_ns);
 }
 
 /* Caller must disable local BH processing. */
 static inline void

[PATCH 6/9] allow proc_dir_entries to have destructor

2006-08-16 Thread Andrey Savochkin

Destructor field added proc_dir_entries,
standard destructor kfree'ing data introduced.

Signed-off-by: Andrey Savochkin [EMAIL PROTECTED]
---
 fs/proc/generic.c   |   10 --
 fs/proc/root.c  |1 +
 include/linux/proc_fs.h |4 
 3 files changed, 13 insertions(+), 2 deletions(-)

--- ./fs/proc/generic.c.veprocdtor  Mon Aug 14 16:43:41 2006
+++ ./fs/proc/generic.c Tue Aug 15 13:45:51 2006
@@ -608,6 +608,11 @@ static struct proc_dir_entry *proc_creat
return ent;
 }
 
+void proc_data_destructor(struct proc_dir_entry *ent)
+{
+   kfree(ent-data);
+}
+
 struct proc_dir_entry *proc_symlink(const char *name,
struct proc_dir_entry *parent, const char *dest)
 {
@@ -620,6 +625,7 @@ struct proc_dir_entry *proc_symlink(cons
ent-data = kmalloc((ent-size=strlen(dest))+1, GFP_KERNEL);
if (ent-data) {
strcpy((char*)ent-data,dest);
+   ent-destructor = proc_data_destructor;
if (proc_register(parent, ent)  0) {
kfree(ent-data);
kfree(ent);
@@ -698,8 +704,8 @@ void free_proc_entry(struct proc_dir_ent
 
release_inode_number(ino);
 
-   if (S_ISLNK(de-mode)  de-data)
-   kfree(de-data);
+   if (de-destructor)
+   de-destructor(de);
kfree(de);
 }
 
--- ./fs/proc/root.c.veprocdtor Mon Aug 14 17:02:38 2006
+++ ./fs/proc/root.cTue Aug 15 13:45:51 2006
@@ -154,6 +154,7 @@ EXPORT_SYMBOL(proc_symlink);
 EXPORT_SYMBOL(proc_mkdir);
 EXPORT_SYMBOL(create_proc_entry);
 EXPORT_SYMBOL(remove_proc_entry);
+EXPORT_SYMBOL(proc_data_destructor);
 EXPORT_SYMBOL(proc_root);
 EXPORT_SYMBOL(proc_root_fs);
 EXPORT_SYMBOL(proc_net);
--- ./include/linux/proc_fs.h.veprocdtorMon Aug 14 17:02:47 2006
+++ ./include/linux/proc_fs.h   Tue Aug 15 13:45:51 2006
@@ -46,6 +46,8 @@ typedef   int (read_proc_t)(char *page, ch
 typedefint (write_proc_t)(struct file *file, const char __user *buffer,
   unsigned long count, void *data);
 typedef int (get_info_t)(char *, char **, off_t, int);
+struct proc_dir_entry;
+typedef void (destroy_proc_t)(struct proc_dir_entry *);
 
 struct proc_dir_entry {
unsigned int low_ino;
@@ -65,6 +67,7 @@ struct proc_dir_entry {
read_proc_t *read_proc;
write_proc_t *write_proc;
atomic_t count; /* use count */
+   destroy_proc_t *destructor;
int deleted;/* delete flag */
void *set;
 };
@@ -109,6 +112,7 @@ char *task_mem(struct mm_struct *, char 
 extern struct proc_dir_entry *create_proc_entry(const char *name, mode_t mode,
struct proc_dir_entry *parent);
 extern void remove_proc_entry(const char *name, struct proc_dir_entry *parent);
+extern void proc_data_destructor(struct proc_dir_entry *);
 
 extern struct vfsmount *proc_mnt;
 extern int proc_fill_super(struct super_block *,void *,int);
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 8/9] network namespaces: device to pass packets between namespaces

2006-08-16 Thread Andrey Savochkin

A simple device to pass packets between a namespace and its child.

Signed-off-by: Andrey Savochkin [EMAIL PROTECTED]
---
 Makefile |3 
 veth.c   |  327 +++
 2 files changed, 330 insertions(+)

--- ./drivers/net/Makefile.veveth   Mon Aug 14 17:03:45 2006
+++ ./drivers/net/Makefile  Tue Aug 15 13:46:15 2006
@@ -124,6 +124,9 @@ obj-$(CONFIG_SLIP) += slip.o
 obj-$(CONFIG_SLHC) += slhc.o
 
 obj-$(CONFIG_DUMMY) += dummy.o
+ifeq ($(CONFIG_NET_NS),y)
+obj-m += veth.o
+endif
 obj-$(CONFIG_IFB) += ifb.o
 obj-$(CONFIG_DE600) += de600.o
 obj-$(CONFIG_DE620) += de620.o
--- ./drivers/net/veth.c.veveth Tue Aug 15 13:44:46 2006
+++ ./drivers/net/veth.cTue Aug 15 13:46:15 2006
@@ -0,0 +1,327 @@
+/*
+ * Copyright (C) 2006  SWsoft
+ *
+ * Written by Andrey Savochkin [EMAIL PROTECTED],
+ * reusing code by Andrey Mirkin [EMAIL PROTECTED].
+ */
+#include linux/list.h
+#include linux/spinlock.h
+#include linux/ctype.h
+#include asm/semaphore.h
+#include linux/netdevice.h
+#include linux/etherdevice.h
+#include linux/proc_fs.h
+#include linux/seq_file.h
+#include net/dst.h
+#include net/xfrm.h
+
+struct veth_struct
+{
+   struct net_device   *pair;
+   struct net_device_stats stats;
+};
+
+#define veth_from_netdev(dev) ((struct veth_struct *)(netdev_priv(dev)))
+
+/* --- *
+ *
+ * Device functions
+ *
+ * --- */
+
+static struct net_device_stats *get_stats(struct net_device *dev);
+static int veth_xmit(struct sk_buff *skb, struct net_device *dev)
+{
+   struct net_device_stats *stats;
+   struct veth_struct *entry;
+   struct net_device *rcv;
+   struct net_namespace *orig_net_ns;
+   int length;
+
+   stats = get_stats(dev);
+   entry = veth_from_netdev(dev);
+   rcv = entry-pair;
+
+   if (!(rcv-flags  IFF_UP))
+   /* Target namespace does not want to receive packets */
+   goto outf;
+
+   dst_release(skb-dst);
+   skb-dst = NULL;
+   secpath_reset(skb);
+   skb_orphan(skb);
+#ifdef CONFIG_NETFILTER
+   nf_conntrack_put(skb-nfct);
+#if defined(CONFIG_NF_CONNTRACK) || defined(CONFIG_NF_CONNTRACK_MODULE)
+   nf_conntrack_put_reasm(skb-nfct_reasm);
+#endif
+#ifdef CONFIG_BRIDGE_NETFILTER
+   nf_bridge_put(skb-nf_bridge);
+#endif
+#endif
+
+   push_net_ns(rcv-net_ns, orig_net_ns);
+   skb-dev = rcv;
+   skb-pkt_type = PACKET_HOST;
+   skb-protocol = eth_type_trans(skb, rcv);
+
+   length = skb-len;
+   stats-tx_bytes += length;
+   stats-tx_packets++;
+   stats = get_stats(rcv);
+   stats-rx_bytes += length;
+   stats-rx_packets++;
+
+   netif_rx(skb);
+   pop_net_ns(orig_net_ns);
+   return 0;
+
+outf:
+   stats-tx_dropped++;
+   kfree_skb(skb);
+   return 0;
+}
+
+static int veth_open(struct net_device *dev)
+{
+   return 0;
+}
+
+static int veth_close(struct net_device *dev)
+{
+   return 0;
+}
+
+static void veth_destructor(struct net_device *dev)
+{
+   free_netdev(dev);
+}
+
+static struct net_device_stats *get_stats(struct net_device *dev)
+{
+   return veth_from_netdev(dev)-stats;
+}
+
+int veth_init_dev(struct net_device *dev)
+{
+   dev-hard_start_xmit = veth_xmit;
+   dev-open = veth_open;
+   dev-stop = veth_close;
+   dev-destructor = veth_destructor;
+   dev-get_stats = get_stats;
+
+   ether_setup(dev);
+
+   dev-tx_queue_len = 0;
+   return 0;
+}
+
+static void veth_setup(struct net_device *dev)
+{
+   dev-init = veth_init_dev;
+}
+
+static inline int is_veth_dev(struct net_device *dev)
+{
+   return dev-init == veth_init_dev;
+}
+
+/* --- *
+ *
+ * Management interface
+ *
+ * --- */
+
+struct net_device *veth_dev_alloc(char *name, char *addr)
+{
+   struct net_device *dev;
+
+   dev = alloc_netdev(sizeof(struct veth_struct), name, veth_setup);
+   if (dev != NULL) {
+   memcpy(dev-dev_addr, addr, ETH_ALEN);
+   dev-addr_len = ETH_ALEN;
+   }
+   return dev;
+}
+
+int veth_entry_add(char *parent_name, char *parent_addr,
+   char *child_name, char *child_addr,
+   struct net_namespace *child_ns)
+{
+   struct net_device *parent_dev, *child_dev;
+   struct net_namespace *parent_ns;
+   int err;
+
+   err = -ENOMEM;
+   if ((parent_dev = veth_dev_alloc(parent_name, parent_addr)) == NULL)
+   goto out_alocp;
+   if ((child_dev = veth_dev_alloc(child_name, child_addr)) == NULL)
+   goto out_alocc;
+   veth_from_netdev(parent_dev)-pair = child_dev;
+   veth_from_netdev(child_dev)-pair = parent_dev;
+
+   /*
+* About serialization, see

[PATCH 9/9] network namespaces: playing with pass-through device

2006-08-16 Thread Andrey Savochkin

Temporary code to debug and play with pass-through device.
Create device pair by
modprobe veth
echo 'add veth1 0:1:2:3:4:1 eth0 0:1:2:3:4:2' /proc/net/veth_ctl
and your shell will appear into a new namespace with `eth0' device.
Configure device in this namespace
ip l s eth0 up
ip a a 1.2.3.4/24 dev eth0
and in the root namespace
ip l s veth1 up
ip a a 1.2.3.1/24 dev veth1
to establish a communication channel between root namespace and the newly
created one.

Signed-off-by: Andrey Savochkin [EMAIL PROTECTED]
---
 veth.c |  113 +
 1 files changed, 113 insertions(+)

--- ./drivers/net/veth.c.veveth-dbg Tue Aug 15 13:47:48 2006
+++ ./drivers/net/veth.cTue Aug 15 14:08:04 2006
@@ -251,6 +251,116 @@ void veth_entry_del_all(void)
 
 /* --- *
  *
+ * Temporary interface to create veth devices
+ *
+ * --- */
+
+#ifdef CONFIG_PROC_FS
+
+static int veth_debug_open(struct inode *inode, struct file *file)
+{
+   return 0;
+}
+
+static char *parse_addr(char *s, char *addr)
+{
+   int i, v;
+
+   for (i = 0; i  ETH_ALEN; i++) {
+   if (!isxdigit(*s))
+   return NULL;
+   *addr = 0;
+   v = isdigit(*s) ? *s - '0' : toupper(*s) - 'A' + 10;
+   s++;
+   if (isxdigit(*s)) {
+   *addr += v  16;
+   v = isdigit(*s) ? *s - '0' : toupper(*s) - 'A' + 10;
+   s++;
+   }
+   *addr++ += v;
+   if (i  ETH_ALEN - 1  ispunct(*s))
+   s++;
+   }
+   return s;
+}
+
+extern int net_ns_start(void);
+static ssize_t veth_debug_write(struct file *file, const char __user *user_buf,
+   size_t size, loff_t *ppos)
+{
+   char buf[128], *s, *parent_name, *child_name;
+   char parent_addr[ETH_ALEN], child_addr[ETH_ALEN];
+   struct net_namespace *parent_ns, *child_ns;
+   int err;
+
+   s = buf;
+   err = -EINVAL;
+   if (size = sizeof(buf))
+   goto out;
+   err = -EFAULT;
+   if (copy_from_user(buf, user_buf, size))
+   goto out;
+   buf[size] = 0;
+
+   err = -EBADRQC;
+   if (!strncmp(buf, add , 4)) {
+   parent_name = buf + 4;
+   if ((s = strchr(parent_name, ' ')) == NULL)
+   goto out;
+   *s = 0;
+   if ((s = parse_addr(s + 1, parent_addr)) == NULL)
+   goto out;
+   if (!*s)
+   goto out;
+   child_name = s + 1;
+   if ((s = strchr(child_name, ' ')) == NULL)
+   goto out;
+   *s = 0;
+   if ((s = parse_addr(s + 1, child_addr)) == NULL)
+   goto out;
+
+   parent_ns = get_net_ns(current_net_ns);
+   err = net_ns_start();
+   if (err)
+   goto out;
+   /* return to parent context */
+   push_net_ns(parent_ns, child_ns);
+   err = veth_entry_add(parent_name, parent_addr,
+   child_name, child_addr, child_ns);
+   pop_net_ns(child_ns);
+   put_net_ns(parent_ns);
+   if (!err)
+   err = size;
+   }
+out:
+   return err;
+}
+
+static struct file_operations veth_debug_ops = {
+   .open   = veth_debug_open,
+   .write  = veth_debug_write,
+};
+
+static int veth_debug_create(void)
+{
+   proc_net_fops_create(veth_ctl, 0200, veth_debug_ops);
+   return 0;
+}
+
+static void veth_debug_remove(void)
+{
+   proc_net_remove(veth_ctl);
+}
+
+#else
+
+static int veth_debug_create(void) { return -1; }
+static void veth_debug_remove(void) { }
+
+#endif
+
+/* --- *
+ *
  * Information in proc
  *
  * --- */
@@ -310,12 +420,15 @@ static inline void veth_proc_remove(void
 
 int __init veth_init(void)
 {
+   if (veth_debug_create())
+   return -EINVAL;
veth_proc_create();
return 0;
 }
 
 void __exit veth_exit(void)
 {
+   veth_debug_remove();
veth_proc_remove();
veth_entry_del_all();
 }
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 09/16] [IPv6] address: Convert address notification to use rtnl_notify()

2006-08-16 Thread Thomas Graf

* Herbert Xu [EMAIL PROTECTED] 2006-08-16 12:58
 I'm not comfortable with that change since it implies the message
 originated from a user-space process.
 
 The netlink header pid is really akin to sadb_msg_pid from RFC 2367.
 IMHO it should always be zero if the kernel is the originator of the
 message.

All route and tc notifications already use the pid so applications
can decide whether the event was caused by them. A notification
is a reply to a request so it doesn't even violate RFC 2367.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 09/16] [IPv6] address: Convert address notification to use rtnl_notify()

2006-08-16 Thread jamal

On Wed, 2006-16-08 at 12:58 +0200, Thomas Graf wrote:
 * Herbert Xu [EMAIL PROTECTED] 2006-08-16 12:58
  I'm not comfortable with that change since it implies the message
  originated from a user-space process.
  
  The netlink header pid is really akin to sadb_msg_pid from RFC 2367.
  IMHO it should always be zero if the kernel is the originator of the
  message.
 
 All route and tc notifications already use the pid so applications
 can decide whether the event was caused by them. A notification
 is a reply to a request so it doesn't even violate RFC 2367.

I would agree with Thomas on this. Regardless, I dont think that 2367 is
really a glorified reference (that thing needs so much updating it is
not funny).

cheers,
jamal



-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 09/16] [IPv6] address: Convert address notification to use rtnl_notify()

2006-08-16 Thread Herbert Xu

Hi Thomas:

On Wed, Aug 16, 2006 at 12:58:56PM +0200, Thomas Graf wrote:
 
 All route and tc notifications already use the pid so applications
 can decide whether the event was caused by them. A notification
 is a reply to a request so it doesn't even violate RFC 2367.

Actually most IPv4 notifications *do* set the pid to zero which is
the right thing to do for kernel-generated messages.

You're right though that the IPv6 notification modified by this patch
does set the pid to the netlink originator.  Looking back in history
it seems that this behaviour was only introduced last year to a subset
of notifications.

This inconsistency is very bad.  IMHO this change (made last year)
should be reverted so that all kernel generated (broadcast) notifications
have the originator set to zero to match the source address of the
message.

Any notification that sets the netlink pid to current-pid is
*completely* bogus.  Let me repeat this, the netlink pid is not
a process ID.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmVHI~} [EMAIL PROTECTED]
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 09/16] [IPv6] address: Convert address notification to use rtnl_notify()

2006-08-16 Thread Thomas Graf

* Herbert Xu [EMAIL PROTECTED] 2006-08-16 21:12
 On Wed, Aug 16, 2006 at 12:58:56PM +0200, Thomas Graf wrote:
  
  All route and tc notifications already use the pid so applications
  can decide whether the event was caused by them. A notification
  is a reply to a request so it doesn't even violate RFC 2367.
 
 Actually most IPv4 notifications *do* set the pid to zero which is
 the right thing to do for kernel-generated messages.
 
 You're right though that the IPv6 notification modified by this patch
 does set the pid to the netlink originator.  Looking back in history
 it seems that this behaviour was only introduced last year to a subset
 of notifications.

It was added to help quagga identify which route modifications
are self caused. It's not possible to use rtm_protocol for this
purpose as other applications can delete routes added by quagga.

 This inconsistency is very bad.  IMHO this change (made last year)
 should be reverted so that all kernel generated (broadcast) notifications
 have the originator set to zero to match the source address of the
 message.

We can't just knowingly break quagga.

I think it's a good thing to include the pid, it's additional
information that is helpful to userspace. Userspace is already
aware that the notifications are orignating from the kernel,
we can't do userspace - userspace communication anymore anyway.

 Any notification that sets the netlink pid to current-pid is
 *completely* bogus.  Let me repeat this, the netlink pid is not
 a process ID.

Everyone is aware of that, actually these patches fix all
occurences of current-pid by replacing them with a pid of 0.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 09/16] [IPv6] address: Convert address notification to use rtnl_notify()

2006-08-16 Thread Herbert Xu

On Wed, Aug 16, 2006 at 09:12:40PM +1000, herbert wrote:
 
 Any notification that sets the netlink pid to current-pid is
 *completely* bogus.  Let me repeat this, the netlink pid is not
 a process ID.

BTW, I'm not having a go at either Thomas or Jamal.  You guys
are oo the same side for once :).

I honestly believe that we have a misunderstanding here which needs
to be sorted out.  It gets worse because that misunderstanding has
made it into the manpages package which only causes more confusion.

So let's step back a bit and think about where does this pid really
come from.  The field in question is nlmsg_pid.  Its primary purpose
is to identify unicast transactions along with the field nlmsg_seq.
It was not designed to identify the origin of a broadcast kernel
notification to a third party.

For this purpose, the value of nlmsg_pid is set to the address of
the destination socket for a particular unicast message (also known
as the pid).

That pid in turn has only a vague connection with the process ID
of the process owning the socket.  For practical purposes, we
should not treat it as a process ID it can easily be claimed by
another process (think socket + bind + fork).

For a broadcast notification, the nlmsg_pid field is meaningless
because the nlmsg_seq field is also meaningless.  I'm not denying
that it wouldn't be useful to have the originator's socket address
in there.  What I'm saying is that it's the wrong place to put
that information.

In any case, putting current-pid in this field is definitely
a bad idea because it only encourages people to confuse the
netlink pid with the process ID which can lead to security
problems later on.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmVHI~} [EMAIL PROTECTED]
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/1] network memory allocator.

2006-08-16 Thread Arnd Bergmann

On Wednesday 16 August 2006 11:00, Evgeniy Polyakov wrote:
 There is drawback here - if data was allocated on CPU wheere NIC is
 closer and then processed on different CPU it will cost more than 
 in case where buffer was allocated on CPU where it will be processed.
 
 But from other point of view, most of the adapters preallocate set of
 skbs, and with msi-x help there will be a possibility to bind irq and
 processing to the CPU where data was origianlly allocated.
 
 So I would like to know how to determine which node should be used for
 allocation. Changes of __get_user_pages() to alloc_pages_node() are
 trivial.

There are two separate memory areas here: Your own metadata used by the
allocator and the memory used for skb data.

avl_node_array[cpu] and avl_container_array[cpu] are only designed to
be accessed only by the local cpu, so these should be done like

avl_node_array[cpu] = kmalloc_node(AVL_NODE_PAGES * sizeof(void *),
GFP_KERNEL, cpu_to_node(cpu));

or you could make the whole array DEFINE_PER_CPU(void *, which would
waste some space in the kernel object file.

Now for the actual pages you get with __get_free_pages(), doing the
same (alloc_pages_node), will help accessing your avl_container 
members, but may not be the best solution for getting the data
next to the network adapter.

Arnd 
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] add bcm43xx-d80211 MAINTAINERS entry

2006-08-16 Thread Michael Buesch

Hi John,

Please pull the patch to add Larry as bcm43xx-softmac
maintainer into wireless-dev and _after_ that please apply
the following patch to mark the d80211 branch explicitely.

--

Add MAINTAINERS for bcm43xx-d80211

Signed-off-by: Michael Buesch [EMAIL PROTECTED]

Index: wireless-dev/MAINTAINERS
===
--- wireless-dev.orig/MAINTAINERS   2006-08-16 11:26:18.0 +0200
+++ wireless-dev/MAINTAINERS2006-08-16 11:27:29.0 +0200
@@ -456,6 +456,14 @@
 W: http://www.baycom.org/~tom/ham/ham.html
 S: Maintained
 
+BCM43XX WIRELESS DRIVER (DEVICESCAPE BASED VERSION)
+P: Michael Buesch
+M: [EMAIL PROTECTED]
+P: Stefano Brivio
+M: [EMAIL PROTECTED]
+W: http://bcm43xx.berlios.de/
+S: Maintained
+
 BCM43XX WIRELESS DRIVER (SOFTMAC BASED VERSION)
 P: Larry Finger
 M: [EMAIL PROTECTED]


-- 
Greetings Michael.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC] network namespaces

2006-08-16 Thread Serge E. Hallyn

Quoting Andrey Savochkin ([EMAIL PROTECTED]):
 Hi All,
 
 I'd like to resurrect our discussion about network namespaces.
 In our previous discussions it appeared that we have rather polar concepts
 which seemed hard to reconcile.
 Now I have an idea how to look at all discussed concepts to enable everyone's
 usage scenario.
 
 1. The most straightforward concept is complete separation of namespaces,
covering device list, routing tables, netfilter tables, socket hashes, and
everything else.
 
On input path, each packet is tagged with namespace right from the
place where it appears from a device, and is processed by each layer
in the context of this namespace.
Non-root namespaces communicate with the outside world in two ways: by
owning hardware devices, or receiving packets forwarded them by their 
 parent
namespace via pass-through device.
 
This complete separation of namespaces is very useful for at least two
purposes:
 - allowing users to create and manage by their own various tunnels and
   VPNs, and
 - enabling easier and more straightforward live migration of groups of
   processes with their environment.

I conceptually prefer this approach, but I seem to recall there were
actual problems in using this for checkpoint/restart of lightweight
(application) containers.  Performance aside, are there any reasons why
this approach would be problematic for c/r?

I'm afraid Daniel may be on vacation, and don't know who else other than
Eric might have thoughts on this.

 2. People expressed concerns that complete separation of namespaces
may introduce an undesired overhead in certain usage scenarios.
The overhead comes from packets traversing input path, then output path,
then input path again in the destination namespace if root namespace
acts as a router.
 
So, we may introduce short-cuts, when input packet starts to be processes
in one namespace, but changes it at some upper layer.
The places where packet can change namespace are, for example:
routing, post-routing netfilter hook, or even lookup in socket hash.
 
The cleanest example among them is post-routing netfilter hook.
Tagging of input packets there means that the packets is checked against
root namespace's routing table, found to be local, and go directly to
the socket hash lookup in the destination namespace.
In this scheme the ability to change routing tables or netfilter rules on
a per-namespace basis is traded for lower overhead.
 
All other optimized schemes where input packets do not travel
input-output-input paths in general case may be viewed as short-cuts in
scheme (1).  The remaining question is which exactly short-cuts make most
sense, and how to make them consistent from the interface point of view.
 
 My current idea is to reach some agreement on the basic concept, review
 patches, and then move on to implementing feasible short-cuts.
 
 Opinions?
 
 Next in this thread are patches introducing namespaces to device list,
 IPv4 routing, and socket hashes, and a pass-through device.
 Patches are against 2.6.18-rc4-mm1.

Just to provide the extreme other end of implementation options, here is
the bsdjail based version I've been using for some testing while waiting
for network namespaces to show up in -mm  :)

(Not intended for *any* sort of inclusion consideration :)

Example usage:
ifconfig eth0:0 192.168.1.16
echo -n ip 192.168.1.16  /proc/$$/attr/exec
exec /bin/sh

-serge

From: Serge E. Hallyn [EMAIL PROTECTED](none)
Date: Wed, 26 Jul 2006 21:47:13 -0500
Subject: [PATCH 1/1] bsdjail: define bsdjail lsm

Define the actual bsdjail LSM.

Signed-off-by: Serge E. Hallyn [EMAIL PROTECTED]
---
 security/Kconfig   |   11 
 security/Makefile  |1 
 security/bsdjail.c | 1351 
 3 files changed, 1363 insertions(+), 0 deletions(-)

diff --git a/security/Kconfig b/security/Kconfig
index 67785df..fa30e40 100644
--- a/security/Kconfig
+++ b/security/Kconfig
@@ -105,6 +105,17 @@ config SECURITY_SECLVL
 
  If you are unsure how to answer this question, answer N.
 
+config SECURITY_BSDJAIL
+   tristate BSD Jail LSM
+   depends on SECURITY
+   select SECURITY_NETWORK
+   help
+ Provides BSD Jail compartmentalization functionality.
+ See Documentation/bsdjail.txt for more information and
+ usage instructions.
+
+ If you are unsure how to answer this question, answer N.
+
 source security/selinux/Kconfig
 
 endmenu
diff --git a/security/Makefile b/security/Makefile
index 8cbbf2f..050b588 100644
--- a/security/Makefile
+++ b/security/Makefile
@@ -17,3 +17,4 @@ obj-$(CONFIG_SECURITY_SELINUX)+= selin
 obj-$(CONFIG_SECURITY_CAPABILITIES)+= commoncap.o capability.o
 obj-$(CONFIG_SECURITY_ROOTPLUG)+= commoncap.o root_plug.o
 obj-$(CONFIG_SECURITY_SECLVL)  +=

Re: [PATCH 09/16] [IPv6] address: Convert address notification to use rtnl_notify()

2006-08-16 Thread Herbert Xu

On Wed, Aug 16, 2006 at 01:40:03PM +0200, Thomas Graf wrote:
 
 It was added to help quagga identify which route modifications
 are self caused. It's not possible to use rtm_protocol for this
 purpose as other applications can delete routes added by quagga.

Actually it's not that bad.  I just checked the quagga source and
the stuff it needs was already provided anyway, even before that
change.

In fact, the really bad bits in the changeset have already been
reverted by Alexey back in February :)

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmVHI~} [EMAIL PROTECTED]
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 09/16] [IPv6] address: Convert address notification to use rtnl_notify()

2006-08-16 Thread jamal

On Wed, 2006-16-08 at 21:39 +1000, Herbert Xu wrote:

 So let's step back a bit and think about where does this pid really
 come from.  The field in question is nlmsg_pid.  Its primary purpose
 is to identify unicast transactions along with the field nlmsg_seq.
 It was not designed to identify the origin of a broadcast kernel
 notification to a third party.

There are quiet a few things that netlink design intent was not
intending to solve that became needed over time. This being one IMHO.
Design intent and eventual (sometimes creative) use occasionally create
an impedance ;- Evolution is the only description i can think of.

 For this purpose, the value of nlmsg_pid is set to the address of
 the destination socket for a particular unicast message (also known
 as the pid).

Since we are talking history:
The idea of it being a destination socket _was not_ design intent. It
was evolution. I recall James Morris actually to be the first person
whining about this ambiguity when coding up nfqueue. I cant remember who
fixed it (I am inclined to think it was you;-)
 
 That pid in turn has only a vague connection with the process ID
 of the process owning the socket.  For practical purposes, we
 should not treat it as a process ID it can easily be claimed by
 another process (think socket + bind + fork).

If you want to be complete the kernel should fix the pid in
netlink::sendmsg().

 For a broadcast notification, the nlmsg_pid field is meaningless
 because the nlmsg_seq field is also meaningless.  

nlmsg_seq is meaningless; seq is again a bad noun. It should be
cookie.

 I'm not denying
 that it wouldn't be useful to have the originator's socket address
 in there.  What I'm saying is that it's the wrong place to put
 that information.

 In any case, putting current-pid in this field is definitely
 a bad idea because it only encourages people to confuse the
 netlink pid with the process ID which can lead to security
 problems later on.

current-pid i think is coming out to be a bad idea. Thomas' patches
revert it out. Again this has everything to do with the original idea
what maps to pid now changing to socketid.

What do you think of the idea of infact rewriting the pid to be that of
the socket id?

cheers,
jamal

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 09/16] [IPv6] address: Convert address notification to use rtnl_notify()

2006-08-16 Thread jamal


On Wed, 2006-16-08 at 14:05 +0200, Thomas Graf wrote:

 Right, but he forgot the bits in IPv6 which I now fixed. The
 changeset introducing those current-pid uses was definitely
 simply wrong. I'm not questioning that :)

Herbert, if you look at the thread as well I am no longer questioning
that either ;-

cheers,
jamal

PS:- Would a topic of things i wish netlink did better be of interest
for discussion (maybe for netconf)? (Un)fortunately, we are fixing some
of them with genetlink;- 

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 09/16] [IPv6] address: Convert address notification to use rtnl_notify()

2006-08-16 Thread Thomas Graf

* Herbert Xu [EMAIL PROTECTED] 2006-08-16 21:57
 On Wed, Aug 16, 2006 at 01:40:03PM +0200, Thomas Graf wrote:
  
  It was added to help quagga identify which route modifications
  are self caused. It's not possible to use rtm_protocol for this
  purpose as other applications can delete routes added by quagga.
 
 Actually it's not that bad.  I just checked the quagga source and
 the stuff it needs was already provided anyway, even before that
 change.

If I recall correctly the quagga folks asked to get the same
behaviour for IPv6 routes as it was already done for IPv4
around the time of that bogus changeset.


 In fact, the really bad bits in the changeset have already been
 reverted by Alexey back in February :)

Right, but he forgot the bits in IPv6 which I now fixed. The
changeset introducing those current-pid uses was definitely
simply wrong. I'm not questioning that :)
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 09/16] [IPv6] address: Convert address notification to use rtnl_notify()

2006-08-16 Thread Thomas Graf

* jamal [EMAIL PROTECTED] 2006-08-16 08:04
 current-pid i think is coming out to be a bad idea. Thomas' patches
 revert it out. Again this has everything to do with the original idea
 what maps to pid now changing to socketid.

It probably developed from autobind using current-tid.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 09/16] [IPv6] address: Convert address notification to use rtnl_notify()

2006-08-16 Thread Herbert Xu

On Wed, Aug 16, 2006 at 08:04:24AM -0400, jamal wrote:
 
 What do you think of the idea of infact rewriting the pid to be that of
 the socket id?

Rewriting it with the netlink socket address? That's fine by me as
long as there is a clear 1-to-1 relationship between the request
and the notification.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmVHI~} [EMAIL PROTECTED]
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 09/16] [IPv6] address: Convert address notification to use rtnl_notify()

2006-08-16 Thread Thomas Graf

* Herbert Xu [EMAIL PROTECTED] 2006-08-16 21:39
 For a broadcast notification, the nlmsg_pid field is meaningless
 because the nlmsg_seq field is also meaningless.  I'm not denying
 that it wouldn't be useful to have the originator's socket address
 in there.  What I'm saying is that it's the wrong place to put
 that information.

It might not be the best place to put it considering the original
intend of nlmsg_pid as you explained correctly. However, as you
state yourself, the nlmsg_pid field is meaningless/unused for
notifications so extending the definition of nlmsg_pid to have
a special meaning for broadcasts doesn't harm anyone.

When setting nlmsg_seq to the seq of the request it becomes a
meaning together with nlmsg_pid as applications can then easly
assign notifications to their own sent requests.

Secondly we already have applications depending on this whereas
the eventual breaking of aplications depending on nlmsg_pid == 0
is uncertain.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 09/16] [IPv6] address: Convert address notification to use rtnl_notify()

2006-08-16 Thread jamal

On Wed, 2006-16-08 at 14:08 +0200, Thomas Graf wrote:
 * jamal [EMAIL PROTECTED] 2006-08-16 08:04
  current-pid i think is coming out to be a bad idea. Thomas' patches
  revert it out. Again this has everything to do with the original idea
  what maps to pid now changing to socketid.
 
 It probably developed from autobind using current-tid.

In one conversation with Alexey he told me there was some inspiration
from pfkey in the semantics of it i.e processid. Obviously with many
sockets on the same process etc, that assumption is no longer valid. 

On Wed, 2006-16-08 at 22:08 +1000, Herbert Xu wrote: 
 On Wed, Aug 16, 2006 at 08:04:24AM -0400, jamal wrote:
  
  What do you think of the idea of infact rewriting the pid to be that of
  the socket id?
 
 Rewriting it with the netlink socket address? That's fine by me as
 long as there is a clear 1-to-1 relationship between the request
 and the notification.

you would have to call getpeername() to get a correct 1-1 mapping as is 
today when in doubt.
What i was suggesting is notifications using the pid that would id the socket 
and
would therefore require a getpeername() which identify the real socket it came 
from; if you are fine with what Thomas is doing, then this unnecessary since i 
was
suggesting it as a compromise for consistency you pointed was lacking.

cheers,
jamal

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 09/16] [IPv6] address: Convert address notification to use rtnl_notify()

2006-08-16 Thread Alexey Kuznetsov

Hello!

 The netlink header pid is really akin to sadb_msg_pid from RFC 2367.
 IMHO it should always be zero if the kernel is the originator of the
 message.

No. Analogue of sadb_msg_pid is nladdr.nl_pid.


Netlink header pid is not originator of the message, but author of
the change. The notion is ambiguous by definition, and the field
is a little ambiguous.

If the message is a plain ack or part of a dump, it is obviously
pid of requestor. But if it is notification about change, it can be
nl_pid of socket, which requested the operation, but may be 0.
Of course, all the 0s sent only because I was lazy to track authorship,
should be eliminated.

Alexey
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[rt2500usb] link led weirdness

2006-08-16 Thread Johannes Berg

Hey,

I just noticed that my rt2500usb device turns on the link LED when I
just add an active monitor interface. I can't imagine that being on
purpose, but I'm not sure based on what it is controlled.

johannes
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 09/16] [IPv6] address: Convert address notification to use rtnl_notify()

2006-08-16 Thread Alexey Kuznetsov

Hello!

 In one conversation with Alexey he told me there was some inspiration
 from pfkey in the semantics of it i.e processid.

Inspiration, but not a copy. :-)

Unlike pfkeyv2 it uses addressing usual for networking i.e.
struct sockaddr_nl.

Alexey
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 09/16] [IPv6] address: Convert address notification to use rtnl_notify()

2006-08-16 Thread jamal

On Wed, 2006-16-08 at 17:04 +0400, Alexey Kuznetsov wrote:
 Hello!
 
  In one conversation with Alexey he told me there was some inspiration
  from pfkey in the semantics of it i.e processid.
 
 Inspiration, but not a copy. :-)

Oh, absolutely. Netlink is way superior. I should have said perspiration
instead of inspiration;- Calling inspiration was being polite - it is
as being polite as saying i was being economical with the truth[1] ;-

 Unlike pfkeyv2 it uses addressing usual for networking i.e.
 struct sockaddr_nl.

I think this needs to be captured somewhere. I dont know who is
maintaining the man pages these days.

cheers,
jamal

[1] A term i learnt from some British guy. They have ways with words
those Brits.

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH][IPSEC]: Aggregate make_jiffies

2006-08-16 Thread jamal

On Tue, 2006-15-08 at 22:59 -0400, jamal wrote:

   How about moving
  it to linux/jiffies.h and rewrite in the style of msec_to_jiffies?
  
 
 Is there something other than the boundary check already done you
 foresee being made? If yes, do you wanna take a crack at it?

Herbert, I actually dont know the answer that is why i am punting it to
you;- I would just move the whole thing to linux/jiffies.h as is but
you seem to suggest there may be other boundary checks. If yes, go at
it ;-

cheers,
jamal




-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[take10 2/2] kevent: poll/select() notifications. Timer notifications.

2006-08-16 Thread Evgeniy Polyakov


poll/select() notifications. Timer notifications.

This patch includes generic poll/select and timer notifications.

kevent_poll works simialr to epoll and has the same issues (callback
is invoked not from internal state machine of the caller, but through
process awake).

Timer notifications can be used for fine grained per-process time 
management, since interval timers are very inconvenient to use, 
and they are limited.

Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED]

diff --git a/kernel/kevent/kevent_poll.c b/kernel/kevent/kevent_poll.c
new file mode 100644
index 000..8a4f863
--- /dev/null
+++ b/kernel/kevent/kevent_poll.c
@@ -0,0 +1,220 @@
+/*
+ * kevent_poll.c
+ * 
+ * 2006 Copyright (c) Evgeniy Polyakov [EMAIL PROTECTED]
+ * All rights reserved.
+ * 
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#include linux/kernel.h
+#include linux/types.h
+#include linux/list.h
+#include linux/slab.h
+#include linux/spinlock.h
+#include linux/timer.h
+#include linux/file.h
+#include linux/kevent.h
+#include linux/poll.h
+#include linux/fs.h
+
+static kmem_cache_t *kevent_poll_container_cache;
+static kmem_cache_t *kevent_poll_priv_cache;
+
+struct kevent_poll_ctl
+{
+   struct poll_table_structpt;
+   struct kevent   *k;
+};
+
+struct kevent_poll_wait_container
+{
+   struct list_headcontainer_entry;
+   wait_queue_head_t   *whead;
+   wait_queue_twait;
+   struct kevent   *k;
+};
+
+struct kevent_poll_private
+{
+   struct list_headcontainer_list;
+   spinlock_t  container_lock;
+};
+
+static int kevent_poll_enqueue(struct kevent *k);
+static int kevent_poll_dequeue(struct kevent *k);
+static int kevent_poll_callback(struct kevent *k);
+
+static int kevent_poll_wait_callback(wait_queue_t *wait, 
+   unsigned mode, int sync, void *key)
+{
+   struct kevent_poll_wait_container *cont = 
+   container_of(wait, struct kevent_poll_wait_container, wait);
+   struct kevent *k = cont-k;
+   struct file *file = k-st-origin;
+   u32 revents;
+
+   revents = file-f_op-poll(file, NULL);
+
+   kevent_storage_ready(k-st, NULL, revents);
+
+   return 0;
+}
+
+static void kevent_poll_qproc(struct file *file, wait_queue_head_t *whead, 
+   struct poll_table_struct *poll_table)
+{
+   struct kevent *k = 
+   container_of(poll_table, struct kevent_poll_ctl, pt)-k;
+   struct kevent_poll_private *priv = k-priv;
+   struct kevent_poll_wait_container *cont;
+   unsigned long flags;
+
+   cont = kmem_cache_alloc(kevent_poll_container_cache, SLAB_KERNEL);
+   if (!cont) {
+   kevent_break(k);
+   return;
+   }
+   
+   cont-k = k;
+   init_waitqueue_func_entry(cont-wait, kevent_poll_wait_callback);
+   cont-whead = whead;
+
+   spin_lock_irqsave(priv-container_lock, flags);
+   list_add_tail(cont-container_entry, priv-container_list);
+   spin_unlock_irqrestore(priv-container_lock, flags);
+
+   add_wait_queue(whead, cont-wait);
+}
+
+static int kevent_poll_enqueue(struct kevent *k)
+{
+   struct file *file;
+   int err, ready = 0;
+   unsigned int revents;
+   struct kevent_poll_ctl ctl;
+   struct kevent_poll_private *priv;
+
+   file = fget(k-event.id.raw[0]);
+   if (!file)
+   return -ENODEV;
+
+   err = -EINVAL;
+   if (!file-f_op || !file-f_op-poll)
+   goto err_out_fput;
+
+   err = -ENOMEM;
+   priv = kmem_cache_alloc(kevent_poll_priv_cache, SLAB_KERNEL);
+   if (!priv)
+   goto err_out_fput;
+
+   spin_lock_init(priv-container_lock);
+   INIT_LIST_HEAD(priv-container_list);
+
+   k-priv = priv;
+
+   ctl.k = k;
+   init_poll_funcptr(ctl.pt, kevent_poll_qproc);
+
+   err = kevent_storage_enqueue(file-st, k);
+   if (err)
+   goto err_out_free;
+
+   revents = file-f_op-poll(file, ctl.pt);
+   if (revents  k-event.event) {
+   ready = 1;
+   kevent_poll_dequeue(k);
+   }
+   
+   return ready;
+
+err_out_free:
+   kmem_cache_free(kevent_poll_priv_cache, priv);
+err_out_fput:
+   fput(file);
+   return err;
+}
+
+static int kevent_poll_dequeue(struct kevent *k)
+{
+   struct file *file = k-st-origin;
+   struct kevent_poll_private *priv = k-priv;
+

[take10 1/2] kevent: Core files.

2006-08-16 Thread Evgeniy Polyakov


Core files.

This patch includes core kevent files:
 - userspace controlling
 - kernelspace interfaces
 - initialization
 - notification state machines

Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED]

diff --git a/arch/i386/kernel/syscall_table.S b/arch/i386/kernel/syscall_table.S
index dd63d47..091ff42 100644
--- a/arch/i386/kernel/syscall_table.S
+++ b/arch/i386/kernel/syscall_table.S
@@ -317,3 +317,5 @@ ENTRY(sys_call_table)
.long sys_tee   /* 315 */
.long sys_vmsplice
.long sys_move_pages
+   .long sys_kevent_get_events
+   .long sys_kevent_ctl
diff --git a/arch/x86_64/ia32/ia32entry.S b/arch/x86_64/ia32/ia32entry.S
index 5d4a7d1..b2af4a8 100644
--- a/arch/x86_64/ia32/ia32entry.S
+++ b/arch/x86_64/ia32/ia32entry.S
@@ -713,4 +713,6 @@ #endif
.quad sys_tee
.quad compat_sys_vmsplice
.quad compat_sys_move_pages
+   .quad sys_kevent_get_events
+   .quad sys_kevent_ctl
 ia32_syscall_end:  
diff --git a/include/asm-i386/unistd.h b/include/asm-i386/unistd.h
index fc1c8dd..c9dde13 100644
--- a/include/asm-i386/unistd.h
+++ b/include/asm-i386/unistd.h
@@ -323,10 +323,12 @@ #define __NR_sync_file_range  314
 #define __NR_tee   315
 #define __NR_vmsplice  316
 #define __NR_move_pages317
+#define __NR_kevent_get_events 318
+#define __NR_kevent_ctl319
 
 #ifdef __KERNEL__
 
-#define NR_syscalls 318
+#define NR_syscalls 320
 
 /*
  * user-visible error numbers are in the range -1 - -128: see
diff --git a/include/asm-x86_64/unistd.h b/include/asm-x86_64/unistd.h
index 94387c9..61363e0 100644
--- a/include/asm-x86_64/unistd.h
+++ b/include/asm-x86_64/unistd.h
@@ -619,10 +619,14 @@ #define __NR_vmsplice 278
 __SYSCALL(__NR_vmsplice, sys_vmsplice)
 #define __NR_move_pages279
 __SYSCALL(__NR_move_pages, sys_move_pages)
+#define __NR_kevent_get_events 280
+__SYSCALL(__NR_kevent_get_events, sys_kevent_get_events)
+#define __NR_kevent_ctl281
+__SYSCALL(__NR_kevent_ctl, sys_kevent_ctl)
 
 #ifdef __KERNEL__
 
-#define __NR_syscall_max __NR_move_pages
+#define __NR_syscall_max __NR_kevent_ctl
 
 #ifndef __NO_STUBS
 
diff --git a/include/linux/kevent.h b/include/linux/kevent.h
new file mode 100644
index 000..03a
--- /dev/null
+++ b/include/linux/kevent.h
@@ -0,0 +1,310 @@
+/*
+ * kevent.h
+ * 
+ * 2006 Copyright (c) Evgeniy Polyakov [EMAIL PROTECTED]
+ * All rights reserved.
+ * 
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+#ifndef __KEVENT_H
+#define __KEVENT_H
+
+/*
+ * Kevent request flags.
+ */
+
+#define KEVENT_REQ_ONESHOT 0x1 /* Process this event only once 
and then dequeue. */
+
+/*
+ * Kevent return flags.
+ */
+#define KEVENT_RET_BROKEN  0x1 /* Kevent is broken. */
+#define KEVENT_RET_DONE0x2 /* Kevent processing 
was finished successfully. */
+
+/*
+ * Kevent type set.
+ */
+#define KEVENT_SOCKET  0
+#define KEVENT_INODE   1
+#define KEVENT_TIMER   2
+#define KEVENT_POLL3
+#define KEVENT_NAIO4
+#define KEVENT_AIO 5
+#defineKEVENT_MAX  6
+
+/*
+ * Per-type event sets.
+ * Number of per-event sets should be exactly as number of kevent types.
+ */
+
+/*
+ * Timer events.
+ */
+#defineKEVENT_TIMER_FIRED  0x1
+
+/*
+ * Socket/network asynchronous IO events.
+ */
+#defineKEVENT_SOCKET_RECV  0x1
+#defineKEVENT_SOCKET_ACCEPT0x2
+#defineKEVENT_SOCKET_SEND  0x4
+
+/*
+ * Inode events.
+ */
+#defineKEVENT_INODE_CREATE 0x1
+#defineKEVENT_INODE_REMOVE 0x2
+
+/*
+ * Poll events.
+ */
+#defineKEVENT_POLL_POLLIN  0x0001
+#defineKEVENT_POLL_POLLPRI 0x0002
+#defineKEVENT_POLL_POLLOUT 0x0004
+#defineKEVENT_POLL_POLLERR 0x0008
+#defineKEVENT_POLL_POLLHUP 0x0010
+#defineKEVENT_POLL_POLLNVAL0x0020
+
+#defineKEVENT_POLL_POLLRDNORM  0x0040
+#defineKEVENT_POLL_POLLRDBAND  0x0080
+#defineKEVENT_POLL_POLLWRNORM  0x0100
+#defineKEVENT_POLL_POLLWRBAND  0x0200
+#defineKEVENT_POLL_POLLMSG 0x0400
+#defineKEVENT_POLL_POLLREMOVE  0x1000
+

Re: [PATCH 2.6.17] net/ipv6/udp.c: remove duplicate udp_get_port code

2006-08-16 Thread gerrit

Hi Yoshifuji, 

|   +#if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE)
|   +  else if(sk-sk_family == PF_INET6 
|   +  ipv6_rcv_saddr_equal(sk, sk2) )
|   +  goto fail;
|   +  }
|   +#endif
|  
|  This is not good because you cannot link ipv6_rcv_saddr_equal()
|  if you are compiling IPv6 as module.
Yes and the second ugliness was that ipv4/udp.c suddenly had to include 
net/addrconf.h. 

|  How about retaining udp_v{4,6}_get_port() and call
|  common udp_get_port() from both functions?
I enclose a realisation below - do you think that is better? 
Tested both IPv6 as module and as `y', double-checked all changes. 
Thank you for reviewing and comments.

Signed-off-by: Gerrit Renker [EMAIL PROTECTED]
---

 include/net/udp.h |   18 +-
 net/ipv4/udp.c|   95 ++
 net/ipv6/udp.c|   76 +--
 3 files changed, 64 insertions(+), 125 deletions(-)


diff --git a/include/net/udp.h b/include/net/udp.h
index 766fba1..c490a0f 100644
--- a/include/net/udp.h
+++ b/include/net/udp.h
@@ -30,25 +30,9 @@ #include linux/seq_file.h
 
 #define UDP_HTABLE_SIZE128
 
-/* udp.c: This needs to be shared by v4 and v6 because the lookup
- *and hashing code needs to work with different AF's yet
- *the port space is shared.
- */
 extern struct hlist_head udp_hash[UDP_HTABLE_SIZE];
 extern rwlock_t udp_hash_lock;
 
-extern int udp_port_rover;
-
-static inline int udp_lport_inuse(u16 num)
-{
-   struct sock *sk;
-   struct hlist_node *node;
-
-   sk_for_each(sk, node, udp_hash[num  (UDP_HTABLE_SIZE - 1)])
-   if (inet_sk(sk)-num == num)
-   return 1;
-   return 0;
-}
 
 /* Note: this must match 'valbool' in sock_setsockopt */
 #define UDP_CSUM_NOXMIT1
@@ -63,6 +47,8 @@ extern struct proto udp_prot;
 
 struct sk_buff;
 
+extern int udp_get_port(struct sock *sk, unsigned short snum,
+int (*saddr_cmp)(struct sock *, struct sock *));
 extern voidudp_err(struct sk_buff *, u32);
 
 extern int udp_sendmsg(struct kiocb *iocb, struct sock *sk,
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 3f93292..c5ee645 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -119,14 +119,34 @@ DEFINE_SNMP_STAT(struct udp_mib, udp_sta
 struct hlist_head udp_hash[UDP_HTABLE_SIZE];
 DEFINE_RWLOCK(udp_hash_lock);
 
-/* Shared by v4/v6 udp. */
+/* Shared by v4/v6 udp_get_port */
 int udp_port_rover;
 
-static int udp_v4_get_port(struct sock *sk, unsigned short snum)
+static inline int udp_lport_inuse(u16 num)
 {
+   struct sock *sk;
struct hlist_node *node;
+
+   sk_for_each(sk, node, udp_hash[num  (UDP_HTABLE_SIZE - 1)])
+   if (inet_sk(sk)-num == num)
+   return 1;
+   return 0;
+}
+
+/**
+ *  udp_get_port  -  common port lookup for IPv4 and IPv6
+ *
+ *  @sk:  socket struct in question
+ *  @snum:port number to look up
+ *  @saddr_comp:  AF-dependent comparison of bound local IP addresses
+ */
+int udp_get_port(struct sock *sk, unsigned short snum,
+int (*saddr_cmp)(struct sock *sk1, struct sock *sk2))
+{
+   struct hlist_node *node;
+   struct hlist_head *head;
struct sock *sk2;
-   struct inet_sock *inet = inet_sk(sk);
+   interror = 1;
 
write_lock_bh(udp_hash_lock);
if (snum == 0) {
@@ -138,11 +158,10 @@ static int udp_v4_get_port(struct sock *
best_size_so_far = 32767;
best = result = udp_port_rover;
for (i = 0; i  UDP_HTABLE_SIZE; i++, result++) {
-   struct hlist_head *list;
int size;
 
-   list = udp_hash[result  (UDP_HTABLE_SIZE - 1)];
-   if (hlist_empty(list)) {
+   head = udp_hash[result  (UDP_HTABLE_SIZE - 1)];
+   if (hlist_empty(head)) {
if (result  sysctl_local_port_range[1])
result = sysctl_local_port_range[0] +
((result - 
sysctl_local_port_range[0]) 
@@ -150,12 +169,11 @@ static int udp_v4_get_port(struct sock *
goto gotit;
}
size = 0;
-   sk_for_each(sk2, node, list)
-   if (++size = best_size_so_far)
-   goto next;
-   best_size_so_far = size;
-   best = result;
-   next:;
+   sk_for_each(sk2, node, head)
+   if (++size  best_size_so_far) {
+   best_size_so_far = size;
+

Re: [take9 0/2] kevent: Generic event handling mechanism.

2006-08-16 Thread Christoph Hellwig

On Mon, Aug 14, 2006 at 10:21:36AM +0400, Evgeniy Polyakov wrote:
 
 Generic event handling mechanism.

Hi, I've just started looking into this, so some comments here first
on the submission process:

 - could you send new revisions of the patches in a new thread so one can
   easily find them?
 - the patch split is not very nice, your first patch adds Makefile and
   Kconfig entries for files only in the second patch or not actually
   submitted at all, that's a big no-no.

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [take9 2/2] kevent: poll/select() notifications. Timer notifications.

2006-08-16 Thread Christoph Hellwig

On Mon, Aug 14, 2006 at 10:21:36AM +0400, Evgeniy Polyakov wrote:
 
 poll/select() notifications. Timer notifications.
 
 This patch includes generic poll/select and timer notifications.
 
 kevent_poll works simialr to epoll and has the same issues (callback
 is invoked not from internal state machine of the caller, but through
 process awake).

I'm not a big fan of duplicating code over and over.  kevent is a candidate
for a generic event devlivery mechanisms which is a _very_ good thing.  But
starting that system by duplicating existing functionality is not very nice.

What speaks against a patch the recplaces the epoll core by something that
build on kevent while still supporting the epoll interface as a compatibility
shim?

 Timer notifications can be used for fine grained per-process time 
 management, since interval timers are very inconvenient to use, 
 and they are limited.

I have similar reservations about this one.  Having timers as part of a
generic events system is very nice, but having so much duplicated functionality
is not.  Cc'ed Thomas on behalf of the Timer cabal if there's a point in
integrating this into a larger framework of timer code.


Also it would be nice if you could submit each of the notifications as a patch
on it's own.

 diff --git a/kernel/kevent/kevent_poll.c b/kernel/kevent/kevent_poll.c
 new file mode 100644
 index 000..8a4f863
 --- /dev/null
 +++ b/kernel/kevent/kevent_poll.c
 @@ -0,0 +1,220 @@
 +/*
 + *   kevent_poll.c
 + * 
 + * 2006 Copyright (c) Evgeniy Polyakov [EMAIL PROTECTED]
 + * All rights reserved.
 + * 
 + * This program is free software; you can redistribute it and/or modify
 + * it under the terms of the GNU General Public License as published by
 + * the Free Software Foundation; either version 2 of the License, or
 + * (at your option) any later version.
 + *
 + * This program is distributed in the hope that it will be useful,
 + * but WITHOUT ANY WARRANTY; without even the implied warranty of
 + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 + * GNU General Public License for more details.
 + */
 +
 +#include linux/kernel.h
 +#include linux/types.h
 +#include linux/list.h
 +#include linux/slab.h
 +#include linux/spinlock.h
 +#include linux/timer.h
 +#include linux/file.h
 +#include linux/kevent.h
 +#include linux/poll.h
 +#include linux/fs.h
 +
 +static kmem_cache_t *kevent_poll_container_cache;
 +static kmem_cache_t *kevent_poll_priv_cache;
 +
 +struct kevent_poll_ctl
 +{
 + struct poll_table_structpt;
 + struct kevent   *k;
 +};
 +
 +struct kevent_poll_wait_container
 +{
 + struct list_headcontainer_entry;
 + wait_queue_head_t   *whead;
 + wait_queue_twait;
 + struct kevent   *k;
 +};
 +
 +struct kevent_poll_private
 +{
 + struct list_headcontainer_list;
 + spinlock_t  container_lock;
 +};
 +
 +static int kevent_poll_enqueue(struct kevent *k);
 +static int kevent_poll_dequeue(struct kevent *k);
 +static int kevent_poll_callback(struct kevent *k);
 +
 +static int kevent_poll_wait_callback(wait_queue_t *wait, 
 + unsigned mode, int sync, void *key)
 +{
 + struct kevent_poll_wait_container *cont = 
 + container_of(wait, struct kevent_poll_wait_container, wait);
 + struct kevent *k = cont-k;
 + struct file *file = k-st-origin;
 + u32 revents;
 +
 + revents = file-f_op-poll(file, NULL);
 +
 + kevent_storage_ready(k-st, NULL, revents);
 +
 + return 0;
 +}
 +
 +static void kevent_poll_qproc(struct file *file, wait_queue_head_t *whead, 
 + struct poll_table_struct *poll_table)
 +{
 + struct kevent *k = 
 + container_of(poll_table, struct kevent_poll_ctl, pt)-k;
 + struct kevent_poll_private *priv = k-priv;
 + struct kevent_poll_wait_container *cont;
 + unsigned long flags;
 +
 + cont = kmem_cache_alloc(kevent_poll_container_cache, SLAB_KERNEL);
 + if (!cont) {
 + kevent_break(k);
 + return;
 + }
 + 
 + cont-k = k;
 + init_waitqueue_func_entry(cont-wait, kevent_poll_wait_callback);
 + cont-whead = whead;
 +
 + spin_lock_irqsave(priv-container_lock, flags);
 + list_add_tail(cont-container_entry, priv-container_list);
 + spin_unlock_irqrestore(priv-container_lock, flags);
 +
 + add_wait_queue(whead, cont-wait);
 +}
 +
 +static int kevent_poll_enqueue(struct kevent *k)
 +{
 + struct file *file;
 + int err, ready = 0;
 + unsigned int revents;
 + struct kevent_poll_ctl ctl;
 + struct kevent_poll_private *priv;
 +
 + file = fget(k-event.id.raw[0]);
 + if (!file)
 + return -ENODEV;
 +
 + err = -EINVAL;
 + if (!file-f_op || !file-f_op-poll)
 + goto err_out_fput;
 +
 + err = -ENOMEM;
 + priv = kmem_cache_alloc(kevent_poll_priv_cache, SLAB_KERNEL);
 + if

Re: bonding: cannot remove certain named devices

2006-08-16 Thread Bill Nottingham

Giacomo A. Catenazzi ([EMAIL PROTECTED]) said: 
  Are you willing to work to add the special case code necessary to
  handle whitespace characters in the device name over all of the kernel
  code and also all of the userland tools too?
 
 But if you don't handle spaces in userspace, you handle *, ?, [, ], $,
 , ', \  in userspace? Should kernel disable also these (insane device
 chars) chars?

Don't forget unicode characters!

Seriously, while it might be insane to use some of these, I'm wondering
if trying to filter names is more work than fixing the tools.

Bill
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [take9 0/2] kevent: Generic event handling mechanism.

2006-08-16 Thread Evgeniy Polyakov

On Wed, Aug 16, 2006 at 02:26:31PM +0100, Christoph Hellwig ([EMAIL PROTECTED]) 
wrote:
 On Mon, Aug 14, 2006 at 10:21:36AM +0400, Evgeniy Polyakov wrote:
  
  Generic event handling mechanism.
 
 Hi, I've just started looking into this, so some comments here first
 on the submission process:
 
  - could you send new revisions of the patches in a new thread so one can
easily find them?

Ok.

  - the patch split is not very nice, your first patch adds Makefile and
Kconfig entries for files only in the second patch or not actually
submitted at all, that's a big no-no.

It is done by scripts using list of files generated by git-diff, but I
can reformat them to be in a way:
core files
poll/select
timer
any other
main Kconfig/Makefile

Kevent's makefile still contains all entries for files added later, is
it a big problem right now?
I can split patches manually, but it would be much better to do it when
decision about it's inclusion is made, and until review and feature
addiotion process is not completed generate patches as is...

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [take9 2/2] kevent: poll/select() notifications. Timer notifications.

2006-08-16 Thread Evgeniy Polyakov

On Wed, Aug 16, 2006 at 02:30:14PM +0100, Christoph Hellwig ([EMAIL PROTECTED]) 
wrote:
 On Mon, Aug 14, 2006 at 10:21:36AM +0400, Evgeniy Polyakov wrote:
  
  poll/select() notifications. Timer notifications.
  
  This patch includes generic poll/select and timer notifications.
  
  kevent_poll works simialr to epoll and has the same issues (callback
  is invoked not from internal state machine of the caller, but through
  process awake).
 
 I'm not a big fan of duplicating code over and over.  kevent is a candidate
 for a generic event devlivery mechanisms which is a _very_ good thing.  But
 starting that system by duplicating existing functionality is not very nice.
 
 What speaks against a patch the recplaces the epoll core by something that
 build on kevent while still supporting the epoll interface as a compatibility
 shim?

There is no problem from my side, but epoll and kevent_poll differs on
some aspects, so it can be better to not replace them for a while.

  Timer notifications can be used for fine grained per-process time 
  management, since interval timers are very inconvenient to use, 
  and they are limited.
 
 I have similar reservations about this one.  Having timers as part of a
 generic events system is very nice, but having so much duplicated 
 functionality
 is not.  Cc'ed Thomas on behalf of the Timer cabal if there's a point in
 integrating this into a larger framework of timer code.
 
 
 Also it would be nice if you could submit each of the notifications as a patch
 on it's own.

Ok.

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [take9 1/2] kevent: Core files.

2006-08-16 Thread Christoph Hellwig

 diff --git a/include/linux/kevent.h b/include/linux/kevent.h
 new file mode 100644
 index 000..03a
 --- /dev/null
 +++ b/include/linux/kevent.h
 @@ -0,0 +1,310 @@
 +/*
 + *   kevent.h

Please don't put filenames in the top of file block comments.  They're
redudant and as history shows out of date far too often.

 +#ifdef __KERNEL__

Please split the user/kernel ABI and kernel implementation details into
two different headers.  That way we don't have to run unifdef as part of
the user headers generation process and it's much cleaner what bit is a
kernel implementation details and what's the public ABI.

 +#define KEVENT_READY 0x1
 +#define KEVENT_STORAGE   0x2
 +#define KEVENT_USER  0x4

Please use enums here.

 + void*priv;  /* Private data for 
 different storages. 
 +  * poll()/select 
 storage has a list of wait_queue_t containers 
 +  * for each -poll() { 
 poll_wait()' } here.
 +  */

Please try to avoid spilling over the 80 chars limit.  In this case it's
easy, just put the comment before the field beeing documented.

 +extern struct kevent_callbacks kevent_registered_callbacks[];

Having global arrays is not very nice.  Any chance this could be hidden
behind proper accessor functions?

 +#ifdef CONFIG_KEVENT_INODE
 +void kevent_inode_notify(struct inode *inode, u32 event);
 +void kevent_inode_notify_parent(struct dentry *dentry, u32 event);
 +void kevent_inode_remove(struct inode *inode);
 +#else
 +static inline void kevent_inode_notify(struct inode *inode, u32 event)
 +{
 +}
 +static inline void kevent_inode_notify_parent(struct dentry *dentry, u32 
 event)
 +{
 +}
 +static inline void kevent_inode_remove(struct inode *inode)
 +{
 +}
 +#endif /* CONFIG_KEVENT_INODE */

The code implementing these prototypes doesn't exist.

 +#ifdef CONFIG_KEVENT_SOCKET
 +#ifdef CONFIG_LOCKDEP
 +void kevent_socket_reinit(struct socket *sock);
 +void kevent_sk_reinit(struct sock *sk);
 +#else
 +static inline void kevent_socket_reinit(struct socket *sock)
 +{
 +}
 +static inline void kevent_sk_reinit(struct sock *sk)
 +{
 +}
 +#endif

Dito.  Please clean the header from all this dead code.

 +int kevent_storage_init(void *origin, struct kevent_storage *st)
 +{
 + spin_lock_init(st-lock);
 + st-origin = origin;
 + INIT_LIST_HEAD(st-list);
 + return 0;
 +}

Why does this need a return value?

 +int kevent_sys_init(void)
 +{
 + int i;
 +
 + kevent_cache = kmem_cache_create(kevent_cache, 
 + sizeof(struct kevent), 0, SLAB_PANIC, NULL, NULL);
 +
 + for (i=0; iARRAY_SIZE(kevent_registered_callbacks); ++i) {
 + struct kevent_callbacks *c = kevent_registered_callbacks[i];
 +
 + c-callback = c-enqueue = c-dequeue = NULL;
 + }
 + 
 + return 0;
 +}

Please make this an initcall in this file and make sure it's linked before
kevent_users.c


 +static int kevent_user_open(struct inode *, struct file *);
 +static int kevent_user_release(struct inode *, struct file *);
 +static unsigned int kevent_user_poll(struct file *, struct poll_table_struct 
 *);
 +static int kevent_user_mmap(struct file *, struct vm_area_struct *);

Could you reorder the file so these forward-declaring prototypes aren't
needed?

 + for (i=0; iARRAY_SIZE(u-kevent_list); ++i)

for (i = 0; i  ARRAY_SIZE(u-kevent_list); i++)

 +static struct page *kevent_user_nopage(struct vm_area_struct *vma, unsigned 
 long addr, int *type)
 +{
 + struct kevent_user *u = vma-vm_file-private_data;
 + unsigned long off = (addr - vma-vm_start)/PAGE_SIZE;
 + unsigned int pnum = ALIGN(KEVENT_MAX_EVENTS*sizeof(struct mukevent) + 
 sizeof(unsigned int), PAGE_SIZE)/PAGE_SIZE;
 +
 + if (type)
 + *type = VM_FAULT_MINOR;
 +
 + if (off = pnum)
 + goto err_out_sigbus;
 +
 + u-pring[off] = __get_free_page(GFP_KERNEL);

So we have a pagefault handler that allocates pages.

 +static int kevent_user_mmap(struct file *file, struct vm_area_struct *vma)
 +{
 + unsigned long start = vma-vm_start;
 + struct kevent_user *u = file-private_data;
 +
 + if (vma-vm_flags  VM_WRITE)
 + return -EPERM;
 +
 + vma-vm_page_prot = pgprot_noncached(vma-vm_page_prot);
 + vma-vm_ops = kevent_user_vm_ops;
 + vma-vm_flags |= VM_RESERVED;
 + vma-vm_file = file;
 +
 + if (remap_pfn_range(vma, start, virt_to_phys((void *)u-pring[0]), 
 PAGE_SIZE,
 + vma-vm_page_prot))
 + return -EFAULT;

but you always map the first page.  This model sounds odd and rather confusing.
Do we really need to avoid of the cost of the pagefault just for the special
first page?

If so please at least use vm_insert_page() instead of remap_pfn_range().

 +#if 0
 +static inline unsigned int

[PATCH] wireless-dev: relax sysfs permissions

2006-08-16 Thread Johannes Berg

The sysfs attributes add_iface and remove_iface both check for
CAP_NET_ADMIN whenever something is written. Hence, permissions for the
files should be relaxed so that someone who is not root but happens to
have CAP_NET_ADMIN can do things.

Signed-off-by: Johannes Berg [EMAIL PROTECTED]

--- wireless-dev.orig/net/d80211/ieee80211_sysfs.c  2006-08-16 
15:45:41.0 +0200
+++ wireless-dev/net/d80211/ieee80211_sysfs.c   2006-08-16 15:46:05.0 
+0200
@@ -195,8 +195,8 @@
 __IEEE80211_LOCAL_SHOW(rate_ctrl_alg);
 
 static struct class_device_attribute ieee80211_class_dev_attrs[] = {
-   __ATTR(add_iface, S_IWUSR, NULL, store_add_iface),
-   __ATTR(remove_iface, S_IWUSR, NULL, store_remove_iface),
+   __ATTR(add_iface, S_IWUGO, NULL, store_add_iface),
+   __ATTR(remove_iface, S_IWUGO, NULL, store_remove_iface),
__ATTR(channel, S_IRUGO, ieee80211_local_show_channel, NULL),
__ATTR(frequency, S_IRUGO, ieee80211_local_show_frequency, NULL),
__ATTR(radar_detect, S_IRUGO, ieee80211_local_show_radar_detect, NULL),

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Possible leak of multicast source filter sctructure #3

2006-08-16 Thread Michal Ruzicka

Hi
 I'm not sure the second one is quite right. The case of concern
 is where an interface is deleted. If you joined (or left) the group by
 address and then deleted the interface, then you wouldn't match the
 index (which wouldn't be set) so the leave wouldn't work, still.

That's right I havent thought of this case.

 Also, if you passed a completely bogus ifindex, it should return
 ENODEV, but with the patch it would return EADDRNOTAVAIL it appears.

The question is what is completely bogus ifindex in this case?
An interface that does not exist any more but happen to be on the sockets
multicast list shouldn't be.

 So, I think the second patch needs some more work. I'll look at
 it some more and see if I can suggest something better.
 
 
 +-DLS

I've tried to implement something more complete but especially in the case
of leaving a group by address it is still just a best effort and not something
absolutely perfect.

I've started with streamlining the ip_mc_find_dev() function with one little
change in its behavior: clearing the imr_address member of the ip_mreqn
request structure in case an interface is found by an index.
This should be no problem since this member is not used in this case and may
contain a random value. So I clear it to get rid of this randomness since
this value might now be used in ip_mc_leave_group()

Well and now the changes in the ip_mc_leave_group():
I've splitted it into two different cases:
 1) leave by an interface index
 2) leave by an interface address / muticast address

In the first case I search for a match by the interface index specified
in the leave request. If a match is found I leave the group on the
interface irrespective of its existence.

In the second case I do a similar search (but this time using the interface
index found in ip_mc_find_dev()) while also checking for a match by
the interface address.
If no match is found by the interface index and there is a match (or more)
by the address I leave the group on the interface corresponding to the first
match by the address.
This certainly could produce weird results but such results could be
produced by the original algorithm as well with the additional problem
that there was no way to leave a group on a deleted interface.

And here is the patch:

Signed-off-by: Michal Ruzicka [EMAIL PROTECTED]

--- linux-2.6.17.8/net/ipv4/igmp.c.orig 2006-08-11 11:50:46.0 +0200
+++ linux-2.6.17.8/net/ipv4/igmp.c  2006-08-16 15:06:18.0 +0200
@@ -1369,13 +1369,15 @@
struct flowi fl = { .nl_u = { .ip4_u =
  { .daddr = imr-imr_multiaddr.s_addr } } 
};
struct rtable *rt;
-   struct net_device *dev = NULL;
-   struct in_device *idev = NULL;
+   struct net_device *dev;
 
if (imr-imr_ifindex) {
-   idev = inetdev_by_index(imr-imr_ifindex);
-   if (idev)
+   struct in_device *idev = inetdev_by_index(imr-imr_ifindex);
+
+   if (idev) {
+   imr-imr_address.s_addr = 0;
__in_dev_put(idev);
+   }
return idev;
}
if (imr-imr_address.s_addr) {
@@ -1383,17 +1385,16 @@
if (!dev)
return NULL;
dev_put(dev);
-   }
-
-   if (!dev  !ip_route_output_key(rt, fl)) {
+   } else if (!ip_route_output_key(rt, fl)) {
dev = rt-u.dst.dev;
ip_rt_put(rt);
-   }
-   if (dev) {
-   imr-imr_ifindex = dev-ifindex;
-   idev = __in_dev_get_rtnl(dev);
-   }
-   return idev;
+   if (!dev)
+   return NULL;
+   } else
+   return NULL;
+
+   imr-imr_ifindex = dev-ifindex;
+   return __in_dev_get_rtnl(dev);
 }
 
 /*
@@ -1798,27 +1799,79 @@
u32 ifindex;
 
rtnl_lock();
-   in_dev = ip_mc_find_dev(imr);
-   if (!in_dev) {
-   rtnl_unlock();
-   return -ENODEV;
-   }
ifindex = imr-imr_ifindex;
-   for (imlp = inet-mc_list; (iml = *imlp) != NULL; imlp = iml-next) {
-   if (iml-multi.imr_multiaddr.s_addr == group 
-   iml-multi.imr_ifindex == ifindex) {
-   (void) ip_mc_leave_src(sk, iml, in_dev);
-
-   *imlp = iml-next;
-
-   ip_mc_dec_group(in_dev, group);
-   rtnl_unlock();
-   sock_kfree_s(sk, iml, sizeof(*iml));
-   return 0;
+   in_dev = ip_mc_find_dev(imr);
+   if (ifindex != 0) {
+   /* leave by interface index */
+   for (imlp = inet-mc_list; (iml = *imlp) != NULL; imlp = 
iml-next) {
+   if (iml-multi.imr_multiaddr.s_addr != group)
+   continue;
+
+   if

Re: [PATCH 1/9] network namespaces: core and device list

2006-08-16 Thread Dave Hansen

On Tue, 2006-08-15 at 18:48 +0400, Andrey Savochkin wrote:
 
 /* Can survive without statistics */
 stats = kmalloc(sizeof(struct net_device_stats), GFP_KERNEL);
 if (stats) {
 memset(stats, 0, sizeof(struct net_device_stats));
 -   loopback_dev.priv = stats;
 -   loopback_dev.get_stats = get_stats;
 +   dev-priv = stats;
 +   dev-get_stats = get_stats;
 } 

With this much surgery it might be best to start using things that have
come along since this code was touched last, like kzalloc().

-- Dave

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Possible leak of multicast source filter sctructure #3a

2006-08-16 Thread Michal Ruzicka

The same patch as in previous e-mail with a few typos in comments corrected:

Signed-off-by: Michal Ruzicka [EMAIL PROTECTED]

--- linux-2.6.17.8/net/ipv4/igmp.c.orig 2006-08-11 11:50:46.0 +0200
+++ linux-2.6.17.8/net/ipv4/igmp.c  2006-08-16 16:53:08.0 +0200
@@ -1369,13 +1369,15 @@
struct flowi fl = { .nl_u = { .ip4_u =
  { .daddr = imr-imr_multiaddr.s_addr } } 
};
struct rtable *rt;
-   struct net_device *dev = NULL;
-   struct in_device *idev = NULL;
+   struct net_device *dev;
 
if (imr-imr_ifindex) {
-   idev = inetdev_by_index(imr-imr_ifindex);
-   if (idev)
+   struct in_device *idev = inetdev_by_index(imr-imr_ifindex);
+
+   if (idev) {
+   imr-imr_address.s_addr = 0;
__in_dev_put(idev);
+   }
return idev;
}
if (imr-imr_address.s_addr) {
@@ -1383,17 +1385,16 @@
if (!dev)
return NULL;
dev_put(dev);
-   }
-
-   if (!dev  !ip_route_output_key(rt, fl)) {
+   } else if (!ip_route_output_key(rt, fl)) {
dev = rt-u.dst.dev;
ip_rt_put(rt);
-   }
-   if (dev) {
-   imr-imr_ifindex = dev-ifindex;
-   idev = __in_dev_get_rtnl(dev);
-   }
-   return idev;
+   if (!dev)
+   return NULL;
+   } else
+   return NULL;
+
+   imr-imr_ifindex = dev-ifindex;
+   return __in_dev_get_rtnl(dev);
 }
 
 /*
@@ -1798,27 +1799,79 @@
u32 ifindex;
 
rtnl_lock();
-   in_dev = ip_mc_find_dev(imr);
-   if (!in_dev) {
-   rtnl_unlock();
-   return -ENODEV;
-   }
ifindex = imr-imr_ifindex;
-   for (imlp = inet-mc_list; (iml = *imlp) != NULL; imlp = iml-next) {
-   if (iml-multi.imr_multiaddr.s_addr == group 
-   iml-multi.imr_ifindex == ifindex) {
-   (void) ip_mc_leave_src(sk, iml, in_dev);
-
-   *imlp = iml-next;
-
-   ip_mc_dec_group(in_dev, group);
-   rtnl_unlock();
-   sock_kfree_s(sk, iml, sizeof(*iml));
-   return 0;
+   in_dev = ip_mc_find_dev(imr);
+   if (ifindex != 0) {
+   /* leave by interface index */
+   for (imlp = inet-mc_list; (iml = *imlp) != NULL; imlp = 
iml-next) {
+   if (iml-multi.imr_multiaddr.s_addr != group)
+   continue;
+
+   if (iml-multi.imr_ifindex == ifindex)
+   goto leave;
+   }
+   } else {
+   /* leave by address / multicast group route */
+   struct ip_mc_socklist **cimlp = NULL;
+   u32 address = imr-imr_address.s_addr;
+
+   ifindex = imr-imr_ifindex;
+   for (imlp = inet-mc_list; (iml = *imlp) != NULL; imlp = 
iml-next) {
+   if (iml-multi.imr_multiaddr.s_addr != group)
+   continue;
+
+   if (iml-multi.imr_ifindex == ifindex)
+   /* direct match
+* NOTE: We do not have to test for in_dev != 
NULL
+* since we know that ifindex was zero before 
call
+* to ip_mc_find_dev() but is non-zero now (as
+* it equals to an interface index which is 
never
+* zero). The ip_mc_find_dev() function modifies
+* the ifindex only if it finds an interface
+* (in wich case it returns non-NULL). Thus the
+* in_dev must be non-NULL.
+*/
+   goto leave;
+
+   if (cimlp == NULL  iml-multi.imr_address.s_addr == 
address)
+   cimlp = imlp;
+   }
+
+   if (cimlp != NULL) {
+   /* We have found at least one candidate interface
+* for leaving by address but not a direct match.
+* Since there is no way to tell what interface the user
+* wnated to leave the multicast group on we are going
+* to leave it on the first candidate interface found.
+*/
+   iml = *(imlp = cimlp);
+
+   if (in_dev != NULL) {
+   /* If we have found an interface matching the 
leave
+* request chances are that the interface which 
we
+* are about to leave the multicast group on 
still

Can't turn off CONFIG_NET_ESTIMATOR on 2.6.17.7

2006-08-16 Thread Andy Furniss




I Recently built a 2.6.17.7 and wanted to turn off CONFIG_NET_ESTIMATOR 
but can't using menuconfig.


Is it on by default now, or is it a config issue?

I wanted it off to play with chains of policers and unless I 
misunderstand it uses Hz, and is inaccurate when Hz=250 with its' 
minimum time of 1/4 sec - which is too high for what I wanted anyway.


Andy.

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC] network namespaces

2006-08-16 Thread Alexey Kuznetsov

Hello!

 (application) containers.  Performance aside, are there any reasons why
 this approach would be problematic for c/r?

This approach is just perfect for c/r.

Probably, this is the only approach when migration can be done
in a clean and self-consistent way.

Alexey
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] bcm43xx-softmac: optimization of DMA bitfields.]

2006-08-16 Thread Larry Finger

John,

Please pull this patch for the wireless-2.6 tree.

This patch depends on the 64bit DMA patch, which is already
submitted for inclusion.

Convert the bitfields in the bcm43xx DMA code to properly
aligned u8 booleans. These flags are accessed in the DMA
hotpath, so it's a good idea to waste a few bytes of memory
for the sake of speed by not requiring masking (and probably
shifting) of the bitfields.

Signed-off-by: Michael Buesch [EMAIL PROTECTED]
Signed-Off-By: Larry Finger [EMAIL PROTECTED]

Index: wireless-2.6/drivers/net/wireless/bcm43xx/bcm43xx_dma.h
===
--- wireless-2.6.orig/drivers/net/wireless/bcm43xx/bcm43xx_dma.h
2006-08-16 12:47:27.0 +0200
+++ wireless-2.6/drivers/net/wireless/bcm43xx/bcm43xx_dma.h 2006-08-16 
12:49:43.0 +0200
@@ -235,9 +235,12 @@
u16 mmio_base;
/* DMA controller index number (0-5). */
int index;
-   u8 tx:1,/* TRUE, if this is a TX ring. */
-  dma64:1, /* TRUE, if 64-bit DMA is enabled (FALSE if 32bit). */
-  suspended:1; /* TRUE, if transfers are suspended on this ring. */
+   /* Boolean. Is this a TX ring? */
+   u8 tx
+   /* Boolean. 64bit DMA if true, 32bit DMA otherwise. */
+   u8 dma64;
+   /* Boolean. Are transfers suspended on this ring? */
+   u8 suspended;
struct bcm43xx_private *bcm;
 #ifdef CONFIG_BCM43XX_DEBUG
/* Maximum number of used slots. */



-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] bcm43xx-softmac: optimization of DMA bitfields.]

2006-08-16 Thread Johannes Berg

On Wed, 2006-08-16 at 10:36 -0500, Larry Finger wrote:
 + /* Boolean. Is this a TX ring? */
 + u8 tx
 + /* Boolean. 64bit DMA if true, 32bit DMA otherwise. */
 + u8 dma64;

does that compile?

johannes
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 2/2] Add support for LAN8187 and LAN8700 PHYs

2006-08-16 Thread Steve Glendinning

Make functions and constants generic, add support for two more
SMSC PHY models with identical interrupt source and mask registers

Signed-off-by: Steve Glendinning [EMAIL PROTECTED]
---
 drivers/net/phy/smsc.c |  112 ++--
 1 files changed, 89 insertions(+), 23 deletions(-)

diff --git a/drivers/net/phy/smsc.c b/drivers/net/phy/smsc.c
index 2119bd7..22785fb 100644
--- a/drivers/net/phy/smsc.c
+++ b/drivers/net/phy/smsc.c
@@ -12,6 +12,7 @@
  * Free Software Foundation;  either version 2 of the  License, or (at your
  * option) any later version.
  *
+ * Support added for SMSC LAN8187 and LAN8700 by [EMAIL PROTECTED]
  */
 
 #include linux/config.h
@@ -22,42 +23,42 @@ #include linux/ethtool.h
 #include linux/phy.h
 #include linux/netdevice.h
 
-#define MII_LAN83C185_ISF 29 /* Interrupt Source Flags */
-#define MII_LAN83C185_IM  30 /* Interrupt Mask */
+#define MII_SMSC_ISF 29 /* Interrupt Source Flags */
+#define MII_SMSC_IM  30 /* Interrupt Mask */
 
-#define MII_LAN83C185_ISF_INT1 (11) /* Auto-Negotiation Page Received */
-#define MII_LAN83C185_ISF_INT2 (12) /* Parallel Detection Fault */
-#define MII_LAN83C185_ISF_INT3 (13) /* Auto-Negotiation LP Ack */
-#define MII_LAN83C185_ISF_INT4 (14) /* Link Down */
-#define MII_LAN83C185_ISF_INT5 (15) /* Remote Fault Detected */
-#define MII_LAN83C185_ISF_INT6 (16) /* Auto-Negotiation complete */
-#define MII_LAN83C185_ISF_INT7 (17) /* ENERGYON */
+#define MII_SMSC_ISF_INT1 (11) /* Auto-Negotiation Page Received */
+#define MII_SMSC_ISF_INT2 (12) /* Parallel Detection Fault */
+#define MII_SMSC_ISF_INT3 (13) /* Auto-Negotiation LP Ack */
+#define MII_SMSC_ISF_INT4 (14) /* Link Down */
+#define MII_SMSC_ISF_INT5 (15) /* Remote Fault Detected */
+#define MII_SMSC_ISF_INT6 (16) /* Auto-Negotiation complete */
+#define MII_SMSC_ISF_INT7 (17) /* ENERGYON */
 
-#define MII_LAN83C185_ISF_INT_ALL (0x0e)
+#define MII_SMSC_ISF_INT_ALL (0x0e)
 
-#define MII_LAN83C185_ISF_INT_PHYLIB_EVENTS \
-   (MII_LAN83C185_ISF_INT6 | MII_LAN83C185_ISF_INT4)
+#define MII_SMSC_ISF_INT_PHYLIB_EVENTS \
+   (MII_SMSC_ISF_INT6 | MII_SMSC_ISF_INT4)
 
 
-static int lan83c185_config_intr(struct phy_device *phydev)
+static int smsc_phy_config_intr(struct phy_device *phydev)
 {
-   int rc = phy_write(phydev, MII_LAN83C185_IM,
+   int rc = phy_write(phydev, MII_SMSC_IM,
   ((PHY_INTERRUPT_ENABLED == phydev-interrupts)
-   ? MII_LAN83C185_ISF_INT_PHYLIB_EVENTS : 0));
+   ? MII_SMSC_ISF_INT_PHYLIB_EVENTS : 0));
 
return rc  0 ? rc : 0;
 }
 
-static int lan83c185_ack_interrupt(struct phy_device *phydev)
+static int smsc_phy_ack_interrupt(struct phy_device *phydev)
 {
-   int rc = phy_read(phydev, MII_LAN83C185_ISF);
+   int rc = phy_read(phydev, MII_SMSC_ISF);
 
return rc  0 ? rc : 0;
 }
 
-static int lan83c185_config_init(struct phy_device *phydev)
+static int smsc_phy_config_init(struct phy_device *phydev)
 {
-   return lan83c185_ack_interrupt(phydev);
+   return smsc_phy_ack_interrupt(phydev);
 }
 
 
@@ -73,22 +74,87 @@ static struct phy_driver lan83c185_drive
/* basic functions */
.config_aneg= genphy_config_aneg,
.read_status= genphy_read_status,
-   .config_init= lan83c185_config_init,
+   .config_init= smsc_phy_config_init,
 
/* IRQ related */
-   .ack_interrupt  = lan83c185_ack_interrupt,
-   .config_intr= lan83c185_config_intr,
+   .ack_interrupt  = smsc_phy_ack_interrupt,
+   .config_intr= smsc_phy_config_intr,
+
+   .driver = { .owner = THIS_MODULE, }
+};
+
+static struct phy_driver lan8187_driver = {
+   .phy_id = 0x0007c0b0, /* OUI=0x00800f, Model#=0x0b */
+   .phy_id_mask= 0xfff0,
+   .name   = SMSC LAN8187,
+
+   .features   = (PHY_BASIC_FEATURES | SUPPORTED_Pause
+   | SUPPORTED_Asym_Pause),
+   .flags  = PHY_HAS_INTERRUPT | PHY_HAS_MAGICANEG,
+
+   /* basic functions */
+   .config_aneg= genphy_config_aneg,
+   .read_status= genphy_read_status,
+   .config_init= smsc_phy_config_init,
+
+   /* IRQ related */
+   .ack_interrupt  = smsc_phy_ack_interrupt,
+   .config_intr= smsc_phy_config_intr,
+
+   .driver = { .owner = THIS_MODULE, }
+};
+
+static struct phy_driver lan8700_driver = {
+   .phy_id = 0x0007c0c0, /* OUI=0x00800f, Model#=0x0c */
+   .phy_id_mask= 0xfff0,
+   .name   = SMSC LAN8700,
+
+   .features   = (PHY_BASIC_FEATURES | SUPPORTED_Pause
+   | SUPPORTED_Asym_Pause),
+   .flags  = PHY_HAS_INTERRUPT | PHY_HAS_MAGICANEG,
+
+   /* basic functions */
+   .config_aneg= genphy_config_aneg,
+   .read_status= genphy_read_status,
+   .config_init= smsc_phy_config_init,
+
+   /* IRQ

RE: [E1000-devel] e1000: ethtool -p + cable pull = system wedges hard

2006-08-16 Thread Brandeburg, Jesse

Kok, Auke-jan H wrote:
 Auke Kok wrote:
 Jay Vosburgh wrote:
 Running both 2.6.17.6 plus the e1000 7.2.7 from sourceforge, or
 the e1000 in netdev-2.6#upstream (7.1.9-k4).
 
 Starting up ethtool -p ethX then unplugging the cable
 connected to the identified port is causing my system to completely
 freeze; even sysrq is unresponsive.  I'm running on a 2-way x86
 box, with an 82545GM. 
 
 Is this by any chance a known problem?
 
 not at all.
 
 One of my brain halves (the third one ;)) poked me and told me that
 it *is* a known issue. Not good. Apparently as early as kernel 2.5.50
 a change was introduced that causes this. I am unsure what exactly
 caused it and I assume it is generic (other nic's might also suffer).
 The issue is documented in our standalone driver documentation. Not
 sure what to do with this. 

Has something to do with the RTNL lock being held and link notification,
as I remember.
We noticed it to be a global problem, happens with e100 too.

http://www.mail-archive.com/netdev@vger.kernel.org/msg01654.html

Jesse
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 1/2] Fix style to match drivers/net/phy/*

2006-08-16 Thread Steve Glendinning

Trivial style fixes to match other PHY drivers

Signed-off-by: Steve Glendinning [EMAIL PROTECTED]
---
 drivers/net/phy/smsc.c |   15 +++
 1 files changed, 7 insertions(+), 8 deletions(-)

diff --git a/drivers/net/phy/smsc.c b/drivers/net/phy/smsc.c
index 25e31fb..2119bd7 100644
--- a/drivers/net/phy/smsc.c
+++ b/drivers/net/phy/smsc.c
@@ -41,24 +41,23 @@ #define MII_LAN83C185_ISF_INT_PHYLIB_EVE
 
 static int lan83c185_config_intr(struct phy_device *phydev)
 {
-   int rc = phy_write (phydev, MII_LAN83C185_IM,
-   ((PHY_INTERRUPT_ENABLED == phydev-interrupts)
-   ? MII_LAN83C185_ISF_INT_PHYLIB_EVENTS
-   : 0));
+   int rc = phy_write(phydev, MII_LAN83C185_IM,
+  ((PHY_INTERRUPT_ENABLED == phydev-interrupts)
+   ? MII_LAN83C185_ISF_INT_PHYLIB_EVENTS : 0));
 
return rc  0 ? rc : 0;
 }
 
 static int lan83c185_ack_interrupt(struct phy_device *phydev)
 {
-   int rc = phy_read (phydev, MII_LAN83C185_ISF);
+   int rc = phy_read(phydev, MII_LAN83C185_ISF);
 
return rc  0 ? rc : 0;
 }
 
 static int lan83c185_config_init(struct phy_device *phydev)
 {
-   return lan83c185_ack_interrupt (phydev);
+   return lan83c185_ack_interrupt(phydev);
 }
 
 
@@ -85,12 +84,12 @@ static struct phy_driver lan83c185_drive
 
 static int __init smsc_init(void)
 {
-   return phy_driver_register (lan83c185_driver);
+   return phy_driver_register(lan83c185_driver);
 }
 
 static void __exit smsc_exit(void)
 {
-   phy_driver_unregister (lan83c185_driver);
+   phy_driver_unregister(lan83c185_driver);
 }
 
 MODULE_DESCRIPTION(SMSC PHY driver);
-- 
1.4.1

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/9] network namespaces: core and device list

2006-08-16 Thread Stephen Hemminger

On Wed, 16 Aug 2006 07:46:43 -0700
Dave Hansen [EMAIL PROTECTED] wrote:

 On Tue, 2006-08-15 at 18:48 +0400, Andrey Savochkin wrote:
  
  /* Can survive without statistics */
  stats = kmalloc(sizeof(struct net_device_stats), GFP_KERNEL);
  if (stats) {
  memset(stats, 0, sizeof(struct net_device_stats));
  -   loopback_dev.priv = stats;
  -   loopback_dev.get_stats = get_stats;
  +   dev-priv = stats;
  +   dev-get_stats = get_stats;
  } 
 
 With this much surgery it might be best to start using things that have
 come along since this code was touched last, like kzalloc().
 


If you are going to make the loopback device dynamic, it MUST use
alloc_netdev().
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/9] network namespaces: playing and debugging

2006-08-16 Thread Stephen Hemminger

On Tue, 15 Aug 2006 18:48:43 +0400
Andrey Savochkin [EMAIL PROTECTED] wrote:

 Temporary code to play with network namespaces in the simplest way.
 Do
 exec 7 /proc/net/net_ns
 in your bash shell and you'll get a brand new network namespace.
 There you can, for example, do
 ip link set lo up
 ip addr list
 ip addr add 1.2.3.4 dev lo
 ping -n 1.2.3.4
 
 Signed-off-by: Andrey Savochkin [EMAIL PROTECTED]

NACK, new /proc interfaces are not acceptable.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [d80211 rfc] link master interface from wiphy

2006-08-16 Thread Jiri Benc

On Mon, 14 Aug 2006 10:12:01 +0200, Johannes Berg wrote:
 I'd like to see a link from the wiphy to the master interface that 
 belongs to it so one can tell this easily on systems that have multiple 
 wireless devices.

As wiphy and master interface are closely bind to each other, this makes
sense.

Btw, we will probably need some way to ask d80211 about all interfaces
belonging to given wiphy anyway. Crawling all network interfaces and
searching for correct wiphy symlinks is probably not the best way. I
think a new netlink interface can be used for this.

 wpa_supplicant could use this, I guess. I think 
 another link to wlan#ap should be created (or does wpa_supplicant set 
 the name of that so it knows which one it will get?), or something like 
 that anyway.

wmgmt# will go away in future. There is an ioctl to get its ifindex, so
no need for the link.

 On the other hand, is there any real reason we have this code:
 ndev-base_addr = dev-base_addr;
 ndev-irq = dev-irq;
 ndev-mem_start = dev-mem_start;
 ndev-mem_end = dev-mem_end;
 ndev-flags = dev-flags  IFF_MULTICAST;
 SET_NETDEV_DEV(ndev, dev-class_dev.dev);
 
 in ieee80211_if_add? Maybe we should make the virtual devices all 
 children of the wiphy (struct ieee80211_local) instead of making them 
 children of the physical device? I don't really know though. This is too 
 dark magic for me ;)

What do you mean by making the virtual devices all children of the
wiphy? Currently, all virtual devices (of one physical device) have the
same pointer to ieee80211_local in their net_dev structure and pointers
to them are stored in the linked list in ieee80211_local.

 However, I do know that I can trivially rename the wmaster0 interface 
 using just 'ip link set wmaster0 name wlan3' and things will probably be 
 very confusing for any program that relies on the naming to know which 
 device is which.

Any program that relies on particular device names is broken.

 Comments welcome. Userspace comments as well, I'm programming something 
 that'll use a bunch of interfaces (wmaster, a monitor one and a sta one 
 probably) and I want the user to just select the physical interface, not 
 all these three logical ones... (in fact, I'm creating the logical 
 monitor interface myself in code).

Do you know about /sys/class/net/X/wiphy symlinks? But as I said,
crawling sysfs is not the best idea - among other things, it is subject
to race conditions.

Thanks,

 Jiri

-- 
Jiri Benc
SUSE Labs
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/2] rfkill - Add rfkill driver to misc input devices

2006-08-16 Thread Jiri Benc

On Tue, 8 Aug 2006 16:30:12 +0200, Ivo van Doorn wrote:
 This will add the rfkill driver to the input/misc section of the kernel.
 rfkill is usefull for newtwork devices that contain a hardware button
 to enable or disable the radio.
 With rfkill a generic interface is created for the network drivers,
 as well as providing a  uniform way for userspace to listen
 to the hardware button events.

You need to send this patch to lkml and to input subsystem maintainer.

 Jiri

-- 
Jiri Benc
SUSE Labs
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/9] network namespaces: playing and debugging

2006-08-16 Thread Eric W. Biederman

Stephen Hemminger [EMAIL PROTECTED] writes:

 On Tue, 15 Aug 2006 18:48:43 +0400
 Andrey Savochkin [EMAIL PROTECTED] wrote:

 Temporary code to play with network namespaces in the simplest way.
 Do
 exec 7 /proc/net/net_ns
 in your bash shell and you'll get a brand new network namespace.
 There you can, for example, do
 ip link set lo up
 ip addr list
 ip addr add 1.2.3.4 dev lo
 ping -n 1.2.3.4
 
 Signed-off-by: Andrey Savochkin [EMAIL PROTECTED]

 NACK, new /proc interfaces are not acceptable.

The rule is that new /proc interfaces that are not process related
are not acceptable.  If structured right a network namespace can
arguably be process related.

I do agree that this interface is pretty ugly there.

Eric
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC] network namespaces

2006-08-16 Thread Eric W. Biederman

Alexey Kuznetsov [EMAIL PROTECTED] writes:

 Hello!

 (application) containers.  Performance aside, are there any reasons why
 this approach would be problematic for c/r?

 This approach is just perfect for c/r.

Yes.  For c/r you need to take your state with you.

 Probably, this is the only approach when migration can be done
 in a clean and self-consistent way.

Basically there are currently 3 approaches that have been proposed.

The trivial bsdjail style as implemented by Serge and in a slightly
more sophisticated version in vserver.  This approach as it does not
touch the packets has little to no packet level overhead.  Basically
this is what I have called the Level 3 approach.

The more in depth approach where we modify the packet processing based
upon which network interface the packet comes in on, and it looks like
each namespace has it's own instance of the network stack. Roughly
what was proposed earlier in this thread the Level 2 approach.  This
potentially has per packet overhead so we need to watch the implementation
very carefully.

Some weird hybrid as proposed by Daniel, that I was never clear on the
semantics.

From the previous conversations my impression was that as long as
we could get a Layer 2 approach that did not slow down the networking
stack and was clean, everyone would be happy.

I'm buried in the process id namespace at the moment, and except
to be so for the rest of the month, so I'm not going to be
very helpful except for a few stray comments.

Eric
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

skge crashes

2006-08-16 Thread Beschorner Daniel

Stephen,

the reproducible crashes with all skge versions (where sk98lin works
fine) on my box are SMP related.
I booted with maxcpus=1 and the box survived my usual crash test, I will
keep an eye.

Daniel
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: proposal for new wireless configuration API

2006-08-16 Thread Simon Barber

I'd suggest that the new signal strength measure should be defined as
'RCPI' - the 'Received Channel Power Indicator' - which is defined in
IEEE 802.11k (the Radio Measurements amendment to 802.11).  Here is the
full text of the definition from 802.11k draft 5.0:

received channel power indicator (RCPI): An indication of the total
channel power (signal, noise, and interference) of a received IEEE
802.11 frame measured on a single channel and at the antenna connector
used to receive the frame.

The RCPI indicator is a measure of the received RF power in the selected
channel for a received frame. This parameter shall be a measure by the
PHY sublayer of the received RF power in the channel measured over the
entire received frame or by other equivalent means which meet the
specified accuracy. RCPI shall be a monotonically increasing,
logarithmic function of the received power level defined in dBm. The
allowed values for the Received Channel Power Indicator (RCPI) parameter
shall be an 8 bit value in the range from 0 through 220, with indicated
values rounded to the nearest 0.5 dB as follows:

0: Power  -110 dBm
1: Power = -109.5 dBm
2: Power = -109.0 dBm

and so on where

RCPI = int{(Power in dBm +110)*2} for 0dbm  Power  -110dBm

220: Power  -0 dBm
221-254: reserved
255: Measurement not available

RCPI shall equal the received RF power within an accuracy of +/-5 dB
(95% confidence interval) within the specified dynamic range of the
receiver. The received RF power shall be determined assuming a receiver
noise equivalent bandwidth equal to the channel bandwidth multiplied by
1.1.



Simon

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
On Behalf Of Johannes Berg
Sent: Tuesday, August 15, 2006 11:51 PM
To: Dan Williams
Cc: netdev@vger.kernel.org; Jean Tourrilhes
Subject: Re: proposal for new wireless configuration API

On Tue, 2006-08-15 at 12:29 -0400, Dan Williams wrote:

 We might want to take the time to fix up a few of the ambiguities of 
 WEXT that we've encountered over the past few years:

Yes, I definitely agree.

 o Separate attributes for signal strength units; signed integer type 
 for dBm, unsigned integer type for RSSI.  One 8-bit var to represent 
 both is just too confusing for people, evidently (which is true...)

Yes, agreed, they should be separated. In general, I think that one
attribute should always have a single meaning and unit attached, except
for explicitly unit-less attributes (number of frames or whatever), or
attributes that explicitly have no stable unit (raw rssi).

 o Merge functionality ENCODE and ENCODEEXT handlers into one

Good one. I'm still not sure whether we should have an attribute for
this, or a command. The whole key business seems rather complex and it
might be good to have a command 'set key' with say a possible attribute
for the mac address of a pairwise key, a key material attribute and an
IV attribute or something. Otherwise we'll end up parsing the contents
of an attribute again, which rather sucks...

On the other hand, having it as a command won't allow the user to
optimize setting the key and other things at once. I'm not too sure we
should pay all that much attention to this problem though, it can't take
forever and typically a user with such a card won't be changing the key
or parameters all the time, hence it's usually probably done only at boo
or association time.

johannes
-
To unsubscribe from this list: send the line unsubscribe netdev in the
body of a message to [EMAIL PROTECTED] More majordomo info at
http://vger.kernel.org/majordomo-info.html
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: skge crashes

2006-08-16 Thread Stephen Hemminger

On Wed, 16 Aug 2006 19:47:08 +0200
Beschorner Daniel [EMAIL PROTECTED] wrote:

 Stephen,
 
 the reproducible crashes with all skge versions (where sk98lin works
 fine) on my box are SMP related.
 I booted with maxcpus=1 and the box survived my usual crash test, I will
 keep an eye.
 
 Daniel
 

Is this P3 SMP?
What form of IRQ balance are you using?
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: New driver questions: Attansic L1 gigabit NIC

2006-08-16 Thread Stephen Hemminger

On Tue, 15 Aug 2006 18:23:19 -0500
Jay Cliburn [EMAIL PROTECTED] wrote:

 Stephen Hemminger wrote:
  On Sun, 13 Aug 2006 19:11:42 -0500
  Jay Cliburn [EMAIL PROTECTED] wrote:
 ...snip...
  I've read the LKML FAQ regarding new driver submissions, but it implies
  that the submitter be willing to maintain the driver, which I'm not
  qualified to do.  I haven't contacted Attansic to request a change to
  the above support statement, because my past attempts to contact vendors
  on matters of this tenor have been greeted with silence.
  
  I would recommend the module author to see if they would GPL it.
 
 Thank you for your reply.  I've contacted the author as you suggest.
 

IANAL but because they used GPL code in the driver, one could argue
that they created a derived work covered by GPL already. But I learned in
preschool it is always better to ask than take.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Bug 6936] BUG: warning at net/core/dev.c:1171/skb_checksum_help()

2006-08-16 Thread Stephen Hemminger

On Wed, 16 Aug 2006 11:29:00 +1000
Herbert Xu [EMAIL PROTECTED] wrote:

 On Tue, Aug 15, 2006 at 11:29:59AM -0700, [EMAIL PROTECTED] wrote:
  http://bugzilla.kernel.org/show_bug.cgi?id=6936
 
 It's actually a bug in the bridging code :)
 
 [BRIDGE]: Disable SG/GSO if TX checksum is off
 
 When the bridge recomputes features, it does not maintain the
 constraint that SG/GSO must be off if TX checksum is off.
 This patch adds that constraint.
 
 On a completely unrelated note, I've also added TSO6 and TSO_ECN
 feature bits if GSO is enabled on the underlying device through
 the new NETIF_F_GSO_SOFTWARE macro.
 
 Signed-off-by: Herbert Xu [EMAIL PROTECTED]
 
 Cheers,

agree. I assume this came in with the new GSO for 2.6.18 or do we need
to fix 2.6.17 as well.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [E1000-devel] e1000: ethtool -p + cable pull = system wedges hard

2006-08-16 Thread Jay Vosburgh

Brandeburg, Jesse [EMAIL PROTECTED] wrote:

Kok, Auke-jan H wrote:
 Auke Kok wrote:
 Jay Vosburgh wrote:
Running both 2.6.17.6 plus the e1000 7.2.7 from sourceforge, or
 the e1000 in netdev-2.6#upstream (7.1.9-k4).
 
Starting up ethtool -p ethX then unplugging the cable
 connected to the identified port is causing my system to completely
 freeze; even sysrq is unresponsive.  I'm running on a 2-way x86
 box, with an 82545GM. 
[...]
Has something to do with the RTNL lock being held and link notification,
as I remember.
We noticed it to be a global problem, happens with e100 too.

http://www.mail-archive.com/netdev@vger.kernel.org/msg01654.html

Well, I thought maybe I'd messed it up when I tested the other
cards, but I just went and tried it again.  Only the e1000 wedges the
system if I pull the cable with ethtool -p running.  The e100 and tg3
don't lock up.  Pulling the cable ends the ethtool for tg3, but for e100
the blinky blinky keeps going even after the cable is back in.

Even so, as you mention, the operation is holding the RTNL, so
anything else that wants it has to wait.

-J

---
-Jay Vosburgh, IBM Linux Technology Center, [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: New driver questions: Attansic L1 gigabit NIC

2006-08-16 Thread John Haller


Stephen Hemminger wrote:

On Tue, 15 Aug 2006 18:23:19 -0500
Jay Cliburn [EMAIL PROTECTED] wrote:


Stephen Hemminger wrote:

On Sun, 13 Aug 2006 19:11:42 -0500
Jay Cliburn [EMAIL PROTECTED] wrote:

...snip...

I've read the LKML FAQ regarding new driver submissions, but it implies
that the submitter be willing to maintain the driver, which I'm not
qualified to do.  I haven't contacted Attansic to request a change to
the above support statement, because my past attempts to contact vendors
on matters of this tenor have been greeted with silence.

I would recommend the module author to see if they would GPL it.

Thank you for your reply.  I've contacted the author as you suggest.



IANAL but because they used GPL code in the driver, one could argue
that they created a derived work covered by GPL already. But I learned in
preschool it is always better to ask than take.

Not exactly.  What they wrote is covered by their copyright,
and there is no permission to use it in any way other than
how they licensed it.  Use of GPL code in their driver
would allow the author of the GPL code to sue them for
violating the license agreement, which would likely result
in the code being released under GPL.

IANAL either, but to paraphrase another preschool saying,
two wrongs (copyright violations) don't make a right
(legally licensed).
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Possible leak of multicast source filter sctructure #3a

2006-08-16 Thread David Stevens

Michal,
I believe the patch I submitted yesterday fixes this
problem, and in a simpler way.

+-DLS

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: PATCH Fix bonding active-backup behavior for VLAN interfaces

2006-08-16 Thread Krzysztof Oledzki

On Mon, 14 Aug 2006, David Miller wrote:

From: Jay Vosburgh [EMAIL PROTECTED]
Date: Thu, 03 Aug 2006 18:01:35 -0700

In this case (bond0.555 above bond0 above eth0,eth1,etc),
skb_bond doesn't suppress duplicates because skb_bond is called with the
skb-dev set to the bond0.555 dev, not the ethX dev.  Non-accelerated
VLAN devices don't do this; they'll come in with skb-dev set to ethX
and will go through skb_bond as expected.

Ok, since __vlan_hwaccel_rx() bypasses the netif_receive_skb()
that would normally occur, we have to duplicate the bonding
drop checks.

The submitted patch put skb_bond() into if_vlan.h which is
definitely the wrong thing to do.  This is a generic operation
and therefore belongs in linux/netdevice.h at best.

Furthermore, we're only interested in the packet drop check,
so that's the only part of the logic we need to export,
the rest can stay private to skb_bond() in net/core/dev.c

Can the folks who can reproduce this try this patch?

Works for me, thank you.

Acked-by: Krzysztof Piotr Oledzki [EMAIL PROTECTED]

Best regards,

Krzysztof Olędzki

Re: [patch 32/41] lockdep: fix smc91x

2006-08-16 Thread Nicolas Pitre

On Mon, 14 Aug 2006, [EMAIL PROTECTED] wrote:

 From: Russell King [EMAIL PROTECTED]
 
 When booting using root-nfs, I'm seeing (independently) two lockdep dumps
 in the smc91x driver.  The patch below fixes both.  Both dumps look like
 real locking issues.
 
 Nico - please review and ack if you think the patch is correct.

The lock validator is rightfully complaining and the patch is correct.

Acked-by: Nicolas Pitre [EMAIL PROTECTED]


Nicolas
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch 32/41] lockdep: fix smc91x

2006-08-16 Thread Jeff Garzik


Nicolas Pitre wrote:

On Mon, 14 Aug 2006, [EMAIL PROTECTED] wrote:


From: Russell King [EMAIL PROTECTED]

When booting using root-nfs, I'm seeing (independently) two lockdep dumps
in the smc91x driver.  The patch below fixes both.  Both dumps look like
real locking issues.

Nico - please review and ack if you think the patch is correct.


The lock validator is rightfully complaining and the patch is correct.

Acked-by: Nicolas Pitre [EMAIL PROTECTED]


thanks

Jeff



-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/2]: powerpc/cell spidernet bottom half

2006-08-16 Thread Linas Vepstas

On Wed, Aug 16, 2006 at 12:30:29PM -0400, Jeff Garzik wrote:
 Linas Vepstas wrote:
 
 The recent set of low-waterark patches for the spider result in a
 
 Let's not reinvented NAPI, shall we...

?? 

I was under the impression that NAPI was for the receive side only.
This round of patches were for the transmit queue.

Let me describe the technical problem; perhaps there's some other
solution for it?  

The default socket size seems to be 128KB; (cat
/proc/sys/net/core/wmem_default) if a user application
writes more than 128 KB to a socket, the app is blocked by the 
kernel till there's room in the socket for more.  At gigabit speeds,
a network card can drain 128KB in about a millisecond, or about
four times a jiffy (assuming  HZ=250).  If the network card isn't
generaing interrupts, (and there are no other interrupts flying 
around) then the tcp stack only wakes up once a jiffy, and so 
the user app is scheduled only once a jiffy.  Thus, the max
bandwidth that the app can see is (HZ * wmem_default) bytes per 
second, or about 250 Mbits/sec for my system.  Disappointing 
for a gigabit adapter.

There's three ways out of this: 

(1) tell the sysadmin to 
echo 1234567  /proc/sys/net/core/wmem_default which 
violates all the rules.

(2) Poll more frequently than once-a-jiffy. Arnd Bergmann and I 
got this working, using hrtimers. It worked pretty well,
but seemed like a hack to me.

(3) Generate transmit queue low-watermark interrupts, 
which is an admitedly olde-fashioned but common
engineering practice.  This round of patches implement 
this.


--linas


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/2]: powerpc/cell spidernet bottom half

2006-08-16 Thread Jeff Garzik


Linas Vepstas wrote:

I was under the impression that NAPI was for the receive side only.


That depends on the driver implementation.

Jeff


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/2]: powerpc/cell spidernet bottom half

2006-08-16 Thread David Miller

From: Jeff Garzik [EMAIL PROTECTED]
Date: Wed, 16 Aug 2006 16:34:31 -0400

 Linas Vepstas wrote:
  I was under the impression that NAPI was for the receive side only.

 That depends on the driver implementation.

What Jeff is trying to say is that TX reclaim can occur in
the NAPI poll routine, and in fact this is what the vast
majority of NAPI drivers do.

It also makes the locking simpler.

In practice, the best thing seems to be to put both RX and TX
work into -poll() and have a very mild hw interrupt mitigation
setting programmed into the chip.

I'm not familiar with the spidernet TX side interrupt capabilities
so I can't say whether that is something that can be directly
implied.  In fact, I get the impression that spidernet is limited
in some way and that's where all the strange approaches are coming
from :)
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH?] tcp and delayed acks

2006-08-16 Thread Benjamin LaHaise

Hello folks,

In looking at a few benchmarks (especially netperf) run locally, it seems 
that tcp is unable to make full use of available CPU cycles as the sender 
is throttled waiting for ACKs to arrive.  The problem is exacerbated when 
the sender is using a small send buffer -- running netperf -C -c -- -s 1024 
show a miserable 420Kbit/s at essentially 0% CPU usage.  Tests over gige 
are similarly constrained to a mere 96Mbit/s.

Since there is no way for the receiver to know if the sender is being 
blocked on transmit space, would it not make sense for the receiver to 
send out any delayed ACKs when it is clear that the receiving process is 
waiting for more data?  The patch below attempts this (I make no guarantees 
of its correctness with respect to the rest of the delayed ack code).  One 
point I'm still contemplating is what to do if the receiver is waiting in 
poll/select/epoll.

[All tests run with maxcpus=1 on a 2.67GHz Woodcrest system.]

Recv   SendSend  Utilization   Service Demand
Socket Socket  Message  Elapsed  Send Recv SendRecv
Size   SizeSize Time Throughput  localremote   local   remote
bytes  bytes   bytessecs.10^6bits/s  % S  % S  us/KB   us/KB

Base (2.6.17-rc4):
default send buffer size
netperf -C -c
 87380  16384  1638410.02  14127.79   99.9099.900.579   0.579 
 87380  16384  1638410.02  13875.28   99.9099.900.590   0.590 
 87380  16384  1638410.01  13777.25   99.9099.900.594   0.594 
 87380  16384  1638410.02  13796.31   99.9099.900.593   0.593 
 87380  16384  1638410.01  13801.97   99.9099.900.593   0.593 

netperf -C -c -- -s 1024
 87380   2048   204810.02 0.43   -0.04-0.04-7.105  -7.377
 87380   2048   204810.02 0.43   -0.01-0.01-2.337  -2.620
 87380   2048   204810.02 0.43   -0.03-0.03-5.683  -5.940
 87380   2048   204810.02 0.43   -0.05-0.05-9.373  -9.625
 87380   2048   204810.02 0.43   -0.05-0.05-9.373  -9.625

from a remote system over gigabit ethernet
netperf -H woody -C -c
 87380  16384  1638410.03   936.23   19.3220.473.382   1.791 
 87380  16384  1638410.03   936.27   17.6720.953.091   1.833 
 87380  16384  1638410.03   936.17   19.1820.773.356   1.817 
 87380  16384  1638410.03   936.26   18.2220.263.188   1.773 
 87380  16384  1638410.03   936.26   17.3520.543.036   1.797 

netperf -H woody -C -c -- -s 1024
 87380   2048   204810.0095.72   10.046.64 17.188  5.683 
 87380   2048   204810.0095.94   9.47 6.42 16.170  5.478 
 87380   2048   204810.0096.83   9.62 5.72 16.283  4.840 
 87380   2048   204810.0095.91   9.58 6.13 16.368  5.236 
 87380   2048   204810.0095.91   9.58 6.13 16.368  5.236 


Patched:
default send buffer size
netperf -C -c
 87380  16384  1638410.01  13923.16   99.9099.900.588   0.588 
 87380  16384  1638410.01  13854.59   99.9099.900.591   0.591 
 87380  16384  1638410.02  13840.42   99.9099.900.591   0.591 
 87380  16384  1638410.01  13810.96   99.9099.900.593   0.593 
 87380  16384  1638410.01  13771.27   99.9099.900.594   0.594 

netperf -C -c -- -s 1024
 87380   2048   204810.02  2473.48   99.9099.903.309   3.309 
 87380   2048   204810.02  2421.46   99.9099.903.380   3.380 
 87380   2048   204810.02  2288.07   99.9099.903.577   3.577 
 87380   2048   204810.02  2405.41   99.9099.903.402   3.402 
 87380   2048   204810.02  2284.41   99.9099.903.582   3.582 

netperf -H woody -C -c
 87380  16384  1638410.04   936.10   23.0421.604.033   1.890 
 87380  16384  1638410.03   936.20   18.5221.063.242   1.843 
 87380  16384  1638410.03   936.52   17.6121.053.082   1.841 
 87380  16384  1638410.03   936.18   18.2420.733.191   1.814 
 87380  16384  1638410.03   936.28   18.3021.043.202   1.841 

netperf -H woody -C -c -- -s 1024
 87380   2048   204810.00   142.46   10.197.53 11.714  4.332 
 87380   2048   204810.00   147.28   9.73 7.93 10.829  4.412 
 87380   2048   204810.00   143.37   10.646.54 12.161  3.738 
 87380   2048   204810.00   146.41   9.18 7.43 10.277  4.158 
 87380   2048   204810.01   145.58   9.80 7.25 11.032  4.081 

Comments/thoughts?

-ben
-- 
Time is of no importance, Mr. President, only life is important.
Don't Email: [EMAIL PROTECTED].


diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 934396b..e554ceb 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1277,8

PROBLEM: baycom_ser_fdx in kernel 2.6, transmit broken

2006-08-16 Thread Daniel Parthey

Last year I reported to the linux-hams list and to Tom Sailer
that I could not get AX25 to work using a serial baycom modem with
baycom_ser_fdx under vanilla kernel 2.6.11.6 although it ran fine
under kernel 2.4.30:
http://he.fi/archive/linux-hams/200505/0021.html

I got no answer, but other people experienced the same problem:
http://he.fi/archive/linux-hams/200508/0108.html

Now, I quickly checked again with Debian Kernel linux-image-2.6.17-2-686
(2.6.17-6) and the problem still persists.

Receiving packets works without any problems, but the transmitted
packets are erroneous.

I set up another testbox (situated in the same building) with a
soundcard and the soundmodem tools to analyse this problem.
I could only read some data after enabling pass all (no FCS check)
and then it looked like some bytes were missing.

Packet: fm D0CHZ0-3 to IATEp-14 via B0STL0-3,H2NOD0-1 I35  pid=65
  ^^

The correct frame header should look like:

Packet: fm ND0CHZ-0 to IGATE-0 via CB0STL-0,CH2NOD-0 I35  pid=65

As far as I can tell from looking at the waveforms, my packets are
complete and not cut off, neither at the beginning nor the end.

Packets received from other stations are decoded correctly.

[Sending station]

  - Vanilla kernel (kernel.org) 2.6.11.6 with the configuration
adapted from the debian kernel
  - Kernel compiled with gcc version 3.3.5
  - Processor Intel Celeron (Mendocino) 450 MHz

  - Serial configuration (Modem on ttyS0)
# setserial -g /dev/ttyS{0,1}
/dev/ttyS0, UART: unknown, Port: 0x03f8, IRQ: 4
/dev/ttyS1, UART: 16550A, Port: 0x02f8, IRQ: 3

  - Baycom startup options:
sethdlc -p -i bcsf0 mode ser12* io 0x3f8 irq 4
ifup bcsf0
sethdlc -i bcsf0 -a half

  - Module parameters for /etc/modules.conf
options baycom_ser_fdx mode=ser12* iobase=0x3f8 irq=4
alias bcsf0 baycom_ser_fdx
alias nr0 netrom
alias tty-ldisc-5  mkiss

  - Loaded modules:
hdlcdrv, baycom_ser_fdx, ax25, mkiss, af_packet

  - /etc/ax25/axports
cb  CBPORT-0  19200 255 7 CB-Funk
axudp AXUDP-0   19200 255 7 Netzwerk AX25UDP Link
lokal CH0CON-0  19200 255 7 Lokal

# cat /proc/ioports | grep baycom
03f8-03ff : baycom_ser_fdx

# cat /proc/interrupts 
   CPU0   
  0: 285101  XT-PIC  timer
  1:130  XT-PIC  i8042
  2:  0  XT-PIC  cascade
  4:  39576  XT-PIC  baycom_ser_fdx
  7:  0  XT-PIC  parport0
 11:   3473  XT-PIC  Intel ICH 82801AA, eth0
 12:   9693  XT-PIC  HiSax, uhci_hcd, eth1
 14:  17651  XT-PIC  ide0
 15: 13  XT-PIC  ide1
NMI:  0 
LOC:  0 
ERR:  0
MIS:  0


[Testbox]

- soundmodemconfig from debian package soundmodem version 0.9-1

The complete output of soundmodemconfig with pass all enabled:

Modulator: afsk Demodulator: afsk
Modulator: parameter bps value 1200
Modulator: parameter f0 value 1200
Modulator: parameter f1 value 2200
Modulator: parameter diffenc value 1
Demodulator: parameter bps value 1200
Demodulator: parameter f0 value 1200
Demodulator: parameter f1 value 2200
Demodulator: parameter diffdec value 1
Minimum sampling rate: 9600
Audio IO: type soundcard
sm[7286]: audio: starting /dev/dsp
sm[7286]: audio: forcing half duplex mode
sm[7286]: audio: sample rate 9600 input fragsz 256 numfrags 2 output fragsz 256 
numfrags 256
Real sampling rate: 9600
passall: 1
passall: 0
passall: 1
Packet: fm ND0CHZ-0 to kIATE-0 via CB0M0C-8,2NOD0-8,*42'2:-7,722-4 I37^ pid=75
x), 1.79dp02 (JO60JT:ND0CHZ)
Internet-Node Chemnitz-Gruena - Mail DP9BOX

Packet: 
Packet: 
Packet: fm D0CHZ0-3 to IATEp-14 via B0STL0-3,H2NOD0-1 I35  pid=65
NetNode (Linux), 1.79dp02 (JO60JT:ND0CHZ)
Internet-Node Chemnitz-Gruena - Mail DP9BOX

Packet: fm NXHZ0C-2 to kIATE-0 via 0STL0C-8,2NOD0-8,*42'2:-7,722-4 I37  pid=75
x), 1.79dp0(JO60JT:NdHZ)
Internet-Node Chemnitz-Gruena - Mail DP9BOX

Packet: 
Joining TxThread
Joining RxThread
Releasing IO

I would be happy to provide further information if necessary.

Bye,
Daniel.
-- 
JabberID: [EMAIL PROTECTED]
http://de.wikipedia.org/wiki/Jabber
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH?] tcp and delayed acks

2006-08-16 Thread Stephen Hemminger

On Wed, 16 Aug 2006 16:55:32 -0400
Benjamin LaHaise [EMAIL PROTECTED] wrote:

 Hello folks,
 
 In looking at a few benchmarks (especially netperf) run locally, it seems 
 that tcp is unable to make full use of available CPU cycles as the sender 
 is throttled waiting for ACKs to arrive.  The problem is exacerbated when 
 the sender is using a small send buffer -- running netperf -C -c -- -s 1024 
 show a miserable 420Kbit/s at essentially 0% CPU usage.  Tests over gige 
 are similarly constrained to a mere 96Mbit/s.

What ethernet hardware? The defaults are often not big enough
for full speed on gigabit hardware. I need increase rmem/wmem to allow
for more buffering. 

 Since there is no way for the receiver to know if the sender is being 
 blocked on transmit space, would it not make sense for the receiver to 
 send out any delayed ACKs when it is clear that the receiving process is 
 waiting for more data?  The patch below attempts this (I make no guarantees 
 of its correctness with respect to the rest of the delayed ack code).  One 
 point I'm still contemplating is what to do if the receiver is waiting in 
 poll/select/epoll.

The point of delayed ack's was to merge the response and the ack on 
request/response
protocols like NFS or telnet. It does make sense to get it out sooner though.

 [All tests run with maxcpus=1 on a 2.67GHz Woodcrest system.]
 
 Recv   SendSend  Utilization   Service Demand
 Socket Socket  Message  Elapsed  Send Recv SendRecv
 Size   SizeSize Time Throughput  localremote   local   remote
 bytes  bytes   bytessecs.10^6bits/s  % S  % S  us/KB   us/KB
 
 Base (2.6.17-rc4):
 default send buffer size
 netperf -C -c
  87380  16384  1638410.02  14127.79   99.9099.900.579   0.579 
  87380  16384  1638410.02  13875.28   99.9099.900.590   0.590 
  87380  16384  1638410.01  13777.25   99.9099.900.594   0.594 
  87380  16384  1638410.02  13796.31   99.9099.900.593   0.593 
  87380  16384  1638410.01  13801.97   99.9099.900.593   0.593 
 
 netperf -C -c -- -s 1024
  87380   2048   204810.02 0.43   -0.04-0.04-7.105  -7.377
  87380   2048   204810.02 0.43   -0.01-0.01-2.337  -2.620
  87380   2048   204810.02 0.43   -0.03-0.03-5.683  -5.940
  87380   2048   204810.02 0.43   -0.05-0.05-9.373  -9.625
  87380   2048   204810.02 0.43   -0.05-0.05-9.373  -9.625
 
 from a remote system over gigabit ethernet
 netperf -H woody -C -c
  87380  16384  1638410.03   936.23   19.3220.473.382   1.791 
  87380  16384  1638410.03   936.27   17.6720.953.091   1.833 
  87380  16384  1638410.03   936.17   19.1820.773.356   1.817 
  87380  16384  1638410.03   936.26   18.2220.263.188   1.773 
  87380  16384  1638410.03   936.26   17.3520.543.036   1.797 
 
 netperf -H woody -C -c -- -s 1024
  87380   2048   204810.0095.72   10.046.64 17.188  5.683 
  87380   2048   204810.0095.94   9.47 6.42 16.170  5.478 
  87380   2048   204810.0096.83   9.62 5.72 16.283  4.840 
  87380   2048   204810.0095.91   9.58 6.13 16.368  5.236 
  87380   2048   204810.0095.91   9.58 6.13 16.368  5.236 
 
 
 Patched:
 default send buffer size
 netperf -C -c
  87380  16384  1638410.01  13923.16   99.9099.900.588   0.588 
  87380  16384  1638410.01  13854.59   99.9099.900.591   0.591 
  87380  16384  1638410.02  13840.42   99.9099.900.591   0.591 
  87380  16384  1638410.01  13810.96   99.9099.900.593   0.593 
  87380  16384  1638410.01  13771.27   99.9099.900.594   0.594 
 
 netperf -C -c -- -s 1024
  87380   2048   204810.02  2473.48   99.9099.903.309   3.309 
  87380   2048   204810.02  2421.46   99.9099.903.380   3.380 
  87380   2048   204810.02  2288.07   99.9099.903.577   3.577 
  87380   2048   204810.02  2405.41   99.9099.903.402   3.402 
  87380   2048   204810.02  2284.41   99.9099.903.582   3.582 
 
 netperf -H woody -C -c
  87380  16384  1638410.04   936.10   23.0421.604.033   1.890 
  87380  16384  1638410.03   936.20   18.5221.063.242   1.843 
  87380  16384  1638410.03   936.52   17.6121.053.082   1.841 
  87380  16384  1638410.03   936.18   18.2420.733.191   1.814 
  87380  16384  1638410.03   936.28   18.3021.043.202   1.841 
 
 netperf -H woody -C -c -- -s 1024
  87380   2048   204810.00   142.46   10.197.53 11.714  4.332 
  87380   2048   204810.00   147.28   9.73 7.93 10.829  4.412 
  87380   2048   2048

Re: [PATCH?] tcp and delayed acks

2006-08-16 Thread David Miller

From: Stephen Hemminger [EMAIL PROTECTED]
Date: Wed, 16 Aug 2006 12:11:12 -0700

 What ethernet hardware? The defaults are often not big enough
 for full speed on gigabit hardware. I need increase rmem/wmem to allow
 for more buffering. 

Current kernels allow the TCP send and receive socket buffers
to grow up to at least 4MB in size, how much more do you need?

tcp_{w,r}mem[2] will now have a value of at least 4MB, see
net/ipv4/tcp.c:tcp_init().
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: New driver questions: Attansic L1 gigabit NIC

2006-08-16 Thread Stephen Hemminger

On Wed, 16 Aug 2006 13:44:43 -0500
John Haller [EMAIL PROTECTED] wrote:

 Stephen Hemminger wrote:
  On Tue, 15 Aug 2006 18:23:19 -0500
  Jay Cliburn [EMAIL PROTECTED] wrote:
  
  Stephen Hemminger wrote:
  On Sun, 13 Aug 2006 19:11:42 -0500
  Jay Cliburn [EMAIL PROTECTED] wrote:
  ...snip...
  I've read the LKML FAQ regarding new driver submissions, but it implies
  that the submitter be willing to maintain the driver, which I'm not
  qualified to do.  I haven't contacted Attansic to request a change to
  the above support statement, because my past attempts to contact vendors
  on matters of this tenor have been greeted with silence.
  I would recommend the module author to see if they would GPL it.
  Thank you for your reply.  I've contacted the author as you suggest.
 
  
  IANAL but because they used GPL code in the driver, one could argue
  that they created a derived work covered by GPL already. But I learned in
  preschool it is always better to ask than take.
 Not exactly.  What they wrote is covered by their copyright,
 and there is no permission to use it in any way other than
 how they licensed it.  Use of GPL code in their driver
 would allow the author of the GPL code to sue them for
 violating the license agreement, which would likely result
 in the code being released under GPL.
 
 IANAL either, but to paraphrase another preschool saying,
 two wrongs (copyright violations) don't make a right
 (legally licensed).

In this case, though the vendor put a license file in the source that says GPL.
But they just forgot and put a different value in the MODULE_LICENSE().
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH?] tcp and delayed acks

2006-08-16 Thread Rick Jones


The point of delayed ack's was to merge the response and the ack on 
request/response
protocols like NFS or telnet. It does make sense to get it out sooner though.


Well, to a point at least - I wouldn't go so far as to suggest immediate 
ACKs.


However, I was always under the impression that ACKs were sent (in the 
mythical generic TCP stack) when:


a) there was data going the other way
b) there was a window update going the other way
c) the standalone ACK timer expired.

Does this patch then implement b?  Were there perhaps holes in the 
logic when things were smaller than the MTU/MSS?  (-v 2 on the netperf 
command line should show what the MSS was for the connection)


rick jones

BTW, many points scored for including CPU utilization and service demand 
figures with the netperf output :)






[All tests run with maxcpus=1 on a 2.67GHz Woodcrest system.]

Recv   SendSend  Utilization   Service Demand
Socket Socket  Message  Elapsed  Send Recv SendRecv
Size   SizeSize Time Throughput  localremote   local   remote
bytes  bytes   bytessecs.10^6bits/s  % S  % S  us/KB   us/KB

Base (2.6.17-rc4):
default send buffer size
netperf -C -c
87380  16384  1638410.02  14127.79   99.9099.900.579   0.579 
87380  16384  1638410.02  13875.28   99.9099.900.590   0.590 
87380  16384  1638410.01  13777.25   99.9099.900.594   0.594 
87380  16384  1638410.02  13796.31   99.9099.900.593   0.593 
87380  16384  1638410.01  13801.97   99.9099.900.593   0.593 


netperf -C -c -- -s 1024
87380   2048   204810.02 0.43   -0.04-0.04-7.105  -7.377
87380   2048   204810.02 0.43   -0.01-0.01-2.337  -2.620
87380   2048   204810.02 0.43   -0.03-0.03-5.683  -5.940
87380   2048   204810.02 0.43   -0.05-0.05-9.373  -9.625
87380   2048   204810.02 0.43   -0.05-0.05-9.373  -9.625


Hmm, those CPU numbers don't look right.  I guess there must still be 
some holes in the procstat CPU method code in netperf :(



-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH?] tcp and delayed acks

2006-08-16 Thread Benjamin LaHaise

On Wed, Aug 16, 2006 at 12:11:12PM -0700, Stephen Hemminger wrote:
  is throttled waiting for ACKs to arrive.  The problem is exacerbated when 
  the sender is using a small send buffer -- running netperf -C -c -- -s 1024 
  show a miserable 420Kbit/s at essentially 0% CPU usage.  Tests over gige 
  are similarly constrained to a mere 96Mbit/s.
 
 What ethernet hardware? The defaults are often not big enough
 for full speed on gigabit hardware. I need increase rmem/wmem to allow
 for more buffering. 

This is for small buffer transmit buffer sizes over either loopback or 
e1000.  The artifact also shows up over localhost for somewhat larger buffer 
sizes, although it is much more difficult to get results that don't have 
large fluctuations because of other scheduling issues.  Pinning the tasks to 
CPUs is on my list of things to try, but something in the multiple variants 
of sched_setaffinity() has resulted in it being broken in netperf.

 The point of delayed ack's was to merge the response and the ack on 
 request/response
 protocols like NFS or telnet. It does make sense to get it out sooner though.

I would like to see what sort of effect this change has on higher latency.  
Ideally, quick ack mode should be doing the right thing, but it might need 
more input about the receiver's intent.

-ben
-- 
Time is of no importance, Mr. President, only life is important.
Don't Email: [EMAIL PROTECTED].
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 07/13] e1000: Allow NVM to setup LPLU for IGP2 and IGP3

2006-08-16 Thread Kok, Auke


Allow NVM to setup LPLU for IGP2 and IGP3. Only IGP needs LPLU D3
disabled during init here.

Signed-off-by: Jeff Kirsher [EMAIL PROTECTED]
Signed-off-by: Auke Kok [EMAIL PROTECTED]
---

 drivers/net/e1000/e1000_hw.c |   13 -
 1 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/drivers/net/e1000/e1000_hw.c b/drivers/net/e1000/e1000_hw.c
index 583518a..3728f33 100644
--- a/drivers/net/e1000/e1000_hw.c
+++ b/drivers/net/e1000/e1000_hw.c
@@ -1324,11 +1324,14 @@ e1000_copper_link_igp_setup(struct e1000
 E1000_WRITE_REG(hw, LEDCTL, led_ctrl);
 }
 
-/* disable lplu d3 during driver init */
-ret_val = e1000_set_d3_lplu_state(hw, FALSE);
-if (ret_val) {
-DEBUGOUT(Error Disabling LPLU D3\n);
-return ret_val;
+/* The NVM settings will configure LPLU in D3 for IGP2 and IGP3 PHYs */
+if (hw-phy_type == e1000_phy_igp) {
+/* disable lplu d3 during driver init */
+ret_val = e1000_set_d3_lplu_state(hw, FALSE);
+if (ret_val) {
+DEBUGOUT(Error Disabling LPLU D3\n);
+return ret_val;
+}
 }
 
 /* disable lplu d0 during driver init */



--
Auke Kok [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 03/13] e1000: Same cosmetic fix as earlier sent out for IPV4.

2006-08-16 Thread Kok, Auke


Signed-off-by: Auke Kok [EMAIL PROTECTED]
---

 drivers/net/e1000/e1000_main.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/net/e1000/e1000_main.c b/drivers/net/e1000/e1000_main.c
index 627f224..ea3d504 100644
--- a/drivers/net/e1000/e1000_main.c
+++ b/drivers/net/e1000/e1000_main.c
@@ -2545,7 +2545,7 @@ e1000_tso(struct e1000_adapter *adapter,
cmd_length = E1000_TXD_CMD_IP;
ipcse = skb-h.raw - skb-data - 1;
 #ifdef NETIF_F_TSO_IPV6
-   } else if (skb-protocol == ntohs(ETH_P_IPV6)) {
+   } else if (skb-protocol == htons(ETH_P_IPV6)) {
skb-nh.ipv6h-payload_len = 0;
skb-h.th-check =
~csum_ipv6_magic(skb-nh.ipv6h-saddr,



--
Auke Kok [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 01/13] e100: Fix MDIO/MDIO-X

2006-08-16 Thread Kok, Auke


MDIO/MDIO-X was broken due to a wrong errata. Removing the workaround
code fixes for affected NICs.

Signed-off-by: Jeff Kirsher [EMAIL PROTECTED]
Signed-off-by: Auke Kok [EMAIL PROTECTED]
---

 drivers/net/e100.c |   14 +-
 1 files changed, 5 insertions(+), 9 deletions(-)

diff --git a/drivers/net/e100.c b/drivers/net/e100.c
index 91ef5f2..5de9843 100644
--- a/drivers/net/e100.c
+++ b/drivers/net/e100.c
@@ -1391,15 +1391,11 @@ static int e100_phy_init(struct nic *nic
}
 
if((nic-mac = mac_82550_D102) || ((nic-flags  ich) 
-  (mdio_read(netdev, nic-mii.phy_id, MII_TPISTATUS)  0x8000))) {
-   /* enable/disable MDI/MDI-X auto-switching.
-  MDI/MDI-X auto-switching is disabled for 82551ER/QM chips */
-   if((nic-mac == mac_82551_E) || (nic-mac == mac_82551_F) ||
-  (nic-mac == mac_82551_10) || (nic-mii.force_media) ||
-  !(nic-eeprom[eeprom_cnfg_mdix]  eeprom_mdix_enabled))
-   mdio_write(netdev, nic-mii.phy_id, MII_NCONFIG, 0);
-   else
-   mdio_write(netdev, nic-mii.phy_id, MII_NCONFIG, 
NCONFIG_AUTO_SWITCH);
+  (mdio_read(netdev, nic-mii.phy_id, MII_TPISTATUS)  0x8000) 
+   !(nic-eeprom[eeprom_cnfg_mdix]  eeprom_mdix_enabled))) {
+   /* enable/disable MDI/MDI-X auto-switching. */
+   mdio_write(netdev, nic-mii.phy_id, MII_NCONFIG,
+   nic-mii.force_media ? 0 : NCONFIG_AUTO_SWITCH);
}
 
return 0;



--
Auke Kok [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

1 2 >

1 - 100 of 162 matches

Mail list logo