Re: bonding: cannot remove certain named devices
David Miller wrote: From: Bodo Eggert [EMAIL PROTECTED] Date: Wed, 16 Aug 2006 02:02:03 +0200 Stephen Hemminger [EMAIL PROTECTED] wrote: IMHO idiots who put space's in filenames should be ignored. As long as the bonding code doesn't throw a fatal error, it has every right to return No such device to the fool. Maybe you should limit device names to eight uppercase characters and up to three characters extension, too. NOT! There is no reason to artificially impose limitations on device names, so don't do that. Are you willing to work to add the special case code necessary to handle whitespace characters in the device name over all of the kernel code and also all of the userland tools too? But if you don't handle spaces in userspace, you handle *, ?, [, ], $, , ', \ in userspace? Should kernel disable also these (insane device chars) chars? ciao cate No? Great, I'm glad that's settled. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: proposal for new wireless configuration API
On Tue, 2006-08-15 at 12:29 -0400, Dan Williams wrote: We might want to take the time to fix up a few of the ambiguities of WEXT that we've encountered over the past few years: Yes, I definitely agree. o Separate attributes for signal strength units; signed integer type for dBm, unsigned integer type for RSSI. One 8-bit var to represent both is just too confusing for people, evidently (which is true...) Yes, agreed, they should be separated. In general, I think that one attribute should always have a single meaning and unit attached, except for explicitly unit-less attributes (number of frames or whatever), or attributes that explicitly have no stable unit (raw rssi). o Merge functionality ENCODE and ENCODEEXT handlers into one Good one. I'm still not sure whether we should have an attribute for this, or a command. The whole key business seems rather complex and it might be good to have a command 'set key' with say a possible attribute for the mac address of a pairwise key, a key material attribute and an IV attribute or something. Otherwise we'll end up parsing the contents of an attribute again, which rather sucks... On the other hand, having it as a command won't allow the user to optimize setting the key and other things at once. I'm not too sure we should pay all that much attention to this problem though, it can't take forever and typically a user with such a card won't be changing the key or parameters all the time, hence it's usually probably done only at boo or association time. johannes - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: proposal for new wireless configuration API
On Tue, 2006-08-15 at 15:59 -0400, Dan Williams wrote: Ok, so if somebody magically opens up new unlicensed ISM spectrum around, say, 7GHz, does that space get broken into channels and assigned specific numbers by the IEEE? I know there are stable channel #s for abg range. What about the future? [1] Can we guarantee that whenever new spectrum opens up that future 802.11 products may use, that the mappings are well-defined? That was my main question. I'd expect them to actually break it into channels and assign channel numbers. Or whoever creates the hardware first does it, and those numbers then get adopted in the year-long specification process ;) Besides, if we really really really needed something else later for whatever weird reason, we could add a new attribute for those cases, and have it reject the channel attribute then :) johannes - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: proposal for new wireless configuration API
On Tue, 2006-08-15 at 12:14 -0400, Luis R. Rodriguez wrote: Basically redo WE completely from scratch using netlink. Not quite, I hope! As Dan mentioned, for example all the key management stuff ought to be consolidated. Same for some other things. For per packet this makes sense, for modification of all packets I think configfs would be more suitable. Then again this is just an addition, I'm not disagreeing here with the approach. The same goes for several common wireless settings -- we could also have a configfs directory for each device which would allow manual read/writing for setting/getting certain values; mind you that congifs does allow for setting/getting multiple values at the same time, for those of you who have wondered. This could just could easily go in as a wrapper for configfs-new NL API. Yeah, that might not even be undesirable. But we also need per-packet controls, and a bunch of them. The current situation with a special header in front of a packet injected into the management interface isn't too great. I'm not sure what kind of generic packet sending parameters we have. Bitrate obviously, and all the other possible attributes... NL80211_ATTR_IFINDEX: index of interface to use This was just meant to be the ifindex of the eth0 or whatever device. (NL80211_ATTR_PHYIDX: (later) index of wiphy to configure) Do you mean to have a wireless device have its own device index, separate from the netdevice index? Can you elaborate a bit on this? Well, the d80211 stack gives each driver backend phyN in /sys/class/ieee80211/. If we ever want to get rid of the wmasterN interface, we probably want to allow configuring without an ifindex because the physical device might not have any network devices attached at that time. I'm not exactly sure if it really makes sense to configure the device then, but hey. With WE we were restricted to the number of attributes possibly changed by the number of ioctls and later by sub-ioctl hack restrictions. What restrictions are we to face with this? We can have tons of attributes, it's a 16-bit field. I think that should be sufficient :) Do we want to map each attribute directly to the respective WE ioctl number to make it easy to do backward compatibility? No, because that would mean having very large attribute numbers up-front, and due to the way genetlink works there is memory allocated for each possible attribute. Hence, attribute numbers should be allocated in an increasing fashion starting from 1, and not be sparse. johannes - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: hard_start_xmit conext
Herbert Xu wrote: kiran kandi [EMAIL PROTECTED] wrote: In what context hard_start_xmit function is called. Is it called in soft irq or a processes context. softirq Also can you call kfree_skb in soft irq context. Yes. Don't do it in hard irq context though. FWIW there is also Documentation/networking/netdevices.txt where this sort of stuff is documented. Jeff - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2.6.17] net/ipv6/udp.c: remove duplicate udp_get_port code
UDPv4 and UDPv6 use an almost identical version of the get_port function, which is unnecessary since the (long) code differs in only one if-statement. By disentangling the if statement and adding v4/v6 checks appropriately, this code duplication can be removed and further * udp_port_rover can stay in net/ipv4/udp.c * udp_lport_inuse can become static in net/ipv4/udp.c (only called by udp_get_port The text below discusses the re-arrangement of the if-statement. This is implemented by enclosed patch (works both on stable and Torvalds' release). The patch also dispenses with a goto statement whose jump label is referenced only once. D i s c u s s i o n The following compares the statements for udp_v{4,6}_get_port. A) In udp_v4_get_port(): = if (inet2-num == snum sk2 != sk !ipv6_only_sock(sk2) (!sk2-sk_bound_dev_if || !sk-sk_bound_dev_if || sk2-sk_bound_dev_if == sk-sk_bound_dev_if) (!inet2-rcv_saddr || !inet-rcv_saddr || inet2-rcv_saddr == inet-rcv_saddr) (!sk2-sk_reuse || !sk-sk_reuse) ) goto fail; This function is called from IPv4 context, hence sk-sk_family == PF_INET. B) In udp_v6_get_port(): = if (inet_sk(sk2)-num == snum sk2 != sk (!sk2-sk_bound_dev_if || !sk-sk_bound_dev_if || sk2-sk_bound_dev_if == sk-sk_bound_dev_if) (!sk2-sk_reuse || !sk-sk_reuse) ipv6_rcv_saddr_equal(sk, sk2) ) goto fail; This function is called from IPv6 context, hence sk-sk_family == PF_INET6. Common denominator: === By re-ordering some of the last literals, both functions share the following conjunction of conditions: if (inet_sk(sk2)-num == snum sk2 != sk (!sk2-sk_bound_dev_if || !sk-sk_bound_dev_if || sk2-sk_bound_dev_if == sk-sk_bound_dev_if) (!sk2-sk_reuse || !sk-sk_reuse) ) To make the function applicable to both v4 and v6 contexts, a second if statement is added, which branches according to sk's sk_family. Signed-off-by: Gerrit Renker [EMAIL PROTECTED] --- include/net/udp.h | 17 +-- net/ipv4/udp.c| 57 -- net/ipv6/udp.c| 79 + 3 files changed, 38 insertions(+), 115 deletions(-) diff --git a/include/net/udp.h b/include/net/udp.h index 766fba1..69d4288 100644 --- a/include/net/udp.h +++ b/include/net/udp.h @@ -30,25 +30,9 @@ #include linux/seq_file.h #define UDP_HTABLE_SIZE128 -/* udp.c: This needs to be shared by v4 and v6 because the lookup - *and hashing code needs to work with different AF's yet - *the port space is shared. - */ extern struct hlist_head udp_hash[UDP_HTABLE_SIZE]; extern rwlock_t udp_hash_lock; -extern int udp_port_rover; - -static inline int udp_lport_inuse(u16 num) -{ - struct sock *sk; - struct hlist_node *node; - - sk_for_each(sk, node, udp_hash[num (UDP_HTABLE_SIZE - 1)]) - if (inet_sk(sk)-num == num) - return 1; - return 0; -} /* Note: this must match 'valbool' in sock_setsockopt */ #define UDP_CSUM_NOXMIT1 @@ -63,6 +47,7 @@ extern struct proto udp_prot; struct sk_buff; +extern int udp_get_port(struct sock *sk, unsigned short snum); extern voidudp_err(struct sk_buff *, u32); extern int udp_sendmsg(struct kiocb *iocb, struct sock *sk, diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c index 3f93292..eb3aa82 100644 --- a/net/ipv4/udp.c +++ b/net/ipv4/udp.c @@ -102,6 +102,7 @@ #include net/protocol.h #include linux/skbuff.h #include linux/proc_fs.h #include linux/seq_file.h +#include net/addrconf.h #include net/sock.h #include net/udp.h #include net/icmp.h @@ -119,10 +120,20 @@ DEFINE_SNMP_STAT(struct udp_mib, udp_sta struct hlist_head udp_hash[UDP_HTABLE_SIZE]; DEFINE_RWLOCK(udp_hash_lock); -/* Shared by v4/v6 udp. */ int udp_port_rover; -static int udp_v4_get_port(struct sock *sk, unsigned short snum) +static inline int udp_lport_inuse(u16 num) +{ + struct sock *sk; + struct hlist_node *node; + + sk_for_each(sk, node, udp_hash[num (UDP_HTABLE_SIZE - 1)]) + if (inet_sk(sk)-num == num) + return 1; + return 0; +} + +int udp_get_port(struct sock *sk, unsigned short snum) { struct hlist_node *node; struct sock *sk2; @@ -151,11 +162,10 @@ static int udp_v4_get_port(struct sock * } size = 0; sk_for_each(sk2, node, list) - if (++size = best_size_so_far) -
[PATCH2 1/1] network memory allocator.
Hello. Network tree allocator can be used to allocate memory for all network operations from any context. Changes from previous release: * added dynamically grown cache * changed some inline issues * reduced code size * removed AVL tree implementation from the sources * changed minimum allocation size to l1 cache line size (some arches require that) * removed skb-__tsize parameter * added a lot of comments * a lot of small cleanups Trivial epoll based web server achieved more than 2450 requests per second with this version (usual numbers are about 1600-1800 when usual kmalloc is used for all network operations). Network allocator design and implementation notes as long as performance and fragmentation analysis can be found at project homepage: http://tservice.net.ru/~s0mbre/old/?section=projectsitem=nta Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED] diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 19c96d4..f550f95 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -327,6 +327,10 @@ #include linux/slab.h #include asm/system.h +extern void *avl_alloc(unsigned int size, gfp_t gfp_mask); +extern void avl_free(void *ptr, unsigned int size); +extern int avl_init(void); + extern void kfree_skb(struct sk_buff *skb); extern void __kfree_skb(struct sk_buff *skb); extern struct sk_buff *__alloc_skb(unsigned int size, diff --git a/net/core/Makefile b/net/core/Makefile index 2645ba4..d86d468 100644 --- a/net/core/Makefile +++ b/net/core/Makefile @@ -10,6 +10,8 @@ obj-$(CONFIG_SYSCTL) += sysctl_net_core. obj-y += dev.o ethtool.o dev_mcast.o dst.o netevent.o \ neighbour.o rtnetlink.o utils.o link_watch.o filter.o +obj-y += alloc/ + obj-$(CONFIG_XFRM) += flow.o obj-$(CONFIG_SYSFS) += net-sysfs.o obj-$(CONFIG_NET_DIVERT) += dv.o diff --git a/net/core/alloc/Makefile b/net/core/alloc/Makefile new file mode 100644 index 000..21b7c51 --- /dev/null +++ b/net/core/alloc/Makefile @@ -0,0 +1,3 @@ +obj-y := allocator.o + +allocator-y:= avl.o diff --git a/net/core/alloc/avl.c b/net/core/alloc/avl.c new file mode 100644 index 000..c404cbe --- /dev/null +++ b/net/core/alloc/avl.c @@ -0,0 +1,651 @@ +/* + * avl.c + * + * 2006 Copyright (c) Evgeniy Polyakov [EMAIL PROTECTED] + * All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ + +#include linux/kernel.h +#include linux/types.h +#include linux/string.h +#include linux/errno.h +#include linux/slab.h +#include linux/spinlock.h +#include linux/percpu.h +#include linux/list.h +#include linux/mm.h +#include linux/skbuff.h + +#include avl.h + +static struct avl_allocator_data avl_allocator[NR_CPUS]; + +/* + * Get node pointer from address. + */ +static inline struct avl_node *avl_get_node_ptr(unsigned long ptr) +{ + struct page *page = virt_to_page(ptr); + struct avl_node *node = (struct avl_node *)(page-lru.next); + + return node; +} + +/* + * Set node pointer for page for given address. + */ +static void avl_set_node_ptr(unsigned long ptr, struct avl_node *node, int order) +{ + int nr_pages = 1order, i; + struct page *page = virt_to_page(ptr); + + for (i=0; inr_pages; ++i) { + page-lru.next = (void *)node; + page++; + } +} + +/* + * Get allocation CPU from address. + */ +static inline int avl_get_cpu_ptr(unsigned long ptr) +{ + struct page *page = virt_to_page(ptr); + int cpu = (int)(unsigned long)(page-lru.prev); + + return cpu; +} + +/* + * Set allocation cpu for page for given address. + */ +static void avl_set_cpu_ptr(unsigned long ptr, int cpu, int order) +{ + int nr_pages = 1order, i; + struct page *page = virt_to_page(ptr); + + for (i=0; inr_pages; ++i) { + page-lru.prev = (void *)(unsigned long)cpu; + page++; + } +} + +/* + * Convert pointer to node's value. + * Node's value is a start address for contiguous chunk bound to given node. + */ +static inline unsigned long avl_ptr_to_value(void *ptr) +{ + struct avl_node *node = avl_get_node_ptr((unsigned long)ptr); + return node-value; +} + +/* + * Convert pointer into offset from start address of the contiguous chunk + *
Re: hard_start_xmit conext
Jeff Garzik wrote: Herbert Xu wrote: kiran kandi [EMAIL PROTECTED] wrote: In what context hard_start_xmit function is called. Is it called in soft irq or a processes context. softirq It can be process too...doesn't pktgen call it directly? Ben - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/1] network memory allocator.
On Wed, Aug 16, 2006 at 09:35:46AM +0400, Evgeniy Polyakov wrote: On Tue, Aug 15, 2006 at 10:21:22PM +0200, Arnd Bergmann ([EMAIL PROTECTED]) wrote: Am Monday 14 August 2006 13:04 schrieb Evgeniy Polyakov: ?* full per CPU allocation and freeing (objects are never freed on different CPU) Many of your data structures are per cpu, but your underlying allocations are all using regular kzalloc/__get_free_page/__get_free_pages functions. Shouldn't these be converted to calls to kmalloc_node and alloc_pages_node in order to get better locality on NUMA systems? OTOH, we have recently experimented with doing the dev_alloc_skb calls with affinity to the NUMA node that holds the actual network adapter, and got significant improvements on the Cell blade server. That of course may be a conflicting goal since it would mean having per-cpu per-node page pools if any CPU is supposed to be able to allocate pages for use as DMA buffers on any node. Doesn't alloc_pages() automatically switches to alloc_pages_node() or alloc_pages_current()? That's not what's wanted. If you have a slow interconnect you always want to allocate memory on the node the network device is attached to. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2.6.17] net/ipv6/udp.c: remove duplicate udp_get_port code
Hello. In article [EMAIL PROTECTED] (at Wed, 16 Aug 2006 08:46:48 +0100), [EMAIL PROTECTED] says: UDPv4 and UDPv6 use an almost identical version of the get_port function, which is unnecessary since the (long) code differs in only one if-statement. : : +#if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE) + else if(sk-sk_family == PF_INET6 + ipv6_rcv_saddr_equal(sk, sk2) ) + goto fail; + } +#endif This is not good because you cannot link ipv6_rcv_saddr_equal() if you are compiling IPv6 as module. How about retaining udp_v{4,6}_get_port() and call common udp_get_port() from both functions? --yoshfuji - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/1] network memory allocator.
On Wed, Aug 16, 2006 at 09:48:08AM +0100, Christoph Hellwig ([EMAIL PROTECTED]) wrote: Doesn't alloc_pages() automatically switches to alloc_pages_node() or alloc_pages_current()? That's not what's wanted. If you have a slow interconnect you always want to allocate memory on the node the network device is attached to. There is drawback here - if data was allocated on CPU wheere NIC is closer and then processed on different CPU it will cost more than in case where buffer was allocated on CPU where it will be processed. But from other point of view, most of the adapters preallocate set of skbs, and with msi-x help there will be a possibility to bind irq and processing to the CPU where data was origianlly allocated. So I would like to know how to determine which node should be used for allocation. Changes of __get_user_pages() to alloc_pages_node() are trivial. -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/1] network memory allocator.
From: Evgeniy Polyakov [EMAIL PROTECTED] Date: Wed, 16 Aug 2006 13:00:31 +0400 So I would like to know how to determine which node should be used for allocation. Changes of __get_user_pages() to alloc_pages_node() are trivial. netdev_alloc_skb() knows the netdevice, and therefore you can obtain the struct device; referenced inside of the netdev, and therefore you can determine the node using the struct device. Christophe is working on adding support for this using existing allocator. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/1] network memory allocator.
On Wed, Aug 16, 2006 at 02:05:03AM -0700, David Miller wrote: From: Evgeniy Polyakov [EMAIL PROTECTED] Date: Wed, 16 Aug 2006 13:00:31 +0400 So I would like to know how to determine which node should be used for allocation. Changes of __get_user_pages() to alloc_pages_node() are trivial. netdev_alloc_skb() knows the netdevice, and therefore you can obtain the struct device; referenced inside of the netdev, and therefore you can determine the node using the struct device. It's not that easy unfortunately. I did what you describe above in my first prototype but then found out the hard way that the struct device in the netdevice can be a non-pci one, e.g. for pcmcia. Im that case the kernel will crash on you becuase you can only get the node infortmation for pci devices. My current patchkit adds an int node member to struct net_device instead. I can repost the patchkit ontop of -mm (which is the required slab memory leak tracking changes) if anyone cares. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/1] network memory allocator.
On Wed, Aug 16, 2006 at 10:10:29AM +0100, Christoph Hellwig ([EMAIL PROTECTED]) wrote: On Wed, Aug 16, 2006 at 02:05:03AM -0700, David Miller wrote: From: Evgeniy Polyakov [EMAIL PROTECTED] Date: Wed, 16 Aug 2006 13:00:31 +0400 So I would like to know how to determine which node should be used for allocation. Changes of __get_user_pages() to alloc_pages_node() are trivial. netdev_alloc_skb() knows the netdevice, and therefore you can obtain the struct device; referenced inside of the netdev, and therefore you can determine the node using the struct device. It's not that easy unfortunately. I did what you describe above in my first prototype but then found out the hard way that the struct device in the netdevice can be a non-pci one, e.g. for pcmcia. Im that case the kernel will crash on you becuase you can only get the node infortmation for pci devices. My current patchkit adds an int node member to struct net_device instead. I can repost the patchkit ontop of -mm (which is the required slab memory leak tracking changes) if anyone cares. Can we check device-bus_type or device-driver-bus against pci_bus_type for that? -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/1] network memory allocator.
On Wed, Aug 16, 2006 at 01:32:02PM +0400, Evgeniy Polyakov wrote: On Wed, Aug 16, 2006 at 10:10:29AM +0100, Christoph Hellwig ([EMAIL PROTECTED]) wrote: On Wed, Aug 16, 2006 at 02:05:03AM -0700, David Miller wrote: From: Evgeniy Polyakov [EMAIL PROTECTED] Date: Wed, 16 Aug 2006 13:00:31 +0400 So I would like to know how to determine which node should be used for allocation. Changes of __get_user_pages() to alloc_pages_node() are trivial. netdev_alloc_skb() knows the netdevice, and therefore you can obtain the struct device; referenced inside of the netdev, and therefore you can determine the node using the struct device. It's not that easy unfortunately. I did what you describe above in my first prototype but then found out the hard way that the struct device in the netdevice can be a non-pci one, e.g. for pcmcia. Im that case the kernel will crash on you becuase you can only get the node infortmation for pci devices. My current patchkit adds an int node member to struct net_device instead. I can repost the patchkit ontop of -mm (which is the required slab memory leak tracking changes) if anyone cares. Can we check device-bus_type or device-driver-bus against pci_bus_type for that? We could, but I'd rather waste 4 bytes in struct net_device than having such ugly warts in common code. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/1] network memory allocator.
On Wed, Aug 16, 2006 at 01:00:31PM +0400, Evgeniy Polyakov wrote: On Wed, Aug 16, 2006 at 09:48:08AM +0100, Christoph Hellwig ([EMAIL PROTECTED]) wrote: Doesn't alloc_pages() automatically switches to alloc_pages_node() or alloc_pages_current()? That's not what's wanted. If you have a slow interconnect you always want to allocate memory on the node the network device is attached to. There is drawback here - if data was allocated on CPU wheere NIC is closer and then processed on different CPU it will cost more than in case where buffer was allocated on CPU where it will be processed. But from other point of view, most of the adapters preallocate set of skbs, and with msi-x help there will be a possibility to bind irq and processing to the CPU where data was origianlly allocated. The case we've benchmarked (spidernet) is the common preallocated case. For allocate on demand I'd expect the slab allocator to get things right. We do have the irq on the right node, not through MSI but due to the odd interreupt architecture of the Cell blades. So I would like to know how to determine which node should be used for allocation. Changes of __get_user_pages() to alloc_pages_node() are trivial. The patches I have add the node field to struct net_device and use it. It's set via alloc_netdev_node, a function I add and for the normal case of PCI adapters the node arguments comes from pcibus_to_node(). It's arguable we should add a alloc_foodeve_pdev variant that hids that detail, but I'm not entirely sure about whether it's worth the effort. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/1] network memory allocator.
From: Christoph Hellwig [EMAIL PROTECTED] Date: Wed, 16 Aug 2006 10:38:37 +0100 We could, but I'd rather waste 4 bytes in struct net_device than having such ugly warts in common code. Why not instead have struct device store some default node value? The node decision will be sub-optimal on non-pci but it won't crash. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/1] network memory allocator.
On Wed, Aug 16, 2006 at 02:40:08AM -0700, David Miller wrote: From: Christoph Hellwig [EMAIL PROTECTED] Date: Wed, 16 Aug 2006 10:38:37 +0100 We could, but I'd rather waste 4 bytes in struct net_device than having such ugly warts in common code. Why not instead have struct device store some default node value? The node decision will be sub-optimal on non-pci but it won't crash. Right now we don't even have the node stored in the pci_dev structure but only arch-specific accessor functions/macros. We could change those to take a struct device instead and make them return -1 for everything non-pci as we already do in architectures that don't support those helpers. -1 means 'any node' for all common allocators. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: hard_start_xmit conext
On Wed, Aug 16, 2006 at 01:48:56AM -0700, Ben Greear wrote: softirq It can be process too...doesn't pktgen call it directly? Only with BH disabled. -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmVHI~} [EMAIL PROTECTED] Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 7/9] net_device seq_file
Library function to create a seq_file in proc filesystem, showing some information for each netdevice. This code is present in the kernel in about 10 instances, and all of them can be converted to using introduced library function. Signed-off-by: Andrey Savochkin [EMAIL PROTECTED] --- include/linux/netdevice.h |7 +++ net/core/dev.c| 96 ++ 2 files changed, 103 insertions(+) --- ./include/linux/netdevice.h.venetproc Tue Aug 15 13:46:08 2006 +++ ./include/linux/netdevice.h Tue Aug 15 13:46:08 2006 @@ -592,6 +592,13 @@ extern int register_netdevice(struct ne extern int unregister_netdevice(struct net_device *dev); extern voidfree_netdev(struct net_device *dev); extern voidsynchronize_net(void); +#ifdef CONFIG_PROC_FS +extern int netdev_proc_create(char *name, + int (*show)(struct seq_file *, + struct net_device *, void *), + void *data, struct module *mod); +void netdev_proc_remove(char *name); +#endif extern int register_netdevice_notifier(struct notifier_block *nb); extern int unregister_netdevice_notifier(struct notifier_block *nb); extern int call_netdevice_notifiers(unsigned long val, void *v); --- ./net/core/dev.c.venetproc Tue Aug 15 13:46:08 2006 +++ ./net/core/dev.cTue Aug 15 13:46:08 2006 @@ -2100,6 +2100,102 @@ static int dev_ifconf(char __user *arg) } #ifdef CONFIG_PROC_FS + +struct netdev_proc_data { + struct file_operations fops; + int (*show)(struct seq_file *, struct net_device *, void *); + void *data; +}; + +static void *netdev_proc_seq_start(struct seq_file *seq, loff_t *pos) +{ + struct net_device *dev; + loff_t off; + + read_lock(dev_base_lock); + if (*pos == 0) + return SEQ_START_TOKEN; + for (dev = dev_base, off = 1; dev; dev = dev-next, off++) { + if (*pos == off) + return dev; + } + return NULL; +} + +static void *netdev_proc_seq_next(struct seq_file *seq, void *v, loff_t *pos) +{ + ++*pos; + return (v == SEQ_START_TOKEN) ? dev_base + : ((struct net_device *)v)-next; +} + +static void netdev_proc_seq_stop(struct seq_file *seq, void *v) +{ + read_unlock(dev_base_lock); +} + +static int netdev_proc_seq_show(struct seq_file *seq, void *v) +{ + struct netdev_proc_data *p; + + p = seq-private; + return (*p-show)(seq, v, p-data); +} + +static struct seq_operations netdev_proc_seq_ops = { + .start = netdev_proc_seq_start, + .next = netdev_proc_seq_next, + .stop = netdev_proc_seq_stop, + .show = netdev_proc_seq_show, +}; + +static int netdev_proc_open(struct inode *inode, struct file *file) +{ + int err; + struct seq_file *p; + + err = seq_open(file, netdev_proc_seq_ops); + if (!err) { + p = file-private_data; + p-private = (struct netdev_proc_data *)PDE(inode)-data; + } + return err; +} + +int netdev_proc_create(char *name, + int (*show)(struct seq_file *, struct net_device *, void *), + void *data, struct module *mod) +{ + struct netdev_proc_data *p; + struct proc_dir_entry *ent; + + p = kzalloc(sizeof(*p), GFP_KERNEL); + p-fops.owner = mod; + p-fops.open = netdev_proc_open; + p-fops.read = seq_read; + p-fops.llseek = seq_lseek; + p-fops.release = seq_release; + p-show = show; + p-data = data; + ent = create_proc_entry(name, S_IRUGO, proc_net); + if (ent == NULL) { + kfree(p); + return -EINVAL; + } + ent-data = p; + ent-destructor = proc_data_destructor; + smp_wmb(); + ent-proc_fops = p-fops; + return 0; +} +EXPORT_SYMBOL(netdev_proc_create); + +void netdev_proc_remove(char *name) +{ + proc_net_remove(name); +} +EXPORT_SYMBOL(netdev_proc_remove); + /* * This is invoked by the /proc filesystem handler to display a device * in detail. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 5/9] network namespaces: async socket operations
Non-trivial part of socket namespaces: asynchronous events should be run in proper context. Signed-off-by: Andrey Savochkin [EMAIL PROTECTED] --- af_inet.c| 10 ++ inet_timewait_sock.c |8 tcp_timer.c |9 + 3 files changed, 27 insertions(+) --- ./net/ipv4/af_inet.c.venssock-asyn Mon Aug 14 17:04:07 2006 +++ ./net/ipv4/af_inet.cTue Aug 15 13:45:44 2006 @@ -366,10 +366,17 @@ out_rcu_unlock: int inet_release(struct socket *sock) { struct sock *sk = sock-sk; + struct net_namespace *ns, *orig_net_ns; if (sk) { long timeout; + /* Need to change context here since protocol -close +* operation may send packets. +*/ + ns = get_net_ns(sk-sk_net_ns); + push_net_ns(ns, orig_net_ns); + /* Applications forget to leave groups before exiting */ ip_mc_drop_socket(sk); @@ -386,6 +393,9 @@ int inet_release(struct socket *sock) timeout = sk-sk_lingertime; sock-sk = NULL; sk-sk_prot-close(sk, timeout); + + pop_net_ns(orig_net_ns); + put_net_ns(ns); } return 0; } --- ./net/ipv4/inet_timewait_sock.c.venssock-asyn Tue Aug 15 13:45:44 2006 +++ ./net/ipv4/inet_timewait_sock.c Tue Aug 15 13:45:44 2006 @@ -129,6 +129,7 @@ static int inet_twdr_do_twkill_work(stru { struct inet_timewait_sock *tw; struct hlist_node *node; + struct net_namespace *orig_net_ns; unsigned int killed; int ret; @@ -140,8 +141,10 @@ static int inet_twdr_do_twkill_work(stru */ killed = 0; ret = 0; + push_net_ns(current_net_ns, orig_net_ns); rescan: inet_twsk_for_each_inmate(tw, node, twdr-cells[slot]) { + switch_net_ns(tw-tw_net_ns); __inet_twsk_del_dead_node(tw); spin_unlock(twdr-death_lock); __inet_twsk_kill(tw, twdr-hashinfo); @@ -164,6 +167,7 @@ rescan: twdr-tw_count -= killed; NET_ADD_STATS_BH(LINUX_MIB_TIMEWAITED, killed); + pop_net_ns(orig_net_ns); return ret; } @@ -338,10 +342,12 @@ void inet_twdr_twcal_tick(unsigned long int n, slot; unsigned long j; unsigned long now = jiffies; + struct net_namespace *orig_net_ns; int killed = 0; int adv = 0; twdr = (struct inet_timewait_death_row *)data; + push_net_ns(current_net_ns, orig_net_ns); spin_lock(twdr-death_lock); if (twdr-twcal_hand 0) @@ -357,6 +363,7 @@ void inet_twdr_twcal_tick(unsigned long inet_twsk_for_each_inmate_safe(tw, node, safe, twdr-twcal_row[slot]) { + switch_net_ns(tw-tw_net_ns); __inet_twsk_del_dead_node(tw); __inet_twsk_kill(tw, twdr-hashinfo); inet_twsk_put(tw); @@ -384,6 +391,7 @@ out: del_timer(twdr-tw_timer); NET_ADD_STATS_BH(LINUX_MIB_TIMEWAITKILLED, killed); spin_unlock(twdr-death_lock); + pop_net_ns(orig_net_ns); } EXPORT_SYMBOL_GPL(inet_twdr_twcal_tick); --- ./net/ipv4/tcp_timer.c.venssock-asynMon Aug 14 16:43:51 2006 +++ ./net/ipv4/tcp_timer.c Tue Aug 15 13:45:44 2006 @@ -171,7 +171,9 @@ static void tcp_delack_timer(unsigned lo struct sock *sk = (struct sock*)data; struct tcp_sock *tp = tcp_sk(sk); struct inet_connection_sock *icsk = inet_csk(sk); + struct net_namespace *orig_net_ns; + push_net_ns(sk-sk_net_ns, orig_net_ns); bh_lock_sock(sk); if (sock_owned_by_user(sk)) { /* Try again later. */ @@ -225,6 +227,7 @@ out: out_unlock: bh_unlock_sock(sk); sock_put(sk); + pop_net_ns(orig_net_ns); } static void tcp_probe_timer(struct sock *sk) @@ -384,8 +387,10 @@ static void tcp_write_timer(unsigned lon { struct sock *sk = (struct sock*)data; struct inet_connection_sock *icsk = inet_csk(sk); + struct net_namespace *orig_net_ns; int event; + push_net_ns(sk-sk_net_ns, orig_net_ns); bh_lock_sock(sk); if (sock_owned_by_user(sk)) { /* Try again later */ @@ -419,6 +424,7 @@ out: out_unlock: bh_unlock_sock(sk); sock_put(sk); + pop_net_ns(orig_net_ns); } /* @@ -447,9 +453,11 @@ static void tcp_keepalive_timer (unsigne { struct sock *sk = (struct sock *) data; struct inet_connection_sock *icsk = inet_csk(sk); + struct net_namespace *orig_net_ns; struct tcp_sock *tp = tcp_sk(sk); __u32 elapsed; + push_net_ns(sk-sk_net_ns, orig_net_ns); /* Only process if socket is not in use. */ bh_lock_sock(sk);
[RFC] network namespaces
Hi All, I'd like to resurrect our discussion about network namespaces. In our previous discussions it appeared that we have rather polar concepts which seemed hard to reconcile. Now I have an idea how to look at all discussed concepts to enable everyone's usage scenario. 1. The most straightforward concept is complete separation of namespaces, covering device list, routing tables, netfilter tables, socket hashes, and everything else. On input path, each packet is tagged with namespace right from the place where it appears from a device, and is processed by each layer in the context of this namespace. Non-root namespaces communicate with the outside world in two ways: by owning hardware devices, or receiving packets forwarded them by their parent namespace via pass-through device. This complete separation of namespaces is very useful for at least two purposes: - allowing users to create and manage by their own various tunnels and VPNs, and - enabling easier and more straightforward live migration of groups of processes with their environment. 2. People expressed concerns that complete separation of namespaces may introduce an undesired overhead in certain usage scenarios. The overhead comes from packets traversing input path, then output path, then input path again in the destination namespace if root namespace acts as a router. So, we may introduce short-cuts, when input packet starts to be processes in one namespace, but changes it at some upper layer. The places where packet can change namespace are, for example: routing, post-routing netfilter hook, or even lookup in socket hash. The cleanest example among them is post-routing netfilter hook. Tagging of input packets there means that the packets is checked against root namespace's routing table, found to be local, and go directly to the socket hash lookup in the destination namespace. In this scheme the ability to change routing tables or netfilter rules on a per-namespace basis is traded for lower overhead. All other optimized schemes where input packets do not travel input-output-input paths in general case may be viewed as short-cuts in scheme (1). The remaining question is which exactly short-cuts make most sense, and how to make them consistent from the interface point of view. My current idea is to reach some agreement on the basic concept, review patches, and then move on to implementing feasible short-cuts. Opinions? Next in this thread are patches introducing namespaces to device list, IPv4 routing, and socket hashes, and a pass-through device. Patches are against 2.6.18-rc4-mm1. Best regards, Andrey - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/9] network namespaces: IPv4 routing
Structures related to IPv4 rounting (FIB and routing cache) are made per-namespace. Signed-off-by: Andrey Savochkin [EMAIL PROTECTED] --- include/linux/net_ns.h | 10 +++ include/net/flow.h |3 + include/net/ip_fib.h | 46 net/core/dev.c |8 ++ net/core/fib_rules.c | 43 --- net/ipv4/Kconfig |4 - net/ipv4/fib_frontend.c | 132 +-- net/ipv4/fib_hash.c | 13 +++- net/ipv4/fib_rules.c | 86 +- net/ipv4/fib_semantics.c | 99 +++ net/ipv4/route.c | 26 - 11 files changed, 375 insertions(+), 95 deletions(-) --- ./include/linux/net_ns.h.vensroute Mon Aug 14 17:18:59 2006 +++ ./include/linux/net_ns.hMon Aug 14 19:19:14 2006 @@ -14,7 +14,17 @@ struct net_namespace { atomic_tactive_ref, use_ref; struct net_device *dev_base_p, **dev_tail_p; struct net_device *loopback; +#ifndef CONFIG_IP_MULTIPLE_TABLES + struct fib_table*fib4_local_table, *fib4_main_table; +#else + struct list_headfib_rules_ops_list; + struct fib_rules_ops*fib4_rules_ops; + struct hlist_head *fib4_tables; +#endif + struct hlist_head *fib4_hash, *fib4_laddrhash; + unsignedfib4_hash_size, fib4_info_cnt; unsigned inthash; + chardestroying; struct work_struct destroy_work; }; --- ./include/net/flow.h.vensroute Mon Aug 14 17:04:04 2006 +++ ./include/net/flow.hMon Aug 14 17:18:59 2006 @@ -79,6 +79,9 @@ struct flowi { #define fl_icmp_code uli_u.icmpt.code #define fl_ipsec_spi uli_u.spi __u32 secid; /* used by xfrm; see secid.txt */ +#ifdef CONFIG_NET_NS + struct net_namespace *net_ns; +#endif } __attribute__((__aligned__(BITS_PER_LONG/8))); #define FLOW_DIR_IN0 --- ./include/net/ip_fib.h.vensrouteMon Aug 14 17:04:04 2006 +++ ./include/net/ip_fib.h Tue Aug 15 11:53:22 2006 @@ -18,6 +18,7 @@ #include net/flow.h #include linux/seq_file.h +#include linux/net_ns.h #include net/fib_rules.h /* WARNING: The ordering of these elements must match ordering @@ -171,14 +172,21 @@ struct fib_table { #ifndef CONFIG_IP_MULTIPLE_TABLES -extern struct fib_table *ip_fib_local_table; -extern struct fib_table *ip_fib_main_table; +#ifndef CONFIG_NET_NS +extern struct fib_table *ip_fib_local_table_static; +extern struct fib_table *ip_fib_main_table_static; +#define ip_fib_local_table_ns()ip_fib_local_table_static +#define ip_fib_main_table_ns() ip_fib_main_table_static +#else +#define ip_fib_local_table_ns() (current_net_ns-fib4_local_table) +#define ip_fib_main_table_ns() (current_net_ns-fib4_main_table) +#endif static inline struct fib_table *fib_get_table(u32 id) { if (id != RT_TABLE_LOCAL) - return ip_fib_main_table; - return ip_fib_local_table; + return ip_fib_main_table_ns(); + return ip_fib_local_table_ns(); } static inline struct fib_table *fib_new_table(u32 id) @@ -188,21 +196,29 @@ static inline struct fib_table *fib_new_ static inline int fib_lookup(const struct flowi *flp, struct fib_result *res) { - if (ip_fib_local_table-tb_lookup(ip_fib_local_table, flp, res) - ip_fib_main_table-tb_lookup(ip_fib_main_table, flp, res)) + struct fib_table *tb; + + tb = ip_fib_local_table_ns(); + if (!tb-tb_lookup(tb, flp, res)) + return 0; + tb = ip_fib_main_table_ns(); + if (tb-tb_lookup(tb, flp, res)) return -ENETUNREACH; return 0; } static inline void fib_select_default(const struct flowi *flp, struct fib_result *res) { + struct fib_table *tb; + + tb = ip_fib_main_table_ns(); if (FIB_RES_GW(*res) FIB_RES_NH(*res).nh_scope == RT_SCOPE_LINK) - ip_fib_main_table-tb_select_default(ip_fib_main_table, flp, res); + tb-tb_select_default(main_table, flp, res); } #else /* CONFIG_IP_MULTIPLE_TABLES */ -#define ip_fib_local_table fib_get_table(RT_TABLE_LOCAL) -#define ip_fib_main_table fib_get_table(RT_TABLE_MAIN) +#define ip_fib_local_table_ns() fib_get_table(RT_TABLE_LOCAL) +#define ip_fib_main_table_ns() fib_get_table(RT_TABLE_MAIN) extern int fib_lookup(struct flowi *flp, struct fib_result *res); @@ -214,6 +230,10 @@ extern void fib_select_default(const str /* Exported by fib_frontend.c */ extern voidip_fib_init(void); +#ifdef CONFIG_NET_NS +extern int ip_fib_struct_init(void); +extern void ip_fib_struct_cleanup(void); +#endif extern int inet_rtm_delroute(struct sk_buff *skb, struct nlmsghdr* nlh, void *arg); extern int inet_rtm_newroute(struct sk_buff *skb, struct nlmsghdr* nlh, void *arg); extern int
[PATCH 1/9] network namespaces: core and device list
CONFIG_NET_NS and net_namespace structure are introduced. List of network devices is made per-namespace. Each namespace gets its own loopback device. Signed-off-by: Andrey Savochkin [EMAIL PROTECTED] --- drivers/net/loopback.c| 69 - include/linux/init_task.h |9 ++ include/linux/net_ns.h| 82 + include/linux/netdevice.h | 13 +++ include/linux/nsproxy.h |3 include/linux/sched.h |3 kernel/nsproxy.c | 14 net/Kconfig |7 ++ net/core/dev.c| 150 -- net/core/net-sysfs.c | 24 +++ net/ipv4/devinet.c|2 net/ipv6/addrconf.c |2 net/ipv6/route.c |9 +- 13 files changed, 349 insertions(+), 38 deletions(-) --- ./drivers/net/loopback.c.vensdevMon Aug 14 17:02:18 2006 +++ ./drivers/net/loopback.cMon Aug 14 17:18:20 2006 @@ -196,42 +196,55 @@ static struct ethtool_ops loopback_ethto .set_tso= ethtool_op_set_tso, }; -struct net_device loopback_dev = { - .name = lo, - .mtu= (16 * 1024) + 20 + 20 + 12, - .hard_start_xmit= loopback_xmit, - .hard_header= eth_header, - .hard_header_cache = eth_header_cache, - .header_cache_update= eth_header_cache_update, - .hard_header_len= ETH_HLEN, /* 14 */ - .addr_len = ETH_ALEN, /* 6*/ - .tx_queue_len = 0, - .type = ARPHRD_LOOPBACK, /* 0x0001*/ - .rebuild_header = eth_rebuild_header, - .flags = IFF_LOOPBACK, - .features = NETIF_F_SG | NETIF_F_FRAGLIST +struct net_device loopback_dev_static; +EXPORT_SYMBOL(loopback_dev_static); + +void loopback_dev_dtor(struct net_device *dev) +{ + if (dev-priv) { + kfree(dev-priv); + dev-priv = NULL; + } + free_netdev(dev); +} + +void loopback_dev_ctor(struct net_device *dev) +{ + struct net_device_stats *stats; + + memset(dev, 0, sizeof(*dev)); + strcpy(dev-name, lo); + dev-mtu= (16 * 1024) + 20 + 20 + 12; + dev-hard_start_xmit= loopback_xmit; + dev-hard_header= eth_header; + dev-hard_header_cache = eth_header_cache; + dev-header_cache_update = eth_header_cache_update; + dev-hard_header_len= ETH_HLEN; /* 14 */ + dev-addr_len = ETH_ALEN; /* 6*/ + dev-tx_queue_len = 0; + dev-type = ARPHRD_LOOPBACK; /* 0x0001*/ + dev-rebuild_header = eth_rebuild_header; + dev-flags = IFF_LOOPBACK; + dev-features = NETIF_F_SG | NETIF_F_FRAGLIST #ifdef LOOPBACK_TSO | NETIF_F_TSO #endif | NETIF_F_NO_CSUM | NETIF_F_HIGHDMA - | NETIF_F_LLTX, - .ethtool_ops= loopback_ethtool_ops, -}; - -/* Setup and register the loopback device. */ -int __init loopback_init(void) -{ - struct net_device_stats *stats; + | NETIF_F_LLTX; + dev-ethtool_ops= loopback_ethtool_ops; /* Can survive without statistics */ stats = kmalloc(sizeof(struct net_device_stats), GFP_KERNEL); if (stats) { memset(stats, 0, sizeof(struct net_device_stats)); - loopback_dev.priv = stats; - loopback_dev.get_stats = get_stats; + dev-priv = stats; + dev-get_stats = get_stats; } - - return register_netdev(loopback_dev); -}; +} -EXPORT_SYMBOL(loopback_dev); +/* Setup and register the loopback device. */ +int __init loopback_init(void) +{ + loopback_dev_ctor(loopback_dev_static); + return register_netdev(loopback_dev_static); +}; --- ./include/linux/init_task.h.vensdev Mon Aug 14 17:04:04 2006 +++ ./include/linux/init_task.h Mon Aug 14 17:18:21 2006 @@ -87,6 +87,14 @@ extern struct nsproxy init_nsproxy; extern struct group_info init_groups; +#ifdef CONFIG_NET_NS +extern struct net_namespace init_net_ns; +#define INIT_NET_NS \ + .net_context= init_net_ns, +#else +#define INIT_NET_NS +#endif + /* * INIT_TASK is used to set up the first task table, touch at * your own risk!. Base=0, limit=0x1f (=2MB) @@ -129,6 +137,7 @@ extern struct group_info init_groups; .signal = init_signals,\ .sighand= init_sighand,\ .nsproxy= init_nsproxy,\ + INIT_NET_NS \ .pending= { \ .list =
[PATCH 3/9] network namespaces: playing and debugging
Temporary code to play with network namespaces in the simplest way. Do exec 7 /proc/net/net_ns in your bash shell and you'll get a brand new network namespace. There you can, for example, do ip link set lo up ip addr list ip addr add 1.2.3.4 dev lo ping -n 1.2.3.4 Signed-off-by: Andrey Savochkin [EMAIL PROTECTED] --- dev.c | 20 1 files changed, 20 insertions(+) --- ./net/core/dev.c.vensxdbg Tue Aug 15 13:46:44 2006 +++ ./net/core/dev.cTue Aug 15 13:46:44 2006 @@ -3597,6 +3597,8 @@ int net_ns_start(void) if (err) goto out_register; put_net_ns(orig_ns); + printk(KERN_DEBUG NET_NS: created new netcontext %p for %s (pid=%d)\n, + ns, task-comm, task-tgid); return 0; out_register: @@ -3629,14 +3631,29 @@ static void net_ns_destroy(void *data) ip_fib_struct_cleanup(); pop_net_ns(orig_ns); kfree(ns); + printk(KERN_DEBUG NET_NS: netcontext %p freed\n, ns); } void net_ns_stop(struct net_namespace *ns) { + printk(KERN_DEBUG NET_NS: netcontext %p scheduled for stop\n, ns); INIT_WORK(ns-destroy_work, net_ns_destroy, ns); schedule_work(ns-destroy_work); } EXPORT_SYMBOL(net_ns_stop); + +static int net_ns_open(struct inode *i, struct file *f) +{ + return net_ns_start(); +} +static struct file_operations net_ns_fops = { + .open = net_ns_open, +}; +static int net_ns_init(void) +{ + return proc_net_fops_create(net_ns, S_IRWXU, net_ns_fops) + ? 0 : -ENOMEM; +} #endif /* @@ -3701,6 +3718,9 @@ static int __init net_dev_init(void) hotcpu_notifier(dev_cpu_callback, 0); dst_init(); dev_mcast_init(); +#ifdef CONFIG_NET_NS + net_ns_init(); +#endif rc = 0; out: return rc; - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 4/9] network namespaces: socket hashes
Socket hash lookups are made within namespace. Hash tables are common for all namespaces, with additional permutation of indexes. Signed-off-by: Andrey Savochkin [EMAIL PROTECTED] --- include/linux/ipv6.h |3 ++- include/net/inet6_hashtables.h |6 -- include/net/inet_hashtables.h| 38 +- include/net/inet_sock.h |6 -- include/net/inet_timewait_sock.h |2 ++ include/net/sock.h |4 include/net/udp.h| 12 +--- net/core/sock.c |5 + net/ipv4/inet_connection_sock.c | 19 +++ net/ipv4/inet_hashtables.c | 29 ++--- net/ipv4/inet_timewait_sock.c|8 ++-- net/ipv4/raw.c |2 ++ net/ipv4/udp.c | 20 +--- net/ipv6/inet6_connection_sock.c |2 ++ net/ipv6/inet6_hashtables.c | 25 ++--- net/ipv6/raw.c |4 net/ipv6/udp.c | 21 ++--- 17 files changed, 151 insertions(+), 55 deletions(-) --- ./include/linux/ipv6.h.venssock Mon Aug 14 17:02:45 2006 +++ ./include/linux/ipv6.h Tue Aug 15 13:38:47 2006 @@ -428,10 +428,11 @@ static inline struct raw6_sock *raw6_sk( #define inet_v6_ipv6only(__sk) 0 #endif /* defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE) */ -#define INET6_MATCH(__sk, __hash, __saddr, __daddr, __ports, __dif)\ +#define INET6_MATCH(__sk, __hash, __saddr, __daddr, __ports, __dif, __ns)\ (((__sk)-sk_hash == (__hash)) \ ((*((__u32 *)(inet_sk(__sk)-dport))) == (__ports))\ ((__sk)-sk_family == AF_INET6) \ +net_ns_match((__sk)-sk_net_ns, __ns) \ ipv6_addr_equal(inet6_sk(__sk)-daddr, (__saddr)) \ ipv6_addr_equal(inet6_sk(__sk)-rcv_saddr, (__daddr)) \ (!((__sk)-sk_bound_dev_if) || ((__sk)-sk_bound_dev_if == (__dif --- ./include/net/inet6_hashtables.h.venssock Mon Aug 14 17:02:47 2006 +++ ./include/net/inet6_hashtables.hTue Aug 15 13:38:47 2006 @@ -26,11 +26,13 @@ struct inet_hashinfo; /* I have no idea if this is a good hash for v6 or not. -DaveM */ static inline unsigned int inet6_ehashfn(const struct in6_addr *laddr, const u16 lport, - const struct in6_addr *faddr, const u16 fport) + const struct in6_addr *faddr, const u16 fport, + struct net_namespace *ns) { unsigned int hashent = (lport ^ fport); hashent ^= (laddr-s6_addr32[3] ^ faddr-s6_addr32[3]); + hashent ^= net_ns_hash(ns); hashent ^= hashent 16; hashent ^= hashent 8; return hashent; @@ -44,7 +46,7 @@ static inline int inet6_sk_ehashfn(const const struct in6_addr *faddr = np-daddr; const __u16 lport = inet-num; const __u16 fport = inet-dport; - return inet6_ehashfn(laddr, lport, faddr, fport); + return inet6_ehashfn(laddr, lport, faddr, fport, current_net_ns); } extern void __inet6_hash(struct inet_hashinfo *hashinfo, struct sock *sk); --- ./include/net/inet_hashtables.h.venssockMon Aug 14 17:04:04 2006 +++ ./include/net/inet_hashtables.h Tue Aug 15 13:38:47 2006 @@ -74,6 +74,9 @@ struct inet_ehash_bucket { * ports are created in O(1) time? I thought so. ;-) -DaveM */ struct inet_bind_bucket { +#ifdef CONFIG_NET_NS + struct net_namespace*net_ns; +#endif unsigned short port; signed shortfastreuse; struct hlist_node node; @@ -142,30 +145,34 @@ extern struct inet_bind_bucket * extern void inet_bind_bucket_destroy(kmem_cache_t *cachep, struct inet_bind_bucket *tb); -static inline int inet_bhashfn(const __u16 lport, const int bhash_size) +static inline int inet_bhashfn(const __u16 lport, + struct net_namespace *ns, + const int bhash_size) { - return lport (bhash_size - 1); + return (lport ^ net_ns_hash(ns)) (bhash_size - 1); } extern void inet_bind_hash(struct sock *sk, struct inet_bind_bucket *tb, const unsigned short snum); /* These can have wildcards, don't try too hard. */ -static inline int inet_lhashfn(const unsigned short num) +static inline int inet_lhashfn(const unsigned short num, + struct net_namespace *ns) { - return num (INET_LHTABLE_SIZE - 1); + return (num ^ net_ns_hash(ns)) (INET_LHTABLE_SIZE - 1); } static inline int inet_sk_listen_hashfn(const struct sock *sk) { - return inet_lhashfn(inet_sk(sk)-num); + return inet_lhashfn(inet_sk(sk)-num, current_net_ns); } /* Caller must disable local BH processing. */ static inline void
[PATCH 6/9] allow proc_dir_entries to have destructor
Destructor field added proc_dir_entries, standard destructor kfree'ing data introduced. Signed-off-by: Andrey Savochkin [EMAIL PROTECTED] --- fs/proc/generic.c | 10 -- fs/proc/root.c |1 + include/linux/proc_fs.h |4 3 files changed, 13 insertions(+), 2 deletions(-) --- ./fs/proc/generic.c.veprocdtor Mon Aug 14 16:43:41 2006 +++ ./fs/proc/generic.c Tue Aug 15 13:45:51 2006 @@ -608,6 +608,11 @@ static struct proc_dir_entry *proc_creat return ent; } +void proc_data_destructor(struct proc_dir_entry *ent) +{ + kfree(ent-data); +} + struct proc_dir_entry *proc_symlink(const char *name, struct proc_dir_entry *parent, const char *dest) { @@ -620,6 +625,7 @@ struct proc_dir_entry *proc_symlink(cons ent-data = kmalloc((ent-size=strlen(dest))+1, GFP_KERNEL); if (ent-data) { strcpy((char*)ent-data,dest); + ent-destructor = proc_data_destructor; if (proc_register(parent, ent) 0) { kfree(ent-data); kfree(ent); @@ -698,8 +704,8 @@ void free_proc_entry(struct proc_dir_ent release_inode_number(ino); - if (S_ISLNK(de-mode) de-data) - kfree(de-data); + if (de-destructor) + de-destructor(de); kfree(de); } --- ./fs/proc/root.c.veprocdtor Mon Aug 14 17:02:38 2006 +++ ./fs/proc/root.cTue Aug 15 13:45:51 2006 @@ -154,6 +154,7 @@ EXPORT_SYMBOL(proc_symlink); EXPORT_SYMBOL(proc_mkdir); EXPORT_SYMBOL(create_proc_entry); EXPORT_SYMBOL(remove_proc_entry); +EXPORT_SYMBOL(proc_data_destructor); EXPORT_SYMBOL(proc_root); EXPORT_SYMBOL(proc_root_fs); EXPORT_SYMBOL(proc_net); --- ./include/linux/proc_fs.h.veprocdtorMon Aug 14 17:02:47 2006 +++ ./include/linux/proc_fs.h Tue Aug 15 13:45:51 2006 @@ -46,6 +46,8 @@ typedef int (read_proc_t)(char *page, ch typedefint (write_proc_t)(struct file *file, const char __user *buffer, unsigned long count, void *data); typedef int (get_info_t)(char *, char **, off_t, int); +struct proc_dir_entry; +typedef void (destroy_proc_t)(struct proc_dir_entry *); struct proc_dir_entry { unsigned int low_ino; @@ -65,6 +67,7 @@ struct proc_dir_entry { read_proc_t *read_proc; write_proc_t *write_proc; atomic_t count; /* use count */ + destroy_proc_t *destructor; int deleted;/* delete flag */ void *set; }; @@ -109,6 +112,7 @@ char *task_mem(struct mm_struct *, char extern struct proc_dir_entry *create_proc_entry(const char *name, mode_t mode, struct proc_dir_entry *parent); extern void remove_proc_entry(const char *name, struct proc_dir_entry *parent); +extern void proc_data_destructor(struct proc_dir_entry *); extern struct vfsmount *proc_mnt; extern int proc_fill_super(struct super_block *,void *,int); - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 8/9] network namespaces: device to pass packets between namespaces
A simple device to pass packets between a namespace and its child. Signed-off-by: Andrey Savochkin [EMAIL PROTECTED] --- Makefile |3 veth.c | 327 +++ 2 files changed, 330 insertions(+) --- ./drivers/net/Makefile.veveth Mon Aug 14 17:03:45 2006 +++ ./drivers/net/Makefile Tue Aug 15 13:46:15 2006 @@ -124,6 +124,9 @@ obj-$(CONFIG_SLIP) += slip.o obj-$(CONFIG_SLHC) += slhc.o obj-$(CONFIG_DUMMY) += dummy.o +ifeq ($(CONFIG_NET_NS),y) +obj-m += veth.o +endif obj-$(CONFIG_IFB) += ifb.o obj-$(CONFIG_DE600) += de600.o obj-$(CONFIG_DE620) += de620.o --- ./drivers/net/veth.c.veveth Tue Aug 15 13:44:46 2006 +++ ./drivers/net/veth.cTue Aug 15 13:46:15 2006 @@ -0,0 +1,327 @@ +/* + * Copyright (C) 2006 SWsoft + * + * Written by Andrey Savochkin [EMAIL PROTECTED], + * reusing code by Andrey Mirkin [EMAIL PROTECTED]. + */ +#include linux/list.h +#include linux/spinlock.h +#include linux/ctype.h +#include asm/semaphore.h +#include linux/netdevice.h +#include linux/etherdevice.h +#include linux/proc_fs.h +#include linux/seq_file.h +#include net/dst.h +#include net/xfrm.h + +struct veth_struct +{ + struct net_device *pair; + struct net_device_stats stats; +}; + +#define veth_from_netdev(dev) ((struct veth_struct *)(netdev_priv(dev))) + +/* --- * + * + * Device functions + * + * --- */ + +static struct net_device_stats *get_stats(struct net_device *dev); +static int veth_xmit(struct sk_buff *skb, struct net_device *dev) +{ + struct net_device_stats *stats; + struct veth_struct *entry; + struct net_device *rcv; + struct net_namespace *orig_net_ns; + int length; + + stats = get_stats(dev); + entry = veth_from_netdev(dev); + rcv = entry-pair; + + if (!(rcv-flags IFF_UP)) + /* Target namespace does not want to receive packets */ + goto outf; + + dst_release(skb-dst); + skb-dst = NULL; + secpath_reset(skb); + skb_orphan(skb); +#ifdef CONFIG_NETFILTER + nf_conntrack_put(skb-nfct); +#if defined(CONFIG_NF_CONNTRACK) || defined(CONFIG_NF_CONNTRACK_MODULE) + nf_conntrack_put_reasm(skb-nfct_reasm); +#endif +#ifdef CONFIG_BRIDGE_NETFILTER + nf_bridge_put(skb-nf_bridge); +#endif +#endif + + push_net_ns(rcv-net_ns, orig_net_ns); + skb-dev = rcv; + skb-pkt_type = PACKET_HOST; + skb-protocol = eth_type_trans(skb, rcv); + + length = skb-len; + stats-tx_bytes += length; + stats-tx_packets++; + stats = get_stats(rcv); + stats-rx_bytes += length; + stats-rx_packets++; + + netif_rx(skb); + pop_net_ns(orig_net_ns); + return 0; + +outf: + stats-tx_dropped++; + kfree_skb(skb); + return 0; +} + +static int veth_open(struct net_device *dev) +{ + return 0; +} + +static int veth_close(struct net_device *dev) +{ + return 0; +} + +static void veth_destructor(struct net_device *dev) +{ + free_netdev(dev); +} + +static struct net_device_stats *get_stats(struct net_device *dev) +{ + return veth_from_netdev(dev)-stats; +} + +int veth_init_dev(struct net_device *dev) +{ + dev-hard_start_xmit = veth_xmit; + dev-open = veth_open; + dev-stop = veth_close; + dev-destructor = veth_destructor; + dev-get_stats = get_stats; + + ether_setup(dev); + + dev-tx_queue_len = 0; + return 0; +} + +static void veth_setup(struct net_device *dev) +{ + dev-init = veth_init_dev; +} + +static inline int is_veth_dev(struct net_device *dev) +{ + return dev-init == veth_init_dev; +} + +/* --- * + * + * Management interface + * + * --- */ + +struct net_device *veth_dev_alloc(char *name, char *addr) +{ + struct net_device *dev; + + dev = alloc_netdev(sizeof(struct veth_struct), name, veth_setup); + if (dev != NULL) { + memcpy(dev-dev_addr, addr, ETH_ALEN); + dev-addr_len = ETH_ALEN; + } + return dev; +} + +int veth_entry_add(char *parent_name, char *parent_addr, + char *child_name, char *child_addr, + struct net_namespace *child_ns) +{ + struct net_device *parent_dev, *child_dev; + struct net_namespace *parent_ns; + int err; + + err = -ENOMEM; + if ((parent_dev = veth_dev_alloc(parent_name, parent_addr)) == NULL) + goto out_alocp; + if ((child_dev = veth_dev_alloc(child_name, child_addr)) == NULL) + goto out_alocc; + veth_from_netdev(parent_dev)-pair = child_dev; + veth_from_netdev(child_dev)-pair = parent_dev; + + /* +* About serialization, see
[PATCH 9/9] network namespaces: playing with pass-through device
Temporary code to debug and play with pass-through device. Create device pair by modprobe veth echo 'add veth1 0:1:2:3:4:1 eth0 0:1:2:3:4:2' /proc/net/veth_ctl and your shell will appear into a new namespace with `eth0' device. Configure device in this namespace ip l s eth0 up ip a a 1.2.3.4/24 dev eth0 and in the root namespace ip l s veth1 up ip a a 1.2.3.1/24 dev veth1 to establish a communication channel between root namespace and the newly created one. Signed-off-by: Andrey Savochkin [EMAIL PROTECTED] --- veth.c | 113 + 1 files changed, 113 insertions(+) --- ./drivers/net/veth.c.veveth-dbg Tue Aug 15 13:47:48 2006 +++ ./drivers/net/veth.cTue Aug 15 14:08:04 2006 @@ -251,6 +251,116 @@ void veth_entry_del_all(void) /* --- * * + * Temporary interface to create veth devices + * + * --- */ + +#ifdef CONFIG_PROC_FS + +static int veth_debug_open(struct inode *inode, struct file *file) +{ + return 0; +} + +static char *parse_addr(char *s, char *addr) +{ + int i, v; + + for (i = 0; i ETH_ALEN; i++) { + if (!isxdigit(*s)) + return NULL; + *addr = 0; + v = isdigit(*s) ? *s - '0' : toupper(*s) - 'A' + 10; + s++; + if (isxdigit(*s)) { + *addr += v 16; + v = isdigit(*s) ? *s - '0' : toupper(*s) - 'A' + 10; + s++; + } + *addr++ += v; + if (i ETH_ALEN - 1 ispunct(*s)) + s++; + } + return s; +} + +extern int net_ns_start(void); +static ssize_t veth_debug_write(struct file *file, const char __user *user_buf, + size_t size, loff_t *ppos) +{ + char buf[128], *s, *parent_name, *child_name; + char parent_addr[ETH_ALEN], child_addr[ETH_ALEN]; + struct net_namespace *parent_ns, *child_ns; + int err; + + s = buf; + err = -EINVAL; + if (size = sizeof(buf)) + goto out; + err = -EFAULT; + if (copy_from_user(buf, user_buf, size)) + goto out; + buf[size] = 0; + + err = -EBADRQC; + if (!strncmp(buf, add , 4)) { + parent_name = buf + 4; + if ((s = strchr(parent_name, ' ')) == NULL) + goto out; + *s = 0; + if ((s = parse_addr(s + 1, parent_addr)) == NULL) + goto out; + if (!*s) + goto out; + child_name = s + 1; + if ((s = strchr(child_name, ' ')) == NULL) + goto out; + *s = 0; + if ((s = parse_addr(s + 1, child_addr)) == NULL) + goto out; + + parent_ns = get_net_ns(current_net_ns); + err = net_ns_start(); + if (err) + goto out; + /* return to parent context */ + push_net_ns(parent_ns, child_ns); + err = veth_entry_add(parent_name, parent_addr, + child_name, child_addr, child_ns); + pop_net_ns(child_ns); + put_net_ns(parent_ns); + if (!err) + err = size; + } +out: + return err; +} + +static struct file_operations veth_debug_ops = { + .open = veth_debug_open, + .write = veth_debug_write, +}; + +static int veth_debug_create(void) +{ + proc_net_fops_create(veth_ctl, 0200, veth_debug_ops); + return 0; +} + +static void veth_debug_remove(void) +{ + proc_net_remove(veth_ctl); +} + +#else + +static int veth_debug_create(void) { return -1; } +static void veth_debug_remove(void) { } + +#endif + +/* --- * + * * Information in proc * * --- */ @@ -310,12 +420,15 @@ static inline void veth_proc_remove(void int __init veth_init(void) { + if (veth_debug_create()) + return -EINVAL; veth_proc_create(); return 0; } void __exit veth_exit(void) { + veth_debug_remove(); veth_proc_remove(); veth_entry_del_all(); } - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 09/16] [IPv6] address: Convert address notification to use rtnl_notify()
* Herbert Xu [EMAIL PROTECTED] 2006-08-16 12:58 I'm not comfortable with that change since it implies the message originated from a user-space process. The netlink header pid is really akin to sadb_msg_pid from RFC 2367. IMHO it should always be zero if the kernel is the originator of the message. All route and tc notifications already use the pid so applications can decide whether the event was caused by them. A notification is a reply to a request so it doesn't even violate RFC 2367. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 09/16] [IPv6] address: Convert address notification to use rtnl_notify()
On Wed, 2006-16-08 at 12:58 +0200, Thomas Graf wrote: * Herbert Xu [EMAIL PROTECTED] 2006-08-16 12:58 I'm not comfortable with that change since it implies the message originated from a user-space process. The netlink header pid is really akin to sadb_msg_pid from RFC 2367. IMHO it should always be zero if the kernel is the originator of the message. All route and tc notifications already use the pid so applications can decide whether the event was caused by them. A notification is a reply to a request so it doesn't even violate RFC 2367. I would agree with Thomas on this. Regardless, I dont think that 2367 is really a glorified reference (that thing needs so much updating it is not funny). cheers, jamal - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 09/16] [IPv6] address: Convert address notification to use rtnl_notify()
Hi Thomas: On Wed, Aug 16, 2006 at 12:58:56PM +0200, Thomas Graf wrote: All route and tc notifications already use the pid so applications can decide whether the event was caused by them. A notification is a reply to a request so it doesn't even violate RFC 2367. Actually most IPv4 notifications *do* set the pid to zero which is the right thing to do for kernel-generated messages. You're right though that the IPv6 notification modified by this patch does set the pid to the netlink originator. Looking back in history it seems that this behaviour was only introduced last year to a subset of notifications. This inconsistency is very bad. IMHO this change (made last year) should be reverted so that all kernel generated (broadcast) notifications have the originator set to zero to match the source address of the message. Any notification that sets the netlink pid to current-pid is *completely* bogus. Let me repeat this, the netlink pid is not a process ID. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmVHI~} [EMAIL PROTECTED] Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 09/16] [IPv6] address: Convert address notification to use rtnl_notify()
* Herbert Xu [EMAIL PROTECTED] 2006-08-16 21:12 On Wed, Aug 16, 2006 at 12:58:56PM +0200, Thomas Graf wrote: All route and tc notifications already use the pid so applications can decide whether the event was caused by them. A notification is a reply to a request so it doesn't even violate RFC 2367. Actually most IPv4 notifications *do* set the pid to zero which is the right thing to do for kernel-generated messages. You're right though that the IPv6 notification modified by this patch does set the pid to the netlink originator. Looking back in history it seems that this behaviour was only introduced last year to a subset of notifications. It was added to help quagga identify which route modifications are self caused. It's not possible to use rtm_protocol for this purpose as other applications can delete routes added by quagga. This inconsistency is very bad. IMHO this change (made last year) should be reverted so that all kernel generated (broadcast) notifications have the originator set to zero to match the source address of the message. We can't just knowingly break quagga. I think it's a good thing to include the pid, it's additional information that is helpful to userspace. Userspace is already aware that the notifications are orignating from the kernel, we can't do userspace - userspace communication anymore anyway. Any notification that sets the netlink pid to current-pid is *completely* bogus. Let me repeat this, the netlink pid is not a process ID. Everyone is aware of that, actually these patches fix all occurences of current-pid by replacing them with a pid of 0. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 09/16] [IPv6] address: Convert address notification to use rtnl_notify()
On Wed, Aug 16, 2006 at 09:12:40PM +1000, herbert wrote: Any notification that sets the netlink pid to current-pid is *completely* bogus. Let me repeat this, the netlink pid is not a process ID. BTW, I'm not having a go at either Thomas or Jamal. You guys are oo the same side for once :). I honestly believe that we have a misunderstanding here which needs to be sorted out. It gets worse because that misunderstanding has made it into the manpages package which only causes more confusion. So let's step back a bit and think about where does this pid really come from. The field in question is nlmsg_pid. Its primary purpose is to identify unicast transactions along with the field nlmsg_seq. It was not designed to identify the origin of a broadcast kernel notification to a third party. For this purpose, the value of nlmsg_pid is set to the address of the destination socket for a particular unicast message (also known as the pid). That pid in turn has only a vague connection with the process ID of the process owning the socket. For practical purposes, we should not treat it as a process ID it can easily be claimed by another process (think socket + bind + fork). For a broadcast notification, the nlmsg_pid field is meaningless because the nlmsg_seq field is also meaningless. I'm not denying that it wouldn't be useful to have the originator's socket address in there. What I'm saying is that it's the wrong place to put that information. In any case, putting current-pid in this field is definitely a bad idea because it only encourages people to confuse the netlink pid with the process ID which can lead to security problems later on. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmVHI~} [EMAIL PROTECTED] Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/1] network memory allocator.
On Wednesday 16 August 2006 11:00, Evgeniy Polyakov wrote: There is drawback here - if data was allocated on CPU wheere NIC is closer and then processed on different CPU it will cost more than in case where buffer was allocated on CPU where it will be processed. But from other point of view, most of the adapters preallocate set of skbs, and with msi-x help there will be a possibility to bind irq and processing to the CPU where data was origianlly allocated. So I would like to know how to determine which node should be used for allocation. Changes of __get_user_pages() to alloc_pages_node() are trivial. There are two separate memory areas here: Your own metadata used by the allocator and the memory used for skb data. avl_node_array[cpu] and avl_container_array[cpu] are only designed to be accessed only by the local cpu, so these should be done like avl_node_array[cpu] = kmalloc_node(AVL_NODE_PAGES * sizeof(void *), GFP_KERNEL, cpu_to_node(cpu)); or you could make the whole array DEFINE_PER_CPU(void *, which would waste some space in the kernel object file. Now for the actual pages you get with __get_free_pages(), doing the same (alloc_pages_node), will help accessing your avl_container members, but may not be the best solution for getting the data next to the network adapter. Arnd - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] add bcm43xx-d80211 MAINTAINERS entry
Hi John, Please pull the patch to add Larry as bcm43xx-softmac maintainer into wireless-dev and _after_ that please apply the following patch to mark the d80211 branch explicitely. -- Add MAINTAINERS for bcm43xx-d80211 Signed-off-by: Michael Buesch [EMAIL PROTECTED] Index: wireless-dev/MAINTAINERS === --- wireless-dev.orig/MAINTAINERS 2006-08-16 11:26:18.0 +0200 +++ wireless-dev/MAINTAINERS2006-08-16 11:27:29.0 +0200 @@ -456,6 +456,14 @@ W: http://www.baycom.org/~tom/ham/ham.html S: Maintained +BCM43XX WIRELESS DRIVER (DEVICESCAPE BASED VERSION) +P: Michael Buesch +M: [EMAIL PROTECTED] +P: Stefano Brivio +M: [EMAIL PROTECTED] +W: http://bcm43xx.berlios.de/ +S: Maintained + BCM43XX WIRELESS DRIVER (SOFTMAC BASED VERSION) P: Larry Finger M: [EMAIL PROTECTED] -- Greetings Michael. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] network namespaces
Quoting Andrey Savochkin ([EMAIL PROTECTED]): Hi All, I'd like to resurrect our discussion about network namespaces. In our previous discussions it appeared that we have rather polar concepts which seemed hard to reconcile. Now I have an idea how to look at all discussed concepts to enable everyone's usage scenario. 1. The most straightforward concept is complete separation of namespaces, covering device list, routing tables, netfilter tables, socket hashes, and everything else. On input path, each packet is tagged with namespace right from the place where it appears from a device, and is processed by each layer in the context of this namespace. Non-root namespaces communicate with the outside world in two ways: by owning hardware devices, or receiving packets forwarded them by their parent namespace via pass-through device. This complete separation of namespaces is very useful for at least two purposes: - allowing users to create and manage by their own various tunnels and VPNs, and - enabling easier and more straightforward live migration of groups of processes with their environment. I conceptually prefer this approach, but I seem to recall there were actual problems in using this for checkpoint/restart of lightweight (application) containers. Performance aside, are there any reasons why this approach would be problematic for c/r? I'm afraid Daniel may be on vacation, and don't know who else other than Eric might have thoughts on this. 2. People expressed concerns that complete separation of namespaces may introduce an undesired overhead in certain usage scenarios. The overhead comes from packets traversing input path, then output path, then input path again in the destination namespace if root namespace acts as a router. So, we may introduce short-cuts, when input packet starts to be processes in one namespace, but changes it at some upper layer. The places where packet can change namespace are, for example: routing, post-routing netfilter hook, or even lookup in socket hash. The cleanest example among them is post-routing netfilter hook. Tagging of input packets there means that the packets is checked against root namespace's routing table, found to be local, and go directly to the socket hash lookup in the destination namespace. In this scheme the ability to change routing tables or netfilter rules on a per-namespace basis is traded for lower overhead. All other optimized schemes where input packets do not travel input-output-input paths in general case may be viewed as short-cuts in scheme (1). The remaining question is which exactly short-cuts make most sense, and how to make them consistent from the interface point of view. My current idea is to reach some agreement on the basic concept, review patches, and then move on to implementing feasible short-cuts. Opinions? Next in this thread are patches introducing namespaces to device list, IPv4 routing, and socket hashes, and a pass-through device. Patches are against 2.6.18-rc4-mm1. Just to provide the extreme other end of implementation options, here is the bsdjail based version I've been using for some testing while waiting for network namespaces to show up in -mm :) (Not intended for *any* sort of inclusion consideration :) Example usage: ifconfig eth0:0 192.168.1.16 echo -n ip 192.168.1.16 /proc/$$/attr/exec exec /bin/sh -serge From: Serge E. Hallyn [EMAIL PROTECTED](none) Date: Wed, 26 Jul 2006 21:47:13 -0500 Subject: [PATCH 1/1] bsdjail: define bsdjail lsm Define the actual bsdjail LSM. Signed-off-by: Serge E. Hallyn [EMAIL PROTECTED] --- security/Kconfig | 11 security/Makefile |1 security/bsdjail.c | 1351 3 files changed, 1363 insertions(+), 0 deletions(-) diff --git a/security/Kconfig b/security/Kconfig index 67785df..fa30e40 100644 --- a/security/Kconfig +++ b/security/Kconfig @@ -105,6 +105,17 @@ config SECURITY_SECLVL If you are unsure how to answer this question, answer N. +config SECURITY_BSDJAIL + tristate BSD Jail LSM + depends on SECURITY + select SECURITY_NETWORK + help + Provides BSD Jail compartmentalization functionality. + See Documentation/bsdjail.txt for more information and + usage instructions. + + If you are unsure how to answer this question, answer N. + source security/selinux/Kconfig endmenu diff --git a/security/Makefile b/security/Makefile index 8cbbf2f..050b588 100644 --- a/security/Makefile +++ b/security/Makefile @@ -17,3 +17,4 @@ obj-$(CONFIG_SECURITY_SELINUX)+= selin obj-$(CONFIG_SECURITY_CAPABILITIES)+= commoncap.o capability.o obj-$(CONFIG_SECURITY_ROOTPLUG)+= commoncap.o root_plug.o obj-$(CONFIG_SECURITY_SECLVL) +=
Re: [PATCH 09/16] [IPv6] address: Convert address notification to use rtnl_notify()
On Wed, Aug 16, 2006 at 01:40:03PM +0200, Thomas Graf wrote: It was added to help quagga identify which route modifications are self caused. It's not possible to use rtm_protocol for this purpose as other applications can delete routes added by quagga. Actually it's not that bad. I just checked the quagga source and the stuff it needs was already provided anyway, even before that change. In fact, the really bad bits in the changeset have already been reverted by Alexey back in February :) Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmVHI~} [EMAIL PROTECTED] Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 09/16] [IPv6] address: Convert address notification to use rtnl_notify()
On Wed, 2006-16-08 at 21:39 +1000, Herbert Xu wrote: So let's step back a bit and think about where does this pid really come from. The field in question is nlmsg_pid. Its primary purpose is to identify unicast transactions along with the field nlmsg_seq. It was not designed to identify the origin of a broadcast kernel notification to a third party. There are quiet a few things that netlink design intent was not intending to solve that became needed over time. This being one IMHO. Design intent and eventual (sometimes creative) use occasionally create an impedance ;- Evolution is the only description i can think of. For this purpose, the value of nlmsg_pid is set to the address of the destination socket for a particular unicast message (also known as the pid). Since we are talking history: The idea of it being a destination socket _was not_ design intent. It was evolution. I recall James Morris actually to be the first person whining about this ambiguity when coding up nfqueue. I cant remember who fixed it (I am inclined to think it was you;-) That pid in turn has only a vague connection with the process ID of the process owning the socket. For practical purposes, we should not treat it as a process ID it can easily be claimed by another process (think socket + bind + fork). If you want to be complete the kernel should fix the pid in netlink::sendmsg(). For a broadcast notification, the nlmsg_pid field is meaningless because the nlmsg_seq field is also meaningless. nlmsg_seq is meaningless; seq is again a bad noun. It should be cookie. I'm not denying that it wouldn't be useful to have the originator's socket address in there. What I'm saying is that it's the wrong place to put that information. In any case, putting current-pid in this field is definitely a bad idea because it only encourages people to confuse the netlink pid with the process ID which can lead to security problems later on. current-pid i think is coming out to be a bad idea. Thomas' patches revert it out. Again this has everything to do with the original idea what maps to pid now changing to socketid. What do you think of the idea of infact rewriting the pid to be that of the socket id? cheers, jamal - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 09/16] [IPv6] address: Convert address notification to use rtnl_notify()
On Wed, 2006-16-08 at 14:05 +0200, Thomas Graf wrote: Right, but he forgot the bits in IPv6 which I now fixed. The changeset introducing those current-pid uses was definitely simply wrong. I'm not questioning that :) Herbert, if you look at the thread as well I am no longer questioning that either ;- cheers, jamal PS:- Would a topic of things i wish netlink did better be of interest for discussion (maybe for netconf)? (Un)fortunately, we are fixing some of them with genetlink;- - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 09/16] [IPv6] address: Convert address notification to use rtnl_notify()
* Herbert Xu [EMAIL PROTECTED] 2006-08-16 21:57 On Wed, Aug 16, 2006 at 01:40:03PM +0200, Thomas Graf wrote: It was added to help quagga identify which route modifications are self caused. It's not possible to use rtm_protocol for this purpose as other applications can delete routes added by quagga. Actually it's not that bad. I just checked the quagga source and the stuff it needs was already provided anyway, even before that change. If I recall correctly the quagga folks asked to get the same behaviour for IPv6 routes as it was already done for IPv4 around the time of that bogus changeset. In fact, the really bad bits in the changeset have already been reverted by Alexey back in February :) Right, but he forgot the bits in IPv6 which I now fixed. The changeset introducing those current-pid uses was definitely simply wrong. I'm not questioning that :) - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 09/16] [IPv6] address: Convert address notification to use rtnl_notify()
* jamal [EMAIL PROTECTED] 2006-08-16 08:04 current-pid i think is coming out to be a bad idea. Thomas' patches revert it out. Again this has everything to do with the original idea what maps to pid now changing to socketid. It probably developed from autobind using current-tid. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 09/16] [IPv6] address: Convert address notification to use rtnl_notify()
On Wed, Aug 16, 2006 at 08:04:24AM -0400, jamal wrote: What do you think of the idea of infact rewriting the pid to be that of the socket id? Rewriting it with the netlink socket address? That's fine by me as long as there is a clear 1-to-1 relationship between the request and the notification. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmVHI~} [EMAIL PROTECTED] Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 09/16] [IPv6] address: Convert address notification to use rtnl_notify()
* Herbert Xu [EMAIL PROTECTED] 2006-08-16 21:39 For a broadcast notification, the nlmsg_pid field is meaningless because the nlmsg_seq field is also meaningless. I'm not denying that it wouldn't be useful to have the originator's socket address in there. What I'm saying is that it's the wrong place to put that information. It might not be the best place to put it considering the original intend of nlmsg_pid as you explained correctly. However, as you state yourself, the nlmsg_pid field is meaningless/unused for notifications so extending the definition of nlmsg_pid to have a special meaning for broadcasts doesn't harm anyone. When setting nlmsg_seq to the seq of the request it becomes a meaning together with nlmsg_pid as applications can then easly assign notifications to their own sent requests. Secondly we already have applications depending on this whereas the eventual breaking of aplications depending on nlmsg_pid == 0 is uncertain. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 09/16] [IPv6] address: Convert address notification to use rtnl_notify()
On Wed, 2006-16-08 at 14:08 +0200, Thomas Graf wrote: * jamal [EMAIL PROTECTED] 2006-08-16 08:04 current-pid i think is coming out to be a bad idea. Thomas' patches revert it out. Again this has everything to do with the original idea what maps to pid now changing to socketid. It probably developed from autobind using current-tid. In one conversation with Alexey he told me there was some inspiration from pfkey in the semantics of it i.e processid. Obviously with many sockets on the same process etc, that assumption is no longer valid. On Wed, 2006-16-08 at 22:08 +1000, Herbert Xu wrote: On Wed, Aug 16, 2006 at 08:04:24AM -0400, jamal wrote: What do you think of the idea of infact rewriting the pid to be that of the socket id? Rewriting it with the netlink socket address? That's fine by me as long as there is a clear 1-to-1 relationship between the request and the notification. you would have to call getpeername() to get a correct 1-1 mapping as is today when in doubt. What i was suggesting is notifications using the pid that would id the socket and would therefore require a getpeername() which identify the real socket it came from; if you are fine with what Thomas is doing, then this unnecessary since i was suggesting it as a compromise for consistency you pointed was lacking. cheers, jamal - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 09/16] [IPv6] address: Convert address notification to use rtnl_notify()
Hello! The netlink header pid is really akin to sadb_msg_pid from RFC 2367. IMHO it should always be zero if the kernel is the originator of the message. No. Analogue of sadb_msg_pid is nladdr.nl_pid. Netlink header pid is not originator of the message, but author of the change. The notion is ambiguous by definition, and the field is a little ambiguous. If the message is a plain ack or part of a dump, it is obviously pid of requestor. But if it is notification about change, it can be nl_pid of socket, which requested the operation, but may be 0. Of course, all the 0s sent only because I was lazy to track authorship, should be eliminated. Alexey - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[rt2500usb] link led weirdness
Hey, I just noticed that my rt2500usb device turns on the link LED when I just add an active monitor interface. I can't imagine that being on purpose, but I'm not sure based on what it is controlled. johannes - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 09/16] [IPv6] address: Convert address notification to use rtnl_notify()
Hello! In one conversation with Alexey he told me there was some inspiration from pfkey in the semantics of it i.e processid. Inspiration, but not a copy. :-) Unlike pfkeyv2 it uses addressing usual for networking i.e. struct sockaddr_nl. Alexey - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 09/16] [IPv6] address: Convert address notification to use rtnl_notify()
On Wed, 2006-16-08 at 17:04 +0400, Alexey Kuznetsov wrote: Hello! In one conversation with Alexey he told me there was some inspiration from pfkey in the semantics of it i.e processid. Inspiration, but not a copy. :-) Oh, absolutely. Netlink is way superior. I should have said perspiration instead of inspiration;- Calling inspiration was being polite - it is as being polite as saying i was being economical with the truth[1] ;- Unlike pfkeyv2 it uses addressing usual for networking i.e. struct sockaddr_nl. I think this needs to be captured somewhere. I dont know who is maintaining the man pages these days. cheers, jamal [1] A term i learnt from some British guy. They have ways with words those Brits. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][IPSEC]: Aggregate make_jiffies
On Tue, 2006-15-08 at 22:59 -0400, jamal wrote: How about moving it to linux/jiffies.h and rewrite in the style of msec_to_jiffies? Is there something other than the boundary check already done you foresee being made? If yes, do you wanna take a crack at it? Herbert, I actually dont know the answer that is why i am punting it to you;- I would just move the whole thing to linux/jiffies.h as is but you seem to suggest there may be other boundary checks. If yes, go at it ;- cheers, jamal - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[take10 2/2] kevent: poll/select() notifications. Timer notifications.
poll/select() notifications. Timer notifications. This patch includes generic poll/select and timer notifications. kevent_poll works simialr to epoll and has the same issues (callback is invoked not from internal state machine of the caller, but through process awake). Timer notifications can be used for fine grained per-process time management, since interval timers are very inconvenient to use, and they are limited. Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED] diff --git a/kernel/kevent/kevent_poll.c b/kernel/kevent/kevent_poll.c new file mode 100644 index 000..8a4f863 --- /dev/null +++ b/kernel/kevent/kevent_poll.c @@ -0,0 +1,220 @@ +/* + * kevent_poll.c + * + * 2006 Copyright (c) Evgeniy Polyakov [EMAIL PROTECTED] + * All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +#include linux/kernel.h +#include linux/types.h +#include linux/list.h +#include linux/slab.h +#include linux/spinlock.h +#include linux/timer.h +#include linux/file.h +#include linux/kevent.h +#include linux/poll.h +#include linux/fs.h + +static kmem_cache_t *kevent_poll_container_cache; +static kmem_cache_t *kevent_poll_priv_cache; + +struct kevent_poll_ctl +{ + struct poll_table_structpt; + struct kevent *k; +}; + +struct kevent_poll_wait_container +{ + struct list_headcontainer_entry; + wait_queue_head_t *whead; + wait_queue_twait; + struct kevent *k; +}; + +struct kevent_poll_private +{ + struct list_headcontainer_list; + spinlock_t container_lock; +}; + +static int kevent_poll_enqueue(struct kevent *k); +static int kevent_poll_dequeue(struct kevent *k); +static int kevent_poll_callback(struct kevent *k); + +static int kevent_poll_wait_callback(wait_queue_t *wait, + unsigned mode, int sync, void *key) +{ + struct kevent_poll_wait_container *cont = + container_of(wait, struct kevent_poll_wait_container, wait); + struct kevent *k = cont-k; + struct file *file = k-st-origin; + u32 revents; + + revents = file-f_op-poll(file, NULL); + + kevent_storage_ready(k-st, NULL, revents); + + return 0; +} + +static void kevent_poll_qproc(struct file *file, wait_queue_head_t *whead, + struct poll_table_struct *poll_table) +{ + struct kevent *k = + container_of(poll_table, struct kevent_poll_ctl, pt)-k; + struct kevent_poll_private *priv = k-priv; + struct kevent_poll_wait_container *cont; + unsigned long flags; + + cont = kmem_cache_alloc(kevent_poll_container_cache, SLAB_KERNEL); + if (!cont) { + kevent_break(k); + return; + } + + cont-k = k; + init_waitqueue_func_entry(cont-wait, kevent_poll_wait_callback); + cont-whead = whead; + + spin_lock_irqsave(priv-container_lock, flags); + list_add_tail(cont-container_entry, priv-container_list); + spin_unlock_irqrestore(priv-container_lock, flags); + + add_wait_queue(whead, cont-wait); +} + +static int kevent_poll_enqueue(struct kevent *k) +{ + struct file *file; + int err, ready = 0; + unsigned int revents; + struct kevent_poll_ctl ctl; + struct kevent_poll_private *priv; + + file = fget(k-event.id.raw[0]); + if (!file) + return -ENODEV; + + err = -EINVAL; + if (!file-f_op || !file-f_op-poll) + goto err_out_fput; + + err = -ENOMEM; + priv = kmem_cache_alloc(kevent_poll_priv_cache, SLAB_KERNEL); + if (!priv) + goto err_out_fput; + + spin_lock_init(priv-container_lock); + INIT_LIST_HEAD(priv-container_list); + + k-priv = priv; + + ctl.k = k; + init_poll_funcptr(ctl.pt, kevent_poll_qproc); + + err = kevent_storage_enqueue(file-st, k); + if (err) + goto err_out_free; + + revents = file-f_op-poll(file, ctl.pt); + if (revents k-event.event) { + ready = 1; + kevent_poll_dequeue(k); + } + + return ready; + +err_out_free: + kmem_cache_free(kevent_poll_priv_cache, priv); +err_out_fput: + fput(file); + return err; +} + +static int kevent_poll_dequeue(struct kevent *k) +{ + struct file *file = k-st-origin; + struct kevent_poll_private *priv = k-priv; +
[take10 1/2] kevent: Core files.
Core files. This patch includes core kevent files: - userspace controlling - kernelspace interfaces - initialization - notification state machines Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED] diff --git a/arch/i386/kernel/syscall_table.S b/arch/i386/kernel/syscall_table.S index dd63d47..091ff42 100644 --- a/arch/i386/kernel/syscall_table.S +++ b/arch/i386/kernel/syscall_table.S @@ -317,3 +317,5 @@ ENTRY(sys_call_table) .long sys_tee /* 315 */ .long sys_vmsplice .long sys_move_pages + .long sys_kevent_get_events + .long sys_kevent_ctl diff --git a/arch/x86_64/ia32/ia32entry.S b/arch/x86_64/ia32/ia32entry.S index 5d4a7d1..b2af4a8 100644 --- a/arch/x86_64/ia32/ia32entry.S +++ b/arch/x86_64/ia32/ia32entry.S @@ -713,4 +713,6 @@ #endif .quad sys_tee .quad compat_sys_vmsplice .quad compat_sys_move_pages + .quad sys_kevent_get_events + .quad sys_kevent_ctl ia32_syscall_end: diff --git a/include/asm-i386/unistd.h b/include/asm-i386/unistd.h index fc1c8dd..c9dde13 100644 --- a/include/asm-i386/unistd.h +++ b/include/asm-i386/unistd.h @@ -323,10 +323,12 @@ #define __NR_sync_file_range 314 #define __NR_tee 315 #define __NR_vmsplice 316 #define __NR_move_pages317 +#define __NR_kevent_get_events 318 +#define __NR_kevent_ctl319 #ifdef __KERNEL__ -#define NR_syscalls 318 +#define NR_syscalls 320 /* * user-visible error numbers are in the range -1 - -128: see diff --git a/include/asm-x86_64/unistd.h b/include/asm-x86_64/unistd.h index 94387c9..61363e0 100644 --- a/include/asm-x86_64/unistd.h +++ b/include/asm-x86_64/unistd.h @@ -619,10 +619,14 @@ #define __NR_vmsplice 278 __SYSCALL(__NR_vmsplice, sys_vmsplice) #define __NR_move_pages279 __SYSCALL(__NR_move_pages, sys_move_pages) +#define __NR_kevent_get_events 280 +__SYSCALL(__NR_kevent_get_events, sys_kevent_get_events) +#define __NR_kevent_ctl281 +__SYSCALL(__NR_kevent_ctl, sys_kevent_ctl) #ifdef __KERNEL__ -#define __NR_syscall_max __NR_move_pages +#define __NR_syscall_max __NR_kevent_ctl #ifndef __NO_STUBS diff --git a/include/linux/kevent.h b/include/linux/kevent.h new file mode 100644 index 000..03a --- /dev/null +++ b/include/linux/kevent.h @@ -0,0 +1,310 @@ +/* + * kevent.h + * + * 2006 Copyright (c) Evgeniy Polyakov [EMAIL PROTECTED] + * All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ + +#ifndef __KEVENT_H +#define __KEVENT_H + +/* + * Kevent request flags. + */ + +#define KEVENT_REQ_ONESHOT 0x1 /* Process this event only once and then dequeue. */ + +/* + * Kevent return flags. + */ +#define KEVENT_RET_BROKEN 0x1 /* Kevent is broken. */ +#define KEVENT_RET_DONE0x2 /* Kevent processing was finished successfully. */ + +/* + * Kevent type set. + */ +#define KEVENT_SOCKET 0 +#define KEVENT_INODE 1 +#define KEVENT_TIMER 2 +#define KEVENT_POLL3 +#define KEVENT_NAIO4 +#define KEVENT_AIO 5 +#defineKEVENT_MAX 6 + +/* + * Per-type event sets. + * Number of per-event sets should be exactly as number of kevent types. + */ + +/* + * Timer events. + */ +#defineKEVENT_TIMER_FIRED 0x1 + +/* + * Socket/network asynchronous IO events. + */ +#defineKEVENT_SOCKET_RECV 0x1 +#defineKEVENT_SOCKET_ACCEPT0x2 +#defineKEVENT_SOCKET_SEND 0x4 + +/* + * Inode events. + */ +#defineKEVENT_INODE_CREATE 0x1 +#defineKEVENT_INODE_REMOVE 0x2 + +/* + * Poll events. + */ +#defineKEVENT_POLL_POLLIN 0x0001 +#defineKEVENT_POLL_POLLPRI 0x0002 +#defineKEVENT_POLL_POLLOUT 0x0004 +#defineKEVENT_POLL_POLLERR 0x0008 +#defineKEVENT_POLL_POLLHUP 0x0010 +#defineKEVENT_POLL_POLLNVAL0x0020 + +#defineKEVENT_POLL_POLLRDNORM 0x0040 +#defineKEVENT_POLL_POLLRDBAND 0x0080 +#defineKEVENT_POLL_POLLWRNORM 0x0100 +#defineKEVENT_POLL_POLLWRBAND 0x0200 +#defineKEVENT_POLL_POLLMSG 0x0400 +#defineKEVENT_POLL_POLLREMOVE 0x1000 +
Re: [PATCH 2.6.17] net/ipv6/udp.c: remove duplicate udp_get_port code
Hi Yoshifuji, | +#if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE) | + else if(sk-sk_family == PF_INET6 | + ipv6_rcv_saddr_equal(sk, sk2) ) | + goto fail; | + } | +#endif | | This is not good because you cannot link ipv6_rcv_saddr_equal() | if you are compiling IPv6 as module. Yes and the second ugliness was that ipv4/udp.c suddenly had to include net/addrconf.h. | How about retaining udp_v{4,6}_get_port() and call | common udp_get_port() from both functions? I enclose a realisation below - do you think that is better? Tested both IPv6 as module and as `y', double-checked all changes. Thank you for reviewing and comments. Signed-off-by: Gerrit Renker [EMAIL PROTECTED] --- include/net/udp.h | 18 +- net/ipv4/udp.c| 95 ++ net/ipv6/udp.c| 76 +-- 3 files changed, 64 insertions(+), 125 deletions(-) diff --git a/include/net/udp.h b/include/net/udp.h index 766fba1..c490a0f 100644 --- a/include/net/udp.h +++ b/include/net/udp.h @@ -30,25 +30,9 @@ #include linux/seq_file.h #define UDP_HTABLE_SIZE128 -/* udp.c: This needs to be shared by v4 and v6 because the lookup - *and hashing code needs to work with different AF's yet - *the port space is shared. - */ extern struct hlist_head udp_hash[UDP_HTABLE_SIZE]; extern rwlock_t udp_hash_lock; -extern int udp_port_rover; - -static inline int udp_lport_inuse(u16 num) -{ - struct sock *sk; - struct hlist_node *node; - - sk_for_each(sk, node, udp_hash[num (UDP_HTABLE_SIZE - 1)]) - if (inet_sk(sk)-num == num) - return 1; - return 0; -} /* Note: this must match 'valbool' in sock_setsockopt */ #define UDP_CSUM_NOXMIT1 @@ -63,6 +47,8 @@ extern struct proto udp_prot; struct sk_buff; +extern int udp_get_port(struct sock *sk, unsigned short snum, +int (*saddr_cmp)(struct sock *, struct sock *)); extern voidudp_err(struct sk_buff *, u32); extern int udp_sendmsg(struct kiocb *iocb, struct sock *sk, diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c index 3f93292..c5ee645 100644 --- a/net/ipv4/udp.c +++ b/net/ipv4/udp.c @@ -119,14 +119,34 @@ DEFINE_SNMP_STAT(struct udp_mib, udp_sta struct hlist_head udp_hash[UDP_HTABLE_SIZE]; DEFINE_RWLOCK(udp_hash_lock); -/* Shared by v4/v6 udp. */ +/* Shared by v4/v6 udp_get_port */ int udp_port_rover; -static int udp_v4_get_port(struct sock *sk, unsigned short snum) +static inline int udp_lport_inuse(u16 num) { + struct sock *sk; struct hlist_node *node; + + sk_for_each(sk, node, udp_hash[num (UDP_HTABLE_SIZE - 1)]) + if (inet_sk(sk)-num == num) + return 1; + return 0; +} + +/** + * udp_get_port - common port lookup for IPv4 and IPv6 + * + * @sk: socket struct in question + * @snum:port number to look up + * @saddr_comp: AF-dependent comparison of bound local IP addresses + */ +int udp_get_port(struct sock *sk, unsigned short snum, +int (*saddr_cmp)(struct sock *sk1, struct sock *sk2)) +{ + struct hlist_node *node; + struct hlist_head *head; struct sock *sk2; - struct inet_sock *inet = inet_sk(sk); + interror = 1; write_lock_bh(udp_hash_lock); if (snum == 0) { @@ -138,11 +158,10 @@ static int udp_v4_get_port(struct sock * best_size_so_far = 32767; best = result = udp_port_rover; for (i = 0; i UDP_HTABLE_SIZE; i++, result++) { - struct hlist_head *list; int size; - list = udp_hash[result (UDP_HTABLE_SIZE - 1)]; - if (hlist_empty(list)) { + head = udp_hash[result (UDP_HTABLE_SIZE - 1)]; + if (hlist_empty(head)) { if (result sysctl_local_port_range[1]) result = sysctl_local_port_range[0] + ((result - sysctl_local_port_range[0]) @@ -150,12 +169,11 @@ static int udp_v4_get_port(struct sock * goto gotit; } size = 0; - sk_for_each(sk2, node, list) - if (++size = best_size_so_far) - goto next; - best_size_so_far = size; - best = result; - next:; + sk_for_each(sk2, node, head) + if (++size best_size_so_far) { + best_size_so_far = size; +
Re: [take9 0/2] kevent: Generic event handling mechanism.
On Mon, Aug 14, 2006 at 10:21:36AM +0400, Evgeniy Polyakov wrote: Generic event handling mechanism. Hi, I've just started looking into this, so some comments here first on the submission process: - could you send new revisions of the patches in a new thread so one can easily find them? - the patch split is not very nice, your first patch adds Makefile and Kconfig entries for files only in the second patch or not actually submitted at all, that's a big no-no. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [take9 2/2] kevent: poll/select() notifications. Timer notifications.
On Mon, Aug 14, 2006 at 10:21:36AM +0400, Evgeniy Polyakov wrote: poll/select() notifications. Timer notifications. This patch includes generic poll/select and timer notifications. kevent_poll works simialr to epoll and has the same issues (callback is invoked not from internal state machine of the caller, but through process awake). I'm not a big fan of duplicating code over and over. kevent is a candidate for a generic event devlivery mechanisms which is a _very_ good thing. But starting that system by duplicating existing functionality is not very nice. What speaks against a patch the recplaces the epoll core by something that build on kevent while still supporting the epoll interface as a compatibility shim? Timer notifications can be used for fine grained per-process time management, since interval timers are very inconvenient to use, and they are limited. I have similar reservations about this one. Having timers as part of a generic events system is very nice, but having so much duplicated functionality is not. Cc'ed Thomas on behalf of the Timer cabal if there's a point in integrating this into a larger framework of timer code. Also it would be nice if you could submit each of the notifications as a patch on it's own. diff --git a/kernel/kevent/kevent_poll.c b/kernel/kevent/kevent_poll.c new file mode 100644 index 000..8a4f863 --- /dev/null +++ b/kernel/kevent/kevent_poll.c @@ -0,0 +1,220 @@ +/* + * kevent_poll.c + * + * 2006 Copyright (c) Evgeniy Polyakov [EMAIL PROTECTED] + * All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +#include linux/kernel.h +#include linux/types.h +#include linux/list.h +#include linux/slab.h +#include linux/spinlock.h +#include linux/timer.h +#include linux/file.h +#include linux/kevent.h +#include linux/poll.h +#include linux/fs.h + +static kmem_cache_t *kevent_poll_container_cache; +static kmem_cache_t *kevent_poll_priv_cache; + +struct kevent_poll_ctl +{ + struct poll_table_structpt; + struct kevent *k; +}; + +struct kevent_poll_wait_container +{ + struct list_headcontainer_entry; + wait_queue_head_t *whead; + wait_queue_twait; + struct kevent *k; +}; + +struct kevent_poll_private +{ + struct list_headcontainer_list; + spinlock_t container_lock; +}; + +static int kevent_poll_enqueue(struct kevent *k); +static int kevent_poll_dequeue(struct kevent *k); +static int kevent_poll_callback(struct kevent *k); + +static int kevent_poll_wait_callback(wait_queue_t *wait, + unsigned mode, int sync, void *key) +{ + struct kevent_poll_wait_container *cont = + container_of(wait, struct kevent_poll_wait_container, wait); + struct kevent *k = cont-k; + struct file *file = k-st-origin; + u32 revents; + + revents = file-f_op-poll(file, NULL); + + kevent_storage_ready(k-st, NULL, revents); + + return 0; +} + +static void kevent_poll_qproc(struct file *file, wait_queue_head_t *whead, + struct poll_table_struct *poll_table) +{ + struct kevent *k = + container_of(poll_table, struct kevent_poll_ctl, pt)-k; + struct kevent_poll_private *priv = k-priv; + struct kevent_poll_wait_container *cont; + unsigned long flags; + + cont = kmem_cache_alloc(kevent_poll_container_cache, SLAB_KERNEL); + if (!cont) { + kevent_break(k); + return; + } + + cont-k = k; + init_waitqueue_func_entry(cont-wait, kevent_poll_wait_callback); + cont-whead = whead; + + spin_lock_irqsave(priv-container_lock, flags); + list_add_tail(cont-container_entry, priv-container_list); + spin_unlock_irqrestore(priv-container_lock, flags); + + add_wait_queue(whead, cont-wait); +} + +static int kevent_poll_enqueue(struct kevent *k) +{ + struct file *file; + int err, ready = 0; + unsigned int revents; + struct kevent_poll_ctl ctl; + struct kevent_poll_private *priv; + + file = fget(k-event.id.raw[0]); + if (!file) + return -ENODEV; + + err = -EINVAL; + if (!file-f_op || !file-f_op-poll) + goto err_out_fput; + + err = -ENOMEM; + priv = kmem_cache_alloc(kevent_poll_priv_cache, SLAB_KERNEL); + if
Re: bonding: cannot remove certain named devices
Giacomo A. Catenazzi ([EMAIL PROTECTED]) said: Are you willing to work to add the special case code necessary to handle whitespace characters in the device name over all of the kernel code and also all of the userland tools too? But if you don't handle spaces in userspace, you handle *, ?, [, ], $, , ', \ in userspace? Should kernel disable also these (insane device chars) chars? Don't forget unicode characters! Seriously, while it might be insane to use some of these, I'm wondering if trying to filter names is more work than fixing the tools. Bill - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [take9 0/2] kevent: Generic event handling mechanism.
On Wed, Aug 16, 2006 at 02:26:31PM +0100, Christoph Hellwig ([EMAIL PROTECTED]) wrote: On Mon, Aug 14, 2006 at 10:21:36AM +0400, Evgeniy Polyakov wrote: Generic event handling mechanism. Hi, I've just started looking into this, so some comments here first on the submission process: - could you send new revisions of the patches in a new thread so one can easily find them? Ok. - the patch split is not very nice, your first patch adds Makefile and Kconfig entries for files only in the second patch or not actually submitted at all, that's a big no-no. It is done by scripts using list of files generated by git-diff, but I can reformat them to be in a way: core files poll/select timer any other main Kconfig/Makefile Kevent's makefile still contains all entries for files added later, is it a big problem right now? I can split patches manually, but it would be much better to do it when decision about it's inclusion is made, and until review and feature addiotion process is not completed generate patches as is... -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [take9 2/2] kevent: poll/select() notifications. Timer notifications.
On Wed, Aug 16, 2006 at 02:30:14PM +0100, Christoph Hellwig ([EMAIL PROTECTED]) wrote: On Mon, Aug 14, 2006 at 10:21:36AM +0400, Evgeniy Polyakov wrote: poll/select() notifications. Timer notifications. This patch includes generic poll/select and timer notifications. kevent_poll works simialr to epoll and has the same issues (callback is invoked not from internal state machine of the caller, but through process awake). I'm not a big fan of duplicating code over and over. kevent is a candidate for a generic event devlivery mechanisms which is a _very_ good thing. But starting that system by duplicating existing functionality is not very nice. What speaks against a patch the recplaces the epoll core by something that build on kevent while still supporting the epoll interface as a compatibility shim? There is no problem from my side, but epoll and kevent_poll differs on some aspects, so it can be better to not replace them for a while. Timer notifications can be used for fine grained per-process time management, since interval timers are very inconvenient to use, and they are limited. I have similar reservations about this one. Having timers as part of a generic events system is very nice, but having so much duplicated functionality is not. Cc'ed Thomas on behalf of the Timer cabal if there's a point in integrating this into a larger framework of timer code. Also it would be nice if you could submit each of the notifications as a patch on it's own. Ok. -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [take9 1/2] kevent: Core files.
diff --git a/include/linux/kevent.h b/include/linux/kevent.h new file mode 100644 index 000..03a --- /dev/null +++ b/include/linux/kevent.h @@ -0,0 +1,310 @@ +/* + * kevent.h Please don't put filenames in the top of file block comments. They're redudant and as history shows out of date far too often. +#ifdef __KERNEL__ Please split the user/kernel ABI and kernel implementation details into two different headers. That way we don't have to run unifdef as part of the user headers generation process and it's much cleaner what bit is a kernel implementation details and what's the public ABI. +#define KEVENT_READY 0x1 +#define KEVENT_STORAGE 0x2 +#define KEVENT_USER 0x4 Please use enums here. + void*priv; /* Private data for different storages. + * poll()/select storage has a list of wait_queue_t containers + * for each -poll() { poll_wait()' } here. + */ Please try to avoid spilling over the 80 chars limit. In this case it's easy, just put the comment before the field beeing documented. +extern struct kevent_callbacks kevent_registered_callbacks[]; Having global arrays is not very nice. Any chance this could be hidden behind proper accessor functions? +#ifdef CONFIG_KEVENT_INODE +void kevent_inode_notify(struct inode *inode, u32 event); +void kevent_inode_notify_parent(struct dentry *dentry, u32 event); +void kevent_inode_remove(struct inode *inode); +#else +static inline void kevent_inode_notify(struct inode *inode, u32 event) +{ +} +static inline void kevent_inode_notify_parent(struct dentry *dentry, u32 event) +{ +} +static inline void kevent_inode_remove(struct inode *inode) +{ +} +#endif /* CONFIG_KEVENT_INODE */ The code implementing these prototypes doesn't exist. +#ifdef CONFIG_KEVENT_SOCKET +#ifdef CONFIG_LOCKDEP +void kevent_socket_reinit(struct socket *sock); +void kevent_sk_reinit(struct sock *sk); +#else +static inline void kevent_socket_reinit(struct socket *sock) +{ +} +static inline void kevent_sk_reinit(struct sock *sk) +{ +} +#endif Dito. Please clean the header from all this dead code. +int kevent_storage_init(void *origin, struct kevent_storage *st) +{ + spin_lock_init(st-lock); + st-origin = origin; + INIT_LIST_HEAD(st-list); + return 0; +} Why does this need a return value? +int kevent_sys_init(void) +{ + int i; + + kevent_cache = kmem_cache_create(kevent_cache, + sizeof(struct kevent), 0, SLAB_PANIC, NULL, NULL); + + for (i=0; iARRAY_SIZE(kevent_registered_callbacks); ++i) { + struct kevent_callbacks *c = kevent_registered_callbacks[i]; + + c-callback = c-enqueue = c-dequeue = NULL; + } + + return 0; +} Please make this an initcall in this file and make sure it's linked before kevent_users.c +static int kevent_user_open(struct inode *, struct file *); +static int kevent_user_release(struct inode *, struct file *); +static unsigned int kevent_user_poll(struct file *, struct poll_table_struct *); +static int kevent_user_mmap(struct file *, struct vm_area_struct *); Could you reorder the file so these forward-declaring prototypes aren't needed? + for (i=0; iARRAY_SIZE(u-kevent_list); ++i) for (i = 0; i ARRAY_SIZE(u-kevent_list); i++) +static struct page *kevent_user_nopage(struct vm_area_struct *vma, unsigned long addr, int *type) +{ + struct kevent_user *u = vma-vm_file-private_data; + unsigned long off = (addr - vma-vm_start)/PAGE_SIZE; + unsigned int pnum = ALIGN(KEVENT_MAX_EVENTS*sizeof(struct mukevent) + sizeof(unsigned int), PAGE_SIZE)/PAGE_SIZE; + + if (type) + *type = VM_FAULT_MINOR; + + if (off = pnum) + goto err_out_sigbus; + + u-pring[off] = __get_free_page(GFP_KERNEL); So we have a pagefault handler that allocates pages. +static int kevent_user_mmap(struct file *file, struct vm_area_struct *vma) +{ + unsigned long start = vma-vm_start; + struct kevent_user *u = file-private_data; + + if (vma-vm_flags VM_WRITE) + return -EPERM; + + vma-vm_page_prot = pgprot_noncached(vma-vm_page_prot); + vma-vm_ops = kevent_user_vm_ops; + vma-vm_flags |= VM_RESERVED; + vma-vm_file = file; + + if (remap_pfn_range(vma, start, virt_to_phys((void *)u-pring[0]), PAGE_SIZE, + vma-vm_page_prot)) + return -EFAULT; but you always map the first page. This model sounds odd and rather confusing. Do we really need to avoid of the cost of the pagefault just for the special first page? If so please at least use vm_insert_page() instead of remap_pfn_range(). +#if 0 +static inline unsigned int
[PATCH] wireless-dev: relax sysfs permissions
The sysfs attributes add_iface and remove_iface both check for CAP_NET_ADMIN whenever something is written. Hence, permissions for the files should be relaxed so that someone who is not root but happens to have CAP_NET_ADMIN can do things. Signed-off-by: Johannes Berg [EMAIL PROTECTED] --- wireless-dev.orig/net/d80211/ieee80211_sysfs.c 2006-08-16 15:45:41.0 +0200 +++ wireless-dev/net/d80211/ieee80211_sysfs.c 2006-08-16 15:46:05.0 +0200 @@ -195,8 +195,8 @@ __IEEE80211_LOCAL_SHOW(rate_ctrl_alg); static struct class_device_attribute ieee80211_class_dev_attrs[] = { - __ATTR(add_iface, S_IWUSR, NULL, store_add_iface), - __ATTR(remove_iface, S_IWUSR, NULL, store_remove_iface), + __ATTR(add_iface, S_IWUGO, NULL, store_add_iface), + __ATTR(remove_iface, S_IWUGO, NULL, store_remove_iface), __ATTR(channel, S_IRUGO, ieee80211_local_show_channel, NULL), __ATTR(frequency, S_IRUGO, ieee80211_local_show_frequency, NULL), __ATTR(radar_detect, S_IRUGO, ieee80211_local_show_radar_detect, NULL), - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Possible leak of multicast source filter sctructure #3
Hi I'm not sure the second one is quite right. The case of concern is where an interface is deleted. If you joined (or left) the group by address and then deleted the interface, then you wouldn't match the index (which wouldn't be set) so the leave wouldn't work, still. That's right I havent thought of this case. Also, if you passed a completely bogus ifindex, it should return ENODEV, but with the patch it would return EADDRNOTAVAIL it appears. The question is what is completely bogus ifindex in this case? An interface that does not exist any more but happen to be on the sockets multicast list shouldn't be. So, I think the second patch needs some more work. I'll look at it some more and see if I can suggest something better. +-DLS I've tried to implement something more complete but especially in the case of leaving a group by address it is still just a best effort and not something absolutely perfect. I've started with streamlining the ip_mc_find_dev() function with one little change in its behavior: clearing the imr_address member of the ip_mreqn request structure in case an interface is found by an index. This should be no problem since this member is not used in this case and may contain a random value. So I clear it to get rid of this randomness since this value might now be used in ip_mc_leave_group() Well and now the changes in the ip_mc_leave_group(): I've splitted it into two different cases: 1) leave by an interface index 2) leave by an interface address / muticast address In the first case I search for a match by the interface index specified in the leave request. If a match is found I leave the group on the interface irrespective of its existence. In the second case I do a similar search (but this time using the interface index found in ip_mc_find_dev()) while also checking for a match by the interface address. If no match is found by the interface index and there is a match (or more) by the address I leave the group on the interface corresponding to the first match by the address. This certainly could produce weird results but such results could be produced by the original algorithm as well with the additional problem that there was no way to leave a group on a deleted interface. And here is the patch: Signed-off-by: Michal Ruzicka [EMAIL PROTECTED] --- linux-2.6.17.8/net/ipv4/igmp.c.orig 2006-08-11 11:50:46.0 +0200 +++ linux-2.6.17.8/net/ipv4/igmp.c 2006-08-16 15:06:18.0 +0200 @@ -1369,13 +1369,15 @@ struct flowi fl = { .nl_u = { .ip4_u = { .daddr = imr-imr_multiaddr.s_addr } } }; struct rtable *rt; - struct net_device *dev = NULL; - struct in_device *idev = NULL; + struct net_device *dev; if (imr-imr_ifindex) { - idev = inetdev_by_index(imr-imr_ifindex); - if (idev) + struct in_device *idev = inetdev_by_index(imr-imr_ifindex); + + if (idev) { + imr-imr_address.s_addr = 0; __in_dev_put(idev); + } return idev; } if (imr-imr_address.s_addr) { @@ -1383,17 +1385,16 @@ if (!dev) return NULL; dev_put(dev); - } - - if (!dev !ip_route_output_key(rt, fl)) { + } else if (!ip_route_output_key(rt, fl)) { dev = rt-u.dst.dev; ip_rt_put(rt); - } - if (dev) { - imr-imr_ifindex = dev-ifindex; - idev = __in_dev_get_rtnl(dev); - } - return idev; + if (!dev) + return NULL; + } else + return NULL; + + imr-imr_ifindex = dev-ifindex; + return __in_dev_get_rtnl(dev); } /* @@ -1798,27 +1799,79 @@ u32 ifindex; rtnl_lock(); - in_dev = ip_mc_find_dev(imr); - if (!in_dev) { - rtnl_unlock(); - return -ENODEV; - } ifindex = imr-imr_ifindex; - for (imlp = inet-mc_list; (iml = *imlp) != NULL; imlp = iml-next) { - if (iml-multi.imr_multiaddr.s_addr == group - iml-multi.imr_ifindex == ifindex) { - (void) ip_mc_leave_src(sk, iml, in_dev); - - *imlp = iml-next; - - ip_mc_dec_group(in_dev, group); - rtnl_unlock(); - sock_kfree_s(sk, iml, sizeof(*iml)); - return 0; + in_dev = ip_mc_find_dev(imr); + if (ifindex != 0) { + /* leave by interface index */ + for (imlp = inet-mc_list; (iml = *imlp) != NULL; imlp = iml-next) { + if (iml-multi.imr_multiaddr.s_addr != group) + continue; + + if
Re: [PATCH 1/9] network namespaces: core and device list
On Tue, 2006-08-15 at 18:48 +0400, Andrey Savochkin wrote: /* Can survive without statistics */ stats = kmalloc(sizeof(struct net_device_stats), GFP_KERNEL); if (stats) { memset(stats, 0, sizeof(struct net_device_stats)); - loopback_dev.priv = stats; - loopback_dev.get_stats = get_stats; + dev-priv = stats; + dev-get_stats = get_stats; } With this much surgery it might be best to start using things that have come along since this code was touched last, like kzalloc(). -- Dave - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Possible leak of multicast source filter sctructure #3a
The same patch as in previous e-mail with a few typos in comments corrected: Signed-off-by: Michal Ruzicka [EMAIL PROTECTED] --- linux-2.6.17.8/net/ipv4/igmp.c.orig 2006-08-11 11:50:46.0 +0200 +++ linux-2.6.17.8/net/ipv4/igmp.c 2006-08-16 16:53:08.0 +0200 @@ -1369,13 +1369,15 @@ struct flowi fl = { .nl_u = { .ip4_u = { .daddr = imr-imr_multiaddr.s_addr } } }; struct rtable *rt; - struct net_device *dev = NULL; - struct in_device *idev = NULL; + struct net_device *dev; if (imr-imr_ifindex) { - idev = inetdev_by_index(imr-imr_ifindex); - if (idev) + struct in_device *idev = inetdev_by_index(imr-imr_ifindex); + + if (idev) { + imr-imr_address.s_addr = 0; __in_dev_put(idev); + } return idev; } if (imr-imr_address.s_addr) { @@ -1383,17 +1385,16 @@ if (!dev) return NULL; dev_put(dev); - } - - if (!dev !ip_route_output_key(rt, fl)) { + } else if (!ip_route_output_key(rt, fl)) { dev = rt-u.dst.dev; ip_rt_put(rt); - } - if (dev) { - imr-imr_ifindex = dev-ifindex; - idev = __in_dev_get_rtnl(dev); - } - return idev; + if (!dev) + return NULL; + } else + return NULL; + + imr-imr_ifindex = dev-ifindex; + return __in_dev_get_rtnl(dev); } /* @@ -1798,27 +1799,79 @@ u32 ifindex; rtnl_lock(); - in_dev = ip_mc_find_dev(imr); - if (!in_dev) { - rtnl_unlock(); - return -ENODEV; - } ifindex = imr-imr_ifindex; - for (imlp = inet-mc_list; (iml = *imlp) != NULL; imlp = iml-next) { - if (iml-multi.imr_multiaddr.s_addr == group - iml-multi.imr_ifindex == ifindex) { - (void) ip_mc_leave_src(sk, iml, in_dev); - - *imlp = iml-next; - - ip_mc_dec_group(in_dev, group); - rtnl_unlock(); - sock_kfree_s(sk, iml, sizeof(*iml)); - return 0; + in_dev = ip_mc_find_dev(imr); + if (ifindex != 0) { + /* leave by interface index */ + for (imlp = inet-mc_list; (iml = *imlp) != NULL; imlp = iml-next) { + if (iml-multi.imr_multiaddr.s_addr != group) + continue; + + if (iml-multi.imr_ifindex == ifindex) + goto leave; + } + } else { + /* leave by address / multicast group route */ + struct ip_mc_socklist **cimlp = NULL; + u32 address = imr-imr_address.s_addr; + + ifindex = imr-imr_ifindex; + for (imlp = inet-mc_list; (iml = *imlp) != NULL; imlp = iml-next) { + if (iml-multi.imr_multiaddr.s_addr != group) + continue; + + if (iml-multi.imr_ifindex == ifindex) + /* direct match +* NOTE: We do not have to test for in_dev != NULL +* since we know that ifindex was zero before call +* to ip_mc_find_dev() but is non-zero now (as +* it equals to an interface index which is never +* zero). The ip_mc_find_dev() function modifies +* the ifindex only if it finds an interface +* (in wich case it returns non-NULL). Thus the +* in_dev must be non-NULL. +*/ + goto leave; + + if (cimlp == NULL iml-multi.imr_address.s_addr == address) + cimlp = imlp; + } + + if (cimlp != NULL) { + /* We have found at least one candidate interface +* for leaving by address but not a direct match. +* Since there is no way to tell what interface the user +* wnated to leave the multicast group on we are going +* to leave it on the first candidate interface found. +*/ + iml = *(imlp = cimlp); + + if (in_dev != NULL) { + /* If we have found an interface matching the leave +* request chances are that the interface which we +* are about to leave the multicast group on still
Can't turn off CONFIG_NET_ESTIMATOR on 2.6.17.7
I Recently built a 2.6.17.7 and wanted to turn off CONFIG_NET_ESTIMATOR but can't using menuconfig. Is it on by default now, or is it a config issue? I wanted it off to play with chains of policers and unless I misunderstand it uses Hz, and is inaccurate when Hz=250 with its' minimum time of 1/4 sec - which is too high for what I wanted anyway. Andy. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] network namespaces
Hello! (application) containers. Performance aside, are there any reasons why this approach would be problematic for c/r? This approach is just perfect for c/r. Probably, this is the only approach when migration can be done in a clean and self-consistent way. Alexey - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] bcm43xx-softmac: optimization of DMA bitfields.]
John, Please pull this patch for the wireless-2.6 tree. This patch depends on the 64bit DMA patch, which is already submitted for inclusion. Convert the bitfields in the bcm43xx DMA code to properly aligned u8 booleans. These flags are accessed in the DMA hotpath, so it's a good idea to waste a few bytes of memory for the sake of speed by not requiring masking (and probably shifting) of the bitfields. Signed-off-by: Michael Buesch [EMAIL PROTECTED] Signed-Off-By: Larry Finger [EMAIL PROTECTED] Index: wireless-2.6/drivers/net/wireless/bcm43xx/bcm43xx_dma.h === --- wireless-2.6.orig/drivers/net/wireless/bcm43xx/bcm43xx_dma.h 2006-08-16 12:47:27.0 +0200 +++ wireless-2.6/drivers/net/wireless/bcm43xx/bcm43xx_dma.h 2006-08-16 12:49:43.0 +0200 @@ -235,9 +235,12 @@ u16 mmio_base; /* DMA controller index number (0-5). */ int index; - u8 tx:1,/* TRUE, if this is a TX ring. */ - dma64:1, /* TRUE, if 64-bit DMA is enabled (FALSE if 32bit). */ - suspended:1; /* TRUE, if transfers are suspended on this ring. */ + /* Boolean. Is this a TX ring? */ + u8 tx + /* Boolean. 64bit DMA if true, 32bit DMA otherwise. */ + u8 dma64; + /* Boolean. Are transfers suspended on this ring? */ + u8 suspended; struct bcm43xx_private *bcm; #ifdef CONFIG_BCM43XX_DEBUG /* Maximum number of used slots. */ - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] bcm43xx-softmac: optimization of DMA bitfields.]
On Wed, 2006-08-16 at 10:36 -0500, Larry Finger wrote: + /* Boolean. Is this a TX ring? */ + u8 tx + /* Boolean. 64bit DMA if true, 32bit DMA otherwise. */ + u8 dma64; does that compile? johannes - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/2] Add support for LAN8187 and LAN8700 PHYs
Make functions and constants generic, add support for two more SMSC PHY models with identical interrupt source and mask registers Signed-off-by: Steve Glendinning [EMAIL PROTECTED] --- drivers/net/phy/smsc.c | 112 ++-- 1 files changed, 89 insertions(+), 23 deletions(-) diff --git a/drivers/net/phy/smsc.c b/drivers/net/phy/smsc.c index 2119bd7..22785fb 100644 --- a/drivers/net/phy/smsc.c +++ b/drivers/net/phy/smsc.c @@ -12,6 +12,7 @@ * Free Software Foundation; either version 2 of the License, or (at your * option) any later version. * + * Support added for SMSC LAN8187 and LAN8700 by [EMAIL PROTECTED] */ #include linux/config.h @@ -22,42 +23,42 @@ #include linux/ethtool.h #include linux/phy.h #include linux/netdevice.h -#define MII_LAN83C185_ISF 29 /* Interrupt Source Flags */ -#define MII_LAN83C185_IM 30 /* Interrupt Mask */ +#define MII_SMSC_ISF 29 /* Interrupt Source Flags */ +#define MII_SMSC_IM 30 /* Interrupt Mask */ -#define MII_LAN83C185_ISF_INT1 (11) /* Auto-Negotiation Page Received */ -#define MII_LAN83C185_ISF_INT2 (12) /* Parallel Detection Fault */ -#define MII_LAN83C185_ISF_INT3 (13) /* Auto-Negotiation LP Ack */ -#define MII_LAN83C185_ISF_INT4 (14) /* Link Down */ -#define MII_LAN83C185_ISF_INT5 (15) /* Remote Fault Detected */ -#define MII_LAN83C185_ISF_INT6 (16) /* Auto-Negotiation complete */ -#define MII_LAN83C185_ISF_INT7 (17) /* ENERGYON */ +#define MII_SMSC_ISF_INT1 (11) /* Auto-Negotiation Page Received */ +#define MII_SMSC_ISF_INT2 (12) /* Parallel Detection Fault */ +#define MII_SMSC_ISF_INT3 (13) /* Auto-Negotiation LP Ack */ +#define MII_SMSC_ISF_INT4 (14) /* Link Down */ +#define MII_SMSC_ISF_INT5 (15) /* Remote Fault Detected */ +#define MII_SMSC_ISF_INT6 (16) /* Auto-Negotiation complete */ +#define MII_SMSC_ISF_INT7 (17) /* ENERGYON */ -#define MII_LAN83C185_ISF_INT_ALL (0x0e) +#define MII_SMSC_ISF_INT_ALL (0x0e) -#define MII_LAN83C185_ISF_INT_PHYLIB_EVENTS \ - (MII_LAN83C185_ISF_INT6 | MII_LAN83C185_ISF_INT4) +#define MII_SMSC_ISF_INT_PHYLIB_EVENTS \ + (MII_SMSC_ISF_INT6 | MII_SMSC_ISF_INT4) -static int lan83c185_config_intr(struct phy_device *phydev) +static int smsc_phy_config_intr(struct phy_device *phydev) { - int rc = phy_write(phydev, MII_LAN83C185_IM, + int rc = phy_write(phydev, MII_SMSC_IM, ((PHY_INTERRUPT_ENABLED == phydev-interrupts) - ? MII_LAN83C185_ISF_INT_PHYLIB_EVENTS : 0)); + ? MII_SMSC_ISF_INT_PHYLIB_EVENTS : 0)); return rc 0 ? rc : 0; } -static int lan83c185_ack_interrupt(struct phy_device *phydev) +static int smsc_phy_ack_interrupt(struct phy_device *phydev) { - int rc = phy_read(phydev, MII_LAN83C185_ISF); + int rc = phy_read(phydev, MII_SMSC_ISF); return rc 0 ? rc : 0; } -static int lan83c185_config_init(struct phy_device *phydev) +static int smsc_phy_config_init(struct phy_device *phydev) { - return lan83c185_ack_interrupt(phydev); + return smsc_phy_ack_interrupt(phydev); } @@ -73,22 +74,87 @@ static struct phy_driver lan83c185_drive /* basic functions */ .config_aneg= genphy_config_aneg, .read_status= genphy_read_status, - .config_init= lan83c185_config_init, + .config_init= smsc_phy_config_init, /* IRQ related */ - .ack_interrupt = lan83c185_ack_interrupt, - .config_intr= lan83c185_config_intr, + .ack_interrupt = smsc_phy_ack_interrupt, + .config_intr= smsc_phy_config_intr, + + .driver = { .owner = THIS_MODULE, } +}; + +static struct phy_driver lan8187_driver = { + .phy_id = 0x0007c0b0, /* OUI=0x00800f, Model#=0x0b */ + .phy_id_mask= 0xfff0, + .name = SMSC LAN8187, + + .features = (PHY_BASIC_FEATURES | SUPPORTED_Pause + | SUPPORTED_Asym_Pause), + .flags = PHY_HAS_INTERRUPT | PHY_HAS_MAGICANEG, + + /* basic functions */ + .config_aneg= genphy_config_aneg, + .read_status= genphy_read_status, + .config_init= smsc_phy_config_init, + + /* IRQ related */ + .ack_interrupt = smsc_phy_ack_interrupt, + .config_intr= smsc_phy_config_intr, + + .driver = { .owner = THIS_MODULE, } +}; + +static struct phy_driver lan8700_driver = { + .phy_id = 0x0007c0c0, /* OUI=0x00800f, Model#=0x0c */ + .phy_id_mask= 0xfff0, + .name = SMSC LAN8700, + + .features = (PHY_BASIC_FEATURES | SUPPORTED_Pause + | SUPPORTED_Asym_Pause), + .flags = PHY_HAS_INTERRUPT | PHY_HAS_MAGICANEG, + + /* basic functions */ + .config_aneg= genphy_config_aneg, + .read_status= genphy_read_status, + .config_init= smsc_phy_config_init, + + /* IRQ
RE: [E1000-devel] e1000: ethtool -p + cable pull = system wedges hard
Kok, Auke-jan H wrote: Auke Kok wrote: Jay Vosburgh wrote: Running both 2.6.17.6 plus the e1000 7.2.7 from sourceforge, or the e1000 in netdev-2.6#upstream (7.1.9-k4). Starting up ethtool -p ethX then unplugging the cable connected to the identified port is causing my system to completely freeze; even sysrq is unresponsive. I'm running on a 2-way x86 box, with an 82545GM. Is this by any chance a known problem? not at all. One of my brain halves (the third one ;)) poked me and told me that it *is* a known issue. Not good. Apparently as early as kernel 2.5.50 a change was introduced that causes this. I am unsure what exactly caused it and I assume it is generic (other nic's might also suffer). The issue is documented in our standalone driver documentation. Not sure what to do with this. Has something to do with the RTNL lock being held and link notification, as I remember. We noticed it to be a global problem, happens with e100 too. http://www.mail-archive.com/netdev@vger.kernel.org/msg01654.html Jesse - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/2] Fix style to match drivers/net/phy/*
Trivial style fixes to match other PHY drivers Signed-off-by: Steve Glendinning [EMAIL PROTECTED] --- drivers/net/phy/smsc.c | 15 +++ 1 files changed, 7 insertions(+), 8 deletions(-) diff --git a/drivers/net/phy/smsc.c b/drivers/net/phy/smsc.c index 25e31fb..2119bd7 100644 --- a/drivers/net/phy/smsc.c +++ b/drivers/net/phy/smsc.c @@ -41,24 +41,23 @@ #define MII_LAN83C185_ISF_INT_PHYLIB_EVE static int lan83c185_config_intr(struct phy_device *phydev) { - int rc = phy_write (phydev, MII_LAN83C185_IM, - ((PHY_INTERRUPT_ENABLED == phydev-interrupts) - ? MII_LAN83C185_ISF_INT_PHYLIB_EVENTS - : 0)); + int rc = phy_write(phydev, MII_LAN83C185_IM, + ((PHY_INTERRUPT_ENABLED == phydev-interrupts) + ? MII_LAN83C185_ISF_INT_PHYLIB_EVENTS : 0)); return rc 0 ? rc : 0; } static int lan83c185_ack_interrupt(struct phy_device *phydev) { - int rc = phy_read (phydev, MII_LAN83C185_ISF); + int rc = phy_read(phydev, MII_LAN83C185_ISF); return rc 0 ? rc : 0; } static int lan83c185_config_init(struct phy_device *phydev) { - return lan83c185_ack_interrupt (phydev); + return lan83c185_ack_interrupt(phydev); } @@ -85,12 +84,12 @@ static struct phy_driver lan83c185_drive static int __init smsc_init(void) { - return phy_driver_register (lan83c185_driver); + return phy_driver_register(lan83c185_driver); } static void __exit smsc_exit(void) { - phy_driver_unregister (lan83c185_driver); + phy_driver_unregister(lan83c185_driver); } MODULE_DESCRIPTION(SMSC PHY driver); -- 1.4.1 - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/9] network namespaces: core and device list
On Wed, 16 Aug 2006 07:46:43 -0700 Dave Hansen [EMAIL PROTECTED] wrote: On Tue, 2006-08-15 at 18:48 +0400, Andrey Savochkin wrote: /* Can survive without statistics */ stats = kmalloc(sizeof(struct net_device_stats), GFP_KERNEL); if (stats) { memset(stats, 0, sizeof(struct net_device_stats)); - loopback_dev.priv = stats; - loopback_dev.get_stats = get_stats; + dev-priv = stats; + dev-get_stats = get_stats; } With this much surgery it might be best to start using things that have come along since this code was touched last, like kzalloc(). If you are going to make the loopback device dynamic, it MUST use alloc_netdev(). - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/9] network namespaces: playing and debugging
On Tue, 15 Aug 2006 18:48:43 +0400 Andrey Savochkin [EMAIL PROTECTED] wrote: Temporary code to play with network namespaces in the simplest way. Do exec 7 /proc/net/net_ns in your bash shell and you'll get a brand new network namespace. There you can, for example, do ip link set lo up ip addr list ip addr add 1.2.3.4 dev lo ping -n 1.2.3.4 Signed-off-by: Andrey Savochkin [EMAIL PROTECTED] NACK, new /proc interfaces are not acceptable. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [d80211 rfc] link master interface from wiphy
On Mon, 14 Aug 2006 10:12:01 +0200, Johannes Berg wrote: I'd like to see a link from the wiphy to the master interface that belongs to it so one can tell this easily on systems that have multiple wireless devices. As wiphy and master interface are closely bind to each other, this makes sense. Btw, we will probably need some way to ask d80211 about all interfaces belonging to given wiphy anyway. Crawling all network interfaces and searching for correct wiphy symlinks is probably not the best way. I think a new netlink interface can be used for this. wpa_supplicant could use this, I guess. I think another link to wlan#ap should be created (or does wpa_supplicant set the name of that so it knows which one it will get?), or something like that anyway. wmgmt# will go away in future. There is an ioctl to get its ifindex, so no need for the link. On the other hand, is there any real reason we have this code: ndev-base_addr = dev-base_addr; ndev-irq = dev-irq; ndev-mem_start = dev-mem_start; ndev-mem_end = dev-mem_end; ndev-flags = dev-flags IFF_MULTICAST; SET_NETDEV_DEV(ndev, dev-class_dev.dev); in ieee80211_if_add? Maybe we should make the virtual devices all children of the wiphy (struct ieee80211_local) instead of making them children of the physical device? I don't really know though. This is too dark magic for me ;) What do you mean by making the virtual devices all children of the wiphy? Currently, all virtual devices (of one physical device) have the same pointer to ieee80211_local in their net_dev structure and pointers to them are stored in the linked list in ieee80211_local. However, I do know that I can trivially rename the wmaster0 interface using just 'ip link set wmaster0 name wlan3' and things will probably be very confusing for any program that relies on the naming to know which device is which. Any program that relies on particular device names is broken. Comments welcome. Userspace comments as well, I'm programming something that'll use a bunch of interfaces (wmaster, a monitor one and a sta one probably) and I want the user to just select the physical interface, not all these three logical ones... (in fact, I'm creating the logical monitor interface myself in code). Do you know about /sys/class/net/X/wiphy symlinks? But as I said, crawling sysfs is not the best idea - among other things, it is subject to race conditions. Thanks, Jiri -- Jiri Benc SUSE Labs - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] rfkill - Add rfkill driver to misc input devices
On Tue, 8 Aug 2006 16:30:12 +0200, Ivo van Doorn wrote: This will add the rfkill driver to the input/misc section of the kernel. rfkill is usefull for newtwork devices that contain a hardware button to enable or disable the radio. With rfkill a generic interface is created for the network drivers, as well as providing a uniform way for userspace to listen to the hardware button events. You need to send this patch to lkml and to input subsystem maintainer. Jiri -- Jiri Benc SUSE Labs - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/9] network namespaces: playing and debugging
Stephen Hemminger [EMAIL PROTECTED] writes: On Tue, 15 Aug 2006 18:48:43 +0400 Andrey Savochkin [EMAIL PROTECTED] wrote: Temporary code to play with network namespaces in the simplest way. Do exec 7 /proc/net/net_ns in your bash shell and you'll get a brand new network namespace. There you can, for example, do ip link set lo up ip addr list ip addr add 1.2.3.4 dev lo ping -n 1.2.3.4 Signed-off-by: Andrey Savochkin [EMAIL PROTECTED] NACK, new /proc interfaces are not acceptable. The rule is that new /proc interfaces that are not process related are not acceptable. If structured right a network namespace can arguably be process related. I do agree that this interface is pretty ugly there. Eric - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] network namespaces
Alexey Kuznetsov [EMAIL PROTECTED] writes: Hello! (application) containers. Performance aside, are there any reasons why this approach would be problematic for c/r? This approach is just perfect for c/r. Yes. For c/r you need to take your state with you. Probably, this is the only approach when migration can be done in a clean and self-consistent way. Basically there are currently 3 approaches that have been proposed. The trivial bsdjail style as implemented by Serge and in a slightly more sophisticated version in vserver. This approach as it does not touch the packets has little to no packet level overhead. Basically this is what I have called the Level 3 approach. The more in depth approach where we modify the packet processing based upon which network interface the packet comes in on, and it looks like each namespace has it's own instance of the network stack. Roughly what was proposed earlier in this thread the Level 2 approach. This potentially has per packet overhead so we need to watch the implementation very carefully. Some weird hybrid as proposed by Daniel, that I was never clear on the semantics. From the previous conversations my impression was that as long as we could get a Layer 2 approach that did not slow down the networking stack and was clean, everyone would be happy. I'm buried in the process id namespace at the moment, and except to be so for the rest of the month, so I'm not going to be very helpful except for a few stray comments. Eric - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
skge crashes
Stephen, the reproducible crashes with all skge versions (where sk98lin works fine) on my box are SMP related. I booted with maxcpus=1 and the box survived my usual crash test, I will keep an eye. Daniel - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: proposal for new wireless configuration API
I'd suggest that the new signal strength measure should be defined as 'RCPI' - the 'Received Channel Power Indicator' - which is defined in IEEE 802.11k (the Radio Measurements amendment to 802.11). Here is the full text of the definition from 802.11k draft 5.0: received channel power indicator (RCPI): An indication of the total channel power (signal, noise, and interference) of a received IEEE 802.11 frame measured on a single channel and at the antenna connector used to receive the frame. The RCPI indicator is a measure of the received RF power in the selected channel for a received frame. This parameter shall be a measure by the PHY sublayer of the received RF power in the channel measured over the entire received frame or by other equivalent means which meet the specified accuracy. RCPI shall be a monotonically increasing, logarithmic function of the received power level defined in dBm. The allowed values for the Received Channel Power Indicator (RCPI) parameter shall be an 8 bit value in the range from 0 through 220, with indicated values rounded to the nearest 0.5 dB as follows: 0: Power -110 dBm 1: Power = -109.5 dBm 2: Power = -109.0 dBm and so on where RCPI = int{(Power in dBm +110)*2} for 0dbm Power -110dBm 220: Power -0 dBm 221-254: reserved 255: Measurement not available RCPI shall equal the received RF power within an accuracy of +/-5 dB (95% confidence interval) within the specified dynamic range of the receiver. The received RF power shall be determined assuming a receiver noise equivalent bandwidth equal to the channel bandwidth multiplied by 1.1. Simon -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Johannes Berg Sent: Tuesday, August 15, 2006 11:51 PM To: Dan Williams Cc: netdev@vger.kernel.org; Jean Tourrilhes Subject: Re: proposal for new wireless configuration API On Tue, 2006-08-15 at 12:29 -0400, Dan Williams wrote: We might want to take the time to fix up a few of the ambiguities of WEXT that we've encountered over the past few years: Yes, I definitely agree. o Separate attributes for signal strength units; signed integer type for dBm, unsigned integer type for RSSI. One 8-bit var to represent both is just too confusing for people, evidently (which is true...) Yes, agreed, they should be separated. In general, I think that one attribute should always have a single meaning and unit attached, except for explicitly unit-less attributes (number of frames or whatever), or attributes that explicitly have no stable unit (raw rssi). o Merge functionality ENCODE and ENCODEEXT handlers into one Good one. I'm still not sure whether we should have an attribute for this, or a command. The whole key business seems rather complex and it might be good to have a command 'set key' with say a possible attribute for the mac address of a pairwise key, a key material attribute and an IV attribute or something. Otherwise we'll end up parsing the contents of an attribute again, which rather sucks... On the other hand, having it as a command won't allow the user to optimize setting the key and other things at once. I'm not too sure we should pay all that much attention to this problem though, it can't take forever and typically a user with such a card won't be changing the key or parameters all the time, hence it's usually probably done only at boo or association time. johannes - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: skge crashes
On Wed, 16 Aug 2006 19:47:08 +0200 Beschorner Daniel [EMAIL PROTECTED] wrote: Stephen, the reproducible crashes with all skge versions (where sk98lin works fine) on my box are SMP related. I booted with maxcpus=1 and the box survived my usual crash test, I will keep an eye. Daniel Is this P3 SMP? What form of IRQ balance are you using? - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: New driver questions: Attansic L1 gigabit NIC
On Tue, 15 Aug 2006 18:23:19 -0500 Jay Cliburn [EMAIL PROTECTED] wrote: Stephen Hemminger wrote: On Sun, 13 Aug 2006 19:11:42 -0500 Jay Cliburn [EMAIL PROTECTED] wrote: ...snip... I've read the LKML FAQ regarding new driver submissions, but it implies that the submitter be willing to maintain the driver, which I'm not qualified to do. I haven't contacted Attansic to request a change to the above support statement, because my past attempts to contact vendors on matters of this tenor have been greeted with silence. I would recommend the module author to see if they would GPL it. Thank you for your reply. I've contacted the author as you suggest. IANAL but because they used GPL code in the driver, one could argue that they created a derived work covered by GPL already. But I learned in preschool it is always better to ask than take. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug 6936] BUG: warning at net/core/dev.c:1171/skb_checksum_help()
On Wed, 16 Aug 2006 11:29:00 +1000 Herbert Xu [EMAIL PROTECTED] wrote: On Tue, Aug 15, 2006 at 11:29:59AM -0700, [EMAIL PROTECTED] wrote: http://bugzilla.kernel.org/show_bug.cgi?id=6936 It's actually a bug in the bridging code :) [BRIDGE]: Disable SG/GSO if TX checksum is off When the bridge recomputes features, it does not maintain the constraint that SG/GSO must be off if TX checksum is off. This patch adds that constraint. On a completely unrelated note, I've also added TSO6 and TSO_ECN feature bits if GSO is enabled on the underlying device through the new NETIF_F_GSO_SOFTWARE macro. Signed-off-by: Herbert Xu [EMAIL PROTECTED] Cheers, agree. I assume this came in with the new GSO for 2.6.18 or do we need to fix 2.6.17 as well. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [E1000-devel] e1000: ethtool -p + cable pull = system wedges hard
Brandeburg, Jesse [EMAIL PROTECTED] wrote: Kok, Auke-jan H wrote: Auke Kok wrote: Jay Vosburgh wrote: Running both 2.6.17.6 plus the e1000 7.2.7 from sourceforge, or the e1000 in netdev-2.6#upstream (7.1.9-k4). Starting up ethtool -p ethX then unplugging the cable connected to the identified port is causing my system to completely freeze; even sysrq is unresponsive. I'm running on a 2-way x86 box, with an 82545GM. [...] Has something to do with the RTNL lock being held and link notification, as I remember. We noticed it to be a global problem, happens with e100 too. http://www.mail-archive.com/netdev@vger.kernel.org/msg01654.html Well, I thought maybe I'd messed it up when I tested the other cards, but I just went and tried it again. Only the e1000 wedges the system if I pull the cable with ethtool -p running. The e100 and tg3 don't lock up. Pulling the cable ends the ethtool for tg3, but for e100 the blinky blinky keeps going even after the cable is back in. Even so, as you mention, the operation is holding the RTNL, so anything else that wants it has to wait. -J --- -Jay Vosburgh, IBM Linux Technology Center, [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: New driver questions: Attansic L1 gigabit NIC
Stephen Hemminger wrote: On Tue, 15 Aug 2006 18:23:19 -0500 Jay Cliburn [EMAIL PROTECTED] wrote: Stephen Hemminger wrote: On Sun, 13 Aug 2006 19:11:42 -0500 Jay Cliburn [EMAIL PROTECTED] wrote: ...snip... I've read the LKML FAQ regarding new driver submissions, but it implies that the submitter be willing to maintain the driver, which I'm not qualified to do. I haven't contacted Attansic to request a change to the above support statement, because my past attempts to contact vendors on matters of this tenor have been greeted with silence. I would recommend the module author to see if they would GPL it. Thank you for your reply. I've contacted the author as you suggest. IANAL but because they used GPL code in the driver, one could argue that they created a derived work covered by GPL already. But I learned in preschool it is always better to ask than take. Not exactly. What they wrote is covered by their copyright, and there is no permission to use it in any way other than how they licensed it. Use of GPL code in their driver would allow the author of the GPL code to sue them for violating the license agreement, which would likely result in the code being released under GPL. IANAL either, but to paraphrase another preschool saying, two wrongs (copyright violations) don't make a right (legally licensed). - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Possible leak of multicast source filter sctructure #3a
Michal, I believe the patch I submitted yesterday fixes this problem, and in a simpler way. +-DLS - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: PATCH Fix bonding active-backup behavior for VLAN interfaces
On Mon, 14 Aug 2006, David Miller wrote: From: Jay Vosburgh [EMAIL PROTECTED] Date: Thu, 03 Aug 2006 18:01:35 -0700 In this case (bond0.555 above bond0 above eth0,eth1,etc), skb_bond doesn't suppress duplicates because skb_bond is called with the skb-dev set to the bond0.555 dev, not the ethX dev. Non-accelerated VLAN devices don't do this; they'll come in with skb-dev set to ethX and will go through skb_bond as expected. Ok, since __vlan_hwaccel_rx() bypasses the netif_receive_skb() that would normally occur, we have to duplicate the bonding drop checks. The submitted patch put skb_bond() into if_vlan.h which is definitely the wrong thing to do. This is a generic operation and therefore belongs in linux/netdevice.h at best. Furthermore, we're only interested in the packet drop check, so that's the only part of the logic we need to export, the rest can stay private to skb_bond() in net/core/dev.c Can the folks who can reproduce this try this patch? Works for me, thank you. Acked-by: Krzysztof Piotr Oledzki [EMAIL PROTECTED] Best regards, Krzysztof Olędzki
Re: [patch 32/41] lockdep: fix smc91x
On Mon, 14 Aug 2006, [EMAIL PROTECTED] wrote: From: Russell King [EMAIL PROTECTED] When booting using root-nfs, I'm seeing (independently) two lockdep dumps in the smc91x driver. The patch below fixes both. Both dumps look like real locking issues. Nico - please review and ack if you think the patch is correct. The lock validator is rightfully complaining and the patch is correct. Acked-by: Nicolas Pitre [EMAIL PROTECTED] Nicolas - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 32/41] lockdep: fix smc91x
Nicolas Pitre wrote: On Mon, 14 Aug 2006, [EMAIL PROTECTED] wrote: From: Russell King [EMAIL PROTECTED] When booting using root-nfs, I'm seeing (independently) two lockdep dumps in the smc91x driver. The patch below fixes both. Both dumps look like real locking issues. Nico - please review and ack if you think the patch is correct. The lock validator is rightfully complaining and the patch is correct. Acked-by: Nicolas Pitre [EMAIL PROTECTED] thanks Jeff - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2]: powerpc/cell spidernet bottom half
On Wed, Aug 16, 2006 at 12:30:29PM -0400, Jeff Garzik wrote: Linas Vepstas wrote: The recent set of low-waterark patches for the spider result in a Let's not reinvented NAPI, shall we... ?? I was under the impression that NAPI was for the receive side only. This round of patches were for the transmit queue. Let me describe the technical problem; perhaps there's some other solution for it? The default socket size seems to be 128KB; (cat /proc/sys/net/core/wmem_default) if a user application writes more than 128 KB to a socket, the app is blocked by the kernel till there's room in the socket for more. At gigabit speeds, a network card can drain 128KB in about a millisecond, or about four times a jiffy (assuming HZ=250). If the network card isn't generaing interrupts, (and there are no other interrupts flying around) then the tcp stack only wakes up once a jiffy, and so the user app is scheduled only once a jiffy. Thus, the max bandwidth that the app can see is (HZ * wmem_default) bytes per second, or about 250 Mbits/sec for my system. Disappointing for a gigabit adapter. There's three ways out of this: (1) tell the sysadmin to echo 1234567 /proc/sys/net/core/wmem_default which violates all the rules. (2) Poll more frequently than once-a-jiffy. Arnd Bergmann and I got this working, using hrtimers. It worked pretty well, but seemed like a hack to me. (3) Generate transmit queue low-watermark interrupts, which is an admitedly olde-fashioned but common engineering practice. This round of patches implement this. --linas - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2]: powerpc/cell spidernet bottom half
Linas Vepstas wrote: I was under the impression that NAPI was for the receive side only. That depends on the driver implementation. Jeff - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2]: powerpc/cell spidernet bottom half
From: Jeff Garzik [EMAIL PROTECTED] Date: Wed, 16 Aug 2006 16:34:31 -0400 Linas Vepstas wrote: I was under the impression that NAPI was for the receive side only. That depends on the driver implementation. What Jeff is trying to say is that TX reclaim can occur in the NAPI poll routine, and in fact this is what the vast majority of NAPI drivers do. It also makes the locking simpler. In practice, the best thing seems to be to put both RX and TX work into -poll() and have a very mild hw interrupt mitigation setting programmed into the chip. I'm not familiar with the spidernet TX side interrupt capabilities so I can't say whether that is something that can be directly implied. In fact, I get the impression that spidernet is limited in some way and that's where all the strange approaches are coming from :) - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH?] tcp and delayed acks
Hello folks, In looking at a few benchmarks (especially netperf) run locally, it seems that tcp is unable to make full use of available CPU cycles as the sender is throttled waiting for ACKs to arrive. The problem is exacerbated when the sender is using a small send buffer -- running netperf -C -c -- -s 1024 show a miserable 420Kbit/s at essentially 0% CPU usage. Tests over gige are similarly constrained to a mere 96Mbit/s. Since there is no way for the receiver to know if the sender is being blocked on transmit space, would it not make sense for the receiver to send out any delayed ACKs when it is clear that the receiving process is waiting for more data? The patch below attempts this (I make no guarantees of its correctness with respect to the rest of the delayed ack code). One point I'm still contemplating is what to do if the receiver is waiting in poll/select/epoll. [All tests run with maxcpus=1 on a 2.67GHz Woodcrest system.] Recv SendSend Utilization Service Demand Socket Socket Message Elapsed Send Recv SendRecv Size SizeSize Time Throughput localremote local remote bytes bytes bytessecs.10^6bits/s % S % S us/KB us/KB Base (2.6.17-rc4): default send buffer size netperf -C -c 87380 16384 1638410.02 14127.79 99.9099.900.579 0.579 87380 16384 1638410.02 13875.28 99.9099.900.590 0.590 87380 16384 1638410.01 13777.25 99.9099.900.594 0.594 87380 16384 1638410.02 13796.31 99.9099.900.593 0.593 87380 16384 1638410.01 13801.97 99.9099.900.593 0.593 netperf -C -c -- -s 1024 87380 2048 204810.02 0.43 -0.04-0.04-7.105 -7.377 87380 2048 204810.02 0.43 -0.01-0.01-2.337 -2.620 87380 2048 204810.02 0.43 -0.03-0.03-5.683 -5.940 87380 2048 204810.02 0.43 -0.05-0.05-9.373 -9.625 87380 2048 204810.02 0.43 -0.05-0.05-9.373 -9.625 from a remote system over gigabit ethernet netperf -H woody -C -c 87380 16384 1638410.03 936.23 19.3220.473.382 1.791 87380 16384 1638410.03 936.27 17.6720.953.091 1.833 87380 16384 1638410.03 936.17 19.1820.773.356 1.817 87380 16384 1638410.03 936.26 18.2220.263.188 1.773 87380 16384 1638410.03 936.26 17.3520.543.036 1.797 netperf -H woody -C -c -- -s 1024 87380 2048 204810.0095.72 10.046.64 17.188 5.683 87380 2048 204810.0095.94 9.47 6.42 16.170 5.478 87380 2048 204810.0096.83 9.62 5.72 16.283 4.840 87380 2048 204810.0095.91 9.58 6.13 16.368 5.236 87380 2048 204810.0095.91 9.58 6.13 16.368 5.236 Patched: default send buffer size netperf -C -c 87380 16384 1638410.01 13923.16 99.9099.900.588 0.588 87380 16384 1638410.01 13854.59 99.9099.900.591 0.591 87380 16384 1638410.02 13840.42 99.9099.900.591 0.591 87380 16384 1638410.01 13810.96 99.9099.900.593 0.593 87380 16384 1638410.01 13771.27 99.9099.900.594 0.594 netperf -C -c -- -s 1024 87380 2048 204810.02 2473.48 99.9099.903.309 3.309 87380 2048 204810.02 2421.46 99.9099.903.380 3.380 87380 2048 204810.02 2288.07 99.9099.903.577 3.577 87380 2048 204810.02 2405.41 99.9099.903.402 3.402 87380 2048 204810.02 2284.41 99.9099.903.582 3.582 netperf -H woody -C -c 87380 16384 1638410.04 936.10 23.0421.604.033 1.890 87380 16384 1638410.03 936.20 18.5221.063.242 1.843 87380 16384 1638410.03 936.52 17.6121.053.082 1.841 87380 16384 1638410.03 936.18 18.2420.733.191 1.814 87380 16384 1638410.03 936.28 18.3021.043.202 1.841 netperf -H woody -C -c -- -s 1024 87380 2048 204810.00 142.46 10.197.53 11.714 4.332 87380 2048 204810.00 147.28 9.73 7.93 10.829 4.412 87380 2048 204810.00 143.37 10.646.54 12.161 3.738 87380 2048 204810.00 146.41 9.18 7.43 10.277 4.158 87380 2048 204810.01 145.58 9.80 7.25 11.032 4.081 Comments/thoughts? -ben -- Time is of no importance, Mr. President, only life is important. Don't Email: [EMAIL PROTECTED]. diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 934396b..e554ceb 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -1277,8
PROBLEM: baycom_ser_fdx in kernel 2.6, transmit broken
Last year I reported to the linux-hams list and to Tom Sailer that I could not get AX25 to work using a serial baycom modem with baycom_ser_fdx under vanilla kernel 2.6.11.6 although it ran fine under kernel 2.4.30: http://he.fi/archive/linux-hams/200505/0021.html I got no answer, but other people experienced the same problem: http://he.fi/archive/linux-hams/200508/0108.html Now, I quickly checked again with Debian Kernel linux-image-2.6.17-2-686 (2.6.17-6) and the problem still persists. Receiving packets works without any problems, but the transmitted packets are erroneous. I set up another testbox (situated in the same building) with a soundcard and the soundmodem tools to analyse this problem. I could only read some data after enabling pass all (no FCS check) and then it looked like some bytes were missing. Packet: fm D0CHZ0-3 to IATEp-14 via B0STL0-3,H2NOD0-1 I35 pid=65 ^^ The correct frame header should look like: Packet: fm ND0CHZ-0 to IGATE-0 via CB0STL-0,CH2NOD-0 I35 pid=65 As far as I can tell from looking at the waveforms, my packets are complete and not cut off, neither at the beginning nor the end. Packets received from other stations are decoded correctly. [Sending station] - Vanilla kernel (kernel.org) 2.6.11.6 with the configuration adapted from the debian kernel - Kernel compiled with gcc version 3.3.5 - Processor Intel Celeron (Mendocino) 450 MHz - Serial configuration (Modem on ttyS0) # setserial -g /dev/ttyS{0,1} /dev/ttyS0, UART: unknown, Port: 0x03f8, IRQ: 4 /dev/ttyS1, UART: 16550A, Port: 0x02f8, IRQ: 3 - Baycom startup options: sethdlc -p -i bcsf0 mode ser12* io 0x3f8 irq 4 ifup bcsf0 sethdlc -i bcsf0 -a half - Module parameters for /etc/modules.conf options baycom_ser_fdx mode=ser12* iobase=0x3f8 irq=4 alias bcsf0 baycom_ser_fdx alias nr0 netrom alias tty-ldisc-5 mkiss - Loaded modules: hdlcdrv, baycom_ser_fdx, ax25, mkiss, af_packet - /etc/ax25/axports cb CBPORT-0 19200 255 7 CB-Funk axudp AXUDP-0 19200 255 7 Netzwerk AX25UDP Link lokal CH0CON-0 19200 255 7 Lokal # cat /proc/ioports | grep baycom 03f8-03ff : baycom_ser_fdx # cat /proc/interrupts CPU0 0: 285101 XT-PIC timer 1:130 XT-PIC i8042 2: 0 XT-PIC cascade 4: 39576 XT-PIC baycom_ser_fdx 7: 0 XT-PIC parport0 11: 3473 XT-PIC Intel ICH 82801AA, eth0 12: 9693 XT-PIC HiSax, uhci_hcd, eth1 14: 17651 XT-PIC ide0 15: 13 XT-PIC ide1 NMI: 0 LOC: 0 ERR: 0 MIS: 0 [Testbox] - soundmodemconfig from debian package soundmodem version 0.9-1 The complete output of soundmodemconfig with pass all enabled: Modulator: afsk Demodulator: afsk Modulator: parameter bps value 1200 Modulator: parameter f0 value 1200 Modulator: parameter f1 value 2200 Modulator: parameter diffenc value 1 Demodulator: parameter bps value 1200 Demodulator: parameter f0 value 1200 Demodulator: parameter f1 value 2200 Demodulator: parameter diffdec value 1 Minimum sampling rate: 9600 Audio IO: type soundcard sm[7286]: audio: starting /dev/dsp sm[7286]: audio: forcing half duplex mode sm[7286]: audio: sample rate 9600 input fragsz 256 numfrags 2 output fragsz 256 numfrags 256 Real sampling rate: 9600 passall: 1 passall: 0 passall: 1 Packet: fm ND0CHZ-0 to kIATE-0 via CB0M0C-8,2NOD0-8,*42'2:-7,722-4 I37^ pid=75 x), 1.79dp02 (JO60JT:ND0CHZ) Internet-Node Chemnitz-Gruena - Mail DP9BOX Packet: Packet: Packet: fm D0CHZ0-3 to IATEp-14 via B0STL0-3,H2NOD0-1 I35 pid=65 NetNode (Linux), 1.79dp02 (JO60JT:ND0CHZ) Internet-Node Chemnitz-Gruena - Mail DP9BOX Packet: fm NXHZ0C-2 to kIATE-0 via 0STL0C-8,2NOD0-8,*42'2:-7,722-4 I37 pid=75 x), 1.79dp0(JO60JT:NdHZ) Internet-Node Chemnitz-Gruena - Mail DP9BOX Packet: Joining TxThread Joining RxThread Releasing IO I would be happy to provide further information if necessary. Bye, Daniel. -- JabberID: [EMAIL PROTECTED] http://de.wikipedia.org/wiki/Jabber - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH?] tcp and delayed acks
On Wed, 16 Aug 2006 16:55:32 -0400 Benjamin LaHaise [EMAIL PROTECTED] wrote: Hello folks, In looking at a few benchmarks (especially netperf) run locally, it seems that tcp is unable to make full use of available CPU cycles as the sender is throttled waiting for ACKs to arrive. The problem is exacerbated when the sender is using a small send buffer -- running netperf -C -c -- -s 1024 show a miserable 420Kbit/s at essentially 0% CPU usage. Tests over gige are similarly constrained to a mere 96Mbit/s. What ethernet hardware? The defaults are often not big enough for full speed on gigabit hardware. I need increase rmem/wmem to allow for more buffering. Since there is no way for the receiver to know if the sender is being blocked on transmit space, would it not make sense for the receiver to send out any delayed ACKs when it is clear that the receiving process is waiting for more data? The patch below attempts this (I make no guarantees of its correctness with respect to the rest of the delayed ack code). One point I'm still contemplating is what to do if the receiver is waiting in poll/select/epoll. The point of delayed ack's was to merge the response and the ack on request/response protocols like NFS or telnet. It does make sense to get it out sooner though. [All tests run with maxcpus=1 on a 2.67GHz Woodcrest system.] Recv SendSend Utilization Service Demand Socket Socket Message Elapsed Send Recv SendRecv Size SizeSize Time Throughput localremote local remote bytes bytes bytessecs.10^6bits/s % S % S us/KB us/KB Base (2.6.17-rc4): default send buffer size netperf -C -c 87380 16384 1638410.02 14127.79 99.9099.900.579 0.579 87380 16384 1638410.02 13875.28 99.9099.900.590 0.590 87380 16384 1638410.01 13777.25 99.9099.900.594 0.594 87380 16384 1638410.02 13796.31 99.9099.900.593 0.593 87380 16384 1638410.01 13801.97 99.9099.900.593 0.593 netperf -C -c -- -s 1024 87380 2048 204810.02 0.43 -0.04-0.04-7.105 -7.377 87380 2048 204810.02 0.43 -0.01-0.01-2.337 -2.620 87380 2048 204810.02 0.43 -0.03-0.03-5.683 -5.940 87380 2048 204810.02 0.43 -0.05-0.05-9.373 -9.625 87380 2048 204810.02 0.43 -0.05-0.05-9.373 -9.625 from a remote system over gigabit ethernet netperf -H woody -C -c 87380 16384 1638410.03 936.23 19.3220.473.382 1.791 87380 16384 1638410.03 936.27 17.6720.953.091 1.833 87380 16384 1638410.03 936.17 19.1820.773.356 1.817 87380 16384 1638410.03 936.26 18.2220.263.188 1.773 87380 16384 1638410.03 936.26 17.3520.543.036 1.797 netperf -H woody -C -c -- -s 1024 87380 2048 204810.0095.72 10.046.64 17.188 5.683 87380 2048 204810.0095.94 9.47 6.42 16.170 5.478 87380 2048 204810.0096.83 9.62 5.72 16.283 4.840 87380 2048 204810.0095.91 9.58 6.13 16.368 5.236 87380 2048 204810.0095.91 9.58 6.13 16.368 5.236 Patched: default send buffer size netperf -C -c 87380 16384 1638410.01 13923.16 99.9099.900.588 0.588 87380 16384 1638410.01 13854.59 99.9099.900.591 0.591 87380 16384 1638410.02 13840.42 99.9099.900.591 0.591 87380 16384 1638410.01 13810.96 99.9099.900.593 0.593 87380 16384 1638410.01 13771.27 99.9099.900.594 0.594 netperf -C -c -- -s 1024 87380 2048 204810.02 2473.48 99.9099.903.309 3.309 87380 2048 204810.02 2421.46 99.9099.903.380 3.380 87380 2048 204810.02 2288.07 99.9099.903.577 3.577 87380 2048 204810.02 2405.41 99.9099.903.402 3.402 87380 2048 204810.02 2284.41 99.9099.903.582 3.582 netperf -H woody -C -c 87380 16384 1638410.04 936.10 23.0421.604.033 1.890 87380 16384 1638410.03 936.20 18.5221.063.242 1.843 87380 16384 1638410.03 936.52 17.6121.053.082 1.841 87380 16384 1638410.03 936.18 18.2420.733.191 1.814 87380 16384 1638410.03 936.28 18.3021.043.202 1.841 netperf -H woody -C -c -- -s 1024 87380 2048 204810.00 142.46 10.197.53 11.714 4.332 87380 2048 204810.00 147.28 9.73 7.93 10.829 4.412 87380 2048 2048
Re: [PATCH?] tcp and delayed acks
From: Stephen Hemminger [EMAIL PROTECTED] Date: Wed, 16 Aug 2006 12:11:12 -0700 What ethernet hardware? The defaults are often not big enough for full speed on gigabit hardware. I need increase rmem/wmem to allow for more buffering. Current kernels allow the TCP send and receive socket buffers to grow up to at least 4MB in size, how much more do you need? tcp_{w,r}mem[2] will now have a value of at least 4MB, see net/ipv4/tcp.c:tcp_init(). - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: New driver questions: Attansic L1 gigabit NIC
On Wed, 16 Aug 2006 13:44:43 -0500 John Haller [EMAIL PROTECTED] wrote: Stephen Hemminger wrote: On Tue, 15 Aug 2006 18:23:19 -0500 Jay Cliburn [EMAIL PROTECTED] wrote: Stephen Hemminger wrote: On Sun, 13 Aug 2006 19:11:42 -0500 Jay Cliburn [EMAIL PROTECTED] wrote: ...snip... I've read the LKML FAQ regarding new driver submissions, but it implies that the submitter be willing to maintain the driver, which I'm not qualified to do. I haven't contacted Attansic to request a change to the above support statement, because my past attempts to contact vendors on matters of this tenor have been greeted with silence. I would recommend the module author to see if they would GPL it. Thank you for your reply. I've contacted the author as you suggest. IANAL but because they used GPL code in the driver, one could argue that they created a derived work covered by GPL already. But I learned in preschool it is always better to ask than take. Not exactly. What they wrote is covered by their copyright, and there is no permission to use it in any way other than how they licensed it. Use of GPL code in their driver would allow the author of the GPL code to sue them for violating the license agreement, which would likely result in the code being released under GPL. IANAL either, but to paraphrase another preschool saying, two wrongs (copyright violations) don't make a right (legally licensed). In this case, though the vendor put a license file in the source that says GPL. But they just forgot and put a different value in the MODULE_LICENSE(). - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH?] tcp and delayed acks
The point of delayed ack's was to merge the response and the ack on request/response protocols like NFS or telnet. It does make sense to get it out sooner though. Well, to a point at least - I wouldn't go so far as to suggest immediate ACKs. However, I was always under the impression that ACKs were sent (in the mythical generic TCP stack) when: a) there was data going the other way b) there was a window update going the other way c) the standalone ACK timer expired. Does this patch then implement b? Were there perhaps holes in the logic when things were smaller than the MTU/MSS? (-v 2 on the netperf command line should show what the MSS was for the connection) rick jones BTW, many points scored for including CPU utilization and service demand figures with the netperf output :) [All tests run with maxcpus=1 on a 2.67GHz Woodcrest system.] Recv SendSend Utilization Service Demand Socket Socket Message Elapsed Send Recv SendRecv Size SizeSize Time Throughput localremote local remote bytes bytes bytessecs.10^6bits/s % S % S us/KB us/KB Base (2.6.17-rc4): default send buffer size netperf -C -c 87380 16384 1638410.02 14127.79 99.9099.900.579 0.579 87380 16384 1638410.02 13875.28 99.9099.900.590 0.590 87380 16384 1638410.01 13777.25 99.9099.900.594 0.594 87380 16384 1638410.02 13796.31 99.9099.900.593 0.593 87380 16384 1638410.01 13801.97 99.9099.900.593 0.593 netperf -C -c -- -s 1024 87380 2048 204810.02 0.43 -0.04-0.04-7.105 -7.377 87380 2048 204810.02 0.43 -0.01-0.01-2.337 -2.620 87380 2048 204810.02 0.43 -0.03-0.03-5.683 -5.940 87380 2048 204810.02 0.43 -0.05-0.05-9.373 -9.625 87380 2048 204810.02 0.43 -0.05-0.05-9.373 -9.625 Hmm, those CPU numbers don't look right. I guess there must still be some holes in the procstat CPU method code in netperf :( - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH?] tcp and delayed acks
On Wed, Aug 16, 2006 at 12:11:12PM -0700, Stephen Hemminger wrote: is throttled waiting for ACKs to arrive. The problem is exacerbated when the sender is using a small send buffer -- running netperf -C -c -- -s 1024 show a miserable 420Kbit/s at essentially 0% CPU usage. Tests over gige are similarly constrained to a mere 96Mbit/s. What ethernet hardware? The defaults are often not big enough for full speed on gigabit hardware. I need increase rmem/wmem to allow for more buffering. This is for small buffer transmit buffer sizes over either loopback or e1000. The artifact also shows up over localhost for somewhat larger buffer sizes, although it is much more difficult to get results that don't have large fluctuations because of other scheduling issues. Pinning the tasks to CPUs is on my list of things to try, but something in the multiple variants of sched_setaffinity() has resulted in it being broken in netperf. The point of delayed ack's was to merge the response and the ack on request/response protocols like NFS or telnet. It does make sense to get it out sooner though. I would like to see what sort of effect this change has on higher latency. Ideally, quick ack mode should be doing the right thing, but it might need more input about the receiver's intent. -ben -- Time is of no importance, Mr. President, only life is important. Don't Email: [EMAIL PROTECTED]. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 07/13] e1000: Allow NVM to setup LPLU for IGP2 and IGP3
Allow NVM to setup LPLU for IGP2 and IGP3. Only IGP needs LPLU D3 disabled during init here. Signed-off-by: Jeff Kirsher [EMAIL PROTECTED] Signed-off-by: Auke Kok [EMAIL PROTECTED] --- drivers/net/e1000/e1000_hw.c | 13 - 1 files changed, 8 insertions(+), 5 deletions(-) diff --git a/drivers/net/e1000/e1000_hw.c b/drivers/net/e1000/e1000_hw.c index 583518a..3728f33 100644 --- a/drivers/net/e1000/e1000_hw.c +++ b/drivers/net/e1000/e1000_hw.c @@ -1324,11 +1324,14 @@ e1000_copper_link_igp_setup(struct e1000 E1000_WRITE_REG(hw, LEDCTL, led_ctrl); } -/* disable lplu d3 during driver init */ -ret_val = e1000_set_d3_lplu_state(hw, FALSE); -if (ret_val) { -DEBUGOUT(Error Disabling LPLU D3\n); -return ret_val; +/* The NVM settings will configure LPLU in D3 for IGP2 and IGP3 PHYs */ +if (hw-phy_type == e1000_phy_igp) { +/* disable lplu d3 during driver init */ +ret_val = e1000_set_d3_lplu_state(hw, FALSE); +if (ret_val) { +DEBUGOUT(Error Disabling LPLU D3\n); +return ret_val; +} } /* disable lplu d0 during driver init */ -- Auke Kok [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 03/13] e1000: Same cosmetic fix as earlier sent out for IPV4.
Signed-off-by: Auke Kok [EMAIL PROTECTED] --- drivers/net/e1000/e1000_main.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/drivers/net/e1000/e1000_main.c b/drivers/net/e1000/e1000_main.c index 627f224..ea3d504 100644 --- a/drivers/net/e1000/e1000_main.c +++ b/drivers/net/e1000/e1000_main.c @@ -2545,7 +2545,7 @@ e1000_tso(struct e1000_adapter *adapter, cmd_length = E1000_TXD_CMD_IP; ipcse = skb-h.raw - skb-data - 1; #ifdef NETIF_F_TSO_IPV6 - } else if (skb-protocol == ntohs(ETH_P_IPV6)) { + } else if (skb-protocol == htons(ETH_P_IPV6)) { skb-nh.ipv6h-payload_len = 0; skb-h.th-check = ~csum_ipv6_magic(skb-nh.ipv6h-saddr, -- Auke Kok [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 01/13] e100: Fix MDIO/MDIO-X
MDIO/MDIO-X was broken due to a wrong errata. Removing the workaround code fixes for affected NICs. Signed-off-by: Jeff Kirsher [EMAIL PROTECTED] Signed-off-by: Auke Kok [EMAIL PROTECTED] --- drivers/net/e100.c | 14 +- 1 files changed, 5 insertions(+), 9 deletions(-) diff --git a/drivers/net/e100.c b/drivers/net/e100.c index 91ef5f2..5de9843 100644 --- a/drivers/net/e100.c +++ b/drivers/net/e100.c @@ -1391,15 +1391,11 @@ static int e100_phy_init(struct nic *nic } if((nic-mac = mac_82550_D102) || ((nic-flags ich) - (mdio_read(netdev, nic-mii.phy_id, MII_TPISTATUS) 0x8000))) { - /* enable/disable MDI/MDI-X auto-switching. - MDI/MDI-X auto-switching is disabled for 82551ER/QM chips */ - if((nic-mac == mac_82551_E) || (nic-mac == mac_82551_F) || - (nic-mac == mac_82551_10) || (nic-mii.force_media) || - !(nic-eeprom[eeprom_cnfg_mdix] eeprom_mdix_enabled)) - mdio_write(netdev, nic-mii.phy_id, MII_NCONFIG, 0); - else - mdio_write(netdev, nic-mii.phy_id, MII_NCONFIG, NCONFIG_AUTO_SWITCH); + (mdio_read(netdev, nic-mii.phy_id, MII_TPISTATUS) 0x8000) + !(nic-eeprom[eeprom_cnfg_mdix] eeprom_mdix_enabled))) { + /* enable/disable MDI/MDI-X auto-switching. */ + mdio_write(netdev, nic-mii.phy_id, MII_NCONFIG, + nic-mii.force_media ? 0 : NCONFIG_AUTO_SWITCH); } return 0; -- Auke Kok [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html