[RFC][UPDATED PATCH 2.6.16] [Patch 9/9] Generic netlink interface for delay accounting
On Mon, Mar 13, 2006 at 09:48:26PM -0500, jamal wrote: > On Mon, 2006-13-03 at 18:33 -0800, Matt Helsley wrote: > > On Mon, 2006-03-13 at 19:56 -0500, Shailabh Nagar wrote: > > > I had a long description in an earlier email feedback; but the summary > of it is the GET command is generic like TASKSTATS_CMD_GET; the message > itself carries TLVs of what needs to be gotten which are > either PID and/or TGID etc. Anyways, theres a long spill of what i am > saying in that earlier email. Perhaps the current patch is a transition > towards that? > Hi, Jamal, Please find the updated version of delayacct-genetlink.patch. We hope this iteration is closer to your expectation. I have copied the enums you suggested in your previous review comments and used them. Comments addressed (in this patch) - Changed the code to use TLV's for data exchange between kernel and user space Thanks, Balbir Documentation for the patch Create a generic netlink interface (NETLINK_GENERIC family), called "taskstats", for getting delay and cpu statistics of tasks and thread groups during their lifetime and when they exit. More changes expected. Following comments will go into a Documentation file: When a task is alive, userspace can get its stats by sending a command containing its pid. Sending a tgid returns the sum of stats of the tasks belonging to that tgid (where such a sum makes sense). Together, the command interface allows stats for a large number of tasks to be collected more efficiently than would be possible through /proc or any per-pid interface. The netlink interface also sends the stats for each task to userspace when the task is exiting. This permits fine-grain accounting for short-lived tasks, which is important if userspace is doing its own aggregation of statistics based on some grouping of tasks (e.g. CSA jobs, ELSA banks or CKRM classes). If the exiting task belongs to a thread group (with more members than itself) , the latters delay stats are also sent out on the task's exit. This allows userspace to get accurate data at a per-tgid level while the tid's of a tgid are exiting one by one. The interface has been deliberately kept distinct from the delay accounting code since it is potentially usable by other kernel components that need to export per-pid/tgid data. The format of data returned to userspace is versioned and the command interface easily extensible to facilitate reuse. If reuse is not deemed useful enough, the naming, placement of functions and config options will be modified to make this an interface for delay accounting alone. Signed-off-by: Shailabh Nagar <[EMAIL PROTECTED]> Signed-off-by: Balbir Singh <[EMAIL PROTECTED]> --- include/linux/delayacct.h | 11 ++ include/linux/taskstats.h | 112 init/Kconfig | 16 ++ kernel/Makefile |1 kernel/delayacct.c| 44 kernel/taskstats.c| 251 ++ 6 files changed, 432 insertions(+), 3 deletions(-) diff -puN include/linux/delayacct.h~delayacct-genetlink include/linux/delayacct.h --- linux-2.6.16/include/linux/delayacct.h~delayacct-genetlink 2006-03-22 11:56:03.0 +0530 +++ linux-2.6.16-balbir/include/linux/delayacct.h 2006-03-22 11:56:03.0 +0530 @@ -15,6 +15,7 @@ #define _LINUX_TASKDELAYS_H #include +#include #ifdef CONFIG_TASK_DELAY_ACCT extern int delayacct_on; /* Delay accounting turned on/off */ @@ -25,6 +26,7 @@ extern void __delayacct_tsk_exit(struct extern void __delayacct_blkio_start(void); extern void __delayacct_blkio_end(void); extern unsigned long long __delayacct_blkio_ticks(struct task_struct *); +extern int __delayacct_add_tsk(struct taskstats *, struct task_struct *); static inline void delayacct_tsk_init(struct task_struct *tsk) { @@ -72,4 +74,13 @@ static inline unsigned long long delayac return 0; } #endif /* CONFIG_TASK_DELAY_ACCT */ +#ifdef CONFIG_TASKSTATS +static inline int delayacct_add_tsk(struct taskstats *d, + struct task_struct *tsk) +{ + if (!tsk->delays) + return -EINVAL; + return __delayacct_add_tsk(d, tsk); +} +#endif #endif /* _LINUX_TASKDELAYS_H */ diff -puN /dev/null include/linux/taskstats.h --- /dev/null 2004-06-24 23:34:38.0 +0530 +++ linux-2.6.16-balbir/include/linux/taskstats.h 2006-03-22 13:12:01.0 +0530 @@ -0,0 +1,112 @@ +/* taskstats.h - exporting per-task statistics + * + * Copyright (C) Shailabh Nagar, IBM Corp. 2006 + * (C) Balbir Singh, IBM Corp. 2006 + * + * This program is free software; you can redistribute it and/or modify it + * under the terms of version 2.1 of the GNU Lesser General Public License + * as published by the Free Software Foundation. + * + * This program is distributed in the hope that it would be useful, but + * WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS F
Re: Writing a rate based transport protocol
On Tue, 21 Mar 2006 20:26:55 -0700 Mark Butler <[EMAIL PROTECTED]> wrote: > On Mon, 13 Mar 2006 18:20:26 -0600, Saurabh Jain wrote: > > > Hi All, I am trying to write a new rate based transport protocol in > linux kernel (either as a module or directly within the kernel). > Basically it would be similar to UDP but with features like dynamic > rate control, connection and state management, error control like > TCP. Is there any established framework which i can use? I know > there is one for window based protocols like TCP where one can > dynamically register different congestion control mechanisms. I > would appreciate if somebody can give me some direction in this regard. > > > I do not know what you have in mind, but a general facility to transmit > a series of packets at spaced intervals would be very useful to > compensate for ack compression, etc. Preferably a facility simple > enough to be trivially offloaded to hardware. TSO/LSO hardware could > certainly use something similar for spacing segments, so breaking sends > over a size (c.f. sysctl_tcp_tso_win_divisor) manually would not be > necessary. > > In software one might implement this as an alternative queueing > discipline at layer two. The minimum spacing interval could be obtained > from a route attribute similar to RTAX_ADVMSS. Alternatively, a > transport protocol might calculate the nominal transmission spacing as > the RTT divided by the congestion window size in packets and run or > share a similar transmission scheduler at layer 4. > The bigger problem is that too be effective rate control needs accurate real time. Linux is doing better at real time, but still providing useful high speed inter packet spacing is beyond the current capabilities. To get around this I think most high speed 10G cards provide some form of rate control in firmware. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 2/2] net: Node aware multipath device round robin -- device locality check
This patch checks device locality on every ip packet xmit. In multipath configuration tcp connection to route association is done at session startup time. The tcp session process is migrated to different nodes after this association. This would mean a remote NIC is chosen for xmit, although a local NIC could be available. Following patch checks if a local NIC is available for the desitnation, and recalculates routes if so. his leads to remote NIC transfer in some tcp work load such as AB. Downside: adds a bitmap to struct rtable. But only if CONFIG_IP_ROUTE_MULTIPATH_NODE is enabled. Comments, suggestions welcome. Signed-off by: Pravin B. Shelar <[EMAIL PROTECTED]> Signed-off by: Ravikiran Thirumalai <[EMAIL PROTECTED]> Signed-off by: Shai Fultheim <[EMAIL PROTECTED]> Index: linux-2.6.16/include/net/route.h === --- linux-2.6.16.orig/include/net/route.h 2006-03-19 21:53:29.0 -0800 +++ linux-2.6.16/include/net/route.h2006-03-20 14:52:24.0 -0800 @@ -75,6 +75,13 @@ struct rtable /* Miscellaneous cached information */ __u32 rt_spec_dst; /* RFC1122 specific destination */ struct inet_peer*peer; /* long-living peer info */ +#ifdef CONFIG_IP_ROUTE_MULTIPATH_NODE + /* bitmap bit is set if current node has a local multi-path device for +* this route. +*/ + DECLARE_BITMAP (mp_if_bitmap, MAX_NUMNODES); +#endif + }; struct ip_rt_acct @@ -201,4 +208,21 @@ static inline struct inet_peer *rt_get_p extern ctl_table ipv4_route_table[]; +#ifdef CONFIG_IP_ROUTE_MULTIPATH_NODE + +#include + +static inline int dst_dev_node_check(struct rtable *rt) +{ + int cnode = numa_node_id(); + if (unlikely(netdev_node(rt->u.dst.dev) != cnode)) { + if (test_bit(cnode, rt->mp_if_bitmap)) + return 1; + } + return 0; +} +#else +#define dst_dev_node_check(rt) 0 +#endif + #endif /* _ROUTE_H */ Index: linux-2.6.16/net/ipv4/ip_output.c === --- linux-2.6.16.orig/net/ipv4/ip_output.c 2006-03-19 21:53:29.0 -0800 +++ linux-2.6.16/net/ipv4/ip_output.c 2006-03-20 14:52:24.0 -0800 @@ -309,7 +309,7 @@ int ip_queue_xmit(struct sk_buff *skb, i /* Make sure we can route this packet. */ rt = (struct rtable *)__sk_dst_check(sk, 0); - if (rt == NULL) { + if ((rt == NULL ) || dst_dev_node_check(rt)) { u32 daddr; /* Use correct destination address if we have options. */ Index: linux-2.6.16/net/ipv4/route.c === --- linux-2.6.16.orig/net/ipv4/route.c 2006-03-19 21:53:29.0 -0800 +++ linux-2.6.16/net/ipv4/route.c 2006-03-20 14:52:24.0 -0800 @@ -2313,6 +2313,22 @@ static inline int ip_mkroute_output(stru if (res->fi && res->fi->fib_nhs > 1) { unsigned char hopcount = res->fi->fib_nhs; +#ifdef CONFIG_IP_ROUTE_MULTIPATH_NODE + DECLARE_BITMAP (mp_if_bitmap, MAX_NUMNODES); + bitmap_zero(mp_if_bitmap, MAX_NUMNODES); + /* Calculating device bitmap for this multipath route */ + if (res->fi->fib_mp_alg == IP_MP_ALG_DRR) { + for (hop = 0; hop < hopcount; hop++) { + struct net_device *dev2nexthop; + + res->nh_sel = hop; + dev2nexthop = FIB_RES_DEV(*res); + dev_hold(dev2nexthop); + set_bit(netdev_node(dev2nexthop), mp_if_bitmap); + dev_put(dev2nexthop); + } + } +#endif for (hop = 0; hop < hopcount; hop++) { struct net_device *dev2nexthop; @@ -2343,6 +2359,10 @@ static inline int ip_mkroute_output(stru FIB_RES_NETMASK(*res), res->prefixlen, &FIB_RES_NH(*res)); + +#ifdef CONFIG_IP_ROUTE_MULTIPATH_NODE + bitmap_copy(rth->mp_if_bitmap, mp_if_bitmap, MAX_NUMNODES); +#endif cleanup: /* release work reference to output device */ dev_put(dev2nexthop); - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 1/2] net: Node aware multipath device round robin
Following patch adds in node aware, device round robin ip multipathing. It is based on multipath_drr.c, the multipath device round robin algorithm, and is derived from it. This implementation maintians per node state table, and round robins between interfaces on the same node. The implementation needs to be aware of the NIC proximity to a node. Hence we have added a nodeid field to struct netdevice. NIC device drivers can initialize this with the node id the NIC belongs to. This patch uses IP_MP_ALG_DRR slot like the regular multipath_drr too. So either SMP multipath_drr or node aware multipath_node_drr should be used for device round robin, based on system having proximity information for the NICs. Performance results: 1. Single NIC test -- 1 client targets 1 nic on the server with 300 concurrent requests. 2. 4 NIC test -- 1 client targets 4 nics, all on different nodes on the server with 300 concurrent requests. We see about 135% improvement on AB requests per second with this patch and the device_locality_check patch on single NIC test, on the Rackable c5100 machine (server). We see about 64% improvement when all 4 NICS are targeted. Credits: This work was originally done by Justin Forbes Comments? Signed-off by: Pravin B. Shelar <[EMAIL PROTECTED]> Signed-off by: Shobhit Dayal <[EMAIL PROTECTED]> Signed-off by: Ravikiran Thirumalai <[EMAIL PROTECTED]> Signed-off by: Shai Fultheim <[EMAIL PROTECTED]> Index: linux-2.6.16/drivers/net/e1000/e1000_main.c === --- linux-2.6.16.orig/drivers/net/e1000/e1000_main.c2006-03-19 21:53:29.0 -0800 +++ linux-2.6.16/drivers/net/e1000/e1000_main.c 2006-03-20 14:52:23.0 -0800 @@ -692,6 +692,7 @@ e1000_probe(struct pci_dev *pdev, SET_MODULE_OWNER(netdev); SET_NETDEV_DEV(netdev, &pdev->dev); + SET_NETDEV_NODE(netdev, pcibus_to_node(pdev->bus)); pci_set_drvdata(pdev, netdev); adapter = netdev_priv(netdev); Index: linux-2.6.16/drivers/net/tg3.c === --- linux-2.6.16.orig/drivers/net/tg3.c 2006-03-19 21:53:29.0 -0800 +++ linux-2.6.16/drivers/net/tg3.c 2006-03-20 14:52:23.0 -0800 @@ -10705,6 +10705,7 @@ static int __devinit tg3_init_one(struct SET_MODULE_OWNER(dev); SET_NETDEV_DEV(dev, &pdev->dev); + SET_NETDEV_NODE(dev, pcibus_to_node(pdev->bus)); dev->features |= NETIF_F_LLTX; #if TG3_VLAN_TAG_USED Index: linux-2.6.16/include/linux/netdevice.h === --- linux-2.6.16.orig/include/linux/netdevice.h 2006-03-19 21:53:29.0 -0800 +++ linux-2.6.16/include/linux/netdevice.h 2006-03-20 14:52:23.0 -0800 @@ -315,7 +315,9 @@ struct net_device /* Interface index. Unique device identifier*/ int ifindex; int iflink; - +#ifdef CONFIG_NUMA + int node; /* NUMA node this IF is close to */ +#endif struct net_device_stats* (*get_stats)(struct net_device *dev); struct iw_statistics* (*get_wireless_stats)(struct net_device *dev); @@ -520,6 +522,14 @@ static inline void *netdev_priv(struct n */ #define SET_NETDEV_DEV(net, pdev) ((net)->class_dev.dev = (pdev)) +#ifdef CONFIG_NUMA +#define SET_NETDEV_NODE(dev, nodeid) ((dev)->node = (nodeid)) +#define netdev_node(dev) ((dev)->node) +#else +#define SET_NETDEV_NODE(dev, nodeid) do {} while (0) +#define netdev_node(dev) (-1) +#endif + struct packet_type { __be16 type; /* This is really htons(ether_type). */ struct net_device *dev; /* NULL is wildcarded here */ Index: linux-2.6.16/net/core/dev.c === --- linux-2.6.16.orig/net/core/dev.c2006-03-19 21:53:29.0 -0800 +++ linux-2.6.16/net/core/dev.c 2006-03-20 14:52:23.0 -0800 @@ -3003,7 +3003,8 @@ struct net_device *alloc_netdev(int size if (sizeof_priv) dev->priv = netdev_priv(dev); - + + SET_NETDEV_NODE(dev, -1); setup(dev); strcpy(dev->name, name); return dev; Index: linux-2.6.16/net/ipv4/Kconfig === --- linux-2.6.16.orig/net/ipv4/Kconfig 2006-03-19 21:53:29.0 -0800 +++ linux-2.6.16/net/ipv4/Kconfig 2006-03-20 14:52:23.0 -0800 @@ -164,6 +164,15 @@ config IP_ROUTE_MULTIPATH_DRR available interfaces. This policy makes sense if the connections should be primarily distributed on interfaces and not on routes. +config IP_ROUTE_MULTIPATH_NODE + tristate "MULTIPATH: interface RR algorithm with node affinity" + depends on IP_ROUTE_MULTIPATH_CACHED && NUMA && !IP_ROUTE_MULTIPATH_DRR +
Re: Writing a rate based transport protocol
On Mon, 13 Mar 2006 18:20:26 -0600, Saurabh Jain wrote: Hi All, I am trying to write a new rate based transport protocol in linux kernel (either as a module or directly within the kernel). Basically it would be similar to UDP but with features like dynamic rate control, connection and state management, error control like TCP. Is there any established framework which i can use? I know there is one for window based protocols like TCP where one can dynamically register different congestion control mechanisms. I would appreciate if somebody can give me some direction in this regard. I do not know what you have in mind, but a general facility to transmit a series of packets at spaced intervals would be very useful to compensate for ack compression, etc. Preferably a facility simple enough to be trivially offloaded to hardware. TSO/LSO hardware could certainly use something similar for spacing segments, so breaking sends over a size (c.f. sysctl_tcp_tso_win_divisor) manually would not be necessary. In software one might implement this as an alternative queueing discipline at layer two. The minimum spacing interval could be obtained from a route attribute similar to RTAX_ADVMSS. Alternatively, a transport protocol might calculate the nominal transmission spacing as the RTT divided by the congestion window size in packets and run or share a similar transmission scheduler at layer 4. - Mark B. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2.6.16-rc6 1/1] ipw2200: Add Kconfig entries for QOS and Monitor mode
On Sat, 2006-03-18 at 18:47 +0100, Andreas Happe wrote: > Adds Kconfig entries for enabling Monitor mode and Quality of service > to the ipw2200 driver. It also renames the IPW_QOS define to > IPW2200_QOS. > > As Monitor mode generates lots of firmware errors it depends upon > BROKEN. QOS is under development, so it depends upon EXPERIMENTAL. Ack the rename and QoS description changes. The IPW2200_MONITOR and monitor mode firmware error are already fixed in wireless-2.6 GIT http://kernel.org/git/?p=linux/kernel/git/linville/wireless-2.6.git;a=summary Wireless related development happens there. I'd suggest you create patches against that tree. Thanks, -yi - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Broadcom Sibyte SB1xxx save unaligned dma descriptor pointer fix
This patch has a fix to drivers/net/sb1250-mac.c, the dma descriptor table ptr is allocated, aligned and the aligned ptr is freed. If the ptr was not already aligned (usually is) then the free would not work of what was returned by the kmalloc. A variable was added to store the unaligned pointer so that it could be properly freed. Tom On Sun, 19 Mar 2006 17:08:23 -0600, Lennert Buytenhek <[EMAIL PROTECTED]> wrote: On Sun, Mar 19, 2006 at 05:12:32PM -0600, Tom Rix wrote: This patch also has a fix to drivers/net/sb1250-mac.c, the dma descriptor table ptr is allocated, aligned and the aligned ptr is freed. If the ptr was not already aligned (usually is) then the free would not work of what was returned by the kmalloc. A variable was added to store the unaligned pointer so that it could be properly freed. Can you submit that as a separate patch? - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html -- Using Opera's revolutionary e-mail client: http://www.opera.com/mail/ mips-sb1250-mac-savedmaptr-1.patch Description: Binary data
Re: [PATCH] Broadcom Sibyte SB1xxx NAPI ethernet support
This patch adds NAPI support for the Broadcom Sibyte SB1xxx family. The changes are limited to adding a new config key SBMAC_NAPI to the drivers/ net/Kconfig and by adding the poll op and interrupt support to drivers/ net/sb1250-mac.c. I have tested this patch on a BCM91250A-SWARM Pass 2 / An. Mark Mason from Broadcom was very helpful and tested this patch on at least a 1480. Tom On Sun, 19 Mar 2006 17:08:23 -0600, Lennert Buytenhek <[EMAIL PROTECTED]> wrote: On Sun, Mar 19, 2006 at 05:12:32PM -0600, Tom Rix wrote: This patch also has a fix to drivers/net/sb1250-mac.c, the dma descriptor table ptr is allocated, aligned and the aligned ptr is freed. If the ptr was not already aligned (usually is) then the free would not work of what was returned by the kmalloc. A variable was added to store the unaligned pointer so that it could be properly freed. Can you submit that as a separate patch? - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html -- Using Opera's revolutionary e-mail client: http://www.opera.com/mail/ mips-sb1250-mac-NAPI-4.patch Description: Binary data
Re: [PATCH] Broadcom Sibyte SB1xxx NAPI ethernet support
Yes. They are soon to follow. Tom On Sun, 19 Mar 2006 17:08:23 -0600, Lennert Buytenhek <[EMAIL PROTECTED]> wrote: On Sun, Mar 19, 2006 at 05:12:32PM -0600, Tom Rix wrote: This patch also has a fix to drivers/net/sb1250-mac.c, the dma descriptor table ptr is allocated, aligned and the aligned ptr is freed. If the ptr was not already aligned (usually is) then the free would not work of what was returned by the kmalloc. A variable was added to store the unaligned pointer so that it could be properly freed. Can you submit that as a separate patch? - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html -- Using Opera's revolutionary e-mail client: http://www.opera.com/mail/ - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 6/6 v2] IB: userspace support for RDMA connection manager
I added this patch to the rdma_cm branch in my git tree. When I was doing that, I noticed that it builds rdma_ucm.ko unconditionally. It seems that we want this to depend on CONFIG_INFINIBAND_USER_ACCESS, since that controls ib_uverbs.ko and ib_ucm.ko. To do this I rejiggered the Kconfig and Makefile changes I made before. I made CONFIG_INFINIBAND_ADDR_TRANS into a bool (instead of a tristate), so that it's 'y' if INFINIBAND and INET are on, and made the top of the Makefile look like: infiniband-$(CONFIG_INFINIBAND_ADDR_TRANS) := ib_addr.o rdma_cm.o user_access-$(CONFIG_INFINIBAND_ADDR_TRANS) := rdma_ucm.o obj-$(CONFIG_INFINIBAND) += ib_core.o ib_mad.o ib_sa.o \ ib_cm.o $(infiniband-y) obj-$(CONFIG_INFINIBAND_USER_MAD) +=ib_umad.o obj-$(CONFIG_INFINIBAND_USER_ACCESS) += ib_uverbs.o ib_ucm.o $(user_access-y) I'm pretty sure this does exactly what we want. - R. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [openib-general] Re: [iproute2] IPoIB link layer address bug
On Tue, Mar 21, 2006 at 03:56:17PM -0800, Stephen Hemminger wrote: > Okay, but there are number of other places in iproute2 that call ll_addr_a2n() > with ifr.ifr_hwaddr.sa_data. And that is 14 bytes. If you want to fix those > it will be harder since it would increase the sizeof(struct sockaddr) and > potentially > break compatibility. Maybe the best thing is to upgrade ip (and or netlink?) to use netlink messages instead of ioctls for the remaining problematic operations. Since netlink already supports an arbitary length hwaddr there should be no compatability problem. Just browsing I see usages of SIOCSIFHWBROADCAST, SIOCSIFHWADDR, SIOCADDMULTI, SIOCDELMULTI and SIOCGIFHWADDR that use a struct ifreq.. I know SIOCGIFHWADDR can be done over netlink, but I'm not too familiar with the others.. Jason - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: shared abstractions (was Writing a rate based transport protocol)
One Tue, 14 Mar 2006 11:37:38 -0300, Arnaldo Carvalho de Melo wrote: On 3/13/06, Saurabh Jain <[EMAIL PROTECTED]> wrote: > Hi All, > > I am trying to write a new rate based transport protocol in linux > kernel (either as a module or directly within the kernel). Basically > it would be similar to UDP but with features like dynamic rate > control, connection and state management, error control like TCP. Is > there any established framework which i can use? I know there is one > for window based protocols like TCP where one can dynamically register > different congestion control mechanisms. I would appreciate if > somebody can give me some direction in this regard. Look at how DCCP and TCP share code, using abstractions such as: struct inet_connection_sock struct inet_request_sock struct inet_timewait_sock struct inet_hashinfo I suggest too that you read my OLS 2004 paper: One of the limitations of those abstractions is that they are not generic enough for SCTP to use them. It is probably asking too much to generalize everything, but it would be nice if everything weren't bound so tightly to the idea of a one-to-one, single address, single path socket. XFRM, the IP layer, the sock layer, the inet sock layer, and the inet connection sock layer all have that assumption hard coded into them in various ways. For example, a one-to-many style SCTP socket is equivalent to a group of inet_connection_socks presenting a UDP style interface, one inet_connection_sock per SCTP association. But since inet_connection_sock is a _sock_, it cannot be used as the base implementation for an SCTP association. Similarly, struct sk_buff carries a destructor pointer, that is typically used to release memory to sk_wmem_alloc, but the destructor is called with a "struct sock *" argument, from skb->sk. A more general implementation would replace skb->sk with a pointer to an intermediate abstraction or a void *, or add a destructor context argument. Currently in order to work correctly SCTP has to consider memory reclaimed as soon as it hits the IP layer, because flow control is done at the association level, not the socket level. XFRM has the same problem - it only allows security policies to be overridden at the socket level, where an SCTP socket may handle thousands of associations, with independent security policies. It would also be nice to share congestion control implementations - SCTP does congestion control on a path ("transport") basis, not a per socket basis, and the congestion control interface would have to be similarly general purpose. There are dozens of fields in struct sock, struct inet_sock, and struct inet6_sock that are superfluous overhead in the SCTP case. It would be better if struct sock etc were one level higher in abstraction, rather than carrying so much baggage from the TCP view of the world. Perhaps two thirds of the current fields belong at lower levels of abstraction. In view of the goal of reducing the kernel footprint, such a re-factoring might be worth considering. - Mark B. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [RFC/PATCH 6/13] d80211: remove obsolete stuff
Yes, fully agreed - and the hardware's pre-beacon interrupt would cause the beacon function to create a beacon frame and put it into the queue (dev_queue_xmit on the master device). The beacon frame would the be passed to the hardware through the normal run_queue that follows. Simon -Original Message- From: Jouni Malinen Sent: Wednesday, March 15, 2006 4:48 PM To: Simon Barber Cc: Jiri Benc; netdev@vger.kernel.org Subject: Re: [RFC/PATCH 6/13] d80211: remove obsolete stuff On Wed, Mar 15, 2006 at 04:41:56PM -0800, Simon Barber wrote: > The more natural way for beacons to flow from the 80211.o to the low > level driver would be for beacons to be passed down just like any > other > 802.11 frame is passed down - rather than having a special case for > beacons and buffered MC data, where they are pulled. I would suggest > making the qdisc aware of beacons, and then there is no special > interface for passing beacons down - they are passed down just like > other frames, with a special queue ID reserved for beacons and > buffered multicast. > > This would simplify the 80211.o/low level interface. Sure, but it would also require good synchronization for sending the beacons just before they are needed for transmission.. If the wlan hardware implementation provides support for interrupts that request beacons at proper times, being able to use them for this is quite convenient. -- Jouni MalinenPGP id EFC895FA - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [iproute2] IPoIB link layer address bug
On Thu, 16 Mar 2006 17:24:41 -0500 (EST) James Lentini <[EMAIL PROTECTED]> wrote: > > The ip(8) command has a bug when dealing with IPoIB link layer > addresses. Specifically it does not correctly handle the addition of > new entries in the neighbor/arp table. For example, this command will > fail: > > ip neigh add 192.168.0.138 lladdr > 00:00:04:04:fe:80:00:00:00:00:00:00:00:01:73:00:00:00:8a:91 nud permanent dev > ib0 > > An IPoIB link layer address is 20-bytes (see > http://www.ietf.org/internet-drafts/draft-ietf-ipoib-ip-over-infiniband-09.txt, > > section 9.1.1). > > The command line parsing code expects link layer addresses to be a > maximum of 16-bytes. Addresses over 16-bytes are truncated. > > This patch (against the iproute2 cvs repository) fixes the problem: > Okay, but there are number of other places in iproute2 that call ll_addr_a2n() with ifr.ifr_hwaddr.sa_data. And that is 14 bytes. If you want to fix those it will be harder since it would increase the sizeof(struct sockaddr) and potentially break compatibility. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Results WAS(Re: [PATCH] TC: bug fixes to the "sample" clause
On Tue, 2006-03-21 at 14:39 -0800, Stephen Hemminger wrote: > Back to the original question... > > What should the iproute2 utilities contain? > > Does it have to have the utsname hack to work? Hi Stephen, I think the resolution was: - No to the utsname hack. Ergo the "tc" sample clause won't work on 2.4. Maybe "tc" using ustname to check the kernel version and print out a warning / error if "sample" is used on 2.4 is acceptable? I regard failing silently and producing incorrect results as a terrible thing to do to. I could produce another patch if this is OK. - Put the 2.6 hash algorithm in "tc". That is what my previous patch did. Jamal didn't like the patch description though. Perhaps he would prefer something along the lines of "Changed u32 hashing algorithm used by the 'sample clause' to the 2.6 kernel algorithm. Currently its uses the 2.4 algorithm, which computes the wrong result under some circumstances on 2.6 kernels. This means tc sample will no longer work on 2.4." - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Results WAS(Re: [PATCH] TC: bug fixes to the "sample" clause
On Tue, 2006-03-21 at 09:57 -0500, jamal wrote: > I accessed them - unfortunately, though i am trying to, I dont > see anything outstanding that would justify any changes to the > hash. Lets just drop this. We can talk about other things if you want. If you still are not convinced, then I don't see that I can convince you. Fair enough. Yes - I would like to discuss other things. I will take me some days to prepare them so you will have a little peace and quiet (from me anyway) for a short while. I would like to take the opportunity to thank you for giving me such a fair hearing. You have been polite throughout, despite my persistence. And you have always responded quickly. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Results WAS(Re: [PATCH] TC: bug fixes to the "sample" clause
Back to the original question... What should the iproute2 utilities contain? Does it have to have the utsname hack to work? - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [openib-general] Re: [PATCH 4/6 v2] IB: address translation to map IP toIB addresses (GIDs)
Sean> "This is simply an attempt to reduce/combine work queues Sean> used by the Infiniband code. This keeps the threading a Sean> little simpler in the rdma_cm, since all callbacks are Sean> invoked using the same work queue. (I'm also using this Sean> with the local SA/multicast code, but that's not ready for Sean> merging.)" How does it keep the threading model simpler? Is this an inter-module dependency. Sean> There's no specific ordering constraint that's required. Sean> We're just ending up with several Infiniband modules Sean> creating their own work queues (ib_mad, ib_cm, ib_addr, Sean> rdma_cm, plus a couple more in modules under development), Sean> and this is an attempt to reduce that. If having separate Sean> work queues would work better, there shouldn't be anything Sean> that prevents this. It seems like it would be cleaner for each module to have its own workqueue if it needs one. There's also schedule_work(), although that goes to a multi-threaded workqueue. Michael Tsirkin has suggested creating a system-wide single-threaded workqueue (ie something like schedule_ordered_work()) for everyone that occasionally needs a single-threaded workqueue. - R. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ES-API
On Mon Mar 14, 2006, Christopher Hellwig wrote: >On Mon, Mar 13, 2006 at 02:25:08PM -0800, Zach Brown wrote: >> Hi guys, >> >> I'm hearing noise about the 'Extended Sockets' API in Oracle. It's an >> extension to the socket API put together by an industry group that calls >> itself the Interconnect Software Consortium and is working in >> partnership with the open group. The API adds support for things like >> memory registration, async operations completed through event queues, >> "standard" sendfile() and async poll(), etc. > >It's a new bullshit standard from the crackmonkeyes at the opengroups >interconnect working group that already tried to push idiocies like RNICPI >onto us. I already told them that they're on crack but they don't care. >It's never going to appear in Linux. ES-API has relatively little to do with memory registration or the RDMA world view per se. It is primarily a generic API for performing asynchronous socket I/O with completion notifications. Considering there are no other cross platform standards for asynchronous socket operations, ES-API is rather unlikely to go away. Of course ES-API is a user level API, not a kernel level API. "Linux" does not have to implement it at all for there to be working, generic (no hardware required) Linux ES-API implemenations. All that is needed is a working syscall interface for asynchronous socket operations, such as an extension of io_submit / io_getevents to do asynchronous connect(), shutdown(), sendmsg(), recvmsg(), setsockopt(), and getsocktopt() operations. A library could easily translate ES-API calls in the same manner as glibc translates POSIX API calls. - Mark B. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[git patches] net driver updates
[just pushed upstream; patch too big to inline] Please pull from 'upstream-linus' branch of master.kernel.org:/pub/scm/linux/kernel/git/jgarzik/netdev-2.6.git to receive the following updates: Documentation/networking/e100.txt | 158 - Documentation/networking/e1000.txt | 634 +++-- MAINTAINERS| 16 drivers/net/mv643xx_eth.h | 18 drivers/net/pcnet32.c | 4161 +++-- drivers/net/skfp/fplustm.c | 12 drivers/net/skge.c | 275 +- drivers/net/skge.h |1 drivers/net/sky2.c | 583 ++--- drivers/net/sky2.h | 22 drivers/net/smc91x.c | 53 drivers/net/smc91x.h | 478 ++-- 12 files changed, 3467 insertions(+), 2944 deletions(-) Andrew Morton: skfp warning fixes Dale Farnsworth: mv643xx_eth: Cache align skb->data if CONFIG_NOT_COHERENT_CACHE Don Fry: pcnet32: support boards with multiple phys Jeff Garzik: [netdrvr] pcnet32: Lindent [netdrvr] pcnet32: other source formatting cleanups Jesse Brandeburg: e100/e1000/ixgb: update MAINTAINERS to current developers e100: update e100.txt e1000: update the readme with the latest text Nicolas Pitre: smc91x: allow for dynamic bus access configs Stephen Hemminger: skge: use NAPI for tx cleanup. skge: use auto masking of irqs skge: check the allocation of ring buffer skge: dma configuration cleanup skge: use kcalloc skge: use mmiowb skge: formmating and whitespace cleanup skge: handle pci errors better skge: version 1.4 sky2: remove support for untested Yukon EC/rev 0 sky2: drop broken wake on lan support sky2: rework of NAPI and IRQ management sky2: coalescing parameters sky2: add MSI support sky2: whitespace fixes sky2: transmit recovery sky2: handle all error irqs sky2 version 1.1 - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2.6.16-rc6 0/3] MAINTAINERS, e100 and e1000 text file updates
Jesse Brandeburg wrote: okay, here goes... these patches are against Linus's current tree. They only update text files, no code updates. The large change to e1000.txt includes whitespace changes, and some content. They could be included with 2.6.16 as they are for the drivers that are already merged. Signed-off-by: Jesse Brandeburg <[EMAIL PROTECTED]> --- The following changes since commit a488edc914aa1d766a4e2c982b5ae03d5657ec1b: are found in the git repository at: git://198.78.49.142/~jbrandeb/linux-2.6 e1000-fixes pulled - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [openib-general] Re: [PATCH 4/6 v2] IB: address translation to map IP toIB addresses (GIDs)
Roland Dreier wrote: > +struct workqueue_struct *rdma_wq; > +EXPORT_SYMBOL(rdma_wq); Sean, I don't think I saw an answer when I asked you this before. Why is ib_addr exporting a workqueue? Is there some sort of ordering constraint that is forcing other modules to go through the same workqueue for things? This seems like a very fragile internal thing to be exposing, and I'm wondering if there's a better way to handle it. I responded in a different thread, but here's what I wrote: "This is simply an attempt to reduce/combine work queues used by the Infiniband code. This keeps the threading a little simpler in the rdma_cm, since all callbacks are invoked using the same work queue. (I'm also using this with the local SA/multicast code, but that's not ready for merging.)" There's no specific ordering constraint that's required. We're just ending up with several Infiniband modules creating their own work queues (ib_mad, ib_cm, ib_addr, rdma_cm, plus a couple more in modules under development), and this is an attempt to reduce that. If having separate work queues would work better, there shouldn't be anything that prevents this. - Sean - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] sis900 adm7001 PHY support
this patch is required to get a SIS964 based motherboard ethernet working (FSC D1875) (picking the #1 transceiver, instead of the last one, in case no known ones were found might be a better default, and would have worked in this case too) Signed-off-by: Artur Skawina <[EMAIL PROTECTED]> --- v2.6.16/drivers/net/sis900.c2006-03-21 21:14:37.0 +0100 +++ v2.6.16-dtnode/drivers/net/sis900.c 2006-03-21 02:53:54.0 +0100 @@ -128,6 +128,7 @@ static struct mii_chip_info { { "SiS 900 Internal MII PHY", 0x001d, 0x8000, LAN }, { "SiS 7014 Physical Layer Solution", 0x0016, 0xf830, LAN }, { "Altimata AC101LF PHY", 0x0022, 0x5520, LAN }, + { "ADM 7001 LAN PHY", 0x002e, 0xcc60, LAN }, { "AMD 79C901 10BASE-T PHY",0x, 0x6B70, LAN }, { "AMD 79C901 HomePNA PHY", 0x, 0x6B90, HOME}, { "ICS LAN PHY",0x0015, 0xF440, LAN },
Re: [PATCH] smc91x: allow for dynamic bus access configs
Nicolas Pitre wrote: All accessor's different methods are now selected with C code and unused ones statically optimized away at compile time instead of being selected with #if's and #ifdef's. This has many advantages such as allowing the compiler to validate the syntax of the whole code, making it cleaner and easier to understand, and ultimately allowing people to define configuration symbols in terms of variables if they really want to dynamically support multiple bus configurations at the same time (with the unavoidable performance cost). Signed-off-by: Nicolas Pitre <[EMAIL PROTECTED]> applied - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Patch] mv643xx_eth: Cache align skb->data if CONFIG_NOT_COHERENT_CACHE
Dale Farnsworth wrote: From: Dale Farnsworth <[EMAIL PROTECTED]> When I/O is non-cache-coherent, we need to ensure that the I/O buffers we use don't share cache lines with other data. Signed-off-by: Dale Farnsworth <[EMAIL PROTECTED]> applied - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/9] skge: use NAPI for tx cleanup.
Stephen Hemminger wrote: Cleanup transmit buffers using NAPI. This allows the transmit routine to leave interrupts enabled, and that improves performance. Signed-off-by: Stephen Hemminger <[EMAIL PROTECTED]> applied 1-9 - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/9] sky2: remove support for untested Yukon EC/rev 0
Stephen Hemminger wrote: The Yukon EC/rev0 (A1) chipset requires a bunch of workarounds. I copied these from sk98lin. But since they never got tested and add more cruft to the code; any attempt at using driver as is on this version will probably fail. It looks like this was a early engineering sample chip revision, if it ever shows up on a real system. Produce an error message. Signed-off-by: Stephen Hemminger <[EMAIL PROTECTED]> applied 1-9 - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/6 v2] IB: address translation to map IP toIB addresses (GIDs)
> +struct workqueue_struct *rdma_wq; > +EXPORT_SYMBOL(rdma_wq); Sean, I don't think I saw an answer when I asked you this before. Why is ib_addr exporting a workqueue? Is there some sort of ordering constraint that is forcing other modules to go through the same workqueue for things? This seems like a very fragile internal thing to be exposing, and I'm wondering if there's a better way to handle it. - R. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] wireless.git: update acxsm to 0.4.7
On Wed, Mar 01, 2006 at 03:58:14PM +0200, Denis Vlasenko wrote: > On Tuesday 28 February 2006 03:34, John W. Linville wrote: > > On Mon, Feb 27, 2006 at 11:44:38AM +0100, Carlos Martín wrote: > > > On Monday 27 February 2006 11:20, Denis Vlasenko wrote: > > > > > Comments are welcome and I'll split the patch if needed. > > > > Denis are you applying this patch to your tree? If so, I'll rely on > > you to push it to me when you are ready. > > > > If not, then I will need Carlos to generate the diffs so that they > > can be applied to the top of the tree with -p1. > > > > http://linux.yyz.us/patch-format.html > > Changelog: Merged to softmac branch of wireless-2.6...thanks! John -- John W. Linville [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Please pull bcm43xx softmac-upstream and dscape-upstream branches
On Sun, Mar 05, 2006 at 09:47:55PM +0100, Michael Buesch wrote: > Please pull branches "softmac-upstream" and "dscape-upstream" > from my repository at: > git://bu3sch.de/wireless-2.6.git Merged to softmac and dscape branches of wireless-2.6...thanks! John -- John W. Linville [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RESEND] RT2x00 update: trivial fixes
On Tue, Feb 28, 2006 at 08:46:54PM +0100, Ivo van Doorn wrote: > ieee80211_rx has been renamed __ieee80211_rx. > Use DRV_NAME as much as possible instead of a seperate name string. > Add new USB device ID. Merged to dscape branch of wireless-2.6...thanks! John -- John W. Linville [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH wireless-2.6 0/2] d80211: Devicescape 802.11 update
On Fri, Mar 03, 2006 at 06:54:23PM -0800, Jouni Malinen wrote: > Here's couple of patches to the Devicescape 802.11 implementation. > Please consider applying to the dscape branch of wireless-2.6 tree. Merged to dscape branch of wireless-2.6...thanks! John -- John W. Linville [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 6/9] skge: use mmiowb
Add mmio barriers at the appropriate places, don't have a platform that needs them, but this is where the documentation of the patch says to add them. Signed-off-by: Stephen Hemminger <[EMAIL PROTECTED]> --- skge-2.6.orig/drivers/net/skge.c +++ skge-2.6/drivers/net/skge.c @@ -2394,9 +2394,11 @@ static int skge_xmit_frame(struct sk_buf netif_stop_queue(dev); } - dev->trans_start = jiffies; + mmiowb(); spin_unlock(&skge->tx_lock); + dev->trans_start = jiffies; + return NETDEV_TX_OK; } @@ -2730,6 +2732,8 @@ static int skge_poll(struct net_device * return 1; /* not done */ netif_rx_complete(dev); + mmiowb(); + hw->intr_mask |= skge->port == 0 ? (IS_R1_F|IS_XA1_F) : (IS_R2_F|IS_XA2_F); skge_write32(hw, B0_IMSK, hw->intr_mask); -- - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 5/9] skge: use kcalloc
Use kcalloc when allocating ring data structure. Signed-off-by: Stephen Hemminger <[EMAIL PROTECTED]> --- skge-2.6.orig/drivers/net/skge.c +++ skge-2.6/drivers/net/skge.c @@ -733,13 +733,12 @@ static int skge_ring_alloc(struct skge_r struct skge_element *e; int i; - ring->start = kmalloc(sizeof(*e)*ring->count, GFP_KERNEL); + ring->start = kcalloc(sizeof(*e), ring->count, GFP_KERNEL); if (!ring->start) return -ENOMEM; for (i = 0, e = ring->start, d = vaddr; i < ring->count; i++, e++, d++) { e->desc = d; - e->skb = NULL; if (i == ring->count - 1) { e->next = ring->start; d->next_offset = base; -- - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/9] skge: use auto masking of irqs
Improve performance of skge driver by not touching irq mask register as much. Since the interrupt source auto-masks, the driver can just leave it disabled until the end of the soft irq. Signed-off-by: Stephen Hemminger <[EMAIL PROTECTED]> --- skge-2.6.orig/drivers/net/skge.c +++ skge-2.6/drivers/net/skge.c @@ -104,7 +104,6 @@ static const int txqaddr[] = { Q_XA1, Q_ static const int rxqaddr[] = { Q_R1, Q_R2 }; static const u32 rxirqmask[] = { IS_R1_F, IS_R2_F }; static const u32 txirqmask[] = { IS_XA1_F, IS_XA2_F }; -static const u32 portirqmask[] = { IS_PORT_1, IS_PORT_2 }; static int skge_get_regs_len(struct net_device *dev) { @@ -2184,12 +2183,6 @@ static int skge_up(struct net_device *de skge->tx_avail = skge->tx_ring.count - 1; - /* Enable IRQ from port */ - spin_lock_irq(&hw->hw_lock); - hw->intr_mask |= portirqmask[port]; - skge_write32(hw, B0_IMSK, hw->intr_mask); - spin_unlock_irq(&hw->hw_lock); - /* Initialize MAC */ spin_lock_bh(&hw->phy_lock); if (hw->chip_id == CHIP_ID_GENESIS) @@ -2246,11 +2239,6 @@ static int skge_down(struct net_device * else yukon_stop(skge); - spin_lock_irq(&hw->hw_lock); - hw->intr_mask &= ~portirqmask[skge->port]; - skge_write32(hw, B0_IMSK, hw->intr_mask); - spin_unlock_irq(&hw->hw_lock); - /* Stop transmitter */ skge_write8(hw, Q_ADDR(txqaddr[port], Q_CSR), CSR_STOP); skge_write32(hw, RB_ADDR(txqaddr[port], RB_CTRL), @@ -2734,11 +2722,9 @@ static int skge_poll(struct net_device * if (work_done >= to_do) return 1; /* not done */ - spin_lock_irq(&hw->hw_lock); - __netif_rx_complete(dev); - hw->intr_mask |= portirqmask[skge->port]; + netif_rx_complete(dev); + hw->intr_mask |= skge->port == 0 ? (IS_R1_F|IS_XA1_F) : (IS_R2_F|IS_XA2_F); skge_write32(hw, B0_IMSK, hw->intr_mask); - spin_unlock_irq(&hw->hw_lock); return 0; } @@ -2850,12 +2836,11 @@ static void skge_extirq(unsigned long da int port; spin_lock(&hw->phy_lock); - for (port = 0; port < 2; port++) { + for (port = 0; port < hw->ports; port++) { struct net_device *dev = hw->dev[port]; + struct skge_port *skge = netdev_priv(dev); - if (dev && netif_running(dev)) { - struct skge_port *skge = netdev_priv(dev); - + if (netif_running(dev)) { if (hw->chip_id != CHIP_ID_GENESIS) yukon_phy_intr(skge); else @@ -2864,21 +2849,25 @@ static void skge_extirq(unsigned long da } spin_unlock(&hw->phy_lock); - spin_lock_irq(&hw->hw_lock); hw->intr_mask |= IS_EXT_REG; skge_write32(hw, B0_IMSK, hw->intr_mask); - spin_unlock_irq(&hw->hw_lock); } static irqreturn_t skge_intr(int irq, void *dev_id, struct pt_regs *regs) { struct skge_hw *hw = dev_id; - u32 status = skge_read32(hw, B0_SP_ISRC); + u32 status; - if (status == 0 || status == ~0) /* hotplug or shared irq */ + /* Reading this register masks IRQ */ + status = skge_read32(hw, B0_SP_ISRC); + if (status == 0) return IRQ_NONE; - spin_lock(&hw->hw_lock); + if (status & IS_EXT_REG) { + hw->intr_mask &= ~IS_EXT_REG; + tasklet_schedule(&hw->ext_tasklet); + } + if (status & (IS_R1_F|IS_XA1_F)) { skge_write8(hw, Q_ADDR(Q_R1, Q_CSR), CSR_IRQ_CL_F); hw->intr_mask &= ~(IS_R1_F|IS_XA1_F); @@ -2891,6 +2880,9 @@ static irqreturn_t skge_intr(int irq, vo netif_rx_schedule(hw->dev[1]); } + if (likely((status & hw->intr_mask) == 0)) + return IRQ_HANDLED; + if (status & IS_PA_TO_RX1) { struct skge_port *skge = netdev_priv(hw->dev[0]); ++skge->net_stats.rx_over_errors; @@ -2918,13 +2910,7 @@ static irqreturn_t skge_intr(int irq, vo if (status & IS_HW_ERR) skge_error_irq(hw); - if (status & IS_EXT_REG) { - hw->intr_mask &= ~IS_EXT_REG; - tasklet_schedule(&hw->ext_tasklet); - } - skge_write32(hw, B0_IMSK, hw->intr_mask); - spin_unlock(&hw->hw_lock); return IRQ_HANDLED; } @@ -3070,7 +3056,10 @@ static int skge_reset(struct skge_hw *hw else hw->ram_size = t8 * 4096; - hw->intr_mask = IS_HW_ERR | IS_EXT_REG; + hw->intr_mask = IS_HW_ERR | IS_EXT_REG | IS_PORT_1; + if (hw->ports > 1) + hw->intr_mask |= IS_PORT_2; + if (hw->chip_id == CHIP_ID_GENESIS) genesis_init(hw); else { @@ -3293,7 +3282,6 @@ static int __devinit skge_probe(struct p hw->pdev = pdev; spin_lock_init(&hw->phy_lock); - spin_lock_init(&hw->hw_l
[PATCH 7/9] skge: formmating and whitespace cleanup
Reformat some code to make it easier to read. And whitespace fixes. Signed-off-by: Stephen Hemminger <[EMAIL PROTECTED]> --- skge-2.6.orig/drivers/net/skge.c +++ skge-2.6/drivers/net/skge.c @@ -2177,15 +2177,17 @@ static int skge_up(struct net_device *de memset(skge->mem, 0, skge->mem_size); - if ((err = skge_ring_alloc(&skge->rx_ring, skge->mem, skge->dma))) + err = skge_ring_alloc(&skge->rx_ring, skge->mem, skge->dma); + if (err) goto free_pci_mem; err = skge_rx_fill(skge); if (err) goto free_rx_ring; - if ((err = skge_ring_alloc(&skge->tx_ring, skge->mem + rx_size, - skge->dma + rx_size))) + err = skge_ring_alloc(&skge->tx_ring, skge->mem + rx_size, + skge->dma + rx_size); + if (err) goto free_rx_ring; skge->tx_avail = skge->tx_ring.count - 1; @@ -2308,9 +2310,9 @@ static int skge_xmit_frame(struct sk_buf return NETDEV_TX_OK; if (!spin_trylock(&skge->tx_lock)) { - /* Collision - tell upper layer to requeue */ - return NETDEV_TX_LOCKED; - } + /* Collision - tell upper layer to requeue */ + return NETDEV_TX_LOCKED; + } if (unlikely(skge->tx_avail < skb_shinfo(skb)->nr_frags +1)) { if (!netif_queue_stopped(dev)) { @@ -2709,8 +2711,8 @@ static int skge_poll(struct net_device * if (control & BMU_OWN) break; - skb = skge_rx_get(skge, e, control, rd->status, - le16_to_cpu(rd->csum2)); + skb = skge_rx_get(skge, e, control, rd->status, + le16_to_cpu(rd->csum2)); if (likely(skb)) { dev->last_rx = jiffies; netif_receive_skb(skb); @@ -3240,13 +3242,15 @@ static int __devinit skge_probe(struct p struct skge_hw *hw; int err, using_dac = 0; - if ((err = pci_enable_device(pdev))) { + err = pci_enable_device(pdev); + if (err) { printk(KERN_ERR PFX "%s cannot enable PCI device\n", pci_name(pdev)); goto err_out; } - if ((err = pci_request_regions(pdev, DRV_NAME))) { + err = pci_request_regions(pdev, DRV_NAME); + if (err) { printk(KERN_ERR PFX "%s cannot obtain PCI resources\n", pci_name(pdev)); goto err_out_disable_pdev; @@ -3298,7 +3302,8 @@ static int __devinit skge_probe(struct p goto err_out_free_hw; } - if ((err = request_irq(pdev->irq, skge_intr, SA_SHIRQ, DRV_NAME, hw))) { + err = request_irq(pdev->irq, skge_intr, SA_SHIRQ, DRV_NAME, hw); + if (err) { printk(KERN_ERR PFX "%s: cannot assign irq %d\n", pci_name(pdev), pdev->irq); goto err_out_iounmap; @@ -3316,7 +3321,8 @@ static int __devinit skge_probe(struct p if ((dev = skge_devinit(hw, 0, using_dac)) == NULL) goto err_out_led_off; - if ((err = register_netdev(dev))) { + err = register_netdev(dev); + if (err) { printk(KERN_ERR PFX "%s: cannot register net device\n", pci_name(pdev)); goto err_out_free_netdev; -- - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/9] skge: check the allocation of ring buffer
The SysKonnect Genesis and Yukon chip sets have restrictions on the possible control block area. The memory needs to not cross 4 Gig boundary, and it needs to be 8 byte aligned. This patch checks and fails to bring the device up if region is unacceptable. Signed-off-by: Stephen Hemminger <[EMAIL PROTECTED]> --- skge-2.6.orig/drivers/net/skge.c +++ skge-2.6/drivers/net/skge.c @@ -727,7 +727,7 @@ static struct ethtool_ops skge_ethtool_o * Allocate ring elements and chain them together * One-to-one association of board descriptors with ring elements */ -static int skge_ring_alloc(struct skge_ring *ring, void *vaddr, u64 base) +static int skge_ring_alloc(struct skge_ring *ring, void *vaddr, u32 base) { struct skge_tx_desc *d; struct skge_element *e; @@ -2168,6 +2168,14 @@ static int skge_up(struct net_device *de if (!skge->mem) return -ENOMEM; + BUG_ON(skge->dma & 7); + + if ((u64)skge->dma >> 32 != ((u64) skge->dma + skge->mem_size) >> 32) { + printk(KERN_ERR PFX "pci_alloc_consistent region crosses 4G boundary\n"); + err = -EINVAL; + goto free_pci_mem; + } + memset(skge->mem, 0, skge->mem_size); if ((err = skge_ring_alloc(&skge->rx_ring, skge->mem, skge->dma))) -- - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/9] skge version 1.4
Update to skge driver for 2.6.17. The main change is elminating some lock roundtrips and io accesses which will improve performance and SMP stability. -- - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 8/9] skge: handle pci errors better
When a PCI error occurs, try and report more info. Signed-off-by: Stephen Hemminger <[EMAIL PROTECTED]> --- skge-2.6.orig/drivers/net/skge.c +++ skge-2.6/drivers/net/skge.c @@ -2764,17 +2764,6 @@ static void skge_mac_parity(struct skge_ ? GMF_CLI_TX_FC : GMF_CLI_TX_PE); } -static void skge_pci_clear(struct skge_hw *hw) -{ - u16 status; - - pci_read_config_word(hw->pdev, PCI_STATUS, &status); - skge_write8(hw, B2_TST_CTRL1, TST_CFG_WRITE_ON); - pci_write_config_word(hw->pdev, PCI_STATUS, - status | PCI_STATUS_ERROR_BITS); - skge_write8(hw, B2_TST_CTRL1, TST_CFG_WRITE_OFF); -} - static void skge_mac_intr(struct skge_hw *hw, int port) { if (hw->chip_id == CHIP_ID_GENESIS) @@ -2816,23 +2805,39 @@ static void skge_error_irq(struct skge_h if (hwstatus & IS_M2_PAR_ERR) skge_mac_parity(hw, 1); - if (hwstatus & IS_R1_PAR_ERR) + if (hwstatus & IS_R1_PAR_ERR) { + printk(KERN_ERR PFX "%s: receive queue parity error\n", + hw->dev[0]->name); skge_write32(hw, B0_R1_CSR, CSR_IRQ_CL_P); + } - if (hwstatus & IS_R2_PAR_ERR) + if (hwstatus & IS_R2_PAR_ERR) { + printk(KERN_ERR PFX "%s: receive queue parity error\n", + hw->dev[1]->name); skge_write32(hw, B0_R2_CSR, CSR_IRQ_CL_P); + } if (hwstatus & (IS_IRQ_MST_ERR|IS_IRQ_STAT)) { - printk(KERN_ERR PFX "hardware error detected (status 0x%x)\n", - hwstatus); + u16 pci_status, pci_cmd; + + pci_read_config_word(hw->pdev, PCI_COMMAND, &pci_cmd); + pci_read_config_word(hw->pdev, PCI_STATUS, &pci_status); - skge_pci_clear(hw); + printk(KERN_ERR PFX "%s: PCI error cmd=%#x status=%#x\n", + pci_name(hw->pdev), pci_cmd, pci_status); + + /* Write the error bits back to clear them. */ + pci_status &= PCI_STATUS_ERROR_BITS; + skge_write8(hw, B2_TST_CTRL1, TST_CFG_WRITE_ON); + pci_write_config_word(hw->pdev, PCI_COMMAND, + pci_cmd | PCI_COMMAND_SERR | PCI_COMMAND_PARITY); + pci_write_config_word(hw->pdev, PCI_STATUS, pci_status); + skge_write8(hw, B2_TST_CTRL1, TST_CFG_WRITE_OFF); /* if error still set then just ignore it */ hwstatus = skge_read32(hw, B0_HWE_ISRC); if (hwstatus & IS_IRQ_STAT) { - pr_debug("IRQ status %x: still set ignoring hardware errors\n", - hwstatus); + printk(KERN_INFO PFX "unable to clear error (so ignoring them)\n"); hw->intr_mask &= ~IS_HW_ERR; } } @@ -2998,7 +3003,7 @@ static const char *skge_board_name(const static int skge_reset(struct skge_hw *hw) { u32 reg; - u16 ctst; + u16 ctst, pci_status; u8 t8, mac_cfg, pmd_type, phy_type; int i; @@ -3009,8 +3014,13 @@ static int skge_reset(struct skge_hw *hw skge_write8(hw, B0_CTST, CS_RST_CLR); /* clear PCI errors, if any */ - skge_pci_clear(hw); + skge_write8(hw, B2_TST_CTRL1, TST_CFG_WRITE_ON); + skge_write8(hw, B2_TST_CTRL2, 0); + pci_read_config_word(hw->pdev, PCI_STATUS, &pci_status); + pci_write_config_word(hw->pdev, PCI_STATUS, + pci_status | PCI_STATUS_ERROR_BITS); + skge_write8(hw, B2_TST_CTRL1, TST_CFG_WRITE_OFF); skge_write8(hw, B0_CTST, CS_MRST_CLR); /* restore CLK_RUN bits (for Yukon-Lite) */ @@ -3377,7 +3387,6 @@ static void __devexit skge_remove(struct skge_write32(hw, B0_IMSK, 0); skge_write16(hw, B0_LED, LED_STAT_OFF); - skge_pci_clear(hw); skge_write8(hw, B0_CTST, CS_RST_SET); tasklet_kill(&hw->ext_tasklet); -- - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 4/9] skge: dma configuration cleanup
Cleanup of the part of the code that sets up DMA configuration. Should cause no real change in operation, just clearer. Signed-off-by: Stephen Hemminger <[EMAIL PROTECTED]> --- skge-2.6.orig/drivers/net/skge.c +++ skge-2.6/drivers/net/skge.c @@ -3251,22 +3251,18 @@ static int __devinit skge_probe(struct p pci_set_master(pdev); - if (sizeof(dma_addr_t) > sizeof(u32) && - !(err = pci_set_dma_mask(pdev, DMA_64BIT_MASK))) { + if (!pci_set_dma_mask(pdev, DMA_64BIT_MASK)) { using_dac = 1; err = pci_set_consistent_dma_mask(pdev, DMA_64BIT_MASK); - if (err < 0) { - printk(KERN_ERR PFX "%s unable to obtain 64 bit DMA " - "for consistent allocations\n", pci_name(pdev)); - goto err_out_free_regions; - } - } else { - err = pci_set_dma_mask(pdev, DMA_32BIT_MASK); - if (err) { - printk(KERN_ERR PFX "%s no usable DMA configuration\n", - pci_name(pdev)); - goto err_out_free_regions; - } + } else if (!(err = pci_set_dma_mask(pdev, DMA_32BIT_MASK))) { + using_dac = 0; + err = pci_set_consistent_dma_mask(pdev, DMA_32BIT_MASK); + } + + if (err) { + printk(KERN_ERR PFX "%s no usable DMA configuration\n", + pci_name(pdev)); + goto err_out_free_regions; } #ifdef __BIG_ENDIAN -- - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/9] skge: use NAPI for tx cleanup.
Cleanup transmit buffers using NAPI. This allows the transmit routine to leave interrupts enabled, and that improves performance. Signed-off-by: Stephen Hemminger <[EMAIL PROTECTED]> --- skge-2.6.orig/drivers/net/skge.c +++ skge-2.6/drivers/net/skge.c @@ -2307,16 +2307,13 @@ static int skge_xmit_frame(struct sk_buf int i; u32 control, len; u64 map; - unsigned long flags; skb = skb_padto(skb, ETH_ZLEN); if (!skb) return NETDEV_TX_OK; - local_irq_save(flags); if (!spin_trylock(&skge->tx_lock)) { /* Collision - tell upper layer to requeue */ - local_irq_restore(flags); return NETDEV_TX_LOCKED; } @@ -2327,7 +2324,7 @@ static int skge_xmit_frame(struct sk_buf printk(KERN_WARNING PFX "%s: ring full when queue awake!\n", dev->name); } - spin_unlock_irqrestore(&skge->tx_lock, flags); + spin_unlock(&skge->tx_lock); return NETDEV_TX_BUSY; } @@ -2403,7 +2400,7 @@ static int skge_xmit_frame(struct sk_buf } dev->trans_start = jiffies; - spin_unlock_irqrestore(&skge->tx_lock, flags); + spin_unlock(&skge->tx_lock); return NETDEV_TX_OK; } @@ -2416,7 +2413,7 @@ static inline void skge_tx_free(struct s pci_unmap_addr(e, mapaddr), pci_unmap_len(e, maplen), PCI_DMA_TODEVICE); - dev_kfree_skb_any(e->skb); + dev_kfree_skb(e->skb); e->skb = NULL; } else { pci_unmap_page(hw->pdev, @@ -2430,15 +2427,14 @@ static void skge_tx_clean(struct skge_po { struct skge_ring *ring = &skge->tx_ring; struct skge_element *e; - unsigned long flags; - spin_lock_irqsave(&skge->tx_lock, flags); + spin_lock_bh(&skge->tx_lock); for (e = ring->to_clean; e != ring->to_use; e = e->next) { ++skge->tx_avail; skge_tx_free(skge->hw, e); } ring->to_clean = e; - spin_unlock_irqrestore(&skge->tx_lock, flags); + spin_unlock_bh(&skge->tx_lock); } static void skge_tx_timeout(struct net_device *dev) @@ -2663,6 +2659,37 @@ resubmit: return NULL; } +static void skge_tx_done(struct skge_port *skge) +{ + struct skge_ring *ring = &skge->tx_ring; + struct skge_element *e; + + spin_lock(&skge->tx_lock); + for (e = ring->to_clean; prefetch(e->next), e != ring->to_use; e = e->next) { + struct skge_tx_desc *td = e->desc; + u32 control; + + rmb(); + control = td->control; + if (control & BMU_OWN) + break; + + if (unlikely(netif_msg_tx_done(skge))) + printk(KERN_DEBUG PFX "%s: tx done slot %td status 0x%x\n", + skge->netdev->name, e - ring->start, td->status); + + skge_tx_free(skge->hw, e); + e->skb = NULL; + ++skge->tx_avail; + } + ring->to_clean = e; + skge_write8(skge->hw, Q_ADDR(txqaddr[skge->port], Q_CSR), CSR_IRQ_CL_F); + + if (skge->tx_avail > MAX_SKB_FRAGS + 1) + netif_wake_queue(skge->netdev); + + spin_unlock(&skge->tx_lock); +} static int skge_poll(struct net_device *dev, int *budget) { @@ -2670,8 +2697,10 @@ static int skge_poll(struct net_device * struct skge_hw *hw = skge->hw; struct skge_ring *ring = &skge->rx_ring; struct skge_element *e; - unsigned int to_do = min(dev->quota, *budget); - unsigned int work_done = 0; + int to_do = min(dev->quota, *budget); + int work_done = 0; + + skge_tx_done(skge); for (e = ring->to_clean; prefetch(e->next), work_done < to_do; e = e->next) { struct skge_rx_desc *rd = e->desc; @@ -2714,40 +2743,6 @@ static int skge_poll(struct net_device * return 0; } -static inline void skge_tx_intr(struct net_device *dev) -{ - struct skge_port *skge = netdev_priv(dev); - struct skge_hw *hw = skge->hw; - struct skge_ring *ring = &skge->tx_ring; - struct skge_element *e; - - spin_lock(&skge->tx_lock); - for (e = ring->to_clean; prefetch(e->next), e != ring->to_use; e = e->next) { - struct skge_tx_desc *td = e->desc; - u32 control; - - rmb(); - control = td->control; - if (control & BMU_OWN) - break; - - if (unlikely(netif_msg_tx_done(skge))) - printk(KERN_DEBUG PFX "%s: tx done slot %td status 0x%x\n", - dev->name, e - ring->start, td->status); - - skge_tx_free(hw, e); - e->skb = NULL; -
[PATCH 9/9] skge: version 1.4
Update version number Signed-off-by: Stephen Hemminger <[EMAIL PROTECTED]> --- skge-2.6.orig/drivers/net/skge.c +++ skge-2.6/drivers/net/skge.c @@ -44,7 +44,7 @@ #include "skge.h" #define DRV_NAME "skge" -#define DRV_VERSION"1.3" +#define DRV_VERSION"1.4" #define PFXDRV_NAME " " #define DEFAULT_TX_RING_SIZE 128 -- - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[Patch] mv643xx_eth: Cache align skb->data if CONFIG_NOT_COHERENT_CACHE
From: Dale Farnsworth <[EMAIL PROTECTED]> When I/O is non-cache-coherent, we need to ensure that the I/O buffers we use don't share cache lines with other data. Signed-off-by: Dale Farnsworth <[EMAIL PROTECTED]> --- This patch fixes red zone error messages that appear when CONFIG_SLAB_DEBUG=y. drivers/net/mv643xx_eth.h | 18 ++ 1 file changed, 14 insertions(+), 4 deletions(-) Index: linux-2.6-mv643xx_enet/drivers/net/mv643xx_eth.h === --- linux-2.6-mv643xx_enet.orig/drivers/net/mv643xx_eth.h +++ linux-2.6-mv643xx_enet/drivers/net/mv643xx_eth.h @@ -42,13 +42,23 @@ #define MAX_DESCS_PER_SKB 1 #endif +/* + * The MV643XX HW requires 8-byte alignment. However, when I/O + * is non-cache-coherent, we need to ensure that the I/O buffers + * we use don't share cache lines with other data. + */ +#if defined(CONFIG_DMA_NONCOHERENT) || defined(CONFIG_NOT_COHERENT_CACHE) +#define ETH_DMA_ALIGN L1_CACHE_BYTES +#else +#define ETH_DMA_ALIGN 8 +#endif + #define ETH_VLAN_HLEN 4 #define ETH_FCS_LEN4 -#define ETH_DMA_ALIGN 8 /* hw requires 8-byte alignment */ -#define ETH_HW_IP_ALIGN2 /* hw aligns IP header */ +#define ETH_HW_IP_ALIGN2 /* hw aligns IP header */ #define ETH_WRAPPER_LEN(ETH_HW_IP_ALIGN + ETH_HLEN + \ - ETH_VLAN_HLEN + ETH_FCS_LEN) -#define ETH_RX_SKB_SIZE((dev->mtu + ETH_WRAPPER_LEN + 7) & ~0x7) + ETH_VLAN_HLEN + ETH_FCS_LEN) +#define ETH_RX_SKB_SIZE(dev->mtu + ETH_WRAPPER_LEN + ETH_DMA_ALIGN) #define ETH_RX_QUEUES_ENABLED (1 << 0)/* use only Q0 for receive */ #define ETH_TX_QUEUES_ENABLED (1 << 0)/* use only Q0 for transmit */ - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
MPLS extension for pktgen
Steven Whitehouse writes: > I've been looking into MPLS recently and so one of the first things that > would be useful is a testbed to generate test traffic, and hence the > attached patch to pktgen. > > If you have a moment to look over it, then please let me know if you > would give it your blessing. The patch is against davem's current > net-2.6.17 tree, Nice. Well never thought about mpls but it seems possible too. With mpls enabled it seems send something my tcpdump does not understand so I trust you there. I and it does not seem to brake standard ipv4 sending. So it should be OK. But I'll guess the mpls result code is not what you expected... echo "mpls 0001000a,0002000a,000a" > /proc/net/pktgen/eth1 cat /proc/net/pktgen/eth1 | grep Res Result: 000a sprintf(pg_result, "OK: mpls="); for(n = 0; n < pkt_dev->nr_labels; n++) sprintf(pg_result, "%08x%s", ntohl(pkt_dev->labels[n]), n == pkt_dev->nr_labels-1 ? "" : ","); Cheers. --ro - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Question about TCP behavior
Hello all, I am trying to figure out what is causing a change in behavior of the TCP stack on Linux? I have a very simple test setup: 1) Windows machine running a test app to request data from the server 2) Linux (2.6.10 - yeah, I know... upgrade...) machine running test server 3) Gigabit ethernet between the two machines via a Cisco switch The Windows machine sends a request to the Linux machine requesting the Linux machine send a block of data containing 502,132 bytes of data. The server on Linux makes a single send() call with the entire buffer (this to reduce the user-to-kernel mode overhead of multiple calls. If the Linux machine has just recently been booted, the transfer takes around 8 or 9 milliseconds. If the Linux machine has been up for a while (but still primarily idle), the transfer starts to take anywhere from 32 to 70 milli- seconds. Both the Windows machine and the Linux machine are for all practical purposes idle and dedicated to this test process. It seems the Linux TCP stack is getting into a state where it decides to slow down the pace of the transfer to the Windows machine?!? When the transfer is fast, the time between frame sends is usually about 8 to 40 microseconds (with some variation). When the transfer is slow, the time between frame sends starts off at a high 130 microseconds, then tapers down to 1/2 and/or 1/4 of that in a pattern that looks too consistant to be random. Here's the basic pattern: (time between packet sends in microseconds) 130, 130, 130, 130, 130, 130, 130, 130, 68, 32, 32, 32, 32, 32 [ it's this pattern I'm hoping someone recognizes! :o) ] After a group of packets are sent, the pattern starts again with a large number then tapers down again and again until the entire transfer is done. Questions going though my head: 1) Is some metric on the interface being used to determine the initial TCP transfer rate? 2) Is this some form of "slow start"? (doesn't sound it to me, but who knows?). If so, can I verify that? Then turn it off (or not do whatever is triggering it)?? 3) What mechanism of TCP might account for such a pattern of behavior? I have a dump of the delta times between packets for the fast and slow case with some packet information (frame size, TCP flags, start of TCP data). Not to take advantage of this mailing list, I've put the verbose information on the following web page: http://www.klos.com/~patrick/TCPQuestion.html Thanks for looking! Patrick = For LAN/WAN Protocol Analysis, check out PacketView Pro! = Patrick Klos Email: [EMAIL PROTECTED] Network/Embedded Software Engineer Web: http://www.klos.com/ Klos Technologies, Inc.Phone: 603-471-2547 http://www.loving-long-island.com/ - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
MPLS extension for pktgen
Hi, I've been looking into MPLS recently and so one of the first things that would be useful is a testbed to generate test traffic, and hence the attached patch to pktgen. If you have a moment to look over it, then please let me know if you would give it your blessing. The patch is against davem's current net-2.6.17 tree, Steve. -- diff --git a/Documentation/networking/pktgen.txt b/Documentation/networking/pktgen.txt --- a/Documentation/networking/pktgen.txt +++ b/Documentation/networking/pktgen.txt @@ -109,6 +109,22 @@ Examples: cycle through the port range. pgset "udp_dst_max 9" set UDP destination port max. + pgset "mpls 0001000a,0002000a,000a" set MPLS labels (in this example + outer label=16,middle label=32, +inner label=0 (IPv4 NULL)) Note that +there must be no spaces between the +arguments. Leading zeros are required. +Do not set the bottom of stack bit, +thats done automatically. If you do +set the bottom of stack bit, that +indicates that you want to randomly +generate that address and the flag +MPLS_RND will be turned on. You +can have any mix of random and fixed +labels in the label stack. + + pgset "mpls 0" turn off mpls (or any invalid argument works too!) + pgset stop aborts injection. Also, ^C aborts generator. @@ -167,6 +183,8 @@ pkt_size min_pkt_size max_pkt_size +mpls + udp_src_min udp_src_max @@ -211,4 +229,4 @@ Grant Grundler for testing on IA-64 and Stephen Hemminger, Andi Kleen, Dave Miller and many others. -Good luck with the linux net-development. \ No newline at end of file +Good luck with the linux net-development. diff --git a/net/core/pktgen.c b/net/core/pktgen.c --- a/net/core/pktgen.c +++ b/net/core/pktgen.c @@ -106,6 +106,9 @@ * * interruptible_sleep_on_timeout() replaced Nishanth Aravamudan <[EMAIL PROTECTED]> * 050103 + * + * MPLS support by Steven Whitehouse <[EMAIL PROTECTED]> + * */ #include #include @@ -154,7 +157,7 @@ #include /* do_div */ #include -#define VERSION "pktgen v2.66: Packet Generator for packet performance testing.\n" +#define VERSION "pktgen v2.67: Packet Generator for packet performance testing.\n" /* #define PG_DEBUG(a) a */ #define PG_DEBUG(a) @@ -162,6 +165,8 @@ /* The buckets are exponential in 'width' */ #define LAT_BUCKETS_MAX 32 #define IP_NAME_SZ 32 +#define MAX_MPLS_LABELS 16 /* This is the max label stack depth */ +#define MPLS_STACK_BOTTOM __constant_htonl(0x0100) /* Device flag bits */ #define F_IPSRC_RND (1<<0) /* IP-Src Random */ @@ -172,6 +177,7 @@ #define F_MACDST_RND (1<<5) /* MAC-Dst Random */ #define F_TXSIZE_RND (1<<6) /* Transmit size is random */ #define F_IPV6(1<<7) /* Interface in IPV6 Mode */ +#define F_MPLS_RND(1<<8) /* Random MPLS labels */ /* Thread control flag bits */ #define T_TERMINATE (1<<0) @@ -278,6 +284,10 @@ struct pktgen_dev { __u16 udp_dst_min; /* inclusive, dest UDP port */ __u16 udp_dst_max; /* exclusive, dest UDP port */ + /* MPLS */ + unsigned nr_labels; /* Depth of stack, 0 = no MPLS */ + __be32 labels[MAX_MPLS_LABELS]; + __u32 src_mac_count;/* How many MACs to iterate through */ __u32 dst_mac_count;/* How many MACs to iterate through */ @@ -623,9 +633,19 @@ static int pktgen_if_show(struct seq_fil pkt_dev->udp_dst_min, pkt_dev->udp_dst_max); seq_printf(seq, - " src_mac_count: %d dst_mac_count: %d \n Flags: ", + " src_mac_count: %d dst_mac_count: %d\n", pkt_dev->src_mac_count, pkt_dev->dst_mac_count); + if (pkt_dev->nr_labels) { + unsigned i; + seq_printf(seq, " mpls: "); + for(i = 0; i < pkt_dev->nr_labels; i++) + seq_printf(seq, "%08x%s", ntohl(pkt_dev->labels[i]), + i == pkt_dev->nr_labels-1 ? "\n" : ", "); + } + + seq_printf(seq, " Flags: "); + if (pkt_dev->flags & F_IPV6) seq_printf(seq, "IPV6 "); @@ -644,6 +664,9 @@ static int pktgen_if_show(struct seq_fil if (pkt_dev->flags & F_UDPDST_RND) seq_printf(seq, "UDPDST_RND "); + if (pkt_dev->flags & F_MPLS_RND) + seq_printf(seq, "MPLS_RND "); + if (pkt_dev-
[Comment] sizeof("struct tcp_sock") is above 1024 on x86 since linux-2.6.15
Hi all I would like to point out that struct tcp_sock was enlarged in 2.6.15, and the 'TCP' kmem_cache now needs order-1 allocations instead of order-0 In 2.6.14 : # grep "^TCP" /proc/slabinfo TCP 64 7696041 : tunables 54 270 : slabdata 19 19 0 In 2.6.16 / 2.6.15 : grep "^TCP" /proc/slabinfo TCP 16 28 115272 : tunables 24 128 : slabdata 4 4 0 This is a new point of failure for x86 machines that use lot of tcp sockets, I learnt it the bad way and had to revert to 2.6.14 some servers that cannot run stock 2.6.15/2.6.16 for long because of this problem. Of course, we might argue the problem come from linux memory management... Oh well... Thank you Eric Dumazet - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Comment] sizeof("struct tcp_sock") is above 1024 on x86 since linux-2.6.15
Ismail Donmez a écrit : This is a new point of failure for x86 machines that use lot of tcp sockets, I learnt it the bad way and had to revert to 2.6.14 some servers that cannot run stock 2.6.15/2.6.16 for long because of this problem. It seems to be 1024 here, maybe its your config ? [EMAIL PROTECTED] ~ $ grep "^TCP" /proc/slabinfo TCP9 12 102441 : tunables 54 270 : slabdata 3 3 0 [EMAIL PROTECTED] ~ $ uname -a Linux southpark 2.6.16 #7 Mon Mar 20 21:16:42 EET 2006 i686 i686 i386 GNU/Linux Yes, I should have mentioned my servers were SMP :) Your kernel build is uniprocessor. Eric - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Comment] sizeof("struct tcp_sock") is above 1024 on x86 since linux-2.6.15
Salı 21 Mart 2006 16:40 tarihinde, Eric Dumazet şunları yazmıştı: > Ismail Donmez a écrit : > >> This is a new point of failure for x86 machines that use lot of tcp > >> sockets, I learnt it the bad way and had to revert to 2.6.14 some > >> servers that cannot run stock 2.6.15/2.6.16 for long because of this > >> problem. > > > > It seems to be 1024 here, maybe its your config ? > > > > [EMAIL PROTECTED] ~ $ grep "^TCP" /proc/slabinfo > > TCP9 12 102441 : tunables 54 27 > > 0 : slabdata 3 3 0 > > [EMAIL PROTECTED] ~ $ uname -a > > Linux southpark 2.6.16 #7 Mon Mar 20 21:16:42 EET 2006 i686 i686 i386 > > GNU/Linux > > Yes, I should have mentioned my servers were SMP :) Ah, ok then. Regards, ismail -- An eye for eye will make the whole world blind -- Gandhi pgpvwuoCmAs9Z.pgp Description: PGP signature
Re: [Comment] sizeof("struct tcp_sock") is above 1024 on x86 since linux-2.6.15
Salı 21 Mart 2006 16:17 tarihinde şunları yazmıştınız: > Hi all > > I would like to point out that struct tcp_sock was enlarged in 2.6.15, and > the 'TCP' kmem_cache now needs order-1 allocations instead of order-0 > > > In 2.6.14 : > > # grep "^TCP" /proc/slabinfo > TCP 64 7696041 : tunables 54 270 > : slabdata 19 19 0 > > In 2.6.16 / 2.6.15 : > > grep "^TCP" /proc/slabinfo > TCP 16 28 115272 : tunables 24 128 > : slabdata 4 4 0 > > > This is a new point of failure for x86 machines that use lot of tcp > sockets, I learnt it the bad way and had to revert to 2.6.14 some servers > that cannot run stock 2.6.15/2.6.16 for long because of this problem. It seems to be 1024 here, maybe its your config ? [EMAIL PROTECTED] ~ $ grep "^TCP" /proc/slabinfo TCP9 12 102441 : tunables 54 270 : slabdata 3 3 0 [EMAIL PROTECTED] ~ $ uname -a Linux southpark 2.6.16 #7 Mon Mar 20 21:16:42 EET 2006 i686 i686 i386 GNU/Linux Regards, ismail -- An eye for eye will make the whole world blind -- Gandhi pgpJCHjcNryxk.pgp Description: PGP signature
Re: Results WAS(Re: [PATCH] TC: bug fixes to the "sample" clause
On Tue, 2006-21-03 at 09:35 +1000, Russell Stuart wrote: > Jeezz, that pisses me off. What is it with the bloody > internet? This isn't the first time this has happened. > The page you are accessing is in the US for gods sake. > It seems like the internet has walled off islands on > occasions. I have mirrored it: Thanks - I accessed it. [..] > By the way, with the analysis I didn't go out of my > way to find a dataset where 2.4 ran faster - it was > just the second one I tried. There are pathological > "fake" datasets that perform much worse than the one > in the analysis, and presumably real ones too. sorry Russell - that still doesnt cut it. When you design something like a route lookup algorithm, for example, you dont pick one over another based on a set of IP addresses you have. A worst case scenario is acceptable, always. An observation of "this is going to run on the edge/core of a network therefore i will optimize for that case" is also acceptable. Yours doesnt fit this. I havent run your test data but i am willing to bet (unenthusiastic to try for sure since we've spent too much time on this), the "better" results you are getting are due to biasing so that the "better" algorithm gets things in some buckets more than others i.e it has nothing to do with the hash selection. The environment changes and such results will change as well. Nothing is ever gonna save you from 25-75% of your buckets never ever being used in the case of 2.4; and at the expense of sounding like a broken record: i dont see anything the 2.4 algorithm brings of value other than in the case of 256 buckets with masks which ensure all 256 buckets get used - so as a performance bigot i equally dont value adding those extra computations; trust me if i was semi-convinced i would have supported the change. The impression i have is you are an energetic, resourceful person - lets move on (drop this) to that other thing you said you wanted to talk about. I could look at the way you have arranged your tables and offer opinion. cheers, jamal - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] scm: fold __scm_send() into scm_send()
On Tue, 2006-03-21 at 08:32 -0500, Stephen Smalley wrote: > > I don't expect security_sk_sid() to be terribly expensive. It's not > > an AVC check, it's just propagating a label. But I've not done any > > benchmarking on that. > > No permission check there, but it looks like it does read lock > sk_callback_lock. Not sure if that is truly justified here. Ah, that is because it is also called from the xfrm code, introduced by Trent's patches. But that locking shouldn't be necessary from scm_send, right? So she likely wants a separate hook for it to avoid that overhead, or even just a direct SELinux interface? -- Stephen Smalley National Security Agency - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] scm: fold __scm_send() into scm_send()
On Mon, 2006-03-20 at 15:15 -0800, Chris Wright wrote: > * Andrew Morton ([EMAIL PROTECTED]) wrote: > > Chris Wright <[EMAIL PROTECTED]> wrote: > > > Catherine, the security_sid_to_context() is a raw SELinux function which > > > crept into core code and should not have been there. The fallout fixes > > > included conditionally exporting security_sid_to_context, and finally > > > scm_send/recv unlining. > > > > Yes. So we're OK up the uninlining, right? > > Yes, although sid_to_context is meant to be analog to the other > get_peersec calls, and should really be made a proper part of the > interface (can be done later, correctness is the issue at hand). Yes, Catherine was told that she shouldn't be directly exporting security_sid_to_context, and was allegedly working on a fix. Note however that the expected solution is not a LSM interface but a set of properly encapsulated interfaces exported directly from SELinux, based on the iptables context matching patches by James. The same style of interface is being put forth for the audit LSPP work. The indirection of LSM serves no purpose here, as these users are specifically looking for functionality provided only by SELinux. > I don't expect security_sk_sid() to be terribly expensive. It's not > an AVC check, it's just propagating a label. But I've not done any > benchmarking on that. No permission check there, but it looks like it does read lock sk_callback_lock. Not sure if that is truly justified here. -- Stephen Smalley National Security Agency - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html