Re: [2.4 PATCH] ethernet (net/ethernet/eth.c): eth_header() may produce invalid packets (with dest. addr. = 00:00:00:00:00:00)
On Wed, Jan 10, 2007 at 12:53:31PM +0100, Christian Praehauser wrote: Hello, and sorry for bothering you with a patch you've already seen ;-). I had not seen it yet, so I'm fine with it :-) From: Christian Praehauser, Department of Computer Sciences, University of Salzburg This patch fixes a problem which has already been corrected in Linux-2.6.16 but was not back-ported to the 2.4 series. It is essentially the same as the patch for 2.6.16. An excerpt from the ChangeLog for Linux 2.6.16 is included below. David, I think that Christian and Alexey's descriptions are fine, and that this patch should be merged. Are you OK ? What is not described in the patch description for 2.6.16 is that this problem also arises when transmitting IP multicast packets. If you send an IP multicast stream over an ethernet network interface ethX and turn off ARP on ethX then Linux will produce an ethernet frame with a dest. addresses of 00:00:00:00:00:00 (which is invalid). As IP multicast addresses are directly mapped to HW (MAC) addresses without invoking any ARP protocol mechanisms - for IP4 this mapping is performed by the function ip_eth_mc_map - it makes perfect sense to do this even if ARP is disabled. Further, this problem may occur periodically, everytime the corresponding struct dst_entry is garbage-collected (e.g. ~ every 10 minutes). Christian, would you please produce a properly formated description starting from the 2.6 one and appending your comments to it ? I'd also like to get the same subject as for the 2.6 patch. It's important that we keep patches subjects and descriptions for backports, it helps further patch tracking. Thanks in advance, Willy - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
planned driver updates for 2.4.35
Hi David, As stated with the 2.4.34 announcement, I planned to perform a few updates in two network drivers for 2.4.35 : - e1000: the cards equipped with the not-so old 82546EB chips have completely disappeared from the earth surface, and people replacing hardware are experiencing trouble with the new chip (82546GB) which is not supported by the old driver. I know that Red Hat merged support for this chip in RHEL3 recently too (U8) by upgrading from v6 to v7. Jesse Brandeburg from Intel offered to update the driver to the more recent branch (7.3), which will make further maintenance easier for him. I personnally use 7.3 on my own kernels, so I just know that it works well for the NICs I'm used to find on servers, but for the rest, I'll have to rely on Intel people's support. - sk98lin: when Stephen decided to rewrite that driver from scratch because the one from Marvell's site has too many bugs, I thought he was exagerating, till the day I noticed NFS servers taking a lot of time to respond and DNS servers timing out because of UDP packets suddenly not leaving the host anymore. I recently encountered a worst case at a customer's with losses, duplicates and truncated packets. It's clear that the driver is too buggy. Each time I replaced the official driver with Stephen's backport and it definitely fixed the problems. I proposed Stephen to merge his work into 2.4, and he agreed, offering to update it once it gets in, which I'm fine with. This means I would start with skge-1.4 and sky2-0.5 which I could not get to fault on various intensive tests nor on the one which trigger sk98lin's bugs. Since those drivers support cards that are currently only supported by the out-of-tree driver from Marvell's site, no compatibility is lost and users can still use their old driver if they prefer it (as I did till I got those bugs). Aside that, I rejected a user's request for an update of the tg3 driver which does not support one recent chip in a notebook. I think that a notebook is not a big enough argument to touch this driver which works quite well IMHO. A notebook is clearly where he should use 2.6 by default, or take the time to download and install the out-of-tree code from broadcom's site if he absolutely wants 2.4. Maybe I'm wrong, but I have not tested the unofficial tg3 enough to have an opinion, and I don't want to break things that work just for this. I'd like to get your opinion on those updates before I open 2.4.35. I think that integrating them early is the best way to get more testing, since people often try the first and the latest versions only. Best regards, Willy - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] sky2: flow control off
Hi Stephen, On Fri, Feb 02, 2007 at 03:34:25PM -0800, Stephen Hemminger wrote: Turn flow control off for sky2. When flow control is on, the transmitter may get randomly stuck. Perhaps there is hardware problem, but until Marvell provides errata information for workaround, it should default to off. Are you aware of any way to reproduce or at least to increases chances of occurrences of this problem ? It looks much like a problem I've been experiencing with Marvell's driver with a 88E8053 chip on 2.4. Doing ethtool -r proved useful to reset the transceiver in production. When trying to reproduce the problem, I noticed corrupted and duplicated frames in tcpdump captures when packets between 147 and 496 bytes were received then re-routed via the same NIC towards a 100 Mbps machine on the same switch, a Gig Dlink doing flow control. Yes, I know, there are a lot of variables ! But changing the target machine, or the packet sizes was enough to stop the corruption. I could not get the transmitter stuck as I sometimes have in production, but I think that both problems are related. The fact that a comparable problem is encountered with your driver really makes me think about a hardware bug :-/ BTW, switching to your driver (sky2-1.5), I could not reproduce the corruption problem anymore, but I've not yet put it in production to check if it fixes the transceiver problem. Anyway, it might be interesting that users who encounter this problem try an old version of the driver just in case they detect a difference. regards, Willy - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH-2.4] forcedeth update to 0.50
On Wed, May 31, 2006 at 07:50:32AM +0200, Manfred Spraul wrote: Hi Willy, Willy Tarreau wrote: I started from the latest backport you sent in september (0.42) and incrementally applied 2.6 updates. I stopped at 0.50 which provides VLAN support, because after this one, there are some 2.4-incompatible changes (64bit consistent memory allocation for rings, and MSI/MSIX support). I agree, 2.4 needs a backport. Either a full backport as you did, or a minimal one-liner fix. Right now, the driver is not usable due to an incorrect initialization. Or to be more accurate: # modprobe # ifup works. But # modprobe # ifup # ifdown # ifup causes a misconfiguration, and the nic hangs hard after a few MB. And recent distros do the equivalent of ifup/ifdown/ifup somewhere in the initialization. That's what I read in one of the changelogs, but I'm not sure at all that it's what happened, because I had the problem after an ifup only. What I was doing with this box was pure performance tests which drew me to compare the broadcom and nforce performance. My tests measured 3 creteria : - number of HTTP/1.0 hits/s - maximum data rate - maximum packets/s on tg3, I got around 45 khits/s, 949 Mbps (TCP, =1.0 Gbps on wire) and 1.05 Mpps receive (I want to build a high speed load-balancer and a sniffer). This was stable. On the nforce, I tried with the hits/s first because it's a good indication of hardware-based and driver-based optimizations. It reached 18 khits/s with a lot of difficulty and the machine was stuck at 100% of one CPU. But it ran for a few minutes like this. Then I tried data rate (which is the same test with 1MB objects), and it failed after about 2 seconds and few megabytes (or hundreds of megabytes) transferred. I had to reboot to get it to work again. And I'm fairly sure that I did not do down/up this time as well, but the test came to the same end. That's why I'm not sure at all that the one-liner will be enough. Moreover, after the update, I reached the same performance as with the broadcom, with a slight improvement on packet reception (1.09 Mpps), and low CPU usage (15%). So basically, the upgrade rendered the driver from barely usable for SSH to very performant. Marcelo: Do you need a one-liner, or could you apply a large backport patch? I would really vote for the full backport, and I can break it into pieces if needed (I have them at hand, just have to re-inject the changelogs). However, I have separate changes from 0.42 to 0.50, because I started with your 0.30-0.42 backport patch. I have this machine till the end of the week, so I can perform other tests if you're interested in trying specific things. -- Manfred Cheers, Willy - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH-2.4] forcedeth update to 0.50
On Wed, May 31, 2006 at 09:50:38PM +0200, Manfred Spraul wrote: Marcelo Tosatti wrote: Since v2.4.33 should be out RSN, my opinion is that applying the one-liner to fix the bringup problem for now is more prudent.. It's attached. Untested, but it should work. Just the relevant hunk from the 0.42 patch. I will test it tomorrow morning. John might be interested in merging it too, as I have checked today that RHEL3 was affected by the same problem (rmmod followed by modprobe). But I would disagree with waiting for 2.3.34 for a full backport: 0.30 basically doesn't work, thus the update to 0.50 would be a big step forward - it can't be worse that 0.30. Seconded ! Manfred, if you have some corner cases in mind, are aware of anything which might sometimes break, or have a few experimental patches to try, I'm OK for a few tests while I have the machine (it's SMP BTW). -- Manfred Cheers, Willy - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Patch] 2.4.32 - Neighbour Cache (ARP) State machine bug Fixed
Hi, On Tue, Feb 07, 2006 at 12:57:43AM -0700, Pradeep Vincent wrote: In 2.4.21, arp code uses gc_timer to check for stale arp cache entries. In 2.6, each entry has its own timer to check for stale arp cache. 2.4.29 to 2.4.32 kernels (atleast) use neither of these timers. This causes problems in environments where IPs or MACs are reassigned - saw this problem on load balancing router based networks that use VMACs. Tested this code on load balancing router based networks as well as peer-linux systems. Thanks, Signed off by: Pradeep Vincent [EMAIL PROTECTED] diff -Naur old/net/core/neighbour.c new/net/core/neighbour.c --- old/net/core/neighbour.cWed Nov 23 17:15:30 2005 +++ new/net/core/neighbour.cWed Nov 23 17:26:01 2005 @@ -14,6 +14,7 @@ * Vitaly E. Lavrovreleasing NULL neighbor in neigh_add. * Harald WelteAdd neighbour cache statistics like rtstat * Harald Welteport neighbour cache rework from 2.6.9-rcX + * Pradeep Vincent Move neighbour cache entry to stale state */ As you can see above, your mailer is still broken. Leading spaces get removed and it seems like tabs are replaced with spaces. This makes it really annoying to fix by hand because we all have to do your work again. You should try to fix your mailer options, possibly by sending a few mails to yourself or someone else (if you send *a few* mails to me, I can confirm which one looks OK). If your mailer is definitely broken, then you may send it as plain text first (for review), with a text attachment for people to apply it without trouble. Thanks, Willy - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
SKGE backport to 2.4 : success
Hi Stephen, In my own kernels, I've added your backport of SKGE to 2.4 that I found here : http://developer.osdl.org/shemminger/releases/skge-sky2-backport.tar.bz2 It seems to work pretty well compared to the original syskonnect driver (up to and including 8.36). Several people around me have reported very slow NFS operations with the official driver, which I finally attributed to a strange effect of UDP packets not going out after a while until they get pushed by a TCP packet. I even noticed the problem at the company and we turned the NFS server to an unused 100 Mbps card to workaround the problem before being able to fully ananlyze the problem. It seems your driver is getting mature and its performance is very close to the official one, while its code is smaller and apparently more reliable. I was thinking about merging it in mainline 2.4 as a fix for people having trouble with the syskonnect driver. It might also be easier to backport fixes from 2.6 to 2.4 when the driver is the same. I don't think we risk any regression because it won't replace an existing driver, but will provide one to people who are used to download new versions from an external tree. Also, I'm not yet sure whether I would also backport the sky2 driver, because I know about a handful boxes running in production with the official one with 88E8053 chips at high packet rates with no trouble at all. Anyway, as long as the backport does not prevent them from using the external driver, there should be no problem. I'd like to get your opinion on this matter, and of course, Jeff's and Davem's. Thanks in advance, Willy - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SKGE backport to 2.4 : success
On Mon, Nov 06, 2006 at 10:56:09AM -0800, Stephen Hemminger wrote: On Sat, 4 Nov 2006 22:08:55 +0100 Willy Tarreau [EMAIL PROTECTED] wrote: Hi Stephen, I don't know if you received my mail since I got no reply. Thanks in advance for your comments, Willy On Sat, Oct 28, 2006 at 10:57:07PM +0200, Willy Tarreau wrote: Hi Stephen, In my own kernels, I've added your backport of SKGE to 2.4 that I found here : http://developer.osdl.org/shemminger/releases/skge-sky2-backport.tar.bz2 It seems to work pretty well compared to the original syskonnect driver (up to and including 8.36). Several people around me have reported very slow NFS operations with the official driver, which I finally attributed to a strange effect of UDP packets not going out after a while until they get pushed by a TCP packet. I even noticed the problem at the company and we turned the NFS server to an unused 100 Mbps card to workaround the problem before being able to fully ananlyze the problem. It seems your driver is getting mature and its performance is very close to the official one, while its code is smaller and apparently more reliable. I was thinking about merging it in mainline 2.4 as a fix for people having trouble with the syskonnect driver. It might also be easier to backport fixes from 2.6 to 2.4 when the driver is the same. I don't think we risk any regression because it won't replace an existing driver, but will provide one to people who are used to download new versions from an external tree. Also, I'm not yet sure whether I would also backport the sky2 driver, because I know about a handful boxes running in production with the official one with 88E8053 chips at high packet rates with no trouble at all. Anyway, as long as the backport does not prevent them from using the external driver, there should be no problem. I'd like to get your opinion on this matter, and of course, Jeff's and Davem's. Thanks in advance, Willy The backport needs to be updated. It is of older code. I plan to do a new backport this week. The backport version doesn't use NAPI, because of issues with not wanting to change netdevice.h. For a good 2.4 version, I would make a version that was closer to 2.6 code (using NAPI). That would be perfect, it would make backport of fixes even easier. I have turned last version into a patch against 2.4.33 for in-tree inclusion, so if you're interested in getting it for the Config.in, Makefiles and Configure.help, do not hesitate. I did the backport because one of the equipment donors gave a VPN box whose base OS is RHEL based on 2.4. It's amazing how having the hardware stimulates development, isn't it? :-) Tbanks, Willy - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [IPv4] fib: Fix out of bound access of fib_props[]
On Sat, Mar 24, 2007 at 08:33:39PM -0700, David Miller wrote: From: Thomas Graf [EMAIL PROTECTED] Date: Sat, 24 Mar 2007 16:34:36 +0100 Fixes a typo which caused fib_props[] to have the wrong size and makes sure the value used to index the array which is provided by userspace via netlink is checked to avoid out of bound access. Signed-off-by: Thomas Graf [EMAIL PROTECTED] Applied, thanks. Thomas, David, it seems to me that 2.4 needs the same fix for the typo, but I see no place where we may add the check for RTN_MAX. Maybe this last one is not needed. If someone could enlighten me, or at least provide a means to test if the bug it present, it would help me. Thanks in advance, Willy - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [IPv4] fib: Fix out of bound access of fib_props[]
On Sat, Apr 14, 2007 at 04:26:55PM +0200, Willy Tarreau wrote: On Sat, Mar 24, 2007 at 08:33:39PM -0700, David Miller wrote: From: Thomas Graf [EMAIL PROTECTED] Date: Sat, 24 Mar 2007 16:34:36 +0100 Fixes a typo which caused fib_props[] to have the wrong size and makes sure the value used to index the array which is provided by userspace via netlink is checked to avoid out of bound access. Signed-off-by: Thomas Graf [EMAIL PROTECTED] Applied, thanks. Thomas, David, it seems to me that 2.4 needs the same fix for the typo, but I see no place where we may add the check for RTN_MAX. Maybe this last one is not needed. If someone could enlighten me, or at least provide a means to test if the bug it present, it would help me. Finally, I think I have the correct fix below. Please someone confirm or tell me I'm nuts. Thanks in advance, Willy From 230c62b9e7000cfb407a079a21ad0f077f164b21 Mon Sep 17 00:00:00 2001 From: Willy Tarreau [EMAIL PROTECTED] Date: Sat, 14 Apr 2007 17:44:03 +0200 Subject: [IPv4] fib: Fix out of bound access of fib_props[] Backported from 2.6. Bug found and fixed by Thomas Graf : Fixes a typo which caused fib_props[] to have the wrong size and makes sure the value used to index the array which is provided by userspace via netlink is checked to avoid out of bound access. --- net/ipv4/fib_semantics.c |7 ++- 1 files changed, 6 insertions(+), 1 deletions(-) diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c index afdf4bb..b930371 100644 --- a/net/ipv4/fib_semantics.c +++ b/net/ipv4/fib_semantics.c @@ -83,7 +83,7 @@ static struct { int error; u8 scope; -} fib_props[RTA_MAX+1] = { +} fib_props[RTN_MAX+1] = { { 0, RT_SCOPE_NOWHERE},/* RTN_UNSPEC */ { 0, RT_SCOPE_UNIVERSE},/* RTN_UNICAST */ { 0, RT_SCOPE_HOST},/* RTN_LOCAL */ @@ -431,6 +431,11 @@ fib_create_info(const struct rtmsg *r, struct kern_rta *rta, const int nhs = 1; #endif + if (r-rtm_type RTN_MAX) { + err = -EINVAL; + goto errout; + } + /* Fast check to catch the most weird cases */ if (fib_props[r-rtm_type].scope r-rtm_scope) goto err_inval; -- 1.4.4.4 - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [IPv4] fib: Fix out of bound access of fib_props[]
On Mon, Apr 16, 2007 at 10:14:17AM +0200, Thomas Graf wrote: * Willy Tarreau [EMAIL PROTECTED] 2007-04-14 17:48 Finally, I think I have the correct fix below. Please someone confirm or tell me I'm nuts. Looks good, same is needed for DECnet Thank you Thomas. The DECnet stuff was already merged, it was (expectedly) the same code. Regards, Willy - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.16, sk98lin out of date
On Mon, Feb 13, 2006 at 02:03:14PM -0500, Lee Revell wrote: On Mon, 2006-02-13 at 12:06 +0100, Mws wrote: hi, as i do have the same problem i may help you out. at first, syskonnect did send their kernel diffs/patches but they we're rejected caused by coding style, indention and some people thinking that things can be done better. Haha, they didn't like the LKML code review so they just stopped sending patches? Classic. Remind me not to buy their gear. Lee, it's not always that simple. When you submit one driver, sometimes reviewers tell you that for whatever reason your driver's structure is wrong and it has to be changed a lot (and sometimes they're right of course). But when you don't have enough ressource to do the job twice, the best you can do is to maintain it out of tree, which is already a pain. I'm not saying that it is what happened with their driver, I don't know the history. However, I found your reaction somewhat hasty. I personally would prefer to offer time and help before deciding that I don't want anyone's products on this basis. It's not as if they did not release their driver's source ! Cheers, Willy - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Linux v2.6.16-rc6
On Sat, Mar 11, 2006 at 06:39:04PM -0800, David S. Miller wrote: From: Michal Piotrowski [EMAIL PROTECTED] Date: Sun, 12 Mar 2006 02:51:40 +0100 I have noticed this warnings TCP: Treason uncloaked! Peer 82.113.55.2:11759/50967 shrinks window 148470938:148470943. Repaired. TCP: Treason uncloaked! Peer 82.113.55.2:11759/50967 shrinks window 148470938:148470943. Repaired. TCP: Treason uncloaked! Peer 82.113.55.2:11759/59768 shrinks window 1124211698:1124211703. Repaired. TCP: Treason uncloaked! Peer 82.113.55.2:11759/59768 shrinks window 1124211698:1124211703. Repaired. It maybe problem with ktorrent. It is a problem with the remote TCP implementation, it is illegally advertising a smaller window that it previously did. on 2005/10/27, Herbert Xu provided a patch merged in 2.6.14 to fix some erroneous occurences of this message (some of them appeared with Linux on the other side). It would be interesting to know whether the peer above is Linux or not, because it might be possible that Herbert's fix needs to be applied to other places ? Here comes his patch with his interesting analysis for reference, in case it might give ideas to anybody. Cheers, Willy --- From: Herbert Xu [EMAIL PROTECTED] Date: Thu, 27 Oct 2005 08:47:46 + (+1000) Subject: [TCP]: Clear stale pred_flags when snd_wnd changes X-Git-Tag: v2.6.14 X-Git-Url: http://kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=2ad41065d9fe518759b695fc2640cf9c07261dd2 [TCP]: Clear stale pred_flags when snd_wnd changes This bug is responsible for causing the infamous Treason uncloaked messages that's been popping up everywhere since the printk was added. It has usually been blamed on foreign operating systems. However, some of those reports implicate Linux as both systems are running Linux or the TCP connection is going across the loopback interface. In fact, there really is a bug in the Linux TCP header prediction code that's been there since at least 2.1.8. This bug was tracked down with help from Dale Blount. The effect of this bug ranges from harmless Treason uncloaked messages to hung/aborted TCP connections. The details of the bug and fix is as follows. When snd_wnd is updated, we only update pred_flags if tcp_fast_path_check succeeds. When it fails (for example, when our rcvbuf is used up), we will leave pred_flags with an out-of-date snd_wnd value. When the out-of-date pred_flags happens to match the next incoming packet we will again hit the fast path and use the current snd_wnd which will be wrong. In the case of the treason messages, it just happens that the snd_wnd cached in pred_flags is zero while tp-snd_wnd is non-zero. Therefore when a zero-window packet comes in we incorrectly conclude that the window is non-zero. In fact if the peer continues to send us zero-window pure ACKs we will continue making the same mistake. It's only when the peer transmits a zero-window packet with data attached that we get a chance to snap out of it. This is what triggers the treason message at the next retransmit timeout. Signed-off-by: Herbert Xu [EMAIL PROTECTED] Signed-off-by: Arnaldo Carvalho de Melo [EMAIL PROTECTED] --- --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -2239,6 +2239,7 @@ static int tcp_ack_update_window(struct /* Note, it is the only place, where * fast path is recovered for sending TCP. */ + tp-pred_flags = 0; tcp_fast_path_check(sk, tp); if (nwin tp-max_window) { - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.4.32: unresolved symbol unregister_qdisc
On Sun, Apr 09, 2006 at 08:29:12PM -0700, Randy.Dunlap wrote: On Sun, 9 Apr 2006 23:20:01 -0400 (EDT) George P Nychis wrote: On Sun, 9 Apr 2006 22:49:50 -0400 (EDT) George P Nychis wrote: On Sun, 9 Apr 2006 22:05:33 -0400 (EDT) George P Nychis wrote: On Sun, 9 Apr 2006 13:37:25 -0400 (EDT) George P Nychis wrote: Thanks for the help. Here is the makefile: http://rafb.net/paste/results/auchPH75.html And here is the full errors I receive: http://rafb.net/paste/results/Qplpqw74.html Greatly appreciate it - George [repeat: please don't top-post] I don't know how much I can help you. It's been a long time since I've built external modules on 2.4.x. Problems that I see: - the Makefile does not use the expected 2.4 kernel build infrastructure; - kernel Makefile uses -nostdinc to prevent use of userspace headers; - Makefile is trying to include userspace headers instead of kernel headers, e.g.: In file included from /usr/include/linux/if_ether.h:107, from /usr/include/linux/netdevice.h:29, from sch_xcp.c:8: - this specified include directory is only in 2.6.x, not 2.4.x: -I/lib/modules/`uname -r`/build/include/asm/mach-default It's not clear to me how this makefile could work with 2.4.x at all. Is it supposed to, or that's just what you want to see it do? You could try to fix the Makefile based on makefile-changes articles at lwn.net. E.g.: http://lwn.net/Articles/151784/ http://lwn.net/Articles/79984/ http://lwn.net/Articles/74767/ http://lwn.net/Articles/69148/ http://lwn.net/Articles/21823/ On Sat, 8 Apr 2006 19:18:47 -0400 (EDT) George P Nychis wrote: Yeah, this module is unfortunately not under the GPL, it was made for research and i am not the author, I was only given the code for my own research. I enabled that support in the kernel, and then tried to recompile and get tons of errors/warnings... so maybe I am missing something else to be enabled in the kernel... here are a few examples of errors: /usr/include/linux/skbuff.h:30:26: net/checksum.h: No such file or directory /usr/include/asm/irq.h:16:25: irq_vectors.h: No such file or directory /usr/include/linux/irq.h:72: error: `NR_IRQS' undeclared here (not in a function) /usr/include/asm/hw_irq.h:28: error: `NR_IRQ_VECTORS' undeclared here (not in a function) I think those are the top most errors, so if i can fix those hopefully the rest shall vanish! Looks like a Makefile problem then. Can you post the Makefile? Hopefully it is using a Makefile and not just an elaborate gcc command line. [and please don't top-post] - George From: George P Nychis [EMAIL PROTECTED] Date: Sat, 8 Apr 2006 18:47:34 -0400 (EDT) Hey, I have a kernel module that uses unregister_qdisc and register_qdisc, whenever i try to insert the module I get: /lib/modules/2.4.32/kernel/net/sched/sch_xcp.o: /lib/modules/2.4.32/kernel/net/sched/sch_xcp.o: unresolved symbol unregister_qdisc /lib/modules/2.4.32/kernel/net/sched/sch_xcp.o: /lib/modules/2.4.32/kernel/net/sched/sch_xcp.o: unresolved symbol register_qdisc Am i missing some sort of support in the kernel? Make sure CONFIG_NET_SCHED is enabled and that you compiled your module against that kernel. Where does this sch_xcp come from? It's not in the vanilla sources. Also, please direct networking questions to the netdev@vger.kernel.org mailing list which I have added to the CC:. --- ~Randy By the way, if I add -I/usr/src/linux/include to the compile line, it successfully compiles, however, i am back to the start: lanthanum-ini src-1.0.1 # insmod sch_xcp Using /lib/modules/2.4.32/kernel/net/sched/sch_xcp.o /lib/modules/2.4.32/kernel/net/sched/sch_xcp.o: /lib/modules/2.4.32/kernel/net/sched/sch_xcp.o: unresolved symbol unregister_qdisc /lib/modules/2.4.32/kernel/net/sched/sch_xcp.o: /lib/modules/2.4.32/kernel/net/sched/sch_xcp.o: unresolved symbol register_qdisc Yet your 2.4.32 kernel image file does have those symbols in it? Can you verify that by using 'nm' on the kernel image file? If so, then I suppose that you'll need to make a small module test case that exhibits this behavior, or just tell us where to get the sch_xcp files... (re-added cc: for netdev) --- ~Randy By kernel image, do you mean /usr/src/linux/vmlinux ? if so, lanthanum-ini linux # nm vmlinux | grep register_qdisc c0399200 R __kstrtab_register_qdisc c0399240 R __kstrtab_unregister_qdisc c039ebc8 R __ksymtab_register_qdisc c039ebd0 R __ksymtab_unregister_qdisc c02eda40 T register_qdisc c02edaf0 T unregister_qdisc Yes. That's good, then. --- ~Randy *sigh* ... still getting the
[PATCH-2.6] e1000: fix media_type - phy_type thinko
Recent patch cb764326dff0ee51aca0d450e1a292de65661055 introduced a thinko in e1000_main.c : e1000_media_type_copper is compared to hw.phy_type instead of hw.media_type. Original patch proposed by Jesse Brandeburg was correct, but what has been merged is not. --- drivers/net/e1000/e1000_main.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) 3df8a180d50c89a72c28abf37151e38ffda75f39 diff --git a/drivers/net/e1000/e1000_main.c b/drivers/net/e1000/e1000_main.c index add8dc4..590a456 100644 --- a/drivers/net/e1000/e1000_main.c +++ b/drivers/net/e1000/e1000_main.c @@ -4156,7 +4156,7 @@ e1000_mii_ioctl(struct net_device *netde spin_unlock_irqrestore(adapter-stats_lock, flags); return -EIO; } - if (adapter-hw.phy_type == e1000_media_type_copper) { + if (adapter-hw.media_type == e1000_media_type_copper) { switch (data-reg_num) { case PHY_CTRL: if (mii_reg MII_CR_POWER_DOWN) -- 1.2.4 - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] sky2: flow control off
Hi Stephen, First, thanks for this detailed explanation. On Mon, Feb 05, 2007 at 09:22:53AM -0800, Stephen Hemminger wrote: Here is what I saw. The transmitter on the Marvell Yukon II (88e8053) hangs when doing transmit flow control under load. There appears to be a bug or race condition that causes the MAC to stop transmitting data. There are two drivers for the Yukon II device on Linux. SysKonnect/Marvell has one called sk98lin it is downloadable from syskonnect.def, and I wrote one called sky2 that is part of the standard Linux kernel. This problem is reproducible with the sky2 driver only; the sk98lin driver has a watchdog routine that resets the hardware perodically, so it masks the problem. When the failure mode occurs only after several minutes of sustained activity and a situation where PAUSE frames would be received. In my testing I used server == 1000mbit === switch --- 100mbit --- client Server was Mac Mini (88E8053) running Linux 2.6.20-rc7 and client was a Sony Vaio (88e8036) laptop. The server was running NFS in kernel and client was doing a large copy. The server was using UDP to cause large amounts of 802 pause frames. The problem is not as reproducible with TCP tests because TCP congestion control avoids over running the switch. I encountered *exactly* this problem with a one-leg firewall equipped with a 88E8053 attached to a 1000 Mbps switch, itself hosting 100 Mbps stations, but with sk98lin (2.4). Running tcpdump on the firewall, I noticed duplicated and corrupted frames. I could only reproduce the duplicated and corrupted frames on a lab setup, not the Tx hangs, by sending high UDP traffic on the port to a 100 Mbps host. Sending to 1000 Mbps hosts never triggered the problem, hence my conclusions about flow control too. What I found interesting is that using a very old version of the sky2 driver which I had with me (sky2 v0.5), I could not trigger the problem anymore. But right now, I realize that this version of the driver did not support flow control yet, which might converge with your observations : # ethtool -i eth0 driver: sky2 version: 0.5 firmware-version: N/A bus-info: 01:00.0 # ethtool -a eth0 Pause parameters for eth0: Autonegotiate: on RX: off TX: off When failure occurs: * packets continue to be received and passed up the stack * GMAC status register is the pause state * transmit packets continue transferred by the DMA into the RAM buffer * when the the RAM buffer fills no more packets are DMA'd * when transmit queue in driver fills, it gets a watch dog timeout * switch appears to get confused and other ports hang as well. During development of the sky2 driver a similar problem was observed on receive if the receive DMA buffer was not 8 byte aligned. For performance reasons, Linux drivers usually offset the Rx buffer by 2 bytes so that the TCP/IP headers are aligned for faster CPU access. If the sky2 Rx buffer was offset, then the receiver DMA would occasionally hung. The workaround for receive was to align the receive buffer on a quad word boundary. This problem appears to be flow control related because after disabling flow control, no errors occurred in a 48 hour test run. No problem here with the old driver without flow control either. I can try to disable it right here on my setup with sk98lin, and test again. I did not know that the sk98lin had a watchdog, it could explain why sometimes the system entered a strange state (packets taking *seconds* to be forwarded). Anyway, I'm more and more convinced that there are hardware bugs. It is not normal at all that both the original syskonnect driver and your fresh new code show such similar problems ! There probably are other races and hangs that are related. I don't consider all the hangs eliminated yet. Well, at least you have a more maintainable driver than what was the previous one, so you will eventually manage to fix all problems ;-) Best regards, Willy - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2.4 0/3] e1000: update to 7.3.20
Hi Auke, On Mon, Feb 05, 2007 at 05:01:02PM -0800, Kok, Auke wrote: Willy, Please pull: git-pull git://lost.foo-projects.org/~ahkok/git/linux-2.4 e1000 to receive an update for the e1000 driver. This updates the e1000 driver in the 2.4 kernel to version 7.3.20-k4, roughly the equivalent of what is in 2.6.20 and the latest of our out-of-tree driver. This adds new hardware support and many fixes. Many customers are asking us for these updates and the current 2.4 kernel ships a very outdated version of the e1000 adapter. This patch includes small compatibility headers and code to minimize the changes we need to make to our driver and keep it in sync easier with the 2.6 kernel version so we can possibly continue to ship updates and fixes to this driver with more ease. Pulled, thanks very much. I'll produce -pre1 shortly so that we can ensure there's no regression. Cheers, Auke Best Regards, Willy - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] tcp_cubic: use 32 bit math
On Wed, Mar 07, 2007 at 07:10:47PM -0800, Stephen Hemminger wrote: David Miller wrote: From: Stephen Hemminger [EMAIL PROTECTED] Date: Wed, 7 Mar 2007 17:07:31 -0800 The basic calculation has to be done in 32 bits to avoid doing 64 bit divide by 3. The value x is only 22bits max so only need full 64 bits only for x^2. Signed-off-by: Stephen Hemminger [EMAIL PROTECTED] Applied, thanks Stephen. What about Willy Tarreau's supposedly even faster variant? Or does this incorporate that set of improvements? That's what this is: x = (2 * x + (uint32_t)div64_64(a, (uint64_t)x*(uint64_t)x)) / 3; Confirmed, it's the same. BTW, has someone tested on a 64bit system if it brings any difference ? Willy - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] tcp_cubic: use 32 bit math
On Wed, Mar 07, 2007 at 07:51:35PM -0800, David Miller wrote: From: Stephen Hemminger [EMAIL PROTECTED] Date: Wed, 07 Mar 2007 19:10:47 -0800 David Miller wrote: What about Willy Tarreau's supposedly even faster variant? Or does this incorporate that set of improvements? That's what this is: x = (2 * x + (uint32_t)div64_64(a, (uint64_t)x*(uint64_t)x)) / 3; Great, thanks for the clarification. Oh BTW, I have a newer version with a first approximation of the cbrt() before the div64_64, which allows us to reduce from 3 div64 to only 2 div64. This results in a version which is twice as fast as the initial one (ncubic), but with slightly less accuracy (0.286% compared to 0.247). But I see that other functions such as hcbrt() had a 1.5% avg error, so I think this is not dramatic. Also, I managed to remove all other divides, to be kind with CPUs having a slow divide instruction or no divide at all. Since we compute on limited range (22 bits), we can multiply then shift right. It shows me even slightly better time on pentium-m and athlon, with a slightly higher avg error (0.297% compared to 0.286%), and slightly smaller code. I just have to clean experiments from my code to provide a patch. David, Stephen, are you interested ? $ ./bictcp fls(0)=0, fls(1)=1, fls(256)=9 Calibrating Function clocks mean(us) max(us) std(us) Avg error bictcp 936 0.6124.28 1.99 0.172% ocubic 886 0.5723.51 3.18 0.274% ncubic 644 0.4216.59 2.18 0.247% ncubic32444 0.2911.47 1.50 0.247% ncubic32_1 444 0.2911.56 1.88 0.238% ncubic32b3 337 0.22 8.67 0.88 0.286% ncubic_ndiv3329 0.21 8.46 0.69 0.297% acbrt 707 0.4618.05 0.80 0.275% hcbrt 644 0.4216.44 0.51 1.580% Regards, Willy - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] tcp_cubic: use 32 bit math
; /* a in [0..63] */ uint32_t bits; bits = (uint8_t)a 1; bits = ~1; return 1 + ((0xFFEAAA54UL bits) 0x03); } /* We want (b - 1) / 3 for b between 7 and 64 so that we have a * a bit shift count between 2 and 21 inclusive. */ b = (b * 84) 8; /* We want the highest 3bit block from 'a' */ x = ((/*cbrt_3msb_4lsb*/0xDCBA9863UL VAL_3BIT_TO_SHIFT4(a, b)) 0x0F) + 3; x = (x b) 3; /* one divide is enough to get a value within 0.4% */ x = (2 * x + (uint32_t)div64_64(a, (uint64_t)x * (uint64_t)(x - 1))); x = ((x * 341) 10); return x; } static uint32_t end_ncubic_1div() { } /* calculate the cubic root of x using only an approximation. * Avg err ~= 4.521% */ static uint32_t ncubic_0div(uint64_t a) { uint32_t x; uint32_t b; /* * For large values, 3 bits inputs are enough. * * We store 4*cbrt(8*x)-1 for x in [0..7] * * x | cbrt | 4*cbrt | int | int | * range | range | approx | val.| -3 | * ---+-++-+-+ * 56..63 | 3.83 - 3.98 | 15.66 | 16 | 13 | * 48..55 | 3.63 - 3.80 | 14.93 | 15 | 12 | * 40..47 | 3.42 - 3.61 | 14.12 | 14 | 11 | * 32..39 | 3.17 - 3.39 | 13.21 | 13 | 10 | * 24..31 | 2.88 - 3.14 | 12.15 | 12 | 9 | * 16..23 | 2.51 - 2.84 | 10.86 | 11 | 8 | * 08..15 | 2.00 - 2.46 | 9.16 | 9 | 6 | * 00..07 | 1.00 - 1.91 | 6.35 | 6 | 3 | * * We can store (4*cbrt(x)-3) in 4 bits for x in [0..63]. * So we use the following map : 0xDCBA9863UL */ ALIGN64 static uint32_t cbrt_3msb_4lsb = 0xDCBA9863UL; b = fls64(a); if (b 7) { if (b = 1) return b; /* a in [0..63] */ uint32_t bits; bits = (uint8_t)a 1; bits = ~1; return 1 + ((0xFFEAAA54UL bits) 0x03); } /* We want (b - 1) / 3 for b between 7 and 64 so that we have a * a bit shift count between 2 and 21 inclusive. */ b = (b * 84) 8; /* We want the highest 3bit block from 'a' */ x = ((/*cbrt_3msb_4lsb*/0xDCBA9863UL VAL_3BIT_TO_SHIFT4(a, b)) 0x0F) + 3; x = (x b) 3; return x; } static uint32_t end_ncubic_0div() { } /* calculate the cubic root of x using table lookups only. * Avg err ~= 0.613% */ static uint32_t ncubic_tab0(uint64_t a) { uint32_t b; uint32_t shift; /* * cbrt(x) MSB values for x MSB values in [0..63]. * Precomputed then refined by hand - Willy Tarreau * * For x in [0..63], * v = cbrt(x 18) - 1 * cbrt(x) = (v[x] + 1) 6 */ static uint8_t v[] = { /* 0x00 */0, 63, 63, 63, 127, 127, 127, 127, /* 0x08 */ 129, 135, 139, 143, 147, 152, 155, 160, /* 0x10 */ 161, 165, 168, 172, 174, 177, 179, 182, /* 0x18 */ 184, 188, 190, 193, 195, 197, 199, 202, /* 0x20 */ 203, 205, 207, 209, 212, 213, 215, 217, /* 0x28 */ 219, 221, 222, 224, 226, 227, 229, 231, /* 0x30 */ 232, 234, 236, 237, 239, 240, 241, 243, /* 0x38 */ 244, 246, 247, 249, 251, 252, 253, 255, }; b = fls64(a); if (b 7) /* a in [0..63] */ return (v[(uint32_t)a] + 31) 6; b = ((b * 84) 8) - 1; shift = (a (b * 3)); return ((uint32_t)(v[shift] + 1) b) 6; } static uint32_t end_ncubic_tab0() { } /* calculate the cubic root of x using a table lookup followed by one * Newton-Raphson iteration. * Avg err ~= 0.195% */ static uint32_t ncubic_tab1(uint64_t a) { uint32_t x; uint32_t b; uint32_t z; uint32_t shift; /* * cbrt(x) MSB values for x MSB values in [0..63]. * Precomputed then refined by hand - Willy Tarreau * * For x in [0..63], * v = cbrt(x 18) - 1 * cbrt(x) = (v[x] + 10) 6 */ static uint8_t v[] = { /* 0x00 */0, 54, 54, 54, 118, 118, 118, 118, /* 0x08 */ 123, 129, 134, 138, 143, 147, 151, 156, /* 0x10 */ 157, 161, 164, 168, 170, 173, 176, 179, /* 0x18 */ 181, 185, 187, 190, 192, 194, 197, 199, /* 0x20 */ 200, 202, 204, 206, 209, 211, 213, 215, /* 0x28 */ 217, 219, 221, 222, 224, 225, 227, 229, /* 0x30 */ 231, 232, 234, 236, 237, 239, 240, 242, /* 0x38 */ 244, 245, 246
Re: [PATCH] tcp_cubic: use 32 bit math
Hi Stephen, On Wed, Mar 21, 2007 at 11:54:19AM -0700, Stephen Hemminger wrote: On Tue, 13 Mar 2007 21:50:20 +0100 Willy Tarreau [EMAIL PROTECTED] wrote: [...] ( cut my boring part ) Here are the results classed by speed : /* Sample output on a Pentium-M 600 MHz : Function clocks mean(us) max(us) std(us) Avg err size ncubic_tab0 79 0.66 7.20 1.04 0.613% 160 ncubic_0div 84 0.70 7.64 1.57 4.521% 192 ncubic_1div 178 1.4816.27 1.81 0.443% 336 ncubic_tab1 179 1.4916.34 1.85 0.195% 320 ncubic_ndiv3 263 2.1824.04 3.59 0.250% 512 ncubic_2div 270 2.2424.70 2.77 0.187% 512 ncubic32_1 359 2.9832.81 3.59 0.238% 544 ncubic_3div 361 2.9933.08 3.79 0.170% 656 ncubic32 364 3.0233.29 3.51 0.247% 544 ncubic 529 4.3948.39 4.92 0.247% 720 hcbrt539 4.4749.25 5.98 1.580% 96 ocubic 732 4.9361.83 7.22 0.274% 320 acbrt842 6.9876.73 8.55 0.275% 192 bictcp 1032 6.9586.30 9.04 0.172% 768 [...] The following version of div64_64 is faster because do_div already optimized for the 32 bit case.. Cool, this is interesting because I first wanted to optimize it but did not find how to start with this. You seem to get very good results. BTW, you did not append your changes. However, one thing I do not understand is why your avg error is about 1/3 below the original one. Was there a precision bug in the original div_64_64 or did you extend the values used in the test ? Or perhaps you used -fast-math to build and the original cbrt() is less precise in this case ? I get the following results on ULV Core Solo (ie slow current processor) and the following on 64bit Core Duo. ncubic_tab1 seems like the best (no additional error and about as fast) OK. It was the one I preferred too unless tab0's avg error was acceptable. ULV Core Solo Function clocks mean(us) max(us) std(us) Avg err size ncubic_tab0 19211.2445.1015.28 0.450% -2262 ncubic_0div 20111.7747.2327.40 3.357% -2404 ncubic_1div 32419.0276.3225.82 0.189% -2567 ncubic_tab1 32619.1376.7323.71 0.043% -2059 ncubic_2div 45626.72 108.92 493.16 0.028% -2790 ncubic_ndiv3 46327.15 133.37 1889.39 0.104% -3344 ncubic32 54932.18 130.59 508.97 0.041% -3794 ncubic32_1 57433.66 138.32 548.48 0.029% -3604 ncubic_3div 58134.04 140.24 608.55 0.018% -3050 ncubic 73342.92 173.35 523.19 0.041% 299 ocubic 104661.25 283.68 3305.65 0.027% -2232 acbrt 114967.32 284.91 1941.55 0.029% 168 bictcp 166397.41 394.29 604.86 0.017% 628 Core 2 Duo Function clocks mean(us) max(us) std(us) Avg err size ncubic_0div 74 0.03 1.60 0.07 3.357% -2101 ncubic_tab0 74 0.03 1.60 0.04 0.450% -2029 ncubic_1div 142 0.07 3.11 1.05 0.189% -2195 ncubic_tab1 144 0.07 3.18 1.02 0.043% -1638 ncubic_2div 216 0.10 4.74 1.07 0.028% -2326 ncubic_ndiv3 219 0.10 4.76 1.04 0.104% -2709 ncubic32 269 0.13 5.87 1.13 0.041% -1500 ncubic32_1 272 0.13 5.92 1.10 0.029% -2881 ncubic 273 0.13 5.96 1.13 0.041% -1763 ncubic_3div 290 0.14 6.32 1.01 0.018% -2499 acbrt430 0.20 9.42 1.18 0.029% 77 ocubic 444 0.21 9.82 1.82 0.027% -1924 bictcp 549 0.2612.06 1.68 0.017% 236 Thanks, Willy - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network performance degradation from 2.6.11.12 to 2.6.16.20
Hi Andi, On Mon, Jun 19, 2006 at 05:24:31PM +0200, Andi Kleen wrote: If you use pmtmr try to reboot with kernel option clock=tsc. That's dangerous advice - when the system choses not to use TSC it often has a reason. On my Opteron AMD system i normally can route 400 kpps, but with timesource pmtmr i could only route around 83 kpps. (I found the timer to be the issue by using oprofile). Unless you're using packet sniffing or any other application that requests time stamps on a socket then the timer shouldn't make much difference. Incoming packets are only time stamped when someone asks for the timestamps. I encountered the same problem on a dual core opteron equipped with a broadcom NIC (tg3) under 2.4. It could receive 1 Mpps when using TSC as the clock source, but the time jumped back and forth, so I changed it to 'notsc', then the performance dropped dramatically to around the same value as above with one CPU saturated. I suspect that the clock precision is needed by the tg3 driver to correctly decide to switch to polling mode, but unfortunately, the performance drop rendered the solution so much unusable that I finally decided to use it only in uniprocessor with TSC enabled. -Andi Regards, Willy - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] getsockopt() early argument sanity checking
On Sun, Aug 20, 2006 at 03:05:32AM +0400, Solar Designer wrote: Willy, I propose the attached patch (extracted from 2.4.33-ow1) for inclusion into 2.4.34-pre. (2.6 kernels could benefit from the same change, too, but at the moment I am dealing with proper submission of generic changes like this that are a part of 2.4.33-ow1.) The patch makes getsockopt(2) sanity-check the value pointed to by the optlen argument early on. This is a security hardening measure intended to prevent exploitation of certain potential vulnerabilities in socket type specific getsockopt() code on UP systems. This change has been a part of -ow patches for some years. looks valid to me, merged. Thanks Alexander ! Willy Thanks, -- Alexander Peslyak solar at openwall.com GPG key ID: B35D3598 fp: 6429 0D7E F130 C13E C929 6447 73C3 A290 B35D 3598 http://www.openwall.com - bringing security into open computing environments diff -urpPX nopatch linux-2.4.33/net/socket.c linux/net/socket.c --- linux-2.4.33/net/socket.c Wed Jan 19 17:10:14 2005 +++ linux/net/socket.cSat Aug 12 08:51:47 2006 @@ -1307,10 +1307,18 @@ asmlinkage long sys_setsockopt(int fd, i asmlinkage long sys_getsockopt(int fd, int level, int optname, char *optval, int *optlen) { int err; + int len; struct socket *sock; if ((sock = sockfd_lookup(fd, err))!=NULL) { + /* XXX: insufficient for SMP, but should be redundant anyway */ + if (get_user(len, optlen)) + err = -EFAULT; + else + if (len 0) + err = -EINVAL; + else if (level == SOL_SOCKET) err=sock_getsockopt(sock,level,optname,optval,optlen); else - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] getsockopt() early argument sanity checking
On Sun, Aug 20, 2006 at 02:05:20AM +0200, Michael Buesch wrote: On Sunday 20 August 2006 01:48, Willy Tarreau wrote: On Sun, Aug 20, 2006 at 03:05:32AM +0400, Solar Designer wrote: Willy, I propose the attached patch (extracted from 2.4.33-ow1) for inclusion into 2.4.34-pre. (2.6 kernels could benefit from the same change, too, but at the moment I am dealing with proper submission of generic changes like this that are a part of 2.4.33-ow1.) The patch makes getsockopt(2) sanity-check the value pointed to by the optlen argument early on. This is a security hardening measure intended to prevent exploitation of certain potential vulnerabilities in socket type specific getsockopt() code on UP systems. This change has been a part of -ow patches for some years. looks valid to me, merged. Not to me. It heavily violates codingstyle and screws brains ^^^ little exageration detected here. with the non-indented else branches. while they surprized me first, they make the *patch* more readable by clearly showing what has been inserted and where. However, I have joined the lines for the merge. Learn about goto. definitely not here. The if() expressions are all one-liners. Adding a goto would mean two instructions, to which you add 2 braces. It will not make the code more readable. Patch below is OK. If you have a hard time understanding it, then it's because it's bedtime for you too :-) Regards, Willy diff --git a/net/socket.c b/net/socket.c index ac45b13..910ef88 100644 --- a/net/socket.c +++ b/net/socket.c @@ -1307,11 +1307,17 @@ asmlinkage long sys_setsockopt(int fd, i asmlinkage long sys_getsockopt(int fd, int level, int optname, char *optval, int *optlen) { int err; + int len; struct socket *sock; if ((sock = sockfd_lookup(fd, err))!=NULL) { - if (level == SOL_SOCKET) + /* XXX: insufficient for SMP, but should be redundant anyway */ + if (get_user(len, optlen)) + err = -EFAULT; + else if (len 0) + err = -EINVAL; + else if (level == SOL_SOCKET) err=sock_getsockopt(sock,level,optname,optval,optlen); else err=sock-ops-getsockopt(sock, level, optname, optval, optlen); -- 1.4.1 - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] getsockopt() early argument sanity checking
On Sun, Aug 20, 2006 at 10:34:43AM +0200, Andi Kleen wrote: On Sunday 20 August 2006 01:05, Solar Designer wrote: I propose the attached patch (extracted from 2.4.33-ow1) for inclusion into 2.4.34-pre. (2.6 kernels could benefit from the same change, too, but at the moment I am dealing with proper submission of generic changes like this that are a part of 2.4.33-ow1.) In general I don't think it makes sense to submit stuff for 2.4 that isn't in 2.6. I generally agree with you on this, but I think that when they are just preventive measures, they can be applied in whatever order, provided that they don't get lost. The patch makes getsockopt(2) sanity-check the value pointed to by the optlen argument early on. This is a security hardening measure intended to prevent exploitation of certain potential vulnerabilities in socket type specific getsockopt() code on UP systems. It's not only insufficient on SMP, but even on UP where a thread can sleep in get_user and another one can run in this time. Valid point. Doing a check that is inherently racy everywhere doesn't seem like a security improvement to me. If there is really a length checking bug somewhere it needs to be fixed in a race-free way. The only race-free solution would be to add an argument to all getsockopt() functions and pass them the decoded value. There are other places where multiple get_user() are performed on optlen, with some useless tests (eg in net/ipv4/tcp.c, len is checked for 0 after having been compared to sizeof(int) as unsigned int) and all those places would benefit from this. But I don't want to induce such large changes in this kernel. The goal of this test is a preventive measure to catch easily exploitable errors that might have remained undetected. For instance, a quick glance shows this portion of code in net/ipv4/raw.c (both 2.4 and 2.6) : static int raw_seticmpfilter(struct sock *sk, char *optval, int optlen) { if (optlen sizeof(struct icmp_filter)) optlen = sizeof(struct icmp_filter); if (copy_from_user(sk-tp_pinfo.tp_raw4.filter, optval, optlen)) return -EFAULT; return 0; } It only relies on sock_setsockopt() refusing optlen values sizeof(int), and this is not documented. Having part of this code being copied for use in another code path would open a breach for optlen 0. If not then there is no need for a change. There are two tests in this patch : - one on the validity of the optlen address. This one is race-free and should be conserved anyway. - one on the optlen range which is valid for most cases but which is subject to a race condition and which might be circumvented by carefully written code and with some luck as in all race conditions issues. Some will say that this last one offers a first level of protection by transforming determinist attacks into racy attacks which make potential vulnerabilities much harder to exploit by code injection. I'm one of them. Others will consider it totally useless because it does not cover all cases, but I think it is against the general principle of precaution we try to apply in security. -Andi Regards, Willy - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] getsockopt() early argument sanity checking
Hi David, On Sun, Aug 20, 2006 at 12:44:27PM -0700, David Miller wrote: From: Willy Tarreau [EMAIL PROTECTED] Date: Sun, 20 Aug 2006 02:43:07 +0200 On Sun, Aug 20, 2006 at 02:05:20AM +0200, Michael Buesch wrote: Not to me. It heavily violates codingstyle and screws brains ^^^ little exageration detected here. with the non-indented else branches. while they surprized me first, they make the *patch* more readable by clearly showing what has been inserted and where. However, I have joined the lines for the merge. Thanks for consulting the networking maintainer before merging this. :-/ I'm sorry, will try to do better next time :-( I only queue the patches locally while they're being discussed anyway. What if some sockopt treats a negative length specially? Maybe some setsockopt() doesn't care about the optlen pointer? Which should not be a reason for userspace to respect the calling convention. From man 2 getsockopt, it's clear that optlen is the size of the buffer : For getsockopt, optlen is a value-result parameter, initially containing the size of the buffer pointed to by optval, and modified on return to indicate the actual size of the value returned. Where I can agree with you is on this part of the man : If no option value is to be supplied or returned, optval may be NULL. Nothing is said about optlen's validity in such a case. This toplevel code has no buisness interpreting the arguments when the downcall and argument interpretation is by definition protocol specific. Generally, the advantage of doing checks early is that they help enforcing conventions and sometimes protect against a stupid bug in one of the multiple leaves. It should in no way be an excuse to keep the remaining stupid bugs when we spot them. It also means we'll touch userspace twice for this value which is really dumb. If this is that dumb, then why is it done twice in tcp_getsockopt() for TCP_INFO ? would you accept a fix which copies it in a local variable instead ? 2.6 would also benefit from this for TCP_CONGESTION. The only nice part about this change is that it allows us to be lazy about auditing the individual setsockopt() implementations. This is absolutely not the intent. When we merged Julien Tinnes's fixes for NULL pointer checks, it looked like the callers already performed the tests, but that was not a reason not to complete the test in the leaf functions. I'd rather fix the broken cases than add a patch which just assumes they are broken and not worth fixing We're not assuming they're broken. When some code is maintained by many people and when conventions differ between similar functions (eg: setsockopt does the check at top level and getsockopt in the leaves), it should be expected that we populate the CVE lists from time to time when we decide that there will always be one single check for every argument. And this is true for all system parts. and also imposes a convention for the optlen argument. I would better understand this one considering the fact that the man is vague on this matter. No thanks. OK, fine, I drop it. People who seek better security know where to find hardening patches anyway. Willy - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: TCP SACK issue, hung connection, tcpdump included
Hi, [CCing netdev as it's where people competent on the subject are] On Sun, Jul 29, 2007 at 06:59:26AM +0100, Darryl L. Miles wrote: CLIENT = Linux 2.6.20.1-smp [Customer build] SERVER = Linux 2.6.9-55.ELsmp [Red Hat Enterprise Linux AS release 4 (Nahant Update 5)] The problems start around time index 09:21:39.860302 when the CLIENT issues a TCP packet with SACK option set (seemingly for a data segment which has already been seen) from that point on the connection hangs. Where was the capture taken ? on CLIENT or on SERVER (I suspect client from the timers) ? A possible, but very unlikely reason would be an MTU limitation somewhere, because the segment which never gets correctly ACKed is also the largest one in this trace. It would be strange as the SACKs are send immediately after, but it should be something to consider anyway. Also, if everything is right with the packets on the server side, then it would means that it's the RHEL kernel which is buggy (which does not mean that the same bug does not exist in mainline). Regards, Willy Full dump available via email. 09:12:13.444999 IP CLIENT.50727 SERVER.ssh: S 2919512080:2919512080(0) win 5840 mss 1460,sackOK,timestamp 799860259 0,nop,wscale 7 09:12:13.535643 IP SERVER.ssh CLIENT.50727: S 492909547:492909547(0) ack 2919512081 win 5792 mss 1460,sackOK,timestamp 7126976 799860259,nop,wscale 2 09:12:13.535717 IP CLIENT.50727 SERVER.ssh: . ack 1 win 46 nop,nop,timestamp 799860282 7126976 09:12:13.665481 IP SERVER.ssh CLIENT.50727: P 1:24(23) ack 1 win 1448 nop,nop,timestamp 7127074 799860282 09:12:13.665895 IP CLIENT.50727 SERVER.ssh: . ack 24 win 46 nop,nop,timestamp 799860314 7127074 09:12:13.666044 IP CLIENT.50727 SERVER.ssh: P 1:23(22) ack 24 win 46 nop,nop,timestamp 799860314 7127074 09:12:13.776318 IP SERVER.ssh CLIENT.50727: . ack 23 win 1448 nop,nop,timestamp 7127216 799860314 ...SNIPPED SUCCESSFUL TRAFFIC... 09:21:39.490740 IP SERVER.ssh CLIENT.50727: P 18200:18464(264) ack 2991 win 2728 nop,nop,timestamp 7692910 81727 09:21:39.490775 IP CLIENT.50727 SERVER.ssh: . ack 18464 win 378 nop,nop,timestamp 81755 7692910 09:21:39.860245 IP SERVER.ssh CLIENT.50727: . 12408:13856(1448) ack 2991 win 2728 nop,nop,timestamp 7693293 81749 09:21:39.860302 IP CLIENT.50727 SERVER.ssh: . ack 18464 win 378 nop,nop,timestamp 81847 7692910,nop,nop,sack sack 1 {12408:13856} 09:21:40.453440 IP SERVER.ssh CLIENT.50727: . 12408:13856(1448) ack 2991 win 2728 nop,nop,timestamp 7693887 81749 09:21:40.453495 IP CLIENT.50727 SERVER.ssh: . ack 18464 win 378 nop,nop,timestamp 81996 7692910,nop,nop,sack sack 1 {12408:13856} 09:21:41.641821 IP SERVER.ssh CLIENT.50727: . 12408:13856(1448) ack 2991 win 2728 nop,nop,timestamp 7695075 81749 09:21:41.641938 IP CLIENT.50727 SERVER.ssh: . ack 18464 win 378 nop,nop,timestamp 82293 7692910,nop,nop,sack sack 1 {12408:13856} 09:21:44.017552 IP SERVER.ssh CLIENT.50727: . 12408:13856(1448) ack 2991 win 2728 nop,nop,timestamp 7697451 81749 09:21:44.017622 IP CLIENT.50727 SERVER.ssh: . ack 18464 win 378 nop,nop,timestamp 82887 7692910,nop,nop,sack sack 1 {12408:13856} 09:21:48.770051 IP SERVER.ssh CLIENT.50727: . 12408:13856(1448) ack 2991 win 2728 nop,nop,timestamp 7702203 81749 09:21:48.770099 IP CLIENT.50727 SERVER.ssh: . ack 18464 win 378 nop,nop,timestamp 84075 7692910,nop,nop,sack sack 1 {12408:13856} 09:21:58.274061 IP SERVER.ssh CLIENT.50727: . 12408:13856(1448) ack 2991 win 2728 nop,nop,timestamp 7711707 81749 09:21:58.274180 IP CLIENT.50727 SERVER.ssh: . ack 18464 win 378 nop,nop,timestamp 86450 7692910,nop,nop,sack sack 1 {12408:13856} 09:22:17.282035 IP SERVER.ssh CLIENT.50727: . 12408:13856(1448) ack 2991 win 2728 nop,nop,timestamp 7730715 81749 09:22:17.282153 IP CLIENT.50727 SERVER.ssh: . ack 18464 win 378 nop,nop,timestamp 800011202 7692910,nop,nop,sack sack 1 {12408:13856} 09:22:55.298955 IP SERVER.ssh CLIENT.50727: . 12408:13856(1448) ack 2991 win 2728 nop,nop,timestamp 7768731 81749 09:22:55.299023 IP CLIENT.50727 SERVER.ssh: . ack 18464 win 378 nop,nop,timestamp 800020705 7692910,nop,nop,sack sack 1 {12408:13856} 09:24:11.329853 IP SERVER.ssh CLIENT.50727: . 12408:13856(1448) ack 2991 win 2728 nop,nop,timestamp 7844763 81749 09:24:11.329923 IP CLIENT.50727 SERVER.ssh: . ack 18464 win 378 nop,nop,timestamp 800039711 7692910,nop,nop,sack sack 1 {12408:13856} 09:26:11.331578 IP SERVER.ssh CLIENT.50727: . 12408:13856(1448) ack 2991 win 2728 nop,nop,timestamp 7964763 81749 09:26:11.331699 IP CLIENT.50727 SERVER.ssh: . ack 18464 win 378 nop,nop,timestamp 800069708 7692910,nop,nop,sack sack 1 {12408:13856} 09:26:13.011885 IP CLIENT.50727 SERVER.ssh: P 2991:3039(48) ack 18464 win 378 nop,nop,timestamp 800070128 7692910 09:26:13.309032 IP CLIENT.50727 SERVER.ssh: P 2991:3039(48) ack 18464 win 378
Re: TCP SACK issue, hung connection, tcpdump included
On Sun, Jul 29, 2007 at 11:26:00AM +0300, Ilpo Järvinen wrote: On Sun, 29 Jul 2007, Willy Tarreau wrote: On Sun, Jul 29, 2007 at 06:59:26AM +0100, Darryl L. Miles wrote: CLIENT = Linux 2.6.20.1-smp [Customer build] SERVER = Linux 2.6.9-55.ELsmp [Red Hat Enterprise Linux AS release 4 (Nahant Update 5)] The problems start around time index 09:21:39.860302 when the CLIENT issues a TCP packet with SACK option set (seemingly for a data segment which has already been seen) from that point on the connection hangs. ...That's DSACK and it's being correctly sent. To me, it seems unlikely to be the cause for this breakage... Where was the capture taken ? on CLIENT or on SERVER (I suspect client from the timers) ? ...I would guess the same based on SYN timestamps (and from the DSACK timestamps)... A possible, but very unlikely reason would be an MTU limitation somewhere, because the segment which never gets correctly ACKed is also the largest one in this trace. Limitation for 48 byte segments? You have to be kidding... :-) But yes, it seems that one of the directions is dropping packets for some reason though I would not assume MTU limitation... Or did you mean some other segment? No, I was talking about the 1448 bytes segments. But in fact I don't believe it much because the SACKs are always retransmitted just afterwards. BTW, some information are missing. It would have been better if the trace had been read with tcpdump -Svv. We would have got seq numbers and ttl. Also, we do not know if there's a firewall between both sides. Sometimes, some IDS identify attacks in crypted traffic and kill connections. It might have been the case here, with the connection closed one way on an intermediate firewall. Regards, Willy - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: TCP SACK issue, hung connection, tcpdump included
On Sun, Jul 29, 2007 at 12:28:04PM +0300, Ilpo Järvinen wrote: (...) Limitation for 48 byte segments? You have to be kidding... :-) But yes, it seems that one of the directions is dropping packets for some reason though I would not assume MTU limitation... Or did you mean some other segment? No, I was talking about the 1448 bytes segments. But in fact I don't believe it much because the SACKs are always retransmitted just afterwards. Ah, but it's ACKed correctly right below it...: [...snip...] 09:21:39.490740 IP SERVER.ssh CLIENT.50727: P 18200:18464(264) ack 2991 win 2728 nop,nop,timestamp 7692910 81727 09:21:39.490775 IP CLIENT.50727 SERVER.ssh: . ack 18464 win 378 nop,nop,timestamp 81755 7692910 09:21:39.860245 IP SERVER.ssh CLIENT.50727: . 12408:13856(1448) ack 2991 win 2728 nop,nop,timestamp 7693293 81749 ...segment below snd_una arrived = snd_una remains 18464, receiver generates a duplicate ACK: 09:21:39.860302 IP CLIENT.50727 SERVER.ssh: . ack 18464 win 378 nop,nop,timestamp 81847 7692910,nop,nop,sack sack 1 {12408:13856} The cumulative ACK field of it covers _everything_ below 18464 (i.e., it ACKs them), including the 1448 bytes in 12408:13856... In addition, the SACK block is DSACK information [RFC2883] telling explicitly the address of the received duplicate block. However, if this ACK doesn't reach the SERVER TCP, RTO is triggered and the first not yet cumulatively ACKed segment is retransmitted (I guess cumulative ACKs up to 12408 arrived without problems to the SERVER): Oh yes, you're damn right. I did not notice that the ACK was higher than the SACK, I'm more used to read traces with absolute rather than relative seq/acks. So I agree, it is this ACK which is lost between client and server, reinforcing the supposition about the location of the capture (client side). [...snip...] BTW, some information are missing. It would have been better if the trace had been read with tcpdump -Svv. We would have got seq numbers and ttl. Also, we do not know if there's a firewall between both sides. Sometimes, some IDS identify attacks in crypted traffic and kill connections. It might have been the case here, with the connection closed one way on an intermediate firewall. Yeah, firewall or some other issue, I'd say it's quite unlikely a bug in TCP because behavior to both directions indicate client - sender blackhole independently of each other... It would also be possible that something stupid between both ends simply drops packets with the SACK option set. Also something which sometimes happen is that some firewalls automatically translate sequence numbers but not necessarily SACK values, which could pretty well lead to this packet being received but ignored on the server side. I'm pretty sure that the same trace taken on the server side will reveal the reason for the problem. Willy - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] e100 module loads 1/2 times
Hi Auke, Jesse, for a long time, I've been annoyed by version 3.5.17 of the e100 driver which refuses to load on first time and only loads on second time. Since I always had the original 2.3.43 driver in kernel 2.4, I did not care that much. Recently, I encountered real troubles with 2.3.43 in a 802.1q setup (basically it did not untag incoming frames). So I decided to give 3.5.17 a second try. Same problem, I had to load it twice. I finally found the problem in e100_init_module(). Up to 3.5.14, the return of pci_module_init() was returned. This one equals zero if everything went fine, 0 otherwise, which is compatible with init_module(). With 3.5.17, the result comes from pci_register_driver(), which returns the number of devices registered. So the problem now makes sense : - first call: the driver registers itself and returns non-zero, which is an error for insmod - second call: the driver cannot register again and returns zero new drivers, which is good for insmod. Note that e1000 has a related bug : it uses the return from pci_module_init() to decide whether or not to register a reboot notifier. Fortunately, the test is performed with ret=0, which matches ==0 and not 0 (errors), so this works as a side effect. The obvious fix for e100 reusing pci_module_init is below. Also, since e100-2.3.43 does not work at all with vlans in 2.4, I was thinking about upgrading it to 3.5.17. It would also be the same version as in 2.6, simplifying its long-term maintenance. What do you think about this ? Best regards, Willy --- e100-3.5.17/src/e100.c.orig 2007-08-13 08:53:18 +0200 +++ e100-3.5.17/src/e100.c 2007-08-13 09:24:56 +0200 @@ -2934,13 +2934,13 @@ printk(KERN_INFO PFX %s\n, DRV_COPYRIGHT); } #ifdef E100_USE_REBOOT_NOTIFIER - retval = pci_register_driver(e100_driver); - if (retval = 0) + retval = pci_module_init(e100_driver); + if (retval == 0) register_reboot_notifier(e100_notifier_reboot); return retval; #else - return pci_register_driver(e100_driver); + return pci_module_init(e100_driver); #endif } - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Marvell 88E8056 gigabit ethernet controller
On Sat, Aug 18, 2007 at 04:45:26AM -0700, Kevin E wrote: --- Willy Tarreau [EMAIL PROTECTED] wrote: No Stephen, look again, he says that moving the video card into the broken system does not change anything. Correct, I've used three different video cards in the broken machine. I've used an old PCI vid card, the PCI-X vid card from the working machine, and now PCI-X card I just bought yesterday (nvidia based). None have affected whether the Marvell chipset works or not. Also, the broken machine is a server so I don't start X on it. It just sits in console mode all the time. The CPU (Core2 E4400) used to be in the working machine, then upgraded it to the Q6600 and put the old E4400 in the new MB that became the broken machine. Memory was brand new out of the package. So I've used the E4400 in a MB that worked fine. I don't understand why the working one is on PCI bus 3 while the other is on PCI bus 4. It's just as if the chip embedded a PCI bridge. Maybe those chips are just cheaper dual-channel controllers with one faulty controller disabled. It would also explain why the PCI ID is different. I've attached the lspci -vvv output for the two machines, if it doesn't come through just let me know how I can get it to you. OK, in this trace, both controllers are on the same bus. The broken one has 'Capabilities: [100] Advanced Error Reporting' the other does not have, and the bridge to this bus has two more capabilities : 'Capabilities: [100] Virtual Channel' and 'Capabilities: [180] Unknown (5)'. I don't know whether it can jutify a different behaviour. Also, maybe this is caused by a minuscule difference in the BIOS setup ? Regards, Willy - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [2.6.20.17 review 35/58] forcedeth bug fix: realtek phy
On Wed, Aug 22, 2007 at 11:56:42AM -0400, Chuck Ebbert wrote: On 08/22/2007 05:39 AM, Willy Tarreau wrote: This patch contains errata fixes for the realtek phy. It only renamed the defines to be phy specific. Signed-off-by: Ayaz Abdulla [EMAIL PROTECTED] Signed-off-by: Greg Kroah-Hartman [EMAIL PROTECTED] Signed-off-by: Willy Tarreau [EMAIL PROTECTED] --- drivers/net/forcedeth.c | 54 +++ 1 files changed, 54 insertions(+), 0 deletions(-) diff --git a/drivers/net/forcedeth.c b/drivers/net/forcedeth.c index c383dc3..dbfdbed 100644 --- a/drivers/net/forcedeth.c +++ b/drivers/net/forcedeth.c @@ -554,6 +554,7 @@ union ring_type { #define PHY_OUI_MARVELL0x5043 #define PHY_OUI_CICADA 0x03f1 #define PHY_OUI_VITESSE0x01c1 +#define PHY_OUI_REALTEK0x01c1 #define PHYID1_OUI_MASK0x03ff #define PHYID1_OUI_SHFT6 #define PHYID2_OUI_MASK0xfc00 Realtek is 0x0732 This is still wrong upstream -- what happened to the patch to fix it? Good catch, thanks Chuck! I've already seen the fix somewhere, I believe it was on netdev, though I'm not sure. I'm fixing the patch in place right now. I can add your signoff if you want. Cheers, Willy - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] fix realtek phy id in forcedeth
Hi Greg, On Thu, Aug 23, 2007 at 09:55:13AM -0700, Greg KH wrote: It might help if someone sends a real patch that can be applied :) This is getting really silly now :-) We're all wasting more time wondering who will send the patch than posting it. I've lost, I got fed up first, so here it is. Please apply to mainline then stable. Thanks, Willy -- From a0e2922b99eedd9863232368ea2afe072c52783e Mon Sep 17 00:00:00 2001 From: Willy Tarreau [EMAIL PROTECTED] Date: Thu, 23 Aug 2007 21:35:41 +0200 Subject: [PATCH] fix realtek phy id in forcedeth As noticed by Chuck Ebbert, commit c5e3ae8823693b260ce1f217adca8add1bc0b3de introduced a copy-paste typo, as realtek phy is 0x732 and not 0x1c1. Obvious fix below suggested by Ayaz Abdulla. Signed-off-by: Willy Tarreau [EMAIL PROTECTED] Cc: Ayaz Abdulla [EMAIL PROTECTED] Cc: Chuck Ebbert [EMAIL PROTECTED] --- drivers/net/forcedeth.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/drivers/net/forcedeth.c b/drivers/net/forcedeth.c index 10f4e3b..1938d6d 100644 --- a/drivers/net/forcedeth.c +++ b/drivers/net/forcedeth.c @@ -552,7 +552,7 @@ union ring_type { #define PHY_OUI_MARVELL0x5043 #define PHY_OUI_CICADA 0x03f1 #define PHY_OUI_VITESSE0x01c1 -#define PHY_OUI_REALTEK0x01c1 +#define PHY_OUI_REALTEK0x0732 #define PHYID1_OUI_MASK0x03ff #define PHYID1_OUI_SHFT6 #define PHYID2_OUI_MASK0xfc00 -- 1.5.2.5 - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.22.5 forcedeth timeout hang
On Thu, Aug 23, 2007 at 06:48:23PM -0500, Mr. Berkley Shands wrote: 100% reproducible hang on xmit timeout. Just do a make -j4 modules on an nfs mounted kernel source. Most likely you also had the problem with 2.6.22.2 (maybe you have not tested this one, though). There were bug fixes for forcedeth introduced in this version, one of them being buggy. The patch below fixes it. Can you please give it a try ? If it does not fix the problem, please try 2.6.22.1 which does not include those changes. I'm interested because I have those changes pending for 2.6.20.17 too. diff --git a/drivers/net/forcedeth.c b/drivers/net/forcedeth.c index 10f4e3b..1938d6d 100644 --- a/drivers/net/forcedeth.c +++ b/drivers/net/forcedeth.c @@ -552,7 +552,7 @@ union ring_type { #define PHY_OUI_MARVELL0x5043 #define PHY_OUI_CICADA 0x03f1 #define PHY_OUI_VITESSE0x01c1 -#define PHY_OUI_REALTEK0x01c1 +#define PHY_OUI_REALTEK0x0732 #define PHYID1_OUI_MASK0x03ff #define PHYID1_OUI_SHFT6 #define PHYID2_OUI_MASK0xfc00 Thanks, Willy - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] e100 module loads 1/2 times
On Tue, Aug 28, 2007 at 02:43:23PM -0700, Brandeburg, Jesse wrote: Willy Tarreau wrote: --- e100-3.5.17/src/e100.c.orig 2007-08-13 08:53:18 +0200 +++ e100-3.5.17/src/e100.c 2007-08-13 09:24:56 +0200 @@ -2934,13 +2934,13 @@ printk(KERN_INFO PFX %s\n, DRV_COPYRIGHT); } #ifdef E100_USE_REBOOT_NOTIFIER - retval = pci_register_driver(e100_driver); - if (retval = 0) + retval = pci_module_init(e100_driver); + if (retval == 0) register_reboot_notifier(e100_notifier_reboot); return retval; #else - return pci_register_driver(e100_driver); + return pci_module_init(e100_driver); #endif } excellent catch willy! Auke will make sure this gets upstream. Thanks for your investigative work! So strange that we had never gotten other reports of it. I think that many places where e100 is found, there are at least 2 NICs, and probably that the startup scripts load the module twice, resulting in it finally working. Could we update the e100 driver in 2.4 to 3.5.17+this fix ? The current e100 driver (2.3.43) is plain broken regarding VLAN support. I already have it working in my own kernels, and it definitely put an end to all the problems I was facing before. Thanks, Willy - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: sk98lin for 2.6.23-rc1
On Tue, Sep 11, 2007 at 05:03:57PM +0200, Adrian Bunk wrote: On Tue, Sep 11, 2007 at 10:29:47AM -0400, Bill Davidsen wrote: So if you want people to try a new driver, I think it really has to have some benefits to the users, in terms of performance, reliability, or features. Cleaner design doesn't motivate, and it does raise the question of why the old driver wasn't just cleaned up. I've been doing software for decades, I appreciate why, but users in general just want to use their system. Which raises the question of why to delete drivers which work for many or even most users? As I already explained, there is a long term advantage for all users if there is only one driver in the kernel. Not only that. You have to place the switch in its context with history. Stephen, please correct me if I'm wrong, but sk98lin has been randomly working for a very long time. Not 100% the driver's fault, because it has had to workaround a lot of chips bugs. The fact that this driver supports *all* chips in the family makes it harder to identify whether problems are caused by the hardware or by the driver because it is bloated with tons of if/else. I've personally encountered random data corruption on the receive path with PCI-E hardware with sk98lin, as well as random TX stops. Sometimes it would require one terabyte of data, sometimes just a few hundreds megs. On other hardware (skge now), UDP would simply stop being sent and some TCP traffic was necessary to restart UDP! One guy at Marvell once asked me for more information, but it was not easy to provide much more, given the randomness of the problems! Stephen has done an excellent (and thankless) job at restarting from scratch, and the idea to separate the two chips was a good one IMHO. The problem is that he might have thought that most of the bugs were in the driver, while most of them are in the hardware, and this requires a lot of workarounds, which do not always work the same for everybody (I remember having tried to disable flow control with sk98lin because it helped with sky2). In parallel, sk98lin has improved on the vendor's site. v8 exhibited all the problems I explained above, but v10 has fixed a lot of them, making the new sk98lin more reliable. In parallel, sky2 and skge had got wider acceptance and testing. The nastiest hardware bugs will slowly surface, a good deal of driver bugs have been detected too (and that's expected from any new driver). It is possible that after 2 or 3 patches, a lot of the remaining problems will suddenly vanish. But it's also possible that the driver will still not work for 1% of people for 1 or 2 years because of some obscure hardware combinations which trigger some obscure hardware bugs. Therefore all users should switch away from obsolete drivers to the replacement drivers, and the obsolete driver will be removed at some point in time. The only question is how to do it. Desktop users genreally have no problem experimenting with multiple kernels or drivers. They can report feedback too, but generally, they're not very good at downloading alternative drivers and patching their kernel with those. Server users cannot experiment for a long time. After 2 or 3 losses of service, they *have* to provide a definitive solution. For some of them when sky2 fails, it may very well be to switch over to sk98lin. Downloading from the vendor's site and patching is not a problem for those users, but it causes them the trouble of updating the kernel for security fixes, so the old driver must be shipped with the kernel. However, I remember something which might constitute a solution. In 2.4, there's a small bug in the kbuild process on alpha. One question is always asked during make oldconfig. Its saved value is ignored because of the way it is computed. I don't know if we could do this with 2.6 kbuild. It would then be nice to always set sk98lin to unset if it was set to Y or M, so that at each build, the user has to explicitly state he wants it. It's annoying enough to give the other one a try once in a while, without causing too much trouble to people who really have no other choice right now. What we need with this driver is people being fed up with it, not them being unable to use it as a last resort. Also, given that it has improved over the last years (probably due to competition pressure from sky2/skge), users will even less understand why there is such incentive to remove it. Another trick for obsolete drivers would be to simply remove them from the usual build system, but have them being available for explicit build. Eg: make modules will not build them, but make obsolete-modules would do. Testing a new kernel is no longer a drop in a boot operation if modprobe.conf must be edited to get the network up, and the typical user isn't going to write that shell script to try one or the other driver. The typical user will let his distribution handle this. And MODULE_ALIAS can also
Re: bnx2 dirver's firmware images
On Tue, Sep 18, 2007 at 12:21:50PM -0700, David Miller wrote: From: Michael Chan [EMAIL PROTECTED] Date: Tue, 18 Sep 2007 13:05:51 -0700 The bnx2 firmware changes quite frequently. A new driver quite often requires new firmware to work correctly. Splitting them up makes things difficult for the user. The firmware in tg3 is a lot more mature and I don't expect it to change. I think tg3 is better suited for using request_firmware(). Like I said, I think neither should change and the driver should be fully functional when built statically into the kernel. Michael, doesn't a functional-yet-suboptimal firmware exist ? I mean, just the same principle as we all have kernels, boot CDs, statically built tools, etc... which run everywhere. If you have such a beast, maybe it would be a good start to have it in the kernel, and provide the users with the ability to upgrade the firmware once the system is able to do more complex things. Just a thought... Regards, Willy - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: bnx2 dirver's firmware images
On Tue, Sep 18, 2007 at 02:31:34PM -0700, David Miller wrote: From: Willy Tarreau [EMAIL PROTECTED] Date: Tue, 18 Sep 2007 23:30:25 +0200 On Tue, Sep 18, 2007 at 12:21:50PM -0700, David Miller wrote: From: Michael Chan [EMAIL PROTECTED] Date: Tue, 18 Sep 2007 13:05:51 -0700 The bnx2 firmware changes quite frequently. A new driver quite often requires new firmware to work correctly. Splitting them up makes things difficult for the user. The firmware in tg3 is a lot more mature and I don't expect it to change. I think tg3 is better suited for using request_firmware(). Like I said, I think neither should change and the driver should be fully functional when built statically into the kernel. Michael, doesn't a functional-yet-suboptimal firmware exist ? I mean, just the same principle as we all have kernels, boot CDs, statically built tools, etc... which run everywhere. If you have such a beast, maybe it would be a good start to have it in the kernel, and provide the users with the ability to upgrade the firmware once the system is able to do more complex things. Just a thought... So let's save 60K instead of 80K. That's not for this reason I said this. Michael said the firmware needs to be updated somewhat often. What I was wondering was if it was not possible to stick to a stable one (and hopefully small) so that the driver could require less frequent updates. Sorry if it's not the main point of the discussion, but I grepped on this :-) I mean, the entire discussion is just plain silly :) yes, possibly :-) Cheers, Willy - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
FWD: 2.4.35 e1000 problems
Hi, Dmitry, I've forwarded your mail to the netdev mailing list and to the e1000 maintainers. Auke, does it sound like a known problem ? Maybe someone has seen it in 2.6 ? Just for the record, e1000 in 2.4.35 is 7.3.20-k4. Dmitry, if you have time for a test, I think it would be good if you could test the most recent driver from e1000.sf.net. It might give us some more information about the problem's cause. Otherwise, let's wait for Auke's advices and/or questions. Regards, Willy - Forwarded message from dmpesl [EMAIL PROTECTED] - Date: Mon, 15 Oct 2007 17:54:01 +0400 (MSD) From: dmpesl [EMAIL PROTECTED] Subject: 2.4.35 e1000 problems To: [EMAIL PROTECTED] X-X-Sender: [EMAIL PROTECTED] Hello, Willy on recent month i've changed kernel from 2.4.31-2.4.35 on my Owl-based web server (purchased SATA hdd, onboard SATA 20378), Asus PCH-DR mb, 2*Xeon, and noticed that following error occurs on network card: - Sep 29 21:29:51 host kernel: NETDEV WATCHDOG: eth0: transmit timed out Sep 29 21:29:53 host kernel: e1000: eth0: e1000_watchdog: NIC Link is Up 100 Mbps Full Duplex, Flow Control: RX/TX Sep 29 22:28:03 host kernel: NETDEV WATCHDOG: eth0: transmit timed out Sep 29 22:28:05 host kernel: e1000: eth0: e1000_watchdog: NIC Link is Up 100 Mbps Full Duplex, Flow Control: RX/TX Sep 30 00:51:35 host kernel: NETDEV WATCHDOG: eth0: transmit timed out Sep 30 00:51:37 host kernel: e1000: eth0: e1000_watchdog: NIC Link is Up 100 Mbps Full Duplex, Flow Control: RX/TX - After some conversation with my collocation provider (changing cables and switch port) i've searched google to same errors and found some advises to use ethtool -k tso off, but this is not work with my kernel/distribution. Then i swapped just e1000 driver in 2.4.35 to older, from 2.4.31. Now everything works, without loosing link. Regards, Dmitry - End forwarded message - - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ISNs and 2.6.22, Was: Re: haproxy linux firewall (netfilter)
On Sat, Oct 20, 2007 at 07:23:25PM +0200, Krzysztof Oledzki wrote: (...) So it seems that ISNs are not randomly incremented but rather randomly generated. Adding netdev@vger.kernel.org to the CC list. Eh, I was little to hurry this time. There were not randomly generated but incremented with to big value. This patch fixes my problem: http://git.kernel.org/?p=linux/kernel/git/stable/stable-queue.git;a=blob;f=queue-2.6.22/fix-tcp-initial-sequence-number-selection.patch;h=05b9167d68ecde1e6088f58c55e2906b768420ed;hb=HEAD Good catch Krzysztof ! I've already noticed that one on LKML but did not make the connection with your problem! Please also tell Jozsef so that if he gets other reports, he knows where to point the reporters. Regards, Willy - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] [e1000 VLAN] Disable vlan hw accel when promiscuous mode
On Mon, Nov 12, 2007 at 02:57:16PM -0800, David Miller wrote: From: Chris Friesen [EMAIL PROTECTED] Date: Mon, 12 Nov 2007 16:43:24 -0600 David Miller wrote: When you select VLAN, you by definition are asking for non-VLAN traffic to be elided. It is like plugging the ethernet cable into one switch or another. For max functionality it seems like the raw eth device should show everything on the wire in promiscuous mode. If we want to sniff only the traffic for a specific vlan, we can sniff the vlan device. VLAN settings are a filter of sorts, much like plugging into one switch or another filters traffic physically. If you don't want that filter, turn the VLAN settings off. I don't really agree with that view. Having spent a lot of time with tcpdump on production systems, I can say that sometimes you'd like to be aware that one of your VLANs is wrong and you'd simply like to sniff the wire to guess the correct tag. And on production, you simply cannot remove other VLANs, otherwise you disrupt the service. Basically, what generally happens is that the guy responsible for the switch tells you it's OK now, but for you it isn't and you cannot access the switch. If the solution is to disable VLAN hardware acceleration, I agree that it is very risky to do that without the user being aware of it. But at least we should be able to do this by any means (eg: ethtool) without disabling what's running. And since you made the parallel with a switch, when you receive tagged traffic on a switch port, you generally can mirror that port to another one and catch all VLANs at once. A new feature that is starting to appear is the ability to mirror tagged traffic to a VLAN on another port (which means you get a double 802.1q tag). This is useful for inter-site links between data-centers for instance. Regards, Willy - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] [e1000 VLAN] Disable vlan hw accel when promiscuous mode
On Mon, Nov 12, 2007 at 03:19:23PM -0800, David Miller wrote: From: Willy Tarreau [EMAIL PROTECTED] Date: Tue, 13 Nov 2007 00:15:16 +0100 I can say that sometimes you'd like to be aware that one of your VLANs is wrong and you'd simply like to sniff the wire to guess the correct tag. And on production, you simply cannot remove other VLANs, otherwise you disrupt the service. If you were plugged into the wrong physical switch, how might you debug that problem in production? :-) tcpdump on an unfiltered RJ45 tells you a lot of things. Some switches for instance send keepalive packets (ether proto 9000) with their own MAC as source+dest, others send CDP packets, etc... I've very rarely stayed plugged to the wrong switch for too long before discovering it. But that requires the ability to sniff. Really, it's the same issue, just virtualized, as in the name for the feature, VLAN. No, it's not the same, because as soon as you pass through a switch which is not configured for VLANs, it does not make any distinction, and the same DMAC is considered on a single port whatever the VLAN. This is not true with multiple switches. Also, sometimes, being able to see that your port gets flooded with traffic for a VLAN on which you're not configured helps you understand why you cannot fill the port. I'd like that we can use the hardware correctly without having to buy TAPs. It reminds me the discussions about TOE. You claim yourself that TOE is bad in part because you have little if any control on the bugs on the TCP stack on the chip. This is the same here. If I know that the chip does not send me what it receives when configured in promiscuous mode, I can I swear it never received the missing packet ? There's always a doubt. Maybe sometimes the filter is buggy, maybe it only passes tags when the remaining bits are all zero, etc... Promiscuous mode generally means that you're the closest possible to the wire. We already do not receive pause frames and sometimes that's annoying. But having no control over what we want to see is frustrating. At least, being able to disable the feature at module load time would be acceptable. Many people who often need to sniff on decent machines would always keep it disabled. Regards, Willy - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: questions on NAPI processing latency and dropped network packets
On Wed, Jan 16, 2008 at 07:58:36AM +0100, Jarek Poplawski wrote: On Wed, Jan 16, 2008 at 11:17:08AM +1100, Herbert Xu wrote: ... Well people are always going to operate on this model for commercial reasons. FWIW I used to work for a company that stuck to a specific version of the Linux kernel, and I suppose I still do even now :) But the important thing is that if you're going to do that, then the cost that comes with it should be borne by the company and not the community. Sure. But the most sad thing is there seems to be not so much savings in this (unless a company isn't sure of its near future). Trying to upgrade and test current products with current kernels, even if not necessary, should be always useful and make developing of new products faster and better fit (and of course, BTW, make the kernel better on time). you can work with latest release provided that you always have a fallback to an earlier one. That way, you don't bet too much on something you don't completely control. If it works, it tells you you'll be able to completely exploit its new possibilities in next product release, and if it breaks, it's easy to issue a fix to get back to earlier, well-tested version. Regards, Willy -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Add IPv6 support to TCP SYN cookies
Hi Andi, Alan, I've run extensive tests with/without syn cookies recently. On Tue, Feb 05, 2008 at 05:39:12PM +0100, Andi Kleen wrote: On Tue, Feb 05, 2008 at 03:42:13PM +, Alan Cox wrote: Syncookies are discouraged these days. They disable too many valuable TCP features (window scaling, SACK) and even without them the kernel is usually strong enough to defend against syn floods and systems have much more memory than they used to be. Somewhat untrue. Network speeds have risen dramatically, the number of With strong I meant Linux has much better algorithms to handle the standard syn queue (syncookies was originally added when it had only dumb head drop) and there are minisocks which also require significantly less overhead to manage than full sockets (less memory etc.) That's true, but not enough, see below. When I wrote syncookies originally that all was not true. appliances running Linux that are not PC class means memory has fallen not risen and CPU has been pretty level. So I don't think it makes much sense to add more code to it, sorry. I think it makes a lot of sense I have my doubts. It would be probably better to recheck everything and then remove syncookies. Also your sub PC class appliances rarely run LISTEN servers anyways that are open to the world. In my tests, I discovered that in fact SYN cookies more benefit high end machines than low-end ones. Let me explain. I noticed that computing the cookie consumes a lot of CPU, which is a real problem on low-end machines. But on the other end, it helps the system continue to respond when otherwise it would not. My tests on an AMD LX800 with max_syn_backlog at 63000 on an HTTP reverse proxy consisted in injecting 250 hits/s of legitimate traffic with 8000 SYN/s of noise. Without SYN cookies, the average response time was about 1.5 second and unstable (due to retransmits), and the CPU was set to 60%. With SYN cookies enabled, the response time dropped to 12-15ms only, but CPU usage jumped to 70%. The difference appears at a higher legitimate traffic rate. At 500 hits/s + 7800 SYN/s, the CPU is just saturated with correct response time (SYN backlog almost full but never full), and the performance slightly goes down with SYN cookies enabled, inducing a drop of the hit rate due to the increased CPU consumption. Till there, one would conclude that SYN cookies are bad. BUT! this was with tcp_synack_retries = 1, which is the optimal situation without SYN cookies under an attack and which is pretty bad for normal usage. The real problem without SYN cookies is that you are forced to support a huge SYN backlog (eg: 2 million entries to sustain 100 Mbps of SYN). And what happens with a large backlog ? You send a lot of retries for each SYN. 5 by default, meaning 6 SYN-ACKs for 1 SYN. Thus, you become a SYN amplifier and the guy in front of you just has to send you 20 Mbps of traffic for you to saturate your 100 Mbps uplink. Also, sending all those SYN-ACKs takes a huge amount of CPU time. With tcp_synack_retries at 0, my machine received 26600 SYN/s, and returned 26600 SYN-ACK/s at 100% CPU. With tcp_synack_retries set to 4, it could only accept 12900 SYN/s, replying with 51200 SYN-ACK/s. So the input load was halfed and the output was doubled. I did not bother going higher. The only solution against this is then to reduce tcp_synack_retries to very low values (0 ideally, to match SYN cookies behaviour), but in this case, you degrade normal traffic 100% of the time, while SYN cookies would only trigger while you're already under attack. My conclusions after those tests was to set tcp_synack_retries to a reasonable value (1 to 3), and set the backlog to the number of half-open sessions your machine can accumulate under a SYN attack without collapsing. You then enable SYN cookies, and they will only trigger when you know that your machine will not be able to sustain the increased load. This solution permits you to accept normal connections when not under attack, with an acceptable number of retransmits and with TCP options working well. Under a moderate attack, the large backlog will still help you accept legitimate connections with all comfort (sack, wscale, ...). Under a massive attack, you will not send more than tcp_synack_retries*backlog packets per tcp_synack_retries period, thus limiting the outbound traffic, plus 1 SYN-ACK per incoming SYN, legitimate or not. At this stage, if your users have a castrated TCP stack in front of them, that's not a problem because you know that otherwise they would not even have been able to connect. So in this regard, SYN cookies are really needed. Last, I've read on DJB's page that SYN cookies do not break TCP beahaviour. Yes they do. If the client waits for the server to talk first, you'd better not lose the first ACK from the client because it will not get retransmitted, and the client will see an ESTABLISHED connection while the server will have
Re: [PATCH] Add IPv6 support to TCP SYN cookies
On Wed, Feb 06, 2008 at 12:52:17AM +0300, Evgeniy Polyakov wrote: Hi Alan. On Tue, Feb 05, 2008 at 09:20:17PM +, Alan Cox ([EMAIL PROTECTED]) wrote: Most (if not all) distributions have them enabled and window growing works just fine. Actually I do not see any reason why connection establishment handshake should prevent any run-time operations at all, even if it was setup during handshake. Syncookies are only triggered if the system is under a load where it would begin to lose connections otherwise. So they merely turn a DoS into a working if slightly slower setup (and 64K windows don't matter for most normal users, especially on mobile devices). SACK is actually a good idea for mobile devices, so preventing syncookies from not getting into account some options (btw, does it work with timestamps and PAWS?) is not a solution. All TCP options negociated during session setup are lost. In fact, some bits (3) are still reserved for the best known value of the MSS, but that's all. The principle of SYN cookies is that the server does not create any session upon the SYN, but builds a sequence number constitued from a hash and the values it absolutely needs to know when the client validates the session with an ACK. I've seen some firewalls acting as SYN gateways which send the options from the server to the client in the first ACK packet from the server. This is normally not allowed, but it seems to work with some TCP stacks (at least for the MSS). One solution would be to extend TCP to officially support this behaviour and to optionally use it along with SYN cookies, but there will always be old clients not compatible with the extension. Regards, Willy -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] [PATCH] net: socket: Fix the wrong returns for recvmsg and sendmsg
On Tue, Jun 02, 2015 at 02:43:54PM +0800, Junling Zheng wrote: On 2015/6/2 14:27, Greg KH wrote: On Mon, Jun 01, 2015 at 10:23:57PM -0700, David Miller wrote: From: Junling Zheng zhengjunl...@huawei.com Date: Tue, 2 Jun 2015 12:05:32 +0800 So, the problem commit is 281c9c36 (net: compat: Update get_compat_msghdr() to match copy_msghdr_from_user() behaviour), which fixes db31c55a6fb2 and brings the get_compat_msghdr() in line with copy_msghdr_from_user(). Upstream this got fixed by: 08adb7dabd4874cc5666b4490653b26534702ce0 So the part that makes us not unconditionally return -EFAULT needs to be backported, and that's probably equivalent to the patch your proposed which therefore should be applied. Ok, thanks, now applied. Maybe other stable version also needs this fix:) Yes, from what I'm seeing, at least 3.2 and 2.6.32 need it as well. Thanks, Willy -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Fw: [Bug 99161] New: 2.6.32.66 PPC Oops in tcp_send_fin
Hi Eric, On Fri, May 29, 2015 at 08:52:11AM -0700, Eric Dumazet wrote: On Fri, 2015-05-29 at 08:12 -0700, Stephen Hemminger wrote: I think 2.6.32 is so old no one will care. A few will still, but at least we must ensure the old guy finishes his days nicely :-) (...) I guess a backport went wrong. Ah crap, sorry about that :-( Willy, please add following to your tree : diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index 5339f066234b..d1e2895bb63c 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -2136,7 +2136,7 @@ void tcp_send_fin(struct sock *sk) */ if (tskb (tcp_send_head(sk) || tcp_memory_pressure)) { coalesce: - TCP_SKB_CB(skb)-flags |= TCPCB_FLAG_FIN; + TCP_SKB_CB(tskb)-flags |= TCPCB_FLAG_FIN; TCP_SKB_CB(tskb)-end_seq++; tp-write_seq++; if (!tcp_send_head(sk)) { Thanks Eric, I'm queuing this now. Best regards, Willy -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.32.66 tcp regression OOPs
Hi, On Mon, Jun 01, 2015 at 09:00:21AM +0200, Frans Klaver wrote: [cc: Willy Tarreau] On Mon, Jun 1, 2015 at 3:26 AM, starlight.201...@binnacle.cx wrote: Hello, Apoligies if I have submitted to the wrong lists. Encountered a regression in 2.6.32.66 relative to 2.6.32.65. Crash eight minutes after boot. Will responded with additional details if the OOPS is not sufficent. Best Regards Did you bisect it? Eric Dumazet notified me that of something possibly similar due to a mistake I made when backporting a fix by hand. Please apply the following patch to see if it fixes the problem : diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index 5339f066234b..d1e2895bb63c 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -2136,7 +2136,7 @@ void tcp_send_fin(struct sock *sk) */ if (tskb (tcp_send_head(sk) || tcp_memory_pressure)) { coalesce: - TCP_SKB_CB(skb)-flags |= TCPCB_FLAG_FIN; + TCP_SKB_CB(tskb)-flags |= TCPCB_FLAG_FIN; TCP_SKB_CB(tskb)-end_seq++; tp-write_seq++; if (!tcp_send_head(sk)) { Thanks, Willy -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.32.66 tcp regression OOPs
On Mon, Jun 01, 2015 at 11:32:19AM -0400, starlight.201...@binnacle.cx wrote: Hi, I found the patch late yesterday and applied it. Running fine now for 12 hours under active load. Thank you. Recommend the patch be rolled into the tarball, or a notation added to the release page as this one has severe consequences. I'll emit 2.6.32.67 with it. I didn't know it was that easy to trigger it, and since feedback comes slowly on 2.6.32, I was waiting a bit for more feedback before doing another one. Thank you! Willy -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/6] net: mvneta: Handle per-cpu interrupts
Hi Maxime, On Fri, Jul 03, 2015 at 04:25:49PM +0200, Maxime Ripard wrote: Now that our interrupt controller is allowing us to use per-CPU interrupts, actually use it in the mvneta driver. This involves obviously reworking the driver to have a CPU-local NAPI structure, and report for incoming packet using that structure. Signed-off-by: Maxime Ripard maxime.rip...@free-electrons.com This patch breaks module build of mvneta unless you export request_percpu_irq : diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c index ec31697..1440a92 100644 --- a/kernel/irq/manage.c +++ b/kernel/irq/manage.c @@ -1799,6 +1799,7 @@ int request_percpu_irq(unsigned int irq, irq_handler_t handler, return retval; } +EXPORT_SYMBOL_GPL(request_percpu_irq); Regards, Willy -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 6/6] net: mvneta: Statically assign queues to CPUs
Hi Thomas, On Fri, Jul 03, 2015 at 04:46:24PM +0200, Thomas Petazzoni wrote: Maxime, On Fri, 3 Jul 2015 16:25:51 +0200, Maxime Ripard wrote: +static void mvneta_percpu_enable(void *arg) +{ + struct mvneta_port *pp = arg; + + enable_percpu_irq(pp-dev-irq, IRQ_TYPE_NONE); +} + static int mvneta_open(struct net_device *dev) { struct mvneta_port *pp = netdev_priv(dev); @@ -2655,6 +2662,19 @@ static int mvneta_open(struct net_device *dev) goto err_cleanup_txqs; } + /* +* Even though the documentation says that request_percpu_irq +* doesn't enable the interrupts automatically, it actually +* does so on the local CPU. +* +* Make sure it's disabled. +*/ + disable_percpu_irq(pp-dev-irq); + + /* Enable per-CPU interrupt on the one CPU we care about */ + smp_call_function_single(rxq_def % num_online_cpus(), +mvneta_percpu_enable, pp, true); What happens if that CPU goes offline through CPU hotplug? I just tried : if I start mvneta with rxq_def=1, then my irq runs on CPU1. Then I offline CPU1 and the irqs are automatically handled by CPU0. Then I online CPU1 and irqs stay on CPU0. More or less related, I found that if I enable a queue number larger than the CPU count it does work, but then the system complains during rmmod : [ 877.146203] [ cut here ] [ 877.146227] WARNING: CPU: 1 PID: 1731 at fs/proc/generic.c:552 remove_proc_entry+0x144/0x15c() [ 877.146233] remove_proc_entry: removing non-empty directory 'irq/29', leaking at least 'mvneta' [ 877.146238] Modules linked in: mvneta(-) [last unloaded: mvneta] [ 877.146254] CPU: 1 PID: 1731 Comm: rmmod Tainted: GW 4.1.1-mvebu-6-g3d317ed-dirty #5 [ 877.146260] Hardware name: Marvell Armada 370/XP (Device Tree) [ 877.146281] [c0017194] (unwind_backtrace) from [c00126d4] (show_stack+0x10/0x14) [ 877.146293] [c00126d4] (show_stack) from [c04d32e4] (dump_stack+0x74/0x90) [ 877.146305] [c04d32e4] (dump_stack) from [c0025200] (warn_slowpath_common+0x74/0xb0) [ 877.146315] [c0025200] (warn_slowpath_common) from [c00252d0] (warn_slowpath_fmt+0x30/0x40) [ 877.146325] [c00252d0] (warn_slowpath_fmt) from [c0115694] (remove_proc_entry+0x144/0x15c) [ 877.146336] [c0115694] (remove_proc_entry) from [c0060fd4] (unregister_irq_proc+0x8c/0xb0) [ 877.146347] [c0060fd4] (unregister_irq_proc) from [c0059f84] (free_desc+0x28/0x58) [ 877.146356] [c0059f84] (free_desc) from [c005a028] (irq_free_descs+0x44/0x80) [ 877.146368] [c005a028] (irq_free_descs) from [bf015748] (mvneta_remove+0x3c/0x4c [mvneta]) [ 877.146382] [bf015748] (mvneta_remove [mvneta]) from [c024d6e8] (platform_drv_remove+0x18/0x30) [ 877.146393] [c024d6e8] (platform_drv_remove) from [c024bd50] (__device_release_driver+0x70/0xe4) [ 877.146402] [c024bd50] (__device_release_driver) from [c024c5b8] (driver_detach+0xcc/0xd0) [ 877.146411] [c024c5b8] (driver_detach) from [c024bbe0] (bus_remove_driver+0x4c/0x90) [ 877.146425] [c024bbe0] (bus_remove_driver) from [c007d3f0] (SyS_delete_module+0x164/0x1b4) [ 877.146437] [c007d3f0] (SyS_delete_module) from [c000f4c0] (ret_fast_syscall+0x0/0x3c) [ 877.146443] ---[ end trace 48713a9ae31204b1 ]--- This was on the AX3 (dual-proc) with rxq_def=2. Hoping this helps, Willy -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][kernel 2.6.32] Bond interface can't send gratuitous ARP
Hi Qingjie, On Mon, Jul 27, 2015 at 09:05:29AM +0800, ? wrote: Hi, Bond interface worked as Active-Backup mode. If the bond interface was added in bridge, then it's just a port of bridge. It doesn' have IP address. When bond slave changing, the current code bond_send_gratuitous_arp didn't take effect. It couldn't send gratuitous ARP. So I made a patch to fix this issue. It's my first patch for kernel. Thanks a lot for your *understanding* for any *inconvenience* caused. Patches may only be sent for mainline kernel. Since you're trying to fix 2.6.32, you first need to check if a more recent version works properly. If it's the case, then instead you should ask for a backport of the patch that did the related change (it's easier for tracking bug fixes and regressions). If the problem is also present in mainline, then you should propose your patch for mainline. Since it's a fix, it will be backported. BTW, please keep in mind that 2.6.32 is reaching EOL in a few months, so it would be a good time to try newer ones (3.2 or 3.4 are rock solid now). Hoping this helps, Willy -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Please backport commit to 3.12+
Hi, We recently faced the issue described in the patch below on 3.14.56. This fix was merged in 4.2-rc7. I checked Davem's queue and stable queue and it's not there yet. Could we please have it in 3.12 and above ? (feature was introduced in 3.11). I can confirm that it properly fixes the problem for us. Mainline commit is : commit 3c16241c445303a90529565e7437e1f240acfef2 Author: Phil SutterDate: Tue Jul 28 00:53:26 2015 +0200 netfilter: SYNPROXY: fix sending window update to client Upon receipt of SYNACK from the server, ipt_SYNPROXY first sends back an ACK to finish the server handshake, then calls nf_ct_seqadj_init() to initiate sequence number adjustment of forwarded packets to the client and finally sends a window update to the client to unblock it's TX queue. Since synproxy_send_client_ack() does not set synproxy_send_tcp()'s nfct parameter, no sequence number adjustment happens and the client receives the window update with incorrect sequence number. Depending on client TCP implementation, this leads to a significant delay (until a window probe is being sent). Signed-off-by: Phil Sutter Signed-off-by: Pablo Neira Ayuso Thanks, Willy -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: "ss -p" segfaults (updated to 4.2)
On Mon, Oct 12, 2015 at 09:50:19AM -0700, Stephen Hemminger wrote: > Applied, and did some editing on commit msg Thank you Stephen! Willy -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: "ss -p" segfaults (updated to 4.2)
Hi guys, I've updated Jose's patch to make it slightly simpler (eg: calloc instead of malloc+memset), and ported it to 4.2.0 which requires it as well, and attached it to this e-mail. I can confirm that with this patch 4.1.1 doesn't segfault on me anymore. The commit message should be reworked I guess though everything's in it and I didn't want to modify his description. Can it be merged as-is or should I reword the commit message and reference Jose as the fix reporter ? We should not let this bug live forever. Thanks, Willy >From 618028d6c5bfa4fb9898f2ec1ab5483e6f5392d4 Mon Sep 17 00:00:00 2001 From: "j...@openmailbox.org" <j...@openmailbox.org> Date: Tue, 21 Jul 2015 10:54:05 +0100 Subject: "ss -p" segfaults Patch for 4.2.0 Essentially all that is needed to get rid of this issue is the addition of: memset(u, 0, sizeof(*u)); after: if (!(u = malloc(sizeof(*u break; Also patched some other situations (strcpy and sprintf uses) that potentially produce the same results. Signed-off-by: Jose P Santos <j...@openmailbox.org> [ wt: made Jose's patch slightly simpler, all credits to him for the diag ] Signed-off-by: Willy Tarreau <w...@1wt.eu> --- misc/ss.c | 16 ++-- 1 file changed, 10 insertions(+), 6 deletions(-) diff --git a/misc/ss.c b/misc/ss.c index 2f34962..8b0d606 100644 --- a/misc/ss.c +++ b/misc/ss.c @@ -457,7 +457,9 @@ static void user_ent_hash_build(void) user_ent_hash_build_init = 1; - strcpy(name, root); + strncpy(name, root, sizeof(name)-1); + name[sizeof(name)-1] = 0; + if (strlen(name) == 0 || name[strlen(name)-1] != '/') strcat(name, "/"); @@ -481,7 +483,7 @@ static void user_ent_hash_build(void) if (getpidcon(pid, _context) != 0) pid_context = strdup(no_ctx); - sprintf(name + nameoff, "%d/fd/", pid); + snprintf(name + nameoff, sizeof(name) - nameoff, "%d/fd/", pid); pos = strlen(name); if ((dir1 = opendir(name)) == NULL) { free(pid_context); @@ -502,7 +504,7 @@ static void user_ent_hash_build(void) if (sscanf(d1->d_name, "%d%c", , ) != 1) continue; - sprintf(name+pos, "%d", fd); + snprintf(name+pos, sizeof(name) - pos, "%d", fd); link_len = readlink(name, lnk, sizeof(lnk)-1); if (link_len == -1) @@ -2736,7 +2738,7 @@ static int unix_show(struct filter *f) struct sockstat *u, **insp; int flags; - if (!(u = malloc(sizeof(*u + if (!(u = calloc(1, sizeof(*u break; u->name = NULL; u->peer_name = NULL; @@ -3086,11 +3088,13 @@ static int netlink_show_one(struct filter *f, strncpy(procname, "kernel", 6); } else if (pid > 0) { FILE *fp; - sprintf(procname, "%s/%d/stat", + snprintf(procname, sizeof(procname), "%s/%d/stat", getenv("PROC_ROOT") ? : "/proc", pid); if ((fp = fopen(procname, "r")) != NULL) { if (fscanf(fp, "%*d (%[^)])", procname) == 1) { - sprintf(procname+strlen(procname), "/%d", pid); + snprintf(procname+strlen(procname), + sizeof(procname)-strlen(procname), + "/%d", pid); done = 1; } fclose(fp); -- 1.7.12.1
Re: NFS/TCP/IPv6 acting strangely in 4.2
Hi, On Wed, Sep 16, 2015 at 06:53:57AM +, Damien Thébault wrote: > On Fri, 2015-09-11 at 12:38 +0100, Russell King - ARM Linux wrote: > > I have a recent Marvell Armada 388 board here which uses the mvneta > > driver. I'm seeing some weird effects with NFS with it acting as a > > client. > > Hello, > > I'm upgrading a Marvelle Armada 370 board using the mvneta driver from > 4.0 to 4.2 and noticed issues with NFS booting. > Basically, most of the time init returns with an error code, or > programs segfault or throw illegal instructions. > > Since it worked fine on 4.0 I bisected until I found commit > a84e32894191cfcbffa54180d78d7d4654d56c20 "net: mvneta: fix refilling > for Rx DMA buffers". > > If I revert this commit, everything seems to get back to normal. > Could you try it ? The two issues look very similar. I'm not sure but I'm seeing that the accounting was changed by this patch without being certain of the implications; if the revert above works, it would be nice to try to only apply this just to see if that's indeed an accounting error or not : diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c index 62e48bc..4205867 100644 --- a/drivers/net/ethernet/marvell/mvneta.c +++ b/drivers/net/ethernet/marvell/mvneta.c @@ -1463,6 +1463,7 @@ static int mvneta_rx(struct mvneta_port *pp, int rx_todo, { struct net_device *dev = pp->dev; int rx_done; + int missed = 0; u32 rcvd_pkts = 0; u32 rcvd_bytes = 0; @@ -1527,6 +1528,7 @@ static int mvneta_rx(struct mvneta_port *pp, int rx_todo, if (err) { netdev_err(dev, "Linux processing - Can't refill\n"); rxq->missed++; + missed++; goto err_drop_frame; } @@ -1561,7 +1563,7 @@ static int mvneta_rx(struct mvneta_port *pp, int rx_todo, } /* Update rxq management counters */ - mvneta_rxq_desc_num_update(pp, rxq, rx_done, rx_done); + mvneta_rxq_desc_num_update(pp, rxq, rx_done, rx_done - missed); return rx_done; } Regards, Willy -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/1] net: Add SO_REUSEPORT_LISTEN_OFF socket option as drain mode
Hi Eric, On Wed, Nov 11, 2015 at 05:09:01PM -0800, Eric Dumazet wrote: > On Wed, 2015-11-11 at 10:43 -0800, Eric Dumazet wrote: > > On Wed, 2015-11-11 at 10:23 -0800, Tom Herbert wrote: > > > > > How about doing this in shutdown called for a listener? > > > > Seems a good idea, I will try it, thanks ! > > > > Arg, I forgot about this shutdown() discussion we had recently > with Oracle guys. > > It is currently used in linux to unblock potential threads in accept() > system call. > > This would prevent syn_recv sockets to be finally accepted. I had a conversation with an haproxy user who's concerned with the connection drops during the reload operation and we stumbled upon this thread. I was considering improving shutdown() as well for this as haproxy already performs a shutdown(RD) during a "pause" operation (ie: workaround for kernels missing SO_REUSEPORT). And I found that the code clearly doesn't make this possible since shutdown(RD) flushes the queue and stops the listening. However I found what I consider an elegant solution which works pretty well : by simply adding a test in compute_score(), we can ensure that a previous socket ranks lower than the current ones, and is never considered as long as the new ones are present. Here I achieved this using setsockopt(SO_LINGER). The old process just has to do this with a non-zero value on the socket it wants to put into lingering mode and that's all. I find this elegant since it keeps the same semantics as for a connected socket in that it avoids killing the queue, and that it doesn't change the behaviour for existing applications. It just turns out that listening sockets are set up without any lingering by default so we don't need to add any new socket options nor anything. Please let me know what you think about it (patch attached), if it's accepted it's trivial to adapt haproxy to this new behaviour. Thanks! Willy >From 7b79e362479fa7084798e6aa41da2a2045f0d6bb Mon Sep 17 00:00:00 2001 From: Willy Tarreau <w...@1wt.eu> Date: Tue, 15 Dec 2015 16:40:00 +0100 Subject: net: make lingering sockets score less in compute_score() When multiple processes use SO_REUSEPORT for a seamless restart operation, there's a tiny window during which both the old and the new process are bound to the same port, and there's no way for the old process to gracefully stop receiving connections without dropping the few that are left in the queue between the last poll() and the shutdown() or close() operation. Incoming connections are distributed between multiple listening sockets in inet_lookup_listener() according to multiple criteria. The first criterion is a score based on a number of attributes for each socket, then a hash computation ensures that the connections are evenly distributed between sockets of equal weight. This patch provides an elegant approach by which the old process can simply decrease its score by setting the lingering time to non-zero on its listening socket. Thus, the sockets from the new process (which start without any lingering) always score higher and are always preferred. The old process can then safely drain incoming connections and stop after meeting the -1 EAGAIN condition, as shown in the example below : process A (old one)| process B (new one) | listen() >= 0 | ... | accept() | ... | ... | listen() From now on, both processes receive incoming connections ... | kill(process A, go_away) setsockopt(SO_LINGER) | accept() >= 0 Here process A stops receiving new connections accept() >= 0 | accept() >= 0 ... | accept() = -1 EAGAIN | accept() >= 0 close() | exit()| Signed-off-by: Willy Tarreau <w...@1wt.eu> --- net/ipv4/inet_hashtables.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c index c6fb80b..473b16f 100644 --- a/net/ipv4/inet_hashtables.c +++ b/net/ipv4/inet_hashtables.c @@ -191,6 +191,8 @@ static inline int compute_score(struct sock *sk, struct net *net, score += 4; } } + if (!sock_flag(sk, SOCK_LINGER)) + score++; return score; } -- 1.7.12.1
Re: [PATCH 1/1] net: Add SO_REUSEPORT_LISTEN_OFF socket option as drain mode
On Tue, Dec 15, 2015 at 10:21:52AM -0800, Eric Dumazet wrote: > On Tue, 2015-12-15 at 18:43 +0100, Willy Tarreau wrote: > > > Ah ? but what does it bring in this case ? I'm not seeing it used > > anywhere on a listening socket. The code took care of not breaking > > them though (ie they still accept if no other socket shows up with > > a higher score). Otherwise we'll have to switch to Tolga's patch, > > unless we find another socket option that can safely be combined > > and which makes sense (I often find it better not to make userland > > depend on new updates of includes when possible). > > Socket options set on the listener before the accept() are inherited. I completely forgot about this use case, stupid me! > Applications wanting SO_LINGER special settings on all their sockets can > use a single system call right before listen(). > > Some servers having to deal with TIME_WAIT proliferation very often use > SO_LINGER with timeout 0 Yes definitely, it's just that I was focused on the listening socket not taking into account the fact that it would be inherited to the (rare) few sockets that are accepted from the queue afterwards. And indeed it's a perfectly legitimate usage to save a syscall per incoming connection. Thus do you think it's worth adding a new option as Tolga proposed ? Another solution I considered (but found a bit dirty) was to make use of the unimplemented shutdown(WR) for this. While it's easy to do, I don't like it simply because it looks like a hack and not logical at all from the users perspective. Thanks, Willy -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/1] net: Add SO_REUSEPORT_LISTEN_OFF socket option as drain mode
On Tue, Dec 15, 2015 at 09:10:24AM -0800, Eric Dumazet wrote: > On Tue, 2015-12-15 at 17:14 +0100, Willy Tarreau wrote: > > Hi Eric, > > > > On Wed, Nov 11, 2015 at 05:09:01PM -0800, Eric Dumazet wrote: > > > On Wed, 2015-11-11 at 10:43 -0800, Eric Dumazet wrote: > > > > On Wed, 2015-11-11 at 10:23 -0800, Tom Herbert wrote: > > > > > > > > > How about doing this in shutdown called for a listener? > > > > > > > > Seems a good idea, I will try it, thanks ! > > > > > > > > > > Arg, I forgot about this shutdown() discussion we had recently > > > with Oracle guys. > > > > > > It is currently used in linux to unblock potential threads in accept() > > > system call. > > > > > > This would prevent syn_recv sockets to be finally accepted. > > > > I had a conversation with an haproxy user who's concerned with the > > connection drops during the reload operation and we stumbled upon > > this thread. I was considering improving shutdown() as well for this > > as haproxy already performs a shutdown(RD) during a "pause" operation > > (ie: workaround for kernels missing SO_REUSEPORT). > > > > And I found that the code clearly doesn't make this possible since > > shutdown(RD) flushes the queue and stops the listening. > > > > However I found what I consider an elegant solution which works > > pretty well : by simply adding a test in compute_score(), we can > > ensure that a previous socket ranks lower than the current ones, > > and is never considered as long as the new ones are present. Here I > > achieved this using setsockopt(SO_LINGER). The old process just has > > to do this with a non-zero value on the socket it wants to put into > > lingering mode and that's all. > > > > I find this elegant since it keeps the same semantics as for a > > connected socket in that it avoids killing the queue, and that it > > doesn't change the behaviour for existing applications. It just > > turns out that listening sockets are set up without any lingering > > by default so we don't need to add any new socket options nor > > anything. > > > > Please let me know what you think about it (patch attached), if > > it's accepted it's trivial to adapt haproxy to this new behaviour. > > Well, problem is : some applications use LINGER on the listener, > you can not really hijack this flag. Ah ? but what does it bring in this case ? I'm not seeing it used anywhere on a listening socket. The code took care of not breaking them though (ie they still accept if no other socket shows up with a higher score). Otherwise we'll have to switch to Tolga's patch, unless we find another socket option that can safely be combined and which makes sense (I often find it better not to make userland depend on new updates of includes when possible). Cheers, Willy -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/1] net: Add SO_REUSEPORT_LISTEN_OFF socket option as drain mode
Hi Eric, On Wed, Dec 16, 2015 at 08:38:14AM +0100, Willy Tarreau wrote: > On Tue, Dec 15, 2015 at 01:21:15PM -0800, Eric Dumazet wrote: > > On Tue, 2015-12-15 at 20:44 +0100, Willy Tarreau wrote: > > > > > Thus do you think it's worth adding a new option as Tolga proposed ? > > > > > > I thought we tried hard to avoid adding the option but determined > > we could not avoid it ;) > > Not yet, your other proposal of disabling SO_REUSEPORT makes sense if > we combine it with the proposal to change the score in my patch. If > we say that a socket which has SO_REUSEPORT scores higher, then the > connections which don't want to accept new connections anymore will > simply have to drop it an not be elected. I find this even cleaner > since the sole purpose of the loop is to find the best socket in case > of SO_REUSEPORT. So I tried this and am pretty satisfied with the results, as I couldn't see any single reset on 4.4-rc5 with it. On 4.1 I got a few very rare resets at the exact moment the new process binds to the socket, because I suspect some ACKs end up in the wrong queue exactly there. But apparently the changes you did in 4.4 totally got rid of this, which is great! I suspected that I could enter a situation where a new process could fail to bind if generations n-1 and n-2 were still present, because n-2 would be running without SO_REUSEPORT and that should make this test fail in inet_csk_bind_conflict(), but it never failed for me : if ((!reuse || !sk2->sk_reuse || sk2->sk_state == TCP_LISTEN) && (!reuseport || !sk2->sk_reuseport || (sk2->sk_state != TCP_TIME_WAIT && !uid_eq(uid, sock_i_uid(sk2) { ... So I'm clearly missing something and can't spot what. I mean, I'd prefer to see my patch occasionally fail than not understanding why it always works! If anyone has an suggestion I'm interested. Here's the updated patch. Best regards, Willy >From c060a5db92274402a0178d7c777a1e37c15eadb9 Mon Sep 17 00:00:00 2001 From: Willy Tarreau <w...@1wt.eu> Date: Tue, 15 Dec 2015 16:40:00 +0100 Subject: net: make lingering sockets score less in compute_score() When multiple processes use SO_REUSEPORT for a seamless restart operation, there's a tiny window during which both the old and the new process are bound to the same port, and there's no way for the old process to gracefully stop receiving connections without dropping the few that are left in the queue between the last poll() and the shutdown() or close() operation. Incoming connections are distributed between multiple listening sockets in inet_lookup_listener() according to multiple criteria. The first criterion is a score based on a number of attributes for each socket, then a hash computation ensures that the connections are evenly distributed between sockets of equal weight. This patch provides a simple approach by which the old process can simply decrease its score by disabling SO_REUSEPORT on its listening sockets. Thus, the sockets from the new process always score higher and are always preferred. The old process can then safely drain incoming connections and stop after meeting the -1 EAGAIN condition, as shown in the example below : process A (old one) | process B (new one) | setsockopt(SO_REUSEPORT, 1) | listen() >= 0 | ... | accept()| ... | setsockopt(SO_REUSEPORT, 1) ... | listen() From now on, both processes receive incoming connections ... | kill(process A, go_away) setsockopt(SO_REUSEPORT, 0) | accept() >= 0 Here process A stops receiving new connections accept() >= 0 | accept() >= 0 ... | accept() = -1 EAGAIN | accept() >= 0 close() | exit() | Signed-off-by: Willy Tarreau <w...@1wt.eu> --- net/ipv4/inet_hashtables.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c index ccc5980..1c950ba 100644 --- a/net/ipv4/inet_hashtables.c +++ b/net/ipv4/inet_hashtables.c @@ -189,6 +189,8 @@ static inline int compute_score(struct sock *sk, struct net *net, return -1; score += 4; } + if (sk->sk_reuseport) + score++; if (sk->sk_incoming_cpu == raw_smp_processor_id()) score++; } -- 1.7.12.1
Re: [PATCH 1/1] net: Add SO_REUSEPORT_LISTEN_OFF socket option as drain mode
On Tue, Dec 15, 2015 at 01:21:15PM -0800, Eric Dumazet wrote: > On Tue, 2015-12-15 at 20:44 +0100, Willy Tarreau wrote: > > > Thus do you think it's worth adding a new option as Tolga proposed ? > > > I thought we tried hard to avoid adding the option but determined > we could not avoid it ;) Not yet, your other proposal of disabling SO_REUSEPORT makes sense if we combine it with the proposal to change the score in my patch. If we say that a socket which has SO_REUSEPORT scores higher, then the connections which don't want to accept new connections anymore will simply have to drop it an not be elected. I find this even cleaner since the sole purpose of the loop is to find the best socket in case of SO_REUSEPORT. I'll give this a try and will submit such a proposal if that works for me. Cheers! Willy -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] unix: properly account for FDs passed over unix sockets
On Thu, Dec 31, 2015 at 03:08:53PM +0900, Tetsuo Handa wrote: > Willy Tarreau wrote: > > On Wed, Dec 30, 2015 at 09:58:42AM +0100, Hannes Frederic Sowa wrote: > > > The MSG_PEEK code should not be harmful and the patch is good as is. I > > > first understood from the published private thread, that it is possible > > > for a program to exceed the rlimit of fds. But the DoS is only by > > > keeping the fds in flight and not attaching them to any program. > > > > Exactly. The real issue is when these FDs become very expensive such as > > pipes full of data. > > > > As you wrote how to abuse this vulnerability which exists in Linux 2.0 > and later kernel, I quote a short description from private thread. > > "an unprivileged user consumes all file descriptors so that other > unprivileged user cannot work" and "an unprivileged user consumes all > kernel memory so that the OOM killer kills almost all processes before > the culprit process is killed (CVE-2013-4312)". > > Reported-by: Tetsuo Handa <penguin-ker...@i-love.sakura.ne.jp> > Mitigates: CVE-2013-4312 (Linux 2.0+) Well I didn't reveal any secret as it was publicly reported first in 2010, it's only that Mark sent us the proof of concept exploit on the security list recently :-) https://bugzilla.kernel.org/show_bug.cgi?id=20402 Anyway I'll resend the patch with your reported-by, the CVE and Hannes' ACK. Thanks! Willy -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] unix: properly account for FDs passed over unix sockets
On Tue, Dec 29, 2015 at 03:48:45PM +0100, Hannes Frederic Sowa wrote: > On 28.12.2015 15:14, Willy Tarreau wrote: > >It is possible for a process to allocate and accumulate far more FDs than > >the process' limit by sending them over a unix socket then closing them > >to keep the process' fd count low. > > > >This change addresses this problem by keeping track of the number of FDs > >in flight per user and preventing non-privileged processes from having > >more FDs in flight than their configured FD limit. > > > >Reported-by: socketp...@gmail.com > >Suggested-by: Linus Torvalds <torva...@linux-foundation.org> > >Signed-off-by: Willy Tarreau <w...@1wt.eu> > > Thanks for the patch! > > I think this does not close the DoS attack completely as we duplicate > fds if the reader uses MSG_PEEK on the unix domain socket and thus > clones the fd. Have I overlooked something? I didn't know this behaviour. However, then the fd remains in flight, right ? So as long as it's not removed from the queue, the sender cannot add more than its FD limit. I may be missing something obvious though :-/ Thanks, Willy -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/1] net: Add SO_REUSEPORT_LISTEN_OFF socket option as drain mode
On Mon, Dec 21, 2015 at 12:38:27PM -0800, Tom Herbert wrote: > On Fri, Dec 18, 2015 at 11:00 PM, Willy Tarreau <w...@1wt.eu> wrote: > > On Fri, Dec 18, 2015 at 06:38:03PM -0800, Eric Dumazet wrote: > >> On Fri, 2015-12-18 at 19:58 +0100, Willy Tarreau wrote: > >> > Hi Josh, > >> > > >> > On Fri, Dec 18, 2015 at 08:33:45AM -0800, Josh Snyder wrote: > >> > > I was also puzzled that binding succeeded. Looking into the code paths > >> > > involved, in inet_csk_get_port, we quickly goto have_snum. From there, > >> > > we end > >> > > up dropping into tb_found. Since !hlist_empty(>owners), we end up > >> > > checking > >> > > that (tb->fastreuseport > 0 && sk->sk_reuseport && uid_eq(tb->fastuid, > >> > > uid)). > >> > > This test passes, so we goto success and bind. > >> > > > >> > > Crucially, we are checking the fastreuseport field on the > >> > > inet_bind_bucket, and > >> > > not the sk_reuseport variable on the other sockets in the bucket. > >> > > Since this > >> > > bit is set based on sk_reuseport at the time the first socket binds > >> > > (see > >> > > tb_not_found), I can see no reason why sockets need to keep > >> > > SO_REUSEPORT set > >> > > beyond initial binding. > >> > > > >> > > Given this, I believe Willy's patch elegantly solves the problem at > >> > > hand. > >> > > >> > Great, thanks for your in-depth explanation. > >> > > >> > Eric, do you think that this patch may be acceptable material for next > >> > merge window (given that it's not a fix per-se) ? If so I'll resubmit > >> > later. > >> > >> I need to check with Craig Gallek, because he was about to upstream a > >> change to make SO_REUSEPORT more scalable & sexy (like having an [e]BPF > >> filter to perform the selection in an array of sockets) > > > > OK fine. Please note that I also considered using a new value instead of > > zero there but I preferred to avoid it since the man talked about zero/ > > non-zero so I wanted to limit any API change. If Craig adds new values > > there then this is something we can reconsider. > > > Is there any reason why this turning off a soreuseport socket should > not apply to UDP also? (seems like we have a need to turn off RX but > not TX for a UDP socket). I didn't know it was supported for UDP :-) I guess that's the only reason. willy -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/1] net: Add SO_REUSEPORT_LISTEN_OFF socket option as drain mode
Hi Josh, On Fri, Dec 18, 2015 at 08:33:45AM -0800, Josh Snyder wrote: > I was also puzzled that binding succeeded. Looking into the code paths > involved, in inet_csk_get_port, we quickly goto have_snum. From there, we end > up dropping into tb_found. Since !hlist_empty(>owners), we end up checking > that (tb->fastreuseport > 0 && sk->sk_reuseport && uid_eq(tb->fastuid, uid)). > This test passes, so we goto success and bind. > > Crucially, we are checking the fastreuseport field on the inet_bind_bucket, > and > not the sk_reuseport variable on the other sockets in the bucket. Since this > bit is set based on sk_reuseport at the time the first socket binds (see > tb_not_found), I can see no reason why sockets need to keep SO_REUSEPORT set > beyond initial binding. > > Given this, I believe Willy's patch elegantly solves the problem at hand. Great, thanks for your in-depth explanation. Eric, do you think that this patch may be acceptable material for next merge window (given that it's not a fix per-se) ? If so I'll resubmit later. Thanks, Willy -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/1] net: Add SO_REUSEPORT_LISTEN_OFF socket option as drain mode
On Fri, Dec 18, 2015 at 06:38:03PM -0800, Eric Dumazet wrote: > On Fri, 2015-12-18 at 19:58 +0100, Willy Tarreau wrote: > > Hi Josh, > > > > On Fri, Dec 18, 2015 at 08:33:45AM -0800, Josh Snyder wrote: > > > I was also puzzled that binding succeeded. Looking into the code paths > > > involved, in inet_csk_get_port, we quickly goto have_snum. From there, we > > > end > > > up dropping into tb_found. Since !hlist_empty(>owners), we end up > > > checking > > > that (tb->fastreuseport > 0 && sk->sk_reuseport && uid_eq(tb->fastuid, > > > uid)). > > > This test passes, so we goto success and bind. > > > > > > Crucially, we are checking the fastreuseport field on the > > > inet_bind_bucket, and > > > not the sk_reuseport variable on the other sockets in the bucket. Since > > > this > > > bit is set based on sk_reuseport at the time the first socket binds (see > > > tb_not_found), I can see no reason why sockets need to keep SO_REUSEPORT > > > set > > > beyond initial binding. > > > > > > Given this, I believe Willy's patch elegantly solves the problem at hand. > > > > Great, thanks for your in-depth explanation. > > > > Eric, do you think that this patch may be acceptable material for next > > merge window (given that it's not a fix per-se) ? If so I'll resubmit > > later. > > I need to check with Craig Gallek, because he was about to upstream a > change to make SO_REUSEPORT more scalable & sexy (like having an [e]BPF > filter to perform the selection in an array of sockets) OK fine. Please note that I also considered using a new value instead of zero there but I preferred to avoid it since the man talked about zero/ non-zero so I wanted to limit any API change. If Craig adds new values there then this is something we can reconsider. Have a nice week-end, Willy -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] unix: properly account for FDs passed over unix sockets
It is possible for a process to allocate and accumulate far more FDs than the process' limit by sending them over a unix socket then closing them to keep the process' fd count low. This change addresses this problem by keeping track of the number of FDs in flight per user and preventing non-privileged processes from having more FDs in flight than their configured FD limit. Reported-by: socketp...@gmail.com Suggested-by: Linus Torvalds <torva...@linux-foundation.org> Signed-off-by: Willy Tarreau <w...@1wt.eu> --- It would be nice if (if accepted) it would be backported to -stable as the issue is currently exploitable. --- include/linux/sched.h | 1 + net/unix/af_unix.c| 24 net/unix/garbage.c| 13 - 3 files changed, 29 insertions(+), 9 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index edad7a4..fbf25f1 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -830,6 +830,7 @@ struct user_struct { unsigned long mq_bytes; /* How many bytes can be allocated to mqueue? */ #endif unsigned long locked_shm; /* How many pages of mlocked shm ? */ + unsigned long unix_inflight;/* How many files in flight in unix sockets */ #ifdef CONFIG_KEYS struct key *uid_keyring;/* UID specific keyring */ diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c index 45aebd9..d6d7b43 100644 --- a/net/unix/af_unix.c +++ b/net/unix/af_unix.c @@ -1499,6 +1499,21 @@ static void unix_destruct_scm(struct sk_buff *skb) sock_wfree(skb); } +/* + * The "user->unix_inflight" variable is protected by the garbage + * collection lock, and we just read it locklessly here. If you go + * over the limit, there might be a tiny race in actually noticing + * it across threads. Tough. + */ +static inline bool too_many_unix_fds(struct task_struct *p) +{ + struct user_struct *user = current_user(); + + if (unlikely(user->unix_inflight > task_rlimit(p, RLIMIT_NOFILE))) + return !capable(CAP_SYS_RESOURCE) && !capable(CAP_SYS_ADMIN); + return false; +} + #define MAX_RECURSION_LEVEL 4 static int unix_attach_fds(struct scm_cookie *scm, struct sk_buff *skb) @@ -1507,6 +1522,9 @@ static int unix_attach_fds(struct scm_cookie *scm, struct sk_buff *skb) unsigned char max_level = 0; int unix_sock_count = 0; + if (too_many_unix_fds(current)) + return -ETOOMANYREFS; + for (i = scm->fp->count - 1; i >= 0; i--) { struct sock *sk = unix_get_socket(scm->fp->fp[i]); @@ -1528,10 +1546,8 @@ static int unix_attach_fds(struct scm_cookie *scm, struct sk_buff *skb) if (!UNIXCB(skb).fp) return -ENOMEM; - if (unix_sock_count) { - for (i = scm->fp->count - 1; i >= 0; i--) - unix_inflight(scm->fp->fp[i]); - } + for (i = scm->fp->count - 1; i >= 0; i--) + unix_inflight(scm->fp->fp[i]); return max_level; } diff --git a/net/unix/garbage.c b/net/unix/garbage.c index a73a226..8fcdc22 100644 --- a/net/unix/garbage.c +++ b/net/unix/garbage.c @@ -120,11 +120,11 @@ void unix_inflight(struct file *fp) { struct sock *s = unix_get_socket(fp); + spin_lock(_gc_lock); + if (s) { struct unix_sock *u = unix_sk(s); - spin_lock(_gc_lock); - if (atomic_long_inc_return(>inflight) == 1) { BUG_ON(!list_empty(>link)); list_add_tail(>link, _inflight_list); @@ -132,25 +132,28 @@ void unix_inflight(struct file *fp) BUG_ON(list_empty(>link)); } unix_tot_inflight++; - spin_unlock(_gc_lock); } + fp->f_cred->user->unix_inflight++; + spin_unlock(_gc_lock); } void unix_notinflight(struct file *fp) { struct sock *s = unix_get_socket(fp); + spin_lock(_gc_lock); + if (s) { struct unix_sock *u = unix_sk(s); - spin_lock(_gc_lock); BUG_ON(list_empty(>link)); if (atomic_long_dec_and_test(>inflight)) list_del_init(>link); unix_tot_inflight--; - spin_unlock(_gc_lock); } + fp->f_cred->user->unix_inflight--; + spin_unlock(_gc_lock); } static void scan_inflight(struct sock *x, void (*func)(struct unix_sock *), -- 1.7.12.1 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] unix: properly account for FDs passed over unix sockets
On Wed, Dec 30, 2015 at 09:58:42AM +0100, Hannes Frederic Sowa wrote: > The MSG_PEEK code should not be harmful and the patch is good as is. I > first understood from the published private thread, that it is possible > for a program to exceed the rlimit of fds. But the DoS is only by > keeping the fds in flight and not attaching them to any program. Exactly. The real issue is when these FDs become very expensive such as pipes full of data. > __alloc_fd, called on the receiver side, does check for the rlimit > maximum anyway, so I don't see a loophole anymore: > > Acked-by: Hannes Frederic SowaThanks! > Another idea would be to add the amount of memory used to manage the fds > to sock_rmem/wmem but I don't see any advantages or disadvantages. Compared to the impact of the pending data in pipes themselves in flight, this would remain fairly minimal. Thanks, Willy -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3.10 128/143] VSOCK: do not disconnect socket when peer has shutdown SEND only
From: Ian Campbell <ian.campb...@docker.com> commit dedc58e067d8c379a15a8a183c5db318201295bb upstream. The peer may be expecting a reply having sent a request and then done a shutdown(SHUT_WR), so tearing down the whole socket at this point seems wrong and breaks for me with a client which does a SHUT_WR. Looking at other socket family's stream_recvmsg callbacks doing a shutdown here does not seem to be the norm and removing it does not seem to have had any adverse effects that I can see. I'm using Stefan's RFC virtio transport patches, I'm unsure of the impact on the vmci transport. Signed-off-by: Ian Campbell <ian.campb...@docker.com> Cc: "David S. Miller" <da...@davemloft.net> Cc: Stefan Hajnoczi <stefa...@redhat.com> Cc: Claudio Imbrenda <imbre...@linux.vnet.ibm.com> Cc: Andy King <ack...@vmware.com> Cc: Dmitry Torokhov <d...@vmware.com> Cc: Jorgen Hansen <jhan...@vmware.com> Cc: Adit Ranadive <ad...@vmware.com> Cc: netdev@vger.kernel.org Signed-off-by: David S. Miller <da...@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gre...@linuxfoundation.org> Signed-off-by: Willy Tarreau <w...@1wt.eu> --- net/vmw_vsock/af_vsock.c | 21 + 1 file changed, 1 insertion(+), 20 deletions(-) diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c index 9b88693..66a9bf5 100644 --- a/net/vmw_vsock/af_vsock.c +++ b/net/vmw_vsock/af_vsock.c @@ -1804,27 +1804,8 @@ vsock_stream_recvmsg(struct kiocb *kiocb, else if (sk->sk_shutdown & RCV_SHUTDOWN) err = 0; - if (copied > 0) { - /* We only do these additional bookkeeping/notification steps -* if we actually copied something out of the queue pair -* instead of just peeking ahead. -*/ - - if (!(flags & MSG_PEEK)) { - /* If the other side has shutdown for sending and there -* is nothing more to read, then modify the socket -* state. -*/ - if (vsk->peer_shutdown & SEND_SHUTDOWN) { - if (vsock_stream_has_data(vsk) <= 0) { - sk->sk_state = SS_UNCONNECTED; - sock_set_flag(sk, SOCK_DONE); - sk->sk_state_change(sk); - } - } - } + if (copied > 0) err = copied; - } out_wait: finish_wait(sk_sleep(sk), ); -- 2.8.0.rc2.1.gbe9624a
Re: [PATCH v2] unix: properly account for FDs passed over unix sockets
On Tue, Feb 02, 2016 at 09:32:56PM +0100, Hannes Frederic Sowa wrote: > But "struct pid *" in unix_skb_parms should be enough to get us to > corresponding "struct cred *" so we can decrement the correct counter > during skb destruction. > > So: > > We increment current task's unix_inflight and also check the current > task's limit during attaching fds to skbs and decrement the inflight > counter via "struct pid *". This looks like it should work. I like it as well, the principle sounds sane. > >That way it's always the person who actually does the send (rather > >than the opener of the socket _or_ the opener of the file that gets > >passed around) that gets credited, and thanks to the cred pointer we > >can then de-credit them properly. > > Exactly, I try to implement that. Thanks a lot! Thanks to you Hannes, I appreciate that you work on it, it would take much more time to me to dig into this. Willy
Re: [PATCH v2] unix: properly account for FDs passed over unix sockets
On Tue, Feb 02, 2016 at 12:53:20PM -0800, Linus Torvalds wrote: > On Tue, Feb 2, 2016 at 12:49 PM, Willy Tarreau <w...@1wt.eu> wrote: > > On Tue, Feb 02, 2016 at 12:44:54PM -0800, Linus Torvalds wrote: > >> > >> Umm. I think the "struct cred" may change in between, can't it? > > > > You mean for example in case of setuid() or something like this ? > > Yeah. I'd be worried about looking up the creds or user structure > later, and possibly getting a different one. > > I'd much rather look it up at attach time, and just carry an extra > pointer around. That seems to be an inherently safer model where > there's no worry about "what happens if the user does X in the > meantime". Normally we can only change from root to non-root, and we don't apply the limits to root, so if we have the ability to only store one bit indicating "not tracked" or to simply nullify one pointer to avoid counting in flight FDs for root, we don't take the risk to recredit them to the target user after a change. I just don't know if we can do that though :-/ Willy
Re: [PATCH v2] unix: properly account for FDs passed over unix sockets
On Tue, Feb 02, 2016 at 12:44:54PM -0800, Linus Torvalds wrote: > On Tue, Feb 2, 2016 at 12:32 PM, Hannes Frederic Sowa >wrote: > > But "struct pid *" in unix_skb_parms should be enough to get us to > > corresponding "struct cred *" so we can decrement the correct counter during > > skb destruction. > > Umm. I think the "struct cred" may change in between, can't it? You mean for example in case of setuid() or something like this ? willy
Re: struct pid memory leak
Hi Eric, Dmitry, On Fri, Jan 22, 2016 at 08:50:01AM -0800, Eric Dumazet wrote: > CC netdev, as it looks some af_unix issue ... > > On Fri, 2016-01-22 at 16:08 +0100, Dmitry Vyukov wrote: > > Hello, > > > > The following program causes struct pid memory leak: > > > > // autogenerated by syzkaller (http://github.com/google/syzkaller) (...) > > unreferenced object 0x8800324af200 (size 112): > > comm "syz-executor", pid 18413, jiffies 4295500287 (age 14.321s) > > hex dump (first 32 bytes): > > 01 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00 > > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > > backtrace: > > [] kmemleak_alloc+0x63/0xa0 mm/kmemleak.c:916 > > [< inline >] kmemleak_alloc_recursive > > include/linux/kmemleak.h:47 (...) > > On commit 30f05309bde49295e02e45c7e615f73aa4e0ccc2 (Jan 20). I can't reproduce this with the indicated commit. I'm unsure how/what I'm supposed to see. Is a certain config needed ? I've enabled kmemleak in my .config but there are too few information here to go further unfortunately. Regards, Willy
Re: struct pid memory leak
On Sat, Jan 23, 2016 at 07:14:33PM +0100, Dmitry Vyukov wrote: > I've attached my .config. > Also run this program in a parallel loop. I think it's leaking not > every time, probably some race is involved. Thank you. Just in order to confirm, am I supposed to see the messages you quoted in dmesg ? Thanks, Willy
Re: struct pid memory leak
On Sat, Jan 23, 2016 at 07:46:45PM +0100, Dmitry Vyukov wrote: > On Sat, Jan 23, 2016 at 7:40 PM, Willy Tarreau <w...@1wt.eu> wrote: > > On Sat, Jan 23, 2016 at 07:14:33PM +0100, Dmitry Vyukov wrote: > >> I've attached my .config. > >> Also run this program in a parallel loop. I think it's leaking not > >> every time, probably some race is involved. > > > > Thank you. Just in order to confirm, am I supposed to see the > > messages you quoted in dmesg ? > > > I think the simplest way to confirm that you can reproduce it locally > is to check /proc/slabinfo. When I run this program in a parallel > loop, number of objects in pid cache was constantly growing: > > # cat /proc/slabinfo | grep pid > pid 297532576 284 : tunables00 > 0 : slabdata 19 19 0 > ... > pid 412532576 284 : tunables00 > 0 : slabdata 19 19 0 > ... > pid 1107 1176576 284 : tunables00 > 0 : slabdata 42 42 0 > ... > pid 1545 1652576 284 : tunables00 > 0 : slabdata 59 59 0 OK got it and indeed I can see it grow. In fact, the active column grows and once it reaches the num objects, this one grows in turn, which makes sense. All I can say now is that it doesn't need to run over multiple processes to leak, though that makes it easier. SMP is not needed either. > If you want to use kmemleak, then you need to run this program in a > parallel loop for some time, then stop it and then: > > $ echo scan > /sys/kernel/debug/kmemleak > $ cat /sys/kernel/debug/kmemleak > > If kmemleak has detected any leaks, cat will show them. I noticed that > kmemleak can delay leaks with significant delay, so usually I do scan > at least 5 times. Thank you for these information. I've tested on an older (3.14) kernel and I can see the effect there as well. I don't have "pid" in slabinfo, but launching 1000 processes at a time uses a few tens to hundreds kB of RAM on each round. 3.10 doesn't seem affected, I'm seeing the memory grow to a fixed point if I increase the number of parallel processes but then even after a few tens of thousands of processes, the reported used memory doesn't seem to increase (remember no "pid" entry here). kmemleak indeed reports me something on 3.14 which seems to match your trace as I'm seeing bash as the process (instead of syz-executor in your case) and alloc_pid() calls kmem_cache_alloc() : Unreferenced object 0x88003facd000 (size 128): comm "bash", pid 1822, jiffies 4294951223 (age 15.280s) hex dump (first 32 bytes): 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 backtrace: [] kmem_cache_alloc+0x92/0xe0 [] alloc_pid+0x24/0x4a0 [] cpumask_any_but+0x23/0x40 [] copy_process.part.66+0x1068/0x16e0 [] n_tty_write+0x37b/0x4f0 [] tty_write+0x1c1/0x2a0 [] do_fork+0xe0/0x340 [] __set_task_blocked+0x30/0x80 [] __set_current_blocked+0x38/0x60 [] stub_clone+0x69/0x90 [] system_call_fastpath+0x16/0x1b [] 0x It doesn't report this on 3.10. Unfortunately I feel totally incompetent on the subject :-/ Willy
Re: struct pid memory leak
On Sat, Jan 23, 2016 at 06:50:11PM -0800, Eric Dumazet wrote: > On Sat, Jan 23, 2016 at 6:38 PM, Willy Tarreau <w...@1wt.eu> wrote: > > On Sun, Jan 24, 2016 at 03:11:45AM +0100, Willy Tarreau wrote: > >> It doesn't report this on 3.10. > > > > To be more precise, kmemleak reports the issue on 3.13 and not on 3.12. > > I'm not sure if it's reliable enough to run a bisect though. > > > > Willy > > > > I have the leak on linux-3.11. > > I believe even linux-3.3 gets the leak, although I had to wait about > one hour to be confident the leak was there. OK so I'm stopping my bisect. It's possible it's affected with some option which changed along the various "make oldconfig" at each step. Thanks for letting me know. Willy
Re: struct pid memory leak
On Sun, Jan 24, 2016 at 03:11:45AM +0100, Willy Tarreau wrote: > It doesn't report this on 3.10. To be more precise, kmemleak reports the issue on 3.13 and not on 3.12. I'm not sure if it's reliable enough to run a bisect though. Willy
Re: Mis-backport in af_unix patch for Linux 3.10.95
Hello, On Sun, Jan 24, 2016 at 12:10:35AM -0500, Sultan Qasim wrote: > Hello all, > > I'm an outsider to the Linux kernel community, so I apologize if this > is not the right channel to mention this. The simple fact that you participate, inspect the code and report bugs makes you part of this community :-) It's indeed the right place. Usually when reporting an issue with a commit, we also CC the whole signed-off-by / CC chain of that commit (which I'm doing now). For bugs related to networking, we usually CC the netdev list as well. > I noticed that the > backported version of the patch "af_unix: Revert 'lock_interruptible' > in stream receive code" in Linux 3.10.95 seems to have removed the > mutex_lock_interruptible from the wrong function. > > Here is the backported patch: > https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/?id=3a57e783016bf43ab9326172217f564941b85b17 > > Here is the original: > https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/net/unix/af_unix.c?id=3822b5c2fc62e3de8a0f33806ff279fb7df92432 > > Was it not meant to be removed from unix_stream_recvmsg instead of > unix_dgram_recvmsg? You're absolutely right, good catch! Similar controls were added to both functions resulting in the same code appearing there, which confused the patch process, causing the change to be applied to the wrong location. This happens from time to time in such circumstances when backporting to older kernels. > Also, the variable called "noblock" needs to be > removed from the function being changed to prevent unused variable > warnings. If you mean this variable in function unix_dgram_recvmsg(), it would indeed report a warning but only due to the patch being mis-applied. In unix_stream_recvmsg(), it's still used as well. Does the attached patch seem better to you (not compile-tested) ? Greg/Ben, both 3.2.76 and 3.14.59 are OK regarding this, it seems like only 3.10.95 was affected. Thanks, Willy >From 77f6e82adf349cbccf7e2ec7601b25c994e0b483 Mon Sep 17 00:00:00 2001 From: Willy Tarreau <w...@1wt.eu> Date: Sun, 24 Jan 2016 09:19:57 +0100 Subject: af_unix: fix incorrect revert of 'lock_interruptible' in stream receive code As reported by Sultan Qasim, commit 3822b5c ("af_unix: Revert 'lock_interruptible' in stream receive code") was accidently applied at the wrong place in the backport that appeared in 3.10.95, it affected unix_dgram_recvmsg() instead of unix_stream_recvmsg() due to now similar code sections there. The dgram part needs to remain but the stream part needs to be removed. Reported-By: Sultan Qasim <sultanqa...@gmail.com> Fixes: 3a57e78 (3.10.95) Signed-off-by: Willy Tarreau <w...@1wt.eu> --- net/unix/af_unix.c | 18 +- 1 file changed, 9 insertions(+), 9 deletions(-) diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c index f934e7b..0061d00 100644 --- a/net/unix/af_unix.c +++ b/net/unix/af_unix.c @@ -1934,7 +1934,14 @@ static int unix_dgram_recvmsg(struct kiocb *iocb, struct socket *sock, if (flags_OOB) goto out; - mutex_lock(>readlock); + err = mutex_lock_interruptible(>readlock); + if (unlikely(err)) { + /* recvmsg() in non blocking mode is supposed to return -EAGAIN +* sk_rcvtimeo is not honored by mutex_lock_interruptible() +*/ + err = noblock ? -EAGAIN : -ERESTARTSYS; + goto out; + } skip = sk_peek_offset(sk, flags); @@ -2083,14 +2090,7 @@ static int unix_stream_recvmsg(struct kiocb *iocb, struct socket *sock, memset(_scm, 0, sizeof(tmp_scm)); } - err = mutex_lock_interruptible(>readlock); - if (unlikely(err)) { - /* recvmsg() in non blocking mode is supposed to return -EAGAIN -* sk_rcvtimeo is not honored by mutex_lock_interruptible() -*/ - err = noblock ? -EAGAIN : -ERESTARTSYS; - goto out; - } + mutex_lock(>readlock); do { int chunk; -- 1.7.12.2.21.g234cd45.dirty
Re: [PATCH net] af_unix: fix struct pid memory leak
Hi Eric, On Sun, Jan 24, 2016 at 01:53:50PM -0800, Eric Dumazet wrote: > From: Eric Dumazet> > Dmitry reported a struct pid leak detected by a syzkaller program. > > Bug happens in unix_stream_recvmsg() when we break the loop when a > signal is pending, without properly releasing scm. > > Fixes: b3ca9b02b007 ("net: fix multithreaded signal handling in unix recv > routines") > Reported-by: Dmitry Vyukov > Signed-off-by: Eric Dumazet > Cc: Rainer Weikusat > --- > net/unix/af_unix.c |1 + > 1 file changed, 1 insertion(+) > > diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c > index c5bf5ef2bf89..49d5093eb055 100644 > --- a/net/unix/af_unix.c > +++ b/net/unix/af_unix.c > @@ -2339,6 +2339,7 @@ again: > > if (signal_pending(current)) { > err = sock_intr_errno(timeo); > + scm_destroy(); > goto out; > } Good job on this one! FWIW, I managed to test it on 3.14 and I confirm it completely fixes the leak there as well. I had to modify it a little bit however since there's no scm local variable there : - scm_destroy(); + scm_destroy(siocb->scm); Cheers, Willy
Re: [PATCH v2 net-next 0/8] API set for HW Buffer management
Hi Gregory, On Tue, Feb 16, 2016 at 04:33:35PM +0100, Gregory CLEMENT wrote: > Hello, > > A few weeks ago I sent a proposal for a API set for HW Buffer > management, to have a better view of the motivation for this API see > the cover letter of this proposal: > http://thread.gmane.org/gmane.linux.kernel/2125152 > > Since this version I took into account the review from Florian: > - The hardware buffer management helpers are no more built by default > and now depend on a hidden config symbol which has to be selected > by the driver if needed > - The hwbm_pool_refill() and hwbm_pool_add() now receive a gfp_t as > argument allowing the caller to specify the flag it needs. > - buf_num is now tested to ensure there is no wrapping > - A spinlock has been added to protect the hwbm_pool_add() function in > SMP or irq context. > > I also used pr_warn instead of pr_debug in case of errors. > > I fixed the mvneta implementation by returning the buffer to the pool > at various place instead of ignoring it. > > About the series itself I tried to make this series easier to merge: > - Squashed "bus: mvenus-mbus: Fix size test for >mvebu_mbus_get_dram_win_info" into bus: mvebu-mbus: provide api for >obtaining IO and DRAM window information. > - Added my signed-otf-by on all the patches as submitter of the series. > - Renamed the dts patches with the pattern "ARM: dts: platform:" > - Removed the patch "ARM: mvebu: enable SRAM support in > mvebu_v7_defconfig" of this series and already applied it > - Rodified the order of the patches. > > In order to ease the test the branch mvneta-BM-framework-v2 is > available at g...@github.com:MISL-EBU-System-SW/mainline-public.git. Well, I tested this patch series on top of latest master (from today) on my fresh new clearfog board. I compared carefully with and without the patchset. My workload was haproxy receiving connections and forwarding them to my PC via the same port. I tested both with short connections (HTTP GET of an empty file) and long ones (1 MB or more). No trouble was detected at all, which is pretty good. I noticed a very tiny performance drop which is more noticeable on short connections (high packet rates), my forwarded connection rate went down from 17500/s to 17300/s. But I have not checked yet what can be tuned when using the BM, nor did I compare CPU usage. I remember having run some tests in the past, I guess it was on the XP-GP board, and noticed that the BM could save a significant amount of CPU and improve cache efficiency, so if this is the case here, we don't really care about a possible 1% performance drop. I'll try to provide more results as time permits. In the mean time if you want (or plan to submit a next batch), feel free to add a Tested-by: Willy Tarreau <w...@1wt.eu>. cheers, Willy
Re: [PATCH 1/1] net: Add SO_REUSEPORT_LISTEN_OFF socket option as drain mode
Hi Eric, On Thu, Mar 24, 2016 at 07:13:33AM -0700, Eric Dumazet wrote: > On Thu, 2016-03-24 at 07:12 +0100, Willy Tarreau wrote: > > Hi, > > > > On Wed, Mar 23, 2016 at 10:10:06PM -0700, Tolga Ceylan wrote: > > > I apologize for not properly following up on this. I had the > > > impression that we did not want to merge my original patch and then I > > > also noticed that it fails to keep the hash consistent. Recently, I > > > read the follow ups on it as well as Willy's patch/proposals. > > > > > > Is there any update on Willy's SO_REUSEPORT patch? IMHO, it solves the > > > problem and it is simpler than adding new sock option. > > > > no, Craig's changes were merged, and I haven't checked yet if my patch > > needs to be rebased or still applies. Feel free to check it and resubmit > > if you have time. > > No need for a patch AFAIK. > > BPF solution is generic enough. > > All user space needs to do is to update the BPF filter so that the > listener needing to be dismantled does not receive any new packet. But that means that any software making use of SO_REUSEPORT needs to also implement BPF on Linux to achieve the same as what it does on other OSes ? Also I found a case where a dying process would still cause trouble in the accept queue, maybe it's not redistributed, I don't remember, all I remember is that my traffic stopped after a segfault of only one of them :-/ I'll have to dig a bit regarding this. Thanks, Willy
Re: [PATCH 1/1] net: Add SO_REUSEPORT_LISTEN_OFF socket option as drain mode
On Thu, Mar 24, 2016 at 04:54:03PM -0700, Tom Herbert wrote: > On Thu, Mar 24, 2016 at 4:40 PM, Yann Ylavicwrote: > > I'll learn how to do this to get the best performances from the > > server, but having to do so to work around what looks like a defect > > (for simple/default SMP configurations at least, no NUMA or clever > > CPU-affinity or queuing policy involved) seems odd in the first place. > > > I disagree with your assessment that there is a defect. SO_REUSEPORT > is designed to spread packets amongst _equivalent_ connections. In the > server draining case sockets are no longer equivalent, but that is a > special case. I partially disagree with you here Tom. Initially SO_REUSEPORT was not used to spread packets but to allow soft restart in some applications. I've been using it since 2001 in haproxy on *BSD and linux 2.2. It was removed during 2.3 and I used to keep a patch to reimplement it in 2.4 (basically 2 or 3 lines, the infrastructure was still present), but the patch was not accepted. The same patch worked for 2.6 and 3.x, allowing me to continue to perform soft-restarts on Linux just like I used to do on *BSD. When SO_REUSEPORT was reimplemented in 3.9 with load balancing, I was happy because it at last allowed me to drop my patch and I got the extra benefit of better load balancing of incoming connections. But the main use we have for it (at least historically) is for soft restarts, where one process replaces another one. Very few people use more than one process in our case. However given the benefits of the load spreading for extreme loads, I'm willing to find how to achieve the same with BPF, but it's pretty clear that at this point I have no idea where to start from and that for a single process replacing a single one, it looks quite complicated. For me quite frankly the special case is the load balancing which is a side benefit (and a nice one, don't get me wrong). That's why I would have found it nice to "fix" the process replacement to avoid dropping incoming connections, though I don't want it to become a problem for future improvements on BPF. I don't think the two lines I proposed could become an issue but I'll live without them (or continue to apply this patch). BTW, I have no problem with having to write a little bit of assembly for fast interfaces if it remains untouched for years, we do already have a bit in haproxy. It's just a longterm investment. Best regards, Willy
Re: [PATCH 1/1] net: Add SO_REUSEPORT_LISTEN_OFF socket option as drain mode
Hi Eric, On Thu, Mar 24, 2016 at 11:49:41PM -0700, Eric Dumazet wrote: > Everything is possible, but do not complain because BPF went in the > kernel before your changes. Don't get me wrong, I'm not complaining, I'm more asking for help to try to elaborate the alternate solution. I understood well what my proposal did because it was pretty straightforward, and the new one I'll have to do is of an immense complexity for me by now, since it will require learning a new language and finding doc on how all this works together. But as I said I'm totally sold to the benefits it can provide for large scale deployments and I'm well aware of the ugly socket scan there was in the previous one. > Just rework your patch. > > Supporting multiple SO_REUSEPORT groups on the same port should not be > too hard really. Making sure BPF still work for them is feasible. > > But the semantic of the socket option would be really different. I don't care much about the socket options themselves as long as I can continue to support seamless reloads. I could even get rid of SO_REUSEPORT on Linux to use something else instead if I have a reliable way to detect that the alternative will work. > You need to not control an individual listener, but a group of listener. > > Your dying haproxy would issue a single system call to tell the kernel : > My SO_REUSEPORT group should not accept new SYN packets, so that the new > haproxy can setup a working new SO_REUSEPORT group. Normally it's the other way around :-) The new process first grabs the socket, there's a tiny window where both are attached, and only then the old process leaves. That's the only way to ensure there's no loss nor added latency in the processing. If you could share a pointer to some good starter documentation for this, I would appreciate it. I really have no idea where to start from and the only elements I found on the net didn't give a single hint regarding all this :-/ Thanks, Willy
Re: [PATCH 1/1] net: Add SO_REUSEPORT_LISTEN_OFF socket option as drain mode
Hi Eric, (just lost my e-mail, trying not to forget some points) On Thu, Mar 24, 2016 at 07:45:44AM -0700, Eric Dumazet wrote: > On Thu, 2016-03-24 at 15:22 +0100, Willy Tarreau wrote: > > Hi Eric, > > > But that means that any software making use of SO_REUSEPORT needs to > > also implement BPF on Linux to achieve the same as what it does on > > other OSes ? Also I found a case where a dying process would still > > cause trouble in the accept queue, maybe it's not redistributed, I > > don't remember, all I remember is that my traffic stopped after a > > segfault of only one of them :-/ I'll have to dig a bit regarding > > this. > > Hi Willy > > Problem is : If we add a SO_REUSEPORT_LISTEN_OFF, this wont work with > BPF. I wasn't for adding SO_REUSEPORT_LISTEN_OFF either. Instead the idea was just to modify the score in compute_score() so that a socket which disables SO_REUSEPORT scores less than one which still has it. The application wishing to terminate just has to clear the SO_REUSEPORT flag and wait for accept() reporting EAGAIN. The patch simply looked like this (copy-pasted hence space-mangled) : --- a/net/ipv4/inet_hashtables.c +++ b/net/ipv4/inet_hashtables.c @@ -189,6 +189,8 @@ static inline int compute_score(struct sock *sk, struct net *net, return -1; score += 4; } + if (sk->sk_reuseport) + score++; if (sk->sk_incoming_cpu == raw_smp_processor_id()) score++; } > BPF makes a decision without knowing individual listeners states. But is the decision taken without considering compute_score() ? The point really was to be the least possibly intrusive and quite logical for the application : "disable SO_REUSEPORT when you don't want to participate to incoming load balancing anymore". > Or we would need to extend BPF to access these kind of states. > Doable certainly, but we need to be convinced it is necessary. You know that I don't like complex designs to address simple issues if possible :-) > And yes, if a listener is closed while children are still in accept > queue, we drop all the children, we do not care of redistributing them > to another listeners. Really too complex to be worth it. Forget this, I mixed two issues here. Yes I know that redistributing is almost impossible, I've read that code a year ago or so and realized how complex this would be, without providing even 100% success rate. I wasn't speaking about redistribution of incoming queue but about an issue I've met where when I have 4 processes bound to the same port, if one dies, its share of incoming traffic is definitely lost. The only fix was to restart the processes to create new listeners. But I don't remember the conditions where I met this case, I don't even remember if it was on an -rc kernel or a final one, so I'd prefer to discuss this only once I have more elements. Cheers, Willy >From c060a5db92274402a0178d7c777a1e37c15eadb9 Mon Sep 17 00:00:00 2001 From: Willy Tarreau <w...@1wt.eu> Date: Tue, 15 Dec 2015 16:40:00 +0100 Subject: net: make lingering sockets score less in compute_score() When multiple processes use SO_REUSEPORT for a seamless restart operation, there's a tiny window during which both the old and the new process are bound to the same port, and there's no way for the old process to gracefully stop receiving connections without dropping the few that are left in the queue between the last poll() and the shutdown() or close() operation. Incoming connections are distributed between multiple listening sockets in inet_lookup_listener() according to multiple criteria. The first criterion is a score based on a number of attributes for each socket, then a hash computation ensures that the connections are evenly distributed between sockets of equal weight. This patch provides a simple approach by which the old process can simply decrease its score by disabling SO_REUSEPORT on its listening sockets. Thus, the sockets from the new process always score higher and are always preferred. The old process can then safely drain incoming connections and stop after meeting the -1 EAGAIN condition, as shown in the example below : process A (old one) | process B (new one) | setsockopt(SO_REUSEPORT, 1) | listen() >= 0 | ... | accept()| ... | setsockopt(SO_REUSEPORT, 1) ... | listen() From now on, both processes receive incoming connections ... | kill(process A, go_away) setsockopt(SO_REUSEPORT, 0) | accept() >= 0 Here process A stops receiving new connections
Re: [PATCH 1/1] net: Add SO_REUSEPORT_LISTEN_OFF socket option as drain mode
On Thu, Mar 24, 2016 at 09:33:11AM -0700, Eric Dumazet wrote: > > --- a/net/ipv4/inet_hashtables.c > > +++ b/net/ipv4/inet_hashtables.c > > @@ -189,6 +189,8 @@ static inline int compute_score(struct sock *sk, struct > > net *net, > > return -1; > > score += 4; > > } > > + if (sk->sk_reuseport) > > + score++; > > This wont work with BPF > > > if (sk->sk_incoming_cpu == raw_smp_processor_id()) > > score++; > > This one does not work either with BPF But this *is* in 4.5. Does this mean that this part doesn't work anymore or just that it's not usable in conjunction with BPF ? In this case I'm less worried, because it would mean that we have a solution for non-BPF aware applications and that BPF-aware applications can simply use BPF. > Whole point of BPF was to avoid iterate through all sockets [1], > and let user space use whatever selection logic it needs. > > [1] This was okay with up to 16 sockets. But with 128 it does not scale. Indeed. > If you really look at how BPF works, implementing another 'per listener' flag > would break the BPF selection. OK. > You can certainly implement the SO_REUSEPORT_LISTEN_OFF by loading an > updated BPF, why should we add another way in the kernel to do the same, > in a way that would not work in some cases ? I don't try to reimplement something already available, but I'm confused by a few points : - the code above already exists and you mention it cannot be used with BPF - for the vast majority of applications not using BPF, would the above *still* work (it worked in 4.4-rc at least) - it seems to me that for BPF to be usable on process shutting down, we'd need to have some form of central knowledge if the goal is to redefine how to distribute the load. In my case there are multiple independant processes forked on startup, so it's unclear to me how each of them could reconfigure BPF when shutting down without risking to break the other ones. - the doc makes me believe that BPF would require privileges to be unset, so that would not be compatible with a process shutting down which has already dropped its privileges after startup, but I could be wrong. Thanks for your help on this, Willy
Re: [PATCH 1/1] net: Add SO_REUSEPORT_LISTEN_OFF socket option as drain mode
On Thu, Mar 24, 2016 at 10:01:37AM -0700, Eric Dumazet wrote: > On Thu, 2016-03-24 at 17:50 +0100, Willy Tarreau wrote: > > On Thu, Mar 24, 2016 at 09:33:11AM -0700, Eric Dumazet wrote: > > > > --- a/net/ipv4/inet_hashtables.c > > > > +++ b/net/ipv4/inet_hashtables.c > > > > @@ -189,6 +189,8 @@ static inline int compute_score(struct sock *sk, > > > > struct net *net, > > > > return -1; > > > > score += 4; > > > > } > > > > + if (sk->sk_reuseport) > > > > + score++; > > > > > > This wont work with BPF > > > > > > > if (sk->sk_incoming_cpu == raw_smp_processor_id()) > > > > score++; > > > > > > This one does not work either with BPF > > > > But this *is* in 4.5. Does this mean that this part doesn't work anymore or > > just that it's not usable in conjunction with BPF ? In this case I'm less > > worried, because it would mean that we have a solution for non-BPF aware > > applications and that BPF-aware applications can simply use BPF. > > > > BPF can implement the CPU choice/pref itself. It has everything needed. Well I don't need the CPU choice, it was already there, it's not my code, I only need the ability for an independant process to stop receiving new connections without altering the other processes nor dropping some of these connections. In fact initially I didn't even need anything related to incoming connection load-balancing, just the ability to start a new process without stopping the old one, as it used to work in 2.2 and for which I used to keep a patch in 2.4 and 2.6. When SO_REUSEPORT was reintroduced in 3.9, that solved the issue and some users started to complain that between the old and the new processes, some connections were lost. Hence the proposal above. Since it's not about load distribution and that processes are totally independant, I don't see well how to (ab)use BPF to achieve this. The pattern is : t0 : unprivileged processes 1 and 2 are listening to the same port (sock1@pid1) (sock2@pid2) <-- listening --> t1 : new processes are started to replace the old ones (sock1@pid1) (sock2@pid2) (sock3@pid3) (sock4@pid4) <-- listening --> <-- listening --> t2 : new processes signal the old ones they must stop (sock1@pid1) (sock2@pid2) (sock3@pid3) (sock4@pid4) <--- draining --> <-- listening --> t3 : pids 1 and 2 have finished, they go away (sock3@pid3) (sock4@pid4) <-- gone -> <-- listening --> > > - it seems to me that for BPF to be usable on process shutting down, we'd > > need to have some form of central knowledge if the goal is to redefine > > how to distribute the load. In my case there are multiple independant > > processes forked on startup, so it's unclear to me how each of them > > could > > reconfigure BPF when shutting down without risking to break the other > > ones. > > - the doc makes me believe that BPF would require privileges to be unset, > > so > > that would not be compatible with a process shutting down which has > > already > > dropped its privileges after startup, but I could be wrong. > > > > Thanks for your help on this, > > Willy > > > > The point is : BPF is the way to go, because it is expandable. OK so this means we have to find a way to expand it to allow an individual non-privileged process to change the distribution algorithm without impacting other processes. I need to discover it better to find what can be done, but unfortunately at this point the sole principle makes me think of a level of complexity that doesn't seem obvious to solve at all :-/ Regards, Willy
Re: [PATCH 1/1] net: Add SO_REUSEPORT_LISTEN_OFF socket option as drain mode
On Thu, Mar 24, 2016 at 07:00:11PM +0100, Willy Tarreau wrote: > Since it's not about > load distribution and that processes are totally independant, I don't see > well how to (ab)use BPF to achieve this. > > The pattern is : > > t0 : unprivileged processes 1 and 2 are listening to the same port >(sock1@pid1) (sock2@pid2) ><-- listening --> > > t1 : new processes are started to replace the old ones >(sock1@pid1) (sock2@pid2) (sock3@pid3) (sock4@pid4) ><-- listening --> <-- listening --> > > t2 : new processes signal the old ones they must stop >(sock1@pid1) (sock2@pid2) (sock3@pid3) (sock4@pid4) ><--- draining --> <-- listening --> > > t3 : pids 1 and 2 have finished, they go away > (sock3@pid3) (sock4@pid4) > <-- gone -> <-- listening --> > Thinking a bit more about it, would it make sense to consider that in order to address such a scenario, the only the new (still privileged) process reconfigures the BPF to deliver traffic only to its own sockets and that by doing so it will result in the old one not to receive any of it anymore ? If so that could possibly be reasonably doable then. Ie: the old processes don't have to do anything to stop receiving traffic. Thanks, Willy
Re: [PATCH 1/1] net: Add SO_REUSEPORT_LISTEN_OFF socket option as drain mode
On Thu, Mar 24, 2016 at 11:20:49AM -0700, Tolga Ceylan wrote: > I would appreciate a conceptual description on how this would work > especially for a common scenario > as described by Willy. My initial impression was that a coordinator > (master) process takes this > responsibility to adjust BPF filters as children come and go. Indeed that would help, I don't know where to start from for now. > Two popular software has similar use cases: nginx and haproxy. Another > concern is with the > introduction of BPF itself, should we expect a performance drop in > these applications? Knowing how picky Eric is about performance in such areas, I'm not worried a single second about adopting something he recommends :-) I just need to ensure it covers our users' needs. And maybe the solution I mentionned in the other e-mail could work. Willy
Re: [PATCH 1/1] net: Add SO_REUSEPORT_LISTEN_OFF socket option as drain mode
Hi, On Wed, Mar 23, 2016 at 10:10:06PM -0700, Tolga Ceylan wrote: > I apologize for not properly following up on this. I had the > impression that we did not want to merge my original patch and then I > also noticed that it fails to keep the hash consistent. Recently, I > read the follow ups on it as well as Willy's patch/proposals. > > Is there any update on Willy's SO_REUSEPORT patch? IMHO, it solves the > problem and it is simpler than adding new sock option. no, Craig's changes were merged, and I haven't checked yet if my patch needs to be rebased or still applies. Feel free to check it and resubmit if you have time. Best regards, Willy
Re: [PATCH 3.2 085/115] veth: don???t modify ip_summed; doing so treats packets with bad checksums as good.
On Sat, Apr 30, 2016 at 03:43:51PM -0700, Ben Greear wrote: > On 04/30/2016 03:01 PM, Vijay Pandurangan wrote: > > Consider: > > > > - App A sends out corrupt packets 50% of the time and discards inbound > > data. (...) > How can you make a generic app C know how to do this? The path could be, > for instance: > > eth0 <-> user-space-A <-> vethA <-> vethB <-> { kernel routing logic } <-> > vethC <-> vethD <-> appC > > There are no sockets on vethB, but it does need to have special behaviour to > elide > csums. Even if appC is hacked to know how to twiddle some thing on it's veth > port, > mucking with vethD will have no effect on vethB. > > With regard to your example above, why would A corrupt packets? My guess: > > 1) It has bugs (so, fix the bugs, it could equally create incorrect data > with proper checksums, > so just enabling checksumming adds no useful protection.) I agree with Ben here, what he needs is the ability for userspace to be trusted when *forwarding* a packet. Ideally you'd only want to receive the csum status per packet on the packet socket and pass the same value on the vethA interface, with this status being kept when the packet reaches vethB. If A purposely corrupts packet, it's A's problem. It's similar to designing a NIC which intentionally corrupts packets and reports "checksum good". The real issue is that in order to do things right, the userspace bridge (here, "A") would really need to pass this status. In Ben's case as he says, bad checksum packets are dropped before reaching A, so that simplifies the process quite a bit and that might be what causes some confusion, but ideally we'd rather have recvmsg() and sendmsg() with these flags. I faced the exact same issue 3 years ago when playing with netmap, it was slow as hell because it would lose all checksum information when packets were passing through userland, resulting in GRO/GSO etc being disabled, and had to modify it to let userland preserve it. That's especially important when you have to deal with possibly corrupted packets not yet detected in the chain because the NIC did not validate their checksums. Willy
Re: [PATCH] nf_conntrack: avoid kernel pointer value leak in slab name
On Sat, May 14, 2016 at 02:31:04PM -0700, Linus Torvalds wrote: > On Sat, May 14, 2016 at 11:24 AM, Linus Torvalds >wrote: > > > > > > - net->ct.slabname = kasprintf(GFP_KERNEL, "nf_conntrack_%p", net); > > + net->ct.slabname = kasprintf(GFP_KERNEL, "nf_conntrack_%llu", > > + (u64)atomic64_inc_return(_id)); > > Oh well. I suspect this is going to cause a new warning on alpha and > ia64 and possibly others. > > "u64" is indeed "unsigned long long" on x86 and many other > architectures, but on alpga and ia64 it's just "unsigned long". > > So that case should have been to "long long". I detest how there isn't > a "64-bit size" printf specifier. Why simply not cast the atomic to (unsigned long long) instead of (u64) so that %llu always matches ? Willy
Re: [PATCH] nf_conntrack: avoid kernel pointer value leak in slab name
On Sat, May 14, 2016 at 03:21:31PM -0700, Linus Torvalds wrote: > On Sat, May 14, 2016 at 2:33 PM, Willy Tarreau <w...@1wt.eu> wrote: > > > > Why simply not cast the atomic to (unsigned long long) instead of (u64) > > so that %llu always matches ? > > Yes, that fixes the problem. It's just more typing, and annoying. The > fact that MS got it right while posix and gcc screwed it up is a bit > embarrassing.. Well on the other hand, because of this MS still has problems porting code from 32 to 64 bit. The real problem is that on both sides they imagined that you needed only one way to specify your types. In practice users generally want either the most optimal types for the architecture because they don't care about the size (char, int, size_t, void *...) or a specific size. This last one is annoying to use with printf format. > If we ever start using __uint128_t, we'll have even more problems in > this area. Oh well. Definitely. Willy
Re: [PATCH net] net: mvneta: set real interrupt per packet for tx_done
Hi, On Wed, Jul 06, 2016 at 04:18:58AM +0200, Marcin Wojtas wrote: > From: Dmitri Epshtein <d...@marvell.com> > > Commit aebea2ba0f74 ("net: mvneta: fix Tx interrupt delay") intended to > set coalescing threshold to a value guaranteeing interrupt generation > per each sent packet, so that buffers can be released with no delay. > > In fact setting threshold to '1' was wrong, because it causes interrupt > every two packets. According to the documentation a reason behind it is > following - interrupt occurs once sent buffers counter reaches a value, > which is higher than one specified in MVNETA_TXQ_SIZE_REG(q). This > behavior was confirmed during tests. Also when testing the SoC working > as a NAS device, better performance was observed with int-per-packet, > as it strongly depends on the fact that all transmitted packets are > released immediately. > > This commit enables NETA controller work in interrupt per sent packet mode > by setting coalescing threshold to 0. We had a discussion about this in January 2015 and I thought I sent a patch to change it but I can't find it, so I think it was only proposed to some users for testing. I also remember that on more recent kernels by then (>=3.13) we observed a slightly better performance with this value set to zero. Acked-by: Willy Tarreau <w...@1wt.eu> Willy
[PATCH 3.10 257/319] net: sctp, forbid negative length
From: Jiri Slaby <jsl...@suse.cz> commit a4b8e71b05c27bae6bad3bdecddbc6b68a3ad8cf upstream. Most of getsockopt handlers in net/sctp/socket.c check len against sizeof some structure like: if (len < sizeof(int)) return -EINVAL; On the first look, the check seems to be correct. But since len is int and sizeof returns size_t, int gets promoted to unsigned size_t too. So the test returns false for negative lengths. Yes, (-1 < sizeof(long)) is false. Fix this in sctp by explicitly checking len < 0 before any getsockopt handler is called. Note that sctp_getsockopt_events already handled the negative case. Since we added the < 0 check elsewhere, this one can be removed. If not checked, this is the result: UBSAN: Undefined behaviour in ../mm/page_alloc.c:2722:19 shift exponent 52 is too large for 32-bit type 'int' CPU: 1 PID: 24535 Comm: syz-executor Not tainted 4.8.1-0-syzkaller #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014 88006d99f2a8 b2f7bdea 41b58ab3 b4363c14 b2f7bcde 88006d99f2d0 88006d99f270 0034 b5096422 Call Trace: [] ? __ubsan_handle_shift_out_of_bounds+0x29c/0x300 ... [] ? kmalloc_order+0x24/0x90 [] ? kmalloc_order_trace+0x24/0x220 [] ? __kmalloc+0x330/0x540 [] ? sctp_getsockopt_local_addrs+0x174/0xca0 [sctp] [] ? sctp_getsockopt+0x10d/0x1b0 [sctp] [] ? sock_common_getsockopt+0xb9/0x150 [] ? SyS_getsockopt+0x1a5/0x270 Signed-off-by: Jiri Slaby <jsl...@suse.cz> Cc: Vlad Yasevich <vyasev...@gmail.com> Cc: Neil Horman <nhor...@tuxdriver.com> Cc: "David S. Miller" <da...@davemloft.net> Cc: linux-s...@vger.kernel.org Cc: netdev@vger.kernel.org Acked-by: Neil Horman <nhor...@tuxdriver.com> Signed-off-by: David S. Miller <da...@davemloft.net> Signed-off-by: Willy Tarreau <w...@1wt.eu> --- net/sctp/socket.c | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/net/sctp/socket.c b/net/sctp/socket.c index bdc3fb6..86e7352 100644 --- a/net/sctp/socket.c +++ b/net/sctp/socket.c @@ -4259,7 +4259,7 @@ static int sctp_getsockopt_disable_fragments(struct sock *sk, int len, static int sctp_getsockopt_events(struct sock *sk, int len, char __user *optval, int __user *optlen) { - if (len <= 0) + if (len == 0) return -EINVAL; if (len > sizeof(struct sctp_event_subscribe)) len = sizeof(struct sctp_event_subscribe); @@ -5770,6 +5770,9 @@ SCTP_STATIC int sctp_getsockopt(struct sock *sk, int level, int optname, if (get_user(len, optlen)) return -EFAULT; + if (len < 0) + return -EINVAL; + sctp_lock_sock(sk); switch (optname) { -- 2.8.0.rc2.1.gbe9624a
Re: [PATCH net-next 3/3] net/tcp-fastopen: Add new API support
On Mon, Jan 23, 2017 at 11:01:21PM +0100, Willy Tarreau wrote: > On Mon, Jan 23, 2017 at 10:37:32PM +0100, Willy Tarreau wrote: > > On Mon, Jan 23, 2017 at 01:28:53PM -0800, Wei Wang wrote: > > > Hi Willy, > > > > > > True. If you call connect() multiple times on a socket which already has > > > cookie without a write(), the second and onward connect() call will return > > > EINPROGRESS. > > > It is basically because the following code block in > > > __inet_stream_connect() > > > can't distinguish if it is the first time connect() is called or not: > > > > > > case SS_CONNECTING: > > > if (inet_sk(sk)->defer_connect) <- defer_connect will > > > be 0 only after a write() is called > > > err = -EINPROGRESS; > > > else > > > err = -EALREADY; > > > /* Fall out of switch with err, set for this state */ > > > break; > > > > Ah OK that totally makes sense, thanks for the explanation! > > > > > I guess we can add some extra logic here to address this issue. So the > > > second connect() and onwards will return EALREADY. > > Thinking about it a bit more, I really think it would make more > sense to return -EISCONN here if we want to match the semantics > of a connect() returning zero on the first call. This way the > caller knows it can write whenever it wants and can disable > write polling until needed. > > I'm currently rebuilding a kernel with this change to see if it > behaves any better : > > -err = -EINPROGRESS; > +err = -EISCONN; OK so obviously it didn't work since sendmsg() goes there as well. But that made me realize that there really are 3 states, not 2 : - after connect() and before sendmsg() : defer_accept = 1, we want to lie to userland and pretend we're connected so that userland can call send(). A connect() must return either zero or -EISCONN. - during first sendmsg(), still connecting : the connection is in progress, EINPROGRESS must be returned to the first sendmsg(). - after the first sendmsg() : defer_accept = 0 ; connect() must return -EALREADY. We want to return real socket states from now on. Thus I modified defer_accept to take two bits to represent the extra state we need to indicate the transition. Now sendmsg() upgrades defer_accept from 1 to 2 before calling __inet_stream_connect(), which then knows it must return EINPROGRESS to sendmsg(). This way we correctly cover all these situations. Even if we call connect() again after the first connect() attempt it still matches the first result : accept4(7, {sa_family=AF_INET, sin_port=htons(36860), sin_addr=inet_addr("192.168.0.176")}, [128->16], SOCK_NONBLOCK) = 1 setsockopt(1, SOL_TCP, TCP_NODELAY, [1], 4) = 0 accept4(7, 0x7ffc2282fcb0, [128], SOCK_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable) recvfrom(1, 0x7b53a4, 7006, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable) socket(AF_INET, SOCK_STREAM, IPPROTO_TCP) = 2 fcntl(2, F_SETFL, O_RDONLY|O_NONBLOCK) = 0 setsockopt(2, SOL_TCP, TCP_NODELAY, [1], 4) = 0 setsockopt(2, SOL_TCP, 0x1e /* TCP_??? */, [1], 4) = 0 connect(2, {sa_family=AF_INET, sin_port=htons(8001), sin_addr=inet_addr("127.0.0.1")}, 16) = 0 epoll_ctl(0, EPOLL_CTL_ADD, 1, {EPOLLIN|EPOLLRDHUP, {u32=1, u64=1}}) = 0 epoll_wait(0, [], 200, 0) = 0 connect(2, {sa_family=AF_INET, sin_port=htons(8001), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EISCONN (Transport endpoint is already connected) epoll_wait(0, [], 200, 0) = 0 recvfrom(2, 0x7b53a4, 8030, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable) epoll_ctl(0, EPOLL_CTL_ADD, 2, {EPOLLIN|EPOLLRDHUP, {u32=2, u64=2}}) = 0 epoll_wait(0, [], 200, 1000)= 0 epoll_wait(0, [], 200, 1000)= 0 epoll_wait(0, [], 200, 1000)= 0 epoll_wait(0, [{EPOLLIN, {u32=1, u64=1}}], 200, 1000) = 1 recvfrom(1, "GET / HTTP/1.1\r\n", 8030, 0, NULL, NULL) = 16 sendto(2, "GET / HTTP/1.1\r\n", 16, MSG_DONTWAIT|MSG_NOSIGNAL, NULL, 0) = 16 epoll_wait(0, [{EPOLLIN, {u32=1, u64=1}}], 200, 1000) = 1 recvfrom(1, "\r\n", 8030, 0, NULL, NULL) = 2 sendto(2, "\r\n", 2, MSG_DONTWAIT|MSG_NOSIGNAL, NULL, 0) = 2 epoll_wait(0, [{EPOLLIN|EPOLLRDHUP, {u32=2, u64=2}}], 200, 1000) = 1 recvfrom(2, "HTTP/1.1 302 Found\r\nCache-Contro"..., 8030, 0, NULL, NULL) = 98 sendto(1, "HTTP/1.1 302 Found\r\nCache-Contro"..., 98, MSG_DONTWAIT|MSG_NOSIGNAL|MSG_MORE, NULL, 0) = 98 shutdown(1, SHUT_WR)= 0 epoll_ctl(0, EPOLL_CTL_DEL, 2, 0x6ff55c) = 0 epoll_wait(0, [{EPOLLIN|EPOLLHUP|
Re: [PATCH net-next 3/3] net/tcp-fastopen: Add new API support
On Mon, Jan 23, 2017 at 01:28:53PM -0800, Wei Wang wrote: > Hi Willy, > > True. If you call connect() multiple times on a socket which already has > cookie without a write(), the second and onward connect() call will return > EINPROGRESS. > It is basically because the following code block in __inet_stream_connect() > can't distinguish if it is the first time connect() is called or not: > > case SS_CONNECTING: > if (inet_sk(sk)->defer_connect) <- defer_connect will > be 0 only after a write() is called > err = -EINPROGRESS; > else > err = -EALREADY; > /* Fall out of switch with err, set for this state */ > break; Ah OK that totally makes sense, thanks for the explanation! > I guess we can add some extra logic here to address this issue. So the > second connect() and onwards will return EALREADY. If that's possible at little cost it would be nice, because your patch makes it so easy to enable TFO on outgoing connections now that I expect many people will blindly run the setsockopt() before connect(). Do not hesitate to ask me to run some tests. While 4 years ago it was not easy, here it's very simple for me. By the way I'm seeing an ~10% performance increase on haproxy by enabling this, it's really cool! Thanks, Willy
Re: [PATCH net-next 3/3] net/tcp-fastopen: Add new API support
On Mon, Jan 23, 2017 at 10:37:32PM +0100, Willy Tarreau wrote: > On Mon, Jan 23, 2017 at 01:28:53PM -0800, Wei Wang wrote: > > Hi Willy, > > > > True. If you call connect() multiple times on a socket which already has > > cookie without a write(), the second and onward connect() call will return > > EINPROGRESS. > > It is basically because the following code block in __inet_stream_connect() > > can't distinguish if it is the first time connect() is called or not: > > > > case SS_CONNECTING: > > if (inet_sk(sk)->defer_connect) <- defer_connect will > > be 0 only after a write() is called > > err = -EINPROGRESS; > > else > > err = -EALREADY; > > /* Fall out of switch with err, set for this state */ > > break; > > Ah OK that totally makes sense, thanks for the explanation! > > > I guess we can add some extra logic here to address this issue. So the > > second connect() and onwards will return EALREADY. Thinking about it a bit more, I really think it would make more sense to return -EISCONN here if we want to match the semantics of a connect() returning zero on the first call. This way the caller knows it can write whenever it wants and can disable write polling until needed. I'm currently rebuilding a kernel with this change to see if it behaves any better : -err = -EINPROGRESS; +err = -EISCONN; I'll keep you updated. Thanks, Willy
Re: [PATCH net-next 3/3] net/tcp-fastopen: Add new API support
Hi Wei, first, thanks a lot for doing this, it's really awesome! I'm testing it on 4.9 on haproxy and I met a corner case : when I perform a connect() to a server and I have nothing to send, upon POLLOUT notification since I have nothing to send I simply probe the connection using connect() again to see if it returns EISCONN or anything else. But here now I'm seeing EINPROGRESS loops. To illustrate this, here's what I'm doing : :8000 :8001 [ client ] ---> [ proxy ] ---> [ server ] The proxy is configured to enable TFO to the server and the server supports TFO as well. The proxy and the server are in fact two proxy instances in haproxy running in the same process for convenience. When I already have data to send here's what I'm seeing (so it works fine) : 06:29:16.861190 accept4(7, {sa_family=AF_INET, sin_port=htons(33986), sin_addr=inet_addr("192.168.0.176")}, [128->16], SOCK _NONBLOCK) = 9 06:29:16.861277 setsockopt(9, SOL_TCP, TCP_NODELAY, [1], 4) = 0 06:29:16.861342 accept4(7, 0x7ffd0d794430, [128], SOCK_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable) 06:29:16.861417 recvfrom(9, "BLAH\n", 7006, 0, NULL, NULL) = 5 06:29:16.861509 recvfrom(9, 0x2619329, 7001, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable) 06:29:16.861657 socket(AF_INET, SOCK_STREAM, IPPROTO_TCP) = 10 06:29:16.861730 fcntl(10, F_SETFL, O_RDONLY|O_NONBLOCK) = 0 06:29:16.861779 setsockopt(10, SOL_TCP, TCP_NODELAY, [1], 4) = 0 06:29:16.861882 setsockopt(10, SOL_TCP, 0x1e /* TCP_??? */, [1], 4) = 0 06:29:16.861942 connect(10, {sa_family=AF_INET, sin_port=htons(8001), sin_addr=inet_addr("127.0.0.1")}, 16) = 0 06:29:16.862015 epoll_ctl(3, EPOLL_CTL_ADD, 9, {EPOLLIN|EPOLLRDHUP, {u32=9, u64=9}}) = 0 06:29:16.862072 epoll_wait(3, [], 200, 0) = 0 06:29:16.862126 sendto(10, "BLAH\n", 5, MSG_DONTWAIT|MSG_NOSIGNAL, NULL, 0) = 5 06:29:16.862281 epoll_wait(3, [{EPOLLIN, {u32=8, u64=8}}], 200, 0) = 1 06:29:16.862334 recvfrom(10, 0x26173a4, 8030, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable) 06:29:16.862385 accept4(8, {sa_family=AF_INET, sin_port=htons(46760), sin_addr=inet_addr("127.0.0.1")}, [128->16], SOCK_NON BLOCK) = 11 06:29:16.862450 setsockopt(11, SOL_TCP, TCP_NODELAY, [1], 4) = 0 06:29:16.862504 accept4(8, 0x7ffd0d794430, [128], SOCK_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable) 06:29:16.862564 recvfrom(11, "BLAH\n", 7006, 0, NULL, NULL) = 5 When I don't have data, here's what I'm seeing : 06:29:24.047801 accept4(7, {sa_family=AF_INET, sin_port=htons(33988), sin_addr=inet_addr("192.168.0.176")}, [128->16], SOCK _NONBLOCK) = 9 06:29:24.047899 setsockopt(9, SOL_TCP, TCP_NODELAY, [1], 4) = 0 06:29:24.047966 accept4(7, 0x7ffdedb2c7f0, [128], SOCK_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable) 06:29:24.048043 recvfrom(9, 0xd31324, 7006, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable) 06:29:24.048281 socket(AF_INET, SOCK_STREAM, IPPROTO_TCP) = 10 06:29:24.048342 fcntl(10, F_SETFL, O_RDONLY|O_NONBLOCK) = 0 06:29:24.048392 setsockopt(10, SOL_TCP, TCP_NODELAY, [1], 4) = 0 06:29:24.048447 setsockopt(10, SOL_TCP, 0x1e /* TCP_??? */, [1], 4) = 0 06:29:24.048508 connect(10, {sa_family=AF_INET, sin_port=htons(8001), sin_addr=inet_addr("127.0.0.1")}, 16) = 0 06:29:24.048593 epoll_ctl(3, EPOLL_CTL_ADD, 9, {EPOLLIN|EPOLLRDHUP, {u32=9, u64=9}}) = 0 06:29:24.048651 epoll_wait(3, [], 200, 0) = 0 06:29:24.048699 getsockopt(10, SOL_SOCKET, SO_ERROR, [0], [4]) = 0 06:29:24.048751 connect(10, {sa_family=AF_INET, sin_port=htons(8001), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EINPROGRES S (Operation now in progress) 06:29:24.048808 epoll_ctl(3, EPOLL_CTL_ADD, 10, {EPOLLOUT, {u32=10, u64=10}}) = 0 06:29:24.048860 epoll_wait(3, [{EPOLLOUT, {u32=10, u64=10}}], 200, 1000) = 1 06:29:24.048912 getsockopt(10, SOL_SOCKET, SO_ERROR, [0], [4]) = 0 06:29:24.048963 connect(10, {sa_family=AF_INET, sin_port=htons(8001), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EINPROGRES S (Operation now in progress) 06:29:24.049018 epoll_wait(3, [{EPOLLOUT, {u32=10, u64=10}}], 200, 1000) = 1 06:29:24.049072 getsockopt(10, SOL_SOCKET, SO_ERROR, [0], [4]) = 0 06:29:24.049122 connect(10, {sa_family=AF_INET, sin_port=htons(8001), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EINPROGRES S (Operation now in progress) I theorically understand why but I think we have something wrong here and instead we should have -1 EISCONN (to pretend the connection is established) or return EALREADY (to mention that a previous request was already made and that we're waiting for the next step). While I can instrument my connect() *not* to use TFO when connecting without any pending data, I don't always know this (eg when I use openssl and cross fingers so that it decides to quickly send something on the next round). I think it's easy to fall into this tricky corner case and am wondering what can be done about it. Does the EINPROGRESS happen only because there is no cookie yet ? If so,
Re: [PATCH net-next 3/3] net/tcp-fastopen: Add new API support
On Mon, Jan 23, 2017 at 02:57:31PM -0800, Wei Wang wrote: > Yes. That seems to be a valid fix to it. > Let me try it with my existing test cases as well to see if it works for > all scenarios I have. Perfect. Note that since the state 2 is transient I initially thought about abusing the flags passed to __inet_stream_connect() to say "hey I'm sendmsg() and not connect()" but that would have been a ugly hack while here we really have the 3 socket states represented eventhough one changes twice around a call. Thanks, Willy
Re: [PATCH net-next 3/3] net/tcp-fastopen: Add new API support
On Mon, Jan 23, 2017 at 10:59:22AM -0800, Wei Wang wrote: > This patch adds a new socket option, TCP_FASTOPEN_CONNECT, as an > alternative way to perform Fast Open on the active side (client). Wei, I think that nothing prevents from reusin the original TCP_FASTOPEN sockopt instead of adding a new one. The original one does this : case TCP_FASTOPEN: if (val >= 0 && ((1 << sk->sk_state) & (TCPF_CLOSE | TCPF_LISTEN))) { tcp_fastopen_init_key_once(true); fastopen_queue_tune(sk, val); } else { err = -EINVAL; } break; and your new option does this : case TCP_FASTOPEN_CONNECT: if (val > 1 || val < 0) { err = -EINVAL; } else if (sysctl_tcp_fastopen & TFO_CLIENT_ENABLE) { if (sk->sk_state == TCP_CLOSE) tp->fastopen_connect = val; else err = -EINVAL; } else { err = -EOPNOTSUPP; } break; Now if we compare : - the value ranges are the same (0,1) - tcp_fastopen_init_key_once() only performs an initialization once - fastopen_queue_tune() only sets sk->max_qlen based on the backlog, this has no effect on an outgoing connection ; - tp->fastopen_connect can be applied to a listening socket without side effect. Thus I think we can merge them this way : case TCP_FASTOPEN: if (val >= 0) { if ((sysctl_tcp_fastopen & TFO_CLIENT_ENABLE) && (sk->sk_state == TCP_CLOSE) tp->fastopen_connect = val; if ((1 << sk->sk_state) & (TCPF_CLOSE | TCPF_LISTEN))) { tcp_fastopen_init_key_once(true); fastopen_queue_tune(sk, val); } } else { err = -EINVAL; } break; And for the userland, the API is even simpler because we can use the same TCP_FASTOPEN sockopt regardless of the socket direction. Also, I don't know if TCP_FASTOPEN is supported on simultaneous connect, but at least if it works it would be easier to understand this way. Do you think there's a compelling reason for adding a new option or are you interested in a small patch to perform the change above ? Regards, Willy