Re: [2.4 PATCH] ethernet (net/ethernet/eth.c): eth_header() may produce invalid packets (with dest. addr. = 00:00:00:00:00:00)

2007-01-11 Thread Willy Tarreau
On Wed, Jan 10, 2007 at 12:53:31PM +0100, Christian Praehauser wrote:
 Hello, and sorry for bothering you with a patch you've already seen ;-).

I had not seen it yet, so I'm fine with it :-)

 From: Christian Praehauser, Department of Computer Sciences, University of 
 Salzburg
 
 This patch fixes a problem which has already been corrected in Linux-2.6.16 
 but was not back-ported to the 2.4 series. It is essentially the same as 
 the patch for 2.6.16. An excerpt from the ChangeLog for Linux 2.6.16 is 
 included below.

David, I think that Christian and Alexey's descriptions are fine, and that
this patch should be merged. Are you OK ?

 What is not described in the patch description for 2.6.16 is that this 
 problem also arises when transmitting IP multicast packets. If you send an 
 IP multicast stream over an ethernet network interface ethX and turn off 
 ARP on ethX then Linux will produce an ethernet frame with a dest. 
 addresses of 00:00:00:00:00:00 (which is invalid). As IP multicast 
 addresses are directly mapped to HW (MAC) addresses without invoking any 
 ARP protocol mechanisms - for IP4 this mapping is performed by the function 
 ip_eth_mc_map - it makes perfect sense to do this even if ARP is disabled. 
 Further, this problem may occur periodically, everytime the corresponding 
 struct dst_entry is garbage-collected (e.g. ~ every 10 minutes).

Christian, would you please produce a properly formated description starting
from the 2.6 one and appending your comments to it ? I'd also like to get
the same subject as for the 2.6 patch. It's important that we keep patches
subjects and descriptions for backports, it helps further patch tracking.

Thanks in advance,
Willy

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


planned driver updates for 2.4.35

2007-01-29 Thread Willy Tarreau
Hi David,

As stated with the 2.4.34 announcement, I planned to perform a few
updates in two network drivers for 2.4.35 :

  - e1000: the cards equipped with the not-so old 82546EB chips
have completely disappeared from the earth surface, and
people replacing hardware are experiencing trouble with the
new chip (82546GB) which is not supported by the old driver.
I know that Red Hat merged support for this chip in RHEL3
recently too (U8) by upgrading from v6 to v7.

Jesse Brandeburg from Intel offered to update the driver to
the more recent branch (7.3), which will make further
maintenance easier for him. I personnally use 7.3 on my own
kernels, so I just know that it works well for the NICs I'm
used to find on servers, but for the rest, I'll have to rely
on Intel people's support.

  - sk98lin: when Stephen decided to rewrite that driver from scratch
because the one from Marvell's site has too many bugs, I thought
he was exagerating, till the day I noticed NFS servers taking a
lot of time to respond and DNS servers timing out because of UDP
packets suddenly not leaving the host anymore. I recently
encountered a worst case at a customer's with losses, duplicates
and truncated packets. It's clear that the driver is too buggy.
Each time I replaced the official driver with Stephen's backport
and it definitely fixed the problems. I proposed Stephen to merge
his work into 2.4, and he agreed, offering to update it once it
gets in, which I'm fine with.

This means I would start with skge-1.4 and sky2-0.5 which I could
not get to fault on various intensive tests nor on the one which
trigger sk98lin's bugs. Since those drivers support cards that are
currently only supported by the out-of-tree driver from Marvell's
site, no compatibility is lost and users can still use their old
driver if they prefer it (as I did till I got those bugs).

Aside that, I rejected a user's request for an update of the tg3
driver which does not support one recent chip in a notebook. I think
that a notebook is not a big enough argument to touch this driver which
works quite well IMHO. A notebook is clearly where he should use 2.6 by
default, or take the time to download and install the out-of-tree code
from broadcom's site if he absolutely wants 2.4. Maybe I'm wrong, but I
have not tested the unofficial tg3 enough to have an opinion, and I don't
want to break things that work just for this.

I'd like to get your opinion on those updates before I open 2.4.35. I
think that integrating them early is the best way to get more testing,
since people often try the first and the latest versions only.

Best regards,
Willy

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] sky2: flow control off

2007-02-03 Thread Willy Tarreau
Hi Stephen,

On Fri, Feb 02, 2007 at 03:34:25PM -0800, Stephen Hemminger wrote:
 Turn flow control off for sky2. When flow control is on, the transmitter
 may get randomly stuck. Perhaps there is hardware problem, but until
 Marvell provides errata information for workaround, it should default to off.

Are you aware of any way to reproduce or at least to increases chances
of occurrences of this problem ? It looks much like a problem I've been
experiencing with Marvell's driver with a 88E8053 chip on 2.4. Doing
ethtool -r proved useful to reset the transceiver in production.

When trying to reproduce the problem, I noticed corrupted and duplicated
frames in tcpdump captures when packets between 147 and 496 bytes were
received then re-routed via the same NIC towards a 100 Mbps machine on
the same switch, a Gig Dlink doing flow control. Yes, I know, there are
a lot of variables ! But changing the target machine, or the packet
sizes was enough to stop the corruption. I could not get the transmitter
stuck as I sometimes have in production, but I think that both problems
are related. The fact that a comparable problem is encountered with your
driver really makes me think about a hardware bug :-/

BTW, switching to your driver (sky2-1.5), I could not reproduce the
corruption problem anymore, but I've not yet put it in production
to check if it fixes the transceiver problem. Anyway, it might be
interesting that users who encounter this problem try an old version
of the driver just in case they detect a difference.

regards,
Willy

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH-2.4] forcedeth update to 0.50

2006-05-31 Thread Willy Tarreau
On Wed, May 31, 2006 at 07:50:32AM +0200, Manfred Spraul wrote:
 Hi Willy,
 
 Willy Tarreau wrote:
 
 I started from the latest backport you sent in september (0.42) and
 incrementally applied 2.6 updates. I stopped at 0.50 which provides
 VLAN support, because after this one, there are some 2.4-incompatible
 changes (64bit consistent memory allocation for rings, and MSI/MSIX
 support).
 
  
 
 I agree, 2.4 needs a backport. Either a full backport as you did, or a 
 minimal one-liner fix.
 Right now, the driver is not usable due to an incorrect initialization.
 Or to be more accurate:
# modprobe
# ifup
 works.
 But
# modprobe
# ifup
# ifdown
# ifup
 causes a misconfiguration, and the nic hangs hard after a few MB. And 
 recent distros do the equivalent of ifup/ifdown/ifup somewhere in the 
 initialization.

That's what I read in one of the changelogs, but I'm not sure at all that
it's what happened, because I had the problem after an ifup only. What I
was doing with this box was pure performance tests which drew me to compare
the broadcom and nforce performance. My tests measured 3 creteria :

  - number of HTTP/1.0 hits/s
  - maximum data rate
  - maximum packets/s

on tg3, I got around 45 khits/s, 949 Mbps (TCP, =1.0 Gbps on wire) and
1.05 Mpps receive (I want to build a high speed load-balancer and a sniffer).
This was stable.

On the nforce, I tried with the hits/s first because it's a good indication
of hardware-based and driver-based optimizations. It reached 18 khits/s with
a lot of difficulty and the machine was stuck at 100% of one CPU. But it ran
for a few minutes like this. Then I tried data rate (which is the same test
with 1MB objects), and it failed after about 2 seconds and few megabytes (or
hundreds of megabytes) transferred.

I had to reboot to get it to work again. And I'm fairly sure that I did not
do down/up this time as well, but the test came to the same end.

That's why I'm not sure at all that the one-liner will be enough.

Moreover, after the update, I reached the same performance as with the
broadcom, with a slight improvement on packet reception (1.09 Mpps), and
low CPU usage (15%). So basically, the upgrade rendered the driver from
barely usable for SSH to very performant.

 Marcelo: Do you need a one-liner, or could you apply a large backport 
 patch?

I would really vote for the full backport, and I can break it into pieces
if needed (I have them at hand, just have to re-inject the changelogs).
However, I have separate changes from 0.42 to 0.50, because I started
with your 0.30-0.42 backport patch.

I have this machine till the end of the week, so I can perform other tests
if you're interested in trying specific things.

 --
Manfred

Cheers,
Willy

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH-2.4] forcedeth update to 0.50

2006-05-31 Thread Willy Tarreau
On Wed, May 31, 2006 at 09:50:38PM +0200, Manfred Spraul wrote:
 Marcelo Tosatti wrote:
 
 Since v2.4.33 should be out RSN, my opinion is that applying the 
 one-liner to fix the bringup problem for now is more prudent..
 
  
 
 It's attached. Untested, but it should work. Just the relevant hunk from 
 the 0.42 patch.

I will test it tomorrow morning. John might be interested in merging it too,
as I have checked today that RHEL3 was affected by the same problem (rmmod
followed by modprobe).

 But I would disagree with waiting for 2.3.34 for a full backport:
 0.30 basically doesn't work, thus the update to 0.50 would be a big step 
 forward - it can't be worse that 0.30.

Seconded !
Manfred, if you have some corner cases in mind, are aware of anything which
might sometimes break, or have a few experimental patches to try, I'm OK for
a few tests while I have the machine (it's SMP BTW).

 --
Manfred

Cheers,
Willy

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Patch] 2.4.32 - Neighbour Cache (ARP) State machine bug Fixed

2006-02-07 Thread Willy Tarreau
Hi,

On Tue, Feb 07, 2006 at 12:57:43AM -0700, Pradeep Vincent wrote:
 In 2.4.21, arp code uses gc_timer to check for stale arp cache
 entries. In 2.6, each entry has its own timer to check for stale arp
 cache. 2.4.29 to 2.4.32 kernels (atleast) use neither of these timers.
 This causes problems in environments where IPs or MACs are reassigned
 - saw this problem on load balancing router based networks that use
 VMACs. Tested this code on load balancing router based networks as
 well as peer-linux systems.
 
 
 Thanks,
 
 
 Signed off by: Pradeep Vincent [EMAIL PROTECTED]
 
 diff -Naur old/net/core/neighbour.c new/net/core/neighbour.c
 --- old/net/core/neighbour.cWed Nov 23 17:15:30 2005
 +++ new/net/core/neighbour.cWed Nov 23 17:26:01 2005
 @@ -14,6 +14,7 @@
 * Vitaly E. Lavrovreleasing NULL neighbor in neigh_add.
 * Harald WelteAdd neighbour cache statistics like rtstat
 * Harald Welteport neighbour cache rework from 2.6.9-rcX
 + *  Pradeep Vincent Move neighbour cache entry to stale state
 */

As you can see above, your mailer is still broken. Leading spaces get
removed and it seems like tabs are replaced with spaces. This makes it
really annoying to fix by hand because we all have to do your work again.
You should try to fix your mailer options, possibly by sending a few
mails to yourself or someone else (if you send *a few* mails to me, I
can confirm which one looks OK). If your mailer is definitely broken,
then you may send it as plain text first (for review), with a text
attachment for people to apply it without trouble.

Thanks,
Willy

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


SKGE backport to 2.4 : success

2006-10-28 Thread Willy Tarreau
Hi Stephen,

In my own kernels, I've added your backport of SKGE to 2.4 that I found
here :

   http://developer.osdl.org/shemminger/releases/skge-sky2-backport.tar.bz2

It seems to work pretty well compared to the original syskonnect driver
(up to and including 8.36). Several people around me have reported very
slow NFS operations with the official driver, which I finally attributed
to a strange effect of UDP packets not going out after a while until they
get pushed by a TCP packet. I even noticed the problem at the company
and we turned the NFS server to an unused 100 Mbps card to workaround the
problem before being able to fully ananlyze the problem.

It seems your driver is getting mature and its performance is very close to
the official one, while its code is smaller and apparently more reliable. I
was thinking about merging it in mainline 2.4 as a fix for people having
trouble with the syskonnect driver. It might also be easier to backport fixes
from 2.6 to 2.4 when the driver is the same.

I don't think we risk any regression because it won't replace an existing
driver, but will provide one to people who are used to download new versions
from an external tree.

Also, I'm not yet sure whether I would also backport the sky2 driver, because
I know about a handful boxes running in production with the official one with
88E8053 chips at high packet rates with no trouble at all. Anyway, as long as
the backport does not prevent them from using the external driver, there
should be no problem.

I'd like to get your opinion on this matter, and of course, Jeff's and Davem's.

Thanks in advance,
Willy

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SKGE backport to 2.4 : success

2006-11-06 Thread Willy Tarreau
On Mon, Nov 06, 2006 at 10:56:09AM -0800, Stephen Hemminger wrote:
 On Sat, 4 Nov 2006 22:08:55 +0100
 Willy Tarreau [EMAIL PROTECTED] wrote:
 
  Hi Stephen,
  
  I don't know if you received my mail since I got no reply.
  
  Thanks in advance for your comments,
  Willy
  
  On Sat, Oct 28, 2006 at 10:57:07PM +0200, Willy Tarreau wrote:
   Hi Stephen,
   
   In my own kernels, I've added your backport of SKGE to 2.4 that I found
   here :
   
  
   http://developer.osdl.org/shemminger/releases/skge-sky2-backport.tar.bz2
   
   It seems to work pretty well compared to the original syskonnect driver
   (up to and including 8.36). Several people around me have reported very
   slow NFS operations with the official driver, which I finally attributed
   to a strange effect of UDP packets not going out after a while until they
   get pushed by a TCP packet. I even noticed the problem at the company
   and we turned the NFS server to an unused 100 Mbps card to workaround the
   problem before being able to fully ananlyze the problem.
   
   It seems your driver is getting mature and its performance is very close 
   to
   the official one, while its code is smaller and apparently more reliable. 
   I
   was thinking about merging it in mainline 2.4 as a fix for people having
   trouble with the syskonnect driver. It might also be easier to backport 
   fixes
   from 2.6 to 2.4 when the driver is the same.
   
   I don't think we risk any regression because it won't replace an existing
   driver, but will provide one to people who are used to download new 
   versions
   from an external tree.
   
   Also, I'm not yet sure whether I would also backport the sky2 driver, 
   because
   I know about a handful boxes running in production with the official one 
   with
   88E8053 chips at high packet rates with no trouble at all. Anyway, as 
   long as
   the backport does not prevent them from using the external driver, there
   should be no problem.
   
   I'd like to get your opinion on this matter, and of course, Jeff's and 
   Davem's.
   
   Thanks in advance,
   Willy
   
 
 
 The backport needs to be updated. It is of older code.  I plan to do a new
 backport this week. The backport version doesn't use NAPI, because of issues
 with not wanting to change netdevice.h. For a good 2.4 version, I would
 make a version that was closer to 2.6 code (using NAPI).

That would be perfect, it would make backport of fixes even easier. I have
turned last version into a patch against 2.4.33 for in-tree inclusion, so
if you're interested in getting it for the Config.in, Makefiles and
Configure.help, do not hesitate.

 I did the backport because one of the equipment donors gave a VPN box whose
 base OS is RHEL based on 2.4.

It's amazing how having the hardware stimulates development, isn't it? :-)

Tbanks,
Willy

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [IPv4] fib: Fix out of bound access of fib_props[]

2007-04-14 Thread Willy Tarreau
On Sat, Mar 24, 2007 at 08:33:39PM -0700, David Miller wrote:
 From: Thomas Graf [EMAIL PROTECTED]
 Date: Sat, 24 Mar 2007 16:34:36 +0100
 
  Fixes a typo which caused fib_props[] to have the wrong size
  and makes sure the value used to index the array which is
  provided by userspace via netlink is checked to avoid out of
  bound access.
  
  Signed-off-by: Thomas Graf [EMAIL PROTECTED]
 
 Applied, thanks.

Thomas, David, it seems to me that 2.4 needs the same fix for the
typo, but I see no place where we may add the check for  RTN_MAX.
Maybe this last one is not needed. If someone could enlighten me,
or at least provide a means to test if the bug it present, it would
help me.

Thanks in advance,
Willy

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [IPv4] fib: Fix out of bound access of fib_props[]

2007-04-14 Thread Willy Tarreau
On Sat, Apr 14, 2007 at 04:26:55PM +0200, Willy Tarreau wrote:
 On Sat, Mar 24, 2007 at 08:33:39PM -0700, David Miller wrote:
  From: Thomas Graf [EMAIL PROTECTED]
  Date: Sat, 24 Mar 2007 16:34:36 +0100
  
   Fixes a typo which caused fib_props[] to have the wrong size
   and makes sure the value used to index the array which is
   provided by userspace via netlink is checked to avoid out of
   bound access.
   
   Signed-off-by: Thomas Graf [EMAIL PROTECTED]
  
  Applied, thanks.
 
 Thomas, David, it seems to me that 2.4 needs the same fix for the
 typo, but I see no place where we may add the check for  RTN_MAX.
 Maybe this last one is not needed. If someone could enlighten me,
 or at least provide a means to test if the bug it present, it would
 help me.

Finally, I think I have the correct fix below. Please someone confirm
or tell me I'm nuts.

Thanks in advance,
Willy


From 230c62b9e7000cfb407a079a21ad0f077f164b21 Mon Sep 17 00:00:00 2001
From: Willy Tarreau [EMAIL PROTECTED]
Date: Sat, 14 Apr 2007 17:44:03 +0200
Subject: [IPv4] fib: Fix out of bound access of fib_props[]

Backported from 2.6. Bug found and fixed by Thomas Graf :

Fixes a typo which caused fib_props[] to have the wrong size
and makes sure the value used to index the array which is
provided by userspace via netlink is checked to avoid out of
bound access.
---
 net/ipv4/fib_semantics.c |7 ++-
 1 files changed, 6 insertions(+), 1 deletions(-)

diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
index afdf4bb..b930371 100644
--- a/net/ipv4/fib_semantics.c
+++ b/net/ipv4/fib_semantics.c
@@ -83,7 +83,7 @@ static struct
 {
int error;
u8  scope;
-} fib_props[RTA_MAX+1] = {
+} fib_props[RTN_MAX+1] = {
 { 0, RT_SCOPE_NOWHERE},/* RTN_UNSPEC */
{ 0, RT_SCOPE_UNIVERSE},/* RTN_UNICAST */
{ 0, RT_SCOPE_HOST},/* RTN_LOCAL */
@@ -431,6 +431,11 @@ fib_create_info(const struct rtmsg *r, struct kern_rta 
*rta,
const int nhs = 1;
 #endif
 
+   if (r-rtm_type  RTN_MAX) {
+   err = -EINVAL;
+   goto errout;
+   }
+
/* Fast check to catch the most weird cases */
if (fib_props[r-rtm_type].scope  r-rtm_scope)
goto err_inval;
-- 
1.4.4.4

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [IPv4] fib: Fix out of bound access of fib_props[]

2007-04-16 Thread Willy Tarreau
On Mon, Apr 16, 2007 at 10:14:17AM +0200, Thomas Graf wrote:
 * Willy Tarreau [EMAIL PROTECTED] 2007-04-14 17:48
  Finally, I think I have the correct fix below. Please someone confirm
  or tell me I'm nuts.
 
 Looks good, same is needed for DECnet

Thank you Thomas.
The DECnet stuff was already merged, it was (expectedly) the same code.

Regards,
Willy

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.16, sk98lin out of date

2006-02-13 Thread Willy Tarreau
On Mon, Feb 13, 2006 at 02:03:14PM -0500, Lee Revell wrote:
 On Mon, 2006-02-13 at 12:06 +0100, Mws wrote:
  hi,
  as i do have the same problem i may help you out.
  
  at first, syskonnect did send their kernel diffs/patches but they
  we're rejected caused
  by coding style, indention and some people thinking that things can be
  done better. 
 
 Haha, they didn't like the LKML code review so they just stopped sending
 patches?  Classic.  Remind me not to buy their gear.

Lee, it's not always that simple. When you submit one driver, sometimes
reviewers tell you that for whatever reason your driver's structure is
wrong and it has to be changed a lot (and sometimes they're right of
course). But when you don't have enough ressource to do the job twice,
the best you can do is to maintain it out of tree, which is already a
pain. I'm not saying that it is what happened with their driver, I don't
know the history. However, I found your reaction somewhat hasty. I
personally would prefer to offer time and help before deciding that
I don't want anyone's products on this basis. It's not as if they
did not release their driver's source !

Cheers,
Willy

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Linux v2.6.16-rc6

2006-03-12 Thread Willy Tarreau
On Sat, Mar 11, 2006 at 06:39:04PM -0800, David S. Miller wrote:
 From: Michal Piotrowski [EMAIL PROTECTED]
 Date: Sun, 12 Mar 2006 02:51:40 +0100
 
  I have noticed this warnings
  TCP: Treason uncloaked! Peer 82.113.55.2:11759/50967 shrinks window
  148470938:148470943. Repaired.
  TCP: Treason uncloaked! Peer 82.113.55.2:11759/50967 shrinks window
  148470938:148470943. Repaired.
  TCP: Treason uncloaked! Peer 82.113.55.2:11759/59768 shrinks window
  1124211698:1124211703. Repaired.
  TCP: Treason uncloaked! Peer 82.113.55.2:11759/59768 shrinks window
  1124211698:1124211703. Repaired.
  
  It maybe problem with ktorrent.
 
 It is a problem with the remote TCP implementation, it is
 illegally advertising a smaller window that it previously
 did.

on 2005/10/27, Herbert Xu provided a patch merged in 2.6.14 to fix some
erroneous occurences of this message (some of them appeared with Linux
on the other side). It would be interesting to know whether the peer
above is Linux or not, because it might be possible that Herbert's fix
needs to be applied to other places ?

Here comes his patch with his interesting analysis for reference, in
case it might give ideas to anybody.

Cheers,
Willy

---
From: Herbert Xu [EMAIL PROTECTED]
Date: Thu, 27 Oct 2005 08:47:46 + (+1000)
Subject: [TCP]: Clear stale pred_flags when snd_wnd changes
X-Git-Tag: v2.6.14
X-Git-Url: 
http://kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=2ad41065d9fe518759b695fc2640cf9c07261dd2

[TCP]: Clear stale pred_flags when snd_wnd changes

This bug is responsible for causing the infamous Treason uncloaked
messages that's been popping up everywhere since the printk was added.
It has usually been blamed on foreign operating systems.  However,
some of those reports implicate Linux as both systems are running
Linux or the TCP connection is going across the loopback interface.

In fact, there really is a bug in the Linux TCP header prediction code
that's been there since at least 2.1.8.  This bug was tracked down with
help from Dale Blount.

The effect of this bug ranges from harmless Treason uncloaked
messages to hung/aborted TCP connections.  The details of the bug
and fix is as follows.

When snd_wnd is updated, we only update pred_flags if
tcp_fast_path_check succeeds.  When it fails (for example,
when our rcvbuf is used up), we will leave pred_flags with
an out-of-date snd_wnd value.

When the out-of-date pred_flags happens to match the next incoming
packet we will again hit the fast path and use the current snd_wnd
which will be wrong.

In the case of the treason messages, it just happens that the snd_wnd
cached in pred_flags is zero while tp-snd_wnd is non-zero.  Therefore
when a zero-window packet comes in we incorrectly conclude that the
window is non-zero.

In fact if the peer continues to send us zero-window pure ACKs we
will continue making the same mistake.  It's only when the peer
transmits a zero-window packet with data attached that we get a
chance to snap out of it.  This is what triggers the treason
message at the next retransmit timeout.

Signed-off-by: Herbert Xu [EMAIL PROTECTED]
Signed-off-by: Arnaldo Carvalho de Melo [EMAIL PROTECTED]
---

--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -2239,6 +2239,7 @@ static int tcp_ack_update_window(struct 
/* Note, it is the only place, where
 * fast path is recovered for sending TCP.
 */
+   tp-pred_flags = 0;
tcp_fast_path_check(sk, tp);
 
if (nwin  tp-max_window) {




-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.4.32: unresolved symbol unregister_qdisc

2006-04-09 Thread Willy Tarreau
On Sun, Apr 09, 2006 at 08:29:12PM -0700, Randy.Dunlap wrote:
 On Sun, 9 Apr 2006 23:20:01 -0400 (EDT) George P Nychis wrote:
 
  
   On Sun, 9 Apr 2006 22:49:50 -0400 (EDT) George P Nychis wrote:
   
   
   On Sun, 9 Apr 2006 22:05:33 -0400 (EDT) George P Nychis wrote:
   
   
   On Sun, 9 Apr 2006 13:37:25 -0400 (EDT) George P Nychis wrote:
   
   Thanks for the help.
   
   Here is the makefile:
   http://rafb.net/paste/results/auchPH75.html
   
   And here is the full errors I receive: 
   http://rafb.net/paste/results/Qplpqw74.html
   
   Greatly appreciate it
   
   - George
   
   [repeat: please don't top-post]
   
   I don't know how much I can help you.  It's been a long time
   since I've built external modules on 2.4.x.
   
   Problems that I see: - the Makefile does not use the expected 2.4
kernel build infrastructure; - kernel Makefile uses -nostdinc to
prevent use of userspace headers; - Makefile is trying to
   include userspace headers instead of kernel headers, e.g.: In file
   included from /usr/include/linux/if_ether.h:107, from 
   /usr/include/linux/netdevice.h:29, from sch_xcp.c:8: - this
   specified include directory is only in 2.6.x, not 2.4.x:
   -I/lib/modules/`uname -r`/build/include/asm/mach-default
   
   It's not clear to me how this makefile could work with 2.4.x at
   all. Is it supposed to, or that's just what you want to see it do?
   
   
   You could try to fix the Makefile based on makefile-changes
   articles at lwn.net. E.g.: http://lwn.net/Articles/151784/ 
   http://lwn.net/Articles/79984/ http://lwn.net/Articles/74767/ 
   http://lwn.net/Articles/69148/ http://lwn.net/Articles/21823/
   
   
   
   On Sat, 8 Apr 2006 19:18:47 -0400 (EDT) George P Nychis
   wrote:
   
   Yeah, this module is unfortunately not under the GPL, it
   was made for research and i am not the author, I was only
   given the code for my own research.
   
   I enabled that support in the kernel, and then tried to 
   recompile and get tons of errors/warnings... so maybe I am 
   missing something else to be enabled in the kernel... here
   are a few examples of errors:
   /usr/include/linux/skbuff.h:30:26: net/checksum.h: No such
   file or directory /usr/include/asm/irq.h:16:25:
   irq_vectors.h: No such file or directory
   /usr/include/linux/irq.h:72: error: `NR_IRQS' undeclared
   here (not in a function) /usr/include/asm/hw_irq.h:28:
   error: `NR_IRQ_VECTORS' undeclared here (not in a function)
   
   I think those are the top most errors, so if i can fix
   those hopefully the rest shall vanish!
   
   Looks like a Makefile problem then.  Can you post the
   Makefile? Hopefully it is using a Makefile and not just an
   elaborate gcc command line.
   
   [and please don't top-post]
   
   - George
   
   
   From: George P Nychis [EMAIL PROTECTED] Date: Sat, 8
   Apr 2006 18:47:34 -0400 (EDT)
   
   Hey,
   
   I have a kernel module that uses unregister_qdisc and 
   register_qdisc, whenever i try to insert the module I
   get: /lib/modules/2.4.32/kernel/net/sched/sch_xcp.o: 
   /lib/modules/2.4.32/kernel/net/sched/sch_xcp.o:
   unresolved symbol unregister_qdisc 
   /lib/modules/2.4.32/kernel/net/sched/sch_xcp.o: 
   /lib/modules/2.4.32/kernel/net/sched/sch_xcp.o:
   unresolved symbol register_qdisc
   
   Am i missing some sort of support in the kernel?
   
   Make sure CONFIG_NET_SCHED is enabled and that you
   compiled your module against that kernel.
   
   Where does this sch_xcp come from?  It's not in the
   vanilla sources.
   
   Also, please direct networking questions to the 
   netdev@vger.kernel.org mailing list which I have added to
   the CC:.
   
   --- ~Randy
   
   
   
   By the way, if I add -I/usr/src/linux/include to the compile line,
   it successfully compiles, however, i am back to the start:
   lanthanum-ini src-1.0.1 # insmod sch_xcp Using 
   /lib/modules/2.4.32/kernel/net/sched/sch_xcp.o 
   /lib/modules/2.4.32/kernel/net/sched/sch_xcp.o: 
   /lib/modules/2.4.32/kernel/net/sched/sch_xcp.o: unresolved symbol 
   unregister_qdisc /lib/modules/2.4.32/kernel/net/sched/sch_xcp.o: 
   /lib/modules/2.4.32/kernel/net/sched/sch_xcp.o: unresolved symbol 
   register_qdisc
   
   Yet your 2.4.32 kernel image file does have those symbols in it? Can
   you verify that by using 'nm' on the kernel image file?
   
   If so, then I suppose that you'll need to make a small module test
   case that exhibits this behavior, or just tell us where to get the
   sch_xcp files...
   
   (re-added cc: for netdev)
   
   --- ~Randy
   
   
   
   By kernel image, do you mean /usr/src/linux/vmlinux ? if so, 
   lanthanum-ini linux # nm vmlinux | grep register_qdisc c0399200 R
   __kstrtab_register_qdisc c0399240 R __kstrtab_unregister_qdisc c039ebc8 R
   __ksymtab_register_qdisc c039ebd0 R __ksymtab_unregister_qdisc c02eda40 T
   register_qdisc c02edaf0 T unregister_qdisc
   
   Yes.  That's good, then.
   
   --- ~Randy
   
   
  
  *sigh* ... still getting the 

[PATCH-2.6] e1000: fix media_type - phy_type thinko

2006-04-15 Thread Willy TARREAU

Recent patch cb764326dff0ee51aca0d450e1a292de65661055 introduced
a thinko in e1000_main.c : e1000_media_type_copper is compared
to hw.phy_type instead of hw.media_type. Original patch proposed
by Jesse Brandeburg was correct, but what has been merged is not.

---

 drivers/net/e1000/e1000_main.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

3df8a180d50c89a72c28abf37151e38ffda75f39
diff --git a/drivers/net/e1000/e1000_main.c b/drivers/net/e1000/e1000_main.c
index add8dc4..590a456 100644
--- a/drivers/net/e1000/e1000_main.c
+++ b/drivers/net/e1000/e1000_main.c
@@ -4156,7 +4156,7 @@ e1000_mii_ioctl(struct net_device *netde
spin_unlock_irqrestore(adapter-stats_lock, flags);
return -EIO;
}
-   if (adapter-hw.phy_type == e1000_media_type_copper) {
+   if (adapter-hw.media_type == e1000_media_type_copper) {
switch (data-reg_num) {
case PHY_CTRL:
if (mii_reg  MII_CR_POWER_DOWN)
-- 
1.2.4

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] sky2: flow control off

2007-02-05 Thread Willy Tarreau
Hi Stephen,

First, thanks for this detailed explanation.

On Mon, Feb 05, 2007 at 09:22:53AM -0800, Stephen Hemminger wrote:
 Here is what I saw.
 
 The transmitter on the Marvell Yukon II (88e8053) hangs when doing transmit 
 flow
 control under load.  There appears to be a bug or race condition that 
 causes the MAC to stop transmitting data.
 
 There are two drivers for the Yukon II device on Linux. SysKonnect/Marvell
 has one called sk98lin it is downloadable from syskonnect.def, and I wrote
 one called sky2 that is part of the standard Linux kernel. This problem
 is reproducible with the sky2 driver only; the sk98lin driver has a watchdog
 routine that resets the hardware perodically, so it masks the problem.
 
 When the failure mode occurs only after several minutes of sustained activity
 and a situation where PAUSE frames would be received. In my testing I used
 
   server == 1000mbit  === switch --- 100mbit --- client
 
 Server was Mac Mini (88E8053) running Linux 2.6.20-rc7 and client was a 
 Sony Vaio (88e8036) laptop.  The server was running NFS in kernel
 and client was doing a large copy. The server was using UDP to cause
 large amounts of 802 pause frames. The problem is not as reproducible with
 TCP tests because TCP congestion control avoids over running the switch.

I encountered *exactly* this problem with a one-leg firewall equipped with a
88E8053 attached to a 1000 Mbps switch, itself hosting 100 Mbps stations,
but with sk98lin (2.4). Running tcpdump on the firewall, I noticed duplicated
and corrupted frames. I could only reproduce the duplicated and corrupted
frames on a lab setup, not the Tx hangs, by sending high UDP traffic on the
port to a 100 Mbps host. Sending to 1000 Mbps hosts never triggered the
problem, hence my conclusions about flow control too. What I found interesting
is that using a very old version of the sky2 driver which I had with me
(sky2 v0.5), I could not trigger the problem anymore. But right now, I realize
that this version of the driver did not support flow control yet, which might
converge with your observations :

# ethtool -i eth0
driver: sky2
version: 0.5
firmware-version: N/A
bus-info: 01:00.0

# ethtool -a eth0
Pause parameters for eth0:
Autonegotiate:  on
RX: off
TX: off

 When failure occurs:
  * packets continue to be received and passed up the stack
 
  * GMAC status register is the pause state
  * transmit packets continue transferred by the DMA into the RAM buffer
  * when the the RAM buffer fills no more packets are DMA'd
  * when transmit queue in driver fills, it gets a watch dog timeout
 
  * switch appears to get confused and other ports hang as well.
 
 During development of the sky2 driver a similar problem was observed on
 receive if the receive DMA buffer was not 8 byte aligned.  For performance
 reasons, Linux drivers usually offset the Rx buffer by 2 bytes so that
 the TCP/IP headers are aligned for faster CPU access.  If the sky2 Rx
 buffer was offset, then the receiver DMA would occasionally hung. The
 workaround for receive was to align the receive buffer on a quad word
 boundary.
 
 This problem appears to be flow control related because after disabling
 flow control, no errors occurred in a 48 hour test run.

No problem here with the old driver without flow control either. I can try
to disable it right here on my setup with sk98lin, and test again. I did not
know that the sk98lin had a watchdog, it could explain why sometimes the
system entered a strange state (packets taking *seconds* to be forwarded).

Anyway, I'm more and more convinced that there are hardware bugs. It is
not normal at all that both the original syskonnect driver and your fresh
new code show such similar problems !

 There probably are other races and hangs that are related. I don't
 consider all the hangs eliminated yet.

Well, at least you have a more maintainable driver than what was the
previous one, so you will eventually manage to fix all problems ;-)

Best regards,
Willy

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2.4 0/3] e1000: update to 7.3.20

2007-02-06 Thread Willy Tarreau
Hi Auke,

On Mon, Feb 05, 2007 at 05:01:02PM -0800, Kok, Auke wrote:
 Willy,
 
 Please pull:
 
 git-pull git://lost.foo-projects.org/~ahkok/git/linux-2.4 e1000
 
 to receive an update for the e1000 driver. This updates the e1000
 driver in the 2.4 kernel to version 7.3.20-k4, roughly the equivalent
 of what is in 2.6.20 and the latest of our out-of-tree driver.
 
 This adds new hardware support and many fixes. Many customers are
 asking us for these updates and the current 2.4 kernel ships a very
 outdated version of the e1000 adapter.
 
 This patch includes small compatibility headers and code to minimize
 the changes we need to make to our driver and keep it in sync easier
 with the 2.6 kernel version so we can possibly continue to ship
 updates and fixes to this driver with more ease.

Pulled, thanks very much. I'll produce -pre1 shortly so that we can
ensure there's no regression.

 Cheers,
 
 Auke

Best Regards,
Willy

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] tcp_cubic: use 32 bit math

2007-03-07 Thread Willy Tarreau
On Wed, Mar 07, 2007 at 07:10:47PM -0800, Stephen Hemminger wrote:
 David Miller wrote:
 From: Stephen Hemminger [EMAIL PROTECTED]
 Date: Wed, 7 Mar 2007 17:07:31 -0800
 
   
 The basic calculation has to be done in 32 bits to avoid
 doing 64 bit divide by 3. The value x is only 22bits max
 so only need full 64 bits only for x^2.
 
 Signed-off-by: Stephen Hemminger [EMAIL PROTECTED]
 
 
 Applied, thanks Stephen.
 
 What about Willy Tarreau's supposedly even faster variant?
 Or does this incorporate that set of improvements?
   
 That's what this is:
x = (2 * x + (uint32_t)div64_64(a, (uint64_t)x*(uint64_t)x)) / 3;

Confirmed, it's the same. BTW, has someone tested on a 64bit system if
it brings any difference ?

Willy

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] tcp_cubic: use 32 bit math

2007-03-10 Thread Willy Tarreau
On Wed, Mar 07, 2007 at 07:51:35PM -0800, David Miller wrote:
 From: Stephen Hemminger [EMAIL PROTECTED]
 Date: Wed, 07 Mar 2007 19:10:47 -0800
 
  David Miller wrote:
   What about Willy Tarreau's supposedly even faster variant?
   Or does this incorporate that set of improvements?
 
  That's what this is:
  x = (2 * x + (uint32_t)div64_64(a, (uint64_t)x*(uint64_t)x)) / 3;
 
 Great, thanks for the clarification.

Oh BTW, I have a newer version with a first approximation of the
cbrt() before the div64_64, which allows us to reduce from 3 div64
to only 2 div64. This results in a version which is twice as fast
as the initial one (ncubic), but with slightly less accuracy (0.286%
compared to 0.247). But I see that other functions such as hcbrt()
had a 1.5% avg error, so I think this is not dramatic.

Also, I managed to remove all other divides, to be kind with CPUs
having a slow divide instruction or no divide at all. Since we compute
on limited range (22 bits), we can multiply then shift right. It shows
me even slightly better time on pentium-m and athlon, with a slightly
higher avg error (0.297% compared to 0.286%), and slightly smaller
code.

I just have to clean experiments from my code to provide a patch.
David, Stephen, are you interested ?

$ ./bictcp
fls(0)=0, fls(1)=1, fls(256)=9
Calibrating
Function clocks  mean(us) max(us)  std(us)  Avg error
bictcp  936 0.6124.28 1.99 0.172%
ocubic  886 0.5723.51 3.18 0.274%
ncubic  644 0.4216.59 2.18 0.247%
ncubic32444 0.2911.47 1.50 0.247%
ncubic32_1  444 0.2911.56 1.88 0.238%
ncubic32b3  337 0.22 8.67 0.88 0.286%
ncubic_ndiv3329 0.21 8.46 0.69 0.297%
acbrt   707 0.4618.05 0.80 0.275%
hcbrt   644 0.4216.44 0.51 1.580%


Regards,
Willy

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] tcp_cubic: use 32 bit math

2007-03-13 Thread Willy Tarreau
;
/* a in [0..63] */
uint32_t bits;
bits = (uint8_t)a  1;
bits = ~1;
return 1 + ((0xFFEAAA54UL  bits)  0x03);
}

/* We want (b - 1) / 3 for b between 7 and 64 so that we have a
 * a bit shift count between 2 and 21 inclusive.
 */
b = (b * 84)  8;

/* We want the highest 3bit block from 'a'  */
x = ((/*cbrt_3msb_4lsb*/0xDCBA9863UL  VAL_3BIT_TO_SHIFT4(a, b))  
0x0F) + 3;
x = (x  b)  3;

/* one divide is enough to get a value within 0.4% */
x = (2 * x + (uint32_t)div64_64(a, (uint64_t)x * (uint64_t)(x - 1)));
x = ((x * 341)  10);
return x;
}
static uint32_t end_ncubic_1div() { }


/* calculate the cubic root of x using only an approximation.
 * Avg err ~= 4.521%
 */
static uint32_t ncubic_0div(uint64_t a)
{
uint32_t x;
uint32_t b;

/*
 * For large values, 3 bits inputs are enough.
 *
 * We store 4*cbrt(8*x)-1 for x in [0..7]
 *
 *  x |   cbrt  | 4*cbrt | int | int |
 *  range |   range | approx | val.|  -3 |
 * ---+-++-+-+
 * 56..63 | 3.83 - 3.98 | 15.66  | 16  |  13 |
 * 48..55 | 3.63 - 3.80 | 14.93  | 15  |  12 |
 * 40..47 | 3.42 - 3.61 | 14.12  | 14  |  11 |
 * 32..39 | 3.17 - 3.39 | 13.21  | 13  |  10 |
 * 24..31 | 2.88 - 3.14 | 12.15  | 12  |   9 |
 * 16..23 | 2.51 - 2.84 | 10.86  | 11  |   8 |
 * 08..15 | 2.00 - 2.46 |  9.16  |  9  |   6 |
 * 00..07 | 1.00 - 1.91 |  6.35  |  6  |   3 |
 *
 * We can store (4*cbrt(x)-3) in 4 bits for x in [0..63].
 * So we use the following map : 0xDCBA9863UL
 */
ALIGN64 static uint32_t cbrt_3msb_4lsb = 0xDCBA9863UL;

b = fls64(a);

if (b  7) {
if (b = 1)
return b;
/* a in [0..63] */
uint32_t bits;
bits = (uint8_t)a  1;
bits = ~1;
return 1 + ((0xFFEAAA54UL  bits)  0x03);
}

/* We want (b - 1) / 3 for b between 7 and 64 so that we have a
 * a bit shift count between 2 and 21 inclusive.
 */
b = (b * 84)  8;

/* We want the highest 3bit block from 'a'  */
x = ((/*cbrt_3msb_4lsb*/0xDCBA9863UL  VAL_3BIT_TO_SHIFT4(a, b))  
0x0F) + 3;
x = (x  b)  3;
return x;
}
static uint32_t end_ncubic_0div() { }



/* calculate the cubic root of x using table lookups only.
 * Avg err ~= 0.613%
 */
static uint32_t ncubic_tab0(uint64_t a)
{
uint32_t b;
uint32_t shift;

/*
 * cbrt(x) MSB values for x MSB values in [0..63].
 * Precomputed then refined by hand - Willy Tarreau
 *
 * For x in [0..63],
 *   v = cbrt(x  18) - 1
 *   cbrt(x) = (v[x] + 1)  6
 */
static uint8_t v[] = {
/* 0x00 */0,   63,   63,   63,  127,  127,  127,  127,
/* 0x08 */  129,  135,  139,  143,  147,  152,  155,  160,
/* 0x10 */  161,  165,  168,  172,  174,  177,  179,  182,
/* 0x18 */  184,  188,  190,  193,  195,  197,  199,  202,
/* 0x20 */  203,  205,  207,  209,  212,  213,  215,  217,
/* 0x28 */  219,  221,  222,  224,  226,  227,  229,  231,
/* 0x30 */  232,  234,  236,  237,  239,  240,  241,  243,
/* 0x38 */  244,  246,  247,  249,  251,  252,  253,  255,
};

b = fls64(a);

if (b  7)
/* a in [0..63] */
return (v[(uint32_t)a] + 31)  6;

b = ((b * 84)  8) - 1;
shift = (a  (b * 3));
return ((uint32_t)(v[shift] + 1)  b)  6;
}
static uint32_t end_ncubic_tab0() { }


/* calculate the cubic root of x using a table lookup followed by one
 * Newton-Raphson iteration.
 * Avg err ~= 0.195%
 */
static uint32_t ncubic_tab1(uint64_t a)
{
uint32_t x;
uint32_t b;
uint32_t z;
uint32_t shift;

/*
 * cbrt(x) MSB values for x MSB values in [0..63].
 * Precomputed then refined by hand - Willy Tarreau
 *
 * For x in [0..63],
 *   v = cbrt(x  18) - 1
 *   cbrt(x) = (v[x] + 10)  6
 */
static uint8_t v[] = {
/* 0x00 */0,   54,   54,   54,  118,  118,  118,  118,
/* 0x08 */  123,  129,  134,  138,  143,  147,  151,  156,
/* 0x10 */  157,  161,  164,  168,  170,  173,  176,  179,
/* 0x18 */  181,  185,  187,  190,  192,  194,  197,  199,
/* 0x20 */  200,  202,  204,  206,  209,  211,  213,  215,
/* 0x28 */  217,  219,  221,  222,  224,  225,  227,  229,
/* 0x30 */  231,  232,  234,  236,  237,  239,  240,  242,
/* 0x38 */  244,  245,  246

Re: [PATCH] tcp_cubic: use 32 bit math

2007-03-21 Thread Willy Tarreau
Hi Stephen,

On Wed, Mar 21, 2007 at 11:54:19AM -0700, Stephen Hemminger wrote:
 On Tue, 13 Mar 2007 21:50:20 +0100
 Willy Tarreau [EMAIL PROTECTED] wrote:

[...] ( cut my boring part )

  Here are the results classed by speed :
  
  /* Sample output on a Pentium-M 600 MHz :
  
  Function  clocks mean(us)  max(us)  std(us) Avg err size
  ncubic_tab0   79 0.66 7.20 1.04  0.613%  160
  ncubic_0div   84 0.70 7.64 1.57  4.521%  192
  ncubic_1div  178 1.4816.27 1.81  0.443%  336
  ncubic_tab1  179 1.4916.34 1.85  0.195%  320
  ncubic_ndiv3 263 2.1824.04 3.59  0.250%  512
  ncubic_2div  270 2.2424.70 2.77  0.187%  512
  ncubic32_1   359 2.9832.81 3.59  0.238%  544
  ncubic_3div  361 2.9933.08 3.79  0.170%  656
  ncubic32 364 3.0233.29 3.51  0.247%  544
  ncubic   529 4.3948.39 4.92  0.247%  720
  hcbrt539 4.4749.25 5.98  1.580%   96
  ocubic   732 4.9361.83 7.22  0.274%  320
  acbrt842 6.9876.73 8.55  0.275%  192
  bictcp  1032 6.9586.30 9.04  0.172%  768
  

[...]

 The following version of div64_64 is faster because do_div already
 optimized for the 32 bit case..

Cool, this is interesting because I first wanted to optimize it but did
not find how to start with this. You seem to get very good results. BTW,
you did not append your changes.

However, one thing I do not understand is why your avg error is about 1/3
below the original one. Was there a precision bug in the original div_64_64
or did you extend the values used in the test ?

Or perhaps you used -fast-math to build and the original cbrt() is less
precise in this case ?

 I get the following results on ULV Core Solo (ie slow current processor)
 and the following on 64bit Core Duo. ncubic_tab1 seems like
 the best (no additional error and about as fast)

OK. It was the one I preferred too unless tab0's avg error was acceptable.

 ULV Core Solo
 
 Function  clocks mean(us)  max(us)  std(us) Avg err size
 ncubic_tab0  19211.2445.1015.28  0.450% -2262
 ncubic_0div  20111.7747.2327.40  3.357% -2404
 ncubic_1div  32419.0276.3225.82  0.189% -2567
 ncubic_tab1  32619.1376.7323.71  0.043% -2059
 ncubic_2div  45626.72   108.92   493.16  0.028% -2790
 ncubic_ndiv3 46327.15   133.37  1889.39  0.104% -3344
 ncubic32 54932.18   130.59   508.97  0.041% -3794
 ncubic32_1   57433.66   138.32   548.48  0.029% -3604
 ncubic_3div  58134.04   140.24   608.55  0.018% -3050
 ncubic   73342.92   173.35   523.19  0.041%  299
 ocubic  104661.25   283.68  3305.65  0.027% -2232
 acbrt   114967.32   284.91  1941.55  0.029%  168
 bictcp  166397.41   394.29   604.86  0.017%  628
 
 Core 2 Duo
 
 Function  clocks mean(us)  max(us)  std(us) Avg err size
 ncubic_0div   74 0.03 1.60 0.07  3.357% -2101
 ncubic_tab0   74 0.03 1.60 0.04  0.450% -2029
 ncubic_1div  142 0.07 3.11 1.05  0.189% -2195
 ncubic_tab1  144 0.07 3.18 1.02  0.043% -1638
 ncubic_2div  216 0.10 4.74 1.07  0.028% -2326
 ncubic_ndiv3 219 0.10 4.76 1.04  0.104% -2709
 ncubic32 269 0.13 5.87 1.13  0.041% -1500
 ncubic32_1   272 0.13 5.92 1.10  0.029% -2881
 ncubic   273 0.13 5.96 1.13  0.041% -1763
 ncubic_3div  290 0.14 6.32 1.01  0.018% -2499
 acbrt430 0.20 9.42 1.18  0.029%   77
 ocubic   444 0.21 9.82 1.82  0.027% -1924
 bictcp   549 0.2612.06 1.68  0.017%  236

Thanks,
Willy

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network performance degradation from 2.6.11.12 to 2.6.16.20

2006-06-25 Thread Willy Tarreau
Hi Andi,

On Mon, Jun 19, 2006 at 05:24:31PM +0200, Andi Kleen wrote:
 
  If you use pmtmr try to reboot with kernel option clock=tsc.
 
 That's dangerous advice - when the system choses not to use
 TSC it often has a reason.
 
  
  On my Opteron AMD system i normally can route 400 kpps, but with 
  timesource pmtmr i could only route around 83 kpps.  (I found the timer 
  to be the issue by using oprofile).
 
 Unless you're using packet sniffing or any other application
 that requests time stamps on a socket then the timer shouldn't 
 make much difference. Incoming packets are only time stamped
 when someone asks for the timestamps.

I encountered the same problem on a dual core opteron equipped with a
broadcom NIC (tg3) under 2.4. It could receive 1 Mpps when using TSC
as the clock source, but the time jumped back and forth, so I changed
it to 'notsc', then the performance dropped dramatically to around the
same value as above with one CPU saturated. I suspect that the clock
precision is needed by the tg3 driver to correctly decide to switch to
polling mode, but unfortunately, the performance drop rendered the
solution so much unusable that I finally decided to use it only in
uniprocessor with TSC enabled.

 -Andi

Regards,
Willy

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] getsockopt() early argument sanity checking

2006-08-19 Thread Willy Tarreau
On Sun, Aug 20, 2006 at 03:05:32AM +0400, Solar Designer wrote:
 Willy,
 
 I propose the attached patch (extracted from 2.4.33-ow1) for inclusion
 into 2.4.34-pre.
 
 (2.6 kernels could benefit from the same change, too, but at the moment
 I am dealing with proper submission of generic changes like this that
 are a part of 2.4.33-ow1.)
 
 The patch makes getsockopt(2) sanity-check the value pointed to by
 the optlen argument early on.  This is a security hardening measure
 intended to prevent exploitation of certain potential vulnerabilities in
 socket type specific getsockopt() code on UP systems.
 
 This change has been a part of -ow patches for some years.

looks valid to me, merged.

Thanks Alexander !
Willy


 Thanks,
 
 -- 
 Alexander Peslyak solar at openwall.com
 GPG key ID: B35D3598  fp: 6429 0D7E F130 C13E C929  6447 73C3 A290 B35D 3598
 http://www.openwall.com - bringing security into open computing environments

 diff -urpPX nopatch linux-2.4.33/net/socket.c linux/net/socket.c
 --- linux-2.4.33/net/socket.c Wed Jan 19 17:10:14 2005
 +++ linux/net/socket.cSat Aug 12 08:51:47 2006
 @@ -1307,10 +1307,18 @@ asmlinkage long sys_setsockopt(int fd, i
  asmlinkage long sys_getsockopt(int fd, int level, int optname, char *optval, 
 int *optlen)
  {
   int err;
 + int len;
   struct socket *sock;
  
   if ((sock = sockfd_lookup(fd, err))!=NULL)
   {
 + /* XXX: insufficient for SMP, but should be redundant anyway */
 + if (get_user(len, optlen))
 + err = -EFAULT;
 + else
 + if (len  0)
 + err = -EINVAL;
 + else
   if (level == SOL_SOCKET)
   err=sock_getsockopt(sock,level,optname,optval,optlen);
   else

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] getsockopt() early argument sanity checking

2006-08-19 Thread Willy Tarreau
On Sun, Aug 20, 2006 at 02:05:20AM +0200, Michael Buesch wrote:
 On Sunday 20 August 2006 01:48, Willy Tarreau wrote:
  On Sun, Aug 20, 2006 at 03:05:32AM +0400, Solar Designer wrote:
   Willy,
   
   I propose the attached patch (extracted from 2.4.33-ow1) for inclusion
   into 2.4.34-pre.
   
   (2.6 kernels could benefit from the same change, too, but at the moment
   I am dealing with proper submission of generic changes like this that
   are a part of 2.4.33-ow1.)
   
   The patch makes getsockopt(2) sanity-check the value pointed to by
   the optlen argument early on.  This is a security hardening measure
   intended to prevent exploitation of certain potential vulnerabilities in
   socket type specific getsockopt() code on UP systems.
   
   This change has been a part of -ow patches for some years.
  
  looks valid to me, merged.
 
 Not to me. It heavily violates codingstyle and screws brains
^^^
little exageration detected here.

 with the non-indented else branches.

while they surprized me first, they make the *patch* more readable
by clearly showing what has been inserted and where. However, I have
joined the lines for the merge.

 Learn about goto.

definitely not here. The if() expressions are all one-liners. Adding
a goto would mean two instructions, to which you add 2 braces. It will
not make the code more readable. Patch below is OK. If you have a hard
time understanding it, then it's because it's bedtime for you too :-)

Regards,
Willy


diff --git a/net/socket.c b/net/socket.c
index ac45b13..910ef88 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -1307,11 +1307,17 @@ asmlinkage long sys_setsockopt(int fd, i
 asmlinkage long sys_getsockopt(int fd, int level, int optname, char *optval, 
int *optlen)
 {
int err;
+   int len;
struct socket *sock;
 
if ((sock = sockfd_lookup(fd, err))!=NULL)
{
-   if (level == SOL_SOCKET)
+   /* XXX: insufficient for SMP, but should be redundant anyway */
+   if (get_user(len, optlen))
+   err = -EFAULT;
+   else if (len  0)
+   err = -EINVAL;
+   else if (level == SOL_SOCKET)
err=sock_getsockopt(sock,level,optname,optval,optlen);
else
err=sock-ops-getsockopt(sock, level, optname, optval, 
optlen);
-- 
1.4.1

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] getsockopt() early argument sanity checking

2006-08-20 Thread Willy Tarreau
On Sun, Aug 20, 2006 at 10:34:43AM +0200, Andi Kleen wrote:
 On Sunday 20 August 2006 01:05, Solar Designer wrote:
  I propose the attached patch (extracted from 2.4.33-ow1) for inclusion
  into 2.4.34-pre.
  
  (2.6 kernels could benefit from the same change, too, but at the moment
  I am dealing with proper submission of generic changes like this that
  are a part of 2.4.33-ow1.)
 
 In general I don't think it makes sense to submit stuff for 2.4 
 that isn't in 2.6.

I generally agree with you on this, but I think that when they are just
preventive measures, they can be applied in whatever order, provided that
they don't get lost.

  The patch makes getsockopt(2) sanity-check the value pointed to by
  the optlen argument early on.  This is a security hardening measure
  intended to prevent exploitation of certain potential vulnerabilities in
  socket type specific getsockopt() code on UP systems.
 
 It's not only insufficient on SMP, but even on UP where a thread
 can sleep in get_user and another one can run in this time.

Valid point.

 Doing a check that is inherently racy everywhere doesn't seem like
 a security improvement to me. If there is really a length checking bug 
 somewhere 
 it needs to be fixed in a race-free way.

The only race-free solution would be to add an argument to all getsockopt()
functions and pass them the decoded value. There are other places where
multiple get_user() are performed on optlen, with some useless tests (eg
in net/ipv4/tcp.c, len is checked for 0 after having been compared to
sizeof(int) as unsigned int) and all those places would benefit from this.

But I don't want to induce such large changes in this kernel. The goal of
this test is a preventive measure to catch easily exploitable errors that
might have remained undetected. For instance, a quick glance shows this
portion of code in net/ipv4/raw.c (both 2.4 and 2.6) :

static int raw_seticmpfilter(struct sock *sk, char *optval, int optlen)
{
if (optlen  sizeof(struct icmp_filter))
optlen = sizeof(struct icmp_filter);
if (copy_from_user(sk-tp_pinfo.tp_raw4.filter, optval, optlen))
return -EFAULT;
return 0;
}

It only relies on sock_setsockopt() refusing optlen values  sizeof(int),
and this is not documented. Having part of this code being copied for use
in another code path would open a breach for optlen  0.

 If not then there is no need for a change.

There are two tests in this patch :

  - one on the validity of the optlen address. This one is race-free and
should be conserved anyway.

  - one on the optlen range which is valid for most cases but which is
subject to a race condition and which might be circumvented by
carefully written code and with some luck as in all race conditions
issues.

Some will say that this last one offers a first level of protection by
transforming determinist attacks into racy attacks which make potential
vulnerabilities much harder to exploit by code injection. I'm one of them.

Others will consider it totally useless because it does not cover all cases,
but I think it is against the general principle of precaution we try to
apply in security.

 -Andi

Regards,
Willy

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] getsockopt() early argument sanity checking

2006-08-20 Thread Willy Tarreau
Hi David,

On Sun, Aug 20, 2006 at 12:44:27PM -0700, David Miller wrote:
 From: Willy Tarreau [EMAIL PROTECTED]
 Date: Sun, 20 Aug 2006 02:43:07 +0200
 
  On Sun, Aug 20, 2006 at 02:05:20AM +0200, Michael Buesch wrote:
   Not to me. It heavily violates codingstyle and screws brains
  ^^^
  little exageration detected here.
  
   with the non-indented else branches.
  
  while they surprized me first, they make the *patch* more readable
  by clearly showing what has been inserted and where. However, I have
  joined the lines for the merge.
 
 Thanks for consulting the networking maintainer before merging
 this. :-/

I'm sorry, will try to do better next time :-(
I only queue the patches locally while they're being discussed anyway.

 What if some sockopt treats a negative length specially?
 Maybe some setsockopt() doesn't care about the optlen pointer?

Which should not be a reason for userspace to respect the calling convention.
From man 2 getsockopt, it's clear that optlen is the size of the buffer :

   For getsockopt, optlen is a value-result
   parameter, initially containing the  size  of  the  buffer
   pointed  to  by optval, and modified on return to indicate
   the actual size of the value returned.

Where I can agree with you is on this part of the man :

   If no option value is to be supplied or returned, optval may be NULL.

Nothing is said about optlen's validity in such a case.

 This toplevel code has no buisness interpreting the arguments when the
 downcall and argument interpretation is by definition protocol
 specific.

Generally, the advantage of doing checks early is that they help enforcing
conventions and sometimes protect against a stupid bug in one of the multiple
leaves. It should in no way be an excuse to keep the remaining stupid bugs
when we spot them.

 It also means we'll touch userspace twice for this value which is really
 dumb.

If this is that dumb, then why is it done twice in tcp_getsockopt() for
TCP_INFO ? would you accept a fix which copies it in a local variable
instead ? 2.6 would also benefit from this for TCP_CONGESTION.

 The only nice part about this change is that it allows us to be lazy
 about auditing the individual setsockopt() implementations.

This is absolutely not the intent. When we merged Julien Tinnes's fixes
for NULL pointer checks, it looked like the callers already performed
the tests, but that was not a reason not to complete the test in the leaf
functions.

 I'd rather fix the broken cases than add a patch which just assumes they
 are broken and not worth fixing

We're not assuming they're broken. When some code is maintained by many people
and when conventions differ between similar functions (eg: setsockopt does
the check at top level and getsockopt in the leaves), it should be expected
that we populate the CVE lists from time to time when we decide that there
will always be one single check for every argument. And this is true for all
system parts.

 and also imposes a convention for the optlen argument.

I would better understand this one considering the fact that the man is
vague on this matter.

 No thanks.

OK, fine, I drop it. People who seek better security know where to find
hardening patches anyway.

Willy

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: TCP SACK issue, hung connection, tcpdump included

2007-07-29 Thread Willy Tarreau
Hi,

[CCing netdev as it's where people competent on the subject are]

On Sun, Jul 29, 2007 at 06:59:26AM +0100, Darryl L. Miles wrote:
 CLIENT = Linux 2.6.20.1-smp [Customer build]
 SERVER = Linux 2.6.9-55.ELsmp [Red Hat Enterprise Linux AS release 4 
 (Nahant Update 5)]
 
 The problems start around time index 09:21:39.860302 when the CLIENT issues 
 a TCP packet with SACK option set (seemingly for a data segment which has 
 already been seen) from that point on the connection hangs.

Where was the capture taken ? on CLIENT or on SERVER (I suspect client from
the timers) ? A possible, but very unlikely reason would be an MTU limitation
somewhere, because the segment which never gets correctly ACKed is also the
largest one in this trace. It would be strange as the SACKs are send
immediately after, but it should be something to consider anyway.

Also, if everything is right with the packets on the server side, then
it would means that it's the RHEL kernel which is buggy (which does not
mean that the same bug does not exist in mainline).

Regards,
Willy


 Full dump available via email.
 
 
 09:12:13.444999 IP CLIENT.50727  SERVER.ssh: S 2919512080:2919512080(0) 
 win 5840 mss 1460,sackOK,timestamp 799860259 0,nop,wscale 7
 09:12:13.535643 IP SERVER.ssh  CLIENT.50727: S 492909547:492909547(0) ack 
 2919512081 win 5792 mss 1460,sackOK,timestamp 7126976 799860259,nop,wscale 
 2
 09:12:13.535717 IP CLIENT.50727  SERVER.ssh: . ack 1 win 46 
 nop,nop,timestamp 799860282 7126976
 09:12:13.665481 IP SERVER.ssh  CLIENT.50727: P 1:24(23) ack 1 win 1448 
 nop,nop,timestamp 7127074 799860282
 09:12:13.665895 IP CLIENT.50727  SERVER.ssh: . ack 24 win 46 
 nop,nop,timestamp 799860314 7127074
 09:12:13.666044 IP CLIENT.50727  SERVER.ssh: P 1:23(22) ack 24 win 46 
 nop,nop,timestamp 799860314 7127074
 09:12:13.776318 IP SERVER.ssh  CLIENT.50727: . ack 23 win 1448 
 nop,nop,timestamp 7127216 799860314
 
 ...SNIPPED SUCCESSFUL TRAFFIC...
 
 09:21:39.490740 IP SERVER.ssh  CLIENT.50727: P 18200:18464(264) ack 2991 
 win 2728 nop,nop,timestamp 7692910 81727
 09:21:39.490775 IP CLIENT.50727  SERVER.ssh: . ack 18464 win 378 
 nop,nop,timestamp 81755 7692910
 09:21:39.860245 IP SERVER.ssh  CLIENT.50727: . 12408:13856(1448) ack 2991 
 win 2728 nop,nop,timestamp 7693293 81749
 09:21:39.860302 IP CLIENT.50727  SERVER.ssh: . ack 18464 win 378 
 nop,nop,timestamp 81847 7692910,nop,nop,sack sack 1 {12408:13856} 
 09:21:40.453440 IP SERVER.ssh  CLIENT.50727: . 12408:13856(1448) ack 2991 
 win 2728 nop,nop,timestamp 7693887 81749
 09:21:40.453495 IP CLIENT.50727  SERVER.ssh: . ack 18464 win 378 
 nop,nop,timestamp 81996 7692910,nop,nop,sack sack 1 {12408:13856} 
 09:21:41.641821 IP SERVER.ssh  CLIENT.50727: . 12408:13856(1448) ack 2991 
 win 2728 nop,nop,timestamp 7695075 81749
 09:21:41.641938 IP CLIENT.50727  SERVER.ssh: . ack 18464 win 378 
 nop,nop,timestamp 82293 7692910,nop,nop,sack sack 1 {12408:13856} 
 09:21:44.017552 IP SERVER.ssh  CLIENT.50727: . 12408:13856(1448) ack 2991 
 win 2728 nop,nop,timestamp 7697451 81749
 09:21:44.017622 IP CLIENT.50727  SERVER.ssh: . ack 18464 win 378 
 nop,nop,timestamp 82887 7692910,nop,nop,sack sack 1 {12408:13856} 
 09:21:48.770051 IP SERVER.ssh  CLIENT.50727: . 12408:13856(1448) ack 2991 
 win 2728 nop,nop,timestamp 7702203 81749
 09:21:48.770099 IP CLIENT.50727  SERVER.ssh: . ack 18464 win 378 
 nop,nop,timestamp 84075 7692910,nop,nop,sack sack 1 {12408:13856} 
 09:21:58.274061 IP SERVER.ssh  CLIENT.50727: . 12408:13856(1448) ack 2991 
 win 2728 nop,nop,timestamp 7711707 81749
 09:21:58.274180 IP CLIENT.50727  SERVER.ssh: . ack 18464 win 378 
 nop,nop,timestamp 86450 7692910,nop,nop,sack sack 1 {12408:13856} 
 09:22:17.282035 IP SERVER.ssh  CLIENT.50727: . 12408:13856(1448) ack 2991 
 win 2728 nop,nop,timestamp 7730715 81749
 09:22:17.282153 IP CLIENT.50727  SERVER.ssh: . ack 18464 win 378 
 nop,nop,timestamp 800011202 7692910,nop,nop,sack sack 1 {12408:13856} 
 09:22:55.298955 IP SERVER.ssh  CLIENT.50727: . 12408:13856(1448) ack 2991 
 win 2728 nop,nop,timestamp 7768731 81749
 09:22:55.299023 IP CLIENT.50727  SERVER.ssh: . ack 18464 win 378 
 nop,nop,timestamp 800020705 7692910,nop,nop,sack sack 1 {12408:13856} 
 09:24:11.329853 IP SERVER.ssh  CLIENT.50727: . 12408:13856(1448) ack 2991 
 win 2728 nop,nop,timestamp 7844763 81749
 09:24:11.329923 IP CLIENT.50727  SERVER.ssh: . ack 18464 win 378 
 nop,nop,timestamp 800039711 7692910,nop,nop,sack sack 1 {12408:13856} 
 09:26:11.331578 IP SERVER.ssh  CLIENT.50727: . 12408:13856(1448) ack 2991 
 win 2728 nop,nop,timestamp 7964763 81749
 09:26:11.331699 IP CLIENT.50727  SERVER.ssh: . ack 18464 win 378 
 nop,nop,timestamp 800069708 7692910,nop,nop,sack sack 1 {12408:13856} 
 09:26:13.011885 IP CLIENT.50727  SERVER.ssh: P 2991:3039(48) ack 18464 win 
 378 nop,nop,timestamp 800070128 7692910
 09:26:13.309032 IP CLIENT.50727  SERVER.ssh: P 2991:3039(48) ack 18464 win 
 378 

Re: TCP SACK issue, hung connection, tcpdump included

2007-07-29 Thread Willy Tarreau
On Sun, Jul 29, 2007 at 11:26:00AM +0300, Ilpo Järvinen wrote:
 On Sun, 29 Jul 2007, Willy Tarreau wrote:
 
  On Sun, Jul 29, 2007 at 06:59:26AM +0100, Darryl L. Miles wrote:
   CLIENT = Linux 2.6.20.1-smp [Customer build]
   SERVER = Linux 2.6.9-55.ELsmp [Red Hat Enterprise Linux AS release 4 
   (Nahant Update 5)]
   
   The problems start around time index 09:21:39.860302 when the CLIENT 
   issues 
   a TCP packet with SACK option set (seemingly for a data segment which has 
   already been seen) from that point on the connection hangs.
 
 ...That's DSACK and it's being correctly sent. To me, it seems unlikely to 
 be the cause for this breakage...
 
  Where was the capture taken ? on CLIENT or on SERVER (I suspect client from
  the timers) ? 
 
 ...I would guess the same based on SYN timestamps (and from the DSACK 
 timestamps)...
 
  A possible, but very unlikely reason would be an MTU limitation
  somewhere, because the segment which never gets correctly ACKed is also the
  largest one in this trace.
 
 Limitation for 48 byte segments? You have to be kidding... :-) But yes,
 it seems that one of the directions is dropping packets for some reason 
 though I would not assume MTU limitation... Or did you mean some other 
 segment?

No, I was talking about the 1448 bytes segments. But in fact I don't
believe it much because the SACKs are always retransmitted just afterwards.

BTW, some information are missing. It would have been better if the trace
had been read with tcpdump -Svv. We would have got seq numbers and ttl.
Also, we do not know if there's a firewall between both sides. Sometimes,
some IDS identify attacks in crypted traffic and kill connections. It
might have been the case here, with the connection closed one way on an
intermediate firewall.

Regards,
Willy

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: TCP SACK issue, hung connection, tcpdump included

2007-07-29 Thread Willy Tarreau
On Sun, Jul 29, 2007 at 12:28:04PM +0300, Ilpo Järvinen wrote:
(...)
   Limitation for 48 byte segments? You have to be kidding... :-) But yes,
   it seems that one of the directions is dropping packets for some reason 
   though I would not assume MTU limitation... Or did you mean some other 
   segment?
  
  No, I was talking about the 1448 bytes segments. But in fact I don't
  believe it much because the SACKs are always retransmitted just afterwards.
 
 Ah, but it's ACKed correctly right below it...:
 
 [...snip...]
 09:21:39.490740 IP SERVER.ssh  CLIENT.50727: P 18200:18464(264) ack 
 2991 
 win 2728 nop,nop,timestamp 7692910 81727
 09:21:39.490775 IP CLIENT.50727  SERVER.ssh: . ack 18464 win 378 
 nop,nop,timestamp 81755 7692910
 09:21:39.860245 IP SERVER.ssh  CLIENT.50727: . 12408:13856(1448) ack 
 2991 
 win 2728 nop,nop,timestamp 7693293 81749
 
 ...segment below snd_una arrived = snd_una remains 18464, receiver 
 generates a duplicate ACK:
  
 09:21:39.860302 IP CLIENT.50727  SERVER.ssh: . ack 18464 win 378 
 nop,nop,timestamp 81847 7692910,nop,nop,sack sack 1 
 {12408:13856} 
 
 The cumulative ACK field of it covers _everything_ below 18464 (i.e., it 
 ACKs them), including the 1448 bytes in 12408:13856... In addition, the 
 SACK block is DSACK information [RFC2883] telling explicitly the address 
 of the received duplicate block. However, if this ACK doesn't reach the 
 SERVER TCP, RTO is triggered and the first not yet cumulatively ACKed 
 segment is retransmitted (I guess cumulative ACKs up to 12408 arrived 
 without problems to the SERVER):

Oh yes, you're damn right. I did not notice that the ACK was higher than
the SACK, I'm more used to read traces with absolute rather than relative
seq/acks.

So I agree, it is this ACK which is lost between client and server,
reinforcing the supposition about the location of the capture (client side).

 [...snip...]
 
  BTW, some information are missing. It would have been better if the trace
  had been read with tcpdump -Svv. We would have got seq numbers and ttl.
  Also, we do not know if there's a firewall between both sides. Sometimes,
  some IDS identify attacks in crypted traffic and kill connections. It
  might have been the case here, with the connection closed one way on an
  intermediate firewall.
 
 Yeah, firewall or some other issue, I'd say it's quite unlikely a bug in 
 TCP because behavior to both directions indicate client - sender 
 blackhole independently of each other...

It would also be possible that something stupid between both ends simply
drops packets with the SACK option set. Also something which sometimes
happen is that some firewalls automatically translate sequence numbers
but not necessarily SACK values, which could pretty well lead to this
packet being received but ignored on the server side.

I'm pretty sure that the same trace taken on the server side will reveal
the reason for the problem.

Willy

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] e100 module loads 1/2 times

2007-08-13 Thread Willy Tarreau
Hi Auke, Jesse,

for a long time, I've been annoyed by version 3.5.17 of the e100 driver
which refuses to load on first time and only loads on second time. Since
I always had the original 2.3.43 driver in kernel 2.4, I did not care
that much. Recently, I encountered real troubles with 2.3.43 in a 802.1q
setup (basically it did not untag incoming frames). So I decided to give
3.5.17 a second try. Same problem, I had to load it twice.

I finally found the problem in e100_init_module(). Up to 3.5.14, the
return of pci_module_init() was returned. This one equals zero if
everything went fine, 0 otherwise, which is compatible with init_module().

With 3.5.17, the result comes from pci_register_driver(), which returns
the number of devices registered. So the problem now makes sense :

- first call: the driver registers itself and returns non-zero, which
  is an error for insmod

- second call: the driver cannot register again and returns zero new
  drivers, which is good for insmod.

Note that e1000 has a related bug : it uses the return from pci_module_init()
to decide whether or not to register a reboot notifier. Fortunately, the
test is performed with ret=0, which matches ==0 and not 0 (errors), so
this works as a side effect.

The obvious fix for e100 reusing pci_module_init is below. Also, since
e100-2.3.43 does not work at all with vlans in 2.4, I was thinking about
upgrading it to 3.5.17. It would also be the same version as in 2.6,
simplifying its long-term maintenance. What do you think about this ?

Best regards,
Willy


--- e100-3.5.17/src/e100.c.orig 2007-08-13 08:53:18 +0200
+++ e100-3.5.17/src/e100.c  2007-08-13 09:24:56 +0200
@@ -2934,13 +2934,13 @@
printk(KERN_INFO PFX %s\n, DRV_COPYRIGHT);
}
 #ifdef E100_USE_REBOOT_NOTIFIER
-   retval = pci_register_driver(e100_driver);
-   if (retval = 0)
+   retval = pci_module_init(e100_driver);
+   if (retval == 0)
register_reboot_notifier(e100_notifier_reboot);
 
return retval;
 #else
-   return pci_register_driver(e100_driver);
+   return pci_module_init(e100_driver);
 #endif
 }
 


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Marvell 88E8056 gigabit ethernet controller

2007-08-18 Thread Willy Tarreau
On Sat, Aug 18, 2007 at 04:45:26AM -0700, Kevin E wrote:
 
 --- Willy Tarreau [EMAIL PROTECTED] wrote:
 
  No Stephen, look again, he says that moving the
  video card into the broken
  system does not change anything.
 
 Correct, I've used three different video cards in the
 broken machine.  I've used an old PCI vid card, the
 PCI-X vid card from the working machine, and now PCI-X
 card I just bought yesterday (nvidia based).  None
 have affected whether the Marvell chipset works or
 not.  
 
 Also, the broken machine is a server so I don't start
 X on it.  It just sits in console mode all the time. 
 The CPU (Core2 E4400) used to be in the working
 machine, then upgraded it to the Q6600 and put the old
 E4400 in the new MB that became the broken machine. 
 Memory was brand new out of the package.  So I've used
 the E4400 in a MB that worked fine.
 
 
  I don't understand why the working one is on PCI bus
  3 while the other
  is on PCI bus 4. It's just as if the chip embedded a
  PCI bridge. Maybe
  those chips are just cheaper dual-channel
  controllers with one faulty
  controller disabled. It would also explain why the
  PCI ID is different.
 
 I've attached the lspci -vvv output for the two
 machines, if it doesn't come through just let me know
 how I can get it to you.

OK, in this trace, both controllers are on the same bus. The broken
one has 'Capabilities: [100] Advanced Error Reporting' the other
does not have, and the bridge to this bus has two more capabilities :
'Capabilities: [100] Virtual Channel' and 'Capabilities: [180] Unknown (5)'.

I don't know whether it can jutify a different behaviour. Also, maybe this
is caused by a minuscule difference in the BIOS setup ?

Regards,
Willy

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [2.6.20.17 review 35/58] forcedeth bug fix: realtek phy

2007-08-22 Thread Willy Tarreau
On Wed, Aug 22, 2007 at 11:56:42AM -0400, Chuck Ebbert wrote:
 On 08/22/2007 05:39 AM, Willy Tarreau wrote:
  This patch contains errata fixes for the realtek phy. It only renamed the
  defines to be phy specific.
  
  Signed-off-by: Ayaz Abdulla [EMAIL PROTECTED]
  Signed-off-by: Greg Kroah-Hartman [EMAIL PROTECTED]
  Signed-off-by: Willy Tarreau [EMAIL PROTECTED]
  ---
   drivers/net/forcedeth.c |   54 
  +++
   1 files changed, 54 insertions(+), 0 deletions(-)
  
  diff --git a/drivers/net/forcedeth.c b/drivers/net/forcedeth.c
  index c383dc3..dbfdbed 100644
  --- a/drivers/net/forcedeth.c
  +++ b/drivers/net/forcedeth.c
  @@ -554,6 +554,7 @@ union ring_type {
   #define PHY_OUI_MARVELL0x5043
   #define PHY_OUI_CICADA 0x03f1
   #define PHY_OUI_VITESSE0x01c1
  +#define PHY_OUI_REALTEK0x01c1
   #define PHYID1_OUI_MASK0x03ff
   #define PHYID1_OUI_SHFT6
   #define PHYID2_OUI_MASK0xfc00
 
 Realtek is 0x0732
 
 This is still wrong upstream -- what happened to the patch to fix it?

Good catch, thanks Chuck! I've already seen the fix somewhere, I believe it
was on netdev, though I'm not sure. I'm fixing the patch in place right now.
I can add your signoff if you want.

Cheers,
Willy

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] fix realtek phy id in forcedeth

2007-08-23 Thread Willy Tarreau
Hi Greg,

On Thu, Aug 23, 2007 at 09:55:13AM -0700, Greg KH wrote:
 It might help if someone sends a real patch that can be applied :)

This is getting really silly now :-) We're all wasting more time
wondering who will send the patch than posting it. I've lost, I got
fed up first, so here it is. Please apply to mainline then stable.

Thanks,
Willy

--

From a0e2922b99eedd9863232368ea2afe072c52783e Mon Sep 17 00:00:00 2001
From: Willy Tarreau [EMAIL PROTECTED]
Date: Thu, 23 Aug 2007 21:35:41 +0200
Subject: [PATCH] fix realtek phy id in forcedeth

As noticed by Chuck Ebbert, commit c5e3ae8823693b260ce1f217adca8add1bc0b3de
introduced a copy-paste typo, as realtek phy is 0x732 and not 0x1c1. Obvious
fix below suggested by Ayaz Abdulla.

Signed-off-by: Willy Tarreau [EMAIL PROTECTED]
Cc: Ayaz Abdulla [EMAIL PROTECTED]
Cc: Chuck Ebbert [EMAIL PROTECTED]
---
 drivers/net/forcedeth.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/net/forcedeth.c b/drivers/net/forcedeth.c
index 10f4e3b..1938d6d 100644
--- a/drivers/net/forcedeth.c
+++ b/drivers/net/forcedeth.c
@@ -552,7 +552,7 @@ union ring_type {
 #define PHY_OUI_MARVELL0x5043
 #define PHY_OUI_CICADA 0x03f1
 #define PHY_OUI_VITESSE0x01c1
-#define PHY_OUI_REALTEK0x01c1
+#define PHY_OUI_REALTEK0x0732
 #define PHYID1_OUI_MASK0x03ff
 #define PHYID1_OUI_SHFT6
 #define PHYID2_OUI_MASK0xfc00
-- 
1.5.2.5

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.22.5 forcedeth timeout hang

2007-08-23 Thread Willy Tarreau
On Thu, Aug 23, 2007 at 06:48:23PM -0500, Mr. Berkley Shands wrote:
 100% reproducible hang on xmit timeout.
 Just do a make -j4 modules on an nfs mounted kernel source.

Most likely you also had the problem with 2.6.22.2 (maybe you have not
tested this one, though). There were bug fixes for forcedeth introduced
in this version, one of them being buggy. The patch below fixes it. Can
you please give it a try ? If it does not fix the problem, please try
2.6.22.1 which does not include those changes. I'm interested because
I have those changes pending for 2.6.20.17 too.

diff --git a/drivers/net/forcedeth.c b/drivers/net/forcedeth.c
index 10f4e3b..1938d6d 100644
--- a/drivers/net/forcedeth.c
+++ b/drivers/net/forcedeth.c
@@ -552,7 +552,7 @@ union ring_type {
 #define PHY_OUI_MARVELL0x5043
 #define PHY_OUI_CICADA 0x03f1
 #define PHY_OUI_VITESSE0x01c1
-#define PHY_OUI_REALTEK0x01c1
+#define PHY_OUI_REALTEK0x0732
 #define PHYID1_OUI_MASK0x03ff
 #define PHYID1_OUI_SHFT6
 #define PHYID2_OUI_MASK0xfc00

Thanks,
Willy

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] e100 module loads 1/2 times

2007-08-28 Thread Willy Tarreau
On Tue, Aug 28, 2007 at 02:43:23PM -0700, Brandeburg, Jesse wrote:
 Willy Tarreau wrote:
  --- e100-3.5.17/src/e100.c.orig 2007-08-13 08:53:18 +0200
  +++ e100-3.5.17/src/e100.c  2007-08-13 09:24:56 +0200
  @@ -2934,13 +2934,13 @@
  printk(KERN_INFO PFX %s\n, DRV_COPYRIGHT);
  }
   #ifdef E100_USE_REBOOT_NOTIFIER
  -   retval = pci_register_driver(e100_driver);
  -   if (retval = 0)
  +   retval = pci_module_init(e100_driver);
  +   if (retval == 0)
  register_reboot_notifier(e100_notifier_reboot);
  
  return retval;
   #else
  -   return pci_register_driver(e100_driver);
  +   return pci_module_init(e100_driver);
   #endif
   }
  
 
 excellent catch willy!  Auke will make sure this gets upstream.
 
 Thanks for your investigative work!  So strange that we had never gotten
 other reports of it.

I think that many places where e100 is found, there are at least 2 NICs,
and probably that the startup scripts load the module twice, resulting
in it finally working.

Could we update the e100 driver in 2.4 to 3.5.17+this fix ? The current
e100 driver (2.3.43) is plain broken regarding VLAN support. I already
have it working in my own kernels, and it definitely put an end to all
the problems I was facing before.

Thanks,
Willy

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: sk98lin for 2.6.23-rc1

2007-09-11 Thread Willy Tarreau
On Tue, Sep 11, 2007 at 05:03:57PM +0200, Adrian Bunk wrote:
 On Tue, Sep 11, 2007 at 10:29:47AM -0400, Bill Davidsen wrote:
  So if you want people to try a new driver, I think it really has to have 
  some benefits to the users, in terms of performance, reliability, or 
  features. Cleaner design doesn't motivate, and it does raise the question 
  of why the old driver wasn't just cleaned up. I've been doing software for 
  decades, I appreciate why, but users in general just want to use their 
  system. Which raises the question of why to delete drivers which work for 
  many or even most users?
 
 As I already explained, there is a long term advantage for all users if 
 there is only one driver in the kernel.

Not only that. You have to place the switch in its context with history.
Stephen, please correct me if I'm wrong, but sk98lin has been randomly
working for a very long time. Not 100% the driver's fault, because it
has had to workaround a lot of chips bugs. The fact that this driver
supports *all* chips in the family makes it harder to identify whether
problems are caused by the hardware or by the driver because it is
bloated with tons of if/else.

I've personally encountered random data corruption on the receive path
with PCI-E hardware with sk98lin, as well as random TX stops. Sometimes
it would require one terabyte of data, sometimes just a few hundreds
megs. On other hardware (skge now), UDP would simply stop being sent
and some TCP traffic was necessary to restart UDP! One guy at Marvell
once asked me for more information, but it was not easy to provide
much more, given the randomness of the problems!

Stephen has done an excellent (and thankless) job at restarting from
scratch, and the idea to separate the two chips was a good one IMHO.
The problem is that he might have thought that most of the bugs were
in the driver, while most of them are in the hardware, and this requires
a lot of workarounds, which do not always work the same for everybody
(I remember having tried to disable flow control with sk98lin because
it helped with sky2).

In parallel, sk98lin has improved on the vendor's site. v8 exhibited
all the problems I explained above, but v10 has fixed a lot of them,
making the new sk98lin more reliable. In parallel, sky2 and skge had
got wider acceptance and testing. The nastiest hardware bugs will
slowly surface, a good deal of driver bugs have been detected too
(and that's expected from any new driver).

It is possible that after 2 or 3 patches, a lot of the remaining
problems will suddenly vanish. But it's also possible that the driver
will still not work for 1% of people for 1 or 2 years because of some
obscure hardware combinations which trigger some obscure hardware bugs.

 Therefore all users should 
 switch away from obsolete drivers to the replacement drivers, and the 
 obsolete driver will be removed at some point in time. The only question 
 is how to do it.

Desktop users genreally have no problem experimenting with multiple kernels
or drivers. They can report feedback too, but generally, they're not very
good at downloading alternative drivers and patching their kernel with those.

Server users cannot experiment for a long time. After 2 or 3 losses of
service, they *have* to provide a definitive solution. For some of them
when sky2 fails, it may very well be to switch over to sk98lin. Downloading
from the vendor's site and patching is not a problem for those users, but
it causes them the trouble of updating the kernel for security fixes, so
the old driver must be shipped with the kernel.

However, I remember something which might constitute a solution. In 2.4,
there's a small bug in the kbuild process on alpha. One question is always
asked during make oldconfig. Its saved value is ignored because of the way
it is computed. I don't know if we could do this with 2.6 kbuild. It would
then be nice to always set sk98lin to unset if it was set to Y or M,
so that at each build, the user has to explicitly state he wants it. It's
annoying enough to give the other one a try once in a while, without causing
too much trouble to people who really have no other choice right now.

What we need with this driver is people being fed up with it, not them
being unable to use it as a last resort. Also, given that it has improved
over the last years (probably due to competition pressure from sky2/skge),
users will even less understand why there is such incentive to remove it.

Another trick for obsolete drivers would be to simply remove them from
the usual build system, but have them being available for explicit build.
Eg: make modules will not build them, but make obsolete-modules would do.

  Testing a new kernel is no longer a drop in a boot 
  operation if modprobe.conf must be edited to get the network up, and the 
  typical user isn't going to write that shell script to try one or the other 
  driver.
 
 The typical user will let his distribution handle this.
 
 And MODULE_ALIAS can also 

Re: bnx2 dirver's firmware images

2007-09-18 Thread Willy Tarreau
On Tue, Sep 18, 2007 at 12:21:50PM -0700, David Miller wrote:
 From: Michael Chan [EMAIL PROTECTED]
 Date: Tue, 18 Sep 2007 13:05:51 -0700
 
  The bnx2 firmware changes quite frequently.  A new driver quite often
  requires new firmware to work correctly.  Splitting them up makes things
  difficult for the user.
  
  The firmware in tg3 is a lot more mature and I don't expect it to
  change.  I think tg3 is better suited for using request_firmware().
 
 Like I said, I think neither should change and the driver should
 be fully functional when built statically into the kernel.

Michael, doesn't a functional-yet-suboptimal firmware exist ? I mean,
just the same principle as we all have kernels, boot CDs, statically
built tools, etc... which run everywhere. If you have such a beast,
maybe it would be a good start to have it in the kernel, and provide
the users with the ability to upgrade the firmware once the system
is able to do more complex things.

Just a thought...

Regards,
Willy

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: bnx2 dirver's firmware images

2007-09-18 Thread Willy Tarreau
On Tue, Sep 18, 2007 at 02:31:34PM -0700, David Miller wrote:
 From: Willy Tarreau [EMAIL PROTECTED]
 Date: Tue, 18 Sep 2007 23:30:25 +0200
 
  On Tue, Sep 18, 2007 at 12:21:50PM -0700, David Miller wrote:
   From: Michael Chan [EMAIL PROTECTED]
   Date: Tue, 18 Sep 2007 13:05:51 -0700
   
The bnx2 firmware changes quite frequently.  A new driver quite often
requires new firmware to work correctly.  Splitting them up makes things
difficult for the user.

The firmware in tg3 is a lot more mature and I don't expect it to
change.  I think tg3 is better suited for using request_firmware().
   
   Like I said, I think neither should change and the driver should
   be fully functional when built statically into the kernel.
  
  Michael, doesn't a functional-yet-suboptimal firmware exist ? I mean,
  just the same principle as we all have kernels, boot CDs, statically
  built tools, etc... which run everywhere. If you have such a beast,
  maybe it would be a good start to have it in the kernel, and provide
  the users with the ability to upgrade the firmware once the system
  is able to do more complex things.
  
  Just a thought...
 
 So let's save 60K instead of 80K.

That's not for this reason I said this. Michael said the firmware needs
to be updated somewhat often. What I was wondering was if it was not
possible to stick to a stable one (and hopefully small) so that the
driver could require less frequent updates. Sorry if it's not the main
point of the discussion, but I grepped on this :-)

 I mean, the entire discussion is just plain silly :)

yes, possibly :-)

Cheers,
Willy

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


FWD: 2.4.35 e1000 problems

2007-10-15 Thread Willy Tarreau
Hi,

Dmitry, I've forwarded your mail to the netdev mailing list and to
the e1000 maintainers.

Auke, does it sound like a known problem ? Maybe someone has seen it
in 2.6 ? Just for the record, e1000 in 2.4.35 is 7.3.20-k4.

Dmitry, if you have time for a test, I think it would be good if you
could test the most recent driver from e1000.sf.net. It might give
us some more information about the problem's cause. Otherwise, let's
wait for Auke's advices and/or questions.

Regards,
Willy

- Forwarded message from dmpesl [EMAIL PROTECTED] -

Date: Mon, 15 Oct 2007 17:54:01 +0400 (MSD)
From: dmpesl [EMAIL PROTECTED]
Subject: 2.4.35 e1000 problems
To: [EMAIL PROTECTED]
X-X-Sender: [EMAIL PROTECTED]

Hello, Willy

on recent month i've changed kernel from 2.4.31-2.4.35 on my Owl-based 
web server (purchased SATA hdd, onboard SATA 20378), Asus PCH-DR mb, 
2*Xeon, and noticed that following error occurs on network card:
-
Sep 29 21:29:51 host kernel: NETDEV WATCHDOG: eth0: transmit timed out
Sep 29 21:29:53 host kernel: e1000: eth0: e1000_watchdog: NIC Link is Up 
100 Mbps Full Duplex, Flow Control: RX/TX
Sep 29 22:28:03 host kernel: NETDEV WATCHDOG: eth0: transmit timed out
Sep 29 22:28:05 host kernel: e1000: eth0: e1000_watchdog: NIC Link is Up 
100 Mbps Full Duplex, Flow Control: RX/TX
Sep 30 00:51:35 host kernel: NETDEV WATCHDOG: eth0: transmit timed out
Sep 30 00:51:37 host kernel: e1000: eth0: e1000_watchdog: NIC Link is Up 
100 Mbps Full Duplex, Flow Control: RX/TX
-
After some conversation with my collocation provider (changing cables and 
switch port) i've searched google to same errors and found some advises to 
use ethtool -k tso off, but this is not work with my kernel/distribution.
Then i swapped just e1000 driver in 2.4.35 to older, from 2.4.31.
Now everything works, without loosing link.

Regards, Dmitry

- End forwarded message -
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ISNs and 2.6.22, Was: Re: haproxy linux firewall (netfilter)

2007-10-20 Thread Willy Tarreau
On Sat, Oct 20, 2007 at 07:23:25PM +0200, Krzysztof Oledzki wrote:
(...)
 So it seems that ISNs are not randomly incremented but rather randomly 
 generated. Adding netdev@vger.kernel.org to the CC list.
 
 Eh, I was little to hurry this time. There were not randomly generated but 
 incremented with to big value. This patch fixes my problem:
 
 http://git.kernel.org/?p=linux/kernel/git/stable/stable-queue.git;a=blob;f=queue-2.6.22/fix-tcp-initial-sequence-number-selection.patch;h=05b9167d68ecde1e6088f58c55e2906b768420ed;hb=HEAD

Good catch Krzysztof ! I've already noticed that one on LKML but did not
make the connection with your problem!

Please also tell Jozsef so that if he gets other reports, he knows where
to point the reporters.

Regards,
Willy

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] [e1000 VLAN] Disable vlan hw accel when promiscuous mode

2007-11-12 Thread Willy Tarreau
On Mon, Nov 12, 2007 at 02:57:16PM -0800, David Miller wrote:
 From: Chris Friesen [EMAIL PROTECTED]
 Date: Mon, 12 Nov 2007 16:43:24 -0600
 
  David Miller wrote:
  
   When you select VLAN, you by definition are asking for non-VLAN
   traffic to be elided.  It is like plugging the ethernet cable
   into one switch or another.
  
  For max functionality it seems like the raw eth device should show 
  everything on the wire in promiscuous mode.
  
  If we want to sniff only the traffic for a specific vlan, we can sniff 
  the vlan device.
 
 VLAN settings are a filter of sorts, much like plugging into
 one switch or another filters traffic physically.
 
 If you don't want that filter, turn the VLAN settings off.

I don't really agree with that view. Having spent a lot of time with
tcpdump on production systems, I can say that sometimes you'd like to
be aware that one of your VLANs is wrong and you'd simply like to
sniff the wire to guess the correct tag. And on production, you simply
cannot remove other VLANs, otherwise you disrupt the service.

Basically, what generally happens is that the guy responsible for the
switch tells you it's OK now, but for you it isn't and you cannot
access the switch.

If the solution is to disable VLAN hardware acceleration, I agree that
it is very risky to do that without the user being aware of it. But at
least we should be able to do this by any means (eg: ethtool) without
disabling what's running.

And since you made the parallel with a switch, when you receive tagged
traffic on a switch port, you generally can mirror that port to another
one and catch all VLANs at once. A new feature that is starting to appear
is the ability to mirror tagged traffic to a VLAN on another port (which
means you get a double 802.1q tag). This is useful for inter-site links
between data-centers for instance.

Regards,
Willy

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] [e1000 VLAN] Disable vlan hw accel when promiscuous mode

2007-11-12 Thread Willy Tarreau
On Mon, Nov 12, 2007 at 03:19:23PM -0800, David Miller wrote:
 From: Willy Tarreau [EMAIL PROTECTED]
 Date: Tue, 13 Nov 2007 00:15:16 +0100
 
  I can say that sometimes you'd like to be aware that one of your
  VLANs is wrong and you'd simply like to sniff the wire to guess the
  correct tag. And on production, you simply cannot remove other
  VLANs, otherwise you disrupt the service.
 
 If you were plugged into the wrong physical switch, how might
 you debug that problem in production? :-)

tcpdump on an unfiltered RJ45 tells you a lot of things. Some switches
for instance send keepalive packets (ether proto 9000) with their own
MAC as source+dest, others send CDP packets, etc... I've very rarely
stayed plugged to the wrong switch for too long before discovering it.
But that requires the ability to sniff.

 Really, it's the same issue, just virtualized, as in the name
 for the feature, VLAN.

No, it's not the same, because as soon as you pass through a switch
which is not configured for VLANs, it does not make any distinction,
and the same DMAC is considered on a single port whatever the VLAN.
This is not true with multiple switches. Also, sometimes, being able
to see that your port gets flooded with traffic for a VLAN on which
you're not configured helps you understand why you cannot fill the
port.

I'd like that we can use the hardware correctly without having to
buy TAPs. It reminds me the discussions about TOE. You claim
yourself that TOE is bad in part because you have little if any
control on the bugs on the TCP stack on the chip. This is the same
here. If I know that the chip does not send me what it receives
when configured in promiscuous mode, I can I swear it never received
the missing packet ? There's always a doubt. Maybe sometimes the
filter is buggy, maybe it only passes tags when the remaining bits
are all zero, etc... Promiscuous mode generally means that you're
the closest possible to the wire. We already do not receive pause
frames and sometimes that's annoying. But having no control over
what we want to see is frustrating.

At least, being able to disable the feature at module load time
would be acceptable. Many people who often need to sniff on decent
machines would always keep it disabled.

Regards,
Willy

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: questions on NAPI processing latency and dropped network packets

2008-01-16 Thread Willy Tarreau
On Wed, Jan 16, 2008 at 07:58:36AM +0100, Jarek Poplawski wrote:
 On Wed, Jan 16, 2008 at 11:17:08AM +1100, Herbert Xu wrote:
 ...
  Well people are always going to operate on this model for commercial
  reasons.  FWIW I used to work for a company that stuck to a specific
  version of the Linux kernel, and I suppose I still do even now :)
  
  But the important thing is that if you're going to do that, then the
  cost that comes with it should be borne by the company and not the
  community.
 
 Sure. But the most sad thing is there seems to be not so much savings
 in this (unless a company isn't sure of its near future). Trying to
 upgrade and test current products with current kernels, even if not
 necessary, should be always useful and make developing of new products
 faster and better fit (and of course, BTW, make the kernel better on
 time).

you can work with latest release provided that you always have a fallback
to an earlier one. That way, you don't bet too much on something you don't
completely control. If it works, it tells you you'll be able to completely
exploit its new possibilities in next product release, and if it breaks,
it's easy to issue a fix to get back to earlier, well-tested version.

Regards,
Willy

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Add IPv6 support to TCP SYN cookies

2008-02-05 Thread Willy Tarreau
Hi Andi, Alan,

I've run extensive tests with/without syn cookies recently.

On Tue, Feb 05, 2008 at 05:39:12PM +0100, Andi Kleen wrote:
 On Tue, Feb 05, 2008 at 03:42:13PM +, Alan Cox wrote:
   Syncookies are discouraged these days. They disable too many 
   valuable TCP features (window scaling, SACK) and even without them 
   the kernel is usually strong enough to defend against syn floods
   and systems have much more memory than they used to be.
  
  Somewhat untrue. Network speeds have risen dramatically, the number of
 
 With strong I meant Linux has much better algorithms to handle the standard
 syn queue (syncookies was originally added when it had only dumb head drop) 
 and there are minisocks which also require significantly less overhead
 to manage than full sockets (less memory etc.) 

That's true, but not enough, see below.

 When I wrote syncookies originally that all was not true.
 
  appliances running Linux that are not PC class means memory has fallen
  not risen and CPU has been pretty level.
  
   So I don't think it makes much sense to add more code to it, sorry.
  
  I think it makes a lot of sense 
 
 I have my doubts. It would be probably better to recheck everything
 and then remove syncookies.
 
 Also your sub PC class appliances rarely run LISTEN servers anyways
 that are open to the world.

In my tests, I discovered that in fact SYN cookies more benefit high
end machines than low-end ones. Let me explain.

I noticed that computing the cookie consumes a lot of CPU, which is a
real problem on low-end machines. But on the other end, it helps the
system continue to respond when otherwise it would not. My tests on
an AMD LX800 with max_syn_backlog at 63000 on an HTTP reverse proxy
consisted in injecting 250 hits/s of legitimate traffic with 8000 SYN/s
of noise.

Without SYN cookies, the average response time was about 1.5 second and
unstable (due to retransmits), and the CPU was set to 60%. With SYN
cookies enabled, the response time dropped to 12-15ms only, but CPU
usage jumped to 70%. The difference appears at a higher legitimate
traffic rate. At 500 hits/s + 7800 SYN/s, the CPU is just saturated
with correct response time (SYN backlog almost full but never full),
and the performance slightly goes down with SYN cookies enabled, inducing
a drop of the hit rate due to the increased CPU consumption.

Till there, one would conclude that SYN cookies are bad. BUT! this was
with tcp_synack_retries = 1, which is the optimal situation without
SYN cookies under an attack and which is pretty bad for normal usage.

The real problem without SYN cookies is that you are forced to support
a huge SYN backlog (eg: 2 million entries to sustain 100 Mbps of SYN).
And what happens with a large backlog ? You send a lot of retries for
each SYN. 5 by default, meaning 6 SYN-ACKs for 1 SYN. Thus, you become
a SYN amplifier and the guy in front of you just has to send you 20 Mbps
of traffic for you to saturate your 100 Mbps uplink.

Also, sending all those SYN-ACKs takes a huge amount of CPU time. With
tcp_synack_retries at 0, my machine received 26600 SYN/s, and returned
26600 SYN-ACK/s at 100% CPU. With tcp_synack_retries set to 4, it could
only accept 12900 SYN/s, replying with 51200 SYN-ACK/s. So the input
load was halfed and the output was doubled. I did not bother going higher.

The only solution against this is then to reduce tcp_synack_retries to
very low values (0 ideally, to match SYN cookies behaviour), but in this
case, you degrade normal traffic 100% of the time, while SYN cookies would
only trigger while you're already under attack.

My conclusions after those tests was to set tcp_synack_retries to a reasonable
value (1 to 3), and set the backlog to the number of half-open sessions your
machine can accumulate under a SYN attack without collapsing. You then enable
SYN cookies, and they will only trigger when you know that your machine will
not be able to sustain the increased load.

This solution permits you to accept normal connections when not under attack,
with an acceptable number of retransmits and with TCP options working well.
Under a moderate attack, the large backlog will still help you accept
legitimate connections with all comfort (sack, wscale, ...). Under a massive
attack, you will not send more than tcp_synack_retries*backlog packets per
tcp_synack_retries period, thus limiting the outbound traffic, plus 1 SYN-ACK
per incoming SYN, legitimate or not. At this stage, if your users have a
castrated TCP stack in front of them, that's not a problem because you know
that otherwise they would not even have been able to connect.

So in this regard, SYN cookies are really needed.

Last, I've read on DJB's page that SYN cookies do not break TCP beahaviour.
Yes they do. If the client waits for the server to talk first, you'd better
not lose the first ACK from the client because it will not get retransmitted,
and the client will see an ESTABLISHED connection while the server will have

Re: [PATCH] Add IPv6 support to TCP SYN cookies

2008-02-05 Thread Willy Tarreau
On Wed, Feb 06, 2008 at 12:52:17AM +0300, Evgeniy Polyakov wrote:
 Hi Alan.
 
 On Tue, Feb 05, 2008 at 09:20:17PM +, Alan Cox ([EMAIL PROTECTED]) wrote:
   Most (if not all) distributions have them enabled and window growing
   works just fine. Actually I do not see any reason why connection
   establishment handshake should prevent any run-time operations at all,
   even if it was setup during handshake.
  
  Syncookies are only triggered if the system is under a load where it
  would begin to lose connections otherwise. So they merely turn a DoS into
  a working if slightly slower setup (and  64K windows don't matter for
  most normal users, especially on mobile devices).
 
 SACK is actually a good idea for mobile devices, so preventing
 syncookies from not getting into account some options (btw, does it work
 with timestamps and PAWS?) is not a solution.

All TCP options negociated during session setup are lost. In fact, some
bits (3) are still reserved for the best known value of the MSS, but
that's all. The principle of SYN cookies is that the server does not
create any session upon the SYN, but builds a sequence number constitued
from a hash and the values it absolutely needs to know when the client
validates the session with an ACK.

I've seen some firewalls acting as SYN gateways which send the options
from the server to the client in the first ACK packet from the server.
This is normally not allowed, but it seems to work with some TCP stacks
(at least for the MSS). One solution would be to extend TCP to officially
support this behaviour and to optionally use it along with SYN cookies,
but there will always be old clients not compatible with the extension.

Regards,
Willy

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] [PATCH] net: socket: Fix the wrong returns for recvmsg and sendmsg

2015-06-02 Thread Willy Tarreau
On Tue, Jun 02, 2015 at 02:43:54PM +0800, Junling Zheng wrote:
 On 2015/6/2 14:27, Greg KH wrote:
  On Mon, Jun 01, 2015 at 10:23:57PM -0700, David Miller wrote:
  From: Junling Zheng zhengjunl...@huawei.com
  Date: Tue, 2 Jun 2015 12:05:32 +0800
 
  So, the problem commit is 281c9c36 (net: compat: Update
  get_compat_msghdr() to match copy_msghdr_from_user() behaviour),
  which fixes db31c55a6fb2 and brings the get_compat_msghdr() in line
  with copy_msghdr_from_user().
 
  Upstream this got fixed by:
 
  08adb7dabd4874cc5666b4490653b26534702ce0
 
  So the part that makes us not unconditionally return -EFAULT needs
  to be backported, and that's probably equivalent to the patch
  your proposed which therefore should be applied.
  
  Ok, thanks, now applied.
  
 
 Maybe other stable version also needs this fix:)

Yes, from what I'm seeing, at least 3.2 and 2.6.32 need it as well.

Thanks,
Willy

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fw: [Bug 99161] New: 2.6.32.66 PPC Oops in tcp_send_fin

2015-05-29 Thread Willy Tarreau
Hi Eric,

On Fri, May 29, 2015 at 08:52:11AM -0700, Eric Dumazet wrote:
 On Fri, 2015-05-29 at 08:12 -0700, Stephen Hemminger wrote:
  I think 2.6.32 is so old no one will care.

A few will still, but at least we must ensure the old guy finishes
his days nicely :-)

(...)
 I guess a backport went wrong.

Ah crap, sorry about that :-(

 Willy, please add following to your tree :
 
 diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
 index 5339f066234b..d1e2895bb63c 100644
 --- a/net/ipv4/tcp_output.c
 +++ b/net/ipv4/tcp_output.c
 @@ -2136,7 +2136,7 @@ void tcp_send_fin(struct sock *sk)
*/
   if (tskb  (tcp_send_head(sk) || tcp_memory_pressure)) {
  coalesce:
 - TCP_SKB_CB(skb)-flags |= TCPCB_FLAG_FIN;
 + TCP_SKB_CB(tskb)-flags |= TCPCB_FLAG_FIN;
   TCP_SKB_CB(tskb)-end_seq++;
   tp-write_seq++;
   if (!tcp_send_head(sk)) {


Thanks Eric, I'm queuing this now.

Best regards,
Willy

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.32.66 tcp regression OOPs

2015-06-01 Thread Willy Tarreau
Hi,

On Mon, Jun 01, 2015 at 09:00:21AM +0200, Frans Klaver wrote:
 [cc: Willy Tarreau]
 
 On Mon, Jun 1, 2015 at 3:26 AM,  starlight.201...@binnacle.cx wrote:
  Hello,
 
  Apoligies if I have submitted to the wrong lists.
 
  Encountered a regression in
  2.6.32.66 relative to 2.6.32.65.
 
  Crash eight minutes after boot.
 
  Will responded with additional details
  if the OOPS is not sufficent.
 
  Best Regards
 
 
 Did you bisect it?

Eric Dumazet notified me that of something possibly similar due to
a mistake I made when backporting a fix by hand.

Please apply the following patch to see if it fixes the problem :

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c  
index 5339f066234b..d1e2895bb63c 100644 
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2136,7 +2136,7 @@ void tcp_send_fin(struct sock *sk)
 */
if (tskb  (tcp_send_head(sk) || tcp_memory_pressure)) {
 coalesce:
-   TCP_SKB_CB(skb)-flags |= TCPCB_FLAG_FIN;
+   TCP_SKB_CB(tskb)-flags |= TCPCB_FLAG_FIN;
TCP_SKB_CB(tskb)-end_seq++;
tp-write_seq++;
if (!tcp_send_head(sk)) {

Thanks,
Willy

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.32.66 tcp regression OOPs

2015-06-01 Thread Willy Tarreau
On Mon, Jun 01, 2015 at 11:32:19AM -0400, starlight.201...@binnacle.cx wrote:
 Hi,
 
 I found the patch late yesterday and applied it.
 
 Running fine now for 12 hours under active load.

Thank you.

 Recommend the patch be rolled into the tarball,
 or a notation added to the release page as this
 one has severe consequences.

I'll emit 2.6.32.67 with it. I didn't know it was that easy to trigger
it, and since feedback comes slowly on 2.6.32, I was waiting a bit for
more feedback before doing another one.

Thank you!
Willy

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/6] net: mvneta: Handle per-cpu interrupts

2015-07-05 Thread Willy Tarreau
Hi Maxime,

On Fri, Jul 03, 2015 at 04:25:49PM +0200, Maxime Ripard wrote:
 Now that our interrupt controller is allowing us to use per-CPU interrupts,
 actually use it in the mvneta driver.
 
 This involves obviously reworking the driver to have a CPU-local NAPI
 structure, and report for incoming packet using that structure.
 
 Signed-off-by: Maxime Ripard maxime.rip...@free-electrons.com

This patch breaks module build of mvneta unless you export request_percpu_irq :

diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index ec31697..1440a92 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -1799,6 +1799,7 @@ int request_percpu_irq(unsigned int irq, irq_handler_t 
handler,
 
return retval;
 }
+EXPORT_SYMBOL_GPL(request_percpu_irq);
 
Regards,
Willy

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 6/6] net: mvneta: Statically assign queues to CPUs

2015-07-05 Thread Willy Tarreau
Hi Thomas,

On Fri, Jul 03, 2015 at 04:46:24PM +0200, Thomas Petazzoni wrote:
 Maxime,
 
 On Fri,  3 Jul 2015 16:25:51 +0200, Maxime Ripard wrote:
 
  +static void mvneta_percpu_enable(void *arg)
  +{
  +   struct mvneta_port *pp = arg;
  +
  +   enable_percpu_irq(pp-dev-irq, IRQ_TYPE_NONE);
  +}
  +
   static int mvneta_open(struct net_device *dev)
   {
  struct mvneta_port *pp = netdev_priv(dev);
  @@ -2655,6 +2662,19 @@ static int mvneta_open(struct net_device *dev)
  goto err_cleanup_txqs;
  }
   
  +   /*
  +* Even though the documentation says that request_percpu_irq
  +* doesn't enable the interrupts automatically, it actually
  +* does so on the local CPU.
  +*
  +* Make sure it's disabled.
  +*/
  +   disable_percpu_irq(pp-dev-irq);
  +
  +   /* Enable per-CPU interrupt on the one CPU we care about */
  +   smp_call_function_single(rxq_def % num_online_cpus(),
  +mvneta_percpu_enable, pp, true);
 
 What happens if that CPU goes offline through CPU hotplug?

I just tried : if I start mvneta with rxq_def=1, then my irq runs on
CPU1. Then I offline CPU1 and the irqs are automatically handled by CPU0.
Then I online CPU1 and irqs stay on CPU0.

More or less related, I found that if I enable a queue number larger than
the CPU count it does work, but then the system complains during rmmod :

[  877.146203] [ cut here ]
[  877.146227] WARNING: CPU: 1 PID: 1731 at fs/proc/generic.c:552 
remove_proc_entry+0x144/0x15c()
[  877.146233] remove_proc_entry: removing non-empty directory 'irq/29', 
leaking at least 'mvneta'
[  877.146238] Modules linked in: mvneta(-) [last unloaded: mvneta]
[  877.146254] CPU: 1 PID: 1731 Comm: rmmod Tainted: GW   
4.1.1-mvebu-6-g3d317ed-dirty #5
[  877.146260] Hardware name: Marvell Armada 370/XP (Device Tree)
[  877.146281] [c0017194] (unwind_backtrace) from [c00126d4] 
(show_stack+0x10/0x14)
[  877.146293] [c00126d4] (show_stack) from [c04d32e4] 
(dump_stack+0x74/0x90)
[  877.146305] [c04d32e4] (dump_stack) from [c0025200] 
(warn_slowpath_common+0x74/0xb0)
[  877.146315] [c0025200] (warn_slowpath_common) from [c00252d0] 
(warn_slowpath_fmt+0x30/0x40)
[  877.146325] [c00252d0] (warn_slowpath_fmt) from [c0115694] 
(remove_proc_entry+0x144/0x15c)
[  877.146336] [c0115694] (remove_proc_entry) from [c0060fd4] 
(unregister_irq_proc+0x8c/0xb0)
[  877.146347] [c0060fd4] (unregister_irq_proc) from [c0059f84] 
(free_desc+0x28/0x58)
[  877.146356] [c0059f84] (free_desc) from [c005a028] 
(irq_free_descs+0x44/0x80)
[  877.146368] [c005a028] (irq_free_descs) from [bf015748] 
(mvneta_remove+0x3c/0x4c [mvneta])
[  877.146382] [bf015748] (mvneta_remove [mvneta]) from [c024d6e8] 
(platform_drv_remove+0x18/0x30)
[  877.146393] [c024d6e8] (platform_drv_remove) from [c024bd50] 
(__device_release_driver+0x70/0xe4)
[  877.146402] [c024bd50] (__device_release_driver) from [c024c5b8] 
(driver_detach+0xcc/0xd0)
[  877.146411] [c024c5b8] (driver_detach) from [c024bbe0] 
(bus_remove_driver+0x4c/0x90)
[  877.146425] [c024bbe0] (bus_remove_driver) from [c007d3f0] 
(SyS_delete_module+0x164/0x1b4)
[  877.146437] [c007d3f0] (SyS_delete_module) from [c000f4c0] 
(ret_fast_syscall+0x0/0x3c)
[  877.146443] ---[ end trace 48713a9ae31204b1 ]---

This was on the AX3 (dual-proc) with rxq_def=2.

Hoping this helps,
Willy

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][kernel 2.6.32] Bond interface can't send gratuitous ARP

2015-07-26 Thread Willy Tarreau
Hi Qingjie,

On Mon, Jul 27, 2015 at 09:05:29AM +0800, ? wrote:
 Hi,
 
 Bond interface worked as Active-Backup mode.
 If the bond interface was added in bridge, then it's just a port of bridge.
 It doesn' have IP address.
 When bond slave changing, the current code bond_send_gratuitous_arp
 didn't take effect. It couldn't send gratuitous ARP.
 So I made a patch to fix this issue.
 
 It's my first patch for kernel.
 
 Thanks a lot for your *understanding* for any *inconvenience* caused.

Patches may only be sent for mainline kernel. Since you're trying to fix
2.6.32, you first need to check if a more recent version works properly.
If it's the case, then instead you should ask for a backport of the patch
that did the related change (it's easier for tracking bug fixes and
regressions). If the problem is also present in mainline, then you should
propose your patch for mainline. Since it's a fix, it will be backported.

BTW, please keep in mind that 2.6.32 is reaching EOL in a few months, so
it would be a good time to try newer ones (3.2 or 3.4 are rock solid now).

Hoping this helps,
Willy

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Please backport commit to 3.12+

2015-11-06 Thread Willy Tarreau
Hi,

We recently faced the issue described in the patch below on 3.14.56. This fix
was merged in 4.2-rc7. I checked Davem's queue and stable queue and it's not
there yet. Could we please have it in 3.12 and above ? (feature was introduced
in 3.11). I can confirm that it properly fixes the problem for us.

Mainline commit is :

  commit 3c16241c445303a90529565e7437e1f240acfef2
  Author: Phil Sutter 
  Date:   Tue Jul 28 00:53:26 2015 +0200

netfilter: SYNPROXY: fix sending window update to client

Upon receipt of SYNACK from the server, ipt_SYNPROXY first sends back an 
ACK to
finish the server handshake, then calls nf_ct_seqadj_init() to initiate
sequence number adjustment of forwarded packets to the client and finally 
sends
a window update to the client to unblock it's TX queue.

Since synproxy_send_client_ack() does not set synproxy_send_tcp()'s nfct
parameter, no sequence number adjustment happens and the client receives the
window update with incorrect sequence number. Depending on client TCP
implementation, this leads to a significant delay (until a window probe is
being sent).

Signed-off-by: Phil Sutter 
Signed-off-by: Pablo Neira Ayuso 

Thanks,
Willy

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: "ss -p" segfaults (updated to 4.2)

2015-10-12 Thread Willy Tarreau
On Mon, Oct 12, 2015 at 09:50:19AM -0700, Stephen Hemminger wrote:
> Applied, and did some editing on commit msg

Thank you Stephen!

Willy

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: "ss -p" segfaults (updated to 4.2)

2015-10-06 Thread Willy Tarreau
Hi guys,

I've updated Jose's patch to make it slightly simpler (eg: calloc instead
of malloc+memset), and ported it to 4.2.0 which requires it as well, and
attached it to this e-mail.

I can confirm that with this patch 4.1.1 doesn't segfault on me anymore.
The commit message should be reworked I guess though everything's in it
and I didn't want to modify his description.

Can it be merged as-is or should I reword the commit message and reference
Jose as the fix reporter ? We should not let this bug live forever.

Thanks,
Willy

>From 618028d6c5bfa4fb9898f2ec1ab5483e6f5392d4 Mon Sep 17 00:00:00 2001
From: "j...@openmailbox.org" <j...@openmailbox.org>
Date: Tue, 21 Jul 2015 10:54:05 +0100
Subject: "ss -p" segfaults

Patch for 4.2.0

Essentially all that is needed to get rid of this issue is the
addition of:

memset(u, 0, sizeof(*u));

after:

if (!(u = malloc(sizeof(*u
break;

Also patched some other situations (strcpy and sprintf uses) that
potentially produce the same results.

Signed-off-by: Jose P Santos <j...@openmailbox.org>

[ wt: made Jose's patch slightly simpler, all credits to him for the diag ]
Signed-off-by: Willy Tarreau <w...@1wt.eu>
---
 misc/ss.c | 16 ++--
 1 file changed, 10 insertions(+), 6 deletions(-)

diff --git a/misc/ss.c b/misc/ss.c
index 2f34962..8b0d606 100644
--- a/misc/ss.c
+++ b/misc/ss.c
@@ -457,7 +457,9 @@ static void user_ent_hash_build(void)
 
user_ent_hash_build_init = 1;
 
-   strcpy(name, root);
+   strncpy(name, root, sizeof(name)-1);
+   name[sizeof(name)-1] = 0;
+
if (strlen(name) == 0 || name[strlen(name)-1] != '/')
strcat(name, "/");
 
@@ -481,7 +483,7 @@ static void user_ent_hash_build(void)
if (getpidcon(pid, _context) != 0)
pid_context = strdup(no_ctx);
 
-   sprintf(name + nameoff, "%d/fd/", pid);
+   snprintf(name + nameoff, sizeof(name) - nameoff, "%d/fd/", pid);
pos = strlen(name);
if ((dir1 = opendir(name)) == NULL) {
free(pid_context);
@@ -502,7 +504,7 @@ static void user_ent_hash_build(void)
if (sscanf(d1->d_name, "%d%c", , ) != 1)
continue;
 
-   sprintf(name+pos, "%d", fd);
+   snprintf(name+pos, sizeof(name) - pos, "%d", fd);
 
link_len = readlink(name, lnk, sizeof(lnk)-1);
if (link_len == -1)
@@ -2736,7 +2738,7 @@ static int unix_show(struct filter *f)
struct sockstat *u, **insp;
int flags;
 
-   if (!(u = malloc(sizeof(*u
+   if (!(u = calloc(1, sizeof(*u
break;
u->name = NULL;
u->peer_name = NULL;
@@ -3086,11 +3088,13 @@ static int netlink_show_one(struct filter *f,
strncpy(procname, "kernel", 6);
} else if (pid > 0) {
FILE *fp;
-   sprintf(procname, "%s/%d/stat",
+   snprintf(procname, sizeof(procname), "%s/%d/stat",
getenv("PROC_ROOT") ? : "/proc", pid);
if ((fp = fopen(procname, "r")) != NULL) {
if (fscanf(fp, "%*d (%[^)])", procname) == 1) {
-   sprintf(procname+strlen(procname), 
"/%d", pid);
+   snprintf(procname+strlen(procname),
+   
sizeof(procname)-strlen(procname),
+   "/%d", pid);
done = 1;
}
fclose(fp);
-- 
1.7.12.1



Re: NFS/TCP/IPv6 acting strangely in 4.2

2015-09-16 Thread Willy Tarreau
Hi,

On Wed, Sep 16, 2015 at 06:53:57AM +, Damien Thébault wrote:
> On Fri, 2015-09-11 at 12:38 +0100, Russell King - ARM Linux wrote:
> > I have a recent Marvell Armada 388 board here which uses the mvneta
> > driver.  I'm seeing some weird effects with NFS with it acting as a
> > client.
> 
> Hello,
> 
> I'm upgrading a Marvelle Armada 370 board using the mvneta driver from
> 4.0 to 4.2 and noticed issues with NFS booting.
> Basically, most of the time init returns with an error code, or
> programs segfault or throw illegal instructions.
> 
> Since it worked fine on 4.0 I bisected until I found commit
> a84e32894191cfcbffa54180d78d7d4654d56c20 "net: mvneta: fix refilling
> for Rx DMA buffers".
> 
> If I revert this commit, everything seems to get back to normal.
> Could you try it ? The two issues look very similar.

I'm not sure but I'm seeing that the accounting was changed by this
patch without being certain of the implications; if the revert above
works, it would be nice to try to only apply this just to see if
that's indeed an accounting error or not :

diff --git a/drivers/net/ethernet/marvell/mvneta.c 
b/drivers/net/ethernet/marvell/mvneta.c
index 62e48bc..4205867 100644
--- a/drivers/net/ethernet/marvell/mvneta.c
+++ b/drivers/net/ethernet/marvell/mvneta.c
@@ -1463,6 +1463,7 @@ static int mvneta_rx(struct mvneta_port *pp, int rx_todo,
 {
struct net_device *dev = pp->dev;
int rx_done;
+   int missed = 0;
u32 rcvd_pkts = 0;
u32 rcvd_bytes = 0;
 
@@ -1527,6 +1528,7 @@ static int mvneta_rx(struct mvneta_port *pp, int rx_todo,
if (err) {
netdev_err(dev, "Linux processing - Can't refill\n");
rxq->missed++;
+   missed++;
goto err_drop_frame;
}
 
@@ -1561,7 +1563,7 @@ static int mvneta_rx(struct mvneta_port *pp, int rx_todo,
}
 
/* Update rxq management counters */
-   mvneta_rxq_desc_num_update(pp, rxq, rx_done, rx_done);
+   mvneta_rxq_desc_num_update(pp, rxq, rx_done, rx_done - missed);
 
return rx_done;
 }

Regards,
Willy

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/1] net: Add SO_REUSEPORT_LISTEN_OFF socket option as drain mode

2015-12-15 Thread Willy Tarreau
Hi Eric,

On Wed, Nov 11, 2015 at 05:09:01PM -0800, Eric Dumazet wrote:
> On Wed, 2015-11-11 at 10:43 -0800, Eric Dumazet wrote:
> > On Wed, 2015-11-11 at 10:23 -0800, Tom Herbert wrote:
> > 
> > > How about doing this in shutdown called for a listener?
> > 
> > Seems a good idea, I will try it, thanks !
> > 
> 
> Arg, I forgot about this shutdown() discussion we had recently
> with Oracle guys.
> 
> It is currently used in linux to unblock potential threads in accept()
> system call.
> 
> This would prevent syn_recv sockets to be finally accepted.

I had a conversation with an haproxy user who's concerned with the
connection drops during the reload operation and we stumbled upon
this thread. I was considering improving shutdown() as well for this
as haproxy already performs a shutdown(RD) during a "pause" operation
(ie: workaround for kernels missing SO_REUSEPORT).

And I found that the code clearly doesn't make this possible since
shutdown(RD) flushes the queue and stops the listening.

However I found what I consider an elegant solution which works
pretty well : by simply adding a test in compute_score(), we can
ensure that a previous socket ranks lower than the current ones,
and is never considered as long as the new ones are present. Here I
achieved this using setsockopt(SO_LINGER). The old process just has
to do this with a non-zero value on the socket it wants to put into
lingering mode and that's all.

I find this elegant since it keeps the same semantics as for a
connected socket in that it avoids killing the queue, and that it
doesn't change the behaviour for existing applications. It just
turns out that listening sockets are set up without any lingering
by default so we don't need to add any new socket options nor
anything.

Please let me know what you think about it (patch attached), if
it's accepted it's trivial to adapt haproxy to this new behaviour.

Thanks!
Willy

>From 7b79e362479fa7084798e6aa41da2a2045f0d6bb Mon Sep 17 00:00:00 2001
From: Willy Tarreau <w...@1wt.eu>
Date: Tue, 15 Dec 2015 16:40:00 +0100
Subject: net: make lingering sockets score less in compute_score()

When multiple processes use SO_REUSEPORT for a seamless restart
operation, there's a tiny window during which both the old and the new
process are bound to the same port, and there's no way for the old
process to gracefully stop receiving connections without dropping
the few that are left in the queue between the last poll() and the
shutdown() or close() operation.

Incoming connections are distributed between multiple listening sockets
in inet_lookup_listener() according to multiple criteria. The first
criterion is a score based on a number of attributes for each socket,
then a hash computation ensures that the connections are evenly
distributed between sockets of equal weight.

This patch provides an elegant approach by which the old process can
simply decrease its score by setting the lingering time to non-zero
on its listening socket. Thus, the sockets from the new process
(which start without any lingering) always score higher and are always
preferred.

The old process can then safely drain incoming connections and stop
after meeting the -1 EAGAIN condition, as shown in the example below :

 process A (old one)|  process B (new one)
|
  listen() >= 0 |
  ...   |
  accept()  |
  ...   |
  ...   |  listen()

   From now on, both processes receive incoming connections

  ...   |  kill(process A, go_away)
  setsockopt(SO_LINGER) |  accept() >= 0

   Here process A stops receiving new connections

  accept() >= 0 |  accept() >= 0
  ...   |
  accept() = -1 EAGAIN  |  accept() >= 0
  close()   |
  exit()|

Signed-off-by: Willy Tarreau <w...@1wt.eu>
---
 net/ipv4/inet_hashtables.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index c6fb80b..473b16f 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -191,6 +191,8 @@ static inline int compute_score(struct sock *sk, struct net 
*net,
score += 4;
}
}
+   if (!sock_flag(sk, SOCK_LINGER))
+   score++;
return score;
 }
 
-- 
1.7.12.1



Re: [PATCH 1/1] net: Add SO_REUSEPORT_LISTEN_OFF socket option as drain mode

2015-12-15 Thread Willy Tarreau
On Tue, Dec 15, 2015 at 10:21:52AM -0800, Eric Dumazet wrote:
> On Tue, 2015-12-15 at 18:43 +0100, Willy Tarreau wrote:
> 
> > Ah ? but what does it bring in this case ? I'm not seeing it used
> > anywhere on a listening socket. The code took care of not breaking
> > them though (ie they still accept if no other socket shows up with
> > a higher score). Otherwise we'll have to switch to Tolga's patch,
> > unless we find another socket option that can safely be combined
> > and which makes sense (I often find it better not to make userland
> > depend on new updates of includes when possible).
> 
> Socket options set on the listener before the accept() are inherited.

I completely forgot about this use case, stupid me!

> Applications wanting SO_LINGER special settings on all their sockets can
> use a single system call right before listen().
> 
> Some servers having to deal with TIME_WAIT proliferation very often use
> SO_LINGER with timeout 0

Yes definitely, it's just that I was focused on the listening socket not
taking into account the fact that it would be inherited to the (rare) few
sockets that are accepted from the queue afterwards. And indeed it's a
perfectly legitimate usage to save a syscall per incoming connection.

Thus do you think it's worth adding a new option as Tolga proposed ?

Another solution I considered (but found a bit dirty) was to make use of
the unimplemented shutdown(WR) for this. While it's easy to do, I don't
like it simply because it looks like a hack and not logical at all from
the users perspective.

Thanks,
Willy
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/1] net: Add SO_REUSEPORT_LISTEN_OFF socket option as drain mode

2015-12-15 Thread Willy Tarreau
On Tue, Dec 15, 2015 at 09:10:24AM -0800, Eric Dumazet wrote:
> On Tue, 2015-12-15 at 17:14 +0100, Willy Tarreau wrote:
> > Hi Eric,
> > 
> > On Wed, Nov 11, 2015 at 05:09:01PM -0800, Eric Dumazet wrote:
> > > On Wed, 2015-11-11 at 10:43 -0800, Eric Dumazet wrote:
> > > > On Wed, 2015-11-11 at 10:23 -0800, Tom Herbert wrote:
> > > > 
> > > > > How about doing this in shutdown called for a listener?
> > > > 
> > > > Seems a good idea, I will try it, thanks !
> > > > 
> > > 
> > > Arg, I forgot about this shutdown() discussion we had recently
> > > with Oracle guys.
> > > 
> > > It is currently used in linux to unblock potential threads in accept()
> > > system call.
> > > 
> > > This would prevent syn_recv sockets to be finally accepted.
> > 
> > I had a conversation with an haproxy user who's concerned with the
> > connection drops during the reload operation and we stumbled upon
> > this thread. I was considering improving shutdown() as well for this
> > as haproxy already performs a shutdown(RD) during a "pause" operation
> > (ie: workaround for kernels missing SO_REUSEPORT).
> > 
> > And I found that the code clearly doesn't make this possible since
> > shutdown(RD) flushes the queue and stops the listening.
> > 
> > However I found what I consider an elegant solution which works
> > pretty well : by simply adding a test in compute_score(), we can
> > ensure that a previous socket ranks lower than the current ones,
> > and is never considered as long as the new ones are present. Here I
> > achieved this using setsockopt(SO_LINGER). The old process just has
> > to do this with a non-zero value on the socket it wants to put into
> > lingering mode and that's all.
> > 
> > I find this elegant since it keeps the same semantics as for a
> > connected socket in that it avoids killing the queue, and that it
> > doesn't change the behaviour for existing applications. It just
> > turns out that listening sockets are set up without any lingering
> > by default so we don't need to add any new socket options nor
> > anything.
> > 
> > Please let me know what you think about it (patch attached), if
> > it's accepted it's trivial to adapt haproxy to this new behaviour.
> 
> Well, problem is : some applications use LINGER on the listener,
> you can not really hijack this flag.

Ah ? but what does it bring in this case ? I'm not seeing it used
anywhere on a listening socket. The code took care of not breaking
them though (ie they still accept if no other socket shows up with
a higher score). Otherwise we'll have to switch to Tolga's patch,
unless we find another socket option that can safely be combined
and which makes sense (I often find it better not to make userland
depend on new updates of includes when possible).

Cheers,
Willy

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/1] net: Add SO_REUSEPORT_LISTEN_OFF socket option as drain mode

2015-12-16 Thread Willy Tarreau
Hi Eric,

On Wed, Dec 16, 2015 at 08:38:14AM +0100, Willy Tarreau wrote:
> On Tue, Dec 15, 2015 at 01:21:15PM -0800, Eric Dumazet wrote:
> > On Tue, 2015-12-15 at 20:44 +0100, Willy Tarreau wrote:
> > 
> > > Thus do you think it's worth adding a new option as Tolga proposed ?
> > 
> > 
> > I thought we tried hard to avoid adding the option but determined
> > we could not avoid it ;)
> 
> Not yet, your other proposal of disabling SO_REUSEPORT makes sense if
> we combine it with the proposal to change the score in my patch. If
> we say that a socket which has SO_REUSEPORT scores higher, then the
> connections which don't want to accept new connections anymore will
> simply have to drop it an not be elected. I find this even cleaner
> since the sole purpose of the loop is to find the best socket in case
> of SO_REUSEPORT.

So I tried this and am pretty satisfied with the results, as I couldn't
see any single reset on 4.4-rc5 with it. On 4.1 I got a few very rare
resets at the exact moment the new process binds to the socket, because
I suspect some ACKs end up in the wrong queue exactly there. But
apparently the changes you did in 4.4 totally got rid of this, which is
great!

I suspected that I could enter a situation where a new process could
fail to bind if generations n-1 and n-2 were still present, because
n-2 would be running without SO_REUSEPORT and that should make this
test fail in inet_csk_bind_conflict(), but it never failed for me :

if ((!reuse || !sk2->sk_reuse ||
sk2->sk_state == TCP_LISTEN) &&
(!reuseport || !sk2->sk_reuseport ||
(sk2->sk_state != TCP_TIME_WAIT &&
 !uid_eq(uid, sock_i_uid(sk2) {
...

So I'm clearly missing something and can't spot what. I mean, I'd
prefer to see my patch occasionally fail than not understanding why
it always works! If anyone has an suggestion I'm interested.

Here's the updated patch.

Best regards,
Willy

>From c060a5db92274402a0178d7c777a1e37c15eadb9 Mon Sep 17 00:00:00 2001
From: Willy Tarreau <w...@1wt.eu>
Date: Tue, 15 Dec 2015 16:40:00 +0100
Subject: net: make lingering sockets score less in compute_score()

When multiple processes use SO_REUSEPORT for a seamless restart
operation, there's a tiny window during which both the old and the new
process are bound to the same port, and there's no way for the old
process to gracefully stop receiving connections without dropping
the few that are left in the queue between the last poll() and the
shutdown() or close() operation.

Incoming connections are distributed between multiple listening sockets
in inet_lookup_listener() according to multiple criteria. The first
criterion is a score based on a number of attributes for each socket,
then a hash computation ensures that the connections are evenly
distributed between sockets of equal weight.

This patch provides a simple approach by which the old process can
simply decrease its score by disabling SO_REUSEPORT on its listening
sockets. Thus, the sockets from the new process always score higher
and are always preferred.

The old process can then safely drain incoming connections and stop
after meeting the -1 EAGAIN condition, as shown in the example below :

 process A (old one)  |  process B (new one)
  |
  setsockopt(SO_REUSEPORT, 1) |
  listen() >= 0   |
  ... |
  accept()|
  ... |  setsockopt(SO_REUSEPORT, 1)
  ... |  listen()

   From now on, both processes receive incoming connections

  ... |  kill(process A, go_away)
  setsockopt(SO_REUSEPORT, 0) |  accept() >= 0

   Here process A stops receiving new connections

  accept() >= 0   |  accept() >= 0
  ... |
  accept() = -1 EAGAIN    |  accept() >= 0
  close() |
  exit()  |

Signed-off-by: Willy Tarreau <w...@1wt.eu>
---
 net/ipv4/inet_hashtables.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index ccc5980..1c950ba 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -189,6 +189,8 @@ static inline int compute_score(struct sock *sk, struct net 
*net,
return -1;
score += 4;
}
+   if (sk->sk_reuseport)
+   score++;
if (sk->sk_incoming_cpu == raw_smp_processor_id())
score++;
}
-- 
1.7.12.1



Re: [PATCH 1/1] net: Add SO_REUSEPORT_LISTEN_OFF socket option as drain mode

2015-12-15 Thread Willy Tarreau
On Tue, Dec 15, 2015 at 01:21:15PM -0800, Eric Dumazet wrote:
> On Tue, 2015-12-15 at 20:44 +0100, Willy Tarreau wrote:
> 
> > Thus do you think it's worth adding a new option as Tolga proposed ?
> 
> 
> I thought we tried hard to avoid adding the option but determined
> we could not avoid it ;)

Not yet, your other proposal of disabling SO_REUSEPORT makes sense if
we combine it with the proposal to change the score in my patch. If
we say that a socket which has SO_REUSEPORT scores higher, then the
connections which don't want to accept new connections anymore will
simply have to drop it an not be elected. I find this even cleaner
since the sole purpose of the loop is to find the best socket in case
of SO_REUSEPORT.

I'll give this a try and will submit such a proposal if that works
for me.

Cheers!
Willy

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] unix: properly account for FDs passed over unix sockets

2015-12-30 Thread Willy Tarreau
On Thu, Dec 31, 2015 at 03:08:53PM +0900, Tetsuo Handa wrote:
> Willy Tarreau wrote:
> > On Wed, Dec 30, 2015 at 09:58:42AM +0100, Hannes Frederic Sowa wrote:
> > > The MSG_PEEK code should not be harmful and the patch is good as is. I 
> > > first understood from the published private thread, that it is possible 
> > > for a program to exceed the rlimit of fds. But the DoS is only by 
> > > keeping the fds in flight and not attaching them to any program.
> > 
> > Exactly. The real issue is when these FDs become very expensive such as
> > pipes full of data.
> > 
> 
> As you wrote how to abuse this vulnerability which exists in Linux 2.0
> and later kernel, I quote a short description from private thread.
> 
>   "an unprivileged user consumes all file descriptors so that other
>   unprivileged user cannot work" and "an unprivileged user consumes all
>   kernel memory so that the OOM killer kills almost all processes before
>   the culprit process is killed (CVE-2013-4312)".
> 
> Reported-by: Tetsuo Handa <penguin-ker...@i-love.sakura.ne.jp>
> Mitigates: CVE-2013-4312 (Linux 2.0+)

Well I didn't reveal any secret as it was publicly reported first
in 2010, it's only that Mark sent us the proof of concept exploit
on the security list recently :-)

https://bugzilla.kernel.org/show_bug.cgi?id=20402

Anyway I'll resend the patch with your reported-by, the CVE and
Hannes' ACK.

Thanks!
Willy

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] unix: properly account for FDs passed over unix sockets

2015-12-29 Thread Willy Tarreau
On Tue, Dec 29, 2015 at 03:48:45PM +0100, Hannes Frederic Sowa wrote:
> On 28.12.2015 15:14, Willy Tarreau wrote:
> >It is possible for a process to allocate and accumulate far more FDs than
> >the process' limit by sending them over a unix socket then closing them
> >to keep the process' fd count low.
> >
> >This change addresses this problem by keeping track of the number of FDs
> >in flight per user and preventing non-privileged processes from having
> >more FDs in flight than their configured FD limit.
> >
> >Reported-by: socketp...@gmail.com
> >Suggested-by: Linus Torvalds <torva...@linux-foundation.org>
> >Signed-off-by: Willy Tarreau <w...@1wt.eu>
> 
> Thanks for the patch!
> 
> I think this does not close the DoS attack completely as we duplicate 
> fds if the reader uses MSG_PEEK on the unix domain socket and thus 
> clones the fd. Have I overlooked something?

I didn't know this behaviour. However, then the fd remains in flight, right ?
So as long as it's not removed from the queue, the sender cannot add more
than its FD limit. I may be missing something obvious though :-/

Thanks,
Willy

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/1] net: Add SO_REUSEPORT_LISTEN_OFF socket option as drain mode

2015-12-21 Thread Willy Tarreau
On Mon, Dec 21, 2015 at 12:38:27PM -0800, Tom Herbert wrote:
> On Fri, Dec 18, 2015 at 11:00 PM, Willy Tarreau <w...@1wt.eu> wrote:
> > On Fri, Dec 18, 2015 at 06:38:03PM -0800, Eric Dumazet wrote:
> >> On Fri, 2015-12-18 at 19:58 +0100, Willy Tarreau wrote:
> >> > Hi Josh,
> >> >
> >> > On Fri, Dec 18, 2015 at 08:33:45AM -0800, Josh Snyder wrote:
> >> > > I was also puzzled that binding succeeded. Looking into the code paths
> >> > > involved, in inet_csk_get_port, we quickly goto have_snum. From there, 
> >> > > we end
> >> > > up dropping into tb_found. Since !hlist_empty(>owners), we end up 
> >> > > checking
> >> > > that (tb->fastreuseport > 0 && sk->sk_reuseport && uid_eq(tb->fastuid, 
> >> > > uid)).
> >> > > This test passes, so we goto success and bind.
> >> > >
> >> > > Crucially, we are checking the fastreuseport field on the 
> >> > > inet_bind_bucket, and
> >> > > not the sk_reuseport variable on the other sockets in the bucket. 
> >> > > Since this
> >> > > bit is set based on sk_reuseport at the time the first socket binds 
> >> > > (see
> >> > > tb_not_found), I can see no reason why sockets need to keep 
> >> > > SO_REUSEPORT set
> >> > > beyond initial binding.
> >> > >
> >> > > Given this, I believe Willy's patch elegantly solves the problem at 
> >> > > hand.
> >> >
> >> > Great, thanks for your in-depth explanation.
> >> >
> >> > Eric, do you think that this patch may be acceptable material for next
> >> > merge window (given that it's not a fix per-se) ? If so I'll resubmit
> >> > later.
> >>
> >> I need to check with Craig Gallek, because he was about to upstream a
> >> change to make SO_REUSEPORT more scalable & sexy (like having an [e]BPF
> >> filter to perform the selection in an array of sockets)
> >
> > OK fine. Please note that I also considered using a new value instead of
> > zero there but I preferred to avoid it since the man talked about zero/
> > non-zero so I wanted to limit any API change. If Craig adds new values
> > there then this is something we can reconsider.
> >
> Is there any reason why this turning off a soreuseport socket should
> not apply to UDP also? (seems like we have a need to turn off RX but
> not TX for a UDP socket).

I didn't know it was supported for UDP :-) I guess that's the only reason.

willy

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/1] net: Add SO_REUSEPORT_LISTEN_OFF socket option as drain mode

2015-12-18 Thread Willy Tarreau
Hi Josh,

On Fri, Dec 18, 2015 at 08:33:45AM -0800, Josh Snyder wrote:
> I was also puzzled that binding succeeded. Looking into the code paths
> involved, in inet_csk_get_port, we quickly goto have_snum. From there, we end
> up dropping into tb_found. Since !hlist_empty(>owners), we end up checking
> that (tb->fastreuseport > 0 && sk->sk_reuseport && uid_eq(tb->fastuid, uid)).
> This test passes, so we goto success and bind.
> 
> Crucially, we are checking the fastreuseport field on the inet_bind_bucket, 
> and
> not the sk_reuseport variable on the other sockets in the bucket. Since this
> bit is set based on sk_reuseport at the time the first socket binds (see
> tb_not_found), I can see no reason why sockets need to keep SO_REUSEPORT set
> beyond initial binding.
> 
> Given this, I believe Willy's patch elegantly solves the problem at hand.

Great, thanks for your in-depth explanation.

Eric, do you think that this patch may be acceptable material for next
merge window (given that it's not a fix per-se) ? If so I'll resubmit
later.

Thanks,
Willy

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/1] net: Add SO_REUSEPORT_LISTEN_OFF socket option as drain mode

2015-12-18 Thread Willy Tarreau
On Fri, Dec 18, 2015 at 06:38:03PM -0800, Eric Dumazet wrote:
> On Fri, 2015-12-18 at 19:58 +0100, Willy Tarreau wrote:
> > Hi Josh,
> > 
> > On Fri, Dec 18, 2015 at 08:33:45AM -0800, Josh Snyder wrote:
> > > I was also puzzled that binding succeeded. Looking into the code paths
> > > involved, in inet_csk_get_port, we quickly goto have_snum. From there, we 
> > > end
> > > up dropping into tb_found. Since !hlist_empty(>owners), we end up 
> > > checking
> > > that (tb->fastreuseport > 0 && sk->sk_reuseport && uid_eq(tb->fastuid, 
> > > uid)).
> > > This test passes, so we goto success and bind.
> > > 
> > > Crucially, we are checking the fastreuseport field on the 
> > > inet_bind_bucket, and
> > > not the sk_reuseport variable on the other sockets in the bucket. Since 
> > > this
> > > bit is set based on sk_reuseport at the time the first socket binds (see
> > > tb_not_found), I can see no reason why sockets need to keep SO_REUSEPORT 
> > > set
> > > beyond initial binding.
> > > 
> > > Given this, I believe Willy's patch elegantly solves the problem at hand.
> > 
> > Great, thanks for your in-depth explanation.
> > 
> > Eric, do you think that this patch may be acceptable material for next
> > merge window (given that it's not a fix per-se) ? If so I'll resubmit
> > later.
> 
> I need to check with Craig Gallek, because he was about to upstream a
> change to make SO_REUSEPORT more scalable & sexy (like having an [e]BPF
> filter to perform the selection in an array of sockets)

OK fine. Please note that I also considered using a new value instead of
zero there but I preferred to avoid it since the man talked about zero/
non-zero so I wanted to limit any API change. If Craig adds new values
there then this is something we can reconsider.

Have a nice week-end,
Willy

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] unix: properly account for FDs passed over unix sockets

2015-12-28 Thread Willy Tarreau
It is possible for a process to allocate and accumulate far more FDs than
the process' limit by sending them over a unix socket then closing them
to keep the process' fd count low.

This change addresses this problem by keeping track of the number of FDs
in flight per user and preventing non-privileged processes from having
more FDs in flight than their configured FD limit.

Reported-by: socketp...@gmail.com
Suggested-by: Linus Torvalds <torva...@linux-foundation.org>
Signed-off-by: Willy Tarreau <w...@1wt.eu>
---
It would be nice if (if accepted) it would be backported to -stable as the
issue is currently exploitable.
---
 include/linux/sched.h |  1 +
 net/unix/af_unix.c| 24 
 net/unix/garbage.c| 13 -
 3 files changed, 29 insertions(+), 9 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index edad7a4..fbf25f1 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -830,6 +830,7 @@ struct user_struct {
unsigned long mq_bytes; /* How many bytes can be allocated to mqueue? */
 #endif
unsigned long locked_shm; /* How many pages of mlocked shm ? */
+   unsigned long unix_inflight;/* How many files in flight in unix 
sockets */
 
 #ifdef CONFIG_KEYS
struct key *uid_keyring;/* UID specific keyring */
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 45aebd9..d6d7b43 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -1499,6 +1499,21 @@ static void unix_destruct_scm(struct sk_buff *skb)
sock_wfree(skb);
 }
 
+/*
+ * The "user->unix_inflight" variable is protected by the garbage
+ * collection lock, and we just read it locklessly here. If you go
+ * over the limit, there might be a tiny race in actually noticing
+ * it across threads. Tough.
+ */
+static inline bool too_many_unix_fds(struct task_struct *p)
+{
+   struct user_struct *user = current_user();
+
+   if (unlikely(user->unix_inflight > task_rlimit(p, RLIMIT_NOFILE)))
+   return !capable(CAP_SYS_RESOURCE) && !capable(CAP_SYS_ADMIN);
+   return false;
+}
+
 #define MAX_RECURSION_LEVEL 4
 
 static int unix_attach_fds(struct scm_cookie *scm, struct sk_buff *skb)
@@ -1507,6 +1522,9 @@ static int unix_attach_fds(struct scm_cookie *scm, struct 
sk_buff *skb)
unsigned char max_level = 0;
int unix_sock_count = 0;
 
+   if (too_many_unix_fds(current))
+   return -ETOOMANYREFS;
+
for (i = scm->fp->count - 1; i >= 0; i--) {
struct sock *sk = unix_get_socket(scm->fp->fp[i]);
 
@@ -1528,10 +1546,8 @@ static int unix_attach_fds(struct scm_cookie *scm, 
struct sk_buff *skb)
if (!UNIXCB(skb).fp)
return -ENOMEM;
 
-   if (unix_sock_count) {
-   for (i = scm->fp->count - 1; i >= 0; i--)
-   unix_inflight(scm->fp->fp[i]);
-   }
+   for (i = scm->fp->count - 1; i >= 0; i--)
+   unix_inflight(scm->fp->fp[i]);
return max_level;
 }
 
diff --git a/net/unix/garbage.c b/net/unix/garbage.c
index a73a226..8fcdc22 100644
--- a/net/unix/garbage.c
+++ b/net/unix/garbage.c
@@ -120,11 +120,11 @@ void unix_inflight(struct file *fp)
 {
struct sock *s = unix_get_socket(fp);
 
+   spin_lock(_gc_lock);
+
if (s) {
struct unix_sock *u = unix_sk(s);
 
-   spin_lock(_gc_lock);
-
if (atomic_long_inc_return(>inflight) == 1) {
BUG_ON(!list_empty(>link));
list_add_tail(>link, _inflight_list);
@@ -132,25 +132,28 @@ void unix_inflight(struct file *fp)
BUG_ON(list_empty(>link));
}
unix_tot_inflight++;
-   spin_unlock(_gc_lock);
}
+   fp->f_cred->user->unix_inflight++;
+   spin_unlock(_gc_lock);
 }
 
 void unix_notinflight(struct file *fp)
 {
struct sock *s = unix_get_socket(fp);
 
+   spin_lock(_gc_lock);
+
if (s) {
struct unix_sock *u = unix_sk(s);
 
-   spin_lock(_gc_lock);
BUG_ON(list_empty(>link));
 
if (atomic_long_dec_and_test(>inflight))
list_del_init(>link);
unix_tot_inflight--;
-   spin_unlock(_gc_lock);
}
+   fp->f_cred->user->unix_inflight--;
+   spin_unlock(_gc_lock);
 }
 
 static void scan_inflight(struct sock *x, void (*func)(struct unix_sock *),
-- 
1.7.12.1
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] unix: properly account for FDs passed over unix sockets

2015-12-30 Thread Willy Tarreau
On Wed, Dec 30, 2015 at 09:58:42AM +0100, Hannes Frederic Sowa wrote:
> The MSG_PEEK code should not be harmful and the patch is good as is. I 
> first understood from the published private thread, that it is possible 
> for a program to exceed the rlimit of fds. But the DoS is only by 
> keeping the fds in flight and not attaching them to any program.

Exactly. The real issue is when these FDs become very expensive such as
pipes full of data.

> __alloc_fd, called on the receiver side, does check for the rlimit 
> maximum anyway, so I don't see a loophole anymore:
> 
> Acked-by: Hannes Frederic Sowa 

Thanks!

> Another idea would be to add the amount of memory used to manage the fds 
> to sock_rmem/wmem but I don't see any advantages or disadvantages.

Compared to the impact of the pending data in pipes themselves in flight,
this would remain fairly minimal.

Thanks,
Willy

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3.10 128/143] VSOCK: do not disconnect socket when peer has shutdown SEND only

2016-06-05 Thread Willy Tarreau
From: Ian Campbell <ian.campb...@docker.com>

commit dedc58e067d8c379a15a8a183c5db318201295bb upstream.

The peer may be expecting a reply having sent a request and then done a
shutdown(SHUT_WR), so tearing down the whole socket at this point seems
wrong and breaks for me with a client which does a SHUT_WR.

Looking at other socket family's stream_recvmsg callbacks doing a shutdown
here does not seem to be the norm and removing it does not seem to have
had any adverse effects that I can see.

I'm using Stefan's RFC virtio transport patches, I'm unsure of the impact
on the vmci transport.

Signed-off-by: Ian Campbell <ian.campb...@docker.com>
Cc: "David S. Miller" <da...@davemloft.net>
Cc: Stefan Hajnoczi <stefa...@redhat.com>
Cc: Claudio Imbrenda <imbre...@linux.vnet.ibm.com>
Cc: Andy King <ack...@vmware.com>
Cc: Dmitry Torokhov <d...@vmware.com>
Cc: Jorgen Hansen <jhan...@vmware.com>
Cc: Adit Ranadive <ad...@vmware.com>
Cc: netdev@vger.kernel.org
Signed-off-by: David S. Miller <da...@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gre...@linuxfoundation.org>
Signed-off-by: Willy Tarreau <w...@1wt.eu>
---
 net/vmw_vsock/af_vsock.c | 21 +
 1 file changed, 1 insertion(+), 20 deletions(-)

diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
index 9b88693..66a9bf5 100644
--- a/net/vmw_vsock/af_vsock.c
+++ b/net/vmw_vsock/af_vsock.c
@@ -1804,27 +1804,8 @@ vsock_stream_recvmsg(struct kiocb *kiocb,
else if (sk->sk_shutdown & RCV_SHUTDOWN)
err = 0;
 
-   if (copied > 0) {
-   /* We only do these additional bookkeeping/notification steps
-* if we actually copied something out of the queue pair
-* instead of just peeking ahead.
-*/
-
-   if (!(flags & MSG_PEEK)) {
-   /* If the other side has shutdown for sending and there
-* is nothing more to read, then modify the socket
-* state.
-*/
-   if (vsk->peer_shutdown & SEND_SHUTDOWN) {
-   if (vsock_stream_has_data(vsk) <= 0) {
-   sk->sk_state = SS_UNCONNECTED;
-   sock_set_flag(sk, SOCK_DONE);
-   sk->sk_state_change(sk);
-   }
-   }
-   }
+   if (copied > 0)
err = copied;
-   }
 
 out_wait:
finish_wait(sk_sleep(sk), );
-- 
2.8.0.rc2.1.gbe9624a



Re: [PATCH v2] unix: properly account for FDs passed over unix sockets

2016-02-02 Thread Willy Tarreau
On Tue, Feb 02, 2016 at 09:32:56PM +0100, Hannes Frederic Sowa wrote:
> But "struct pid *" in unix_skb_parms should be enough to get us to 
> corresponding "struct cred *" so we can decrement the correct counter 
> during skb destruction.
> 
> So:
> 
> We increment current task's unix_inflight and also check the current 
> task's limit during attaching fds to skbs and decrement the inflight 
> counter via "struct pid *". This looks like it should work.

I like it as well, the principle sounds sane.

> >That way it's always the person who actually does the send (rather
> >than the opener of the socket _or_ the opener of the file that gets
> >passed around) that gets credited, and thanks to the cred pointer we
> >can then de-credit them properly.
> 
> Exactly, I try to implement that. Thanks a lot!

Thanks to you Hannes, I appreciate that you work on it, it would take
much more time to me to dig into this.

Willy



Re: [PATCH v2] unix: properly account for FDs passed over unix sockets

2016-02-02 Thread Willy Tarreau
On Tue, Feb 02, 2016 at 12:53:20PM -0800, Linus Torvalds wrote:
> On Tue, Feb 2, 2016 at 12:49 PM, Willy Tarreau <w...@1wt.eu> wrote:
> > On Tue, Feb 02, 2016 at 12:44:54PM -0800, Linus Torvalds wrote:
> >>
> >> Umm. I think the "struct cred" may change in between, can't it?
> >
> > You mean for example in case of setuid() or something like this ?
> 
> Yeah. I'd be worried about looking up the creds or user structure
> later, and possibly getting a different one.
> 
> I'd much rather look it up at attach time, and just carry an extra
> pointer around. That seems to be an inherently safer model where
> there's no worry about "what happens if the user does X in the
> meantime".

Normally we can only change from root to non-root, and we don't apply
the limits to root, so if we have the ability to only store one bit
indicating "not tracked" or to simply nullify one pointer to avoid
counting in flight FDs for root, we don't take the risk to recredit
them to the target user after a change.

I just don't know if we can do that though :-/

Willy



Re: [PATCH v2] unix: properly account for FDs passed over unix sockets

2016-02-02 Thread Willy Tarreau
On Tue, Feb 02, 2016 at 12:44:54PM -0800, Linus Torvalds wrote:
> On Tue, Feb 2, 2016 at 12:32 PM, Hannes Frederic Sowa
>  wrote:
> > But "struct pid *" in unix_skb_parms should be enough to get us to
> > corresponding "struct cred *" so we can decrement the correct counter during
> > skb destruction.
> 
> Umm. I think the "struct cred" may change in between, can't it?

You mean for example in case of setuid() or something like this ?

willy



Re: struct pid memory leak

2016-01-23 Thread Willy Tarreau
Hi Eric, Dmitry,

On Fri, Jan 22, 2016 at 08:50:01AM -0800, Eric Dumazet wrote:
> CC netdev, as it looks some af_unix issue ...
> 
> On Fri, 2016-01-22 at 16:08 +0100, Dmitry Vyukov wrote:
> > Hello,
> > 
> > The following program causes struct pid memory leak:
> > 
> > // autogenerated by syzkaller (http://github.com/google/syzkaller)
(...)
> > unreferenced object 0x8800324af200 (size 112):
> >   comm "syz-executor", pid 18413, jiffies 4295500287 (age 14.321s)
> >   hex dump (first 32 bytes):
> > 01 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00  
> > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  
> >   backtrace:
> > [] kmemleak_alloc+0x63/0xa0 mm/kmemleak.c:916
> > [< inline >] kmemleak_alloc_recursive 
> > include/linux/kmemleak.h:47
(...)
> > On commit 30f05309bde49295e02e45c7e615f73aa4e0ccc2 (Jan 20).

I can't reproduce this with the indicated commit. I'm unsure how/what
I'm supposed to see. Is a certain config needed ? I've enabled kmemleak
in my .config but there are too few information here to go further
unfortunately.

Regards,
Willy


Re: struct pid memory leak

2016-01-23 Thread Willy Tarreau
On Sat, Jan 23, 2016 at 07:14:33PM +0100, Dmitry Vyukov wrote:
> I've attached my .config.
> Also run this program in a parallel loop. I think it's leaking not
> every time, probably some race is involved.

Thank you. Just in order to confirm, am I supposed to see the
messages you quoted in dmesg ?

Thanks,
Willy



Re: struct pid memory leak

2016-01-23 Thread Willy Tarreau
On Sat, Jan 23, 2016 at 07:46:45PM +0100, Dmitry Vyukov wrote:
> On Sat, Jan 23, 2016 at 7:40 PM, Willy Tarreau <w...@1wt.eu> wrote:
> > On Sat, Jan 23, 2016 at 07:14:33PM +0100, Dmitry Vyukov wrote:
> >> I've attached my .config.
> >> Also run this program in a parallel loop. I think it's leaking not
> >> every time, probably some race is involved.
> >
> > Thank you. Just in order to confirm, am I supposed to see the
> > messages you quoted in dmesg ?
> 
> 
> I think the simplest way to confirm that you can reproduce it locally
> is to check /proc/slabinfo. When I run this program in a parallel
> loop, number of objects in pid cache was constantly growing:
> 
> # cat /proc/slabinfo | grep pid
> pid  297532576   284 : tunables00
>   0 : slabdata 19 19  0
> ...
> pid  412532576   284 : tunables00
>   0 : slabdata 19 19  0
> ...
> pid 1107   1176576   284 : tunables00
>   0 : slabdata 42 42  0
> ...
> pid 1545   1652576   284 : tunables00
>   0 : slabdata 59 59  0

OK got it and indeed I can see it grow. In fact, the active column grows and
once it reaches the num objects, this one grows in turn, which makes sense.

All I can say now is that it doesn't need to run over multiple processes
to leak, though that makes it easier. SMP is not needed either.

> If you want to use kmemleak, then you need to run this program in a
> parallel loop for some time, then stop it and then:
> 
> $ echo scan > /sys/kernel/debug/kmemleak
> $ cat /sys/kernel/debug/kmemleak
> 
> If kmemleak has detected any leaks, cat will show them. I noticed that
> kmemleak can delay leaks with significant delay, so usually I do scan
> at least 5 times.

Thank you for these information.

I've tested on an older (3.14) kernel and I can see the effect there as well.
I don't have "pid" in slabinfo, but launching 1000 processes at a time uses
a few tens to hundreds kB of RAM on each round. 3.10 doesn't seem affected,
I'm seeing the memory grow to a fixed point if I increase the number of
parallel processes but then even after a few tens of thousands of processes,
the reported used memory doesn't seem to increase (remember no "pid" entry
here).

kmemleak indeed reports me something on 3.14 which seems to match your
trace as I'm seeing bash as the process (instead of syz-executor in your
case) and alloc_pid() calls kmem_cache_alloc() :

Unreferenced object 0x88003facd000 (size 128):
  comm "bash", pid 1822, jiffies 4294951223 (age 15.280s)
  hex dump (first 32 bytes):
01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  
  backtrace:
[] kmem_cache_alloc+0x92/0xe0
[] alloc_pid+0x24/0x4a0
[] cpumask_any_but+0x23/0x40
[] copy_process.part.66+0x1068/0x16e0
[] n_tty_write+0x37b/0x4f0
[] tty_write+0x1c1/0x2a0
[] do_fork+0xe0/0x340
[] __set_task_blocked+0x30/0x80
[] __set_current_blocked+0x38/0x60
[] stub_clone+0x69/0x90
[] system_call_fastpath+0x16/0x1b
[] 0x

It doesn't report this on 3.10.

Unfortunately I feel totally incompetent on the subject :-/

Willy



Re: struct pid memory leak

2016-01-23 Thread Willy Tarreau
On Sat, Jan 23, 2016 at 06:50:11PM -0800, Eric Dumazet wrote:
> On Sat, Jan 23, 2016 at 6:38 PM, Willy Tarreau <w...@1wt.eu> wrote:
> > On Sun, Jan 24, 2016 at 03:11:45AM +0100, Willy Tarreau wrote:
> >> It doesn't report this on 3.10.
> >
> > To be more precise, kmemleak reports the issue on 3.13 and not on 3.12.
> > I'm not sure if it's reliable enough to run a bisect though.
> >
> > Willy
> >
> 
> I have the leak on linux-3.11.
> 
> I believe even linux-3.3 gets the leak, although I had to wait about
> one hour to be confident the leak was there.

OK so I'm stopping my bisect. It's possible it's affected with some option
which changed along the various "make oldconfig" at each step. Thanks for
letting me know.

Willy



Re: struct pid memory leak

2016-01-23 Thread Willy Tarreau
On Sun, Jan 24, 2016 at 03:11:45AM +0100, Willy Tarreau wrote:
> It doesn't report this on 3.10.

To be more precise, kmemleak reports the issue on 3.13 and not on 3.12.
I'm not sure if it's reliable enough to run a bisect though.

Willy



Re: Mis-backport in af_unix patch for Linux 3.10.95

2016-01-24 Thread Willy Tarreau
Hello,

On Sun, Jan 24, 2016 at 12:10:35AM -0500, Sultan Qasim wrote:
> Hello all,
> 
> I'm an outsider to the Linux kernel community, so I apologize if this
> is not the right channel to mention this.

The simple fact that you participate, inspect the code and report bugs
makes you part of this community :-)  It's indeed the right place.
Usually when reporting an issue with a commit, we also CC the whole
signed-off-by / CC chain of that commit (which I'm doing now). For
bugs related to networking, we usually CC the netdev list as well.

> I noticed that the
> backported version of the patch "af_unix: Revert 'lock_interruptible'
> in stream receive code" in Linux 3.10.95 seems to have removed the
> mutex_lock_interruptible from the wrong function.
>
> Here is the backported patch:
> https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/?id=3a57e783016bf43ab9326172217f564941b85b17
> 
> Here is the original:
> https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/net/unix/af_unix.c?id=3822b5c2fc62e3de8a0f33806ff279fb7df92432
> 
> Was it not meant to be removed from unix_stream_recvmsg instead of
> unix_dgram_recvmsg?

You're absolutely right, good catch! Similar controls were added to
both functions resulting in the same code appearing there, which
confused the patch process, causing the change to be applied to the
wrong location. This happens from time to time in such circumstances
when backporting to older kernels.

> Also, the variable called "noblock" needs to be
> removed from the function being changed to prevent unused variable
> warnings.

If you mean this variable in function unix_dgram_recvmsg(), it would
indeed report a warning but only due to the patch being mis-applied.
In unix_stream_recvmsg(), it's still used as well.

Does the attached patch seem better to you (not compile-tested) ?

Greg/Ben, both 3.2.76 and 3.14.59 are OK regarding this, it seems
like only 3.10.95 was affected.

Thanks,
Willy

>From 77f6e82adf349cbccf7e2ec7601b25c994e0b483 Mon Sep 17 00:00:00 2001
From: Willy Tarreau <w...@1wt.eu>
Date: Sun, 24 Jan 2016 09:19:57 +0100
Subject: af_unix: fix incorrect revert of 'lock_interruptible' in stream
 receive code

As reported by Sultan Qasim, commit 3822b5c ("af_unix: Revert
'lock_interruptible' in stream receive code") was accidently applied
at the wrong place in the backport that appeared in 3.10.95, it
affected unix_dgram_recvmsg() instead of unix_stream_recvmsg() due
to now similar code sections there. The dgram part needs to remain
but the stream part needs to be removed.

Reported-By: Sultan Qasim <sultanqa...@gmail.com>
Fixes: 3a57e78 (3.10.95)
Signed-off-by: Willy Tarreau <w...@1wt.eu>
---
 net/unix/af_unix.c | 18 +-
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index f934e7b..0061d00 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -1934,7 +1934,14 @@ static int unix_dgram_recvmsg(struct kiocb *iocb, struct 
socket *sock,
if (flags_OOB)
goto out;
 
-   mutex_lock(>readlock);
+   err = mutex_lock_interruptible(>readlock);
+   if (unlikely(err)) {
+   /* recvmsg() in non blocking mode is supposed to return -EAGAIN
+* sk_rcvtimeo is not honored by mutex_lock_interruptible()
+*/
+   err = noblock ? -EAGAIN : -ERESTARTSYS;
+   goto out;
+   }
 
skip = sk_peek_offset(sk, flags);
 
@@ -2083,14 +2090,7 @@ static int unix_stream_recvmsg(struct kiocb *iocb, 
struct socket *sock,
memset(_scm, 0, sizeof(tmp_scm));
}
 
-   err = mutex_lock_interruptible(>readlock);
-   if (unlikely(err)) {
-   /* recvmsg() in non blocking mode is supposed to return -EAGAIN
-* sk_rcvtimeo is not honored by mutex_lock_interruptible()
-*/
-   err = noblock ? -EAGAIN : -ERESTARTSYS;
-   goto out;
-   }
+   mutex_lock(>readlock);
 
do {
int chunk;
-- 
1.7.12.2.21.g234cd45.dirty



Re: [PATCH net] af_unix: fix struct pid memory leak

2016-01-24 Thread Willy Tarreau
Hi Eric,

On Sun, Jan 24, 2016 at 01:53:50PM -0800, Eric Dumazet wrote:
> From: Eric Dumazet 
> 
> Dmitry reported a struct pid leak detected by a syzkaller program.
> 
> Bug happens in unix_stream_recvmsg() when we break the loop when a
> signal is pending, without properly releasing scm.
> 
> Fixes: b3ca9b02b007 ("net: fix multithreaded signal handling in unix recv 
> routines")
> Reported-by: Dmitry Vyukov 
> Signed-off-by: Eric Dumazet 
> Cc: Rainer Weikusat 
> ---
>  net/unix/af_unix.c |1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
> index c5bf5ef2bf89..49d5093eb055 100644
> --- a/net/unix/af_unix.c
> +++ b/net/unix/af_unix.c
> @@ -2339,6 +2339,7 @@ again:
>  
>   if (signal_pending(current)) {
>   err = sock_intr_errno(timeo);
> + scm_destroy();
>   goto out;
>   }

Good job on this one! FWIW, I managed to test it on 3.14 and I confirm it
completely fixes the leak there as well. I had to modify it a little bit
however since there's no scm local variable there :

-   scm_destroy();
+   scm_destroy(siocb->scm);

Cheers,
Willy



Re: [PATCH v2 net-next 0/8] API set for HW Buffer management

2016-02-17 Thread Willy Tarreau
Hi Gregory,

On Tue, Feb 16, 2016 at 04:33:35PM +0100, Gregory CLEMENT wrote:
> Hello,
> 
> A few weeks ago I sent a proposal for a API set for HW Buffer
> management, to have a better view of the motivation for this API see
> the cover letter of this proposal:
> http://thread.gmane.org/gmane.linux.kernel/2125152
> 
> Since this version I took into account the review from Florian:
> - The hardware buffer management helpers are no more built by default
>   and now depend on a hidden config symbol which has to be selected
>   by the driver if needed
> - The hwbm_pool_refill() and hwbm_pool_add() now receive a gfp_t as
>   argument allowing the caller to specify the flag it needs.
> - buf_num is now tested to ensure there is no wrapping
> - A spinlock has been added to protect the hwbm_pool_add() function in
>   SMP or irq context.
> 
> I also used pr_warn instead of pr_debug in case of errors.
> 
> I fixed the mvneta implementation by returning the buffer to the pool
> at various place instead of ignoring it.
> 
> About the series itself I tried to make this series easier to merge:
> - Squashed "bus: mvenus-mbus: Fix size test for
>mvebu_mbus_get_dram_win_info" into bus: mvebu-mbus: provide api for
>obtaining IO and DRAM window information.
> - Added my signed-otf-by on all the patches as submitter of the series.
> - Renamed the dts patches with the pattern "ARM: dts: platform:"
> - Removed the patch "ARM: mvebu: enable SRAM support in
>   mvebu_v7_defconfig" of this series and already applied it
> - Rodified the order of the patches.
> 
> In order to ease the test the branch mvneta-BM-framework-v2 is
> available at g...@github.com:MISL-EBU-System-SW/mainline-public.git.

Well, I tested this patch series on top of latest master (from today)
on my fresh new clearfog board. I compared carefully with and without
the patchset. My workload was haproxy receiving connections and forwarding
them to my PC via the same port. I tested both with short connections
(HTTP GET of an empty file) and long ones (1 MB or more). No trouble
was detected at all, which is pretty good. I noticed a very tiny
performance drop which is more noticeable on short connections (high
packet rates), my forwarded connection rate went down from 17500/s to
17300/s. But I have not checked yet what can be tuned when using the
BM, nor did I compare CPU usage. I remember having run some tests in
the past, I guess it was on the XP-GP board, and noticed that the BM
could save a significant amount of CPU and improve cache efficiency,
so if this is the case here, we don't really care about a possible 1%
performance drop.

I'll try to provide more results as time permits.

In the mean time if you want (or plan to submit a next batch), feel
free to add a Tested-by: Willy Tarreau <w...@1wt.eu>.

cheers,
Willy



Re: [PATCH 1/1] net: Add SO_REUSEPORT_LISTEN_OFF socket option as drain mode

2016-03-24 Thread Willy Tarreau
Hi Eric,

On Thu, Mar 24, 2016 at 07:13:33AM -0700, Eric Dumazet wrote:
> On Thu, 2016-03-24 at 07:12 +0100, Willy Tarreau wrote:
> > Hi,
> > 
> > On Wed, Mar 23, 2016 at 10:10:06PM -0700, Tolga Ceylan wrote:
> > > I apologize for not properly following up on this. I had the
> > > impression that we did not want to merge my original patch and then I
> > > also noticed that it fails to keep the hash consistent. Recently, I
> > > read the follow ups on it as well as Willy's patch/proposals.
> > > 
> > > Is there any update on Willy's SO_REUSEPORT patch? IMHO, it solves the
> > > problem and it is simpler than adding new sock option.
> > 
> > no, Craig's changes were merged, and I haven't checked yet if my patch
> > needs to be rebased or still applies. Feel free to check it and resubmit
> > if you have time.
> 
> No need for a patch AFAIK.
> 
> BPF solution is generic enough.
> 
> All user space needs to do is to update the BPF filter so that the
> listener needing to be dismantled does not receive any new packet.

But that means that any software making use of SO_REUSEPORT needs to
also implement BPF on Linux to achieve the same as what it does on
other OSes ? Also I found a case where a dying process would still
cause trouble in the accept queue, maybe it's not redistributed, I
don't remember, all I remember is that my traffic stopped after a
segfault of only one of them :-/ I'll have to dig a bit regarding
this.

Thanks,
Willy



Re: [PATCH 1/1] net: Add SO_REUSEPORT_LISTEN_OFF socket option as drain mode

2016-03-24 Thread Willy Tarreau
On Thu, Mar 24, 2016 at 04:54:03PM -0700, Tom Herbert wrote:
> On Thu, Mar 24, 2016 at 4:40 PM, Yann Ylavic  wrote:
> > I'll learn how to do this to get the best performances from the
> > server, but having to do so to work around what looks like a defect
> > (for simple/default SMP configurations at least, no NUMA or clever
> > CPU-affinity or queuing policy involved) seems odd in the first place.
> >
> I disagree with your assessment that there is a defect. SO_REUSEPORT
> is designed to spread packets amongst _equivalent_ connections. In the
> server draining case sockets are no longer equivalent, but that is a
> special case.

I partially disagree with you here Tom. Initially SO_REUSEPORT was not
used to spread packets but to allow soft restart in some applications.
I've been using it since 2001 in haproxy on *BSD and linux 2.2. It was
removed during 2.3 and I used to keep a patch to reimplement it in 2.4
(basically 2 or 3 lines, the infrastructure was still present), but the
patch was not accepted. The same patch worked for 2.6 and 3.x, allowing
me to continue to perform soft-restarts on Linux just like I used to do
on *BSD. When SO_REUSEPORT was reimplemented in 3.9 with load balancing,
I was happy because it at last allowed me to drop my patch and I got
the extra benefit of better load balancing of incoming connections.

But the main use we have for it (at least historically) is for soft
restarts, where one process replaces another one. Very few people use
more than one process in our case.

However given the benefits of the load spreading for extreme loads,
I'm willing to find how to achieve the same with BPF, but it's pretty
clear that at this point I have no idea where to start from and that
for a single process replacing a single one, it looks quite complicated.

For me quite frankly the special case is the load balancing which is
a side benefit (and a nice one, don't get me wrong).

That's why I would have found it nice to "fix" the process replacement
to avoid dropping incoming connections, though I don't want it to become
a problem for future improvements on BPF. I don't think the two lines I
proposed could become an issue but I'll live without them (or continue
to apply this patch).

BTW, I have no problem with having to write a little bit of assembly for
fast interfaces if it remains untouched for years, we do already have a
bit in haproxy. It's just a longterm investment.

Best regards,
Willy



Re: [PATCH 1/1] net: Add SO_REUSEPORT_LISTEN_OFF socket option as drain mode

2016-03-25 Thread Willy Tarreau
Hi Eric,

On Thu, Mar 24, 2016 at 11:49:41PM -0700, Eric Dumazet wrote:
> Everything is possible, but do not complain because BPF went in the
> kernel before your changes.

Don't get me wrong, I'm not complaining, I'm more asking for help to
try to elaborate the alternate solution. I understood well what my
proposal did because it was pretty straightforward, and the new one
I'll have to do is of an immense complexity for me by now, since it
will require learning a new language and finding doc on how all this
works together. But as I said I'm totally sold to the benefits it can
provide for large scale deployments and I'm well aware of the ugly
socket scan there was in the previous one.

> Just rework your patch.
> 
> Supporting multiple SO_REUSEPORT groups on the same port should not be
> too hard really. Making sure BPF still work for them is feasible.
> 
> But the semantic of the socket option would be really different.

I don't care much about the socket options themselves as long as I can
continue to support seamless reloads. I could even get rid of SO_REUSEPORT
on Linux to use something else instead if I have a reliable way to detect
that the alternative will work.

> You need to not control an individual listener, but a group of listener.
> 
> Your dying haproxy would issue a single system call to tell the kernel :
> My SO_REUSEPORT group should not accept new SYN packets, so that the new
> haproxy can setup a working new SO_REUSEPORT group.

Normally it's the other way around :-) The new process first grabs the
socket, there's a tiny window where both are attached, and only then the
old process leaves. That's the only way to ensure there's no loss nor
added latency in the processing.

If you could share a pointer to some good starter documentation for this,
I would appreciate it. I really have no idea where to start from and the
only elements I found on the net didn't give a single hint regarding all
this :-/

Thanks,
Willy



Re: [PATCH 1/1] net: Add SO_REUSEPORT_LISTEN_OFF socket option as drain mode

2016-03-24 Thread Willy Tarreau
Hi Eric,

(just lost my e-mail, trying not to forget some points)

On Thu, Mar 24, 2016 at 07:45:44AM -0700, Eric Dumazet wrote:
> On Thu, 2016-03-24 at 15:22 +0100, Willy Tarreau wrote:
> > Hi Eric,
> 
> > But that means that any software making use of SO_REUSEPORT needs to
> > also implement BPF on Linux to achieve the same as what it does on
> > other OSes ? Also I found a case where a dying process would still
> > cause trouble in the accept queue, maybe it's not redistributed, I
> > don't remember, all I remember is that my traffic stopped after a
> > segfault of only one of them :-/ I'll have to dig a bit regarding
> > this.
> 
> Hi Willy
> 
> Problem is : If we add a SO_REUSEPORT_LISTEN_OFF, this wont work with
> BPF. 

I wasn't for adding SO_REUSEPORT_LISTEN_OFF either. Instead the idea was
just to modify the score in compute_score() so that a socket which disables
SO_REUSEPORT scores less than one which still has it. The application
wishing to terminate just has to clear the SO_REUSEPORT flag and wait for
accept() reporting EAGAIN. The patch simply looked like this (copy-pasted
hence space-mangled) :

--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -189,6 +189,8 @@ static inline int compute_score(struct sock *sk, struct net 
*net,
return -1;
score += 4;
}
+   if (sk->sk_reuseport)
+   score++;
if (sk->sk_incoming_cpu == raw_smp_processor_id())
score++;
}

> BPF makes a decision without knowing individual listeners states.

But is the decision taken without considering compute_score() ? The point
really was to be the least possibly intrusive and quite logical for the
application : "disable SO_REUSEPORT when you don't want to participate to
incoming load balancing anymore".

> Or we would need to extend BPF to access these kind of states.
> Doable certainly, but we need to be convinced it is necessary.

You know that I don't like complex designs to address simple issues if
possible :-)

> And yes, if a listener is closed while children are still in accept
> queue, we drop all the children, we do not care of redistributing them
> to another listeners. Really too complex to be worth it.

Forget this, I mixed two issues here. Yes I know that redistributing is
almost impossible, I've read that code a year ago or so and realized how
complex this would be, without providing even 100% success rate. I wasn't
speaking about redistribution of incoming queue but about an issue I've
met where when I have 4 processes bound to the same port, if one dies,
its share of incoming traffic is definitely lost. The only fix was to
restart the processes to create new listeners. But I don't remember the
conditions where I met this case, I don't even remember if it was on an
-rc kernel or a final one, so I'd prefer to discuss this only once I have
more elements.

Cheers,
Willy
>From c060a5db92274402a0178d7c777a1e37c15eadb9 Mon Sep 17 00:00:00 2001
From: Willy Tarreau <w...@1wt.eu>
Date: Tue, 15 Dec 2015 16:40:00 +0100
Subject: net: make lingering sockets score less in compute_score()

When multiple processes use SO_REUSEPORT for a seamless restart
operation, there's a tiny window during which both the old and the new
process are bound to the same port, and there's no way for the old
process to gracefully stop receiving connections without dropping
the few that are left in the queue between the last poll() and the
shutdown() or close() operation.

Incoming connections are distributed between multiple listening sockets
in inet_lookup_listener() according to multiple criteria. The first
criterion is a score based on a number of attributes for each socket,
then a hash computation ensures that the connections are evenly
distributed between sockets of equal weight.

This patch provides a simple approach by which the old process can
simply decrease its score by disabling SO_REUSEPORT on its listening
sockets. Thus, the sockets from the new process always score higher
and are always preferred.

The old process can then safely drain incoming connections and stop
after meeting the -1 EAGAIN condition, as shown in the example below :

 process A (old one)  |  process B (new one)
  |
  setsockopt(SO_REUSEPORT, 1) |
  listen() >= 0   |
  ... |
  accept()|
  ... |  setsockopt(SO_REUSEPORT, 1)
  ... |  listen()

   From now on, both processes receive incoming connections

  ... |  kill(process A, go_away)
  setsockopt(SO_REUSEPORT, 0) |  accept() >= 0

   Here process A stops receiving new connections


Re: [PATCH 1/1] net: Add SO_REUSEPORT_LISTEN_OFF socket option as drain mode

2016-03-24 Thread Willy Tarreau
On Thu, Mar 24, 2016 at 09:33:11AM -0700, Eric Dumazet wrote:
> > --- a/net/ipv4/inet_hashtables.c
> > +++ b/net/ipv4/inet_hashtables.c
> > @@ -189,6 +189,8 @@ static inline int compute_score(struct sock *sk, struct 
> > net *net,
> > return -1;
> > score += 4;
> > }
> > +   if (sk->sk_reuseport)
> > +   score++;
> 
> This wont work with BPF
> 
> > if (sk->sk_incoming_cpu == raw_smp_processor_id())
> > score++;
> 
> This one does not work either with BPF

But this *is* in 4.5. Does this mean that this part doesn't work anymore or
just that it's not usable in conjunction with BPF ? In this case I'm less
worried, because it would mean that we have a solution for non-BPF aware
applications and that BPF-aware applications can simply use BPF.

> Whole point of BPF was to avoid iterate through all sockets [1],
> and let user space use whatever selection logic it needs.
> 
> [1] This was okay with up to 16 sockets. But with 128 it does not scale.

Indeed.

> If you really look at how BPF works, implementing another 'per listener' flag
> would break the BPF selection.

OK.

> You can certainly implement the SO_REUSEPORT_LISTEN_OFF by loading an
> updated BPF, why should we add another way in the kernel to do the same,
> in a way that would not work in some cases ?

I don't try to reimplement something already available, but I'm confused
by a few points :
  - the code above already exists and you mention it cannot be used with BPF
  - for the vast majority of applications not using BPF, would the above *still*
work (it worked in 4.4-rc at least)
  - it seems to me that for BPF to be usable on process shutting down, we'd
need to have some form of central knowledge if the goal is to redefine
how to distribute the load. In my case there are multiple independant
processes forked on startup, so it's unclear to me how each of them could
reconfigure BPF when shutting down without risking to break the other ones.
  - the doc makes me believe that BPF would require privileges to be unset, so
that would not be compatible with a process shutting down which has already
dropped its privileges after startup, but I could be wrong.

Thanks for your help on this,
Willy



Re: [PATCH 1/1] net: Add SO_REUSEPORT_LISTEN_OFF socket option as drain mode

2016-03-24 Thread Willy Tarreau
On Thu, Mar 24, 2016 at 10:01:37AM -0700, Eric Dumazet wrote:
> On Thu, 2016-03-24 at 17:50 +0100, Willy Tarreau wrote:
> > On Thu, Mar 24, 2016 at 09:33:11AM -0700, Eric Dumazet wrote:
> > > > --- a/net/ipv4/inet_hashtables.c
> > > > +++ b/net/ipv4/inet_hashtables.c
> > > > @@ -189,6 +189,8 @@ static inline int compute_score(struct sock *sk, 
> > > > struct net *net,
> > > > return -1;
> > > > score += 4;
> > > > }
> > > > +   if (sk->sk_reuseport)
> > > > +   score++;
> > > 
> > > This wont work with BPF
> > > 
> > > > if (sk->sk_incoming_cpu == raw_smp_processor_id())
> > > > score++;
> > > 
> > > This one does not work either with BPF
> > 
> > But this *is* in 4.5. Does this mean that this part doesn't work anymore or
> > just that it's not usable in conjunction with BPF ? In this case I'm less
> > worried, because it would mean that we have a solution for non-BPF aware
> > applications and that BPF-aware applications can simply use BPF.
> > 
> 
> BPF can implement the CPU choice/pref itself. It has everything needed.

Well I don't need the CPU choice, it was already there, it's not my code,
I only need the ability for an independant process to stop receiving new
connections without altering the other processes nor dropping some of these
connections.

In fact initially I didn't even need anything related to incoming connection
load-balancing, just the ability to start a new process without stopping the
old one, as it used to work in 2.2 and for which I used to keep a patch in
2.4 and 2.6. When SO_REUSEPORT was reintroduced in 3.9, that solved the issue
and some users started to complain that between the old and the new processes,
some connections were lost. Hence the proposal above. Since it's not about
load distribution and that processes are totally independant, I don't see
well how to (ab)use BPF to achieve this.

The pattern is :

  t0 : unprivileged processes 1 and 2 are listening to the same port
   (sock1@pid1) (sock2@pid2)
   <-- listening -->

  t1 : new processes are started to replace the old ones
   (sock1@pid1) (sock2@pid2) (sock3@pid3) (sock4@pid4)
   <-- listening --> <-- listening -->

  t2 : new processes signal the old ones they must stop
   (sock1@pid1) (sock2@pid2) (sock3@pid3) (sock4@pid4)
   <--- draining --> <-- listening -->

  t3 : pids 1 and 2 have finished, they go away
 (sock3@pid3) (sock4@pid4)
<-- gone ->  <-- listening -->

> >   - it seems to me that for BPF to be usable on process shutting down, we'd
> > need to have some form of central knowledge if the goal is to redefine
> > how to distribute the load. In my case there are multiple independant
> > processes forked on startup, so it's unclear to me how each of them 
> > could
> > reconfigure BPF when shutting down without risking to break the other 
> > ones.
> >   - the doc makes me believe that BPF would require privileges to be unset, 
> > so
> > that would not be compatible with a process shutting down which has 
> > already
> > dropped its privileges after startup, but I could be wrong.
> > 
> > Thanks for your help on this,
> > Willy
> > 
> 
> The point is : BPF is the way to go, because it is expandable.

OK so this means we have to find a way to expand it to allow an individual
non-privileged process to change the distribution algorithm without impacting
other processes.

I need to discover it better to find what can be done, but unfortunately at
this point the sole principle makes me think of a level of complexity that
doesn't seem obvious to solve at all :-/

Regards,
Willy


Re: [PATCH 1/1] net: Add SO_REUSEPORT_LISTEN_OFF socket option as drain mode

2016-03-24 Thread Willy Tarreau
On Thu, Mar 24, 2016 at 07:00:11PM +0100, Willy Tarreau wrote:
> Since it's not about
> load distribution and that processes are totally independant, I don't see
> well how to (ab)use BPF to achieve this.
> 
> The pattern is :
> 
>   t0 : unprivileged processes 1 and 2 are listening to the same port
>(sock1@pid1) (sock2@pid2)
><-- listening -->
> 
>   t1 : new processes are started to replace the old ones
>(sock1@pid1) (sock2@pid2) (sock3@pid3) (sock4@pid4)
><-- listening --> <-- listening -->
> 
>   t2 : new processes signal the old ones they must stop
>(sock1@pid1) (sock2@pid2) (sock3@pid3) (sock4@pid4)
><--- draining --> <-- listening -->
> 
>   t3 : pids 1 and 2 have finished, they go away
>  (sock3@pid3) (sock4@pid4)
> <-- gone ->  <-- listening -->
> 

Thinking a bit more about it, would it make sense to consider that in order
to address such a scenario, the only the new (still privileged) process
reconfigures the BPF to deliver traffic only to its own sockets and that
by doing so it will result in the old one not to receive any of it anymore ?
If so that could possibly be reasonably doable then. Ie: the old processes
don't have to do anything to stop receiving traffic.

Thanks,
Willy



Re: [PATCH 1/1] net: Add SO_REUSEPORT_LISTEN_OFF socket option as drain mode

2016-03-24 Thread Willy Tarreau
On Thu, Mar 24, 2016 at 11:20:49AM -0700, Tolga Ceylan wrote:
> I would appreciate a conceptual description on how this would work
> especially for a common scenario
> as described by Willy. My initial impression was that a coordinator
> (master) process takes this
> responsibility to adjust BPF filters as children come and go.

Indeed that would help, I don't know where to start from for now.

> Two popular software has similar use cases: nginx and haproxy. Another
> concern is with the
> introduction of BPF itself, should we expect a performance drop in
> these applications?

Knowing how picky Eric is about performance in such areas, I'm not
worried a single second about adopting something he recommends :-)
I just need to ensure it covers our users' needs. And maybe the
solution I mentionned in the other e-mail could work.

Willy



Re: [PATCH 1/1] net: Add SO_REUSEPORT_LISTEN_OFF socket option as drain mode

2016-03-24 Thread Willy Tarreau
Hi,

On Wed, Mar 23, 2016 at 10:10:06PM -0700, Tolga Ceylan wrote:
> I apologize for not properly following up on this. I had the
> impression that we did not want to merge my original patch and then I
> also noticed that it fails to keep the hash consistent. Recently, I
> read the follow ups on it as well as Willy's patch/proposals.
> 
> Is there any update on Willy's SO_REUSEPORT patch? IMHO, it solves the
> problem and it is simpler than adding new sock option.

no, Craig's changes were merged, and I haven't checked yet if my patch
needs to be rebased or still applies. Feel free to check it and resubmit
if you have time.

Best regards,
Willy



Re: [PATCH 3.2 085/115] veth: don???t modify ip_summed; doing so treats packets with bad checksums as good.

2016-04-30 Thread Willy Tarreau
On Sat, Apr 30, 2016 at 03:43:51PM -0700, Ben Greear wrote:
> On 04/30/2016 03:01 PM, Vijay Pandurangan wrote:
> > Consider:
> > 
> > - App A  sends out corrupt packets 50% of the time and discards inbound 
> > data.
(...)
> How can you make a generic app C know how to do this?  The path could be,
> for instance:
> 
> eth0 <-> user-space-A <-> vethA <-> vethB <-> { kernel routing logic } <-> 
> vethC <-> vethD <-> appC
> 
> There are no sockets on vethB, but it does need to have special behaviour to 
> elide
> csums.  Even if appC is hacked to know how to twiddle some thing on it's veth 
> port,
> mucking with vethD will have no effect on vethB.
> 
> With regard to your example above, why would A corrupt packets?  My guess:
> 
> 1)  It has bugs (so, fix the bugs, it could equally create incorrect data 
> with proper checksums,
> so just enabling checksumming adds no useful protection.)

I agree with Ben here, what he needs is the ability for userspace to be
trusted when *forwarding* a packet. Ideally you'd only want to receive
the csum status per packet on the packet socket and pass the same value
on the vethA interface, with this status being kept when the packet
reaches vethB.

If A purposely corrupts packet, it's A's problem. It's similar to designing
a NIC which intentionally corrupts packets and reports "checksum good".

The real issue is that in order to do things right, the userspace bridge
(here, "A") would really need to pass this status. In Ben's case as he
says, bad checksum packets are dropped before reaching A, so that
simplifies the process quite a bit and that might be what causes some
confusion, but ideally we'd rather have recvmsg() and sendmsg() with
these flags.

I faced the exact same issue 3 years ago when playing with netmap, it was
slow as hell because it would lose all checksum information when packets
were passing through userland, resulting in GRO/GSO etc being disabled,
and had to modify it to let userland preserve it. That's especially
important when you have to deal with possibly corrupted packets not yet
detected in the chain because the NIC did not validate their checksums.

Willy



Re: [PATCH] nf_conntrack: avoid kernel pointer value leak in slab name

2016-05-14 Thread Willy Tarreau
On Sat, May 14, 2016 at 02:31:04PM -0700, Linus Torvalds wrote:
> On Sat, May 14, 2016 at 11:24 AM, Linus Torvalds
>  wrote:
> >
> >
> > -   net->ct.slabname = kasprintf(GFP_KERNEL, "nf_conntrack_%p", net);
> > +   net->ct.slabname = kasprintf(GFP_KERNEL, "nf_conntrack_%llu",
> > +   (u64)atomic64_inc_return(_id));
> 
> Oh well. I suspect this is going to cause a new warning on alpha and
> ia64 and possibly others.
> 
> "u64" is indeed "unsigned long long" on x86 and many other
> architectures, but on alpga and ia64 it's just "unsigned long".
> 
> So that case should have been to "long long". I detest how there isn't
> a "64-bit size" printf specifier.

Why simply not cast the atomic to (unsigned long long) instead of (u64)
so that %llu always matches ?

Willy



Re: [PATCH] nf_conntrack: avoid kernel pointer value leak in slab name

2016-05-14 Thread Willy Tarreau
On Sat, May 14, 2016 at 03:21:31PM -0700, Linus Torvalds wrote:
> On Sat, May 14, 2016 at 2:33 PM, Willy Tarreau <w...@1wt.eu> wrote:
> >
> > Why simply not cast the atomic to (unsigned long long) instead of (u64)
> > so that %llu always matches ?
> 
> Yes, that fixes the problem. It's just more typing, and annoying. The
> fact that MS got it right while posix and gcc screwed it up is a bit
> embarrassing..

Well on the other hand, because of this MS still has problems porting
code from 32 to 64 bit. The real problem is that on both sides they
imagined that you needed only one way to specify your types. In practice
users generally want either the most optimal types for the architecture
because they don't care about the size (char, int, size_t, void *...)
or a specific size. This last one is annoying to use with printf format.

> If we ever start using __uint128_t, we'll have even more problems in
> this area. Oh well.

Definitely.

Willy



Re: [PATCH net] net: mvneta: set real interrupt per packet for tx_done

2016-07-05 Thread Willy Tarreau
Hi,

On Wed, Jul 06, 2016 at 04:18:58AM +0200, Marcin Wojtas wrote:
> From: Dmitri Epshtein <d...@marvell.com>
> 
> Commit aebea2ba0f74 ("net: mvneta: fix Tx interrupt delay") intended to
> set coalescing threshold to a value guaranteeing interrupt generation
> per each sent packet, so that buffers can be released with no delay.
> 
> In fact setting threshold to '1' was wrong, because it causes interrupt
> every two packets. According to the documentation a reason behind it is
> following - interrupt occurs once sent buffers counter reaches a value,
> which is higher than one specified in MVNETA_TXQ_SIZE_REG(q). This
> behavior was confirmed during tests. Also when testing the SoC working
> as a NAS device, better performance was observed with int-per-packet,
> as it strongly depends on the fact that all transmitted packets are
> released immediately.
> 
> This commit enables NETA controller work in interrupt per sent packet mode
> by setting coalescing threshold to 0.

We had a discussion about this in January 2015 and I thought I sent a patch
to change it but I can't find it, so I think it was only proposed to some
users for testing. I also remember that on more recent kernels by then
(>=3.13) we observed a slightly better performance with this value set to
zero.

Acked-by: Willy Tarreau <w...@1wt.eu>

Willy


[PATCH 3.10 257/319] net: sctp, forbid negative length

2017-02-05 Thread Willy Tarreau
From: Jiri Slaby <jsl...@suse.cz>

commit a4b8e71b05c27bae6bad3bdecddbc6b68a3ad8cf upstream.

Most of getsockopt handlers in net/sctp/socket.c check len against
sizeof some structure like:
if (len < sizeof(int))
return -EINVAL;

On the first look, the check seems to be correct. But since len is int
and sizeof returns size_t, int gets promoted to unsigned size_t too. So
the test returns false for negative lengths. Yes, (-1 < sizeof(long)) is
false.

Fix this in sctp by explicitly checking len < 0 before any getsockopt
handler is called.

Note that sctp_getsockopt_events already handled the negative case.
Since we added the < 0 check elsewhere, this one can be removed.

If not checked, this is the result:
UBSAN: Undefined behaviour in ../mm/page_alloc.c:2722:19
shift exponent 52 is too large for 32-bit type 'int'
CPU: 1 PID: 24535 Comm: syz-executor Not tainted 4.8.1-0-syzkaller #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014
  88006d99f2a8 b2f7bdea 41b58ab3
 b4363c14 b2f7bcde 88006d99f2d0 88006d99f270
   0034 b5096422
Call Trace:
 [] ? __ubsan_handle_shift_out_of_bounds+0x29c/0x300
...
 [] ? kmalloc_order+0x24/0x90
 [] ? kmalloc_order_trace+0x24/0x220
 [] ? __kmalloc+0x330/0x540
 [] ? sctp_getsockopt_local_addrs+0x174/0xca0 [sctp]
 [] ? sctp_getsockopt+0x10d/0x1b0 [sctp]
 [] ? sock_common_getsockopt+0xb9/0x150
 [] ? SyS_getsockopt+0x1a5/0x270

Signed-off-by: Jiri Slaby <jsl...@suse.cz>
Cc: Vlad Yasevich <vyasev...@gmail.com>
Cc: Neil Horman <nhor...@tuxdriver.com>
Cc: "David S. Miller" <da...@davemloft.net>
Cc: linux-s...@vger.kernel.org
Cc: netdev@vger.kernel.org
Acked-by: Neil Horman <nhor...@tuxdriver.com>
Signed-off-by: David S. Miller <da...@davemloft.net>
Signed-off-by: Willy Tarreau <w...@1wt.eu>
---
 net/sctp/socket.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/net/sctp/socket.c b/net/sctp/socket.c
index bdc3fb6..86e7352 100644
--- a/net/sctp/socket.c
+++ b/net/sctp/socket.c
@@ -4259,7 +4259,7 @@ static int sctp_getsockopt_disable_fragments(struct sock 
*sk, int len,
 static int sctp_getsockopt_events(struct sock *sk, int len, char __user 
*optval,
  int __user *optlen)
 {
-   if (len <= 0)
+   if (len == 0)
return -EINVAL;
if (len > sizeof(struct sctp_event_subscribe))
len = sizeof(struct sctp_event_subscribe);
@@ -5770,6 +5770,9 @@ SCTP_STATIC int sctp_getsockopt(struct sock *sk, int 
level, int optname,
if (get_user(len, optlen))
return -EFAULT;
 
+   if (len < 0)
+   return -EINVAL;
+
sctp_lock_sock(sk);
 
switch (optname) {
-- 
2.8.0.rc2.1.gbe9624a



Re: [PATCH net-next 3/3] net/tcp-fastopen: Add new API support

2017-01-23 Thread Willy Tarreau
On Mon, Jan 23, 2017 at 11:01:21PM +0100, Willy Tarreau wrote:
> On Mon, Jan 23, 2017 at 10:37:32PM +0100, Willy Tarreau wrote:
> > On Mon, Jan 23, 2017 at 01:28:53PM -0800, Wei Wang wrote:
> > > Hi Willy,
> > > 
> > > True. If you call connect() multiple times on a socket which already has
> > > cookie without a write(), the second and onward connect() call will return
> > > EINPROGRESS.
> > > It is basically because the following code block in 
> > > __inet_stream_connect()
> > > can't distinguish if it is the first time connect() is called or not:
> > > 
> > > case SS_CONNECTING:
> > > if (inet_sk(sk)->defer_connect)  <- defer_connect will
> > > be 0 only after a write() is called
> > > err = -EINPROGRESS;
> > > else
> > > err = -EALREADY;
> > > /* Fall out of switch with err, set for this state */
> > > break;
> > 
> > Ah OK that totally makes sense, thanks for the explanation!
> > 
> > > I guess we can add some extra logic here to address this issue. So the
> > > second connect() and onwards will return EALREADY.
> 
> Thinking about it a bit more, I really think it would make more
> sense to return -EISCONN here if we want to match the semantics
> of a connect() returning zero on the first call. This way the
> caller knows it can write whenever it wants and can disable
> write polling until needed.
> 
> I'm currently rebuilding a kernel with this change to see if it
> behaves any better :
> 
> -err = -EINPROGRESS;
> +err = -EISCONN;

OK so obviously it didn't work since sendmsg() goes there as well.

But that made me realize that there really are 3 states, not 2 :

  - after connect() and before sendmsg() :
 defer_accept = 1, we want to lie to userland and pretend we're
 connected so that userland can call send(). A connect() must
 return either zero or -EISCONN.

  - during first sendmsg(), still connecting :
 the connection is in progress, EINPROGRESS must be returned to
 the first sendmsg().

  - after the first sendmsg() :
 defer_accept = 0 ; connect() must return -EALREADY. We want to
 return real socket states from now on.

Thus I modified defer_accept to take two bits to represent the extra
state we need to indicate the transition. Now sendmsg() upgrades
defer_accept from 1 to 2 before calling __inet_stream_connect(), which
then knows it must return EINPROGRESS to sendmsg().

This way we correctly cover all these situations. Even if we call
connect() again after the first connect() attempt it still matches
the first result :

accept4(7, {sa_family=AF_INET, sin_port=htons(36860), 
sin_addr=inet_addr("192.168.0.176")}, [128->16], SOCK_NONBLOCK) = 1
setsockopt(1, SOL_TCP, TCP_NODELAY, [1], 4) = 0
accept4(7, 0x7ffc2282fcb0, [128], SOCK_NONBLOCK) = -1 EAGAIN (Resource 
temporarily unavailable)
recvfrom(1, 0x7b53a4, 7006, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily 
unavailable)
socket(AF_INET, SOCK_STREAM, IPPROTO_TCP) = 2
fcntl(2, F_SETFL, O_RDONLY|O_NONBLOCK)  = 0
setsockopt(2, SOL_TCP, TCP_NODELAY, [1], 4) = 0
setsockopt(2, SOL_TCP, 0x1e /* TCP_??? */, [1], 4) = 0
connect(2, {sa_family=AF_INET, sin_port=htons(8001), 
sin_addr=inet_addr("127.0.0.1")}, 16) = 0
epoll_ctl(0, EPOLL_CTL_ADD, 1, {EPOLLIN|EPOLLRDHUP, {u32=1, u64=1}}) = 0
epoll_wait(0, [], 200, 0)   = 0
connect(2, {sa_family=AF_INET, sin_port=htons(8001), 
sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EISCONN (Transport endpoint is 
already connected)
epoll_wait(0, [], 200, 0)   = 0
recvfrom(2, 0x7b53a4, 8030, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily 
unavailable)
epoll_ctl(0, EPOLL_CTL_ADD, 2, {EPOLLIN|EPOLLRDHUP, {u32=2, u64=2}}) = 0
epoll_wait(0, [], 200, 1000)= 0
epoll_wait(0, [], 200, 1000)= 0
epoll_wait(0, [], 200, 1000)= 0
epoll_wait(0, [{EPOLLIN, {u32=1, u64=1}}], 200, 1000) = 1
recvfrom(1, "GET / HTTP/1.1\r\n", 8030, 0, NULL, NULL) = 16
sendto(2, "GET / HTTP/1.1\r\n", 16, MSG_DONTWAIT|MSG_NOSIGNAL, NULL, 0) = 16
epoll_wait(0, [{EPOLLIN, {u32=1, u64=1}}], 200, 1000) = 1
recvfrom(1, "\r\n", 8030, 0, NULL, NULL) = 2
sendto(2, "\r\n", 2, MSG_DONTWAIT|MSG_NOSIGNAL, NULL, 0) = 2
epoll_wait(0, [{EPOLLIN|EPOLLRDHUP, {u32=2, u64=2}}], 200, 1000) = 1
recvfrom(2, "HTTP/1.1 302 Found\r\nCache-Contro"..., 8030, 0, NULL, NULL) = 98
sendto(1, "HTTP/1.1 302 Found\r\nCache-Contro"..., 98, 
MSG_DONTWAIT|MSG_NOSIGNAL|MSG_MORE, NULL, 0) = 98
shutdown(1, SHUT_WR)= 0
epoll_ctl(0, EPOLL_CTL_DEL, 2, 0x6ff55c) = 0
epoll_wait(0, [{EPOLLIN|EPOLLHUP|

Re: [PATCH net-next 3/3] net/tcp-fastopen: Add new API support

2017-01-23 Thread Willy Tarreau
On Mon, Jan 23, 2017 at 01:28:53PM -0800, Wei Wang wrote:
> Hi Willy,
> 
> True. If you call connect() multiple times on a socket which already has
> cookie without a write(), the second and onward connect() call will return
> EINPROGRESS.
> It is basically because the following code block in __inet_stream_connect()
> can't distinguish if it is the first time connect() is called or not:
> 
> case SS_CONNECTING:
> if (inet_sk(sk)->defer_connect)  <- defer_connect will
> be 0 only after a write() is called
> err = -EINPROGRESS;
> else
> err = -EALREADY;
> /* Fall out of switch with err, set for this state */
> break;

Ah OK that totally makes sense, thanks for the explanation!

> I guess we can add some extra logic here to address this issue. So the
> second connect() and onwards will return EALREADY.

If that's possible at little cost it would be nice, because your patch
makes it so easy to enable TFO on outgoing connections now that I
expect many people will blindly run the setsockopt() before connect().

Do not hesitate to ask me to run some tests. While 4 years ago it was
not easy, here it's very simple for me. By the way I'm seeing an ~10%
performance increase on haproxy by enabling this, it's really cool!

Thanks,
Willy


Re: [PATCH net-next 3/3] net/tcp-fastopen: Add new API support

2017-01-23 Thread Willy Tarreau
On Mon, Jan 23, 2017 at 10:37:32PM +0100, Willy Tarreau wrote:
> On Mon, Jan 23, 2017 at 01:28:53PM -0800, Wei Wang wrote:
> > Hi Willy,
> > 
> > True. If you call connect() multiple times on a socket which already has
> > cookie without a write(), the second and onward connect() call will return
> > EINPROGRESS.
> > It is basically because the following code block in __inet_stream_connect()
> > can't distinguish if it is the first time connect() is called or not:
> > 
> > case SS_CONNECTING:
> > if (inet_sk(sk)->defer_connect)  <- defer_connect will
> > be 0 only after a write() is called
> > err = -EINPROGRESS;
> > else
> > err = -EALREADY;
> > /* Fall out of switch with err, set for this state */
> > break;
> 
> Ah OK that totally makes sense, thanks for the explanation!
> 
> > I guess we can add some extra logic here to address this issue. So the
> > second connect() and onwards will return EALREADY.

Thinking about it a bit more, I really think it would make more
sense to return -EISCONN here if we want to match the semantics
of a connect() returning zero on the first call. This way the
caller knows it can write whenever it wants and can disable
write polling until needed.

I'm currently rebuilding a kernel with this change to see if it
behaves any better :

-err = -EINPROGRESS;
+err = -EISCONN;

I'll keep you updated.

Thanks,
Willy


Re: [PATCH net-next 3/3] net/tcp-fastopen: Add new API support

2017-01-23 Thread Willy Tarreau
Hi Wei,

first, thanks a lot for doing this, it's really awesome!

I'm testing it on 4.9 on haproxy and I met a corner case : when I
perform a connect() to a server and I have nothing to send, upon
POLLOUT notification since I have nothing to send I simply probe the
connection using connect() again to see if it returns EISCONN or
anything else. But here now I'm seeing EINPROGRESS loops.

To illustrate this, here's what I'm doing :

:8000  :8001
  [ client ] ---> [ proxy ] ---> [ server ]

The proxy is configured to enable TFO to the server and the server
supports TFO as well. The proxy and the server are in fact two proxy
instances in haproxy running in the same process for convenience.

When I already have data to send here's what I'm seeing (so it works fine) :

06:29:16.861190 accept4(7, {sa_family=AF_INET, sin_port=htons(33986), 
sin_addr=inet_addr("192.168.0.176")}, [128->16], SOCK
_NONBLOCK) = 9
06:29:16.861277 setsockopt(9, SOL_TCP, TCP_NODELAY, [1], 4) = 0
06:29:16.861342 accept4(7, 0x7ffd0d794430, [128], SOCK_NONBLOCK) = -1 EAGAIN 
(Resource temporarily unavailable)
06:29:16.861417 recvfrom(9, "BLAH\n", 7006, 0, NULL, NULL) = 5
06:29:16.861509 recvfrom(9, 0x2619329, 7001, 0, NULL, NULL) = -1 EAGAIN 
(Resource temporarily unavailable)
06:29:16.861657 socket(AF_INET, SOCK_STREAM, IPPROTO_TCP) = 10
06:29:16.861730 fcntl(10, F_SETFL, O_RDONLY|O_NONBLOCK) = 0
06:29:16.861779 setsockopt(10, SOL_TCP, TCP_NODELAY, [1], 4) = 0
06:29:16.861882 setsockopt(10, SOL_TCP, 0x1e /* TCP_??? */, [1], 4) = 0
06:29:16.861942 connect(10, {sa_family=AF_INET, sin_port=htons(8001), 
sin_addr=inet_addr("127.0.0.1")}, 16) = 0
06:29:16.862015 epoll_ctl(3, EPOLL_CTL_ADD, 9, {EPOLLIN|EPOLLRDHUP, {u32=9, 
u64=9}}) = 0
06:29:16.862072 epoll_wait(3, [], 200, 0) = 0
06:29:16.862126 sendto(10, "BLAH\n", 5, MSG_DONTWAIT|MSG_NOSIGNAL, NULL, 0) = 5
06:29:16.862281 epoll_wait(3, [{EPOLLIN, {u32=8, u64=8}}], 200, 0) = 1
06:29:16.862334 recvfrom(10, 0x26173a4, 8030, 0, NULL, NULL) = -1 EAGAIN 
(Resource temporarily unavailable)
06:29:16.862385 accept4(8, {sa_family=AF_INET, sin_port=htons(46760), 
sin_addr=inet_addr("127.0.0.1")}, [128->16], SOCK_NON
BLOCK) = 11
06:29:16.862450 setsockopt(11, SOL_TCP, TCP_NODELAY, [1], 4) = 0
06:29:16.862504 accept4(8, 0x7ffd0d794430, [128], SOCK_NONBLOCK) = -1 EAGAIN 
(Resource temporarily unavailable)
06:29:16.862564 recvfrom(11, "BLAH\n", 7006, 0, NULL, NULL) = 5


When I don't have data, here's what I'm seeing :

06:29:24.047801 accept4(7, {sa_family=AF_INET, sin_port=htons(33988), 
sin_addr=inet_addr("192.168.0.176")}, [128->16], SOCK
_NONBLOCK) = 9
06:29:24.047899 setsockopt(9, SOL_TCP, TCP_NODELAY, [1], 4) = 0
06:29:24.047966 accept4(7, 0x7ffdedb2c7f0, [128], SOCK_NONBLOCK) = -1 EAGAIN 
(Resource temporarily unavailable)
06:29:24.048043 recvfrom(9, 0xd31324, 7006, 0, NULL, NULL) = -1 EAGAIN 
(Resource temporarily unavailable)
06:29:24.048281 socket(AF_INET, SOCK_STREAM, IPPROTO_TCP) = 10
06:29:24.048342 fcntl(10, F_SETFL, O_RDONLY|O_NONBLOCK) = 0
06:29:24.048392 setsockopt(10, SOL_TCP, TCP_NODELAY, [1], 4) = 0
06:29:24.048447 setsockopt(10, SOL_TCP, 0x1e /* TCP_??? */, [1], 4) = 0
06:29:24.048508 connect(10, {sa_family=AF_INET, sin_port=htons(8001), 
sin_addr=inet_addr("127.0.0.1")}, 16) = 0
06:29:24.048593 epoll_ctl(3, EPOLL_CTL_ADD, 9, {EPOLLIN|EPOLLRDHUP, {u32=9, 
u64=9}}) = 0
06:29:24.048651 epoll_wait(3, [], 200, 0) = 0
06:29:24.048699 getsockopt(10, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
06:29:24.048751 connect(10, {sa_family=AF_INET, sin_port=htons(8001), 
sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EINPROGRES
S (Operation now in progress)
06:29:24.048808 epoll_ctl(3, EPOLL_CTL_ADD, 10, {EPOLLOUT, {u32=10, u64=10}}) = 0
06:29:24.048860 epoll_wait(3, [{EPOLLOUT, {u32=10, u64=10}}], 200, 1000) = 1
06:29:24.048912 getsockopt(10, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
06:29:24.048963 connect(10, {sa_family=AF_INET, sin_port=htons(8001), 
sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EINPROGRES
S (Operation now in progress)
06:29:24.049018 epoll_wait(3, [{EPOLLOUT, {u32=10, u64=10}}], 200, 1000) = 1
06:29:24.049072 getsockopt(10, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
06:29:24.049122 connect(10, {sa_family=AF_INET, sin_port=htons(8001), 
sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EINPROGRES
S (Operation now in progress)


I theorically understand why but I think we have something wrong here
and instead we should have -1 EISCONN (to pretend the connection is
established) or return EALREADY (to mention that a previous request was
already made and that we're waiting for the next step).

While I can instrument my connect() *not* to use TFO when connecting
without any pending data, I don't always know this (eg when I use
openssl and cross fingers so that it decides to quickly send something
on the next round).

I think it's easy to fall into this tricky corner case and am wondering
what can be done about it. Does the EINPROGRESS happen only because there
is no cookie yet ? If so, 

Re: [PATCH net-next 3/3] net/tcp-fastopen: Add new API support

2017-01-23 Thread Willy Tarreau
On Mon, Jan 23, 2017 at 02:57:31PM -0800, Wei Wang wrote:
> Yes. That seems to be a valid fix to it.
> Let me try it with my existing test cases as well to see if it works for
> all scenarios I have.

Perfect. Note that since the state 2 is transient I initially thought
about abusing the flags passed to __inet_stream_connect() to say "hey
I'm sendmsg() and not connect()" but that would have been a ugly hack
while here we really have the 3 socket states represented eventhough
one changes twice around a call.

Thanks,
Willy


Re: [PATCH net-next 3/3] net/tcp-fastopen: Add new API support

2017-01-23 Thread Willy Tarreau
On Mon, Jan 23, 2017 at 10:59:22AM -0800, Wei Wang wrote:
> This patch adds a new socket option, TCP_FASTOPEN_CONNECT, as an
> alternative way to perform Fast Open on the active side (client).

Wei, I think that nothing prevents from reusin the original TCP_FASTOPEN
sockopt instead of adding a new one. The original one does this :

case TCP_FASTOPEN:
if (val >= 0 && ((1 << sk->sk_state) & (TCPF_CLOSE |
TCPF_LISTEN))) {
tcp_fastopen_init_key_once(true);

fastopen_queue_tune(sk, val);
} else {
err = -EINVAL;
}
break;

and your new option does this :

case TCP_FASTOPEN_CONNECT:
if (val > 1 || val < 0) {
err = -EINVAL;
} else if (sysctl_tcp_fastopen & TFO_CLIENT_ENABLE) {
if (sk->sk_state == TCP_CLOSE)
tp->fastopen_connect = val;
else
err = -EINVAL;
} else {
err = -EOPNOTSUPP;
}
break;

Now if we compare :
  - the value ranges are the same (0,1)
  - tcp_fastopen_init_key_once() only performs an initialization once
  - fastopen_queue_tune() only sets sk->max_qlen based on the backlog,
this has no effect on an outgoing connection ;
  - tp->fastopen_connect can be applied to a listening socket without
side effect.

Thus I think we can merge them this way :

case TCP_FASTOPEN:
if (val >= 0) {
if ((sysctl_tcp_fastopen & TFO_CLIENT_ENABLE) &&
(sk->sk_state == TCP_CLOSE)
tp->fastopen_connect = val;

if ((1 << sk->sk_state) & (TCPF_CLOSE | TCPF_LISTEN))) {
tcp_fastopen_init_key_once(true);
fastopen_queue_tune(sk, val);
}
} else {
err = -EINVAL;
}
break;

And for the userland, the API is even simpler because we can use the
same TCP_FASTOPEN sockopt regardless of the socket direction. Also,
I don't know if TCP_FASTOPEN is supported on simultaneous connect,
but at least if it works it would be easier to understand this way.

Do you think there's a compelling reason for adding a new option or
are you interested in a small patch to perform the change above ?

Regards,
Willy


  1   2   >