from:"\"Willy Tarreau\""

Re: PROBLEM: DoS Attack on Fragment Cache

2021-04-17 Thread Willy Tarreau

On Sat, Apr 17, 2021 at 10:26:30PM -0400, Matt Corallo wrote:
> Sure, there are better ways to handle the reassembly cache overflowing, but
> that is pretty unrelated to the fact that waiting 30 full seconds for a
> fragment to come in doesn't really make sense in today's networks (the 30
> second delay that is used today appears to even be higher than RFC 791
> suggested in 1981!).

Not exactly actually, because you forget the TTL here. With most hosts
sending an initial TTL around 64, after crossing 10-15 hops it's still
around 50 so that would result in ~50 seconds by default, even according
to the 40 years old RFC791. The 15s there was the absolute minimum. While
I do agree that we shouldn't keep them that long nowadays, we can't go
too low without risking to break some slow transmission stacks (SLIP/PPP
over modems for example). In addition even cutting that in 3 will remain
trivially DoSable.

> You get a lot more bang for your buck if you don't wait
> around so long (or we could restructure things to kick out the oldest
> fragments, but that is a lot more work, and probably extra indexes that just
> aren't worth it).

Kicking out oldest ones is a bad approach in a system made only of
independent elements, because it tends to result in a lot of damage once
all of them behave similarly. I.e. if you need to kick out an old entry
in valid traffic, it's because you do need to wait that long, and if all
datagrams need to wait that long, then new datagrams systematically
prevent the oldest one from being reassembled, and none gest reassembled.
With a random approach at least your success ratio converges towards 1/e
(i.e. 36%) which is better.

Willy

Re: PROBLEM: DoS Attack on Fragment Cache

2021-04-17 Thread Willy Tarreau

On Sat, Apr 17, 2021 at 12:42:39AM -0700, Keyu Man wrote:
> How about at least allow the existing queue to finish? Currently a tiny new
> fragment would potentially invalid all previous fragments by letting them
> timeout without allowing the fragments to come in to finish the assembly.

Because this is exactly the principle of how attacks are built: reserve
resources claiming that you'll send everything so that others can't make
use of the resources that are reserved to you. The best solution precisely
is *not* to wait for anyone to finish, hence *not* to reserve valuable
resources that are unusuable by others.

Willy

Re: PROBLEM: DoS Attack on Fragment Cache

2021-04-17 Thread Willy Tarreau

On Sat, Apr 17, 2021 at 06:44:40AM +0200, Eric Dumazet wrote:
> On Sat, Apr 17, 2021 at 2:31 AM David Ahern  wrote:
> >
> > [ cc author of 648700f76b03b7e8149d13cc2bdb3355035258a9 ]
> 
> I think this has been discussed already. There is no strategy that
> makes IP reassembly units immune to DDOS attacks.

For having tried to deal with this in the past as well, I agree with
this conclusion, which is also another good example of why fragments
should really be avoided as much as possible over hostile networks.

However I also found that random drops of previous entries is the
approach which seems to offer the most statistical opportunities to
legitimate traffic to still work under attack (albeit really poorly
considering that any lost fragment requires retransmission of the
whole series). In this case the chance for a packet to be successfully
reassembled would vary proportionally to the inverse of its number of
fragments, which reasonably limits the impact of attacks (without being
an ultimate solution of course).

> We added rb-tree and sysctls to let admins choose to use GB of RAM if
> they really care.

I agree that for those who care, the real solution is to make sure they
can store all the traffic they receive during a reassembly period.
Legitimate traffic mostly reassembles quickly so keeping 1 second of
traffic at 10 Gbps is only 1.25 GB of RAM after all...

Willy

Re: [PATCH] Revert "macb: support the two tx descriptors on at91rm9200"

2021-04-07 Thread Willy Tarreau

Hi Daniel,

On Tue, Apr 06, 2021 at 07:04:58PM +0900, Daniel Palmer wrote:
> Hi Willy,
> 
> I've been messing with the SSD202D (sibling of the MSC313E) recently
> and the ethernet performance was awful.
> I remembered this revert and reverted it and it makes the ethernet
> work pretty well.

OK, that's reassuring, at least they use the same IP blocks.

> [   15.213358] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
> [   15.245235] pre0 41001fb8, post0 30001fb0 -> IDLETSR clear
> [   15.249291] pre1 30001f90, post1 20001d90 -> FIFOIDLE1 clear
> [   15.253331] pre2 20001d90, post2 10001990 -> FIFOIDLE2 clear
> [   15.257385] pre3 10001990, post3 1190 -> FIFOIDLE3 clear
> [   15.261435] pre4 1190, post4 f191 -> FIFOIDLE4 clear, OVR set
> [   15.265485] pre5 f190, post5 f191 -> OVR set
> [   15.269535] pre6 f190, post6 e181 -> OVR set, BNQ clear
> 
> There seems to be a FIFO empty counter in the top of the register but
> this is totally undocumented.

Yes that's extremely visible. I did notice some apparently random
values at the top of the register as well without being able to
determine what they were for, because I was pretty much convinced I
was facing a 2-deep queue only. You had a good idea to force it to
duplicate packets :-)

> There are two new status bits TBNQ and FBNQ at bits 7 and 8. I have no
> idea what they mean.

Maybe they're related to tx queue empty / tx queue full. Just guessing.
Since all these bits tend not to be reset until written to, I got confused
a few times trying to understand what they indicated.

> Bits 9 through 12 are some new status bits that seem to show if a FIFO
> slot is inuse.

Maybe indeed.

> I can't find any mention of these bits anywhere except the header of
> the vendor driver so I think these are specific to MStar's macb.
> The interesting part though is that BNQ does not get cleared until
> multiple frames have been pushed in and after OVR is set.
> I think this is what breaks your code when it runs on the MSC313E
> version of MACB.

Yes that's very possible, because the behavior was inconsistent with
what was documented for older chips. Also BNQ changed its meaning on
more recent chips, so it may very well have yet another one here.

> Anyhow. I'm working on a version of your patch that should work with
> both the at91rm9200 and the MSC313E.

Cool! Thanks for letting me know. If you need me to run some test, let
me know (just don't expect too short latency these days but I definitely
will test).

Cheers,
Willy

Re: [PATCH] atl1c: optimize rx loop

2021-03-18 Thread Willy Tarreau

On Fri, Mar 19, 2021 at 12:04:47PM +0800, Sieng Piaw Liew wrote:
> Remove this trivial bit of inefficiency from the rx receive loop,
> results in increase of a few Mbps in iperf3. Tested on Intel Core2
> platform.
> 
> Signed-off-by: Sieng Piaw Liew 
> ---
>  drivers/net/ethernet/atheros/atl1c/atl1c_main.c | 4 +---
>  1 file changed, 1 insertion(+), 3 deletions(-)
> 
> diff --git a/drivers/net/ethernet/atheros/atl1c/atl1c_main.c 
> b/drivers/net/ethernet/atheros/atl1c/atl1c_main.c
> index 3f65f2b370c5..b995f9a0479c 100644
> --- a/drivers/net/ethernet/atheros/atl1c/atl1c_main.c
> +++ b/drivers/net/ethernet/atheros/atl1c/atl1c_main.c
> @@ -1796,9 +1796,7 @@ static void atl1c_clean_rx_irq(struct atl1c_adapter 
> *adapter,
>   struct atl1c_recv_ret_status *rrs;
>   struct atl1c_buffer *buffer_info;
>  
> - while (1) {
> - if (*work_done >= work_to_do)
> - break;
> + while (*work_done < work_to_do) {

It should not change anything, or only based on the compiler's optimization
and should not result in a measurable difference because what it does is
exactly the same. Have you really compared the compiled output code to
explain the difference ? I strongly suspect you'll find no difference at
all.

Thus for me it's certainly not an optimization, it could be qualified as
a cleanup to improve code readability however.

Willy

Re: TIOCOUTQ implementation for sockets vs. tty

2021-03-10 Thread Willy Tarreau

Hi,

On Wed, Mar 10, 2021 at 07:16:34PM +0100, Henneberg - Systemdesign wrote:
> Hi,
> 
> I have a question regarding the implementation of ioctl TIOCOUTQ for
> various sockets compared to the tty implementation.
> 
> For several sockets, e. g. AF_BLUETOOTH it is done like this
> 
> af_bluetooth.c:
> case TIOCOUTQ:
>   if (sk->sk_state == BT_LISTEN)
>   return -EINVAL;
> 
>   amount = sk->sk_sndbuf - sk_wmem_alloc_get(sk);
>   if (amount < 0)
>   amount = 0;
>   err = put_user(amount, (int __user *)arg);
>   break;
> 
> so the ioctl returns the available space in the send queue if I
> understand the code correctly (this is also what I observed from tests).
> 
> The tty does this:
> 
> n_tty.c:
> case TIOCOUTQ:
>   return put_user(tty_chars_in_buffer(tty), (int __user *) arg);
> 
> so it returns the used space in the send queue. This is also what I
> would expect from the manpage description.
> 
> Is this mismatch intentional?

At least both man pages (tty_ioctl and tcp(7)) mention that TIOCOUTQ
should return the number of byte in queue.

What I suspect for sockets is that sk_sndbuf grows with allocations
and that sk_wmem_alloc_get() in fact returns the number of unused
allocations thus the difference would be the amount queued. But I
could be wrong and I would tend to read the code the same way as you
did.

Willy

Re: macb broken on HiFive Unleashed

2021-03-10 Thread Willy Tarreau

Hi,

On Tue, Mar 09, 2021 at 08:55:10AM +, claudiu.bez...@microchip.com wrote:
> Hi Andreas,
> 
> On 08.03.2021 21:30, Andreas Schwab wrote:
> > EXTERNAL EMAIL: Do not click links or open attachments unless you know the 
> > content is safe
> > 
> > One of the changes to the macb driver between 5.10 and 5.11 has broken
> > the SiFive HiFive Unleashed.  These are the last messages before the
> > system hangs:
> > 
> > [   12.468674] libphy: Fixed MDIO Bus: probed
> > [   12.746518] macb 1009.ethernet: Registered clk switch 
> > 'sifive-gemgxl-mgmt'
> > [   12.753119] macb 1009.ethernet: GEM doesn't support hardware ptp.
> > [   12.760178] libphy: MACB_mii_bus: probed
> > [   12.881792] MACsec IEEE 802.1AE
> > [   12.944426] macb 1009.ethernet eth0: Cadence GEM rev 0x10070109 at 
> > 0x1009 irq 16 (70:b3:d5:92:f1:07)
> > 
> 
> I don't have a SiFive HiFive Unleashed to investigate this. Can you check
> if reverting commits on macb driver b/w 5.10 and 5.11 solves your issues:
> 
> git log --oneline v5.10..v5.11 -- drivers/net/ethernet/cadence/
> 1d0d561ad1d7 net: macb: Correct usage of MACB_CAPS_CLK_HW_CHG flag
> 1d608d2e0d51 Revert "macb: support the two tx descriptors on at91rm9200"
> 700d566e8171 net: macb: add support for sama7g5 emac interface
> ec771de654e4 net: macb: add support for sama7g5 gem interface
> f4de93f03ed8 net: macb: unprepare clocks in case of failure
> 38493da4e6a8 net: macb: add function to disable all macb clocks
> daafa1d33cc9 net: macb: add capability to not set the clock rate
> edac63861db7 net: macb: add userio bits as platform configuration
> 9e6cad531c9d net: macb: Fix passing zero to 'PTR_ERR'
> 0012eeb370f8 net: macb: fix NULL dereference due to no pcs_config method
> e4e143e26ce8 net: macb: add support for high speed interface

In addition, it's worth mentioning that the driver has multiple rx/tx/irq
functions depending on the platforms or chip variants, and that based on
this it should be easy to further reduce this list.

Just my two cents,
Willy

[PATCH] Revert "macb: support the two tx descriptors on at91rm9200"

2020-12-09 Thread Willy Tarreau

This reverts commit 0a4e9ce17ba77847e5a9f87eed3c0ba46e3f82eb.

The code was developed and tested on an MSC313E SoC, which seems to be
half-way between the AT91RM9200 and the AT91SAM9260 in that it supports
both the 2-descriptors mode and a Tx ring.

It turns out that after the code was merged I could notice that the
controller would sometimes lock up, and only when dealing with sustained
bidirectional transfers, in which case it would report a Tx overrun
condition right after having reported being ready, and will stop sending
even after the status is cleared (a down/up cycle fixes it though).

After adding lots of traces I couldn't spot a sequence pattern allowing
to predict that this situation would happen. The chip comes with no
documentation and other bits are often reported with no conclusive
pattern either.

It is possible that my change is wrong just like it is possible that
the controller on the chip is bogus or at least unpredictable based on
existing docs from other chips. I do not have an RM9200 at hand to test
at the moment and a few tests run on a more recent 9G20 indicate that
this code path cannot be used there to test the code on a 3rd platform.

Since the MSC313E works fine in the single-descriptor mode, and that
people using the old RM9200 very likely favor stability over performance,
better revert this patch until we can test it on the original platform
this part of the driver was written for. Note that the reverted patch
was actually tested on MSC313E.

Cc: Nicolas Ferre 
Cc: Claudiu Beznea 
Cc: Daniel Palmer 
Cc: Alexandre Belloni 
Link: https://lore.kernel.org/netdev/20201206092041.ga10...@1wt.eu/
Signed-off-by: Willy Tarreau 
---
 drivers/net/ethernet/cadence/macb.h  |  2 --
 drivers/net/ethernet/cadence/macb_main.c | 46 ++--
 2 files changed, 8 insertions(+), 40 deletions(-)

diff --git a/drivers/net/ethernet/cadence/macb.h 
b/drivers/net/ethernet/cadence/macb.h
index 85be045..9fbdf53 100644
--- a/drivers/net/ethernet/cadence/macb.h
+++ b/drivers/net/ethernet/cadence/macb.h
@@ -1216,8 +1216,6 @@ struct macb {
 
/* AT91RM9200 transmit queue (1 on wire + 1 queued) */
struct macb_tx_skb  rm9200_txq[2];
-   unsigned intrm9200_tx_tail;
-   unsigned intrm9200_tx_len;
unsigned intmax_tx_length;
 
u64 ethtool_stats[GEM_STATS_LEN + QUEUE_STATS_LEN * 
MACB_MAX_QUEUES];
diff --git a/drivers/net/ethernet/cadence/macb_main.c 
b/drivers/net/ethernet/cadence/macb_main.c
index 817c7b0..352ae7f 100644
--- a/drivers/net/ethernet/cadence/macb_main.c
+++ b/drivers/net/ethernet/cadence/macb_main.c
@@ -3936,7 +3936,6 @@ static int at91ether_start(struct macb *lp)
 MACB_BIT(ISR_TUND) |
 MACB_BIT(ISR_RLE)  |
 MACB_BIT(TCOMP)|
-MACB_BIT(RM9200_TBRE)  |
 MACB_BIT(ISR_ROVR) |
 MACB_BIT(HRESP));
 
@@ -3953,7 +3952,6 @@ static void at91ether_stop(struct macb *lp)
 MACB_BIT(ISR_TUND) |
 MACB_BIT(ISR_RLE)  |
 MACB_BIT(TCOMP)|
-MACB_BIT(RM9200_TBRE)  |
 MACB_BIT(ISR_ROVR) |
 MACB_BIT(HRESP));
 
@@ -4023,10 +4021,11 @@ static netdev_tx_t at91ether_start_xmit(struct sk_buff 
*skb,
struct net_device *dev)
 {
struct macb *lp = netdev_priv(dev);
-   unsigned long flags;
 
-   if (lp->rm9200_tx_len < 2) {
-   int desc = lp->rm9200_tx_tail;
+   if (macb_readl(lp, TSR) & MACB_BIT(RM9200_BNQ)) {
+   int desc = 0;
+
+   netif_stop_queue(dev);
 
/* Store packet information (to free when Tx completed) */
lp->rm9200_txq[desc].skb = skb;
@@ -4040,15 +4039,6 @@ static netdev_tx_t at91ether_start_xmit(struct sk_buff 
*skb,
return NETDEV_TX_OK;
}
 
-   spin_lock_irqsave(&lp->lock, flags);
-
-   lp->rm9200_tx_tail = (desc + 1) & 1;
-   lp->rm9200_tx_len++;
-   if (lp->rm9200_tx_len > 1)
-   netif_stop_queue(dev);
-
-   spin_unlock_irqrestore(&lp->lock, flags);
-
/* Set address of the data in the Transmit Address register */
macb_writel(lp, TAR, lp->rm9200_txq[desc].mapping);
/* Set length of the packet in the Transmit Control register */
@@ -4114,8 +4104,6 @@ static irqreturn_t at91ether_interrupt(int irq, void 
*dev_id)
struct macb *lp = netdev_priv(dev);
u32 intstatus, ctl;
unsigned int desc;
-   unsigned int qlen;
-   u32 tsr;
 
/* MAC Interrupt Status register i

Re: macb: should we revert 0a4e9ce17ba7 ("macb: support the two tx descriptors on at91rm9200") ?

2020-12-07 Thread Willy Tarreau

On Mon, Dec 07, 2020 at 03:40:42PM -0800, Jakub Kicinski wrote:
> Thanks for the report, I remember that one. In hindsight maybe we
> should have punted it to 5.11...

Well, not necessarily as it was simple, well documented and *appeared*
to work fine.

> Let's revert ASAP, 5.10 is going to be LTS, people will definitely
> notice.

It could take some time as we're speaking about crazy people running
5.10 on an old 180 MHz MCU :-)

> Would you mind sending a revert patch with the explanation in the
> commit message?

Sure, will do.

Thanks!
Willy

macb: should we revert 0a4e9ce17ba7 ("macb: support the two tx descriptors on at91rm9200") ?

2020-12-06 Thread Willy Tarreau

Hi Jakub,

Two months ago I implemented a small change in the macb driver to
support the two Tx descriptors that AT91RM9200 supports. I implemented
this using the only compatible device I had which is the MSC313E-based
Breadbee board. Since then I've met situations where the chip would stop
sending, mostly when doing bidirectional traffic. I've spent a few week-
ends on this already trying various things including blocking interrupts
on a larger place, switching to the 32-bit register access on the MSC313E
(previous code was using two 16-bit accesses and I thought it was the
cause), and tracing status registers along the various operations. Each
time the pattern looks similar, a write when the chips declares having
room results in an overrun, but the preceeding condition doesn't leave
any info suggesting this may happen.

Sadly we don't have the datasheet for this SoC, what is visible is that it
supports AT91RM9200's tx mode and that it works well when there's a single
descriptor in flight. In this case it complies with AT91RM9200's datasheet.
The chip reports other undocumented bits in its status registers, that
cannot even be correlated to the issue by the way. I couldn't spot anything
wrong there in my changes, even after doing heavy changes to progress on
debugging, and the SoC's behavior reporting an overrun after a single write
when there's room contradicts the RM9200's datasheet. In addition we know
the chip also supports other descriptor-based modes, so it's very possible
it doesn't perfectly implement the RM9200's 2-desc mode and that my change
is still fine.

Yesterday I hope to get my old AT91SAM9G20 board to execute this code and
test it, but it clearly does not work when I remap the init and tx functions,
which indicates that it really does not implement a hidden compatibility
mode with the old chip.

Thus at this point I have no way to make sure that the original SoC for
which I changed the code still works fine or if I broke it. As such, I'd
feel more comfortable if we'd revert my patch until someone has the
opportunity to test it on the original hardware (Alexandre suggested he
might, but later).

The commit in question is the following one:

  0a4e9ce17ba7 ("macb: support the two tx descriptors on at91rm9200")

If the maintainers prefer that we wait for an issue to be reported before
reverting it, that's fine for me as well. What's important to me is that
this potential issue is identified so as not to waste someone else's time
on it.

Thanks!
Willy

Re: [PATCH net-next 0/3] macb: support the 2-deep Tx queue on at91

2020-11-13 Thread Willy Tarreau

Hi Alexandre!

On Fri, Nov 13, 2020 at 11:03:59AM +0100, Alexandre Belloni wrote:
> I think I'm the only one booting recent linux kernels on at91rm9200 and
> I'm currently stuck home while the board is at the office. I'll try to
> test as soon as possible, which may not be before 2021... At least I'll
> know who is the culprit ;)

Oh that's great. I have a SAMG20 based one, which uses slightly different
registers and supports a tx ring, so initially I thought that I couldn't
test it there. But a friend of mine who wrote the drivers for FreeBSD
told me that the original driver still worked for the SAMG20, and I
suspect that despite not being mentioned in the datasheet, the more
recent chip still supports the old behavior, at least to ease the
transistion for their customers. So eventually I'll try it too.

In all transparency, I must tell you that I recently noticed an issue
when facing intense bidirectional traffic, eventually the chip would
report a TX overrun and would stop sending. I finally attributed this
to the unmerged changes needed for the MStar chip, which uses two 16
bit I/O accesses instead of a single 32-bit one, and which apparently
doesn't like being interrupted between the two when writing to the TAR
or TLEN registers. It took me the whole week-end to figure the root
cause so now I need to remove all my debugging code, address the issue
and test again. If I can't manage to fix this and you can't find a
moment to test on your board, I'll propose to revert my patch to stay
on the safe side, as I want at least one implementation to be 100%
reliable with it.

Cheers,
Willy

Re: [PATCH net-next 3/3] macb: support the two tx descriptors on at91rm9200

2020-10-14 Thread Willy Tarreau

Hi Claudiu,

first, thanks for your feedback!

On Wed, Oct 14, 2020 at 04:08:00PM +, claudiu.bez...@microchip.com wrote:
> > @@ -3994,11 +3996,10 @@ static netdev_tx_t at91ether_start_xmit(struct 
> > sk_buff *skb,
> > struct net_device *dev)
> >  {
> > struct macb *lp = netdev_priv(dev);
> > +   unsigned long flags;
> > 
> > -   if (macb_readl(lp, TSR) & MACB_BIT(RM9200_BNQ)) {
> > -   int desc = 0;
> > -
> > -   netif_stop_queue(dev);
> > +   if (lp->rm9200_tx_len < 2) {
> > +   int desc = lp->rm9200_tx_tail;
> 
> I think you also want to protect these reads with spin_lock() to avoid
> concurrency with the interrupt handler.

I don't think it's needed because the condition doesn't change below
us as the interrupt handler only decrements. However I could use a
READ_ONCE to make things cleaner. And in practice this test was kept
to keep some sanity checks but it never fails, as if the queue length
reaches 2, the queue is stopped (and I never got the device busy message
either before nor after the patch).

> > /* Store packet information (to free when Tx completed) */
> > lp->rm9200_txq[desc].skb = skb;
> > @@ -4012,6 +4013,15 @@ static netdev_tx_t at91ether_start_xmit(struct 
> > sk_buff *skb,
> > return NETDEV_TX_OK;
> > }
> > 
> > +   spin_lock_irqsave(&lp->lock, flags);
> > +
> > +   lp->rm9200_tx_tail = (desc + 1) & 1;
> > +   lp->rm9200_tx_len++;
> > +   if (lp->rm9200_tx_len > 1)
> > +   netif_stop_queue(dev);

This is where we guarantee that we won't call start_xmit() again with
rm9200_tx_len >= 2.

> > @@ -4088,21 +4100,39 @@ static irqreturn_t at91ether_interrupt(int irq, 
> > void *dev_id)
> > at91ether_rx(dev);
> > 
> > /* Transmit complete */
> > -   if (intstatus & MACB_BIT(TCOMP)) {
> > +   if (intstatus & (MACB_BIT(TCOMP) | MACB_BIT(RM9200_TBRE))) {
> > /* The TCOM bit is set even if the transmission failed */
> > if (intstatus & (MACB_BIT(ISR_TUND) | MACB_BIT(ISR_RLE)))
> > dev->stats.tx_errors++;
> > 
> > -   desc = 0;
> > -   if (lp->rm9200_txq[desc].skb) {
> > +   spin_lock(&lp->lock);
> 
> Also, this lock could be moved before while, below, as you want to protect
> the rm9200_tx_len and rm9200_tx_tails members of lp as I understand.

Sure, but it actually *is* before the while(). I'm sorry if that was not
visible from the context of the diff. The while is just a few lins below,
thus rm9200_tx_len and rm9200_tx_tail are properly protected. Do not
hesitate to tell me if something is not clear or if I'm wrong!

Thanks!
Willy

Re: [PATCH net-next 0/3] macb: support the 2-deep Tx queue on at91

2020-10-13 Thread Willy Tarreau

On Tue, Oct 13, 2020 at 05:03:58PM -0700, Jakub Kicinski wrote:
> On Sun, 11 Oct 2020 11:09:41 +0200 Willy Tarreau wrote:
> > while running some tests on my Breadbee board, I noticed poor network
> > Tx performance. I had a look at the driver (macb, at91ether variant)
> > and noticed that at91ether_start_xmit() immediately stops the queue
> > after sending a frame and waits for the interrupt to restart the queue,
> > causing a dead time after each packet is sent.
> > 
> > The AT91RM9200 datasheet states that the controller supports two frames,
> > one being sent and the other one being queued, so I performed minimal
> > changes to support this. The transmit performance on my board has
> > increased by 50% on medium-sized packets (HTTP traffic), and with large
> > packets I can now reach line rate.
> > 
> > Since this driver is shared by various platforms, I tried my best to
> > isolate and limit the changes as much as possible and I think it's pretty
> > reasonable as-is. I've run extensive tests and couldn't meet any
> > unexpected situation (no stall, overflow nor lockup).
> > 
> > There are 3 patches in this series. The first one adds the missing
> > interrupt flag for RM9200 (TBRE, indicating the tx buffer is willing
> > to take a new packet). The second one replaces the single skb with a
> > 2-array and uses only index 0. It does no other change, this is just
> > to prepare the code for the third one. The third one implements the
> > queue. Packets are added at the tail of the queue, the queue is
> > stopped at 2 packets and the interrupt releases 0, 1 or 2 depending
> > on what the transmit status register reports.
> 
> LGTM. There's always a chance that this will make other 
> designs explode, but short of someone from Cadence giving 
> us a timely review we have only one way to find that out.. :)

Not that much in fact, given that the at91ether_* functions are only
used by AT91RM9200 (whose datasheet I used to do this) and Mstar which
I used for the tests. I initially wanted to get my old SAM9G20 board
to boot until I noticed that it doesn't even use the same set of
functions, so the potential victims are extremely limited :-)

> Applied, thanks!

Thank you!
Willy

[PATCH net-next 0/3] macb: support the 2-deep Tx queue on at91

2020-10-11 Thread Willy Tarreau

Hi,

while running some tests on my Breadbee board, I noticed poor network
Tx performance. I had a look at the driver (macb, at91ether variant)
and noticed that at91ether_start_xmit() immediately stops the queue
after sending a frame and waits for the interrupt to restart the queue,
causing a dead time after each packet is sent.

The AT91RM9200 datasheet states that the controller supports two frames,
one being sent and the other one being queued, so I performed minimal
changes to support this. The transmit performance on my board has
increased by 50% on medium-sized packets (HTTP traffic), and with large
packets I can now reach line rate.

Since this driver is shared by various platforms, I tried my best to
isolate and limit the changes as much as possible and I think it's pretty
reasonable as-is. I've run extensive tests and couldn't meet any
unexpected situation (no stall, overflow nor lockup).

There are 3 patches in this series. The first one adds the missing
interrupt flag for RM9200 (TBRE, indicating the tx buffer is willing
to take a new packet). The second one replaces the single skb with a
2-array and uses only index 0. It does no other change, this is just
to prepare the code for the third one. The third one implements the
queue. Packets are added at the tail of the queue, the queue is
stopped at 2 packets and the interrupt releases 0, 1 or 2 depending
on what the transmit status register reports.

Thanks,
Willy

Willy Tarreau (3):
  macb: add RM9200's interrupt flag TBRE
  macb: prepare at91 to use a 2-frame TX queue
  macb: support the two tx descriptors on at91rm9200

 drivers/net/ethernet/cadence/macb.h  | 10 ++--
 drivers/net/ethernet/cadence/macb_main.c | 66 ++--
 2 files changed, 56 insertions(+), 20 deletions(-)

-- 
2.28.0

[PATCH net-next 2/3] macb: prepare at91 to use a 2-frame TX queue

2020-10-11 Thread Willy Tarreau

The RM9200 supports one frame being sent while another one is waiting in
queue. This avoids the dead time that follows the emission of a frame
and which prevents one from reaching line speed.

Right now the driver supports only a single skb, so we'll first replace
the rm9200-specific skb info with an array of two macb_tx_skb (already
used by other drivers). This patch only moves the skb_length to
txq[0].size and skb_physaddr to skb[0].mapping but doesn't perform any
other change. It already uses [desc] in order to minimize future changes.

Cc: Nicolas Ferre 
Cc: Claudiu Beznea 
Cc: Daniel Palmer 
Signed-off-by: Willy Tarreau 
---
 drivers/net/ethernet/cadence/macb.h  |  6 ++---
 drivers/net/ethernet/cadence/macb_main.c | 28 ++--
 2 files changed, 18 insertions(+), 16 deletions(-)

diff --git a/drivers/net/ethernet/cadence/macb.h 
b/drivers/net/ethernet/cadence/macb.h
index 49d347429de8..e87db95fb0f6 100644
--- a/drivers/net/ethernet/cadence/macb.h
+++ b/drivers/net/ethernet/cadence/macb.h
@@ -1206,10 +1206,8 @@ struct macb {
 
phy_interface_t phy_interface;
 
-   /* AT91RM9200 transmit */
-   struct sk_buff *skb;/* holds skb until xmit 
interrupt completes */
-   dma_addr_t skb_physaddr;/* phys addr from 
pci_map_single */
-   int skb_length; /* saved skb length for 
pci_unmap_single */
+   /* AT91RM9200 transmit queue (1 on wire + 1 queued) */
+   struct macb_tx_skb  rm9200_txq[2];
unsigned intmax_tx_length;
 
u64 ethtool_stats[GEM_STATS_LEN + QUEUE_STATS_LEN * 
MACB_MAX_QUEUES];
diff --git a/drivers/net/ethernet/cadence/macb_main.c 
b/drivers/net/ethernet/cadence/macb_main.c
index 9179f7b0b900..ca6e5456906a 100644
--- a/drivers/net/ethernet/cadence/macb_main.c
+++ b/drivers/net/ethernet/cadence/macb_main.c
@@ -3996,14 +3996,16 @@ static netdev_tx_t at91ether_start_xmit(struct sk_buff 
*skb,
struct macb *lp = netdev_priv(dev);
 
if (macb_readl(lp, TSR) & MACB_BIT(RM9200_BNQ)) {
+   int desc = 0;
+
netif_stop_queue(dev);
 
/* Store packet information (to free when Tx completed) */
-   lp->skb = skb;
-   lp->skb_length = skb->len;
-   lp->skb_physaddr = dma_map_single(&lp->pdev->dev, skb->data,
- skb->len, DMA_TO_DEVICE);
-   if (dma_mapping_error(&lp->pdev->dev, lp->skb_physaddr)) {
+   lp->rm9200_txq[desc].skb = skb;
+   lp->rm9200_txq[desc].size = skb->len;
+   lp->rm9200_txq[desc].mapping = dma_map_single(&lp->pdev->dev, 
skb->data,
+ skb->len, 
DMA_TO_DEVICE);
+   if (dma_mapping_error(&lp->pdev->dev, 
lp->rm9200_txq[desc].mapping)) {
dev_kfree_skb_any(skb);
dev->stats.tx_dropped++;
netdev_err(dev, "%s: DMA mapping error\n", __func__);
@@ -4011,7 +4013,7 @@ static netdev_tx_t at91ether_start_xmit(struct sk_buff 
*skb,
}
 
/* Set address of the data in the Transmit Address register */
-   macb_writel(lp, TAR, lp->skb_physaddr);
+   macb_writel(lp, TAR, lp->rm9200_txq[desc].mapping);
/* Set length of the packet in the Transmit Control register */
macb_writel(lp, TCR, skb->len);
 
@@ -4074,6 +4076,7 @@ static irqreturn_t at91ether_interrupt(int irq, void 
*dev_id)
struct net_device *dev = dev_id;
struct macb *lp = netdev_priv(dev);
u32 intstatus, ctl;
+   unsigned int desc;
 
/* MAC Interrupt Status register indicates what interrupts are pending.
 * It is automatically cleared once read.
@@ -4090,13 +4093,14 @@ static irqreturn_t at91ether_interrupt(int irq, void 
*dev_id)
if (intstatus & (MACB_BIT(ISR_TUND) | MACB_BIT(ISR_RLE)))
dev->stats.tx_errors++;
 
-   if (lp->skb) {
-   dev_consume_skb_irq(lp->skb);
-   lp->skb = NULL;
-   dma_unmap_single(&lp->pdev->dev, lp->skb_physaddr,
-lp->skb_length, DMA_TO_DEVICE);
+   desc = 0;
+   if (lp->rm9200_txq[desc].skb) {
+   dev_consume_skb_irq(lp->rm9200_txq[desc].skb);
+   lp->rm9200_txq[desc].skb = NULL;
+   dma_unmap_single(&lp->pdev->dev, 
lp->rm9200_txq[desc].mapping,
+lp->rm9200_txq[desc].size, 
DMA_TO_DEVICE);
dev->stats.tx_packets++;
-

[PATCH net-next 1/3] macb: add RM9200's interrupt flag TBRE

2020-10-11 Thread Willy Tarreau

Transmit Buffer Register Empty replaces TXERR on RM9200 and signals the
sender may try to send again becase the last queued frame is no longer
in queue (being transmitted or already transmitted).

Cc: Nicolas Ferre 
Cc: Claudiu Beznea 
Cc: Daniel Palmer 
Signed-off-by: Willy Tarreau 
---
 drivers/net/ethernet/cadence/macb.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/net/ethernet/cadence/macb.h 
b/drivers/net/ethernet/cadence/macb.h
index 4f1b41569260..49d347429de8 100644
--- a/drivers/net/ethernet/cadence/macb.h
+++ b/drivers/net/ethernet/cadence/macb.h
@@ -365,6 +365,8 @@
 #define MACB_ISR_RLE_SIZE  1
 #define MACB_TXERR_OFFSET  6 /* EN TX frame corrupt from error interrupt */
 #define MACB_TXERR_SIZE1
+#define MACB_RM9200_TBRE_OFFSET6 /* EN may send new frame interrupt 
(RM9200) */
+#define MACB_RM9200_TBRE_SIZE  1
 #define MACB_TCOMP_OFFSET  7 /* Enable transmit complete interrupt */
 #define MACB_TCOMP_SIZE1
 #define MACB_ISR_LINK_OFFSET   9 /* Enable link change interrupt */
-- 
2.28.0

[PATCH net-next 3/3] macb: support the two tx descriptors on at91rm9200

2020-10-11 Thread Willy Tarreau

The at91rm9200 variant used by a few chips including the MSC313 supports
two Tx descriptors (one frame being serialized and another one queued).
However the driver only implemented a single one, which adds a dead time
after each transfer to receive and process the interrupt and wake the
queue up, preventing from reaching line rate.

This patch implements a very basic 2-deep queue to address this limitation.
The tests run on a Breadbee board equipped with an MSC313E show that at
1 GHz, HTTP traffic on medium-sized objects (45kB) was limited to exactly
50 Mbps before this patch, and jumped to 76 Mbps with this patch. And tests
on a single TCP stream with an MTU of 576 jump from 10kpps to 15kpps. With
1500 byte packets it's now possible to reach line rate versus 75 Mbps
before.

Cc: Nicolas Ferre 
Cc: Claudiu Beznea 
Cc: Daniel Palmer 
Signed-off-by: Willy Tarreau 
---
 drivers/net/ethernet/cadence/macb.h  |  2 ++
 drivers/net/ethernet/cadence/macb_main.c | 46 +++-
 2 files changed, 40 insertions(+), 8 deletions(-)

diff --git a/drivers/net/ethernet/cadence/macb.h 
b/drivers/net/ethernet/cadence/macb.h
index e87db95fb0f6..f8133003981f 100644
--- a/drivers/net/ethernet/cadence/macb.h
+++ b/drivers/net/ethernet/cadence/macb.h
@@ -1208,6 +1208,8 @@ struct macb {
 
/* AT91RM9200 transmit queue (1 on wire + 1 queued) */
struct macb_tx_skb  rm9200_txq[2];
+   unsigned intrm9200_tx_tail;
+   unsigned intrm9200_tx_len;
unsigned intmax_tx_length;
 
u64 ethtool_stats[GEM_STATS_LEN + QUEUE_STATS_LEN * 
MACB_MAX_QUEUES];
diff --git a/drivers/net/ethernet/cadence/macb_main.c 
b/drivers/net/ethernet/cadence/macb_main.c
index ca6e5456906a..6ff8e4b0b95d 100644
--- a/drivers/net/ethernet/cadence/macb_main.c
+++ b/drivers/net/ethernet/cadence/macb_main.c
@@ -3909,6 +3909,7 @@ static int at91ether_start(struct macb *lp)
 MACB_BIT(ISR_TUND) |
 MACB_BIT(ISR_RLE)  |
 MACB_BIT(TCOMP)|
+MACB_BIT(RM9200_TBRE)  |
 MACB_BIT(ISR_ROVR) |
 MACB_BIT(HRESP));
 
@@ -3925,6 +3926,7 @@ static void at91ether_stop(struct macb *lp)
 MACB_BIT(ISR_TUND) |
 MACB_BIT(ISR_RLE)  |
 MACB_BIT(TCOMP)|
+MACB_BIT(RM9200_TBRE)  |
 MACB_BIT(ISR_ROVR) |
 MACB_BIT(HRESP));
 
@@ -3994,11 +3996,10 @@ static netdev_tx_t at91ether_start_xmit(struct sk_buff 
*skb,
struct net_device *dev)
 {
struct macb *lp = netdev_priv(dev);
+   unsigned long flags;
 
-   if (macb_readl(lp, TSR) & MACB_BIT(RM9200_BNQ)) {
-   int desc = 0;
-
-   netif_stop_queue(dev);
+   if (lp->rm9200_tx_len < 2) {
+   int desc = lp->rm9200_tx_tail;
 
/* Store packet information (to free when Tx completed) */
lp->rm9200_txq[desc].skb = skb;
@@ -4012,6 +4013,15 @@ static netdev_tx_t at91ether_start_xmit(struct sk_buff 
*skb,
return NETDEV_TX_OK;
}
 
+   spin_lock_irqsave(&lp->lock, flags);
+
+   lp->rm9200_tx_tail = (desc + 1) & 1;
+   lp->rm9200_tx_len++;
+   if (lp->rm9200_tx_len > 1)
+   netif_stop_queue(dev);
+
+   spin_unlock_irqrestore(&lp->lock, flags);
+
/* Set address of the data in the Transmit Address register */
macb_writel(lp, TAR, lp->rm9200_txq[desc].mapping);
/* Set length of the packet in the Transmit Control register */
@@ -4077,6 +4087,8 @@ static irqreturn_t at91ether_interrupt(int irq, void 
*dev_id)
struct macb *lp = netdev_priv(dev);
u32 intstatus, ctl;
unsigned int desc;
+   unsigned int qlen;
+   u32 tsr;
 
/* MAC Interrupt Status register indicates what interrupts are pending.
 * It is automatically cleared once read.
@@ -4088,21 +4100,39 @@ static irqreturn_t at91ether_interrupt(int irq, void 
*dev_id)
at91ether_rx(dev);
 
/* Transmit complete */
-   if (intstatus & MACB_BIT(TCOMP)) {
+   if (intstatus & (MACB_BIT(TCOMP) | MACB_BIT(RM9200_TBRE))) {
/* The TCOM bit is set even if the transmission failed */
if (intstatus & (MACB_BIT(ISR_TUND) | MACB_BIT(ISR_RLE)))
dev->stats.tx_errors++;
 
-   desc = 0;
-   if (lp->rm9200_txq[desc].skb) {
+   spin_lock(&lp->lock);
+
+   tsr = macb_readl(lp, TSR);
+
+   /* we have three possibilities

Re: [PATCH] random32: Restore __latent_entropy attribute on net_rand_state

2020-10-05 Thread Willy Tarreau

Hi Kees,

On Mon, Oct 05, 2020 at 07:12:29PM -0700, Kees Cook wrote:
> On Fri, Oct 02, 2020 at 05:16:11PM +0200, Thibaut Sautereau wrote:
> > From: Thibaut Sautereau 
> > 
> > Commit f227e3ec3b5c ("random32: update the net random state on interrupt
> > and activity") broke compilation and was temporarily fixed by Linus in
> > 83bdc7275e62 ("random32: remove net_rand_state from the latent entropy
> > gcc plugin") by entirely moving net_rand_state out of the things handled
> > by the latent_entropy GCC plugin.
> > 
> > From what I understand when reading the plugin code, using the
> > __latent_entropy attribute on a declaration was the wrong part and
> > simply keeping the __latent_entropy attribute on the variable definition
> > was the correct fix.
> > 
> > Fixes: 83bdc7275e62 ("random32: remove net_rand_state from the latent 
> > entropy gcc plugin")
> > Cc: Linus Torvalds 
> > Cc: Willy Tarreau 
> > Cc: Emese Revfy 
> > Signed-off-by: Thibaut Sautereau 
> 
> Yes, that looks correct. Thank you!
> 
> Acked-by: Kees Cook 
> 
> I'm not sure the best tree for this. Ted, Andrew, Linus? I'll take it
> via my gcc plugin tree if no one else takes it. :)

It was already merged as commit 09a6b0bc3be79 and queued for -stable.

Cheers,
Willy

Re: [PATCH] random32: Restore __latent_entropy attribute on net_rand_state

2020-10-02 Thread Willy Tarreau

On Fri, Oct 02, 2020 at 05:16:11PM +0200, Thibaut Sautereau wrote:
> From: Thibaut Sautereau 
> 
> Commit f227e3ec3b5c ("random32: update the net random state on interrupt
> and activity") broke compilation and was temporarily fixed by Linus in
> 83bdc7275e62 ("random32: remove net_rand_state from the latent entropy
> gcc plugin") by entirely moving net_rand_state out of the things handled
> by the latent_entropy GCC plugin.
> 
> From what I understand when reading the plugin code, using the
> __latent_entropy attribute on a declaration was the wrong part and
> simply keeping the __latent_entropy attribute on the variable definition
> was the correct fix.

Ah thank you, this is what I tried to figure and failed to! Spender
mentioned that a one-liner was all that was needed to fix this but
never responded to my request asking about it.

Willy

Re: Why ping latency is smaller with shorter send interval?

2020-10-02 Thread Willy Tarreau

On Fri, Oct 02, 2020 at 11:26:00AM +0200, Eric Dumazet wrote:
> 2) Idle cpus can be put in a power conserving state.
>   It takes time to exit from these states, as you noticed.
>   These delays can typically be around 50 usec, or more.

This is often the case in my experience. I'm even used to starting a
busy loop in SCHED_IDLE prio on certain machines or in certain VMs
just to keep them busy, and to see the ping latency cut in half.

Willy

Re: [PATCH 1/2] random32: make prandom_u32() output unpredictable

2020-09-14 Thread Willy Tarreau

On Mon, Sep 14, 2020 at 06:16:40PM +0200, Sedat Dilek wrote:
> On Mon, Sep 14, 2020 at 4:53 PM Amit Klein  wrote:
> >
> > Hi
> >
> > Is this patch being pushed to any branch? I don't see it deployed anywhere 
> > (unless I'm missing something...).
> >
> 
> It's here:
> 
> [1] 
> https://git.kernel.org/pub/scm/linux/kernel/git/wtarreau/prandom.git/log/?h=20200901-siphash-noise

By the way I didn't get any feedback from those who initially disagreed
with the one that was mergd, so for now I'm not doing anything on it
anymore. I can propose it again for 5.10-rc1 but will not push anymore
if there's no interest behind it.

Cheers,
Willy

Re: [PATCH 0/2] prandom_u32: make output less predictable

2020-09-01 Thread Willy Tarreau

On Tue, Sep 01, 2020 at 04:41:13PM +0200, Sedat Dilek wrote:
> I have tested with the patchset from [1].
> ( Later I saw, you dropped "WIP: tcp: reuse incoming skb hash in
> tcp_conn_request()". )

Yes because it's a bit out of the cope of this series and makes sense
even without these patches, thus I assume Eric will take care of it
separately.

Willy

Re: [PATCH 1/2] random32: make prandom_u32() output unpredictable

2020-09-01 Thread Willy Tarreau

On Tue, Sep 01, 2020 at 01:10:18PM +, David Laight wrote:
> From: Willy Tarreau
> > Sent: 01 September 2020 07:43
> ...
> > +/*
> > + * Generate some initially weak seeding values to allow
> > + * the prandom_u32() engine to be started.
> > + */
> > +static int __init prandom_init_early(void)
> > +{
> > +   int i;
> > +   unsigned long v0, v1, v2, v3;
> > +
> > +   if (!arch_get_random_long(&v0))
> > +   v0 = jiffies;
> 
> Isn't jiffies likely to be zero here?

I don't know. But do we really care ? I'd personally have been fine
with not even assigning it in this case and leaving whatever was in
the stack in this case, though it could make some static code analyzer
unhappy.

Willy

Re: [PATCH 2/2] random32: add noise from network and scheduling activity

2020-09-01 Thread Willy Tarreau

Hi Eric,

On Tue, Sep 01, 2020 at 12:24:38PM +0200, Eric Dumazet wrote:
> There is not much entropy here really :
> 
> 1) dev & txq are mostly constant on a typical host (at least the kind of 
> hosts that is targeted by 
> Amit Klein and others in their attacks.
> 
> 2) len is also known by the attacker, attacking an idle host.
> 
> 3) skb are also allocations from slab cache, which tend to recycle always the 
> same pointers (on idle hosts)
> 
> 
> 4) jiffies might be incremented every 4 ms (if HZ=250)

I know. The point is essentially that someone "remote" or with rare access
to the host's memory (i.e. in a VM on the same CPU sharing L1 with some
CPU vulnerabilities) cannot synchronize with the PRNG and easily stay
synchronized forever. Otherwise I totally agree that these are pretty
weak. But in my opinion they are sufficient to turn a 100% success into
way less. I try not to forget that we're just trying to make a ~15-bit
port require ~2^14 attempts on average. Oh and by the way the number of
calls also counts here.

> Maybe we could feed percpu prandom noise with samples of ns resolution 
> timestamps,
> lazily cached from ktime_get() or similar functions.
>
> This would use one instruction on x86 to update the cache, with maybe more 
> generic noise.

Sure! I think the principle here allows to easily extend it to various
places, and the more the better. Maybe actually we'll figure that there
are plenty of sources of randomness that were not considered secure enough
to feed /dev/random while they're perfectly fine for such use cases.

> diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
> index 
> 4c47f388a83f17860fdafa3229bba0cc605ec25a..a3e026cbbb6e8c5499ed780e57de5fa09bc010b6
>  100644
> --- a/kernel/time/timekeeping.c
> +++ b/kernel/time/timekeeping.c
> @@ -751,7 +751,7 @@ ktime_t ktime_get(void)
>  {
> struct timekeeper *tk = &tk_core.timekeeper;
> unsigned int seq;
> -   ktime_t base;
> +   ktime_t res, base;
> u64 nsecs;
>  
> WARN_ON(timekeeping_suspended);
> @@ -763,7 +763,9 @@ ktime_t ktime_get(void)
>  
> } while (read_seqcount_retry(&tk_core.seq, seq));
>  
> -   return ktime_add_ns(base, nsecs);
> +   res = ktime_add_ns(base, nsecs);
> +   __this_cpu_add(prandom_noise, (unsigned long)ktime_to_ns(res));
> +   return res;
>  }
>  EXPORT_SYMBOL_GPL(ktime_get);

Actually it could even be nice to combine it with __builtin_return_address(0)
given the large number of callers this one has! But I generally agree with
your proposal.

Thanks,
Willy

Re: [PATCH 1/2] random32: make prandom_u32() output unpredictable

2020-09-01 Thread Willy Tarreau

On Tue, Sep 01, 2020 at 10:46:16AM +0200, Sedat Dilek wrote:
> Will you push the updated patchset to your prandom Git - for easy fetching?

Yeah if you want, feel free to use this one :
   
https://git.kernel.org/pub/scm/linux/kernel/git/wtarreau/prandom.git/log/?h=20200901-siphash-noise

Willy

Re: [PATCH 1/2] random32: make prandom_u32() output unpredictable

2020-09-01 Thread Willy Tarreau

On Tue, Sep 01, 2020 at 10:33:40AM +0200, Yann Ylavic wrote:
> On Tue, Sep 1, 2020 at 8:45 AM Willy Tarreau  wrote:
> >
> > +/*
> > + * Generate some initially weak seeding values to allow
> > + * the prandom_u32() engine to be started.
> > + */
> > +static int __init prandom_init_early(void)
> > +{
> > +   int i;
> > +   unsigned long v0, v1, v2, v3;
> > +
> > +   if (!arch_get_random_long(&v0))
> > +   v0 = jiffies;
> > +   if (!arch_get_random_long(&v1))
> > +   v0 = random_get_entropy();
> 
> Shouldn't the above be:
>   v1 = random_get_entropy();
> ?

Very good catch, many thanks Yann! Now fixed in my local tree.

Willy

[PATCH 1/2] random32: make prandom_u32() output unpredictable

2020-08-31 Thread Willy Tarreau

From: George Spelvin 

Non-cryptographic PRNGs may have great statistical proprties, but
are usually trivially predictable to someone who knows the algorithm,
given a small sample of their output.  An LFSR like prandom_u32() is
particularly simple, even if the sample is widely scattered bits.

It turns out the network stack uses prandom_u32() for some things like
random port numbers which it would prefer are *not* trivially predictable.
Predictability led to a practical DNS spoofing attack.  Oops.

This patch replaces the LFSR with a homebrew cryptographic PRNG based
on the SipHash round function, which is in turn seeded with 128 bits
of strong random key.  (The authors of SipHash have *not* been consulted
about this abuse of their algorithm.)  Speed is prioritized over security;
attacks are rare, while performance is always wanted.

Replacing all callers of prandom_u32() is the quick fix.
Whether to reinstate a weaker PRNG for uses which can tolerate it
is an open question.

Commit f227e3ec3b5c ("random32: update the net random state on interrupt
and activity") was an earlier attempt at a solution.  This patch replaces
it.

Reported-by: Amit Klein 
Cc: Willy Tarreau 
Cc: Eric Dumazet 
Cc: "Jason A. Donenfeld" 
Cc: Andy Lutomirski 
Cc: Kees Cook 
Cc: Thomas Gleixner 
Cc: Peter Zijlstra 
Cc: Linus Torvalds 
Cc: ty...@mit.edu
Cc: Florian Westphal 
Cc: Marc Plumb 
Fixes: f227e3ec3b5c ("random32: update the net random state on interrupt and 
activity")
Signed-off-by: George Spelvin 
Link: https://lore.kernel.org/netdev/20200808152628.ga27...@sdf.org/
[ willy: partial reversal of f227e3ec3b5c; moved SIPROUND definitions
  to prandom.h for later use; merged George's prandom_seed() proposal;
  inlined siprand_u32(); replaced the net_rand_state[] array with 4
  members to fix a build issue; cosmetic cleanups to make checkpatch
  happy ]
Signed-off-by: Willy Tarreau 
---
 drivers/char/random.c   |   1 -
 include/linux/prandom.h |  36 +++-
 kernel/time/timer.c |   7 -
 lib/random32.c  | 433 
 4 files changed, 296 insertions(+), 181 deletions(-)

diff --git a/drivers/char/random.c b/drivers/char/random.c
index d20ba1b104ca..2a41b21623ae 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -1277,7 +1277,6 @@ void add_interrupt_randomness(int irq, int irq_flags)
 
fast_mix(fast_pool);
add_interrupt_bench(cycles);
-   this_cpu_add(net_rand_state.s1, fast_pool->pool[cycles & 3]);
 
if (unlikely(crng_init == 0)) {
if ((fast_pool->count >= 64) &&
diff --git a/include/linux/prandom.h b/include/linux/prandom.h
index aa16e6468f91..cc1e71334e53 100644
--- a/include/linux/prandom.h
+++ b/include/linux/prandom.h
@@ -16,12 +16,44 @@ void prandom_bytes(void *buf, size_t nbytes);
 void prandom_seed(u32 seed);
 void prandom_reseed_late(void);
 
+#if BITS_PER_LONG == 64
+/*
+ * The core SipHash round function.  Each line can be executed in
+ * parallel given enough CPU resources.
+ */
+#define PRND_SIPROUND(v0, v1, v2, v3) ( \
+   v0 += v1, v1 = rol64(v1, 13),  v2 += v3, v3 = rol64(v3, 16), \
+   v1 ^= v0, v0 = rol64(v0, 32),  v3 ^= v2, \
+   v0 += v3, v3 = rol64(v3, 21),  v2 += v1, v1 = rol64(v1, 17), \
+   v3 ^= v0,  v1 ^= v2, v2 = rol64(v2, 32)  \
+)
+
+#define PRND_K0 (0x736f6d6570736575 ^ 0x6c7967656e657261)
+#define PRND_K1 (0x646f72616e646f6d ^ 0x7465646279746573)
+
+#elif BITS_PER_LONG == 32
+/*
+ * On 32-bit machines, we use HSipHash, a reduced-width version of SipHash.
+ * This is weaker, but 32-bit machines are not used for high-traffic
+ * applications, so there is less output for an attacker to analyze.
+ */
+#define PRND_SIPROUND(v0, v1, v2, v3) ( \
+   v0 += v1, v1 = rol32(v1,  5),  v2 += v3, v3 = rol32(v3,  8), \
+   v1 ^= v0, v0 = rol32(v0, 16),  v3 ^= v2, \
+   v0 += v3, v3 = rol32(v3,  7),  v2 += v1, v1 = rol32(v1, 13), \
+   v3 ^= v0,  v1 ^= v2, v2 = rol32(v2, 16)  \
+)
+#define PRND_K0 0x6c796765
+#define PRND_K1 0x74656462
+
+#else
+#error Unsupported BITS_PER_LONG
+#endif
+
 struct rnd_state {
__u32 s1, s2, s3, s4;
 };
 
-DECLARE_PER_CPU(struct rnd_state, net_rand_state);
-
 u32 prandom_u32_state(struct rnd_state *state);
 void prandom_bytes_state(struct rnd_state *state, void *buf, size_t nbytes);
 void prandom_seed_full_state(struct rnd_state __percpu *pcpu_state);
diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index a50364df1054..401fcb9d7388 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -1715,13 +1715,6 @@ void update_process_times(int user_tick)
scheduler_tick();
if (IS_ENABLED(CONFIG_POSIX_TIMERS))
run_posix_cpu_timers();
-
-   /* The current CPU might make use of net randoms without receiving IRQs
-* to renew them often enough. Let's update the n

[PATCH 0/2] prandom_u32: make output less predictable

2020-08-31 Thread Willy Tarreau

This is the cleanup of the latest series of prandom_u32 experimentations
consisting in using SipHash instead of Tausworthe to produce the randoms
used by the network stack. The changes to the files were kept minimal,
and the controversial commit that used to take noise from the fast_pool
(f227e3ec3b5c) was reverted. Instead, a dedicated "net_rand_noise" per_cpu
variable is fed from various sources of activities (networking, scheduling)
to perturb the SipHash state using fast, non-trivially predictable data,
instead of keeping it fully deterministic. The goal is essentially to make
any occasional memory leakage or brute-force attempt useless.

The resulting code was verified to be very slightly faster on x86_64 than
what is was with the controversial commit above, though this remains barely
above measurement noise. It was only build-tested on arm & arm64.

George Spelvin (1):
  random32: make prandom_u32() output unpredictable

Willy Tarreau (1):
  random32: add noise from network and scheduling activity

 drivers/char/random.c   |   1 -
 include/linux/prandom.h |  55 -
 kernel/time/timer.c |   9 +-
 lib/random32.c  | 438 
 net/core/dev.c  |   4 +
 5 files changed, 326 insertions(+), 181 deletions(-)

Cc: George Spelvin 
Cc: Amit Klein 
Cc: Eric Dumazet 
Cc: "Jason A. Donenfeld" 
Cc: Andy Lutomirski 
Cc: Kees Cook 
Cc: Thomas Gleixner 
Cc: Peter Zijlstra 
Cc: Linus Torvalds 
Cc: ty...@mit.edu
Cc: Florian Westphal 
Cc: Marc Plumb 
Cc: Sedat Dilek 

-- 
2.28.0

[PATCH 2/2] random32: add noise from network and scheduling activity

2020-08-31 Thread Willy Tarreau

With the removal of the interrupt perturbations in previous random32
change (random32: make prandom_u32() output unpredictable), the PRNG
has become 100% deterministic again. While SipHash is expected to be
way more robust against brute force than the previous Tausworthe LFSR,
there's still the risk that whoever has even one temporary access to
the PRNG's internal state is able to predict all subsequent draws till
the next reseed (roughly every minute). This may happen through a side
channel attack or any data leak.

This patch restores the spirit of commit f227e3ec3b5c ("random32: update
the net random state on interrupt and activity") in that it will perturb
the internal PRNG's statee using externally collected noise, except that
it will not pick that noise from the random pool's bits nor upon
interrupt, but will rather combine a few elements along the Tx path
that are collectively hard to predict, such as dev, skb and txq
pointers, packet length and jiffies values. These ones are combined
using a single round of SipHash into a single long variable that is
mixed with the net_rand_state upon each invocation.

The operation was inlined because it produces very small and efficient
code, typically 3 xor, 2 add and 2 rol. The performance was measured
to be the same (even very slightly better) than before the switch to
SipHash; on a 6-core 12-thread Core i7-8700k equipped with a 40G NIC
(i40e), the connection rate dropped from 556k/s to 555k/s while the
SYN cookie rate grew from 5.38 Mpps to 5.45 Mpps.

Link: https://lore.kernel.org/netdev/20200808152628.ga27...@sdf.org/
Cc: George Spelvin 
Cc: Amit Klein 
Cc: Eric Dumazet 
Cc: "Jason A. Donenfeld" 
Cc: Andy Lutomirski 
Cc: Kees Cook 
Cc: Thomas Gleixner 
Cc: Peter Zijlstra 
Cc: Linus Torvalds 
Cc: ty...@mit.edu
Cc: Florian Westphal 
Cc: Marc Plumb 
Tested-by: Sedat Dilek 
Signed-off-by: Willy Tarreau 
---
 include/linux/prandom.h | 19 +++
 kernel/time/timer.c |  2 ++
 lib/random32.c  |  5 +
 net/core/dev.c  |  4 
 4 files changed, 30 insertions(+)

diff --git a/include/linux/prandom.h b/include/linux/prandom.h
index cc1e71334e53..aa7de3432e0f 100644
--- a/include/linux/prandom.h
+++ b/include/linux/prandom.h
@@ -16,6 +16,12 @@ void prandom_bytes(void *buf, size_t nbytes);
 void prandom_seed(u32 seed);
 void prandom_reseed_late(void);
 
+DECLARE_PER_CPU(unsigned long, net_rand_noise);
+
+#define PRANDOM_ADD_NOISE(a, b, c, d) \
+   prandom_u32_add_noise((unsigned long)(a), (unsigned long)(b), \
+ (unsigned long)(c), (unsigned long)(d))
+
 #if BITS_PER_LONG == 64
 /*
  * The core SipHash round function.  Each line can be executed in
@@ -50,6 +56,18 @@ void prandom_reseed_late(void);
 #error Unsupported BITS_PER_LONG
 #endif
 
+static inline void prandom_u32_add_noise(unsigned long a, unsigned long b,
+unsigned long c, unsigned long d)
+{
+   /*
+* This is not used cryptographically; it's just
+* a convenient 4-word hash function. (3 xor, 2 add, 2 rol)
+*/
+   a ^= __this_cpu_read(net_rand_noise);
+   PRND_SIPROUND(a, b, c, d);
+   __this_cpu_write(net_rand_noise, d);
+}
+
 struct rnd_state {
__u32 s1, s2, s3, s4;
 };
@@ -99,6 +117,7 @@ static inline void prandom_seed_state(struct rnd_state 
*state, u64 seed)
state->s2 = __seed(i,   8U);
state->s3 = __seed(i,  16U);
state->s4 = __seed(i, 128U);
+   PRANDOM_ADD_NOISE(state, i, 0, 0);
 }
 
 /* Pseudo random number generator from numerical recipes. */
diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index 401fcb9d7388..bebcf2fc1226 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -1704,6 +1704,8 @@ void update_process_times(int user_tick)
 {
struct task_struct *p = current;
 
+   PRANDOM_ADD_NOISE(jiffies, user_tick, p, 0);
+
/* Note: this timer irq context must be accounted for as well. */
account_process_tick(p, user_tick);
run_local_timers();
diff --git a/lib/random32.c b/lib/random32.c
index 00fa925a4487..38db382a8cf5 100644
--- a/lib/random32.c
+++ b/lib/random32.c
@@ -324,6 +324,8 @@ struct siprand_state {
 };
 
 static DEFINE_PER_CPU(struct siprand_state, net_rand_state) __latent_entropy;
+DEFINE_PER_CPU(unsigned long, net_rand_noise);
+EXPORT_PER_CPU_SYMBOL(net_rand_noise);
 
 /*
  * This is the core CPRNG function.  As "pseudorandom", this is not used
@@ -347,9 +349,12 @@ static DEFINE_PER_CPU(struct siprand_state, 
net_rand_state) __latent_entropy;
 static inline u32 siprand_u32(struct siprand_state *s)
 {
unsigned long v0 = s->v0, v1 = s->v1, v2 = s->v2, v3 = s->v3;
+   unsigned long n = __this_cpu_read(net_rand_noise);
 
+   v3 ^= n;
PRND_SIPROUND(v0, v1, v2, v3);
PRND_SIPROUND(v0, v1, v2, v3);
+   v0 ^= n;
s->v0 = v0;  s->v1 = v1;  s-&

Re: [DRAFT PATCH] random32: make prandom_u32() output unpredictable

2020-08-26 Thread Willy Tarreau

Hi Amit,

On Thu, Aug 27, 2020 at 02:06:39AM +0300, Amit Klein wrote:
> Hi
> 
> Is there an ETA for this patch then?

No particular ETA on my side, I was waiting for potential criticisms
before going further. I suspect that if nobody complains anymore, it's
an implicit voucher and I'll have to clean up and post the series then.

Thanks,
Willy

Re: [DRAFT PATCH] random32: make prandom_u32() output unpredictable

2020-08-20 Thread Willy Tarreau

On Thu, Aug 20, 2020 at 08:58:38AM +0200, Willy Tarreau wrote:
> I've just pushed a new branch "20200820-siphash-noise" that I also
> rebased onto latest master. It's currently running make allmodconfig
> here, so that will take a while, but I think it's OK as random32.o is
> already OK. I've also addressed a strange build issue caused by having
> an array instead of 4 ints in siprand_state.
> 
> Please just let me know if that's OK for you now.

At least it worked for me now (built with no errors on x86_64):

$ time make -j 8 bzImage modules
(...)
real65m7.986s
user477m22.477s
sys 38m0.545s
$ find . -name '*.ko' |wc -l
7983

Willy

Re: [DRAFT PATCH] random32: make prandom_u32() output unpredictable

2020-08-19 Thread Willy Tarreau

On Thu, Aug 20, 2020 at 08:08:43AM +0200, Willy Tarreau wrote:
> On Thu, Aug 20, 2020 at 06:42:23AM +0200, Sedat Dilek wrote:
> > On Thu, Aug 20, 2020 at 6:33 AM Willy Tarreau  wrote:
> > >
> > > On Thu, Aug 20, 2020 at 05:05:49AM +0200, Sedat Dilek wrote:
> > > > We have the same defines for K0 and K1 in include/linux/prandom.h and
> > > > lib/random32.c?
> > > > More room for simplifications?
> > >
> > > Definitely, I'm not surprized at all. As I said, the purpose was to
> > > discuss around the proposal, not much more. If we think it's the way
> > > to go, some major lifting is required. I just don't want to invest
> > > significant time on this if nobody cares.
> > >
> > 
> > OK.
> > 
> > Right now, I will try with the attached diff.
> 
> No, don't waste your time this way, it's not the right way to address it,
> you're still facing competition between defines. I'll do another one if
> you want to go further in the tests.

I've just pushed a new branch "20200820-siphash-noise" that I also
rebased onto latest master. It's currently running make allmodconfig
here, so that will take a while, but I think it's OK as random32.o is
already OK. I've also addressed a strange build issue caused by having
an array instead of 4 ints in siprand_state.

Please just let me know if that's OK for you now.

Thanks,
Willy

Re: [DRAFT PATCH] random32: make prandom_u32() output unpredictable

2020-08-19 Thread Willy Tarreau

On Thu, Aug 20, 2020 at 06:42:23AM +0200, Sedat Dilek wrote:
> On Thu, Aug 20, 2020 at 6:33 AM Willy Tarreau  wrote:
> >
> > On Thu, Aug 20, 2020 at 05:05:49AM +0200, Sedat Dilek wrote:
> > > We have the same defines for K0 and K1 in include/linux/prandom.h and
> > > lib/random32.c?
> > > More room for simplifications?
> >
> > Definitely, I'm not surprized at all. As I said, the purpose was to
> > discuss around the proposal, not much more. If we think it's the way
> > to go, some major lifting is required. I just don't want to invest
> > significant time on this if nobody cares.
> >
> 
> OK.
> 
> Right now, I will try with the attached diff.

No, don't waste your time this way, it's not the right way to address it,
you're still facing competition between defines. I'll do another one if
you want to go further in the tests.

> Unclear to me where this modpost "net_rand_noise" undefined! comes from.
> Any hints?

Sure, the symbol is not exported. I'll address it as well for you.

Willy

Re: [DRAFT PATCH] random32: make prandom_u32() output unpredictable

2020-08-19 Thread Willy Tarreau

On Thu, Aug 20, 2020 at 05:05:49AM +0200, Sedat Dilek wrote:
> We have the same defines for K0 and K1 in include/linux/prandom.h and
> lib/random32.c?
> More room for simplifications?

Definitely, I'm not surprized at all. As I said, the purpose was to
discuss around the proposal, not much more. If we think it's the way
to go, some major lifting is required. I just don't want to invest
significant time on this if nobody cares.

Thanks!
Willy

Re: [DRAFT PATCH] random32: make prandom_u32() output unpredictable

2020-08-16 Thread Willy Tarreau

Hi,

so as I mentioned, I could run several test on our lab with variations
around the various proposals and come to quite positive conclusions.

Synthetic observations: the connection rate and the SYN cookie rate do not
seem to be affected the same way by the prandom changes. One explanation
is that the connection rates are less stable across reboots. Another
possible explanation is that the larger state update is more sensitive
to cache misses that increase when calling userland. I noticed that the
compiler didn't inline siprand_u32() for me, resulting in one extra
function call and noticeable register clobbering that mostly vanish
once siprand_u32() is inlined, getting back to the original performance.

The noise generation was placed as discussed in the xmit calls, however
the extra function call and state update had a negative effect on
performance and the noise function alone appeared for up to 0.23% of the
CPU usage. Simplifying the mix of data by keeping only one long for
the noise and using one siphash round on 4 input words to keep only
the last word allowed to use very few instructions and to inline them,
making the noise collection imperceptible in microbenchmarks. The noise
is now collected this way (I verified that all inputs are used), this
performs 3 xor, 2 add and 2 rol, which is way sufficient and already
better than my initial attempt with a bare add :

  static inline
  void prandom_u32_add_noise(unsigned long a, unsigned long b,
 unsigned long c, unsigned long d)
  { 
/*
 * This is not used cryptographically; it's just
 * a convenient 4-word hash function. (3 xor, 2 add, 2 rol)
 */
a ^= __this_cpu_read(net_rand_noise);
PRND_SIPROUND(a, b, c, d);
__this_cpu_write(net_rand_noise, d);
  }

My tests were run on a 6-core 12-thread Core i7-8700k equipped with a 40G
NIC (i40e). I've mainly run two types of tests:

  - connections per second: the machine runs a server which accepts and
closes incoming connections. The load generators aim at it and the
connection rate is measured once it's stabilized.

  - SYN cookie rate: the load generators flood the machine with enough
SYNs to saturate the CPU and the rate of response SYN-ACK is measured.

Both correspond to real world use cases (DDoS protection against SYN flood
and connection flood).

The base kernel was fc80c51f + Eric's patch to add a tracepoint in
prandom_u32(). Another test was made by adding George's changes to use
siphash. Then another test was made with the siprand_u32() function
inlined and with noise stored as a full siphash state. Then one test
was run with the noise reduced to a single long. And a final test was
run with the noise function inlined.

  connectionsSYN cookies   Notes
  per second emitted/s
  
  base: 556k  5.38M
  
  siphash:  535k  5.33M
  
  siphash inlined
  +noise:   548k  5.40Madd_noise=0.23%
  
  siphash + single-word
   noise555k  5.45Madd_noise=0.10%
  
  siphash + single-word&inlined
   noise559k  5.38M

Actually the last one is better than the previous one because it also
swallows more packets. There were 10.9M pps in and 5.38M pps out versus
10.77M in and 5.45M out for the previous one. I didn't report the incoming
traffic for the other ones as it was mostly irrelevant and always within
these bounds.

Finally I've added Eric's patch to reuse the skb hash when known in
tcp_conn_request(), and was happy to see the SYN cookies reach 5.45 Mpps
again and the connection rate remain unaffected. A perf record during
the SYN flood showed almost no call to prandom_u32() anymore (just a few
in tcp_rtx_synack()) so this looks like a desirable optimization.

At the moment the code is ugly, in experimental state (I've pushed all of
it at https://git.kernel.org/pub/scm/linux/kernel/git/wtarreau/prandom.git/).

My impression on this is that given that it's possible to maintain the
same level of performance as we currently have while making the PRNG much
better, there's no more reason for not doing it.

If there's enough interest at this point, I'm OK with restarting from
George's patches and doing the adjustments there. There's still this
prandom_seed() which looks very close to prandom_reseed() and that we
might possibly better remerge, but I'd vote for not changing everything
at once, it's ugly enough already. Also I suspect we can have an infinite
loop in prandom_seed() if entropy is 0 and the state is zero as well.
We'd be unlucky but I'd just make sure entropy is not all zeroes. And
running tests on 32-bit would be desirable as well.

Finally one can wonder whether it makes sense to keep Tausworthe for
other cases (basic statistical sampling) or drop it. We could definitely
drop it and simplify everything given that we now have the same level of
performance. But if we do it, what should we do with the test patterns ?
I personally don't t

Re: [DRAFT PATCH] random32: make prandom_u32() output unpredictable

2020-08-14 Thread Willy Tarreau

On Fri, Aug 14, 2020 at 05:32:32PM +0200, Sedat Dilek wrote:
> commit 94c7eb54c4b8e81618ec79f414fe1ca5767f9720
> "random32: add a tracepoint for prandom_u32()"
> 
> ...I gave Willy's patches a try and used the Linux Test Project (LTP)
> for testing.

Just FWIW today I could run several relevant tests with a 40 Gbps NIC
at high connection rates and under SYN flood to stress SYN cookies.
I couldn't post earlier due to a net outage but will post the results
here. In short, what I'm seeing is very good. The only thing is that
the noise collection as-is with the 4 longs takes a bit too much CPU
(0.2% measured) but if keeping only one word we're back to tausworthe
performance, while using siphash all along.

The noise generation function is so small that we're wasting cycles
calling it and renaming registers. I'll run one more test by inlining
it and exporting the noise.

So provided quite some cleanup now, I really think we're about to
reach a solution which will satisfy everyone. More on this after I
extract the results.

Willy

Re: [DRAFT PATCH] random32: make prandom_u32() output unpredictable

2020-08-13 Thread Willy Tarreau

On Thu, Aug 13, 2020 at 09:53:11AM +0200, Sedat Dilek wrote:
> On Wed, Aug 12, 2020 at 5:21 AM Willy Tarreau  wrote:
> >
> > On Tue, Aug 11, 2020 at 12:51:43PM +0200, Sedat Dilek wrote:
> > > Can you share this "rebased to mainline" version of George's patch?
> >
> > You can pick it from there if that helps, but keep in mind that
> > it's just experimental code that we use to explain our ideas and
> > that we really don't care a single second what kernel it's applied
> > to:
> >
> >
> > https://git.kernel.org/pub/scm/linux/kernel/git/wtarreau/cleanups.git/log/?h=20200811-prandom-1
> >
> 
> Thanks Willy.
> 
> I disagree: the base for testing should be clear(ly communicated).

It is. As you can see on the log above, this was applied on top of
fc80c51fd4b2, there's nothing special here. In addition we're not even
talking about testing nor calling for testers, just trying to find a
reasonable solution. Maybe today I'll be able to re-run a few tests by
the way.

> There are two diffs from Eric to #1: add a trace event for
> prandom_u32() and #2: a removal of prandom_u32() call in
> tcp_conn_request().
> In case you have not seen.

I've seen, just not had the time to test yet.

Willy

Re: [DRAFT PATCH] random32: make prandom_u32() output unpredictable

2020-08-11 Thread Willy Tarreau

On Tue, Aug 11, 2020 at 12:51:43PM +0200, Sedat Dilek wrote:
> Can you share this "rebased to mainline" version of George's patch?

You can pick it from there if that helps, but keep in mind that
it's just experimental code that we use to explain our ideas and
that we really don't care a single second what kernel it's applied
to:

   
https://git.kernel.org/pub/scm/linux/kernel/git/wtarreau/cleanups.git/log/?h=20200811-prandom-1

Willy

Re: [DRAFT PATCH] random32: make prandom_u32() output unpredictable

2020-08-10 Thread Willy Tarreau

On Tue, Aug 11, 2020 at 05:26:49AM +, George Spelvin wrote:
> On Mon, Aug 10, 2020 at 11:04:55PM +0200, Willy Tarreau wrote:
> > What could be improved is the way the input values are mixed (just
> > added hence commutative for now). I didn't want to call a siphash
> > round on the hot paths, but just shifting the previous noise value
> > before adding would work, such as the following for example:
> > 
> >   void prandom_u32_add_noise(a, b, c, d)
> >   {
> > unsigned long *noise = get_cpu_ptr(&net_rand_noise);
> > 
> >   #if BITS_PER_LONG == 64
> > *noise = rol64(*noise, 7) + a + b + c + d;
> >   #else
> > *noise = rol32(*noise, 7) + a + b + c + d;
> >   #endif
> > put_cpu_ptr(&net_rand_noise);
> > 
> >   }
> 
> If you think this is enough seed material, I'm fine with it.
> 
> I don't hugely like the fact that you sum all the inputs, since
> entropy tends to be concentrated in the low-order words, and summing
> risks cancellation.

Yes I've figured this. But I thought it was still better than
a pure xor which would cancell the high bits from pointers.

> You can't afford even one SIPROUND as a non-cryptographic hash?  E.g.

That's what I mentioned above, I'm still hesitating. I need to test.

> 
> DEFINE_PER_CPU(unsigned long[4], net_rand_noise);
> EXPORT_SYMBOL(net_rand_noise);
> 
> void prandom_u32_add_noise(a, b, c, d)
> {
>   unsigned long *noise = get_cpu_ptr(&net_rand_noise);
> 
>   a ^= noise[0];
>   b ^= noise[1];
>   c ^= noise[2];
>   d ^= noise[3];
>   /*
>* This is not used cryptographically; it's just
>* a convenient 4-word hash function.
>*/
>   SIPROUND(a, b, c, d);
>   noise[0] = a;
>   noise[1] = b;
>   noise[2] = c;
>   put_cpu_ptr(&net_rand_noise);
> }
> 
> (And then you mix in net_rand_noise[0].)
> 
> Other options are HASH_MIX() from fs/namei.c, but that's
> more sequential.
> 
> There's also a simple Xorshift generator.

I think a xorshift on each value will have roughly the same cost
as a single SIPROUND. But I've not yet completely eliminated these
options until I've tested. If we lose a few cycles per packet, that
might be OK.

Willy

Re: [DRAFT PATCH] random32: make prandom_u32() output unpredictable

2020-08-10 Thread Willy Tarreau

On Tue, Aug 11, 2020 at 03:47:24AM +, George Spelvin wrote:
> On Mon, Aug 10, 2020 at 01:47:00PM +0200, Willy Tarreau wrote:
> > except that I retrieve it only on 1/8 calls
> > and use the previous noise in this case.
> 
> Er... that's quite different.  I was saying you measure them all, and do:

That was my first approach and it resulted in a significant performance
loss, hence the change (and the resulting ugly construct with the counter).

> > +   if (++s->count >= 8) {
> > +   v3 ^= s->noise;
> > +   s->noise += random_get_entropy();
> > +   s->count = 0;
> > +   }
> > +
> 
> - Can you explain why you save the "noise" until next time?  Is this meant to
>   make it harder for an attacker to observe the time?

Just to make the observable call not depend on immediately measurable TSC
values. It's weak, but the point was that when mixing attack traffic with
regular one, if you can measure the time variations on TSC to know when
it was used and don't have its resulting effect at the same time, it's
harder to analyse than when you have both at once.

> - How about doing away with s->count and making it statistical:
> 
> + if ((v3 & 7) == 0)
> + v3 ^= random_get_entropy();

Absolutely. I just kept the counter from previous attempt. But Linus
prefers that we completely remove TSC calls from direct calls due to
old VMs that were slow at this. We could collect it anywhere else once
in a while instead.

Willy

Re: [DRAFT PATCH] random32: make prandom_u32() output unpredictable

2020-08-10 Thread Willy Tarreau

Linus, George, Florian,

would something in this vein be OK in your opinion ?

- update_process_times still updates the noise
- we don't touch the fast_pool anymore
- we don't read any TSC on hot paths
- we update the noise in xmit from jiffies and a few pointer values instead

I've applied it on top of George's patch rebased to mainline for simplicity.
I've used a separate per_cpu noise variable to keep the net_rand_state static
with its __latent_entropy.

With this I'm even getting very slightly better performance than with
the previous patch (195.7 - 197.8k cps).

What could be improved is the way the input values are mixed (just
added hence commutative for now). I didn't want to call a siphash
round on the hot paths, but just shifting the previous noise value
before adding would work, such as the following for example:

  void prandom_u32_add_noise(a, b, c, d)
  {
unsigned long *noise = get_cpu_ptr(&net_rand_noise);

  #if BITS_PER_LONG == 64
*noise = rol64(*noise, 7) + a + b + c + d;
  #else
*noise = rol32(*noise, 7) + a + b + c + d;
  #endif
put_cpu_ptr(&net_rand_noise);

  }

Thanks,
Willy

---


diff --git a/drivers/char/random.c b/drivers/char/random.c
index d20ba1b104ca..2a41b21623ae 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -1277,7 +1277,6 @@ void add_interrupt_randomness(int irq, int irq_flags)
 
fast_mix(fast_pool);
add_interrupt_bench(cycles);
-   this_cpu_add(net_rand_state.s1, fast_pool->pool[cycles & 3]);
 
if (unlikely(crng_init == 0)) {
if ((fast_pool->count >= 64) &&
diff --git a/include/linux/prandom.h b/include/linux/prandom.h
index aa16e6468f91..e2b4990f2126 100644
--- a/include/linux/prandom.h
+++ b/include/linux/prandom.h
@@ -20,7 +20,7 @@ struct rnd_state {
__u32 s1, s2, s3, s4;
 };
 
-DECLARE_PER_CPU(struct rnd_state, net_rand_state);
+DECLARE_PER_CPU(unsigned long, net_rand_noise);
 
 u32 prandom_u32_state(struct rnd_state *state);
 void prandom_bytes_state(struct rnd_state *state, void *buf, size_t nbytes);
@@ -67,6 +67,7 @@ static inline void prandom_seed_state(struct rnd_state 
*state, u64 seed)
state->s2 = __seed(i,   8U);
state->s3 = __seed(i,  16U);
state->s4 = __seed(i, 128U);
+   __this_cpu_add(net_rand_noise, i);
 }
 
 /* Pseudo random number generator from numerical recipes. */
diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index ae5029f984a8..2f07c569af77 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -1721,7 +1721,7 @@ void update_process_times(int user_tick)
 * non-constant value that's not affine to the number of calls to make
 * sure it's updated when there's some activity (we don't care in idle).
 */
-   this_cpu_add(net_rand_state.s1, rol32(jiffies, 24) + user_tick);
+   __this_cpu_add(net_rand_noise, rol32(jiffies, 24) + user_tick);
 }
 
 /**
diff --git a/lib/random32.c b/lib/random32.c
index 2b048e2ea99f..d74cf1db8cc9 100644
--- a/lib/random32.c
+++ b/lib/random32.c
@@ -320,6 +320,8 @@ struct siprand_state {
 };
 
 static DEFINE_PER_CPU(struct siprand_state, net_rand_state) __latent_entropy;
+DEFINE_PER_CPU(unsigned long, net_rand_noise);
+EXPORT_SYMBOL(net_rand_noise);
 
 #if BITS_PER_LONG == 64
 /*
@@ -334,7 +336,7 @@ static DEFINE_PER_CPU(struct siprand_state, net_rand_state) 
__latent_entropy;
 #define K0 (0x736f6d6570736575 ^ 0x6c7967656e657261 )
 #define K1 (0x646f72616e646f6d ^ 0x7465646279746573 )
 
-#elif BITS_PER_LONG == 23
+#elif BITS_PER_LONG == 32
 /*
  * On 32-bit machines, we use HSipHash, a reduced-width version of SipHash.
  * This is weaker, but 32-bit machines are not used for high-traffic
@@ -374,9 +376,12 @@ static DEFINE_PER_CPU(struct siprand_state, 
net_rand_state) __latent_entropy;
 static u32 siprand_u32(struct siprand_state *s)
 {
unsigned long v0 = s->v[0], v1 = s->v[1], v2 = s->v[2], v3 = s->v[3];
+   unsigned long n = __this_cpu_read(net_rand_noise);
 
+   v3 ^= n;
SIPROUND(v0, v1, v2, v3);
SIPROUND(v0, v1, v2, v3);
+   v0 ^= n;
s->v[0] = v0;  s->v[1] = v1;  s->v[2] = v2;  s->v[3] = v3;
return v1 + v3;
 }
diff --git a/net/core/dev.c b/net/core/dev.c
index 7df6c9617321..55a2471cd75b 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -144,6 +144,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "net-sysfs.h"
 
@@ -3557,6 +3558,7 @@ static int xmit_one(struct sk_buff *skb, struct 
net_device *dev,
dev_queue_xmit_nit(skb, dev);
 
len = skb->len;
+   __this_cpu_add(net_rand_noise, (long)skb + (long)dev + (long)txq + len 
+ jiffies);
trace_net_dev_start_xmit(skb, dev);
rc = netdev_start_xmit(skb, dev, txq, more);
trace_net_dev_xmit(skb, rc, dev, len);
@@ -4129,6 +4131,7 @@ static int __dev_queue_xmit(struct sk_buff *skb, struct 
net_device *sb_dev)
if (!skb)
goto out;
 
+

Re: [DRAFT PATCH] random32: make prandom_u32() output unpredictable

2020-08-10 Thread Willy Tarreau

On Mon, Aug 10, 2020 at 10:45:26AM -0700, Linus Torvalds wrote:
> On Mon, Aug 10, 2020 at 9:59 AM Willy Tarreau  wrote:
> >
> > I took what we were already using in add_interrupt_randomness() since
> > I considered that if it was acceptable there, it probably was elsewhere.
> 
> Once you've taken an interrupt, you're doing IO anyway, and the
> interrupt costs will dominate anything you do.
> 
> But the prandom_u32() interface is potentially done many times per
> interrupt. For all I know it's done inside fairly critical locks etc
> too.
> 
> So I don't think one usage translates to another very well.

Possible, hence the better solution to just feed noise in hot paths.
Using jiffies and skb pointer in xmit is better than nothing anyway.
I'm not seeking anything extremist, I just want to make sure we don't
get yet-another-report on this area next summer just because some
researcher using two VMs discovers that attacking a 100% idle VM from
another one running on the same CPU core is trivial after having stolen
some memory data.  If at least such an attack is boring and rarely
works, the rest of the job is provided by siphash and we should be fine.

Willy

Re: [DRAFT PATCH] random32: make prandom_u32() output unpredictable

2020-08-10 Thread Willy Tarreau

On Mon, Aug 10, 2020 at 09:31:48AM -0700, Linus Torvalds wrote:
> On Mon, Aug 10, 2020 at 4:47 AM Willy Tarreau  wrote:
> >
> > Doing testing on real hardware showed that retrieving the TSC on every
> > call had a non negligible cost, causing a loss of 2.5% on the accept()
> > rate and 4% on packet rate when using iptables -m statistics.
> 
> And by "real hardware" I assume you mean x86, with a fairly fast and
> high-performance TSC for get_random_entropy().

Yep.

> Reading the TSC takes on the order of 20-50 cycles, iirc.
> 
> But it can actually be *much* more expensive. On non-x86, it can be an
> IO cycle to external chips.

I took what we were already using in add_interrupt_randomness() since
I considered that if it was acceptable there, it probably was elsewhere.

> And on older hardware VM's in x86, it can be a vm exit etc, so
> thousands of cycles. I hope nobody uses those VM's any more, but it
> would be a reasonable test-case for some non-x86 implementations, so..

Yes, I remember these ones, they were not fun at all.

> IOW, no. You guys are - once again - ignoring reality.

I'm not ignoring reality, quite the opposite, trying to take all knowledge
into account. If I'm missing some points, fine. But if we were already
calling that in the interrupt handler I expected that this would be OK.

The alternative Florian talked about is quite interesting as well,
which is to collect some cheap noise in the network rx/tx paths since
these are the areas we care about.

Willy

Re: [DRAFT PATCH] random32: make prandom_u32() output unpredictable

2020-08-10 Thread Willy Tarreau

On Mon, Aug 10, 2020 at 02:03:02PM +0200, Florian Westphal wrote:
> As this relates to networking, you could also hook perturbation into rx/tx
> softirq processing.  E.g. once for each new napi poll round or only once
> for each softnet invocation, depending on cost.

I was wondering what/where to add some processing. I was thinking about
having a per_cpu "noise" variable that would get mixed with the randoms
when generated. This "noise" would need to be global so that we can easily
append to it. For example on the network path it would be nice to use
checksums but nowadays they're not calculated anymore.

> IIRC the proposed draft left a unused prandom_seed() stub around, you could
> re-use that to place extra data to include in the hash in percpu data.

Probably. What I saw was that prandom_seed() expected to perform a full
(hence slow) reseed. Instead I'd like to do something cheap. I like the
principle of "noise" in that it doesn't promise to bring any security,
only to perturb a little bit.

Willy

Re: [DRAFT PATCH] random32: make prandom_u32() output unpredictable

2020-08-10 Thread Willy Tarreau

On Mon, Aug 10, 2020 at 12:01:11PM +, David Laight wrote:
> >  /*
> >   * On 32-bit machines, we use HSipHash, a reduced-width version of SipHash.
> >   * This is weaker, but 32-bit machines are not used for high-traffic
> > @@ -375,6 +377,12 @@ static u32 siprand_u32(struct siprand_state *s)
> >  {
> > unsigned long v0 = s->v[0], v1 = s->v[1], v2 = s->v[2], v3 = s->v[3];
> > 
> > +   if (++s->count >= 8) {
> > +   v3 ^= s->noise;
> > +   s->noise += random_get_entropy();
> > +   s->count = 0;
> > +   }
> > +
> 
> Using:
>   if (s->count-- <= 0) {
>   ...
>   s->count = 8;
>   }
> probably generates better code.
> Although you may want to use a 'signed int' instead of 'unsigned long'.

Yeah I know, it's just because I only slightly changed the previous code
there. I had an earlier version that kept the rand state fully padded
when storing intermediate values. That's among the final cleanups I'll
bring if we go down that route.

Thanks!

Willy

Re: [DRAFT PATCH] random32: make prandom_u32() output unpredictable

2020-08-10 Thread Willy Tarreau

Hi George,

On Sun, Aug 09, 2020 at 06:30:17PM +, George Spelvin wrote:
> Even something simple like buffering 8 TSC samples, and adding them
> at 32-bit offsets across the state every 8th call, would make a huge
> difference.

Doing testing on real hardware showed that retrieving the TSC on every
call had a non negligible cost, causing a loss of 2.5% on the accept()
rate and 4% on packet rate when using iptables -m statistics. However
I reused your idea of accumulating old TSCs to increase the uncertainty
about their exact value, except that I retrieve it only on 1/8 calls
and use the previous noise in this case. With this I observe the same
performance as plain 5.8. Below are the connection rates accepted on
a single core :

5.8   5.8+patch 5.8+patch+tsc
   192900-197900   188800->192200   194500-197500  (conn/s)

This was on a core i7-8700K. I looked at the asm code for the function
and it remains reasonably light, in the same order of complexity as the
original one, so I think we could go with that.

My proposed change is below, in case you have any improvements to suggest.

Regards,
Willy

diff --git a/lib/random32.c b/lib/random32.c
index 2b048e2ea99f..a12d63028106 100644
--- a/lib/random32.c
+++ b/lib/random32.c
@@ -317,6 +317,8 @@ static void __init prandom_state_selftest(void)

 struct siprand_state {
unsigned long v[4];
+   unsigned long noise;
+   unsigned long count;
 };

 static DEFINE_PER_CPU(struct siprand_state, net_rand_state) __latent_entropy;
@@ -334,7 +336,7 @@ static DEFINE_PER_CPU(struct siprand_state, net_rand_state) 
__latent_entropy;
 #define K0 (0x736f6d6570736575 ^ 0x6c7967656e657261 )
 #define K1 (0x646f72616e646f6d ^ 0x7465646279746573 )

-#elif BITS_PER_LONG == 23
+#elif BITS_PER_LONG == 32
 /*
  * On 32-bit machines, we use HSipHash, a reduced-width version of SipHash.
  * This is weaker, but 32-bit machines are not used for high-traffic
@@ -375,6 +377,12 @@ static u32 siprand_u32(struct siprand_state *s)
 {
unsigned long v0 = s->v[0], v1 = s->v[1], v2 = s->v[2], v3 = s->v[3];

+   if (++s->count >= 8) {
+   v3 ^= s->noise;
+   s->noise += random_get_entropy();
+   s->count = 0;
+   }
+
SIPROUND(v0, v1, v2, v3);
SIPROUND(v0, v1, v2, v3);
s->v[0] = v0;  s->v[1] = v1;  s->v[2] = v2;  s->v[3] = v3;

Re: [DRAFT PATCH] random32: make prandom_u32() output unpredictable

2020-08-09 Thread Willy Tarreau

On Sun, Aug 09, 2020 at 06:30:17PM +, George Spelvin wrote:
> I'm trying to understand your attack scenario.  I'm assuming that an
> attacker can call prandom_u32() locally.

Well, first, I'm mostly focusing on remote attacks, because if the
attacker has a permanent local access, he can as well just bind()
a source port instead of having to guess the next one that will be
used. Second, I'm not aware of direct access from userland, so the
calls to prandom_u32() are expected to be done only via specific
code parts. The worst case (in my opinion) is when the attacker
runs on the same physical CPU and exploits some memory leakage to
find the internal state and uses the same TSC, but can only trigger
prandom_u32() via the network, hence with much less precision.

(...)
> Even something simple like buffering 8 TSC samples, and adding them
> at 32-bit offsets across the state every 8th call, would make a huge
> difference.
> 
> Even if 7 of the timestamps are from attacker calls (4 bits
> uncertainty), the time of the target call is 8x less known
> (so it goes from 15 to 18 bits), and the total becomes
> 46 bits.  *So* much better.

Maybe. I'm just suggesting to add *some* noise to keep things not
exploitable if the state is stolen once in a while (which would
already be a big problem, admittedly).

> > I can run some tests on this as
> > well. I'd really need to try on a cleaner setup, I have remote machines
> > at the office but I don't feel safe enough to remotely reboot them and
> > risk to lose them :-/
> 
> Yeah, test kernels are nervous-making that way.

In fact I never booted a 5.8 kernel on the machine I'm thinking about
yet and can't remote-control its power supply, so I'm worried about
a dirty hang at boot by missing a config entry. The risk is low but
that's annoying when it happens.

> > I'll also test on arm/arm64 to make sure we don't introduce a significant
> > cost there.
> 
> I don't expect a problem, but SipHash is optimized for 4-issue processors,
> and on narrower machines fewer instructions are "free".

OK.

Willy

Re: [DRAFT PATCH] random32: make prandom_u32() output unpredictable

2020-08-09 Thread Willy Tarreau

On Sun, Aug 09, 2020 at 05:06:39PM +, George Spelvin wrote:
> On Sun, Aug 09, 2020 at 11:38:05AM +0200, Willy Tarreau wrote:
> > So I gave it a quick test under Qemu and it didn't show any obvious
> > performance difference compared to Tausworthe, which is a good thing,
> > even though there's a significant amount of measurement noise in each
> > case.
> 
> Thank you very much!  I'm not quite sure how to benchmark this.
> The whole idea is that it's *not* used in a tight cache-hot loop.
> Hopefully someone already has a test setup so I don't have to invent
> one.

Due to limited access to representative hardware, the to main tests
I've been running were on my laptop in qemu, and consisted in :

   - a connect-accept-close test to stress the accept() code and
 verify we don't observe a significant drop. The thing is that
 connect() usually is much slower and running the two on the
 same machine tends to significantly soften the differences
 compared to what a real machine would see when handling a
 DDoS for example.

   - a packet rate test through this rule (which uses prandom_u32()
 for each packet and which matches what can be done in packet
 schedulers or just by users having to deal with random drop) :

 iptables -I INPUT -m statistic --probability 0.5 -j ACCEPT

While these ones are not very relevant, especially in a VM, not
seeing significant variations tends to indicate we should not see
a catastrophic loss.

> > However it keeps the problem that the whole sequence is entirely
> > determined at the moment of reseeding, so if one were to be able to
> > access the state, e.g. using l1tf/spectre/meltdown/whatever, then
> > this state could be used to predict the whole ongoing sequence for
> > the next minute. What some view as a security feature, others will
> > see as a backdoor :-/  That's why I really like the noise approach.
> > Even just the below would significantly harm that capability because
> > that state alone isn't sufficient anymore to pre-compute all future
> > values:
> > 
> > --- a/lib/random32.c
> > +++ b/lib/random32.c
> > @@ -375,6 +375,7 @@ static u32 siprand_u32(struct siprand_state *s)
> >  {
> > unsigned long v0 = s->v[0], v1 = s->v[1], v2 = s->v[2], v3 = 
> > s->v[3];
> >  
> > +   v0 += get_cycles();
> > SIPROUND(v0, v1, v2, v3);
> > SIPROUND(v0, v1, v2, v3);
> > s->v[0] = v0;  s->v[1] = v1;  s->v[2] = v2;  s->v[3] = v3;
> 
> As long as:
> 1) The periodic catastrophic reseeding remains, and
> 2) You use fresh measurements, not the exact same bits
>that add_*_randomness feeds into /dev/random
> then it doesn't do any real harm, so if it makes you feel better...
> 
> But I really want to stress how weak a design drip-reseeding is.
> 
> If an attacker has enough local machine access to do a meltdown-style attack,
> then they can calibrate the TSC used in get_cycles very accurately,

Absolutely.

> so the
> remaining timing uncertainty is very low.

Not that low in fact because they don't know precisely when the call is
made. I mean, let's say we're in the worst case, with two VMs running on
two siblings of the same core, with the same TSC, on a 3 GHz machine. The
attacker can stress the victim at 100k probes per second. That's still
15 bits of uncertainty on the TSC value estimation which is added to each
call. Even on the first call this is enough to make a source port
unguessable, and preventing the attacker from staying synchronized with
its victim. And I'm only speaking about an idle remote machine, not even
one taking unobservable traffic, which further adds to the difficulty.

> This makes a brute-force attack on
> one or two reseedings quite easy.  I.e. if you can see every other output,
> It's straightforward to figure out the ones in between.

But they already become useless because you're only observing stuff of
the past.

> I wonder if, on general principles, it would be better to use a more
> SipHash style mixing in of the sample:
>   m = get_cycles();
>   v3 ^= m;
>   SIPROUND(v0, v1, v2, v3);
>   SIPROUND(v0, v1, v2, v3);
>   v0 ^= m;

Probably, if it's the recommended way to mix in other values, yes.

> Not sure if it's worth the extra register (and associated spill/fill).

If this makes the hash better, maybe. I can run some tests on this as
well. I'd really need to try on a cleaner setup, I have remote machines
at the office but I don't feel safe enough to remotely reboot them and
risk to lose them :-/

I'll also test on arm/arm64 to make sure we don't introduce a significant
cost there.

Willy

Re: [DRAFT PATCH] random32: make prandom_u32() output unpredictable

2020-08-09 Thread Willy Tarreau

Hi George,

On Sun, Aug 09, 2020 at 06:57:44AM +, George Spelvin wrote:
> +/*
> + * This is the core CPRNG function.  As "pseudorandom", this is not used
> + * for truly valuable things, just intended to be a PITA to guess.
> + * For maximum speed, we do just two SipHash rounds per word.  This is
> + * the same rate as 4 rounds per 64 bits that SipHash normally uses,
> + * so hopefully it's reasonably secure.
> + *
> + * There are two changes from the official SipHash finalization:
> + * - We omit some constants XORed with v2 in the SipHash spec as irrelevant;
> + *   they are there only to make the output rounds distinct from the input
> + *   rounds, and this application has no input rounds.
> + * - Rather than returning v0^v1^v2^v3, return v1+v3.
> + *   If you look at the SipHash round, the last operation on v3 is
> + *   "v3 ^= v0", so "v0 ^ v3" just undoes that, a waste of time.
> + *   Likewise "v1 ^= v2".  (The rotate of v2 makes a difference, but
> + *   it still cancels out half of the bits in v2 for no benefit.)
> + *   Second, since the last combining operation was xor, continue the
> + *   pattern of alternating xor/add for a tiny bit of extra non-linearity.
> + */
> +static u32 siprand_u32(struct siprand_state *s)
> +{
> + unsigned long v0 = s->v[0], v1 = s->v[1], v2 = s->v[2], v3 = s->v[3];
> +
> + SIPROUND(v0, v1, v2, v3);
> + SIPROUND(v0, v1, v2, v3);
> + s->v[0] = v0;  s->v[1] = v1;  s->v[2] = v2;  s->v[3] = v3;
> + return v1 + v3;
> +}
> +
> +
> +/**
> + *   prandom_u32 - pseudo random number generator
> + *
> + *   A 32 bit pseudo-random number is generated using a fast
> + *   algorithm suitable for simulation. This algorithm is NOT
> + *   considered safe for cryptographic use.
> + */
> +u32 prandom_u32(void)
> +{
> + struct siprand_state *state = get_cpu_ptr(&net_rand_state);
> + u32 res = siprand_u32(state);
> +
> + put_cpu_ptr(&net_rand_state);
> + return res;
> +}

So I gave it a quick test under Qemu and it didn't show any obvious
performance difference compared to Tausworthe, which is a good thing,
even though there's a significant amount of measurement noise in each
case.

However it keeps the problem that the whole sequence is entirely
determined at the moment of reseeding, so if one were to be able to
access the state, e.g. using l1tf/spectre/meltdown/whatever, then
this state could be used to predict the whole ongoing sequence for
the next minute. What some view as a security feature, others will
see as a backdoor :-/  That's why I really like the noise approach.
Even just the below would significantly harm that capability because
that state alone isn't sufficient anymore to pre-compute all future
values:

--- a/lib/random32.c
+++ b/lib/random32.c
@@ -375,6 +375,7 @@ static u32 siprand_u32(struct siprand_state *s)
 {
unsigned long v0 = s->v[0], v1 = s->v[1], v2 = s->v[2], v3 = s->v[3];
 
+   v0 += get_cycles();
SIPROUND(v0, v1, v2, v3);
SIPROUND(v0, v1, v2, v3);
s->v[0] = v0;  s->v[1] = v1;  s->v[2] = v2;  s->v[3] = v3;

Regards,
Willy

Re: Flaw in "random32: update the net random state on interrupt and activity"

2020-08-08 Thread Willy Tarreau

Hi Florian,

On Sat, Aug 08, 2020 at 09:18:27PM +0200, Florian Westphal wrote:
> >  struct rnd_state {
> > -   __u32 s1, s2, s3, s4;
> > +   siphash_key_t key;
> > +   uint64_t counter;
> >  };
> 
> Does the siphash_key really need to be percpu?

I don't think so, I was really exploring a quick and easy change, and
given that it was convenient to put that into rnd_state to ease
adaptation to existing code, I left it here.

> The counter is different of course.
> Alternative would be to siphash a few prandom_u32 results
> if the extra u64 is too much storage.

I don't think we need to make it complicated. If we can implement a
partial siphash that's fast enough and that makes everyone happy, it
will also not take too much space so that could become an overall
improvement (and I'd say that reconciling everyone would also be a
huge improvement).

> >  DECLARE_PER_CPU(struct rnd_state, net_rand_state);
> > @@ -161,12 +163,14 @@ static inline u32 __seed(u32 x, u32 m)
> >   */
> >  static inline void prandom_seed_state(struct rnd_state *state, u64 seed)
> >  {
> > +#if 0
> > u32 i = (seed >> 32) ^ (seed << 10) ^ seed;
> >  
> > state->s1 = __seed(i,   2U);
> > state->s2 = __seed(i,   8U);
> > state->s3 = __seed(i,  16U);
> > state->s4 = __seed(i, 128U);
> > +#endif
> >  }
> [..]
> 
> Can't we keep prandom_u32 as-is...?  Most of the usage, esp. in the
> packet schedulers, is fine.
>
> I'd much rather have a prandom_u32_hashed() or whatever for
> those cases where some bits might leak to the outside and then convert
> those prandom_u32 users over to the siphashed version.

In fact I even thought about having a different name, such as "weak_u32"
or something like this for the parts where we need slightly more than a
purely predictable random, and keeping the existing function as-is. But
I discovered there's also another one which is sufficient for stats, it
just doesn't have the same interface. But I totally agree on the benefit
of keeping the fastest version available for packet schedulers. In fact
I've run some of my tests with iptables -m statistics --probability ...
However if it turns out we can end up on a very fast PRNG (Tausworthe
was not *that* fast), it would be 100% benefit to turn to it. That's why
I don't want to rule out the possibility of a simple drop-in replacement.

By the way, if we end up keeping a different function for simpler cases,
we should probably find a better name for them like "predictable random"
or "trivial random" that makes users think twice. Doing that in a packet
scheduler is fine. The problem with the "pseudo random" term is that it
evolved over time from "totally predictable" to "may be strongly secure"
depending on implementations and that various people understand it
differently and have different expectations on it :-/

Cheers,
Willy

Re: Flaw in "random32: update the net random state on interrupt and activity"

2020-08-08 Thread Willy Tarreau

On Sat, Aug 08, 2020 at 11:19:01AM -0700, Linus Torvalds wrote:
> On Sat, Aug 8, 2020 at 10:45 AM Willy Tarreau  wrote:
> >
> >
> > WIP: random32: use siphash with a counter for prandom_u32
> 
> siphash is good.
> 
> But no:
> 
> > --- a/drivers/char/random.c
> > +++ b/drivers/char/random.c
> > @@ -1277,7 +1277,6 @@ void add_interrupt_randomness(int irq, int irq_flags)
> >
> > fast_mix(fast_pool);
> > add_interrupt_bench(cycles);
> > -   this_cpu_add(net_rand_state.s1, fast_pool->pool[cycles & 3]);
> >
> > if (unlikely(crng_init == 0)) {
> > if ((fast_pool->count >= 64) &&
> > --- a/include/linux/random.h
> > +++ b/include/linux/random.h
> > diff --git a/kernel/time/timer.c b/kernel/time/timer.c
> > index 026ac01af9da..c9d839c2b179 100644
> > --- a/kernel/time/timer.c
> > +++ b/kernel/time/timer.c
> > @@ -1743,13 +1743,6 @@ void update_process_times(int user_tick)
> > scheduler_tick();
> > if (IS_ENABLED(CONFIG_POSIX_TIMERS))
> > run_posix_cpu_timers();
> > -
> > -   /* The current CPU might make use of net randoms without receiving 
> > IRQs
> > -* to renew them often enough. Let's update the net_rand_state from 
> > a
> > -* non-constant value that's not affine to the number of calls to 
> > make
> > -* sure it's updated when there's some activity (we don't care in 
> > idle).
> > -*/
> > -   this_cpu_add(net_rand_state.s1, rol32(jiffies, 24) + user_tick);
> >  }
> 
> We're not going back to "don't add noise, just make a stronger
> analyzable function".
> 
> I simply won't take it. See my previous email. I'm 100% fed up with
> security people screwing up real security by trying to make things
> overly analyzable.
> 
> If siphash is a good enough hash to make the pseudo-random state hard
> to guess, then it's also a good enough hash to hide the small part of
> the fast-pool we mix in.

I'm totally fine with that. In fact, my secret goal there was to put
net_rand_state back to random32.c as a static and reinstall the
__latent_entropy that we had to remove :-)

I'll need to re-run more correct tests though. My measurements were
really erratic with some of them showing half an HTTP request rate in
a test, which makes absolutely no sense and thus disqualifies my
measurements.

But if the results are correct enough I'm fine with continuing on this
one and forgetting MSWS.

> And while security researchers may hate it because it's hard to
> analyze, that's the POINT, dammit.

Actually I think there are two approaches which explains the constant
disagreements in this area. Some people want something provably
difficult, and others want something that cannot be proven to be easy. 
Sadly by trying to please everyone we probably got something between
provably easy and not provably difficult :-/

Willy

Re: Flaw in "random32: update the net random state on interrupt and activity"

2020-08-08 Thread Willy Tarreau

On Sat, Aug 08, 2020 at 10:07:51AM -0700, Andy Lutomirski wrote:
> 
> > On Aug 8, 2020, at 8:29 AM, George Spelvin  wrote:
> > 
> 
> > And apparently switching to the fastest secure PRNG currently
> > in the kernel (get_random_u32() using ChaCha + per-CPU buffers)
> > would cause too much performance penalty.
> 
> Can someone explain *why* the slow path latency is particularly relevant
> here?  What workload has the net code generating random numbers in a place
> where even a whole microsecond is a problem as long as the amortized cost is
> low?  (I'm not saying I won't believe this matters, but it's not obvious to
> me that it matters.)

It really depends on what is being done and how we want that code to
continue to be used. Often it will be a matter of queue sizes or
concurrency in certain contexts under a lock. For illustration let's
say you want to use randoms to choose a sequence number for a SYN
cookie (it's not what is done today), a 40G NIC can deliver 60Mpps
or one packet every 16 ns. If you periodically add 1us you end up
queuing 60 extra packets in a queue when this happens. This might
or might not be acceptable. Regarding concurrency, you could imagine
some code picking a random in a small function running under a
spinlock. Frequently adding 1us there can be expensive.

To be clear, I'm not *that* much worried about *some* extra latency,
however I'm extremely worried about the risks of increase once
some security researcher considers the PRNG not secure enough again
(especially once it starts to be used for security purposes after
having being claimed secure) and we replace it with another one
showing a higher cost and longer tail to amortize the cost. And given
that we're speaking about replacing a non-secure PRNG with *something*,
it doesn't seem really wise to start to introduce new constraints on
the data path when there are probably a large enough number of
possibilities which do not require these constraints.

Last point, we must not rule out the possibility than some clever
researcher will later come up with new time-based attacks consisting
in observing latencies of packet processing, explaining how they can
count the number of connections reaching a given host and reveal
certain useful information, and that we're then forced to slow down
all of them to proceed in constant time.

Just my two cents,
Willy

Re: Flaw in "random32: update the net random state on interrupt and activity"

2020-08-08 Thread Willy Tarreau

On Sat, Aug 08, 2020 at 03:26:28PM +, George Spelvin wrote:
> So any discussion of reseeding begins by *assuming* an attacker has
> captured the PRNG state.  If we weren't worried about this possibility,
> we wouldn't need to reseed in the first place!

Sure, but what matters is the time it takes to capture the state.
In the previous case it used to take Amit a few packets only. If it
takes longer than the reseeding period, then it's not captured anymore.
That was exactly the point there: break the certainty of the observations
so that the state cannot be reliably guessed anymore, hence cancelling
the concerns about the input bits being guessed. For sure if it is
assumed that the state is guessed nevertheless, this doesn't work
anymore, but that wasn't the assumption.

> Just like a buffer overflow, a working attack is plausible using a
> combination of well-understood techniques.  It's ridiculous to go to
> the effort of developing a full exploit when it's less work to fix the
> problem in the first place.

I totally agree with this principle. However, systematically starting
with assumptions like "the PRNG's state is known to the attacker" while
it's the purpose of the attack doesn't exactly match something which
does not warrant a bit more explanation. Otherwise I could as well say
in other contexts "let's assume I know the private key of this site's
certificate, after all it's only 2048 bits". And the reason why we're
looping here is that all explanations about the risk of losing entropy
are based on the assumption that the attacker has already won. The patch
was made to keep that assumption sufficiently away, otherwise it's
useless.

> A SOLUTION:
> 
> Now, how to fix it?
> 
> First, what is the problem?  "Non-cryptographic PRNGs are predictable"
> fits in the cateogry of Not A Bug.  There are may applications for
> a non-cryptographic PRNG in the kernel.  Self-tests, spreading wear
> across flash memory, randomized backoff among cooperating processes,
> and assigning IDs in protocols which aren't designed for DoS resistance
> in the first place.
> 
> But apparently the network code is (ab)using lib/random32.c for choosing
> TCP/UDP port numbers and someone has found a (not publicly disclosed)
> attack which exploits the predictability to commit some mischief.

Note, it's beyond predictability, it's reversibility at this point.

> And apparently switching to the fastest secure PRNG currently
> in the kernel (get_random_u32() using ChaCha + per-CPU buffers)
> would cause too much performance penalty.

Based on some quick userland tests, apparently yes.

> So the request is for a less predictable PRNG which is still extremely
> fast.

That was the goal around the MSWS PRNG, as it didn't leak its bits
and would require a certain amount of brute force.

> Well, the best crypto primitive I know of for such an application is
> SipHash.  Its 4x 64-bit words of state is only twice the size of the
> current struct rnd_state.  Converting it from a a pseudo-random function
> to a CRNG is some half-assed amateur cryptography, but at least it's a
> robust and well-studied primitive.

If it's considered as cryptographically secure, it's OK to feed it with
just a counter starting from a random value, since the only way to guess
the current counter state is to brute-force it, right ? I've done this in
the appended patch. Note that for now I've commented out all the reseeding
code because I just wanted to test the impact. 

In my perf measurements I've had some erratic behaviors, one test showing
almost no loss, and another one showing a lot, which didn't make sense, so
I'll need to measure again under better condition (i.e. not my laptop with
a variable frequency CPU).

> Willy Tarreau: Middle square + Weil sequence isn't even *close* to
>   crypto-quality.

Absolutely, I've even said this myself.

>  And you're badly misunderstanding what the
>   fast_pool is doing.  Please stop trying to invent your own crypto
>   primitives; see
>   
> https://www.schneier.com/crypto-gram/archives/1998/1015.html#cipherdesign

I know this pretty well and am NOT trying to invent crypto and am not
qualified for this. I'm not even trying to invent a PRNG myself because
there are some aspects I can't judge (such as measuring the sequence).
This is why I proposed to reuse some published work.

There seems to be some general thinking among many crypto-savvy people
that everything in the world needs strong crypto, including PRNGs, and
that doesn't always help. As you mentioned above, there are plenty of
valid use cases for weaker PRNGs, and my opinion (which may or may not
be shared) is that if guessing an

Re: Flaw in "random32: update the net random state on interrupt and activity"

2020-08-07 Thread Willy Tarreau

On Fri, Aug 07, 2020 at 03:45:48PM -0700, Marc Plumb wrote:
> Willy,
> 
> 
> On 2020-08-07 3:19 p.m., Willy Tarreau wrote:
> > On Fri, Aug 07, 2020 at 12:59:48PM -0700, Marc Plumb wrote:
> > > 
> > > If I can figure the state out once,
> > Yes but how do you take that as granted ? This state doesn't appear
> > without its noise counterpart,
> 
> Amit has shown attacks that can deduce the full internal state from 4-5
> packets with a weak PRNG. If the noise is added less often than that, an
> attacker can figure out the entire state at which point the partial
> reseeding doesn't help. If the noise is added more often than that, and it's
> raw timing events, then it's only adding a few bits of entropy so its easy
> to guess (and it weakens dev/random). If the noise is added more often, and
> it's from the output of a CPRNG, then we have all the performance/latency
> problems from the CPRNG itself, so we might as well use it directly.

His attacks is based on the fact that internal state bits leak as-is.
So it is possible to collect them and perform the inverse operation to
reconstruct the input.

Here the output is made by mixing two input bits with two noise bits
and produce one single output bit. So for each output bit you see,
you don't have just one possible input bit, but 16 possibilities.
That's 128 bits of internal changing states that once combined result
in 32 bits. For each 32-bit output you still have 2^96 possible
internal (x,noise) states producing it. And that's without counting
on the 64-bit seed that you still don't know but can be deduced from
two guessed 128 bit states (assuming that can be brute-forced at all,
of course).

For sure there are plenty of number theories allowing you to
significantly reduce the space you need to work on to brute-force
this but it will definitely remain above 2^32 attempts for each
iteration, which is the floor you have without the noise part,
while the whole internal state will be reseeded every minute
anyway.

I mean, I'm not trying to sell this thing, I'm just trying to defend
that we use a reasonable tool for a reasonable protection level. And
yes, probably that 15 years from now, someone will say "hey look, I
can brute-force this thing in less than a minute on my 1024-core 39th
gen core i7 machine with 2^60 operations per second, why don't we use
our CPU's native 10 picosecond AES1024 instruction instead ?" and we'll
say "well, it's an old story and it's probably about time to change it
again".

I don't see anything wrong with evolving this way, matching concrete
needs more than pure theory.

Regards,
Willy

Re: Flaw in "random32: update the net random state on interrupt and activity"

2020-08-07 Thread Willy Tarreau

On Fri, Aug 07, 2020 at 12:59:48PM -0700, Marc Plumb wrote:
> 
> On 2020-08-07 10:43 a.m., Willy Tarreau wrote:
> > 
> > > Which means that it's 2^32 effort to brute force this (which Amit called 
> > > "no
> > > biggie for modern machines"). If the noise is the raw sample data with 
> > > only
> > > a few bits of entropy, then it's even easier to brute force.
> > Don't you forget to multiply by another 2^32 for X being folded onto itself 
> > ?
> > Because you have 2^32 possible values of X which will give you a single 
> > 32-bit
> > output value for a given noise value.
> 
> If I can figure the state out once,

Yes but how do you take that as granted ? This state doesn't appear
without its noise counterpart, so taking as a prerequisite that you may
guess one separately obviously indicates that you then just have to
deduce the other, but the point of mixing precisely is that we do not
expose individual parts.

This way of thinking is often what results in extreme solutions to be
designed, which are far away from the reality of the field of application,
and result in unacceptable costs that make people turn to other solutions.
Do you think it makes me happy to see people waste their time reimplementing
alternate userland TCP stacks that are supposedly "way faster" by getting
rid of all the useless (to them) stuff that was forced on them at the cost
of performance ? And it makes me even less happy when they ask me why I'm
not spending more time trying to adopt them. The reality is that this time
could be better spent optimizing our stack to be sure that costs are added
where they are relevant, and not just to make sure that when we align 7
conditions starting with "imagine that I could guess that", the 8th could
be guessed as well, except that none of these can really be guessed outside
of a lab :-/

> then the only new input is the noise, so
> that's the only part I have to brute force. Throwing the noise in makes it
> more difficult to get that state once, but once I have it then this type of
> reseeding doesn't help.

> I think it might be possible to do a decent CPRNG (that's at
> least had some cryptanalys of it) with ~20 instructions per word, but if
> that's not fast enough then I'll think about other options.

I think that around 20 instructions for a hash would definitely be nice
(but please be aware that we're speaking about RISC-like instructions,
not SIMD instructions). And also please be careful not to count only
with amortized performance that's only good to show nice openssl
benchmarks, because if that's 1280 instructions for 256 bits that
result in 20 instructions per 32-bit word, it's not the same anymore
at all!

Regards,
Willy

Re: Flaw in "random32: update the net random state on interrupt and activity"

2020-08-07 Thread Willy Tarreau

Hi Andy,

On Fri, Aug 07, 2020 at 10:55:11AM -0700, Andy Lutomirski wrote:
> >> This is still another non-cryptographic PRNG.
> > 
> > Absolutely. During some discussions regarding the possibility of using
> > CSPRNGs, orders around hundreds of CPU cycles were mentioned for them,
> > which can definitely be a huge waste of precious resources for some
> > workloads, possibly causing the addition of a few percent extra machines
> > in certain environments just to keep the average load under a certain
> > threshold.
> 
> I think the real random.c can run plenty fast. It's ChaCha20 plus ludicrous
> overhead right now. I'm working (slowly) on making the overhead go away.  I'm
> hoping to have something testable in a few days.  As it stands, there is a
> ton of indirection, a pile of locks, multiple time comparisons, per-node and
> percpu buffers (why both?), wasted bits due to alignment, and probably other
> things that can be cleaned up.  I'm trying to come up with something that is
> fast and has easy-to-understand semantics.
> 
> You can follow along at:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/log/?h=random/fast

Thanks, we'll see. I developed a quick test tool that's meant to be easy
to use to measure the performance impact on connect/accept. I have not
yet run it on a modified PRNG to verify if it works. I'll send it once
I've tested. I'd definitely would like to see no measurable performance
drop, and ideally even a small performance increase (as Tausworthe isn't
the lightest thing around either so we do have some little margin).

Willy

Re: Flaw in "random32: update the net random state on interrupt and activity"

2020-08-07 Thread Willy Tarreau

On Fri, Aug 07, 2020 at 09:52:14AM -0700, Marc Plumb wrote:
> On 2020-08-07 12:03 a.m., Willy Tarreau wrote:
> 
> > Just to give a heads up on this, here's what I'm having pending regarding
> > MSWS:
> > 
> >struct rnd_state {
> >  uint64_t x, w;
> >  uint64_t seed;
> >  uint64_t noise;
> >};
> > 
> >uint32_t msws32(struct rnd_state *state)
> >{
> >  uint64_t x;
> > 
> >  x  = state->w += state->seed;
> >  x += state->x * state->x;
> >  x  = state->x = (x >> 32) | (x << 32);
> >  x -= state->noise++;
> >  return x ^ (x >> 32);
> >}
> 
> A few comments:
> 
> This is still another non-cryptographic PRNG.

Absolutely. During some discussions regarding the possibility of using
CSPRNGs, orders around hundreds of CPU cycles were mentioned for them,
which can definitely be a huge waste of precious resources for some
workloads, possibly causing the addition of a few percent extra machines
in certain environments just to keep the average load under a certain
threshold. And the worst there is that such workloads would exactly be
the ones absolutely not affected by the theorical attacks regarding
predictability :-/

> An LFSR can pass PractRand (if
> you do a tweak to get around the specific linear complexity test for LFSRs).

OK.

> On a 64-bit machine it should be fast: 4 adds, 1 multiply, 1 rotate, 1
> shift, 1 xor
> 
> This will be much slower on 32-bit machines, if that's still a concern

Yep, that's something I'm totally aware of and one reason I wanted to
have a public discussion about this. My personal view on this is that
a 64-bit multiply will always be cheaper than a crypto operation and
that the environments where picking a source port or accepting a
connection matters that much are not those running on such low-end
machines, so that's very likely an acceptable tradeoff.

> As long as the noise is the output of a CPRNG, this doesn't hurt the
> security of dev/dandom.
> 
> The noise is more like effective 32-bits since you're xoring the low and
> high half of the noise together (ignoring the minor details of carry bits).

Definitely. The purpose of this noise in fact is more to complicate the
reconstruction of the internal state in case a large number of samples
is used, precisely due to these carry bits that propagate solely depending
on the noise value itself, and the uncertainty about when it changes and
from what to what.

> Which means that it's 2^32 effort to brute force this (which Amit called "no
> biggie for modern machines"). If the noise is the raw sample data with only
> a few bits of entropy, then it's even easier to brute force.

Don't you forget to multiply by another 2^32 for X being folded onto itself ?
Because you have 2^32 possible values of X which will give you a single 32-bit
output value for a given noise value.

> Given the uses of this, I think we really should look into a CPRNG for this
> and then completely reseed it periodically. The problem is finding one
> that's fast enough.

That was precisely the problem that drove me to propose something like
this. And all these cycles wasted everywhere just to protect a 16-bit
source port on a mostly idle machine seem quite overkill to me.

> Is there a hard instruction budget for this, or it is
> just "fast enough to not hurt the network benchmarks" (i.e. if Dave Miller
> screams)?

It's not just Davem. I too am concerned about wasting CPU cycles in fast
path especially in the network code. A few half-percent gains are hardly
won once in a while in this area and in some infrastructures they matter.
Not much but they do.

Willy

Re: Flaw in "random32: update the net random state on interrupt and activity"

2020-08-07 Thread Willy Tarreau

On Thu, Aug 06, 2020 at 10:18:55AM -0700, Marc Plumb wrote:
> Willy,
> 
> 
> On 2020-08-05 11:30 p.m., Willy Tarreau wrote:
> > On Wed, Aug 05, 2020 at 03:21:11PM -0700, Marc Plumb wrote:
> > > There is nothing wrong with perturbing net_rand_state, the sin is doing it
> > > with the raw entropy that is also the seed for your CPRNG. Use the output 
> > > of
> > > a CPRNG to perturb the pool all you want, but don't do things that bit by
> > > bit reveal the entropy that is being fed into the CPRNG.
> > This is interesting because I think some of us considered it exactly the
> > other way around, i.e. we're not copying exact bits but just taking a
> > pseudo-random part of such bits at one point in time, to serve as an
> > increment among other ones. And given that these bits were collected
> > over time from not very secret sources, they appeared to be of lower
> > risk than the output.
> 
> No. The output of a CPRNG can't be used to determine the internal state. The
> input can. The input entropy is the one thing that cannot be produced by a
> deterministic computer, so they are the crown jewels of this. It's much much
> safer to use the output.

OK, noted.

> > I didn't know about SFC32, it looks like a variation of the large family
> > of xorshift generators, which is thus probably quite suitable as well
> > for this task. Having used xoroshiro128** myself in another project, I
> > found it overkill for this task compared to MSWS but I definitely agree
> > that any of them is more suited to the task than the current one.
> > 
> It's actually a chaotic generator (not a linear one like an xorshift
> generator), which gives it weaker period guarantees which makes it more
> difficult to reverse. With a counter added to help the period length.
> 
> I'll trust Amit that SFC32 isn't strong enough and look at other options --
> I just thought of it as better, and faster than the existing one with the
> same state size. Maybe a larger state is needed.

Just to give a heads up on this, here's what I'm having pending regarding
MSWS:

  struct rnd_state {
uint64_t x, w;
uint64_t seed;
uint64_t noise;
  };

  uint32_t msws32(struct rnd_state *state)
  {
uint64_t x;

x  = state->w += state->seed;
x += state->x * state->x;
x  = state->x = (x >> 32) | (x << 32);
x -= state->noise++;
return x ^ (x >> 32);
  }

It passes PractRand without any warning after 1 TB of data:

  rng=RNG_stdin, seed=unknown
  length= 512 megabytes (2^29 bytes), time= 2.0 seconds
no anomalies in 229 test result(s)
  
  rng=RNG_stdin, seed=unknown
  length= 1 gigabyte (2^30 bytes), time= 4.3 seconds
no anomalies in 248 test result(s)
  
  rng=RNG_stdin, seed=unknown
  length= 2 gigabytes (2^31 bytes), time= 8.3 seconds
no anomalies in 266 test result(s)
  
  rng=RNG_stdin, seed=unknown
  length= 4 gigabytes (2^32 bytes), time= 15.8 seconds
no anomalies in 282 test result(s)
  
  rng=RNG_stdin, seed=unknown
  length= 8 gigabytes (2^33 bytes), time= 31.3 seconds
no anomalies in 299 test result(s)
  
  rng=RNG_stdin, seed=unknown
  length= 16 gigabytes (2^34 bytes), time= 61.9 seconds
no anomalies in 315 test result(s)
  
  rng=RNG_stdin, seed=unknown
  length= 32 gigabytes (2^35 bytes), time= 119 seconds
no anomalies in 328 test result(s)
  
  rng=RNG_stdin, seed=unknown
  length= 64 gigabytes (2^36 bytes), time= 242 seconds
no anomalies in 344 test result(s)
  
  rng=RNG_stdin, seed=unknown
  length= 128 gigabytes (2^37 bytes), time= 483 seconds
no anomalies in 359 test result(s)
  
  rng=RNG_stdin, seed=unknown
  length= 256 gigabytes (2^38 bytes), time= 940 seconds
no anomalies in 372 test result(s)
  
  rng=RNG_stdin, seed=unknown
  length= 512 gigabytes (2^39 bytes), time= 1906 seconds
no anomalies in 387 test result(s)
  
  rng=RNG_stdin, seed=unknown
  length= 1 terabyte (2^40 bytes), time= 3826 seconds
no anomalies in 401 test result(s)

The two modifications compared to the original msws are:

  - mix bits on output so that we don't reveal the internal
state upon each call ;

  - combination of the output with an independent noise
variable whose purpose was to be updated upon IRQ
and/or CPU usage and/or invocations. But on this point,
while implementing it I figured that updating it on each
invocation did already provide the frequent updates we
were missing in Tausworthe that required the interrupt
updates. I'd definitely update in update_process_times()
so that it's not reduced to a pure counter, but the
results, speed and simplicity look encouraging.

I'll try to work on finishing the patch proposal this week-end.

Regards,
Willy

Re: Flaw in "random32: update the net random state on interrupt and activity"

2020-08-05 Thread Willy Tarreau

On Wed, Aug 05, 2020 at 03:21:11PM -0700, Marc Plumb wrote:
> There is nothing wrong with perturbing net_rand_state, the sin is doing it
> with the raw entropy that is also the seed for your CPRNG. Use the output of
> a CPRNG to perturb the pool all you want, but don't do things that bit by
> bit reveal the entropy that is being fed into the CPRNG.

This is interesting because I think some of us considered it exactly the
other way around, i.e. we're not copying exact bits but just taking a
pseudo-random part of such bits at one point in time, to serve as an
increment among other ones. And given that these bits were collected
over time from not very secret sources, they appeared to be of lower
risk than the output.

I mean, if we reimplemented something in parallel just mixing the IRQ
return pointer and TSC, some people could possibly say "is this really
strong enough?" but it wouldn't seem very shocking in terms of disclosure.
But if by doing so we ended up in accident reproducing the same contents
as the fast_pool it could be problematic.

Would you think that using only the input info used to update the
fast_pool would be cleaner ? I mean, instead of :

fast_pool->pool[0] ^= cycles ^ j_high ^ irq;
fast_pool->pool[1] ^= now ^ c_high;
ip = regs ? instruction_pointer(regs) : _RET_IP_;
fast_pool->pool[2] ^= ip;
fast_pool->pool[3] ^= (sizeof(ip) > 4) ? ip >> 32 :
get_reg(fast_pool, regs);

we'd do:

x0 = cycles ^ j_high ^ irq;
x1 = now ^ c_high;
x2 = regs ? instruction_pointer(regs) : _RET_IP_;
x3 = (sizeof(ip) > 4) ? ip >> 32 : get_reg(fast_pool, regs);

fast_pool->pool[0] ^= x0;
fast_pool->pool[1] ^= x1;
fast_pool->pool[2] ^= x2;
fast_pool->pool[3] ^= x3;

this_cpu_add(net_rand_state.s1, x0^x1^x2^x3);

Because that's something easy to do and providing the same level
of perturbation to net_rand_state.

> > Another approach involving the replacement of the algorithm was considered
> > but we were working with -stable in mind, trying to limit the backporting
> > difficulty (and it revealed a circular dependency nightmare that had been
> > sleeping there for years), and making the changes easier to check (which
> > is precisely what you're doing).
> 
> Really? You can replace the LFSR and only change lib/random32.c. That might
> fix the problem without the need to use the raw fast_pool data for seed
> material.

Definitely. I had been working on a variant of MSWS
(https://arxiv.org/pdf/1704.00358.pdf) that I discussed a little bit with
Amit, but I didn't publish it yet since it needs to be cleaned up and to
replace all the test patterns in random32.c. And when starting to rebase
it I figured that updating it from the interrupt handler didn't really
look like it brought anything. I mean that when you're starting to wonder
what to update "to make it better", it likely means that you don't need it.

> As Linus said, speed is a concern but SFC32 is faster than
> existing Tausworthe generator, and it's a drop-in replacement with the same
> state size if that makes your life easier. If you're willing to expand the
> state there are even better options (like SFC64 or some of chaotic
> generators like Jenkins' Frog).

I didn't know about SFC32, it looks like a variation of the large family
of xorshift generators, which is thus probably quite suitable as well
for this task. Having used xoroshiro128** myself in another project, I
found it overkill for this task compared to MSWS but I definitely agree
that any of them is more suited to the task than the current one.

> > We're not trying to invent any stream cipher or whatever, just making
> > use of a few bits that are never exposed alone as-is to internal nor
> > external states, to slightly disturb another state that otherwise only
> > changes once a minute so that there's no more a 100% chance of guessing
> > a 16-bit port after seeing a few packets. I mean, I'm pretty sure that
> > even stealing three or four bits only from there would be quite enough
> > to defeat the attack given that Amit only recovers a few bits per packet.
> 
> If Amit's attack can recover that much per packet (in practice not just in
> theory) and there is even one packet per interrupt, then it's a major
> problem. There are at most 2 bits of entropy added per call to
> add_interrupt_randomness, so it you're leaking "a few bits per packet" then
> that's everything. Over 64 interrupts you've lost the 128 bits of entropy
> that the fast_pool has spilled into the input_pool.

But how do you *figure* the value of these bits and their position in the
fast_pool ? I mean that the attack relies on net_rand_state not being
perturbed. Let's assume you'd be able to brute-force them in linear time
by drawing the list of possible candidates for 32-bit increment on each
packet and you end up with 64 sets of candidates. You'll probably have
quite a number of candidates there since yo

Re: Flaw in "random32: update the net random state on interrupt and activity"

2020-08-05 Thread Willy Tarreau

Hi Mark,

On Wed, Aug 05, 2020 at 09:06:40AM -0700, Marc Plumb wrote:
> Just because you or I don't have a working exploit doesn't mean that someone
> else isn't more clever.

I agree on the principle, but it can be said from many things, including
our respective inability to factor large numbers for example. But for
sure we do need to be careful, and actually picking only some limited
parts of the fast pool (which are only used to update the input pool
and are only made of low-difficulty stuff like instruction pointers,
jiffies and TSC values) is probably not going to disclose an extremely
well guarded secret.

> The fundamental question is: Why is this attack on net_rand_state problem?
> It's Working as Designed. Why is it a major enough problem to risk harming
> cryptographically important functions?

It's not *that* major an issue (in my personal opinion) but the current
net_rand_state is easy enough to guess so that an observer may reduce
the difficulty to build certain attacks (using known source ports for
example). The goal of this change (and the one in update_process_times())
is to disturb the net_rand_state a little bit so that external observations
turn from "this must be that" to "this may be this or maybe that", which
is sufficient to limit the ability to reliably guess a state and reduce
the cost of an attack.

Another approach involving the replacement of the algorithm was considered
but we were working with -stable in mind, trying to limit the backporting
difficulty (and it revealed a circular dependency nightmare that had been
sleeping there for years), and making the changes easier to check (which
is precisely what you're doing).

> Do you remember how you resisted making dev/urandom fast for large reads for
> a long time to punish stupid uses of the interface? In this case anyone who
> is using net_rand_state assuming it is a CPRNG should stop doing that. Don't
> enable stupidity in the kernel.
> 
> This whole thing is making the fundamental mistake of all amateur
> cryptographers of trying to create your own cryptographic primitive. You're
> trying to invent a secure stream cipher. Either don't try to make
> net_rand_state secure, or use a known secure primitive.

We're not trying to invent any stream cipher or whatever, just making
use of a few bits that are never exposed alone as-is to internal nor
external states, to slightly disturb another state that otherwise only
changes once a minute so that there's no more a 100% chance of guessing
a 16-bit port after seeing a few packets. I mean, I'm pretty sure that
even stealing three or four bits only from there would be quite enough
to defeat the attack given that Amit only recovers a few bits per packet.

For me the right longterm solution will be to replace the easily guessable
LFSR. But given the build breakage we got by just adding one include, I
can only guess what we'll see when trying to do more in this area :-/

Regards,
Willy

Re: Flaw in "random32: update the net random state on interrupt and activity"

2020-08-05 Thread Willy Tarreau

Hi Ted,

On Wed, Aug 05, 2020 at 11:34:32AM -0400, ty...@mit.edu wrote:
> That being said, it certainly is a certificational / theoretical
> weakness, and if the bright boys and girls at Fort Meade did figure
> out a way to exploit this, they are very much unlikely to share it at
> an open Crypto conference.  So replacing LFSR-based PRnG with
> something stronger which didn't release any bits from the fast_pool
> would certainly be desireable, and I look forward to seeing what Willy
> has in mind.

I'll post a proposal patch shortly about this, hopefully this week-end
(got diverted by work lately :-)). Just to give you a few pointers,
it's a small modification of MSWS. It passes the Practrand test suite
on 256 GB of data with zero warning (something that Tausworthe is
supposed to fail at).

By default, MSWS *does* leak its internal state, as Amit showed us (and
seeing that the paper on it suggests it's safe as-is for crypto use is
a bit shocking), but once slightly adjusted, it doesn't reveal its state
anymore and that would constitute a much more future-proof solution for
quite some time. Tausworthe was created something like 20 years ago or
so, hence it's not surprizing that it's a bit dated by now, but if we
can upgrade once every 2 decades I guess it's not that bad.

Cheers,
Willy

Re: Flaw in "random32: update the net random state on interrupt and activity"

2020-08-04 Thread Willy Tarreau

Hi Marc,

On Tue, Aug 04, 2020 at 05:52:36PM -0700, Marc Plumb wrote:
> Seeding two PRNGs with the same entropy causes two problems. The minor one
> is that you're double counting entropy. The major one is that anyone who can
> determine the state of one PRNG can determine the state of the other.
> 
> The net_rand_state PRNG is effectively a 113 bit LFSR, so anyone who can see
> any 113 bits of output can determine the complete internal state.
> 
> The output of the net_rand_state PRNG is used to determine how data is sent
> to the network, so the output is effectively broadcast to anyone watching
> network traffic. Therefore anyone watching the network traffic can determine
> the seed data being fed to the net_rand_state PRNG.

The problem this patch is trying to work around is that the reporter
(Amit) was able to determine the entire net_rand_state after observing
a certain number of packets due to this trivial LFSR and the fact that
its internal state between two reseedings only depends on the number
of calls to read it. (please note that regarding this point I'll
propose a patch to replace that PRNG to stop directly exposing the
internal state to the network).

If you look closer at the patch, you'll see that in one interrupt
the patch only uses any 32 out of the 128 bits of fast_pool to
update only 32 bits of the net_rand_state. As such, the sequence
observed on the network also depends on the remaining bits of
net_rand_state, while the 96 other bits of the fast_pool are not
exposed there.

> Since this is the same
> seed data being fed to get_random_bytes, it allows an attacker to determine
> the state and there output of /dev/random. I sincerely hope that this was
> not the intended goal. :)

Not only was this obviously not the goal, but I'd be particularly
interested in seeing this reality demonstrated, considering that
the whole 128 bits of fast_pool together count as a single bit of
entropy, and that as such, even if you were able to figure the
value of the 32 bits leaked to net_rand_state, you'd still have to
guess the 96 other bits for each single entropy bit :-/

Regards,
Willy

Re: newbie question on networking kernel panics.

2020-07-16 Thread Willy Tarreau

Hi Jeroen,

On Thu, Jul 16, 2020 at 05:38:57PM +0200, Jeroen Baten wrote:
> Hi,
> 
> I have been working on a Linux desktop for the last 20 odd years.
> Currently running Ubuntu 20.04.
> 
> Yesterday I enabled the option "Flow control" on my TP-Link TL-SG1024DE.
> 
> Subsequently I was forced to enjoy 3 kernel panics in the timespan of 18
> hours.
> 
> After disabling the "Flow control" option my system seems to be stable
> again.
> 
> I do have 3 sets of "cut here" text if somebody would be interested.
> 
> Please let me know if this information is of interest to someone or if I
> am barking up the wrong majordomo tree.
> 
> Kind regards and thanks to all here for your immensely valuable work,

Since distro kernels contain some extra patches, may miss a significant
number of fixes, or even rely on different driver sources, you really
need to report this issue to your distro, who will ask for your exact
kernel version. It may be that this issue also affects the vanilla
kernel in the end, but in this case it will be reported by the distro's
maintainers once they've verified that it's not related to their kernel's
differences.

Hoping this helps,
Willy

Re: Verify ACK packets in handshake in kernel module (Access TCP state table)

2019-09-20 Thread Willy Tarreau

On Fri, Sep 20, 2019 at 11:43:50PM +, Swarm wrote:
> First time emailing to this mailing list so please let me know if I made a
> mistake in how I sent it. I'm trying to receive a notification from the
> kernel once it verifies an ACK packet in a handshake. Problem is, there is no
> API or kernel resource I've seen that supports this feature for both
> syncookies and normal handshakes. Where exactly in the kernel does the ACK
> get verified? If there isn't a way to be notified of it, where should I start
> adding that feature into the kernel?

Just searching for TCP_ESTABLISHED immediately brought me to tcp_input.c
(tcp_rcv_state_process() to be precise), so I'm not sure you've searched
that much. As you've noticed there's nothing specifically called in this
case, but in practice a caller of accept() on a listening socket will be
woken up.

Hoping this helps,
Willy

Re: [NETDEV]: getsockopt(fd, SOL_IP, SO_ORIGINAL_DST, sa, &salen) is in fact sometimes returning the source IP instead the destination IP

2019-01-23 Thread Willy Tarreau

Hi Roobesh,

On Wed, Jan 23, 2019 at 08:07:48AM +, Mohandass, Roobesh wrote:
> Hi Willy/Florian/Lukas,
> 
>  disabling nf_conntrack_tcp_loose (solved the problem) and for last 24 hours,
>  we are not evening seeing a single wrong connection data by send-proxy.
> 
> Worth a note in your documentation related to this, as users might be aware
> off.
> 
> Thanks very much for your help and support, I will report back if I see
> anything again related to this.

Thank you for your feedback on this issue, and kudos to Florian
for spotting the root cause!

Willy

Re: [NETDEV]: getsockopt(fd, SOL_IP, SO_ORIGINAL_DST, sa, &salen) is in fact sometimes returning the source IP instead the destination IP

2019-01-12 Thread Willy Tarreau

Hi Lukas,

On Sat, Jan 12, 2019 at 07:01:34PM +0100, Lukas Tribus wrote:
> > Roobesh, do you use the destination address only for logging or
> > anywhere else in the request path ? And could you check if you have
> > nf_conntrack_tcp_loose set as Florian suggests ? I really think he
> > figured it right.
> 
> It's about what we send with the PROXY protocol to the backend server,
> Roobesh reported things like that (src and dst is the same):
> 
> PROXY TCP4 192.220.26.39 192.220.26.39 45066 45066
> PROXY TCP4 192.220.26.39 192.220.26.39 45075 45075
> 
> So the call would actually happen at the beginning of the TCP connection.

That sounds quite shocking to me then. Maybe we're facing such a sequence:

 1) first session comes from this port, and client closes first (FIN)
 2) haproxy then closes with (FIN)
 3) client doesn't respond with the final ACK (e.g. blocked by another
firewall in between or the client's own conntrack)
 4) the socket on the haproxy side remains in LAST_ACK state and ACKs
are periodically resent
 5) local conntrack is in TIME_WAIT and expires faster than the largest
interval between two such ACKs
 6) one of these retransmitted ACKs reopens the connection in reverse
direction due to nf_conntrack_tcp_loose. The connection is then
seen in ESTABLISHED state and might be kept a bit longer.
 8) the connection finally expires in the local TCP stack but not yet
in conntrack.
 7) later the client reuses the same source port while the connection
is still present in the conntrack table.
 8) assuming tcp_be_liberal is also set, the packets can pass through
conntrack and establish a new connection to haproxy.
 9) haproxy calls getsockopt(SO_ORIGINAL_DST) and gets the other end
point since the connection was created at step 6 above in the
other direction.

I could be wrong on certain specific points above but it looks plausible. 

> Initial report is here:
> https://discourse.haproxy.org/t/send-proxy-not-modifying-some-traffic-with-proxy-ip-port-details/3336

Ah cool, I didn't have all this, thank you!

> Let's see if disabling nf_conntrack_tcp_loose changes things.

Yes this really is the only thing I can think of, and in this case
noone is wrong in this chain (neither kernel nor haproxy). We'll
need to document it in my opinion.

Thanks,
Willy

Re: [NETDEV]: getsockopt(fd, SOL_IP, SO_ORIGINAL_DST, sa, &salen) is in fact sometimes returning the source IP instead the destination IP

2019-01-12 Thread Willy Tarreau

Hi Florian!

Sorry for the lag, I was busy with other issues.

On Sat, Jan 12, 2019 at 05:04:00PM +0100, Florian Westphal wrote:
> Lukas Tribus  wrote:
> > The application (haproxy) needs to know the original destination IP
> > address, however it does not know whether -j REDIRECT was used or not.
> > Because of this the application always queries SO_ORIGINAL_DST, and
> > this includes configurations without -j REDIRECT.
> > 
> > Are you saying the behavior of SO_ORIGINAL_DST is undefined when not
> > used with -j REDIRECT and that this issue does not happen when -j
> > REDIRECT is actually used?
> 
> No, thats not what I said.  Because OP provided a link that mentions
> TPROXY, I concluded OP was using TPROXY, so I pointed out that the
> error source can be completely avoided by not using SO_ORIGINAL_DST.
> 
> As I said, SO_ORIGINAL_DST returns the dst address of
> the original direction *as seen by conntrack*.
> 
> In case REDIRECT or DNAT was used, the address returned is the on-wire
> one, before DNAT rewrite took place.
>
> Therefore, SO_ORIGINAL_DST is only needed when REDIRECT or DNAT was
> used. If no DNAT rewrite takes place, sockaddr returned by accept or
> getsockname can be used directly and SO_ORIGINAL_DST isn't needed.
> The returned address should be identical to the one given by accept().

That's also the way we're using it :

http://git.haproxy.org/?p=haproxy.git;a=blob;f=src/proto_tcp.c;h=52a4691d2ba93aec93a3e8a0edd1e90d93de968d;hb=HEAD#l600

More precisely if getsockopt() fails we only use getsockname(). It
happens that this code is very old (2002) and used to work with kernel
2.4's TPROXY which did rely on conntrack, hence the ifdef including
TPROXY. But it's irrelevant here, with a modern TPROXY the getsockopt()
here will usually fail, and the result will come from getsockname()
only.

When seeing the (old) haproxy code there, I've been thinking about
another possibility. Right now haproxy does getsockname() then *tries*
getsockopt(). In the unlikely case getsockopt() would modify part of
the address already returned by getsockname(), it could leave an
incorrect address there. But such an issue would require partial uses
of copy_to_user() which doesn't make much sense, and getorigdst()
doesn't do this either. So this one was ruled out.

I've been wondering if it was possible that from time to time, this
getsockopt() accidently succeeds and returns something wrong, maybe
due to a race with another thread, or maybe because it would return
an uninitialized field which happened to see the source address
previously. The very low amount of occurrence makes me think it could
possibly happen only upon certain error conditions. But looking at
getorigdst(), I hardly imagine how this would happen.

> If SO_ORIGINAL_DST returns the reply, then conntrack picked up
> a reply packet as the first packet of the connection, so it believes
> originator is the responder and vice versa.
> 
> One case where this can happen is when nf_conntrack_tcp_loose
> (mid-stream pickup) is enabled.

Very interesting case indeed, I hadn't thought about it! I think we
don't have enough info from the original reporter's setup but it
would definitely make sense and explain why it's the other end which
is retrieved!

I'm seeing one possibility to explain this : Let's say the OP's setup
has a short conntrack timeout and a longer haproxy timeout. If the
address is only retrieved for logging, it will be retrieved at the
end of the connection. Let's assume haproxy receives a request from
a client, a conntrack entry is created and haproxy forwards the request
to a very slow server. Before the server responds, the conntrack entry
expires, then the server responds and haproxy forwards to the client,
re-creating the entry and hitting this case before the address is
picked up for logging.

Roobesh, do you use the destination address only for logging or
anywhere else in the request path ? And could you check if you have
nf_conntrack_tcp_loose set as Florian suggests ? I really think he
figured it right.

If that's the problem, I think we can address it with documentation :
first, warn about the use case, second, explain how to store the
address into a variable once the connection is accepted since it will
not have expired and will still be valid at this moment.

> This is not a haproxy bug.
> 
> Only thing that haproxy could is to provide a knob to make it only
> use addresses returned by accept, rather than relying on
> SO_ORIGINAL_DST for those that use TPROXY to do MITM interception.

Since it's the destination we get it using getsockname(), not accept(),
but yeah maybe we'd benefit from having a tunable to disable the use of
getsockopt(SO_ORIGINAL_DST). Actually I'd prefer to avoid doing this
until the issue it confirmed because I don't feel comfortable with having
an option saying "disable the use of getsockopt(SO_ORIGINAL_DST) which
may fail once per million due to some obscure reasons". However if it's
confirmed that it

Re: MSG_ZEROCOPY doesn't work on half-open TCP sockets

2019-01-09 Thread Willy Tarreau

On Wed, Jan 09, 2019 at 08:55:14AM -0500, Willem de Bruijn wrote:
> > In other words: because the socket needs to be ESTABLISHED for
> > MSG_ZEROCOPY to work, and because remote party can send FIN and move
> > the socket to CLOSE_WAIT, a sending party must implement a fallback
> > from EINVAL return code on the transmission code. An adversarial
> > client who does shutdown(SHUT_WR), will trigger EINVAL in the sender..
> 
> An adversarial client only affects its own stream, so the impact is limited.

Sure but it doesn't necessarily do it on purpose either :-) The typical

   echo -ne "GET /file HTTP/1.1\r\nHost: foo.example.com\r\n\r\n" | nc host port

is perfectly valid and will not work in this case, possibly forcing the newly
deployed component to toll back.

> Thanks for the report. At first blush it seems like extending the
> check to include state CLOSE_WAIT would resolve the issue
> 
> if (flags & MSG_ZEROCOPY && size && sock_flag(sk, SOCK_ZEROCOPY)) {
> -   if (sk->sk_state != TCP_ESTABLISHED) {
> +   if ((1 << sk->sk_state) & ~(TCPF_ESTABLISHED |
> TCPF_CLOSE_WAIT)) {
> 
> err = -EINVAL;
> goto out_err;
> }

At first glance I think it should do the job.

Willy

Re: splice() performance for TCP socket forwarding

2018-12-13 Thread Willy Tarreau

On Thu, Dec 13, 2018 at 02:17:20PM +0100, Marek Majkowski wrote:
> > splice code will be expensive if less than 1MB is present in receive queue.
> 
> I'm not sure what you are suggesting. I'm just shuffling data between
> two sockets. Is there a better buffer size value? Is it possible to
> keep splice() blocked until it succeeds to forward N bytes of data? (I
> tried this unsuccessfully with SO_RCVLOWAT).

I've personally observed performance decrease once the pipe is configured
larger than 512 kB. I think that it's related to the fact that you're
moving 256 pages around on each call and that it might even start to
have some effect on L1 caches when touch lots of data, though that could
be completely unrelated.

> Here is a snippet from strace:
> 
> splice(4, NULL, 11, NULL, 1048576, 0) = 373760 <0.48>
> splice(10, NULL, 5, NULL, 373760, 0) = 373760 <0.000108>
> splice(4, NULL, 11, NULL, 1048576, 0) = 335800 <0.65>
> splice(10, NULL, 5, NULL, 335800, 0) = 335800 <0.000202>
> splice(4, NULL, 11, NULL, 1048576, 0) = 227760 <0.29>
> splice(10, NULL, 5, NULL, 227760, 0) = 227760 <0.000106>
> splice(4, NULL, 11, NULL, 1048576, 0) = 16060 <0.19>
> splice(10, NULL, 5, NULL, 16060, 0) = 16060 <0.28>
> splice(4, NULL, 11, NULL, 1048576, 0) = 7300 <0.13>
> splice(10, NULL, 5, NULL, 7300, 0) = 7300 <0.21>

I think your driver is returning one segment per page. Let's do some
rough maths : assuming you're having an MSS of 1448 (timestamps enabled),
you'll retrieve 256*1448 = 370688 at once per call, which closely matches
what you're seeing. Hmmm checking closer, you're in fact exactly running
at 1460 (256*1460=373760) so you have timestamps disabled. Your numbers
seem normal to me (just the CPU usage doesn't, but maybe it improves
when using a smaller pipe).

Willy

Re: splice() performance for TCP socket forwarding

2018-12-13 Thread Willy Tarreau

Hi Eric!

On Thu, Dec 13, 2018 at 05:37:11AM -0800, Eric Dumazet wrote:
> Maybe mlx5 driver is in LRO mode, packing TCP payload in 4K pages ?

I could be wrong but I don't think so : I remember having been used to
LRO on myri10ge a decade ago giving me good performance which would
degrade with concurrent connections, till the point LRO got deprecated
when GRO started to work quite well. Thus this got me used to always
disabling LRO to be sure to measure something durable ;-)

> bnx2x GRO/LRO has this mode, meaning that around 8 pages are used for a GRO 
> packets of ~32 KB,
> while mlx4 for instance would use one page frag for every ~1428 bytes of 
> payload.

I remember that it was the same on myri10ge (1 segment per page), making
splice() return rougly 21 or 22kB per call for a 64kB pipe. BTW, I think
I said bullshit and that 3 years ago it was mlx4 and not mlx5 that I've
been using.

Willy

Re: splice() performance for TCP socket forwarding

2018-12-13 Thread Willy Tarreau

Hi Marek,

On Thu, Dec 13, 2018 at 12:25:20PM +0100, Marek Majkowski wrote:
> Hi!
> 
> I'm basically trying to do TCP splicing in Linux. I'm focusing on
> performance of the simplest case: receive data from one TCP socket,
> write data to another TCP socket. I get poor performance with splice.
> 
> First, the naive code, pretty much:
> 
> while(1){
>  n = read(rs, buf);
>  write(ws, buf, n);
> }
> 
> With GRO enabled, this code does roughly line-rate of 10Gbps, hovering
> ~50% of CPU in application (sys mostly).
> 
> When replaced with splice version:
> 
> pipe(pfd);
> fcntl(pfd[0], F_SETPIPE_SZ, 1024 * 1024);
> while(1) {
>  n = splice(rd, NULL, pfd[1], NULL, 1024*1024,
>SPLICE_F_MOVE);
>   splice(pfd[0], NULL, wd, NULL, n, SPLICE_F_MOVE);
> }
> 
> Full code:
> https://gist.github.com/majek/c58a97b9be7d9217fe3ebd6c1328faaa#file-proxy-splice-c-L59
> 
> I get 100% cpu (sys) and dramatically worse performance (1.5x slower).

It's quite strange, it doesn't match at all what I'm used to. In haproxy
we're using splicing as well between sockets, and for medium to large
objects we always get much better performance with splicing than without.
3 years ago during a test, we reached 60 Gbps on a 4-core machine using
2 40G NICs, which is not an exceptional sizing. And between processes on
the loopback, numbers around 100G are totally possible. By the way this
is one test you should start with, to verify if the issue is more on the
splice side or on the NIC's side. It might be that your network driver is
totally inefficient when used with GRO/GSO. In my case, multi-10G using
ixgbe and 40G using mlx5 have always shown excellent results.

Regards,
Willy

Re: [net-next, RFC, 4/8] net: core: add recycle capabilities on skbs via page_pool API

2018-12-08 Thread Willy Tarreau

On Sat, Dec 08, 2018 at 10:14:47PM +0200, Ilias Apalodimas wrote:
> On Sat, Dec 08, 2018 at 09:11:53PM +0100, Jesper Dangaard Brouer wrote:
> > >  
> > > I want to make sure you guys thought about splice() stuff, and
> > > skb_try_coalesce(), and GRO, and skb cloning, and ...
> > 
> > Thanks for the pointers. To Ilias, we need to double check 
> > skb_try_coalesce()
> > code path, as it does look like we don't handle this correctly.
> > 
> 
> Noted. We did check skb cloning, but not this one indeed

I'll try to run some tests on my clearfog later, but for now I'm still
100% busy until haproxy 1.9 gets released.

Cheers,
Willy

Re: bring back IPX and NCPFS, please!

2018-11-09 Thread Willy Tarreau

On Fri, Nov 09, 2018 at 06:30:14PM +0100, Johannes C. Schulz wrote:
> Hello Willy, hello Stephen
> 
> Thankyou for your reply.
> But I'm not able to maintain or code these modules. I'm just a bloody
> user/webdev.

That's what we've all claimed before taking over something many years
ago you know :-)  The most important is time and willingness to try to
do it.

You could first look at the latest kernel supporting those, check if
they still used to work fine in your environment (not everyone has
access to these ones anymore), and if so, then try to copy that code
over newer kernels. Sometimes it will not build with an obvious error
that you'll be able to fix by yourself, sometimes it will be harder
and you'll have to ask for help and/or figure API changes in "git log".
After working many hours on this you'll be much more at ease with this
code and you'll possibly be able to make it work on your kernel version.
This is already a huge step because even if you don't consider it as
being in a mergeable state (too hackish, dirty etc), you have the
option to run it as your own patch for a while.

After this you'll seek some more help about the process needed to get
these merged back and to maintain them as long as you estimate you can
(possibly mark it deprecated and keep it as long as you can). And who
knows, given nothing changes in this area these days, maybe it will be
trivial to maintain this FS for another decade and you'll have learned
something fun and useful.

> It would be really nice if these modules will find a good
> maintainer!

Just think again about the advantages you have over many other people :
  - access to the environment
  - real use case for the feature

There's nothing wrong with trying and failing multiple times, even giving
up if you find the task too hard. But giving up before trying is quite
sad in your situation.

Cheers,
Willy

Re: bring back IPX and NCPFS, please!

2018-11-09 Thread Willy Tarreau

On Fri, Nov 09, 2018 at 02:23:27PM +0100, Johannes C. Schulz wrote:
> Hello all!
> 
> I like to please you to bring back IPX and NCPFS modules to the kernel.
> Whyever my admins using Novell-shares on our network which I'm not be
> able to use anymore - I'm forced to use cifs instead (and the admins
> will kill the cifs-shares in some time), because my kernel (4.18) does
> not have support for ncpfs anymore.
> Maybe we at my work are not enough people that just for us this
> modules will come back, but maybe out there are other people.
> Thank you.

Well, like any code, it requires time and skills. If nobody with the
required skills is available for this anymore, there's no way you'll
get a feature back. However you could always step up to maintain it
yourself if you have the time and are willing to develop your own
skills at it. It's how maintainers change over time for certain parts
of the system, so you have an opportunity here.

Just my two cents,
Willy

Re: Kernel Panic on high bandwidth transfer over wifi

2018-08-29 Thread Willy Tarreau

On Wed, Aug 29, 2018 at 11:42:44AM +, Nathaniel Munk wrote:
> As you can see from the attached log

You apparently forgot to attach the log.

Willy

Re: how to (cross)connect two (physical) eth ports for ping test?

2018-08-18 Thread Willy Tarreau

On Sat, Aug 18, 2018 at 09:10:25PM +0200, Andrew Lunn wrote:
> On Sat, Aug 18, 2018 at 01:39:50PM -0400, Robert P. J. Day wrote:
> > 
> >   (i'm sure this has been explained many times before, so a link
> > covering this will almost certainly do just fine.)
> > 
> >   i want to loop one physical ethernet port into another, and just
> > ping the daylights from one to the other for stress testing. my fedora
> > laptop doesn't actually have two unused ethernet ports, so i just want
> > to emulate this by slapping a couple startech USB/net adapters into
> > two empty USB ports, setting this up, then doing it all over again
> > monday morning on the actual target system, which does have multiple
> > ethernet ports.
> > 
> >   so if someone can point me to the recipe, that would be great and
> > you can stop reading.
> > 
> >   as far as my tentative solution goes, i assume i need to put at
> > least one of the physical ports in a network namespace via "ip netns",
> > then ping from the netns to the root namespace. or, going one step
> > further, perhaps putting both interfaces into two new namespaces, and
> > setting up forwarding.
> 
> Namespaces is a good solution. Something like this should work:
> 
> ip netns add namespace1
> ip netns add namespace2
> 
> ip link set eth1 netns namespace1
> ip link set eth2 netns namespace2
> 
> ip netns exec namespace1 \
> ip addr add 10.42.42.42/24 dev eth1
> 
> ip netns exec namespace1 \
> ip link set eth1 up
> 
> ip netns exec namespace2 \
> ip addr add 10.42.42.24/24 dev eth2
> 
> ip netns exec namespace2 \
> ip link set eth2 up
> 
> ip netns exec namespace1 \
> ping 10.42.42.24
> 
> You might also want to consider iperf3 for stress testing, depending
> on the sort of stress you need.

FWIW I have a setup somewhere involving ip rule + ip route which achieves
the same without involving namespaces. It's a bit hackish but sometimes
convenient. I can dig if someone is interested.

Regards,
Willy

Re: ANNOUNCE: Enhanced IP v1.4

2018-06-05 Thread Willy Tarreau

On Tue, Jun 05, 2018 at 02:33:03PM +0200, Bjørn Mork wrote:
> > I do have IPv6 at home (a /48, waste of addressing space, I'd be fine
> > with less),
> 
> Any reason you would want less?  Any reason the ISP should give you
> less?

What I mean is that *if* the availability of /48 networks was an issue
for some ISPs, I'd be fine with less because I don't plan to deploy 64k
networks at home, though I already have ~9 around the firewall and don't
expect to go much further.

> > Maybe setting up a public list of ISPs where users don't have at least
> > a /60 by default could help, but I suspect that most of them will
> > consider that as long as their competitors are on the list there's no
> > emergency.
> 
> Exactly.  And the number of users using the list as the primary
> parameter for selecting an ISP would be close to 0.  The critical part
> is not the list, but making large enough groups of users consider IPv6
> an important parameter when selecting ISPs.

In fact the IoT trend could play a role here by letting users know that
they can remotely access their fridge and whatever stupid device they've
deployed. But the reality is the opposite : some gateway services are/will
be offered at a paid price to make these devices remotely accessible, and
the claimed security provided by this gateway will be presented as a real
benefit compared to the risks of anyone directly accessing your private
life over IPv6. So I'm not getting much hopes for the future in this area
either.

Willy

Re: ANNOUNCE: Enhanced IP v1.4

2018-06-03 Thread Willy Tarreau

On Sun, Jun 03, 2018 at 03:41:08PM -0700, Eric Dumazet wrote:
> 
> 
> On 06/03/2018 01:37 PM, Tom Herbert wrote:
> 
> > This is not an inconsequential mechanism that is being proposed. It's
> > a modification to IP protocol that is intended to work on the
> > Internet, but it looks like the draft hasn't been updated for two
> > years and it is not adopted by any IETF working group. I don't see how
> > this can go anywhere without IETF support. Also, I suggest that you
> > look at the IPv10 proposal since that was very similar in intent. One
> > of the reasons that IPv10 shot down was because protocol transition
> > mechanisms were more interesting ten years ago than today. IPv6 has
> > good traction now. In fact, it's probably the case that it's now
> > easier to bring up IPv6 than to try to make IPv4 options work over the
> > Internet.
> 
> +1
> 
> Many hosts do not use IPv4 anymore.
> 
> We even have the project making IPv4 support in linux optional.

I agree on these points, but I'd like to figure what can be done to put
a bit more pressure on ISPs to *always* provide IPv6. It's still very
hard to have decent connectivity at home and without this it will continue
to be marginalize.

I do have IPv6 at home (a /48, waste of addressing space, I'd be fine
with less), there's none at work (I don't even know if the ISP supports
it, at least it was never ever mentioned so probably they don't know
about this), and some ISPs only provide a /64 which is as ridiculous
as providing a single address as it forces the end user to NAT thus
breaking the end-to-end principle. Ideally with IoT at the door, every
home connection should have at least a /60 and enterprises should have
a /56, and this by default, without having to request anything.

Maybe setting up a public list of ISPs where users don't have at least
a /60 by default could help, but I suspect that most of them will
consider that as long as their competitors are on the list there's no
emergency.

Willy

Re: ANNOUNCE: Enhanced IP v1.4

2018-06-02 Thread Willy Tarreau

On Sat, Jun 02, 2018 at 12:17:12PM -0400, Sam Patton wrote:
> As far as application examples, check out this simple netcat-like
> program I use for testing:
> 
> https://github.com/EnIP/enhancedip/blob/master/userspace/netcat/netcat.c
> 
> Lines 61-67 show how to connect directly via an EnIP address.  The
> netcat-like application uses
> 
> a header file called eip.h.

OK so basically we need to use a new address structure (which makes sense),
however I'm not sure I understand how the system is supposed to know it has
to use EnIP instead of IPv4 if the sin_family remains the same :

...
   memset(&serv_addr, 0, sizeof(struct sockaddr_ein));
   serv_addr.sin_family= AF_INET;
   serv_addr.sin_port = htons(portno);
   serv_addr.sin_addr1=inet_addr(argv[1]);
   serv_addr.sin_addr2=inet_addr(argv[2]);

   if (connect(sockfd,(struct sockaddr *) &serv_addr,sizeof(serv_addr)) < 0) 
error("ERROR connecting");
...

Does it solely rely on the 3rd argument to connect() to know that it
may read sin_addr2 ? If so, that sounds a bit fragile to me because I
*think* (but could be wrong) that right now any size at least as large
as sockaddr_in work for IPv4 (typically I think if you pass a struct
sockaddr_storage with its size it should work). So there would be a
risk of breaking applications if this is the case.

I suspect that using a different family would make things more robust
and more friendly to applications. But that's just a first opinion,
you have worked longer than me on this.

>  You can look at it here:
> 
> https://github.com/EnIP/enhancedip/blob/master/userspace/include/eip.h

Thank you. There's a problem with your #pragma pack there, it will
propagate to all struct declared after this file is included and
will force all subsequent structs to be packed. It would be preferable
to use __attribute__((packed)) on the struct or to use pragma pack()
safely (I've just seen it supports push/pop options).

> EnIP makes use of IPv6  records for DNS lookup.  We simply put
> 2001:0101 (which is an IPv6 experimental prefix) and
> 
> then we put the 64-bit EnIP address into the next 8 bytes of the
> address.  The remaining bytes are set to zero.

Does this require a significant amount of changes to the libc ?

> In the kernel, if you want to see how we convert the IPv6 DNS lookup
> into something connect() can manage,
> 
> check out the add_enhanced_ip() routine found here:
> 
> https://github.com/EnIP/enhancedip/blob/master/kernel/4.9.28/socket.c

Hmmm so both IPv4 and IPv6 addresses are supported and used together.
So that just makes me think that if in the end you need a little bit
of IPv6 to handle this, why not instead define a new way to map IPv6 to
Enhanced IPv4 ? I mean, you'd exclusively use AF_INET6 in applications,
within a dedicated address range, allowing you to naturally use all of
the IPv6 ecosystem in place in userland, but use IPv4 over the wire.
This way applications would possibly require even less modififications,
if any at all because they'd accept/connect/send/receive data for IPv6
address families that they all deal with nowadays, but this would rely
on an omni-present IPv4 infrastructure.

> The reason we had to do changes for openssh and not other applications
> (that use DNS) is openssh has a check to
> 
> see if the socket is using IP options.  If the socket does, sshd drops
> the connection.  I had to work around that to get openssh working

OK. They possibly do this to avoid source-routing.

> with EnIP.  The result: if you want to connect the netcat-like program
> with IP addresses you'll end up doing something like the
> 
> example above.  If you're using DNS (getaddrinfo) to connect(), it
> should just work (except for sshd as noted).
> 
> Here's the draft experimental RFC:
> https://tools.ietf.org/html/draft-chimiak-enhanced-ipv4-03

Yep I've found it on the site.

> I'll also note that I am doing this code part time as a hobby for a long
> time so I appreciate your help and support.  It would be really
> 
> great if the kernel community decided to pick this up, but if it's not a
> reality please let me know soonest so I can move on to a
> 
> different hobby.  :)

I *personally* think there is some value in what you're experimenting
with, and doing it as a hobby can leave you with the time needed to
collect pros and cons from various people. I'm just thinking that in
order to get some adoption, you need to be extremely careful not to
break any of the IPv4 applications developed in the last 38 years,
and by this, AF_INET+sizeof() scares me a little bit.

Regards,
Willy

Re: ANNOUNCE: Enhanced IP v1.4

2018-06-01 Thread Willy Tarreau

Hello Sam,

On Fri, Jun 01, 2018 at 09:48:28PM -0400, Sam Patton wrote:
> Hello!
> 
> If you do not know what Enhanced IP is, read this post on netdev first:
> 
> https://www.spinics.net/lists/netdev/msg327242.html
> 
> 
> The Enhanced IP project presents:
> 
>  Enhanced IP v1.4
> 
> The Enhanced IP (EnIP) code has been updated.  It now builds with OpenWRT 
> barrier breaker (for 148 different devices). We've been testing with the 
> Western Digital N600 and N750 wireless home routers.
(...) First note, please think about breaking your lines if you want your
mails to be read by the widest audience, as for some of us here, reading
lines wider than a terminal is really annoying, and often not considered
worth spending time on them considering there are so many easier ones
left to read.

> Interested in seeing Enhanced IP in the Linux kernel, read on.  Not
> interested in seeing Enhanced IP in the Linux kernel read on.
(...)

So I personally find the concept quite interesting. It reminds me of the
previous IPv5/IPv7/IPv8 initiatives, which in my opinion were a bit hopeless.
Here the fact that you decide to consider the IPv4 address as a network opens
new perspectives. For containerized environments it could be considered that
each server, with one IPv4, can host 2^32 guests and that NAT is not needed
anymore for example. It could also open the possibility that enthousiasts
can more easily host some services at home behind their ADSL line without
having to run on strange ports.

However I think your approach is not the most efficient to encourage adoption.
It's important to understand that there will be little incentive for people
to patch their kernels to run some code if they don't have the applications
on top of it. The kernel is not the end goal for most users, the kernel is
just the lower layer needed to run applications on top. I looked at your site
and the github repo, and all I could find was a pre-patched openssh, no simple
explanation of what to change in an application.

What you need to do first is to *explain* how to modify userland applications
to support En-IP, provide an echo server and show the parts which have to be
changed. Write a simple client and do the same. Provide your changes to
existing programs as patches, not as pre-patched code. This way anyone can
use your patches on top of other versions, and can use these patches to
understand what has to be modified in their applications.

Once applications are easy to patch, the incentive to install patched kernels
everywhere will be higher. For many enthousiasts, knowing that they only have
to modify the ADSL router to automatically make their internal IoT stuff
accessible from outside indeed becomes appealing.

Then you'll need to provide patches for well known applications like curl,
wget, DNS servers (bind...), then browsers.

In my case I could be interested in adding support for En-ip into haproxy,
and only once I don't see any showstopped in doing this, I'd be willing to
patch my kernel to support it.

Last advice, provide links to your drafts in future e-mails, they are not
easy to find on your site, we have to navigate through various pages to
finally find them.

Regards,
Willy

Re: Request for -stable inclusion: time stamping fix for nfp

2018-05-17 Thread Willy Tarreau

Adding Greg here.

Greg, apparently a backport of 46f1c52e66db is needed in 4.9 according
to the thread below. It was merged in 4.13 so 4.14 already has it.

Willy

On Thu, May 17, 2018 at 02:09:03PM -0400, David Miller wrote:
> From: Guillaume Nault 
> Date: Thu, 17 May 2018 19:41:47 +0200
> 
> > On Thu, Nov 16, 2017 at 10:13:28AM +0900, David Miller wrote:
> >> From: Guillaume Nault 
> >> Date: Wed, 15 Nov 2017 17:20:46 +0100
> >> 
> >> > Can you please queue commit 46f1c52e66db
> >> > ("nfp: TX time stamp packets before HW doorbell is rung") for -stable?
> >> > We got hit but this bug in the late summer. We run this fix internally
> >> > since a couple of months, but that'd be better to have it officially
> >> > backported so everyone can benefit of it.
> >> 
> >> Queued up.
> > 
> > I guess this one got lost somewhere as it doesn't appear in linux-4.9.y
> > (other trees aren't relevant).
> > If that's unintentional, than can you please re-queue
> > 46f1c52e66db ("nfp: TX time stamp packets before HW doorbell is rung")
> > to -stable?
> 
> I only submit patches to -stable for the two most recent active branches
> which right now consists of 4.16 and 4.14 as per www.kernel.org

Re: [PATCH linux-stable-4.14] tcp: reset sk_send_head in tcp_write_queue_purge

2018-03-20 Thread Willy Tarreau

Hi David,

regarding the patch below, I'm not certain whether you planned to take
it since it's marked "not applicable" on patchwork, but I suspect it's
only because it doesn't apply to mainline.

However, please note that there are two typos in commit IDs referenced in
the commit message that might be nice to fix before merging :

> > For example, after 27fid7a8ed38 (tcp: purge write queue upon RST),

- here it's "a27fd7a8ed38" (missing 'a' and extra 'i' in the middle)

- and lower in the fixes tag there's the extra 'i' as well :

> > Fixes: a27fid7a8ed38 (tcp: purge write queue upon RST)

There was another report today on the stable list for a similar crash
on 4.14.28 and I suspect it's the one I saw this week-end on my firewall
after upgrading from 4.14.10 to 4.14.27 though I didn't have the symbols
to confirm.

Thanks,
Willy

Re: [PATCH stable 4.9 1/8] x86: bpf_jit: small optimization in emit_bpf_tail_call()

2018-01-29 Thread Willy Tarreau

Hi Eric,

On Mon, Jan 29, 2018 at 06:04:30AM -0800, Eric Dumazet wrote:
> > If these 4 bytes matter, why not use
> > cmpq with an immediate value instead, which saves 2 extra bytes ? :
> >
> >   - the mov above is 11 bytes total :
> >
> >0:   48 8b 84 d6 78 56 34mov0x12345678(%rsi,%rdx,8),%rax
> >7:   12
> >8:   48 85 c0test   %rax,%rax
> >
> >   - the equivalent cmp is only 9 bytes :
> >
> >0:   48 83 bc d6 78 56 34cmpq   $0x0,0x12345678(%rsi,%rdx,8)
> >7:   12 00
> >
> > And as a bonus, it doesn't even clobber rax.
> >
> > Just my two cents,
> 
> 
> Hi Willy
> 
> Please look more closely at following instructions.
> 
> We need the value later, not only testing it being zero :)

Ah OK that makes total sense then ;-)

Thanks,
willy

Re: [PATCH stable 4.9 1/8] x86: bpf_jit: small optimization in emit_bpf_tail_call()

2018-01-28 Thread Willy Tarreau

Hi,

[ replaced stable@ and greg@ by netdev@ as my question below is not
  relevant to stable ]

On Mon, Jan 29, 2018 at 02:48:54AM +0100, Daniel Borkmann wrote:
> From: Eric Dumazet 
> 
> [ upstream commit 84ccac6e7854ebbfb56d2fc6d5bef9be49bb304c ]
> 
> Saves 4 bytes replacing following instructions :
> 
> lea rax, [rsi + rdx * 8 + offsetof(...)]
> mov rax, qword ptr [rax]
> cmp rax, 0
> 
> by :
> 
> mov rax, [rsi + rdx * 8 + offsetof(...)]
> test rax, rax

I've just noticed this on stable@. If these 4 bytes matter, why not use
cmpq with an immediate value instead, which saves 2 extra bytes ? :

  - the mov above is 11 bytes total :

   0:   48 8b 84 d6 78 56 34mov0x12345678(%rsi,%rdx,8),%rax
   7:   12 
   8:   48 85 c0test   %rax,%rax

  - the equivalent cmp is only 9 bytes :

   0:   48 83 bc d6 78 56 34cmpq   $0x0,0x12345678(%rsi,%rdx,8)
   7:   12 00 

And as a bonus, it doesn't even clobber rax.

Just my two cents,
Willy

Re: TCP many-connection regression between 4.7 and 4.13 kernels.

2018-01-22 Thread Willy Tarreau

Hi Eric,

On Mon, Jan 22, 2018 at 10:16:06AM -0800, Eric Dumazet wrote:
> On Mon, 2018-01-22 at 09:28 -0800, Ben Greear wrote:
> > My test case is to have 6 processes each create 5000 TCP IPv4 connections 
> > to each other
> > on a system with 16GB RAM and send slow-speed data.  This works fine on a 
> > 4.7 kernel, but
> > will not work at all on a 4.13.  The 4.13 first complains about running out 
> > of tcp memory,
> > but even after forcing those values higher, the max connections we can get 
> > is around 15k.
> > 
> > Both kernels have my out-of-tree patches applied, so it is possible it is 
> > my fault
> > at this point.
> > 
> > Any suggestions as to what this might be caused by, or if it is fixed in 
> > more recent kernels?
> > 
> > I will start bisecting in the meantime...
> > 
> 
> Hi Ben
> 
> Unfortunately I have no idea.
> 
> Are you using loopback flows, or have I misunderstood you ?
> 
> How loopback connections can be slow-speed ?

A few quick points : I have not noticed this on 4.9, which we use with
pretty satisfying performance (typically around 100k conn/s). However
during some recent tests I did around the meltdown fixes on 4.15, I
noticed a high connect() or bind() cost to find a local port when
running on the loopback, that I didn't have the time to compare to
older kernels. However, strace clearly showed that bind() (or connect()
if bind was not used) could take as much as 2-3 ms as source ports were
filling up.

To be clear, it was just a quick observation and anything could be wrong
there, including my tests. I'm just saying this in case it matches anything
Ben has observed. I can try to get more info if that helps, but it could be
a different case.

Cheers,
Willy

Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-07 Thread Willy Tarreau

On Sun, Jan 07, 2018 at 12:17:11PM -0800, Linus Torvalds wrote:
> We need to fix the security problem, but we need to do it *without*
> these braindead arguments that performance is somehow secondary.

OK OK. At least we should have security by default and let people trade
it against performance if they want and understand the risk. People
never know when they're secure enough (security doesn't measure) but
they know fairly well when they're not performant enough to take action
(most often changing the machine).

Willy

Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-07 Thread Willy Tarreau

On Sun, Jan 07, 2018 at 11:47:07AM -0800, Linus Torvalds wrote:
> And the whole "normal people won't even notice" is pure garbage too.
> Don't spread that bullshit when you see actual normal people
> complaining.
> 
> Performance matters. A *LOT*.

Linus, no need to explain that to me, I'm precisely trying to see how
to disable PTI for a specific process because I face up to 45% loss in
certain circumstances, making it a no-go. But while a few of us have
very specific workloads emphasizing this impact, others have very
different ones and will not notice. For example my laptop did boot
pretty fine and I didn't notice anything until I fire a network
benchmark.

Willy

Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-06 Thread Willy Tarreau

On Sat, Jan 06, 2018 at 07:38:14PM -0800, Alexei Starovoitov wrote:
> yep. plenty of unknowns and what's happening now is an overreaction.

To be fair there's overreaction on both sides. The vast majority of
users need to get a 100% safe system and will never notice  any
difference. A few of us are more concerned by the risk of performance
loss brought by these fixes. We do all need to run tests on the
patchsets to bring numbers on the table.

> What is the rush to push half baked patches into upstream
> that don't even address the variant 1 ?

You probably need to trust the CPU vendors a bit for having had all
the details about the risks for a few months now and accept that if
they're destroying their product's performance compared to their main
competitor, they probably studied a number of other alternatives first.
It doesn't mean they thought about everything of course, and maybe they
need to study your proposal as a better solution to reduce criticism.

> which clearly states that bpf_tail_call() was used in the attack.
> Yet none of the intel nor arm patches address speculation in
> this bpf helper!
> It means that:
> - gpz didn't share neither exploit nor the detailed description
>   of the POC with cpu vendors until now

Or it was considered less urgent to fix compared to the rest. It's
also possible that the scariest details were not written in the GPZ
article.

> Now the attack is well described, yet cpu vendors still pushing
> for lfence patches that don't make sense. Why?

Imagine if you were in their position and were pushing a solution which
would later be found to be inefficient and to be vulnerable again. Your
name would appear twice in the press in a few months, this would be
terrible. It makes sense in their position to find the safest fix first
given that those like you or me really concerned about the performance
impact know how to add an option to a boot loader or revert a patch that
causes trouble.

> What I think is important is to understand vulnerability first.
> I don't think it was done.

I suspect it was clearly done by those how had no other option but
working on this painful series over the last few months :-/

> > The differences involved on the "lfence" versus "and" versus before are
> > not likely to be anywhere in that order of magnitude.
> 
> we clearly disagree here. Both intel and arm patches proposed
> to add lfence in bpf_map_lookup() which is the hottest function
> we have and we do run it at 40+Gbps speeds where every nanosecond
> counts, so no, lfence is not a solution.

Please help here by testing the patch series and reporting numbers
before, with the fix, and with your proposal. That's the best way to
catch their attention and to get your proposal considered as a viable
alternative (or as a partial one for certain environments). I did the
same when I believed it could be sufficient to add noise to RDTSC and
found it totally inefficient after testing. But it's better for
everyone that research and testing is done rather than criticizing the
proposed fixes (which is not fair to the people who work hard on them
for everyone else).

Cheers,
Willy

Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-06 Thread Willy Tarreau

On Sat, Jan 06, 2018 at 06:38:59PM +, Alan Cox wrote:
> Normally people who propose security fixes don't have to argue about the
> fact they added 30 clocks to avoid your box being 0wned.

In fact it depends, because if a fix makes the system unusable for its
initial purpose, this fix will simply not be deployed at all, which is
the worst that can happen. Especially when it cannot be disabled by
config and people stop updating their systems to stay on the last
"known good" version. Fortunately in Linux we often have the choice so
that users rarely have a valid reason for not upgrading!

Willy

Re: BUG warnings in 4.14.9

2017-12-26 Thread Willy Tarreau

Guys,

Chris reported the bug below and confirmed that reverting commit
9704f81 (ipv6: grab rt->rt6i_ref before allocating pcpu rt) seems to
have fixed the issue for him. This patch is a94b9367 in mainline.

I personally have no opinion on the patch, just found it because it
was the only one touching this area between 4.14.8 and 4.14.9 :-)

Should this be reverted or maybe fixed differently ?

thanks,
Willy

On Tue, Dec 26, 2017 at 01:49:59AM +, Chris Rankin wrote:
> Hi,
> 
> I've just raised https://bugzilla.kernel.org/show_bug.cgi?id=198271
> because the new 4.14.9 kernel is generating lots of BUG warnings, e.g.
> 
> [   35.069924] BUG: using smp_processor_id() in preemptible []
> code: avahi-daemon/761
> [   35.078187] caller is ip6_pol_route+0x46b/0x509 [ipv6]
> [   35.083340] CPU: 5 PID: 761 Comm: avahi-daemon Tainted: G
> I 4.14.9 #1
> [   35.091176] Hardware name: Gigabyte Technology Co., Ltd.
> EX58-UD3R/EX58-UD3R, BIOS FB  05/04/2009
> [   35.100181] Call Trace:
> [   35.102709]  dump_stack+0x46/0x59
> [   35.106095]  check_preemption_disabled+0xca/0xda
> [   35.110786]  ip6_pol_route+0x46b/0x509 [ipv6]
> [   35.115224]  fib6_rule_lookup+0x15/0x4b [ipv6]
> [   35.119745]  ip6_dst_lookup_tail+0x11d/0x18f [ipv6]
> [   35.124702]  ip6_dst_lookup_flow+0x30/0x6f [ipv6]
> [   35.129481]  udpv6_sendmsg+0x5c5/0xa10 [ipv6]
> [   35.133914]  ? ip_reply_glue_bits+0x48/0x48
> [   35.138187]  ? rw_copy_check_uvector+0x6d/0xf9
> [   35.142701]  ? sock_sendmsg+0x28/0x34
> [   35.146393]  sock_sendmsg+0x28/0x34
> [   35.149912]  ___sys_sendmsg+0x1b4/0x246
> [   35.153821]  ? poll_select_copy_remaining+0x104/0x104
> [   35.158995]  ? current_time+0x11/0x52
> [   35.162738]  ? pipe_write+0x353/0x365
> [   35.166483]  ? __sys_sendmsg+0x3c/0x5d
> [   35.170304]  __sys_sendmsg+0x3c/0x5d
> [   35.173909]  do_syscall_64+0x4a/0xe1
> [   35.177482]  entry_SYSCALL64_slow_path+0x25/0x25
> [   35.177484] RIP: 0033:0x7f963d7097b7
> [   35.177485] RSP: 002b:77ffbd18 EFLAGS: 0246 ORIG_RAX:
> 002e
> [   35.177486] RAX: ffda RBX: 000d RCX: 
> 7f963d7097b7
> [   35.177487] RDX:  RSI: 77ffbde0 RDI: 
> 000d
> [   35.177487] RBP:  R08: 0001 R09: 
> 77ffbd42
> [   35.177488] R10: 0002 R11: 0246 R12: 
> 77ffbde0
> [   35.177489] R13: 55e4e1af29cc R14: 000d R15: 
> 0002
> [   35.177565] IN=eth0 OUT= MAC=
> SRC=fe80::::0224:1dff:fecd:0f3a
> DST=ff02:::::::00fb LEN=245 TC=0 HOPLIMIT=255
> FLOWLBL=147873 PROTO=UDP SPT=5353 DPT=5353 LEN=205
> 
> Cheers,
> Chris

Re: [PATCH net 0/3] Few mvneta fixes

2017-12-19 Thread Willy Tarreau

Hi Arnd,

On Tue, Dec 19, 2017 at 09:18:35PM +0100, Arnd Bergmann wrote:
> On Tue, Dec 19, 2017 at 5:59 PM, Gregory CLEMENT
>  wrote:
> > Hello,
> >
> > here it is a small series of fixes found on the mvneta driver. They
> > had been already used in the vendor kernel and are now ported to
> > mainline.
> 
> Does one of the patches look like it addresses the rare Oops we discussed on
> #kernelci this morning?
> 
> https://storage.kernelci.org/stable/linux-4.9.y/v4.9.70/arm/mvebu_v7_defconfig/lab-free-electrons/boot-armada-375-db.html

I could be wrong but for me the 375 uses mvpp2, not mvneta, so this
should have no effect there.

Willy

Re: [PATCH] net: bridge: add max_fdb_count

2017-11-17 Thread Willy Tarreau

Hi Andrew,

On Fri, Nov 17, 2017 at 03:06:23PM +0100, Andrew Lunn wrote:
> > Usually it's better to apply LRU or random here in my opinion, as the
> > new entry is much more likely to be needed than older ones by definition.
> 
> Hi Willy
> 
> I think this depends on why you need to discard. If it is normal
> operation and the limits are simply too low, i would agree.
> 
> If however it is a DoS, throwing away the new entries makes sense,
> leaving the old ones which are more likely to be useful.
> 
> Most of the talk in this thread has been about limits for DoS
> prevention...

Sure but my point is that it can kick in on regular traffic and in
this case it can be catastrophic. That's only what bothers me. If
we have an unlimited default value with this algorithm I'm fine
because nobody will get caught by accident with a bridge suddenly
replicating high traffic on all ports because an unknown limit was
reached. That's the principle of least surprise.

I know that when fighting DoSes there's never any universally good
solutions and one has to make tradeoffs. I'm perfectly fine with this.

Cheers,
Willy

Re: [PATCH] net: bridge: add max_fdb_count

2017-11-16 Thread Willy Tarreau

Hi Stephen,

On Thu, Nov 16, 2017 at 04:27:18PM -0800, Stephen Hemminger wrote:
> On Thu, 16 Nov 2017 21:21:55 +0100
> Vincent Bernat  wrote:
> 
> >  ? 16 novembre 2017 20:23 +0100, Andrew Lunn  :
> > 
> > > struct net_bridge_fdb_entry is 40 bytes.
> > >
> > > My WiFi access point which is also a 5 port bridge, currently has 97MB
> > > free RAM. That is space for about 2.5M FDB entries. So even Roopa's
> > > 128K is not really a problem, in terms of memory.  
> > 
> > I am also interested in Sarah's patch because we can now have bridge
> > with many ports through VXLAN. The FDB can be replicated to an external
> > daemon with BGP and the cost of each additional MAC address is therefore
> > higher than just a few bytes. It seems simpler to implement a limiting
> > policy early (at the port or bridge level).
> > 
> > Also, this is a pretty standard limit to have for a bridge (switchport
> > port-security maximum on Cisco, set interface X mac-limit on
> > Juniper). And it's not something easy to do with ebtables.
> 
> I want an optional limit per port, it makes a lot of sense.
> If for no other reason that huge hash tables are a performance problems.

Except its not a limit in that it doesn't prevent new traffic from going
in, it only prevents new MACs from being learned, so suddenly you start
flooding all ports with new traffic once the limit is reached, which is
not trivial to detect nor diagnose.

> There is a bigger question about which fdb to evict but just dropping the
> new one seems to be easiest and as good as any other solution.

Usually it's better to apply LRU or random here in my opinion, as the
new entry is much more likely to be needed than older ones by definition.
In terms of CPU usage it may even be better to kill an entire series in
the hash table (eg: all nodes in the same table entry for example), as
the operation can be almost as cheap and result in not being needed for
a while again.

Willy

Re: [PATCH] net: bridge: add max_fdb_count

2017-11-16 Thread Willy Tarreau

Hi Sarah,

On Thu, Nov 16, 2017 at 01:20:18AM -0800, Sarah Newman wrote:
> I note that anyone who would run up against a too-low limit on the maximum
> number of fdb entries would also be savvy enough to fix it in a matter of
> minutes.

I disagree on this point. There's a huge difference between experiencing
sudden breakage under normal conditions due to arbitrary limits being set
and being down because of an attack. While the latter is not desirable,
it's much more easily accepted and most often requires operations anyway.
The former is never an option.

And I continue to think that the default behaviour once the limit is reached
must not be to prevent new entries from being learned but to purge older
ones. At least it preserves normal operations.

But given the high CPU impact you reported for a very low load, definitely
something needs to be done.

> They could also default the limit to U32_MAX in their particular
> distribution if it was a configuration option.

Well, I'd say that we don't have a default limit on the socket number either
and that it happens to be the expected behaviour. It's almost impossible to
find a suitable limit for everyone. People dealing with regular loads never
read docs and get caught. People working in hostile environments are always
more careful and will ensure that their limits are properly set.

> At the moment there is not even a single log message if the problem doesn't
> result in memory exhaustion.

This probably needs to be addressed as well!

Regards,
Willy

Re: [PATCH] net: recvmsg: Unconditionally zero struct sockaddr_storage

2017-10-31 Thread Willy Tarreau

On Tue, Oct 31, 2017 at 09:14:45AM -0700, Kees Cook wrote:
> diff --git a/net/socket.c b/net/socket.c
> index c729625eb5d3..34183f4fbdf8 100644
> --- a/net/socket.c
> +++ b/net/socket.c
> @@ -2188,6 +2188,7 @@ static int ___sys_recvmsg(struct socket *sock, struct 
> user_msghdr __user *msg,
>   struct sockaddr __user *uaddr;
>   int __user *uaddr_len = COMPAT_NAMELEN(msg);
>  
> + memset(&addr, 0, sizeof(addr));
>   msg_sys->msg_name = &addr;

Isn't this going to cause a performance hit in the fast path ? Just
checking, I have not read the whole code with the patch in its context.

Willy

Re: net.ipv4.tcp_max_syn_backlog implementation

2017-08-28 Thread Willy Tarreau

On Mon, Aug 28, 2017 at 11:47:41PM -0400, Harsha Chenji wrote:
> So I have ubuntu 12.04 x32 in a VM with syncookies turned off. I tried
> to do a syn flood (with netwox) on 3 different processes. Each of them
> returns a different value with netstat -na | grep -c RECV :
> 
> nc -l  returns 16 (netcat-traditional)
> apache2 port 80 returns 256
> vsftpd on 21 returns 64.
> net.ipv4.tcp_max_syn_backlog is 512.
> 
> Why do these different processes on different ports have different
> queue lengths for incomplete connections? Where exactly in the kernel
> is this decided?

The listening socket's backlog (second argument to the listen() syscall)
is considered as well. The code path to determine whether or not to start
to send SYN cookies is far from being trivial but makes sense once you
write it down completely. I never perfectly remember it, I regularly have
to recheck when I have a doubt.

Willy

Re: [PATCH net 3/3] tcp: fix xmit timer to only be reset if data ACKed/SACKed

2017-08-06 Thread Willy Tarreau

On Sun, Aug 06, 2017 at 07:39:57AM +, maowenan wrote:
> 
> 
> > -Original Message-
> > From: Willy Tarreau [mailto:w...@1wt.eu]
> > Sent: Saturday, August 05, 2017 2:19 AM
> > To: Neal Cardwell
> > Cc: maowenan; David Miller; netdev@vger.kernel.org; Yuchung Cheng; Nandita
> > Dukkipati; Eric Dumazet
> > Subject: Re: [PATCH net 3/3] tcp: fix xmit timer to only be reset if data
> > ACKed/SACKed
> > 
> > On Fri, Aug 04, 2017 at 02:01:34PM -0400, Neal Cardwell wrote:
> > > On Fri, Aug 4, 2017 at 1:10 PM, Willy Tarreau  wrote:
> > > > Hi Neal,
> > > >
> > > > On Fri, Aug 04, 2017 at 12:59:51PM -0400, Neal Cardwell wrote:
> > > >> I have attached patches for this fix rebased on to v3.10.107, the
> > > >> latest stable release for 3.10. That's pretty far back in history,
> > > >> so there were substantial conflict resolutions and adjustments 
> > > >> required.
> > > >> :-) Hope that helps.
> > > >
> > > > At least it will help me :-)
> > > >
> > > > Do you suggest that I queue them for 3.10.108, that I wait for
> > > > Maowenan to test them more broadly first or anything else ?
> > >
> > > Let's wait for Maowenan to test them first.
> [Mao Wenan]It works well with these patches of v3.10, and the 
> retransmission packet is about 250ms after TLP probe.
> Thank you very much for these patches under 3.10.

Thanks, I'll take them then.

Willy

Re: [PATCH net 3/3] tcp: fix xmit timer to only be reset if data ACKed/SACKed

2017-08-04 Thread Willy Tarreau

On Fri, Aug 04, 2017 at 02:01:34PM -0400, Neal Cardwell wrote:
> On Fri, Aug 4, 2017 at 1:10 PM, Willy Tarreau  wrote:
> > Hi Neal,
> >
> > On Fri, Aug 04, 2017 at 12:59:51PM -0400, Neal Cardwell wrote:
> >> I have attached patches for this fix rebased on to v3.10.107, the
> >> latest stable release for 3.10. That's pretty far back in history, so
> >> there were substantial conflict resolutions and adjustments required.
> >> :-) Hope that helps.
> >
> > At least it will help me :-)
> >
> > Do you suggest that I queue them for 3.10.108, that I wait for Maowenan
> > to test them more broadly first or anything else ?
> 
> Let's wait for Maowenan to test them first.

Fine, thanks.
Willy

Re: [PATCH net 3/3] tcp: fix xmit timer to only be reset if data ACKed/SACKed

2017-08-04 Thread Willy Tarreau

Hi Neal,

On Fri, Aug 04, 2017 at 12:59:51PM -0400, Neal Cardwell wrote:
> I have attached patches for this fix rebased on to v3.10.107, the
> latest stable release for 3.10. That's pretty far back in history, so
> there were substantial conflict resolutions and adjustments required.
> :-) Hope that helps.

At least it will help me :-)

Do you suggest that I queue them for 3.10.108, that I wait for Maowenan
to test them more broadly first or anything else ?

I'm fine with any option.
Thanks!
Willy

Re: STABLE: net: reduce skb_warn_bad_offload() noise

2017-07-28 Thread Willy Tarreau

On Fri, Jul 28, 2017 at 10:22:52PM -0700, Eric Dumazet wrote:
> On Fri, 2017-07-28 at 12:30 -0700, David Miller wrote:
> > From: Mark Salyzyn 
> > Date: Fri, 28 Jul 2017 10:29:57 -0700
> > 
> > > Please backport the upstream patch to the stable trees (including
> > > 3.10.y, 3.18.y, 4.4.y and 4.9.y):
> > > 
> > > b2504a5dbef3305ef41988ad270b0e8ec289331c net: reduce
> > > skb_warn_bad_offload() noise
> > > 
> > > Impacting performance or creating unnecessary alarm, and will result
> > > in kernel panic for panic_on_warn configuration.
> > 
> > Yeah this is fine.
> 
> If you do so, also backport 6e7bc478c9a006c701c14476ec9d389a484b4864
> ("net: skb_needs_check() accepts CHECKSUM_NONE for tx")
 
OK, noted for next 3.10 as well.

thanks!
Willy

1 2 3 >

1 - 100 of 229 matches

Mail list logo