Re: PROBLEM: DoS Attack on Fragment Cache
On Sat, Apr 17, 2021 at 10:26:30PM -0400, Matt Corallo wrote: > Sure, there are better ways to handle the reassembly cache overflowing, but > that is pretty unrelated to the fact that waiting 30 full seconds for a > fragment to come in doesn't really make sense in today's networks (the 30 > second delay that is used today appears to even be higher than RFC 791 > suggested in 1981!). Not exactly actually, because you forget the TTL here. With most hosts sending an initial TTL around 64, after crossing 10-15 hops it's still around 50 so that would result in ~50 seconds by default, even according to the 40 years old RFC791. The 15s there was the absolute minimum. While I do agree that we shouldn't keep them that long nowadays, we can't go too low without risking to break some slow transmission stacks (SLIP/PPP over modems for example). In addition even cutting that in 3 will remain trivially DoSable. > You get a lot more bang for your buck if you don't wait > around so long (or we could restructure things to kick out the oldest > fragments, but that is a lot more work, and probably extra indexes that just > aren't worth it). Kicking out oldest ones is a bad approach in a system made only of independent elements, because it tends to result in a lot of damage once all of them behave similarly. I.e. if you need to kick out an old entry in valid traffic, it's because you do need to wait that long, and if all datagrams need to wait that long, then new datagrams systematically prevent the oldest one from being reassembled, and none gest reassembled. With a random approach at least your success ratio converges towards 1/e (i.e. 36%) which is better. Willy
Re: PROBLEM: DoS Attack on Fragment Cache
On Sat, Apr 17, 2021 at 12:42:39AM -0700, Keyu Man wrote: > How about at least allow the existing queue to finish? Currently a tiny new > fragment would potentially invalid all previous fragments by letting them > timeout without allowing the fragments to come in to finish the assembly. Because this is exactly the principle of how attacks are built: reserve resources claiming that you'll send everything so that others can't make use of the resources that are reserved to you. The best solution precisely is *not* to wait for anyone to finish, hence *not* to reserve valuable resources that are unusuable by others. Willy
Re: PROBLEM: DoS Attack on Fragment Cache
On Sat, Apr 17, 2021 at 06:44:40AM +0200, Eric Dumazet wrote: > On Sat, Apr 17, 2021 at 2:31 AM David Ahern wrote: > > > > [ cc author of 648700f76b03b7e8149d13cc2bdb3355035258a9 ] > > I think this has been discussed already. There is no strategy that > makes IP reassembly units immune to DDOS attacks. For having tried to deal with this in the past as well, I agree with this conclusion, which is also another good example of why fragments should really be avoided as much as possible over hostile networks. However I also found that random drops of previous entries is the approach which seems to offer the most statistical opportunities to legitimate traffic to still work under attack (albeit really poorly considering that any lost fragment requires retransmission of the whole series). In this case the chance for a packet to be successfully reassembled would vary proportionally to the inverse of its number of fragments, which reasonably limits the impact of attacks (without being an ultimate solution of course). > We added rb-tree and sysctls to let admins choose to use GB of RAM if > they really care. I agree that for those who care, the real solution is to make sure they can store all the traffic they receive during a reassembly period. Legitimate traffic mostly reassembles quickly so keeping 1 second of traffic at 10 Gbps is only 1.25 GB of RAM after all... Willy
Re: [PATCH] Revert "macb: support the two tx descriptors on at91rm9200"
Hi Daniel, On Tue, Apr 06, 2021 at 07:04:58PM +0900, Daniel Palmer wrote: > Hi Willy, > > I've been messing with the SSD202D (sibling of the MSC313E) recently > and the ethernet performance was awful. > I remembered this revert and reverted it and it makes the ethernet > work pretty well. OK, that's reassuring, at least they use the same IP blocks. > [ 15.213358] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready > [ 15.245235] pre0 41001fb8, post0 30001fb0 -> IDLETSR clear > [ 15.249291] pre1 30001f90, post1 20001d90 -> FIFOIDLE1 clear > [ 15.253331] pre2 20001d90, post2 10001990 -> FIFOIDLE2 clear > [ 15.257385] pre3 10001990, post3 1190 -> FIFOIDLE3 clear > [ 15.261435] pre4 1190, post4 f191 -> FIFOIDLE4 clear, OVR set > [ 15.265485] pre5 f190, post5 f191 -> OVR set > [ 15.269535] pre6 f190, post6 e181 -> OVR set, BNQ clear > > There seems to be a FIFO empty counter in the top of the register but > this is totally undocumented. Yes that's extremely visible. I did notice some apparently random values at the top of the register as well without being able to determine what they were for, because I was pretty much convinced I was facing a 2-deep queue only. You had a good idea to force it to duplicate packets :-) > There are two new status bits TBNQ and FBNQ at bits 7 and 8. I have no > idea what they mean. Maybe they're related to tx queue empty / tx queue full. Just guessing. Since all these bits tend not to be reset until written to, I got confused a few times trying to understand what they indicated. > Bits 9 through 12 are some new status bits that seem to show if a FIFO > slot is inuse. Maybe indeed. > I can't find any mention of these bits anywhere except the header of > the vendor driver so I think these are specific to MStar's macb. > The interesting part though is that BNQ does not get cleared until > multiple frames have been pushed in and after OVR is set. > I think this is what breaks your code when it runs on the MSC313E > version of MACB. Yes that's very possible, because the behavior was inconsistent with what was documented for older chips. Also BNQ changed its meaning on more recent chips, so it may very well have yet another one here. > Anyhow. I'm working on a version of your patch that should work with > both the at91rm9200 and the MSC313E. Cool! Thanks for letting me know. If you need me to run some test, let me know (just don't expect too short latency these days but I definitely will test). Cheers, Willy
Re: [PATCH] atl1c: optimize rx loop
On Fri, Mar 19, 2021 at 12:04:47PM +0800, Sieng Piaw Liew wrote: > Remove this trivial bit of inefficiency from the rx receive loop, > results in increase of a few Mbps in iperf3. Tested on Intel Core2 > platform. > > Signed-off-by: Sieng Piaw Liew > --- > drivers/net/ethernet/atheros/atl1c/atl1c_main.c | 4 +--- > 1 file changed, 1 insertion(+), 3 deletions(-) > > diff --git a/drivers/net/ethernet/atheros/atl1c/atl1c_main.c > b/drivers/net/ethernet/atheros/atl1c/atl1c_main.c > index 3f65f2b370c5..b995f9a0479c 100644 > --- a/drivers/net/ethernet/atheros/atl1c/atl1c_main.c > +++ b/drivers/net/ethernet/atheros/atl1c/atl1c_main.c > @@ -1796,9 +1796,7 @@ static void atl1c_clean_rx_irq(struct atl1c_adapter > *adapter, > struct atl1c_recv_ret_status *rrs; > struct atl1c_buffer *buffer_info; > > - while (1) { > - if (*work_done >= work_to_do) > - break; > + while (*work_done < work_to_do) { It should not change anything, or only based on the compiler's optimization and should not result in a measurable difference because what it does is exactly the same. Have you really compared the compiled output code to explain the difference ? I strongly suspect you'll find no difference at all. Thus for me it's certainly not an optimization, it could be qualified as a cleanup to improve code readability however. Willy
Re: TIOCOUTQ implementation for sockets vs. tty
Hi, On Wed, Mar 10, 2021 at 07:16:34PM +0100, Henneberg - Systemdesign wrote: > Hi, > > I have a question regarding the implementation of ioctl TIOCOUTQ for > various sockets compared to the tty implementation. > > For several sockets, e. g. AF_BLUETOOTH it is done like this > > af_bluetooth.c: > case TIOCOUTQ: > if (sk->sk_state == BT_LISTEN) > return -EINVAL; > > amount = sk->sk_sndbuf - sk_wmem_alloc_get(sk); > if (amount < 0) > amount = 0; > err = put_user(amount, (int __user *)arg); > break; > > so the ioctl returns the available space in the send queue if I > understand the code correctly (this is also what I observed from tests). > > The tty does this: > > n_tty.c: > case TIOCOUTQ: > return put_user(tty_chars_in_buffer(tty), (int __user *) arg); > > so it returns the used space in the send queue. This is also what I > would expect from the manpage description. > > Is this mismatch intentional? At least both man pages (tty_ioctl and tcp(7)) mention that TIOCOUTQ should return the number of byte in queue. What I suspect for sockets is that sk_sndbuf grows with allocations and that sk_wmem_alloc_get() in fact returns the number of unused allocations thus the difference would be the amount queued. But I could be wrong and I would tend to read the code the same way as you did. Willy
Re: macb broken on HiFive Unleashed
Hi, On Tue, Mar 09, 2021 at 08:55:10AM +, claudiu.bez...@microchip.com wrote: > Hi Andreas, > > On 08.03.2021 21:30, Andreas Schwab wrote: > > EXTERNAL EMAIL: Do not click links or open attachments unless you know the > > content is safe > > > > One of the changes to the macb driver between 5.10 and 5.11 has broken > > the SiFive HiFive Unleashed. These are the last messages before the > > system hangs: > > > > [ 12.468674] libphy: Fixed MDIO Bus: probed > > [ 12.746518] macb 1009.ethernet: Registered clk switch > > 'sifive-gemgxl-mgmt' > > [ 12.753119] macb 1009.ethernet: GEM doesn't support hardware ptp. > > [ 12.760178] libphy: MACB_mii_bus: probed > > [ 12.881792] MACsec IEEE 802.1AE > > [ 12.944426] macb 1009.ethernet eth0: Cadence GEM rev 0x10070109 at > > 0x1009 irq 16 (70:b3:d5:92:f1:07) > > > > I don't have a SiFive HiFive Unleashed to investigate this. Can you check > if reverting commits on macb driver b/w 5.10 and 5.11 solves your issues: > > git log --oneline v5.10..v5.11 -- drivers/net/ethernet/cadence/ > 1d0d561ad1d7 net: macb: Correct usage of MACB_CAPS_CLK_HW_CHG flag > 1d608d2e0d51 Revert "macb: support the two tx descriptors on at91rm9200" > 700d566e8171 net: macb: add support for sama7g5 emac interface > ec771de654e4 net: macb: add support for sama7g5 gem interface > f4de93f03ed8 net: macb: unprepare clocks in case of failure > 38493da4e6a8 net: macb: add function to disable all macb clocks > daafa1d33cc9 net: macb: add capability to not set the clock rate > edac63861db7 net: macb: add userio bits as platform configuration > 9e6cad531c9d net: macb: Fix passing zero to 'PTR_ERR' > 0012eeb370f8 net: macb: fix NULL dereference due to no pcs_config method > e4e143e26ce8 net: macb: add support for high speed interface In addition, it's worth mentioning that the driver has multiple rx/tx/irq functions depending on the platforms or chip variants, and that based on this it should be easy to further reduce this list. Just my two cents, Willy
[PATCH] Revert "macb: support the two tx descriptors on at91rm9200"
This reverts commit 0a4e9ce17ba77847e5a9f87eed3c0ba46e3f82eb. The code was developed and tested on an MSC313E SoC, which seems to be half-way between the AT91RM9200 and the AT91SAM9260 in that it supports both the 2-descriptors mode and a Tx ring. It turns out that after the code was merged I could notice that the controller would sometimes lock up, and only when dealing with sustained bidirectional transfers, in which case it would report a Tx overrun condition right after having reported being ready, and will stop sending even after the status is cleared (a down/up cycle fixes it though). After adding lots of traces I couldn't spot a sequence pattern allowing to predict that this situation would happen. The chip comes with no documentation and other bits are often reported with no conclusive pattern either. It is possible that my change is wrong just like it is possible that the controller on the chip is bogus or at least unpredictable based on existing docs from other chips. I do not have an RM9200 at hand to test at the moment and a few tests run on a more recent 9G20 indicate that this code path cannot be used there to test the code on a 3rd platform. Since the MSC313E works fine in the single-descriptor mode, and that people using the old RM9200 very likely favor stability over performance, better revert this patch until we can test it on the original platform this part of the driver was written for. Note that the reverted patch was actually tested on MSC313E. Cc: Nicolas Ferre Cc: Claudiu Beznea Cc: Daniel Palmer Cc: Alexandre Belloni Link: https://lore.kernel.org/netdev/20201206092041.ga10...@1wt.eu/ Signed-off-by: Willy Tarreau --- drivers/net/ethernet/cadence/macb.h | 2 -- drivers/net/ethernet/cadence/macb_main.c | 46 ++-- 2 files changed, 8 insertions(+), 40 deletions(-) diff --git a/drivers/net/ethernet/cadence/macb.h b/drivers/net/ethernet/cadence/macb.h index 85be045..9fbdf53 100644 --- a/drivers/net/ethernet/cadence/macb.h +++ b/drivers/net/ethernet/cadence/macb.h @@ -1216,8 +1216,6 @@ struct macb { /* AT91RM9200 transmit queue (1 on wire + 1 queued) */ struct macb_tx_skb rm9200_txq[2]; - unsigned intrm9200_tx_tail; - unsigned intrm9200_tx_len; unsigned intmax_tx_length; u64 ethtool_stats[GEM_STATS_LEN + QUEUE_STATS_LEN * MACB_MAX_QUEUES]; diff --git a/drivers/net/ethernet/cadence/macb_main.c b/drivers/net/ethernet/cadence/macb_main.c index 817c7b0..352ae7f 100644 --- a/drivers/net/ethernet/cadence/macb_main.c +++ b/drivers/net/ethernet/cadence/macb_main.c @@ -3936,7 +3936,6 @@ static int at91ether_start(struct macb *lp) MACB_BIT(ISR_TUND) | MACB_BIT(ISR_RLE) | MACB_BIT(TCOMP)| -MACB_BIT(RM9200_TBRE) | MACB_BIT(ISR_ROVR) | MACB_BIT(HRESP)); @@ -3953,7 +3952,6 @@ static void at91ether_stop(struct macb *lp) MACB_BIT(ISR_TUND) | MACB_BIT(ISR_RLE) | MACB_BIT(TCOMP)| -MACB_BIT(RM9200_TBRE) | MACB_BIT(ISR_ROVR) | MACB_BIT(HRESP)); @@ -4023,10 +4021,11 @@ static netdev_tx_t at91ether_start_xmit(struct sk_buff *skb, struct net_device *dev) { struct macb *lp = netdev_priv(dev); - unsigned long flags; - if (lp->rm9200_tx_len < 2) { - int desc = lp->rm9200_tx_tail; + if (macb_readl(lp, TSR) & MACB_BIT(RM9200_BNQ)) { + int desc = 0; + + netif_stop_queue(dev); /* Store packet information (to free when Tx completed) */ lp->rm9200_txq[desc].skb = skb; @@ -4040,15 +4039,6 @@ static netdev_tx_t at91ether_start_xmit(struct sk_buff *skb, return NETDEV_TX_OK; } - spin_lock_irqsave(&lp->lock, flags); - - lp->rm9200_tx_tail = (desc + 1) & 1; - lp->rm9200_tx_len++; - if (lp->rm9200_tx_len > 1) - netif_stop_queue(dev); - - spin_unlock_irqrestore(&lp->lock, flags); - /* Set address of the data in the Transmit Address register */ macb_writel(lp, TAR, lp->rm9200_txq[desc].mapping); /* Set length of the packet in the Transmit Control register */ @@ -4114,8 +4104,6 @@ static irqreturn_t at91ether_interrupt(int irq, void *dev_id) struct macb *lp = netdev_priv(dev); u32 intstatus, ctl; unsigned int desc; - unsigned int qlen; - u32 tsr; /* MAC Interrupt Status register i
Re: macb: should we revert 0a4e9ce17ba7 ("macb: support the two tx descriptors on at91rm9200") ?
On Mon, Dec 07, 2020 at 03:40:42PM -0800, Jakub Kicinski wrote: > Thanks for the report, I remember that one. In hindsight maybe we > should have punted it to 5.11... Well, not necessarily as it was simple, well documented and *appeared* to work fine. > Let's revert ASAP, 5.10 is going to be LTS, people will definitely > notice. It could take some time as we're speaking about crazy people running 5.10 on an old 180 MHz MCU :-) > Would you mind sending a revert patch with the explanation in the > commit message? Sure, will do. Thanks! Willy
macb: should we revert 0a4e9ce17ba7 ("macb: support the two tx descriptors on at91rm9200") ?
Hi Jakub, Two months ago I implemented a small change in the macb driver to support the two Tx descriptors that AT91RM9200 supports. I implemented this using the only compatible device I had which is the MSC313E-based Breadbee board. Since then I've met situations where the chip would stop sending, mostly when doing bidirectional traffic. I've spent a few week- ends on this already trying various things including blocking interrupts on a larger place, switching to the 32-bit register access on the MSC313E (previous code was using two 16-bit accesses and I thought it was the cause), and tracing status registers along the various operations. Each time the pattern looks similar, a write when the chips declares having room results in an overrun, but the preceeding condition doesn't leave any info suggesting this may happen. Sadly we don't have the datasheet for this SoC, what is visible is that it supports AT91RM9200's tx mode and that it works well when there's a single descriptor in flight. In this case it complies with AT91RM9200's datasheet. The chip reports other undocumented bits in its status registers, that cannot even be correlated to the issue by the way. I couldn't spot anything wrong there in my changes, even after doing heavy changes to progress on debugging, and the SoC's behavior reporting an overrun after a single write when there's room contradicts the RM9200's datasheet. In addition we know the chip also supports other descriptor-based modes, so it's very possible it doesn't perfectly implement the RM9200's 2-desc mode and that my change is still fine. Yesterday I hope to get my old AT91SAM9G20 board to execute this code and test it, but it clearly does not work when I remap the init and tx functions, which indicates that it really does not implement a hidden compatibility mode with the old chip. Thus at this point I have no way to make sure that the original SoC for which I changed the code still works fine or if I broke it. As such, I'd feel more comfortable if we'd revert my patch until someone has the opportunity to test it on the original hardware (Alexandre suggested he might, but later). The commit in question is the following one: 0a4e9ce17ba7 ("macb: support the two tx descriptors on at91rm9200") If the maintainers prefer that we wait for an issue to be reported before reverting it, that's fine for me as well. What's important to me is that this potential issue is identified so as not to waste someone else's time on it. Thanks! Willy
Re: [PATCH net-next 0/3] macb: support the 2-deep Tx queue on at91
Hi Alexandre! On Fri, Nov 13, 2020 at 11:03:59AM +0100, Alexandre Belloni wrote: > I think I'm the only one booting recent linux kernels on at91rm9200 and > I'm currently stuck home while the board is at the office. I'll try to > test as soon as possible, which may not be before 2021... At least I'll > know who is the culprit ;) Oh that's great. I have a SAMG20 based one, which uses slightly different registers and supports a tx ring, so initially I thought that I couldn't test it there. But a friend of mine who wrote the drivers for FreeBSD told me that the original driver still worked for the SAMG20, and I suspect that despite not being mentioned in the datasheet, the more recent chip still supports the old behavior, at least to ease the transistion for their customers. So eventually I'll try it too. In all transparency, I must tell you that I recently noticed an issue when facing intense bidirectional traffic, eventually the chip would report a TX overrun and would stop sending. I finally attributed this to the unmerged changes needed for the MStar chip, which uses two 16 bit I/O accesses instead of a single 32-bit one, and which apparently doesn't like being interrupted between the two when writing to the TAR or TLEN registers. It took me the whole week-end to figure the root cause so now I need to remove all my debugging code, address the issue and test again. If I can't manage to fix this and you can't find a moment to test on your board, I'll propose to revert my patch to stay on the safe side, as I want at least one implementation to be 100% reliable with it. Cheers, Willy
Re: [PATCH net-next 3/3] macb: support the two tx descriptors on at91rm9200
Hi Claudiu, first, thanks for your feedback! On Wed, Oct 14, 2020 at 04:08:00PM +, claudiu.bez...@microchip.com wrote: > > @@ -3994,11 +3996,10 @@ static netdev_tx_t at91ether_start_xmit(struct > > sk_buff *skb, > > struct net_device *dev) > > { > > struct macb *lp = netdev_priv(dev); > > + unsigned long flags; > > > > - if (macb_readl(lp, TSR) & MACB_BIT(RM9200_BNQ)) { > > - int desc = 0; > > - > > - netif_stop_queue(dev); > > + if (lp->rm9200_tx_len < 2) { > > + int desc = lp->rm9200_tx_tail; > > I think you also want to protect these reads with spin_lock() to avoid > concurrency with the interrupt handler. I don't think it's needed because the condition doesn't change below us as the interrupt handler only decrements. However I could use a READ_ONCE to make things cleaner. And in practice this test was kept to keep some sanity checks but it never fails, as if the queue length reaches 2, the queue is stopped (and I never got the device busy message either before nor after the patch). > > /* Store packet information (to free when Tx completed) */ > > lp->rm9200_txq[desc].skb = skb; > > @@ -4012,6 +4013,15 @@ static netdev_tx_t at91ether_start_xmit(struct > > sk_buff *skb, > > return NETDEV_TX_OK; > > } > > > > + spin_lock_irqsave(&lp->lock, flags); > > + > > + lp->rm9200_tx_tail = (desc + 1) & 1; > > + lp->rm9200_tx_len++; > > + if (lp->rm9200_tx_len > 1) > > + netif_stop_queue(dev); This is where we guarantee that we won't call start_xmit() again with rm9200_tx_len >= 2. > > @@ -4088,21 +4100,39 @@ static irqreturn_t at91ether_interrupt(int irq, > > void *dev_id) > > at91ether_rx(dev); > > > > /* Transmit complete */ > > - if (intstatus & MACB_BIT(TCOMP)) { > > + if (intstatus & (MACB_BIT(TCOMP) | MACB_BIT(RM9200_TBRE))) { > > /* The TCOM bit is set even if the transmission failed */ > > if (intstatus & (MACB_BIT(ISR_TUND) | MACB_BIT(ISR_RLE))) > > dev->stats.tx_errors++; > > > > - desc = 0; > > - if (lp->rm9200_txq[desc].skb) { > > + spin_lock(&lp->lock); > > Also, this lock could be moved before while, below, as you want to protect > the rm9200_tx_len and rm9200_tx_tails members of lp as I understand. Sure, but it actually *is* before the while(). I'm sorry if that was not visible from the context of the diff. The while is just a few lins below, thus rm9200_tx_len and rm9200_tx_tail are properly protected. Do not hesitate to tell me if something is not clear or if I'm wrong! Thanks! Willy
Re: [PATCH net-next 0/3] macb: support the 2-deep Tx queue on at91
On Tue, Oct 13, 2020 at 05:03:58PM -0700, Jakub Kicinski wrote: > On Sun, 11 Oct 2020 11:09:41 +0200 Willy Tarreau wrote: > > while running some tests on my Breadbee board, I noticed poor network > > Tx performance. I had a look at the driver (macb, at91ether variant) > > and noticed that at91ether_start_xmit() immediately stops the queue > > after sending a frame and waits for the interrupt to restart the queue, > > causing a dead time after each packet is sent. > > > > The AT91RM9200 datasheet states that the controller supports two frames, > > one being sent and the other one being queued, so I performed minimal > > changes to support this. The transmit performance on my board has > > increased by 50% on medium-sized packets (HTTP traffic), and with large > > packets I can now reach line rate. > > > > Since this driver is shared by various platforms, I tried my best to > > isolate and limit the changes as much as possible and I think it's pretty > > reasonable as-is. I've run extensive tests and couldn't meet any > > unexpected situation (no stall, overflow nor lockup). > > > > There are 3 patches in this series. The first one adds the missing > > interrupt flag for RM9200 (TBRE, indicating the tx buffer is willing > > to take a new packet). The second one replaces the single skb with a > > 2-array and uses only index 0. It does no other change, this is just > > to prepare the code for the third one. The third one implements the > > queue. Packets are added at the tail of the queue, the queue is > > stopped at 2 packets and the interrupt releases 0, 1 or 2 depending > > on what the transmit status register reports. > > LGTM. There's always a chance that this will make other > designs explode, but short of someone from Cadence giving > us a timely review we have only one way to find that out.. :) Not that much in fact, given that the at91ether_* functions are only used by AT91RM9200 (whose datasheet I used to do this) and Mstar which I used for the tests. I initially wanted to get my old SAM9G20 board to boot until I noticed that it doesn't even use the same set of functions, so the potential victims are extremely limited :-) > Applied, thanks! Thank you! Willy
[PATCH net-next 0/3] macb: support the 2-deep Tx queue on at91
Hi, while running some tests on my Breadbee board, I noticed poor network Tx performance. I had a look at the driver (macb, at91ether variant) and noticed that at91ether_start_xmit() immediately stops the queue after sending a frame and waits for the interrupt to restart the queue, causing a dead time after each packet is sent. The AT91RM9200 datasheet states that the controller supports two frames, one being sent and the other one being queued, so I performed minimal changes to support this. The transmit performance on my board has increased by 50% on medium-sized packets (HTTP traffic), and with large packets I can now reach line rate. Since this driver is shared by various platforms, I tried my best to isolate and limit the changes as much as possible and I think it's pretty reasonable as-is. I've run extensive tests and couldn't meet any unexpected situation (no stall, overflow nor lockup). There are 3 patches in this series. The first one adds the missing interrupt flag for RM9200 (TBRE, indicating the tx buffer is willing to take a new packet). The second one replaces the single skb with a 2-array and uses only index 0. It does no other change, this is just to prepare the code for the third one. The third one implements the queue. Packets are added at the tail of the queue, the queue is stopped at 2 packets and the interrupt releases 0, 1 or 2 depending on what the transmit status register reports. Thanks, Willy Willy Tarreau (3): macb: add RM9200's interrupt flag TBRE macb: prepare at91 to use a 2-frame TX queue macb: support the two tx descriptors on at91rm9200 drivers/net/ethernet/cadence/macb.h | 10 ++-- drivers/net/ethernet/cadence/macb_main.c | 66 ++-- 2 files changed, 56 insertions(+), 20 deletions(-) -- 2.28.0
[PATCH net-next 2/3] macb: prepare at91 to use a 2-frame TX queue
The RM9200 supports one frame being sent while another one is waiting in queue. This avoids the dead time that follows the emission of a frame and which prevents one from reaching line speed. Right now the driver supports only a single skb, so we'll first replace the rm9200-specific skb info with an array of two macb_tx_skb (already used by other drivers). This patch only moves the skb_length to txq[0].size and skb_physaddr to skb[0].mapping but doesn't perform any other change. It already uses [desc] in order to minimize future changes. Cc: Nicolas Ferre Cc: Claudiu Beznea Cc: Daniel Palmer Signed-off-by: Willy Tarreau --- drivers/net/ethernet/cadence/macb.h | 6 ++--- drivers/net/ethernet/cadence/macb_main.c | 28 ++-- 2 files changed, 18 insertions(+), 16 deletions(-) diff --git a/drivers/net/ethernet/cadence/macb.h b/drivers/net/ethernet/cadence/macb.h index 49d347429de8..e87db95fb0f6 100644 --- a/drivers/net/ethernet/cadence/macb.h +++ b/drivers/net/ethernet/cadence/macb.h @@ -1206,10 +1206,8 @@ struct macb { phy_interface_t phy_interface; - /* AT91RM9200 transmit */ - struct sk_buff *skb;/* holds skb until xmit interrupt completes */ - dma_addr_t skb_physaddr;/* phys addr from pci_map_single */ - int skb_length; /* saved skb length for pci_unmap_single */ + /* AT91RM9200 transmit queue (1 on wire + 1 queued) */ + struct macb_tx_skb rm9200_txq[2]; unsigned intmax_tx_length; u64 ethtool_stats[GEM_STATS_LEN + QUEUE_STATS_LEN * MACB_MAX_QUEUES]; diff --git a/drivers/net/ethernet/cadence/macb_main.c b/drivers/net/ethernet/cadence/macb_main.c index 9179f7b0b900..ca6e5456906a 100644 --- a/drivers/net/ethernet/cadence/macb_main.c +++ b/drivers/net/ethernet/cadence/macb_main.c @@ -3996,14 +3996,16 @@ static netdev_tx_t at91ether_start_xmit(struct sk_buff *skb, struct macb *lp = netdev_priv(dev); if (macb_readl(lp, TSR) & MACB_BIT(RM9200_BNQ)) { + int desc = 0; + netif_stop_queue(dev); /* Store packet information (to free when Tx completed) */ - lp->skb = skb; - lp->skb_length = skb->len; - lp->skb_physaddr = dma_map_single(&lp->pdev->dev, skb->data, - skb->len, DMA_TO_DEVICE); - if (dma_mapping_error(&lp->pdev->dev, lp->skb_physaddr)) { + lp->rm9200_txq[desc].skb = skb; + lp->rm9200_txq[desc].size = skb->len; + lp->rm9200_txq[desc].mapping = dma_map_single(&lp->pdev->dev, skb->data, + skb->len, DMA_TO_DEVICE); + if (dma_mapping_error(&lp->pdev->dev, lp->rm9200_txq[desc].mapping)) { dev_kfree_skb_any(skb); dev->stats.tx_dropped++; netdev_err(dev, "%s: DMA mapping error\n", __func__); @@ -4011,7 +4013,7 @@ static netdev_tx_t at91ether_start_xmit(struct sk_buff *skb, } /* Set address of the data in the Transmit Address register */ - macb_writel(lp, TAR, lp->skb_physaddr); + macb_writel(lp, TAR, lp->rm9200_txq[desc].mapping); /* Set length of the packet in the Transmit Control register */ macb_writel(lp, TCR, skb->len); @@ -4074,6 +4076,7 @@ static irqreturn_t at91ether_interrupt(int irq, void *dev_id) struct net_device *dev = dev_id; struct macb *lp = netdev_priv(dev); u32 intstatus, ctl; + unsigned int desc; /* MAC Interrupt Status register indicates what interrupts are pending. * It is automatically cleared once read. @@ -4090,13 +4093,14 @@ static irqreturn_t at91ether_interrupt(int irq, void *dev_id) if (intstatus & (MACB_BIT(ISR_TUND) | MACB_BIT(ISR_RLE))) dev->stats.tx_errors++; - if (lp->skb) { - dev_consume_skb_irq(lp->skb); - lp->skb = NULL; - dma_unmap_single(&lp->pdev->dev, lp->skb_physaddr, -lp->skb_length, DMA_TO_DEVICE); + desc = 0; + if (lp->rm9200_txq[desc].skb) { + dev_consume_skb_irq(lp->rm9200_txq[desc].skb); + lp->rm9200_txq[desc].skb = NULL; + dma_unmap_single(&lp->pdev->dev, lp->rm9200_txq[desc].mapping, +lp->rm9200_txq[desc].size, DMA_TO_DEVICE); dev->stats.tx_packets++; -
[PATCH net-next 1/3] macb: add RM9200's interrupt flag TBRE
Transmit Buffer Register Empty replaces TXERR on RM9200 and signals the sender may try to send again becase the last queued frame is no longer in queue (being transmitted or already transmitted). Cc: Nicolas Ferre Cc: Claudiu Beznea Cc: Daniel Palmer Signed-off-by: Willy Tarreau --- drivers/net/ethernet/cadence/macb.h | 2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/net/ethernet/cadence/macb.h b/drivers/net/ethernet/cadence/macb.h index 4f1b41569260..49d347429de8 100644 --- a/drivers/net/ethernet/cadence/macb.h +++ b/drivers/net/ethernet/cadence/macb.h @@ -365,6 +365,8 @@ #define MACB_ISR_RLE_SIZE 1 #define MACB_TXERR_OFFSET 6 /* EN TX frame corrupt from error interrupt */ #define MACB_TXERR_SIZE1 +#define MACB_RM9200_TBRE_OFFSET6 /* EN may send new frame interrupt (RM9200) */ +#define MACB_RM9200_TBRE_SIZE 1 #define MACB_TCOMP_OFFSET 7 /* Enable transmit complete interrupt */ #define MACB_TCOMP_SIZE1 #define MACB_ISR_LINK_OFFSET 9 /* Enable link change interrupt */ -- 2.28.0
[PATCH net-next 3/3] macb: support the two tx descriptors on at91rm9200
The at91rm9200 variant used by a few chips including the MSC313 supports two Tx descriptors (one frame being serialized and another one queued). However the driver only implemented a single one, which adds a dead time after each transfer to receive and process the interrupt and wake the queue up, preventing from reaching line rate. This patch implements a very basic 2-deep queue to address this limitation. The tests run on a Breadbee board equipped with an MSC313E show that at 1 GHz, HTTP traffic on medium-sized objects (45kB) was limited to exactly 50 Mbps before this patch, and jumped to 76 Mbps with this patch. And tests on a single TCP stream with an MTU of 576 jump from 10kpps to 15kpps. With 1500 byte packets it's now possible to reach line rate versus 75 Mbps before. Cc: Nicolas Ferre Cc: Claudiu Beznea Cc: Daniel Palmer Signed-off-by: Willy Tarreau --- drivers/net/ethernet/cadence/macb.h | 2 ++ drivers/net/ethernet/cadence/macb_main.c | 46 +++- 2 files changed, 40 insertions(+), 8 deletions(-) diff --git a/drivers/net/ethernet/cadence/macb.h b/drivers/net/ethernet/cadence/macb.h index e87db95fb0f6..f8133003981f 100644 --- a/drivers/net/ethernet/cadence/macb.h +++ b/drivers/net/ethernet/cadence/macb.h @@ -1208,6 +1208,8 @@ struct macb { /* AT91RM9200 transmit queue (1 on wire + 1 queued) */ struct macb_tx_skb rm9200_txq[2]; + unsigned intrm9200_tx_tail; + unsigned intrm9200_tx_len; unsigned intmax_tx_length; u64 ethtool_stats[GEM_STATS_LEN + QUEUE_STATS_LEN * MACB_MAX_QUEUES]; diff --git a/drivers/net/ethernet/cadence/macb_main.c b/drivers/net/ethernet/cadence/macb_main.c index ca6e5456906a..6ff8e4b0b95d 100644 --- a/drivers/net/ethernet/cadence/macb_main.c +++ b/drivers/net/ethernet/cadence/macb_main.c @@ -3909,6 +3909,7 @@ static int at91ether_start(struct macb *lp) MACB_BIT(ISR_TUND) | MACB_BIT(ISR_RLE) | MACB_BIT(TCOMP)| +MACB_BIT(RM9200_TBRE) | MACB_BIT(ISR_ROVR) | MACB_BIT(HRESP)); @@ -3925,6 +3926,7 @@ static void at91ether_stop(struct macb *lp) MACB_BIT(ISR_TUND) | MACB_BIT(ISR_RLE) | MACB_BIT(TCOMP)| +MACB_BIT(RM9200_TBRE) | MACB_BIT(ISR_ROVR) | MACB_BIT(HRESP)); @@ -3994,11 +3996,10 @@ static netdev_tx_t at91ether_start_xmit(struct sk_buff *skb, struct net_device *dev) { struct macb *lp = netdev_priv(dev); + unsigned long flags; - if (macb_readl(lp, TSR) & MACB_BIT(RM9200_BNQ)) { - int desc = 0; - - netif_stop_queue(dev); + if (lp->rm9200_tx_len < 2) { + int desc = lp->rm9200_tx_tail; /* Store packet information (to free when Tx completed) */ lp->rm9200_txq[desc].skb = skb; @@ -4012,6 +4013,15 @@ static netdev_tx_t at91ether_start_xmit(struct sk_buff *skb, return NETDEV_TX_OK; } + spin_lock_irqsave(&lp->lock, flags); + + lp->rm9200_tx_tail = (desc + 1) & 1; + lp->rm9200_tx_len++; + if (lp->rm9200_tx_len > 1) + netif_stop_queue(dev); + + spin_unlock_irqrestore(&lp->lock, flags); + /* Set address of the data in the Transmit Address register */ macb_writel(lp, TAR, lp->rm9200_txq[desc].mapping); /* Set length of the packet in the Transmit Control register */ @@ -4077,6 +4087,8 @@ static irqreturn_t at91ether_interrupt(int irq, void *dev_id) struct macb *lp = netdev_priv(dev); u32 intstatus, ctl; unsigned int desc; + unsigned int qlen; + u32 tsr; /* MAC Interrupt Status register indicates what interrupts are pending. * It is automatically cleared once read. @@ -4088,21 +4100,39 @@ static irqreturn_t at91ether_interrupt(int irq, void *dev_id) at91ether_rx(dev); /* Transmit complete */ - if (intstatus & MACB_BIT(TCOMP)) { + if (intstatus & (MACB_BIT(TCOMP) | MACB_BIT(RM9200_TBRE))) { /* The TCOM bit is set even if the transmission failed */ if (intstatus & (MACB_BIT(ISR_TUND) | MACB_BIT(ISR_RLE))) dev->stats.tx_errors++; - desc = 0; - if (lp->rm9200_txq[desc].skb) { + spin_lock(&lp->lock); + + tsr = macb_readl(lp, TSR); + + /* we have three possibilities
Re: [PATCH] random32: Restore __latent_entropy attribute on net_rand_state
Hi Kees, On Mon, Oct 05, 2020 at 07:12:29PM -0700, Kees Cook wrote: > On Fri, Oct 02, 2020 at 05:16:11PM +0200, Thibaut Sautereau wrote: > > From: Thibaut Sautereau > > > > Commit f227e3ec3b5c ("random32: update the net random state on interrupt > > and activity") broke compilation and was temporarily fixed by Linus in > > 83bdc7275e62 ("random32: remove net_rand_state from the latent entropy > > gcc plugin") by entirely moving net_rand_state out of the things handled > > by the latent_entropy GCC plugin. > > > > From what I understand when reading the plugin code, using the > > __latent_entropy attribute on a declaration was the wrong part and > > simply keeping the __latent_entropy attribute on the variable definition > > was the correct fix. > > > > Fixes: 83bdc7275e62 ("random32: remove net_rand_state from the latent > > entropy gcc plugin") > > Cc: Linus Torvalds > > Cc: Willy Tarreau > > Cc: Emese Revfy > > Signed-off-by: Thibaut Sautereau > > Yes, that looks correct. Thank you! > > Acked-by: Kees Cook > > I'm not sure the best tree for this. Ted, Andrew, Linus? I'll take it > via my gcc plugin tree if no one else takes it. :) It was already merged as commit 09a6b0bc3be79 and queued for -stable. Cheers, Willy
Re: [PATCH] random32: Restore __latent_entropy attribute on net_rand_state
On Fri, Oct 02, 2020 at 05:16:11PM +0200, Thibaut Sautereau wrote: > From: Thibaut Sautereau > > Commit f227e3ec3b5c ("random32: update the net random state on interrupt > and activity") broke compilation and was temporarily fixed by Linus in > 83bdc7275e62 ("random32: remove net_rand_state from the latent entropy > gcc plugin") by entirely moving net_rand_state out of the things handled > by the latent_entropy GCC plugin. > > From what I understand when reading the plugin code, using the > __latent_entropy attribute on a declaration was the wrong part and > simply keeping the __latent_entropy attribute on the variable definition > was the correct fix. Ah thank you, this is what I tried to figure and failed to! Spender mentioned that a one-liner was all that was needed to fix this but never responded to my request asking about it. Willy
Re: Why ping latency is smaller with shorter send interval?
On Fri, Oct 02, 2020 at 11:26:00AM +0200, Eric Dumazet wrote: > 2) Idle cpus can be put in a power conserving state. > It takes time to exit from these states, as you noticed. > These delays can typically be around 50 usec, or more. This is often the case in my experience. I'm even used to starting a busy loop in SCHED_IDLE prio on certain machines or in certain VMs just to keep them busy, and to see the ping latency cut in half. Willy
Re: [PATCH 1/2] random32: make prandom_u32() output unpredictable
On Mon, Sep 14, 2020 at 06:16:40PM +0200, Sedat Dilek wrote: > On Mon, Sep 14, 2020 at 4:53 PM Amit Klein wrote: > > > > Hi > > > > Is this patch being pushed to any branch? I don't see it deployed anywhere > > (unless I'm missing something...). > > > > It's here: > > [1] > https://git.kernel.org/pub/scm/linux/kernel/git/wtarreau/prandom.git/log/?h=20200901-siphash-noise By the way I didn't get any feedback from those who initially disagreed with the one that was mergd, so for now I'm not doing anything on it anymore. I can propose it again for 5.10-rc1 but will not push anymore if there's no interest behind it. Cheers, Willy
Re: [PATCH 0/2] prandom_u32: make output less predictable
On Tue, Sep 01, 2020 at 04:41:13PM +0200, Sedat Dilek wrote: > I have tested with the patchset from [1]. > ( Later I saw, you dropped "WIP: tcp: reuse incoming skb hash in > tcp_conn_request()". ) Yes because it's a bit out of the cope of this series and makes sense even without these patches, thus I assume Eric will take care of it separately. Willy
Re: [PATCH 1/2] random32: make prandom_u32() output unpredictable
On Tue, Sep 01, 2020 at 01:10:18PM +, David Laight wrote: > From: Willy Tarreau > > Sent: 01 September 2020 07:43 > ... > > +/* > > + * Generate some initially weak seeding values to allow > > + * the prandom_u32() engine to be started. > > + */ > > +static int __init prandom_init_early(void) > > +{ > > + int i; > > + unsigned long v0, v1, v2, v3; > > + > > + if (!arch_get_random_long(&v0)) > > + v0 = jiffies; > > Isn't jiffies likely to be zero here? I don't know. But do we really care ? I'd personally have been fine with not even assigning it in this case and leaving whatever was in the stack in this case, though it could make some static code analyzer unhappy. Willy
Re: [PATCH 2/2] random32: add noise from network and scheduling activity
Hi Eric, On Tue, Sep 01, 2020 at 12:24:38PM +0200, Eric Dumazet wrote: > There is not much entropy here really : > > 1) dev & txq are mostly constant on a typical host (at least the kind of > hosts that is targeted by > Amit Klein and others in their attacks. > > 2) len is also known by the attacker, attacking an idle host. > > 3) skb are also allocations from slab cache, which tend to recycle always the > same pointers (on idle hosts) > > > 4) jiffies might be incremented every 4 ms (if HZ=250) I know. The point is essentially that someone "remote" or with rare access to the host's memory (i.e. in a VM on the same CPU sharing L1 with some CPU vulnerabilities) cannot synchronize with the PRNG and easily stay synchronized forever. Otherwise I totally agree that these are pretty weak. But in my opinion they are sufficient to turn a 100% success into way less. I try not to forget that we're just trying to make a ~15-bit port require ~2^14 attempts on average. Oh and by the way the number of calls also counts here. > Maybe we could feed percpu prandom noise with samples of ns resolution > timestamps, > lazily cached from ktime_get() or similar functions. > > This would use one instruction on x86 to update the cache, with maybe more > generic noise. Sure! I think the principle here allows to easily extend it to various places, and the more the better. Maybe actually we'll figure that there are plenty of sources of randomness that were not considered secure enough to feed /dev/random while they're perfectly fine for such use cases. > diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c > index > 4c47f388a83f17860fdafa3229bba0cc605ec25a..a3e026cbbb6e8c5499ed780e57de5fa09bc010b6 > 100644 > --- a/kernel/time/timekeeping.c > +++ b/kernel/time/timekeeping.c > @@ -751,7 +751,7 @@ ktime_t ktime_get(void) > { > struct timekeeper *tk = &tk_core.timekeeper; > unsigned int seq; > - ktime_t base; > + ktime_t res, base; > u64 nsecs; > > WARN_ON(timekeeping_suspended); > @@ -763,7 +763,9 @@ ktime_t ktime_get(void) > > } while (read_seqcount_retry(&tk_core.seq, seq)); > > - return ktime_add_ns(base, nsecs); > + res = ktime_add_ns(base, nsecs); > + __this_cpu_add(prandom_noise, (unsigned long)ktime_to_ns(res)); > + return res; > } > EXPORT_SYMBOL_GPL(ktime_get); Actually it could even be nice to combine it with __builtin_return_address(0) given the large number of callers this one has! But I generally agree with your proposal. Thanks, Willy
Re: [PATCH 1/2] random32: make prandom_u32() output unpredictable
On Tue, Sep 01, 2020 at 10:46:16AM +0200, Sedat Dilek wrote: > Will you push the updated patchset to your prandom Git - for easy fetching? Yeah if you want, feel free to use this one : https://git.kernel.org/pub/scm/linux/kernel/git/wtarreau/prandom.git/log/?h=20200901-siphash-noise Willy
Re: [PATCH 1/2] random32: make prandom_u32() output unpredictable
On Tue, Sep 01, 2020 at 10:33:40AM +0200, Yann Ylavic wrote: > On Tue, Sep 1, 2020 at 8:45 AM Willy Tarreau wrote: > > > > +/* > > + * Generate some initially weak seeding values to allow > > + * the prandom_u32() engine to be started. > > + */ > > +static int __init prandom_init_early(void) > > +{ > > + int i; > > + unsigned long v0, v1, v2, v3; > > + > > + if (!arch_get_random_long(&v0)) > > + v0 = jiffies; > > + if (!arch_get_random_long(&v1)) > > + v0 = random_get_entropy(); > > Shouldn't the above be: > v1 = random_get_entropy(); > ? Very good catch, many thanks Yann! Now fixed in my local tree. Willy
[PATCH 1/2] random32: make prandom_u32() output unpredictable
From: George Spelvin Non-cryptographic PRNGs may have great statistical proprties, but are usually trivially predictable to someone who knows the algorithm, given a small sample of their output. An LFSR like prandom_u32() is particularly simple, even if the sample is widely scattered bits. It turns out the network stack uses prandom_u32() for some things like random port numbers which it would prefer are *not* trivially predictable. Predictability led to a practical DNS spoofing attack. Oops. This patch replaces the LFSR with a homebrew cryptographic PRNG based on the SipHash round function, which is in turn seeded with 128 bits of strong random key. (The authors of SipHash have *not* been consulted about this abuse of their algorithm.) Speed is prioritized over security; attacks are rare, while performance is always wanted. Replacing all callers of prandom_u32() is the quick fix. Whether to reinstate a weaker PRNG for uses which can tolerate it is an open question. Commit f227e3ec3b5c ("random32: update the net random state on interrupt and activity") was an earlier attempt at a solution. This patch replaces it. Reported-by: Amit Klein Cc: Willy Tarreau Cc: Eric Dumazet Cc: "Jason A. Donenfeld" Cc: Andy Lutomirski Cc: Kees Cook Cc: Thomas Gleixner Cc: Peter Zijlstra Cc: Linus Torvalds Cc: ty...@mit.edu Cc: Florian Westphal Cc: Marc Plumb Fixes: f227e3ec3b5c ("random32: update the net random state on interrupt and activity") Signed-off-by: George Spelvin Link: https://lore.kernel.org/netdev/20200808152628.ga27...@sdf.org/ [ willy: partial reversal of f227e3ec3b5c; moved SIPROUND definitions to prandom.h for later use; merged George's prandom_seed() proposal; inlined siprand_u32(); replaced the net_rand_state[] array with 4 members to fix a build issue; cosmetic cleanups to make checkpatch happy ] Signed-off-by: Willy Tarreau --- drivers/char/random.c | 1 - include/linux/prandom.h | 36 +++- kernel/time/timer.c | 7 - lib/random32.c | 433 4 files changed, 296 insertions(+), 181 deletions(-) diff --git a/drivers/char/random.c b/drivers/char/random.c index d20ba1b104ca..2a41b21623ae 100644 --- a/drivers/char/random.c +++ b/drivers/char/random.c @@ -1277,7 +1277,6 @@ void add_interrupt_randomness(int irq, int irq_flags) fast_mix(fast_pool); add_interrupt_bench(cycles); - this_cpu_add(net_rand_state.s1, fast_pool->pool[cycles & 3]); if (unlikely(crng_init == 0)) { if ((fast_pool->count >= 64) && diff --git a/include/linux/prandom.h b/include/linux/prandom.h index aa16e6468f91..cc1e71334e53 100644 --- a/include/linux/prandom.h +++ b/include/linux/prandom.h @@ -16,12 +16,44 @@ void prandom_bytes(void *buf, size_t nbytes); void prandom_seed(u32 seed); void prandom_reseed_late(void); +#if BITS_PER_LONG == 64 +/* + * The core SipHash round function. Each line can be executed in + * parallel given enough CPU resources. + */ +#define PRND_SIPROUND(v0, v1, v2, v3) ( \ + v0 += v1, v1 = rol64(v1, 13), v2 += v3, v3 = rol64(v3, 16), \ + v1 ^= v0, v0 = rol64(v0, 32), v3 ^= v2, \ + v0 += v3, v3 = rol64(v3, 21), v2 += v1, v1 = rol64(v1, 17), \ + v3 ^= v0, v1 ^= v2, v2 = rol64(v2, 32) \ +) + +#define PRND_K0 (0x736f6d6570736575 ^ 0x6c7967656e657261) +#define PRND_K1 (0x646f72616e646f6d ^ 0x7465646279746573) + +#elif BITS_PER_LONG == 32 +/* + * On 32-bit machines, we use HSipHash, a reduced-width version of SipHash. + * This is weaker, but 32-bit machines are not used for high-traffic + * applications, so there is less output for an attacker to analyze. + */ +#define PRND_SIPROUND(v0, v1, v2, v3) ( \ + v0 += v1, v1 = rol32(v1, 5), v2 += v3, v3 = rol32(v3, 8), \ + v1 ^= v0, v0 = rol32(v0, 16), v3 ^= v2, \ + v0 += v3, v3 = rol32(v3, 7), v2 += v1, v1 = rol32(v1, 13), \ + v3 ^= v0, v1 ^= v2, v2 = rol32(v2, 16) \ +) +#define PRND_K0 0x6c796765 +#define PRND_K1 0x74656462 + +#else +#error Unsupported BITS_PER_LONG +#endif + struct rnd_state { __u32 s1, s2, s3, s4; }; -DECLARE_PER_CPU(struct rnd_state, net_rand_state); - u32 prandom_u32_state(struct rnd_state *state); void prandom_bytes_state(struct rnd_state *state, void *buf, size_t nbytes); void prandom_seed_full_state(struct rnd_state __percpu *pcpu_state); diff --git a/kernel/time/timer.c b/kernel/time/timer.c index a50364df1054..401fcb9d7388 100644 --- a/kernel/time/timer.c +++ b/kernel/time/timer.c @@ -1715,13 +1715,6 @@ void update_process_times(int user_tick) scheduler_tick(); if (IS_ENABLED(CONFIG_POSIX_TIMERS)) run_posix_cpu_timers(); - - /* The current CPU might make use of net randoms without receiving IRQs -* to renew them often enough. Let's update the n
[PATCH 0/2] prandom_u32: make output less predictable
This is the cleanup of the latest series of prandom_u32 experimentations consisting in using SipHash instead of Tausworthe to produce the randoms used by the network stack. The changes to the files were kept minimal, and the controversial commit that used to take noise from the fast_pool (f227e3ec3b5c) was reverted. Instead, a dedicated "net_rand_noise" per_cpu variable is fed from various sources of activities (networking, scheduling) to perturb the SipHash state using fast, non-trivially predictable data, instead of keeping it fully deterministic. The goal is essentially to make any occasional memory leakage or brute-force attempt useless. The resulting code was verified to be very slightly faster on x86_64 than what is was with the controversial commit above, though this remains barely above measurement noise. It was only build-tested on arm & arm64. George Spelvin (1): random32: make prandom_u32() output unpredictable Willy Tarreau (1): random32: add noise from network and scheduling activity drivers/char/random.c | 1 - include/linux/prandom.h | 55 - kernel/time/timer.c | 9 +- lib/random32.c | 438 net/core/dev.c | 4 + 5 files changed, 326 insertions(+), 181 deletions(-) Cc: George Spelvin Cc: Amit Klein Cc: Eric Dumazet Cc: "Jason A. Donenfeld" Cc: Andy Lutomirski Cc: Kees Cook Cc: Thomas Gleixner Cc: Peter Zijlstra Cc: Linus Torvalds Cc: ty...@mit.edu Cc: Florian Westphal Cc: Marc Plumb Cc: Sedat Dilek -- 2.28.0
[PATCH 2/2] random32: add noise from network and scheduling activity
With the removal of the interrupt perturbations in previous random32 change (random32: make prandom_u32() output unpredictable), the PRNG has become 100% deterministic again. While SipHash is expected to be way more robust against brute force than the previous Tausworthe LFSR, there's still the risk that whoever has even one temporary access to the PRNG's internal state is able to predict all subsequent draws till the next reseed (roughly every minute). This may happen through a side channel attack or any data leak. This patch restores the spirit of commit f227e3ec3b5c ("random32: update the net random state on interrupt and activity") in that it will perturb the internal PRNG's statee using externally collected noise, except that it will not pick that noise from the random pool's bits nor upon interrupt, but will rather combine a few elements along the Tx path that are collectively hard to predict, such as dev, skb and txq pointers, packet length and jiffies values. These ones are combined using a single round of SipHash into a single long variable that is mixed with the net_rand_state upon each invocation. The operation was inlined because it produces very small and efficient code, typically 3 xor, 2 add and 2 rol. The performance was measured to be the same (even very slightly better) than before the switch to SipHash; on a 6-core 12-thread Core i7-8700k equipped with a 40G NIC (i40e), the connection rate dropped from 556k/s to 555k/s while the SYN cookie rate grew from 5.38 Mpps to 5.45 Mpps. Link: https://lore.kernel.org/netdev/20200808152628.ga27...@sdf.org/ Cc: George Spelvin Cc: Amit Klein Cc: Eric Dumazet Cc: "Jason A. Donenfeld" Cc: Andy Lutomirski Cc: Kees Cook Cc: Thomas Gleixner Cc: Peter Zijlstra Cc: Linus Torvalds Cc: ty...@mit.edu Cc: Florian Westphal Cc: Marc Plumb Tested-by: Sedat Dilek Signed-off-by: Willy Tarreau --- include/linux/prandom.h | 19 +++ kernel/time/timer.c | 2 ++ lib/random32.c | 5 + net/core/dev.c | 4 4 files changed, 30 insertions(+) diff --git a/include/linux/prandom.h b/include/linux/prandom.h index cc1e71334e53..aa7de3432e0f 100644 --- a/include/linux/prandom.h +++ b/include/linux/prandom.h @@ -16,6 +16,12 @@ void prandom_bytes(void *buf, size_t nbytes); void prandom_seed(u32 seed); void prandom_reseed_late(void); +DECLARE_PER_CPU(unsigned long, net_rand_noise); + +#define PRANDOM_ADD_NOISE(a, b, c, d) \ + prandom_u32_add_noise((unsigned long)(a), (unsigned long)(b), \ + (unsigned long)(c), (unsigned long)(d)) + #if BITS_PER_LONG == 64 /* * The core SipHash round function. Each line can be executed in @@ -50,6 +56,18 @@ void prandom_reseed_late(void); #error Unsupported BITS_PER_LONG #endif +static inline void prandom_u32_add_noise(unsigned long a, unsigned long b, +unsigned long c, unsigned long d) +{ + /* +* This is not used cryptographically; it's just +* a convenient 4-word hash function. (3 xor, 2 add, 2 rol) +*/ + a ^= __this_cpu_read(net_rand_noise); + PRND_SIPROUND(a, b, c, d); + __this_cpu_write(net_rand_noise, d); +} + struct rnd_state { __u32 s1, s2, s3, s4; }; @@ -99,6 +117,7 @@ static inline void prandom_seed_state(struct rnd_state *state, u64 seed) state->s2 = __seed(i, 8U); state->s3 = __seed(i, 16U); state->s4 = __seed(i, 128U); + PRANDOM_ADD_NOISE(state, i, 0, 0); } /* Pseudo random number generator from numerical recipes. */ diff --git a/kernel/time/timer.c b/kernel/time/timer.c index 401fcb9d7388..bebcf2fc1226 100644 --- a/kernel/time/timer.c +++ b/kernel/time/timer.c @@ -1704,6 +1704,8 @@ void update_process_times(int user_tick) { struct task_struct *p = current; + PRANDOM_ADD_NOISE(jiffies, user_tick, p, 0); + /* Note: this timer irq context must be accounted for as well. */ account_process_tick(p, user_tick); run_local_timers(); diff --git a/lib/random32.c b/lib/random32.c index 00fa925a4487..38db382a8cf5 100644 --- a/lib/random32.c +++ b/lib/random32.c @@ -324,6 +324,8 @@ struct siprand_state { }; static DEFINE_PER_CPU(struct siprand_state, net_rand_state) __latent_entropy; +DEFINE_PER_CPU(unsigned long, net_rand_noise); +EXPORT_PER_CPU_SYMBOL(net_rand_noise); /* * This is the core CPRNG function. As "pseudorandom", this is not used @@ -347,9 +349,12 @@ static DEFINE_PER_CPU(struct siprand_state, net_rand_state) __latent_entropy; static inline u32 siprand_u32(struct siprand_state *s) { unsigned long v0 = s->v0, v1 = s->v1, v2 = s->v2, v3 = s->v3; + unsigned long n = __this_cpu_read(net_rand_noise); + v3 ^= n; PRND_SIPROUND(v0, v1, v2, v3); PRND_SIPROUND(v0, v1, v2, v3); + v0 ^= n; s->v0 = v0; s->v1 = v1; s-&
Re: [DRAFT PATCH] random32: make prandom_u32() output unpredictable
Hi Amit, On Thu, Aug 27, 2020 at 02:06:39AM +0300, Amit Klein wrote: > Hi > > Is there an ETA for this patch then? No particular ETA on my side, I was waiting for potential criticisms before going further. I suspect that if nobody complains anymore, it's an implicit voucher and I'll have to clean up and post the series then. Thanks, Willy
Re: [DRAFT PATCH] random32: make prandom_u32() output unpredictable
On Thu, Aug 20, 2020 at 08:58:38AM +0200, Willy Tarreau wrote: > I've just pushed a new branch "20200820-siphash-noise" that I also > rebased onto latest master. It's currently running make allmodconfig > here, so that will take a while, but I think it's OK as random32.o is > already OK. I've also addressed a strange build issue caused by having > an array instead of 4 ints in siprand_state. > > Please just let me know if that's OK for you now. At least it worked for me now (built with no errors on x86_64): $ time make -j 8 bzImage modules (...) real65m7.986s user477m22.477s sys 38m0.545s $ find . -name '*.ko' |wc -l 7983 Willy
Re: [DRAFT PATCH] random32: make prandom_u32() output unpredictable
On Thu, Aug 20, 2020 at 08:08:43AM +0200, Willy Tarreau wrote: > On Thu, Aug 20, 2020 at 06:42:23AM +0200, Sedat Dilek wrote: > > On Thu, Aug 20, 2020 at 6:33 AM Willy Tarreau wrote: > > > > > > On Thu, Aug 20, 2020 at 05:05:49AM +0200, Sedat Dilek wrote: > > > > We have the same defines for K0 and K1 in include/linux/prandom.h and > > > > lib/random32.c? > > > > More room for simplifications? > > > > > > Definitely, I'm not surprized at all. As I said, the purpose was to > > > discuss around the proposal, not much more. If we think it's the way > > > to go, some major lifting is required. I just don't want to invest > > > significant time on this if nobody cares. > > > > > > > OK. > > > > Right now, I will try with the attached diff. > > No, don't waste your time this way, it's not the right way to address it, > you're still facing competition between defines. I'll do another one if > you want to go further in the tests. I've just pushed a new branch "20200820-siphash-noise" that I also rebased onto latest master. It's currently running make allmodconfig here, so that will take a while, but I think it's OK as random32.o is already OK. I've also addressed a strange build issue caused by having an array instead of 4 ints in siprand_state. Please just let me know if that's OK for you now. Thanks, Willy
Re: [DRAFT PATCH] random32: make prandom_u32() output unpredictable
On Thu, Aug 20, 2020 at 06:42:23AM +0200, Sedat Dilek wrote: > On Thu, Aug 20, 2020 at 6:33 AM Willy Tarreau wrote: > > > > On Thu, Aug 20, 2020 at 05:05:49AM +0200, Sedat Dilek wrote: > > > We have the same defines for K0 and K1 in include/linux/prandom.h and > > > lib/random32.c? > > > More room for simplifications? > > > > Definitely, I'm not surprized at all. As I said, the purpose was to > > discuss around the proposal, not much more. If we think it's the way > > to go, some major lifting is required. I just don't want to invest > > significant time on this if nobody cares. > > > > OK. > > Right now, I will try with the attached diff. No, don't waste your time this way, it's not the right way to address it, you're still facing competition between defines. I'll do another one if you want to go further in the tests. > Unclear to me where this modpost "net_rand_noise" undefined! comes from. > Any hints? Sure, the symbol is not exported. I'll address it as well for you. Willy
Re: [DRAFT PATCH] random32: make prandom_u32() output unpredictable
On Thu, Aug 20, 2020 at 05:05:49AM +0200, Sedat Dilek wrote: > We have the same defines for K0 and K1 in include/linux/prandom.h and > lib/random32.c? > More room for simplifications? Definitely, I'm not surprized at all. As I said, the purpose was to discuss around the proposal, not much more. If we think it's the way to go, some major lifting is required. I just don't want to invest significant time on this if nobody cares. Thanks! Willy
Re: [DRAFT PATCH] random32: make prandom_u32() output unpredictable
Hi, so as I mentioned, I could run several test on our lab with variations around the various proposals and come to quite positive conclusions. Synthetic observations: the connection rate and the SYN cookie rate do not seem to be affected the same way by the prandom changes. One explanation is that the connection rates are less stable across reboots. Another possible explanation is that the larger state update is more sensitive to cache misses that increase when calling userland. I noticed that the compiler didn't inline siprand_u32() for me, resulting in one extra function call and noticeable register clobbering that mostly vanish once siprand_u32() is inlined, getting back to the original performance. The noise generation was placed as discussed in the xmit calls, however the extra function call and state update had a negative effect on performance and the noise function alone appeared for up to 0.23% of the CPU usage. Simplifying the mix of data by keeping only one long for the noise and using one siphash round on 4 input words to keep only the last word allowed to use very few instructions and to inline them, making the noise collection imperceptible in microbenchmarks. The noise is now collected this way (I verified that all inputs are used), this performs 3 xor, 2 add and 2 rol, which is way sufficient and already better than my initial attempt with a bare add : static inline void prandom_u32_add_noise(unsigned long a, unsigned long b, unsigned long c, unsigned long d) { /* * This is not used cryptographically; it's just * a convenient 4-word hash function. (3 xor, 2 add, 2 rol) */ a ^= __this_cpu_read(net_rand_noise); PRND_SIPROUND(a, b, c, d); __this_cpu_write(net_rand_noise, d); } My tests were run on a 6-core 12-thread Core i7-8700k equipped with a 40G NIC (i40e). I've mainly run two types of tests: - connections per second: the machine runs a server which accepts and closes incoming connections. The load generators aim at it and the connection rate is measured once it's stabilized. - SYN cookie rate: the load generators flood the machine with enough SYNs to saturate the CPU and the rate of response SYN-ACK is measured. Both correspond to real world use cases (DDoS protection against SYN flood and connection flood). The base kernel was fc80c51f + Eric's patch to add a tracepoint in prandom_u32(). Another test was made by adding George's changes to use siphash. Then another test was made with the siprand_u32() function inlined and with noise stored as a full siphash state. Then one test was run with the noise reduced to a single long. And a final test was run with the noise function inlined. connectionsSYN cookies Notes per second emitted/s base: 556k 5.38M siphash: 535k 5.33M siphash inlined +noise: 548k 5.40Madd_noise=0.23% siphash + single-word noise555k 5.45Madd_noise=0.10% siphash + single-word&inlined noise559k 5.38M Actually the last one is better than the previous one because it also swallows more packets. There were 10.9M pps in and 5.38M pps out versus 10.77M in and 5.45M out for the previous one. I didn't report the incoming traffic for the other ones as it was mostly irrelevant and always within these bounds. Finally I've added Eric's patch to reuse the skb hash when known in tcp_conn_request(), and was happy to see the SYN cookies reach 5.45 Mpps again and the connection rate remain unaffected. A perf record during the SYN flood showed almost no call to prandom_u32() anymore (just a few in tcp_rtx_synack()) so this looks like a desirable optimization. At the moment the code is ugly, in experimental state (I've pushed all of it at https://git.kernel.org/pub/scm/linux/kernel/git/wtarreau/prandom.git/). My impression on this is that given that it's possible to maintain the same level of performance as we currently have while making the PRNG much better, there's no more reason for not doing it. If there's enough interest at this point, I'm OK with restarting from George's patches and doing the adjustments there. There's still this prandom_seed() which looks very close to prandom_reseed() and that we might possibly better remerge, but I'd vote for not changing everything at once, it's ugly enough already. Also I suspect we can have an infinite loop in prandom_seed() if entropy is 0 and the state is zero as well. We'd be unlucky but I'd just make sure entropy is not all zeroes. And running tests on 32-bit would be desirable as well. Finally one can wonder whether it makes sense to keep Tausworthe for other cases (basic statistical sampling) or drop it. We could definitely drop it and simplify everything given that we now have the same level of performance. But if we do it, what should we do with the test patterns ? I personally don't t
Re: [DRAFT PATCH] random32: make prandom_u32() output unpredictable
On Fri, Aug 14, 2020 at 05:32:32PM +0200, Sedat Dilek wrote: > commit 94c7eb54c4b8e81618ec79f414fe1ca5767f9720 > "random32: add a tracepoint for prandom_u32()" > > ...I gave Willy's patches a try and used the Linux Test Project (LTP) > for testing. Just FWIW today I could run several relevant tests with a 40 Gbps NIC at high connection rates and under SYN flood to stress SYN cookies. I couldn't post earlier due to a net outage but will post the results here. In short, what I'm seeing is very good. The only thing is that the noise collection as-is with the 4 longs takes a bit too much CPU (0.2% measured) but if keeping only one word we're back to tausworthe performance, while using siphash all along. The noise generation function is so small that we're wasting cycles calling it and renaming registers. I'll run one more test by inlining it and exporting the noise. So provided quite some cleanup now, I really think we're about to reach a solution which will satisfy everyone. More on this after I extract the results. Willy
Re: [DRAFT PATCH] random32: make prandom_u32() output unpredictable
On Thu, Aug 13, 2020 at 09:53:11AM +0200, Sedat Dilek wrote: > On Wed, Aug 12, 2020 at 5:21 AM Willy Tarreau wrote: > > > > On Tue, Aug 11, 2020 at 12:51:43PM +0200, Sedat Dilek wrote: > > > Can you share this "rebased to mainline" version of George's patch? > > > > You can pick it from there if that helps, but keep in mind that > > it's just experimental code that we use to explain our ideas and > > that we really don't care a single second what kernel it's applied > > to: > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/wtarreau/cleanups.git/log/?h=20200811-prandom-1 > > > > Thanks Willy. > > I disagree: the base for testing should be clear(ly communicated). It is. As you can see on the log above, this was applied on top of fc80c51fd4b2, there's nothing special here. In addition we're not even talking about testing nor calling for testers, just trying to find a reasonable solution. Maybe today I'll be able to re-run a few tests by the way. > There are two diffs from Eric to #1: add a trace event for > prandom_u32() and #2: a removal of prandom_u32() call in > tcp_conn_request(). > In case you have not seen. I've seen, just not had the time to test yet. Willy
Re: [DRAFT PATCH] random32: make prandom_u32() output unpredictable
On Tue, Aug 11, 2020 at 12:51:43PM +0200, Sedat Dilek wrote: > Can you share this "rebased to mainline" version of George's patch? You can pick it from there if that helps, but keep in mind that it's just experimental code that we use to explain our ideas and that we really don't care a single second what kernel it's applied to: https://git.kernel.org/pub/scm/linux/kernel/git/wtarreau/cleanups.git/log/?h=20200811-prandom-1 Willy
Re: [DRAFT PATCH] random32: make prandom_u32() output unpredictable
On Tue, Aug 11, 2020 at 05:26:49AM +, George Spelvin wrote: > On Mon, Aug 10, 2020 at 11:04:55PM +0200, Willy Tarreau wrote: > > What could be improved is the way the input values are mixed (just > > added hence commutative for now). I didn't want to call a siphash > > round on the hot paths, but just shifting the previous noise value > > before adding would work, such as the following for example: > > > > void prandom_u32_add_noise(a, b, c, d) > > { > > unsigned long *noise = get_cpu_ptr(&net_rand_noise); > > > > #if BITS_PER_LONG == 64 > > *noise = rol64(*noise, 7) + a + b + c + d; > > #else > > *noise = rol32(*noise, 7) + a + b + c + d; > > #endif > > put_cpu_ptr(&net_rand_noise); > > > > } > > If you think this is enough seed material, I'm fine with it. > > I don't hugely like the fact that you sum all the inputs, since > entropy tends to be concentrated in the low-order words, and summing > risks cancellation. Yes I've figured this. But I thought it was still better than a pure xor which would cancell the high bits from pointers. > You can't afford even one SIPROUND as a non-cryptographic hash? E.g. That's what I mentioned above, I'm still hesitating. I need to test. > > DEFINE_PER_CPU(unsigned long[4], net_rand_noise); > EXPORT_SYMBOL(net_rand_noise); > > void prandom_u32_add_noise(a, b, c, d) > { > unsigned long *noise = get_cpu_ptr(&net_rand_noise); > > a ^= noise[0]; > b ^= noise[1]; > c ^= noise[2]; > d ^= noise[3]; > /* >* This is not used cryptographically; it's just >* a convenient 4-word hash function. >*/ > SIPROUND(a, b, c, d); > noise[0] = a; > noise[1] = b; > noise[2] = c; > put_cpu_ptr(&net_rand_noise); > } > > (And then you mix in net_rand_noise[0].) > > Other options are HASH_MIX() from fs/namei.c, but that's > more sequential. > > There's also a simple Xorshift generator. I think a xorshift on each value will have roughly the same cost as a single SIPROUND. But I've not yet completely eliminated these options until I've tested. If we lose a few cycles per packet, that might be OK. Willy
Re: [DRAFT PATCH] random32: make prandom_u32() output unpredictable
On Tue, Aug 11, 2020 at 03:47:24AM +, George Spelvin wrote: > On Mon, Aug 10, 2020 at 01:47:00PM +0200, Willy Tarreau wrote: > > except that I retrieve it only on 1/8 calls > > and use the previous noise in this case. > > Er... that's quite different. I was saying you measure them all, and do: That was my first approach and it resulted in a significant performance loss, hence the change (and the resulting ugly construct with the counter). > > + if (++s->count >= 8) { > > + v3 ^= s->noise; > > + s->noise += random_get_entropy(); > > + s->count = 0; > > + } > > + > > - Can you explain why you save the "noise" until next time? Is this meant to > make it harder for an attacker to observe the time? Just to make the observable call not depend on immediately measurable TSC values. It's weak, but the point was that when mixing attack traffic with regular one, if you can measure the time variations on TSC to know when it was used and don't have its resulting effect at the same time, it's harder to analyse than when you have both at once. > - How about doing away with s->count and making it statistical: > > + if ((v3 & 7) == 0) > + v3 ^= random_get_entropy(); Absolutely. I just kept the counter from previous attempt. But Linus prefers that we completely remove TSC calls from direct calls due to old VMs that were slow at this. We could collect it anywhere else once in a while instead. Willy
Re: [DRAFT PATCH] random32: make prandom_u32() output unpredictable
Linus, George, Florian, would something in this vein be OK in your opinion ? - update_process_times still updates the noise - we don't touch the fast_pool anymore - we don't read any TSC on hot paths - we update the noise in xmit from jiffies and a few pointer values instead I've applied it on top of George's patch rebased to mainline for simplicity. I've used a separate per_cpu noise variable to keep the net_rand_state static with its __latent_entropy. With this I'm even getting very slightly better performance than with the previous patch (195.7 - 197.8k cps). What could be improved is the way the input values are mixed (just added hence commutative for now). I didn't want to call a siphash round on the hot paths, but just shifting the previous noise value before adding would work, such as the following for example: void prandom_u32_add_noise(a, b, c, d) { unsigned long *noise = get_cpu_ptr(&net_rand_noise); #if BITS_PER_LONG == 64 *noise = rol64(*noise, 7) + a + b + c + d; #else *noise = rol32(*noise, 7) + a + b + c + d; #endif put_cpu_ptr(&net_rand_noise); } Thanks, Willy --- diff --git a/drivers/char/random.c b/drivers/char/random.c index d20ba1b104ca..2a41b21623ae 100644 --- a/drivers/char/random.c +++ b/drivers/char/random.c @@ -1277,7 +1277,6 @@ void add_interrupt_randomness(int irq, int irq_flags) fast_mix(fast_pool); add_interrupt_bench(cycles); - this_cpu_add(net_rand_state.s1, fast_pool->pool[cycles & 3]); if (unlikely(crng_init == 0)) { if ((fast_pool->count >= 64) && diff --git a/include/linux/prandom.h b/include/linux/prandom.h index aa16e6468f91..e2b4990f2126 100644 --- a/include/linux/prandom.h +++ b/include/linux/prandom.h @@ -20,7 +20,7 @@ struct rnd_state { __u32 s1, s2, s3, s4; }; -DECLARE_PER_CPU(struct rnd_state, net_rand_state); +DECLARE_PER_CPU(unsigned long, net_rand_noise); u32 prandom_u32_state(struct rnd_state *state); void prandom_bytes_state(struct rnd_state *state, void *buf, size_t nbytes); @@ -67,6 +67,7 @@ static inline void prandom_seed_state(struct rnd_state *state, u64 seed) state->s2 = __seed(i, 8U); state->s3 = __seed(i, 16U); state->s4 = __seed(i, 128U); + __this_cpu_add(net_rand_noise, i); } /* Pseudo random number generator from numerical recipes. */ diff --git a/kernel/time/timer.c b/kernel/time/timer.c index ae5029f984a8..2f07c569af77 100644 --- a/kernel/time/timer.c +++ b/kernel/time/timer.c @@ -1721,7 +1721,7 @@ void update_process_times(int user_tick) * non-constant value that's not affine to the number of calls to make * sure it's updated when there's some activity (we don't care in idle). */ - this_cpu_add(net_rand_state.s1, rol32(jiffies, 24) + user_tick); + __this_cpu_add(net_rand_noise, rol32(jiffies, 24) + user_tick); } /** diff --git a/lib/random32.c b/lib/random32.c index 2b048e2ea99f..d74cf1db8cc9 100644 --- a/lib/random32.c +++ b/lib/random32.c @@ -320,6 +320,8 @@ struct siprand_state { }; static DEFINE_PER_CPU(struct siprand_state, net_rand_state) __latent_entropy; +DEFINE_PER_CPU(unsigned long, net_rand_noise); +EXPORT_SYMBOL(net_rand_noise); #if BITS_PER_LONG == 64 /* @@ -334,7 +336,7 @@ static DEFINE_PER_CPU(struct siprand_state, net_rand_state) __latent_entropy; #define K0 (0x736f6d6570736575 ^ 0x6c7967656e657261 ) #define K1 (0x646f72616e646f6d ^ 0x7465646279746573 ) -#elif BITS_PER_LONG == 23 +#elif BITS_PER_LONG == 32 /* * On 32-bit machines, we use HSipHash, a reduced-width version of SipHash. * This is weaker, but 32-bit machines are not used for high-traffic @@ -374,9 +376,12 @@ static DEFINE_PER_CPU(struct siprand_state, net_rand_state) __latent_entropy; static u32 siprand_u32(struct siprand_state *s) { unsigned long v0 = s->v[0], v1 = s->v[1], v2 = s->v[2], v3 = s->v[3]; + unsigned long n = __this_cpu_read(net_rand_noise); + v3 ^= n; SIPROUND(v0, v1, v2, v3); SIPROUND(v0, v1, v2, v3); + v0 ^= n; s->v[0] = v0; s->v[1] = v1; s->v[2] = v2; s->v[3] = v3; return v1 + v3; } diff --git a/net/core/dev.c b/net/core/dev.c index 7df6c9617321..55a2471cd75b 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -144,6 +144,7 @@ #include #include #include +#include #include "net-sysfs.h" @@ -3557,6 +3558,7 @@ static int xmit_one(struct sk_buff *skb, struct net_device *dev, dev_queue_xmit_nit(skb, dev); len = skb->len; + __this_cpu_add(net_rand_noise, (long)skb + (long)dev + (long)txq + len + jiffies); trace_net_dev_start_xmit(skb, dev); rc = netdev_start_xmit(skb, dev, txq, more); trace_net_dev_xmit(skb, rc, dev, len); @@ -4129,6 +4131,7 @@ static int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev) if (!skb) goto out; +
Re: [DRAFT PATCH] random32: make prandom_u32() output unpredictable
On Mon, Aug 10, 2020 at 10:45:26AM -0700, Linus Torvalds wrote: > On Mon, Aug 10, 2020 at 9:59 AM Willy Tarreau wrote: > > > > I took what we were already using in add_interrupt_randomness() since > > I considered that if it was acceptable there, it probably was elsewhere. > > Once you've taken an interrupt, you're doing IO anyway, and the > interrupt costs will dominate anything you do. > > But the prandom_u32() interface is potentially done many times per > interrupt. For all I know it's done inside fairly critical locks etc > too. > > So I don't think one usage translates to another very well. Possible, hence the better solution to just feed noise in hot paths. Using jiffies and skb pointer in xmit is better than nothing anyway. I'm not seeking anything extremist, I just want to make sure we don't get yet-another-report on this area next summer just because some researcher using two VMs discovers that attacking a 100% idle VM from another one running on the same CPU core is trivial after having stolen some memory data. If at least such an attack is boring and rarely works, the rest of the job is provided by siphash and we should be fine. Willy
Re: [DRAFT PATCH] random32: make prandom_u32() output unpredictable
On Mon, Aug 10, 2020 at 09:31:48AM -0700, Linus Torvalds wrote: > On Mon, Aug 10, 2020 at 4:47 AM Willy Tarreau wrote: > > > > Doing testing on real hardware showed that retrieving the TSC on every > > call had a non negligible cost, causing a loss of 2.5% on the accept() > > rate and 4% on packet rate when using iptables -m statistics. > > And by "real hardware" I assume you mean x86, with a fairly fast and > high-performance TSC for get_random_entropy(). Yep. > Reading the TSC takes on the order of 20-50 cycles, iirc. > > But it can actually be *much* more expensive. On non-x86, it can be an > IO cycle to external chips. I took what we were already using in add_interrupt_randomness() since I considered that if it was acceptable there, it probably was elsewhere. > And on older hardware VM's in x86, it can be a vm exit etc, so > thousands of cycles. I hope nobody uses those VM's any more, but it > would be a reasonable test-case for some non-x86 implementations, so.. Yes, I remember these ones, they were not fun at all. > IOW, no. You guys are - once again - ignoring reality. I'm not ignoring reality, quite the opposite, trying to take all knowledge into account. If I'm missing some points, fine. But if we were already calling that in the interrupt handler I expected that this would be OK. The alternative Florian talked about is quite interesting as well, which is to collect some cheap noise in the network rx/tx paths since these are the areas we care about. Willy
Re: [DRAFT PATCH] random32: make prandom_u32() output unpredictable
On Mon, Aug 10, 2020 at 02:03:02PM +0200, Florian Westphal wrote: > As this relates to networking, you could also hook perturbation into rx/tx > softirq processing. E.g. once for each new napi poll round or only once > for each softnet invocation, depending on cost. I was wondering what/where to add some processing. I was thinking about having a per_cpu "noise" variable that would get mixed with the randoms when generated. This "noise" would need to be global so that we can easily append to it. For example on the network path it would be nice to use checksums but nowadays they're not calculated anymore. > IIRC the proposed draft left a unused prandom_seed() stub around, you could > re-use that to place extra data to include in the hash in percpu data. Probably. What I saw was that prandom_seed() expected to perform a full (hence slow) reseed. Instead I'd like to do something cheap. I like the principle of "noise" in that it doesn't promise to bring any security, only to perturb a little bit. Willy
Re: [DRAFT PATCH] random32: make prandom_u32() output unpredictable
On Mon, Aug 10, 2020 at 12:01:11PM +, David Laight wrote: > > /* > > * On 32-bit machines, we use HSipHash, a reduced-width version of SipHash. > > * This is weaker, but 32-bit machines are not used for high-traffic > > @@ -375,6 +377,12 @@ static u32 siprand_u32(struct siprand_state *s) > > { > > unsigned long v0 = s->v[0], v1 = s->v[1], v2 = s->v[2], v3 = s->v[3]; > > > > + if (++s->count >= 8) { > > + v3 ^= s->noise; > > + s->noise += random_get_entropy(); > > + s->count = 0; > > + } > > + > > Using: > if (s->count-- <= 0) { > ... > s->count = 8; > } > probably generates better code. > Although you may want to use a 'signed int' instead of 'unsigned long'. Yeah I know, it's just because I only slightly changed the previous code there. I had an earlier version that kept the rand state fully padded when storing intermediate values. That's among the final cleanups I'll bring if we go down that route. Thanks! Willy
Re: [DRAFT PATCH] random32: make prandom_u32() output unpredictable
Hi George, On Sun, Aug 09, 2020 at 06:30:17PM +, George Spelvin wrote: > Even something simple like buffering 8 TSC samples, and adding them > at 32-bit offsets across the state every 8th call, would make a huge > difference. Doing testing on real hardware showed that retrieving the TSC on every call had a non negligible cost, causing a loss of 2.5% on the accept() rate and 4% on packet rate when using iptables -m statistics. However I reused your idea of accumulating old TSCs to increase the uncertainty about their exact value, except that I retrieve it only on 1/8 calls and use the previous noise in this case. With this I observe the same performance as plain 5.8. Below are the connection rates accepted on a single core : 5.8 5.8+patch 5.8+patch+tsc 192900-197900 188800->192200 194500-197500 (conn/s) This was on a core i7-8700K. I looked at the asm code for the function and it remains reasonably light, in the same order of complexity as the original one, so I think we could go with that. My proposed change is below, in case you have any improvements to suggest. Regards, Willy diff --git a/lib/random32.c b/lib/random32.c index 2b048e2ea99f..a12d63028106 100644 --- a/lib/random32.c +++ b/lib/random32.c @@ -317,6 +317,8 @@ static void __init prandom_state_selftest(void) struct siprand_state { unsigned long v[4]; + unsigned long noise; + unsigned long count; }; static DEFINE_PER_CPU(struct siprand_state, net_rand_state) __latent_entropy; @@ -334,7 +336,7 @@ static DEFINE_PER_CPU(struct siprand_state, net_rand_state) __latent_entropy; #define K0 (0x736f6d6570736575 ^ 0x6c7967656e657261 ) #define K1 (0x646f72616e646f6d ^ 0x7465646279746573 ) -#elif BITS_PER_LONG == 23 +#elif BITS_PER_LONG == 32 /* * On 32-bit machines, we use HSipHash, a reduced-width version of SipHash. * This is weaker, but 32-bit machines are not used for high-traffic @@ -375,6 +377,12 @@ static u32 siprand_u32(struct siprand_state *s) { unsigned long v0 = s->v[0], v1 = s->v[1], v2 = s->v[2], v3 = s->v[3]; + if (++s->count >= 8) { + v3 ^= s->noise; + s->noise += random_get_entropy(); + s->count = 0; + } + SIPROUND(v0, v1, v2, v3); SIPROUND(v0, v1, v2, v3); s->v[0] = v0; s->v[1] = v1; s->v[2] = v2; s->v[3] = v3;
Re: [DRAFT PATCH] random32: make prandom_u32() output unpredictable
On Sun, Aug 09, 2020 at 06:30:17PM +, George Spelvin wrote: > I'm trying to understand your attack scenario. I'm assuming that an > attacker can call prandom_u32() locally. Well, first, I'm mostly focusing on remote attacks, because if the attacker has a permanent local access, he can as well just bind() a source port instead of having to guess the next one that will be used. Second, I'm not aware of direct access from userland, so the calls to prandom_u32() are expected to be done only via specific code parts. The worst case (in my opinion) is when the attacker runs on the same physical CPU and exploits some memory leakage to find the internal state and uses the same TSC, but can only trigger prandom_u32() via the network, hence with much less precision. (...) > Even something simple like buffering 8 TSC samples, and adding them > at 32-bit offsets across the state every 8th call, would make a huge > difference. > > Even if 7 of the timestamps are from attacker calls (4 bits > uncertainty), the time of the target call is 8x less known > (so it goes from 15 to 18 bits), and the total becomes > 46 bits. *So* much better. Maybe. I'm just suggesting to add *some* noise to keep things not exploitable if the state is stolen once in a while (which would already be a big problem, admittedly). > > I can run some tests on this as > > well. I'd really need to try on a cleaner setup, I have remote machines > > at the office but I don't feel safe enough to remotely reboot them and > > risk to lose them :-/ > > Yeah, test kernels are nervous-making that way. In fact I never booted a 5.8 kernel on the machine I'm thinking about yet and can't remote-control its power supply, so I'm worried about a dirty hang at boot by missing a config entry. The risk is low but that's annoying when it happens. > > I'll also test on arm/arm64 to make sure we don't introduce a significant > > cost there. > > I don't expect a problem, but SipHash is optimized for 4-issue processors, > and on narrower machines fewer instructions are "free". OK. Willy
Re: [DRAFT PATCH] random32: make prandom_u32() output unpredictable
On Sun, Aug 09, 2020 at 05:06:39PM +, George Spelvin wrote: > On Sun, Aug 09, 2020 at 11:38:05AM +0200, Willy Tarreau wrote: > > So I gave it a quick test under Qemu and it didn't show any obvious > > performance difference compared to Tausworthe, which is a good thing, > > even though there's a significant amount of measurement noise in each > > case. > > Thank you very much! I'm not quite sure how to benchmark this. > The whole idea is that it's *not* used in a tight cache-hot loop. > Hopefully someone already has a test setup so I don't have to invent > one. Due to limited access to representative hardware, the to main tests I've been running were on my laptop in qemu, and consisted in : - a connect-accept-close test to stress the accept() code and verify we don't observe a significant drop. The thing is that connect() usually is much slower and running the two on the same machine tends to significantly soften the differences compared to what a real machine would see when handling a DDoS for example. - a packet rate test through this rule (which uses prandom_u32() for each packet and which matches what can be done in packet schedulers or just by users having to deal with random drop) : iptables -I INPUT -m statistic --probability 0.5 -j ACCEPT While these ones are not very relevant, especially in a VM, not seeing significant variations tends to indicate we should not see a catastrophic loss. > > However it keeps the problem that the whole sequence is entirely > > determined at the moment of reseeding, so if one were to be able to > > access the state, e.g. using l1tf/spectre/meltdown/whatever, then > > this state could be used to predict the whole ongoing sequence for > > the next minute. What some view as a security feature, others will > > see as a backdoor :-/ That's why I really like the noise approach. > > Even just the below would significantly harm that capability because > > that state alone isn't sufficient anymore to pre-compute all future > > values: > > > > --- a/lib/random32.c > > +++ b/lib/random32.c > > @@ -375,6 +375,7 @@ static u32 siprand_u32(struct siprand_state *s) > > { > > unsigned long v0 = s->v[0], v1 = s->v[1], v2 = s->v[2], v3 = > > s->v[3]; > > > > + v0 += get_cycles(); > > SIPROUND(v0, v1, v2, v3); > > SIPROUND(v0, v1, v2, v3); > > s->v[0] = v0; s->v[1] = v1; s->v[2] = v2; s->v[3] = v3; > > As long as: > 1) The periodic catastrophic reseeding remains, and > 2) You use fresh measurements, not the exact same bits >that add_*_randomness feeds into /dev/random > then it doesn't do any real harm, so if it makes you feel better... > > But I really want to stress how weak a design drip-reseeding is. > > If an attacker has enough local machine access to do a meltdown-style attack, > then they can calibrate the TSC used in get_cycles very accurately, Absolutely. > so the > remaining timing uncertainty is very low. Not that low in fact because they don't know precisely when the call is made. I mean, let's say we're in the worst case, with two VMs running on two siblings of the same core, with the same TSC, on a 3 GHz machine. The attacker can stress the victim at 100k probes per second. That's still 15 bits of uncertainty on the TSC value estimation which is added to each call. Even on the first call this is enough to make a source port unguessable, and preventing the attacker from staying synchronized with its victim. And I'm only speaking about an idle remote machine, not even one taking unobservable traffic, which further adds to the difficulty. > This makes a brute-force attack on > one or two reseedings quite easy. I.e. if you can see every other output, > It's straightforward to figure out the ones in between. But they already become useless because you're only observing stuff of the past. > I wonder if, on general principles, it would be better to use a more > SipHash style mixing in of the sample: > m = get_cycles(); > v3 ^= m; > SIPROUND(v0, v1, v2, v3); > SIPROUND(v0, v1, v2, v3); > v0 ^= m; Probably, if it's the recommended way to mix in other values, yes. > Not sure if it's worth the extra register (and associated spill/fill). If this makes the hash better, maybe. I can run some tests on this as well. I'd really need to try on a cleaner setup, I have remote machines at the office but I don't feel safe enough to remotely reboot them and risk to lose them :-/ I'll also test on arm/arm64 to make sure we don't introduce a significant cost there. Willy
Re: [DRAFT PATCH] random32: make prandom_u32() output unpredictable
Hi George, On Sun, Aug 09, 2020 at 06:57:44AM +, George Spelvin wrote: > +/* > + * This is the core CPRNG function. As "pseudorandom", this is not used > + * for truly valuable things, just intended to be a PITA to guess. > + * For maximum speed, we do just two SipHash rounds per word. This is > + * the same rate as 4 rounds per 64 bits that SipHash normally uses, > + * so hopefully it's reasonably secure. > + * > + * There are two changes from the official SipHash finalization: > + * - We omit some constants XORed with v2 in the SipHash spec as irrelevant; > + * they are there only to make the output rounds distinct from the input > + * rounds, and this application has no input rounds. > + * - Rather than returning v0^v1^v2^v3, return v1+v3. > + * If you look at the SipHash round, the last operation on v3 is > + * "v3 ^= v0", so "v0 ^ v3" just undoes that, a waste of time. > + * Likewise "v1 ^= v2". (The rotate of v2 makes a difference, but > + * it still cancels out half of the bits in v2 for no benefit.) > + * Second, since the last combining operation was xor, continue the > + * pattern of alternating xor/add for a tiny bit of extra non-linearity. > + */ > +static u32 siprand_u32(struct siprand_state *s) > +{ > + unsigned long v0 = s->v[0], v1 = s->v[1], v2 = s->v[2], v3 = s->v[3]; > + > + SIPROUND(v0, v1, v2, v3); > + SIPROUND(v0, v1, v2, v3); > + s->v[0] = v0; s->v[1] = v1; s->v[2] = v2; s->v[3] = v3; > + return v1 + v3; > +} > + > + > +/** > + * prandom_u32 - pseudo random number generator > + * > + * A 32 bit pseudo-random number is generated using a fast > + * algorithm suitable for simulation. This algorithm is NOT > + * considered safe for cryptographic use. > + */ > +u32 prandom_u32(void) > +{ > + struct siprand_state *state = get_cpu_ptr(&net_rand_state); > + u32 res = siprand_u32(state); > + > + put_cpu_ptr(&net_rand_state); > + return res; > +} So I gave it a quick test under Qemu and it didn't show any obvious performance difference compared to Tausworthe, which is a good thing, even though there's a significant amount of measurement noise in each case. However it keeps the problem that the whole sequence is entirely determined at the moment of reseeding, so if one were to be able to access the state, e.g. using l1tf/spectre/meltdown/whatever, then this state could be used to predict the whole ongoing sequence for the next minute. What some view as a security feature, others will see as a backdoor :-/ That's why I really like the noise approach. Even just the below would significantly harm that capability because that state alone isn't sufficient anymore to pre-compute all future values: --- a/lib/random32.c +++ b/lib/random32.c @@ -375,6 +375,7 @@ static u32 siprand_u32(struct siprand_state *s) { unsigned long v0 = s->v[0], v1 = s->v[1], v2 = s->v[2], v3 = s->v[3]; + v0 += get_cycles(); SIPROUND(v0, v1, v2, v3); SIPROUND(v0, v1, v2, v3); s->v[0] = v0; s->v[1] = v1; s->v[2] = v2; s->v[3] = v3; Regards, Willy
Re: Flaw in "random32: update the net random state on interrupt and activity"
Hi Florian, On Sat, Aug 08, 2020 at 09:18:27PM +0200, Florian Westphal wrote: > > struct rnd_state { > > - __u32 s1, s2, s3, s4; > > + siphash_key_t key; > > + uint64_t counter; > > }; > > Does the siphash_key really need to be percpu? I don't think so, I was really exploring a quick and easy change, and given that it was convenient to put that into rnd_state to ease adaptation to existing code, I left it here. > The counter is different of course. > Alternative would be to siphash a few prandom_u32 results > if the extra u64 is too much storage. I don't think we need to make it complicated. If we can implement a partial siphash that's fast enough and that makes everyone happy, it will also not take too much space so that could become an overall improvement (and I'd say that reconciling everyone would also be a huge improvement). > > DECLARE_PER_CPU(struct rnd_state, net_rand_state); > > @@ -161,12 +163,14 @@ static inline u32 __seed(u32 x, u32 m) > > */ > > static inline void prandom_seed_state(struct rnd_state *state, u64 seed) > > { > > +#if 0 > > u32 i = (seed >> 32) ^ (seed << 10) ^ seed; > > > > state->s1 = __seed(i, 2U); > > state->s2 = __seed(i, 8U); > > state->s3 = __seed(i, 16U); > > state->s4 = __seed(i, 128U); > > +#endif > > } > [..] > > Can't we keep prandom_u32 as-is...? Most of the usage, esp. in the > packet schedulers, is fine. > > I'd much rather have a prandom_u32_hashed() or whatever for > those cases where some bits might leak to the outside and then convert > those prandom_u32 users over to the siphashed version. In fact I even thought about having a different name, such as "weak_u32" or something like this for the parts where we need slightly more than a purely predictable random, and keeping the existing function as-is. But I discovered there's also another one which is sufficient for stats, it just doesn't have the same interface. But I totally agree on the benefit of keeping the fastest version available for packet schedulers. In fact I've run some of my tests with iptables -m statistics --probability ... However if it turns out we can end up on a very fast PRNG (Tausworthe was not *that* fast), it would be 100% benefit to turn to it. That's why I don't want to rule out the possibility of a simple drop-in replacement. By the way, if we end up keeping a different function for simpler cases, we should probably find a better name for them like "predictable random" or "trivial random" that makes users think twice. Doing that in a packet scheduler is fine. The problem with the "pseudo random" term is that it evolved over time from "totally predictable" to "may be strongly secure" depending on implementations and that various people understand it differently and have different expectations on it :-/ Cheers, Willy
Re: Flaw in "random32: update the net random state on interrupt and activity"
On Sat, Aug 08, 2020 at 11:19:01AM -0700, Linus Torvalds wrote: > On Sat, Aug 8, 2020 at 10:45 AM Willy Tarreau wrote: > > > > > > WIP: random32: use siphash with a counter for prandom_u32 > > siphash is good. > > But no: > > > --- a/drivers/char/random.c > > +++ b/drivers/char/random.c > > @@ -1277,7 +1277,6 @@ void add_interrupt_randomness(int irq, int irq_flags) > > > > fast_mix(fast_pool); > > add_interrupt_bench(cycles); > > - this_cpu_add(net_rand_state.s1, fast_pool->pool[cycles & 3]); > > > > if (unlikely(crng_init == 0)) { > > if ((fast_pool->count >= 64) && > > --- a/include/linux/random.h > > +++ b/include/linux/random.h > > diff --git a/kernel/time/timer.c b/kernel/time/timer.c > > index 026ac01af9da..c9d839c2b179 100644 > > --- a/kernel/time/timer.c > > +++ b/kernel/time/timer.c > > @@ -1743,13 +1743,6 @@ void update_process_times(int user_tick) > > scheduler_tick(); > > if (IS_ENABLED(CONFIG_POSIX_TIMERS)) > > run_posix_cpu_timers(); > > - > > - /* The current CPU might make use of net randoms without receiving > > IRQs > > -* to renew them often enough. Let's update the net_rand_state from > > a > > -* non-constant value that's not affine to the number of calls to > > make > > -* sure it's updated when there's some activity (we don't care in > > idle). > > -*/ > > - this_cpu_add(net_rand_state.s1, rol32(jiffies, 24) + user_tick); > > } > > We're not going back to "don't add noise, just make a stronger > analyzable function". > > I simply won't take it. See my previous email. I'm 100% fed up with > security people screwing up real security by trying to make things > overly analyzable. > > If siphash is a good enough hash to make the pseudo-random state hard > to guess, then it's also a good enough hash to hide the small part of > the fast-pool we mix in. I'm totally fine with that. In fact, my secret goal there was to put net_rand_state back to random32.c as a static and reinstall the __latent_entropy that we had to remove :-) I'll need to re-run more correct tests though. My measurements were really erratic with some of them showing half an HTTP request rate in a test, which makes absolutely no sense and thus disqualifies my measurements. But if the results are correct enough I'm fine with continuing on this one and forgetting MSWS. > And while security researchers may hate it because it's hard to > analyze, that's the POINT, dammit. Actually I think there are two approaches which explains the constant disagreements in this area. Some people want something provably difficult, and others want something that cannot be proven to be easy. Sadly by trying to please everyone we probably got something between provably easy and not provably difficult :-/ Willy
Re: Flaw in "random32: update the net random state on interrupt and activity"
On Sat, Aug 08, 2020 at 10:07:51AM -0700, Andy Lutomirski wrote: > > > On Aug 8, 2020, at 8:29 AM, George Spelvin wrote: > > > > > And apparently switching to the fastest secure PRNG currently > > in the kernel (get_random_u32() using ChaCha + per-CPU buffers) > > would cause too much performance penalty. > > Can someone explain *why* the slow path latency is particularly relevant > here? What workload has the net code generating random numbers in a place > where even a whole microsecond is a problem as long as the amortized cost is > low? (I'm not saying I won't believe this matters, but it's not obvious to > me that it matters.) It really depends on what is being done and how we want that code to continue to be used. Often it will be a matter of queue sizes or concurrency in certain contexts under a lock. For illustration let's say you want to use randoms to choose a sequence number for a SYN cookie (it's not what is done today), a 40G NIC can deliver 60Mpps or one packet every 16 ns. If you periodically add 1us you end up queuing 60 extra packets in a queue when this happens. This might or might not be acceptable. Regarding concurrency, you could imagine some code picking a random in a small function running under a spinlock. Frequently adding 1us there can be expensive. To be clear, I'm not *that* much worried about *some* extra latency, however I'm extremely worried about the risks of increase once some security researcher considers the PRNG not secure enough again (especially once it starts to be used for security purposes after having being claimed secure) and we replace it with another one showing a higher cost and longer tail to amortize the cost. And given that we're speaking about replacing a non-secure PRNG with *something*, it doesn't seem really wise to start to introduce new constraints on the data path when there are probably a large enough number of possibilities which do not require these constraints. Last point, we must not rule out the possibility than some clever researcher will later come up with new time-based attacks consisting in observing latencies of packet processing, explaining how they can count the number of connections reaching a given host and reveal certain useful information, and that we're then forced to slow down all of them to proceed in constant time. Just my two cents, Willy
Re: Flaw in "random32: update the net random state on interrupt and activity"
On Sat, Aug 08, 2020 at 03:26:28PM +, George Spelvin wrote: > So any discussion of reseeding begins by *assuming* an attacker has > captured the PRNG state. If we weren't worried about this possibility, > we wouldn't need to reseed in the first place! Sure, but what matters is the time it takes to capture the state. In the previous case it used to take Amit a few packets only. If it takes longer than the reseeding period, then it's not captured anymore. That was exactly the point there: break the certainty of the observations so that the state cannot be reliably guessed anymore, hence cancelling the concerns about the input bits being guessed. For sure if it is assumed that the state is guessed nevertheless, this doesn't work anymore, but that wasn't the assumption. > Just like a buffer overflow, a working attack is plausible using a > combination of well-understood techniques. It's ridiculous to go to > the effort of developing a full exploit when it's less work to fix the > problem in the first place. I totally agree with this principle. However, systematically starting with assumptions like "the PRNG's state is known to the attacker" while it's the purpose of the attack doesn't exactly match something which does not warrant a bit more explanation. Otherwise I could as well say in other contexts "let's assume I know the private key of this site's certificate, after all it's only 2048 bits". And the reason why we're looping here is that all explanations about the risk of losing entropy are based on the assumption that the attacker has already won. The patch was made to keep that assumption sufficiently away, otherwise it's useless. > A SOLUTION: > > Now, how to fix it? > > First, what is the problem? "Non-cryptographic PRNGs are predictable" > fits in the cateogry of Not A Bug. There are may applications for > a non-cryptographic PRNG in the kernel. Self-tests, spreading wear > across flash memory, randomized backoff among cooperating processes, > and assigning IDs in protocols which aren't designed for DoS resistance > in the first place. > > But apparently the network code is (ab)using lib/random32.c for choosing > TCP/UDP port numbers and someone has found a (not publicly disclosed) > attack which exploits the predictability to commit some mischief. Note, it's beyond predictability, it's reversibility at this point. > And apparently switching to the fastest secure PRNG currently > in the kernel (get_random_u32() using ChaCha + per-CPU buffers) > would cause too much performance penalty. Based on some quick userland tests, apparently yes. > So the request is for a less predictable PRNG which is still extremely > fast. That was the goal around the MSWS PRNG, as it didn't leak its bits and would require a certain amount of brute force. > Well, the best crypto primitive I know of for such an application is > SipHash. Its 4x 64-bit words of state is only twice the size of the > current struct rnd_state. Converting it from a a pseudo-random function > to a CRNG is some half-assed amateur cryptography, but at least it's a > robust and well-studied primitive. If it's considered as cryptographically secure, it's OK to feed it with just a counter starting from a random value, since the only way to guess the current counter state is to brute-force it, right ? I've done this in the appended patch. Note that for now I've commented out all the reseeding code because I just wanted to test the impact. In my perf measurements I've had some erratic behaviors, one test showing almost no loss, and another one showing a lot, which didn't make sense, so I'll need to measure again under better condition (i.e. not my laptop with a variable frequency CPU). > Willy Tarreau: Middle square + Weil sequence isn't even *close* to > crypto-quality. Absolutely, I've even said this myself. > And you're badly misunderstanding what the > fast_pool is doing. Please stop trying to invent your own crypto > primitives; see > > https://www.schneier.com/crypto-gram/archives/1998/1015.html#cipherdesign I know this pretty well and am NOT trying to invent crypto and am not qualified for this. I'm not even trying to invent a PRNG myself because there are some aspects I can't judge (such as measuring the sequence). This is why I proposed to reuse some published work. There seems to be some general thinking among many crypto-savvy people that everything in the world needs strong crypto, including PRNGs, and that doesn't always help. As you mentioned above, there are plenty of valid use cases for weaker PRNGs, and my opinion (which may or may not be shared) is that if guessing an
Re: Flaw in "random32: update the net random state on interrupt and activity"
On Fri, Aug 07, 2020 at 03:45:48PM -0700, Marc Plumb wrote: > Willy, > > > On 2020-08-07 3:19 p.m., Willy Tarreau wrote: > > On Fri, Aug 07, 2020 at 12:59:48PM -0700, Marc Plumb wrote: > > > > > > If I can figure the state out once, > > Yes but how do you take that as granted ? This state doesn't appear > > without its noise counterpart, > > Amit has shown attacks that can deduce the full internal state from 4-5 > packets with a weak PRNG. If the noise is added less often than that, an > attacker can figure out the entire state at which point the partial > reseeding doesn't help. If the noise is added more often than that, and it's > raw timing events, then it's only adding a few bits of entropy so its easy > to guess (and it weakens dev/random). If the noise is added more often, and > it's from the output of a CPRNG, then we have all the performance/latency > problems from the CPRNG itself, so we might as well use it directly. His attacks is based on the fact that internal state bits leak as-is. So it is possible to collect them and perform the inverse operation to reconstruct the input. Here the output is made by mixing two input bits with two noise bits and produce one single output bit. So for each output bit you see, you don't have just one possible input bit, but 16 possibilities. That's 128 bits of internal changing states that once combined result in 32 bits. For each 32-bit output you still have 2^96 possible internal (x,noise) states producing it. And that's without counting on the 64-bit seed that you still don't know but can be deduced from two guessed 128 bit states (assuming that can be brute-forced at all, of course). For sure there are plenty of number theories allowing you to significantly reduce the space you need to work on to brute-force this but it will definitely remain above 2^32 attempts for each iteration, which is the floor you have without the noise part, while the whole internal state will be reseeded every minute anyway. I mean, I'm not trying to sell this thing, I'm just trying to defend that we use a reasonable tool for a reasonable protection level. And yes, probably that 15 years from now, someone will say "hey look, I can brute-force this thing in less than a minute on my 1024-core 39th gen core i7 machine with 2^60 operations per second, why don't we use our CPU's native 10 picosecond AES1024 instruction instead ?" and we'll say "well, it's an old story and it's probably about time to change it again". I don't see anything wrong with evolving this way, matching concrete needs more than pure theory. Regards, Willy
Re: Flaw in "random32: update the net random state on interrupt and activity"
On Fri, Aug 07, 2020 at 12:59:48PM -0700, Marc Plumb wrote: > > On 2020-08-07 10:43 a.m., Willy Tarreau wrote: > > > > > Which means that it's 2^32 effort to brute force this (which Amit called > > > "no > > > biggie for modern machines"). If the noise is the raw sample data with > > > only > > > a few bits of entropy, then it's even easier to brute force. > > Don't you forget to multiply by another 2^32 for X being folded onto itself > > ? > > Because you have 2^32 possible values of X which will give you a single > > 32-bit > > output value for a given noise value. > > If I can figure the state out once, Yes but how do you take that as granted ? This state doesn't appear without its noise counterpart, so taking as a prerequisite that you may guess one separately obviously indicates that you then just have to deduce the other, but the point of mixing precisely is that we do not expose individual parts. This way of thinking is often what results in extreme solutions to be designed, which are far away from the reality of the field of application, and result in unacceptable costs that make people turn to other solutions. Do you think it makes me happy to see people waste their time reimplementing alternate userland TCP stacks that are supposedly "way faster" by getting rid of all the useless (to them) stuff that was forced on them at the cost of performance ? And it makes me even less happy when they ask me why I'm not spending more time trying to adopt them. The reality is that this time could be better spent optimizing our stack to be sure that costs are added where they are relevant, and not just to make sure that when we align 7 conditions starting with "imagine that I could guess that", the 8th could be guessed as well, except that none of these can really be guessed outside of a lab :-/ > then the only new input is the noise, so > that's the only part I have to brute force. Throwing the noise in makes it > more difficult to get that state once, but once I have it then this type of > reseeding doesn't help. > I think it might be possible to do a decent CPRNG (that's at > least had some cryptanalys of it) with ~20 instructions per word, but if > that's not fast enough then I'll think about other options. I think that around 20 instructions for a hash would definitely be nice (but please be aware that we're speaking about RISC-like instructions, not SIMD instructions). And also please be careful not to count only with amortized performance that's only good to show nice openssl benchmarks, because if that's 1280 instructions for 256 bits that result in 20 instructions per 32-bit word, it's not the same anymore at all! Regards, Willy
Re: Flaw in "random32: update the net random state on interrupt and activity"
Hi Andy, On Fri, Aug 07, 2020 at 10:55:11AM -0700, Andy Lutomirski wrote: > >> This is still another non-cryptographic PRNG. > > > > Absolutely. During some discussions regarding the possibility of using > > CSPRNGs, orders around hundreds of CPU cycles were mentioned for them, > > which can definitely be a huge waste of precious resources for some > > workloads, possibly causing the addition of a few percent extra machines > > in certain environments just to keep the average load under a certain > > threshold. > > I think the real random.c can run plenty fast. It's ChaCha20 plus ludicrous > overhead right now. I'm working (slowly) on making the overhead go away. I'm > hoping to have something testable in a few days. As it stands, there is a > ton of indirection, a pile of locks, multiple time comparisons, per-node and > percpu buffers (why both?), wasted bits due to alignment, and probably other > things that can be cleaned up. I'm trying to come up with something that is > fast and has easy-to-understand semantics. > > You can follow along at: > > https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/log/?h=random/fast Thanks, we'll see. I developed a quick test tool that's meant to be easy to use to measure the performance impact on connect/accept. I have not yet run it on a modified PRNG to verify if it works. I'll send it once I've tested. I'd definitely would like to see no measurable performance drop, and ideally even a small performance increase (as Tausworthe isn't the lightest thing around either so we do have some little margin). Willy
Re: Flaw in "random32: update the net random state on interrupt and activity"
On Fri, Aug 07, 2020 at 09:52:14AM -0700, Marc Plumb wrote: > On 2020-08-07 12:03 a.m., Willy Tarreau wrote: > > > Just to give a heads up on this, here's what I'm having pending regarding > > MSWS: > > > >struct rnd_state { > > uint64_t x, w; > > uint64_t seed; > > uint64_t noise; > >}; > > > >uint32_t msws32(struct rnd_state *state) > >{ > > uint64_t x; > > > > x = state->w += state->seed; > > x += state->x * state->x; > > x = state->x = (x >> 32) | (x << 32); > > x -= state->noise++; > > return x ^ (x >> 32); > >} > > A few comments: > > This is still another non-cryptographic PRNG. Absolutely. During some discussions regarding the possibility of using CSPRNGs, orders around hundreds of CPU cycles were mentioned for them, which can definitely be a huge waste of precious resources for some workloads, possibly causing the addition of a few percent extra machines in certain environments just to keep the average load under a certain threshold. And the worst there is that such workloads would exactly be the ones absolutely not affected by the theorical attacks regarding predictability :-/ > An LFSR can pass PractRand (if > you do a tweak to get around the specific linear complexity test for LFSRs). OK. > On a 64-bit machine it should be fast: 4 adds, 1 multiply, 1 rotate, 1 > shift, 1 xor > > This will be much slower on 32-bit machines, if that's still a concern Yep, that's something I'm totally aware of and one reason I wanted to have a public discussion about this. My personal view on this is that a 64-bit multiply will always be cheaper than a crypto operation and that the environments where picking a source port or accepting a connection matters that much are not those running on such low-end machines, so that's very likely an acceptable tradeoff. > As long as the noise is the output of a CPRNG, this doesn't hurt the > security of dev/dandom. > > The noise is more like effective 32-bits since you're xoring the low and > high half of the noise together (ignoring the minor details of carry bits). Definitely. The purpose of this noise in fact is more to complicate the reconstruction of the internal state in case a large number of samples is used, precisely due to these carry bits that propagate solely depending on the noise value itself, and the uncertainty about when it changes and from what to what. > Which means that it's 2^32 effort to brute force this (which Amit called "no > biggie for modern machines"). If the noise is the raw sample data with only > a few bits of entropy, then it's even easier to brute force. Don't you forget to multiply by another 2^32 for X being folded onto itself ? Because you have 2^32 possible values of X which will give you a single 32-bit output value for a given noise value. > Given the uses of this, I think we really should look into a CPRNG for this > and then completely reseed it periodically. The problem is finding one > that's fast enough. That was precisely the problem that drove me to propose something like this. And all these cycles wasted everywhere just to protect a 16-bit source port on a mostly idle machine seem quite overkill to me. > Is there a hard instruction budget for this, or it is > just "fast enough to not hurt the network benchmarks" (i.e. if Dave Miller > screams)? It's not just Davem. I too am concerned about wasting CPU cycles in fast path especially in the network code. A few half-percent gains are hardly won once in a while in this area and in some infrastructures they matter. Not much but they do. Willy
Re: Flaw in "random32: update the net random state on interrupt and activity"
On Thu, Aug 06, 2020 at 10:18:55AM -0700, Marc Plumb wrote: > Willy, > > > On 2020-08-05 11:30 p.m., Willy Tarreau wrote: > > On Wed, Aug 05, 2020 at 03:21:11PM -0700, Marc Plumb wrote: > > > There is nothing wrong with perturbing net_rand_state, the sin is doing it > > > with the raw entropy that is also the seed for your CPRNG. Use the output > > > of > > > a CPRNG to perturb the pool all you want, but don't do things that bit by > > > bit reveal the entropy that is being fed into the CPRNG. > > This is interesting because I think some of us considered it exactly the > > other way around, i.e. we're not copying exact bits but just taking a > > pseudo-random part of such bits at one point in time, to serve as an > > increment among other ones. And given that these bits were collected > > over time from not very secret sources, they appeared to be of lower > > risk than the output. > > No. The output of a CPRNG can't be used to determine the internal state. The > input can. The input entropy is the one thing that cannot be produced by a > deterministic computer, so they are the crown jewels of this. It's much much > safer to use the output. OK, noted. > > I didn't know about SFC32, it looks like a variation of the large family > > of xorshift generators, which is thus probably quite suitable as well > > for this task. Having used xoroshiro128** myself in another project, I > > found it overkill for this task compared to MSWS but I definitely agree > > that any of them is more suited to the task than the current one. > > > It's actually a chaotic generator (not a linear one like an xorshift > generator), which gives it weaker period guarantees which makes it more > difficult to reverse. With a counter added to help the period length. > > I'll trust Amit that SFC32 isn't strong enough and look at other options -- > I just thought of it as better, and faster than the existing one with the > same state size. Maybe a larger state is needed. Just to give a heads up on this, here's what I'm having pending regarding MSWS: struct rnd_state { uint64_t x, w; uint64_t seed; uint64_t noise; }; uint32_t msws32(struct rnd_state *state) { uint64_t x; x = state->w += state->seed; x += state->x * state->x; x = state->x = (x >> 32) | (x << 32); x -= state->noise++; return x ^ (x >> 32); } It passes PractRand without any warning after 1 TB of data: rng=RNG_stdin, seed=unknown length= 512 megabytes (2^29 bytes), time= 2.0 seconds no anomalies in 229 test result(s) rng=RNG_stdin, seed=unknown length= 1 gigabyte (2^30 bytes), time= 4.3 seconds no anomalies in 248 test result(s) rng=RNG_stdin, seed=unknown length= 2 gigabytes (2^31 bytes), time= 8.3 seconds no anomalies in 266 test result(s) rng=RNG_stdin, seed=unknown length= 4 gigabytes (2^32 bytes), time= 15.8 seconds no anomalies in 282 test result(s) rng=RNG_stdin, seed=unknown length= 8 gigabytes (2^33 bytes), time= 31.3 seconds no anomalies in 299 test result(s) rng=RNG_stdin, seed=unknown length= 16 gigabytes (2^34 bytes), time= 61.9 seconds no anomalies in 315 test result(s) rng=RNG_stdin, seed=unknown length= 32 gigabytes (2^35 bytes), time= 119 seconds no anomalies in 328 test result(s) rng=RNG_stdin, seed=unknown length= 64 gigabytes (2^36 bytes), time= 242 seconds no anomalies in 344 test result(s) rng=RNG_stdin, seed=unknown length= 128 gigabytes (2^37 bytes), time= 483 seconds no anomalies in 359 test result(s) rng=RNG_stdin, seed=unknown length= 256 gigabytes (2^38 bytes), time= 940 seconds no anomalies in 372 test result(s) rng=RNG_stdin, seed=unknown length= 512 gigabytes (2^39 bytes), time= 1906 seconds no anomalies in 387 test result(s) rng=RNG_stdin, seed=unknown length= 1 terabyte (2^40 bytes), time= 3826 seconds no anomalies in 401 test result(s) The two modifications compared to the original msws are: - mix bits on output so that we don't reveal the internal state upon each call ; - combination of the output with an independent noise variable whose purpose was to be updated upon IRQ and/or CPU usage and/or invocations. But on this point, while implementing it I figured that updating it on each invocation did already provide the frequent updates we were missing in Tausworthe that required the interrupt updates. I'd definitely update in update_process_times() so that it's not reduced to a pure counter, but the results, speed and simplicity look encouraging. I'll try to work on finishing the patch proposal this week-end. Regards, Willy
Re: Flaw in "random32: update the net random state on interrupt and activity"
On Wed, Aug 05, 2020 at 03:21:11PM -0700, Marc Plumb wrote: > There is nothing wrong with perturbing net_rand_state, the sin is doing it > with the raw entropy that is also the seed for your CPRNG. Use the output of > a CPRNG to perturb the pool all you want, but don't do things that bit by > bit reveal the entropy that is being fed into the CPRNG. This is interesting because I think some of us considered it exactly the other way around, i.e. we're not copying exact bits but just taking a pseudo-random part of such bits at one point in time, to serve as an increment among other ones. And given that these bits were collected over time from not very secret sources, they appeared to be of lower risk than the output. I mean, if we reimplemented something in parallel just mixing the IRQ return pointer and TSC, some people could possibly say "is this really strong enough?" but it wouldn't seem very shocking in terms of disclosure. But if by doing so we ended up in accident reproducing the same contents as the fast_pool it could be problematic. Would you think that using only the input info used to update the fast_pool would be cleaner ? I mean, instead of : fast_pool->pool[0] ^= cycles ^ j_high ^ irq; fast_pool->pool[1] ^= now ^ c_high; ip = regs ? instruction_pointer(regs) : _RET_IP_; fast_pool->pool[2] ^= ip; fast_pool->pool[3] ^= (sizeof(ip) > 4) ? ip >> 32 : get_reg(fast_pool, regs); we'd do: x0 = cycles ^ j_high ^ irq; x1 = now ^ c_high; x2 = regs ? instruction_pointer(regs) : _RET_IP_; x3 = (sizeof(ip) > 4) ? ip >> 32 : get_reg(fast_pool, regs); fast_pool->pool[0] ^= x0; fast_pool->pool[1] ^= x1; fast_pool->pool[2] ^= x2; fast_pool->pool[3] ^= x3; this_cpu_add(net_rand_state.s1, x0^x1^x2^x3); Because that's something easy to do and providing the same level of perturbation to net_rand_state. > > Another approach involving the replacement of the algorithm was considered > > but we were working with -stable in mind, trying to limit the backporting > > difficulty (and it revealed a circular dependency nightmare that had been > > sleeping there for years), and making the changes easier to check (which > > is precisely what you're doing). > > Really? You can replace the LFSR and only change lib/random32.c. That might > fix the problem without the need to use the raw fast_pool data for seed > material. Definitely. I had been working on a variant of MSWS (https://arxiv.org/pdf/1704.00358.pdf) that I discussed a little bit with Amit, but I didn't publish it yet since it needs to be cleaned up and to replace all the test patterns in random32.c. And when starting to rebase it I figured that updating it from the interrupt handler didn't really look like it brought anything. I mean that when you're starting to wonder what to update "to make it better", it likely means that you don't need it. > As Linus said, speed is a concern but SFC32 is faster than > existing Tausworthe generator, and it's a drop-in replacement with the same > state size if that makes your life easier. If you're willing to expand the > state there are even better options (like SFC64 or some of chaotic > generators like Jenkins' Frog). I didn't know about SFC32, it looks like a variation of the large family of xorshift generators, which is thus probably quite suitable as well for this task. Having used xoroshiro128** myself in another project, I found it overkill for this task compared to MSWS but I definitely agree that any of them is more suited to the task than the current one. > > We're not trying to invent any stream cipher or whatever, just making > > use of a few bits that are never exposed alone as-is to internal nor > > external states, to slightly disturb another state that otherwise only > > changes once a minute so that there's no more a 100% chance of guessing > > a 16-bit port after seeing a few packets. I mean, I'm pretty sure that > > even stealing three or four bits only from there would be quite enough > > to defeat the attack given that Amit only recovers a few bits per packet. > > If Amit's attack can recover that much per packet (in practice not just in > theory) and there is even one packet per interrupt, then it's a major > problem. There are at most 2 bits of entropy added per call to > add_interrupt_randomness, so it you're leaking "a few bits per packet" then > that's everything. Over 64 interrupts you've lost the 128 bits of entropy > that the fast_pool has spilled into the input_pool. But how do you *figure* the value of these bits and their position in the fast_pool ? I mean that the attack relies on net_rand_state not being perturbed. Let's assume you'd be able to brute-force them in linear time by drawing the list of possible candidates for 32-bit increment on each packet and you end up with 64 sets of candidates. You'll probably have quite a number of candidates there since yo
Re: Flaw in "random32: update the net random state on interrupt and activity"
Hi Mark, On Wed, Aug 05, 2020 at 09:06:40AM -0700, Marc Plumb wrote: > Just because you or I don't have a working exploit doesn't mean that someone > else isn't more clever. I agree on the principle, but it can be said from many things, including our respective inability to factor large numbers for example. But for sure we do need to be careful, and actually picking only some limited parts of the fast pool (which are only used to update the input pool and are only made of low-difficulty stuff like instruction pointers, jiffies and TSC values) is probably not going to disclose an extremely well guarded secret. > The fundamental question is: Why is this attack on net_rand_state problem? > It's Working as Designed. Why is it a major enough problem to risk harming > cryptographically important functions? It's not *that* major an issue (in my personal opinion) but the current net_rand_state is easy enough to guess so that an observer may reduce the difficulty to build certain attacks (using known source ports for example). The goal of this change (and the one in update_process_times()) is to disturb the net_rand_state a little bit so that external observations turn from "this must be that" to "this may be this or maybe that", which is sufficient to limit the ability to reliably guess a state and reduce the cost of an attack. Another approach involving the replacement of the algorithm was considered but we were working with -stable in mind, trying to limit the backporting difficulty (and it revealed a circular dependency nightmare that had been sleeping there for years), and making the changes easier to check (which is precisely what you're doing). > Do you remember how you resisted making dev/urandom fast for large reads for > a long time to punish stupid uses of the interface? In this case anyone who > is using net_rand_state assuming it is a CPRNG should stop doing that. Don't > enable stupidity in the kernel. > > This whole thing is making the fundamental mistake of all amateur > cryptographers of trying to create your own cryptographic primitive. You're > trying to invent a secure stream cipher. Either don't try to make > net_rand_state secure, or use a known secure primitive. We're not trying to invent any stream cipher or whatever, just making use of a few bits that are never exposed alone as-is to internal nor external states, to slightly disturb another state that otherwise only changes once a minute so that there's no more a 100% chance of guessing a 16-bit port after seeing a few packets. I mean, I'm pretty sure that even stealing three or four bits only from there would be quite enough to defeat the attack given that Amit only recovers a few bits per packet. For me the right longterm solution will be to replace the easily guessable LFSR. But given the build breakage we got by just adding one include, I can only guess what we'll see when trying to do more in this area :-/ Regards, Willy
Re: Flaw in "random32: update the net random state on interrupt and activity"
Hi Ted, On Wed, Aug 05, 2020 at 11:34:32AM -0400, ty...@mit.edu wrote: > That being said, it certainly is a certificational / theoretical > weakness, and if the bright boys and girls at Fort Meade did figure > out a way to exploit this, they are very much unlikely to share it at > an open Crypto conference. So replacing LFSR-based PRnG with > something stronger which didn't release any bits from the fast_pool > would certainly be desireable, and I look forward to seeing what Willy > has in mind. I'll post a proposal patch shortly about this, hopefully this week-end (got diverted by work lately :-)). Just to give you a few pointers, it's a small modification of MSWS. It passes the Practrand test suite on 256 GB of data with zero warning (something that Tausworthe is supposed to fail at). By default, MSWS *does* leak its internal state, as Amit showed us (and seeing that the paper on it suggests it's safe as-is for crypto use is a bit shocking), but once slightly adjusted, it doesn't reveal its state anymore and that would constitute a much more future-proof solution for quite some time. Tausworthe was created something like 20 years ago or so, hence it's not surprizing that it's a bit dated by now, but if we can upgrade once every 2 decades I guess it's not that bad. Cheers, Willy
Re: Flaw in "random32: update the net random state on interrupt and activity"
Hi Marc, On Tue, Aug 04, 2020 at 05:52:36PM -0700, Marc Plumb wrote: > Seeding two PRNGs with the same entropy causes two problems. The minor one > is that you're double counting entropy. The major one is that anyone who can > determine the state of one PRNG can determine the state of the other. > > The net_rand_state PRNG is effectively a 113 bit LFSR, so anyone who can see > any 113 bits of output can determine the complete internal state. > > The output of the net_rand_state PRNG is used to determine how data is sent > to the network, so the output is effectively broadcast to anyone watching > network traffic. Therefore anyone watching the network traffic can determine > the seed data being fed to the net_rand_state PRNG. The problem this patch is trying to work around is that the reporter (Amit) was able to determine the entire net_rand_state after observing a certain number of packets due to this trivial LFSR and the fact that its internal state between two reseedings only depends on the number of calls to read it. (please note that regarding this point I'll propose a patch to replace that PRNG to stop directly exposing the internal state to the network). If you look closer at the patch, you'll see that in one interrupt the patch only uses any 32 out of the 128 bits of fast_pool to update only 32 bits of the net_rand_state. As such, the sequence observed on the network also depends on the remaining bits of net_rand_state, while the 96 other bits of the fast_pool are not exposed there. > Since this is the same > seed data being fed to get_random_bytes, it allows an attacker to determine > the state and there output of /dev/random. I sincerely hope that this was > not the intended goal. :) Not only was this obviously not the goal, but I'd be particularly interested in seeing this reality demonstrated, considering that the whole 128 bits of fast_pool together count as a single bit of entropy, and that as such, even if you were able to figure the value of the 32 bits leaked to net_rand_state, you'd still have to guess the 96 other bits for each single entropy bit :-/ Regards, Willy
Re: newbie question on networking kernel panics.
Hi Jeroen, On Thu, Jul 16, 2020 at 05:38:57PM +0200, Jeroen Baten wrote: > Hi, > > I have been working on a Linux desktop for the last 20 odd years. > Currently running Ubuntu 20.04. > > Yesterday I enabled the option "Flow control" on my TP-Link TL-SG1024DE. > > Subsequently I was forced to enjoy 3 kernel panics in the timespan of 18 > hours. > > After disabling the "Flow control" option my system seems to be stable > again. > > I do have 3 sets of "cut here" text if somebody would be interested. > > Please let me know if this information is of interest to someone or if I > am barking up the wrong majordomo tree. > > Kind regards and thanks to all here for your immensely valuable work, Since distro kernels contain some extra patches, may miss a significant number of fixes, or even rely on different driver sources, you really need to report this issue to your distro, who will ask for your exact kernel version. It may be that this issue also affects the vanilla kernel in the end, but in this case it will be reported by the distro's maintainers once they've verified that it's not related to their kernel's differences. Hoping this helps, Willy
Re: Verify ACK packets in handshake in kernel module (Access TCP state table)
On Fri, Sep 20, 2019 at 11:43:50PM +, Swarm wrote: > First time emailing to this mailing list so please let me know if I made a > mistake in how I sent it. I'm trying to receive a notification from the > kernel once it verifies an ACK packet in a handshake. Problem is, there is no > API or kernel resource I've seen that supports this feature for both > syncookies and normal handshakes. Where exactly in the kernel does the ACK > get verified? If there isn't a way to be notified of it, where should I start > adding that feature into the kernel? Just searching for TCP_ESTABLISHED immediately brought me to tcp_input.c (tcp_rcv_state_process() to be precise), so I'm not sure you've searched that much. As you've noticed there's nothing specifically called in this case, but in practice a caller of accept() on a listening socket will be woken up. Hoping this helps, Willy
Re: [NETDEV]: getsockopt(fd, SOL_IP, SO_ORIGINAL_DST, sa, &salen) is in fact sometimes returning the source IP instead the destination IP
Hi Roobesh, On Wed, Jan 23, 2019 at 08:07:48AM +, Mohandass, Roobesh wrote: > Hi Willy/Florian/Lukas, > > disabling nf_conntrack_tcp_loose (solved the problem) and for last 24 hours, > we are not evening seeing a single wrong connection data by send-proxy. > > Worth a note in your documentation related to this, as users might be aware > off. > > Thanks very much for your help and support, I will report back if I see > anything again related to this. Thank you for your feedback on this issue, and kudos to Florian for spotting the root cause! Willy
Re: [NETDEV]: getsockopt(fd, SOL_IP, SO_ORIGINAL_DST, sa, &salen) is in fact sometimes returning the source IP instead the destination IP
Hi Lukas, On Sat, Jan 12, 2019 at 07:01:34PM +0100, Lukas Tribus wrote: > > Roobesh, do you use the destination address only for logging or > > anywhere else in the request path ? And could you check if you have > > nf_conntrack_tcp_loose set as Florian suggests ? I really think he > > figured it right. > > It's about what we send with the PROXY protocol to the backend server, > Roobesh reported things like that (src and dst is the same): > > PROXY TCP4 192.220.26.39 192.220.26.39 45066 45066 > PROXY TCP4 192.220.26.39 192.220.26.39 45075 45075 > > So the call would actually happen at the beginning of the TCP connection. That sounds quite shocking to me then. Maybe we're facing such a sequence: 1) first session comes from this port, and client closes first (FIN) 2) haproxy then closes with (FIN) 3) client doesn't respond with the final ACK (e.g. blocked by another firewall in between or the client's own conntrack) 4) the socket on the haproxy side remains in LAST_ACK state and ACKs are periodically resent 5) local conntrack is in TIME_WAIT and expires faster than the largest interval between two such ACKs 6) one of these retransmitted ACKs reopens the connection in reverse direction due to nf_conntrack_tcp_loose. The connection is then seen in ESTABLISHED state and might be kept a bit longer. 8) the connection finally expires in the local TCP stack but not yet in conntrack. 7) later the client reuses the same source port while the connection is still present in the conntrack table. 8) assuming tcp_be_liberal is also set, the packets can pass through conntrack and establish a new connection to haproxy. 9) haproxy calls getsockopt(SO_ORIGINAL_DST) and gets the other end point since the connection was created at step 6 above in the other direction. I could be wrong on certain specific points above but it looks plausible. > Initial report is here: > https://discourse.haproxy.org/t/send-proxy-not-modifying-some-traffic-with-proxy-ip-port-details/3336 Ah cool, I didn't have all this, thank you! > Let's see if disabling nf_conntrack_tcp_loose changes things. Yes this really is the only thing I can think of, and in this case noone is wrong in this chain (neither kernel nor haproxy). We'll need to document it in my opinion. Thanks, Willy
Re: [NETDEV]: getsockopt(fd, SOL_IP, SO_ORIGINAL_DST, sa, &salen) is in fact sometimes returning the source IP instead the destination IP
Hi Florian! Sorry for the lag, I was busy with other issues. On Sat, Jan 12, 2019 at 05:04:00PM +0100, Florian Westphal wrote: > Lukas Tribus wrote: > > The application (haproxy) needs to know the original destination IP > > address, however it does not know whether -j REDIRECT was used or not. > > Because of this the application always queries SO_ORIGINAL_DST, and > > this includes configurations without -j REDIRECT. > > > > Are you saying the behavior of SO_ORIGINAL_DST is undefined when not > > used with -j REDIRECT and that this issue does not happen when -j > > REDIRECT is actually used? > > No, thats not what I said. Because OP provided a link that mentions > TPROXY, I concluded OP was using TPROXY, so I pointed out that the > error source can be completely avoided by not using SO_ORIGINAL_DST. > > As I said, SO_ORIGINAL_DST returns the dst address of > the original direction *as seen by conntrack*. > > In case REDIRECT or DNAT was used, the address returned is the on-wire > one, before DNAT rewrite took place. > > Therefore, SO_ORIGINAL_DST is only needed when REDIRECT or DNAT was > used. If no DNAT rewrite takes place, sockaddr returned by accept or > getsockname can be used directly and SO_ORIGINAL_DST isn't needed. > The returned address should be identical to the one given by accept(). That's also the way we're using it : http://git.haproxy.org/?p=haproxy.git;a=blob;f=src/proto_tcp.c;h=52a4691d2ba93aec93a3e8a0edd1e90d93de968d;hb=HEAD#l600 More precisely if getsockopt() fails we only use getsockname(). It happens that this code is very old (2002) and used to work with kernel 2.4's TPROXY which did rely on conntrack, hence the ifdef including TPROXY. But it's irrelevant here, with a modern TPROXY the getsockopt() here will usually fail, and the result will come from getsockname() only. When seeing the (old) haproxy code there, I've been thinking about another possibility. Right now haproxy does getsockname() then *tries* getsockopt(). In the unlikely case getsockopt() would modify part of the address already returned by getsockname(), it could leave an incorrect address there. But such an issue would require partial uses of copy_to_user() which doesn't make much sense, and getorigdst() doesn't do this either. So this one was ruled out. I've been wondering if it was possible that from time to time, this getsockopt() accidently succeeds and returns something wrong, maybe due to a race with another thread, or maybe because it would return an uninitialized field which happened to see the source address previously. The very low amount of occurrence makes me think it could possibly happen only upon certain error conditions. But looking at getorigdst(), I hardly imagine how this would happen. > If SO_ORIGINAL_DST returns the reply, then conntrack picked up > a reply packet as the first packet of the connection, so it believes > originator is the responder and vice versa. > > One case where this can happen is when nf_conntrack_tcp_loose > (mid-stream pickup) is enabled. Very interesting case indeed, I hadn't thought about it! I think we don't have enough info from the original reporter's setup but it would definitely make sense and explain why it's the other end which is retrieved! I'm seeing one possibility to explain this : Let's say the OP's setup has a short conntrack timeout and a longer haproxy timeout. If the address is only retrieved for logging, it will be retrieved at the end of the connection. Let's assume haproxy receives a request from a client, a conntrack entry is created and haproxy forwards the request to a very slow server. Before the server responds, the conntrack entry expires, then the server responds and haproxy forwards to the client, re-creating the entry and hitting this case before the address is picked up for logging. Roobesh, do you use the destination address only for logging or anywhere else in the request path ? And could you check if you have nf_conntrack_tcp_loose set as Florian suggests ? I really think he figured it right. If that's the problem, I think we can address it with documentation : first, warn about the use case, second, explain how to store the address into a variable once the connection is accepted since it will not have expired and will still be valid at this moment. > This is not a haproxy bug. > > Only thing that haproxy could is to provide a knob to make it only > use addresses returned by accept, rather than relying on > SO_ORIGINAL_DST for those that use TPROXY to do MITM interception. Since it's the destination we get it using getsockname(), not accept(), but yeah maybe we'd benefit from having a tunable to disable the use of getsockopt(SO_ORIGINAL_DST). Actually I'd prefer to avoid doing this until the issue it confirmed because I don't feel comfortable with having an option saying "disable the use of getsockopt(SO_ORIGINAL_DST) which may fail once per million due to some obscure reasons". However if it's confirmed that it
Re: MSG_ZEROCOPY doesn't work on half-open TCP sockets
On Wed, Jan 09, 2019 at 08:55:14AM -0500, Willem de Bruijn wrote: > > In other words: because the socket needs to be ESTABLISHED for > > MSG_ZEROCOPY to work, and because remote party can send FIN and move > > the socket to CLOSE_WAIT, a sending party must implement a fallback > > from EINVAL return code on the transmission code. An adversarial > > client who does shutdown(SHUT_WR), will trigger EINVAL in the sender.. > > An adversarial client only affects its own stream, so the impact is limited. Sure but it doesn't necessarily do it on purpose either :-) The typical echo -ne "GET /file HTTP/1.1\r\nHost: foo.example.com\r\n\r\n" | nc host port is perfectly valid and will not work in this case, possibly forcing the newly deployed component to toll back. > Thanks for the report. At first blush it seems like extending the > check to include state CLOSE_WAIT would resolve the issue > > if (flags & MSG_ZEROCOPY && size && sock_flag(sk, SOCK_ZEROCOPY)) { > - if (sk->sk_state != TCP_ESTABLISHED) { > + if ((1 << sk->sk_state) & ~(TCPF_ESTABLISHED | > TCPF_CLOSE_WAIT)) { > > err = -EINVAL; > goto out_err; > } At first glance I think it should do the job. Willy
Re: splice() performance for TCP socket forwarding
On Thu, Dec 13, 2018 at 02:17:20PM +0100, Marek Majkowski wrote: > > splice code will be expensive if less than 1MB is present in receive queue. > > I'm not sure what you are suggesting. I'm just shuffling data between > two sockets. Is there a better buffer size value? Is it possible to > keep splice() blocked until it succeeds to forward N bytes of data? (I > tried this unsuccessfully with SO_RCVLOWAT). I've personally observed performance decrease once the pipe is configured larger than 512 kB. I think that it's related to the fact that you're moving 256 pages around on each call and that it might even start to have some effect on L1 caches when touch lots of data, though that could be completely unrelated. > Here is a snippet from strace: > > splice(4, NULL, 11, NULL, 1048576, 0) = 373760 <0.48> > splice(10, NULL, 5, NULL, 373760, 0) = 373760 <0.000108> > splice(4, NULL, 11, NULL, 1048576, 0) = 335800 <0.65> > splice(10, NULL, 5, NULL, 335800, 0) = 335800 <0.000202> > splice(4, NULL, 11, NULL, 1048576, 0) = 227760 <0.29> > splice(10, NULL, 5, NULL, 227760, 0) = 227760 <0.000106> > splice(4, NULL, 11, NULL, 1048576, 0) = 16060 <0.19> > splice(10, NULL, 5, NULL, 16060, 0) = 16060 <0.28> > splice(4, NULL, 11, NULL, 1048576, 0) = 7300 <0.13> > splice(10, NULL, 5, NULL, 7300, 0) = 7300 <0.21> I think your driver is returning one segment per page. Let's do some rough maths : assuming you're having an MSS of 1448 (timestamps enabled), you'll retrieve 256*1448 = 370688 at once per call, which closely matches what you're seeing. Hmmm checking closer, you're in fact exactly running at 1460 (256*1460=373760) so you have timestamps disabled. Your numbers seem normal to me (just the CPU usage doesn't, but maybe it improves when using a smaller pipe). Willy
Re: splice() performance for TCP socket forwarding
Hi Eric! On Thu, Dec 13, 2018 at 05:37:11AM -0800, Eric Dumazet wrote: > Maybe mlx5 driver is in LRO mode, packing TCP payload in 4K pages ? I could be wrong but I don't think so : I remember having been used to LRO on myri10ge a decade ago giving me good performance which would degrade with concurrent connections, till the point LRO got deprecated when GRO started to work quite well. Thus this got me used to always disabling LRO to be sure to measure something durable ;-) > bnx2x GRO/LRO has this mode, meaning that around 8 pages are used for a GRO > packets of ~32 KB, > while mlx4 for instance would use one page frag for every ~1428 bytes of > payload. I remember that it was the same on myri10ge (1 segment per page), making splice() return rougly 21 or 22kB per call for a 64kB pipe. BTW, I think I said bullshit and that 3 years ago it was mlx4 and not mlx5 that I've been using. Willy
Re: splice() performance for TCP socket forwarding
Hi Marek, On Thu, Dec 13, 2018 at 12:25:20PM +0100, Marek Majkowski wrote: > Hi! > > I'm basically trying to do TCP splicing in Linux. I'm focusing on > performance of the simplest case: receive data from one TCP socket, > write data to another TCP socket. I get poor performance with splice. > > First, the naive code, pretty much: > > while(1){ > n = read(rs, buf); > write(ws, buf, n); > } > > With GRO enabled, this code does roughly line-rate of 10Gbps, hovering > ~50% of CPU in application (sys mostly). > > When replaced with splice version: > > pipe(pfd); > fcntl(pfd[0], F_SETPIPE_SZ, 1024 * 1024); > while(1) { > n = splice(rd, NULL, pfd[1], NULL, 1024*1024, >SPLICE_F_MOVE); > splice(pfd[0], NULL, wd, NULL, n, SPLICE_F_MOVE); > } > > Full code: > https://gist.github.com/majek/c58a97b9be7d9217fe3ebd6c1328faaa#file-proxy-splice-c-L59 > > I get 100% cpu (sys) and dramatically worse performance (1.5x slower). It's quite strange, it doesn't match at all what I'm used to. In haproxy we're using splicing as well between sockets, and for medium to large objects we always get much better performance with splicing than without. 3 years ago during a test, we reached 60 Gbps on a 4-core machine using 2 40G NICs, which is not an exceptional sizing. And between processes on the loopback, numbers around 100G are totally possible. By the way this is one test you should start with, to verify if the issue is more on the splice side or on the NIC's side. It might be that your network driver is totally inefficient when used with GRO/GSO. In my case, multi-10G using ixgbe and 40G using mlx5 have always shown excellent results. Regards, Willy
Re: [net-next, RFC, 4/8] net: core: add recycle capabilities on skbs via page_pool API
On Sat, Dec 08, 2018 at 10:14:47PM +0200, Ilias Apalodimas wrote: > On Sat, Dec 08, 2018 at 09:11:53PM +0100, Jesper Dangaard Brouer wrote: > > > > > > I want to make sure you guys thought about splice() stuff, and > > > skb_try_coalesce(), and GRO, and skb cloning, and ... > > > > Thanks for the pointers. To Ilias, we need to double check > > skb_try_coalesce() > > code path, as it does look like we don't handle this correctly. > > > > Noted. We did check skb cloning, but not this one indeed I'll try to run some tests on my clearfog later, but for now I'm still 100% busy until haproxy 1.9 gets released. Cheers, Willy
Re: bring back IPX and NCPFS, please!
On Fri, Nov 09, 2018 at 06:30:14PM +0100, Johannes C. Schulz wrote: > Hello Willy, hello Stephen > > Thankyou for your reply. > But I'm not able to maintain or code these modules. I'm just a bloody > user/webdev. That's what we've all claimed before taking over something many years ago you know :-) The most important is time and willingness to try to do it. You could first look at the latest kernel supporting those, check if they still used to work fine in your environment (not everyone has access to these ones anymore), and if so, then try to copy that code over newer kernels. Sometimes it will not build with an obvious error that you'll be able to fix by yourself, sometimes it will be harder and you'll have to ask for help and/or figure API changes in "git log". After working many hours on this you'll be much more at ease with this code and you'll possibly be able to make it work on your kernel version. This is already a huge step because even if you don't consider it as being in a mergeable state (too hackish, dirty etc), you have the option to run it as your own patch for a while. After this you'll seek some more help about the process needed to get these merged back and to maintain them as long as you estimate you can (possibly mark it deprecated and keep it as long as you can). And who knows, given nothing changes in this area these days, maybe it will be trivial to maintain this FS for another decade and you'll have learned something fun and useful. > It would be really nice if these modules will find a good > maintainer! Just think again about the advantages you have over many other people : - access to the environment - real use case for the feature There's nothing wrong with trying and failing multiple times, even giving up if you find the task too hard. But giving up before trying is quite sad in your situation. Cheers, Willy
Re: bring back IPX and NCPFS, please!
On Fri, Nov 09, 2018 at 02:23:27PM +0100, Johannes C. Schulz wrote: > Hello all! > > I like to please you to bring back IPX and NCPFS modules to the kernel. > Whyever my admins using Novell-shares on our network which I'm not be > able to use anymore - I'm forced to use cifs instead (and the admins > will kill the cifs-shares in some time), because my kernel (4.18) does > not have support for ncpfs anymore. > Maybe we at my work are not enough people that just for us this > modules will come back, but maybe out there are other people. > Thank you. Well, like any code, it requires time and skills. If nobody with the required skills is available for this anymore, there's no way you'll get a feature back. However you could always step up to maintain it yourself if you have the time and are willing to develop your own skills at it. It's how maintainers change over time for certain parts of the system, so you have an opportunity here. Just my two cents, Willy
Re: Kernel Panic on high bandwidth transfer over wifi
On Wed, Aug 29, 2018 at 11:42:44AM +, Nathaniel Munk wrote: > As you can see from the attached log You apparently forgot to attach the log. Willy
Re: how to (cross)connect two (physical) eth ports for ping test?
On Sat, Aug 18, 2018 at 09:10:25PM +0200, Andrew Lunn wrote: > On Sat, Aug 18, 2018 at 01:39:50PM -0400, Robert P. J. Day wrote: > > > > (i'm sure this has been explained many times before, so a link > > covering this will almost certainly do just fine.) > > > > i want to loop one physical ethernet port into another, and just > > ping the daylights from one to the other for stress testing. my fedora > > laptop doesn't actually have two unused ethernet ports, so i just want > > to emulate this by slapping a couple startech USB/net adapters into > > two empty USB ports, setting this up, then doing it all over again > > monday morning on the actual target system, which does have multiple > > ethernet ports. > > > > so if someone can point me to the recipe, that would be great and > > you can stop reading. > > > > as far as my tentative solution goes, i assume i need to put at > > least one of the physical ports in a network namespace via "ip netns", > > then ping from the netns to the root namespace. or, going one step > > further, perhaps putting both interfaces into two new namespaces, and > > setting up forwarding. > > Namespaces is a good solution. Something like this should work: > > ip netns add namespace1 > ip netns add namespace2 > > ip link set eth1 netns namespace1 > ip link set eth2 netns namespace2 > > ip netns exec namespace1 \ > ip addr add 10.42.42.42/24 dev eth1 > > ip netns exec namespace1 \ > ip link set eth1 up > > ip netns exec namespace2 \ > ip addr add 10.42.42.24/24 dev eth2 > > ip netns exec namespace2 \ > ip link set eth2 up > > ip netns exec namespace1 \ > ping 10.42.42.24 > > You might also want to consider iperf3 for stress testing, depending > on the sort of stress you need. FWIW I have a setup somewhere involving ip rule + ip route which achieves the same without involving namespaces. It's a bit hackish but sometimes convenient. I can dig if someone is interested. Regards, Willy
Re: ANNOUNCE: Enhanced IP v1.4
On Tue, Jun 05, 2018 at 02:33:03PM +0200, Bjørn Mork wrote: > > I do have IPv6 at home (a /48, waste of addressing space, I'd be fine > > with less), > > Any reason you would want less? Any reason the ISP should give you > less? What I mean is that *if* the availability of /48 networks was an issue for some ISPs, I'd be fine with less because I don't plan to deploy 64k networks at home, though I already have ~9 around the firewall and don't expect to go much further. > > Maybe setting up a public list of ISPs where users don't have at least > > a /60 by default could help, but I suspect that most of them will > > consider that as long as their competitors are on the list there's no > > emergency. > > Exactly. And the number of users using the list as the primary > parameter for selecting an ISP would be close to 0. The critical part > is not the list, but making large enough groups of users consider IPv6 > an important parameter when selecting ISPs. In fact the IoT trend could play a role here by letting users know that they can remotely access their fridge and whatever stupid device they've deployed. But the reality is the opposite : some gateway services are/will be offered at a paid price to make these devices remotely accessible, and the claimed security provided by this gateway will be presented as a real benefit compared to the risks of anyone directly accessing your private life over IPv6. So I'm not getting much hopes for the future in this area either. Willy
Re: ANNOUNCE: Enhanced IP v1.4
On Sun, Jun 03, 2018 at 03:41:08PM -0700, Eric Dumazet wrote: > > > On 06/03/2018 01:37 PM, Tom Herbert wrote: > > > This is not an inconsequential mechanism that is being proposed. It's > > a modification to IP protocol that is intended to work on the > > Internet, but it looks like the draft hasn't been updated for two > > years and it is not adopted by any IETF working group. I don't see how > > this can go anywhere without IETF support. Also, I suggest that you > > look at the IPv10 proposal since that was very similar in intent. One > > of the reasons that IPv10 shot down was because protocol transition > > mechanisms were more interesting ten years ago than today. IPv6 has > > good traction now. In fact, it's probably the case that it's now > > easier to bring up IPv6 than to try to make IPv4 options work over the > > Internet. > > +1 > > Many hosts do not use IPv4 anymore. > > We even have the project making IPv4 support in linux optional. I agree on these points, but I'd like to figure what can be done to put a bit more pressure on ISPs to *always* provide IPv6. It's still very hard to have decent connectivity at home and without this it will continue to be marginalize. I do have IPv6 at home (a /48, waste of addressing space, I'd be fine with less), there's none at work (I don't even know if the ISP supports it, at least it was never ever mentioned so probably they don't know about this), and some ISPs only provide a /64 which is as ridiculous as providing a single address as it forces the end user to NAT thus breaking the end-to-end principle. Ideally with IoT at the door, every home connection should have at least a /60 and enterprises should have a /56, and this by default, without having to request anything. Maybe setting up a public list of ISPs where users don't have at least a /60 by default could help, but I suspect that most of them will consider that as long as their competitors are on the list there's no emergency. Willy
Re: ANNOUNCE: Enhanced IP v1.4
On Sat, Jun 02, 2018 at 12:17:12PM -0400, Sam Patton wrote: > As far as application examples, check out this simple netcat-like > program I use for testing: > > https://github.com/EnIP/enhancedip/blob/master/userspace/netcat/netcat.c > > Lines 61-67 show how to connect directly via an EnIP address. The > netcat-like application uses > > a header file called eip.h. OK so basically we need to use a new address structure (which makes sense), however I'm not sure I understand how the system is supposed to know it has to use EnIP instead of IPv4 if the sin_family remains the same : ... memset(&serv_addr, 0, sizeof(struct sockaddr_ein)); serv_addr.sin_family= AF_INET; serv_addr.sin_port = htons(portno); serv_addr.sin_addr1=inet_addr(argv[1]); serv_addr.sin_addr2=inet_addr(argv[2]); if (connect(sockfd,(struct sockaddr *) &serv_addr,sizeof(serv_addr)) < 0) error("ERROR connecting"); ... Does it solely rely on the 3rd argument to connect() to know that it may read sin_addr2 ? If so, that sounds a bit fragile to me because I *think* (but could be wrong) that right now any size at least as large as sockaddr_in work for IPv4 (typically I think if you pass a struct sockaddr_storage with its size it should work). So there would be a risk of breaking applications if this is the case. I suspect that using a different family would make things more robust and more friendly to applications. But that's just a first opinion, you have worked longer than me on this. > You can look at it here: > > https://github.com/EnIP/enhancedip/blob/master/userspace/include/eip.h Thank you. There's a problem with your #pragma pack there, it will propagate to all struct declared after this file is included and will force all subsequent structs to be packed. It would be preferable to use __attribute__((packed)) on the struct or to use pragma pack() safely (I've just seen it supports push/pop options). > EnIP makes use of IPv6 records for DNS lookup. We simply put > 2001:0101 (which is an IPv6 experimental prefix) and > > then we put the 64-bit EnIP address into the next 8 bytes of the > address. The remaining bytes are set to zero. Does this require a significant amount of changes to the libc ? > In the kernel, if you want to see how we convert the IPv6 DNS lookup > into something connect() can manage, > > check out the add_enhanced_ip() routine found here: > > https://github.com/EnIP/enhancedip/blob/master/kernel/4.9.28/socket.c Hmmm so both IPv4 and IPv6 addresses are supported and used together. So that just makes me think that if in the end you need a little bit of IPv6 to handle this, why not instead define a new way to map IPv6 to Enhanced IPv4 ? I mean, you'd exclusively use AF_INET6 in applications, within a dedicated address range, allowing you to naturally use all of the IPv6 ecosystem in place in userland, but use IPv4 over the wire. This way applications would possibly require even less modififications, if any at all because they'd accept/connect/send/receive data for IPv6 address families that they all deal with nowadays, but this would rely on an omni-present IPv4 infrastructure. > The reason we had to do changes for openssh and not other applications > (that use DNS) is openssh has a check to > > see if the socket is using IP options. If the socket does, sshd drops > the connection. I had to work around that to get openssh working OK. They possibly do this to avoid source-routing. > with EnIP. The result: if you want to connect the netcat-like program > with IP addresses you'll end up doing something like the > > example above. If you're using DNS (getaddrinfo) to connect(), it > should just work (except for sshd as noted). > > Here's the draft experimental RFC: > https://tools.ietf.org/html/draft-chimiak-enhanced-ipv4-03 Yep I've found it on the site. > I'll also note that I am doing this code part time as a hobby for a long > time so I appreciate your help and support. It would be really > > great if the kernel community decided to pick this up, but if it's not a > reality please let me know soonest so I can move on to a > > different hobby. :) I *personally* think there is some value in what you're experimenting with, and doing it as a hobby can leave you with the time needed to collect pros and cons from various people. I'm just thinking that in order to get some adoption, you need to be extremely careful not to break any of the IPv4 applications developed in the last 38 years, and by this, AF_INET+sizeof() scares me a little bit. Regards, Willy
Re: ANNOUNCE: Enhanced IP v1.4
Hello Sam, On Fri, Jun 01, 2018 at 09:48:28PM -0400, Sam Patton wrote: > Hello! > > If you do not know what Enhanced IP is, read this post on netdev first: > > https://www.spinics.net/lists/netdev/msg327242.html > > > The Enhanced IP project presents: > > Enhanced IP v1.4 > > The Enhanced IP (EnIP) code has been updated. It now builds with OpenWRT > barrier breaker (for 148 different devices). We've been testing with the > Western Digital N600 and N750 wireless home routers. (...) First note, please think about breaking your lines if you want your mails to be read by the widest audience, as for some of us here, reading lines wider than a terminal is really annoying, and often not considered worth spending time on them considering there are so many easier ones left to read. > Interested in seeing Enhanced IP in the Linux kernel, read on. Not > interested in seeing Enhanced IP in the Linux kernel read on. (...) So I personally find the concept quite interesting. It reminds me of the previous IPv5/IPv7/IPv8 initiatives, which in my opinion were a bit hopeless. Here the fact that you decide to consider the IPv4 address as a network opens new perspectives. For containerized environments it could be considered that each server, with one IPv4, can host 2^32 guests and that NAT is not needed anymore for example. It could also open the possibility that enthousiasts can more easily host some services at home behind their ADSL line without having to run on strange ports. However I think your approach is not the most efficient to encourage adoption. It's important to understand that there will be little incentive for people to patch their kernels to run some code if they don't have the applications on top of it. The kernel is not the end goal for most users, the kernel is just the lower layer needed to run applications on top. I looked at your site and the github repo, and all I could find was a pre-patched openssh, no simple explanation of what to change in an application. What you need to do first is to *explain* how to modify userland applications to support En-IP, provide an echo server and show the parts which have to be changed. Write a simple client and do the same. Provide your changes to existing programs as patches, not as pre-patched code. This way anyone can use your patches on top of other versions, and can use these patches to understand what has to be modified in their applications. Once applications are easy to patch, the incentive to install patched kernels everywhere will be higher. For many enthousiasts, knowing that they only have to modify the ADSL router to automatically make their internal IoT stuff accessible from outside indeed becomes appealing. Then you'll need to provide patches for well known applications like curl, wget, DNS servers (bind...), then browsers. In my case I could be interested in adding support for En-ip into haproxy, and only once I don't see any showstopped in doing this, I'd be willing to patch my kernel to support it. Last advice, provide links to your drafts in future e-mails, they are not easy to find on your site, we have to navigate through various pages to finally find them. Regards, Willy
Re: Request for -stable inclusion: time stamping fix for nfp
Adding Greg here. Greg, apparently a backport of 46f1c52e66db is needed in 4.9 according to the thread below. It was merged in 4.13 so 4.14 already has it. Willy On Thu, May 17, 2018 at 02:09:03PM -0400, David Miller wrote: > From: Guillaume Nault > Date: Thu, 17 May 2018 19:41:47 +0200 > > > On Thu, Nov 16, 2017 at 10:13:28AM +0900, David Miller wrote: > >> From: Guillaume Nault > >> Date: Wed, 15 Nov 2017 17:20:46 +0100 > >> > >> > Can you please queue commit 46f1c52e66db > >> > ("nfp: TX time stamp packets before HW doorbell is rung") for -stable? > >> > We got hit but this bug in the late summer. We run this fix internally > >> > since a couple of months, but that'd be better to have it officially > >> > backported so everyone can benefit of it. > >> > >> Queued up. > > > > I guess this one got lost somewhere as it doesn't appear in linux-4.9.y > > (other trees aren't relevant). > > If that's unintentional, than can you please re-queue > > 46f1c52e66db ("nfp: TX time stamp packets before HW doorbell is rung") > > to -stable? > > I only submit patches to -stable for the two most recent active branches > which right now consists of 4.16 and 4.14 as per www.kernel.org
Re: [PATCH linux-stable-4.14] tcp: reset sk_send_head in tcp_write_queue_purge
Hi David, regarding the patch below, I'm not certain whether you planned to take it since it's marked "not applicable" on patchwork, but I suspect it's only because it doesn't apply to mainline. However, please note that there are two typos in commit IDs referenced in the commit message that might be nice to fix before merging : > > For example, after 27fid7a8ed38 (tcp: purge write queue upon RST), - here it's "a27fd7a8ed38" (missing 'a' and extra 'i' in the middle) - and lower in the fixes tag there's the extra 'i' as well : > > Fixes: a27fid7a8ed38 (tcp: purge write queue upon RST) There was another report today on the stable list for a similar crash on 4.14.28 and I suspect it's the one I saw this week-end on my firewall after upgrading from 4.14.10 to 4.14.27 though I didn't have the symbols to confirm. Thanks, Willy
Re: [PATCH stable 4.9 1/8] x86: bpf_jit: small optimization in emit_bpf_tail_call()
Hi Eric, On Mon, Jan 29, 2018 at 06:04:30AM -0800, Eric Dumazet wrote: > > If these 4 bytes matter, why not use > > cmpq with an immediate value instead, which saves 2 extra bytes ? : > > > > - the mov above is 11 bytes total : > > > >0: 48 8b 84 d6 78 56 34mov0x12345678(%rsi,%rdx,8),%rax > >7: 12 > >8: 48 85 c0test %rax,%rax > > > > - the equivalent cmp is only 9 bytes : > > > >0: 48 83 bc d6 78 56 34cmpq $0x0,0x12345678(%rsi,%rdx,8) > >7: 12 00 > > > > And as a bonus, it doesn't even clobber rax. > > > > Just my two cents, > > > Hi Willy > > Please look more closely at following instructions. > > We need the value later, not only testing it being zero :) Ah OK that makes total sense then ;-) Thanks, willy
Re: [PATCH stable 4.9 1/8] x86: bpf_jit: small optimization in emit_bpf_tail_call()
Hi, [ replaced stable@ and greg@ by netdev@ as my question below is not relevant to stable ] On Mon, Jan 29, 2018 at 02:48:54AM +0100, Daniel Borkmann wrote: > From: Eric Dumazet > > [ upstream commit 84ccac6e7854ebbfb56d2fc6d5bef9be49bb304c ] > > Saves 4 bytes replacing following instructions : > > lea rax, [rsi + rdx * 8 + offsetof(...)] > mov rax, qword ptr [rax] > cmp rax, 0 > > by : > > mov rax, [rsi + rdx * 8 + offsetof(...)] > test rax, rax I've just noticed this on stable@. If these 4 bytes matter, why not use cmpq with an immediate value instead, which saves 2 extra bytes ? : - the mov above is 11 bytes total : 0: 48 8b 84 d6 78 56 34mov0x12345678(%rsi,%rdx,8),%rax 7: 12 8: 48 85 c0test %rax,%rax - the equivalent cmp is only 9 bytes : 0: 48 83 bc d6 78 56 34cmpq $0x0,0x12345678(%rsi,%rdx,8) 7: 12 00 And as a bonus, it doesn't even clobber rax. Just my two cents, Willy
Re: TCP many-connection regression between 4.7 and 4.13 kernels.
Hi Eric, On Mon, Jan 22, 2018 at 10:16:06AM -0800, Eric Dumazet wrote: > On Mon, 2018-01-22 at 09:28 -0800, Ben Greear wrote: > > My test case is to have 6 processes each create 5000 TCP IPv4 connections > > to each other > > on a system with 16GB RAM and send slow-speed data. This works fine on a > > 4.7 kernel, but > > will not work at all on a 4.13. The 4.13 first complains about running out > > of tcp memory, > > but even after forcing those values higher, the max connections we can get > > is around 15k. > > > > Both kernels have my out-of-tree patches applied, so it is possible it is > > my fault > > at this point. > > > > Any suggestions as to what this might be caused by, or if it is fixed in > > more recent kernels? > > > > I will start bisecting in the meantime... > > > > Hi Ben > > Unfortunately I have no idea. > > Are you using loopback flows, or have I misunderstood you ? > > How loopback connections can be slow-speed ? A few quick points : I have not noticed this on 4.9, which we use with pretty satisfying performance (typically around 100k conn/s). However during some recent tests I did around the meltdown fixes on 4.15, I noticed a high connect() or bind() cost to find a local port when running on the loopback, that I didn't have the time to compare to older kernels. However, strace clearly showed that bind() (or connect() if bind was not used) could take as much as 2-3 ms as source ports were filling up. To be clear, it was just a quick observation and anything could be wrong there, including my tests. I'm just saying this in case it matches anything Ben has observed. I can try to get more info if that helps, but it could be a different case. Cheers, Willy
Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok
On Sun, Jan 07, 2018 at 12:17:11PM -0800, Linus Torvalds wrote: > We need to fix the security problem, but we need to do it *without* > these braindead arguments that performance is somehow secondary. OK OK. At least we should have security by default and let people trade it against performance if they want and understand the risk. People never know when they're secure enough (security doesn't measure) but they know fairly well when they're not performant enough to take action (most often changing the machine). Willy
Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok
On Sun, Jan 07, 2018 at 11:47:07AM -0800, Linus Torvalds wrote: > And the whole "normal people won't even notice" is pure garbage too. > Don't spread that bullshit when you see actual normal people > complaining. > > Performance matters. A *LOT*. Linus, no need to explain that to me, I'm precisely trying to see how to disable PTI for a specific process because I face up to 45% loss in certain circumstances, making it a no-go. But while a few of us have very specific workloads emphasizing this impact, others have very different ones and will not notice. For example my laptop did boot pretty fine and I didn't notice anything until I fire a network benchmark. Willy
Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok
On Sat, Jan 06, 2018 at 07:38:14PM -0800, Alexei Starovoitov wrote: > yep. plenty of unknowns and what's happening now is an overreaction. To be fair there's overreaction on both sides. The vast majority of users need to get a 100% safe system and will never notice any difference. A few of us are more concerned by the risk of performance loss brought by these fixes. We do all need to run tests on the patchsets to bring numbers on the table. > What is the rush to push half baked patches into upstream > that don't even address the variant 1 ? You probably need to trust the CPU vendors a bit for having had all the details about the risks for a few months now and accept that if they're destroying their product's performance compared to their main competitor, they probably studied a number of other alternatives first. It doesn't mean they thought about everything of course, and maybe they need to study your proposal as a better solution to reduce criticism. > which clearly states that bpf_tail_call() was used in the attack. > Yet none of the intel nor arm patches address speculation in > this bpf helper! > It means that: > - gpz didn't share neither exploit nor the detailed description > of the POC with cpu vendors until now Or it was considered less urgent to fix compared to the rest. It's also possible that the scariest details were not written in the GPZ article. > Now the attack is well described, yet cpu vendors still pushing > for lfence patches that don't make sense. Why? Imagine if you were in their position and were pushing a solution which would later be found to be inefficient and to be vulnerable again. Your name would appear twice in the press in a few months, this would be terrible. It makes sense in their position to find the safest fix first given that those like you or me really concerned about the performance impact know how to add an option to a boot loader or revert a patch that causes trouble. > What I think is important is to understand vulnerability first. > I don't think it was done. I suspect it was clearly done by those how had no other option but working on this painful series over the last few months :-/ > > The differences involved on the "lfence" versus "and" versus before are > > not likely to be anywhere in that order of magnitude. > > we clearly disagree here. Both intel and arm patches proposed > to add lfence in bpf_map_lookup() which is the hottest function > we have and we do run it at 40+Gbps speeds where every nanosecond > counts, so no, lfence is not a solution. Please help here by testing the patch series and reporting numbers before, with the fix, and with your proposal. That's the best way to catch their attention and to get your proposal considered as a viable alternative (or as a partial one for certain environments). I did the same when I believed it could be sufficient to add noise to RDTSC and found it totally inefficient after testing. But it's better for everyone that research and testing is done rather than criticizing the proposed fixes (which is not fair to the people who work hard on them for everyone else). Cheers, Willy
Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok
On Sat, Jan 06, 2018 at 06:38:59PM +, Alan Cox wrote: > Normally people who propose security fixes don't have to argue about the > fact they added 30 clocks to avoid your box being 0wned. In fact it depends, because if a fix makes the system unusable for its initial purpose, this fix will simply not be deployed at all, which is the worst that can happen. Especially when it cannot be disabled by config and people stop updating their systems to stay on the last "known good" version. Fortunately in Linux we often have the choice so that users rarely have a valid reason for not upgrading! Willy
Re: BUG warnings in 4.14.9
Guys, Chris reported the bug below and confirmed that reverting commit 9704f81 (ipv6: grab rt->rt6i_ref before allocating pcpu rt) seems to have fixed the issue for him. This patch is a94b9367 in mainline. I personally have no opinion on the patch, just found it because it was the only one touching this area between 4.14.8 and 4.14.9 :-) Should this be reverted or maybe fixed differently ? thanks, Willy On Tue, Dec 26, 2017 at 01:49:59AM +, Chris Rankin wrote: > Hi, > > I've just raised https://bugzilla.kernel.org/show_bug.cgi?id=198271 > because the new 4.14.9 kernel is generating lots of BUG warnings, e.g. > > [ 35.069924] BUG: using smp_processor_id() in preemptible [] > code: avahi-daemon/761 > [ 35.078187] caller is ip6_pol_route+0x46b/0x509 [ipv6] > [ 35.083340] CPU: 5 PID: 761 Comm: avahi-daemon Tainted: G > I 4.14.9 #1 > [ 35.091176] Hardware name: Gigabyte Technology Co., Ltd. > EX58-UD3R/EX58-UD3R, BIOS FB 05/04/2009 > [ 35.100181] Call Trace: > [ 35.102709] dump_stack+0x46/0x59 > [ 35.106095] check_preemption_disabled+0xca/0xda > [ 35.110786] ip6_pol_route+0x46b/0x509 [ipv6] > [ 35.115224] fib6_rule_lookup+0x15/0x4b [ipv6] > [ 35.119745] ip6_dst_lookup_tail+0x11d/0x18f [ipv6] > [ 35.124702] ip6_dst_lookup_flow+0x30/0x6f [ipv6] > [ 35.129481] udpv6_sendmsg+0x5c5/0xa10 [ipv6] > [ 35.133914] ? ip_reply_glue_bits+0x48/0x48 > [ 35.138187] ? rw_copy_check_uvector+0x6d/0xf9 > [ 35.142701] ? sock_sendmsg+0x28/0x34 > [ 35.146393] sock_sendmsg+0x28/0x34 > [ 35.149912] ___sys_sendmsg+0x1b4/0x246 > [ 35.153821] ? poll_select_copy_remaining+0x104/0x104 > [ 35.158995] ? current_time+0x11/0x52 > [ 35.162738] ? pipe_write+0x353/0x365 > [ 35.166483] ? __sys_sendmsg+0x3c/0x5d > [ 35.170304] __sys_sendmsg+0x3c/0x5d > [ 35.173909] do_syscall_64+0x4a/0xe1 > [ 35.177482] entry_SYSCALL64_slow_path+0x25/0x25 > [ 35.177484] RIP: 0033:0x7f963d7097b7 > [ 35.177485] RSP: 002b:77ffbd18 EFLAGS: 0246 ORIG_RAX: > 002e > [ 35.177486] RAX: ffda RBX: 000d RCX: > 7f963d7097b7 > [ 35.177487] RDX: RSI: 77ffbde0 RDI: > 000d > [ 35.177487] RBP: R08: 0001 R09: > 77ffbd42 > [ 35.177488] R10: 0002 R11: 0246 R12: > 77ffbde0 > [ 35.177489] R13: 55e4e1af29cc R14: 000d R15: > 0002 > [ 35.177565] IN=eth0 OUT= MAC= > SRC=fe80::::0224:1dff:fecd:0f3a > DST=ff02:::::::00fb LEN=245 TC=0 HOPLIMIT=255 > FLOWLBL=147873 PROTO=UDP SPT=5353 DPT=5353 LEN=205 > > Cheers, > Chris
Re: [PATCH net 0/3] Few mvneta fixes
Hi Arnd, On Tue, Dec 19, 2017 at 09:18:35PM +0100, Arnd Bergmann wrote: > On Tue, Dec 19, 2017 at 5:59 PM, Gregory CLEMENT > wrote: > > Hello, > > > > here it is a small series of fixes found on the mvneta driver. They > > had been already used in the vendor kernel and are now ported to > > mainline. > > Does one of the patches look like it addresses the rare Oops we discussed on > #kernelci this morning? > > https://storage.kernelci.org/stable/linux-4.9.y/v4.9.70/arm/mvebu_v7_defconfig/lab-free-electrons/boot-armada-375-db.html I could be wrong but for me the 375 uses mvpp2, not mvneta, so this should have no effect there. Willy
Re: [PATCH] net: bridge: add max_fdb_count
Hi Andrew, On Fri, Nov 17, 2017 at 03:06:23PM +0100, Andrew Lunn wrote: > > Usually it's better to apply LRU or random here in my opinion, as the > > new entry is much more likely to be needed than older ones by definition. > > Hi Willy > > I think this depends on why you need to discard. If it is normal > operation and the limits are simply too low, i would agree. > > If however it is a DoS, throwing away the new entries makes sense, > leaving the old ones which are more likely to be useful. > > Most of the talk in this thread has been about limits for DoS > prevention... Sure but my point is that it can kick in on regular traffic and in this case it can be catastrophic. That's only what bothers me. If we have an unlimited default value with this algorithm I'm fine because nobody will get caught by accident with a bridge suddenly replicating high traffic on all ports because an unknown limit was reached. That's the principle of least surprise. I know that when fighting DoSes there's never any universally good solutions and one has to make tradeoffs. I'm perfectly fine with this. Cheers, Willy
Re: [PATCH] net: bridge: add max_fdb_count
Hi Stephen, On Thu, Nov 16, 2017 at 04:27:18PM -0800, Stephen Hemminger wrote: > On Thu, 16 Nov 2017 21:21:55 +0100 > Vincent Bernat wrote: > > > ? 16 novembre 2017 20:23 +0100, Andrew Lunn : > > > > > struct net_bridge_fdb_entry is 40 bytes. > > > > > > My WiFi access point which is also a 5 port bridge, currently has 97MB > > > free RAM. That is space for about 2.5M FDB entries. So even Roopa's > > > 128K is not really a problem, in terms of memory. > > > > I am also interested in Sarah's patch because we can now have bridge > > with many ports through VXLAN. The FDB can be replicated to an external > > daemon with BGP and the cost of each additional MAC address is therefore > > higher than just a few bytes. It seems simpler to implement a limiting > > policy early (at the port or bridge level). > > > > Also, this is a pretty standard limit to have for a bridge (switchport > > port-security maximum on Cisco, set interface X mac-limit on > > Juniper). And it's not something easy to do with ebtables. > > I want an optional limit per port, it makes a lot of sense. > If for no other reason that huge hash tables are a performance problems. Except its not a limit in that it doesn't prevent new traffic from going in, it only prevents new MACs from being learned, so suddenly you start flooding all ports with new traffic once the limit is reached, which is not trivial to detect nor diagnose. > There is a bigger question about which fdb to evict but just dropping the > new one seems to be easiest and as good as any other solution. Usually it's better to apply LRU or random here in my opinion, as the new entry is much more likely to be needed than older ones by definition. In terms of CPU usage it may even be better to kill an entire series in the hash table (eg: all nodes in the same table entry for example), as the operation can be almost as cheap and result in not being needed for a while again. Willy
Re: [PATCH] net: bridge: add max_fdb_count
Hi Sarah, On Thu, Nov 16, 2017 at 01:20:18AM -0800, Sarah Newman wrote: > I note that anyone who would run up against a too-low limit on the maximum > number of fdb entries would also be savvy enough to fix it in a matter of > minutes. I disagree on this point. There's a huge difference between experiencing sudden breakage under normal conditions due to arbitrary limits being set and being down because of an attack. While the latter is not desirable, it's much more easily accepted and most often requires operations anyway. The former is never an option. And I continue to think that the default behaviour once the limit is reached must not be to prevent new entries from being learned but to purge older ones. At least it preserves normal operations. But given the high CPU impact you reported for a very low load, definitely something needs to be done. > They could also default the limit to U32_MAX in their particular > distribution if it was a configuration option. Well, I'd say that we don't have a default limit on the socket number either and that it happens to be the expected behaviour. It's almost impossible to find a suitable limit for everyone. People dealing with regular loads never read docs and get caught. People working in hostile environments are always more careful and will ensure that their limits are properly set. > At the moment there is not even a single log message if the problem doesn't > result in memory exhaustion. This probably needs to be addressed as well! Regards, Willy
Re: [PATCH] net: recvmsg: Unconditionally zero struct sockaddr_storage
On Tue, Oct 31, 2017 at 09:14:45AM -0700, Kees Cook wrote: > diff --git a/net/socket.c b/net/socket.c > index c729625eb5d3..34183f4fbdf8 100644 > --- a/net/socket.c > +++ b/net/socket.c > @@ -2188,6 +2188,7 @@ static int ___sys_recvmsg(struct socket *sock, struct > user_msghdr __user *msg, > struct sockaddr __user *uaddr; > int __user *uaddr_len = COMPAT_NAMELEN(msg); > > + memset(&addr, 0, sizeof(addr)); > msg_sys->msg_name = &addr; Isn't this going to cause a performance hit in the fast path ? Just checking, I have not read the whole code with the patch in its context. Willy
Re: net.ipv4.tcp_max_syn_backlog implementation
On Mon, Aug 28, 2017 at 11:47:41PM -0400, Harsha Chenji wrote: > So I have ubuntu 12.04 x32 in a VM with syncookies turned off. I tried > to do a syn flood (with netwox) on 3 different processes. Each of them > returns a different value with netstat -na | grep -c RECV : > > nc -l returns 16 (netcat-traditional) > apache2 port 80 returns 256 > vsftpd on 21 returns 64. > net.ipv4.tcp_max_syn_backlog is 512. > > Why do these different processes on different ports have different > queue lengths for incomplete connections? Where exactly in the kernel > is this decided? The listening socket's backlog (second argument to the listen() syscall) is considered as well. The code path to determine whether or not to start to send SYN cookies is far from being trivial but makes sense once you write it down completely. I never perfectly remember it, I regularly have to recheck when I have a doubt. Willy
Re: [PATCH net 3/3] tcp: fix xmit timer to only be reset if data ACKed/SACKed
On Sun, Aug 06, 2017 at 07:39:57AM +, maowenan wrote: > > > > -Original Message- > > From: Willy Tarreau [mailto:w...@1wt.eu] > > Sent: Saturday, August 05, 2017 2:19 AM > > To: Neal Cardwell > > Cc: maowenan; David Miller; netdev@vger.kernel.org; Yuchung Cheng; Nandita > > Dukkipati; Eric Dumazet > > Subject: Re: [PATCH net 3/3] tcp: fix xmit timer to only be reset if data > > ACKed/SACKed > > > > On Fri, Aug 04, 2017 at 02:01:34PM -0400, Neal Cardwell wrote: > > > On Fri, Aug 4, 2017 at 1:10 PM, Willy Tarreau wrote: > > > > Hi Neal, > > > > > > > > On Fri, Aug 04, 2017 at 12:59:51PM -0400, Neal Cardwell wrote: > > > >> I have attached patches for this fix rebased on to v3.10.107, the > > > >> latest stable release for 3.10. That's pretty far back in history, > > > >> so there were substantial conflict resolutions and adjustments > > > >> required. > > > >> :-) Hope that helps. > > > > > > > > At least it will help me :-) > > > > > > > > Do you suggest that I queue them for 3.10.108, that I wait for > > > > Maowenan to test them more broadly first or anything else ? > > > > > > Let's wait for Maowenan to test them first. > [Mao Wenan]It works well with these patches of v3.10, and the > retransmission packet is about 250ms after TLP probe. > Thank you very much for these patches under 3.10. Thanks, I'll take them then. Willy
Re: [PATCH net 3/3] tcp: fix xmit timer to only be reset if data ACKed/SACKed
On Fri, Aug 04, 2017 at 02:01:34PM -0400, Neal Cardwell wrote: > On Fri, Aug 4, 2017 at 1:10 PM, Willy Tarreau wrote: > > Hi Neal, > > > > On Fri, Aug 04, 2017 at 12:59:51PM -0400, Neal Cardwell wrote: > >> I have attached patches for this fix rebased on to v3.10.107, the > >> latest stable release for 3.10. That's pretty far back in history, so > >> there were substantial conflict resolutions and adjustments required. > >> :-) Hope that helps. > > > > At least it will help me :-) > > > > Do you suggest that I queue them for 3.10.108, that I wait for Maowenan > > to test them more broadly first or anything else ? > > Let's wait for Maowenan to test them first. Fine, thanks. Willy
Re: [PATCH net 3/3] tcp: fix xmit timer to only be reset if data ACKed/SACKed
Hi Neal, On Fri, Aug 04, 2017 at 12:59:51PM -0400, Neal Cardwell wrote: > I have attached patches for this fix rebased on to v3.10.107, the > latest stable release for 3.10. That's pretty far back in history, so > there were substantial conflict resolutions and adjustments required. > :-) Hope that helps. At least it will help me :-) Do you suggest that I queue them for 3.10.108, that I wait for Maowenan to test them more broadly first or anything else ? I'm fine with any option. Thanks! Willy
Re: STABLE: net: reduce skb_warn_bad_offload() noise
On Fri, Jul 28, 2017 at 10:22:52PM -0700, Eric Dumazet wrote: > On Fri, 2017-07-28 at 12:30 -0700, David Miller wrote: > > From: Mark Salyzyn > > Date: Fri, 28 Jul 2017 10:29:57 -0700 > > > > > Please backport the upstream patch to the stable trees (including > > > 3.10.y, 3.18.y, 4.4.y and 4.9.y): > > > > > > b2504a5dbef3305ef41988ad270b0e8ec289331c net: reduce > > > skb_warn_bad_offload() noise > > > > > > Impacting performance or creating unnecessary alarm, and will result > > > in kernel panic for panic_on_warn configuration. > > > > Yeah this is fine. > > If you do so, also backport 6e7bc478c9a006c701c14476ec9d389a484b4864 > ("net: skb_needs_check() accepts CHECKSUM_NONE for tx") OK, noted for next 3.10 as well. thanks! Willy