Re: [Cerowrt-devel] [Bloat] Linux network is damn fast, need more use XDP (Was: DC behaviors today)

2017-12-07 Thread Jesper Dangaard Brouer
(Removed netdev list)

On Mon, 4 Dec 2017 09:00:41 -0800 Dave Taht  wrote:

> > If you have not heard, the netdev-community have worked on something
> > called XDP (eXpress Data Path).  This is a new layer in the network
> > stack, that basically operates a the same "layer"/level as DPDK.
> > Thus, surprise we get the same performance numbers as DPDK. E.g. I can
> > do 13.4 Mpps forwarding with ixgbe on a single CPU (more CPUs=14.6Mps)
> >
> > We can actually use XDP for (software) offloading the Linux routing
> > table.  There are two methods we are experimenting with:
> >
> > (1) externally monitor route changes from userspace and update BPF-maps
> > to reflect this. That approach is already accepted upstream[4][5].  I'm
> > measuring 9,513,746 pps per CPU with that approach.
> >
> > (2) add a bpf helper to simply call fib_table_lookup() from the XDP hook.
> > This is still experimental patches (credit to David Ahern), and I've
> > measured 9,350,160 pps with this approach in a single CPU.  Using more
> > CPUs we hit 14.6Mpps (only used 3 CPUs in that test)  
> 
> Neat. Perhaps trying xdp on the itty bitty routers I usually work on
> would be a win.

Definitely. It will be a huge win for small routers. This is part of my
grand scheme.  We/I just need to implement XDP in one of these small
router's driver.

That said, XDP skip many layers and features of the network stack that
you likely need on these small routers e.g. like NAT... 


> > [4] 
> > https://github.com/torvalds/linux/blob/master/samples/bpf/xdp_router_ipv4_user.c
> > [5] 
> > https://github.com/torvalds/linux/blob/master/samples/bpf/xdp_router_ipv4_kern.c
> >   
> 
> thx very much for the update.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer
___
Cerowrt-devel mailing list
Cerowrt-devel@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/cerowrt-devel


Re: [Cerowrt-devel] [Bloat] Linux network is damn fast, need more use XDP (Was: DC behaviors today)

2017-12-07 Thread Jesper Dangaard Brouer
On Mon, 4 Dec 2017 09:00:41 -0800
Dave Taht  wrote:

> Jesper:
> 
> I have a tendency to deal with netdev by itself and never cross post
> there, as the bufferbloat.net servers (primarily to combat spam)
> mandate starttls and vger doesn't support it at all, thus leading to
> raising davem blood pressure which I'd rather not do.

Sorry, I didn't know.  I've removed the bloat-lists from the reply I
just gave to Matthias on netdev:

 http://lkml.kernel.org/r/20171207093343.07108...@redhat.com

And I'll refrain from cross-posting between these lists in the future.
-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer
___
Cerowrt-devel mailing list
Cerowrt-devel@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/cerowrt-devel


Re: [Cerowrt-devel] [Bloat] Linux network is damn fast, need more use XDP (Was: DC behaviors today)

2017-12-04 Thread Joel Wirāmu Pauling
On 5 December 2017 at 06:00, Dave Taht  wrote:

>>> The route table lookup also really expensive on the main cpu.
>
> To clarify the context here, I was asking specifically if the X5 mellonox card
> did routing table offlload or only switching.
>
To clarify what I know the X5 using it's smart offload engine CAN do
L3 offload into the NIC - the X4's can't.

So for the Nuage OVS -> Eswitch (what mellanox calls the flow
programming) magic to happen and be useful we are going to need X5.

Mark Iskra gave a talk at Openstack summit which can be found here:

https://www.openstack.org/videos/sydney-2017/warp-speed-openvswitch-turbo-charge-vnfs-to-100gbps-in-nextgen-sdnnfv-datacenter

Slides here:

https://www.openstack.org/assets/presentation-media/OSS-Nov-2017-Warp-speed-Openvswitch-v6.pptx

Mark's local to you (Mountain View) - and is a nice guy, is probably
the better person to answer specifics.

-Joel
___
Cerowrt-devel mailing list
Cerowrt-devel@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/cerowrt-devel


Re: [Cerowrt-devel] [Bloat] Linux network is damn fast, need more use XDP (Was: DC behaviors today)

2017-12-04 Thread Dave Taht
Jesper:

I have a tendency to deal with netdev by itself and never cross post
there, as the bufferbloat.net servers (primarily to combat spam)
mandate starttls and vger doesn't support it at all, thus leading to
raising davem blood pressure which I'd rather not do.

But moving on...

On Mon, Dec 4, 2017 at 2:56 AM, Jesper Dangaard Brouer
 wrote:
>
> On Sun, 03 Dec 2017 20:19:33 -0800 Dave Taht  wrote:
>
>> Changing the topic, adding bloat.
>
> Adding netdev, and also adjust the topic to be a rant on that the Linux
> kernel network stack is actually damn fast, and if you need something
> faster then XDP can solved your needs...
>
>> Joel Wirāmu Pauling  writes:
>>
>> > Just from a Telco/Industry perspective slant.
>> >
>> > Everything in DC has moved to SFP28 interfaces at 25Gbit as the server
>> > port of interconnect. Everything TOR wise is now QSFP28 - 100Gbit.
>> > Mellanox X5 cards are the current hotness, and their offload
>> > enhancements (ASAP2 - which is sorta like DPDK on steroids) allows for
>> > OVS flow rules programming into the card. We have a lot of customers
>> > chomping at the bit for that feature (disclaimer I work for Nuage
>> > Networks, and we are working on enhanced OVS to do just that) for NFV
>> > workloads.
>>
>> What Jesper's been working on for ages has been to try and get linux's
>> PPS up for small packets, which last I heard was hovering at about
>> 4Gbits.
>
> I hope you made a typo here Dave, the normal Linux kernel is definitely
> way beyond 4Gbit/s, you must have misunderstood something, maybe you
> meant 40Gbit/s? (which is also too low)

The context here was PPS for *non-gro'd* tcp ack packets, in the
further context of
the increasingly epic "benefits of ack filtering" thread on the bloat
list, in the context
that for 50x1 end-user-asymmetry we were seeing 90% less acks with the new
sch_cake ack-filter code, double the throughput...

The kind of return traffic you see from data sent outside the DC, with
tons of flows.

What's that number?

>
> Scaling up to more CPUs and TCP-stream, Tariq[1] and I have showed the
> Linux kernel network stack scales to 94Gbit/s (linerate minus overhead).
> But when the drivers page-recycler fails, we hit bottlenecks in the
> page-allocator, that cause negative scaling to around 43Gbit/s.

So I divide by 94/22 and get 4gbit for acks. Or I look at PPS * 66. Or?

> [1] http://lkml.kernel.org/r/cef85936-10b2-5d76-9f97-cb03b418f...@mellanox.com
>
> Linux have for a _long_ time been doing 10Gbit/s TCP-stream easily, on
> a SINGLE CPU.  This is mostly thanks to TSO/GRO aggregating packets,
> but last couple of years the network stack have been optimized (with
> UDP workloads), and as a result we can do 10G without TSO/GRO on a
> single-CPU.  This is "only" 812Kpps with MTU size frames.

acks.

> It is important to NOTICE that I'm mostly talking about SINGLE-CPU
> performance.  But the Linux kernel scales very well to more CPUs, and
> you can scale this up, although we are starting to hit scalability
> issues in MM-land[1].
>
> I've also demonstrated that netdev-community have optimized the kernels
> per-CPU processing power to around 2Mpps.  What does this really
> mean... well with MTU size packets 812Kpps was 10Gbit/s, thus 25Gbit/s
> should be around 2Mpps That implies Linux can do 25Gbit/s on a
> single CPU without GRO (MTU size frames).  Do you need more I ask?

The benchmark I had in mind was, say, 100k flows going out over the internet,
and the characteristics of the ack flows on the return path.

>
>
>> The route table lookup also really expensive on the main cpu.

To clarify the context here, I was asking specifically if the X5 mellonox card
did routing table offlload or only switching.

> Well, it used-to-be very expensive. Vincent Bernat wrote some excellent
> blogposts[2][3] on the recent improvements over kernel versions, and
> gave due credit to people involved.
>
> [2] 
> https://vincent.bernat.im/en/blog/2017-performance-progression-ipv4-route-lookup-linux
> [3] 
> https://vincent.bernat.im/en/blog/2017-performance-progression-ipv6-route-lookup-linux
>
> He measured around 25 to 35 nanosec cost of route lookups.  My own
> recent measurements were 36.9 ns cost of fib_table_lookup.

On intel hw.

>
>> Does this stuff offload the route table lookup also?
>
> If you have not heard, the netdev-community have worked on something
> called XDP (eXpress Data Path).  This is a new layer in the network
> stack, that basically operates a the same "layer"/level as DPDK.
> Thus, surprise we get the same performance numbers as DPDK. E.g. I can
> do 13.4 Mpps forwarding with ixgbe on a single CPU (more CPUs=14.6Mps)
>
> We can actually use XDP for (software) offloading the Linux routing
> table.  There are two methods we are experimenting with:
>
> (1) externally monitor route changes from userspace and update BPF-maps
> to reflect this. That approach is already accepted upstream[4][5].