On Mon, May 18, 2020 at 4:34 AM Van Haaren, Harry
<[email protected]> wrote:
>
> > -----Original Message-----
> > From: William Tu <[email protected]>
> > Sent: Saturday, May 16, 2020 5:01 AM
> > To: Van Haaren, Harry <[email protected]>
> > Cc: ovs-dev <[email protected]>; Ilya Maximets <[email protected]>
> > Subject: Re: [ovs-dev] [PATCH v2 0/5] DPCLS Subtable ISA Optimization
> >
> > Hi Harry,
>
> Hey William,
>
> > Thanks for the patch, I learn a lot from them.
>
> Cool, yeah it's been fun for me learning about the OVS datapath at this level.
>
> > On Wed, May 6, 2020 at 6:05 AM Harry van Haaren
> > <[email protected]> wrote:
> > >
> > > This patchset implements the changes as proposed during the
> > > OVS Conf '19, in the talk "Next steps for SW Datapath".
> > > Youtube link: https://youtu.be/x0bOpojnpmU
> <snip>
> > > Patch 5/5:
> > > Actual AVX-512 implementation for DPCLS subtable search. This is the
> > > actual SIMD vector code, which performs DPCLS miniflow iteration in
> > > parallel.
> > >
> > From your previous slides and patch5, I roughly understand the avx code 
> > logic.
>
> Any questions feel free to ask! The SIMD design & implementation can be 
> difficult
> to understand, I'd be happy to help if you're curious about specific aspects.
>
> > I'm also thinking about a very rough idea.
> > I wonder if it is possible to use avx scatter function to implement 
> > miniflow_expand.
>
> Is miniflow expand a significant amount of cycles in your use-case? I know 
> it's used to decompress
> a miniflow as required for OF updates etc, but on the datapath it shouldn't 
> matter? If there's a
> benchmark to run that shows mf expand to be a hotspot that would be very 
> interesting!
>
> You're right that AVX scatter could be used to perform the writes from a 
> single AVX register.
>
> > And for lookup a subtable, we can expand to the origin "struct flow" memory
> > layouts for both packets and subtable->mf.
> > So each field for each packet is at a fixed offset from the mf values.
> > This wastes some memory due to expand but makes rule match keys easier?
>
> My concern here is that "miniflow" has this very nice attribute that it is 
> compressed, and
> hence requires fewer cache lines than the full "struct flow". Particularly, 
> the miniflow
> is contiguous, meaning utilization of the cache lines is 100%. Typical 
> miniflow sizes for
> outer packets are ~6 or so miniflow blocks, so ~6*8bytes (uint64_t) + 2 bytes 
> for "bits".
> That means simple packets are resident in a single cache-line, and many 
> tunneled packets
> can be represented by 2 cache-lines.
>
> Matching on "struct flow" would imply a sparsely populated region of 672 
> bytes, and depending
> on the exact contents being matched on, could be anywhere from 2-X cache 
> lines? Generally
> compute is more performant than memory-accesses that aren't cache local, I'm 
> not sure is really
> going to give performance benefits in the bigger picture.
>
Hi Harry,
Thanks for your explanation! And yes, the cache line miss overhead is definitely
more important. Now I understood the design.
William
_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to