> -----Original Message-----
> From: William Tu <[email protected]>
> Sent: Saturday, May 16, 2020 5:01 AM
> To: Van Haaren, Harry <[email protected]>
> Cc: ovs-dev <[email protected]>; Ilya Maximets <[email protected]>
> Subject: Re: [ovs-dev] [PATCH v2 0/5] DPCLS Subtable ISA Optimization
> 
> Hi Harry,

Hey William,

> Thanks for the patch, I learn a lot from them.

Cool, yeah it's been fun for me learning about the OVS datapath at this level.

> On Wed, May 6, 2020 at 6:05 AM Harry van Haaren
> <[email protected]> wrote:
> >
> > This patchset implements the changes as proposed during the
> > OVS Conf '19, in the talk "Next steps for SW Datapath".
> > Youtube link: https://youtu.be/x0bOpojnpmU
<snip>
> > Patch 5/5:
> > Actual AVX-512 implementation for DPCLS subtable search. This is the
> > actual SIMD vector code, which performs DPCLS miniflow iteration in
> > parallel.
> >
> From your previous slides and patch5, I roughly understand the avx code logic.

Any questions feel free to ask! The SIMD design & implementation can be 
difficult
to understand, I'd be happy to help if you're curious about specific aspects.

> I'm also thinking about a very rough idea.
> I wonder if it is possible to use avx scatter function to implement 
> miniflow_expand.

Is miniflow expand a significant amount of cycles in your use-case? I know it's 
used to decompress
a miniflow as required for OF updates etc, but on the datapath it shouldn't 
matter? If there's a
benchmark to run that shows mf expand to be a hotspot that would be very 
interesting!

You're right that AVX scatter could be used to perform the writes from a single 
AVX register.

> And for lookup a subtable, we can expand to the origin "struct flow" memory
> layouts for both packets and subtable->mf.
> So each field for each packet is at a fixed offset from the mf values.
> This wastes some memory due to expand but makes rule match keys easier?

My concern here is that "miniflow" has this very nice attribute that it is 
compressed, and
hence requires fewer cache lines than the full "struct flow". Particularly, the 
miniflow 
is contiguous, meaning utilization of the cache lines is 100%. Typical miniflow 
sizes for
outer packets are ~6 or so miniflow blocks, so ~6*8bytes (uint64_t) + 2 bytes 
for "bits".
That means simple packets are resident in a single cache-line, and many 
tunneled packets
can be represented by 2 cache-lines.

Matching on "struct flow" would imply a sparsely populated region of 672 bytes, 
and depending
on the exact contents being matched on, could be anywhere from 2-X cache lines? 
Generally
compute is more performant than memory-accesses that aren't cache local, I'm 
not sure is really
going to give performance benefits in the bigger picture.

> Regards,
> William

Cheers for having a look at the patchset! -Harry
_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to