On Tue, Aug 8, 2023 at 4:17 PM Jeff Law via Gcc-patches
<gcc-patches@gcc.gnu.org> wrote:
>
>
>
> On 8/8/23 10:38, Alexander Monakov wrote:
> >
> > On Tue, 8 Aug 2023, Jeff Law wrote:
> >
> >> That was my thinking at one time.  Then we started looking at the distros 
> >> and
> >> found enough crc implementations in there to change my mind about the 
> >> overall
> >> utility.
> >
> > The ones I'm familiar with are all table-based and look impossible to
> > pattern-match (and hence already fairly efficient comparable to bitwise
> > loop in Coremark).
> We found dozens that were the usual looking loops and, IIRC ~200 table
> lookups after analyzing about half of the packages in Fedora.

I will make a note we do handle table lookups to detect count leading
zeros, see check_ctz_array in tree-ssa-forwprop.cc for that detection.
(that was done to improve a SPEC benchmark even).
So if the tables are statically defined at compile time, there is
already an example of how it can be detected too.

Thanks,
Andrew Pinski

>
>
> >
> > So... just provide a library? A library code is easier to develop and audit,
> > it can be released independently, people can use it with their compiler of
> > choice. Not everything needs to be in libgcc.
> If the compiler can identify a CRC and collapse it down to a table or
> clmul, that's a major win and such code does exist in the real world.
> That was the whole point behind the Fedora experiment -- to determine if
> these things are showing up in the real world or if this is just a
> benchmarking exercise.
>
> And just to be clear, we're not proposing anything for libgcc.
>
> >
> > I'm talking about factoring a long chain into multiple independent chains
> > for latency hiding.
> And that could potentially be an extension.  But even without this a
> standard looking CRC loop will be much faster using table lookups or
> simple generation with clmul.
>
> Also note that latency of clmuls is improving on modern hardware.  4c
> isn't hard to achieve and I wouldn't be surprised to see 2c clmuls in
> the near future.
>
>
> >
> > Useful to whom? The Linux kernel? zlib, bzip2, xz-utils? ffmpeg?
> > These consumers need high-performance blockwise CRC, offering them
> > a latency-bound elementwise CRC primitive is a disservice. And what
> > should they use as a fallback when __builtin_crc is unavailable?
> THe point is builtin_crc would always be available.  If there is no
> clmul, then the RTL backend can expand to a table lookup version.
>
> >
> >> while at the same time putting one side of the infrastructure we need for
> >> automatic detection of CRC loops and turning them into table lookups or
> >> CLMULs.
> >>
> >> With that in mind I'm certain Mariam & I would love feedback on a builtin 
> >> API
> >> that would be more useful.
> >
> > I think offering a conventional library for CRC has substantial advantages.
> That's not what I asked.  If you think there's room for improvement to a
> builtin API, I'd love to hear it.
>
> But it seems you don't think this is worth the effort at all.  That's
> unfortunate, but if that's the consensus, then so be it.
>
> I'll note LLVM is likely going forward with CRC detection and
> optimization at some point in the next ~6 months (effectively moving the
> implementation from the hexagon port into the generic parts of their
> loop optimizer).
>
>
>
> Jeff

Reply via email to