On Tue, Aug 8, 2023 at 4:17 PM Jeff Law via Gcc-patches <gcc-patches@gcc.gnu.org> wrote: > > > > On 8/8/23 10:38, Alexander Monakov wrote: > > > > On Tue, 8 Aug 2023, Jeff Law wrote: > > > >> That was my thinking at one time. Then we started looking at the distros > >> and > >> found enough crc implementations in there to change my mind about the > >> overall > >> utility. > > > > The ones I'm familiar with are all table-based and look impossible to > > pattern-match (and hence already fairly efficient comparable to bitwise > > loop in Coremark). > We found dozens that were the usual looking loops and, IIRC ~200 table > lookups after analyzing about half of the packages in Fedora.
I will make a note we do handle table lookups to detect count leading zeros, see check_ctz_array in tree-ssa-forwprop.cc for that detection. (that was done to improve a SPEC benchmark even). So if the tables are statically defined at compile time, there is already an example of how it can be detected too. Thanks, Andrew Pinski > > > > > > So... just provide a library? A library code is easier to develop and audit, > > it can be released independently, people can use it with their compiler of > > choice. Not everything needs to be in libgcc. > If the compiler can identify a CRC and collapse it down to a table or > clmul, that's a major win and such code does exist in the real world. > That was the whole point behind the Fedora experiment -- to determine if > these things are showing up in the real world or if this is just a > benchmarking exercise. > > And just to be clear, we're not proposing anything for libgcc. > > > > > I'm talking about factoring a long chain into multiple independent chains > > for latency hiding. > And that could potentially be an extension. But even without this a > standard looking CRC loop will be much faster using table lookups or > simple generation with clmul. > > Also note that latency of clmuls is improving on modern hardware. 4c > isn't hard to achieve and I wouldn't be surprised to see 2c clmuls in > the near future. > > > > > > Useful to whom? The Linux kernel? zlib, bzip2, xz-utils? ffmpeg? > > These consumers need high-performance blockwise CRC, offering them > > a latency-bound elementwise CRC primitive is a disservice. And what > > should they use as a fallback when __builtin_crc is unavailable? > THe point is builtin_crc would always be available. If there is no > clmul, then the RTL backend can expand to a table lookup version. > > > > >> while at the same time putting one side of the infrastructure we need for > >> automatic detection of CRC loops and turning them into table lookups or > >> CLMULs. > >> > >> With that in mind I'm certain Mariam & I would love feedback on a builtin > >> API > >> that would be more useful. > > > > I think offering a conventional library for CRC has substantial advantages. > That's not what I asked. If you think there's room for improvement to a > builtin API, I'd love to hear it. > > But it seems you don't think this is worth the effort at all. That's > unfortunate, but if that's the consensus, then so be it. > > I'll note LLVM is likely going forward with CRC detection and > optimization at some point in the next ~6 months (effectively moving the > implementation from the hexagon port into the generic parts of their > loop optimizer). > > > > Jeff