On 9/26/23 12:56, Joern Rennecke wrote:


What ultimately pushed us to keep moving forward on this effort was
discovering numerous CRC loop implementations out in the wild, including
4 implementations (IIRC) in the kernel itself.

I have always assumed that such must exist (CRCs are useful for a number
of problems, and besides, they wouldn't have been included in coremark as
a part of the workload if they were not relevant), but it is good to have
confirmation, and even better to have code that can detect and analyse a
entire class of idioms that is in such widespread use.
I was personally surprised at how many we found. While there were a bunch of table generation routines which obviously aren't at all interesting, there were enough in the cases we analyzed that it convinced me this wasn't just catering to a benchmark.


This still leaves room for further improvements, like detecting fully-unrolled
code, table lookup, or slice-by-N, and replacing them with better
target-optimized code where this is indicated by the optimization flags to
save execution time or code/rodata size.  Not something we have to tackle
now, but just because we don't do it now, doesn't mean we couldn't address
these in the future if that appears worthwhile.
Absolutely. In fact, I have encouraged Mariam to capture some of the cases we can't currently handle in the testsuite, essentially building a bit of a TODO list.



I can easily see creating a clmul RTL opcode for targets which support
it and hoisting the clmul vs lookup table selection into generic code.
I'm still pondering if we're likely to ever see cases where we want a
vector clmul intrinsic or support in the autovectorizer for clmul.
We've certainly got targets with vector clmul in the ISA, the question
is using it.

If we aim for optimal code, I think it more likely that we want to detect a
block CRC computation, and have a target expander decide to do that
with inline code or a library call that uses vectorized clmul.  At the time
we add such block-CRC expansion code, it also makes sense to add a
builtin for block CRC so that new GNU C programs can directly request
that functionality without having to go through the cargo cult of matching
a supported idiom.
And I think we've got a structure which largely allows further improvements, both in the recognition/validation step and in the code expansion step.


Now, the library might be written in GNU C, and for that it might be useful
to have a vector clmul intrinsic so that we can express this algorithm more
easily.
Agreed. It's also worth nothing that LLVM has a clmul in their IL and I suspect they expose it via a builtin/intrinsic. I'd expect we'll ultimately end up in the same boat.


Probably the biggest task in that space right now is to see if we can
avoid the symbolic execution engine by re-using ranger.

I'll be interested to see what you'll come up with, but if reverting to the
symbolic execution engine, the compile time cost isn't much if you only
use it for a proper match.  So whatever heuristics are used before deciding
to use the engine matter.  Can all the cases detected by the engine be
recognized as a loop with a reduction?  We might use different heuristics
for different optimization levels, i.e. allow more false negatives at -O1,
and more false positives at -O2 / -fexpensive-optimizations.
It's mostly a desire not to add (yet another) symbolic execution engine to GCC. We've already got the engine for CCP as well as symbolic execution capabilities in Ranger. I'd like to avoid adding another if we can do so.

For a LFSR validation step we need to track 4 potential states for each bit in an object. 0, 1, x, !x where "x" is the state of the bit from a different object. If it was just tracking 0, 1, x, !x for an entire object, Ranger is probably already do that. But we need to do it for each bit within an object.

We haven't seen compile-time cost be a real issue. But we also haven't looked too closely at that problem.


While I concur that we want existing code to be able to take advantage of
this optimization by gcc recognizing whatever method the code uses to
compute CRC (within reason, we are still bound by the laws of
computability and resource constraints for the compilation), I would
like to stress that I don't think the builtin will use its value over time.
It can be used in tests to make sure we exercise the code generation for the
internal function.  It can be used in new GNU C code to make it easier to
do a CRC computation without worrying about the implementation.  If
an accord with other major compiler suppliers (e.g. the LLVM community)
is reached, it might even see more widespread use.
Which somewhat dovetails with Alex's position -- namely that it's not that value. Hence the de-emphasis on this approach. We'll continue to focus on the end-to-end solution.

We may still want a clmul as an RTL opcode and builtins to utilize it.


Which is basically what I was suggesting by moving to the various expand
interfaces.  That code does not need to be RISC-V specific and in fact
it's much better if it is not and can be shared across architectures.

For the RISC-V implementation, I started out with writing some variants in
C and assembler optimized for latency, register use, and code size, picking
the one that's best for speed for constant polynoms (crcu16_1 is still viable
for other cases, but I didn't have time to implement *and test*
-Os and variable-polynom paths), and then I crafted an expander so that
I'd get the desired assembler at the end of compilation.  I had this idea that
every port wants its own implementation (with the previous port going a
completely different route than RISC-V).  But I suppose you are right that
a lot of GCCs ports that are in mainline are just like RISC-V in that they
will work well with a lookup table, and to get an initial version that works,
it makes sense to start with a single or a handful of generic
implementation(s), and do target-specific refinements later.
The thing about a table is it'll work just about everywhere as it doesn't require anything special from a hardware standpoint. So we ought to be able to verify the table based version quite thoroughly without relying on other port maintainers to wire up a carryless multiply in their port. And it'll be crazy faster than a bitwise implementation.





I suppose for such an initial implementation,  RISC and x86* 32 / 64
bit targets will be fine with the table lookup,while
small-address-space targets
like avr would be better off with a library function along the lines of crcu16_1, tailored to their ALU and register sizes. The 'For base
instruction set' variant should actually work with int or short just
as well as long, long just
happens
to be faster for RISC-V as it doesn't need extra extensions, while
avr will likely be much better off with short.
Right. And there's nothing preventing someone from putting a target custom routine in libgcc, but I doubt we'll see a lot of that. I suspect that 99% of the time clmul or a table is going to be the desired codegen for a detected CRC. The remaining cases would be space sensitive and likely using -Os/-Oz -- where we'd disable CRC optimization if there isn't a clmul capability in the backend.



FWIW, for the table lookup code,  there are probably some places where
force_operand is convenient to make the code portable across targets.
Absolutely. But I think we're doing too much direct RTL generation. I'd like to squash that out and hopefully what's left is just calls to expand_* and we don't have to worry about force_operand and such.


jeff

Reply via email to