Re: RISC-V: Added support for CRC.

2023-09-27 Thread Jeff Law




On 9/26/23 12:56, Joern Rennecke wrote:




What ultimately pushed us to keep moving forward on this effort was
discovering numerous CRC loop implementations out in the wild, including
4 implementations (IIRC) in the kernel itself.


I have always assumed that such must exist (CRCs are useful for a number
of problems, and besides, they wouldn't have been included in coremark as
a part of the workload if they were not relevant), but it is good to have
confirmation, and even better to have code that can detect and analyse a
entire class of idioms that is in such widespread use.
I was personally surprised at how many we found.   While there were a 
bunch of table generation routines which obviously aren't at all 
interesting, there were enough in the cases we analyzed that it 
convinced me this wasn't just catering to a benchmark.




This still leaves room for further improvements, like detecting fully-unrolled
code, table lookup, or slice-by-N, and replacing them with better
target-optimized code where this is indicated by the optimization flags to
save execution time or code/rodata size.  Not something we have to tackle
now, but just because we don't do it now, doesn't mean we couldn't address
these in the future if that appears worthwhile.
Absolutely.  In fact, I have encouraged Mariam to capture some of the 
cases we can't currently handle in the testsuite, essentially building a 
bit of a TODO list.






I can easily see creating a clmul RTL opcode for targets which support
it and hoisting the clmul vs lookup table selection into generic code.
I'm still pondering if we're likely to ever see cases where we want a
vector clmul intrinsic or support in the autovectorizer for clmul.
We've certainly got targets with vector clmul in the ISA, the question
is using it.


If we aim for optimal code, I think it more likely that we want to detect a
block CRC computation, and have a target expander decide to do that
with inline code or a library call that uses vectorized clmul.  At the time
we add such block-CRC expansion code, it also makes sense to add a
builtin for block CRC so that new GNU C programs can directly request
that functionality without having to go through the cargo cult of matching
a supported idiom.
And I think we've got a structure which largely allows further 
improvements, both in the recognition/validation step and in the code 
expansion step.




Now, the library might be written in GNU C, and for that it might be useful
to have a vector clmul intrinsic so that we can express this algorithm more
easily.
Agreed.  It's also worth nothing that LLVM has a clmul in their IL and I 
suspect they expose it via a builtin/intrinsic.  I'd expect we'll 
ultimately end up in the same boat.





Probably the biggest task in that space right now is to see if we can
avoid the symbolic execution engine by re-using ranger.


I'll be interested to see what you'll come up with, but if reverting to the
symbolic execution engine, the compile time cost isn't much if you only
use it for a proper match.  So whatever heuristics are used before deciding
to use the engine matter.  Can all the cases detected by the engine be
recognized as a loop with a reduction?  We might use different heuristics
for different optimization levels, i.e. allow more false negatives at -O1,
and more false positives at -O2 / -fexpensive-optimizations.
It's mostly a desire not to add (yet another) symbolic execution engine 
to GCC.  We've already got the engine for CCP as well as symbolic 
execution capabilities in Ranger.  I'd like to avoid adding another if 
we can do so.


For a LFSR validation step we need to track 4 potential states for each 
bit in an object.  0, 1, x, !x where "x" is the state of the bit from a 
different object.  If it was just tracking 0, 1, x, !x for an entire 
object, Ranger is probably already do that.  But we need to do it for 
each bit within an object.


We haven't seen compile-time cost be a real issue.  But we also haven't 
looked too closely at that problem.




While I concur that we want existing code to be able to take advantage of
this optimization by gcc recognizing whatever method the code uses to
compute CRC (within reason, we are still bound by the laws of
computability and resource constraints for the compilation), I would
like to stress that I don't think the builtin will use its value over time.
It can be used in tests to make sure we exercise the code generation for the
internal function.  It can be used in new GNU C code to make it easier to
do a CRC computation without worrying about the implementation.  If
an accord with other major compiler suppliers (e.g. the LLVM community)
is reached, it might even see more widespread use.
Which somewhat dovetails with Alex's position -- namely that it's not 
that value.  Hence the de-emphasis on this approach.  We'll continue to 
focus on the end-to-end solution.


We may still want a clmul as an RTL opcode and builtins to utilize it.




Which is 

Re: RISC-V: Added support for CRC.

2023-09-26 Thread Joern Rennecke
On Tue, 26 Sept 2023 at 14:18, Jeff Law  wrote:

>  But the Coremark code is what it is.  This isn't a whole lot
> different than the work in the 90s which rewrote loops and compromised
> some of the spec benchmarks, or the eqntott hack to simplify that one
> key loop in eqntott.

I think the stated purpose of the benchmark matters.  If dhrystone had been
pushed as an abstraction-penalty benchmark, it would have been fine to
present results with WPA, inlining and dead code elimination as ordinary
dhrystone results.  But since it's supposed to exercise specific hardware
features, and not have the tests for these optimized away, that's not
appropriate.

So, first, we make the compiled program perform the work that the benchmark
was supposed to include in the measurement, just more efficiently.
Second, we not only optimize the benchmark, but also make the target-optimized
code generation available for other programs.  For new programs targeted at
GNU C, that is minimally archived by providing a built-in function,
and in general
for new code, by being able to replicate the idiom from coremark that
is recognized
by GCC.  The mere existence of a C idiom in a popular benchmark also makes this
idiom a common idiom, if it hasn't already been that before.
As we find new patterns that are used to implement CRC which would
be better replaced with a target-specific implementation, we can add these.

This is similar to rotate operations, which are supported directly by
some processors,
and even for other targets, there are generally preferred ways to
expand the code,
but there are a couple of different variants depending on the
available instruction set,
registers, and the microarchitecture (pipeline, latency etc).  We
started out with
one patterns that was recognized, and as new patterns were identified
in C code, we
improved GCC to recognize these other patterns.

> What ultimately pushed us to keep moving forward on this effort was
> discovering numerous CRC loop implementations out in the wild, including
> 4 implementations (IIRC) in the kernel itself.

I have always assumed that such must exist (CRCs are useful for a number
of problems, and besides, they wouldn't have been included in coremark as
a part of the workload if they were not relevant), but it is good to have
confirmation, and even better to have code that can detect and analyse a
entire class of idioms that is in such widespread use.

This still leaves room for further improvements, like detecting fully-unrolled
code, table lookup, or slice-by-N, and replacing them with better
target-optimized code where this is indicated by the optimization flags to
save execution time or code/rodata size.  Not something we have to tackle
now, but just because we don't do it now, doesn't mean we couldn't address
these in the future if that appears worthwhile.

> I can easily see creating a clmul RTL opcode for targets which support
> it and hoisting the clmul vs lookup table selection into generic code.
> I'm still pondering if we're likely to ever see cases where we want a
> vector clmul intrinsic or support in the autovectorizer for clmul.
> We've certainly got targets with vector clmul in the ISA, the question
> is using it.

If we aim for optimal code, I think it more likely that we want to detect a
block CRC computation, and have a target expander decide to do that
with inline code or a library call that uses vectorized clmul.  At the time
we add such block-CRC expansion code, it also makes sense to add a
builtin for block CRC so that new GNU C programs can directly request
that functionality without having to go through the cargo cult of matching
a supported idiom.

Now, the library might be written in GNU C, and for that it might be useful
to have a vector clmul intrinsic so that we can express this algorithm more
easily.

> Probably the biggest task in that space right now is to see if we can
> avoid the symbolic execution engine by re-using ranger.

I'll be interested to see what you'll come up with, but if reverting to the
symbolic execution engine, the compile time cost isn't much if you only
use it for a proper match.  So whatever heuristics are used before deciding
to use the engine matter.  Can all the cases detected by the engine be
recognized as a loop with a reduction?  We might use different heuristics
for different optimization levels, i.e. allow more false negatives at -O1,
and more false positives at -O2 / -fexpensive-optimizations.

> To reiterate the real goal here is to take code as-is and make it
> significantly faster.  While the original target was Coremark, we've
> found similar bitwise implementations of CRCs all over the place.
> There's no good reason that code should have to change.
>
> The idea of exposing a CRC builtin was an intermediate step that would
> allow those willing to change their code or writing new code to write
> their CRC in a trivial way and let the compiler figure out a sensible
> implementation while we clean up the 

Re: RISC-V: Added support for CRC.

2023-09-26 Thread Alexander Monakov


On Tue, 26 Sep 2023, Jeff Law wrote:

> What ultimately pushed us to keep moving forward on this effort was
> discovering numerous CRC loop implementations out in the wild, including 4
> implementations (IIRC) in the kernel itself.

The kernel employs bitwise CRC only in look-up table generators.
Which run at build time. You are quite literally slowing down the compiler
in order to speed up generators that don't account for even one millisecond
of kernel build time, and have no relation to its run-time performance.

(incidentally you can't detect the actual CRC impls using those tables)

> And as I've stated before, the latency of clmuls is dropping.   I wouldn't be
> terribly surprised to see single cycle clmul implmementations showing up
> within the next 18-24 months.  It's really just a matter of gate budget vs
> expected value.

In a commercial implementation? I'll take that bet. You spend gates budget
like that after better avenues for raising ILP are exhausted (like adding
more ALUs that can do clmul at a reasonable 3c-4c latency).

> To reiterate the real goal here is to take code as-is and make it
> significantly faster.

Which code? Table generators in the kernel and xz-utils? 

> While the original target was Coremark, we've found similar bitwise
> implementations of CRCs all over the place. There's no good reason that code
> should have to change.

But did you look at them? There's no point to optimize table generators either.

Alexander


Re: RISC-V: Added support for CRC.

2023-09-26 Thread Jeff Law




On 9/23/23 17:05, Joern Rennecke wrote:

Mariam Harutyunyan:
+++ b/gcc/ChangeLog
@@ -1,3 +1,45 @@
+2023-08-03  Mariam Arutunian  
+

It is common courtesy to include all authors in the list of authors
for the ChangeLog; also,
this may help people in the future understand the history of the code better.
While must of your patch is new, it still contains non-trivial parts of mine
( https://gcc.gnu.org/pipermail/gcc-patches/2022-March/591744.html )
.And stripping out the comment why, currently,  we can't use linkonce
for crc tables on the the RISC-V target is
not helpful to someone who wants to understand the code.
Thanks for pointing this out Joern.  Neither Mariam nor I were aware 
that some of your code was used to seed this work.  We are happy to 
include you as a co-author.



Mariam -- the way to do this is to add a "Co-Author:" line to your 
commit message.  If you look at upstream commit:


5ff4431675c0d0c800d4a983254e94a6b401c14d

Shows the right format.




See also the discussion to put this into loop distribution:
https://gcc.gnu.org/pipermail/gcc-patches/2022-March/591821.html > 
https://gcc.gnu.org/pipermail/gcc-patches/2022-March/591866.html
Yes, this was the gist of the discussion I had with Richi.  CRC 
optimization has a lot in common with distribution, reduction and final 
value replacement.


For early testing and development we focused on detecting a CRC function 
and replacing it at a function level.  THat allowed Mariam to avoid a 
class a problems for a while and focus on an end-to-end solution.  Once 
that was working reasonably well Mariam went back and enhanced the first 
pass filter and validation so that it would work if the CRC loop was 
embedded inside a larger function either at the source level or due to 
inlining.


The placement we chose was just after loop distribution.  I think 
there's some wiggle room in terms of pass placement.


The one thing I'm hesitant to do is make this part of an existing pass. 
I dont see many benefits of integrating inside loop dist and I very much 
like its clean separation.





Mariam Harutyunyan:

It adds internal
functions and built-ins specifically designed to handle CRC computations
efficiently.


This sounds like this is a finished work, although defining built-in
functions to calculate the CRC of single data elements and recognizers
for some C idioms that do these calculations,
is just a starting point.
More precisely it was carved out of a larger piece of work in the hopes 
that it would be useful independently and allow upstreaming of the 
backend work.  Alexander was fairly dismissive of the utility of doing 
that.  There was some interest from Cauldron attendees to potentially 
reviving that idea.





Alexander Monakov :


Jeff, as I understand this all is happening only because Coremark contains
use of bitwise CRC that affects benchmark scores. In another universe where
- Coremark was careful to checksum outputs outside of timed sections, or
- implemented CRC in a manner that is not transparent to the compiler, or
- did not use CRC at all
we would not be spending effort on this, correct?


It is a stated goal of coremark to test performance for CRC.  They do
not use a library call
to implement CRC, but a specific bit-banging algorithm they have
chosen.  That algorithm is,
for the vast majority of processors, not representative of the targets
performance potential in calculating CRCs, thus if a compiler fails to
translate this into the CRC implementation that
would be used for performance code, the compiler frustrates this goal
of coremark to give a measure of CRC calculation performance.All true.  But the Coremark code is what it is.  This isn't a whole lot 
different than the work in the 90s which rewrote loops and compromised 
some of the spec benchmarks, or the eqntott hack to simplify that one 
key loop in eqntott.


What ultimately pushed us to keep moving forward on this effort was 
discovering numerous CRC loop implementations out in the wild, including 
4 implementations (IIRC) in the kernel itself.






At best we might have
a discussion on providing a __builtin_clmul for carry-less multiplication
(which _is_ a fundamental primitive, unlike __builtin_crc), and move on.


Some processors have specialized instructions for CRC computations.
They do.  But I think the discussion about providing intrinsic access to 
a clmul instruction is orthogonal to the discussion about whether or not 
to provide a builtin for crc.


I can easily see creating a clmul RTL opcode for targets which support 
it and hoisting the clmul vs lookup table selection into generic code. 
I'm still pondering if we're likely to ever see cases where we want a 
vector clmul intrinsic or support in the autovectorizer for clmul. 
We've certainly got targets with vector clmul in the ISA, the question 
is using it.






Instead, efficient CRC loops have the following structure:
- they carry an unreduced remainder in the loop, performing final reduction
  

Re: RISC-V: Added support for CRC.

2023-09-26 Thread Oleg Endo
On Sun, 2023-09-24 at 00:05 +0100, Joern Rennecke wrote:
> 
> Although maybe Oleg Endo's library, as mentioned in
> https://gcc.gnu.org/pipermail/gcc-patches/2022-March/591748.html ,
> might be suitable?  What is the license for that?
> 
> 

I haven't published the library, but I think I could do that.

It's a C++-14 header-only thing and uses templates + constexpr to generate
the .rodata lookup tables.  It's convenient for an application project, as
it doesn't require any generator tool in the build.  This might be not a big
advantage in the context of GCC.

Since the tables are computed during compile-time, there is no particular
optimization implemented.  The run-time function is also nothing fancy:

static constexpr uint8_t table_index (value_type rem, uint8_t x)
{
  if (ReflectInput)
return x ^ rem;
  else
return x ^ (BitCount > 8 ? (rem >> (BitCount - 8))
 : (rem << (8 - BitCount)));
}

static constexpr value_type shift (value_type rem)
{
  return ReflectInput ? rem >> 8 : rem << 8;
}

static value_type
default_process_bytes (value_type rem, const uint8_t* in, const uint8_t* in_end)
{
  for (; in != in_end; ++in)
  {
auto i = table_index (rem, *in);
rem = table[i] ^ shift (rem);
  }
  return rem;
}

Anyway, let me know if anyone is interested.

Cheers,
Oleg


Re: RISC-V: Added support for CRC.

2023-09-26 Thread Mariam Harutyunyan
On Sun, Sep 24, 2023, 00:05 Joern Rennecke 
wrote:

> Mariam Harutyunyan:
> +++ b/gcc/ChangeLog
> @@ -1,3 +1,45 @@
> +2023-08-03  Mariam Arutunian  
> +
>
> It is common courtesy to include all authors in the list of authors
> for the ChangeLog; also,
> this may help people in the future understand the history of the code
> better.
> While must of your patch is new, it still contains non-trivial parts of
> mine


Sorry, I didn't know that when using others' code snippets, we should
include their names. I'll add your name.
Although, the code has been removed from the new patch version, but your
work was really helpful.

Thank you,
Mariam


Re: RISC-V: Added support for CRC.

2023-09-24 Thread Joern Rennecke
On Sun, 24 Sept 2023 at 12:41, Alexander Monakov  wrote:
>
>
> On Sun, 24 Sep 2023, Joern Rennecke wrote:
>
> > It is a stated goal of coremark to test performance for CRC.
>
> I would expect a good CRC benchmark to print CRC throughput in
> bytes per cycle or megabytes per second.
>
> I don't see where Coremark states that goal. In the readme at
> https://github.com/eembc/coremark/blob/main/README.md
> it enumerates the three subcategories (linked list, matrix ops,
> state machine) and indicates that CRC is used for validation.

At https://www.eembc.org/coremark/index.php , they state under the
Details heading:

...
Replacing the antiquated Dhrystone benchmark, Coremark contains
implementations of the following algorithms: list processing (find and
sort), matrix manipulation (common matrix operations), state machine
(determine if an input stream contains valid numbers), and CRC (cyclic
redundancy check).
...
The CRC algorithm serves a dual function; it provides a workload
commonly seen in embedded applications and ensures correct operation
of the CoreMark benchmark, essentially providing a self-checking
mechanism.
...

They also point to a whitepaper there, which states:

Since CRC is also a commonly used function in embedded applications, this
calculation is included in the timed portion of the CoreMark.

> If it claims that elsewhere, the way its code employs CRC does not
> correspond to real-world use patterns, like in the Linux kernel for
> protocol and filesystem checksumming, or decompression libraries.

That may be so, but we should still strive to optimize the code to
obtain the intended purpose.

> It is, however, representative of the target CPU's ability to run
> those basic bitwise ops with good overlap with the rest of computation,
> which is far more relevant for the real-world performance of the CPU.

That depends on how much CRC calculation your application does.  You can
disable specific compiler optimizations in GCC for specialized testing.
> > thus if a compiler fails to translate this into the CRC implementation
> > that would be used for performance code, the compiler frustrates this
> > goal of coremark to give a measure of CRC calculation performance.
>
> Are you seriously saying that if a customer chooses CPU A over CPU B
> based on Coremark scores, and then discovers that actual performance
> in, say, zlib (which uses slice-by-N for CRC) is better on CPU B, that's
> entirely fair and the benchmarks scores they saw were not misleading?

Using coremark as a yardstick for any one application is always going to be
likely to give an inaccurate assessment - unless your application is
identical to
coremark.  I don't see why whatever implementation is chosen for the
short-length
CRC in coremark should be closer or farther from slice-by-N CRC, I would expect
it to be pseudo-random.  Unless CPU B has worse GCC support or
available hardware
instruction for short-range CRC, in which case the manufacturer might
considering
improving support (particularily if it's about GCC support ;-)
Actually, if CRC optimization is implemented via table lookup, on both
CPU A and B,
it gets a bit closer to slice-by-N, since both do table lookups,
although for slice-by-N you
trade latency for register pressure.

Any single benchmark can't be a good performance predictor for all applications.
If you care a lot about performance for a particular load, you should
benchmark that load,
or something that is known to be a close proxy.

> > > At best we might have
> > > a discussion on providing a __builtin_clmul for carry-less multiplication
> > > (which _is_ a fundamental primitive, unlike __builtin_crc), and move on.
> >
> > Some processors have specialized instructions for CRC computations.
>
> Only for one or two fixed polynomials. For that matter, some processors
> have instructions for AES and SHA, but that doesn't change that clmul is
> a more fundamental and flexible primitive than "CRC".

So it is, but when analyzing user programs that haven't been written by experts
with a focus on performance, CRC is more likely to come up than clmul .
I agree that it would make sense to have a builtin for clmul that can be used
uniformly across architectures that support this operation, but I'm
not volunteering
to write a patch for that.

> If only the "walk before you run" logic was applied in favor of
> implementing a portable clmul builtin prior to all this.

I started writing the CRC patch for an architecture that didn't have
clmul as a basecase
instruction, so a clmul builtin would not have helped.
>
> > A library can be used to implement built-ins in gcc (we still need to
> > define one for block operations, one step at a time...).  However,
> > someone or something needs to rewrite the existing code to use the
> > library.  It is commonly accepted that an efficient way to do this is
> > to make a compiler do the necessary transformations, as long as it can
> > be made to churn out good enough code.
>
> How does 

Re: RISC-V: Added support for CRC.

2023-09-24 Thread Alexander Monakov


On Sun, 24 Sep 2023, Joern Rennecke wrote:

> It is a stated goal of coremark to test performance for CRC.

I would expect a good CRC benchmark to print CRC throughput in
bytes per cycle or megabytes per second.

I don't see where Coremark states that goal. In the readme at
https://github.com/eembc/coremark/blob/main/README.md
it enumerates the three subcategories (linked list, matrix ops,
state machine) and indicates that CRC is used for validation.

If it claims that elsewhere, the way its code employs CRC does not
correspond to real-world use patterns, like in the Linux kernel for
protocol and filesystem checksumming, or decompression libraries.

> They do not use a library call to implement CRC, but a specific
> bit-banging algorithm they have chosen.  That algorithm is, for the
> vast majority of processors, not representative of the targets
> performance potential in calculating CRCs,

It is, however, representative of the target CPU's ability to run
those basic bitwise ops with good overlap with the rest of computation,
which is far more relevant for the real-world performance of the CPU.

> thus if a compiler fails to translate this into the CRC implementation
> that would be used for performance code, the compiler frustrates this
> goal of coremark to give a measure of CRC calculation performance.

Are you seriously saying that if a customer chooses CPU A over CPU B
based on Coremark scores, and then discovers that actual performance
in, say, zlib (which uses slice-by-N for CRC) is better on CPU B, that's
entirely fair and the benchmarks scores they saw were not misleading?

> > At best we might have
> > a discussion on providing a __builtin_clmul for carry-less multiplication
> > (which _is_ a fundamental primitive, unlike __builtin_crc), and move on.
> 
> Some processors have specialized instructions for CRC computations.

Only for one or two fixed polynomials. For that matter, some processors
have instructions for AES and SHA, but that doesn't change that clmul is
a more fundamental and flexible primitive than "CRC".

> If you want to recognize a loop that does a CRC on a block, it makes
> sense to start with recognizing the CRC computation for single array
> elements first.  We have to learn to walk before we can run.

If only the "walk before you run" logic was applied in favor of
implementing a portable clmul builtin prior to all this.

> A library can be used to implement built-ins in gcc (we still need to
> define one for block operations, one step at a time...).  However,
> someone or something needs to rewrite the existing code to use the
> library.  It is commonly accepted that an efficient way to do this is
> to make a compiler do the necessary transformations, as long as it can
> be made to churn out good enough code.

How does this apply to the real world? Among CRC implementations in the
Linux kernel, ffmpeg, zlib, bzip2, xz-utils, and zstd I'm aware of only
a single instance where bitwise CRC is used. It's in the table
initialization function in xz-utils. The compiler would transform that
to copying one table into another. Is that a valuable transform?

> Alexander Monakov:
> > Useful to whom? The Linux kernel? zlib, bzip2, xz-utils? ffmpeg?
> > These consumers need high-performance blockwise CRC, offering them
> > a latency-bound elementwise CRC primitive is a disservice. And what
> > should they use as a fallback when __builtin_crc is unavailable?
> 
> We can provide a fallback implementation for all targets with table
> lookup and/or shifts .

How would it help when they are compiled with LLVM, or GCC version
earlier than 14?

Alexander


Re: RISC-V: Added support for CRC.

2023-09-23 Thread Joern Rennecke
Mariam Harutyunyan:
+++ b/gcc/ChangeLog
@@ -1,3 +1,45 @@
+2023-08-03  Mariam Arutunian  
+

It is common courtesy to include all authors in the list of authors
for the ChangeLog; also,
this may help people in the future understand the history of the code better.
While must of your patch is new, it still contains non-trivial parts of mine
( https://gcc.gnu.org/pipermail/gcc-patches/2022-March/591744.html )
.And stripping out the comment why, currently,  we can't use linkonce
for crc tables on the the RISC-V target is
not helpful to someone who wants to understand the code.

See also the discussion to put this into loop distribution:
https://gcc.gnu.org/pipermail/gcc-patches/2022-March/591821.html
https://gcc.gnu.org/pipermail/gcc-patches/2022-March/591866.html

Mariam Harutyunyan:
> It adds internal
> functions and built-ins specifically designed to handle CRC computations
> efficiently.

This sounds like this is a finished work, although defining built-in
functions to calculate the CRC of single data elements and recognizers
for some C idioms that do these calculations,
is just a starting point.

Alexander Monakov :

> Jeff, as I understand this all is happening only because Coremark contains
> use of bitwise CRC that affects benchmark scores. In another universe where
> - Coremark was careful to checksum outputs outside of timed sections, or
> - implemented CRC in a manner that is not transparent to the compiler, or
> - did not use CRC at all
> we would not be spending effort on this, correct?

It is a stated goal of coremark to test performance for CRC.  They do
not use a library call
to implement CRC, but a specific bit-banging algorithm they have
chosen.  That algorithm is,
for the vast majority of processors, not representative of the targets
performance potential in calculating CRCs, thus if a compiler fails to
translate this into the CRC implementation that
would be used for performance code, the compiler frustrates this goal
of coremark to give a measure of CRC calculation performance.

> At best we might have
> a discussion on providing a __builtin_clmul for carry-less multiplication
> (which _is_ a fundamental primitive, unlike __builtin_crc), and move on.

Some processors have specialized instructions for CRC computations.

> Instead, efficient CRC loops have the following structure:
> - they carry an unreduced remainder in the loop, performing final reduction
>  modulo polynomial only once after the loop — this halves the CLMUL count;
> - they overlap multiple CLMUL chains to make the loop throughput-bound
> rather than latency-bound. The typical unroll factor is about 4x-8x.

If you want to recognize a loop that does a CRC on a block, it makes
sense to start with recognizing the CRC computation for single array
elements first.  We have to learn to
walk before we can run.

Nore that my initial patch already contained a match.pd stanza to
recognize two chained single-element CRC calculations.

Jeff Law:
> The intention is to provide a useful
> builtin_crc while at the same time putting one side of the
> infrastructure we need for automatic detection of CRC loops and turning
> them into table lookups or CLMULs.

Note that when optimizing for size, for a target with tiny memory, or
when using a non-constant (or constant but undiscoverable by the
compiler) polynom, we can't use the table lookup.  But still, even if
we don't have CLmul, we can generally speed up CRC computation over
the coremark algorithm by using something more suited to the target,
like the crcu16_1 function I
put into comments in my patch.

Alexander Monakov :
> So... just provide a library? A library code is easier to develop and audit,
> it can be released independently, people can use it with their compiler of
> choice. Not everything needs to be in libgcc.

A library can be used to implement built-ins in gcc (we still need to
define one for block operations, one step at a time...).
However, someone or something needs to rewrite the existing code to
use the library.
It is commonly accepted that an efficient way to do this is to make a
compiler do the
necessary transformations, as long as it can be made to churn out good
enough code.

Alexander Monakov:
> Useful to whom? The Linux kernel? zlib, bzip2, xz-utils? ffmpeg?
> These consumers need high-performance blockwise CRC, offering them
> a latency-bound elementwise CRC primitive is a disservice. And what
> should they use as a fallback when __builtin_crc is unavailable?

We can provide a fallback implementation for all targets with table
lookup and/or shifts .

Alexander Monakov:
> I think offering a conventional library for CRC has substantial advantages.
Are you volunteering?  It would make our work to emit code for block
CRC easier if we could
just use a library call when we recognize a block CRC (although making
that recognizer is likely still considerable work if we want to get
good coverage over different coding styles).

Although maybe Oleg Endo's library, as mentioned 

Re: RISC-V: Added support for CRC.

2023-08-21 Thread Mariam Harutyunyan via Gcc-patches
Thank you for the review.
I'm already working on suggested changes.
The answers to the few questions are attached here.

Thanks,
Mariam

On Wed, Aug 16, 2023 at 8:59 AM Jeff Law  wrote:

>
>
> On 8/3/23 13:37, Mariam Harutyunyan via Gcc-patches wrote:
> > This patch adds CRC support for the RISC-V architecture. It adds internal
> > functions and built-ins specifically designed to handle CRC computations
> > efficiently.
> >
> > If the target is ZBC, the clmul instruction is used for the CRC code
> > generation; otherwise, table-based CRC is generated.  A table with 256
> > elements is used to store precomputed CRCs.
> >
> > These CRC calculation algorithms have higher performance than the naive
> CRC
> > calculation algorithm.
> [ ... ]
> Various comments attached.
>


reply-to-comments
Description: Binary data


Re: RISC-V: Added support for CRC.

2023-08-17 Thread Alexander Monakov


On Wed, 16 Aug 2023, Philipp Tomsich wrote:

> > > I fully expect that latency to drop within the next 12-18 months.  In that
> > > world, there's not going to be much benefit to using hand-coded libraries 
> > > vs
> > > just letting the compiler do it.
> 
> I would also hope that the hand-coded libraries would eventually have
> a code path for compilers that support the built-in.

You seem to be working with the false assumption that the interface of the
proposed builtin matches how high-performance CRC computation is structured.
It is not. State-of-the-art CRC keeps unreduced intermediate residual, split
over multiple temporaries to allow overlapping CLMULs in the CPU. The
intermediate residuals are reduced only once, when the final CRC value is
needed. In constrast, the proposed builtin has data dependencies between
all adjacent instructions, and cannot allow the CPU to work at IPC > 1.0.

Shame how little you apparently understand of the "mindbending math".

Alexander


Re: RISC-V: Added support for CRC.

2023-08-16 Thread Jeff Law via Gcc-patches




On 8/16/23 13:10, Alexander Monakov wrote:


On Tue, 15 Aug 2023, Jeff Law wrote:


Because if the compiler can optimize it automatically, then the projects have
to do literally nothing to take advantage of it.  They just compile normally
and their bitwise CRC gets optimized down to either a table lookup or a clmul
variant.  That's the real goal here.


The only high-profile FOSS project that carries a bitwise CRC implementation
I'm aware of is the 'xz' compression library. There bitwise CRC is used for
populating the lookup table under './configure --enable-small':

https://github.com/tukaani-project/xz/blob/2b871f4dbffe3801d0da3f89806b5935f758d5f3/src/liblzma/check/crc64_small.c

It's a well-reasoned choice and your compiler would be undoing it
(reintroducing the table when the bitwise CRC is employed specifically
to avoid carrying the table).
If they don't want the table variant, there would obviously be ways to 
turn that off.   It's essentially no different than any speed improving 
optimization that makes things larger.






One final note.  Elsewhere in this thread you described performance concerns.
Right now clmuls can be implemented in 4c, fully piped.


Pipelining doesn't matter in the implementation being proposed here, because
the builtin is expanded to

li  a4,quotient
li  a5,polynomial
xor a0,a1,a0
clmul   a0,a0,a4
srlia0,a0,crc_size
clmul   a0,a0,a5
sllia0,a0,GET_MODE_BITSIZE (word_mode) - crc_size
srlia0,a0,GET_MODE_BITSIZE (word_mode) - crc_size




making CLMULs data-dependent, so the second can only be started one cycle
after the first finishes, and consecutive invocations of __builtin_crc
are likewise data-dependent (with three cycles between CLMUL). So even
when you get CLMUL down to 3c latency, you'll have two CLMULs and 10 cycles
per input block, while state of the art is one widening CLMUL per input block
(one CLMUL per 32-bit block on a 64-bit CPU) limited by throughput, not latency.

I expect it'll actually be 2c latency.   We're approaching the point 
where it just won't make that much sense to call out to a library when 
you can emit the pair of clmuls and a couple shifts.


jeff


Re: RISC-V: Added support for CRC.

2023-08-16 Thread Paul Koning via Gcc-patches



> On Aug 16, 2023, at 3:42 PM, Philipp Tomsich  wrote:
> 
> On Wed, 16 Aug 2023 at 21:10, Alexander Monakov  wrote:
>> 
>> 
>> On Tue, 15 Aug 2023, Jeff Law wrote:
>> 
>>> Because if the compiler can optimize it automatically, then the projects 
>>> have
>>> to do literally nothing to take advantage of it.  They just compile normally
>>> and their bitwise CRC gets optimized down to either a table lookup or a 
>>> clmul
>>> variant.  That's the real goal here.
>> 
>> The only high-profile FOSS project that carries a bitwise CRC implementation
>> I'm aware of is the 'xz' compression library. There bitwise CRC is used for
>> populating the lookup table under './configure --enable-small':
>> 
>> https://github.com/tukaani-project/xz/blob/2b871f4dbffe3801d0da3f89806b5935f758d5f3/src/liblzma/check/crc64_small.c
>> 
>> It's a well-reasoned choice and your compiler would be undoing it
>> (reintroducing the table when the bitwise CRC is employed specifically
>> to avoid carrying the table).

Is that compiled with -Os?  It would seem sensible for that to be the case, and 
for the table optimization to be suppressed if that switch is used.

paul




Re: RISC-V: Added support for CRC.

2023-08-16 Thread Philipp Tomsich
On Wed, 16 Aug 2023 at 21:10, Alexander Monakov  wrote:
>
>
> On Tue, 15 Aug 2023, Jeff Law wrote:
>
> > Because if the compiler can optimize it automatically, then the projects 
> > have
> > to do literally nothing to take advantage of it.  They just compile normally
> > and their bitwise CRC gets optimized down to either a table lookup or a 
> > clmul
> > variant.  That's the real goal here.
>
> The only high-profile FOSS project that carries a bitwise CRC implementation
> I'm aware of is the 'xz' compression library. There bitwise CRC is used for
> populating the lookup table under './configure --enable-small':
>
> https://github.com/tukaani-project/xz/blob/2b871f4dbffe3801d0da3f89806b5935f758d5f3/src/liblzma/check/crc64_small.c
>
> It's a well-reasoned choice and your compiler would be undoing it
> (reintroducing the table when the bitwise CRC is employed specifically
> to avoid carrying the table).
>
> > One final note.  Elsewhere in this thread you described performance 
> > concerns.
> > Right now clmuls can be implemented in 4c, fully piped.
>
> Pipelining doesn't matter in the implementation being proposed here, because
> the builtin is expanded to
>
>li  a4,quotient
>li  a5,polynomial
>xor a0,a1,a0
>clmul   a0,a0,a4
>srlia0,a0,crc_size
>clmul   a0,a0,a5
>sllia0,a0,GET_MODE_BITSIZE (word_mode) - crc_size
>srlia0,a0,GET_MODE_BITSIZE (word_mode) - crc_size
>
> making CLMULs data-dependent, so the second can only be started one cycle
> after the first finishes, and consecutive invocations of __builtin_crc
> are likewise data-dependent (with three cycles between CLMUL). So even
> when you get CLMUL down to 3c latency, you'll have two CLMULs and 10 cycles
> per input block, while state of the art is one widening CLMUL per input block
> (one CLMUL per 32-bit block on a 64-bit CPU) limited by throughput, not 
> latency.
>
> > I fully expect that latency to drop within the next 12-18 months.  In that
> > world, there's not going to be much benefit to using hand-coded libraries vs
> > just letting the compiler do it.

I would also hope that the hand-coded libraries would eventually have
a code path for compilers that support the built-in.
For what it's worth, there now is CRC in Boost:
https://www.boost.org/doc/libs/1_83_0/doc/html/crc.html

Cheers,
philipp.


Re: RISC-V: Added support for CRC.

2023-08-16 Thread Alexander Monakov


On Tue, 15 Aug 2023, Jeff Law wrote:

> Because if the compiler can optimize it automatically, then the projects have
> to do literally nothing to take advantage of it.  They just compile normally
> and their bitwise CRC gets optimized down to either a table lookup or a clmul
> variant.  That's the real goal here.

The only high-profile FOSS project that carries a bitwise CRC implementation
I'm aware of is the 'xz' compression library. There bitwise CRC is used for
populating the lookup table under './configure --enable-small':

https://github.com/tukaani-project/xz/blob/2b871f4dbffe3801d0da3f89806b5935f758d5f3/src/liblzma/check/crc64_small.c

It's a well-reasoned choice and your compiler would be undoing it
(reintroducing the table when the bitwise CRC is employed specifically
to avoid carrying the table).

> One final note.  Elsewhere in this thread you described performance concerns.
> Right now clmuls can be implemented in 4c, fully piped.

Pipelining doesn't matter in the implementation being proposed here, because
the builtin is expanded to

   li  a4,quotient
   li  a5,polynomial
   xor a0,a1,a0
   clmul   a0,a0,a4
   srlia0,a0,crc_size
   clmul   a0,a0,a5
   sllia0,a0,GET_MODE_BITSIZE (word_mode) - crc_size
   srlia0,a0,GET_MODE_BITSIZE (word_mode) - crc_size

making CLMULs data-dependent, so the second can only be started one cycle
after the first finishes, and consecutive invocations of __builtin_crc
are likewise data-dependent (with three cycles between CLMUL). So even
when you get CLMUL down to 3c latency, you'll have two CLMULs and 10 cycles
per input block, while state of the art is one widening CLMUL per input block
(one CLMUL per 32-bit block on a 64-bit CPU) limited by throughput, not latency.

> I fully expect that latency to drop within the next 12-18 months.  In that
> world, there's not going to be much benefit to using hand-coded libraries vs
> just letting the compiler do it.

...

Alexander


Re: RISC-V: Added support for CRC.

2023-08-15 Thread Jeff Law via Gcc-patches



On 8/3/23 13:37, Mariam Harutyunyan via Gcc-patches wrote:

This patch adds CRC support for the RISC-V architecture. It adds internal
functions and built-ins specifically designed to handle CRC computations
efficiently.

If the target is ZBC, the clmul instruction is used for the CRC code
generation; otherwise, table-based CRC is generated.  A table with 256
elements is used to store precomputed CRCs.

These CRC calculation algorithms have higher performance than the naive CRC
calculation algorithm.

[ ... ]
Various comments attached.
From 9d2e9023c222501a1d9519bea3d5cdbd32b5a91e Mon Sep 17 00:00:00 2001
From: Mariam Arutunian 
Date: Thu, 3 Aug 2023 15:59:57 +0400
Subject: [PATCH] RISC-V: Added support for CRC.

  If the target is ZBC, then the clmul instruction is used for the CRC code generation;
  otherwise, table-based CRC is generated. A table with 256 elements is used to store precomputed CRCs.

  gcc/ChangeLog:
	*builtin-types.def (BT_FN_UINT8_UINT8_UINT8_CONST_SIZE): Define.
	(BT_FN_UINT16_UINT16_UINT8_CONST_SIZE): Likewise.
	(BT_FN_UINT16_UINT16_UINT16_CONST_SIZE): Likewise.
	(BT_FN_UINT32_UINT32_UINT8_CONST_SIZE): Likewise.
	(BT_FN_UINT32_UINT32_UINT16_CONST_SIZE): Likewise.
	(BT_FN_UINT32_UINT32_UINT32_CONST_SIZE): Likewise.
	(BT_FN_UINT64_UINT64_UINT8_CONST_SIZE): Likewise.
	(BT_FN_UINT64_UINT64_UINT16_CONST_SIZE): Likewise.
	(BT_FN_UINT64_UINT64_UINT32_CONST_SIZE): Likewise.
	(BT_FN_UINT64_UINT64_UINT32_CONST_SIZE): Likewise.
	* builtins.cc (associated_internal_fn): Handle BUILT_IN_CRC8_DATA8,
	BUILT_IN_CRC16_DATA8, BUILT_IN_CRC16_DATA16,
	BUILT_IN_CRC32_DATA8, BUILT_IN_CRC32_DATA16, BUILT_IN_CRC32_DATA32,
	BUILT_IN_CRC64_DATA8, BUILT_IN_CRC64_DATA16, BUILT_IN_CRC64_DATA32,
	BUILT_IN_CRC64_DATA64.
	* builtins.def (BUILT_IN_CRC8_DATA8): New builtin.
	(BUILT_IN_CRC16_DATA8): Likewise.
	(BUILT_IN_CRC16_DATA16): Likewise.
	(BUILT_IN_CRC32_DATA8): Likewise.
	(BUILT_IN_CRC32_DATA16): Likewise.
	(BUILT_IN_CRC32_DATA32): Likewise.
	(BUILT_IN_CRC64_DATA8): Likewise.
	(BUILT_IN_CRC64_DATA16): Likewise.
	(BUILT_IN_CRC64_DATA32): Likewise.
	(BUILT_IN_CRC64_DATA64): Likewise.
	* config/riscv/bitmanip.md (crc4): New expander.
	* config/riscv/riscv-protos.h (expand_crc_table_based): Declare.
	(expand_crc_using_clmul): Likewise.
	* config/riscv/riscv.cc (gf2n_poly_long_div_quotient): New function.
	(generate_crc): Likewise.
	(generate_crc_table): Likewise.
	(expand_crc_table_based): Likewise.
	(expand_crc_using_clmul): Likewise.
	* config/riscv/riscv.md (UNSPEC_CRC): New unspec for CRC.
	* internal-fn.cc (crc_direct): Define.
	(expand_crc_optab_fn): New function.
	(direct_crc_optab_supported_p): Define.
	* internal-fn.def (CRC): New internal optab function.
	* optabs.def (crc_optab): New optab.

  gcc/testsuite/ChangeLog:
	* gcc.target/riscv/crc-builtin-table-target32.c: New test.
	* gcc.target/riscv/crc-builtin-table-target64.c: New test.
	* gcc.target/riscv/crc-builtin-zbc32.c: New test.
	* gcc.target/riscv/crc-builtin-zbc64.c: New test.
---
diff --git a/gcc/ChangeLog b/gcc/ChangeLog
index 2eab466a9f8..748d8be384b 100644
--- a/gcc/ChangeLog
+++ b/gcc/ChangeLog
In general, the ChangeLog entry is just sent like you've done above and
not included as a diff.


diff --git a/gcc/builtin-types.def b/gcc/builtin-types.def
index 43381bc8949..e33837c27d0 100644
--- a/gcc/builtin-types.def
+++ b/gcc/builtin-types.def
@@ -829,6 +829,26 @@ DEF_FUNCTION_TYPE_3 (BT_FN_PTR_SIZE_SIZE_PTRMODE,
 		 BT_PTR, BT_SIZE, BT_SIZE, BT_PTRMODE)
 DEF_FUNCTION_TYPE_3 (BT_FN_VOID_PTR_UINT8_PTRMODE, BT_VOID, BT_PTR, BT_UINT8,
 		 BT_PTRMODE)
+DEF_FUNCTION_TYPE_3 (BT_FN_UINT8_UINT8_UINT8_CONST_SIZE, BT_UINT8, BT_UINT8,
+		 BT_UINT8, BT_CONST_SIZE)
[ ... ]
Presumably the reason we need to many variants is due to the desire to
support various types for the inputs and outputs?


diff --git a/gcc/config/riscv/bitmanip.md b/gcc/config/riscv/bitmanip.md
index c42e7b890db..4c896303242 100644
--- a/gcc/config/riscv/bitmanip.md
+++ b/gcc/config/riscv/bitmanip.md
@@ -856,3 +856,38 @@
   "TARGET_ZBC"
   "clmulr\t%0,%1,%2"
   [(set_attr "type" "clmul")])
+
+;; Iterator for hardware-supported integer modes, same as ANYI
+(define_mode_iterator ANYI2 [QI HI SI (DI "TARGET_64BIT")])
+
+;; CRC 8, 16, 32, (64 for TARGET_64)
+(define_expand "crc4"
+	;; return value (calculated CRC)
+  [(set (match_operand:ANYI 0 "register_operand" "=r")
+		  ;; initial CRC
+	(unspec:ANYI [(match_operand:ANYI 1 "register_operand" "r")
+		  ;; data
+		  (match_operand:ANYI2 2 "register_operand" "r")
+		  ;; polynomial
+		  (match_operand:ANYI 3)]
+		  UNSPEC_CRC))]
I'm not real comfortable with directly supporting sub-word sizes for the
output operand.  The vast majority of insns on the risc-v port have
outputs that use either the X iterator (which maps to either SI or DI
mode for rv32 and rv64 respectively) or they

Re: RISC-V: Added support for CRC.

2023-08-15 Thread Jeff Law via Gcc-patches




On 8/9/23 07:02, Paul Koning wrote:




On Aug 9, 2023, at 2:32 AM, Alexander Monakov  wrote:


On Tue, 8 Aug 2023, Jeff Law wrote:


If the compiler can identify a CRC and collapse it down to a table or clmul,
that's a major win and such code does exist in the real world. That was the
whole point behind the Fedora experiment -- to determine if these things are
showing up in the real world or if this is just a benchmarking exercise.


Can you share the results of the experiment and give your estimate of what
sort of real-world improvement is expected? I already listed the popular
FOSS projects where CRC performance is important: the Linux kernel and
a few compression libraries. Those projects do not use a bitwise CRC loop,
except sometimes for table generation on startup (which needs less time
than a page fault that may be necessary to bring in a hardcoded table).

For those projects that need a better CRC, why is the chosen solution is
to optimize it in the compiler instead of offering them a library they
could use with any compiler?

Was there any thought given to embedded projects that use bitwise CRC
exactly because they little space for a hardcoded table to spare?


Or those that use smaller tables -- for example, the classic VAX microcode 
approach with a 16-entry table, doing CRC 4 bits at a time.
Yup.  I think we settled on 8 bits as a time for the table variant.  It 
seemed like a good tradeoff between size of the tables and speed.




I agree that this seems an odd thing to optimize.  CRC is a well known CPU hog 
with well established efficient solutions, and it's hard to see  why anyone who 
needs good performance would fail to understand and apply that knowledge.
As I've said, what started us down this path was Coremark. But what 
convinced me that this was useful beyond juicing benchmark data was 
finding the various implementations in the wild.


Jeff


Re: RISC-V: Added support for CRC.

2023-08-15 Thread Jeff Law via Gcc-patches




On 8/9/23 00:32, Alexander Monakov wrote:


On Tue, 8 Aug 2023, Jeff Law wrote:


If the compiler can identify a CRC and collapse it down to a table or clmul,
that's a major win and such code does exist in the real world. That was the
whole point behind the Fedora experiment -- to determine if these things are
showing up in the real world or if this is just a benchmarking exercise.


Can you share the results of the experiment and give your estimate of what
sort of real-world improvement is expected? I already listed the popular
FOSS projects where CRC performance is important: the Linux kernel and
a few compression libraries. Those projects do not use a bitwise CRC loop,
except sometimes for table generation on startup (which needs less time
than a page fault that may be necessary to bring in a hardcoded table).
That experiment was ~7 months ago.  I don't think any of the data is 
still around except for some extracted testcases.




For those projects that need a better CRC, why is the chosen solution is
to optimize it in the compiler instead of offering them a library they
could use with any compiler?
Because if the compiler can optimize it automatically, then the projects 
have to do literally nothing to take advantage of it.  They just compile 
normally and their bitwise CRC gets optimized down to either a table 
lookup or a clmul variant.  That's the real goal here.


If a step where we provide the backend bits hooked up to a builtin isn't 
useful, then we won't pursue it.  The thinking was it would provide 
value for those willing to make a slight change to their sources and at 
the same time we get real world exposure for the backend work of the CRC 
optimization effort while we polish the gimple detection bits.






Was there any thought given to embedded projects that use bitwise CRC
exactly because they little space for a hardcoded table to spare?
It wasn't an explicit goal, but the ability to select between a table 
implementation and a clmul implementation in the backend seemed useful, 
so we wired up both.





No, not if the compiler is not GCC, or its version is less than 14. And
those projects are not going to sacrifice their portability just for
__builtin_crc.

You may be right.   I don't think it's so clear cut. though.





I think offering a conventional library for CRC has substantial advantages.

That's not what I asked.  If you think there's room for improvement to a
builtin API, I'd love to hear it.

But it seems you don't think this is worth the effort at all.  That's
unfortunate, but if that's the consensus, then so be it.


I think it's a strange application of development effort. You'd get more
done coding a library.
Not if the end goal is to detect the CRC and optimize it into a table or 
clmul without the user having to do anything special.


Again, what we've proposed in this patch is a piece of that larger body 
of work, specifically the backend bits that we thought would have value 
independently.  If the community doesn't see that carved out chunk as 
helpful we'll table it until the whole end-to-end path is ready for 
submission.






I'll note LLVM is likely going forward with CRC detection and optimization at
some point in the next ~6 months (effectively moving the implementation from
the hexagon port into the generic parts of their loop optimizer).


I don't see CRC detection in the Hexagon port. There is a recognizer for
polynomial multiplication (CRC is division, not multiplication).
Yes, you need to the recognizer so that you can detect a CRC loop, then 
with a bit of math you turn that into a carryless multiply sequence.  I 
find the math here mindbending, but the Hexagon bits are precisely to 
optimize CRC loops.  Sadly the Hexagon bits are fairly specific to the 
CRC implementation inside coremark.  The GCC bits we've been working on 
are much more general.


One final note.  Elsewhere in this thread you described performance 
concerns.  Right now clmuls can be implemented in 4c, fully piped.  I 
fully expect that latency to drop within the next 12-18 months.  In that 
world, there's not going to be much benefit to using hand-coded 
libraries vs just letting the compiler do it.


Jeff


Re: RISC-V: Added support for CRC.

2023-08-09 Thread Paul Koning via Gcc-patches



> On Aug 9, 2023, at 2:32 AM, Alexander Monakov  wrote:
> 
> 
> On Tue, 8 Aug 2023, Jeff Law wrote:
> 
>> If the compiler can identify a CRC and collapse it down to a table or clmul,
>> that's a major win and such code does exist in the real world. That was the
>> whole point behind the Fedora experiment -- to determine if these things are
>> showing up in the real world or if this is just a benchmarking exercise.
> 
> Can you share the results of the experiment and give your estimate of what
> sort of real-world improvement is expected? I already listed the popular
> FOSS projects where CRC performance is important: the Linux kernel and
> a few compression libraries. Those projects do not use a bitwise CRC loop,
> except sometimes for table generation on startup (which needs less time
> than a page fault that may be necessary to bring in a hardcoded table).
> 
> For those projects that need a better CRC, why is the chosen solution is
> to optimize it in the compiler instead of offering them a library they
> could use with any compiler?
> 
> Was there any thought given to embedded projects that use bitwise CRC
> exactly because they little space for a hardcoded table to spare?

Or those that use smaller tables -- for example, the classic VAX microcode 
approach with a 16-entry table, doing CRC 4 bits at a time.

I agree that this seems an odd thing to optimize.  CRC is a well known CPU hog 
with well established efficient solutions, and it's hard to see  why anyone who 
needs good performance would fail to understand and apply that knowledge.

paul




Re: RISC-V: Added support for CRC.

2023-08-09 Thread Alexander Monakov


On Tue, 8 Aug 2023, Jeff Law wrote:

> If the compiler can identify a CRC and collapse it down to a table or clmul,
> that's a major win and such code does exist in the real world. That was the
> whole point behind the Fedora experiment -- to determine if these things are
> showing up in the real world or if this is just a benchmarking exercise.

Can you share the results of the experiment and give your estimate of what
sort of real-world improvement is expected? I already listed the popular
FOSS projects where CRC performance is important: the Linux kernel and
a few compression libraries. Those projects do not use a bitwise CRC loop,
except sometimes for table generation on startup (which needs less time
than a page fault that may be necessary to bring in a hardcoded table).

For those projects that need a better CRC, why is the chosen solution is
to optimize it in the compiler instead of offering them a library they
could use with any compiler?

Was there any thought given to embedded projects that use bitwise CRC
exactly because they little space for a hardcoded table to spare?

> > Useful to whom? The Linux kernel? zlib, bzip2, xz-utils? ffmpeg?
> > These consumers need high-performance blockwise CRC, offering them
> > a latency-bound elementwise CRC primitive is a disservice. And what
> > should they use as a fallback when __builtin_crc is unavailable?
> THe point is builtin_crc would always be available.  If there is no clmul,
> then the RTL backend can expand to a table lookup version.

No, not if the compiler is not GCC, or its version is less than 14. And
those projects are not going to sacrifice their portability just for
__builtin_crc.

> > I think offering a conventional library for CRC has substantial advantages.
> That's not what I asked.  If you think there's room for improvement to a
> builtin API, I'd love to hear it.
> 
> But it seems you don't think this is worth the effort at all.  That's
> unfortunate, but if that's the consensus, then so be it.

I think it's a strange application of development effort. You'd get more
done coding a library.

> I'll note LLVM is likely going forward with CRC detection and optimization at
> some point in the next ~6 months (effectively moving the implementation from
> the hexagon port into the generic parts of their loop optimizer).

I don't see CRC detection in the Hexagon port. There is a recognizer for
polynomial multiplication (CRC is division, not multiplication).

Alexander


Re: RISC-V: Added support for CRC.

2023-08-08 Thread Jeff Law via Gcc-patches




On 8/8/23 17:31, Andrew Pinski wrote:

On Tue, Aug 8, 2023 at 4:17 PM Jeff Law via Gcc-patches
 wrote:




On 8/8/23 10:38, Alexander Monakov wrote:


On Tue, 8 Aug 2023, Jeff Law wrote:


That was my thinking at one time.  Then we started looking at the distros and
found enough crc implementations in there to change my mind about the overall
utility.


The ones I'm familiar with are all table-based and look impossible to
pattern-match (and hence already fairly efficient comparable to bitwise
loop in Coremark).

We found dozens that were the usual looking loops and, IIRC ~200 table
lookups after analyzing about half of the packages in Fedora.


I will make a note we do handle table lookups to detect count leading
zeros, see check_ctz_array in tree-ssa-forwprop.cc for that detection.
(that was done to improve a SPEC benchmark even).
So if the tables are statically defined at compile time, there is
already an example of how it can be detected too.
Detecting CRC is a bit more complex than ctz/popcount and friends 
because of the field polynomial's impact on the table.


But there is some value in detecting a table version, then converting it 
to clmul.  I don't remember the data offhand, but clmul was noticeably 
better than a table.


Jeff


Re: RISC-V: Added support for CRC.

2023-08-08 Thread Andrew Pinski via Gcc-patches
On Tue, Aug 8, 2023 at 4:17 PM Jeff Law via Gcc-patches
 wrote:
>
>
>
> On 8/8/23 10:38, Alexander Monakov wrote:
> >
> > On Tue, 8 Aug 2023, Jeff Law wrote:
> >
> >> That was my thinking at one time.  Then we started looking at the distros 
> >> and
> >> found enough crc implementations in there to change my mind about the 
> >> overall
> >> utility.
> >
> > The ones I'm familiar with are all table-based and look impossible to
> > pattern-match (and hence already fairly efficient comparable to bitwise
> > loop in Coremark).
> We found dozens that were the usual looking loops and, IIRC ~200 table
> lookups after analyzing about half of the packages in Fedora.

I will make a note we do handle table lookups to detect count leading
zeros, see check_ctz_array in tree-ssa-forwprop.cc for that detection.
(that was done to improve a SPEC benchmark even).
So if the tables are statically defined at compile time, there is
already an example of how it can be detected too.

Thanks,
Andrew Pinski

>
>
> >
> > So... just provide a library? A library code is easier to develop and audit,
> > it can be released independently, people can use it with their compiler of
> > choice. Not everything needs to be in libgcc.
> If the compiler can identify a CRC and collapse it down to a table or
> clmul, that's a major win and such code does exist in the real world.
> That was the whole point behind the Fedora experiment -- to determine if
> these things are showing up in the real world or if this is just a
> benchmarking exercise.
>
> And just to be clear, we're not proposing anything for libgcc.
>
> >
> > I'm talking about factoring a long chain into multiple independent chains
> > for latency hiding.
> And that could potentially be an extension.  But even without this a
> standard looking CRC loop will be much faster using table lookups or
> simple generation with clmul.
>
> Also note that latency of clmuls is improving on modern hardware.  4c
> isn't hard to achieve and I wouldn't be surprised to see 2c clmuls in
> the near future.
>
>
> >
> > Useful to whom? The Linux kernel? zlib, bzip2, xz-utils? ffmpeg?
> > These consumers need high-performance blockwise CRC, offering them
> > a latency-bound elementwise CRC primitive is a disservice. And what
> > should they use as a fallback when __builtin_crc is unavailable?
> THe point is builtin_crc would always be available.  If there is no
> clmul, then the RTL backend can expand to a table lookup version.
>
> >
> >> while at the same time putting one side of the infrastructure we need for
> >> automatic detection of CRC loops and turning them into table lookups or
> >> CLMULs.
> >>
> >> With that in mind I'm certain Mariam & I would love feedback on a builtin 
> >> API
> >> that would be more useful.
> >
> > I think offering a conventional library for CRC has substantial advantages.
> That's not what I asked.  If you think there's room for improvement to a
> builtin API, I'd love to hear it.
>
> But it seems you don't think this is worth the effort at all.  That's
> unfortunate, but if that's the consensus, then so be it.
>
> I'll note LLVM is likely going forward with CRC detection and
> optimization at some point in the next ~6 months (effectively moving the
> implementation from the hexagon port into the generic parts of their
> loop optimizer).
>
>
>
> Jeff


Re: RISC-V: Added support for CRC.

2023-08-08 Thread Jeff Law via Gcc-patches




On 8/8/23 10:38, Alexander Monakov wrote:


On Tue, 8 Aug 2023, Jeff Law wrote:


That was my thinking at one time.  Then we started looking at the distros and
found enough crc implementations in there to change my mind about the overall
utility.


The ones I'm familiar with are all table-based and look impossible to
pattern-match (and hence already fairly efficient comparable to bitwise
loop in Coremark).
We found dozens that were the usual looking loops and, IIRC ~200 table 
lookups after analyzing about half of the packages in Fedora.





So... just provide a library? A library code is easier to develop and audit,
it can be released independently, people can use it with their compiler of
choice. Not everything needs to be in libgcc.
If the compiler can identify a CRC and collapse it down to a table or 
clmul, that's a major win and such code does exist in the real world. 
That was the whole point behind the Fedora experiment -- to determine if 
these things are showing up in the real world or if this is just a 
benchmarking exercise.


And just to be clear, we're not proposing anything for libgcc.



I'm talking about factoring a long chain into multiple independent chains
for latency hiding.
And that could potentially be an extension.  But even without this a 
standard looking CRC loop will be much faster using table lookups or 
simple generation with clmul.


Also note that latency of clmuls is improving on modern hardware.  4c 
isn't hard to achieve and I wouldn't be surprised to see 2c clmuls in 
the near future.





Useful to whom? The Linux kernel? zlib, bzip2, xz-utils? ffmpeg?
These consumers need high-performance blockwise CRC, offering them
a latency-bound elementwise CRC primitive is a disservice. And what
should they use as a fallback when __builtin_crc is unavailable?
THe point is builtin_crc would always be available.  If there is no 
clmul, then the RTL backend can expand to a table lookup version.





while at the same time putting one side of the infrastructure we need for
automatic detection of CRC loops and turning them into table lookups or
CLMULs.

With that in mind I'm certain Mariam & I would love feedback on a builtin API
that would be more useful.


I think offering a conventional library for CRC has substantial advantages.
That's not what I asked.  If you think there's room for improvement to a 
builtin API, I'd love to hear it.


But it seems you don't think this is worth the effort at all.  That's 
unfortunate, but if that's the consensus, then so be it.


I'll note LLVM is likely going forward with CRC detection and 
optimization at some point in the next ~6 months (effectively moving the 
implementation from the hexagon port into the generic parts of their 
loop optimizer).




Jeff


Re: RISC-V: Added support for CRC.

2023-08-08 Thread Alexander Monakov


On Tue, 8 Aug 2023, Jeff Law wrote:

> That was my thinking at one time.  Then we started looking at the distros and
> found enough crc implementations in there to change my mind about the overall
> utility.

The ones I'm familiar with are all table-based and look impossible to
pattern-match (and hence already fairly efficient comparable to bitwise
loop in Coremark).

> If we need to do something to make it more useful, we're certainly open to
> that.

So... just provide a library? A library code is easier to develop and audit,
it can be released independently, people can use it with their compiler of
choice. Not everything needs to be in libgcc.

> > - they overlap multiple CLMUL chains to make the loop throughput-bound
> >rather than latency-bound. The typical unroll factor is about 4x-8x.
> We do have the ability to build longer chains.  We actually use that in the
> coremark benchmark where the underlying primitives are 8-bit CRCs that are
> composed into 16/32 bit CRCs.

I'm talking about factoring a long chain into multiple independent chains
for latency hiding.

> > Hence, I am concerned that proposed __builtin_crc is not useful for FOSS
> > that actually needs high-performance CRC (the Linux kernel, compression
> > and image libraries).
> > 
> > I think this crosses the line of "cheating in benchmarks" and not something
> > we should do in GCC.
> Certianly not the intention.  The intention is to provide a useful builtin_crc

Useful to whom? The Linux kernel? zlib, bzip2, xz-utils? ffmpeg?
These consumers need high-performance blockwise CRC, offering them
a latency-bound elementwise CRC primitive is a disservice. And what
should they use as a fallback when __builtin_crc is unavailable?

> while at the same time putting one side of the infrastructure we need for
> automatic detection of CRC loops and turning them into table lookups or
> CLMULs.
> 
> With that in mind I'm certain Mariam & I would love feedback on a builtin API
> that would be more useful.

I think offering a conventional library for CRC has substantial advantages.

Alexander


Re: RISC-V: Added support for CRC.

2023-08-08 Thread Jeff Law via Gcc-patches




On 8/8/23 09:22, Alexander Monakov wrote:


Jeff, as I understand this all is happening only because Coremark contains
use of bitwise CRC that affects benchmark scores. In another universe where

- Coremark was careful to checksum outputs outside of timed sections, or
- implemented CRC in a manner that is not transparent to the compiler, or
- did not use CRC at all

we would not be spending effort on this, correct? At best we might have
a discussion on providing a __builtin_clmul for carry-less multiplication
(which _is_ a fundamental primitive, unlike __builtin_crc), and move on.
That was my thinking at one time.  Then we started looking at the 
distros and found enough crc implementations in there to change my mind 
about the overall utility.




Note that proposed __builtin_crc does not match how a high-performance CRC
over a variable-size array is implemented. You don't want to do two
back-to-back CLMULs to compute a new CRC given an old CRC. That makes your
loop latency-constrained to 2*L*N where L is latency of the CLMUL instruction
and N is the number of loop iterations.
If we need to do something to make it more useful, we're certainly open 
to that.





Instead, efficient CRC loops have the following structure:

- they carry an unreduced remainder in the loop, performing final reduction
   modulo polynomial only once after the loop — this halves the CLMUL count;

- they overlap multiple CLMUL chains to make the loop throughput-bound
   rather than latency-bound. The typical unroll factor is about 4x-8x.
We do have the ability to build longer chains.  We actually use that in 
the coremark benchmark where the underlying primitives are 8-bit CRCs 
that are composed into 16/32 bit CRCs.




A relatively easy to follow explanation is provided by Pete Cawley at
https://www.corsix.org/content/alternative-exposition-crc32_4k_pclmulqdq
(there are other sources for similar optimization of table-based CRC).

Also note that in __builtin_crc care is needed regarding how the
polynomial is specified (which term is dropped, and what bit order is used).

Absolutely.



Hence, I am concerned that proposed __builtin_crc is not useful for FOSS
that actually needs high-performance CRC (the Linux kernel, compression
and image libraries).

I think this crosses the line of "cheating in benchmarks" and not something
we should do in GCC.
Certianly not the intention.  The intention is to provide a useful 
builtin_crc while at the same time putting one side of the 
infrastructure we need for automatic detection of CRC loops and turning 
them into table lookups or CLMULs.


With that in mind I'm certain Mariam & I would love feedback on a 
builtin API that would be more useful.


jeff


Re: RISC-V: Added support for CRC.

2023-08-08 Thread Alexander Monakov


On Thu, 3 Aug 2023, Jeff Law wrote:

> The end goal here is to actually detect bitwise CRC implementations in the
> gimple optimizers and turn them into table lookups or carryless multiplies in
> RTL.
> 
> Mariam has that working end-to-end and has proposed a talk for the Cauldron on
> the topic.
> 
> The idea here is to carve out the RTL side which we think provides potential
> value to end users (the ability to use the builtin to get an performant CRC
> implementation) and to get community feedback on the implementation.

Jeff, as I understand this all is happening only because Coremark contains
use of bitwise CRC that affects benchmark scores. In another universe where

- Coremark was careful to checksum outputs outside of timed sections, or
- implemented CRC in a manner that is not transparent to the compiler, or
- did not use CRC at all

we would not be spending effort on this, correct? At best we might have
a discussion on providing a __builtin_clmul for carry-less multiplication
(which _is_ a fundamental primitive, unlike __builtin_crc), and move on.

Note that proposed __builtin_crc does not match how a high-performance CRC
over a variable-size array is implemented. You don't want to do two
back-to-back CLMULs to compute a new CRC given an old CRC. That makes your
loop latency-constrained to 2*L*N where L is latency of the CLMUL instruction
and N is the number of loop iterations.

Instead, efficient CRC loops have the following structure:

- they carry an unreduced remainder in the loop, performing final reduction
  modulo polynomial only once after the loop — this halves the CLMUL count;

- they overlap multiple CLMUL chains to make the loop throughput-bound
  rather than latency-bound. The typical unroll factor is about 4x-8x.

A relatively easy to follow explanation is provided by Pete Cawley at
https://www.corsix.org/content/alternative-exposition-crc32_4k_pclmulqdq
(there are other sources for similar optimization of table-based CRC).

Also note that in __builtin_crc care is needed regarding how the
polynomial is specified (which term is dropped, and what bit order is used).

Hence, I am concerned that proposed __builtin_crc is not useful for FOSS
that actually needs high-performance CRC (the Linux kernel, compression
and image libraries).

I think this crosses the line of "cheating in benchmarks" and not something
we should do in GCC.

Alexander


Re: RISC-V: Added support for CRC.

2023-08-03 Thread Jeff Law via Gcc-patches




On 8/3/23 13:37, Mariam Harutyunyan via Gcc-patches wrote:

This patch adds CRC support for the RISC-V architecture. It adds internal
functions and built-ins specifically designed to handle CRC computations
efficiently.

If the target is ZBC, the clmul instruction is used for the CRC code
generation; otherwise, table-based CRC is generated.  A table with 256
elements is used to store precomputed CRCs.

These CRC calculation algorithms have higher performance than the naive CRC
calculation algorithm.
So just a little bit of additional information for others that might be 
looking at this thread.


The end goal here is to actually detect bitwise CRC implementations in 
the gimple optimizers and turn them into table lookups or carryless 
multiplies in RTL.


Mariam has that working end-to-end and has proposed a talk for the 
Cauldron on the topic.


The idea here is to carve out the RTL side which we think provides 
potential value to end users (the ability to use the builtin to get an 
performant CRC implementation) and to get community feedback on the 
implementation.


So this isn't the end of the line for this work, just the first carved 
out chunk.


Jeff


Re: RISC-V: Added support for CRC.

2023-08-03 Thread Mariam Harutyunyan via Gcc-patches
Hi.
Thank you. I'll add.

Best regards,
Mariam

On Thu, Aug 3, 2023, 23:56 Andrew Pinski  wrote:

> On Thu, Aug 3, 2023 at 12:38 PM Mariam Harutyunyan via Gcc-patches
>  wrote:
> >
> > This patch adds CRC support for the RISC-V architecture. It adds internal
> > functions and built-ins specifically designed to handle CRC computations
> > efficiently.
> >
> > If the target is ZBC, the clmul instruction is used for the CRC code
> > generation; otherwise, table-based CRC is generated.  A table with 256
> > elements is used to store precomputed CRCs.
> >
> > These CRC calculation algorithms have higher performance than the naive
> CRC
> > calculation algorithm.
>
> A few things about this patch:
> You created a generic (non-target specific) builtin but didn't
> document it in doc/extend.texi .
> You created a generic builtin with no fallback in libgcc.
> You created a new named (RTL) pattern, crc, and didn't document it in
> the `Standard Names` section of doc/md.texi .
>
> Thanks,
> Andrew Pinski
>
> >
> >   gcc/ChangeLog:
> >*builtin-types.def (BT_FN_UINT8_UINT8_UINT8_CONST_SIZE):
> Define.
> >(BT_FN_UINT16_UINT16_UINT8_CONST_SIZE): Likewise.
> >(BT_FN_UINT16_UINT16_UINT16_CONST_SIZE): Likewise.
> >(BT_FN_UINT32_UINT32_UINT8_CONST_SIZE): Likewise.
> >(BT_FN_UINT32_UINT32_UINT16_CONST_SIZE): Likewise.
> >(BT_FN_UINT32_UINT32_UINT32_CONST_SIZE): Likewise.
> >(BT_FN_UINT64_UINT64_UINT8_CONST_SIZE): Likewise.
> >(BT_FN_UINT64_UINT64_UINT16_CONST_SIZE): Likewise.
> >(BT_FN_UINT64_UINT64_UINT32_CONST_SIZE): Likewise.
> >(BT_FN_UINT64_UINT64_UINT32_CONST_SIZE): Likewise.
> >* builtins.cc (associated_internal_fn): Handle
> > BUILT_IN_CRC8_DATA8,
> >BUILT_IN_CRC16_DATA8, BUILT_IN_CRC16_DATA16,
> >BUILT_IN_CRC32_DATA8, BUILT_IN_CRC32_DATA16,
> > BUILT_IN_CRC32_DATA32,
> >BUILT_IN_CRC64_DATA8, BUILT_IN_CRC64_DATA16,
> > BUILT_IN_CRC64_DATA32,
> >BUILT_IN_CRC64_DATA64.
> >* builtins.def (BUILT_IN_CRC8_DATA8): New builtin.
> >(BUILT_IN_CRC16_DATA8): Likewise.
> >(BUILT_IN_CRC16_DATA16): Likewise.
> >(BUILT_IN_CRC32_DATA8): Likewise.
> >(BUILT_IN_CRC32_DATA16): Likewise.
> >(BUILT_IN_CRC32_DATA32): Likewise.
> >(BUILT_IN_CRC64_DATA8): Likewise.
> >(BUILT_IN_CRC64_DATA16): Likewise.
> >(BUILT_IN_CRC64_DATA32): Likewise.
> >(BUILT_IN_CRC64_DATA64): Likewise.
> >* config/riscv/bitmanip.md (crc4): New
> > expander.
> >* config/riscv/riscv-protos.h (expand_crc_table_based):
> Declare.
> >(expand_crc_using_clmul): Likewise.
> >* config/riscv/riscv.cc (gf2n_poly_long_div_quotient): New
> > function.
> >(generate_crc): Likewise.
> >(generate_crc_table): Likewise.
> >(expand_crc_table_based): Likewise.
> >(expand_crc_using_clmul): Likewise.
> >* config/riscv/riscv.md (UNSPEC_CRC): New unspec for CRC.
> >* internal-fn.cc (crc_direct): Define.
> >(expand_crc_optab_fn): New function.
> >(direct_crc_optab_supported_p): Define.
> >* internal-fn.def (CRC): New internal optab function.
> >* optabs.def (crc_optab): New optab.
> >
> >  gcc/testsuite/ChangeLog:
> >* gcc.target/riscv/crc-builtin-table-target32.c: New test.
> >* gcc.target/riscv/crc-builtin-table-target64.c: New test.
> >* gcc.target/riscv/crc-builtin-zbc32.c: New test.
> >* gcc.target/riscv/crc-builtin-zbc64.c: New test.
>


Re: RISC-V: Added support for CRC.

2023-08-03 Thread Andrew Pinski via Gcc-patches
On Thu, Aug 3, 2023 at 12:38 PM Mariam Harutyunyan via Gcc-patches
 wrote:
>
> This patch adds CRC support for the RISC-V architecture. It adds internal
> functions and built-ins specifically designed to handle CRC computations
> efficiently.
>
> If the target is ZBC, the clmul instruction is used for the CRC code
> generation; otherwise, table-based CRC is generated.  A table with 256
> elements is used to store precomputed CRCs.
>
> These CRC calculation algorithms have higher performance than the naive CRC
> calculation algorithm.

A few things about this patch:
You created a generic (non-target specific) builtin but didn't
document it in doc/extend.texi .
You created a generic builtin with no fallback in libgcc.
You created a new named (RTL) pattern, crc, and didn't document it in
the `Standard Names` section of doc/md.texi .

Thanks,
Andrew Pinski

>
>   gcc/ChangeLog:
>*builtin-types.def (BT_FN_UINT8_UINT8_UINT8_CONST_SIZE): Define.
>(BT_FN_UINT16_UINT16_UINT8_CONST_SIZE): Likewise.
>(BT_FN_UINT16_UINT16_UINT16_CONST_SIZE): Likewise.
>(BT_FN_UINT32_UINT32_UINT8_CONST_SIZE): Likewise.
>(BT_FN_UINT32_UINT32_UINT16_CONST_SIZE): Likewise.
>(BT_FN_UINT32_UINT32_UINT32_CONST_SIZE): Likewise.
>(BT_FN_UINT64_UINT64_UINT8_CONST_SIZE): Likewise.
>(BT_FN_UINT64_UINT64_UINT16_CONST_SIZE): Likewise.
>(BT_FN_UINT64_UINT64_UINT32_CONST_SIZE): Likewise.
>(BT_FN_UINT64_UINT64_UINT32_CONST_SIZE): Likewise.
>* builtins.cc (associated_internal_fn): Handle
> BUILT_IN_CRC8_DATA8,
>BUILT_IN_CRC16_DATA8, BUILT_IN_CRC16_DATA16,
>BUILT_IN_CRC32_DATA8, BUILT_IN_CRC32_DATA16,
> BUILT_IN_CRC32_DATA32,
>BUILT_IN_CRC64_DATA8, BUILT_IN_CRC64_DATA16,
> BUILT_IN_CRC64_DATA32,
>BUILT_IN_CRC64_DATA64.
>* builtins.def (BUILT_IN_CRC8_DATA8): New builtin.
>(BUILT_IN_CRC16_DATA8): Likewise.
>(BUILT_IN_CRC16_DATA16): Likewise.
>(BUILT_IN_CRC32_DATA8): Likewise.
>(BUILT_IN_CRC32_DATA16): Likewise.
>(BUILT_IN_CRC32_DATA32): Likewise.
>(BUILT_IN_CRC64_DATA8): Likewise.
>(BUILT_IN_CRC64_DATA16): Likewise.
>(BUILT_IN_CRC64_DATA32): Likewise.
>(BUILT_IN_CRC64_DATA64): Likewise.
>* config/riscv/bitmanip.md (crc4): New
> expander.
>* config/riscv/riscv-protos.h (expand_crc_table_based): Declare.
>(expand_crc_using_clmul): Likewise.
>* config/riscv/riscv.cc (gf2n_poly_long_div_quotient): New
> function.
>(generate_crc): Likewise.
>(generate_crc_table): Likewise.
>(expand_crc_table_based): Likewise.
>(expand_crc_using_clmul): Likewise.
>* config/riscv/riscv.md (UNSPEC_CRC): New unspec for CRC.
>* internal-fn.cc (crc_direct): Define.
>(expand_crc_optab_fn): New function.
>(direct_crc_optab_supported_p): Define.
>* internal-fn.def (CRC): New internal optab function.
>* optabs.def (crc_optab): New optab.
>
>  gcc/testsuite/ChangeLog:
>* gcc.target/riscv/crc-builtin-table-target32.c: New test.
>* gcc.target/riscv/crc-builtin-table-target64.c: New test.
>* gcc.target/riscv/crc-builtin-zbc32.c: New test.
>* gcc.target/riscv/crc-builtin-zbc64.c: New test.


RISC-V: Added support for CRC.

2023-08-03 Thread Mariam Harutyunyan via Gcc-patches
This patch adds CRC support for the RISC-V architecture. It adds internal
functions and built-ins specifically designed to handle CRC computations
efficiently.

If the target is ZBC, the clmul instruction is used for the CRC code
generation; otherwise, table-based CRC is generated.  A table with 256
elements is used to store precomputed CRCs.

These CRC calculation algorithms have higher performance than the naive CRC
calculation algorithm.

  gcc/ChangeLog:
   *builtin-types.def (BT_FN_UINT8_UINT8_UINT8_CONST_SIZE): Define.
   (BT_FN_UINT16_UINT16_UINT8_CONST_SIZE): Likewise.
   (BT_FN_UINT16_UINT16_UINT16_CONST_SIZE): Likewise.
   (BT_FN_UINT32_UINT32_UINT8_CONST_SIZE): Likewise.
   (BT_FN_UINT32_UINT32_UINT16_CONST_SIZE): Likewise.
   (BT_FN_UINT32_UINT32_UINT32_CONST_SIZE): Likewise.
   (BT_FN_UINT64_UINT64_UINT8_CONST_SIZE): Likewise.
   (BT_FN_UINT64_UINT64_UINT16_CONST_SIZE): Likewise.
   (BT_FN_UINT64_UINT64_UINT32_CONST_SIZE): Likewise.
   (BT_FN_UINT64_UINT64_UINT32_CONST_SIZE): Likewise.
   * builtins.cc (associated_internal_fn): Handle
BUILT_IN_CRC8_DATA8,
   BUILT_IN_CRC16_DATA8, BUILT_IN_CRC16_DATA16,
   BUILT_IN_CRC32_DATA8, BUILT_IN_CRC32_DATA16,
BUILT_IN_CRC32_DATA32,
   BUILT_IN_CRC64_DATA8, BUILT_IN_CRC64_DATA16,
BUILT_IN_CRC64_DATA32,
   BUILT_IN_CRC64_DATA64.
   * builtins.def (BUILT_IN_CRC8_DATA8): New builtin.
   (BUILT_IN_CRC16_DATA8): Likewise.
   (BUILT_IN_CRC16_DATA16): Likewise.
   (BUILT_IN_CRC32_DATA8): Likewise.
   (BUILT_IN_CRC32_DATA16): Likewise.
   (BUILT_IN_CRC32_DATA32): Likewise.
   (BUILT_IN_CRC64_DATA8): Likewise.
   (BUILT_IN_CRC64_DATA16): Likewise.
   (BUILT_IN_CRC64_DATA32): Likewise.
   (BUILT_IN_CRC64_DATA64): Likewise.
   * config/riscv/bitmanip.md (crc4): New
expander.
   * config/riscv/riscv-protos.h (expand_crc_table_based): Declare.
   (expand_crc_using_clmul): Likewise.
   * config/riscv/riscv.cc (gf2n_poly_long_div_quotient): New
function.
   (generate_crc): Likewise.
   (generate_crc_table): Likewise.
   (expand_crc_table_based): Likewise.
   (expand_crc_using_clmul): Likewise.
   * config/riscv/riscv.md (UNSPEC_CRC): New unspec for CRC.
   * internal-fn.cc (crc_direct): Define.
   (expand_crc_optab_fn): New function.
   (direct_crc_optab_supported_p): Define.
   * internal-fn.def (CRC): New internal optab function.
   * optabs.def (crc_optab): New optab.

 gcc/testsuite/ChangeLog:
   * gcc.target/riscv/crc-builtin-table-target32.c: New test.
   * gcc.target/riscv/crc-builtin-table-target64.c: New test.
   * gcc.target/riscv/crc-builtin-zbc32.c: New test.
   * gcc.target/riscv/crc-builtin-zbc64.c: New test.
From 9d2e9023c222501a1d9519bea3d5cdbd32b5a91e Mon Sep 17 00:00:00 2001
From: Mariam Arutunian 
Date: Thu, 3 Aug 2023 15:59:57 +0400
Subject: [PATCH] RISC-V: Added support for CRC.

  If the target is ZBC, then the clmul instruction is used for the CRC code generation;
  otherwise, table-based CRC is generated. A table with 256 elements is used to store precomputed CRCs.

  gcc/ChangeLog:
	*builtin-types.def (BT_FN_UINT8_UINT8_UINT8_CONST_SIZE): Define.
	(BT_FN_UINT16_UINT16_UINT8_CONST_SIZE): Likewise.
	(BT_FN_UINT16_UINT16_UINT16_CONST_SIZE): Likewise.
	(BT_FN_UINT32_UINT32_UINT8_CONST_SIZE): Likewise.
	(BT_FN_UINT32_UINT32_UINT16_CONST_SIZE): Likewise.
	(BT_FN_UINT32_UINT32_UINT32_CONST_SIZE): Likewise.
	(BT_FN_UINT64_UINT64_UINT8_CONST_SIZE): Likewise.
	(BT_FN_UINT64_UINT64_UINT16_CONST_SIZE): Likewise.
	(BT_FN_UINT64_UINT64_UINT32_CONST_SIZE): Likewise.
	(BT_FN_UINT64_UINT64_UINT32_CONST_SIZE): Likewise.
	* builtins.cc (associated_internal_fn): Handle BUILT_IN_CRC8_DATA8,
	BUILT_IN_CRC16_DATA8, BUILT_IN_CRC16_DATA16,
	BUILT_IN_CRC32_DATA8, BUILT_IN_CRC32_DATA16, BUILT_IN_CRC32_DATA32,
	BUILT_IN_CRC64_DATA8, BUILT_IN_CRC64_DATA16, BUILT_IN_CRC64_DATA32,
	BUILT_IN_CRC64_DATA64.
	* builtins.def (BUILT_IN_CRC8_DATA8): New builtin.
	(BUILT_IN_CRC16_DATA8): Likewise.
	(BUILT_IN_CRC16_DATA16): Likewise.
	(BUILT_IN_CRC32_DATA8): Likewise.
	(BUILT_IN_CRC32_DATA16): Likewise.
	(BUILT_IN_CRC32_DATA32): Likewise.
	(BUILT_IN_CRC64_DATA8): Likewise.
	(BUILT_IN_CRC64_DATA16): Likewise.
	(BUILT_IN_CRC64_DATA32): Likewise.
	(BUILT_IN_CRC64_DATA64): Likewise.
	* config/riscv/bitmanip.md (crc4): New expander.
	* config/riscv/riscv-protos.h (expand_crc_table_based): Declare.
	(expand_crc_using_clmul): Likewise.
	* config/riscv/riscv.cc (gf2n_poly_long_div_quotient): New function.
	(generate_crc): Likewise.
	(generate_crc_table): Likewise.
	(expand_crc_table_based): Likewise.
	(expand_crc_using_clmul): Likewise.
	* config/riscv/riscv.md (UNSPEC_CRC): New unspec for CRC.
	* internal-fn.cc (crc_direct): Define