Re: RISC-V: Added support for CRC.
On 9/26/23 12:56, Joern Rennecke wrote: What ultimately pushed us to keep moving forward on this effort was discovering numerous CRC loop implementations out in the wild, including 4 implementations (IIRC) in the kernel itself. I have always assumed that such must exist (CRCs are useful for a number of problems, and besides, they wouldn't have been included in coremark as a part of the workload if they were not relevant), but it is good to have confirmation, and even better to have code that can detect and analyse a entire class of idioms that is in such widespread use. I was personally surprised at how many we found. While there were a bunch of table generation routines which obviously aren't at all interesting, there were enough in the cases we analyzed that it convinced me this wasn't just catering to a benchmark. This still leaves room for further improvements, like detecting fully-unrolled code, table lookup, or slice-by-N, and replacing them with better target-optimized code where this is indicated by the optimization flags to save execution time or code/rodata size. Not something we have to tackle now, but just because we don't do it now, doesn't mean we couldn't address these in the future if that appears worthwhile. Absolutely. In fact, I have encouraged Mariam to capture some of the cases we can't currently handle in the testsuite, essentially building a bit of a TODO list. I can easily see creating a clmul RTL opcode for targets which support it and hoisting the clmul vs lookup table selection into generic code. I'm still pondering if we're likely to ever see cases where we want a vector clmul intrinsic or support in the autovectorizer for clmul. We've certainly got targets with vector clmul in the ISA, the question is using it. If we aim for optimal code, I think it more likely that we want to detect a block CRC computation, and have a target expander decide to do that with inline code or a library call that uses vectorized clmul. At the time we add such block-CRC expansion code, it also makes sense to add a builtin for block CRC so that new GNU C programs can directly request that functionality without having to go through the cargo cult of matching a supported idiom. And I think we've got a structure which largely allows further improvements, both in the recognition/validation step and in the code expansion step. Now, the library might be written in GNU C, and for that it might be useful to have a vector clmul intrinsic so that we can express this algorithm more easily. Agreed. It's also worth nothing that LLVM has a clmul in their IL and I suspect they expose it via a builtin/intrinsic. I'd expect we'll ultimately end up in the same boat. Probably the biggest task in that space right now is to see if we can avoid the symbolic execution engine by re-using ranger. I'll be interested to see what you'll come up with, but if reverting to the symbolic execution engine, the compile time cost isn't much if you only use it for a proper match. So whatever heuristics are used before deciding to use the engine matter. Can all the cases detected by the engine be recognized as a loop with a reduction? We might use different heuristics for different optimization levels, i.e. allow more false negatives at -O1, and more false positives at -O2 / -fexpensive-optimizations. It's mostly a desire not to add (yet another) symbolic execution engine to GCC. We've already got the engine for CCP as well as symbolic execution capabilities in Ranger. I'd like to avoid adding another if we can do so. For a LFSR validation step we need to track 4 potential states for each bit in an object. 0, 1, x, !x where "x" is the state of the bit from a different object. If it was just tracking 0, 1, x, !x for an entire object, Ranger is probably already do that. But we need to do it for each bit within an object. We haven't seen compile-time cost be a real issue. But we also haven't looked too closely at that problem. While I concur that we want existing code to be able to take advantage of this optimization by gcc recognizing whatever method the code uses to compute CRC (within reason, we are still bound by the laws of computability and resource constraints for the compilation), I would like to stress that I don't think the builtin will use its value over time. It can be used in tests to make sure we exercise the code generation for the internal function. It can be used in new GNU C code to make it easier to do a CRC computation without worrying about the implementation. If an accord with other major compiler suppliers (e.g. the LLVM community) is reached, it might even see more widespread use. Which somewhat dovetails with Alex's position -- namely that it's not that value. Hence the de-emphasis on this approach. We'll continue to focus on the end-to-end solution. We may still want a clmul as an RTL opcode and builtins to utilize it. Which is
Re: RISC-V: Added support for CRC.
On Tue, 26 Sept 2023 at 14:18, Jeff Law wrote: > But the Coremark code is what it is. This isn't a whole lot > different than the work in the 90s which rewrote loops and compromised > some of the spec benchmarks, or the eqntott hack to simplify that one > key loop in eqntott. I think the stated purpose of the benchmark matters. If dhrystone had been pushed as an abstraction-penalty benchmark, it would have been fine to present results with WPA, inlining and dead code elimination as ordinary dhrystone results. But since it's supposed to exercise specific hardware features, and not have the tests for these optimized away, that's not appropriate. So, first, we make the compiled program perform the work that the benchmark was supposed to include in the measurement, just more efficiently. Second, we not only optimize the benchmark, but also make the target-optimized code generation available for other programs. For new programs targeted at GNU C, that is minimally archived by providing a built-in function, and in general for new code, by being able to replicate the idiom from coremark that is recognized by GCC. The mere existence of a C idiom in a popular benchmark also makes this idiom a common idiom, if it hasn't already been that before. As we find new patterns that are used to implement CRC which would be better replaced with a target-specific implementation, we can add these. This is similar to rotate operations, which are supported directly by some processors, and even for other targets, there are generally preferred ways to expand the code, but there are a couple of different variants depending on the available instruction set, registers, and the microarchitecture (pipeline, latency etc). We started out with one patterns that was recognized, and as new patterns were identified in C code, we improved GCC to recognize these other patterns. > What ultimately pushed us to keep moving forward on this effort was > discovering numerous CRC loop implementations out in the wild, including > 4 implementations (IIRC) in the kernel itself. I have always assumed that such must exist (CRCs are useful for a number of problems, and besides, they wouldn't have been included in coremark as a part of the workload if they were not relevant), but it is good to have confirmation, and even better to have code that can detect and analyse a entire class of idioms that is in such widespread use. This still leaves room for further improvements, like detecting fully-unrolled code, table lookup, or slice-by-N, and replacing them with better target-optimized code where this is indicated by the optimization flags to save execution time or code/rodata size. Not something we have to tackle now, but just because we don't do it now, doesn't mean we couldn't address these in the future if that appears worthwhile. > I can easily see creating a clmul RTL opcode for targets which support > it and hoisting the clmul vs lookup table selection into generic code. > I'm still pondering if we're likely to ever see cases where we want a > vector clmul intrinsic or support in the autovectorizer for clmul. > We've certainly got targets with vector clmul in the ISA, the question > is using it. If we aim for optimal code, I think it more likely that we want to detect a block CRC computation, and have a target expander decide to do that with inline code or a library call that uses vectorized clmul. At the time we add such block-CRC expansion code, it also makes sense to add a builtin for block CRC so that new GNU C programs can directly request that functionality without having to go through the cargo cult of matching a supported idiom. Now, the library might be written in GNU C, and for that it might be useful to have a vector clmul intrinsic so that we can express this algorithm more easily. > Probably the biggest task in that space right now is to see if we can > avoid the symbolic execution engine by re-using ranger. I'll be interested to see what you'll come up with, but if reverting to the symbolic execution engine, the compile time cost isn't much if you only use it for a proper match. So whatever heuristics are used before deciding to use the engine matter. Can all the cases detected by the engine be recognized as a loop with a reduction? We might use different heuristics for different optimization levels, i.e. allow more false negatives at -O1, and more false positives at -O2 / -fexpensive-optimizations. > To reiterate the real goal here is to take code as-is and make it > significantly faster. While the original target was Coremark, we've > found similar bitwise implementations of CRCs all over the place. > There's no good reason that code should have to change. > > The idea of exposing a CRC builtin was an intermediate step that would > allow those willing to change their code or writing new code to write > their CRC in a trivial way and let the compiler figure out a sensible > implementation while we clean up the
Re: RISC-V: Added support for CRC.
On Tue, 26 Sep 2023, Jeff Law wrote: > What ultimately pushed us to keep moving forward on this effort was > discovering numerous CRC loop implementations out in the wild, including 4 > implementations (IIRC) in the kernel itself. The kernel employs bitwise CRC only in look-up table generators. Which run at build time. You are quite literally slowing down the compiler in order to speed up generators that don't account for even one millisecond of kernel build time, and have no relation to its run-time performance. (incidentally you can't detect the actual CRC impls using those tables) > And as I've stated before, the latency of clmuls is dropping. I wouldn't be > terribly surprised to see single cycle clmul implmementations showing up > within the next 18-24 months. It's really just a matter of gate budget vs > expected value. In a commercial implementation? I'll take that bet. You spend gates budget like that after better avenues for raising ILP are exhausted (like adding more ALUs that can do clmul at a reasonable 3c-4c latency). > To reiterate the real goal here is to take code as-is and make it > significantly faster. Which code? Table generators in the kernel and xz-utils? > While the original target was Coremark, we've found similar bitwise > implementations of CRCs all over the place. There's no good reason that code > should have to change. But did you look at them? There's no point to optimize table generators either. Alexander
Re: RISC-V: Added support for CRC.
On 9/23/23 17:05, Joern Rennecke wrote: Mariam Harutyunyan: +++ b/gcc/ChangeLog @@ -1,3 +1,45 @@ +2023-08-03 Mariam Arutunian + It is common courtesy to include all authors in the list of authors for the ChangeLog; also, this may help people in the future understand the history of the code better. While must of your patch is new, it still contains non-trivial parts of mine ( https://gcc.gnu.org/pipermail/gcc-patches/2022-March/591744.html ) .And stripping out the comment why, currently, we can't use linkonce for crc tables on the the RISC-V target is not helpful to someone who wants to understand the code. Thanks for pointing this out Joern. Neither Mariam nor I were aware that some of your code was used to seed this work. We are happy to include you as a co-author. Mariam -- the way to do this is to add a "Co-Author:" line to your commit message. If you look at upstream commit: 5ff4431675c0d0c800d4a983254e94a6b401c14d Shows the right format. See also the discussion to put this into loop distribution: https://gcc.gnu.org/pipermail/gcc-patches/2022-March/591821.html > https://gcc.gnu.org/pipermail/gcc-patches/2022-March/591866.html Yes, this was the gist of the discussion I had with Richi. CRC optimization has a lot in common with distribution, reduction and final value replacement. For early testing and development we focused on detecting a CRC function and replacing it at a function level. THat allowed Mariam to avoid a class a problems for a while and focus on an end-to-end solution. Once that was working reasonably well Mariam went back and enhanced the first pass filter and validation so that it would work if the CRC loop was embedded inside a larger function either at the source level or due to inlining. The placement we chose was just after loop distribution. I think there's some wiggle room in terms of pass placement. The one thing I'm hesitant to do is make this part of an existing pass. I dont see many benefits of integrating inside loop dist and I very much like its clean separation. Mariam Harutyunyan: It adds internal functions and built-ins specifically designed to handle CRC computations efficiently. This sounds like this is a finished work, although defining built-in functions to calculate the CRC of single data elements and recognizers for some C idioms that do these calculations, is just a starting point. More precisely it was carved out of a larger piece of work in the hopes that it would be useful independently and allow upstreaming of the backend work. Alexander was fairly dismissive of the utility of doing that. There was some interest from Cauldron attendees to potentially reviving that idea. Alexander Monakov : Jeff, as I understand this all is happening only because Coremark contains use of bitwise CRC that affects benchmark scores. In another universe where - Coremark was careful to checksum outputs outside of timed sections, or - implemented CRC in a manner that is not transparent to the compiler, or - did not use CRC at all we would not be spending effort on this, correct? It is a stated goal of coremark to test performance for CRC. They do not use a library call to implement CRC, but a specific bit-banging algorithm they have chosen. That algorithm is, for the vast majority of processors, not representative of the targets performance potential in calculating CRCs, thus if a compiler fails to translate this into the CRC implementation that would be used for performance code, the compiler frustrates this goal of coremark to give a measure of CRC calculation performance.All true. But the Coremark code is what it is. This isn't a whole lot different than the work in the 90s which rewrote loops and compromised some of the spec benchmarks, or the eqntott hack to simplify that one key loop in eqntott. What ultimately pushed us to keep moving forward on this effort was discovering numerous CRC loop implementations out in the wild, including 4 implementations (IIRC) in the kernel itself. At best we might have a discussion on providing a __builtin_clmul for carry-less multiplication (which _is_ a fundamental primitive, unlike __builtin_crc), and move on. Some processors have specialized instructions for CRC computations. They do. But I think the discussion about providing intrinsic access to a clmul instruction is orthogonal to the discussion about whether or not to provide a builtin for crc. I can easily see creating a clmul RTL opcode for targets which support it and hoisting the clmul vs lookup table selection into generic code. I'm still pondering if we're likely to ever see cases where we want a vector clmul intrinsic or support in the autovectorizer for clmul. We've certainly got targets with vector clmul in the ISA, the question is using it. Instead, efficient CRC loops have the following structure: - they carry an unreduced remainder in the loop, performing final reduction
Re: RISC-V: Added support for CRC.
On Sun, 2023-09-24 at 00:05 +0100, Joern Rennecke wrote: > > Although maybe Oleg Endo's library, as mentioned in > https://gcc.gnu.org/pipermail/gcc-patches/2022-March/591748.html , > might be suitable? What is the license for that? > > I haven't published the library, but I think I could do that. It's a C++-14 header-only thing and uses templates + constexpr to generate the .rodata lookup tables. It's convenient for an application project, as it doesn't require any generator tool in the build. This might be not a big advantage in the context of GCC. Since the tables are computed during compile-time, there is no particular optimization implemented. The run-time function is also nothing fancy: static constexpr uint8_t table_index (value_type rem, uint8_t x) { if (ReflectInput) return x ^ rem; else return x ^ (BitCount > 8 ? (rem >> (BitCount - 8)) : (rem << (8 - BitCount))); } static constexpr value_type shift (value_type rem) { return ReflectInput ? rem >> 8 : rem << 8; } static value_type default_process_bytes (value_type rem, const uint8_t* in, const uint8_t* in_end) { for (; in != in_end; ++in) { auto i = table_index (rem, *in); rem = table[i] ^ shift (rem); } return rem; } Anyway, let me know if anyone is interested. Cheers, Oleg
Re: RISC-V: Added support for CRC.
On Sun, Sep 24, 2023, 00:05 Joern Rennecke wrote: > Mariam Harutyunyan: > +++ b/gcc/ChangeLog > @@ -1,3 +1,45 @@ > +2023-08-03 Mariam Arutunian > + > > It is common courtesy to include all authors in the list of authors > for the ChangeLog; also, > this may help people in the future understand the history of the code > better. > While must of your patch is new, it still contains non-trivial parts of > mine Sorry, I didn't know that when using others' code snippets, we should include their names. I'll add your name. Although, the code has been removed from the new patch version, but your work was really helpful. Thank you, Mariam
Re: RISC-V: Added support for CRC.
On Sun, 24 Sept 2023 at 12:41, Alexander Monakov wrote: > > > On Sun, 24 Sep 2023, Joern Rennecke wrote: > > > It is a stated goal of coremark to test performance for CRC. > > I would expect a good CRC benchmark to print CRC throughput in > bytes per cycle or megabytes per second. > > I don't see where Coremark states that goal. In the readme at > https://github.com/eembc/coremark/blob/main/README.md > it enumerates the three subcategories (linked list, matrix ops, > state machine) and indicates that CRC is used for validation. At https://www.eembc.org/coremark/index.php , they state under the Details heading: ... Replacing the antiquated Dhrystone benchmark, Coremark contains implementations of the following algorithms: list processing (find and sort), matrix manipulation (common matrix operations), state machine (determine if an input stream contains valid numbers), and CRC (cyclic redundancy check). ... The CRC algorithm serves a dual function; it provides a workload commonly seen in embedded applications and ensures correct operation of the CoreMark benchmark, essentially providing a self-checking mechanism. ... They also point to a whitepaper there, which states: Since CRC is also a commonly used function in embedded applications, this calculation is included in the timed portion of the CoreMark. > If it claims that elsewhere, the way its code employs CRC does not > correspond to real-world use patterns, like in the Linux kernel for > protocol and filesystem checksumming, or decompression libraries. That may be so, but we should still strive to optimize the code to obtain the intended purpose. > It is, however, representative of the target CPU's ability to run > those basic bitwise ops with good overlap with the rest of computation, > which is far more relevant for the real-world performance of the CPU. That depends on how much CRC calculation your application does. You can disable specific compiler optimizations in GCC for specialized testing. > > thus if a compiler fails to translate this into the CRC implementation > > that would be used for performance code, the compiler frustrates this > > goal of coremark to give a measure of CRC calculation performance. > > Are you seriously saying that if a customer chooses CPU A over CPU B > based on Coremark scores, and then discovers that actual performance > in, say, zlib (which uses slice-by-N for CRC) is better on CPU B, that's > entirely fair and the benchmarks scores they saw were not misleading? Using coremark as a yardstick for any one application is always going to be likely to give an inaccurate assessment - unless your application is identical to coremark. I don't see why whatever implementation is chosen for the short-length CRC in coremark should be closer or farther from slice-by-N CRC, I would expect it to be pseudo-random. Unless CPU B has worse GCC support or available hardware instruction for short-range CRC, in which case the manufacturer might considering improving support (particularily if it's about GCC support ;-) Actually, if CRC optimization is implemented via table lookup, on both CPU A and B, it gets a bit closer to slice-by-N, since both do table lookups, although for slice-by-N you trade latency for register pressure. Any single benchmark can't be a good performance predictor for all applications. If you care a lot about performance for a particular load, you should benchmark that load, or something that is known to be a close proxy. > > > At best we might have > > > a discussion on providing a __builtin_clmul for carry-less multiplication > > > (which _is_ a fundamental primitive, unlike __builtin_crc), and move on. > > > > Some processors have specialized instructions for CRC computations. > > Only for one or two fixed polynomials. For that matter, some processors > have instructions for AES and SHA, but that doesn't change that clmul is > a more fundamental and flexible primitive than "CRC". So it is, but when analyzing user programs that haven't been written by experts with a focus on performance, CRC is more likely to come up than clmul . I agree that it would make sense to have a builtin for clmul that can be used uniformly across architectures that support this operation, but I'm not volunteering to write a patch for that. > If only the "walk before you run" logic was applied in favor of > implementing a portable clmul builtin prior to all this. I started writing the CRC patch for an architecture that didn't have clmul as a basecase instruction, so a clmul builtin would not have helped. > > > A library can be used to implement built-ins in gcc (we still need to > > define one for block operations, one step at a time...). However, > > someone or something needs to rewrite the existing code to use the > > library. It is commonly accepted that an efficient way to do this is > > to make a compiler do the necessary transformations, as long as it can > > be made to churn out good enough code. > > How does
Re: RISC-V: Added support for CRC.
On Sun, 24 Sep 2023, Joern Rennecke wrote: > It is a stated goal of coremark to test performance for CRC. I would expect a good CRC benchmark to print CRC throughput in bytes per cycle or megabytes per second. I don't see where Coremark states that goal. In the readme at https://github.com/eembc/coremark/blob/main/README.md it enumerates the three subcategories (linked list, matrix ops, state machine) and indicates that CRC is used for validation. If it claims that elsewhere, the way its code employs CRC does not correspond to real-world use patterns, like in the Linux kernel for protocol and filesystem checksumming, or decompression libraries. > They do not use a library call to implement CRC, but a specific > bit-banging algorithm they have chosen. That algorithm is, for the > vast majority of processors, not representative of the targets > performance potential in calculating CRCs, It is, however, representative of the target CPU's ability to run those basic bitwise ops with good overlap with the rest of computation, which is far more relevant for the real-world performance of the CPU. > thus if a compiler fails to translate this into the CRC implementation > that would be used for performance code, the compiler frustrates this > goal of coremark to give a measure of CRC calculation performance. Are you seriously saying that if a customer chooses CPU A over CPU B based on Coremark scores, and then discovers that actual performance in, say, zlib (which uses slice-by-N for CRC) is better on CPU B, that's entirely fair and the benchmarks scores they saw were not misleading? > > At best we might have > > a discussion on providing a __builtin_clmul for carry-less multiplication > > (which _is_ a fundamental primitive, unlike __builtin_crc), and move on. > > Some processors have specialized instructions for CRC computations. Only for one or two fixed polynomials. For that matter, some processors have instructions for AES and SHA, but that doesn't change that clmul is a more fundamental and flexible primitive than "CRC". > If you want to recognize a loop that does a CRC on a block, it makes > sense to start with recognizing the CRC computation for single array > elements first. We have to learn to walk before we can run. If only the "walk before you run" logic was applied in favor of implementing a portable clmul builtin prior to all this. > A library can be used to implement built-ins in gcc (we still need to > define one for block operations, one step at a time...). However, > someone or something needs to rewrite the existing code to use the > library. It is commonly accepted that an efficient way to do this is > to make a compiler do the necessary transformations, as long as it can > be made to churn out good enough code. How does this apply to the real world? Among CRC implementations in the Linux kernel, ffmpeg, zlib, bzip2, xz-utils, and zstd I'm aware of only a single instance where bitwise CRC is used. It's in the table initialization function in xz-utils. The compiler would transform that to copying one table into another. Is that a valuable transform? > Alexander Monakov: > > Useful to whom? The Linux kernel? zlib, bzip2, xz-utils? ffmpeg? > > These consumers need high-performance blockwise CRC, offering them > > a latency-bound elementwise CRC primitive is a disservice. And what > > should they use as a fallback when __builtin_crc is unavailable? > > We can provide a fallback implementation for all targets with table > lookup and/or shifts . How would it help when they are compiled with LLVM, or GCC version earlier than 14? Alexander
Re: RISC-V: Added support for CRC.
Mariam Harutyunyan: +++ b/gcc/ChangeLog @@ -1,3 +1,45 @@ +2023-08-03 Mariam Arutunian + It is common courtesy to include all authors in the list of authors for the ChangeLog; also, this may help people in the future understand the history of the code better. While must of your patch is new, it still contains non-trivial parts of mine ( https://gcc.gnu.org/pipermail/gcc-patches/2022-March/591744.html ) .And stripping out the comment why, currently, we can't use linkonce for crc tables on the the RISC-V target is not helpful to someone who wants to understand the code. See also the discussion to put this into loop distribution: https://gcc.gnu.org/pipermail/gcc-patches/2022-March/591821.html https://gcc.gnu.org/pipermail/gcc-patches/2022-March/591866.html Mariam Harutyunyan: > It adds internal > functions and built-ins specifically designed to handle CRC computations > efficiently. This sounds like this is a finished work, although defining built-in functions to calculate the CRC of single data elements and recognizers for some C idioms that do these calculations, is just a starting point. Alexander Monakov : > Jeff, as I understand this all is happening only because Coremark contains > use of bitwise CRC that affects benchmark scores. In another universe where > - Coremark was careful to checksum outputs outside of timed sections, or > - implemented CRC in a manner that is not transparent to the compiler, or > - did not use CRC at all > we would not be spending effort on this, correct? It is a stated goal of coremark to test performance for CRC. They do not use a library call to implement CRC, but a specific bit-banging algorithm they have chosen. That algorithm is, for the vast majority of processors, not representative of the targets performance potential in calculating CRCs, thus if a compiler fails to translate this into the CRC implementation that would be used for performance code, the compiler frustrates this goal of coremark to give a measure of CRC calculation performance. > At best we might have > a discussion on providing a __builtin_clmul for carry-less multiplication > (which _is_ a fundamental primitive, unlike __builtin_crc), and move on. Some processors have specialized instructions for CRC computations. > Instead, efficient CRC loops have the following structure: > - they carry an unreduced remainder in the loop, performing final reduction > modulo polynomial only once after the loop — this halves the CLMUL count; > - they overlap multiple CLMUL chains to make the loop throughput-bound > rather than latency-bound. The typical unroll factor is about 4x-8x. If you want to recognize a loop that does a CRC on a block, it makes sense to start with recognizing the CRC computation for single array elements first. We have to learn to walk before we can run. Nore that my initial patch already contained a match.pd stanza to recognize two chained single-element CRC calculations. Jeff Law: > The intention is to provide a useful > builtin_crc while at the same time putting one side of the > infrastructure we need for automatic detection of CRC loops and turning > them into table lookups or CLMULs. Note that when optimizing for size, for a target with tiny memory, or when using a non-constant (or constant but undiscoverable by the compiler) polynom, we can't use the table lookup. But still, even if we don't have CLmul, we can generally speed up CRC computation over the coremark algorithm by using something more suited to the target, like the crcu16_1 function I put into comments in my patch. Alexander Monakov : > So... just provide a library? A library code is easier to develop and audit, > it can be released independently, people can use it with their compiler of > choice. Not everything needs to be in libgcc. A library can be used to implement built-ins in gcc (we still need to define one for block operations, one step at a time...). However, someone or something needs to rewrite the existing code to use the library. It is commonly accepted that an efficient way to do this is to make a compiler do the necessary transformations, as long as it can be made to churn out good enough code. Alexander Monakov: > Useful to whom? The Linux kernel? zlib, bzip2, xz-utils? ffmpeg? > These consumers need high-performance blockwise CRC, offering them > a latency-bound elementwise CRC primitive is a disservice. And what > should they use as a fallback when __builtin_crc is unavailable? We can provide a fallback implementation for all targets with table lookup and/or shifts . Alexander Monakov: > I think offering a conventional library for CRC has substantial advantages. Are you volunteering? It would make our work to emit code for block CRC easier if we could just use a library call when we recognize a block CRC (although making that recognizer is likely still considerable work if we want to get good coverage over different coding styles). Although maybe Oleg Endo's library, as mentioned
Re: RISC-V: Added support for CRC.
Thank you for the review. I'm already working on suggested changes. The answers to the few questions are attached here. Thanks, Mariam On Wed, Aug 16, 2023 at 8:59 AM Jeff Law wrote: > > > On 8/3/23 13:37, Mariam Harutyunyan via Gcc-patches wrote: > > This patch adds CRC support for the RISC-V architecture. It adds internal > > functions and built-ins specifically designed to handle CRC computations > > efficiently. > > > > If the target is ZBC, the clmul instruction is used for the CRC code > > generation; otherwise, table-based CRC is generated. A table with 256 > > elements is used to store precomputed CRCs. > > > > These CRC calculation algorithms have higher performance than the naive > CRC > > calculation algorithm. > [ ... ] > Various comments attached. > reply-to-comments Description: Binary data
Re: RISC-V: Added support for CRC.
On Wed, 16 Aug 2023, Philipp Tomsich wrote: > > > I fully expect that latency to drop within the next 12-18 months. In that > > > world, there's not going to be much benefit to using hand-coded libraries > > > vs > > > just letting the compiler do it. > > I would also hope that the hand-coded libraries would eventually have > a code path for compilers that support the built-in. You seem to be working with the false assumption that the interface of the proposed builtin matches how high-performance CRC computation is structured. It is not. State-of-the-art CRC keeps unreduced intermediate residual, split over multiple temporaries to allow overlapping CLMULs in the CPU. The intermediate residuals are reduced only once, when the final CRC value is needed. In constrast, the proposed builtin has data dependencies between all adjacent instructions, and cannot allow the CPU to work at IPC > 1.0. Shame how little you apparently understand of the "mindbending math". Alexander
Re: RISC-V: Added support for CRC.
On 8/16/23 13:10, Alexander Monakov wrote: On Tue, 15 Aug 2023, Jeff Law wrote: Because if the compiler can optimize it automatically, then the projects have to do literally nothing to take advantage of it. They just compile normally and their bitwise CRC gets optimized down to either a table lookup or a clmul variant. That's the real goal here. The only high-profile FOSS project that carries a bitwise CRC implementation I'm aware of is the 'xz' compression library. There bitwise CRC is used for populating the lookup table under './configure --enable-small': https://github.com/tukaani-project/xz/blob/2b871f4dbffe3801d0da3f89806b5935f758d5f3/src/liblzma/check/crc64_small.c It's a well-reasoned choice and your compiler would be undoing it (reintroducing the table when the bitwise CRC is employed specifically to avoid carrying the table). If they don't want the table variant, there would obviously be ways to turn that off. It's essentially no different than any speed improving optimization that makes things larger. One final note. Elsewhere in this thread you described performance concerns. Right now clmuls can be implemented in 4c, fully piped. Pipelining doesn't matter in the implementation being proposed here, because the builtin is expanded to li a4,quotient li a5,polynomial xor a0,a1,a0 clmul a0,a0,a4 srlia0,a0,crc_size clmul a0,a0,a5 sllia0,a0,GET_MODE_BITSIZE (word_mode) - crc_size srlia0,a0,GET_MODE_BITSIZE (word_mode) - crc_size making CLMULs data-dependent, so the second can only be started one cycle after the first finishes, and consecutive invocations of __builtin_crc are likewise data-dependent (with three cycles between CLMUL). So even when you get CLMUL down to 3c latency, you'll have two CLMULs and 10 cycles per input block, while state of the art is one widening CLMUL per input block (one CLMUL per 32-bit block on a 64-bit CPU) limited by throughput, not latency. I expect it'll actually be 2c latency. We're approaching the point where it just won't make that much sense to call out to a library when you can emit the pair of clmuls and a couple shifts. jeff
Re: RISC-V: Added support for CRC.
> On Aug 16, 2023, at 3:42 PM, Philipp Tomsich wrote: > > On Wed, 16 Aug 2023 at 21:10, Alexander Monakov wrote: >> >> >> On Tue, 15 Aug 2023, Jeff Law wrote: >> >>> Because if the compiler can optimize it automatically, then the projects >>> have >>> to do literally nothing to take advantage of it. They just compile normally >>> and their bitwise CRC gets optimized down to either a table lookup or a >>> clmul >>> variant. That's the real goal here. >> >> The only high-profile FOSS project that carries a bitwise CRC implementation >> I'm aware of is the 'xz' compression library. There bitwise CRC is used for >> populating the lookup table under './configure --enable-small': >> >> https://github.com/tukaani-project/xz/blob/2b871f4dbffe3801d0da3f89806b5935f758d5f3/src/liblzma/check/crc64_small.c >> >> It's a well-reasoned choice and your compiler would be undoing it >> (reintroducing the table when the bitwise CRC is employed specifically >> to avoid carrying the table). Is that compiled with -Os? It would seem sensible for that to be the case, and for the table optimization to be suppressed if that switch is used. paul
Re: RISC-V: Added support for CRC.
On Wed, 16 Aug 2023 at 21:10, Alexander Monakov wrote: > > > On Tue, 15 Aug 2023, Jeff Law wrote: > > > Because if the compiler can optimize it automatically, then the projects > > have > > to do literally nothing to take advantage of it. They just compile normally > > and their bitwise CRC gets optimized down to either a table lookup or a > > clmul > > variant. That's the real goal here. > > The only high-profile FOSS project that carries a bitwise CRC implementation > I'm aware of is the 'xz' compression library. There bitwise CRC is used for > populating the lookup table under './configure --enable-small': > > https://github.com/tukaani-project/xz/blob/2b871f4dbffe3801d0da3f89806b5935f758d5f3/src/liblzma/check/crc64_small.c > > It's a well-reasoned choice and your compiler would be undoing it > (reintroducing the table when the bitwise CRC is employed specifically > to avoid carrying the table). > > > One final note. Elsewhere in this thread you described performance > > concerns. > > Right now clmuls can be implemented in 4c, fully piped. > > Pipelining doesn't matter in the implementation being proposed here, because > the builtin is expanded to > >li a4,quotient >li a5,polynomial >xor a0,a1,a0 >clmul a0,a0,a4 >srlia0,a0,crc_size >clmul a0,a0,a5 >sllia0,a0,GET_MODE_BITSIZE (word_mode) - crc_size >srlia0,a0,GET_MODE_BITSIZE (word_mode) - crc_size > > making CLMULs data-dependent, so the second can only be started one cycle > after the first finishes, and consecutive invocations of __builtin_crc > are likewise data-dependent (with three cycles between CLMUL). So even > when you get CLMUL down to 3c latency, you'll have two CLMULs and 10 cycles > per input block, while state of the art is one widening CLMUL per input block > (one CLMUL per 32-bit block on a 64-bit CPU) limited by throughput, not > latency. > > > I fully expect that latency to drop within the next 12-18 months. In that > > world, there's not going to be much benefit to using hand-coded libraries vs > > just letting the compiler do it. I would also hope that the hand-coded libraries would eventually have a code path for compilers that support the built-in. For what it's worth, there now is CRC in Boost: https://www.boost.org/doc/libs/1_83_0/doc/html/crc.html Cheers, philipp.
Re: RISC-V: Added support for CRC.
On Tue, 15 Aug 2023, Jeff Law wrote: > Because if the compiler can optimize it automatically, then the projects have > to do literally nothing to take advantage of it. They just compile normally > and their bitwise CRC gets optimized down to either a table lookup or a clmul > variant. That's the real goal here. The only high-profile FOSS project that carries a bitwise CRC implementation I'm aware of is the 'xz' compression library. There bitwise CRC is used for populating the lookup table under './configure --enable-small': https://github.com/tukaani-project/xz/blob/2b871f4dbffe3801d0da3f89806b5935f758d5f3/src/liblzma/check/crc64_small.c It's a well-reasoned choice and your compiler would be undoing it (reintroducing the table when the bitwise CRC is employed specifically to avoid carrying the table). > One final note. Elsewhere in this thread you described performance concerns. > Right now clmuls can be implemented in 4c, fully piped. Pipelining doesn't matter in the implementation being proposed here, because the builtin is expanded to li a4,quotient li a5,polynomial xor a0,a1,a0 clmul a0,a0,a4 srlia0,a0,crc_size clmul a0,a0,a5 sllia0,a0,GET_MODE_BITSIZE (word_mode) - crc_size srlia0,a0,GET_MODE_BITSIZE (word_mode) - crc_size making CLMULs data-dependent, so the second can only be started one cycle after the first finishes, and consecutive invocations of __builtin_crc are likewise data-dependent (with three cycles between CLMUL). So even when you get CLMUL down to 3c latency, you'll have two CLMULs and 10 cycles per input block, while state of the art is one widening CLMUL per input block (one CLMUL per 32-bit block on a 64-bit CPU) limited by throughput, not latency. > I fully expect that latency to drop within the next 12-18 months. In that > world, there's not going to be much benefit to using hand-coded libraries vs > just letting the compiler do it. ... Alexander
Re: RISC-V: Added support for CRC.
On 8/3/23 13:37, Mariam Harutyunyan via Gcc-patches wrote: This patch adds CRC support for the RISC-V architecture. It adds internal functions and built-ins specifically designed to handle CRC computations efficiently. If the target is ZBC, the clmul instruction is used for the CRC code generation; otherwise, table-based CRC is generated. A table with 256 elements is used to store precomputed CRCs. These CRC calculation algorithms have higher performance than the naive CRC calculation algorithm. [ ... ] Various comments attached. From 9d2e9023c222501a1d9519bea3d5cdbd32b5a91e Mon Sep 17 00:00:00 2001 From: Mariam Arutunian Date: Thu, 3 Aug 2023 15:59:57 +0400 Subject: [PATCH] RISC-V: Added support for CRC. If the target is ZBC, then the clmul instruction is used for the CRC code generation; otherwise, table-based CRC is generated. A table with 256 elements is used to store precomputed CRCs. gcc/ChangeLog: *builtin-types.def (BT_FN_UINT8_UINT8_UINT8_CONST_SIZE): Define. (BT_FN_UINT16_UINT16_UINT8_CONST_SIZE): Likewise. (BT_FN_UINT16_UINT16_UINT16_CONST_SIZE): Likewise. (BT_FN_UINT32_UINT32_UINT8_CONST_SIZE): Likewise. (BT_FN_UINT32_UINT32_UINT16_CONST_SIZE): Likewise. (BT_FN_UINT32_UINT32_UINT32_CONST_SIZE): Likewise. (BT_FN_UINT64_UINT64_UINT8_CONST_SIZE): Likewise. (BT_FN_UINT64_UINT64_UINT16_CONST_SIZE): Likewise. (BT_FN_UINT64_UINT64_UINT32_CONST_SIZE): Likewise. (BT_FN_UINT64_UINT64_UINT32_CONST_SIZE): Likewise. * builtins.cc (associated_internal_fn): Handle BUILT_IN_CRC8_DATA8, BUILT_IN_CRC16_DATA8, BUILT_IN_CRC16_DATA16, BUILT_IN_CRC32_DATA8, BUILT_IN_CRC32_DATA16, BUILT_IN_CRC32_DATA32, BUILT_IN_CRC64_DATA8, BUILT_IN_CRC64_DATA16, BUILT_IN_CRC64_DATA32, BUILT_IN_CRC64_DATA64. * builtins.def (BUILT_IN_CRC8_DATA8): New builtin. (BUILT_IN_CRC16_DATA8): Likewise. (BUILT_IN_CRC16_DATA16): Likewise. (BUILT_IN_CRC32_DATA8): Likewise. (BUILT_IN_CRC32_DATA16): Likewise. (BUILT_IN_CRC32_DATA32): Likewise. (BUILT_IN_CRC64_DATA8): Likewise. (BUILT_IN_CRC64_DATA16): Likewise. (BUILT_IN_CRC64_DATA32): Likewise. (BUILT_IN_CRC64_DATA64): Likewise. * config/riscv/bitmanip.md (crc4): New expander. * config/riscv/riscv-protos.h (expand_crc_table_based): Declare. (expand_crc_using_clmul): Likewise. * config/riscv/riscv.cc (gf2n_poly_long_div_quotient): New function. (generate_crc): Likewise. (generate_crc_table): Likewise. (expand_crc_table_based): Likewise. (expand_crc_using_clmul): Likewise. * config/riscv/riscv.md (UNSPEC_CRC): New unspec for CRC. * internal-fn.cc (crc_direct): Define. (expand_crc_optab_fn): New function. (direct_crc_optab_supported_p): Define. * internal-fn.def (CRC): New internal optab function. * optabs.def (crc_optab): New optab. gcc/testsuite/ChangeLog: * gcc.target/riscv/crc-builtin-table-target32.c: New test. * gcc.target/riscv/crc-builtin-table-target64.c: New test. * gcc.target/riscv/crc-builtin-zbc32.c: New test. * gcc.target/riscv/crc-builtin-zbc64.c: New test. --- diff --git a/gcc/ChangeLog b/gcc/ChangeLog index 2eab466a9f8..748d8be384b 100644 --- a/gcc/ChangeLog +++ b/gcc/ChangeLog In general, the ChangeLog entry is just sent like you've done above and not included as a diff. diff --git a/gcc/builtin-types.def b/gcc/builtin-types.def index 43381bc8949..e33837c27d0 100644 --- a/gcc/builtin-types.def +++ b/gcc/builtin-types.def @@ -829,6 +829,26 @@ DEF_FUNCTION_TYPE_3 (BT_FN_PTR_SIZE_SIZE_PTRMODE, BT_PTR, BT_SIZE, BT_SIZE, BT_PTRMODE) DEF_FUNCTION_TYPE_3 (BT_FN_VOID_PTR_UINT8_PTRMODE, BT_VOID, BT_PTR, BT_UINT8, BT_PTRMODE) +DEF_FUNCTION_TYPE_3 (BT_FN_UINT8_UINT8_UINT8_CONST_SIZE, BT_UINT8, BT_UINT8, + BT_UINT8, BT_CONST_SIZE) [ ... ] Presumably the reason we need to many variants is due to the desire to support various types for the inputs and outputs? diff --git a/gcc/config/riscv/bitmanip.md b/gcc/config/riscv/bitmanip.md index c42e7b890db..4c896303242 100644 --- a/gcc/config/riscv/bitmanip.md +++ b/gcc/config/riscv/bitmanip.md @@ -856,3 +856,38 @@ "TARGET_ZBC" "clmulr\t%0,%1,%2" [(set_attr "type" "clmul")]) + +;; Iterator for hardware-supported integer modes, same as ANYI +(define_mode_iterator ANYI2 [QI HI SI (DI "TARGET_64BIT")]) + +;; CRC 8, 16, 32, (64 for TARGET_64) +(define_expand "crc4" + ;; return value (calculated CRC) + [(set (match_operand:ANYI 0 "register_operand" "=r") + ;; initial CRC + (unspec:ANYI [(match_operand:ANYI 1 "register_operand" "r") + ;; data + (match_operand:ANYI2 2 "register_operand" "r") + ;; polynomial + (match_operand:ANYI 3)] + UNSPEC_CRC))] I'm not real comfortable with directly supporting sub-word sizes for the output operand. The vast majority of insns on the risc-v port have outputs that use either the X iterator (which maps to either SI or DI mode for rv32 and rv64 respectively) or they
Re: RISC-V: Added support for CRC.
On 8/9/23 07:02, Paul Koning wrote: On Aug 9, 2023, at 2:32 AM, Alexander Monakov wrote: On Tue, 8 Aug 2023, Jeff Law wrote: If the compiler can identify a CRC and collapse it down to a table or clmul, that's a major win and such code does exist in the real world. That was the whole point behind the Fedora experiment -- to determine if these things are showing up in the real world or if this is just a benchmarking exercise. Can you share the results of the experiment and give your estimate of what sort of real-world improvement is expected? I already listed the popular FOSS projects where CRC performance is important: the Linux kernel and a few compression libraries. Those projects do not use a bitwise CRC loop, except sometimes for table generation on startup (which needs less time than a page fault that may be necessary to bring in a hardcoded table). For those projects that need a better CRC, why is the chosen solution is to optimize it in the compiler instead of offering them a library they could use with any compiler? Was there any thought given to embedded projects that use bitwise CRC exactly because they little space for a hardcoded table to spare? Or those that use smaller tables -- for example, the classic VAX microcode approach with a 16-entry table, doing CRC 4 bits at a time. Yup. I think we settled on 8 bits as a time for the table variant. It seemed like a good tradeoff between size of the tables and speed. I agree that this seems an odd thing to optimize. CRC is a well known CPU hog with well established efficient solutions, and it's hard to see why anyone who needs good performance would fail to understand and apply that knowledge. As I've said, what started us down this path was Coremark. But what convinced me that this was useful beyond juicing benchmark data was finding the various implementations in the wild. Jeff
Re: RISC-V: Added support for CRC.
On 8/9/23 00:32, Alexander Monakov wrote: On Tue, 8 Aug 2023, Jeff Law wrote: If the compiler can identify a CRC and collapse it down to a table or clmul, that's a major win and such code does exist in the real world. That was the whole point behind the Fedora experiment -- to determine if these things are showing up in the real world or if this is just a benchmarking exercise. Can you share the results of the experiment and give your estimate of what sort of real-world improvement is expected? I already listed the popular FOSS projects where CRC performance is important: the Linux kernel and a few compression libraries. Those projects do not use a bitwise CRC loop, except sometimes for table generation on startup (which needs less time than a page fault that may be necessary to bring in a hardcoded table). That experiment was ~7 months ago. I don't think any of the data is still around except for some extracted testcases. For those projects that need a better CRC, why is the chosen solution is to optimize it in the compiler instead of offering them a library they could use with any compiler? Because if the compiler can optimize it automatically, then the projects have to do literally nothing to take advantage of it. They just compile normally and their bitwise CRC gets optimized down to either a table lookup or a clmul variant. That's the real goal here. If a step where we provide the backend bits hooked up to a builtin isn't useful, then we won't pursue it. The thinking was it would provide value for those willing to make a slight change to their sources and at the same time we get real world exposure for the backend work of the CRC optimization effort while we polish the gimple detection bits. Was there any thought given to embedded projects that use bitwise CRC exactly because they little space for a hardcoded table to spare? It wasn't an explicit goal, but the ability to select between a table implementation and a clmul implementation in the backend seemed useful, so we wired up both. No, not if the compiler is not GCC, or its version is less than 14. And those projects are not going to sacrifice their portability just for __builtin_crc. You may be right. I don't think it's so clear cut. though. I think offering a conventional library for CRC has substantial advantages. That's not what I asked. If you think there's room for improvement to a builtin API, I'd love to hear it. But it seems you don't think this is worth the effort at all. That's unfortunate, but if that's the consensus, then so be it. I think it's a strange application of development effort. You'd get more done coding a library. Not if the end goal is to detect the CRC and optimize it into a table or clmul without the user having to do anything special. Again, what we've proposed in this patch is a piece of that larger body of work, specifically the backend bits that we thought would have value independently. If the community doesn't see that carved out chunk as helpful we'll table it until the whole end-to-end path is ready for submission. I'll note LLVM is likely going forward with CRC detection and optimization at some point in the next ~6 months (effectively moving the implementation from the hexagon port into the generic parts of their loop optimizer). I don't see CRC detection in the Hexagon port. There is a recognizer for polynomial multiplication (CRC is division, not multiplication). Yes, you need to the recognizer so that you can detect a CRC loop, then with a bit of math you turn that into a carryless multiply sequence. I find the math here mindbending, but the Hexagon bits are precisely to optimize CRC loops. Sadly the Hexagon bits are fairly specific to the CRC implementation inside coremark. The GCC bits we've been working on are much more general. One final note. Elsewhere in this thread you described performance concerns. Right now clmuls can be implemented in 4c, fully piped. I fully expect that latency to drop within the next 12-18 months. In that world, there's not going to be much benefit to using hand-coded libraries vs just letting the compiler do it. Jeff
Re: RISC-V: Added support for CRC.
> On Aug 9, 2023, at 2:32 AM, Alexander Monakov wrote: > > > On Tue, 8 Aug 2023, Jeff Law wrote: > >> If the compiler can identify a CRC and collapse it down to a table or clmul, >> that's a major win and such code does exist in the real world. That was the >> whole point behind the Fedora experiment -- to determine if these things are >> showing up in the real world or if this is just a benchmarking exercise. > > Can you share the results of the experiment and give your estimate of what > sort of real-world improvement is expected? I already listed the popular > FOSS projects where CRC performance is important: the Linux kernel and > a few compression libraries. Those projects do not use a bitwise CRC loop, > except sometimes for table generation on startup (which needs less time > than a page fault that may be necessary to bring in a hardcoded table). > > For those projects that need a better CRC, why is the chosen solution is > to optimize it in the compiler instead of offering them a library they > could use with any compiler? > > Was there any thought given to embedded projects that use bitwise CRC > exactly because they little space for a hardcoded table to spare? Or those that use smaller tables -- for example, the classic VAX microcode approach with a 16-entry table, doing CRC 4 bits at a time. I agree that this seems an odd thing to optimize. CRC is a well known CPU hog with well established efficient solutions, and it's hard to see why anyone who needs good performance would fail to understand and apply that knowledge. paul
Re: RISC-V: Added support for CRC.
On Tue, 8 Aug 2023, Jeff Law wrote: > If the compiler can identify a CRC and collapse it down to a table or clmul, > that's a major win and such code does exist in the real world. That was the > whole point behind the Fedora experiment -- to determine if these things are > showing up in the real world or if this is just a benchmarking exercise. Can you share the results of the experiment and give your estimate of what sort of real-world improvement is expected? I already listed the popular FOSS projects where CRC performance is important: the Linux kernel and a few compression libraries. Those projects do not use a bitwise CRC loop, except sometimes for table generation on startup (which needs less time than a page fault that may be necessary to bring in a hardcoded table). For those projects that need a better CRC, why is the chosen solution is to optimize it in the compiler instead of offering them a library they could use with any compiler? Was there any thought given to embedded projects that use bitwise CRC exactly because they little space for a hardcoded table to spare? > > Useful to whom? The Linux kernel? zlib, bzip2, xz-utils? ffmpeg? > > These consumers need high-performance blockwise CRC, offering them > > a latency-bound elementwise CRC primitive is a disservice. And what > > should they use as a fallback when __builtin_crc is unavailable? > THe point is builtin_crc would always be available. If there is no clmul, > then the RTL backend can expand to a table lookup version. No, not if the compiler is not GCC, or its version is less than 14. And those projects are not going to sacrifice their portability just for __builtin_crc. > > I think offering a conventional library for CRC has substantial advantages. > That's not what I asked. If you think there's room for improvement to a > builtin API, I'd love to hear it. > > But it seems you don't think this is worth the effort at all. That's > unfortunate, but if that's the consensus, then so be it. I think it's a strange application of development effort. You'd get more done coding a library. > I'll note LLVM is likely going forward with CRC detection and optimization at > some point in the next ~6 months (effectively moving the implementation from > the hexagon port into the generic parts of their loop optimizer). I don't see CRC detection in the Hexagon port. There is a recognizer for polynomial multiplication (CRC is division, not multiplication). Alexander
Re: RISC-V: Added support for CRC.
On 8/8/23 17:31, Andrew Pinski wrote: On Tue, Aug 8, 2023 at 4:17 PM Jeff Law via Gcc-patches wrote: On 8/8/23 10:38, Alexander Monakov wrote: On Tue, 8 Aug 2023, Jeff Law wrote: That was my thinking at one time. Then we started looking at the distros and found enough crc implementations in there to change my mind about the overall utility. The ones I'm familiar with are all table-based and look impossible to pattern-match (and hence already fairly efficient comparable to bitwise loop in Coremark). We found dozens that were the usual looking loops and, IIRC ~200 table lookups after analyzing about half of the packages in Fedora. I will make a note we do handle table lookups to detect count leading zeros, see check_ctz_array in tree-ssa-forwprop.cc for that detection. (that was done to improve a SPEC benchmark even). So if the tables are statically defined at compile time, there is already an example of how it can be detected too. Detecting CRC is a bit more complex than ctz/popcount and friends because of the field polynomial's impact on the table. But there is some value in detecting a table version, then converting it to clmul. I don't remember the data offhand, but clmul was noticeably better than a table. Jeff
Re: RISC-V: Added support for CRC.
On Tue, Aug 8, 2023 at 4:17 PM Jeff Law via Gcc-patches wrote: > > > > On 8/8/23 10:38, Alexander Monakov wrote: > > > > On Tue, 8 Aug 2023, Jeff Law wrote: > > > >> That was my thinking at one time. Then we started looking at the distros > >> and > >> found enough crc implementations in there to change my mind about the > >> overall > >> utility. > > > > The ones I'm familiar with are all table-based and look impossible to > > pattern-match (and hence already fairly efficient comparable to bitwise > > loop in Coremark). > We found dozens that were the usual looking loops and, IIRC ~200 table > lookups after analyzing about half of the packages in Fedora. I will make a note we do handle table lookups to detect count leading zeros, see check_ctz_array in tree-ssa-forwprop.cc for that detection. (that was done to improve a SPEC benchmark even). So if the tables are statically defined at compile time, there is already an example of how it can be detected too. Thanks, Andrew Pinski > > > > > > So... just provide a library? A library code is easier to develop and audit, > > it can be released independently, people can use it with their compiler of > > choice. Not everything needs to be in libgcc. > If the compiler can identify a CRC and collapse it down to a table or > clmul, that's a major win and such code does exist in the real world. > That was the whole point behind the Fedora experiment -- to determine if > these things are showing up in the real world or if this is just a > benchmarking exercise. > > And just to be clear, we're not proposing anything for libgcc. > > > > > I'm talking about factoring a long chain into multiple independent chains > > for latency hiding. > And that could potentially be an extension. But even without this a > standard looking CRC loop will be much faster using table lookups or > simple generation with clmul. > > Also note that latency of clmuls is improving on modern hardware. 4c > isn't hard to achieve and I wouldn't be surprised to see 2c clmuls in > the near future. > > > > > > Useful to whom? The Linux kernel? zlib, bzip2, xz-utils? ffmpeg? > > These consumers need high-performance blockwise CRC, offering them > > a latency-bound elementwise CRC primitive is a disservice. And what > > should they use as a fallback when __builtin_crc is unavailable? > THe point is builtin_crc would always be available. If there is no > clmul, then the RTL backend can expand to a table lookup version. > > > > >> while at the same time putting one side of the infrastructure we need for > >> automatic detection of CRC loops and turning them into table lookups or > >> CLMULs. > >> > >> With that in mind I'm certain Mariam & I would love feedback on a builtin > >> API > >> that would be more useful. > > > > I think offering a conventional library for CRC has substantial advantages. > That's not what I asked. If you think there's room for improvement to a > builtin API, I'd love to hear it. > > But it seems you don't think this is worth the effort at all. That's > unfortunate, but if that's the consensus, then so be it. > > I'll note LLVM is likely going forward with CRC detection and > optimization at some point in the next ~6 months (effectively moving the > implementation from the hexagon port into the generic parts of their > loop optimizer). > > > > Jeff
Re: RISC-V: Added support for CRC.
On 8/8/23 10:38, Alexander Monakov wrote: On Tue, 8 Aug 2023, Jeff Law wrote: That was my thinking at one time. Then we started looking at the distros and found enough crc implementations in there to change my mind about the overall utility. The ones I'm familiar with are all table-based and look impossible to pattern-match (and hence already fairly efficient comparable to bitwise loop in Coremark). We found dozens that were the usual looking loops and, IIRC ~200 table lookups after analyzing about half of the packages in Fedora. So... just provide a library? A library code is easier to develop and audit, it can be released independently, people can use it with their compiler of choice. Not everything needs to be in libgcc. If the compiler can identify a CRC and collapse it down to a table or clmul, that's a major win and such code does exist in the real world. That was the whole point behind the Fedora experiment -- to determine if these things are showing up in the real world or if this is just a benchmarking exercise. And just to be clear, we're not proposing anything for libgcc. I'm talking about factoring a long chain into multiple independent chains for latency hiding. And that could potentially be an extension. But even without this a standard looking CRC loop will be much faster using table lookups or simple generation with clmul. Also note that latency of clmuls is improving on modern hardware. 4c isn't hard to achieve and I wouldn't be surprised to see 2c clmuls in the near future. Useful to whom? The Linux kernel? zlib, bzip2, xz-utils? ffmpeg? These consumers need high-performance blockwise CRC, offering them a latency-bound elementwise CRC primitive is a disservice. And what should they use as a fallback when __builtin_crc is unavailable? THe point is builtin_crc would always be available. If there is no clmul, then the RTL backend can expand to a table lookup version. while at the same time putting one side of the infrastructure we need for automatic detection of CRC loops and turning them into table lookups or CLMULs. With that in mind I'm certain Mariam & I would love feedback on a builtin API that would be more useful. I think offering a conventional library for CRC has substantial advantages. That's not what I asked. If you think there's room for improvement to a builtin API, I'd love to hear it. But it seems you don't think this is worth the effort at all. That's unfortunate, but if that's the consensus, then so be it. I'll note LLVM is likely going forward with CRC detection and optimization at some point in the next ~6 months (effectively moving the implementation from the hexagon port into the generic parts of their loop optimizer). Jeff
Re: RISC-V: Added support for CRC.
On Tue, 8 Aug 2023, Jeff Law wrote: > That was my thinking at one time. Then we started looking at the distros and > found enough crc implementations in there to change my mind about the overall > utility. The ones I'm familiar with are all table-based and look impossible to pattern-match (and hence already fairly efficient comparable to bitwise loop in Coremark). > If we need to do something to make it more useful, we're certainly open to > that. So... just provide a library? A library code is easier to develop and audit, it can be released independently, people can use it with their compiler of choice. Not everything needs to be in libgcc. > > - they overlap multiple CLMUL chains to make the loop throughput-bound > >rather than latency-bound. The typical unroll factor is about 4x-8x. > We do have the ability to build longer chains. We actually use that in the > coremark benchmark where the underlying primitives are 8-bit CRCs that are > composed into 16/32 bit CRCs. I'm talking about factoring a long chain into multiple independent chains for latency hiding. > > Hence, I am concerned that proposed __builtin_crc is not useful for FOSS > > that actually needs high-performance CRC (the Linux kernel, compression > > and image libraries). > > > > I think this crosses the line of "cheating in benchmarks" and not something > > we should do in GCC. > Certianly not the intention. The intention is to provide a useful builtin_crc Useful to whom? The Linux kernel? zlib, bzip2, xz-utils? ffmpeg? These consumers need high-performance blockwise CRC, offering them a latency-bound elementwise CRC primitive is a disservice. And what should they use as a fallback when __builtin_crc is unavailable? > while at the same time putting one side of the infrastructure we need for > automatic detection of CRC loops and turning them into table lookups or > CLMULs. > > With that in mind I'm certain Mariam & I would love feedback on a builtin API > that would be more useful. I think offering a conventional library for CRC has substantial advantages. Alexander
Re: RISC-V: Added support for CRC.
On 8/8/23 09:22, Alexander Monakov wrote: Jeff, as I understand this all is happening only because Coremark contains use of bitwise CRC that affects benchmark scores. In another universe where - Coremark was careful to checksum outputs outside of timed sections, or - implemented CRC in a manner that is not transparent to the compiler, or - did not use CRC at all we would not be spending effort on this, correct? At best we might have a discussion on providing a __builtin_clmul for carry-less multiplication (which _is_ a fundamental primitive, unlike __builtin_crc), and move on. That was my thinking at one time. Then we started looking at the distros and found enough crc implementations in there to change my mind about the overall utility. Note that proposed __builtin_crc does not match how a high-performance CRC over a variable-size array is implemented. You don't want to do two back-to-back CLMULs to compute a new CRC given an old CRC. That makes your loop latency-constrained to 2*L*N where L is latency of the CLMUL instruction and N is the number of loop iterations. If we need to do something to make it more useful, we're certainly open to that. Instead, efficient CRC loops have the following structure: - they carry an unreduced remainder in the loop, performing final reduction modulo polynomial only once after the loop — this halves the CLMUL count; - they overlap multiple CLMUL chains to make the loop throughput-bound rather than latency-bound. The typical unroll factor is about 4x-8x. We do have the ability to build longer chains. We actually use that in the coremark benchmark where the underlying primitives are 8-bit CRCs that are composed into 16/32 bit CRCs. A relatively easy to follow explanation is provided by Pete Cawley at https://www.corsix.org/content/alternative-exposition-crc32_4k_pclmulqdq (there are other sources for similar optimization of table-based CRC). Also note that in __builtin_crc care is needed regarding how the polynomial is specified (which term is dropped, and what bit order is used). Absolutely. Hence, I am concerned that proposed __builtin_crc is not useful for FOSS that actually needs high-performance CRC (the Linux kernel, compression and image libraries). I think this crosses the line of "cheating in benchmarks" and not something we should do in GCC. Certianly not the intention. The intention is to provide a useful builtin_crc while at the same time putting one side of the infrastructure we need for automatic detection of CRC loops and turning them into table lookups or CLMULs. With that in mind I'm certain Mariam & I would love feedback on a builtin API that would be more useful. jeff
Re: RISC-V: Added support for CRC.
On Thu, 3 Aug 2023, Jeff Law wrote: > The end goal here is to actually detect bitwise CRC implementations in the > gimple optimizers and turn them into table lookups or carryless multiplies in > RTL. > > Mariam has that working end-to-end and has proposed a talk for the Cauldron on > the topic. > > The idea here is to carve out the RTL side which we think provides potential > value to end users (the ability to use the builtin to get an performant CRC > implementation) and to get community feedback on the implementation. Jeff, as I understand this all is happening only because Coremark contains use of bitwise CRC that affects benchmark scores. In another universe where - Coremark was careful to checksum outputs outside of timed sections, or - implemented CRC in a manner that is not transparent to the compiler, or - did not use CRC at all we would not be spending effort on this, correct? At best we might have a discussion on providing a __builtin_clmul for carry-less multiplication (which _is_ a fundamental primitive, unlike __builtin_crc), and move on. Note that proposed __builtin_crc does not match how a high-performance CRC over a variable-size array is implemented. You don't want to do two back-to-back CLMULs to compute a new CRC given an old CRC. That makes your loop latency-constrained to 2*L*N where L is latency of the CLMUL instruction and N is the number of loop iterations. Instead, efficient CRC loops have the following structure: - they carry an unreduced remainder in the loop, performing final reduction modulo polynomial only once after the loop — this halves the CLMUL count; - they overlap multiple CLMUL chains to make the loop throughput-bound rather than latency-bound. The typical unroll factor is about 4x-8x. A relatively easy to follow explanation is provided by Pete Cawley at https://www.corsix.org/content/alternative-exposition-crc32_4k_pclmulqdq (there are other sources for similar optimization of table-based CRC). Also note that in __builtin_crc care is needed regarding how the polynomial is specified (which term is dropped, and what bit order is used). Hence, I am concerned that proposed __builtin_crc is not useful for FOSS that actually needs high-performance CRC (the Linux kernel, compression and image libraries). I think this crosses the line of "cheating in benchmarks" and not something we should do in GCC. Alexander
Re: RISC-V: Added support for CRC.
On 8/3/23 13:37, Mariam Harutyunyan via Gcc-patches wrote: This patch adds CRC support for the RISC-V architecture. It adds internal functions and built-ins specifically designed to handle CRC computations efficiently. If the target is ZBC, the clmul instruction is used for the CRC code generation; otherwise, table-based CRC is generated. A table with 256 elements is used to store precomputed CRCs. These CRC calculation algorithms have higher performance than the naive CRC calculation algorithm. So just a little bit of additional information for others that might be looking at this thread. The end goal here is to actually detect bitwise CRC implementations in the gimple optimizers and turn them into table lookups or carryless multiplies in RTL. Mariam has that working end-to-end and has proposed a talk for the Cauldron on the topic. The idea here is to carve out the RTL side which we think provides potential value to end users (the ability to use the builtin to get an performant CRC implementation) and to get community feedback on the implementation. So this isn't the end of the line for this work, just the first carved out chunk. Jeff
Re: RISC-V: Added support for CRC.
Hi. Thank you. I'll add. Best regards, Mariam On Thu, Aug 3, 2023, 23:56 Andrew Pinski wrote: > On Thu, Aug 3, 2023 at 12:38 PM Mariam Harutyunyan via Gcc-patches > wrote: > > > > This patch adds CRC support for the RISC-V architecture. It adds internal > > functions and built-ins specifically designed to handle CRC computations > > efficiently. > > > > If the target is ZBC, the clmul instruction is used for the CRC code > > generation; otherwise, table-based CRC is generated. A table with 256 > > elements is used to store precomputed CRCs. > > > > These CRC calculation algorithms have higher performance than the naive > CRC > > calculation algorithm. > > A few things about this patch: > You created a generic (non-target specific) builtin but didn't > document it in doc/extend.texi . > You created a generic builtin with no fallback in libgcc. > You created a new named (RTL) pattern, crc, and didn't document it in > the `Standard Names` section of doc/md.texi . > > Thanks, > Andrew Pinski > > > > > gcc/ChangeLog: > >*builtin-types.def (BT_FN_UINT8_UINT8_UINT8_CONST_SIZE): > Define. > >(BT_FN_UINT16_UINT16_UINT8_CONST_SIZE): Likewise. > >(BT_FN_UINT16_UINT16_UINT16_CONST_SIZE): Likewise. > >(BT_FN_UINT32_UINT32_UINT8_CONST_SIZE): Likewise. > >(BT_FN_UINT32_UINT32_UINT16_CONST_SIZE): Likewise. > >(BT_FN_UINT32_UINT32_UINT32_CONST_SIZE): Likewise. > >(BT_FN_UINT64_UINT64_UINT8_CONST_SIZE): Likewise. > >(BT_FN_UINT64_UINT64_UINT16_CONST_SIZE): Likewise. > >(BT_FN_UINT64_UINT64_UINT32_CONST_SIZE): Likewise. > >(BT_FN_UINT64_UINT64_UINT32_CONST_SIZE): Likewise. > >* builtins.cc (associated_internal_fn): Handle > > BUILT_IN_CRC8_DATA8, > >BUILT_IN_CRC16_DATA8, BUILT_IN_CRC16_DATA16, > >BUILT_IN_CRC32_DATA8, BUILT_IN_CRC32_DATA16, > > BUILT_IN_CRC32_DATA32, > >BUILT_IN_CRC64_DATA8, BUILT_IN_CRC64_DATA16, > > BUILT_IN_CRC64_DATA32, > >BUILT_IN_CRC64_DATA64. > >* builtins.def (BUILT_IN_CRC8_DATA8): New builtin. > >(BUILT_IN_CRC16_DATA8): Likewise. > >(BUILT_IN_CRC16_DATA16): Likewise. > >(BUILT_IN_CRC32_DATA8): Likewise. > >(BUILT_IN_CRC32_DATA16): Likewise. > >(BUILT_IN_CRC32_DATA32): Likewise. > >(BUILT_IN_CRC64_DATA8): Likewise. > >(BUILT_IN_CRC64_DATA16): Likewise. > >(BUILT_IN_CRC64_DATA32): Likewise. > >(BUILT_IN_CRC64_DATA64): Likewise. > >* config/riscv/bitmanip.md (crc4): New > > expander. > >* config/riscv/riscv-protos.h (expand_crc_table_based): > Declare. > >(expand_crc_using_clmul): Likewise. > >* config/riscv/riscv.cc (gf2n_poly_long_div_quotient): New > > function. > >(generate_crc): Likewise. > >(generate_crc_table): Likewise. > >(expand_crc_table_based): Likewise. > >(expand_crc_using_clmul): Likewise. > >* config/riscv/riscv.md (UNSPEC_CRC): New unspec for CRC. > >* internal-fn.cc (crc_direct): Define. > >(expand_crc_optab_fn): New function. > >(direct_crc_optab_supported_p): Define. > >* internal-fn.def (CRC): New internal optab function. > >* optabs.def (crc_optab): New optab. > > > > gcc/testsuite/ChangeLog: > >* gcc.target/riscv/crc-builtin-table-target32.c: New test. > >* gcc.target/riscv/crc-builtin-table-target64.c: New test. > >* gcc.target/riscv/crc-builtin-zbc32.c: New test. > >* gcc.target/riscv/crc-builtin-zbc64.c: New test. >
Re: RISC-V: Added support for CRC.
On Thu, Aug 3, 2023 at 12:38 PM Mariam Harutyunyan via Gcc-patches wrote: > > This patch adds CRC support for the RISC-V architecture. It adds internal > functions and built-ins specifically designed to handle CRC computations > efficiently. > > If the target is ZBC, the clmul instruction is used for the CRC code > generation; otherwise, table-based CRC is generated. A table with 256 > elements is used to store precomputed CRCs. > > These CRC calculation algorithms have higher performance than the naive CRC > calculation algorithm. A few things about this patch: You created a generic (non-target specific) builtin but didn't document it in doc/extend.texi . You created a generic builtin with no fallback in libgcc. You created a new named (RTL) pattern, crc, and didn't document it in the `Standard Names` section of doc/md.texi . Thanks, Andrew Pinski > > gcc/ChangeLog: >*builtin-types.def (BT_FN_UINT8_UINT8_UINT8_CONST_SIZE): Define. >(BT_FN_UINT16_UINT16_UINT8_CONST_SIZE): Likewise. >(BT_FN_UINT16_UINT16_UINT16_CONST_SIZE): Likewise. >(BT_FN_UINT32_UINT32_UINT8_CONST_SIZE): Likewise. >(BT_FN_UINT32_UINT32_UINT16_CONST_SIZE): Likewise. >(BT_FN_UINT32_UINT32_UINT32_CONST_SIZE): Likewise. >(BT_FN_UINT64_UINT64_UINT8_CONST_SIZE): Likewise. >(BT_FN_UINT64_UINT64_UINT16_CONST_SIZE): Likewise. >(BT_FN_UINT64_UINT64_UINT32_CONST_SIZE): Likewise. >(BT_FN_UINT64_UINT64_UINT32_CONST_SIZE): Likewise. >* builtins.cc (associated_internal_fn): Handle > BUILT_IN_CRC8_DATA8, >BUILT_IN_CRC16_DATA8, BUILT_IN_CRC16_DATA16, >BUILT_IN_CRC32_DATA8, BUILT_IN_CRC32_DATA16, > BUILT_IN_CRC32_DATA32, >BUILT_IN_CRC64_DATA8, BUILT_IN_CRC64_DATA16, > BUILT_IN_CRC64_DATA32, >BUILT_IN_CRC64_DATA64. >* builtins.def (BUILT_IN_CRC8_DATA8): New builtin. >(BUILT_IN_CRC16_DATA8): Likewise. >(BUILT_IN_CRC16_DATA16): Likewise. >(BUILT_IN_CRC32_DATA8): Likewise. >(BUILT_IN_CRC32_DATA16): Likewise. >(BUILT_IN_CRC32_DATA32): Likewise. >(BUILT_IN_CRC64_DATA8): Likewise. >(BUILT_IN_CRC64_DATA16): Likewise. >(BUILT_IN_CRC64_DATA32): Likewise. >(BUILT_IN_CRC64_DATA64): Likewise. >* config/riscv/bitmanip.md (crc4): New > expander. >* config/riscv/riscv-protos.h (expand_crc_table_based): Declare. >(expand_crc_using_clmul): Likewise. >* config/riscv/riscv.cc (gf2n_poly_long_div_quotient): New > function. >(generate_crc): Likewise. >(generate_crc_table): Likewise. >(expand_crc_table_based): Likewise. >(expand_crc_using_clmul): Likewise. >* config/riscv/riscv.md (UNSPEC_CRC): New unspec for CRC. >* internal-fn.cc (crc_direct): Define. >(expand_crc_optab_fn): New function. >(direct_crc_optab_supported_p): Define. >* internal-fn.def (CRC): New internal optab function. >* optabs.def (crc_optab): New optab. > > gcc/testsuite/ChangeLog: >* gcc.target/riscv/crc-builtin-table-target32.c: New test. >* gcc.target/riscv/crc-builtin-table-target64.c: New test. >* gcc.target/riscv/crc-builtin-zbc32.c: New test. >* gcc.target/riscv/crc-builtin-zbc64.c: New test.
RISC-V: Added support for CRC.
This patch adds CRC support for the RISC-V architecture. It adds internal functions and built-ins specifically designed to handle CRC computations efficiently. If the target is ZBC, the clmul instruction is used for the CRC code generation; otherwise, table-based CRC is generated. A table with 256 elements is used to store precomputed CRCs. These CRC calculation algorithms have higher performance than the naive CRC calculation algorithm. gcc/ChangeLog: *builtin-types.def (BT_FN_UINT8_UINT8_UINT8_CONST_SIZE): Define. (BT_FN_UINT16_UINT16_UINT8_CONST_SIZE): Likewise. (BT_FN_UINT16_UINT16_UINT16_CONST_SIZE): Likewise. (BT_FN_UINT32_UINT32_UINT8_CONST_SIZE): Likewise. (BT_FN_UINT32_UINT32_UINT16_CONST_SIZE): Likewise. (BT_FN_UINT32_UINT32_UINT32_CONST_SIZE): Likewise. (BT_FN_UINT64_UINT64_UINT8_CONST_SIZE): Likewise. (BT_FN_UINT64_UINT64_UINT16_CONST_SIZE): Likewise. (BT_FN_UINT64_UINT64_UINT32_CONST_SIZE): Likewise. (BT_FN_UINT64_UINT64_UINT32_CONST_SIZE): Likewise. * builtins.cc (associated_internal_fn): Handle BUILT_IN_CRC8_DATA8, BUILT_IN_CRC16_DATA8, BUILT_IN_CRC16_DATA16, BUILT_IN_CRC32_DATA8, BUILT_IN_CRC32_DATA16, BUILT_IN_CRC32_DATA32, BUILT_IN_CRC64_DATA8, BUILT_IN_CRC64_DATA16, BUILT_IN_CRC64_DATA32, BUILT_IN_CRC64_DATA64. * builtins.def (BUILT_IN_CRC8_DATA8): New builtin. (BUILT_IN_CRC16_DATA8): Likewise. (BUILT_IN_CRC16_DATA16): Likewise. (BUILT_IN_CRC32_DATA8): Likewise. (BUILT_IN_CRC32_DATA16): Likewise. (BUILT_IN_CRC32_DATA32): Likewise. (BUILT_IN_CRC64_DATA8): Likewise. (BUILT_IN_CRC64_DATA16): Likewise. (BUILT_IN_CRC64_DATA32): Likewise. (BUILT_IN_CRC64_DATA64): Likewise. * config/riscv/bitmanip.md (crc4): New expander. * config/riscv/riscv-protos.h (expand_crc_table_based): Declare. (expand_crc_using_clmul): Likewise. * config/riscv/riscv.cc (gf2n_poly_long_div_quotient): New function. (generate_crc): Likewise. (generate_crc_table): Likewise. (expand_crc_table_based): Likewise. (expand_crc_using_clmul): Likewise. * config/riscv/riscv.md (UNSPEC_CRC): New unspec for CRC. * internal-fn.cc (crc_direct): Define. (expand_crc_optab_fn): New function. (direct_crc_optab_supported_p): Define. * internal-fn.def (CRC): New internal optab function. * optabs.def (crc_optab): New optab. gcc/testsuite/ChangeLog: * gcc.target/riscv/crc-builtin-table-target32.c: New test. * gcc.target/riscv/crc-builtin-table-target64.c: New test. * gcc.target/riscv/crc-builtin-zbc32.c: New test. * gcc.target/riscv/crc-builtin-zbc64.c: New test. From 9d2e9023c222501a1d9519bea3d5cdbd32b5a91e Mon Sep 17 00:00:00 2001 From: Mariam Arutunian Date: Thu, 3 Aug 2023 15:59:57 +0400 Subject: [PATCH] RISC-V: Added support for CRC. If the target is ZBC, then the clmul instruction is used for the CRC code generation; otherwise, table-based CRC is generated. A table with 256 elements is used to store precomputed CRCs. gcc/ChangeLog: *builtin-types.def (BT_FN_UINT8_UINT8_UINT8_CONST_SIZE): Define. (BT_FN_UINT16_UINT16_UINT8_CONST_SIZE): Likewise. (BT_FN_UINT16_UINT16_UINT16_CONST_SIZE): Likewise. (BT_FN_UINT32_UINT32_UINT8_CONST_SIZE): Likewise. (BT_FN_UINT32_UINT32_UINT16_CONST_SIZE): Likewise. (BT_FN_UINT32_UINT32_UINT32_CONST_SIZE): Likewise. (BT_FN_UINT64_UINT64_UINT8_CONST_SIZE): Likewise. (BT_FN_UINT64_UINT64_UINT16_CONST_SIZE): Likewise. (BT_FN_UINT64_UINT64_UINT32_CONST_SIZE): Likewise. (BT_FN_UINT64_UINT64_UINT32_CONST_SIZE): Likewise. * builtins.cc (associated_internal_fn): Handle BUILT_IN_CRC8_DATA8, BUILT_IN_CRC16_DATA8, BUILT_IN_CRC16_DATA16, BUILT_IN_CRC32_DATA8, BUILT_IN_CRC32_DATA16, BUILT_IN_CRC32_DATA32, BUILT_IN_CRC64_DATA8, BUILT_IN_CRC64_DATA16, BUILT_IN_CRC64_DATA32, BUILT_IN_CRC64_DATA64. * builtins.def (BUILT_IN_CRC8_DATA8): New builtin. (BUILT_IN_CRC16_DATA8): Likewise. (BUILT_IN_CRC16_DATA16): Likewise. (BUILT_IN_CRC32_DATA8): Likewise. (BUILT_IN_CRC32_DATA16): Likewise. (BUILT_IN_CRC32_DATA32): Likewise. (BUILT_IN_CRC64_DATA8): Likewise. (BUILT_IN_CRC64_DATA16): Likewise. (BUILT_IN_CRC64_DATA32): Likewise. (BUILT_IN_CRC64_DATA64): Likewise. * config/riscv/bitmanip.md (crc4): New expander. * config/riscv/riscv-protos.h (expand_crc_table_based): Declare. (expand_crc_using_clmul): Likewise. * config/riscv/riscv.cc (gf2n_poly_long_div_quotient): New function. (generate_crc): Likewise. (generate_crc_table): Likewise. (expand_crc_table_based): Likewise. (expand_crc_using_clmul): Likewise. * config/riscv/riscv.md (UNSPEC_CRC): New unspec for CRC. * internal-fn.cc (crc_direct): Define