Re: [RFC/RFA][PATCH 00/12] CRC optimization

2024-05-27 Thread Mariam Arutunian
Yes, this pass can detect only bitwise (naive) implementations of CRC32C
and replace them with hardware crc32* instructions (if supported), or
alternatively
generate CRC using CLMUL-based or table-based methods.
Here is an example of such a CRC32C implementation, where the entire loop
will be replaced with the crc32b instruction:
uint32_t crc32c(const uint8_t *data, size_t length) { uint32_t crc =
0x; for (size_t i = 0; i < length; ++i) { crc ^= data[i]; for (int
j = 0; j < 8; ++j) { crc = (crc >> 1) ^ ((crc & 1) ? 0x82F63B78 : 0); } }
return ~crc; }
I have added support for this optimization, as detailed in this patch:
https://gcc.gnu.org/pipermail/gcc-patches/2024-May/652615.html . The patch
description mentions a polynomial of 0x1EDC6F41, but CRC32C actually uses
the reflected version of this polynomial: 0x82F63B78. It might be clearer
to use 0x82F63B78 in the description.
Unfortunately, the example from the link you provided uses a table-based
version of CRC32C, which we don't detect. Currently, we only optimize
bitwise implementations, which are slower than all other implementations. If
necessary, I can add support for detecting table-based CRC implementations.
This would require a completely different algorithm than the one used for
bitwise implementations.
Thanks,
Mariam

On Sun, May 26, 2024 at 10:23 PM NightStrike  wrote:

>
>
> On Fri, May 24, 2024, 04:42 Mariam Arutunian 
> wrote:
>
>> Hello!
>> This patch set detects bitwise CRC implementation loops (with branches)
>> in the GIMPLE optimizers and replaces them with more optimal CRC
>> implementations in RTL. These patches introduce new internal functions,
>> built-in functions, and expanders for CRC generation, leveraging hardware
>> instructions where available. Additionally, various tests are included
>> to check CRC detection and generation. Main Features:
>>
>>1.
>>
>>CRC Loop Detection and Replacement:
>>- Detection of CRC loops involves two stages: fast checks to identify
>>   potential candidates and verification using symbolic execution. The
>>   algorithm detects only CRCs (8, 16, 32, and 64 bits, both bit-forward 
>> and
>>   bit-reversed) with constant polynomials used without the leading 1. 
>> This
>>   part can be improved to detect more implementation types.
>>   - Once identified, the CRC loops are replaced with calls to newly
>>   added internal functions. These internal functions use target-specific
>>   expanders if available, otherwise generating table-based CRCs.
>>2.
>>
>>Architecture-Specific Expanders:
>>- Expanders are added for RISC-V, aarch64, and i386 architectures.
>>   - These expanders generate CRCs using either carry-less
>>   multiplication instructions or direct CRC instructions, based on the 
>> target
>>   architecture's capabilities.
>>3.
>>
>>New Internal and Built-In Functions:
>>- Introduces internal functions and built-in functions for CRC
>>   generation, supporting various CRC and data sizes (8, 16, 32, and 64 
>> bits).
>>
>> I presented this work during the GNU Tools Cauldron 2023. You can view
>> the presentation here: GCC CRC optimization presentation
>> 
>> .
>>
>> Previously, I submitted a patch to GCC upstream that included built-in
>> parts and expanders for RISC-V. However, the main component of the
>> previously sent patch has been changed. You can find the patch here:
>> https://gcc.gnu.org/pipermail/gcc-patches/2023-August/626279.html
>>
>>
>> Best regards,
>> Mariam
>>
>
> Could this detect a specific CRC32C calculation as well (vs any urge CRC)
> and replace it with optimized calls to the dedicated crc32c hardware
> instructions via __builtin_ia32_crc32*i? I'm asking because ironically I
> was just a few days ago trying to see how current GCC optimizes simpler but
> slower approaches compared to hand tuning. For example,
> https://github.com/htot/crc32c. Almost every implementation I found was
> ultimately based on Mark Adler's (of zlib and that fun Mars mission fame)
> implementation posted to stack overflow here:
> https://stackoverflow.com/a/17646775
>
> Because so many current libraries are based on that, it would be nice to
> see if GCC can improve upon it with your patch. Or if they are no longer
> necessary, because your patch optimizes the simpler software approach to
> basically yield Mark's solution.
>
>>


Re: [RFC/RFA][PATCH 00/12] CRC optimization

2024-05-26 Thread NightStrike
On Fri, May 24, 2024, 04:42 Mariam Arutunian 
wrote:

> Hello!
> This patch set detects bitwise CRC implementation loops (with branches) in
> the GIMPLE optimizers and replaces them with more optimal CRC
> implementations in RTL. These patches introduce new internal functions,
> built-in functions, and expanders for CRC generation, leveraging hardware
> instructions where available. Additionally, various tests are included to
> check CRC detection and generation. Main Features:
>
>1.
>
>CRC Loop Detection and Replacement:
>- Detection of CRC loops involves two stages: fast checks to identify
>   potential candidates and verification using symbolic execution. The
>   algorithm detects only CRCs (8, 16, 32, and 64 bits, both bit-forward 
> and
>   bit-reversed) with constant polynomials used without the leading 1. This
>   part can be improved to detect more implementation types.
>   - Once identified, the CRC loops are replaced with calls to newly
>   added internal functions. These internal functions use target-specific
>   expanders if available, otherwise generating table-based CRCs.
>2.
>
>Architecture-Specific Expanders:
>- Expanders are added for RISC-V, aarch64, and i386 architectures.
>   - These expanders generate CRCs using either carry-less
>   multiplication instructions or direct CRC instructions, based on the 
> target
>   architecture's capabilities.
>3.
>
>New Internal and Built-In Functions:
>- Introduces internal functions and built-in functions for CRC
>   generation, supporting various CRC and data sizes (8, 16, 32, and 64 
> bits).
>
> I presented this work during the GNU Tools Cauldron 2023. You can view the
> presentation here: GCC CRC optimization presentation
> 
> .
>
> Previously, I submitted a patch to GCC upstream that included built-in
> parts and expanders for RISC-V. However, the main component of the
> previously sent patch has been changed. You can find the patch here:
> https://gcc.gnu.org/pipermail/gcc-patches/2023-August/626279.html
>
>
> Best regards,
> Mariam
>

Could this detect a specific CRC32C calculation as well (vs any urge CRC)
and replace it with optimized calls to the dedicated crc32c hardware
instructions via __builtin_ia32_crc32*i? I'm asking because ironically I
was just a few days ago trying to see how current GCC optimizes simpler but
slower approaches compared to hand tuning. For example,
https://github.com/htot/crc32c. Almost every implementation I found was
ultimately based on Mark Adler's (of zlib and that fun Mars mission fame)
implementation posted to stack overflow here:
https://stackoverflow.com/a/17646775

Because so many current libraries are based on that, it would be nice to
see if GCC can improve upon it with your patch. Or if they are no longer
necessary, because your patch optimizes the simpler software approach to
basically yield Mark's solution.

>


Re: [RFC/RFA][PATCH 00/12] CRC optimization

2024-05-24 Thread Jeff Law




On 5/24/24 2:41 AM, Mariam Arutunian wrote:

Hello!

This patch set detects bitwise CRC implementation loops (with branches) 
in the GIMPLE optimizers and replaces them with more optimal CRC 
implementations in RTL. These patches introduce new internal functions, 
built-in functions, and expanders for CRC generation, leveraging 
hardware instructions where available. Additionally, various tests are 
included to check CRC detection and generation.
Thanks so much for getting this process started.  It's a bit quicker 
than I was ready, but no worries.





 2.

Architecture-Specific Expanders:

  * Expanders are added for RISC-V, aarch64, and i386 architectures.
  * These expanders generate CRCs using either carry-less
multiplication instructions or direct CRC instructions, based on
the target architecture's capabilities.
Also note for the wider audience, this work can also generate table 
lookup based CRC implementations.  This has proven exceedingly helpful 
during the testing phase as we were able to run this code on a wide 
variety of the embedded targets to shake out target dependencies.


On Ventana's V1 design the clmul variant was a small, but clear winner 
over the table lookup.  Obviously the bitwise implementation found in 
coremark was the worst performing.


On our V2 design clmul outperforms the table lookup by a wide margin, 
largely due to reduced latency of clmul.



Jeff


[RFC/RFA][PATCH 00/12] CRC optimization

2024-05-24 Thread Mariam Arutunian
Hello!
This patch set detects bitwise CRC implementation loops (with branches) in
the GIMPLE optimizers and replaces them with more optimal CRC
implementations in RTL. These patches introduce new internal functions,
built-in functions, and expanders for CRC generation, leveraging hardware
instructions where available. Additionally, various tests are included to
check CRC detection and generation. Main Features:

   1.

   CRC Loop Detection and Replacement:
   - Detection of CRC loops involves two stages: fast checks to identify
  potential candidates and verification using symbolic execution. The
  algorithm detects only CRCs (8, 16, 32, and 64 bits, both bit-forward and
  bit-reversed) with constant polynomials used without the leading 1. This
  part can be improved to detect more implementation types.
  - Once identified, the CRC loops are replaced with calls to newly
  added internal functions. These internal functions use target-specific
  expanders if available, otherwise generating table-based CRCs.
   2.

   Architecture-Specific Expanders:
   - Expanders are added for RISC-V, aarch64, and i386 architectures.
  - These expanders generate CRCs using either carry-less
  multiplication instructions or direct CRC instructions, based on
the target
  architecture's capabilities.
   3.

   New Internal and Built-In Functions:
   - Introduces internal functions and built-in functions for CRC
  generation, supporting various CRC and data sizes (8, 16, 32,
and 64 bits).

I presented this work during the GNU Tools Cauldron 2023. You can view the
presentation here: GCC CRC optimization presentation

.

Previously, I submitted a patch to GCC upstream that included built-in
parts and expanders for RISC-V. However, the main component of the
previously sent patch has been changed. You can find the patch here:
https://gcc.gnu.org/pipermail/gcc-patches/2023-August/626279.html


Best regards,
Mariam