On Fri, 18 Jun 2021 22:12:11 GMT, Scott Gibbons 
<github.com+6704669+asgibb...@openjdk.org> wrote:

>> Add the Base64 Decode intrinsic for x86 to utilize AVX-512 for acceleration. 
>> Also allows for performance improvement for non-AVX-512 enabled platforms. 
>> Due to the nature of MIME-encoded inputs, modify the intrinsic signature to 
>> accept an additional parameter (isMIME) for fast-path MIME decoding.
>> 
>> A change was made to the signature of DecodeBlock in Base64.java to provide 
>> the intrinsic information as to whether MIME decoding was being done.  This 
>> allows for the intrinsic to bypass the expensive setup of zmm registers from 
>> AVX tables, knowing there may be invalid Base64 characters every 76 
>> characters or so.  A change was also made here removing the restriction that 
>> the intrinsic must return an even multiple of 3 bytes decoded.  This 
>> implementation handles the pad characters at the end of the string and will 
>> return the actual number of characters decoded.
>> 
>> The AVX portion of this code will decode in blocks of 256 bytes per loop 
>> iteration, then in chunks of 64 bytes, followed by end fixup decoding.  The 
>> non-AVX code is an assembly-optimized version of the java DecodeBlock and 
>> behaves identically.
>> 
>> Running the Base64Decode benchmark, this change increases decode performance 
>> by an average of 2.6x with a maximum 19.7x for buffers > ~20k.  The numbers 
>> are given in the table below.
>> 
>> **Base Score** is without intrinsic support, **Optimized Score** is using 
>> this intrinsic, and **Gain** is **Base** / **Optimized**.
>> 
>> 
>> Benchmark Name | Base Score | Optimized Score | Gain
>> -- | -- | -- | --
>> testBase64Decode size 1 | 15.36 | 15.32 | 1.00
>> testBase64Decode size 3 | 17.00 | 16.72 | 1.02
>> testBase64Decode size 7 | 20.60 | 18.82 | 1.09
>> testBase64Decode size 32 | 34.21 | 26.77 | 1.28
>> testBase64Decode size 64 | 54.43 | 38.35 | 1.42
>> testBase64Decode size 80 | 66.40 | 48.34 | 1.37
>> testBase64Decode size 96 | 73.16 | 52.90 | 1.38
>> testBase64Decode size 112 | 84.93 | 51.82 | 1.64
>> testBase64Decode size 512 | 288.81 | 32.04 | 9.01
>> testBase64Decode size 1000 | 560.48 | 40.79 | 13.74
>> testBase64Decode size 20000 | 9530.28 | 483.37 | 19.72
>> testBase64Decode size 50000 | 24552.24 | 1735.07 | 14.15
>> testBase64MIMEDecode size 1 | 22.87 | 21.36 | 1.07
>> testBase64MIMEDecode size 3 | 27.79 | 25.32 | 1.10
>> testBase64MIMEDecode size 7 | 44.74 | 43.81 | 1.02
>> testBase64MIMEDecode size 32 | 142.69 | 129.56 | 1.10
>> testBase64MIMEDecode size 64 | 256.90 | 243.80 | 1.05
>> testBase64MIMEDecode size 80 | 311.60 | 310.80 | 1.00
>> testBase64MIMEDecode size 96 | 364.00 | 346.66 | 1.05
>> testBase64MIMEDecode size 112 | 472.88 | 394.78 | 1.20
>> testBase64MIMEDecode size 512 | 1814.96 | 1671.28 | 1.09
>> testBase64MIMEDecode size 1000 | 3623.50 | 3227.61 | 1.12
>> testBase64MIMEDecode size 20000 | 70484.09 | 64940.77 | 1.09
>> testBase64MIMEDecode size 50000 | 191732.34 | 158158.95 | 1.21
>> testBase64WithErrorInputsDecode size 1 | 1531.02 | 1185.19 | 1.29
>> testBase64WithErrorInputsDecode size 3 | 1306.59 | 1170.99 | 1.12
>> testBase64WithErrorInputsDecode size 7 | 1238.11 | 1176.62 | 1.05
>> testBase64WithErrorInputsDecode size 32 | 1346.46 | 1138.47 | 1.18
>> testBase64WithErrorInputsDecode size 64 | 1195.28 | 1172.52 | 1.02
>> testBase64WithErrorInputsDecode size 80 | 1469.00 | 1180.94 | 1.24
>> testBase64WithErrorInputsDecode size 96 | 1434.48 | 1167.74 | 1.23
>> testBase64WithErrorInputsDecode size 112 | 1440.06 | 1162.56 | 1.24
>> testBase64WithErrorInputsDecode size 512 | 1362.79 | 1193.42 | 1.14
>> testBase64WithErrorInputsDecode size 1000 | 1426.07 | 1194.44 | 1.19
>> testBase64WithErrorInputsDecode size   20000 | 1398.44 | 1138.17 | 1.23
>> testBase64WithErrorInputsDecode size   50000 | 1409.41 | 1114.16 | 1.26
>
> Scott Gibbons has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   Added comments.  Streamlined flow for decode.

src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 6004:

> 6002:       __ BIND(L_continue);
> 6003: 
> 6004:       __ vpxor(errorvec, errorvec, errorvec, Assembler::AVX_512bit);

Why clearing errorvec is needed here?

src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 6023:

> 6021:       __ evmovdquq(tmp16_op3, pack16_op, Assembler::AVX_512bit);
> 6022:       __ evmovdquq(tmp16_op2, pack16_op, Assembler::AVX_512bit);
> 6023:       __ evmovdquq(tmp16_op1, pack16_op, Assembler::AVX_512bit);

Why do we need 3 additional copies of pack16_op?

src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 6026:

> 6024:       __ evmovdquq(tmp32_op3, pack32_op, Assembler::AVX_512bit);
> 6025:       __ evmovdquq(tmp32_op2, pack32_op, Assembler::AVX_512bit);
> 6026:       __ evmovdquq(tmp32_op1, pack32_op, Assembler::AVX_512bit);

Why do we need 3 additional copies of pack32_op?

src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 6051:

> 6049:       __ vpternlogd(t0, 0xfe, input1, input2, Assembler::AVX_512bit);
> 6050: 
> 6051:       __ vpternlogd(t1, 0xfe, translated0, translated1, 
> Assembler::AVX_512bit);

The comment here could be something like below for easy understanding:
// OR all of the inputs and translations together ...
// Here t0 has input0 and t1 has input3

src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 6055:

> 6053:       __ vpternlogd(t2, 0xfe, translated3, t0, Assembler::AVX_512bit);
> 6054:       __ evmovdquq(errorvec, t0, Assembler::AVX_512bit);
> 6055:       __ vpternlogd(errorvec, 0xfe, t1, t2, Assembler::AVX_512bit);

Could this be simplified as:
__ vpternlogd(t0, 0xfe, translated2, translated3, Assembler::AVX_512bit);
__ vpor(errorvec, t0, t1, Assembler::AVX_512bit);

src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 6060:

> 6058:       __ evpmovb2m(k3, errorvec, Assembler::AVX_512bit);
> 6059:       __ kortestql(k3, k3);
> 6060:       __ vpxor(errorvec, errorvec, errorvec, Assembler::AVX_512bit);

Why clearing errorvec is needed here? Seems to be not necessary for the 256 
byte processing loop.

src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 6069:

> 6067:       __ vpmaddubsw(merge_ab_bc0, translated0, tmp16_op3, 
> Assembler::AVX_512bit);
> 6068:       __ vpmaddubsw(merge_ab_bc1, translated1, tmp16_op2, 
> Assembler::AVX_512bit);
> 6069:       __ vpmaddubsw(merge_ab_bc2, translated2, tmp16_op1, 
> Assembler::AVX_512bit);

Could we not use pack16_op directly here instead of its copies tmp16_*?

src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 6078:

> 6076:       __ vpmaddwd(merged0, merge_ab_bc0, tmp32_op2, 
> Assembler::AVX_512bit);
> 6077:       __ vpmaddwd(merged1, merge_ab_bc1, tmp32_op1, 
> Assembler::AVX_512bit);
> 6078:       __ vpmaddwd(merged2, merge_ab_bc2, tmp32_op3, 
> Assembler::AVX_512bit);

Could we not use pack32_op directly here instead of its copies tmp32_*?

src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 6086:

> 6084:       __ evpermt2b(arr01, join01, merged1, Assembler::AVX_512bit);
> 6085:       __ evpermt2b(arr12, join12, merged2, Assembler::AVX_512bit);
> 6086:       __ evpermt2b(arr23, join23, merged3, Assembler::AVX_512bit);

arr01 is same as merged0, arr12 is same as merged1, arr23 is same as merged2.
So the above will be easy to understand if coded as below:
__ evpermt2b(merged0, join01, merged1, Assembler::AVX_512bit);
__ evpermt2b(merged1, join12, merged2, Assembler::AVX_512bit);
__ evpermt2b(merged2, join23, merged3, Assembler::AVX_512bit);

src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 6091:

> 6089:       __ evmovdquq(Address(dest, dp, Address::times_1, 0x00), arr01, 
> Assembler::AVX_512bit);
> 6090:       __ evmovdquq(Address(dest, dp, Address::times_1, 0x40), arr12, 
> Assembler::AVX_512bit);
> 6091:       __ evmovdquq(Address(dest, dp, Address::times_1, 0x80), arr23, 
> Assembler::AVX_512bit);

Here you can directly used the merged0, merged1 and merged2 inplace of arr01, 
arr12 and arr23 respectively.

src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 6122:

> 6120:       __ evpermt2b(translated0, input0, lookup_hi, 
> Assembler::AVX_512bit);
> 6121: 
> 6122:       __ vpternlogd(errorvec, 0xfe, translated0, input0, 
> Assembler::AVX_512bit);

This could be a simple vpor:
__ vpor(errorvec, translated0, input0, Assembler::AVX_512bit);

-------------

PR: https://git.openjdk.java.net/jdk/pull/4368

Reply via email to