Re: RFR(M): 8248188: [PATCH] Add HotSpotIntrinsicCandidate and API for Base64 decoding

Corey Ashford Sat, 29 Aug 2020 13:22:37 -0700

Hi Roger,

Thanks for your reply and thoughts!  Comments interspersed below:


On 8/28/20 10:54 AM, Roger Riggs wrote:

Hi Corey,

A few comments on core-libs side...

The naming convention for methods that end in '0' is usually to indicate
they are the bottom-most method or a native method.
So I think you can/should rename the methods to make the most sense
as to their function.


Ok, I will fix that.


Comparing with the way that the Base64 encoder was intrinsified, the
method that is intrinsified should have a method body that does
the same function, so it is interchangable.  That likely will just shift
the "fast path" code into the decodeBlock method.
Keeping the symmetry between encoder and decoder will
make it easier to maintain the code.

Good point. I'll investigate what this looks like in terms of theactual code, and will report back (perhaps in a new webrev).

Given intrinsic only handles 2 of the three cases, and the java codehandlesall three, I would add an extra arg to decodeBlock to reflect the isMimecase
and have the intrinsic take an early exit until it was implemented.


I did consider doing that, but didn't for two reasons:

* Implementing isMIME using vector hardware would be very difficult dueto the need to ignore non-base64 characters. This requires eliminatingthose characters from the vector, then reading and shifting more in,repeatedly until there are no non-base64 characters left. This isn't atrivial/fast thing to do, at least on Power arch. None of the publishedbase64 encode/decode functions for vector processors address the MIMEcase. In fact they don't address isURL=true either, but fortunatelythat is a relatively easy addition.

* If isMIME=true is not implemented by the intrinsic, it will costunnecessary overhead for that case, because of the need to martial theparameters, call the intrinsic, and then do an early return. Ibenchmarked this approach before, and saw an approx 5% drop inperformance when isMIME = true. So that's why we decided to leave theisMIME=true case as a later optimization. Because of the extracomplexity of the algorithm, it probably shouldn't share the sameintrinsic anyway; only the isMIME=true case should take the performance hit.

It is unfortunate that taking advantage of vectorization has to be handcoded.

If/when the Vector API is ready (JEP 338 https://openjdk.java.net/jeps/338)
the java code should be replaced to use the Vector API and then it would
work for a new hardware without specific coding for each platform.
"Just" implement the Vector API.  There's a lot more bang for the buck
going for that approach.

The kind of vector processing used in this intrinsic operates mostly onbytes within one vector, not between two vectors (for example inmatrix-multiply algorithms), otherwise known as SWAR(https://en.wikipedia.org/wiki/SWAR). Because of that, it's verysensitive to which exact instructions are available in the vectorprocessor. There isn't much standardization of SWAR instructionsbetween different arches, so I think it would be hard to get a genericSWAR API that gives good performance across several arches. Frombriefly looking at the link you provided, it doesn't appear to addressSWAR operations, so it doesn't seem to me that waiting for the vectorAPI would be worth the wait, and in fact may not provide any method atall to boost performance of base64 decode/encode.


Regards,

- Corey

P.S. I work only two days a week, so the updates will be slower comparedto other developers.

Thanks, Roger


On 8/24/20 9:21 PM, Corey Ashford wrote:
Here's a revised webrev which includes a JMH benchmark for the decodeoperation.
http://cr.openjdk.java.net/~mhorie/8248188/webrev.03/
The added benchmark tries to be "fair" in that it doesn't prefer alarge buffer size, which would favor the intrinsic. Itpseudo-randomly (but reproducibly) chooses a buffer size between 8 and20k+8 bytes, and fills it with random data to encode and decode. Aspart of the TearDown of an invocation, it also checks the decodedoutput data for correctness.
Example runs on the Power9-based machine I use for development shows a3X average improvement across these random buffer sizes. Here's anexcerpt of the output when run with -XX:-UseBASE64Intrinsics :
Iteration   1: 70795.623 ops/s
Iteration   2: 71070.607 ops/s
Iteration   3: 70867.544 ops/s
Iteration   4: 71107.992 ops/s
Iteration   5: 71048.281 ops/s

And here's the output with the intrinsic enabled:

Iteration   1: 208794.022 ops/s
Iteration   2: 208630.904 ops/s
Iteration   3: 208238.822 ops/s
Iteration   4: 208714.967 ops/s
Iteration   5: 209060.894 ops/s

Taking the best of the two runs: 209060/71048 = 2.94
From other experiments where the benchmark uses a fixed-size, largerbuffer, the performance ratio rises to about 4.0.
Power10 should have a slightly higher ratio due to several factors,but I have not yet benchmarked on Power10.
Other arches ought to be able to do at least this well, if not better,because of wider vector registers (> 128 bits) being available. Onlya Power9/10 implementation is included in this webrev, however.
Regards,

- Corey


On 8/19/20 11:20 AM, Roger Riggs wrote:
Hi Corey,
For changes obviously performance motivated, it is conventional torun a JMH perf test to demonstate
the improvement and prove it is worthwhile to add code complexity.
I don't see any existing Base64 JMH tests but they would be in therepo below or near:
     test/micro/org/openjdk/bench/java/util/

Please contribute a JMH test and results to show the difference.

Regards, Roger



On 8/19/20 2:10 PM, Corey Ashford wrote:
Michihiro Horie posted up a new iteration of this webrev for me.This time the webrev includes a complete implementation of theintrinsic for Power9 and Power10.
You can find it here:http://cr.openjdk.java.net/~mhorie/8248188/webrev.02/
Changes in webrev.02 vs. webrev.01:
* The method header for the intrinsic in the Base64 code has beenrewritten using the Javadoc style. The clarity of the comments hasbeen improved and some verbosity has been removed. There are noadditional functional changes to Base64.java.
* The code needed to martial and check the intrinsic parametershas been added, using the base64 encodeBlock intrinsic as a guideline.
* A complete intrinsic implementation for Power9 and Power10 isincluded.
* Adds some Power9 and Power10 assembler instructions needed bythe intrinsic which hadn't been defined before.
The intrinsic implementation in this patch accelerates the decodingof large blocks of base64 data by a factor of about 3.5X on Power9.
I'm attaching two Java test cases I am using for testing andbenchmarking. The TestBase64_VB encodes and decodes randomly-sizedbuffers of random data and checks that original data matches theencoded-then-decoded data. TestBase64Errors encodes a 48K block ofrandom bytes, then corrupts each byte of the encoded data, one at atime, checking to see if the decoder catches the illegal byte.
Any comments/suggestions would be appreciated.

Thanks,

- Corey

On 7/27/20 6:49 PM, Corey Ashford wrote:
Michihiro Horie uploaded a new revision of the Base64 decodeBlockintrinsic API for me:
http://cr.openjdk.java.net/~mhorie/8248188/webrev.01/

It has the following changes with respect to the original one posted:
* In the event of encountering a non-base64 character, instead ofhaving a separate error code of -1, the intrinsic can now justreturn either 0, or the number of data bytes produced up to thepoint where the illegal base64 character was encountered. Thisreduces the number of special cases, and also provides a way tospeed up the process of finding the bad character by the slower,pure-Java algorithm.
  * The isMIME boolean is removed from the API for two reasons:
- The current API is not sufficient to handle the isMIME case,because there isn't a strict relationship between the number ofinput bytes and the number of output bytes, because there can be anarbitrary number of non-base64 characters in the source. - If an intrinsic only implements the (isMIME == false) case asours does, it will always return 0 bytes processed, which willslightly slow down the normal path of processing an (isMIME ==true) instantiation. - We considered adding a separate hotspot candidate for the(isMIME == true) case, but since we don't have an intrinsicimplementation to test that, we decided to leave it as a futureoptimization.
Comments and suggestions are welcome.  Thanks for your consideration.

- Corey

On 6/23/20 6:23 PM, Michihiro Horie wrote:
Hi Corey,

Following is the issue I created.
https://bugs.openjdk.java.net/browse/JDK-8248188

I will upload a webrev when you're ready as we talked in private.

Best regards,
Michihiro
Inactive hide details for "Corey Ashford" ---2020/06/2409:40:10---Currently in java.util.Base64, there is aHotSpotIntrinsicCa"Corey Ashford" ---2020/06/2409:40:10---Currently in java.util.Base64, there is aHotSpotIntrinsicCandidate and API for encodeBlock, but no
From: "Corey Ashford" <cjash...@linux.ibm.com>
To: "hotspot-compiler-...@openjdk.java.net"<hotspot-compiler-...@openjdk.java.net>,"ppc-aix-port-...@openjdk.java.net"<ppc-aix-port-...@openjdk.java.net>Cc: Michihiro Horie/Japan/IBM@IBMJP, KazunoriOgata/Japan/IBM@IBMJP, jos...@br.ibm.com
Date: 2020/06/24 09:40
Subject: RFR(S): [PATCH] Add HotSpotIntrinsicCandidate and API forBase64 decoding
------------------------------------------------------------------------
Currently in java.util.Base64, there is aHotSpotIntrinsicCandidate and
API for encodeBlock, but none for decoding.  This means that only
encoding gets acceleration from the underlying CPU's vector hardware.

I'd like to propose adding a new intrinsic for decodeBlock.  The
considerations I have for this new intrinsic's API:

  * Don't make any assumptions about the underlying capability of the
hardware. For example, do not impose any specific block sizegranularity.
  * Don't assume the underlying intrinsic can handle isMIME or isURL
modes, but also let them decide if they will process the dataregardless
of the settings of the two booleans.

  * Any remaining data that is not processed by the intrinsic will be
processed by the pure Java implementation. This allows theintrinsic to
process whatever block sizes it's good at without the complexity of
handling the end fragments.
* If any illegal character is discovered in the decodingprocess, the
intrinsic will simply return -1, instead of requiring it to throw a
proper exception from the context of the intrinsic.  In the event of
getting a -1 returned from the intrinsic, the Java Base64 librarycodesimply calls the pure Java implementation to have it find theerror and
properly throw an exception.  This is a performance trade-off in the
case of an error (which I expect to be very rare).

  * One thought I have for a further optimization (not implemented in
the current patch), is that when the intrinsic decides not toprocess a
block because of some combination of isURL and isMIME settings it
doesn't handle, it could return extra bits in the return code,encoded
as a negative number.  For example:

Illegal_Base64_char   = 0b001;
isMIME_unsupported    = 0b010;
isURL_unsupported     = 0b100;
These can be OR'd together as needed and then negated (flip thesign).
The Base64 library code could then cache these flags, so it will know
not to call the intrinsic again when another decodeBlock is requested
but with an unsupported mode.  This will save the performance hit of
calling the intrinsic when it is guaranteed to fail.

I've tested the attached patch with an actual intrinsic coded up for
Power9/Power10, but those runtime intrinsics and arch-specificpatches
aren't attached today.  I want to get some consensus on the
library-level intrinsic API first.
Also attached is a simple test case to test that the new intrinsicAPI
doesn't break anything.

I'm open to any comments about this.

Thanks for your consideration,

- Corey


Corey Ashford
IBM Systems, Linux Technology Center, OpenJDK team
cjashfor at us dot ibm dot com
[attachment "decodeBlock_api-20200623.patch" deleted by MichihiroHorie/Japan/IBM] [attachment "TestBase64.java" deleted byMichihiro Horie/Japan/IBM]

Re: RFR(M): 8248188: [PATCH] Add HotSpotIntrinsicCandidate and API for Base64 decoding

Reply via email to