Re: RFR: 8287903: Reduce runtime of java.math microbenchmarks

2022-06-08 Thread Andrew Haley
On Tue, 7 Jun 2022 12:34:25 GMT, Claes Redestad  wrote:

> - Reduce runtime by running fewer forks, fewer iterations, less warmup. All 
> micros tested in this group appear to stabilize very quickly.
> - Refactor BigIntegers to avoid re-running some (most) micros over and over 
> with parameter values that don't affect them.
> 
> Expected runtime down from 14 hours to 15 minutes.

Marked as reviewed by aph (Reviewer).

-

PR: https://git.openjdk.java.net/jdk/pull/9062


Re: Why does we still need StrictMath?

2022-05-09 Thread Andrew Haley

On 5/8/22 13:14, Victor Williams Stafusa da Silva wrote:

Sure, there are the x86 intrinsics. But since JEP 306 was delivered, is this still valid? 
The Motivation section of the JEP 306 seems to imply that this is not the case anymore. 
Of course, I could just be grossly misunderstanding what is/was JEP 306 and/or to which 
depth it meant by "restore always-strict floating-point semantics", but I don't 
think that I am the only one out there.


JEP 306 isn't about j.l.StrictMath. The Description section of 306
says what it's really about.

--
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671


Re: Why does we still need StrictMath?

2022-05-09 Thread Andrew Haley

On 5/8/22 13:44, Martin Desruisseaux wrote:

Le 08/05/2022 à 14:15, Victor Williams Stafusa da Silva a écrit :


Was that using Java 17+, which included JEP 306 delivered?


No, but my understanding is that JEP 306 does not apply to Math versus
StrictMath behavior. In my understanding, the strictfp keyword was only
about the use of extended exponent value set in Pentium 80 bits floating
point values. It impacts the behavior of arithmetic operations + - * /,
but my (maybe wrong) understanding is that it does not impact which
processor instruction is used for implementing Math.sin, cos, etc.


You're right.

--
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671


Re: Why does we still need StrictMath?

2022-05-08 Thread Andrew Haley

On 5/8/22 06:54, Victor Williams Stafusa da Silva wrote:

If StrictMath is still needed and could produce different results than
Math in some modern hardware, then by the javadocs, it seems to imply that
Math should always delegate to StrictMath and never the other way around.
Why is it not always the case? I think that the answer is simply because
the StrictMath class was largely left unchanged and that delegating in one
way or in the other could then produce a difference when the strictfp
modifier was there, but is there a better reason than that?


Some targets (x86, in particular) have intrinsics (log, trig) that
are faster than StrictMath and also more accurate. StrictMath is not
about accuracy, but cross-architecture down-to-the-last bit reproducibility.
Whether we still need that reproducibility is, I suppose, something for
debate.

--
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671


Re: RFR: 8284161: Implementation of Virtual Threads (Preview) [v8]

2022-04-28 Thread Andrew Haley
On Wed, 27 Apr 2022 19:21:58 GMT, Alan Bateman  wrote:

>> src/java.base/share/classes/jdk/internal/misc/UnsafeConstants.java line 122:
>> 
>>> 120:  */
>>> 121: 
>>> 122: public static final int SCOPED_CACHE_SHIFT;
>> 
>> I can't find this constant being used.   If added for future, maybe keep 
>> `UnsafeConstants` class and this static field package-private for now.
>
> It originally configured the number of lines in extent local cache but the 
> implementation has changed. @theRealAph would be best to comment on this, it 
> may be possible to delete it.

Yes, it's dead. Thanks.

-

PR: https://git.openjdk.java.net/jdk/pull/8166


Re: OpenJDK or SE Java Floating Point Options?

2022-04-16 Thread Andrew Haley

On 4/16/22 09:04, sminervini.prism wrote:
> It is still the case that I and others need Java floating point and StrictMath
> (the key double type calculator class) to be repaired.

If you are going to pursue this line of reasoning, you must be aware
of some things.

Firstly, it is not possible to change Java's core floating-point
arithmetic in a compatible way. We certainly will not adopt decimal
floating-point for Java's core arithmetic.

Secondly, Java should not be your focus. We will not change Java's
arithmetic in a way that is incompatible with other languages or
which makes Java slower on existing hardware.

You must read and fully understand this before you go any further. It
will require a lot of study:

https://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html

> May the Java Community Process reconsider the present floating point
> operations and method calls situation, based on an imperfect
> standard and improper workaround, and provide corrected, defaulting
> or specifying-compatible waya for Java floating point arithmetic and
> Calculator class method results to always cohere in their ranges,
> without denormal and pronormal inclusion?

In a word, no. That possibility is so unlikely that it is not worthy
of consideration.

--
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671


Re: RFR: 8279508: Auto-vectorize Math.round API [v17]

2022-03-17 Thread Andrew Haley
On Sun, 13 Mar 2022 06:36:15 GMT, Jatin Bhateja  wrote:

>> Summary of changes:
>> - Intrinsify Math.round(float) and Math.round(double) APIs.
>> - Extend auto-vectorizer to infer vector operations on encountering scalar 
>> IR nodes for above intrinsics.
>> - Test creation using new IR testing framework.
>> 
>> Following are the performance number of a JMH micro included with the patch 
>> 
>> Test System: Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz (Icelake Server)
>> 
>> 
>> Benchmark | TESTSIZE | Baseline AVX3 (ops/ms) | Withopt AVX3 (ops/ms) | Gain 
>> ratio | Baseline AVX2 (ops/ms) | Withopt AVX2 (ops/ms) | Gain ratio
>> -- | -- | -- | -- | -- | -- | -- | --
>> FpRoundingBenchmark.test_round_double | 1024.00 | 504.15 | 2209.54 | 4.38 | 
>> 510.36 | 548.39 | 1.07
>> FpRoundingBenchmark.test_round_double | 2048.00 | 293.64 | 1271.98 | 4.33 | 
>> 293.48 | 274.01 | 0.93
>> FpRoundingBenchmark.test_round_float | 1024.00 | 825.99 | 4754.66 | 5.76 | 
>> 751.83 | 2274.13 | 3.02
>> FpRoundingBenchmark.test_round_float | 2048.00 | 412.22 | 2490.09 | 6.04 | 
>> 388.52 | 1334.18 | 3.43
>> 
>> 
>> Kindly review and share your feedback.
>> 
>> Best Regards,
>> Jatin
>
> Jatin Bhateja has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   8279508: Windows build failure fix.

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4143:

> 4141:   ldmxcsr(new_mxcsr);
> 4142:   // Move raw bits corresponding to double value 0.5 into scratch 
> register.
> 4143:   mov64(scratch, 4602678819172646912L);

Suggestion:

  mov64(scratch, julong_cast(0.5));

-

PR: https://git.openjdk.java.net/jdk/pull/7094


Re: Should System.exit be controlled by a Scope Local?

2022-03-15 Thread Andrew Haley

On 3/1/22 18:17, Sean Mullan wrote:

Can you explain in a little more detail as to what the compatibility
issues are with preventing threads in thread pools from calling System.exit?


It is possible, I suppose, that some rather odd programmer is using
a task in a thread pool to exit their program. I know, unlikely, but
we preserve compatibility.

--
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671


Re: RFR: 8279508: Auto-vectorize Math.round API [v15]

2022-03-14 Thread Andrew Haley
On Sun, 13 Mar 2022 04:27:25 GMT, Jatin Bhateja  wrote:

>> src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4024:
>> 
>>> 4022:  * the result is equal to the value of Integer.MAX_VALUE.
>>> 4023:  */
>>> 4024: void 
>>> C2_MacroAssembler::vector_cast_float_special_cases_avx(XMMRegister dst, 
>>> XMMRegister src, XMMRegister xtmp1,
>> 
>> This special handling is really large, could we use a stub routine for it?
>
> Good suggestion, but as of now we are not using vector calling conventions 
> for stubs.

I don't understand this comment. If the stub is only to be used by you, then 
you can determine your own calling convention.

-

PR: https://git.openjdk.java.net/jdk/pull/7094


Re: RFR: 8279508: Auto-vectorize Math.round API [v12]

2022-03-10 Thread Andrew Haley
On Wed, 9 Mar 2022 11:38:34 GMT, Jatin Bhateja  wrote:

>> Summary of changes:
>> - Intrinsify Math.round(float) and Math.round(double) APIs.
>> - Extend auto-vectorizer to infer vector operations on encountering scalar 
>> IR nodes for above intrinsics.
>> - Test creation using new IR testing framework.
>> 
>> Following are the performance number of a JMH micro included with the patch 
>> 
>> Test System: Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz (Icelake Server)
>> 
>> 
>> Benchmark | TESTSIZE | Baseline AVX3 (ops/ms) | Withopt AVX3 (ops/ms) | Gain 
>> ratio | Baseline AVX2 (ops/ms) | Withopt AVX2 (ops/ms) | Gain ratio
>> -- | -- | -- | -- | -- | -- | -- | --
>> FpRoundingBenchmark.test_round_double | 1024.00 | 504.15 | 2209.54 | 4.38 | 
>> 510.36 | 548.39 | 1.07
>> FpRoundingBenchmark.test_round_double | 2048.00 | 293.64 | 1271.98 | 4.33 | 
>> 293.48 | 274.01 | 0.93
>> FpRoundingBenchmark.test_round_float | 1024.00 | 825.99 | 4754.66 | 5.76 | 
>> 751.83 | 2274.13 | 3.02
>> FpRoundingBenchmark.test_round_float | 2048.00 | 412.22 | 2490.09 | 6.04 | 
>> 388.52 | 1334.18 | 3.43
>> 
>> 
>> Kindly review and share your feedback.
>> 
>> Best Regards,
>> Jatin
>
> Jatin Bhateja has updated the pull request with a new target base due to a 
> merge or a rebase. The pull request now contains 15 commits:
> 
>  - 8279508: Preventing domain switch-over penalty for Math.round(float) and 
> constraining unrolling to prevent code bloating.
>  - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8279508
>  - 8279508: Removing +LogCompilation flag.
>  - 8279508: Review comments resolved.`
>  - 8279508: Adding descriptive comments.
>  - 8279508: Review comments resolved.
>  - 8279508: Review comments resolved.
>  - 8279508: Fixing for windows failure.
>  - 8279508: Adding few descriptive comments.
>  - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8279508
>  - ... and 5 more: 
> https://git.openjdk.java.net/jdk/compare/d07f7c76...547f4e31

test/micro/org/openjdk/bench/java/math/FpRoundingBenchmark.java line 70:

> 68:   }
> 69: 
> 70:   FargV1 = new float[TESTSIZE];

`FargV1` is not initialized.
You need to set `i = 0;` here.

test/micro/org/openjdk/bench/java/math/FpRoundingBenchmark.java line 78:

> 76: 
> 77:   for (; i < TESTSIZE; i++) {
> 78:   FargV1[i] = r.nextFloat()*TESTSIZE;

This is an unrealistically narrow range of values. I'd use 
Suggestion:

  FargV1[i] = Float.intBitsToFloat(r.nextInt());

-

PR: https://git.openjdk.java.net/jdk/pull/7094


Re: RFR: 8279508: Auto-vectorize Math.round API [v12]

2022-03-10 Thread Andrew Haley
On Wed, 9 Mar 2022 11:38:34 GMT, Jatin Bhateja  wrote:

>> Summary of changes:
>> - Intrinsify Math.round(float) and Math.round(double) APIs.
>> - Extend auto-vectorizer to infer vector operations on encountering scalar 
>> IR nodes for above intrinsics.
>> - Test creation using new IR testing framework.
>> 
>> Following are the performance number of a JMH micro included with the patch 
>> 
>> Test System: Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz (Icelake Server)
>> 
>> 
>> Benchmark | TESTSIZE | Baseline AVX3 (ops/ms) | Withopt AVX3 (ops/ms) | Gain 
>> ratio | Baseline AVX2 (ops/ms) | Withopt AVX2 (ops/ms) | Gain ratio
>> -- | -- | -- | -- | -- | -- | -- | --
>> FpRoundingBenchmark.test_round_double | 1024.00 | 504.15 | 2209.54 | 4.38 | 
>> 510.36 | 548.39 | 1.07
>> FpRoundingBenchmark.test_round_double | 2048.00 | 293.64 | 1271.98 | 4.33 | 
>> 293.48 | 274.01 | 0.93
>> FpRoundingBenchmark.test_round_float | 1024.00 | 825.99 | 4754.66 | 5.76 | 
>> 751.83 | 2274.13 | 3.02
>> FpRoundingBenchmark.test_round_float | 2048.00 | 412.22 | 2490.09 | 6.04 | 
>> 388.52 | 1334.18 | 3.43
>> 
>> 
>> Kindly review and share your feedback.
>> 
>> Best Regards,
>> Jatin
>
> Jatin Bhateja has updated the pull request with a new target base due to a 
> merge or a rebase. The pull request now contains 15 commits:
> 
>  - 8279508: Preventing domain switch-over penalty for Math.round(float) and 
> constraining unrolling to prevent code bloating.
>  - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8279508
>  - 8279508: Removing +LogCompilation flag.
>  - 8279508: Review comments resolved.`
>  - 8279508: Adding descriptive comments.
>  - 8279508: Review comments resolved.
>  - 8279508: Review comments resolved.
>  - 8279508: Fixing for windows failure.
>  - 8279508: Adding few descriptive comments.
>  - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8279508
>  - ... and 5 more: 
> https://git.openjdk.java.net/jdk/compare/d07f7c76...547f4e31

test/micro/org/openjdk/bench/java/math/FpRoundingBenchmark.java line 114:

> 112:   for (int i = 0; i < TESTSIZE; i++) {
> 113:   ResF[i] = Math.round(FargV1[i]);
> 114:   }

I think that this is wrong: you should not be storing the result into a float 
array because it requires an extra integer->float conversion, which distorts 
the timings. You need `resI` and `resL` for the results of `Math.round`.

-

PR: https://git.openjdk.java.net/jdk/pull/7094


Re: RFR: 8279508: Auto-vectorize Math.round API [v11]

2022-03-06 Thread Andrew Haley
On Wed, 2 Mar 2022 02:44:41 GMT, Jatin Bhateja  wrote:

>> Summary of changes:
>> - Intrinsify Math.round(float) and Math.round(double) APIs.
>> - Extend auto-vectorizer to infer vector operations on encountering scalar 
>> IR nodes for above intrinsics.
>> - Test creation using new IR testing framework.
>> 
>> Following are the performance number of a JMH micro included with the patch 
>> 
>> Test System: Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz (Icelake Server)
>> 
>> 
>> Benchmark | TESTSIZE | Baseline AVX3 (ops/ms) | Withopt AVX3 (ops/ms) | Gain 
>> ratio | Baseline AVX2 (ops/ms) | Withopt AVX2 (ops/ms) | Gain ratio
>> -- | -- | -- | -- | -- | -- | -- | --
>> FpRoundingBenchmark.test_round_double | 1024.00 | 504.15 | 2209.54 | 4.38 | 
>> 510.36 | 548.39 | 1.07
>> FpRoundingBenchmark.test_round_double | 2048.00 | 293.64 | 1271.98 | 4.33 | 
>> 293.48 | 274.01 | 0.93
>> FpRoundingBenchmark.test_round_float | 1024.00 | 825.99 | 4754.66 | 5.76 | 
>> 751.83 | 2274.13 | 3.02
>> FpRoundingBenchmark.test_round_float | 2048.00 | 412.22 | 2490.09 | 6.04 | 
>> 388.52 | 1334.18 | 3.43
>> 
>> 
>> Kindly review and share your feedback.
>> 
>> Best Regards,
>> Jatin
>
> Jatin Bhateja has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   8279508: Removing +LogCompilation flag.

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4157:

> 4155:   ExternalAddress mxcsr_std(StubRoutines::x86::addr_mxcsr_std());
> 4156:   ldmxcsr(new_mxcsr);
> 4157:   movl(scratch, 1056964608);

Suggestion:

  movl(scratch, 1056964608); // Raw bits corresponding to floating point value 
0.5f.

-

PR: https://git.openjdk.java.net/jdk/pull/7094


Re: RFR: 8279508: Auto-vectorize Math.round API [v11]

2022-03-06 Thread Andrew Haley
On Wed, 2 Mar 2022 02:44:41 GMT, Jatin Bhateja  wrote:

>> Summary of changes:
>> - Intrinsify Math.round(float) and Math.round(double) APIs.
>> - Extend auto-vectorizer to infer vector operations on encountering scalar 
>> IR nodes for above intrinsics.
>> - Test creation using new IR testing framework.
>> 
>> Following are the performance number of a JMH micro included with the patch 
>> 
>> Test System: Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz (Icelake Server)
>> 
>> 
>> Benchmark | TESTSIZE | Baseline AVX3 (ops/ms) | Withopt AVX3 (ops/ms) | Gain 
>> ratio | Baseline AVX2 (ops/ms) | Withopt AVX2 (ops/ms) | Gain ratio
>> -- | -- | -- | -- | -- | -- | -- | --
>> FpRoundingBenchmark.test_round_double | 1024.00 | 504.15 | 2209.54 | 4.38 | 
>> 510.36 | 548.39 | 1.07
>> FpRoundingBenchmark.test_round_double | 2048.00 | 293.64 | 1271.98 | 4.33 | 
>> 293.48 | 274.01 | 0.93
>> FpRoundingBenchmark.test_round_float | 1024.00 | 825.99 | 4754.66 | 5.76 | 
>> 751.83 | 2274.13 | 3.02
>> FpRoundingBenchmark.test_round_float | 2048.00 | 412.22 | 2490.09 | 6.04 | 
>> 388.52 | 1334.18 | 3.43
>> 
>> 
>> Kindly review and share your feedback.
>> 
>> Best Regards,
>> Jatin
>
> Jatin Bhateja has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   8279508: Removing +LogCompilation flag.

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4157:

> 4155:   ExternalAddress mxcsr_std(StubRoutines::x86::addr_mxcsr_std());
> 4156:   ldmxcsr(new_mxcsr);
> 4157:   movl(scratch, 1056964608);

What is 1056964608 ?

-

PR: https://git.openjdk.java.net/jdk/pull/7094


Re: RFR: 8281146: Replace StringCoding.hasNegatives with countPositives [v9]

2022-03-03 Thread Andrew Haley
On Thu, 3 Mar 2022 13:31:35 GMT, Claes Redestad  wrote:

> > Well, it just didn't build. With the annotation being present, you also 
> > need an intrinsic implementation. That's what the error message is saying...
> 
> Doh, I had no idea the presence of `@IntrinsicCandidate` was mandating the VM 
> has an intrinsic implementation these days (Why?! Not much of a _candidate_ 
> if it's required..).

I don't think the intrinsic has to be implemented on every target, but AFAICR 
it does have to be declared as an intrinsic in HotSpot.

-

PR: https://git.openjdk.java.net/jdk/pull/7231


Re: RFR: 8281146: Replace StringCoding.hasNegatives with countPositives [v9]

2022-03-03 Thread Andrew Haley
On Thu, 3 Mar 2022 12:04:47 GMT, Claes Redestad  wrote:

>> I'm requesting comments and, hopefully, some help with this patch to replace 
>> `StringCoding.hasNegatives` with `countPositives`. The new method does a 
>> very similar pass, but alters the intrinsic to return the number of leading 
>> bytes in the `byte[]` range which only has positive bytes. This allows for 
>> dealing much more efficiently with those `byte[]`s that has a ASCII prefix, 
>> with no measurable cost on ASCII-only or latin1/UTF16-mostly input.
>> 
>> Microbenchmark results: 
>> https://jmh.morethan.io/?gists=428b487e92e3e47ccb7f169501600a88,3c585de7435506d3a3bdb32160fe8904
>> 
>> - Only implemented on x86 for now, but I want to verify that implementations 
>> of `countPositives` can be implemented with similar efficiency on all 
>> platforms that today implement a `hasNegatives` intrinsic (aarch64, ppc etc) 
>> before moving ahead. This pretty much means holding up this until it's 
>> implemented on all platforms, which can either contributed to this PR or as 
>> dependent follow-ups.
>> 
>> - An alternative to holding up until all platforms are on board is to allow 
>> the implementation of `StringCoding.hasNegatives` and `countPositives` to be 
>> implemented so that the non-intrinsified method calls into the intrinsified. 
>> This requires structuring the implementations differently based on which 
>> intrinsic - if any - is actually implemented. One way to do this could be to 
>> mimic how `java.nio` handles unaligned accesses and expose which intrinsic 
>> is available via `Unsafe` into a `static final` field.
>> 
>> - There are a few minor regressions (~5%) in the x86 implementation on 
>> `encode-/decodeLatin1Short`. Those regressions disappear when mixing inputs, 
>> for example `encode-/decodeShortMixed` even see a minor improvement, which 
>> makes me consider those corner case regressions with little real world 
>> implications (if you have latin1 Strings, you're likely to also have 
>> ASCII-only strings in your mix).
>
> Claes Redestad has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   Document that it's allowed for implementations to return values less than 
> the exact count (iff there are negative bytes)

> * There are a few minor regressions (~5%) in the x86 implementation on 
> `encode-/decodeLatin1Short`. Those regressions disappear when mixing inputs, 
> for example `encode-/decodeShortMixed` even see a minor improvement, which 
> makes me consider those corner case regressions with little real world 
> implications (if you have latin1 Strings, you're likely to also have 
> ASCII-only strings in your mix).

I'm not sure that we can disregard such cases. You're probably right, but it'd 
be interesting to know the actual cause of the problem, perhaps with perfasm. 
Or do you know already?

-

PR: https://git.openjdk.java.net/jdk/pull/7231


Re: RFR: 8281146: Replace StringCoding.hasNegatives with countPositives [v6]

2022-03-02 Thread Andrew Haley
On Tue, 1 Mar 2022 19:12:17 GMT, Claes Redestad  wrote:

> > @theRealAph , @a74nh or someone familiar with aarch64 code, please review 
> > aarch64 changes.
> 
> Note that the aarch64 changes I've put in for now implements `countPositives` 
> to return `0` if there's a negative value anywhere, otherwise `len`. This way 
> we can remove the intrinsic scaffolding for `hasNegatives` once I integrate 
> s390 and ppc 

Sure. And we don't have to check which of the 8 bytes in a word has its top bit 
set, so we don't have to do any more work, just return the count when we 
stopped searching. So there's no possible performance regression on AArch64 as 
far as I can see.

-

PR: https://git.openjdk.java.net/jdk/pull/7231


Re: Should System.exit be controlled by a Scope Local?

2022-03-01 Thread Andrew Haley

On 3/1/22 11:45, Andrew Haley wrote:

Sure, you wouldn't
be able to use the default thread pool, but that's no big deal, I would have
thought.


I'm sorry, I'll say that again. :-)


I meant to say "you wouldn't be able to use the default thread pool if
you wanted to use threads with some restrictions (e.g those that couldn't
invoke System.exit()).

--
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671


Re: Should System.exit be controlled by a Scope Local?

2022-03-01 Thread Andrew Haley

On 2/28/22 15:32, Andrew Haley wrote:


I think all we'd need is a set of capabilities bound to a scope local
at thread startup, and I guess it'd default to "all capabilities".
Trusted code could then override any of those capabilities.

We'd have to make sure that capabilities were inherited by threads,
and we'd have to think very carefully about thread pools. The problem
there is that while it would (I guess) make sense to prevent all code
executing in thread pools from calling System.exit(), there's an
obvious compatibility problem if it can't.


Although... there certainly is some potential profit in restricted thread
pools, which have no compatibility problems because it'd be a new feature.
I think this solves the problem Alan Bateman raised too. Sure, you wouldn't
be able to use the default thread pool, but that's no big deal, I would have
thought.

--
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671


Re: Should System.exit be controlled by a Scope Local?

2022-02-28 Thread Andrew Haley

On 2/28/22 15:12, Sean Mullan wrote:
>
> On 2/27/22 1:47 PM, Andrew Haley wrote:
>
>> I'd like to explore the use of scope locals as a lightweight means to
>> implement a system of permissions and capabilities for things such as
>> this.
>
> Now you have piqued my curiosity, as I have explored a capability based
> model for intercepting `System.exit`. Can you say any more about this yet?

I think all we'd need is a set of capabilities bound to a scope local
at thread startup, and I guess it'd default to "all capabilities".
Trusted code could then override any of those capabilities.

We'd have to make sure that capabilities were inherited by threads,
and we'd have to think very carefully about thread pools. The problem
there is that while it would (I guess) make sense to prevent all code
executing in thread pools from calling System.exit(), there's an
obvious compatibility problem if it can't.

--
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671


Re: Should System.exit be controlled by a Scope Local?

2022-02-27 Thread Andrew Haley

On 2/26/22 22:14, Ethan McCue wrote:

As called out in JEP 411, one of the remaining legitimate uses of the
Security Manager is to intercept calls to System.exit. This seems like a
decent use case for the Scope Local mechanism.


It could well be. One problem, at least in the preview version of scope locals,
is that scope locals are only inherited in a structured concurrency context,
so it wouldn't protect against starting a new Thread which then called
System.exit().

I'd like to explore the use of scope locals as a lightweight means to
implement a system of permissions and capabilities for things such as
this.

--
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671


Re: RFR: 8282047: Enhance StringDecode/Encode microbenchmarks [v2]

2022-02-25 Thread Andrew Haley
On Fri, 25 Feb 2022 10:19:17 GMT, Claes Redestad  wrote:

>> test/micro/org/openjdk/bench/java/lang/StringDecode.java line 72:
>> 
>>> 70: public void setup() {
>>> 71: charset = Charset.forName(charsetName);
>>> 72: asciiString = LOREM.substring(0, 32).getBytes(charset);
>> 
>> This is problematic IMO in that it's missing short strings such as "Claes". 
>> Average Java strings are about 32 bytes long AFAICR, and people writing 
>> (vectorized) ijntrinsics have a nasty habit of optimizing for long strings, 
>> to the detriment of typical-length ones.
>> Whether we like it or not, people will optimize for benchmarks, so it's 
>> important that benchmark data is realistic. The shortest here is 15 bytes, 
>> as far as I can see. I'd certainly include a short string of just a few 
>> bytes so that intrinsics don't cause regressions in important cases.
>
> All good points. I've added a number of such short variants to all(?) 
> relevant microbenchmarks. The tests should now better cover a mix of input 
> lengths and encodings.



-

PR: https://git.openjdk.java.net/jdk/pull/7516


Re: RFR: 8282047: Enhance StringDecode/Encode microbenchmarks [v2]

2022-02-25 Thread Andrew Haley
On Fri, 25 Feb 2022 09:19:33 GMT, Claes Redestad  wrote:

>> Splitting out these micro changes from #7231
>> 
>> - Clean up and simplify setup and code
>> - Add variants with different inputs with varying lengths and encoding 
>> weights, but also relevant mixes of each so that we both cover interesting 
>> corner cases but also verify that performance behave when there's a 
>> multitude of input shapes. Both simple and mixed variants are interesting 
>> diagnostic tools.
>> - Drop all charsets from the default run configuration except UTF-8. 
>> Motivation: Defaults should give good coverage but try to keep runtimes at 
>> bay. Additionally if the charset tested can't encode the higher codepoints 
>> used in these micros the results might be misleading. If you for example 
>> test using ISO-8859-1 the UTF-16 bytes in StringDecode.decodeUTF16 will have 
>> all been replaced by `?`, so the test is effectively the same as testing 
>> ASCII-only.
>
> Claes Redestad has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   Brent review

test/micro/org/openjdk/bench/java/lang/StringDecode.java line 72:

> 70: public void setup() {
> 71: charset = Charset.forName(charsetName);
> 72: asciiString = LOREM.substring(0, 32).getBytes(charset);

This is problematic IMO in that it's missing short strings such as "Claes". 
Average Java strings are about 32 bytes long AFAICR, and people writing 
(vectorized) ijntrinsics have a nasty habit of optimizing for long strings, to 
the detriment of typical-length ones.
Whether we like it or not, people will optimize for benchmarks, so it's 
important that benchmark data is realistic. The shortest here is 15 bytes, as 
far as I can see. I'd certainly include a short string of just a few bytes so 
that intrinsics don't cause regressions in important cases.

-

PR: https://git.openjdk.java.net/jdk/pull/7516


Re: RFR: 8209784: Include hsdis in the JDK

2022-02-22 Thread Andrew Haley
On Tue, 22 Feb 2022 16:10:22 GMT, Magnus Ihse Bursie  wrote:

> This PR adds a new configure argument, `--enable-hsdis-bundling`. Setting 
> this, together with a valid backend using `--with-hsdis`, will bundle the 
> generated hsdis library with the JDK.

Maybe I missed it, but are there instructions somewhere about how to build 
this, and what should be downloaded?

-

PR: https://git.openjdk.java.net/jdk/pull/7578


Re: RFR: 8281631: HashMap.putAll can cause redundant space waste [v3]

2022-02-20 Thread Andrew Haley

On 2/11/22 19:25, XenoAmess wrote:

On Fri, 11 Feb 2022 18:24:49 GMT, Andrew Haley  wrote:


Just multiply by 0.75.

On a modern design, floating-point multiply is 4 clocks latency, 4 ops/clock 
throughput. FP max is 2 clocks latency, conversions int-float and vice versa 3 
clocks latency, 4 ops/clock throughput. Long division is 7-9 clocks, 2ops/clock 
throughput. Shift and add 2 clocks, 2/3 ops/clock througput. Compare is 1 
clock, 3 ops/clock throughput, conditional move is 1 clock, 3 ops/clock 
throughput.

Seems like it's a wash.


@theRealAph

no multiply but divide.


Well yes, but that doesn't look at all hard to change.


besides, did you count the cost for Math.ceil? it is the heaviest part.


Yes. 3 clocks latency, 4 ops/clock throughput. Your hardware may vary.
And that instruction does both the ceil() and the float-int conversion.

(Having said that, I don't know if we currently generate optimal code
for this operation. But of course that can be fixed.)

I don't think this is terribly important, but I don't like to see
attempts at hand optimization in the standard library.


Re: RFR: 8279508: Auto-vectorize Math.round API [v3]

2022-02-14 Thread Andrew Haley
On Sun, 13 Feb 2022 13:12:35 GMT, Jatin Bhateja  wrote:

>>> Hi, IIRC for evex encoding you can embed the RC control bit directly in the 
>>> evex prefix, removing the need to rely on global MXCSR register. Thanks.
>> 
>> Hi @merykitty ,  You are correct, we can embed RC mode in instruction 
>> encoding of round instruction (towards -inf,+inf, zero). But to match the 
>> semantics of Math.round API one needs to add 0.5[f] to input value and then 
>> perform rounding over resultant value, which is why @sviswa7 suggested to 
>> use a global rounding mode driven by MXCSR.RC so that intermediate floating 
>> inexact values are resolved as desired, but OOO execution may misplace 
>> LDMXCSR and hence may have undesired side effects.
>
>> What does this do? Comment, even pseudo code, would be nice.
> 
> Thanks @theRealAph , I shall append the comments over the routine.
> BTW, entire rounding algorithm can also be implemented using  Vector API 
> which can perform if-conversion using masked operations.
> 
> class roundf {
>public static VectorSpecies ISPECIES = IntVector.SPECIES_512;
>public static VectorSpecies SPECIES = FloatVector.SPECIES_512;
> 
>public static int round_vector(float[] a, int[] r, int ctr) {
>   IntVector shiftVBC = (IntVector) ISPECIES.broadcast(24 - 2 + 127);
>   for (int i = 0; i < a.length; i += SPECIES.length()) {
>  FloatVector fv = FloatVector.fromArray(SPECIES, a, i);
>  IntVector iv = fv.reinterpretAsInts();
>  IntVector biasedExpV = iv.lanewise(VectorOperators.AND, 0x7F80);
>  biasedExpV = biasedExpV.lanewise(VectorOperators.ASHR, 23);
>  IntVector shiftV = shiftVBC.lanewise(VectorOperators.SUB, 
> biasedExpV);
>  VectorMask cond = shiftV.lanewise(VectorOperators.AND, -32)
>.compare(VectorOperators.EQ, 0);
>  IntVector res = iv.lanewise(VectorOperators.AND, 0x007F)
>.lanewise(VectorOperators.OR, 0x007F + 1);
>  VectorMask cond1 = iv.compare(VectorOperators.LT, 0);
>  VectorMask cond2 = cond1.and(cond);
>  res = res.lanewise(VectorOperators.NEG, cond2);
>  res = res.lanewise(VectorOperators.ASHR, shiftV)
>.lanewise(VectorOperators.ADD, 1)
>.lanewise(VectorOperators.ASHR, 1);
>  res = fv.convert(VectorOperators.F2I, 0)
>.reinterpretAsInts()
>.blend(res, cond);
>  res.intoArray(r, i);
>   }
>   return r[ctr];
>}

That pseudocode would make a very useful comment too. This whole patch is very 
thinly commented.

-

PR: https://git.openjdk.java.net/jdk/pull/7094


Re: RFR: 8279508: Auto-vectorize Math.round API [v3]

2022-02-13 Thread Andrew Haley
On Sun, 13 Feb 2022 03:09:43 GMT, Jatin Bhateja  wrote:

>> Summary of changes:
>> - Intrinsify Math.round(float) and Math.round(double) APIs.
>> - Extend auto-vectorizer to infer vector operations on encountering scalar 
>> IR nodes for above intrinsics.
>> - Test creation using new IR testing framework.
>> 
>> Following are the performance number of a JMH micro included with the patch 
>> 
>> Test System: Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz (Icelake Server)
>> 
>> 
>> Benchmark | TESTSIZE | Baseline AVX3 (ops/ms) | Withopt AVX3 (ops/ms) | Gain 
>> ratio | Baseline AVX2 (ops/ms) | Withopt AVX2 (ops/ms) | Gain ratio
>> -- | -- | -- | -- | -- | -- | -- | --
>> FpRoundingBenchmark.test_round_double | 1024.00 | 584.99 | 1870.70 | 3.20 | 
>> 510.35 | 548.60 | 1.07
>> FpRoundingBenchmark.test_round_double | 2048.00 | 257.17 | 965.33 | 3.75 | 
>> 293.60 | 273.15 | 0.93
>> FpRoundingBenchmark.test_round_float | 1024.00 | 825.69 | 3592.54 | 4.35 | 
>> 825.32 | 1836.42 | 2.23
>> FpRoundingBenchmark.test_round_float | 2048.00 | 388.55 | 1895.77 | 4.88 | 
>> 412.31 | 945.82 | 2.29
>> 
>> 
>> Kindly review and share your feedback.
>> 
>> Best Regards,
>> Jatin
>
> Jatin Bhateja has updated the pull request with a new target base due to a 
> merge or a rebase. The incremental webrev excludes the unrelated changes 
> brought in by the merge/rebase. The pull request contains four additional 
> commits since the last revision:
> 
>  - 8279508: Adding vectorized algorithms to match the semantics of rounding 
> operations.
>  - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8279508
>  - 8279508: Adding a test for scalar intrinsification.
>  - 8279508: Auto-vectorize Math.round API

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4066:

> 4064: }
> 4065: 
> 4066: void 
> C2_MacroAssembler::vector_cast_double_special_cases_evex(XMMRegister dst, 
> XMMRegister src, XMMRegister xtmp1,

What does this do? Comment, even pseudo code, would be nice.

-

PR: https://git.openjdk.java.net/jdk/pull/7094


Re: RFR: 8281631: HashMap.putAll can cause redundant space waste [v3]

2022-02-11 Thread Andrew Haley
On Fri, 11 Feb 2022 18:13:36 GMT, Roger Riggs  wrote:

>> @RogerRiggs 
>> so you recommend `(int) Math.min(((m.size() * 4L + 2L) / 3L), 
>> Integer.MAX_VALUE)`? OK then, changed it.
>> please review again, thanks!
>
> Works for me.  Thanks

Just multiply by 0.75.

On a modern design, floating-point multiply is 4 clocks latency, 4 ops/clock 
throughput. FP max is 2 clocks latency, conversions int-float and vice versa 3 
clocks latency, 4 ops/clock throughput.
Long division is 7-9 clocks, 2ops/clock throughput. Shift and add 2 clocks, 2/3 
ops/clock througput. Compare is 1 clock, 3 ops/clock throughput, conditional 
move is 1 clock, 3 ops/clock throughput.

Seems like it's a wash.

-

PR: https://git.openjdk.java.net/jdk/pull/7431


Re: a quick question about String

2022-01-03 Thread Andrew Haley

On 12/30/21 16:12, Alexander Scherbatiy wrote:

Would it be useful to have some kind of an immutable array in Java language
which works in the same way as ordinary array except it is not to
possible to change
its values after creation?


Yes.  https://openjdk.java.net/jeps/8261007

--
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671


Re: Adding an @Immutable annotation to Java

2021-11-25 Thread Andrew Haley

On 11/25/21 08:27, Justin Dekeyser wrote:



Quick question, out of curiosity: how would it behave with respect to
inheritance? Can a @Immutable class inherit from an non immutable one?


And: does @Immutable mean deeply immutable? IMO it really should,
but that's harder to check, and we'd have to think about what this
means for binary compatibility.

On 11/25/21 08:39, Alberto Otero Rodríguez wrote:
> I have not thought about that. I'm not a Java expert.
>
> I just throwed the idea precisely to avoid "fake immutability", because a 
programmer could think one record is immutable simply by being a record, while this is false.
>
> There are probably lots of problems that need to be taken in consideration 
(like inheritance). But I just throw the idea because I think it would be a 
desirable feature.
>
> I would be grateful if some expert could deepen in possible problems and 
their solutions.

Everything is always harder than it looks at first.

--
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671


Re: RFR: 8276217: Harmonize StrictMath intrinsics handling [v3]

2021-11-03 Thread Andrew Haley
On Tue, 2 Nov 2021 06:25:33 GMT, Aleksey Shipilev  wrote:

>> This blocks JDK-8276215: `StrictMath` intrinsics are handled peculiarly by 
>> giving failing intrinsics a second chance to match against the similar 
>> `Math` intrinsics. This has interesting consequence for matchers: we can 
>> match the native `StrictMath.sqrt` to non-native intrinsic for `Math.sqrt`. 
>> Interpreter would then have to disambiguate the two. It could be made 
>> simpler and more consistent.
>> 
>> For `min`/`max` methods, `StrictMath` already delegates to `Math` methods, 
>> so we can just drop the intrinsics for them. `sqrt` is harder to delegate, 
>> because it is `native` and a part of public API, so we can instead do the 
>> proper special intrinsic for it.
>> 
>> There seem to be no performance regressions with this patch at least on 
>> Linux x86_64:
>> 
>> 
>> $ CONF=linux-x86_64-server-release make test TEST="micro:StrictMathBench" 
>> 
>> Benchmark   Mode  Cnt   Score Error   Units
>> 
>> ### Before
>> 
>> StrictMathBench.minDouble  thrpt4  230921.558 ± 234.238  ops/ms
>> StrictMathBench.minFloat   thrpt4  230932.303 ± 126.721  ops/ms
>> StrictMathBench.minInt thrpt4  230917.256 ±  73.008  ops/ms
>> StrictMathBench.minLongthrpt4  194460.828 ± 178.079  ops/ms
>> 
>> 
>> StrictMathBench.maxDouble  thrpt4  230983.180 ± 161.211  ops/ms
>> StrictMathBench.maxFloat   thrpt4  230969.290 ± 277.500  ops/ms
>> StrictMathBench.maxInt thrpt4  231033.581 ± 200.015  ops/ms
>> StrictMathBench.maxLongthrpt4  194590.744 ± 114.295  ops/ms
>> 
>> 
>> StrictMathBench.sqrtDouble  thrpt4  230722.037 ± .080  ops/ms
>> 
>> ### After
>> 
>> StrictMathBench.minDouble  thrpt4  230976.625 ±  67.338  ops/ms
>> StrictMathBench.minFloat   thrpt4  230896.021 ± 270.434  ops/ms
>> StrictMathBench.minInt thrpt4  230859.741 ± 403.147  ops/ms
>> StrictMathBench.minLongthrpt4  194456.673 ± 111.557  ops/ms
>> 
>> StrictMathBench.maxDouble  thrpt4  230890.776 ±  89.924  ops/ms
>> StrictMathBench.maxFloat   thrpt4  230918.334 ±  63.160  ops/ms
>> StrictMathBench.maxInt thrpt4  231059.128 ±  51.224  ops/ms
>> StrictMathBench.maxLongthrpt4  194488.210 ± 495.224  ops/ms
>> 
>> StrictMathBench.sqrtDouble  thrpt4  231023.703 ± 247.330  ops/ms
>> 
>> 
>> Additional testing:
>>  - [x] `StrictMath` benchmarks
>>  - [x] Linux x86_64 fastdebug `java/lang/StrictMath`, `java/lang/Math`
>>  - [x] Linux x86_64 fastdebug `tier1`
>
> Aleksey Shipilev has updated the pull request incrementally with one 
> additional commit since the last revision:
> 
>   Keep intrinsics on StrictMath

Marked as reviewed by aph (Reviewer).

-

PR: https://git.openjdk.java.net/jdk/pull/6184


Re: RFR: 8276217: Harmonize StrictMath intrinsics handling

2021-11-01 Thread Andrew Haley
On Mon, 1 Nov 2021 11:23:16 GMT, Aleksey Shipilev  wrote:

> This blocks JDK-8276215: `StrictMath` intrinsics are handled peculiarly by 
> giving failing intrinsics a second chance to match against the similar `Math` 
> intrinsics. This has interesting consequence for matchers: we can match the 
> native `StrictMath.sqrt` to non-native intrinsic for `Math.sqrt`. Interpreter 
> would then have to disambiguate the two. It could be made simpler and more 
> consistent.
> 
> For `min`/`max` methods, `StrictMath` already delegates to `Math` methods, so 
> we can just drop the intrinsics for them. `sqrt` is harder to delegate, 
> because it is `native` and a part of public API, so we can instead do the 
> proper special intrinsic for it.
> 
> There seem to be no performance regressions with this patch at least on Linux 
> x86_64:
> 
> 
> $ CONF=linux-x86_64-server-release make test TEST="micro:StrictMathBench" 
> 
> Benchmark   Mode  Cnt   Score Error   Units
> 
> ### Before
> 
> StrictMathBench.minDouble  thrpt4  230921.558 ± 234.238  ops/ms
> StrictMathBench.minFloat   thrpt4  230932.303 ± 126.721  ops/ms
> StrictMathBench.minInt thrpt4  230917.256 ±  73.008  ops/ms
> StrictMathBench.minLongthrpt4  194460.828 ± 178.079  ops/ms
> 
> 
> StrictMathBench.maxDouble  thrpt4  230983.180 ± 161.211  ops/ms
> StrictMathBench.maxFloat   thrpt4  230969.290 ± 277.500  ops/ms
> StrictMathBench.maxInt thrpt4  231033.581 ± 200.015  ops/ms
> StrictMathBench.maxLongthrpt4  194590.744 ± 114.295  ops/ms
> 
> 
> StrictMathBench.sqrtDouble  thrpt4  230722.037 ± .080  ops/ms
> 
> ### After
> 
> StrictMathBench.minDouble  thrpt4  230976.625 ±  67.338  ops/ms
> StrictMathBench.minFloat   thrpt4  230896.021 ± 270.434  ops/ms
> StrictMathBench.minInt thrpt4  230859.741 ± 403.147  ops/ms
> StrictMathBench.minLongthrpt4  194456.673 ± 111.557  ops/ms
> 
> StrictMathBench.maxDouble  thrpt4  230890.776 ±  89.924  ops/ms
> StrictMathBench.maxFloat   thrpt4  230918.334 ±  63.160  ops/ms
> StrictMathBench.maxInt thrpt4  231059.128 ±  51.224  ops/ms
> StrictMathBench.maxLongthrpt4  194488.210 ± 495.224  ops/ms
> 
> StrictMathBench.sqrtDouble  thrpt4  231023.703 ± 247.330  ops/ms
> 
> 
> Additional testing:
>  - [x] `StrictMath` benchmarks
>  - [x] Linux x86_64 fastdebug `tier1`

So we have _dsqrt and_dsqrt_strict, which must be functionally identical, but 
we provide both names because they're part of a public API. I think this 
deserves an explanatory comment in the code.

-

PR: https://git.openjdk.java.net/jdk/pull/6184


Re: RFR: 8276096: Simplify Unsafe.{load|store}Fence fallbacks by delegating to fullFence

2021-10-29 Thread Andrew Haley
On Thu, 28 Oct 2021 08:47:31 GMT, Aleksey Shipilev  wrote:

> `Unsafe.{load|store}Fence` falls back to `unsafe.cpp` for 
> `OrderAccess::{acquire|release}Fence()`. It seems too heavy-handed (useless?) 
> to call to runtime for a single memory barrier. We can simplify the native 
> `Unsafe` interface by falling back to `fullFence` when `{load|store}Fence` 
> intrinsics are not available. This would be similar to what 
> `Unsafe.{loadLoad|storeStore}Fences` do. 
> 
> This is the behavior of these intrinsics now, on x86_64, using benchmarks 
> from JDK-8276054:
> 
> 
> Benchmark  Mode  Cnt  Score   Error  Units
> 
> # Default
> Single.acquire avgt3   0.407 ± 0.060  ns/op
> Single.fullavgt3   4.693 ± 0.005  ns/op
> Single.loadLoadavgt3   0.415 ± 0.095  ns/op
> Single.plain   avgt3   0.406 ± 0.002  ns/op
> Single.release avgt3   0.408 ± 0.047  ns/op
> Single.storeStore  avgt3   0.408 ± 0.043  ns/op
> 
> # -XX:DisableIntrinsic=_storeFence
> Single.acquire avgt3   0.408 ± 0.016  ns/op
> Single.fullavgt3   4.694 ± 0.002  ns/op
> Single.loadLoadavgt3   0.406 ± 0.002  ns/op
> Single.plain   avgt3   0.406 ± 0.001  ns/op
> Single.release avgt3   4.694 ± 0.003  ns/op <--- upgraded to full
> Single.storeStore  avgt3   4.690 ± 0.005  ns/op <--- upgraded to full
> 
> # -XX:DisableIntrinsic=_loadFence
> Single.acquire avgt3   4.691 ± 0.001  ns/op <--- upgraded to full
> Single.fullavgt3   4.693 ± 0.009  ns/op
> Single.loadLoadavgt3   4.693 ± 0.013  ns/op <--- upgraded to full
> Single.plain   avgt3   0.408 ± 0.072  ns/op
> Single.release avgt3   0.415 ± 0.016  ns/op
> Single.storeStore  avgt3   0.416 ± 0.041  ns/op
> 
> # -XX:DisableIntrinsic=_fullFence
> Single.acquire avgt3   0.406 ± 0.014  ns/op
> Single.fullavgt3  15.836 ± 0.151  ns/op <--- calls runtime
> Single.loadLoadavgt3   0.406 ± 0.001  ns/op
> Single.plain   avgt3   0.426 ± 0.361  ns/op
> Single.release avgt3   0.407 ± 0.021  ns/op
> Single.storeStore  avgt3   0.410 ± 0.061  ns/op
> 
> # -XX:DisableIntrinsic=_fullFence,_loadFence
> Single.acquire avgt3  15.822 ± 0.282  ns/op <--- upgraded, calls 
> runtime
> Single.fullavgt3  15.851 ± 0.127  ns/op <--- calls runtime
> Single.loadLoadavgt3  15.829 ± 0.045  ns/op <--- upgraded, calls 
> runtime
> Single.plain   avgt3   0.406 ± 0.001  ns/op
> Single.release avgt3   0.414 ± 0.156  ns/op
> Single.storeStore  avgt3   0.422 ± 0.452  ns/op
> 
> # -XX:DisableIntrinsic=_fullFence,_storeFence
> Single.acquire avgt3   0.407 ± 0.016  ns/op
> Single.fullavgt3  15.347 ± 6.783  ns/op <--- calls runtime
> Single.loadLoadavgt3   0.406 ± 0.001  ns/op
> Single.plain   avgt3   0.406 ± 0.002  ns/op 
> Single.release avgt3  15.828 ± 0.019  ns/op <--- upgraded, calls 
> runtime
> Single.storeStore  avgt3  15.834 ± 0.045  ns/op <--- upgraded, calls 
> runtime
> 
> # -XX:DisableIntrinsic=_fullFence,_loadFence,_storeFence
> Single.acquire avgt3  15.838 ± 0.030  ns/op <--- upgraded, calls 
> runtime
> Single.fullavgt3  15.854 ± 0.277  ns/op <--- calls runtime
> Single.loadLoadavgt3  15.826 ± 0.160  ns/op <--- upgraded, calls 
> runtime
> Single.plain   avgt3   0.406 ± 0.003  ns/op
> Single.release avgt3  15.838 ± 0.019  ns/op <--- upgraded, calls 
> runtime
> Single.storeStore  avgt3  15.844 ± 0.104  ns/op <--- upgraded, calls 
> runtime
> 
> 
> Additional testing:
>  - [x] Linux x86_64 fastdebug `tier1`

Marked as reviewed by aph (Reviewer).

-

PR: https://git.openjdk.java.net/jdk/pull/6149


Re: RFR: 8275167: x86 intrinsic for unsignedMultiplyHigh

2021-10-15 Thread Andrew Haley
On Wed, 13 Oct 2021 18:55:10 GMT, Vamsi Parasa  wrote:

> Optimize the new Math.unsignedMultiplyHigh using the x86 mul instruction. 
> This change show 1.87X improvement on a micro benchmark.

test/micro/org/openjdk/bench/java/lang/MathBench.java line 547:

> 545: return  Math.unsignedMultiplyHigh(long747, long13);
> 546: }
> 547: 

As far as I can tell, the JMH overhead dominates when trying to measure the 
latency of events in the nanosecond range. `unsignedMultiplyHigh` should have a 
latency of maybe 1.5-2ns. Is that what you saw?

-

PR: https://git.openjdk.java.net/jdk/pull/5933


Re: RFR: 8275167: x86 intrinsic for unsignedMultiplyHigh

2021-10-15 Thread Andrew Haley
On Wed, 13 Oct 2021 18:55:10 GMT, Vamsi Parasa  wrote:

> Optimize the new Math.unsignedMultiplyHigh using the x86 mul instruction. 
> This change show 1.87X improvement on a micro benchmark.

src/hotspot/share/opto/mulnode.cpp line 468:

> 466: }
> 467: 
> 468: 
> //=

MulHiLNode::Value() and UMulHiLNode::Value() seem to be identical. Perhaps some 
refactoring would be in order, maybe make a common shared routine.

-

PR: https://git.openjdk.java.net/jdk/pull/5933


Re: Optimization (?) of HashSet(Collection)

2021-10-15 Thread Andrew Haley

On 10/4/21 12:57, Сергей Цыпанов wrote:



in the code of HashSet(Collection) we have an optimization opportunity:

public HashSet(Collection c) {
   map = new HashMap<>(Math.max((int) (c.size()/.75f) + 1, 16));
   addAll(c);
}

instead of using addAll() inherited from j.u.Collection we can use 
c.forEach(this::add):
> public HashSet(Collection c) {
   map = new HashMap<>(Math.max((int) (c.size()/.75f) + 1, 16));
   c.forEach(this::add);
}

This allows to rid Iterator and use counting loops for e.g. ArrayList instead 
of hasNext()/next().


Well, perhaps, but this sort of low-level hackery is not IMO something we
should reflexively do in the core class libraries.


We are also likely to benefit from this change in case the argument collection 
is synchronized as it's going to be locked only once.


That's a good point, but would a synchronized collection really lock
coarsely around forEach(), rather than in add().


I've used the benchmark [1] to analyze the impact and it demonstrates 
measurable improvement [2].

What puzzles we a bit is the result for j.u.ArrayList. At warm-up time it 
demonstrates better results than at measurement time:

  length = 10

# Run progress: 4,44% complete, ETA 00:21:41
# Fork: 3 of 5
# Warmup Iteration   1: 134,699 ns/op
# Warmup Iteration   2: 115,391 ns/op
# Warmup Iteration   3: 130,110 ns/op
# Warmup Iteration   4: 114,465 ns/op  <
# Warmup Iteration   5: 114,849 ns/op
# Warmup Iteration   6: 115,073 ns/op
# Warmup Iteration   7: 113,988 ns/op
# Warmup Iteration   8: 115,301 ns/op
# Warmup Iteration   9: 115,452 ns/op
# Warmup Iteration  10: 124,493 ns/op  <
Iteration   1: 123,776 ns/op
Iteration   2: 123,719 ns/op
Iteration   3: 123,725 ns/op
Iteration   4: 126,892 ns/op
Iteration   5: 125,493 ns/op
Iteration   6: 124,661 ns/op
Iteration   7: 123,783 ns/op
Iteration   8: 123,975 ns/op
Iteration   9: 124,047 ns/op
Iteration  10: 123,899 ns/op

length = 100

# Run progress: 11,11% complete, ETA 00:20:10
# Fork: 1 of 5
# Warmup Iteration   1: 1236,695 ns/op
# Warmup Iteration   2: 1069,909 ns/op
# Warmup Iteration   3: 1243,852 ns/op
# Warmup Iteration   4: 1059,038 ns/op  <
# Warmup Iteration   5: 1057,670 ns/op
# Warmup Iteration   6: 1057,117 ns/op
# Warmup Iteration   7: 1056,932 ns/op
# Warmup Iteration   8: 1054,909 ns/op
# Warmup Iteration   9: 1058,106 ns/op
# Warmup Iteration  10: 1145,699 ns/op  <
Iteration   1: 1139,155 ns/op
Iteration   2: 1141,662 ns/op
Iteration   3: 1139,061 ns/op
Iteration   4: 1139,306 ns/op
Iteration   5: 1138,475 ns/op
Iteration   6: 1140,491 ns/op
Iteration   7: 1144,017 ns/op
Iteration   8: 1139,411 ns/op
Iteration   9: 1142,729 ns/op
Iteration  10: 1144,042 ns/op


Here I use 1 s warm-up iteration on 2s measurement iteration. It looks like at 
iteration 4 the code is top-optimized,
and later at the end of warm-up a kind of deoptimization happens. There's still 
visible improvement however.

The benchmark is run with 5 forks, and this deoptimization is visible for 
length = {10, 100}.

So my two question are:
- What is the reason of the deoptimization and can we do something about it?


C2 does this sort of thing all the time: it's a heuristic probabilistic
optimizer. I've seen plenty of bimodal examples, where inlining changes
and each time a program is run  it's either fast or slow, quasi-randomly.

If you really want to know, use -prof perfasm in JMH.


- Whether suggested changes is worth implementations?


IMO the gain is too small. Others may disagree.

--
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671


Re: RFR: 8275262: [BACKOUT] AArch64: Implement string_compare intrinsic in SVE

2021-10-14 Thread Andrew Haley
On Thu, 14 Oct 2021 10:21:20 GMT, David Holmes  wrote:

> See https://bugs.openjdk.java.net/browse/JDK-8275263 for some info on the 
> failures. It seems really bizarre.

I would understand this a lot better if the affected machines actually used 
SVE, but they don't have the hardware.
That does reduce the bug surface: we only need to look at the affected common 
code, and the only thing I can immediately see is that a couple of unused 
register arguments have been added. 
Unless, of course, the affected Macs think they have SVE... ? I don't think 
that's possible.
I'd definitely be looking at the host toolchain for differences.

-

PR: https://git.openjdk.java.net/jdk/pull/5941


Re: RFR: 8275262: [BACKOUT] AArch64: Implement string_compare intrinsic in SVE

2021-10-14 Thread Andrew Haley
On Thu, 14 Oct 2021 06:37:19 GMT, Nick Gasson  wrote:

> This reverts "8269559: AArch64: Implement string_compare intrinsic in SVE" 
> which caused some unknown failures in Oracle's CI.

It might be a spurious failure, then. I guess we need to see the test logs.

-

PR: https://git.openjdk.java.net/jdk/pull/5941


Re: RFR: 8275262: [BACKOUT] AArch64: Implement string_compare intrinsic in SVE

2021-10-14 Thread Andrew Haley
On Thu, 14 Oct 2021 07:08:20 GMT, David Holmes  wrote:

> The issue is only on (some of) our macOS Aarch64 systems. Let me know if I 
> can provide more info on hardware etc.

Any info about what failed? A reproducer would be nice.

-

PR: https://git.openjdk.java.net/jdk/pull/5941


Re: RFR: 4511638: Double.toString(double) sometimes produces incorrect results [v2]

2021-10-13 Thread Andrew Haley
On Fri, 16 Apr 2021 11:30:32 GMT, Raffaello Giulietti  
wrote:

>> Hello,
>> 
>> here's a PR for a patch submitted on March 2020 
>> [1](https://cr.openjdk.java.net/~bpb/4511638/webrev.04/) when Mercurial was 
>> a thing.
>> 
>> The patch has been edited to adhere to OpenJDK code conventions about 
>> multi-line (block) comments. Nothing in the code proper has changed, except 
>> for the addition of redundant but clarifying parentheses in some expressions.
>> 
>> 
>> Greetings
>> Raffaello
>
> Raffaello Giulietti has updated the pull request incrementally with one 
> additional commit since the last revision:
> 
>   4511638: Double.toString(double) sometimes produces incorrect results

Email from GuySteele  follows:

(my reply)

Hi Guy,

for some reason your comments still appear garbled on the GitHub PR page and 
don't make it to the core-libs-dev mailing list at all. Luckily, they appear 
intelligible in my mailbox, so I'll keep going with prepending your comments in 
my replies: not ideal but good enough.

Thanks so much for re-reading my "paper".

printf()

There are some issues to consider when trying to apply Schubfach to printf(), 
the main one being that printf() allows to specify an arbitrary length for the 
resulting decimal. This means, for example, that unlimited precision arithmetic 
is unavoidable. But it might be worthwhile to investigate using Schubfach for 
lengths up to H (9 and 17 for float and double, resp.) and fall back to 
unlimited precision beyond that.
Before that, however, I would prefer to finally push Schubfach in the OpenJDK 
codebase for the toString() cases and close this PR.

I completely agree that using Schubfach to solve only the toString() problems 
would be a _major_ improvement in the situation, and this should not wait for 
exploration of the printf problem. But I suspect that using Schubfach for 
lengths up to H would cover a very large fraction of actual usage, and would 
improve both quality and speed, and therefore would be worth exploring later.

Tests

Below, by "extensive test" I mean not only that the outcomes convert back 
without loss of information, but that they fully meet the spec about minimal 
number of digits, closeness, correct formatting (normal viz scientific), 
character set, etc.

All currently available tests are in the contributed code of this PR and will 
be part of the OpenJDK once integrated.

  *   All powers of 2 and the extreme values are already extensively tested.
  *   All values closest to powers of 10 are extensively tested.
  *   All values proposed by Paxson [1] are extensively tested.

I have now read through the Paxson paper. Does this refer to the values listed 
in his Tables 3 and 4, or to other values instead or in addition?

  *   A configurable number of random values are tested at each round 
(currently 4 * 1'000'000 random values). Should a value fail, there's enough 
diagnostic information for further investigation.

I'll add extensive tests for the values you propose in point (1) and (2), 
setting Z = Y = 1024.

I do think that would lend further confidence.

As for comparison with the current JDK behavior, there are already a bunch of 
values for which extensive tests fail on the current JDK but pass with 
Schubfach.

Yes, thanks for supplying some of those.

It would be cumbersome, if possible at all, to have both the current JDK and 
Schubfach implementations in the same OpenJDK codebase to be able to compare 
the outcomes. I performed comparisons in a different constellation, with 
Schubfach as an external library, but this is hardly a possibility in the 
core-libs. Needless to say, Schubfach outcomes are either the same as in JDK or 
better (shorter or closest to the fp value).

Okay.

I will mention here, for the record, that there is one other sort of test that 
could be performed that I think I have not yet seen you mention: a monotonicity 
test of the kind used by David Hough’s Testbase (mentioned by Paxson). However, 
a little thought reveals that such a test made unnecessary by the round-trip 
test. So a monotonicity test would be a good idea when testing printf case, but 
is not needed for the toString case.

Therefore, if you add the few tests I have suggested, I think that we can say 
with the best certainty we can have, short of separately testing every possible 
double value, that Schubfach is extremely well tested and ready for adoption 
into Java.

Peer reviewed publication

Shortening my 27 pages writing and re-formating it to meet a journal standards 
for publication would require investing yet another substantial amount of time. 
I'm not sure I'm prepared for that, given that I've no personal interest in a 
journal publication: I'm not an academic, I'm not pursuing a career...
But I promise I'll think about reconsidering my position on this ;-)

Please do think about reconsidering.

There are several reasons to publish an “academic” paper:
- Earning “merit badges” that lead to academic tenure
- 

Re: RFR: 4511638: Double.toString(double) sometimes produces incorrect results [v2]

2021-10-05 Thread Andrew Haley
On 9/28/21 10:30, Raffaello Giulietti wrote:
> I was pondering whether to propose adding a launch-time option on the command 
> line to choose between the current and the proposed implementation for some 
> time, so to help the community gaining confidence in the new algorithm and 
> still have the current one as a fallback, just in case. (AFAIU, a launch-time 
> option can be constant folded by the JIT compiler to help it eliminate dead 
> code, right?)
> 
> On the one hand, this adds complexity. On the other hand, it could help 
> revive this dormant PR.
> 
> What do you think? Would this be a viable option?

It's a good-enough idea in theory, but I don't think it'd be accepted.

Pinging Joe Darcy: do you (or any of your friends) have a good way
to contact Guy Steele?

-- 
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671


Re: StringCoding.hasNegatives

2021-10-01 Thread Andrew Haley
On 10/1/21 1:57 PM, Brett Okken wrote:
> I know java.lang.StringCoding.hasNegatives has a
> HotSpotIntrinsicCandidate annotation/implementation, but is there
> interest/value in a faster pure java implementation?
> 
> Using Unsafe to read and compare 8 bytes at a time as a long is faster
> than the current simple implementation both in interpreter only mode
> and default hotspot compiler mode. I can provide jmh bencmark and
> results if this is worthwhile to pursue.

The current pure Java implementation does two things: it provides a fallback
for pure-interpreter JVMs and it provides the reader with a simple 
implementation.
I'm not at all sure we'd want a complex implementation.

Having said that, if I were looking at a faster pure Java version of
this logic, I'd look at MethodHandles.byteArrayViewVarHandle().

-- 
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671


Re: RFR: 8269559: AArch64: Implement string_compare intrinsic in SVE [v2]

2021-09-30 Thread Andrew Haley
On Mon, 23 Aug 2021 21:48:01 GMT, TatWai Chong 
 wrote:

>> This patch implements string_compare intrinsic in SVE.
>> It supports all LL, LU, UL and UU comparisons.
>> 
>> As we haven't found an existing benchmark to measure performance impact,
>> we created a benchmark derived from the test [1] for this evaluation.
>> This benchmark is attached to this patch.
>> 
>> Besides, remove the unused temporary register `vtmp3` from the existing
>> match rules for StrCmp.
>> 
>> The result below shows all varients can be benefited largely.
>> Command: make exploded-test TEST="micro:StringCompareToDifferentLength"
>> 
>> Benchmark(size)  Mode  Cnt   Score Speedup Units
>> compareToLL  24  avgt   10  1.0x   ms/op
>> compareToLL  36  avgt   10  1.0x   ms/op
>> compareToLL  72  avgt   10  1.0x   ms/op
>> compareToLL 128  avgt   10  1.4x   ms/op
>> compareToLL 256  avgt   10  1.8x   ms/op
>> compareToLL 512  avgt   10  2.7x   ms/op
>> compareToLU  24  avgt   10  1.6x   ms/op
>> compareToLU  36  avgt   10  1.8x   ms/op
>> compareToLU  72  avgt   10  2.3x   ms/op
>> compareToLU 128  avgt   10  3.8x   ms/op
>> compareToLU 256  avgt   10  4.7x   ms/op
>> compareToLU 512  avgt   10  6.3x   ms/op
>> compareToUL  24  avgt   10  1.6x   ms/op
>> compareToUL  36  avgt   10  1.7x   ms/op
>> compareToUL  72  avgt   10  2.2x   ms/op
>> compareToUL 128  avgt   10  3.3x   ms/op
>> compareToUL 256  avgt   10  4.4x   ms/op
>> compareToUL 512  avgt   10  6.1x   ms/op
>> compareToUU  24  avgt   10  1.0x   ms/op
>> compareToUU  36  avgt   10  1.0x   ms/op
>> compareToUU  72  avgt   10  1.4x   ms/op
>> compareToUU 128  avgt   10  2.2x   ms/op
>> compareToUU 256  avgt   10  2.6x   ms/op
>> compareToUU 512  avgt   10  3.7x   ms/op
>> 
>> [1] 
>> https://github.com/openjdk/jdk/blob/master/test/hotspot/jtreg/compiler/intrinsics/string/TestStringCompareToDifferentLength.java
>
> TatWai Chong has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   Restore the removal of vtmp3 (=V2) as it is still used by the non-SVE 
> compare-long-strings stub.
>   
>   And remove the assertion in `string_compare` since it won't help as the 
> registers
>   used in the stub are fixed.

Marked as reviewed by aph (Reviewer).

-

PR: https://git.openjdk.java.net/jdk/pull/5129


Re: RFR: 4511638: Double.toString(double) sometimes produces incorrect results [v2]

2021-09-28 Thread Andrew Haley
On Fri, 16 Apr 2021 11:30:32 GMT, Raffaello Giulietti 
 wrote:

>> Hello,
>> 
>> here's a PR for a patch submitted on March 2020 
>> [1](https://cr.openjdk.java.net/~bpb/4511638/webrev.04/) when Mercurial was 
>> a thing.
>> 
>> The patch has been edited to adhere to OpenJDK code conventions about 
>> multi-line (block) comments. Nothing in the code proper has changed, except 
>> for the addition of redundant but clarifying parentheses in some expressions.
>> 
>> 
>> Greetings
>> Raffaello
>
> Raffaello Giulietti has updated the pull request incrementally with one 
> additional commit since the last revision:
> 
>   4511638: Double.toString(double) sometimes produces incorrect results

My turn to keep this PR alive. If we'd had any actual face-to-face 
contributors' meetings this year I would have raised this stuck PR as an issue 
with our processes.

-

PR: https://git.openjdk.java.net/jdk/pull/3402


Re: RFR: 8268231: Aarch64: Use ldp in intrinsics for String.compareTo [v6]

2021-09-20 Thread Andrew Haley
On Mon, 20 Sep 2021 09:52:45 GMT, Andrew Dinn  wrote:

>> Andrew, can you help us to approve this?
>
> I agree with Andrew Haley that this patch is not going to make an improvement 
> for anything but a very small number of applications. Processing of strings 
> over a few 10s of bytes is rare. On the other hand the doesn't seem to cause 
> any performance drop for the much more common case of processing short 
> strings. so it does no harm. Also, the new and old code are much the same in 
> terms of complexity so that is no reason to prefer one over the other. The 
> only real concern I have is that any change involves the risk of error and 
> the ratio of cases that might benefit to cases that might suffer from an 
> error is very low. I don't think that's a reason to avoid pushing this patch 
> upstream but it does suggest that we should not backport it.

OK, thanks. That seems like a sensible compromise.

-

PR: https://git.openjdk.java.net/jdk/pull/4722


Re: RFR: 8268231: Aarch64: Use ldp in intrinsics for String.compareTo [v7]

2021-09-20 Thread Andrew Haley
On Mon, 30 Aug 2021 06:26:01 GMT, Wang Huang  wrote:

>> Dear all, 
>> Can you do me a favor to review this patch. This patch use `ldp` to 
>> implement String.compareTo.
>>
>> * We add a JMH test case 
>> * Here is the result of this test case
>>  
>> Benchmark   |(size)| Mode| Cnt|Score | Error  |Units 
>> -|--|-||--||-
>> StringCompare.compareLL   |  64  | avgt| 5  |7.992 | ±   0.005|us/op
>> StringCompare.compareLL   |  72  | avgt| 5  |15.029| ±   0.006|us/op
>> StringCompare.compareLL   |  80  | avgt| 5  |14.655| ±   0.011|us/op
>> StringCompare.compareLL   |  91  | avgt| 5  |16.363| ±   0.12 |us/op
>> StringCompare.compareLL   |  101 | avgt| 5  |16.966| ±   0.007|us/op
>> StringCompare.compareLL   |  121 | avgt| 5  |19.276| ±   0.006|us/op
>> StringCompare.compareLL   |  181 | avgt| 5  |19.002| ±   0.417|us/op
>> StringCompare.compareLL   |  256 | avgt| 5  |24.707| ±   0.041|us/op
>> StringCompare.compareLLWithLdp|  64  | avgt| 5  |8.001   | ± 
>> 0.121|us/op
>> StringCompare.compareLLWithLdp|  72  | avgt| 5  |11.573| ±   0.003|us/op
>> StringCompare.compareLLWithLdp|  80  | avgt| 5  |6.861 | ±   0.004|us/op
>> StringCompare.compareLLWithLdp|  91  | avgt| 5  |12.774| ±   0.201|us/op
>> StringCompare.compareLLWithLdp|  101 | avgt| 5  |8.691 | ±   0.004|us/op
>> StringCompare.compareLLWithLdp|  121 | avgt| 5  |11.091| ±   1.342|us/op
>> StringCompare.compareLLWithLdp|  181 | avgt| 5  |14.64 | ±   0.581|us/op
>> StringCompare.compareLLWithLdp|  256 | avgt| 5  |25.879| ±   1.775|us/op
>> StringCompare.compareUU   |  64  | avgt| 5  |13.476| ±   0.01 |us/op
>> StringCompare.compareUU   |  72  | avgt| 5  |15.078| ±   0.006|us/op
>> StringCompare.compareUU   |  80  | avgt| 5  |23.512| ±   0.011|us/op
>> StringCompare.compareUU   |  91  | avgt| 5  |24.284| ±   0.008|us/op
>> StringCompare.compareUU   |  101 | avgt| 5  |20.707| ±   0.017|us/op
>> StringCompare.compareUU   |  121 | avgt| 5  |29.302| ±   0.011|us/op
>> StringCompare.compareUU   |  181 | avgt| 5  |39.31   | ± 
>> 0.016|us/op
>> StringCompare.compareUU   |  256 | avgt| 5  |54.592| ±   0.392|us/op
>> StringCompare.compareUUWithLdp|  64  | avgt| 5  |16.389| ±   0.008|us/op
>> StringCompare.compareUUWithLdp|  72  | avgt| 5  |10.71 | ±   0.158|us/op
>> StringCompare.compareUUWithLdp|  80  | avgt| 5  |11.488| ±   0.024|us/op
>> StringCompare.compareUUWithLdp|  91  | avgt| 5  |13.412| ±   0.006|us/op
>> StringCompare.compareUUWithLdp|  101 | avgt| 5  |16.245| ±   0.434|us/op
>> StringCompare.compareUUWithLdp|  121 | avgt| 5  |16.597| ±   0.016|us/op
>> StringCompare.compareUUWithLdp|  181 | avgt| 5  |27.373| ±   0.017|us/op
>> StringCompare.compareUUWithLdp|  256 | avgt| 5  |41.74 | ±   3.5  |us/op
>> 
>> From this table, we can see that in most cases, our patch is better than old 
>> one.
>> 
>> Thank you for your review. Any suggestions are welcome.
>
> Wang Huang has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   fix windows build failed

Marked as reviewed by aph (Reviewer).

-

PR: https://git.openjdk.java.net/jdk/pull/4722


Re: better random numbers

2021-09-05 Thread Andrew Haley
On 9/3/21 12:35 AM, John Rose wrote:

> The reference I’d like to give here is to Dr. Melissa O’Neill’s
> website and articles:

I'm quite sceptical. Anyone who says a (non-cryptographic) random-
number generator is "hard to predict" is either quite naive or in a
state of sin, (;-) and while O’Neill’s argument seems sound, it
doesn't seem to have convinced the academic world.

Lemire is thoughtful:
https://lemire.me/blog/2017/08/15/on-melissa-oneills-pcg-random-number-generator/

I wonder about AES, which can do (on Apple M1) 2 parallel rounds per
clock cycle. I'm quite tempted to try a reduced- round AES on the
TestU01 statistical tests. Maybe 6 rounds? However, there can be a
long latency between FPU and integer CPU, so perhaps it's not such a
great idea. Also, you have to load the key registers before you can
generate a random number, so it only really works if you want to
generate a lot of bits at a time. But it is maybe 128 randomish bits
per a few clock cycles.

-- 
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671



Re: RFR: 8268231: Aarch64: Use ldp in intrinsics for String.compareTo [v6]

2021-09-05 Thread Andrew Haley
On Thu, 26 Aug 2021 09:26:24 GMT, Wu Yan  wrote:

>> src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 4871:
>> 
>>> 4869: // exit from large loop when less than 64 bytes left to read or 
>>> we're about
>>> 4870: // to prefetch memory behind array border
>>> 4871: int largeLoopExitCondition = MAX(64, 
>>> SoftwarePrefetchHintDistance)/(isLL ? 1 : 2);
>> 
>> This breaks the Windows AArch64 build:
>> 
>> 
>> Creating support/modules_libs/java.base/server/jvm.dll from 1051 file(s)
>> d:\a\jdk\jdk\jdk\src\hotspot\cpu\aarch64\stubGenerator_aarch64.cpp(4871): 
>> error C3861: 'MAX': identifier not found
>> make[3]: *** [lib/CompileJvm.gmk:143: 
>> /cygdrive/d/a/jdk/jdk/jdk/build/windows-aarch64/hotspot/variant-server/libjvm
>> 
>> 
>> https://github.com/Wanghuang-Huawei/jdk/runs/3260986937
>> 
>> Should probably be left as `MAX2`.
>
> Thanks, I'll fix it.

It's fine. I don't think it'll affect any real programs, so it's rather 
pointless. I don't know if that's any reason not to approve it.

-

PR: https://git.openjdk.java.net/jdk/pull/4722


Re: RFR: 8268231: Aarch64: Use ldp in intrinsics for String.compareTo [v4]

2021-07-30 Thread Andrew Haley
On 7/30/21 7:49 AM, Wu Yan wrote:

> I aggree. This is the compromise solution that the optimization
> has no effect (or even slowdown) on some platforms.
> In addition, I found that in
> [JDK-8202326](https://bugs.openjdk.java.net/browse/JDK-8202326),
> adding prefetches is only for long strings (the rare cases),
> maybe we can further optimize longs string with LDP. So should
> I continue this optimization or close it.

IMO, we don't want to be using the vector unit unless it does
some good, and if you can do this sort of thing in the CPU core
you should, so I like that. I was (still am) tempted to approve
it, but Nick says there are still bugs in corner cases.

I think you should probably close it. Comparison of really long
Strings is so rare that I can't find any examples of where it
actually happens. Array comparisons, sure, but Strings, not so
much.

-- 
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671



Re: RFR: 8268231: Aarch64: Use ldp in intrinsics for String.compareTo [v4]

2021-07-28 Thread Andrew Haley
On Wed, 28 Jul 2021 08:25:08 GMT, Nick Gasson  wrote:

> I don't think we want to keep two copies of the compareTo intrinsic. If there 
> are no cases where the LDP version is worse than the original version then we 
> should just delete the old one and replace it with this.

I agree. The trouble is, what does "worse" mean? I'm looking at SDEN-1982442, 
Neoverse N2 errata, 2001293, and I see that LDP has to be slowed down on 
streaming workloads, which will affect this.

The trouble with this patch is that it (probably) makes things better for long 
strings, which are very rare. What we actually need to care about is 
performance for a large number of typical-sized strings, which are names, 
identifiers, passwords, and so on: about 10-30 characters.

-

PR: https://git.openjdk.java.net/jdk/pull/4722


Re: RFR: 8268231: Aarch64: Use ldp in intrinsics for String.compareTo [v2]

2021-07-12 Thread Andrew Haley
On Mon, 12 Jul 2021 09:14:25 GMT, Wang Huang  wrote:

>> Dear all, 
>> Can you do me a favor to review this patch. This patch use `ldp` to 
>> implement String.compareTo.
>>
>> * We add a JMH test case 
>> * Here is the result of this test case
>>  
>> Benchmark   |(size)| Mode| Cnt|Score | Error  |Units 
>> -|--|-||--||-
>> StringCompare.compareLL   |  64  | avgt| 5  |7.992 | ±   0.005|us/op
>> StringCompare.compareLL   |  72  | avgt| 5  |15.029| ±   0.006|us/op
>> StringCompare.compareLL   |  80  | avgt| 5  |14.655| ±   0.011|us/op
>> StringCompare.compareLL   |  91  | avgt| 5  |16.363| ±   0.12 |us/op
>> StringCompare.compareLL   |  101 | avgt| 5  |16.966| ±   0.007|us/op
>> StringCompare.compareLL   |  121 | avgt| 5  |19.276| ±   0.006|us/op
>> StringCompare.compareLL   |  181 | avgt| 5  |19.002| ±   0.417|us/op
>> StringCompare.compareLL   |  256 | avgt| 5  |24.707| ±   0.041|us/op
>> StringCompare.compareLLWithLdp|  64  | avgt| 5  |8.001   | ± 
>> 0.121|us/op
>> StringCompare.compareLLWithLdp|  72  | avgt| 5  |11.573| ±   0.003|us/op
>> StringCompare.compareLLWithLdp|  80  | avgt| 5  |6.861 | ±   0.004|us/op
>> StringCompare.compareLLWithLdp|  91  | avgt| 5  |12.774| ±   0.201|us/op
>> StringCompare.compareLLWithLdp|  101 | avgt| 5  |8.691 | ±   0.004|us/op
>> StringCompare.compareLLWithLdp|  121 | avgt| 5  |11.091| ±   1.342|us/op
>> StringCompare.compareLLWithLdp|  181 | avgt| 5  |14.64 | ±   0.581|us/op
>> StringCompare.compareLLWithLdp|  256 | avgt| 5  |25.879| ±   1.775|us/op
>> StringCompare.compareUU   |  64  | avgt| 5  |13.476| ±   0.01 |us/op
>> StringCompare.compareUU   |  72  | avgt| 5  |15.078| ±   0.006|us/op
>> StringCompare.compareUU   |  80  | avgt| 5  |23.512| ±   0.011|us/op
>> StringCompare.compareUU   |  91  | avgt| 5  |24.284| ±   0.008|us/op
>> StringCompare.compareUU   |  101 | avgt| 5  |20.707| ±   0.017|us/op
>> StringCompare.compareUU   |  121 | avgt| 5  |29.302| ±   0.011|us/op
>> StringCompare.compareUU   |  181 | avgt| 5  |39.31   | ± 
>> 0.016|us/op
>> StringCompare.compareUU   |  256 | avgt| 5  |54.592| ±   0.392|us/op
>> StringCompare.compareUUWithLdp|  64  | avgt| 5  |16.389| ±   0.008|us/op
>> StringCompare.compareUUWithLdp|  72  | avgt| 5  |10.71 | ±   0.158|us/op
>> StringCompare.compareUUWithLdp|  80  | avgt| 5  |11.488| ±   0.024|us/op
>> StringCompare.compareUUWithLdp|  91  | avgt| 5  |13.412| ±   0.006|us/op
>> StringCompare.compareUUWithLdp|  101 | avgt| 5  |16.245| ±   0.434|us/op
>> StringCompare.compareUUWithLdp|  121 | avgt| 5  |16.597| ±   0.016|us/op
>> StringCompare.compareUUWithLdp|  181 | avgt| 5  |27.373| ±   0.017|us/op
>> StringCompare.compareUUWithLdp|  256 | avgt| 5  |41.74 | ±   3.5  |us/op
>> 
>> From this table, we can see that in most cases, our patch is better than old 
>> one.
>> 
>> Thank you for your review. Any suggestions are welcome.
>
> Wang Huang has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   draft of refactor

And with longer strings, M1 and ThunderX2:


Benchmark   (diff_pos)  (size)  Mode  Cnt   
Score   Error  Units
StringCompare.compareLLDiffStrings10231024  avgt3  
50.849 ± 0.087  us/op
StringCompare.compareLLDiffStringsWithLdp 10231024  avgt3  
23.676 ± 0.015  us/op
StringCompare.compareLLDiffStringsWithRefactor10231024  avgt3  
28.967 ± 0.168  us/op


StringCompare.compareLLDiffStrings10231024  avgt3  
98.681 ± 0.026  us/op
StringCompare.compareLLDiffStringsWithLdp 10231024  avgt3  
82.576 ± 0.656  us/op
StringCompare.compareLLDiffStringsWithRefactor10231024  avgt3  
98.801 ± 0.321  us/op

LDP wins on M1 here, but on ThunderX2 it makes almost no difference at all. And 
how often are we comparing such long strings?
I don't know what to think, really. It seems that we're near to a place where 
we're optimizing for micro-architecture, and I don't want to see that here. On 
the other hand, using LDP is not worse anywhere, so we should allow it.

-

PR: https://git.openjdk.java.net/jdk/pull/4722


Re: RFR: 8268231: Aarch64: Use ldp in intrinsics for String.compareTo [v2]

2021-07-12 Thread Andrew Haley
On Mon, 12 Jul 2021 09:14:25 GMT, Wang Huang  wrote:

>> Dear all, 
>> Can you do me a favor to review this patch. This patch use `ldp` to 
>> implement String.compareTo.
>>
>> * We add a JMH test case 
>> * Here is the result of this test case
>>  
>> Benchmark   |(size)| Mode| Cnt|Score | Error  |Units 
>> -|--|-||--||-
>> StringCompare.compareLL   |  64  | avgt| 5  |7.992 | ±   0.005|us/op
>> StringCompare.compareLL   |  72  | avgt| 5  |15.029| ±   0.006|us/op
>> StringCompare.compareLL   |  80  | avgt| 5  |14.655| ±   0.011|us/op
>> StringCompare.compareLL   |  91  | avgt| 5  |16.363| ±   0.12 |us/op
>> StringCompare.compareLL   |  101 | avgt| 5  |16.966| ±   0.007|us/op
>> StringCompare.compareLL   |  121 | avgt| 5  |19.276| ±   0.006|us/op
>> StringCompare.compareLL   |  181 | avgt| 5  |19.002| ±   0.417|us/op
>> StringCompare.compareLL   |  256 | avgt| 5  |24.707| ±   0.041|us/op
>> StringCompare.compareLLWithLdp|  64  | avgt| 5  |8.001   | ± 
>> 0.121|us/op
>> StringCompare.compareLLWithLdp|  72  | avgt| 5  |11.573| ±   0.003|us/op
>> StringCompare.compareLLWithLdp|  80  | avgt| 5  |6.861 | ±   0.004|us/op
>> StringCompare.compareLLWithLdp|  91  | avgt| 5  |12.774| ±   0.201|us/op
>> StringCompare.compareLLWithLdp|  101 | avgt| 5  |8.691 | ±   0.004|us/op
>> StringCompare.compareLLWithLdp|  121 | avgt| 5  |11.091| ±   1.342|us/op
>> StringCompare.compareLLWithLdp|  181 | avgt| 5  |14.64 | ±   0.581|us/op
>> StringCompare.compareLLWithLdp|  256 | avgt| 5  |25.879| ±   1.775|us/op
>> StringCompare.compareUU   |  64  | avgt| 5  |13.476| ±   0.01 |us/op
>> StringCompare.compareUU   |  72  | avgt| 5  |15.078| ±   0.006|us/op
>> StringCompare.compareUU   |  80  | avgt| 5  |23.512| ±   0.011|us/op
>> StringCompare.compareUU   |  91  | avgt| 5  |24.284| ±   0.008|us/op
>> StringCompare.compareUU   |  101 | avgt| 5  |20.707| ±   0.017|us/op
>> StringCompare.compareUU   |  121 | avgt| 5  |29.302| ±   0.011|us/op
>> StringCompare.compareUU   |  181 | avgt| 5  |39.31   | ± 
>> 0.016|us/op
>> StringCompare.compareUU   |  256 | avgt| 5  |54.592| ±   0.392|us/op
>> StringCompare.compareUUWithLdp|  64  | avgt| 5  |16.389| ±   0.008|us/op
>> StringCompare.compareUUWithLdp|  72  | avgt| 5  |10.71 | ±   0.158|us/op
>> StringCompare.compareUUWithLdp|  80  | avgt| 5  |11.488| ±   0.024|us/op
>> StringCompare.compareUUWithLdp|  91  | avgt| 5  |13.412| ±   0.006|us/op
>> StringCompare.compareUUWithLdp|  101 | avgt| 5  |16.245| ±   0.434|us/op
>> StringCompare.compareUUWithLdp|  121 | avgt| 5  |16.597| ±   0.016|us/op
>> StringCompare.compareUUWithLdp|  181 | avgt| 5  |27.373| ±   0.017|us/op
>> StringCompare.compareUUWithLdp|  256 | avgt| 5  |41.74 | ±   3.5  |us/op
>> 
>> From this table, we can see that in most cases, our patch is better than old 
>> one.
>> 
>> Thank you for your review. Any suggestions are welcome.
>
> Wang Huang has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   draft of refactor

Two machines to compare here, Apple M1 and ThunderX2:


Benchmark   (diff_pos)  (size)  Mode  Cnt  
Score   Error  Units
StringCompare.compareLLDiffStrings   7 128  avgt3  
2.194 ± 0.001  us/op
StringCompare.compareLLDiffStrings  15 128  avgt3  
2.195 ± 0.018  us/op
StringCompare.compareLLDiffStrings  31 128  avgt3  
2.508 ± 0.003  us/op
StringCompare.compareLLDiffStrings  47 128  avgt3  
2.821 ± 0.001  us/op
StringCompare.compareLLDiffStrings  63 128  avgt3  
3.446 ± 0.003  us/op
StringCompare.compareLLDiffStringsWithLdp7 128  avgt3  
2.194 ± 0.001  us/op
StringCompare.compareLLDiffStringsWithLdp   15 128  avgt3  
2.195 ± 0.001  us/op
StringCompare.compareLLDiffStringsWithLdp   31 128  avgt3  
2.508 ± 0.001  us/op
StringCompare.compareLLDiffStringsWithLdp   47 128  avgt3  
2.510 ± 0.006  us/op
StringCompare.compareLLDiffStringsWithLdp   63 128  avgt3  
2.824 ± 0.003  us/op
StringCompare.compareLLDiffStringsWithRefactor   7 128  avgt3  
1.882 ± 0.018  us/op
StringCompare.compareLLDiffStringsWithRefactor  15 128  avgt3  
2.019 ± 0.002  us/op
StringCompare.compareLLDiffStringsWithRefactor  31 128  avgt3  
2.355 ± 0.003  us/op
StringCompare.compareLLDiffStringsWithRefactor  47 128  avgt3  
2.821 ± 0.010  us/op
StringCompare.compareLLDiffStringsWithRefactor  63 128  avgt3  
3.135 ± 0.002  us/op


Benchmark   (diff_pos)  (size)  Mode  Cnt   
Score   Error  Units
StringCompare.compareLLDiffStrings  

Re: RFR: 8268231: Aarch64: Use ldp in intrinsics for String.compareTo

2021-07-09 Thread Andrew Haley
On Thu, 8 Jul 2021 11:50:36 GMT, Wang Huang  wrote:

> Dear all, 
> Can you do me a favor to review this patch. This patch use `ldp` to 
> implement String.compareTo.
>
> * We add a JMH test case 
> * Here is the result of this test case
>  
> Benchmark|(size)| Mode| Cnt|Score | Error  |Units 
> -|--|-||--||-
> StringCompare.compareLL   |  64  | avgt| 5  |7.992 | ±0.005|us/op
> StringCompare.compareLL   |  72  | avgt| 5  |15.029| ±0.006|us/op
> StringCompare.compareLL   |  80  | avgt| 5  |14.655| ±0.011|us/op
> StringCompare.compareLL   |  91  | avgt| 5  |16.363| ±0.12 |us/op
> StringCompare.compareLL   |  101 | avgt| 5  |16.966| ±0.007|us/op
> StringCompare.compareLL   |  121 | avgt| 5  |19.276| ±0.006|us/op
> StringCompare.compareLL   |  181 | avgt| 5  |19.002| ±0.417|us/op
> StringCompare.compareLL   |  256 | avgt| 5  |24.707| ±0.041|us/op
> StringCompare.compareLLWithLdp|  64  | avgt| 5  |8.001| ± 
> 0.121|us/op
> StringCompare.compareLLWithLdp|  72  | avgt| 5  |11.573| ±0.003|us/op
> StringCompare.compareLLWithLdp|  80  | avgt| 5  |6.861 | ±0.004|us/op
> StringCompare.compareLLWithLdp|  91  | avgt| 5  |12.774| ±0.201|us/op
> StringCompare.compareLLWithLdp|  101 | avgt| 5  |8.691 | ±0.004|us/op
> StringCompare.compareLLWithLdp|  121 | avgt| 5  |11.091| ±1.342|us/op
> StringCompare.compareLLWithLdp|  181 | avgt| 5  |14.64 | ±0.581|us/op
> StringCompare.compareLLWithLdp|  256 | avgt| 5  |25.879| ±1.775|us/op
> StringCompare.compareUU   |  64  | avgt| 5  |13.476| ±0.01 |us/op
> StringCompare.compareUU   |  72  | avgt| 5  |15.078| ±0.006|us/op
> StringCompare.compareUU   |  80  | avgt| 5  |23.512| ±0.011|us/op
> StringCompare.compareUU   |  91  | avgt| 5  |24.284| ±0.008|us/op
> StringCompare.compareUU   |  101 | avgt| 5  |20.707| ±0.017|us/op
> StringCompare.compareUU   |  121 | avgt| 5  |29.302| ±0.011|us/op
> StringCompare.compareUU   |  181 | avgt| 5  |39.31| ± 
> 0.016|us/op
> StringCompare.compareUU   |  256 | avgt| 5  |54.592| ±0.392|us/op
> StringCompare.compareUUWithLdp|  64  | avgt| 5  |16.389| ±0.008|us/op
> StringCompare.compareUUWithLdp|  72  | avgt| 5  |10.71 | ±0.158|us/op
> StringCompare.compareUUWithLdp|  80  | avgt| 5  |11.488| ±0.024|us/op
> StringCompare.compareUUWithLdp|  91  | avgt| 5  |13.412| ±0.006|us/op
> StringCompare.compareUUWithLdp|  101 | avgt| 5  |16.245| ±0.434|us/op
> StringCompare.compareUUWithLdp|  121 | avgt| 5  |16.597| ±0.016|us/op
> StringCompare.compareUUWithLdp|  181 | avgt| 5  |27.373| ±0.017|us/op
> StringCompare.compareUUWithLdp|  256 | avgt| 5  |41.74 | ±3.5  |us/op
> 
> From this table, we can see that in most cases, our patch is better than old 
> one.
> 
> Thank you for your review. Any suggestions are welcome.

I'm quite tempted to approve this. It looks generally better, simpler, and 
easier to understand than what we have today. However, the improvement isn't 
great, and I suspect is mostly because of the reduction in traffic between Base 
and Vector registers.
What happens if you rewrite `compare_string_16_bytes_same()` to use `ldp` ?

-

PR: https://git.openjdk.java.net/jdk/pull/4722


Re: RFR: 6506405: Math.abs(float) is slow [v2]

2021-07-08 Thread Andrew Haley
On Thu, 8 Jul 2021 09:43:35 GMT, Andrew Haley  wrote:

> Moves between GPRs and FPRs are often slow. There's a 10-cycle latency on 
> some AArch64, so we avoid it whenever we can. Mind you, we don't care about 
> this patch because we always generate FABS from an intrinsic anyway.

For avoidance of doubt, that's the round-trip latency.

-

PR: https://git.openjdk.java.net/jdk/pull/4711


Re: RFR: 6506405: Math.abs(float) is slow [v2]

2021-07-08 Thread Andrew Haley
On Thu, 8 Jul 2021 01:05:16 GMT, Brian Burkhalter  wrote:

>> Please consider this change to make the `float` and `double` versions of 
>> `java.lang.Math.abs()` branch-free.
>
> Brian Burkhalter has updated the pull request incrementally with one 
> additional commit since the last revision:
> 
>   6506405: Add comments, use new consts for masking

Moves between GPRs and FPRs are often slow. There's a 10-cycle latency on some 
AArch64, so we avoid it whenever we can. Mind you, we don't care about this 
patch because we always generate FABS from an intrinsic anyway.

-

PR: https://git.openjdk.java.net/jdk/pull/4711


Re: RFR: 8214761: Bug in parallel Kahan summation implementation

2021-07-02 Thread Andrew Haley
On Fri, 2 Jul 2021 20:12:39 GMT, Ian Graves  wrote:

> 8214761: Bug in parallel Kahan summation implementation

Crikey, how did we get that wrong?
It'd be nice if we had a regression test for this. Can you provide one, please?

-

PR: https://git.openjdk.java.net/jdk/pull/4674


Re: RFR: 8188044: We need Math.unsignedMultiplyHigh [v4]

2021-07-02 Thread Andrew Haley
On Fri, 2 Jul 2021 16:37:21 GMT, Brian Burkhalter  wrote:

>> Please consider this proposal to add a method 
>> `unsignedMultiplyHigh(long,long)` to each of `java.lang.Math` and 
>> `java.lang.StrictMath`.
>
> Brian Burkhalter has refreshed the contents of this pull request, and 
> previous commits have been removed. The incremental views will show 
> differences compared to the previous content of the PR. The pull request 
> contains one new commit since the last revision:
> 
>   8188044: Use multiplyHigh() to leverage the intrinsic

Marked as reviewed by aph (Reviewer).

-

PR: https://git.openjdk.java.net/jdk/pull/4644


Re: RFR: 8188044: We need Math.unsignedMultiplyHigh [v2]

2021-07-02 Thread Andrew Haley
On 7/2/21 4:30 PM, Raffaello Giulietti wrote:
> FWIW, adinn's branchless code together with
> PR https://git.openjdk.java.net/jdk/pull/4660
> make both methods less vulnerable to timing attacks.

I doubt it, because HotSpot generates conditional select instructions for
the popular systems, at least for C2.

I guess it might on a C1-only or pure interpreter system. That certainly
might be an argument for using the shift-and-add magic. With a comment
that would be fine, as the difference in performance doesn't seem to be
worth worrying about. We're looking at sub-nanosecond throughput.

-- 
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671



Re: RFR: 8188044: We need Math.unsignedMultiplyHigh [v2]

2021-07-02 Thread Andrew Haley
On 7/2/21 12:26 PM, Raffaello Giulietti wrote:
> ... or even as a one liner, like in the test
> 
>   return Math.multiplyHigh(x, y) + ((x >> (Long.SIZE - 1)) & y) + ((y >> 
> (Long.SIZE - 1)) & x);

It was hard to write, so it should be hard to read too! :-)



Re: RFR: 8188044: We need Math.unsignedMultiplyHigh [v2]

2021-07-02 Thread Andrew Haley
On Fri, 2 Jul 2021 13:47:40 GMT, Andrew Haley  wrote:

>> You can also do that branchlessly which might prove better
>> 
>>  long result = Math.multiplyHigh(x, y);
>>  result += (y & (x >> 63));
>>  result += (x & (y >> 63));
>>  return result;
>
>> You can also do that branchlessly which might prove better
>> 
>> ```
>>  long result = Math.multiplyHigh(x, y);
>>  result += (y & (x >> 63));
>>  result += (x & (y >> 63));
>>  return result;
>> ```
> I doubt very much that it would be better, because these days branch 
> prediction is excellent, and we also have conditional select instructions. 
> Exposing the condition helps C2 to eliminate it if the range of args is 
> known. The `if` code is easier to understand.
> 
> Benchmark results, with one of the operands changing signs every iteration, 
> 1000 iterations:
> 
> 
> Benchmark  Mode  Cnt ScoreError  Units
> MulHiTest.mulHiTest1   (aph) avgt3  1570.587 ± 16.602  ns/op
> MulHiTest.mulHiTest2   (adinn)   avgt3  2237.637 ±  4.740  ns/op
> 
> In any case, note that with this optimization the unsigned mulHi is in the 
> nanosecond range, so Good Enough. IMO.

But weirdly, it's the other way around on AArch64, but there's little in it:


Benchmark Mode  Cnt Score   Error  Units
MulHiTest.mulHiTest1  avgt3  1492.108 ± 0.301  ns/op
MulHiTest.mulHiTest2  avgt3  1219.521 ± 1.516  ns/op


but this is only in the case where we have unpredictable branches. Go with 
simple and easy to understand; it doesn't much matter.

-

PR: https://git.openjdk.java.net/jdk/pull/4644


Re: RFR: 8188044: We need Math.unsignedMultiplyHigh [v2]

2021-07-02 Thread Andrew Haley
On Fri, 2 Jul 2021 11:06:06 GMT, Andrew Dinn  wrote:

> You can also do that branchlessly which might prove better
> 
> ```
>  long result = Math.multiplyHigh(x, y);
>  result += (y & (x >> 63));
>  result += (x & (y >> 63));
>  return result;
> ```
I doubt very much that it would be better, because these days branch prediction 
is excellent, and we also have conditional select instructions. Exposing the 
condition helps C2 to eliminate it if the range of args is known. The `if` code 
is easier to understand.

Benchmark results, with one of the operands changing signs every iteration, 
1000 iterations:


Benchmark  Mode  Cnt ScoreError  Units
MulHiTest.mulHiTest1   (aph) avgt3  1570.587 ± 16.602  ns/op
MulHiTest.mulHiTest2   (adinn)   avgt3  2237.637 ±  4.740  ns/op

In any case, note that with this optimization the unsigned mulHi is in the 
nanosecond range, so Good Enough. IMO.

-

PR: https://git.openjdk.java.net/jdk/pull/4644


Re: RFR: 8188044: We need Math.unsignedMultiplyHigh [v2]

2021-07-02 Thread Andrew Haley
On Fri, 2 Jul 2021 09:37:47 GMT, Andrew Haley  wrote:

>> Brian Burkhalter has updated the pull request incrementally with one 
>> additional commit since the last revision:
>> 
>>   8188044: Add @see links between multiplyHigh() and unsignedMultiplyHigh().
>
> src/java.base/share/classes/java/lang/Math.java line 1211:
> 
>> 1209: long z1 = t >>> 32;
>> 1210: 
>> 1211: return x1 * y1 + z1 + (z0 >>> 32);
> 
> Suggestion:
> 
> long result = Math.multiplyHigh(x, y);
> if (x < 0) result += y;
> if (y < 0) result += x;
> return result;

This is just subtracting the 2's-complement offset. I guess the idea, longer 
term, is that this be an intrinsic anyway, but if you do `unsignedMultiplyHigh` 
this way you'll utilize the existing `multiplyHigh` intrinsic on all platforms 
that have it.

-

PR: https://git.openjdk.java.net/jdk/pull/4644


Re: RFR: 8188044: We need Math.unsignedMultiplyHigh [v2]

2021-07-02 Thread Andrew Haley
On Thu, 1 Jul 2021 18:08:22 GMT, Brian Burkhalter  wrote:

>> Please consider this proposal to add a method 
>> `unsignedMultiplyHigh(long,long)` to each of `java.lang.Math` and 
>> `java.lang.StrictMath`.
>
> Brian Burkhalter has updated the pull request incrementally with one 
> additional commit since the last revision:
> 
>   8188044: Add @see links between multiplyHigh() and unsignedMultiplyHigh().

src/java.base/share/classes/java/lang/Math.java line 1211:

> 1209: long z1 = t >>> 32;
> 1210: 
> 1211: return x1 * y1 + z1 + (z0 >>> 32);

Suggestion:

long result = Math.multiplyHigh(x, y);
if (x < 0) result += y;
if (y < 0) result += x;
return result;

-

PR: https://git.openjdk.java.net/jdk/pull/4644


Re: RFR: 8265768 [aarch64] Use glibc libm impl for dlog, dlog10, dexp iff 2.29 or greater on AArch64.

2021-05-25 Thread Andrew Haley
Greg, what's your email address? Everything I try bounces.

-- 
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671



Re: JEP draft: Scope Locals

2021-05-20 Thread Andrew Haley
On 5/19/21 10:51 PM, David Lloyd wrote:
> I think it would be really nice if the snapshot class could hold its
> run/call method rather than making it a static method of `ScopeLocal`. This
> reads more naturally to me (and is nicer to write):

True, but inheritance is *extremely* time-critical when creating a
ForkJoin task. In many cases there won't be any scope locals to inherit,
and by allowing null snapshots you save a significant fraction of a
nanosecond when creating one. And I know, this is hard to believe, but
such an overhead has a significant macroscopic effect on benchmarks.

However, I do intend to fix this in a different way, if I can.

-- 
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671



Re: JEP draft: Scope Locals

2021-05-20 Thread Andrew Haley
On 5/19/21 9:55 PM, David Lloyd wrote:
> Turning this around, I would argue that there are few (or perhaps *no*)
> cases where it would ever be desirable to inherit scope locals across
> thread creation; in cases where this is explicitly desired, one can always
> resume the snapshot from within the thread's Runnable. Was there a
> particular use case this was meant to address?

Structured Concurrency, but also possibly anywhere that inheritable
thread locals are used now.

-- 
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671



Re: JEP draft: Scope Locals

2021-05-20 Thread Andrew Haley
On 5/19/21 5:55 PM, Peter Levart wrote:
> In other words, non-inheritable bindings are 
> never transferred from thread to thread automatically or by 
> snapshot/runWithSnapshot. I can see that snapshot/runWithSnapshot was 
> meant as a mechanism to "simulate" inheritance of bindings when 
> execution is transferred from one thread to another which is not a newly 
> started child thread.

Yes. However, this part of the draft proposal is undergoing some revision,
and it looks like it will make more sense to control inheritance in a
different way, one that will permit more flexible control over what
gets inherited and when.

-- 
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671



Re: JEP draft: Scope Locals

2021-05-20 Thread Andrew Haley
On 5/15/21 6:15 PM, Remi Forax wrote:
> I think the JEP should be more explicit about the shortcoming of ThreadLocal 
> and how the design of scope local fix both issues.

Yes. It's in progress.

-- 
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671



Re: [External] : Re: JEP draft: Scope Locals

2021-05-19 Thread Andrew Haley
On 5/18/21 3:19 AM, Mike Rettig wrote:
> With the current proposal I can't simply flush and close at the
> end of the scope because there might be a child scope that is still active.

Yes, that's true. I think that what you have in mind is a more elaborate
mechanism than what is proposed here, which does no more than bind values
to names over a scope. There needs to be more discussion in the JEP of
what this proposal isn't intended to do, and how we might cope with
mutability of a scope local's values.

-- 
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671



Re: JEP draft: Scope Locals

2021-05-19 Thread Andrew Haley
On 5/15/21 6:50 PM, Peter Levart wrote:
> What if I wanted to create and start a thread that would be "pristine" - 
> not have any ScopeLocal value bound? Is this intentionally not allowed 
> in order to make sure that inheritable ScopeLocal(s) can't be cleared by 
> code that has no access to ScopeLocal instance(s)?

That one is about to be changed by a revision to the JEP. There clearly
is a need to control whether a newly-created thread inherits scope locals
or not. For instance, an Executor might lazily create worker threads, and
we don't want them to inherit scope locals from the thread that was
running.

-- 
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671



Re: JEP draft: Scope Locals

2021-05-15 Thread Andrew Haley
On 5/15/21 10:10 AM, Alex Otenko wrote:
> ScopeLocal.where(...).run(...).finally(...) would be nice.

Is that different from try ... run ... finally ?

-- 
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671



Re: JEP draft: Scope Locals

2021-05-15 Thread Andrew Haley
On 5/14/21 8:19 PM, Mike Rettig wrote:
> I don't see a way to capture the lifecycle of the scoped variable. I think
> for framework designers it's going to be important to know the life cycle
> of the scoped variable especially with snapshotting and inheritance.  The
> lack of a defined life cycle for thread locals is one of the many reasons
> why it should be avoided.

This requirement sounds similar, but I think it's really a different idea.
Scope locals are only binding data to names and don't attempt to handle
lifetime issues. Apart from anything else, that would require a more
heavyweight mechanism than scope locals, which need to be very lightweight
for some applications.

Over at Loom we've been talking about lifetimes and Ron has some ideas, but
scope locals are not likely to be that thing.

-- 
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671



Re: JEP draft: Scope Locals

2021-05-15 Thread Andrew Haley
On 5/15/21 12:28 AM, Nathan Reynolds wrote:
> What about cleanup of the scoped objects?  For example, how do I ensure
> close() or some cleanup method is called at the right time on the scoped
> object?  Do I need to use a Weak/Soft/PhantomReference and a ReferenceQueue
> to track the object?  Do I use Cleaner?  Does the object need to implement
> finalize()... hopefully not?

All a scope local does is bind a value to a name, in a scope. This isn't
about C++-style destructors, which have their merits but are a different
subject.

-- 
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671



Re: JEP draft: Scope Locals

2021-05-14 Thread Andrew Haley
On 5/14/21 3:45 PM, Brian Goetz wrote:
> Where I think the JEP draft (the exposition, not the feature design) 
> could be improved is to make the relationship to TL more clear.  IMO, 
> this is a "more modern" TL design; it addresses the same primary use 
> cases, without the unconstrained mutability of TLs.  The safety that 
> comes from the immutability enables some new concurrent patterns (e.g., 
> structured concurrency) and some nice optimizations.

Thank you, that's a nice summary. I was trying to describe scope locals
in a stand-alone way, but everything that a scope local can do can also
be done by a TL, albeit with some clumsiness. I should be more explicit
about that.

-- 
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671



Re: JEP draft: Scope Locals

2021-05-14 Thread Andrew Haley
On 5/13/21 4:59 PM, Alan Snyder wrote:
> 
> 
>> On May 13, 2021, at 2:03 AM, Andrew Haley  wrote:
>>
>> On 5/12/21 8:12 PM, Alan Snyder wrote:
>>> From the motivation section:
>>>
>>>> So you want to invoke a method |X| in a library that later calls back into 
>>>> your code. In your callback you need some context, perhaps a transaction 
>>>> ID or some |File| instances. However, |X| provides no way to pass a 
>>>> reference through their code into your callback. What are you to do?
>>>
>>> When I read this, my first thought was… pass a callback that is bound to 
>>> the context.
>>>
>>> I suppose the example you have in mind has a fixed callback.
>>
>> Fixed? It's a callback, of some form.
> 
> What I meant was that the callback has a larger scope than the method call. 
> Otherwise, you could bind the context to the callback object.

I don't know quite what you mean by "larger scope," but clearly, if you
can explicitly pass a callback created by a lambda, that's a more explicit
solution.

>>> Is this a common practice? Doesn’t it point to a deficiency in the library 
>>> API?
>>
>> Not necessarily. For example, say you want to give access to a particular
>> resource (a log stream, say) to trusted callees. Nobody AFAIK passes a log
>> stream as an explicit argument through business logic because that's just
>> too much clutter.
> 
> That sounds interesting, but I’m not following how scope locals would be used 
> to solve this problem.

In the outer scope you bind a logger scope local, and all transitive callees
can use that.

-- 
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671



Re: JEP draft: Scope Locals

2021-05-13 Thread Andrew Haley
On 5/12/21 8:12 PM, Alan Snyder wrote:
> From the motivation section:
> 
>> So you want to invoke a method |X| in a library that later calls back into 
>> your code. In your callback you need some context, perhaps a transaction ID 
>> or some |File| instances. However, |X| provides no way to pass a reference 
>> through their code into your callback. What are you to do?
> 
> When I read this, my first thought was… pass a callback that is bound to the 
> context.
> 
> I suppose the example you have in mind has a fixed callback.

Fixed? It's a callback, of some form.

> Is this a common practice? Doesn’t it point to a deficiency in the library 
> API?

Not necessarily. For example, say you want to give access to a particular
resource (a log stream, say) to trusted callees. Nobody AFAIK passes a log
stream as an explicit argument through business logic because that's just
too much clutter.

> Is this feature just a workaround for poorly designed libraries, or are there 
> other examples?

It's not really about poor design as much as separation of concerns.
Intermediate libraries have no way to know what might need to be passed
to callees, and in many cases it's better isolation if they don't get to
find out. The latter is especially true of passing security permissions.

Also, there is evolution of libraries: with scope locals you don't need
to change library interfaces to add useful capabilities like logging.

-- 
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671



JEP draft: Scope Locals

2021-05-12 Thread Andrew Haley
There's been considerable discussion about scope locals on the loom-dev list,
and it's now time to open this to a wider audience. This subject is important
because. although scope locals were motivated by the the needs of Loom, they
have many potential applications outside that project.

The draft JEP is at

https://bugs.openjdk.java.net/browse/JDK-8263012

I've already received some very helpful suggestions for enhancements to
the API, and it'll take me a while to work through them all. In particular,
Paul Sandoz has suggested that I unify the classes Snapshot and Carrier,
and it will take me some time to understand the consequences of that.

In the meantime, please have a look at the JEP and comment here.


For reference, earlier discussions are at

https://mail.openjdk.java.net/pipermail/loom-dev/2021-March/002268.html
https://mail.openjdk.java.net/pipermail/loom-dev/2021-April/002287.html
https://mail.openjdk.java.net/pipermail/loom-dev/2021-May/002427.html

-- 
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671




Re: RFR: 8265768 [aarch64] Use glibc libm impl for dlog, dlog10, dexp iff 2.29 or greater on AArch64.

2021-05-07 Thread Andrew Haley
On Thu, 29 Apr 2021 20:24:25 GMT, Carlos O'Donell 
 wrote:

> Where does the requirement for monotonicity come from?

Semi-monotonicity, to be precise. In the spec of java.lang.Math,

"Besides accuracy at individual arguments, maintaining proper relations between 
the method at different arguments is also important. Therefore, most methods 
with more than 0.5 ulp errors are required to be semi-monotonic: whenever the 
mathematical function is non-decreasing, so is the floating-point 
approximation, likewise, whenever the mathematical function is non-increasing, 
so is the floating-point approximation. Not all approximations that have 1 ulp 
accuracy will automatically meet the monotonicity requirements."

I wouldn't be surprised if the approximations we need in glibc meet this 
anyway. We just need to check.

-

PR: https://git.openjdk.java.net/jdk/pull/3510


Re: RFR: 8264805: Remove the experimental Ahead-of-Time Compiler [v4]

2021-04-09 Thread Andrew Haley
On Thu, 8 Apr 2021 17:24:38 GMT, Vladimir Kozlov  wrote:

>> As part of [JEP 410](http://openjdk.java.net/jeps/410) remove code related 
>> to Ahead-of-Time Compiler from JDK: 
>> 
>> - `jdk.aot` module — the `jaotc` tool 
>> - `src/hotspot/share/aot` — loads AoT compiled code into VM for execution 
>> - related HotSpot code guarded by `#if INCLUDE_AOT` 
>> 
>> Additionally, remove tests as well as code in makefiles related to AoT 
>> compilation.
>> 
>> Tested hs-tier1-4
>
> Vladimir Kozlov has updated the pull request incrementally with one 
> additional commit since the last revision:
> 
>   Remove exports from Graal module to jdk.aot

AArch64 looks fine.

-

Marked as reviewed by aph (Reviewer).

PR: https://git.openjdk.java.net/jdk/pull/3398


Re: RFR: 8256245: AArch64: Implement Base64 decoding intrinsic [v7]

2021-04-08 Thread Andrew Haley
On Thu, 8 Apr 2021 06:33:43 GMT, Dong Bo  wrote:

>> In JDK-8248188, IntrinsicCandidate and API is added for Base64 decoding.
>> Base64 decoding can be improved on aarch64 with ld4/tbl/tbx/st3, a basic 
>> idea can be found at 
>> http://0x80.pl/articles/base64-simd-neon.html#encoding-quadwords.
>> 
>> Patch passed jtreg tier1-3 tests with linux-aarch64-server-fastdebug build.
>> Tests in `test/jdk/java/util/Base64/` and 
>> `compiler/intrinsics/base64/TestBase64.java` runned specially for the 
>> correctness of the implementation.
>> 
>> There can be illegal characters at the start of the input if the data is 
>> MIME encoded.
>> It would be no benefits to use SIMD for this case, so the stub use no-simd 
>> instructions for MIME encoded data now.
>> 
>> A JMH micro, Base64Decode.java, is added for performance test.
>> With different input length (upper-bounded by parameter `maxNumBytes` in the 
>> JMH micro),
>> we witness ~2.5x improvements with long inputs and no regression with short 
>> inputs for raw base64 decodeing, minor improvements (~10.95%) for MIME on 
>> Kunpeng916.
>> 
>> The Base64Decode.java JMH micro-benchmark results:
>> 
>> Benchmark  (lineSize)  (maxNumBytes)  Mode  Cnt  
>>  Score   Error  Units
>> 
>> # Kunpeng916, intrinsic
>> Base64Decode.testBase64Decode   4  1  avgt5  
>> 48.614 ± 0.609  ns/op
>> Base64Decode.testBase64Decode   4  3  avgt5  
>> 58.199 ± 1.650  ns/op
>> Base64Decode.testBase64Decode   4  7  avgt5  
>> 69.400 ± 0.931  ns/op
>> Base64Decode.testBase64Decode   4 32  avgt5  
>> 96.818 ± 1.687  ns/op
>> Base64Decode.testBase64Decode   4 64  avgt5 
>> 122.856 ± 9.217  ns/op
>> Base64Decode.testBase64Decode   4 80  avgt5 
>> 130.935 ± 1.667  ns/op
>> Base64Decode.testBase64Decode   4 96  avgt5 
>> 143.627 ± 1.751  ns/op
>> Base64Decode.testBase64Decode   4112  avgt5 
>> 152.311 ± 1.178  ns/op
>> Base64Decode.testBase64Decode   4512  avgt5 
>> 342.631 ± 0.584  ns/op
>> Base64Decode.testBase64Decode   4   1000  avgt5 
>> 573.635 ± 1.050  ns/op
>> Base64Decode.testBase64Decode   4  2  avgt5
>> 9534.136 ±45.172  ns/op
>> Base64Decode.testBase64Decode   4  5  avgt5   
>> 22718.726 ±   192.070  ns/op
>> Base64Decode.testBase64MIMEDecode   4  1  avgt   10  
>> 63.558 ±0.336  ns/op
>> Base64Decode.testBase64MIMEDecode   4  3  avgt   10  
>> 82.504 ±0.848  ns/op
>> Base64Decode.testBase64MIMEDecode   4  7  avgt   10 
>> 120.591 ±0.608  ns/op
>> Base64Decode.testBase64MIMEDecode   4 32  avgt   10 
>> 324.314 ±6.236  ns/op
>> Base64Decode.testBase64MIMEDecode   4 64  avgt   10 
>> 532.678 ±4.670  ns/op
>> Base64Decode.testBase64MIMEDecode   4 80  avgt   10 
>> 678.126 ±4.324  ns/op
>> Base64Decode.testBase64MIMEDecode   4 96  avgt   10 
>> 771.603 ±6.393  ns/op
>> Base64Decode.testBase64MIMEDecode   4112  avgt   10 
>> 889.608 ±   0.759  ns/op
>> Base64Decode.testBase64MIMEDecode   4512  avgt   10
>> 3663.557 ±3.422  ns/op
>> Base64Decode.testBase64MIMEDecode   4   1000  avgt   10
>> 7017.784 ±9.128  ns/op
>> Base64Decode.testBase64MIMEDecode   4  2  avgt   10  
>> 128670.660 ± 7951.521  ns/op
>> Base64Decode.testBase64MIMEDecode   4  5  avgt   10  
>> 317113.667 ±  161.758  ns/op
>> 
>> # Kunpeng916, default
>> Base64Decode.testBase64Decode   4  1  avgt5  
>> 48.455 ±   0.571  ns/op
>> Base64Decode.testBase64Decode   4  3  avgt5  
>> 57.937 ±   0.505  ns/op
>> Base64Decode.testBase64Decode   4  7  avgt5  
>> 73.823 ±   1.452  ns/op
>> Base64Decode.testBase64Decode   4 32  avgt5 
>> 106.484 ±   1.243  ns/op
>> Base64Decode.testBase64Decode   4 64  avgt5 
>> 141.004 ±   1.188  ns/op
>> Base64Decode.testBase64Decode   4 80  avgt5 
>> 156.284 ±   0.572  ns/op
>> Base64Decode.testBase64Decode   4 96  avgt5 
>> 174.137 ±   0.177  ns/op
>> Base64Decode.testBase64Decode   4112  avgt5 
>> 188.445 ±   0.572  ns/op
>> Base64Decode.testBase64Decode   4512  avgt5 
>> 610.847 ±   1.559  ns/op
>> Base64Decode.testBase64Decode   4   1000  avgt  

Re: RFR: 8256245: AArch64: Implement Base64 decoding intrinsic [v6]

2021-04-07 Thread Andrew Haley
On Wed, 7 Apr 2021 05:51:02 GMT, Dong Bo  wrote:

>> In JDK-8248188, IntrinsicCandidate and API is added for Base64 decoding.
>> Base64 decoding can be improved on aarch64 with ld4/tbl/tbx/st3, a basic 
>> idea can be found at 
>> http://0x80.pl/articles/base64-simd-neon.html#encoding-quadwords.
>> 
>> Patch passed jtreg tier1-3 tests with linux-aarch64-server-fastdebug build.
>> Tests in `test/jdk/java/util/Base64/` and 
>> `compiler/intrinsics/base64/TestBase64.java` runned specially for the 
>> correctness of the implementation.
>> 
>> There can be illegal characters at the start of the input if the data is 
>> MIME encoded.
>> It would be no benefits to use SIMD for this case, so the stub use no-simd 
>> instructions for MIME encoded data now.
>> 
>> A JMH micro, Base64Decode.java, is added for performance test.
>> With different input length (upper-bounded by parameter `maxNumBytes` in the 
>> JMH micro),
>> we witness ~2.5x improvements with long inputs and no regression with short 
>> inputs for raw base64 decodeing, minor improvements (~10.95%) for MIME on 
>> Kunpeng916.
>> 
>> The Base64Decode.java JMH micro-benchmark results:
>> 
>> Benchmark  (lineSize)  (maxNumBytes)  Mode  Cnt  
>>  Score   Error  Units
>> 
>> # Kunpeng916, intrinsic
>> Base64Decode.testBase64Decode   4  1  avgt5  
>> 48.614 ± 0.609  ns/op
>> Base64Decode.testBase64Decode   4  3  avgt5  
>> 58.199 ± 1.650  ns/op
>> Base64Decode.testBase64Decode   4  7  avgt5  
>> 69.400 ± 0.931  ns/op
>> Base64Decode.testBase64Decode   4 32  avgt5  
>> 96.818 ± 1.687  ns/op
>> Base64Decode.testBase64Decode   4 64  avgt5 
>> 122.856 ± 9.217  ns/op
>> Base64Decode.testBase64Decode   4 80  avgt5 
>> 130.935 ± 1.667  ns/op
>> Base64Decode.testBase64Decode   4 96  avgt5 
>> 143.627 ± 1.751  ns/op
>> Base64Decode.testBase64Decode   4112  avgt5 
>> 152.311 ± 1.178  ns/op
>> Base64Decode.testBase64Decode   4512  avgt5 
>> 342.631 ± 0.584  ns/op
>> Base64Decode.testBase64Decode   4   1000  avgt5 
>> 573.635 ± 1.050  ns/op
>> Base64Decode.testBase64Decode   4  2  avgt5
>> 9534.136 ±45.172  ns/op
>> Base64Decode.testBase64Decode   4  5  avgt5   
>> 22718.726 ±   192.070  ns/op
>> Base64Decode.testBase64MIMEDecode   4  1  avgt   10  
>> 63.558 ±0.336  ns/op
>> Base64Decode.testBase64MIMEDecode   4  3  avgt   10  
>> 82.504 ±0.848  ns/op
>> Base64Decode.testBase64MIMEDecode   4  7  avgt   10 
>> 120.591 ±0.608  ns/op
>> Base64Decode.testBase64MIMEDecode   4 32  avgt   10 
>> 324.314 ±6.236  ns/op
>> Base64Decode.testBase64MIMEDecode   4 64  avgt   10 
>> 532.678 ±4.670  ns/op
>> Base64Decode.testBase64MIMEDecode   4 80  avgt   10 
>> 678.126 ±4.324  ns/op
>> Base64Decode.testBase64MIMEDecode   4 96  avgt   10 
>> 771.603 ±6.393  ns/op
>> Base64Decode.testBase64MIMEDecode   4112  avgt   10 
>> 889.608 ±   0.759  ns/op
>> Base64Decode.testBase64MIMEDecode   4512  avgt   10
>> 3663.557 ±3.422  ns/op
>> Base64Decode.testBase64MIMEDecode   4   1000  avgt   10
>> 7017.784 ±9.128  ns/op
>> Base64Decode.testBase64MIMEDecode   4  2  avgt   10  
>> 128670.660 ± 7951.521  ns/op
>> Base64Decode.testBase64MIMEDecode   4  5  avgt   10  
>> 317113.667 ±  161.758  ns/op
>> 
>> # Kunpeng916, default
>> Base64Decode.testBase64Decode   4  1  avgt5  
>> 48.455 ±   0.571  ns/op
>> Base64Decode.testBase64Decode   4  3  avgt5  
>> 57.937 ±   0.505  ns/op
>> Base64Decode.testBase64Decode   4  7  avgt5  
>> 73.823 ±   1.452  ns/op
>> Base64Decode.testBase64Decode   4 32  avgt5 
>> 106.484 ±   1.243  ns/op
>> Base64Decode.testBase64Decode   4 64  avgt5 
>> 141.004 ±   1.188  ns/op
>> Base64Decode.testBase64Decode   4 80  avgt5 
>> 156.284 ±   0.572  ns/op
>> Base64Decode.testBase64Decode   4 96  avgt5 
>> 174.137 ±   0.177  ns/op
>> Base64Decode.testBase64Decode   4112  avgt5 
>> 188.445 ±   0.572  ns/op
>> Base64Decode.testBase64Decode   4512  avgt5 
>> 610.847 ±   1.559  ns/op
>> Base64Decode.testBase64Decode   4   1000  avgt  

Re: RFR: 8256245: AArch64: Implement Base64 decoding intrinsic [v6]

2021-04-07 Thread Andrew Haley
On Wed, 7 Apr 2021 09:50:45 GMT, Andrew Haley  wrote:

>> Dong Bo has updated the pull request incrementally with one additional 
>> commit since the last revision:
>> 
>>   fix misleading annotations
>
> src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 5829:
> 
>> 5827: __ strb(r14, __ post(dst, 1));
>> 5828: __ strb(r15, __ post(dst, 1));
>> 5829: __ strb(r13, __ post(dst, 1));
> 
> I think this sequence should be 4 BFMs, STRW, BFM, STRW. That's the best we 
> can do, I think.

Sorry, that's not quite right, but you get the idea: let's not generate 
unnecessary memory traffic.

-

PR: https://git.openjdk.java.net/jdk/pull/3228


Re: RFR: 8256245: AArch64: Implement Base64 decoding intrinsic [v5]

2021-04-06 Thread Andrew Haley
On Tue, 6 Apr 2021 07:25:57 GMT, Dong Bo  wrote:

>> In JDK-8248188, IntrinsicCandidate and API is added for Base64 decoding.
>> Base64 decoding can be improved on aarch64 with ld4/tbl/tbx/st3, a basic 
>> idea can be found at 
>> http://0x80.pl/articles/base64-simd-neon.html#encoding-quadwords.
>> 
>> Patch passed jtreg tier1-3 tests with linux-aarch64-server-fastdebug build.
>> Tests in `test/jdk/java/util/Base64/` and 
>> `compiler/intrinsics/base64/TestBase64.java` runned specially for the 
>> correctness of the implementation.
>> 
>> There can be illegal characters at the start of the input if the data is 
>> MIME encoded.
>> It would be no benefits to use SIMD for this case, so the stub use no-simd 
>> instructions for MIME encoded data now.
>> 
>> A JMH micro, Base64Decode.java, is added for performance test.
>> With different input length (upper-bounded by parameter `maxNumBytes` in the 
>> JMH micro),
>> we witness ~2.5x improvements with long inputs and no regression with short 
>> inputs for raw base64 decodeing, minor improvements (~10.95%) for MIME on 
>> Kunpeng916.
>> 
>> The Base64Decode.java JMH micro-benchmark results:
>> 
>> Benchmark  (lineSize)  (maxNumBytes)  Mode  Cnt  
>>  Score   Error  Units
>> 
>> # Kunpeng916, intrinsic
>> Base64Decode.testBase64Decode   4  1  avgt5  
>> 48.614 ± 0.609  ns/op
>> Base64Decode.testBase64Decode   4  3  avgt5  
>> 58.199 ± 1.650  ns/op
>> Base64Decode.testBase64Decode   4  7  avgt5  
>> 69.400 ± 0.931  ns/op
>> Base64Decode.testBase64Decode   4 32  avgt5  
>> 96.818 ± 1.687  ns/op
>> Base64Decode.testBase64Decode   4 64  avgt5 
>> 122.856 ± 9.217  ns/op
>> Base64Decode.testBase64Decode   4 80  avgt5 
>> 130.935 ± 1.667  ns/op
>> Base64Decode.testBase64Decode   4 96  avgt5 
>> 143.627 ± 1.751  ns/op
>> Base64Decode.testBase64Decode   4112  avgt5 
>> 152.311 ± 1.178  ns/op
>> Base64Decode.testBase64Decode   4512  avgt5 
>> 342.631 ± 0.584  ns/op
>> Base64Decode.testBase64Decode   4   1000  avgt5 
>> 573.635 ± 1.050  ns/op
>> Base64Decode.testBase64Decode   4  2  avgt5
>> 9534.136 ±45.172  ns/op
>> Base64Decode.testBase64Decode   4  5  avgt5   
>> 22718.726 ±   192.070  ns/op
>> Base64Decode.testBase64MIMEDecode   4  1  avgt   10  
>> 63.558 ±0.336  ns/op
>> Base64Decode.testBase64MIMEDecode   4  3  avgt   10  
>> 82.504 ±0.848  ns/op
>> Base64Decode.testBase64MIMEDecode   4  7  avgt   10 
>> 120.591 ±0.608  ns/op
>> Base64Decode.testBase64MIMEDecode   4 32  avgt   10 
>> 324.314 ±6.236  ns/op
>> Base64Decode.testBase64MIMEDecode   4 64  avgt   10 
>> 532.678 ±4.670  ns/op
>> Base64Decode.testBase64MIMEDecode   4 80  avgt   10 
>> 678.126 ±4.324  ns/op
>> Base64Decode.testBase64MIMEDecode   4 96  avgt   10 
>> 771.603 ±6.393  ns/op
>> Base64Decode.testBase64MIMEDecode   4112  avgt   10 
>> 889.608 ±   0.759  ns/op
>> Base64Decode.testBase64MIMEDecode   4512  avgt   10
>> 3663.557 ±3.422  ns/op
>> Base64Decode.testBase64MIMEDecode   4   1000  avgt   10
>> 7017.784 ±9.128  ns/op
>> Base64Decode.testBase64MIMEDecode   4  2  avgt   10  
>> 128670.660 ± 7951.521  ns/op
>> Base64Decode.testBase64MIMEDecode   4  5  avgt   10  
>> 317113.667 ±  161.758  ns/op
>> 
>> # Kunpeng916, default
>> Base64Decode.testBase64Decode   4  1  avgt5  
>> 48.455 ±   0.571  ns/op
>> Base64Decode.testBase64Decode   4  3  avgt5  
>> 57.937 ±   0.505  ns/op
>> Base64Decode.testBase64Decode   4  7  avgt5  
>> 73.823 ±   1.452  ns/op
>> Base64Decode.testBase64Decode   4 32  avgt5 
>> 106.484 ±   1.243  ns/op
>> Base64Decode.testBase64Decode   4 64  avgt5 
>> 141.004 ±   1.188  ns/op
>> Base64Decode.testBase64Decode   4 80  avgt5 
>> 156.284 ±   0.572  ns/op
>> Base64Decode.testBase64Decode   4 96  avgt5 
>> 174.137 ±   0.177  ns/op
>> Base64Decode.testBase64Decode   4112  avgt5 
>> 188.445 ±   0.572  ns/op
>> Base64Decode.testBase64Decode   4512  avgt5 
>> 610.847 ±   1.559  ns/op
>> Base64Decode.testBase64Decode   4   1000  avgt  

Re: RFR: 8256245: AArch64: Implement Base64 decoding intrinsic [v3]

2021-04-06 Thread Andrew Haley
On Fri, 2 Apr 2021 10:01:27 GMT, Andrew Haley  wrote:

>> Dong Bo has updated the pull request with a new target base due to a merge 
>> or a rebase. The pull request now contains six commits:
>> 
>>  - Merge branch 'master' into aarch64.base64.decode
>>  - copyright
>>  - trivial fixes
>>  - Handling error in SIMD case with loops, combining two non-SIMD cases into 
>> one code blob, addressing other comments
>>  - Merge branch 'master' into aarch64.base64.decode
>>  - 8256245: AArch64: Implement Base64 decoding intrinsic
>
> src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 5802:
> 
>> 5800: // The 1st character of the input can be illegal if the data is 
>> MIME encoded.
>> 5801: // We cannot benefits from SIMD for this case. The max line size 
>> of MIME
>> 5802: // encoding is 76, with the PreProcess80B blob, we actually use 
>> no-simd
> 
> "cannot benefit"

OK, so I now understand what is actually going on here, and it has nothing to 
do with illegal first characters.
The problem is that the maximum block length the decode will be supplied with 
is 76 bytes, and there isn't enough time for the SIMD to be worthwhile. So the 
comment should be "In the MIME case, the line length cannot be more than 76 
bytes (see RFC 2045.) This is too short a block for SIMD to be worthwhile, so 
we use non-SIMD here."

-

PR: https://git.openjdk.java.net/jdk/pull/3228


Re: RFR: 8256245: AArch64: Implement Base64 decoding intrinsic

2021-04-06 Thread Andrew Haley
On Sat, 27 Mar 2021 08:58:03 GMT, Dong Bo  wrote:

> There can be illegal characters at the start of the input if the data is MIME 
> encoded.
> It would be no benefits to use SIMD for this case, so the stub use no-simd 
> instructions for MIME encoded data now.

What is the reasoning here? Sure, there can be illegal characters at the start, 
but what if there are not? The generic logic uses decodeBlock() even in the 
MIME case, because we don't know that there certainly will be illegal 
characters.

-

PR: https://git.openjdk.java.net/jdk/pull/3228


Re: Faster double parser?

2021-04-05 Thread Andrew Haley
On 4/5/21 9:08 AM, Alan Bateman wrote:
> On 04/04/2021 09:53, Andrew Haley wrote:
>> :
>> We also have a candidate for a String -> toDouble parser from Raffaello
>> Giulietti:
>>
>> https://mail.openjdk.java.net/pipermail/core-libs-dev/2018-March/052355.html
>>
>> As far as I can see there's nothing wrong with it, but it got stuck in
>> review. This was a bad failure of our review processes, I'm afraid. I
>> wonder if we will do any better with this one.
> 
> Yeah, there was a lengthy discussion about this at the OpenJDK Committer 
> Workshop just after FOSDEM 2020. Brian Burkhalter and Andrew Dinn had 
> both put time into looking at the proposal it but the conclusion was 
> that Raffaello's paper "The Schubfach way to render doubles" needed a 
> review from from an expert in this area.

I wonder. The implementation has even been proved correct for floats by
exhaustion. (Of course that's not possible for doubles.)

I submit, therefore, that we could at least use the float version without
reasonable fear of breaking something.

We seem to be paralysed by this one.

> Guy Steele was mentioned but I don't know if he was approached about it.

-- 
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671



Re: Faster double parser?

2021-04-04 Thread Andrew Haley
On 4/4/21 9:53 AM, Andrew Haley wrote:
> On 4/3/21 12:58 PM, Carfield Yim wrote:
>> HI all, just aware of this project,
>> https://github.com/wrandelshofer/FastDoubleParser, not sure if it good idea
>> to use this as default Double Parser in JVM?
> 
> We also have a candidate for a String -> toDouble

Argh. Double -> String.

> parser from Raffaello Giulietti:
-- 
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671



Re: Faster double parser?

2021-04-04 Thread Andrew Haley
On 4/3/21 12:58 PM, Carfield Yim wrote:
> HI all, just aware of this project,
> https://github.com/wrandelshofer/FastDoubleParser, not sure if it good idea
> to use this as default Double Parser in JVM?

We also have a candidate for a String -> toDouble parser from Raffaello
Giulietti:

https://mail.openjdk.java.net/pipermail/core-libs-dev/2018-March/052355.html

As far as I can see there's nothing wrong with it, but it got stuck in
review. This was a bad failure of our review processes, I'm afraid. I
wonder if we will do any better with this one.

-- 
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671



Re: RFR: 8256245: AArch64: Implement Base64 decoding intrinsic [v3]

2021-04-02 Thread Andrew Haley
On Fri, 2 Apr 2021 03:10:57 GMT, Dong Bo  wrote:

>> In JDK-8248188, IntrinsicCandidate and API is added for Base64 decoding.
>> Base64 decoding can be improved on aarch64 with ld4/tbl/tbx/st3, a basic 
>> idea can be found at 
>> http://0x80.pl/articles/base64-simd-neon.html#encoding-quadwords.
>> 
>> Patch passed jtreg tier1-3 tests with linux-aarch64-server-fastdebug build.
>> Tests in `test/jdk/java/util/Base64/` and 
>> `compiler/intrinsics/base64/TestBase64.java` runned specially for the 
>> correctness of the implementation.
>> 
>> There can be illegal characters at the start of the input if the data is 
>> MIME encoded.
>> It would be no benefits to use SIMD for this case, so the stub use no-simd 
>> instructions for MIME encoded data now.
>> 
>> A JMH micro, Base64Decode.java, is added for performance test.
>> With different input length (upper-bounded by parameter `maxNumBytes` in the 
>> JMH micro),
>> we witness ~2.5x improvements with long inputs and no regression with short 
>> inputs for raw base64 decodeing, minor improvements (~10.95%) for MIME on 
>> Kunpeng916.
>> 
>> The Base64Decode.java JMH micro-benchmark results:
>> 
>> Benchmark  (lineSize)  (maxNumBytes)  Mode  Cnt  
>>  Score   Error  Units
>> 
>> # Kunpeng916, intrinsic
>> Base64Decode.testBase64Decode   4  1  avgt5  
>> 48.614 ± 0.609  ns/op
>> Base64Decode.testBase64Decode   4  3  avgt5  
>> 58.199 ± 1.650  ns/op
>> Base64Decode.testBase64Decode   4  7  avgt5  
>> 69.400 ± 0.931  ns/op
>> Base64Decode.testBase64Decode   4 32  avgt5  
>> 96.818 ± 1.687  ns/op
>> Base64Decode.testBase64Decode   4 64  avgt5 
>> 122.856 ± 9.217  ns/op
>> Base64Decode.testBase64Decode   4 80  avgt5 
>> 130.935 ± 1.667  ns/op
>> Base64Decode.testBase64Decode   4 96  avgt5 
>> 143.627 ± 1.751  ns/op
>> Base64Decode.testBase64Decode   4112  avgt5 
>> 152.311 ± 1.178  ns/op
>> Base64Decode.testBase64Decode   4512  avgt5 
>> 342.631 ± 0.584  ns/op
>> Base64Decode.testBase64Decode   4   1000  avgt5 
>> 573.635 ± 1.050  ns/op
>> Base64Decode.testBase64Decode   4  2  avgt5
>> 9534.136 ±45.172  ns/op
>> Base64Decode.testBase64Decode   4  5  avgt5   
>> 22718.726 ±   192.070  ns/op
>> Base64Decode.testBase64MIMEDecode   4  1  avgt   10  
>> 63.558 ±0.336  ns/op
>> Base64Decode.testBase64MIMEDecode   4  3  avgt   10  
>> 82.504 ±0.848  ns/op
>> Base64Decode.testBase64MIMEDecode   4  7  avgt   10 
>> 120.591 ±0.608  ns/op
>> Base64Decode.testBase64MIMEDecode   4 32  avgt   10 
>> 324.314 ±6.236  ns/op
>> Base64Decode.testBase64MIMEDecode   4 64  avgt   10 
>> 532.678 ±4.670  ns/op
>> Base64Decode.testBase64MIMEDecode   4 80  avgt   10 
>> 678.126 ±4.324  ns/op
>> Base64Decode.testBase64MIMEDecode   4 96  avgt   10 
>> 771.603 ±6.393  ns/op
>> Base64Decode.testBase64MIMEDecode   4112  avgt   10 
>> 889.608 ±   0.759  ns/op
>> Base64Decode.testBase64MIMEDecode   4512  avgt   10
>> 3663.557 ±3.422  ns/op
>> Base64Decode.testBase64MIMEDecode   4   1000  avgt   10
>> 7017.784 ±9.128  ns/op
>> Base64Decode.testBase64MIMEDecode   4  2  avgt   10  
>> 128670.660 ± 7951.521  ns/op
>> Base64Decode.testBase64MIMEDecode   4  5  avgt   10  
>> 317113.667 ±  161.758  ns/op
>> 
>> # Kunpeng916, default
>> Base64Decode.testBase64Decode   4  1  avgt5  
>> 48.455 ±   0.571  ns/op
>> Base64Decode.testBase64Decode   4  3  avgt5  
>> 57.937 ±   0.505  ns/op
>> Base64Decode.testBase64Decode   4  7  avgt5  
>> 73.823 ±   1.452  ns/op
>> Base64Decode.testBase64Decode   4 32  avgt5 
>> 106.484 ±   1.243  ns/op
>> Base64Decode.testBase64Decode   4 64  avgt5 
>> 141.004 ±   1.188  ns/op
>> Base64Decode.testBase64Decode   4 80  avgt5 
>> 156.284 ±   0.572  ns/op
>> Base64Decode.testBase64Decode   4 96  avgt5 
>> 174.137 ±   0.177  ns/op
>> Base64Decode.testBase64Decode   4112  avgt5 
>> 188.445 ±   0.572  ns/op
>> Base64Decode.testBase64Decode   4512  avgt5 
>> 610.847 ±   1.559  ns/op
>> Base64Decode.testBase64Decode   4   1000  avgt  

Re: RFR: 8256245: AArch64: Implement Base64 decoding intrinsic

2021-04-02 Thread Andrew Haley
On Fri, 2 Apr 2021 07:05:26 GMT, Dong Bo  wrote:

> PING... Any suggestions on the updated commit?

Once you reply to the comments, sure.

-

PR: https://git.openjdk.java.net/jdk/pull/3228


Re: RFR: 8256245: AArch64: Implement Base64 decoding intrinsic

2021-04-02 Thread Andrew Haley
On Mon, 29 Mar 2021 03:28:54 GMT, Dong Bo  wrote:

> > Please consider losing the non-SIMD case. It doesn't result in any 
> > significant gain.
> 
> The non-SIMD case is useful for MIME decoding performance.

Your test results suggest that it isn't useful for that, surely?

-

PR: https://git.openjdk.java.net/jdk/pull/3228


Re: RFR: 8256245: AArch64: Implement Base64 decoding intrinsic [v3]

2021-04-02 Thread Andrew Haley
On Fri, 2 Apr 2021 03:10:57 GMT, Dong Bo  wrote:

>> In JDK-8248188, IntrinsicCandidate and API is added for Base64 decoding.
>> Base64 decoding can be improved on aarch64 with ld4/tbl/tbx/st3, a basic 
>> idea can be found at 
>> http://0x80.pl/articles/base64-simd-neon.html#encoding-quadwords.
>> 
>> Patch passed jtreg tier1-3 tests with linux-aarch64-server-fastdebug build.
>> Tests in `test/jdk/java/util/Base64/` and 
>> `compiler/intrinsics/base64/TestBase64.java` runned specially for the 
>> correctness of the implementation.
>> 
>> There can be illegal characters at the start of the input if the data is 
>> MIME encoded.
>> It would be no benefits to use SIMD for this case, so the stub use no-simd 
>> instructions for MIME encoded data now.
>> 
>> A JMH micro, Base64Decode.java, is added for performance test.
>> With different input length (upper-bounded by parameter `maxNumBytes` in the 
>> JMH micro),
>> we witness ~2.5x improvements with long inputs and no regression with short 
>> inputs for raw base64 decodeing, minor improvements (~10.95%) for MIME on 
>> Kunpeng916.
>> 
>> The Base64Decode.java JMH micro-benchmark results:
>> 
>> Benchmark  (lineSize)  (maxNumBytes)  Mode  Cnt  
>>  Score   Error  Units
>> 
>> # Kunpeng916, intrinsic
>> Base64Decode.testBase64Decode   4  1  avgt5  
>> 48.614 ± 0.609  ns/op
>> Base64Decode.testBase64Decode   4  3  avgt5  
>> 58.199 ± 1.650  ns/op
>> Base64Decode.testBase64Decode   4  7  avgt5  
>> 69.400 ± 0.931  ns/op
>> Base64Decode.testBase64Decode   4 32  avgt5  
>> 96.818 ± 1.687  ns/op
>> Base64Decode.testBase64Decode   4 64  avgt5 
>> 122.856 ± 9.217  ns/op
>> Base64Decode.testBase64Decode   4 80  avgt5 
>> 130.935 ± 1.667  ns/op
>> Base64Decode.testBase64Decode   4 96  avgt5 
>> 143.627 ± 1.751  ns/op
>> Base64Decode.testBase64Decode   4112  avgt5 
>> 152.311 ± 1.178  ns/op
>> Base64Decode.testBase64Decode   4512  avgt5 
>> 342.631 ± 0.584  ns/op
>> Base64Decode.testBase64Decode   4   1000  avgt5 
>> 573.635 ± 1.050  ns/op
>> Base64Decode.testBase64Decode   4  2  avgt5
>> 9534.136 ±45.172  ns/op
>> Base64Decode.testBase64Decode   4  5  avgt5   
>> 22718.726 ±   192.070  ns/op
>> Base64Decode.testBase64MIMEDecode   4  1  avgt   10  
>> 63.558 ±0.336  ns/op
>> Base64Decode.testBase64MIMEDecode   4  3  avgt   10  
>> 82.504 ±0.848  ns/op
>> Base64Decode.testBase64MIMEDecode   4  7  avgt   10 
>> 120.591 ±0.608  ns/op
>> Base64Decode.testBase64MIMEDecode   4 32  avgt   10 
>> 324.314 ±6.236  ns/op
>> Base64Decode.testBase64MIMEDecode   4 64  avgt   10 
>> 532.678 ±4.670  ns/op
>> Base64Decode.testBase64MIMEDecode   4 80  avgt   10 
>> 678.126 ±4.324  ns/op
>> Base64Decode.testBase64MIMEDecode   4 96  avgt   10 
>> 771.603 ±6.393  ns/op
>> Base64Decode.testBase64MIMEDecode   4112  avgt   10 
>> 889.608 ±   0.759  ns/op
>> Base64Decode.testBase64MIMEDecode   4512  avgt   10
>> 3663.557 ±3.422  ns/op
>> Base64Decode.testBase64MIMEDecode   4   1000  avgt   10
>> 7017.784 ±9.128  ns/op
>> Base64Decode.testBase64MIMEDecode   4  2  avgt   10  
>> 128670.660 ± 7951.521  ns/op
>> Base64Decode.testBase64MIMEDecode   4  5  avgt   10  
>> 317113.667 ±  161.758  ns/op
>> 
>> # Kunpeng916, default
>> Base64Decode.testBase64Decode   4  1  avgt5  
>> 48.455 ±   0.571  ns/op
>> Base64Decode.testBase64Decode   4  3  avgt5  
>> 57.937 ±   0.505  ns/op
>> Base64Decode.testBase64Decode   4  7  avgt5  
>> 73.823 ±   1.452  ns/op
>> Base64Decode.testBase64Decode   4 32  avgt5 
>> 106.484 ±   1.243  ns/op
>> Base64Decode.testBase64Decode   4 64  avgt5 
>> 141.004 ±   1.188  ns/op
>> Base64Decode.testBase64Decode   4 80  avgt5 
>> 156.284 ±   0.572  ns/op
>> Base64Decode.testBase64Decode   4 96  avgt5 
>> 174.137 ±   0.177  ns/op
>> Base64Decode.testBase64Decode   4112  avgt5 
>> 188.445 ±   0.572  ns/op
>> Base64Decode.testBase64Decode   4512  avgt5 
>> 610.847 ±   1.559  ns/op
>> Base64Decode.testBase64Decode   4   1000  avgt  

Re: RFR: 8256245: AArch64: Implement Base64 decoding intrinsic [v3]

2021-04-02 Thread Andrew Haley
On Fri, 2 Apr 2021 03:10:57 GMT, Dong Bo  wrote:

>> In JDK-8248188, IntrinsicCandidate and API is added for Base64 decoding.
>> Base64 decoding can be improved on aarch64 with ld4/tbl/tbx/st3, a basic 
>> idea can be found at 
>> http://0x80.pl/articles/base64-simd-neon.html#encoding-quadwords.
>> 
>> Patch passed jtreg tier1-3 tests with linux-aarch64-server-fastdebug build.
>> Tests in `test/jdk/java/util/Base64/` and 
>> `compiler/intrinsics/base64/TestBase64.java` runned specially for the 
>> correctness of the implementation.
>> 
>> There can be illegal characters at the start of the input if the data is 
>> MIME encoded.
>> It would be no benefits to use SIMD for this case, so the stub use no-simd 
>> instructions for MIME encoded data now.
>> 
>> A JMH micro, Base64Decode.java, is added for performance test.
>> With different input length (upper-bounded by parameter `maxNumBytes` in the 
>> JMH micro),
>> we witness ~2.5x improvements with long inputs and no regression with short 
>> inputs for raw base64 decodeing, minor improvements (~10.95%) for MIME on 
>> Kunpeng916.
>> 
>> The Base64Decode.java JMH micro-benchmark results:
>> 
>> Benchmark  (lineSize)  (maxNumBytes)  Mode  Cnt  
>>  Score   Error  Units
>> 
>> # Kunpeng916, intrinsic
>> Base64Decode.testBase64Decode   4  1  avgt5  
>> 48.614 ± 0.609  ns/op
>> Base64Decode.testBase64Decode   4  3  avgt5  
>> 58.199 ± 1.650  ns/op
>> Base64Decode.testBase64Decode   4  7  avgt5  
>> 69.400 ± 0.931  ns/op
>> Base64Decode.testBase64Decode   4 32  avgt5  
>> 96.818 ± 1.687  ns/op
>> Base64Decode.testBase64Decode   4 64  avgt5 
>> 122.856 ± 9.217  ns/op
>> Base64Decode.testBase64Decode   4 80  avgt5 
>> 130.935 ± 1.667  ns/op
>> Base64Decode.testBase64Decode   4 96  avgt5 
>> 143.627 ± 1.751  ns/op
>> Base64Decode.testBase64Decode   4112  avgt5 
>> 152.311 ± 1.178  ns/op
>> Base64Decode.testBase64Decode   4512  avgt5 
>> 342.631 ± 0.584  ns/op
>> Base64Decode.testBase64Decode   4   1000  avgt5 
>> 573.635 ± 1.050  ns/op
>> Base64Decode.testBase64Decode   4  2  avgt5
>> 9534.136 ±45.172  ns/op
>> Base64Decode.testBase64Decode   4  5  avgt5   
>> 22718.726 ±   192.070  ns/op
>> Base64Decode.testBase64MIMEDecode   4  1  avgt   10  
>> 63.558 ±0.336  ns/op
>> Base64Decode.testBase64MIMEDecode   4  3  avgt   10  
>> 82.504 ±0.848  ns/op
>> Base64Decode.testBase64MIMEDecode   4  7  avgt   10 
>> 120.591 ±0.608  ns/op
>> Base64Decode.testBase64MIMEDecode   4 32  avgt   10 
>> 324.314 ±6.236  ns/op
>> Base64Decode.testBase64MIMEDecode   4 64  avgt   10 
>> 532.678 ±4.670  ns/op
>> Base64Decode.testBase64MIMEDecode   4 80  avgt   10 
>> 678.126 ±4.324  ns/op
>> Base64Decode.testBase64MIMEDecode   4 96  avgt   10 
>> 771.603 ±6.393  ns/op
>> Base64Decode.testBase64MIMEDecode   4112  avgt   10 
>> 889.608 ±   0.759  ns/op
>> Base64Decode.testBase64MIMEDecode   4512  avgt   10
>> 3663.557 ±3.422  ns/op
>> Base64Decode.testBase64MIMEDecode   4   1000  avgt   10
>> 7017.784 ±9.128  ns/op
>> Base64Decode.testBase64MIMEDecode   4  2  avgt   10  
>> 128670.660 ± 7951.521  ns/op
>> Base64Decode.testBase64MIMEDecode   4  5  avgt   10  
>> 317113.667 ±  161.758  ns/op
>> 
>> # Kunpeng916, default
>> Base64Decode.testBase64Decode   4  1  avgt5  
>> 48.455 ±   0.571  ns/op
>> Base64Decode.testBase64Decode   4  3  avgt5  
>> 57.937 ±   0.505  ns/op
>> Base64Decode.testBase64Decode   4  7  avgt5  
>> 73.823 ±   1.452  ns/op
>> Base64Decode.testBase64Decode   4 32  avgt5 
>> 106.484 ±   1.243  ns/op
>> Base64Decode.testBase64Decode   4 64  avgt5 
>> 141.004 ±   1.188  ns/op
>> Base64Decode.testBase64Decode   4 80  avgt5 
>> 156.284 ±   0.572  ns/op
>> Base64Decode.testBase64Decode   4 96  avgt5 
>> 174.137 ±   0.177  ns/op
>> Base64Decode.testBase64Decode   4112  avgt5 
>> 188.445 ±   0.572  ns/op
>> Base64Decode.testBase64Decode   4512  avgt5 
>> 610.847 ±   1.559  ns/op
>> Base64Decode.testBase64Decode   4   1000  avgt  

Re: RFR: 8256245: AArch64: Implement Base64 decoding intrinsic

2021-03-29 Thread Andrew Haley
On Mon, 29 Mar 2021 03:28:54 GMT, Dong Bo  wrote:

> I think I can rewrite this part as loops.
> With an intial implemention, we can have almost half of the code size reduced 
> (1312B -> 748B). Sounds OK to you?

Sounds great, but I'm still somewhat concerned that the non-SIMD case only 
offers 3-12% performance gain. Make it just 748 bytes, and therefore not 
icache-hostile, then perhaps the balance of risk and reward is justified.

-

PR: https://git.openjdk.java.net/jdk/pull/3228


Re: RFR: 8256245: AArch64: Implement Base64 decoding intrinsic

2021-03-29 Thread Andrew Haley
On Mon, 29 Mar 2021 03:15:57 GMT, Nick Gasson  wrote:

> > There's a lot of unrolling, particularly in the non-SIMD case. Please 
> > consider taking out some of the unrolling; I suspect it'd not increase time 
> > by very much but would greatly reduce the code cache pollution. It's very 
> > tempting to unroll everything to make a benchmark run quickly, but we have 
> > to take a balanced approach.
> 
> But there's only ever one of these generated at startup, right? It's not like 
> the string intrinsics that are expanded at every call site.

I'm talking about icache pollution. This stuff could be quite small.

-

PR: https://git.openjdk.java.net/jdk/pull/3228


Re: RFR: 8256245: AArch64: Implement Base64 decoding intrinsic

2021-03-27 Thread Andrew Haley
On Sat, 27 Mar 2021 09:53:37 GMT, Andrew Haley  wrote:

>> In JDK-8248188, IntrinsicCandidate and API is added for Base64 decoding.
>> Base64 decoding can be improved on aarch64 with ld4/tbl/tbx/st3, a basic 
>> idea can be found at 
>> http://0x80.pl/articles/base64-simd-neon.html#encoding-quadwords.
>> 
>> Patch passed jtreg tier1-3 tests with linux-aarch64-server-fastdebug build.
>> Tests in `test/jdk/java/util/Base64/` and 
>> `compiler/intrinsics/base64/TestBase64.java` runned specially for the 
>> correctness of the implementation.
>> 
>> There can be illegal characters at the start of the input if the data is 
>> MIME encoded.
>> It would be no benefits to use SIMD for this case, so the stub use no-simd 
>> instructions for MIME encoded data now.
>> 
>> A JMH micro, Base64Decode.java, is added for performance test.
>> With different input length (upper-bounded by parameter `maxNumBytes` in the 
>> JMH micro),
>> we witness ~2.5x improvements with long inputs and no regression with short 
>> inputs for raw base64 decodeing, minor improvements (~10.95%) for MIME on 
>> Kunpeng916.
>> 
>> The Base64Decode.java JMH micro-benchmark results:
>> 
>> Benchmark  (lineSize)  (maxNumBytes)  Mode  Cnt  
>>  Score   Error  Units
>> 
>> # Kunpeng916, intrinsic
>> Base64Decode.testBase64Decode   4  1  avgt5  
>> 48.614 ± 0.609  ns/op
>> Base64Decode.testBase64Decode   4  3  avgt5  
>> 58.199 ± 1.650  ns/op
>> Base64Decode.testBase64Decode   4  7  avgt5  
>> 69.400 ± 0.931  ns/op
>> Base64Decode.testBase64Decode   4 32  avgt5  
>> 96.818 ± 1.687  ns/op
>> Base64Decode.testBase64Decode   4 64  avgt5 
>> 122.856 ± 9.217  ns/op
>> Base64Decode.testBase64Decode   4 80  avgt5 
>> 130.935 ± 1.667  ns/op
>> Base64Decode.testBase64Decode   4 96  avgt5 
>> 143.627 ± 1.751  ns/op
>> Base64Decode.testBase64Decode   4112  avgt5 
>> 152.311 ± 1.178  ns/op
>> Base64Decode.testBase64Decode   4512  avgt5 
>> 342.631 ± 0.584  ns/op
>> Base64Decode.testBase64Decode   4   1000  avgt5 
>> 573.635 ± 1.050  ns/op
>> Base64Decode.testBase64Decode   4  2  avgt5
>> 9534.136 ±45.172  ns/op
>> Base64Decode.testBase64Decode   4  5  avgt5   
>> 22718.726 ±   192.070  ns/op
>> Base64Decode.testBase64MIMEDecode   4  1  avgt   10  
>> 63.558 ±0.336  ns/op
>> Base64Decode.testBase64MIMEDecode   4  3  avgt   10  
>> 82.504 ±0.848  ns/op
>> Base64Decode.testBase64MIMEDecode   4  7  avgt   10 
>> 120.591 ±0.608  ns/op
>> Base64Decode.testBase64MIMEDecode   4 32  avgt   10 
>> 324.314 ±6.236  ns/op
>> Base64Decode.testBase64MIMEDecode   4 64  avgt   10 
>> 532.678 ±4.670  ns/op
>> Base64Decode.testBase64MIMEDecode   4 80  avgt   10 
>> 678.126 ±4.324  ns/op
>> Base64Decode.testBase64MIMEDecode   4 96  avgt   10 
>> 771.603 ±6.393  ns/op
>> Base64Decode.testBase64MIMEDecode   4112  avgt   10 
>> 889.608 ±   0.759  ns/op
>> Base64Decode.testBase64MIMEDecode   4512  avgt   10
>> 3663.557 ±3.422  ns/op
>> Base64Decode.testBase64MIMEDecode   4   1000  avgt   10
>> 7017.784 ±9.128  ns/op
>> Base64Decode.testBase64MIMEDecode   4  2  avgt   10  
>> 128670.660 ± 7951.521  ns/op
>> Base64Decode.testBase64MIMEDecode   4  5  avgt   10  
>> 317113.667 ±  161.758  ns/op
>> 
>> # Kunpeng916, default
>> Base64Decode.testBase64Decode   4  1  avgt5  
>> 48.455 ±   0.571  ns/op
>> Base64Decode.testBase64Decode   4  3  avgt5  
>> 57.937 ±   0.505  ns/op
>> Base64Decode.testBase64Decode   4  7  avgt5  
>> 73.823 ±   1.452  ns/op
>> Base64Decode.testBase64Decode   4 32  avgt5 
>> 106.484 ±   1.243  ns/op
>> Base64Decode.testBase64Decode   4 64  avgt5 
&

Re: RFR: 8256245: AArch64: Implement Base64 decoding intrinsic

2021-03-27 Thread Andrew Haley
On Sat, 27 Mar 2021 08:58:03 GMT, Dong Bo  wrote:

> In JDK-8248188, IntrinsicCandidate and API is added for Base64 decoding.
> Base64 decoding can be improved on aarch64 with ld4/tbl/tbx/st3, a basic idea 
> can be found at 
> http://0x80.pl/articles/base64-simd-neon.html#encoding-quadwords.
> 
> Patch passed jtreg tier1-3 tests with linux-aarch64-server-fastdebug build.
> Tests in `test/jdk/java/util/Base64/` and 
> `compiler/intrinsics/base64/TestBase64.java` runned specially for the 
> correctness of the implementation.
> 
> There can be illegal characters at the start of the input if the data is MIME 
> encoded.
> It would be no benefits to use SIMD for this case, so the stub use no-simd 
> instructions for MIME encoded data now.
> 
> A JMH micro, Base64Decode.java, is added for performance test.
> With different input length (upper-bounded by parameter `maxNumBytes` in the 
> JMH micro),
> we witness ~2.5x improvements with long inputs and no regression with short 
> inputs for raw base64 decodeing, minor improvements (~10.95%) for MIME on 
> Kunpeng916.
> 
> The Base64Decode.java JMH micro-benchmark results:
> 
> Benchmark  (lineSize)  (maxNumBytes)  Mode  Cnt   
> Score   Error  Units
> 
> # Kunpeng916, intrinsic
> Base64Decode.testBase64Decode   4  1  avgt5  
> 48.614 ± 0.609  ns/op
> Base64Decode.testBase64Decode   4  3  avgt5  
> 58.199 ± 1.650  ns/op
> Base64Decode.testBase64Decode   4  7  avgt5  
> 69.400 ± 0.931  ns/op
> Base64Decode.testBase64Decode   4 32  avgt5  
> 96.818 ± 1.687  ns/op
> Base64Decode.testBase64Decode   4 64  avgt5 
> 122.856 ± 9.217  ns/op
> Base64Decode.testBase64Decode   4 80  avgt5 
> 130.935 ± 1.667  ns/op
> Base64Decode.testBase64Decode   4 96  avgt5 
> 143.627 ± 1.751  ns/op
> Base64Decode.testBase64Decode   4112  avgt5 
> 152.311 ± 1.178  ns/op
> Base64Decode.testBase64Decode   4512  avgt5 
> 342.631 ± 0.584  ns/op
> Base64Decode.testBase64Decode   4   1000  avgt5 
> 573.635 ± 1.050  ns/op
> Base64Decode.testBase64Decode   4  2  avgt5
> 9534.136 ±45.172  ns/op
> Base64Decode.testBase64Decode   4  5  avgt5   
> 22718.726 ±   192.070  ns/op
> Base64Decode.testBase64MIMEDecode   4  1  avgt   10  
> 63.558 ±0.336  ns/op
> Base64Decode.testBase64MIMEDecode   4  3  avgt   10  
> 82.504 ±0.848  ns/op
> Base64Decode.testBase64MIMEDecode   4  7  avgt   10 
> 120.591 ±0.608  ns/op
> Base64Decode.testBase64MIMEDecode   4 32  avgt   10 
> 324.314 ±6.236  ns/op
> Base64Decode.testBase64MIMEDecode   4 64  avgt   10 
> 532.678 ±4.670  ns/op
> Base64Decode.testBase64MIMEDecode   4 80  avgt   10 
> 678.126 ±4.324  ns/op
> Base64Decode.testBase64MIMEDecode   4 96  avgt   10 
> 771.603 ±6.393  ns/op
> Base64Decode.testBase64MIMEDecode   4112  avgt   10 
> 889.608 ±   0.759  ns/op
> Base64Decode.testBase64MIMEDecode   4512  avgt   10
> 3663.557 ±3.422  ns/op
> Base64Decode.testBase64MIMEDecode   4   1000  avgt   10
> 7017.784 ±9.128  ns/op
> Base64Decode.testBase64MIMEDecode   4  2  avgt   10  
> 128670.660 ± 7951.521  ns/op
> Base64Decode.testBase64MIMEDecode   4  5  avgt   10  
> 317113.667 ±  161.758  ns/op
> 
> # Kunpeng916, default
> Base64Decode.testBase64Decode   4  1  avgt5  
> 48.455 ±   0.571  ns/op
> Base64Decode.testBase64Decode   4  3  avgt5  
> 57.937 ±   0.505  ns/op
> Base64Decode.testBase64Decode   4  7  avgt5  
> 73.823 ±   1.452  ns/op
> Base64Decode.testBase64Decode   4 32  avgt5 
> 106.484 ±   1.243  ns/op
> Base64Decode.testBase64Decode   4 64  avgt5 
> 141.004 ±   1.188  ns/op
> Base64Decode.testBase64Decode   4 80  avgt5 
> 156.284 ±   0.572  ns/op
> Base64Decode.testBase64Decode   4 96  avgt5 
> 174.137 ±   0.177  ns/op
> Base64Decode.testBase64Decode   4112  avgt5 
> 188.445 ±   0.572  ns/op
> Base64Decode.testBase64Decode   4512  avgt5 
> 610.847 ±   1.559  ns/op
> Base64Decode.testBase64Decode   4   1000  avgt5
> 1155.368 ±   0.813  ns/op
> Base64Decode.testBase64Decode   4  

Re: RFR: 8256245: AArch64: Implement Base64 decoding intrinsic

2021-03-27 Thread Andrew Haley
On Sat, 27 Mar 2021 08:58:03 GMT, Dong Bo  wrote:

> In JDK-8248188, IntrinsicCandidate and API is added for Base64 decoding.
> Base64 decoding can be improved on aarch64 with ld4/tbl/tbx/st3, a basic idea 
> can be found at 
> http://0x80.pl/articles/base64-simd-neon.html#encoding-quadwords.
> 
> Patch passed jtreg tier1-3 tests with linux-aarch64-server-fastdebug build.
> Tests in `test/jdk/java/util/Base64/` and 
> `compiler/intrinsics/base64/TestBase64.java` runned specially for the 
> correctness of the implementation.
> 
> There can be illegal characters at the start of the input if the data is MIME 
> encoded.
> It would be no benefits to use SIMD for this case, so the stub use no-simd 
> instructions for MIME encoded data now.
> 
> A JMH micro, Base64Decode.java, is added for performance test.
> With different input length (upper-bounded by parameter `maxNumBytes` in the 
> JMH micro),
> we witness ~2.5x improvements with long inputs and no regression with short 
> inputs for raw base64 decodeing, minor improvements (~10.95%) for MIME on 
> Kunpeng916.
> 
> The Base64Decode.java JMH micro-benchmark results:
> 
> Benchmark  (lineSize)  (maxNumBytes)  Mode  Cnt   
> Score   Error  Units
> 
> # Kunpeng916, intrinsic
> Base64Decode.testBase64Decode   4  1  avgt5  
> 48.614 ± 0.609  ns/op
> Base64Decode.testBase64Decode   4  3  avgt5  
> 58.199 ± 1.650  ns/op
> Base64Decode.testBase64Decode   4  7  avgt5  
> 69.400 ± 0.931  ns/op
> Base64Decode.testBase64Decode   4 32  avgt5  
> 96.818 ± 1.687  ns/op
> Base64Decode.testBase64Decode   4 64  avgt5 
> 122.856 ± 9.217  ns/op
> Base64Decode.testBase64Decode   4 80  avgt5 
> 130.935 ± 1.667  ns/op
> Base64Decode.testBase64Decode   4 96  avgt5 
> 143.627 ± 1.751  ns/op
> Base64Decode.testBase64Decode   4112  avgt5 
> 152.311 ± 1.178  ns/op
> Base64Decode.testBase64Decode   4512  avgt5 
> 342.631 ± 0.584  ns/op
> Base64Decode.testBase64Decode   4   1000  avgt5 
> 573.635 ± 1.050  ns/op
> Base64Decode.testBase64Decode   4  2  avgt5
> 9534.136 ±45.172  ns/op
> Base64Decode.testBase64Decode   4  5  avgt5   
> 22718.726 ±   192.070  ns/op
> Base64Decode.testBase64MIMEDecode   4  1  avgt   10  
> 63.558 ±0.336  ns/op
> Base64Decode.testBase64MIMEDecode   4  3  avgt   10  
> 82.504 ±0.848  ns/op
> Base64Decode.testBase64MIMEDecode   4  7  avgt   10 
> 120.591 ±0.608  ns/op
> Base64Decode.testBase64MIMEDecode   4 32  avgt   10 
> 324.314 ±6.236  ns/op
> Base64Decode.testBase64MIMEDecode   4 64  avgt   10 
> 532.678 ±4.670  ns/op
> Base64Decode.testBase64MIMEDecode   4 80  avgt   10 
> 678.126 ±4.324  ns/op
> Base64Decode.testBase64MIMEDecode   4 96  avgt   10 
> 771.603 ±6.393  ns/op
> Base64Decode.testBase64MIMEDecode   4112  avgt   10 
> 889.608 ±   0.759  ns/op
> Base64Decode.testBase64MIMEDecode   4512  avgt   10
> 3663.557 ±3.422  ns/op
> Base64Decode.testBase64MIMEDecode   4   1000  avgt   10
> 7017.784 ±9.128  ns/op
> Base64Decode.testBase64MIMEDecode   4  2  avgt   10  
> 128670.660 ± 7951.521  ns/op
> Base64Decode.testBase64MIMEDecode   4  5  avgt   10  
> 317113.667 ±  161.758  ns/op
> 
> # Kunpeng916, default
> Base64Decode.testBase64Decode   4  1  avgt5  
> 48.455 ±   0.571  ns/op
> Base64Decode.testBase64Decode   4  3  avgt5  
> 57.937 ±   0.505  ns/op
> Base64Decode.testBase64Decode   4  7  avgt5  
> 73.823 ±   1.452  ns/op
> Base64Decode.testBase64Decode   4 32  avgt5 
> 106.484 ±   1.243  ns/op
> Base64Decode.testBase64Decode   4 64  avgt5 
> 141.004 ±   1.188  ns/op
> Base64Decode.testBase64Decode   4 80  avgt5 
> 156.284 ±   0.572  ns/op
> Base64Decode.testBase64Decode   4 96  avgt5 
> 174.137 ±   0.177  ns/op
> Base64Decode.testBase64Decode   4112  avgt5 
> 188.445 ±   0.572  ns/op
> Base64Decode.testBase64Decode   4512  avgt5 
> 610.847 ±   1.559  ns/op
> Base64Decode.testBase64Decode   4   1000  avgt5
> 1155.368 ±   0.813  ns/op
> Base64Decode.testBase64Decode   4  

Re: RFR: 8253795: Implementation of JEP 391: macOS/AArch64 Port [v29]

2021-03-23 Thread Andrew Haley
On Tue, 23 Mar 2021 16:20:47 GMT, Ludovic Henry  wrote:

>> Anton Kozlov has updated the pull request with a new target base due to a 
>> merge or a rebase. The pull request now contains 115 commits:
>> 
>>  - Merge branch 'master' into jdk-macos
>>  - JDK-8262491: bsd_aarch64 part
>>  - JDK-8263002: bsd_aarch64 part
>>  - Merge remote-tracking branch 'upstream/jdk/master' into jdk-macos
>>  - Wider #ifdef block
>>  - Fix most of issues in java/foreign/ tests
>>
>>Failures related to va_args are tracked in JDK-8263512.
>>  - Add Azul copyright
>>  - Update Oracle copyright years
>>  - Use Thread::current_or_null_safe in SafeFetch
>>  - 8262903: [macos_aarch64] Thread::current() called on detached thread
>>  - ... and 105 more: 
>> https://git.openjdk.java.net/jdk/compare/a9d2267f...5add9269
>
> Marked as reviewed by luhenry (Author).

> > > > [ Back-porting this patch to JDK 11] depends on the will of openjdk11 
> > > > maintainers to accept this (and few other, like jep-388, as we depend 
> > > > on it) contribution.
> > > 
> > > 
> > > To the extent that 11u has fixed policies :) we definitely have a policy 
> > > of accepting patches to keep 11u working on current hardware. So yes.
> > 
> > 
> > @lewurm That sounds like a green flag for you and jep-388 (with its 
> > R18_RESERVED functionality) ;)
> 
> Thanks, @theRealAph, and @VladimirKempik . We are on it!

It's going to be tricky to do in a really clean way, given some of the 
weirdnesses of the ABI. However, I think there's probably a need for it

-

PR: https://git.openjdk.java.net/jdk/pull/2200


Re: RFR: 8253795: Implementation of JEP 391: macOS/AArch64 Port [v29]

2021-03-23 Thread Andrew Haley
On Tue, 23 Mar 2021 13:54:24 GMT, Andrew Haley  wrote:

>> Anton Kozlov has updated the pull request with a new target base due to a 
>> merge or a rebase. The pull request now contains 115 commits:
>> 
>>  - Merge branch 'master' into jdk-macos
>>  - JDK-8262491: bsd_aarch64 part
>>  - JDK-8263002: bsd_aarch64 part
>>  - Merge remote-tracking branch 'upstream/jdk/master' into jdk-macos
>>  - Wider #ifdef block
>>  - Fix most of issues in java/foreign/ tests
>>
>>Failures related to va_args are tracked in JDK-8263512.
>>  - Add Azul copyright
>>  - Update Oracle copyright years
>>  - Use Thread::current_or_null_safe in SafeFetch
>>  - 8262903: [macos_aarch64] Thread::current() called on detached thread
>>  - ... and 105 more: 
>> https://git.openjdk.java.net/jdk/compare/a9d2267f...5add9269
>
> Marked as reviewed by aph (Reviewer).

> [ Back-porting this patch to JDK 11] depends on the will of openjdk11 
> maintainers to accept this (and few other, like jep-388, as we depend on it) 
> contribution.

To the extent that 11u has fixed policies :)  we definitely have a policy of 
accepting patches to keep 11u working on current hardware. So yes.

-

PR: https://git.openjdk.java.net/jdk/pull/2200


Re: RFR: 8253795: Implementation of JEP 391: macOS/AArch64 Port [v29]

2021-03-23 Thread Andrew Haley
On Mon, 22 Mar 2021 12:50:14 GMT, Anton Kozlov  wrote:

>> Please review the implementation of JEP 391: macOS/AArch64 Port.
>> 
>> It's heavily based on existing ports to linux/aarch64, macos/x86_64, and 
>> windows/aarch64. 
>> 
>> Major changes are in:
>> * src/hotspot/cpu/aarch64: support of the new calling convention (subtasks 
>> JDK-8253817, JDK-8253818)
>> * src/hotspot/os_cpu/bsd_aarch64: copy of os_cpu/linux_aarch64 with 
>> necessary adjustments (JDK-8253819)
>> * src/hotspot/share, test/hotspot/gtest: support of write-xor-execute (W^X), 
>> required on macOS/AArch64 platform. It's implemented with 
>> pthread_jit_write_protect_np provided by Apple. The W^X mode is local to a 
>> thread, so W^X mode change relates to the java thread state change (for java 
>> threads). In most cases, JVM executes in write-only mode, except when 
>> calling a generated stub like SafeFetch, which requires a temporary switch 
>> to execute-only mode. The same execute-only mode is enabled when a java 
>> thread executes in java or native states. This approach of managing W^X mode 
>> turned out to be simple and efficient enough.
>> * src/jdk.hotspot.agent: serviceability agent implementation (JDK-8254941)
>
> Anton Kozlov has updated the pull request with a new target base due to a 
> merge or a rebase. The pull request now contains 115 commits:
> 
>  - Merge branch 'master' into jdk-macos
>  - JDK-8262491: bsd_aarch64 part
>  - JDK-8263002: bsd_aarch64 part
>  - Merge remote-tracking branch 'upstream/jdk/master' into jdk-macos
>  - Wider #ifdef block
>  - Fix most of issues in java/foreign/ tests
>
>Failures related to va_args are tracked in JDK-8263512.
>  - Add Azul copyright
>  - Update Oracle copyright years
>  - Use Thread::current_or_null_safe in SafeFetch
>  - 8262903: [macos_aarch64] Thread::current() called on detached thread
>  - ... and 105 more: 
> https://git.openjdk.java.net/jdk/compare/a9d2267f...5add9269

Marked as reviewed by aph (Reviewer).

-

PR: https://git.openjdk.java.net/jdk/pull/2200


  1   2   3   4   5   6   >