Re: RFR: 8287903: Reduce runtime of java.math microbenchmarks
On Tue, 7 Jun 2022 12:34:25 GMT, Claes Redestad wrote: > - Reduce runtime by running fewer forks, fewer iterations, less warmup. All > micros tested in this group appear to stabilize very quickly. > - Refactor BigIntegers to avoid re-running some (most) micros over and over > with parameter values that don't affect them. > > Expected runtime down from 14 hours to 15 minutes. Marked as reviewed by aph (Reviewer). - PR: https://git.openjdk.java.net/jdk/pull/9062
Re: Why does we still need StrictMath?
On 5/8/22 13:14, Victor Williams Stafusa da Silva wrote: Sure, there are the x86 intrinsics. But since JEP 306 was delivered, is this still valid? The Motivation section of the JEP 306 seems to imply that this is not the case anymore. Of course, I could just be grossly misunderstanding what is/was JEP 306 and/or to which depth it meant by "restore always-strict floating-point semantics", but I don't think that I am the only one out there. JEP 306 isn't about j.l.StrictMath. The Description section of 306 says what it's really about. -- Andrew Haley (he/him) Java Platform Lead Engineer Red Hat UK Ltd. <https://www.redhat.com> https://keybase.io/andrewhaley EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
Re: Why does we still need StrictMath?
On 5/8/22 13:44, Martin Desruisseaux wrote: Le 08/05/2022 à 14:15, Victor Williams Stafusa da Silva a écrit : Was that using Java 17+, which included JEP 306 delivered? No, but my understanding is that JEP 306 does not apply to Math versus StrictMath behavior. In my understanding, the strictfp keyword was only about the use of extended exponent value set in Pentium 80 bits floating point values. It impacts the behavior of arithmetic operations + - * /, but my (maybe wrong) understanding is that it does not impact which processor instruction is used for implementing Math.sin, cos, etc. You're right. -- Andrew Haley (he/him) Java Platform Lead Engineer Red Hat UK Ltd. <https://www.redhat.com> https://keybase.io/andrewhaley EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
Re: Why does we still need StrictMath?
On 5/8/22 06:54, Victor Williams Stafusa da Silva wrote: If StrictMath is still needed and could produce different results than Math in some modern hardware, then by the javadocs, it seems to imply that Math should always delegate to StrictMath and never the other way around. Why is it not always the case? I think that the answer is simply because the StrictMath class was largely left unchanged and that delegating in one way or in the other could then produce a difference when the strictfp modifier was there, but is there a better reason than that? Some targets (x86, in particular) have intrinsics (log, trig) that are faster than StrictMath and also more accurate. StrictMath is not about accuracy, but cross-architecture down-to-the-last bit reproducibility. Whether we still need that reproducibility is, I suppose, something for debate. -- Andrew Haley (he/him) Java Platform Lead Engineer Red Hat UK Ltd. <https://www.redhat.com> https://keybase.io/andrewhaley EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
Re: RFR: 8284161: Implementation of Virtual Threads (Preview) [v8]
On Wed, 27 Apr 2022 19:21:58 GMT, Alan Bateman wrote: >> src/java.base/share/classes/jdk/internal/misc/UnsafeConstants.java line 122: >> >>> 120: */ >>> 121: >>> 122: public static final int SCOPED_CACHE_SHIFT; >> >> I can't find this constant being used. If added for future, maybe keep >> `UnsafeConstants` class and this static field package-private for now. > > It originally configured the number of lines in extent local cache but the > implementation has changed. @theRealAph would be best to comment on this, it > may be possible to delete it. Yes, it's dead. Thanks. - PR: https://git.openjdk.java.net/jdk/pull/8166
Re: OpenJDK or SE Java Floating Point Options?
On 4/16/22 09:04, sminervini.prism wrote: > It is still the case that I and others need Java floating point and StrictMath > (the key double type calculator class) to be repaired. If you are going to pursue this line of reasoning, you must be aware of some things. Firstly, it is not possible to change Java's core floating-point arithmetic in a compatible way. We certainly will not adopt decimal floating-point for Java's core arithmetic. Secondly, Java should not be your focus. We will not change Java's arithmetic in a way that is incompatible with other languages or which makes Java slower on existing hardware. You must read and fully understand this before you go any further. It will require a lot of study: https://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html > May the Java Community Process reconsider the present floating point > operations and method calls situation, based on an imperfect > standard and improper workaround, and provide corrected, defaulting > or specifying-compatible waya for Java floating point arithmetic and > Calculator class method results to always cohere in their ranges, > without denormal and pronormal inclusion? In a word, no. That possibility is so unlikely that it is not worthy of consideration. -- Andrew Haley (he/him) Java Platform Lead Engineer Red Hat UK Ltd. <https://www.redhat.com> https://keybase.io/andrewhaley EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
Re: RFR: 8279508: Auto-vectorize Math.round API [v17]
On Sun, 13 Mar 2022 06:36:15 GMT, Jatin Bhateja wrote: >> Summary of changes: >> - Intrinsify Math.round(float) and Math.round(double) APIs. >> - Extend auto-vectorizer to infer vector operations on encountering scalar >> IR nodes for above intrinsics. >> - Test creation using new IR testing framework. >> >> Following are the performance number of a JMH micro included with the patch >> >> Test System: Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz (Icelake Server) >> >> >> Benchmark | TESTSIZE | Baseline AVX3 (ops/ms) | Withopt AVX3 (ops/ms) | Gain >> ratio | Baseline AVX2 (ops/ms) | Withopt AVX2 (ops/ms) | Gain ratio >> -- | -- | -- | -- | -- | -- | -- | -- >> FpRoundingBenchmark.test_round_double | 1024.00 | 504.15 | 2209.54 | 4.38 | >> 510.36 | 548.39 | 1.07 >> FpRoundingBenchmark.test_round_double | 2048.00 | 293.64 | 1271.98 | 4.33 | >> 293.48 | 274.01 | 0.93 >> FpRoundingBenchmark.test_round_float | 1024.00 | 825.99 | 4754.66 | 5.76 | >> 751.83 | 2274.13 | 3.02 >> FpRoundingBenchmark.test_round_float | 2048.00 | 412.22 | 2490.09 | 6.04 | >> 388.52 | 1334.18 | 3.43 >> >> >> Kindly review and share your feedback. >> >> Best Regards, >> Jatin > > Jatin Bhateja has updated the pull request incrementally with one additional > commit since the last revision: > > 8279508: Windows build failure fix. src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4143: > 4141: ldmxcsr(new_mxcsr); > 4142: // Move raw bits corresponding to double value 0.5 into scratch > register. > 4143: mov64(scratch, 4602678819172646912L); Suggestion: mov64(scratch, julong_cast(0.5)); - PR: https://git.openjdk.java.net/jdk/pull/7094
Re: Should System.exit be controlled by a Scope Local?
On 3/1/22 18:17, Sean Mullan wrote: Can you explain in a little more detail as to what the compatibility issues are with preventing threads in thread pools from calling System.exit? It is possible, I suppose, that some rather odd programmer is using a task in a thread pool to exit their program. I know, unlikely, but we preserve compatibility. -- Andrew Haley (he/him) Java Platform Lead Engineer Red Hat UK Ltd. <https://www.redhat.com> https://keybase.io/andrewhaley EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
Re: RFR: 8279508: Auto-vectorize Math.round API [v15]
On Sun, 13 Mar 2022 04:27:25 GMT, Jatin Bhateja wrote: >> src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4024: >> >>> 4022: * the result is equal to the value of Integer.MAX_VALUE. >>> 4023: */ >>> 4024: void >>> C2_MacroAssembler::vector_cast_float_special_cases_avx(XMMRegister dst, >>> XMMRegister src, XMMRegister xtmp1, >> >> This special handling is really large, could we use a stub routine for it? > > Good suggestion, but as of now we are not using vector calling conventions > for stubs. I don't understand this comment. If the stub is only to be used by you, then you can determine your own calling convention. - PR: https://git.openjdk.java.net/jdk/pull/7094
Re: RFR: 8279508: Auto-vectorize Math.round API [v12]
On Wed, 9 Mar 2022 11:38:34 GMT, Jatin Bhateja wrote: >> Summary of changes: >> - Intrinsify Math.round(float) and Math.round(double) APIs. >> - Extend auto-vectorizer to infer vector operations on encountering scalar >> IR nodes for above intrinsics. >> - Test creation using new IR testing framework. >> >> Following are the performance number of a JMH micro included with the patch >> >> Test System: Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz (Icelake Server) >> >> >> Benchmark | TESTSIZE | Baseline AVX3 (ops/ms) | Withopt AVX3 (ops/ms) | Gain >> ratio | Baseline AVX2 (ops/ms) | Withopt AVX2 (ops/ms) | Gain ratio >> -- | -- | -- | -- | -- | -- | -- | -- >> FpRoundingBenchmark.test_round_double | 1024.00 | 504.15 | 2209.54 | 4.38 | >> 510.36 | 548.39 | 1.07 >> FpRoundingBenchmark.test_round_double | 2048.00 | 293.64 | 1271.98 | 4.33 | >> 293.48 | 274.01 | 0.93 >> FpRoundingBenchmark.test_round_float | 1024.00 | 825.99 | 4754.66 | 5.76 | >> 751.83 | 2274.13 | 3.02 >> FpRoundingBenchmark.test_round_float | 2048.00 | 412.22 | 2490.09 | 6.04 | >> 388.52 | 1334.18 | 3.43 >> >> >> Kindly review and share your feedback. >> >> Best Regards, >> Jatin > > Jatin Bhateja has updated the pull request with a new target base due to a > merge or a rebase. The pull request now contains 15 commits: > > - 8279508: Preventing domain switch-over penalty for Math.round(float) and > constraining unrolling to prevent code bloating. > - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8279508 > - 8279508: Removing +LogCompilation flag. > - 8279508: Review comments resolved.` > - 8279508: Adding descriptive comments. > - 8279508: Review comments resolved. > - 8279508: Review comments resolved. > - 8279508: Fixing for windows failure. > - 8279508: Adding few descriptive comments. > - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8279508 > - ... and 5 more: > https://git.openjdk.java.net/jdk/compare/d07f7c76...547f4e31 test/micro/org/openjdk/bench/java/math/FpRoundingBenchmark.java line 70: > 68: } > 69: > 70: FargV1 = new float[TESTSIZE]; `FargV1` is not initialized. You need to set `i = 0;` here. test/micro/org/openjdk/bench/java/math/FpRoundingBenchmark.java line 78: > 76: > 77: for (; i < TESTSIZE; i++) { > 78: FargV1[i] = r.nextFloat()*TESTSIZE; This is an unrealistically narrow range of values. I'd use Suggestion: FargV1[i] = Float.intBitsToFloat(r.nextInt()); - PR: https://git.openjdk.java.net/jdk/pull/7094
Re: RFR: 8279508: Auto-vectorize Math.round API [v12]
On Wed, 9 Mar 2022 11:38:34 GMT, Jatin Bhateja wrote: >> Summary of changes: >> - Intrinsify Math.round(float) and Math.round(double) APIs. >> - Extend auto-vectorizer to infer vector operations on encountering scalar >> IR nodes for above intrinsics. >> - Test creation using new IR testing framework. >> >> Following are the performance number of a JMH micro included with the patch >> >> Test System: Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz (Icelake Server) >> >> >> Benchmark | TESTSIZE | Baseline AVX3 (ops/ms) | Withopt AVX3 (ops/ms) | Gain >> ratio | Baseline AVX2 (ops/ms) | Withopt AVX2 (ops/ms) | Gain ratio >> -- | -- | -- | -- | -- | -- | -- | -- >> FpRoundingBenchmark.test_round_double | 1024.00 | 504.15 | 2209.54 | 4.38 | >> 510.36 | 548.39 | 1.07 >> FpRoundingBenchmark.test_round_double | 2048.00 | 293.64 | 1271.98 | 4.33 | >> 293.48 | 274.01 | 0.93 >> FpRoundingBenchmark.test_round_float | 1024.00 | 825.99 | 4754.66 | 5.76 | >> 751.83 | 2274.13 | 3.02 >> FpRoundingBenchmark.test_round_float | 2048.00 | 412.22 | 2490.09 | 6.04 | >> 388.52 | 1334.18 | 3.43 >> >> >> Kindly review and share your feedback. >> >> Best Regards, >> Jatin > > Jatin Bhateja has updated the pull request with a new target base due to a > merge or a rebase. The pull request now contains 15 commits: > > - 8279508: Preventing domain switch-over penalty for Math.round(float) and > constraining unrolling to prevent code bloating. > - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8279508 > - 8279508: Removing +LogCompilation flag. > - 8279508: Review comments resolved.` > - 8279508: Adding descriptive comments. > - 8279508: Review comments resolved. > - 8279508: Review comments resolved. > - 8279508: Fixing for windows failure. > - 8279508: Adding few descriptive comments. > - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8279508 > - ... and 5 more: > https://git.openjdk.java.net/jdk/compare/d07f7c76...547f4e31 test/micro/org/openjdk/bench/java/math/FpRoundingBenchmark.java line 114: > 112: for (int i = 0; i < TESTSIZE; i++) { > 113: ResF[i] = Math.round(FargV1[i]); > 114: } I think that this is wrong: you should not be storing the result into a float array because it requires an extra integer->float conversion, which distorts the timings. You need `resI` and `resL` for the results of `Math.round`. - PR: https://git.openjdk.java.net/jdk/pull/7094
Re: RFR: 8279508: Auto-vectorize Math.round API [v11]
On Wed, 2 Mar 2022 02:44:41 GMT, Jatin Bhateja wrote: >> Summary of changes: >> - Intrinsify Math.round(float) and Math.round(double) APIs. >> - Extend auto-vectorizer to infer vector operations on encountering scalar >> IR nodes for above intrinsics. >> - Test creation using new IR testing framework. >> >> Following are the performance number of a JMH micro included with the patch >> >> Test System: Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz (Icelake Server) >> >> >> Benchmark | TESTSIZE | Baseline AVX3 (ops/ms) | Withopt AVX3 (ops/ms) | Gain >> ratio | Baseline AVX2 (ops/ms) | Withopt AVX2 (ops/ms) | Gain ratio >> -- | -- | -- | -- | -- | -- | -- | -- >> FpRoundingBenchmark.test_round_double | 1024.00 | 504.15 | 2209.54 | 4.38 | >> 510.36 | 548.39 | 1.07 >> FpRoundingBenchmark.test_round_double | 2048.00 | 293.64 | 1271.98 | 4.33 | >> 293.48 | 274.01 | 0.93 >> FpRoundingBenchmark.test_round_float | 1024.00 | 825.99 | 4754.66 | 5.76 | >> 751.83 | 2274.13 | 3.02 >> FpRoundingBenchmark.test_round_float | 2048.00 | 412.22 | 2490.09 | 6.04 | >> 388.52 | 1334.18 | 3.43 >> >> >> Kindly review and share your feedback. >> >> Best Regards, >> Jatin > > Jatin Bhateja has updated the pull request incrementally with one additional > commit since the last revision: > > 8279508: Removing +LogCompilation flag. src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4157: > 4155: ExternalAddress mxcsr_std(StubRoutines::x86::addr_mxcsr_std()); > 4156: ldmxcsr(new_mxcsr); > 4157: movl(scratch, 1056964608); Suggestion: movl(scratch, 1056964608); // Raw bits corresponding to floating point value 0.5f. - PR: https://git.openjdk.java.net/jdk/pull/7094
Re: RFR: 8279508: Auto-vectorize Math.round API [v11]
On Wed, 2 Mar 2022 02:44:41 GMT, Jatin Bhateja wrote: >> Summary of changes: >> - Intrinsify Math.round(float) and Math.round(double) APIs. >> - Extend auto-vectorizer to infer vector operations on encountering scalar >> IR nodes for above intrinsics. >> - Test creation using new IR testing framework. >> >> Following are the performance number of a JMH micro included with the patch >> >> Test System: Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz (Icelake Server) >> >> >> Benchmark | TESTSIZE | Baseline AVX3 (ops/ms) | Withopt AVX3 (ops/ms) | Gain >> ratio | Baseline AVX2 (ops/ms) | Withopt AVX2 (ops/ms) | Gain ratio >> -- | -- | -- | -- | -- | -- | -- | -- >> FpRoundingBenchmark.test_round_double | 1024.00 | 504.15 | 2209.54 | 4.38 | >> 510.36 | 548.39 | 1.07 >> FpRoundingBenchmark.test_round_double | 2048.00 | 293.64 | 1271.98 | 4.33 | >> 293.48 | 274.01 | 0.93 >> FpRoundingBenchmark.test_round_float | 1024.00 | 825.99 | 4754.66 | 5.76 | >> 751.83 | 2274.13 | 3.02 >> FpRoundingBenchmark.test_round_float | 2048.00 | 412.22 | 2490.09 | 6.04 | >> 388.52 | 1334.18 | 3.43 >> >> >> Kindly review and share your feedback. >> >> Best Regards, >> Jatin > > Jatin Bhateja has updated the pull request incrementally with one additional > commit since the last revision: > > 8279508: Removing +LogCompilation flag. src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4157: > 4155: ExternalAddress mxcsr_std(StubRoutines::x86::addr_mxcsr_std()); > 4156: ldmxcsr(new_mxcsr); > 4157: movl(scratch, 1056964608); What is 1056964608 ? - PR: https://git.openjdk.java.net/jdk/pull/7094
Re: RFR: 8281146: Replace StringCoding.hasNegatives with countPositives [v9]
On Thu, 3 Mar 2022 13:31:35 GMT, Claes Redestad wrote: > > Well, it just didn't build. With the annotation being present, you also > > need an intrinsic implementation. That's what the error message is saying... > > Doh, I had no idea the presence of `@IntrinsicCandidate` was mandating the VM > has an intrinsic implementation these days (Why?! Not much of a _candidate_ > if it's required..). I don't think the intrinsic has to be implemented on every target, but AFAICR it does have to be declared as an intrinsic in HotSpot. - PR: https://git.openjdk.java.net/jdk/pull/7231
Re: RFR: 8281146: Replace StringCoding.hasNegatives with countPositives [v9]
On Thu, 3 Mar 2022 12:04:47 GMT, Claes Redestad wrote: >> I'm requesting comments and, hopefully, some help with this patch to replace >> `StringCoding.hasNegatives` with `countPositives`. The new method does a >> very similar pass, but alters the intrinsic to return the number of leading >> bytes in the `byte[]` range which only has positive bytes. This allows for >> dealing much more efficiently with those `byte[]`s that has a ASCII prefix, >> with no measurable cost on ASCII-only or latin1/UTF16-mostly input. >> >> Microbenchmark results: >> https://jmh.morethan.io/?gists=428b487e92e3e47ccb7f169501600a88,3c585de7435506d3a3bdb32160fe8904 >> >> - Only implemented on x86 for now, but I want to verify that implementations >> of `countPositives` can be implemented with similar efficiency on all >> platforms that today implement a `hasNegatives` intrinsic (aarch64, ppc etc) >> before moving ahead. This pretty much means holding up this until it's >> implemented on all platforms, which can either contributed to this PR or as >> dependent follow-ups. >> >> - An alternative to holding up until all platforms are on board is to allow >> the implementation of `StringCoding.hasNegatives` and `countPositives` to be >> implemented so that the non-intrinsified method calls into the intrinsified. >> This requires structuring the implementations differently based on which >> intrinsic - if any - is actually implemented. One way to do this could be to >> mimic how `java.nio` handles unaligned accesses and expose which intrinsic >> is available via `Unsafe` into a `static final` field. >> >> - There are a few minor regressions (~5%) in the x86 implementation on >> `encode-/decodeLatin1Short`. Those regressions disappear when mixing inputs, >> for example `encode-/decodeShortMixed` even see a minor improvement, which >> makes me consider those corner case regressions with little real world >> implications (if you have latin1 Strings, you're likely to also have >> ASCII-only strings in your mix). > > Claes Redestad has updated the pull request incrementally with one additional > commit since the last revision: > > Document that it's allowed for implementations to return values less than > the exact count (iff there are negative bytes) > * There are a few minor regressions (~5%) in the x86 implementation on > `encode-/decodeLatin1Short`. Those regressions disappear when mixing inputs, > for example `encode-/decodeShortMixed` even see a minor improvement, which > makes me consider those corner case regressions with little real world > implications (if you have latin1 Strings, you're likely to also have > ASCII-only strings in your mix). I'm not sure that we can disregard such cases. You're probably right, but it'd be interesting to know the actual cause of the problem, perhaps with perfasm. Or do you know already? - PR: https://git.openjdk.java.net/jdk/pull/7231
Re: RFR: 8281146: Replace StringCoding.hasNegatives with countPositives [v6]
On Tue, 1 Mar 2022 19:12:17 GMT, Claes Redestad wrote: > > @theRealAph , @a74nh or someone familiar with aarch64 code, please review > > aarch64 changes. > > Note that the aarch64 changes I've put in for now implements `countPositives` > to return `0` if there's a negative value anywhere, otherwise `len`. This way > we can remove the intrinsic scaffolding for `hasNegatives` once I integrate > s390 and ppc Sure. And we don't have to check which of the 8 bytes in a word has its top bit set, so we don't have to do any more work, just return the count when we stopped searching. So there's no possible performance regression on AArch64 as far as I can see. - PR: https://git.openjdk.java.net/jdk/pull/7231
Re: Should System.exit be controlled by a Scope Local?
On 3/1/22 11:45, Andrew Haley wrote: Sure, you wouldn't be able to use the default thread pool, but that's no big deal, I would have thought. I'm sorry, I'll say that again. :-) I meant to say "you wouldn't be able to use the default thread pool if you wanted to use threads with some restrictions (e.g those that couldn't invoke System.exit()). -- Andrew Haley (he/him) Java Platform Lead Engineer Red Hat UK Ltd. <https://www.redhat.com> https://keybase.io/andrewhaley EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
Re: Should System.exit be controlled by a Scope Local?
On 2/28/22 15:32, Andrew Haley wrote: I think all we'd need is a set of capabilities bound to a scope local at thread startup, and I guess it'd default to "all capabilities". Trusted code could then override any of those capabilities. We'd have to make sure that capabilities were inherited by threads, and we'd have to think very carefully about thread pools. The problem there is that while it would (I guess) make sense to prevent all code executing in thread pools from calling System.exit(), there's an obvious compatibility problem if it can't. Although... there certainly is some potential profit in restricted thread pools, which have no compatibility problems because it'd be a new feature. I think this solves the problem Alan Bateman raised too. Sure, you wouldn't be able to use the default thread pool, but that's no big deal, I would have thought. -- Andrew Haley (he/him) Java Platform Lead Engineer Red Hat UK Ltd. <https://www.redhat.com> https://keybase.io/andrewhaley EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
Re: Should System.exit be controlled by a Scope Local?
On 2/28/22 15:12, Sean Mullan wrote: > > On 2/27/22 1:47 PM, Andrew Haley wrote: > >> I'd like to explore the use of scope locals as a lightweight means to >> implement a system of permissions and capabilities for things such as >> this. > > Now you have piqued my curiosity, as I have explored a capability based > model for intercepting `System.exit`. Can you say any more about this yet? I think all we'd need is a set of capabilities bound to a scope local at thread startup, and I guess it'd default to "all capabilities". Trusted code could then override any of those capabilities. We'd have to make sure that capabilities were inherited by threads, and we'd have to think very carefully about thread pools. The problem there is that while it would (I guess) make sense to prevent all code executing in thread pools from calling System.exit(), there's an obvious compatibility problem if it can't. -- Andrew Haley (he/him) Java Platform Lead Engineer Red Hat UK Ltd. <https://www.redhat.com> https://keybase.io/andrewhaley EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
Re: Should System.exit be controlled by a Scope Local?
On 2/26/22 22:14, Ethan McCue wrote: As called out in JEP 411, one of the remaining legitimate uses of the Security Manager is to intercept calls to System.exit. This seems like a decent use case for the Scope Local mechanism. It could well be. One problem, at least in the preview version of scope locals, is that scope locals are only inherited in a structured concurrency context, so it wouldn't protect against starting a new Thread which then called System.exit(). I'd like to explore the use of scope locals as a lightweight means to implement a system of permissions and capabilities for things such as this. -- Andrew Haley (he/him) Java Platform Lead Engineer Red Hat UK Ltd. <https://www.redhat.com> https://keybase.io/andrewhaley EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
Re: RFR: 8282047: Enhance StringDecode/Encode microbenchmarks [v2]
On Fri, 25 Feb 2022 10:19:17 GMT, Claes Redestad wrote: >> test/micro/org/openjdk/bench/java/lang/StringDecode.java line 72: >> >>> 70: public void setup() { >>> 71: charset = Charset.forName(charsetName); >>> 72: asciiString = LOREM.substring(0, 32).getBytes(charset); >> >> This is problematic IMO in that it's missing short strings such as "Claes". >> Average Java strings are about 32 bytes long AFAICR, and people writing >> (vectorized) ijntrinsics have a nasty habit of optimizing for long strings, >> to the detriment of typical-length ones. >> Whether we like it or not, people will optimize for benchmarks, so it's >> important that benchmark data is realistic. The shortest here is 15 bytes, >> as far as I can see. I'd certainly include a short string of just a few >> bytes so that intrinsics don't cause regressions in important cases. > > All good points. I've added a number of such short variants to all(?) > relevant microbenchmarks. The tests should now better cover a mix of input > lengths and encodings. - PR: https://git.openjdk.java.net/jdk/pull/7516
Re: RFR: 8282047: Enhance StringDecode/Encode microbenchmarks [v2]
On Fri, 25 Feb 2022 09:19:33 GMT, Claes Redestad wrote: >> Splitting out these micro changes from #7231 >> >> - Clean up and simplify setup and code >> - Add variants with different inputs with varying lengths and encoding >> weights, but also relevant mixes of each so that we both cover interesting >> corner cases but also verify that performance behave when there's a >> multitude of input shapes. Both simple and mixed variants are interesting >> diagnostic tools. >> - Drop all charsets from the default run configuration except UTF-8. >> Motivation: Defaults should give good coverage but try to keep runtimes at >> bay. Additionally if the charset tested can't encode the higher codepoints >> used in these micros the results might be misleading. If you for example >> test using ISO-8859-1 the UTF-16 bytes in StringDecode.decodeUTF16 will have >> all been replaced by `?`, so the test is effectively the same as testing >> ASCII-only. > > Claes Redestad has updated the pull request incrementally with one additional > commit since the last revision: > > Brent review test/micro/org/openjdk/bench/java/lang/StringDecode.java line 72: > 70: public void setup() { > 71: charset = Charset.forName(charsetName); > 72: asciiString = LOREM.substring(0, 32).getBytes(charset); This is problematic IMO in that it's missing short strings such as "Claes". Average Java strings are about 32 bytes long AFAICR, and people writing (vectorized) ijntrinsics have a nasty habit of optimizing for long strings, to the detriment of typical-length ones. Whether we like it or not, people will optimize for benchmarks, so it's important that benchmark data is realistic. The shortest here is 15 bytes, as far as I can see. I'd certainly include a short string of just a few bytes so that intrinsics don't cause regressions in important cases. - PR: https://git.openjdk.java.net/jdk/pull/7516
Re: RFR: 8209784: Include hsdis in the JDK
On Tue, 22 Feb 2022 16:10:22 GMT, Magnus Ihse Bursie wrote: > This PR adds a new configure argument, `--enable-hsdis-bundling`. Setting > this, together with a valid backend using `--with-hsdis`, will bundle the > generated hsdis library with the JDK. Maybe I missed it, but are there instructions somewhere about how to build this, and what should be downloaded? - PR: https://git.openjdk.java.net/jdk/pull/7578
Re: RFR: 8281631: HashMap.putAll can cause redundant space waste [v3]
On 2/11/22 19:25, XenoAmess wrote: On Fri, 11 Feb 2022 18:24:49 GMT, Andrew Haley wrote: Just multiply by 0.75. On a modern design, floating-point multiply is 4 clocks latency, 4 ops/clock throughput. FP max is 2 clocks latency, conversions int-float and vice versa 3 clocks latency, 4 ops/clock throughput. Long division is 7-9 clocks, 2ops/clock throughput. Shift and add 2 clocks, 2/3 ops/clock througput. Compare is 1 clock, 3 ops/clock throughput, conditional move is 1 clock, 3 ops/clock throughput. Seems like it's a wash. @theRealAph no multiply but divide. Well yes, but that doesn't look at all hard to change. besides, did you count the cost for Math.ceil? it is the heaviest part. Yes. 3 clocks latency, 4 ops/clock throughput. Your hardware may vary. And that instruction does both the ceil() and the float-int conversion. (Having said that, I don't know if we currently generate optimal code for this operation. But of course that can be fixed.) I don't think this is terribly important, but I don't like to see attempts at hand optimization in the standard library.
Re: RFR: 8279508: Auto-vectorize Math.round API [v3]
On Sun, 13 Feb 2022 13:12:35 GMT, Jatin Bhateja wrote: >>> Hi, IIRC for evex encoding you can embed the RC control bit directly in the >>> evex prefix, removing the need to rely on global MXCSR register. Thanks. >> >> Hi @merykitty , You are correct, we can embed RC mode in instruction >> encoding of round instruction (towards -inf,+inf, zero). But to match the >> semantics of Math.round API one needs to add 0.5[f] to input value and then >> perform rounding over resultant value, which is why @sviswa7 suggested to >> use a global rounding mode driven by MXCSR.RC so that intermediate floating >> inexact values are resolved as desired, but OOO execution may misplace >> LDMXCSR and hence may have undesired side effects. > >> What does this do? Comment, even pseudo code, would be nice. > > Thanks @theRealAph , I shall append the comments over the routine. > BTW, entire rounding algorithm can also be implemented using Vector API > which can perform if-conversion using masked operations. > > class roundf { >public static VectorSpecies ISPECIES = IntVector.SPECIES_512; >public static VectorSpecies SPECIES = FloatVector.SPECIES_512; > >public static int round_vector(float[] a, int[] r, int ctr) { > IntVector shiftVBC = (IntVector) ISPECIES.broadcast(24 - 2 + 127); > for (int i = 0; i < a.length; i += SPECIES.length()) { > FloatVector fv = FloatVector.fromArray(SPECIES, a, i); > IntVector iv = fv.reinterpretAsInts(); > IntVector biasedExpV = iv.lanewise(VectorOperators.AND, 0x7F80); > biasedExpV = biasedExpV.lanewise(VectorOperators.ASHR, 23); > IntVector shiftV = shiftVBC.lanewise(VectorOperators.SUB, > biasedExpV); > VectorMask cond = shiftV.lanewise(VectorOperators.AND, -32) >.compare(VectorOperators.EQ, 0); > IntVector res = iv.lanewise(VectorOperators.AND, 0x007F) >.lanewise(VectorOperators.OR, 0x007F + 1); > VectorMask cond1 = iv.compare(VectorOperators.LT, 0); > VectorMask cond2 = cond1.and(cond); > res = res.lanewise(VectorOperators.NEG, cond2); > res = res.lanewise(VectorOperators.ASHR, shiftV) >.lanewise(VectorOperators.ADD, 1) >.lanewise(VectorOperators.ASHR, 1); > res = fv.convert(VectorOperators.F2I, 0) >.reinterpretAsInts() >.blend(res, cond); > res.intoArray(r, i); > } > return r[ctr]; >} That pseudocode would make a very useful comment too. This whole patch is very thinly commented. - PR: https://git.openjdk.java.net/jdk/pull/7094
Re: RFR: 8279508: Auto-vectorize Math.round API [v3]
On Sun, 13 Feb 2022 03:09:43 GMT, Jatin Bhateja wrote: >> Summary of changes: >> - Intrinsify Math.round(float) and Math.round(double) APIs. >> - Extend auto-vectorizer to infer vector operations on encountering scalar >> IR nodes for above intrinsics. >> - Test creation using new IR testing framework. >> >> Following are the performance number of a JMH micro included with the patch >> >> Test System: Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz (Icelake Server) >> >> >> Benchmark | TESTSIZE | Baseline AVX3 (ops/ms) | Withopt AVX3 (ops/ms) | Gain >> ratio | Baseline AVX2 (ops/ms) | Withopt AVX2 (ops/ms) | Gain ratio >> -- | -- | -- | -- | -- | -- | -- | -- >> FpRoundingBenchmark.test_round_double | 1024.00 | 584.99 | 1870.70 | 3.20 | >> 510.35 | 548.60 | 1.07 >> FpRoundingBenchmark.test_round_double | 2048.00 | 257.17 | 965.33 | 3.75 | >> 293.60 | 273.15 | 0.93 >> FpRoundingBenchmark.test_round_float | 1024.00 | 825.69 | 3592.54 | 4.35 | >> 825.32 | 1836.42 | 2.23 >> FpRoundingBenchmark.test_round_float | 2048.00 | 388.55 | 1895.77 | 4.88 | >> 412.31 | 945.82 | 2.29 >> >> >> Kindly review and share your feedback. >> >> Best Regards, >> Jatin > > Jatin Bhateja has updated the pull request with a new target base due to a > merge or a rebase. The incremental webrev excludes the unrelated changes > brought in by the merge/rebase. The pull request contains four additional > commits since the last revision: > > - 8279508: Adding vectorized algorithms to match the semantics of rounding > operations. > - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8279508 > - 8279508: Adding a test for scalar intrinsification. > - 8279508: Auto-vectorize Math.round API src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4066: > 4064: } > 4065: > 4066: void > C2_MacroAssembler::vector_cast_double_special_cases_evex(XMMRegister dst, > XMMRegister src, XMMRegister xtmp1, What does this do? Comment, even pseudo code, would be nice. - PR: https://git.openjdk.java.net/jdk/pull/7094
Re: RFR: 8281631: HashMap.putAll can cause redundant space waste [v3]
On Fri, 11 Feb 2022 18:13:36 GMT, Roger Riggs wrote: >> @RogerRiggs >> so you recommend `(int) Math.min(((m.size() * 4L + 2L) / 3L), >> Integer.MAX_VALUE)`? OK then, changed it. >> please review again, thanks! > > Works for me. Thanks Just multiply by 0.75. On a modern design, floating-point multiply is 4 clocks latency, 4 ops/clock throughput. FP max is 2 clocks latency, conversions int-float and vice versa 3 clocks latency, 4 ops/clock throughput. Long division is 7-9 clocks, 2ops/clock throughput. Shift and add 2 clocks, 2/3 ops/clock througput. Compare is 1 clock, 3 ops/clock throughput, conditional move is 1 clock, 3 ops/clock throughput. Seems like it's a wash. - PR: https://git.openjdk.java.net/jdk/pull/7431
Re: a quick question about String
On 12/30/21 16:12, Alexander Scherbatiy wrote: Would it be useful to have some kind of an immutable array in Java language which works in the same way as ordinary array except it is not to possible to change its values after creation? Yes. https://openjdk.java.net/jeps/8261007 -- Andrew Haley (he/him) Java Platform Lead Engineer Red Hat UK Ltd. <https://www.redhat.com> https://keybase.io/andrewhaley EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
Re: Adding an @Immutable annotation to Java
On 11/25/21 08:27, Justin Dekeyser wrote: Quick question, out of curiosity: how would it behave with respect to inheritance? Can a @Immutable class inherit from an non immutable one? And: does @Immutable mean deeply immutable? IMO it really should, but that's harder to check, and we'd have to think about what this means for binary compatibility. On 11/25/21 08:39, Alberto Otero Rodríguez wrote: > I have not thought about that. I'm not a Java expert. > > I just throwed the idea precisely to avoid "fake immutability", because a programmer could think one record is immutable simply by being a record, while this is false. > > There are probably lots of problems that need to be taken in consideration (like inheritance). But I just throw the idea because I think it would be a desirable feature. > > I would be grateful if some expert could deepen in possible problems and their solutions. Everything is always harder than it looks at first. -- Andrew Haley (he/him) Java Platform Lead Engineer Red Hat UK Ltd. <https://www.redhat.com> https://keybase.io/andrewhaley EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
Re: RFR: 8276217: Harmonize StrictMath intrinsics handling [v3]
On Tue, 2 Nov 2021 06:25:33 GMT, Aleksey Shipilev wrote: >> This blocks JDK-8276215: `StrictMath` intrinsics are handled peculiarly by >> giving failing intrinsics a second chance to match against the similar >> `Math` intrinsics. This has interesting consequence for matchers: we can >> match the native `StrictMath.sqrt` to non-native intrinsic for `Math.sqrt`. >> Interpreter would then have to disambiguate the two. It could be made >> simpler and more consistent. >> >> For `min`/`max` methods, `StrictMath` already delegates to `Math` methods, >> so we can just drop the intrinsics for them. `sqrt` is harder to delegate, >> because it is `native` and a part of public API, so we can instead do the >> proper special intrinsic for it. >> >> There seem to be no performance regressions with this patch at least on >> Linux x86_64: >> >> >> $ CONF=linux-x86_64-server-release make test TEST="micro:StrictMathBench" >> >> Benchmark Mode Cnt Score Error Units >> >> ### Before >> >> StrictMathBench.minDouble thrpt4 230921.558 ± 234.238 ops/ms >> StrictMathBench.minFloat thrpt4 230932.303 ± 126.721 ops/ms >> StrictMathBench.minInt thrpt4 230917.256 ± 73.008 ops/ms >> StrictMathBench.minLongthrpt4 194460.828 ± 178.079 ops/ms >> >> >> StrictMathBench.maxDouble thrpt4 230983.180 ± 161.211 ops/ms >> StrictMathBench.maxFloat thrpt4 230969.290 ± 277.500 ops/ms >> StrictMathBench.maxInt thrpt4 231033.581 ± 200.015 ops/ms >> StrictMathBench.maxLongthrpt4 194590.744 ± 114.295 ops/ms >> >> >> StrictMathBench.sqrtDouble thrpt4 230722.037 ± .080 ops/ms >> >> ### After >> >> StrictMathBench.minDouble thrpt4 230976.625 ± 67.338 ops/ms >> StrictMathBench.minFloat thrpt4 230896.021 ± 270.434 ops/ms >> StrictMathBench.minInt thrpt4 230859.741 ± 403.147 ops/ms >> StrictMathBench.minLongthrpt4 194456.673 ± 111.557 ops/ms >> >> StrictMathBench.maxDouble thrpt4 230890.776 ± 89.924 ops/ms >> StrictMathBench.maxFloat thrpt4 230918.334 ± 63.160 ops/ms >> StrictMathBench.maxInt thrpt4 231059.128 ± 51.224 ops/ms >> StrictMathBench.maxLongthrpt4 194488.210 ± 495.224 ops/ms >> >> StrictMathBench.sqrtDouble thrpt4 231023.703 ± 247.330 ops/ms >> >> >> Additional testing: >> - [x] `StrictMath` benchmarks >> - [x] Linux x86_64 fastdebug `java/lang/StrictMath`, `java/lang/Math` >> - [x] Linux x86_64 fastdebug `tier1` > > Aleksey Shipilev has updated the pull request incrementally with one > additional commit since the last revision: > > Keep intrinsics on StrictMath Marked as reviewed by aph (Reviewer). - PR: https://git.openjdk.java.net/jdk/pull/6184
Re: RFR: 8276217: Harmonize StrictMath intrinsics handling
On Mon, 1 Nov 2021 11:23:16 GMT, Aleksey Shipilev wrote: > This blocks JDK-8276215: `StrictMath` intrinsics are handled peculiarly by > giving failing intrinsics a second chance to match against the similar `Math` > intrinsics. This has interesting consequence for matchers: we can match the > native `StrictMath.sqrt` to non-native intrinsic for `Math.sqrt`. Interpreter > would then have to disambiguate the two. It could be made simpler and more > consistent. > > For `min`/`max` methods, `StrictMath` already delegates to `Math` methods, so > we can just drop the intrinsics for them. `sqrt` is harder to delegate, > because it is `native` and a part of public API, so we can instead do the > proper special intrinsic for it. > > There seem to be no performance regressions with this patch at least on Linux > x86_64: > > > $ CONF=linux-x86_64-server-release make test TEST="micro:StrictMathBench" > > Benchmark Mode Cnt Score Error Units > > ### Before > > StrictMathBench.minDouble thrpt4 230921.558 ± 234.238 ops/ms > StrictMathBench.minFloat thrpt4 230932.303 ± 126.721 ops/ms > StrictMathBench.minInt thrpt4 230917.256 ± 73.008 ops/ms > StrictMathBench.minLongthrpt4 194460.828 ± 178.079 ops/ms > > > StrictMathBench.maxDouble thrpt4 230983.180 ± 161.211 ops/ms > StrictMathBench.maxFloat thrpt4 230969.290 ± 277.500 ops/ms > StrictMathBench.maxInt thrpt4 231033.581 ± 200.015 ops/ms > StrictMathBench.maxLongthrpt4 194590.744 ± 114.295 ops/ms > > > StrictMathBench.sqrtDouble thrpt4 230722.037 ± .080 ops/ms > > ### After > > StrictMathBench.minDouble thrpt4 230976.625 ± 67.338 ops/ms > StrictMathBench.minFloat thrpt4 230896.021 ± 270.434 ops/ms > StrictMathBench.minInt thrpt4 230859.741 ± 403.147 ops/ms > StrictMathBench.minLongthrpt4 194456.673 ± 111.557 ops/ms > > StrictMathBench.maxDouble thrpt4 230890.776 ± 89.924 ops/ms > StrictMathBench.maxFloat thrpt4 230918.334 ± 63.160 ops/ms > StrictMathBench.maxInt thrpt4 231059.128 ± 51.224 ops/ms > StrictMathBench.maxLongthrpt4 194488.210 ± 495.224 ops/ms > > StrictMathBench.sqrtDouble thrpt4 231023.703 ± 247.330 ops/ms > > > Additional testing: > - [x] `StrictMath` benchmarks > - [x] Linux x86_64 fastdebug `tier1` So we have _dsqrt and_dsqrt_strict, which must be functionally identical, but we provide both names because they're part of a public API. I think this deserves an explanatory comment in the code. - PR: https://git.openjdk.java.net/jdk/pull/6184
Re: RFR: 8276096: Simplify Unsafe.{load|store}Fence fallbacks by delegating to fullFence
On Thu, 28 Oct 2021 08:47:31 GMT, Aleksey Shipilev wrote: > `Unsafe.{load|store}Fence` falls back to `unsafe.cpp` for > `OrderAccess::{acquire|release}Fence()`. It seems too heavy-handed (useless?) > to call to runtime for a single memory barrier. We can simplify the native > `Unsafe` interface by falling back to `fullFence` when `{load|store}Fence` > intrinsics are not available. This would be similar to what > `Unsafe.{loadLoad|storeStore}Fences` do. > > This is the behavior of these intrinsics now, on x86_64, using benchmarks > from JDK-8276054: > > > Benchmark Mode Cnt Score Error Units > > # Default > Single.acquire avgt3 0.407 ± 0.060 ns/op > Single.fullavgt3 4.693 ± 0.005 ns/op > Single.loadLoadavgt3 0.415 ± 0.095 ns/op > Single.plain avgt3 0.406 ± 0.002 ns/op > Single.release avgt3 0.408 ± 0.047 ns/op > Single.storeStore avgt3 0.408 ± 0.043 ns/op > > # -XX:DisableIntrinsic=_storeFence > Single.acquire avgt3 0.408 ± 0.016 ns/op > Single.fullavgt3 4.694 ± 0.002 ns/op > Single.loadLoadavgt3 0.406 ± 0.002 ns/op > Single.plain avgt3 0.406 ± 0.001 ns/op > Single.release avgt3 4.694 ± 0.003 ns/op <--- upgraded to full > Single.storeStore avgt3 4.690 ± 0.005 ns/op <--- upgraded to full > > # -XX:DisableIntrinsic=_loadFence > Single.acquire avgt3 4.691 ± 0.001 ns/op <--- upgraded to full > Single.fullavgt3 4.693 ± 0.009 ns/op > Single.loadLoadavgt3 4.693 ± 0.013 ns/op <--- upgraded to full > Single.plain avgt3 0.408 ± 0.072 ns/op > Single.release avgt3 0.415 ± 0.016 ns/op > Single.storeStore avgt3 0.416 ± 0.041 ns/op > > # -XX:DisableIntrinsic=_fullFence > Single.acquire avgt3 0.406 ± 0.014 ns/op > Single.fullavgt3 15.836 ± 0.151 ns/op <--- calls runtime > Single.loadLoadavgt3 0.406 ± 0.001 ns/op > Single.plain avgt3 0.426 ± 0.361 ns/op > Single.release avgt3 0.407 ± 0.021 ns/op > Single.storeStore avgt3 0.410 ± 0.061 ns/op > > # -XX:DisableIntrinsic=_fullFence,_loadFence > Single.acquire avgt3 15.822 ± 0.282 ns/op <--- upgraded, calls > runtime > Single.fullavgt3 15.851 ± 0.127 ns/op <--- calls runtime > Single.loadLoadavgt3 15.829 ± 0.045 ns/op <--- upgraded, calls > runtime > Single.plain avgt3 0.406 ± 0.001 ns/op > Single.release avgt3 0.414 ± 0.156 ns/op > Single.storeStore avgt3 0.422 ± 0.452 ns/op > > # -XX:DisableIntrinsic=_fullFence,_storeFence > Single.acquire avgt3 0.407 ± 0.016 ns/op > Single.fullavgt3 15.347 ± 6.783 ns/op <--- calls runtime > Single.loadLoadavgt3 0.406 ± 0.001 ns/op > Single.plain avgt3 0.406 ± 0.002 ns/op > Single.release avgt3 15.828 ± 0.019 ns/op <--- upgraded, calls > runtime > Single.storeStore avgt3 15.834 ± 0.045 ns/op <--- upgraded, calls > runtime > > # -XX:DisableIntrinsic=_fullFence,_loadFence,_storeFence > Single.acquire avgt3 15.838 ± 0.030 ns/op <--- upgraded, calls > runtime > Single.fullavgt3 15.854 ± 0.277 ns/op <--- calls runtime > Single.loadLoadavgt3 15.826 ± 0.160 ns/op <--- upgraded, calls > runtime > Single.plain avgt3 0.406 ± 0.003 ns/op > Single.release avgt3 15.838 ± 0.019 ns/op <--- upgraded, calls > runtime > Single.storeStore avgt3 15.844 ± 0.104 ns/op <--- upgraded, calls > runtime > > > Additional testing: > - [x] Linux x86_64 fastdebug `tier1` Marked as reviewed by aph (Reviewer). - PR: https://git.openjdk.java.net/jdk/pull/6149
Re: RFR: 8275167: x86 intrinsic for unsignedMultiplyHigh
On Wed, 13 Oct 2021 18:55:10 GMT, Vamsi Parasa wrote: > Optimize the new Math.unsignedMultiplyHigh using the x86 mul instruction. > This change show 1.87X improvement on a micro benchmark. test/micro/org/openjdk/bench/java/lang/MathBench.java line 547: > 545: return Math.unsignedMultiplyHigh(long747, long13); > 546: } > 547: As far as I can tell, the JMH overhead dominates when trying to measure the latency of events in the nanosecond range. `unsignedMultiplyHigh` should have a latency of maybe 1.5-2ns. Is that what you saw? - PR: https://git.openjdk.java.net/jdk/pull/5933
Re: RFR: 8275167: x86 intrinsic for unsignedMultiplyHigh
On Wed, 13 Oct 2021 18:55:10 GMT, Vamsi Parasa wrote: > Optimize the new Math.unsignedMultiplyHigh using the x86 mul instruction. > This change show 1.87X improvement on a micro benchmark. src/hotspot/share/opto/mulnode.cpp line 468: > 466: } > 467: > 468: > //= MulHiLNode::Value() and UMulHiLNode::Value() seem to be identical. Perhaps some refactoring would be in order, maybe make a common shared routine. - PR: https://git.openjdk.java.net/jdk/pull/5933
Re: Optimization (?) of HashSet(Collection)
On 10/4/21 12:57, Сергей Цыпанов wrote: in the code of HashSet(Collection) we have an optimization opportunity: public HashSet(Collection c) { map = new HashMap<>(Math.max((int) (c.size()/.75f) + 1, 16)); addAll(c); } instead of using addAll() inherited from j.u.Collection we can use c.forEach(this::add): > public HashSet(Collection c) { map = new HashMap<>(Math.max((int) (c.size()/.75f) + 1, 16)); c.forEach(this::add); } This allows to rid Iterator and use counting loops for e.g. ArrayList instead of hasNext()/next(). Well, perhaps, but this sort of low-level hackery is not IMO something we should reflexively do in the core class libraries. We are also likely to benefit from this change in case the argument collection is synchronized as it's going to be locked only once. That's a good point, but would a synchronized collection really lock coarsely around forEach(), rather than in add(). I've used the benchmark [1] to analyze the impact and it demonstrates measurable improvement [2]. What puzzles we a bit is the result for j.u.ArrayList. At warm-up time it demonstrates better results than at measurement time: length = 10 # Run progress: 4,44% complete, ETA 00:21:41 # Fork: 3 of 5 # Warmup Iteration 1: 134,699 ns/op # Warmup Iteration 2: 115,391 ns/op # Warmup Iteration 3: 130,110 ns/op # Warmup Iteration 4: 114,465 ns/op < # Warmup Iteration 5: 114,849 ns/op # Warmup Iteration 6: 115,073 ns/op # Warmup Iteration 7: 113,988 ns/op # Warmup Iteration 8: 115,301 ns/op # Warmup Iteration 9: 115,452 ns/op # Warmup Iteration 10: 124,493 ns/op < Iteration 1: 123,776 ns/op Iteration 2: 123,719 ns/op Iteration 3: 123,725 ns/op Iteration 4: 126,892 ns/op Iteration 5: 125,493 ns/op Iteration 6: 124,661 ns/op Iteration 7: 123,783 ns/op Iteration 8: 123,975 ns/op Iteration 9: 124,047 ns/op Iteration 10: 123,899 ns/op length = 100 # Run progress: 11,11% complete, ETA 00:20:10 # Fork: 1 of 5 # Warmup Iteration 1: 1236,695 ns/op # Warmup Iteration 2: 1069,909 ns/op # Warmup Iteration 3: 1243,852 ns/op # Warmup Iteration 4: 1059,038 ns/op < # Warmup Iteration 5: 1057,670 ns/op # Warmup Iteration 6: 1057,117 ns/op # Warmup Iteration 7: 1056,932 ns/op # Warmup Iteration 8: 1054,909 ns/op # Warmup Iteration 9: 1058,106 ns/op # Warmup Iteration 10: 1145,699 ns/op < Iteration 1: 1139,155 ns/op Iteration 2: 1141,662 ns/op Iteration 3: 1139,061 ns/op Iteration 4: 1139,306 ns/op Iteration 5: 1138,475 ns/op Iteration 6: 1140,491 ns/op Iteration 7: 1144,017 ns/op Iteration 8: 1139,411 ns/op Iteration 9: 1142,729 ns/op Iteration 10: 1144,042 ns/op Here I use 1 s warm-up iteration on 2s measurement iteration. It looks like at iteration 4 the code is top-optimized, and later at the end of warm-up a kind of deoptimization happens. There's still visible improvement however. The benchmark is run with 5 forks, and this deoptimization is visible for length = {10, 100}. So my two question are: - What is the reason of the deoptimization and can we do something about it? C2 does this sort of thing all the time: it's a heuristic probabilistic optimizer. I've seen plenty of bimodal examples, where inlining changes and each time a program is run it's either fast or slow, quasi-randomly. If you really want to know, use -prof perfasm in JMH. - Whether suggested changes is worth implementations? IMO the gain is too small. Others may disagree. -- Andrew Haley (he/him) Java Platform Lead Engineer Red Hat UK Ltd. <https://www.redhat.com> https://keybase.io/andrewhaley EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
Re: RFR: 8275262: [BACKOUT] AArch64: Implement string_compare intrinsic in SVE
On Thu, 14 Oct 2021 10:21:20 GMT, David Holmes wrote: > See https://bugs.openjdk.java.net/browse/JDK-8275263 for some info on the > failures. It seems really bizarre. I would understand this a lot better if the affected machines actually used SVE, but they don't have the hardware. That does reduce the bug surface: we only need to look at the affected common code, and the only thing I can immediately see is that a couple of unused register arguments have been added. Unless, of course, the affected Macs think they have SVE... ? I don't think that's possible. I'd definitely be looking at the host toolchain for differences. - PR: https://git.openjdk.java.net/jdk/pull/5941
Re: RFR: 8275262: [BACKOUT] AArch64: Implement string_compare intrinsic in SVE
On Thu, 14 Oct 2021 06:37:19 GMT, Nick Gasson wrote: > This reverts "8269559: AArch64: Implement string_compare intrinsic in SVE" > which caused some unknown failures in Oracle's CI. It might be a spurious failure, then. I guess we need to see the test logs. - PR: https://git.openjdk.java.net/jdk/pull/5941
Re: RFR: 8275262: [BACKOUT] AArch64: Implement string_compare intrinsic in SVE
On Thu, 14 Oct 2021 07:08:20 GMT, David Holmes wrote: > The issue is only on (some of) our macOS Aarch64 systems. Let me know if I > can provide more info on hardware etc. Any info about what failed? A reproducer would be nice. - PR: https://git.openjdk.java.net/jdk/pull/5941
Re: RFR: 4511638: Double.toString(double) sometimes produces incorrect results [v2]
On Fri, 16 Apr 2021 11:30:32 GMT, Raffaello Giulietti wrote: >> Hello, >> >> here's a PR for a patch submitted on March 2020 >> [1](https://cr.openjdk.java.net/~bpb/4511638/webrev.04/) when Mercurial was >> a thing. >> >> The patch has been edited to adhere to OpenJDK code conventions about >> multi-line (block) comments. Nothing in the code proper has changed, except >> for the addition of redundant but clarifying parentheses in some expressions. >> >> >> Greetings >> Raffaello > > Raffaello Giulietti has updated the pull request incrementally with one > additional commit since the last revision: > > 4511638: Double.toString(double) sometimes produces incorrect results Email from GuySteele follows: (my reply) Hi Guy, for some reason your comments still appear garbled on the GitHub PR page and don't make it to the core-libs-dev mailing list at all. Luckily, they appear intelligible in my mailbox, so I'll keep going with prepending your comments in my replies: not ideal but good enough. Thanks so much for re-reading my "paper". printf() There are some issues to consider when trying to apply Schubfach to printf(), the main one being that printf() allows to specify an arbitrary length for the resulting decimal. This means, for example, that unlimited precision arithmetic is unavoidable. But it might be worthwhile to investigate using Schubfach for lengths up to H (9 and 17 for float and double, resp.) and fall back to unlimited precision beyond that. Before that, however, I would prefer to finally push Schubfach in the OpenJDK codebase for the toString() cases and close this PR. I completely agree that using Schubfach to solve only the toString() problems would be a _major_ improvement in the situation, and this should not wait for exploration of the printf problem. But I suspect that using Schubfach for lengths up to H would cover a very large fraction of actual usage, and would improve both quality and speed, and therefore would be worth exploring later. Tests Below, by "extensive test" I mean not only that the outcomes convert back without loss of information, but that they fully meet the spec about minimal number of digits, closeness, correct formatting (normal viz scientific), character set, etc. All currently available tests are in the contributed code of this PR and will be part of the OpenJDK once integrated. * All powers of 2 and the extreme values are already extensively tested. * All values closest to powers of 10 are extensively tested. * All values proposed by Paxson [1] are extensively tested. I have now read through the Paxson paper. Does this refer to the values listed in his Tables 3 and 4, or to other values instead or in addition? * A configurable number of random values are tested at each round (currently 4 * 1'000'000 random values). Should a value fail, there's enough diagnostic information for further investigation. I'll add extensive tests for the values you propose in point (1) and (2), setting Z = Y = 1024. I do think that would lend further confidence. As for comparison with the current JDK behavior, there are already a bunch of values for which extensive tests fail on the current JDK but pass with Schubfach. Yes, thanks for supplying some of those. It would be cumbersome, if possible at all, to have both the current JDK and Schubfach implementations in the same OpenJDK codebase to be able to compare the outcomes. I performed comparisons in a different constellation, with Schubfach as an external library, but this is hardly a possibility in the core-libs. Needless to say, Schubfach outcomes are either the same as in JDK or better (shorter or closest to the fp value). Okay. I will mention here, for the record, that there is one other sort of test that could be performed that I think I have not yet seen you mention: a monotonicity test of the kind used by David Hough’s Testbase (mentioned by Paxson). However, a little thought reveals that such a test made unnecessary by the round-trip test. So a monotonicity test would be a good idea when testing printf case, but is not needed for the toString case. Therefore, if you add the few tests I have suggested, I think that we can say with the best certainty we can have, short of separately testing every possible double value, that Schubfach is extremely well tested and ready for adoption into Java. Peer reviewed publication Shortening my 27 pages writing and re-formating it to meet a journal standards for publication would require investing yet another substantial amount of time. I'm not sure I'm prepared for that, given that I've no personal interest in a journal publication: I'm not an academic, I'm not pursuing a career... But I promise I'll think about reconsidering my position on this ;-) Please do think about reconsidering. There are several reasons to publish an “academic” paper: - Earning “merit badges” that lead to academic tenure -
Re: RFR: 4511638: Double.toString(double) sometimes produces incorrect results [v2]
On 9/28/21 10:30, Raffaello Giulietti wrote: > I was pondering whether to propose adding a launch-time option on the command > line to choose between the current and the proposed implementation for some > time, so to help the community gaining confidence in the new algorithm and > still have the current one as a fallback, just in case. (AFAIU, a launch-time > option can be constant folded by the JIT compiler to help it eliminate dead > code, right?) > > On the one hand, this adds complexity. On the other hand, it could help > revive this dormant PR. > > What do you think? Would this be a viable option? It's a good-enough idea in theory, but I don't think it'd be accepted. Pinging Joe Darcy: do you (or any of your friends) have a good way to contact Guy Steele? -- Andrew Haley (he/him) Java Platform Lead Engineer Red Hat UK Ltd. <https://www.redhat.com> https://keybase.io/andrewhaley EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
Re: StringCoding.hasNegatives
On 10/1/21 1:57 PM, Brett Okken wrote: > I know java.lang.StringCoding.hasNegatives has a > HotSpotIntrinsicCandidate annotation/implementation, but is there > interest/value in a faster pure java implementation? > > Using Unsafe to read and compare 8 bytes at a time as a long is faster > than the current simple implementation both in interpreter only mode > and default hotspot compiler mode. I can provide jmh bencmark and > results if this is worthwhile to pursue. The current pure Java implementation does two things: it provides a fallback for pure-interpreter JVMs and it provides the reader with a simple implementation. I'm not at all sure we'd want a complex implementation. Having said that, if I were looking at a faster pure Java version of this logic, I'd look at MethodHandles.byteArrayViewVarHandle(). -- Andrew Haley (he/him) Java Platform Lead Engineer Red Hat UK Ltd. <https://www.redhat.com> https://keybase.io/andrewhaley EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
Re: RFR: 8269559: AArch64: Implement string_compare intrinsic in SVE [v2]
On Mon, 23 Aug 2021 21:48:01 GMT, TatWai Chong wrote: >> This patch implements string_compare intrinsic in SVE. >> It supports all LL, LU, UL and UU comparisons. >> >> As we haven't found an existing benchmark to measure performance impact, >> we created a benchmark derived from the test [1] for this evaluation. >> This benchmark is attached to this patch. >> >> Besides, remove the unused temporary register `vtmp3` from the existing >> match rules for StrCmp. >> >> The result below shows all varients can be benefited largely. >> Command: make exploded-test TEST="micro:StringCompareToDifferentLength" >> >> Benchmark(size) Mode Cnt Score Speedup Units >> compareToLL 24 avgt 10 1.0x ms/op >> compareToLL 36 avgt 10 1.0x ms/op >> compareToLL 72 avgt 10 1.0x ms/op >> compareToLL 128 avgt 10 1.4x ms/op >> compareToLL 256 avgt 10 1.8x ms/op >> compareToLL 512 avgt 10 2.7x ms/op >> compareToLU 24 avgt 10 1.6x ms/op >> compareToLU 36 avgt 10 1.8x ms/op >> compareToLU 72 avgt 10 2.3x ms/op >> compareToLU 128 avgt 10 3.8x ms/op >> compareToLU 256 avgt 10 4.7x ms/op >> compareToLU 512 avgt 10 6.3x ms/op >> compareToUL 24 avgt 10 1.6x ms/op >> compareToUL 36 avgt 10 1.7x ms/op >> compareToUL 72 avgt 10 2.2x ms/op >> compareToUL 128 avgt 10 3.3x ms/op >> compareToUL 256 avgt 10 4.4x ms/op >> compareToUL 512 avgt 10 6.1x ms/op >> compareToUU 24 avgt 10 1.0x ms/op >> compareToUU 36 avgt 10 1.0x ms/op >> compareToUU 72 avgt 10 1.4x ms/op >> compareToUU 128 avgt 10 2.2x ms/op >> compareToUU 256 avgt 10 2.6x ms/op >> compareToUU 512 avgt 10 3.7x ms/op >> >> [1] >> https://github.com/openjdk/jdk/blob/master/test/hotspot/jtreg/compiler/intrinsics/string/TestStringCompareToDifferentLength.java > > TatWai Chong has updated the pull request incrementally with one additional > commit since the last revision: > > Restore the removal of vtmp3 (=V2) as it is still used by the non-SVE > compare-long-strings stub. > > And remove the assertion in `string_compare` since it won't help as the > registers > used in the stub are fixed. Marked as reviewed by aph (Reviewer). - PR: https://git.openjdk.java.net/jdk/pull/5129
Re: RFR: 4511638: Double.toString(double) sometimes produces incorrect results [v2]
On Fri, 16 Apr 2021 11:30:32 GMT, Raffaello Giulietti wrote: >> Hello, >> >> here's a PR for a patch submitted on March 2020 >> [1](https://cr.openjdk.java.net/~bpb/4511638/webrev.04/) when Mercurial was >> a thing. >> >> The patch has been edited to adhere to OpenJDK code conventions about >> multi-line (block) comments. Nothing in the code proper has changed, except >> for the addition of redundant but clarifying parentheses in some expressions. >> >> >> Greetings >> Raffaello > > Raffaello Giulietti has updated the pull request incrementally with one > additional commit since the last revision: > > 4511638: Double.toString(double) sometimes produces incorrect results My turn to keep this PR alive. If we'd had any actual face-to-face contributors' meetings this year I would have raised this stuck PR as an issue with our processes. - PR: https://git.openjdk.java.net/jdk/pull/3402
Re: RFR: 8268231: Aarch64: Use ldp in intrinsics for String.compareTo [v6]
On Mon, 20 Sep 2021 09:52:45 GMT, Andrew Dinn wrote: >> Andrew, can you help us to approve this? > > I agree with Andrew Haley that this patch is not going to make an improvement > for anything but a very small number of applications. Processing of strings > over a few 10s of bytes is rare. On the other hand the doesn't seem to cause > any performance drop for the much more common case of processing short > strings. so it does no harm. Also, the new and old code are much the same in > terms of complexity so that is no reason to prefer one over the other. The > only real concern I have is that any change involves the risk of error and > the ratio of cases that might benefit to cases that might suffer from an > error is very low. I don't think that's a reason to avoid pushing this patch > upstream but it does suggest that we should not backport it. OK, thanks. That seems like a sensible compromise. - PR: https://git.openjdk.java.net/jdk/pull/4722
Re: RFR: 8268231: Aarch64: Use ldp in intrinsics for String.compareTo [v7]
On Mon, 30 Aug 2021 06:26:01 GMT, Wang Huang wrote: >> Dear all, >> Can you do me a favor to review this patch. This patch use `ldp` to >> implement String.compareTo. >> >> * We add a JMH test case >> * Here is the result of this test case >> >> Benchmark |(size)| Mode| Cnt|Score | Error |Units >> -|--|-||--||- >> StringCompare.compareLL | 64 | avgt| 5 |7.992 | ± 0.005|us/op >> StringCompare.compareLL | 72 | avgt| 5 |15.029| ± 0.006|us/op >> StringCompare.compareLL | 80 | avgt| 5 |14.655| ± 0.011|us/op >> StringCompare.compareLL | 91 | avgt| 5 |16.363| ± 0.12 |us/op >> StringCompare.compareLL | 101 | avgt| 5 |16.966| ± 0.007|us/op >> StringCompare.compareLL | 121 | avgt| 5 |19.276| ± 0.006|us/op >> StringCompare.compareLL | 181 | avgt| 5 |19.002| ± 0.417|us/op >> StringCompare.compareLL | 256 | avgt| 5 |24.707| ± 0.041|us/op >> StringCompare.compareLLWithLdp| 64 | avgt| 5 |8.001 | ± >> 0.121|us/op >> StringCompare.compareLLWithLdp| 72 | avgt| 5 |11.573| ± 0.003|us/op >> StringCompare.compareLLWithLdp| 80 | avgt| 5 |6.861 | ± 0.004|us/op >> StringCompare.compareLLWithLdp| 91 | avgt| 5 |12.774| ± 0.201|us/op >> StringCompare.compareLLWithLdp| 101 | avgt| 5 |8.691 | ± 0.004|us/op >> StringCompare.compareLLWithLdp| 121 | avgt| 5 |11.091| ± 1.342|us/op >> StringCompare.compareLLWithLdp| 181 | avgt| 5 |14.64 | ± 0.581|us/op >> StringCompare.compareLLWithLdp| 256 | avgt| 5 |25.879| ± 1.775|us/op >> StringCompare.compareUU | 64 | avgt| 5 |13.476| ± 0.01 |us/op >> StringCompare.compareUU | 72 | avgt| 5 |15.078| ± 0.006|us/op >> StringCompare.compareUU | 80 | avgt| 5 |23.512| ± 0.011|us/op >> StringCompare.compareUU | 91 | avgt| 5 |24.284| ± 0.008|us/op >> StringCompare.compareUU | 101 | avgt| 5 |20.707| ± 0.017|us/op >> StringCompare.compareUU | 121 | avgt| 5 |29.302| ± 0.011|us/op >> StringCompare.compareUU | 181 | avgt| 5 |39.31 | ± >> 0.016|us/op >> StringCompare.compareUU | 256 | avgt| 5 |54.592| ± 0.392|us/op >> StringCompare.compareUUWithLdp| 64 | avgt| 5 |16.389| ± 0.008|us/op >> StringCompare.compareUUWithLdp| 72 | avgt| 5 |10.71 | ± 0.158|us/op >> StringCompare.compareUUWithLdp| 80 | avgt| 5 |11.488| ± 0.024|us/op >> StringCompare.compareUUWithLdp| 91 | avgt| 5 |13.412| ± 0.006|us/op >> StringCompare.compareUUWithLdp| 101 | avgt| 5 |16.245| ± 0.434|us/op >> StringCompare.compareUUWithLdp| 121 | avgt| 5 |16.597| ± 0.016|us/op >> StringCompare.compareUUWithLdp| 181 | avgt| 5 |27.373| ± 0.017|us/op >> StringCompare.compareUUWithLdp| 256 | avgt| 5 |41.74 | ± 3.5 |us/op >> >> From this table, we can see that in most cases, our patch is better than old >> one. >> >> Thank you for your review. Any suggestions are welcome. > > Wang Huang has updated the pull request incrementally with one additional > commit since the last revision: > > fix windows build failed Marked as reviewed by aph (Reviewer). - PR: https://git.openjdk.java.net/jdk/pull/4722
Re: better random numbers
On 9/3/21 12:35 AM, John Rose wrote: > The reference I’d like to give here is to Dr. Melissa O’Neill’s > website and articles: I'm quite sceptical. Anyone who says a (non-cryptographic) random- number generator is "hard to predict" is either quite naive or in a state of sin, (;-) and while O’Neill’s argument seems sound, it doesn't seem to have convinced the academic world. Lemire is thoughtful: https://lemire.me/blog/2017/08/15/on-melissa-oneills-pcg-random-number-generator/ I wonder about AES, which can do (on Apple M1) 2 parallel rounds per clock cycle. I'm quite tempted to try a reduced- round AES on the TestU01 statistical tests. Maybe 6 rounds? However, there can be a long latency between FPU and integer CPU, so perhaps it's not such a great idea. Also, you have to load the key registers before you can generate a random number, so it only really works if you want to generate a lot of bits at a time. But it is maybe 128 randomish bits per a few clock cycles. -- Andrew Haley (he/him) Java Platform Lead Engineer Red Hat UK Ltd. <https://www.redhat.com> https://keybase.io/andrewhaley EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
Re: RFR: 8268231: Aarch64: Use ldp in intrinsics for String.compareTo [v6]
On Thu, 26 Aug 2021 09:26:24 GMT, Wu Yan wrote: >> src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 4871: >> >>> 4869: // exit from large loop when less than 64 bytes left to read or >>> we're about >>> 4870: // to prefetch memory behind array border >>> 4871: int largeLoopExitCondition = MAX(64, >>> SoftwarePrefetchHintDistance)/(isLL ? 1 : 2); >> >> This breaks the Windows AArch64 build: >> >> >> Creating support/modules_libs/java.base/server/jvm.dll from 1051 file(s) >> d:\a\jdk\jdk\jdk\src\hotspot\cpu\aarch64\stubGenerator_aarch64.cpp(4871): >> error C3861: 'MAX': identifier not found >> make[3]: *** [lib/CompileJvm.gmk:143: >> /cygdrive/d/a/jdk/jdk/jdk/build/windows-aarch64/hotspot/variant-server/libjvm >> >> >> https://github.com/Wanghuang-Huawei/jdk/runs/3260986937 >> >> Should probably be left as `MAX2`. > > Thanks, I'll fix it. It's fine. I don't think it'll affect any real programs, so it's rather pointless. I don't know if that's any reason not to approve it. - PR: https://git.openjdk.java.net/jdk/pull/4722
Re: RFR: 8268231: Aarch64: Use ldp in intrinsics for String.compareTo [v4]
On 7/30/21 7:49 AM, Wu Yan wrote: > I aggree. This is the compromise solution that the optimization > has no effect (or even slowdown) on some platforms. > In addition, I found that in > [JDK-8202326](https://bugs.openjdk.java.net/browse/JDK-8202326), > adding prefetches is only for long strings (the rare cases), > maybe we can further optimize longs string with LDP. So should > I continue this optimization or close it. IMO, we don't want to be using the vector unit unless it does some good, and if you can do this sort of thing in the CPU core you should, so I like that. I was (still am) tempted to approve it, but Nick says there are still bugs in corner cases. I think you should probably close it. Comparison of really long Strings is so rare that I can't find any examples of where it actually happens. Array comparisons, sure, but Strings, not so much. -- Andrew Haley (he/him) Java Platform Lead Engineer Red Hat UK Ltd. <https://www.redhat.com> https://keybase.io/andrewhaley EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
Re: RFR: 8268231: Aarch64: Use ldp in intrinsics for String.compareTo [v4]
On Wed, 28 Jul 2021 08:25:08 GMT, Nick Gasson wrote: > I don't think we want to keep two copies of the compareTo intrinsic. If there > are no cases where the LDP version is worse than the original version then we > should just delete the old one and replace it with this. I agree. The trouble is, what does "worse" mean? I'm looking at SDEN-1982442, Neoverse N2 errata, 2001293, and I see that LDP has to be slowed down on streaming workloads, which will affect this. The trouble with this patch is that it (probably) makes things better for long strings, which are very rare. What we actually need to care about is performance for a large number of typical-sized strings, which are names, identifiers, passwords, and so on: about 10-30 characters. - PR: https://git.openjdk.java.net/jdk/pull/4722
Re: RFR: 8268231: Aarch64: Use ldp in intrinsics for String.compareTo [v2]
On Mon, 12 Jul 2021 09:14:25 GMT, Wang Huang wrote: >> Dear all, >> Can you do me a favor to review this patch. This patch use `ldp` to >> implement String.compareTo. >> >> * We add a JMH test case >> * Here is the result of this test case >> >> Benchmark |(size)| Mode| Cnt|Score | Error |Units >> -|--|-||--||- >> StringCompare.compareLL | 64 | avgt| 5 |7.992 | ± 0.005|us/op >> StringCompare.compareLL | 72 | avgt| 5 |15.029| ± 0.006|us/op >> StringCompare.compareLL | 80 | avgt| 5 |14.655| ± 0.011|us/op >> StringCompare.compareLL | 91 | avgt| 5 |16.363| ± 0.12 |us/op >> StringCompare.compareLL | 101 | avgt| 5 |16.966| ± 0.007|us/op >> StringCompare.compareLL | 121 | avgt| 5 |19.276| ± 0.006|us/op >> StringCompare.compareLL | 181 | avgt| 5 |19.002| ± 0.417|us/op >> StringCompare.compareLL | 256 | avgt| 5 |24.707| ± 0.041|us/op >> StringCompare.compareLLWithLdp| 64 | avgt| 5 |8.001 | ± >> 0.121|us/op >> StringCompare.compareLLWithLdp| 72 | avgt| 5 |11.573| ± 0.003|us/op >> StringCompare.compareLLWithLdp| 80 | avgt| 5 |6.861 | ± 0.004|us/op >> StringCompare.compareLLWithLdp| 91 | avgt| 5 |12.774| ± 0.201|us/op >> StringCompare.compareLLWithLdp| 101 | avgt| 5 |8.691 | ± 0.004|us/op >> StringCompare.compareLLWithLdp| 121 | avgt| 5 |11.091| ± 1.342|us/op >> StringCompare.compareLLWithLdp| 181 | avgt| 5 |14.64 | ± 0.581|us/op >> StringCompare.compareLLWithLdp| 256 | avgt| 5 |25.879| ± 1.775|us/op >> StringCompare.compareUU | 64 | avgt| 5 |13.476| ± 0.01 |us/op >> StringCompare.compareUU | 72 | avgt| 5 |15.078| ± 0.006|us/op >> StringCompare.compareUU | 80 | avgt| 5 |23.512| ± 0.011|us/op >> StringCompare.compareUU | 91 | avgt| 5 |24.284| ± 0.008|us/op >> StringCompare.compareUU | 101 | avgt| 5 |20.707| ± 0.017|us/op >> StringCompare.compareUU | 121 | avgt| 5 |29.302| ± 0.011|us/op >> StringCompare.compareUU | 181 | avgt| 5 |39.31 | ± >> 0.016|us/op >> StringCompare.compareUU | 256 | avgt| 5 |54.592| ± 0.392|us/op >> StringCompare.compareUUWithLdp| 64 | avgt| 5 |16.389| ± 0.008|us/op >> StringCompare.compareUUWithLdp| 72 | avgt| 5 |10.71 | ± 0.158|us/op >> StringCompare.compareUUWithLdp| 80 | avgt| 5 |11.488| ± 0.024|us/op >> StringCompare.compareUUWithLdp| 91 | avgt| 5 |13.412| ± 0.006|us/op >> StringCompare.compareUUWithLdp| 101 | avgt| 5 |16.245| ± 0.434|us/op >> StringCompare.compareUUWithLdp| 121 | avgt| 5 |16.597| ± 0.016|us/op >> StringCompare.compareUUWithLdp| 181 | avgt| 5 |27.373| ± 0.017|us/op >> StringCompare.compareUUWithLdp| 256 | avgt| 5 |41.74 | ± 3.5 |us/op >> >> From this table, we can see that in most cases, our patch is better than old >> one. >> >> Thank you for your review. Any suggestions are welcome. > > Wang Huang has updated the pull request incrementally with one additional > commit since the last revision: > > draft of refactor And with longer strings, M1 and ThunderX2: Benchmark (diff_pos) (size) Mode Cnt Score Error Units StringCompare.compareLLDiffStrings10231024 avgt3 50.849 ± 0.087 us/op StringCompare.compareLLDiffStringsWithLdp 10231024 avgt3 23.676 ± 0.015 us/op StringCompare.compareLLDiffStringsWithRefactor10231024 avgt3 28.967 ± 0.168 us/op StringCompare.compareLLDiffStrings10231024 avgt3 98.681 ± 0.026 us/op StringCompare.compareLLDiffStringsWithLdp 10231024 avgt3 82.576 ± 0.656 us/op StringCompare.compareLLDiffStringsWithRefactor10231024 avgt3 98.801 ± 0.321 us/op LDP wins on M1 here, but on ThunderX2 it makes almost no difference at all. And how often are we comparing such long strings? I don't know what to think, really. It seems that we're near to a place where we're optimizing for micro-architecture, and I don't want to see that here. On the other hand, using LDP is not worse anywhere, so we should allow it. - PR: https://git.openjdk.java.net/jdk/pull/4722
Re: RFR: 8268231: Aarch64: Use ldp in intrinsics for String.compareTo [v2]
On Mon, 12 Jul 2021 09:14:25 GMT, Wang Huang wrote: >> Dear all, >> Can you do me a favor to review this patch. This patch use `ldp` to >> implement String.compareTo. >> >> * We add a JMH test case >> * Here is the result of this test case >> >> Benchmark |(size)| Mode| Cnt|Score | Error |Units >> -|--|-||--||- >> StringCompare.compareLL | 64 | avgt| 5 |7.992 | ± 0.005|us/op >> StringCompare.compareLL | 72 | avgt| 5 |15.029| ± 0.006|us/op >> StringCompare.compareLL | 80 | avgt| 5 |14.655| ± 0.011|us/op >> StringCompare.compareLL | 91 | avgt| 5 |16.363| ± 0.12 |us/op >> StringCompare.compareLL | 101 | avgt| 5 |16.966| ± 0.007|us/op >> StringCompare.compareLL | 121 | avgt| 5 |19.276| ± 0.006|us/op >> StringCompare.compareLL | 181 | avgt| 5 |19.002| ± 0.417|us/op >> StringCompare.compareLL | 256 | avgt| 5 |24.707| ± 0.041|us/op >> StringCompare.compareLLWithLdp| 64 | avgt| 5 |8.001 | ± >> 0.121|us/op >> StringCompare.compareLLWithLdp| 72 | avgt| 5 |11.573| ± 0.003|us/op >> StringCompare.compareLLWithLdp| 80 | avgt| 5 |6.861 | ± 0.004|us/op >> StringCompare.compareLLWithLdp| 91 | avgt| 5 |12.774| ± 0.201|us/op >> StringCompare.compareLLWithLdp| 101 | avgt| 5 |8.691 | ± 0.004|us/op >> StringCompare.compareLLWithLdp| 121 | avgt| 5 |11.091| ± 1.342|us/op >> StringCompare.compareLLWithLdp| 181 | avgt| 5 |14.64 | ± 0.581|us/op >> StringCompare.compareLLWithLdp| 256 | avgt| 5 |25.879| ± 1.775|us/op >> StringCompare.compareUU | 64 | avgt| 5 |13.476| ± 0.01 |us/op >> StringCompare.compareUU | 72 | avgt| 5 |15.078| ± 0.006|us/op >> StringCompare.compareUU | 80 | avgt| 5 |23.512| ± 0.011|us/op >> StringCompare.compareUU | 91 | avgt| 5 |24.284| ± 0.008|us/op >> StringCompare.compareUU | 101 | avgt| 5 |20.707| ± 0.017|us/op >> StringCompare.compareUU | 121 | avgt| 5 |29.302| ± 0.011|us/op >> StringCompare.compareUU | 181 | avgt| 5 |39.31 | ± >> 0.016|us/op >> StringCompare.compareUU | 256 | avgt| 5 |54.592| ± 0.392|us/op >> StringCompare.compareUUWithLdp| 64 | avgt| 5 |16.389| ± 0.008|us/op >> StringCompare.compareUUWithLdp| 72 | avgt| 5 |10.71 | ± 0.158|us/op >> StringCompare.compareUUWithLdp| 80 | avgt| 5 |11.488| ± 0.024|us/op >> StringCompare.compareUUWithLdp| 91 | avgt| 5 |13.412| ± 0.006|us/op >> StringCompare.compareUUWithLdp| 101 | avgt| 5 |16.245| ± 0.434|us/op >> StringCompare.compareUUWithLdp| 121 | avgt| 5 |16.597| ± 0.016|us/op >> StringCompare.compareUUWithLdp| 181 | avgt| 5 |27.373| ± 0.017|us/op >> StringCompare.compareUUWithLdp| 256 | avgt| 5 |41.74 | ± 3.5 |us/op >> >> From this table, we can see that in most cases, our patch is better than old >> one. >> >> Thank you for your review. Any suggestions are welcome. > > Wang Huang has updated the pull request incrementally with one additional > commit since the last revision: > > draft of refactor Two machines to compare here, Apple M1 and ThunderX2: Benchmark (diff_pos) (size) Mode Cnt Score Error Units StringCompare.compareLLDiffStrings 7 128 avgt3 2.194 ± 0.001 us/op StringCompare.compareLLDiffStrings 15 128 avgt3 2.195 ± 0.018 us/op StringCompare.compareLLDiffStrings 31 128 avgt3 2.508 ± 0.003 us/op StringCompare.compareLLDiffStrings 47 128 avgt3 2.821 ± 0.001 us/op StringCompare.compareLLDiffStrings 63 128 avgt3 3.446 ± 0.003 us/op StringCompare.compareLLDiffStringsWithLdp7 128 avgt3 2.194 ± 0.001 us/op StringCompare.compareLLDiffStringsWithLdp 15 128 avgt3 2.195 ± 0.001 us/op StringCompare.compareLLDiffStringsWithLdp 31 128 avgt3 2.508 ± 0.001 us/op StringCompare.compareLLDiffStringsWithLdp 47 128 avgt3 2.510 ± 0.006 us/op StringCompare.compareLLDiffStringsWithLdp 63 128 avgt3 2.824 ± 0.003 us/op StringCompare.compareLLDiffStringsWithRefactor 7 128 avgt3 1.882 ± 0.018 us/op StringCompare.compareLLDiffStringsWithRefactor 15 128 avgt3 2.019 ± 0.002 us/op StringCompare.compareLLDiffStringsWithRefactor 31 128 avgt3 2.355 ± 0.003 us/op StringCompare.compareLLDiffStringsWithRefactor 47 128 avgt3 2.821 ± 0.010 us/op StringCompare.compareLLDiffStringsWithRefactor 63 128 avgt3 3.135 ± 0.002 us/op Benchmark (diff_pos) (size) Mode Cnt Score Error Units StringCompare.compareLLDiffStrings
Re: RFR: 8268231: Aarch64: Use ldp in intrinsics for String.compareTo
On Thu, 8 Jul 2021 11:50:36 GMT, Wang Huang wrote: > Dear all, > Can you do me a favor to review this patch. This patch use `ldp` to > implement String.compareTo. > > * We add a JMH test case > * Here is the result of this test case > > Benchmark|(size)| Mode| Cnt|Score | Error |Units > -|--|-||--||- > StringCompare.compareLL | 64 | avgt| 5 |7.992 | ±0.005|us/op > StringCompare.compareLL | 72 | avgt| 5 |15.029| ±0.006|us/op > StringCompare.compareLL | 80 | avgt| 5 |14.655| ±0.011|us/op > StringCompare.compareLL | 91 | avgt| 5 |16.363| ±0.12 |us/op > StringCompare.compareLL | 101 | avgt| 5 |16.966| ±0.007|us/op > StringCompare.compareLL | 121 | avgt| 5 |19.276| ±0.006|us/op > StringCompare.compareLL | 181 | avgt| 5 |19.002| ±0.417|us/op > StringCompare.compareLL | 256 | avgt| 5 |24.707| ±0.041|us/op > StringCompare.compareLLWithLdp| 64 | avgt| 5 |8.001| ± > 0.121|us/op > StringCompare.compareLLWithLdp| 72 | avgt| 5 |11.573| ±0.003|us/op > StringCompare.compareLLWithLdp| 80 | avgt| 5 |6.861 | ±0.004|us/op > StringCompare.compareLLWithLdp| 91 | avgt| 5 |12.774| ±0.201|us/op > StringCompare.compareLLWithLdp| 101 | avgt| 5 |8.691 | ±0.004|us/op > StringCompare.compareLLWithLdp| 121 | avgt| 5 |11.091| ±1.342|us/op > StringCompare.compareLLWithLdp| 181 | avgt| 5 |14.64 | ±0.581|us/op > StringCompare.compareLLWithLdp| 256 | avgt| 5 |25.879| ±1.775|us/op > StringCompare.compareUU | 64 | avgt| 5 |13.476| ±0.01 |us/op > StringCompare.compareUU | 72 | avgt| 5 |15.078| ±0.006|us/op > StringCompare.compareUU | 80 | avgt| 5 |23.512| ±0.011|us/op > StringCompare.compareUU | 91 | avgt| 5 |24.284| ±0.008|us/op > StringCompare.compareUU | 101 | avgt| 5 |20.707| ±0.017|us/op > StringCompare.compareUU | 121 | avgt| 5 |29.302| ±0.011|us/op > StringCompare.compareUU | 181 | avgt| 5 |39.31| ± > 0.016|us/op > StringCompare.compareUU | 256 | avgt| 5 |54.592| ±0.392|us/op > StringCompare.compareUUWithLdp| 64 | avgt| 5 |16.389| ±0.008|us/op > StringCompare.compareUUWithLdp| 72 | avgt| 5 |10.71 | ±0.158|us/op > StringCompare.compareUUWithLdp| 80 | avgt| 5 |11.488| ±0.024|us/op > StringCompare.compareUUWithLdp| 91 | avgt| 5 |13.412| ±0.006|us/op > StringCompare.compareUUWithLdp| 101 | avgt| 5 |16.245| ±0.434|us/op > StringCompare.compareUUWithLdp| 121 | avgt| 5 |16.597| ±0.016|us/op > StringCompare.compareUUWithLdp| 181 | avgt| 5 |27.373| ±0.017|us/op > StringCompare.compareUUWithLdp| 256 | avgt| 5 |41.74 | ±3.5 |us/op > > From this table, we can see that in most cases, our patch is better than old > one. > > Thank you for your review. Any suggestions are welcome. I'm quite tempted to approve this. It looks generally better, simpler, and easier to understand than what we have today. However, the improvement isn't great, and I suspect is mostly because of the reduction in traffic between Base and Vector registers. What happens if you rewrite `compare_string_16_bytes_same()` to use `ldp` ? - PR: https://git.openjdk.java.net/jdk/pull/4722
Re: RFR: 6506405: Math.abs(float) is slow [v2]
On Thu, 8 Jul 2021 09:43:35 GMT, Andrew Haley wrote: > Moves between GPRs and FPRs are often slow. There's a 10-cycle latency on > some AArch64, so we avoid it whenever we can. Mind you, we don't care about > this patch because we always generate FABS from an intrinsic anyway. For avoidance of doubt, that's the round-trip latency. - PR: https://git.openjdk.java.net/jdk/pull/4711
Re: RFR: 6506405: Math.abs(float) is slow [v2]
On Thu, 8 Jul 2021 01:05:16 GMT, Brian Burkhalter wrote: >> Please consider this change to make the `float` and `double` versions of >> `java.lang.Math.abs()` branch-free. > > Brian Burkhalter has updated the pull request incrementally with one > additional commit since the last revision: > > 6506405: Add comments, use new consts for masking Moves between GPRs and FPRs are often slow. There's a 10-cycle latency on some AArch64, so we avoid it whenever we can. Mind you, we don't care about this patch because we always generate FABS from an intrinsic anyway. - PR: https://git.openjdk.java.net/jdk/pull/4711
Re: RFR: 8214761: Bug in parallel Kahan summation implementation
On Fri, 2 Jul 2021 20:12:39 GMT, Ian Graves wrote: > 8214761: Bug in parallel Kahan summation implementation Crikey, how did we get that wrong? It'd be nice if we had a regression test for this. Can you provide one, please? - PR: https://git.openjdk.java.net/jdk/pull/4674
Re: RFR: 8188044: We need Math.unsignedMultiplyHigh [v4]
On Fri, 2 Jul 2021 16:37:21 GMT, Brian Burkhalter wrote: >> Please consider this proposal to add a method >> `unsignedMultiplyHigh(long,long)` to each of `java.lang.Math` and >> `java.lang.StrictMath`. > > Brian Burkhalter has refreshed the contents of this pull request, and > previous commits have been removed. The incremental views will show > differences compared to the previous content of the PR. The pull request > contains one new commit since the last revision: > > 8188044: Use multiplyHigh() to leverage the intrinsic Marked as reviewed by aph (Reviewer). - PR: https://git.openjdk.java.net/jdk/pull/4644
Re: RFR: 8188044: We need Math.unsignedMultiplyHigh [v2]
On 7/2/21 4:30 PM, Raffaello Giulietti wrote: > FWIW, adinn's branchless code together with > PR https://git.openjdk.java.net/jdk/pull/4660 > make both methods less vulnerable to timing attacks. I doubt it, because HotSpot generates conditional select instructions for the popular systems, at least for C2. I guess it might on a C1-only or pure interpreter system. That certainly might be an argument for using the shift-and-add magic. With a comment that would be fine, as the difference in performance doesn't seem to be worth worrying about. We're looking at sub-nanosecond throughput. -- Andrew Haley (he/him) Java Platform Lead Engineer Red Hat UK Ltd. <https://www.redhat.com> https://keybase.io/andrewhaley EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
Re: RFR: 8188044: We need Math.unsignedMultiplyHigh [v2]
On 7/2/21 12:26 PM, Raffaello Giulietti wrote: > ... or even as a one liner, like in the test > > return Math.multiplyHigh(x, y) + ((x >> (Long.SIZE - 1)) & y) + ((y >> > (Long.SIZE - 1)) & x); It was hard to write, so it should be hard to read too! :-)
Re: RFR: 8188044: We need Math.unsignedMultiplyHigh [v2]
On Fri, 2 Jul 2021 13:47:40 GMT, Andrew Haley wrote: >> You can also do that branchlessly which might prove better >> >> long result = Math.multiplyHigh(x, y); >> result += (y & (x >> 63)); >> result += (x & (y >> 63)); >> return result; > >> You can also do that branchlessly which might prove better >> >> ``` >> long result = Math.multiplyHigh(x, y); >> result += (y & (x >> 63)); >> result += (x & (y >> 63)); >> return result; >> ``` > I doubt very much that it would be better, because these days branch > prediction is excellent, and we also have conditional select instructions. > Exposing the condition helps C2 to eliminate it if the range of args is > known. The `if` code is easier to understand. > > Benchmark results, with one of the operands changing signs every iteration, > 1000 iterations: > > > Benchmark Mode Cnt ScoreError Units > MulHiTest.mulHiTest1 (aph) avgt3 1570.587 ± 16.602 ns/op > MulHiTest.mulHiTest2 (adinn) avgt3 2237.637 ± 4.740 ns/op > > In any case, note that with this optimization the unsigned mulHi is in the > nanosecond range, so Good Enough. IMO. But weirdly, it's the other way around on AArch64, but there's little in it: Benchmark Mode Cnt Score Error Units MulHiTest.mulHiTest1 avgt3 1492.108 ± 0.301 ns/op MulHiTest.mulHiTest2 avgt3 1219.521 ± 1.516 ns/op but this is only in the case where we have unpredictable branches. Go with simple and easy to understand; it doesn't much matter. - PR: https://git.openjdk.java.net/jdk/pull/4644
Re: RFR: 8188044: We need Math.unsignedMultiplyHigh [v2]
On Fri, 2 Jul 2021 11:06:06 GMT, Andrew Dinn wrote: > You can also do that branchlessly which might prove better > > ``` > long result = Math.multiplyHigh(x, y); > result += (y & (x >> 63)); > result += (x & (y >> 63)); > return result; > ``` I doubt very much that it would be better, because these days branch prediction is excellent, and we also have conditional select instructions. Exposing the condition helps C2 to eliminate it if the range of args is known. The `if` code is easier to understand. Benchmark results, with one of the operands changing signs every iteration, 1000 iterations: Benchmark Mode Cnt ScoreError Units MulHiTest.mulHiTest1 (aph) avgt3 1570.587 ± 16.602 ns/op MulHiTest.mulHiTest2 (adinn) avgt3 2237.637 ± 4.740 ns/op In any case, note that with this optimization the unsigned mulHi is in the nanosecond range, so Good Enough. IMO. - PR: https://git.openjdk.java.net/jdk/pull/4644
Re: RFR: 8188044: We need Math.unsignedMultiplyHigh [v2]
On Fri, 2 Jul 2021 09:37:47 GMT, Andrew Haley wrote: >> Brian Burkhalter has updated the pull request incrementally with one >> additional commit since the last revision: >> >> 8188044: Add @see links between multiplyHigh() and unsignedMultiplyHigh(). > > src/java.base/share/classes/java/lang/Math.java line 1211: > >> 1209: long z1 = t >>> 32; >> 1210: >> 1211: return x1 * y1 + z1 + (z0 >>> 32); > > Suggestion: > > long result = Math.multiplyHigh(x, y); > if (x < 0) result += y; > if (y < 0) result += x; > return result; This is just subtracting the 2's-complement offset. I guess the idea, longer term, is that this be an intrinsic anyway, but if you do `unsignedMultiplyHigh` this way you'll utilize the existing `multiplyHigh` intrinsic on all platforms that have it. - PR: https://git.openjdk.java.net/jdk/pull/4644
Re: RFR: 8188044: We need Math.unsignedMultiplyHigh [v2]
On Thu, 1 Jul 2021 18:08:22 GMT, Brian Burkhalter wrote: >> Please consider this proposal to add a method >> `unsignedMultiplyHigh(long,long)` to each of `java.lang.Math` and >> `java.lang.StrictMath`. > > Brian Burkhalter has updated the pull request incrementally with one > additional commit since the last revision: > > 8188044: Add @see links between multiplyHigh() and unsignedMultiplyHigh(). src/java.base/share/classes/java/lang/Math.java line 1211: > 1209: long z1 = t >>> 32; > 1210: > 1211: return x1 * y1 + z1 + (z0 >>> 32); Suggestion: long result = Math.multiplyHigh(x, y); if (x < 0) result += y; if (y < 0) result += x; return result; - PR: https://git.openjdk.java.net/jdk/pull/4644
Re: RFR: 8265768 [aarch64] Use glibc libm impl for dlog, dlog10, dexp iff 2.29 or greater on AArch64.
Greg, what's your email address? Everything I try bounces. -- Andrew Haley (he/him) Java Platform Lead Engineer Red Hat UK Ltd. <https://www.redhat.com> https://keybase.io/andrewhaley EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
Re: JEP draft: Scope Locals
On 5/19/21 10:51 PM, David Lloyd wrote: > I think it would be really nice if the snapshot class could hold its > run/call method rather than making it a static method of `ScopeLocal`. This > reads more naturally to me (and is nicer to write): True, but inheritance is *extremely* time-critical when creating a ForkJoin task. In many cases there won't be any scope locals to inherit, and by allowing null snapshots you save a significant fraction of a nanosecond when creating one. And I know, this is hard to believe, but such an overhead has a significant macroscopic effect on benchmarks. However, I do intend to fix this in a different way, if I can. -- Andrew Haley (he/him) Java Platform Lead Engineer Red Hat UK Ltd. <https://www.redhat.com> https://keybase.io/andrewhaley EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
Re: JEP draft: Scope Locals
On 5/19/21 9:55 PM, David Lloyd wrote: > Turning this around, I would argue that there are few (or perhaps *no*) > cases where it would ever be desirable to inherit scope locals across > thread creation; in cases where this is explicitly desired, one can always > resume the snapshot from within the thread's Runnable. Was there a > particular use case this was meant to address? Structured Concurrency, but also possibly anywhere that inheritable thread locals are used now. -- Andrew Haley (he/him) Java Platform Lead Engineer Red Hat UK Ltd. <https://www.redhat.com> https://keybase.io/andrewhaley EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
Re: JEP draft: Scope Locals
On 5/19/21 5:55 PM, Peter Levart wrote: > In other words, non-inheritable bindings are > never transferred from thread to thread automatically or by > snapshot/runWithSnapshot. I can see that snapshot/runWithSnapshot was > meant as a mechanism to "simulate" inheritance of bindings when > execution is transferred from one thread to another which is not a newly > started child thread. Yes. However, this part of the draft proposal is undergoing some revision, and it looks like it will make more sense to control inheritance in a different way, one that will permit more flexible control over what gets inherited and when. -- Andrew Haley (he/him) Java Platform Lead Engineer Red Hat UK Ltd. <https://www.redhat.com> https://keybase.io/andrewhaley EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
Re: JEP draft: Scope Locals
On 5/15/21 6:15 PM, Remi Forax wrote: > I think the JEP should be more explicit about the shortcoming of ThreadLocal > and how the design of scope local fix both issues. Yes. It's in progress. -- Andrew Haley (he/him) Java Platform Lead Engineer Red Hat UK Ltd. <https://www.redhat.com> https://keybase.io/andrewhaley EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
Re: [External] : Re: JEP draft: Scope Locals
On 5/18/21 3:19 AM, Mike Rettig wrote: > With the current proposal I can't simply flush and close at the > end of the scope because there might be a child scope that is still active. Yes, that's true. I think that what you have in mind is a more elaborate mechanism than what is proposed here, which does no more than bind values to names over a scope. There needs to be more discussion in the JEP of what this proposal isn't intended to do, and how we might cope with mutability of a scope local's values. -- Andrew Haley (he/him) Java Platform Lead Engineer Red Hat UK Ltd. <https://www.redhat.com> https://keybase.io/andrewhaley EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
Re: JEP draft: Scope Locals
On 5/15/21 6:50 PM, Peter Levart wrote: > What if I wanted to create and start a thread that would be "pristine" - > not have any ScopeLocal value bound? Is this intentionally not allowed > in order to make sure that inheritable ScopeLocal(s) can't be cleared by > code that has no access to ScopeLocal instance(s)? That one is about to be changed by a revision to the JEP. There clearly is a need to control whether a newly-created thread inherits scope locals or not. For instance, an Executor might lazily create worker threads, and we don't want them to inherit scope locals from the thread that was running. -- Andrew Haley (he/him) Java Platform Lead Engineer Red Hat UK Ltd. <https://www.redhat.com> https://keybase.io/andrewhaley EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
Re: JEP draft: Scope Locals
On 5/15/21 10:10 AM, Alex Otenko wrote: > ScopeLocal.where(...).run(...).finally(...) would be nice. Is that different from try ... run ... finally ? -- Andrew Haley (he/him) Java Platform Lead Engineer Red Hat UK Ltd. <https://www.redhat.com> https://keybase.io/andrewhaley EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
Re: JEP draft: Scope Locals
On 5/14/21 8:19 PM, Mike Rettig wrote: > I don't see a way to capture the lifecycle of the scoped variable. I think > for framework designers it's going to be important to know the life cycle > of the scoped variable especially with snapshotting and inheritance. The > lack of a defined life cycle for thread locals is one of the many reasons > why it should be avoided. This requirement sounds similar, but I think it's really a different idea. Scope locals are only binding data to names and don't attempt to handle lifetime issues. Apart from anything else, that would require a more heavyweight mechanism than scope locals, which need to be very lightweight for some applications. Over at Loom we've been talking about lifetimes and Ron has some ideas, but scope locals are not likely to be that thing. -- Andrew Haley (he/him) Java Platform Lead Engineer Red Hat UK Ltd. <https://www.redhat.com> https://keybase.io/andrewhaley EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
Re: JEP draft: Scope Locals
On 5/15/21 12:28 AM, Nathan Reynolds wrote: > What about cleanup of the scoped objects? For example, how do I ensure > close() or some cleanup method is called at the right time on the scoped > object? Do I need to use a Weak/Soft/PhantomReference and a ReferenceQueue > to track the object? Do I use Cleaner? Does the object need to implement > finalize()... hopefully not? All a scope local does is bind a value to a name, in a scope. This isn't about C++-style destructors, which have their merits but are a different subject. -- Andrew Haley (he/him) Java Platform Lead Engineer Red Hat UK Ltd. <https://www.redhat.com> https://keybase.io/andrewhaley EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
Re: JEP draft: Scope Locals
On 5/14/21 3:45 PM, Brian Goetz wrote: > Where I think the JEP draft (the exposition, not the feature design) > could be improved is to make the relationship to TL more clear. IMO, > this is a "more modern" TL design; it addresses the same primary use > cases, without the unconstrained mutability of TLs. The safety that > comes from the immutability enables some new concurrent patterns (e.g., > structured concurrency) and some nice optimizations. Thank you, that's a nice summary. I was trying to describe scope locals in a stand-alone way, but everything that a scope local can do can also be done by a TL, albeit with some clumsiness. I should be more explicit about that. -- Andrew Haley (he/him) Java Platform Lead Engineer Red Hat UK Ltd. <https://www.redhat.com> https://keybase.io/andrewhaley EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
Re: JEP draft: Scope Locals
On 5/13/21 4:59 PM, Alan Snyder wrote: > > >> On May 13, 2021, at 2:03 AM, Andrew Haley wrote: >> >> On 5/12/21 8:12 PM, Alan Snyder wrote: >>> From the motivation section: >>> >>>> So you want to invoke a method |X| in a library that later calls back into >>>> your code. In your callback you need some context, perhaps a transaction >>>> ID or some |File| instances. However, |X| provides no way to pass a >>>> reference through their code into your callback. What are you to do? >>> >>> When I read this, my first thought was… pass a callback that is bound to >>> the context. >>> >>> I suppose the example you have in mind has a fixed callback. >> >> Fixed? It's a callback, of some form. > > What I meant was that the callback has a larger scope than the method call. > Otherwise, you could bind the context to the callback object. I don't know quite what you mean by "larger scope," but clearly, if you can explicitly pass a callback created by a lambda, that's a more explicit solution. >>> Is this a common practice? Doesn’t it point to a deficiency in the library >>> API? >> >> Not necessarily. For example, say you want to give access to a particular >> resource (a log stream, say) to trusted callees. Nobody AFAIK passes a log >> stream as an explicit argument through business logic because that's just >> too much clutter. > > That sounds interesting, but I’m not following how scope locals would be used > to solve this problem. In the outer scope you bind a logger scope local, and all transitive callees can use that. -- Andrew Haley (he/him) Java Platform Lead Engineer Red Hat UK Ltd. <https://www.redhat.com> https://keybase.io/andrewhaley EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
Re: JEP draft: Scope Locals
On 5/12/21 8:12 PM, Alan Snyder wrote: > From the motivation section: > >> So you want to invoke a method |X| in a library that later calls back into >> your code. In your callback you need some context, perhaps a transaction ID >> or some |File| instances. However, |X| provides no way to pass a reference >> through their code into your callback. What are you to do? > > When I read this, my first thought was… pass a callback that is bound to the > context. > > I suppose the example you have in mind has a fixed callback. Fixed? It's a callback, of some form. > Is this a common practice? Doesn’t it point to a deficiency in the library > API? Not necessarily. For example, say you want to give access to a particular resource (a log stream, say) to trusted callees. Nobody AFAIK passes a log stream as an explicit argument through business logic because that's just too much clutter. > Is this feature just a workaround for poorly designed libraries, or are there > other examples? It's not really about poor design as much as separation of concerns. Intermediate libraries have no way to know what might need to be passed to callees, and in many cases it's better isolation if they don't get to find out. The latter is especially true of passing security permissions. Also, there is evolution of libraries: with scope locals you don't need to change library interfaces to add useful capabilities like logging. -- Andrew Haley (he/him) Java Platform Lead Engineer Red Hat UK Ltd. <https://www.redhat.com> https://keybase.io/andrewhaley EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
JEP draft: Scope Locals
There's been considerable discussion about scope locals on the loom-dev list, and it's now time to open this to a wider audience. This subject is important because. although scope locals were motivated by the the needs of Loom, they have many potential applications outside that project. The draft JEP is at https://bugs.openjdk.java.net/browse/JDK-8263012 I've already received some very helpful suggestions for enhancements to the API, and it'll take me a while to work through them all. In particular, Paul Sandoz has suggested that I unify the classes Snapshot and Carrier, and it will take me some time to understand the consequences of that. In the meantime, please have a look at the JEP and comment here. For reference, earlier discussions are at https://mail.openjdk.java.net/pipermail/loom-dev/2021-March/002268.html https://mail.openjdk.java.net/pipermail/loom-dev/2021-April/002287.html https://mail.openjdk.java.net/pipermail/loom-dev/2021-May/002427.html -- Andrew Haley (he/him) Java Platform Lead Engineer Red Hat UK Ltd. <https://www.redhat.com> https://keybase.io/andrewhaley EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
Re: RFR: 8265768 [aarch64] Use glibc libm impl for dlog, dlog10, dexp iff 2.29 or greater on AArch64.
On Thu, 29 Apr 2021 20:24:25 GMT, Carlos O'Donell wrote: > Where does the requirement for monotonicity come from? Semi-monotonicity, to be precise. In the spec of java.lang.Math, "Besides accuracy at individual arguments, maintaining proper relations between the method at different arguments is also important. Therefore, most methods with more than 0.5 ulp errors are required to be semi-monotonic: whenever the mathematical function is non-decreasing, so is the floating-point approximation, likewise, whenever the mathematical function is non-increasing, so is the floating-point approximation. Not all approximations that have 1 ulp accuracy will automatically meet the monotonicity requirements." I wouldn't be surprised if the approximations we need in glibc meet this anyway. We just need to check. - PR: https://git.openjdk.java.net/jdk/pull/3510
Re: RFR: 8264805: Remove the experimental Ahead-of-Time Compiler [v4]
On Thu, 8 Apr 2021 17:24:38 GMT, Vladimir Kozlov wrote: >> As part of [JEP 410](http://openjdk.java.net/jeps/410) remove code related >> to Ahead-of-Time Compiler from JDK: >> >> - `jdk.aot` module — the `jaotc` tool >> - `src/hotspot/share/aot` — loads AoT compiled code into VM for execution >> - related HotSpot code guarded by `#if INCLUDE_AOT` >> >> Additionally, remove tests as well as code in makefiles related to AoT >> compilation. >> >> Tested hs-tier1-4 > > Vladimir Kozlov has updated the pull request incrementally with one > additional commit since the last revision: > > Remove exports from Graal module to jdk.aot AArch64 looks fine. - Marked as reviewed by aph (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/3398
Re: RFR: 8256245: AArch64: Implement Base64 decoding intrinsic [v7]
On Thu, 8 Apr 2021 06:33:43 GMT, Dong Bo wrote: >> In JDK-8248188, IntrinsicCandidate and API is added for Base64 decoding. >> Base64 decoding can be improved on aarch64 with ld4/tbl/tbx/st3, a basic >> idea can be found at >> http://0x80.pl/articles/base64-simd-neon.html#encoding-quadwords. >> >> Patch passed jtreg tier1-3 tests with linux-aarch64-server-fastdebug build. >> Tests in `test/jdk/java/util/Base64/` and >> `compiler/intrinsics/base64/TestBase64.java` runned specially for the >> correctness of the implementation. >> >> There can be illegal characters at the start of the input if the data is >> MIME encoded. >> It would be no benefits to use SIMD for this case, so the stub use no-simd >> instructions for MIME encoded data now. >> >> A JMH micro, Base64Decode.java, is added for performance test. >> With different input length (upper-bounded by parameter `maxNumBytes` in the >> JMH micro), >> we witness ~2.5x improvements with long inputs and no regression with short >> inputs for raw base64 decodeing, minor improvements (~10.95%) for MIME on >> Kunpeng916. >> >> The Base64Decode.java JMH micro-benchmark results: >> >> Benchmark (lineSize) (maxNumBytes) Mode Cnt >> Score Error Units >> >> # Kunpeng916, intrinsic >> Base64Decode.testBase64Decode 4 1 avgt5 >> 48.614 ± 0.609 ns/op >> Base64Decode.testBase64Decode 4 3 avgt5 >> 58.199 ± 1.650 ns/op >> Base64Decode.testBase64Decode 4 7 avgt5 >> 69.400 ± 0.931 ns/op >> Base64Decode.testBase64Decode 4 32 avgt5 >> 96.818 ± 1.687 ns/op >> Base64Decode.testBase64Decode 4 64 avgt5 >> 122.856 ± 9.217 ns/op >> Base64Decode.testBase64Decode 4 80 avgt5 >> 130.935 ± 1.667 ns/op >> Base64Decode.testBase64Decode 4 96 avgt5 >> 143.627 ± 1.751 ns/op >> Base64Decode.testBase64Decode 4112 avgt5 >> 152.311 ± 1.178 ns/op >> Base64Decode.testBase64Decode 4512 avgt5 >> 342.631 ± 0.584 ns/op >> Base64Decode.testBase64Decode 4 1000 avgt5 >> 573.635 ± 1.050 ns/op >> Base64Decode.testBase64Decode 4 2 avgt5 >> 9534.136 ±45.172 ns/op >> Base64Decode.testBase64Decode 4 5 avgt5 >> 22718.726 ± 192.070 ns/op >> Base64Decode.testBase64MIMEDecode 4 1 avgt 10 >> 63.558 ±0.336 ns/op >> Base64Decode.testBase64MIMEDecode 4 3 avgt 10 >> 82.504 ±0.848 ns/op >> Base64Decode.testBase64MIMEDecode 4 7 avgt 10 >> 120.591 ±0.608 ns/op >> Base64Decode.testBase64MIMEDecode 4 32 avgt 10 >> 324.314 ±6.236 ns/op >> Base64Decode.testBase64MIMEDecode 4 64 avgt 10 >> 532.678 ±4.670 ns/op >> Base64Decode.testBase64MIMEDecode 4 80 avgt 10 >> 678.126 ±4.324 ns/op >> Base64Decode.testBase64MIMEDecode 4 96 avgt 10 >> 771.603 ±6.393 ns/op >> Base64Decode.testBase64MIMEDecode 4112 avgt 10 >> 889.608 ± 0.759 ns/op >> Base64Decode.testBase64MIMEDecode 4512 avgt 10 >> 3663.557 ±3.422 ns/op >> Base64Decode.testBase64MIMEDecode 4 1000 avgt 10 >> 7017.784 ±9.128 ns/op >> Base64Decode.testBase64MIMEDecode 4 2 avgt 10 >> 128670.660 ± 7951.521 ns/op >> Base64Decode.testBase64MIMEDecode 4 5 avgt 10 >> 317113.667 ± 161.758 ns/op >> >> # Kunpeng916, default >> Base64Decode.testBase64Decode 4 1 avgt5 >> 48.455 ± 0.571 ns/op >> Base64Decode.testBase64Decode 4 3 avgt5 >> 57.937 ± 0.505 ns/op >> Base64Decode.testBase64Decode 4 7 avgt5 >> 73.823 ± 1.452 ns/op >> Base64Decode.testBase64Decode 4 32 avgt5 >> 106.484 ± 1.243 ns/op >> Base64Decode.testBase64Decode 4 64 avgt5 >> 141.004 ± 1.188 ns/op >> Base64Decode.testBase64Decode 4 80 avgt5 >> 156.284 ± 0.572 ns/op >> Base64Decode.testBase64Decode 4 96 avgt5 >> 174.137 ± 0.177 ns/op >> Base64Decode.testBase64Decode 4112 avgt5 >> 188.445 ± 0.572 ns/op >> Base64Decode.testBase64Decode 4512 avgt5 >> 610.847 ± 1.559 ns/op >> Base64Decode.testBase64Decode 4 1000 avgt
Re: RFR: 8256245: AArch64: Implement Base64 decoding intrinsic [v6]
On Wed, 7 Apr 2021 05:51:02 GMT, Dong Bo wrote: >> In JDK-8248188, IntrinsicCandidate and API is added for Base64 decoding. >> Base64 decoding can be improved on aarch64 with ld4/tbl/tbx/st3, a basic >> idea can be found at >> http://0x80.pl/articles/base64-simd-neon.html#encoding-quadwords. >> >> Patch passed jtreg tier1-3 tests with linux-aarch64-server-fastdebug build. >> Tests in `test/jdk/java/util/Base64/` and >> `compiler/intrinsics/base64/TestBase64.java` runned specially for the >> correctness of the implementation. >> >> There can be illegal characters at the start of the input if the data is >> MIME encoded. >> It would be no benefits to use SIMD for this case, so the stub use no-simd >> instructions for MIME encoded data now. >> >> A JMH micro, Base64Decode.java, is added for performance test. >> With different input length (upper-bounded by parameter `maxNumBytes` in the >> JMH micro), >> we witness ~2.5x improvements with long inputs and no regression with short >> inputs for raw base64 decodeing, minor improvements (~10.95%) for MIME on >> Kunpeng916. >> >> The Base64Decode.java JMH micro-benchmark results: >> >> Benchmark (lineSize) (maxNumBytes) Mode Cnt >> Score Error Units >> >> # Kunpeng916, intrinsic >> Base64Decode.testBase64Decode 4 1 avgt5 >> 48.614 ± 0.609 ns/op >> Base64Decode.testBase64Decode 4 3 avgt5 >> 58.199 ± 1.650 ns/op >> Base64Decode.testBase64Decode 4 7 avgt5 >> 69.400 ± 0.931 ns/op >> Base64Decode.testBase64Decode 4 32 avgt5 >> 96.818 ± 1.687 ns/op >> Base64Decode.testBase64Decode 4 64 avgt5 >> 122.856 ± 9.217 ns/op >> Base64Decode.testBase64Decode 4 80 avgt5 >> 130.935 ± 1.667 ns/op >> Base64Decode.testBase64Decode 4 96 avgt5 >> 143.627 ± 1.751 ns/op >> Base64Decode.testBase64Decode 4112 avgt5 >> 152.311 ± 1.178 ns/op >> Base64Decode.testBase64Decode 4512 avgt5 >> 342.631 ± 0.584 ns/op >> Base64Decode.testBase64Decode 4 1000 avgt5 >> 573.635 ± 1.050 ns/op >> Base64Decode.testBase64Decode 4 2 avgt5 >> 9534.136 ±45.172 ns/op >> Base64Decode.testBase64Decode 4 5 avgt5 >> 22718.726 ± 192.070 ns/op >> Base64Decode.testBase64MIMEDecode 4 1 avgt 10 >> 63.558 ±0.336 ns/op >> Base64Decode.testBase64MIMEDecode 4 3 avgt 10 >> 82.504 ±0.848 ns/op >> Base64Decode.testBase64MIMEDecode 4 7 avgt 10 >> 120.591 ±0.608 ns/op >> Base64Decode.testBase64MIMEDecode 4 32 avgt 10 >> 324.314 ±6.236 ns/op >> Base64Decode.testBase64MIMEDecode 4 64 avgt 10 >> 532.678 ±4.670 ns/op >> Base64Decode.testBase64MIMEDecode 4 80 avgt 10 >> 678.126 ±4.324 ns/op >> Base64Decode.testBase64MIMEDecode 4 96 avgt 10 >> 771.603 ±6.393 ns/op >> Base64Decode.testBase64MIMEDecode 4112 avgt 10 >> 889.608 ± 0.759 ns/op >> Base64Decode.testBase64MIMEDecode 4512 avgt 10 >> 3663.557 ±3.422 ns/op >> Base64Decode.testBase64MIMEDecode 4 1000 avgt 10 >> 7017.784 ±9.128 ns/op >> Base64Decode.testBase64MIMEDecode 4 2 avgt 10 >> 128670.660 ± 7951.521 ns/op >> Base64Decode.testBase64MIMEDecode 4 5 avgt 10 >> 317113.667 ± 161.758 ns/op >> >> # Kunpeng916, default >> Base64Decode.testBase64Decode 4 1 avgt5 >> 48.455 ± 0.571 ns/op >> Base64Decode.testBase64Decode 4 3 avgt5 >> 57.937 ± 0.505 ns/op >> Base64Decode.testBase64Decode 4 7 avgt5 >> 73.823 ± 1.452 ns/op >> Base64Decode.testBase64Decode 4 32 avgt5 >> 106.484 ± 1.243 ns/op >> Base64Decode.testBase64Decode 4 64 avgt5 >> 141.004 ± 1.188 ns/op >> Base64Decode.testBase64Decode 4 80 avgt5 >> 156.284 ± 0.572 ns/op >> Base64Decode.testBase64Decode 4 96 avgt5 >> 174.137 ± 0.177 ns/op >> Base64Decode.testBase64Decode 4112 avgt5 >> 188.445 ± 0.572 ns/op >> Base64Decode.testBase64Decode 4512 avgt5 >> 610.847 ± 1.559 ns/op >> Base64Decode.testBase64Decode 4 1000 avgt
Re: RFR: 8256245: AArch64: Implement Base64 decoding intrinsic [v6]
On Wed, 7 Apr 2021 09:50:45 GMT, Andrew Haley wrote: >> Dong Bo has updated the pull request incrementally with one additional >> commit since the last revision: >> >> fix misleading annotations > > src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 5829: > >> 5827: __ strb(r14, __ post(dst, 1)); >> 5828: __ strb(r15, __ post(dst, 1)); >> 5829: __ strb(r13, __ post(dst, 1)); > > I think this sequence should be 4 BFMs, STRW, BFM, STRW. That's the best we > can do, I think. Sorry, that's not quite right, but you get the idea: let's not generate unnecessary memory traffic. - PR: https://git.openjdk.java.net/jdk/pull/3228
Re: RFR: 8256245: AArch64: Implement Base64 decoding intrinsic [v5]
On Tue, 6 Apr 2021 07:25:57 GMT, Dong Bo wrote: >> In JDK-8248188, IntrinsicCandidate and API is added for Base64 decoding. >> Base64 decoding can be improved on aarch64 with ld4/tbl/tbx/st3, a basic >> idea can be found at >> http://0x80.pl/articles/base64-simd-neon.html#encoding-quadwords. >> >> Patch passed jtreg tier1-3 tests with linux-aarch64-server-fastdebug build. >> Tests in `test/jdk/java/util/Base64/` and >> `compiler/intrinsics/base64/TestBase64.java` runned specially for the >> correctness of the implementation. >> >> There can be illegal characters at the start of the input if the data is >> MIME encoded. >> It would be no benefits to use SIMD for this case, so the stub use no-simd >> instructions for MIME encoded data now. >> >> A JMH micro, Base64Decode.java, is added for performance test. >> With different input length (upper-bounded by parameter `maxNumBytes` in the >> JMH micro), >> we witness ~2.5x improvements with long inputs and no regression with short >> inputs for raw base64 decodeing, minor improvements (~10.95%) for MIME on >> Kunpeng916. >> >> The Base64Decode.java JMH micro-benchmark results: >> >> Benchmark (lineSize) (maxNumBytes) Mode Cnt >> Score Error Units >> >> # Kunpeng916, intrinsic >> Base64Decode.testBase64Decode 4 1 avgt5 >> 48.614 ± 0.609 ns/op >> Base64Decode.testBase64Decode 4 3 avgt5 >> 58.199 ± 1.650 ns/op >> Base64Decode.testBase64Decode 4 7 avgt5 >> 69.400 ± 0.931 ns/op >> Base64Decode.testBase64Decode 4 32 avgt5 >> 96.818 ± 1.687 ns/op >> Base64Decode.testBase64Decode 4 64 avgt5 >> 122.856 ± 9.217 ns/op >> Base64Decode.testBase64Decode 4 80 avgt5 >> 130.935 ± 1.667 ns/op >> Base64Decode.testBase64Decode 4 96 avgt5 >> 143.627 ± 1.751 ns/op >> Base64Decode.testBase64Decode 4112 avgt5 >> 152.311 ± 1.178 ns/op >> Base64Decode.testBase64Decode 4512 avgt5 >> 342.631 ± 0.584 ns/op >> Base64Decode.testBase64Decode 4 1000 avgt5 >> 573.635 ± 1.050 ns/op >> Base64Decode.testBase64Decode 4 2 avgt5 >> 9534.136 ±45.172 ns/op >> Base64Decode.testBase64Decode 4 5 avgt5 >> 22718.726 ± 192.070 ns/op >> Base64Decode.testBase64MIMEDecode 4 1 avgt 10 >> 63.558 ±0.336 ns/op >> Base64Decode.testBase64MIMEDecode 4 3 avgt 10 >> 82.504 ±0.848 ns/op >> Base64Decode.testBase64MIMEDecode 4 7 avgt 10 >> 120.591 ±0.608 ns/op >> Base64Decode.testBase64MIMEDecode 4 32 avgt 10 >> 324.314 ±6.236 ns/op >> Base64Decode.testBase64MIMEDecode 4 64 avgt 10 >> 532.678 ±4.670 ns/op >> Base64Decode.testBase64MIMEDecode 4 80 avgt 10 >> 678.126 ±4.324 ns/op >> Base64Decode.testBase64MIMEDecode 4 96 avgt 10 >> 771.603 ±6.393 ns/op >> Base64Decode.testBase64MIMEDecode 4112 avgt 10 >> 889.608 ± 0.759 ns/op >> Base64Decode.testBase64MIMEDecode 4512 avgt 10 >> 3663.557 ±3.422 ns/op >> Base64Decode.testBase64MIMEDecode 4 1000 avgt 10 >> 7017.784 ±9.128 ns/op >> Base64Decode.testBase64MIMEDecode 4 2 avgt 10 >> 128670.660 ± 7951.521 ns/op >> Base64Decode.testBase64MIMEDecode 4 5 avgt 10 >> 317113.667 ± 161.758 ns/op >> >> # Kunpeng916, default >> Base64Decode.testBase64Decode 4 1 avgt5 >> 48.455 ± 0.571 ns/op >> Base64Decode.testBase64Decode 4 3 avgt5 >> 57.937 ± 0.505 ns/op >> Base64Decode.testBase64Decode 4 7 avgt5 >> 73.823 ± 1.452 ns/op >> Base64Decode.testBase64Decode 4 32 avgt5 >> 106.484 ± 1.243 ns/op >> Base64Decode.testBase64Decode 4 64 avgt5 >> 141.004 ± 1.188 ns/op >> Base64Decode.testBase64Decode 4 80 avgt5 >> 156.284 ± 0.572 ns/op >> Base64Decode.testBase64Decode 4 96 avgt5 >> 174.137 ± 0.177 ns/op >> Base64Decode.testBase64Decode 4112 avgt5 >> 188.445 ± 0.572 ns/op >> Base64Decode.testBase64Decode 4512 avgt5 >> 610.847 ± 1.559 ns/op >> Base64Decode.testBase64Decode 4 1000 avgt
Re: RFR: 8256245: AArch64: Implement Base64 decoding intrinsic [v3]
On Fri, 2 Apr 2021 10:01:27 GMT, Andrew Haley wrote: >> Dong Bo has updated the pull request with a new target base due to a merge >> or a rebase. The pull request now contains six commits: >> >> - Merge branch 'master' into aarch64.base64.decode >> - copyright >> - trivial fixes >> - Handling error in SIMD case with loops, combining two non-SIMD cases into >> one code blob, addressing other comments >> - Merge branch 'master' into aarch64.base64.decode >> - 8256245: AArch64: Implement Base64 decoding intrinsic > > src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 5802: > >> 5800: // The 1st character of the input can be illegal if the data is >> MIME encoded. >> 5801: // We cannot benefits from SIMD for this case. The max line size >> of MIME >> 5802: // encoding is 76, with the PreProcess80B blob, we actually use >> no-simd > > "cannot benefit" OK, so I now understand what is actually going on here, and it has nothing to do with illegal first characters. The problem is that the maximum block length the decode will be supplied with is 76 bytes, and there isn't enough time for the SIMD to be worthwhile. So the comment should be "In the MIME case, the line length cannot be more than 76 bytes (see RFC 2045.) This is too short a block for SIMD to be worthwhile, so we use non-SIMD here." - PR: https://git.openjdk.java.net/jdk/pull/3228
Re: RFR: 8256245: AArch64: Implement Base64 decoding intrinsic
On Sat, 27 Mar 2021 08:58:03 GMT, Dong Bo wrote: > There can be illegal characters at the start of the input if the data is MIME > encoded. > It would be no benefits to use SIMD for this case, so the stub use no-simd > instructions for MIME encoded data now. What is the reasoning here? Sure, there can be illegal characters at the start, but what if there are not? The generic logic uses decodeBlock() even in the MIME case, because we don't know that there certainly will be illegal characters. - PR: https://git.openjdk.java.net/jdk/pull/3228
Re: Faster double parser?
On 4/5/21 9:08 AM, Alan Bateman wrote: > On 04/04/2021 09:53, Andrew Haley wrote: >> : >> We also have a candidate for a String -> toDouble parser from Raffaello >> Giulietti: >> >> https://mail.openjdk.java.net/pipermail/core-libs-dev/2018-March/052355.html >> >> As far as I can see there's nothing wrong with it, but it got stuck in >> review. This was a bad failure of our review processes, I'm afraid. I >> wonder if we will do any better with this one. > > Yeah, there was a lengthy discussion about this at the OpenJDK Committer > Workshop just after FOSDEM 2020. Brian Burkhalter and Andrew Dinn had > both put time into looking at the proposal it but the conclusion was > that Raffaello's paper "The Schubfach way to render doubles" needed a > review from from an expert in this area. I wonder. The implementation has even been proved correct for floats by exhaustion. (Of course that's not possible for doubles.) I submit, therefore, that we could at least use the float version without reasonable fear of breaking something. We seem to be paralysed by this one. > Guy Steele was mentioned but I don't know if he was approached about it. -- Andrew Haley (he/him) Java Platform Lead Engineer Red Hat UK Ltd. <https://www.redhat.com> https://keybase.io/andrewhaley EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
Re: Faster double parser?
On 4/4/21 9:53 AM, Andrew Haley wrote: > On 4/3/21 12:58 PM, Carfield Yim wrote: >> HI all, just aware of this project, >> https://github.com/wrandelshofer/FastDoubleParser, not sure if it good idea >> to use this as default Double Parser in JVM? > > We also have a candidate for a String -> toDouble Argh. Double -> String. > parser from Raffaello Giulietti: -- Andrew Haley (he/him) Java Platform Lead Engineer Red Hat UK Ltd. <https://www.redhat.com> https://keybase.io/andrewhaley EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
Re: Faster double parser?
On 4/3/21 12:58 PM, Carfield Yim wrote: > HI all, just aware of this project, > https://github.com/wrandelshofer/FastDoubleParser, not sure if it good idea > to use this as default Double Parser in JVM? We also have a candidate for a String -> toDouble parser from Raffaello Giulietti: https://mail.openjdk.java.net/pipermail/core-libs-dev/2018-March/052355.html As far as I can see there's nothing wrong with it, but it got stuck in review. This was a bad failure of our review processes, I'm afraid. I wonder if we will do any better with this one. -- Andrew Haley (he/him) Java Platform Lead Engineer Red Hat UK Ltd. <https://www.redhat.com> https://keybase.io/andrewhaley EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
Re: RFR: 8256245: AArch64: Implement Base64 decoding intrinsic [v3]
On Fri, 2 Apr 2021 03:10:57 GMT, Dong Bo wrote: >> In JDK-8248188, IntrinsicCandidate and API is added for Base64 decoding. >> Base64 decoding can be improved on aarch64 with ld4/tbl/tbx/st3, a basic >> idea can be found at >> http://0x80.pl/articles/base64-simd-neon.html#encoding-quadwords. >> >> Patch passed jtreg tier1-3 tests with linux-aarch64-server-fastdebug build. >> Tests in `test/jdk/java/util/Base64/` and >> `compiler/intrinsics/base64/TestBase64.java` runned specially for the >> correctness of the implementation. >> >> There can be illegal characters at the start of the input if the data is >> MIME encoded. >> It would be no benefits to use SIMD for this case, so the stub use no-simd >> instructions for MIME encoded data now. >> >> A JMH micro, Base64Decode.java, is added for performance test. >> With different input length (upper-bounded by parameter `maxNumBytes` in the >> JMH micro), >> we witness ~2.5x improvements with long inputs and no regression with short >> inputs for raw base64 decodeing, minor improvements (~10.95%) for MIME on >> Kunpeng916. >> >> The Base64Decode.java JMH micro-benchmark results: >> >> Benchmark (lineSize) (maxNumBytes) Mode Cnt >> Score Error Units >> >> # Kunpeng916, intrinsic >> Base64Decode.testBase64Decode 4 1 avgt5 >> 48.614 ± 0.609 ns/op >> Base64Decode.testBase64Decode 4 3 avgt5 >> 58.199 ± 1.650 ns/op >> Base64Decode.testBase64Decode 4 7 avgt5 >> 69.400 ± 0.931 ns/op >> Base64Decode.testBase64Decode 4 32 avgt5 >> 96.818 ± 1.687 ns/op >> Base64Decode.testBase64Decode 4 64 avgt5 >> 122.856 ± 9.217 ns/op >> Base64Decode.testBase64Decode 4 80 avgt5 >> 130.935 ± 1.667 ns/op >> Base64Decode.testBase64Decode 4 96 avgt5 >> 143.627 ± 1.751 ns/op >> Base64Decode.testBase64Decode 4112 avgt5 >> 152.311 ± 1.178 ns/op >> Base64Decode.testBase64Decode 4512 avgt5 >> 342.631 ± 0.584 ns/op >> Base64Decode.testBase64Decode 4 1000 avgt5 >> 573.635 ± 1.050 ns/op >> Base64Decode.testBase64Decode 4 2 avgt5 >> 9534.136 ±45.172 ns/op >> Base64Decode.testBase64Decode 4 5 avgt5 >> 22718.726 ± 192.070 ns/op >> Base64Decode.testBase64MIMEDecode 4 1 avgt 10 >> 63.558 ±0.336 ns/op >> Base64Decode.testBase64MIMEDecode 4 3 avgt 10 >> 82.504 ±0.848 ns/op >> Base64Decode.testBase64MIMEDecode 4 7 avgt 10 >> 120.591 ±0.608 ns/op >> Base64Decode.testBase64MIMEDecode 4 32 avgt 10 >> 324.314 ±6.236 ns/op >> Base64Decode.testBase64MIMEDecode 4 64 avgt 10 >> 532.678 ±4.670 ns/op >> Base64Decode.testBase64MIMEDecode 4 80 avgt 10 >> 678.126 ±4.324 ns/op >> Base64Decode.testBase64MIMEDecode 4 96 avgt 10 >> 771.603 ±6.393 ns/op >> Base64Decode.testBase64MIMEDecode 4112 avgt 10 >> 889.608 ± 0.759 ns/op >> Base64Decode.testBase64MIMEDecode 4512 avgt 10 >> 3663.557 ±3.422 ns/op >> Base64Decode.testBase64MIMEDecode 4 1000 avgt 10 >> 7017.784 ±9.128 ns/op >> Base64Decode.testBase64MIMEDecode 4 2 avgt 10 >> 128670.660 ± 7951.521 ns/op >> Base64Decode.testBase64MIMEDecode 4 5 avgt 10 >> 317113.667 ± 161.758 ns/op >> >> # Kunpeng916, default >> Base64Decode.testBase64Decode 4 1 avgt5 >> 48.455 ± 0.571 ns/op >> Base64Decode.testBase64Decode 4 3 avgt5 >> 57.937 ± 0.505 ns/op >> Base64Decode.testBase64Decode 4 7 avgt5 >> 73.823 ± 1.452 ns/op >> Base64Decode.testBase64Decode 4 32 avgt5 >> 106.484 ± 1.243 ns/op >> Base64Decode.testBase64Decode 4 64 avgt5 >> 141.004 ± 1.188 ns/op >> Base64Decode.testBase64Decode 4 80 avgt5 >> 156.284 ± 0.572 ns/op >> Base64Decode.testBase64Decode 4 96 avgt5 >> 174.137 ± 0.177 ns/op >> Base64Decode.testBase64Decode 4112 avgt5 >> 188.445 ± 0.572 ns/op >> Base64Decode.testBase64Decode 4512 avgt5 >> 610.847 ± 1.559 ns/op >> Base64Decode.testBase64Decode 4 1000 avgt
Re: RFR: 8256245: AArch64: Implement Base64 decoding intrinsic
On Fri, 2 Apr 2021 07:05:26 GMT, Dong Bo wrote: > PING... Any suggestions on the updated commit? Once you reply to the comments, sure. - PR: https://git.openjdk.java.net/jdk/pull/3228
Re: RFR: 8256245: AArch64: Implement Base64 decoding intrinsic
On Mon, 29 Mar 2021 03:28:54 GMT, Dong Bo wrote: > > Please consider losing the non-SIMD case. It doesn't result in any > > significant gain. > > The non-SIMD case is useful for MIME decoding performance. Your test results suggest that it isn't useful for that, surely? - PR: https://git.openjdk.java.net/jdk/pull/3228
Re: RFR: 8256245: AArch64: Implement Base64 decoding intrinsic [v3]
On Fri, 2 Apr 2021 03:10:57 GMT, Dong Bo wrote: >> In JDK-8248188, IntrinsicCandidate and API is added for Base64 decoding. >> Base64 decoding can be improved on aarch64 with ld4/tbl/tbx/st3, a basic >> idea can be found at >> http://0x80.pl/articles/base64-simd-neon.html#encoding-quadwords. >> >> Patch passed jtreg tier1-3 tests with linux-aarch64-server-fastdebug build. >> Tests in `test/jdk/java/util/Base64/` and >> `compiler/intrinsics/base64/TestBase64.java` runned specially for the >> correctness of the implementation. >> >> There can be illegal characters at the start of the input if the data is >> MIME encoded. >> It would be no benefits to use SIMD for this case, so the stub use no-simd >> instructions for MIME encoded data now. >> >> A JMH micro, Base64Decode.java, is added for performance test. >> With different input length (upper-bounded by parameter `maxNumBytes` in the >> JMH micro), >> we witness ~2.5x improvements with long inputs and no regression with short >> inputs for raw base64 decodeing, minor improvements (~10.95%) for MIME on >> Kunpeng916. >> >> The Base64Decode.java JMH micro-benchmark results: >> >> Benchmark (lineSize) (maxNumBytes) Mode Cnt >> Score Error Units >> >> # Kunpeng916, intrinsic >> Base64Decode.testBase64Decode 4 1 avgt5 >> 48.614 ± 0.609 ns/op >> Base64Decode.testBase64Decode 4 3 avgt5 >> 58.199 ± 1.650 ns/op >> Base64Decode.testBase64Decode 4 7 avgt5 >> 69.400 ± 0.931 ns/op >> Base64Decode.testBase64Decode 4 32 avgt5 >> 96.818 ± 1.687 ns/op >> Base64Decode.testBase64Decode 4 64 avgt5 >> 122.856 ± 9.217 ns/op >> Base64Decode.testBase64Decode 4 80 avgt5 >> 130.935 ± 1.667 ns/op >> Base64Decode.testBase64Decode 4 96 avgt5 >> 143.627 ± 1.751 ns/op >> Base64Decode.testBase64Decode 4112 avgt5 >> 152.311 ± 1.178 ns/op >> Base64Decode.testBase64Decode 4512 avgt5 >> 342.631 ± 0.584 ns/op >> Base64Decode.testBase64Decode 4 1000 avgt5 >> 573.635 ± 1.050 ns/op >> Base64Decode.testBase64Decode 4 2 avgt5 >> 9534.136 ±45.172 ns/op >> Base64Decode.testBase64Decode 4 5 avgt5 >> 22718.726 ± 192.070 ns/op >> Base64Decode.testBase64MIMEDecode 4 1 avgt 10 >> 63.558 ±0.336 ns/op >> Base64Decode.testBase64MIMEDecode 4 3 avgt 10 >> 82.504 ±0.848 ns/op >> Base64Decode.testBase64MIMEDecode 4 7 avgt 10 >> 120.591 ±0.608 ns/op >> Base64Decode.testBase64MIMEDecode 4 32 avgt 10 >> 324.314 ±6.236 ns/op >> Base64Decode.testBase64MIMEDecode 4 64 avgt 10 >> 532.678 ±4.670 ns/op >> Base64Decode.testBase64MIMEDecode 4 80 avgt 10 >> 678.126 ±4.324 ns/op >> Base64Decode.testBase64MIMEDecode 4 96 avgt 10 >> 771.603 ±6.393 ns/op >> Base64Decode.testBase64MIMEDecode 4112 avgt 10 >> 889.608 ± 0.759 ns/op >> Base64Decode.testBase64MIMEDecode 4512 avgt 10 >> 3663.557 ±3.422 ns/op >> Base64Decode.testBase64MIMEDecode 4 1000 avgt 10 >> 7017.784 ±9.128 ns/op >> Base64Decode.testBase64MIMEDecode 4 2 avgt 10 >> 128670.660 ± 7951.521 ns/op >> Base64Decode.testBase64MIMEDecode 4 5 avgt 10 >> 317113.667 ± 161.758 ns/op >> >> # Kunpeng916, default >> Base64Decode.testBase64Decode 4 1 avgt5 >> 48.455 ± 0.571 ns/op >> Base64Decode.testBase64Decode 4 3 avgt5 >> 57.937 ± 0.505 ns/op >> Base64Decode.testBase64Decode 4 7 avgt5 >> 73.823 ± 1.452 ns/op >> Base64Decode.testBase64Decode 4 32 avgt5 >> 106.484 ± 1.243 ns/op >> Base64Decode.testBase64Decode 4 64 avgt5 >> 141.004 ± 1.188 ns/op >> Base64Decode.testBase64Decode 4 80 avgt5 >> 156.284 ± 0.572 ns/op >> Base64Decode.testBase64Decode 4 96 avgt5 >> 174.137 ± 0.177 ns/op >> Base64Decode.testBase64Decode 4112 avgt5 >> 188.445 ± 0.572 ns/op >> Base64Decode.testBase64Decode 4512 avgt5 >> 610.847 ± 1.559 ns/op >> Base64Decode.testBase64Decode 4 1000 avgt
Re: RFR: 8256245: AArch64: Implement Base64 decoding intrinsic [v3]
On Fri, 2 Apr 2021 03:10:57 GMT, Dong Bo wrote: >> In JDK-8248188, IntrinsicCandidate and API is added for Base64 decoding. >> Base64 decoding can be improved on aarch64 with ld4/tbl/tbx/st3, a basic >> idea can be found at >> http://0x80.pl/articles/base64-simd-neon.html#encoding-quadwords. >> >> Patch passed jtreg tier1-3 tests with linux-aarch64-server-fastdebug build. >> Tests in `test/jdk/java/util/Base64/` and >> `compiler/intrinsics/base64/TestBase64.java` runned specially for the >> correctness of the implementation. >> >> There can be illegal characters at the start of the input if the data is >> MIME encoded. >> It would be no benefits to use SIMD for this case, so the stub use no-simd >> instructions for MIME encoded data now. >> >> A JMH micro, Base64Decode.java, is added for performance test. >> With different input length (upper-bounded by parameter `maxNumBytes` in the >> JMH micro), >> we witness ~2.5x improvements with long inputs and no regression with short >> inputs for raw base64 decodeing, minor improvements (~10.95%) for MIME on >> Kunpeng916. >> >> The Base64Decode.java JMH micro-benchmark results: >> >> Benchmark (lineSize) (maxNumBytes) Mode Cnt >> Score Error Units >> >> # Kunpeng916, intrinsic >> Base64Decode.testBase64Decode 4 1 avgt5 >> 48.614 ± 0.609 ns/op >> Base64Decode.testBase64Decode 4 3 avgt5 >> 58.199 ± 1.650 ns/op >> Base64Decode.testBase64Decode 4 7 avgt5 >> 69.400 ± 0.931 ns/op >> Base64Decode.testBase64Decode 4 32 avgt5 >> 96.818 ± 1.687 ns/op >> Base64Decode.testBase64Decode 4 64 avgt5 >> 122.856 ± 9.217 ns/op >> Base64Decode.testBase64Decode 4 80 avgt5 >> 130.935 ± 1.667 ns/op >> Base64Decode.testBase64Decode 4 96 avgt5 >> 143.627 ± 1.751 ns/op >> Base64Decode.testBase64Decode 4112 avgt5 >> 152.311 ± 1.178 ns/op >> Base64Decode.testBase64Decode 4512 avgt5 >> 342.631 ± 0.584 ns/op >> Base64Decode.testBase64Decode 4 1000 avgt5 >> 573.635 ± 1.050 ns/op >> Base64Decode.testBase64Decode 4 2 avgt5 >> 9534.136 ±45.172 ns/op >> Base64Decode.testBase64Decode 4 5 avgt5 >> 22718.726 ± 192.070 ns/op >> Base64Decode.testBase64MIMEDecode 4 1 avgt 10 >> 63.558 ±0.336 ns/op >> Base64Decode.testBase64MIMEDecode 4 3 avgt 10 >> 82.504 ±0.848 ns/op >> Base64Decode.testBase64MIMEDecode 4 7 avgt 10 >> 120.591 ±0.608 ns/op >> Base64Decode.testBase64MIMEDecode 4 32 avgt 10 >> 324.314 ±6.236 ns/op >> Base64Decode.testBase64MIMEDecode 4 64 avgt 10 >> 532.678 ±4.670 ns/op >> Base64Decode.testBase64MIMEDecode 4 80 avgt 10 >> 678.126 ±4.324 ns/op >> Base64Decode.testBase64MIMEDecode 4 96 avgt 10 >> 771.603 ±6.393 ns/op >> Base64Decode.testBase64MIMEDecode 4112 avgt 10 >> 889.608 ± 0.759 ns/op >> Base64Decode.testBase64MIMEDecode 4512 avgt 10 >> 3663.557 ±3.422 ns/op >> Base64Decode.testBase64MIMEDecode 4 1000 avgt 10 >> 7017.784 ±9.128 ns/op >> Base64Decode.testBase64MIMEDecode 4 2 avgt 10 >> 128670.660 ± 7951.521 ns/op >> Base64Decode.testBase64MIMEDecode 4 5 avgt 10 >> 317113.667 ± 161.758 ns/op >> >> # Kunpeng916, default >> Base64Decode.testBase64Decode 4 1 avgt5 >> 48.455 ± 0.571 ns/op >> Base64Decode.testBase64Decode 4 3 avgt5 >> 57.937 ± 0.505 ns/op >> Base64Decode.testBase64Decode 4 7 avgt5 >> 73.823 ± 1.452 ns/op >> Base64Decode.testBase64Decode 4 32 avgt5 >> 106.484 ± 1.243 ns/op >> Base64Decode.testBase64Decode 4 64 avgt5 >> 141.004 ± 1.188 ns/op >> Base64Decode.testBase64Decode 4 80 avgt5 >> 156.284 ± 0.572 ns/op >> Base64Decode.testBase64Decode 4 96 avgt5 >> 174.137 ± 0.177 ns/op >> Base64Decode.testBase64Decode 4112 avgt5 >> 188.445 ± 0.572 ns/op >> Base64Decode.testBase64Decode 4512 avgt5 >> 610.847 ± 1.559 ns/op >> Base64Decode.testBase64Decode 4 1000 avgt
Re: RFR: 8256245: AArch64: Implement Base64 decoding intrinsic
On Mon, 29 Mar 2021 03:28:54 GMT, Dong Bo wrote: > I think I can rewrite this part as loops. > With an intial implemention, we can have almost half of the code size reduced > (1312B -> 748B). Sounds OK to you? Sounds great, but I'm still somewhat concerned that the non-SIMD case only offers 3-12% performance gain. Make it just 748 bytes, and therefore not icache-hostile, then perhaps the balance of risk and reward is justified. - PR: https://git.openjdk.java.net/jdk/pull/3228
Re: RFR: 8256245: AArch64: Implement Base64 decoding intrinsic
On Mon, 29 Mar 2021 03:15:57 GMT, Nick Gasson wrote: > > There's a lot of unrolling, particularly in the non-SIMD case. Please > > consider taking out some of the unrolling; I suspect it'd not increase time > > by very much but would greatly reduce the code cache pollution. It's very > > tempting to unroll everything to make a benchmark run quickly, but we have > > to take a balanced approach. > > But there's only ever one of these generated at startup, right? It's not like > the string intrinsics that are expanded at every call site. I'm talking about icache pollution. This stuff could be quite small. - PR: https://git.openjdk.java.net/jdk/pull/3228
Re: RFR: 8256245: AArch64: Implement Base64 decoding intrinsic
On Sat, 27 Mar 2021 09:53:37 GMT, Andrew Haley wrote: >> In JDK-8248188, IntrinsicCandidate and API is added for Base64 decoding. >> Base64 decoding can be improved on aarch64 with ld4/tbl/tbx/st3, a basic >> idea can be found at >> http://0x80.pl/articles/base64-simd-neon.html#encoding-quadwords. >> >> Patch passed jtreg tier1-3 tests with linux-aarch64-server-fastdebug build. >> Tests in `test/jdk/java/util/Base64/` and >> `compiler/intrinsics/base64/TestBase64.java` runned specially for the >> correctness of the implementation. >> >> There can be illegal characters at the start of the input if the data is >> MIME encoded. >> It would be no benefits to use SIMD for this case, so the stub use no-simd >> instructions for MIME encoded data now. >> >> A JMH micro, Base64Decode.java, is added for performance test. >> With different input length (upper-bounded by parameter `maxNumBytes` in the >> JMH micro), >> we witness ~2.5x improvements with long inputs and no regression with short >> inputs for raw base64 decodeing, minor improvements (~10.95%) for MIME on >> Kunpeng916. >> >> The Base64Decode.java JMH micro-benchmark results: >> >> Benchmark (lineSize) (maxNumBytes) Mode Cnt >> Score Error Units >> >> # Kunpeng916, intrinsic >> Base64Decode.testBase64Decode 4 1 avgt5 >> 48.614 ± 0.609 ns/op >> Base64Decode.testBase64Decode 4 3 avgt5 >> 58.199 ± 1.650 ns/op >> Base64Decode.testBase64Decode 4 7 avgt5 >> 69.400 ± 0.931 ns/op >> Base64Decode.testBase64Decode 4 32 avgt5 >> 96.818 ± 1.687 ns/op >> Base64Decode.testBase64Decode 4 64 avgt5 >> 122.856 ± 9.217 ns/op >> Base64Decode.testBase64Decode 4 80 avgt5 >> 130.935 ± 1.667 ns/op >> Base64Decode.testBase64Decode 4 96 avgt5 >> 143.627 ± 1.751 ns/op >> Base64Decode.testBase64Decode 4112 avgt5 >> 152.311 ± 1.178 ns/op >> Base64Decode.testBase64Decode 4512 avgt5 >> 342.631 ± 0.584 ns/op >> Base64Decode.testBase64Decode 4 1000 avgt5 >> 573.635 ± 1.050 ns/op >> Base64Decode.testBase64Decode 4 2 avgt5 >> 9534.136 ±45.172 ns/op >> Base64Decode.testBase64Decode 4 5 avgt5 >> 22718.726 ± 192.070 ns/op >> Base64Decode.testBase64MIMEDecode 4 1 avgt 10 >> 63.558 ±0.336 ns/op >> Base64Decode.testBase64MIMEDecode 4 3 avgt 10 >> 82.504 ±0.848 ns/op >> Base64Decode.testBase64MIMEDecode 4 7 avgt 10 >> 120.591 ±0.608 ns/op >> Base64Decode.testBase64MIMEDecode 4 32 avgt 10 >> 324.314 ±6.236 ns/op >> Base64Decode.testBase64MIMEDecode 4 64 avgt 10 >> 532.678 ±4.670 ns/op >> Base64Decode.testBase64MIMEDecode 4 80 avgt 10 >> 678.126 ±4.324 ns/op >> Base64Decode.testBase64MIMEDecode 4 96 avgt 10 >> 771.603 ±6.393 ns/op >> Base64Decode.testBase64MIMEDecode 4112 avgt 10 >> 889.608 ± 0.759 ns/op >> Base64Decode.testBase64MIMEDecode 4512 avgt 10 >> 3663.557 ±3.422 ns/op >> Base64Decode.testBase64MIMEDecode 4 1000 avgt 10 >> 7017.784 ±9.128 ns/op >> Base64Decode.testBase64MIMEDecode 4 2 avgt 10 >> 128670.660 ± 7951.521 ns/op >> Base64Decode.testBase64MIMEDecode 4 5 avgt 10 >> 317113.667 ± 161.758 ns/op >> >> # Kunpeng916, default >> Base64Decode.testBase64Decode 4 1 avgt5 >> 48.455 ± 0.571 ns/op >> Base64Decode.testBase64Decode 4 3 avgt5 >> 57.937 ± 0.505 ns/op >> Base64Decode.testBase64Decode 4 7 avgt5 >> 73.823 ± 1.452 ns/op >> Base64Decode.testBase64Decode 4 32 avgt5 >> 106.484 ± 1.243 ns/op >> Base64Decode.testBase64Decode 4 64 avgt5 &
Re: RFR: 8256245: AArch64: Implement Base64 decoding intrinsic
On Sat, 27 Mar 2021 08:58:03 GMT, Dong Bo wrote: > In JDK-8248188, IntrinsicCandidate and API is added for Base64 decoding. > Base64 decoding can be improved on aarch64 with ld4/tbl/tbx/st3, a basic idea > can be found at > http://0x80.pl/articles/base64-simd-neon.html#encoding-quadwords. > > Patch passed jtreg tier1-3 tests with linux-aarch64-server-fastdebug build. > Tests in `test/jdk/java/util/Base64/` and > `compiler/intrinsics/base64/TestBase64.java` runned specially for the > correctness of the implementation. > > There can be illegal characters at the start of the input if the data is MIME > encoded. > It would be no benefits to use SIMD for this case, so the stub use no-simd > instructions for MIME encoded data now. > > A JMH micro, Base64Decode.java, is added for performance test. > With different input length (upper-bounded by parameter `maxNumBytes` in the > JMH micro), > we witness ~2.5x improvements with long inputs and no regression with short > inputs for raw base64 decodeing, minor improvements (~10.95%) for MIME on > Kunpeng916. > > The Base64Decode.java JMH micro-benchmark results: > > Benchmark (lineSize) (maxNumBytes) Mode Cnt > Score Error Units > > # Kunpeng916, intrinsic > Base64Decode.testBase64Decode 4 1 avgt5 > 48.614 ± 0.609 ns/op > Base64Decode.testBase64Decode 4 3 avgt5 > 58.199 ± 1.650 ns/op > Base64Decode.testBase64Decode 4 7 avgt5 > 69.400 ± 0.931 ns/op > Base64Decode.testBase64Decode 4 32 avgt5 > 96.818 ± 1.687 ns/op > Base64Decode.testBase64Decode 4 64 avgt5 > 122.856 ± 9.217 ns/op > Base64Decode.testBase64Decode 4 80 avgt5 > 130.935 ± 1.667 ns/op > Base64Decode.testBase64Decode 4 96 avgt5 > 143.627 ± 1.751 ns/op > Base64Decode.testBase64Decode 4112 avgt5 > 152.311 ± 1.178 ns/op > Base64Decode.testBase64Decode 4512 avgt5 > 342.631 ± 0.584 ns/op > Base64Decode.testBase64Decode 4 1000 avgt5 > 573.635 ± 1.050 ns/op > Base64Decode.testBase64Decode 4 2 avgt5 > 9534.136 ±45.172 ns/op > Base64Decode.testBase64Decode 4 5 avgt5 > 22718.726 ± 192.070 ns/op > Base64Decode.testBase64MIMEDecode 4 1 avgt 10 > 63.558 ±0.336 ns/op > Base64Decode.testBase64MIMEDecode 4 3 avgt 10 > 82.504 ±0.848 ns/op > Base64Decode.testBase64MIMEDecode 4 7 avgt 10 > 120.591 ±0.608 ns/op > Base64Decode.testBase64MIMEDecode 4 32 avgt 10 > 324.314 ±6.236 ns/op > Base64Decode.testBase64MIMEDecode 4 64 avgt 10 > 532.678 ±4.670 ns/op > Base64Decode.testBase64MIMEDecode 4 80 avgt 10 > 678.126 ±4.324 ns/op > Base64Decode.testBase64MIMEDecode 4 96 avgt 10 > 771.603 ±6.393 ns/op > Base64Decode.testBase64MIMEDecode 4112 avgt 10 > 889.608 ± 0.759 ns/op > Base64Decode.testBase64MIMEDecode 4512 avgt 10 > 3663.557 ±3.422 ns/op > Base64Decode.testBase64MIMEDecode 4 1000 avgt 10 > 7017.784 ±9.128 ns/op > Base64Decode.testBase64MIMEDecode 4 2 avgt 10 > 128670.660 ± 7951.521 ns/op > Base64Decode.testBase64MIMEDecode 4 5 avgt 10 > 317113.667 ± 161.758 ns/op > > # Kunpeng916, default > Base64Decode.testBase64Decode 4 1 avgt5 > 48.455 ± 0.571 ns/op > Base64Decode.testBase64Decode 4 3 avgt5 > 57.937 ± 0.505 ns/op > Base64Decode.testBase64Decode 4 7 avgt5 > 73.823 ± 1.452 ns/op > Base64Decode.testBase64Decode 4 32 avgt5 > 106.484 ± 1.243 ns/op > Base64Decode.testBase64Decode 4 64 avgt5 > 141.004 ± 1.188 ns/op > Base64Decode.testBase64Decode 4 80 avgt5 > 156.284 ± 0.572 ns/op > Base64Decode.testBase64Decode 4 96 avgt5 > 174.137 ± 0.177 ns/op > Base64Decode.testBase64Decode 4112 avgt5 > 188.445 ± 0.572 ns/op > Base64Decode.testBase64Decode 4512 avgt5 > 610.847 ± 1.559 ns/op > Base64Decode.testBase64Decode 4 1000 avgt5 > 1155.368 ± 0.813 ns/op > Base64Decode.testBase64Decode 4
Re: RFR: 8256245: AArch64: Implement Base64 decoding intrinsic
On Sat, 27 Mar 2021 08:58:03 GMT, Dong Bo wrote: > In JDK-8248188, IntrinsicCandidate and API is added for Base64 decoding. > Base64 decoding can be improved on aarch64 with ld4/tbl/tbx/st3, a basic idea > can be found at > http://0x80.pl/articles/base64-simd-neon.html#encoding-quadwords. > > Patch passed jtreg tier1-3 tests with linux-aarch64-server-fastdebug build. > Tests in `test/jdk/java/util/Base64/` and > `compiler/intrinsics/base64/TestBase64.java` runned specially for the > correctness of the implementation. > > There can be illegal characters at the start of the input if the data is MIME > encoded. > It would be no benefits to use SIMD for this case, so the stub use no-simd > instructions for MIME encoded data now. > > A JMH micro, Base64Decode.java, is added for performance test. > With different input length (upper-bounded by parameter `maxNumBytes` in the > JMH micro), > we witness ~2.5x improvements with long inputs and no regression with short > inputs for raw base64 decodeing, minor improvements (~10.95%) for MIME on > Kunpeng916. > > The Base64Decode.java JMH micro-benchmark results: > > Benchmark (lineSize) (maxNumBytes) Mode Cnt > Score Error Units > > # Kunpeng916, intrinsic > Base64Decode.testBase64Decode 4 1 avgt5 > 48.614 ± 0.609 ns/op > Base64Decode.testBase64Decode 4 3 avgt5 > 58.199 ± 1.650 ns/op > Base64Decode.testBase64Decode 4 7 avgt5 > 69.400 ± 0.931 ns/op > Base64Decode.testBase64Decode 4 32 avgt5 > 96.818 ± 1.687 ns/op > Base64Decode.testBase64Decode 4 64 avgt5 > 122.856 ± 9.217 ns/op > Base64Decode.testBase64Decode 4 80 avgt5 > 130.935 ± 1.667 ns/op > Base64Decode.testBase64Decode 4 96 avgt5 > 143.627 ± 1.751 ns/op > Base64Decode.testBase64Decode 4112 avgt5 > 152.311 ± 1.178 ns/op > Base64Decode.testBase64Decode 4512 avgt5 > 342.631 ± 0.584 ns/op > Base64Decode.testBase64Decode 4 1000 avgt5 > 573.635 ± 1.050 ns/op > Base64Decode.testBase64Decode 4 2 avgt5 > 9534.136 ±45.172 ns/op > Base64Decode.testBase64Decode 4 5 avgt5 > 22718.726 ± 192.070 ns/op > Base64Decode.testBase64MIMEDecode 4 1 avgt 10 > 63.558 ±0.336 ns/op > Base64Decode.testBase64MIMEDecode 4 3 avgt 10 > 82.504 ±0.848 ns/op > Base64Decode.testBase64MIMEDecode 4 7 avgt 10 > 120.591 ±0.608 ns/op > Base64Decode.testBase64MIMEDecode 4 32 avgt 10 > 324.314 ±6.236 ns/op > Base64Decode.testBase64MIMEDecode 4 64 avgt 10 > 532.678 ±4.670 ns/op > Base64Decode.testBase64MIMEDecode 4 80 avgt 10 > 678.126 ±4.324 ns/op > Base64Decode.testBase64MIMEDecode 4 96 avgt 10 > 771.603 ±6.393 ns/op > Base64Decode.testBase64MIMEDecode 4112 avgt 10 > 889.608 ± 0.759 ns/op > Base64Decode.testBase64MIMEDecode 4512 avgt 10 > 3663.557 ±3.422 ns/op > Base64Decode.testBase64MIMEDecode 4 1000 avgt 10 > 7017.784 ±9.128 ns/op > Base64Decode.testBase64MIMEDecode 4 2 avgt 10 > 128670.660 ± 7951.521 ns/op > Base64Decode.testBase64MIMEDecode 4 5 avgt 10 > 317113.667 ± 161.758 ns/op > > # Kunpeng916, default > Base64Decode.testBase64Decode 4 1 avgt5 > 48.455 ± 0.571 ns/op > Base64Decode.testBase64Decode 4 3 avgt5 > 57.937 ± 0.505 ns/op > Base64Decode.testBase64Decode 4 7 avgt5 > 73.823 ± 1.452 ns/op > Base64Decode.testBase64Decode 4 32 avgt5 > 106.484 ± 1.243 ns/op > Base64Decode.testBase64Decode 4 64 avgt5 > 141.004 ± 1.188 ns/op > Base64Decode.testBase64Decode 4 80 avgt5 > 156.284 ± 0.572 ns/op > Base64Decode.testBase64Decode 4 96 avgt5 > 174.137 ± 0.177 ns/op > Base64Decode.testBase64Decode 4112 avgt5 > 188.445 ± 0.572 ns/op > Base64Decode.testBase64Decode 4512 avgt5 > 610.847 ± 1.559 ns/op > Base64Decode.testBase64Decode 4 1000 avgt5 > 1155.368 ± 0.813 ns/op > Base64Decode.testBase64Decode 4
Re: RFR: 8253795: Implementation of JEP 391: macOS/AArch64 Port [v29]
On Tue, 23 Mar 2021 16:20:47 GMT, Ludovic Henry wrote: >> Anton Kozlov has updated the pull request with a new target base due to a >> merge or a rebase. The pull request now contains 115 commits: >> >> - Merge branch 'master' into jdk-macos >> - JDK-8262491: bsd_aarch64 part >> - JDK-8263002: bsd_aarch64 part >> - Merge remote-tracking branch 'upstream/jdk/master' into jdk-macos >> - Wider #ifdef block >> - Fix most of issues in java/foreign/ tests >> >>Failures related to va_args are tracked in JDK-8263512. >> - Add Azul copyright >> - Update Oracle copyright years >> - Use Thread::current_or_null_safe in SafeFetch >> - 8262903: [macos_aarch64] Thread::current() called on detached thread >> - ... and 105 more: >> https://git.openjdk.java.net/jdk/compare/a9d2267f...5add9269 > > Marked as reviewed by luhenry (Author). > > > > [ Back-porting this patch to JDK 11] depends on the will of openjdk11 > > > > maintainers to accept this (and few other, like jep-388, as we depend > > > > on it) contribution. > > > > > > > > > To the extent that 11u has fixed policies :) we definitely have a policy > > > of accepting patches to keep 11u working on current hardware. So yes. > > > > > > @lewurm That sounds like a green flag for you and jep-388 (with its > > R18_RESERVED functionality) ;) > > Thanks, @theRealAph, and @VladimirKempik . We are on it! It's going to be tricky to do in a really clean way, given some of the weirdnesses of the ABI. However, I think there's probably a need for it - PR: https://git.openjdk.java.net/jdk/pull/2200
Re: RFR: 8253795: Implementation of JEP 391: macOS/AArch64 Port [v29]
On Tue, 23 Mar 2021 13:54:24 GMT, Andrew Haley wrote: >> Anton Kozlov has updated the pull request with a new target base due to a >> merge or a rebase. The pull request now contains 115 commits: >> >> - Merge branch 'master' into jdk-macos >> - JDK-8262491: bsd_aarch64 part >> - JDK-8263002: bsd_aarch64 part >> - Merge remote-tracking branch 'upstream/jdk/master' into jdk-macos >> - Wider #ifdef block >> - Fix most of issues in java/foreign/ tests >> >>Failures related to va_args are tracked in JDK-8263512. >> - Add Azul copyright >> - Update Oracle copyright years >> - Use Thread::current_or_null_safe in SafeFetch >> - 8262903: [macos_aarch64] Thread::current() called on detached thread >> - ... and 105 more: >> https://git.openjdk.java.net/jdk/compare/a9d2267f...5add9269 > > Marked as reviewed by aph (Reviewer). > [ Back-porting this patch to JDK 11] depends on the will of openjdk11 > maintainers to accept this (and few other, like jep-388, as we depend on it) > contribution. To the extent that 11u has fixed policies :) we definitely have a policy of accepting patches to keep 11u working on current hardware. So yes. - PR: https://git.openjdk.java.net/jdk/pull/2200
Re: RFR: 8253795: Implementation of JEP 391: macOS/AArch64 Port [v29]
On Mon, 22 Mar 2021 12:50:14 GMT, Anton Kozlov wrote: >> Please review the implementation of JEP 391: macOS/AArch64 Port. >> >> It's heavily based on existing ports to linux/aarch64, macos/x86_64, and >> windows/aarch64. >> >> Major changes are in: >> * src/hotspot/cpu/aarch64: support of the new calling convention (subtasks >> JDK-8253817, JDK-8253818) >> * src/hotspot/os_cpu/bsd_aarch64: copy of os_cpu/linux_aarch64 with >> necessary adjustments (JDK-8253819) >> * src/hotspot/share, test/hotspot/gtest: support of write-xor-execute (W^X), >> required on macOS/AArch64 platform. It's implemented with >> pthread_jit_write_protect_np provided by Apple. The W^X mode is local to a >> thread, so W^X mode change relates to the java thread state change (for java >> threads). In most cases, JVM executes in write-only mode, except when >> calling a generated stub like SafeFetch, which requires a temporary switch >> to execute-only mode. The same execute-only mode is enabled when a java >> thread executes in java or native states. This approach of managing W^X mode >> turned out to be simple and efficient enough. >> * src/jdk.hotspot.agent: serviceability agent implementation (JDK-8254941) > > Anton Kozlov has updated the pull request with a new target base due to a > merge or a rebase. The pull request now contains 115 commits: > > - Merge branch 'master' into jdk-macos > - JDK-8262491: bsd_aarch64 part > - JDK-8263002: bsd_aarch64 part > - Merge remote-tracking branch 'upstream/jdk/master' into jdk-macos > - Wider #ifdef block > - Fix most of issues in java/foreign/ tests > >Failures related to va_args are tracked in JDK-8263512. > - Add Azul copyright > - Update Oracle copyright years > - Use Thread::current_or_null_safe in SafeFetch > - 8262903: [macos_aarch64] Thread::current() called on detached thread > - ... and 105 more: > https://git.openjdk.java.net/jdk/compare/a9d2267f...5add9269 Marked as reviewed by aph (Reviewer). - PR: https://git.openjdk.java.net/jdk/pull/2200