RE: Howto replicate failure of 8254790?

2020-10-15 Thread Tatton, Jason
Thanks Vladimir and David, I have access to a new macbook with an Intel 
i7-9750H (supports AVX2) so I will try on that.

-Original Message-
From: Vladimir Kozlov  
Sent: 15 October 2020 20:25
To: David Holmes ; Tatton, Jason 
; hotspot-compiler-...@openjdk.java.net; 
core-libs-dev@openjdk.java.net
Subject: RE: [EXTERNAL] Howto replicate failure of 8254790?

CAUTION: This email originated from outside of the organization. Do not click 
links or open attachments unless you can confirm the sender and know the 
content is safe.



Note, we have old Mac machines in our testing env:
cx8, cmov, fxsr, ht, mmx, 3dnowpref, sse, sse2, sse3, ssse3, sse4.1, sse4.2, 
popcnt, lzcnt, tsc, tscinvbit, avx, avx2, aes, erms, clmul, bmi1, bmi2, rtm, 
adx, fma, vzeroupper, clflush, clflushopt

Use -XX:UseAVX=2

But I was not able reproduce failure on my Skylake Linux machine even with 
-XX:UseAVX=2. Maybe there are other factors on MacOS.

Regards,
Vladimir K

On 10/14/20 5:48 PM, David Holmes wrote:
> Hi Jason,
>
> On 15/10/2020 10:42 am, Tatton, Jason wrote:
>> Hi all,
>>
>>
>>
>> I am trying to replicate the failure of the tier2 test mentioned in 
>> 8254790<https://bugs.openjdk.java.net/browse/JDK-8254790> but I am 
>> only seeing it pass under an x86 linux machine. Are there any specific 
>> architectural constraints under which this test should be run in order to 
>> make it fail?
>
> It failed on a Mac, not Linux.
>
> Cheers,
> David
>
>>
>>
>> I am running the test via: make test 
>> TEST="test/jdk/javax/xml/crypto/dsig/GenerationTests.java"
>>
>>
>>
>> Note that I am running the test against master without the commit: 
>> "8254792: Disable intrinsic StringLatin1.indexOf until 8254790 is fixed" 
>> which disables the intrinsic that is causing the test to fail.
>>
>>
>>
>> Thanks
>> --
>> Jason
>>


Howto replicate failure of 8254790?

2020-10-14 Thread Tatton, Jason
Hi all,



I am trying to replicate the failure of the tier2 test mentioned in 
8254790 but I am only seeing 
it pass under an x86 linux machine. Are there any specific architectural 
constraints under which this test should be run in order to make it fail?



I am running the test via: make test 
TEST="test/jdk/javax/xml/crypto/dsig/GenerationTests.java"



Note that I am running the test against master without the commit: "8254792: 
Disable intrinsic StringLatin1.indexOf until 8254790 is fixed" which disables 
the intrinsic that is causing the test to fail.



Thanks
--
Jason



RE: 8173585: Intrinsify StringLatin1.indexOf(char)

2020-09-11 Thread Tatton, Jason
Hi Everyone,

Please could some reviewers volunteer their time to have a look at this patch?

It was marked as a starter bug so I don't expect it will consume a lot of time.

Thanks,
Jason

-Original Message-
From: hotspot-dev  On Behalf Of Jason Tatton
Sent: 11 September 2020 11:24
To: core-libs-dev@openjdk.java.net; hotspot-...@openjdk.java.net
Subject: RFR: 8173585: Intrinsify StringLatin1.indexOf(char)

This is an implementation of the indexOf(char) intrinsic for StringLatin1 (1 
byte encoded Strings). It is provided for
x86 and ARM64. The implementation is greatly inspired by the indexOf(char) 
intrinsic for StringUTF16. To incorporate it I had to make a small change to 
StringLatin1.java (refactor of functionality to intrisified private method) as 
well as code for C2.

Submitted to: hotspot-compiler-dev and core-libs-dev as this patch contains a 
change to hotspot and java/lang/StringLatin1.java

https://bugs.openjdk.java.net/browse/JDK-8173585

Details of testing:

I have created a jtreg test 
“compiler/intrinsics/string/TestStringLatin1IndexOfChar” to cover this new 
intrinsic. Note that, particularly for the x86 implementation of the intrinsic, 
the code path taken is dependent upon the length of the input String. Hence the 
test has been designed to cover all these cases. In summary they are:

- A “short” string of < 16 characters.
- A SIMD String of 16 – 31 characters.
- A AVX2 SIMD String of 32 characters+.

Hardware used for testing:
-

- Intel Xeon CPU E5-2680 (JVM did not recognize this as having AVX2 support) • 
Intel i7 processor (with AVX2 support).
- AWS Graviton 2 (ARM 64 processor).

I also ran; ‘run-test-tier1’ and ‘run-test-tier2’ for: x86_64 and aarch64.

Possible future enhancements:

For the x86 implementation there may be two further improvements we can make in 
order to improve performance of both the StringUTF16 and StringLatin1 
indexOf(char) intrinsics:

1. Make use of AVX-512 instructions.
2. For “short” Strings (see below), I think it may be possible to modify the 
existing algorithm to still use SSE SIMD instructions instead of a loop.

Benchmark results:

**Without** the new StringLatin1 indexOf(char) intrinsic:

| Benchmark  | Mode | Cnt | Score | Error | Units |
| - | - |- |- 
| |- |- | IndexOfBenchmark.latin1_mixed_char | 
| avgt | 5 | **26,389.129** | ± 182.581 | ns/op | 
| IndexOfBenchmark.utf16_mixed_char | avgt | 5 | 17,885.383 | ± 435.933  
| | ns/op |


**With** the new StringLatin1 indexOf(char) intrinsic:

| Benchmark  | Mode | Cnt | Score | Error | Units |
| - | - |- |- 
| |- |- | IndexOfBenchmark.latin1_mixed_char | 
| avgt | 5 | **17,875.185** | ± 407.716  | ns/op | 
| IndexOfBenchmark.utf16_mixed_char | avgt | 5 | 18,292.802 | ± 167.306  
| | ns/op |


The objective of the patch is to bring the performance of StringLatin1 
indexOf(char) in line with StringUTF16
indexOf(char) for x86 and ARM64. We can see above that this has been achieved. 
Similar results were obtained when running on ARM.

-

Commit messages:
 - 8173585: further whitespace changes required by jcheck
 - JDK-8173585 - whitespace changes required by jcheck
 - JDK-8173585

Changes: https://git.openjdk.java.net/jdk/pull/71/files
 Webrev: https://webrevs.openjdk.java.net/?repo=jdk=71=00
  Issue: https://bugs.openjdk.java.net/browse/JDK-8173585
  Stats: 522 lines in 15 files changed: 506 ins; 0 del; 16 mod
  Patch: https://git.openjdk.java.net/jdk/pull/71.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/71/head:pull/71

PR: https://git.openjdk.java.net/jdk/pull/71


RE: JDK-8173585: Intrinsify StringLatin1.indexOf(char)

2020-09-09 Thread Tatton, Jason
Hi Roger,

Thanks for your question. The code path for UTF16 has hasn't been interacted 
with in a meaningful way in this patch so I think this is just noise. To 
validate this hypothesis I re-ran the benchmark with 10 forks (default is 5), 
the results indicate that the performance of the UTF16 implementation of 
indexOf(char) has not been materially changed:

Without the new StringLatin1 indexOf(char) intrinsic:
Benchmark   Mode  Cnt  Score Error  Units
IndexOfBenchmark.latin1_mixed_char  avgt   25  27550.832 ± 347.570  ns/op
IndexOfBenchmark.utf16_mixed_char   avgt   25  18472.190 ± 219.185  ns/op


With the new StringLatin1 indexOf(char) intrinsic:
Benchmark   Mode  Cnt  Score Error  Units
IndexOfBenchmark.latin1_mixed_char  avgt5  17806.338 ± 217.399  ns/op
IndexOfBenchmark.utf16_mixed_char   avgt5  18276.366 ± 470.528  ns/op

In fact, on this run the performance of utf16 was better than without this 
patch, this is however a 1% improvement which fits within the error range seen 
on these tests of between 1-2%. So I think it's fair to say that this patch has 
no effect on the performance of the existing StringUTF16 indexOf(char) method.

-Original Message-
From: core-libs-dev  On Behalf Of Roger 
Riggs
Sent: 08 September 2020 15:54
To: core-libs-dev@openjdk.java.net
Subject: RE: [EXTERNAL] JDK-8173585: Intrinsify StringLatin1.indexOf(char)

CAUTION: This email originated from outside of the organization. Do not click 
links or open attachments unless you can confirm the sender and know the 
content is safe.



Hi Jason,

With respect to the increased ns/op in the utf16_mixed_char benchmark, how 
should we understand the lower performance?

Thanks, Roger


On 9/8/20 8:02 AM, Tatton, Jason wrote:
> Hi Andrew, thank you for taking the time to review this.
>
> Since we have now moved to git, I have raised a new PR for this RFR:
>
> https://github.com/openjdk/jdk/pull/71
> https://bugs.openjdk.java.net/browse/JDK-8173585
>
> I have improved the micro benchmark in the ways which you and others have 
> requested, namely:
> + The benchmark is now included in test/micro/org/openjdk/bench/java/lang as 
> StringIndexOfChar (as advised by my colleagues here at AWS; Xin Liu and 
> Volker Simonis).
> + Times are now in nanoseconds.
> + Terminating characters ('a') are in 66.666% of tested strings.
> + I have added four new benchmarks which operate on a random length strings 
> (32 characters being the average) of type either StringLatin1 of StringUTF16 
> and call indexOf(char) or indexOf(String).
>
> I have included below the output of these four tests below:
>
> Without the new StringLatin1 indexOf(char) intrinsic:
>
> Benchmark   Mode  Cnt  Score Error  Units
> IndexOfBenchmark.latin1_mixed_char  avgt5  26389.129 ± 182.581  ns/op
> IndexOfBenchmark.utf16_mixed_char   avgt5  17885.383 ± 435.933  ns/op
>
>
> With the new StringLatin1 indexOf(char) intrinsic:
>
> Benchmark   Mode  Cnt  Score Error  Units
> IndexOfBenchmark.latin1_mixed_char  avgt5  17875.185 ± 407.716  ns/op
> IndexOfBenchmark.utf16_mixed_char   avgt5  18292.802 ± 167.306  ns/op
>
> The objective of the patch is to bring the performance of StringLatin1 
> indexOf(char) in line with StringUTF16 indexOf(char) for x86 and ARM64. We 
> can see above that this has been achieved. Similar results were obtained when 
> running on ARM.
>
> Regards,
> Jason



RE: JDK-8173585: Intrinsify StringLatin1.indexOf(char)

2020-09-08 Thread Tatton, Jason
Hi Andrew, thank you for taking the time to review this.

Since we have now moved to git, I have raised a new PR for this RFR:

https://github.com/openjdk/jdk/pull/71
https://bugs.openjdk.java.net/browse/JDK-8173585

I have improved the micro benchmark in the ways which you and others have 
requested, namely:
+ The benchmark is now included in test/micro/org/openjdk/bench/java/lang as 
StringIndexOfChar (as advised by my colleagues here at AWS; Xin Liu and Volker 
Simonis).
+ Times are now in nanoseconds.
+ Terminating characters ('a') are in 66.666% of tested strings.
+ I have added four new benchmarks which operate on a random length strings (32 
characters being the average) of type either StringLatin1 of StringUTF16 and 
call indexOf(char) or indexOf(String).

I have included below the output of these four tests below:

Without the new StringLatin1 indexOf(char) intrinsic:

Benchmark   Mode  Cnt  Score Error  Units
IndexOfBenchmark.latin1_mixed_char  avgt5  26389.129 ± 182.581  ns/op
IndexOfBenchmark.utf16_mixed_char   avgt5  17885.383 ± 435.933  ns/op


With the new StringLatin1 indexOf(char) intrinsic:

Benchmark   Mode  Cnt  Score Error  Units
IndexOfBenchmark.latin1_mixed_char  avgt5  17875.185 ± 407.716  ns/op
IndexOfBenchmark.utf16_mixed_char   avgt5  18292.802 ± 167.306  ns/op

The objective of the patch is to bring the performance of StringLatin1 
indexOf(char) in line with StringUTF16 indexOf(char) for x86 and ARM64. We can 
see above that this has been achieved. Similar results were obtained when 
running on ARM.

Regards,
Jason

-Original Message-
From: Andrew Haley  
Sent: 05 September 2020 15:47
To: Tatton, Jason ; hotspot-compiler-...@openjdk.java.net; 
core-libs-dev@openjdk.java.net
Subject: RE: [EXTERNAL] JDK-8173585: Intrinsify StringLatin1.indexOf(char)

CAUTION: This email originated from outside of the organization. Do not click 
links or open attachments unless you can confirm the sender and know the 
content is safe.



On 03/09/2020 22:28, Tatton, Jason wrote:

>
> JMH Benchmark results:
> 
> The benchmarks examine the 3 codepaths for StringLatin1 and 
> StringUTF16. Here are the results for Intel x86 (ARM is similar):
>
> FYI, String lengths in characters (1byte for Latin1, 2bytes for UTF16):
>Latin1  UTF16
> Short: 15   7
> SSE4:  16   8
> AVX2:  32   16
>
> Without StringLatin1 indexofchar intrinsic:
> Benchmark  Mode  Cnt   Score  Error  Units
> IndexOfBenchmark.latin1_AVX2_String   thrpt5  121781.424 ± 355.085  ops/s
> IndexOfBenchmark.latin1_AVX2_char thrpt5   46060.612 ± 151.274  ops/s
> IndexOfBenchmark.latin1_SSE4_String   thrpt5  197339.146 ±  90.333  ops/s
> IndexOfBenchmark.latin1_SSE4_char thrpt5   61401.204 ± 426.761  ops/s
> IndexOfBenchmark.latin1_Short_String  thrpt5  175389.355 ± 294.976  ops/s
> IndexOfBenchmark.latin1_Short_charthrpt5   60759.868 ± 124.349  ops/s
> IndexOfBenchmark.utf16_AVX2_Stringthrpt5  123601.020 ± 111.981  ops/s
> IndexOfBenchmark.utf16_AVX2_char  thrpt5  141116.832 ± 380.489  ops/s
> IndexOfBenchmark.utf16_SSE4_Stringthrpt5  178136.762 ± 143.227  ops/s
> IndexOfBenchmark.utf16_SSE4_char  thrpt5  181430.649 ± 120.097  ops/s
> IndexOfBenchmark.utf16_Short_String   thrpt5  158301.361 ± 182.738  ops/s
> IndexOfBenchmark.utf16_Short_char thrpt5   84876.919 ± 247.769  ops/s
>
> With StringLatin1 indexofchar intrinsic:
> Benchmark  Mode  Cnt   Score  Error  Units
> IndexOfBenchmark.latin1_AVX2_String   thrpt5  113621.676 ±   68.235  ops/s
> IndexOfBenchmark.latin1_AVX2_char thrpt5  177757.909 ±  727.308  ops/s
> IndexOfBenchmark.latin1_SSE4_String   thrpt5  180529.049 ±   57.356  ops/s
> IndexOfBenchmark.latin1_SSE4_char thrpt5  235087.776 ±  457.024  ops/s
> IndexOfBenchmark.latin1_Short_String  thrpt5  165914.990 ±  329.024  ops/s
> IndexOfBenchmark.latin1_Short_charthrpt5   53989.544 ±   65.393  ops/s
> IndexOfBenchmark.utf16_AVX2_Stringthrpt5  107632.783 ±  446.272  ops/s
> IndexOfBenchmark.utf16_AVX2_char  thrpt5  143131.734 ±  159.944  ops/s
> IndexOfBenchmark.utf16_SSE4_Stringthrpt5  169882.703 ± 1024.367  ops/s
> IndexOfBenchmark.utf16_SSE4_char  thrpt5  175693.972 ±  775.423  ops/s
> IndexOfBenchmark.utf16_Short_String   thrpt5  163595.993 ±  225.089  ops/s
> IndexOfBenchmark.utf16_Short_char thrpt5   90126.154 ±  365.642  ops/s
>
> We can see above that indexOf(char) now behaves similarly between
> StringUTF16 and StringLatin1.

This is confusing. Can you please make the times nanoseconds?  It's quite a 
struggle trying to think in reciprocal units fo

JDK-8173585: Intrinsify StringLatin1.indexOf(char)

2020-09-03 Thread Tatton, Jason
Hi All,

Please review the following patch:

https://bugs.openjdk.java.net/browse/JDK-8173585
http://cr.openjdk.java.net/~phh/8173585/webrev.00/

This is an implementation of the indexOf(char) intrinsic for StringLatin1 (1 
byte encoded Strings). It is provided for x86 and ARM64. The implementation is 
greatly inspired by the indexOf(char) intrinsic for StringUTF16. To incorporate 
it I had to make a small change to StringLatin1.java (refactor of functionality 
to intrisified private method) as well as code for C2.

Submitted to: hotspot-compiler-dev and core-libs-dev as this patch contains a 
change to hotspot and java/lang/StringLatin1.java

Details of testing: 
==
I have created a jtreg test 
“compiler/intrinsics/string/TestStringLatin1IndexOfChar” to cover this new 
intrinsic. Note that, particularly for the x86 implementation of the intrinsic, 
the code path taken is dependent upon the length of the input String. Hence the 
test has been designed to cover all these cases. In summary they are:
• A “short” string of < 16 characters.
• A SIMD String of 16 – 31 characters.
• A AVX2 SIMD String of 32 characters+.

Hardware used for testing: 

• Intel Xeon CPU E5-2680 (JVM did not recognize this as having AVX2 support)
• Intel i7 processor (with AVX2 support).
• AWS Graviton 2 (ARM 64 processor).

I also ran; ‘run-test-tier1’ and ‘run-test-tier2’ for: x86_64 and aarch64.

Possible future enhancements: 
=
For the x86 implementation there may be two further improvements we can make in 
order to improve performance of both the StringUTF16 and StringLatin1 
indexOf(char) intrinsics:
1. Make use of AVX-512 instructions.
2. For “short” Strings (see below), I think it may be possible to modify the 
existing algorithm to still use SSE SIMD instructions instead of a loop.


JMH Benchmark results:

The benchmarks examine the 3 codepaths for StringLatin1 and StringUTF16. Here 
are the results for Intel x86 (ARM is similar):

FYI, String lengths in characters (1byte for Latin1, 2bytes for UTF16):
   Latin1  UTF16
Short: 15   7
SSE4:  16   8 
AVX2:  32   16

Without StringLatin1 indexofchar intrinsic:
Benchmark  Mode  Cnt   Score  Error  Units
IndexOfBenchmark.latin1_AVX2_String   thrpt    5  121781.424 ± 355.085  ops/s
IndexOfBenchmark.latin1_AVX2_char thrpt    5   46060.612 ± 151.274  ops/s
IndexOfBenchmark.latin1_SSE4_String   thrpt    5  197339.146 ±  90.333  ops/s
IndexOfBenchmark.latin1_SSE4_char thrpt    5   61401.204 ± 426.761  ops/s
IndexOfBenchmark.latin1_Short_String  thrpt    5  175389.355 ± 294.976  ops/s
IndexOfBenchmark.latin1_Short_char    thrpt    5   60759.868 ± 124.349  ops/s
IndexOfBenchmark.utf16_AVX2_String    thrpt    5  123601.020 ± 111.981  ops/s
IndexOfBenchmark.utf16_AVX2_char  thrpt    5  141116.832 ± 380.489  ops/s
IndexOfBenchmark.utf16_SSE4_String    thrpt    5  178136.762 ± 143.227  ops/s
IndexOfBenchmark.utf16_SSE4_char  thrpt    5  181430.649 ± 120.097  ops/s
IndexOfBenchmark.utf16_Short_String   thrpt    5  158301.361 ± 182.738  ops/s
IndexOfBenchmark.utf16_Short_char thrpt    5   84876.919 ± 247.769  ops/s

With StringLatin1 indexofchar intrinsic:
Benchmark  Mode  Cnt   Score  Error  Units
IndexOfBenchmark.latin1_AVX2_String   thrpt    5  113621.676 ±   68.235  ops/s
IndexOfBenchmark.latin1_AVX2_char thrpt    5  177757.909 ±  727.308  ops/s
IndexOfBenchmark.latin1_SSE4_String   thrpt    5  180529.049 ±   57.356  ops/s
IndexOfBenchmark.latin1_SSE4_char thrpt    5  235087.776 ±  457.024  ops/s
IndexOfBenchmark.latin1_Short_String  thrpt    5  165914.990 ±  329.024  ops/s
IndexOfBenchmark.latin1_Short_char    thrpt    5   53989.544 ±   65.393  ops/s
IndexOfBenchmark.utf16_AVX2_String    thrpt    5  107632.783 ±  446.272  ops/s
IndexOfBenchmark.utf16_AVX2_char  thrpt    5  143131.734 ±  159.944  ops/s
IndexOfBenchmark.utf16_SSE4_String    thrpt    5  169882.703 ± 1024.367  ops/s
IndexOfBenchmark.utf16_SSE4_char  thrpt    5  175693.972 ±  775.423  ops/s
IndexOfBenchmark.utf16_Short_String   thrpt    5  163595.993 ±  225.089  ops/s
IndexOfBenchmark.utf16_Short_char thrpt    5   90126.154 ±  365.642  ops/s

We can see above that indexOf(char) now behaves similarly between StringUTF16 
and StringLatin1. 

JMH benchmark code:
--

package org.sample;

import java.util.Random;
import org.openjdk.jmh.annotations.Benchmark;
import org.openjdk.jmh.annotations.Scope;
import org.openjdk.jmh.annotations.State;

@State(Scope.Thread)
public class IndexOfBenchmark {
private static final int loops = 10;
private static final Random rng = new Random(1999);
private static final int pathCnt = 1000;
private static final String [] latn1_short = new String[pathCnt];
private static final String [] latn1_sse4  = new String[pathCnt];
private