mccullocht commented on issue #15697: URL: https://github.com/apache/lucene/issues/15697#issuecomment-3931478158
Running luceneutil would be interesting but my experience is that typically microbenchmarks show much better results than larger scale tests. It does seem that this is aarch64 specific. Weird that loading twice is faster, but since you are loading the same address twice the second load is almost free. AMD Ryzen AI 395; AVX 512 ``` Benchmark (size) Mode Cnt Score Error Units VectorUtilBenchmark.binaryHalfByteDotProductSinglePackedScalar 1024 thrpt 15 2.523 ± 0.030 ops/us VectorUtilBenchmark.binaryHalfByteDotProductSinglePackedVector 1024 thrpt 15 11.828 ± 0.169 ops/us VectorUtilBenchmark.binaryHalfByteSquareSinglePackedScalar 1024 thrpt 15 2.303 ± 0.011 ops/us VectorUtilBenchmark.binaryHalfByteSquareSinglePackedVector 1024 thrpt 15 12.519 ± 0.489 ops/us ``` Strangely enough I see the same kind of weird performance falloff in warmup iterations. ``` INFO: Java vector incubator API enabled; uses preferredBitSize=512; FMA enabled 35.595 ops/us # Warmup Iteration 2: 41.108 ops/us # Warmup Iteration 3: 12.517 ops/us # Warmup Iteration 4: 12.766 ops/us Iteration 1: 12.801 ops/us Iteration 2: 12.781 ops/us Iteration 3: 12.763 ops/us Iteration 4: 12.843 ops/us Iteration 5: 12.810 ops/us ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
