gsmiller commented on a change in pull request #69: URL: https://github.com/apache/lucene/pull/69#discussion_r609697557
########## File path: lucene/core/src/java/org/apache/lucene/codecs/lucene90/PForUtil.java ########## @@ -121,4 +167,146 @@ void skip(DataInput in) throws IOException { in.skipBytes(forUtil.numBytes(bitsPerValue) + (numExceptions << 1)); } } + + /** + * Fill {@code longs} with the final values for the case of all deltas being 1. Note this assumes + * there are no exceptions to apply. + */ + private static void prefixSumOfOnes(long[] longs, long base) { + System.arraycopy(IDENTITY_PLUS_ONE, 0, longs, 0, ForUtil.BLOCK_SIZE); + // This loop gets auto-vectorized + for (int i = 0; i < ForUtil.BLOCK_SIZE; ++i) { + longs[i] += base; + } + } + + /** + * Fill {@code longs} with the final values for the case of all deltas being {@code val}. Note + * this assumes there are no exceptions to apply. + */ + private static void prefixSumOf(long[] longs, long base, long val) { + for (int i = 0; i < ForUtil.BLOCK_SIZE; i++) { + longs[i] = (i + 1) * val + base; Review comment: Here's what I've found with `perfasm` on [this microbenchmark branch](https://github.com/gsmiller/decode-128-ints-benchmark/tree/pfor-is-it-vectorizing): 1. The `prefixSumOf` method in question [1] is _not_ auto-vectorizing. The assembly loop is below [2]. 2. If I change the implementation of `prefixSumOf` to use two loops [3], the second "add" loop is auto-vectoring in the same way that `prefixSumOfOnes` does [4], but the first "multiply" loop does not [5]. 3. Even though the second approach [3] gets partially vectorized, it's significantly less performant than the vanilla, single-loop approach (7.1 throughput vs. 6.3) [6]. 4. The full output of the jmh benchmark runs with `perfasm` are here (note that the first run in each of these is `prefixSumOfOnes` as a baseline, controlled by `sameVal == 1` instead of `sameVal == 2`; `sameVal == 1` triggers the special-case handling using `prefixSumOfOnes`): [single-loop.log](https://github.com/gsmiller/decode-128-ints-benchmark/blob/pfor-is-it-vectorizing/single-loop.log), [two-loops.log](https://github.com/gsmiller/decode-128-ints-benchmark/blob/pfor-is-it-vectorizing/two-loops.log) [1] ``` private static void prefixSumOf(long val, long[] arr, long base) { for (int i = 0; i < ForUtil.BLOCK_SIZE; i++) { arr[i] = IDENTITY_PLUS_ONE[i] * val + base; } } ``` [2] ``` 0.45% ↗ 0x00007f37dfa02c52: mov %r9,%r8 0.30% │ 0x00007f37dfa02c55: movabs $0xd2816340,%rdi ; {oop([J{0x00000000d2816340})} 3.16% │ 0x00007f37dfa02c5f: imul 0x10(%rdi,%r10,8),%r8 1.12% │ 0x00007f37dfa02c65: add %rcx,%r8 1.42% │ 0x00007f37dfa02c68: mov %r8,0x10(%rbx,%r10,8) 2.97% │ 0x00007f37dfa02c6d: mov %r9,%r8 2.62% │ 0x00007f37dfa02c70: imul 0x18(%rdi,%r10,8),%r8 1.37% │ 0x00007f37dfa02c76: add %rcx,%r8 1.40% │ 0x00007f37dfa02c79: mov %r8,0x18(%rbx,%r10,8) 5.71% │ 0x00007f37dfa02c7e: mov %r9,%r8 2.02% │ 0x00007f37dfa02c81: imul 0x20(%rdi,%r10,8),%r8 1.08% │ 0x00007f37dfa02c87: add %rcx,%r8 1.91% │ 0x00007f37dfa02c8a: mov %r8,0x20(%rbx,%r10,8) 4.96% │ 0x00007f37dfa02c8f: mov %r9,%r8 1.57% │ 0x00007f37dfa02c92: imul 0x28(%rdi,%r10,8),%r8 0.71% │ 0x00007f37dfa02c98: add %rcx,%r8 0.56% │ 0x00007f37dfa02c9b: mov %r8,0x28(%rbx,%r10,8) ;*lastore {reexecute=0 rethrow=0 return_oop=0} │ ; - jpountz.PForDeltaDecoder::prefixSumOf@24 (line 29) │ ; - jpountz.PForDeltaDecoder::decodeAndPrefixSum@32 (line 59) │ ; - jpountz.PackedIntsDeltaDecodeBenchmark::pForDeltaDecoder@42 (line 29) │ ; - jpountz.generated.PackedIntsDeltaDecodeBenchmark_pForDeltaDecoder_jmhTest::pForDeltaDecoder_thrpt_jmhStub@151 (line 240) 4.79% │ 0x00007f37dfa02ca0: add $0x4,%r10d ;*iinc {reexecute=0 rethrow=0 return_oop=0} │ ; - jpountz.PForDeltaDecoder::prefixSumOf@25 (line 28) │ ; - jpountz.PForDeltaDecoder::decodeAndPrefixSum@32 (line 59) │ ; - jpountz.PackedIntsDeltaDecodeBenchmark::pForDeltaDecoder@42 (line 29) │ ; - jpountz.generated.PackedIntsDeltaDecodeBenchmark_pForDeltaDecoder_jmhTest::pForDeltaDecoder_thrpt_jmhStub@151 (line 240) 1.22% │ 0x00007f37dfa02ca4: cmp $0x7d,%r10d ╰ 0x00007f37dfa02ca8: jl 0x00007f37dfa02c52 ;*if_icmpge {reexecute=0 rethrow=0 return_oop=0} ``` [3] ``` private static void prefixSumOfTwoLoops(long val, long[] arr, long base) { System.arraycopy(IDENTITY_PLUS_ONE, 0, arr, 0, ForUtil.BLOCK_SIZE); for (int i = 0; i < ForUtil.BLOCK_SIZE; i++) { arr[i] *= val; } for (int i = 0; i < ForUtil.BLOCK_SIZE; i++) { arr[i] += base; } } ``` [4] ``` 0.11% ↗ 0x00007f1607a05810: vpaddq 0x10(%rbp,%r11,8),%ymm0,%ymm1 0.17% │ 0x00007f1607a05817: vmovdqu %ymm1,0x10(%rbp,%r11,8) 0.24% │ 0x00007f1607a0581e: vpaddq 0x30(%rbp,%r11,8),%ymm0,%ymm1 0.20% │ 0x00007f1607a05825: vmovdqu %ymm1,0x30(%rbp,%r11,8) 0.15% │ 0x00007f1607a0582c: vpaddq 0x50(%rbp,%r11,8),%ymm0,%ymm1 0.30% │ 0x00007f1607a05833: vmovdqu %ymm1,0x50(%rbp,%r11,8) 0.09% │ 0x00007f1607a0583a: vpaddq 0x70(%rbp,%r11,8),%ymm0,%ymm1 0.30% │ 0x00007f1607a05841: vmovdqu %ymm1,0x70(%rbp,%r11,8) │ 0x00007f1607a05848: vpaddq 0x90(%rbp,%r11,8),%ymm0,%ymm1 0.15% │ 0x00007f1607a05852: vmovdqu %ymm1,0x90(%rbp,%r11,8) 0.20% │ 0x00007f1607a0585c: vpaddq 0xb0(%rbp,%r11,8),%ymm0,%ymm1 0.11% │ 0x00007f1607a05866: vmovdqu %ymm1,0xb0(%rbp,%r11,8) 0.09% │ 0x00007f1607a05870: vpaddq 0xd0(%rbp,%r11,8),%ymm0,%ymm1 0.37% │ 0x00007f1607a0587a: vmovdqu %ymm1,0xd0(%rbp,%r11,8) 0.07% │ 0x00007f1607a05884: vpaddq 0xf0(%rbp,%r11,8),%ymm0,%ymm1 0.26% │ 0x00007f1607a0588e: vmovdqu %ymm1,0xf0(%rbp,%r11,8) ;*lastore {reexecute=0 rethrow=0 return_oop=0} │ ; - jpountz.PForDeltaDecoder::prefixSumOfTwoLoops@55 (line 39) │ ; - jpountz.PForDeltaDecoder::decodeAndPrefixSum@32 (line 59) │ ; - jpountz.PackedIntsDeltaDecodeBenchmark::pForDeltaDecoder@42 (line 29) │ ; - jpountz.generated.PackedIntsDeltaDecodeBenchmark_pForDeltaDecoder_jmhTest::pForDeltaDecoder_thrpt_jmhStub@151 (line 240) 0.15% │ 0x00007f1607a05898: add $0x20,%r11d ;*iinc {reexecute=0 rethrow=0 return_oop=0} │ ; - jpountz.PForDeltaDecoder::prefixSumOfTwoLoops@56 (line 38) │ ; - jpountz.PForDeltaDecoder::decodeAndPrefixSum@32 (line 59) │ ; - jpountz.PackedIntsDeltaDecodeBenchmark::pForDeltaDecoder@42 (line 29) │ ; - jpountz.generated.PackedIntsDeltaDecodeBenchmark_pForDeltaDecoder_jmhTest::pForDeltaDecoder_thrpt_jmhStub@151 (line 240) 0.13% │ 0x00007f1607a0589c: cmp $0x61,%r11d ╰ 0x00007f1607a058a0: jl 0x00007f1607a05810 ;*if_icmpge {reexecute=0 rethrow=0 return_oop=0} ``` [5] ``` 0.31% ↗ 0x00007f1607a05750: mov %r13,%r11 1.06% │ 0x00007f1607a05753: imul 0x10(%rbp,%r10,8),%r11 1.70% │ 0x00007f1607a05759: mov %r11,0x10(%rbp,%r10,8) 3.32% │ 0x00007f1607a0575e: mov %r13,%r11 0.28% │ 0x00007f1607a05761: imul 0x18(%rbp,%r10,8),%r11 2.72% │ 0x00007f1607a05767: mov %r11,0x18(%rbp,%r10,8) 2.24% │ 0x00007f1607a0576c: mov %r13,%r11 1.67% │ 0x00007f1607a0576f: imul 0x20(%rbp,%r10,8),%r11 0.26% │ 0x00007f1607a05775: mov %r11,0x20(%rbp,%r10,8) 6.17% │ 0x00007f1607a0577a: mov %r13,%r11 0.69% │ 0x00007f1607a0577d: imul 0x28(%rbp,%r10,8),%r11 1.54% │ 0x00007f1607a05783: mov %r11,0x28(%rbp,%r10,8) ;*lastore {reexecute=0 rethrow=0 return_oop=0} │ ; - jpountz.PForDeltaDecoder::prefixSumOfTwoLoops@30 (line 36) │ ; - jpountz.PForDeltaDecoder::decodeAndPrefixSum@32 (line 59) │ ; - jpountz.PackedIntsDeltaDecodeBenchmark::pForDeltaDecoder@42 (line 29) │ ; - jpountz.generated.PackedIntsDeltaDecodeBenchmark_pForDeltaDecoder_jmhTest::pForDeltaDecoder_thrpt_jmhStub@151 (line 240) 1.80% │ 0x00007f1607a05788: add $0x4,%r10d ;*iinc {reexecute=0 rethrow=0 return_oop=0} │ ; - jpountz.PForDeltaDecoder::prefixSumOfTwoLoops@31 (line 35) │ ; - jpountz.PForDeltaDecoder::decodeAndPrefixSum@32 (line 59) │ ; - jpountz.PackedIntsDeltaDecodeBenchmark::pForDeltaDecoder@42 (line 29) │ ; - jpountz.generated.PackedIntsDeltaDecodeBenchmark_pForDeltaDecoder_jmhTest::pForDeltaDecoder_thrpt_jmhStub@151 (line 240) 1.59% │ 0x00007f1607a0578c: cmp $0x7d,%r10d ╰ 0x00007f1607a05790: jl 0x00007f1607a05750 ;*if_icmpge {reexecute=0 rethrow=0 return_oop=0} ``` [6] First run with a single loop: ``` Benchmark (bitsPerValue) (exceptionCount) (sameVal) Mode Cnt Score Error Units PackedIntsDeltaDecodeBenchmark.pForDeltaDecoder 0 0 1 thrpt 5 8.829 ± 0.045 ops/us PackedIntsDeltaDecodeBenchmark.pForDeltaDecoder:·asm 0 0 1 thrpt NaN --- PackedIntsDeltaDecodeBenchmark.pForDeltaDecoder 0 0 2 thrpt 5 7.165 ± 0.217 ops/us PackedIntsDeltaDecodeBenchmark.pForDeltaDecoder:·asm 0 0 2 thrpt NaN --- ``` Second run with two loops: ``` Benchmark (bitsPerValue) (exceptionCount) (sameVal) Mode Cnt Score Error Units PackedIntsDeltaDecodeBenchmark.pForDeltaDecoder 0 0 1 thrpt 5 9.294 ± 0.041 ops/us PackedIntsDeltaDecodeBenchmark.pForDeltaDecoder:·asm 0 0 1 thrpt NaN --- PackedIntsDeltaDecodeBenchmark.pForDeltaDecoder 0 0 2 thrpt 5 6.635 ± 0.015 ops/us PackedIntsDeltaDecodeBenchmark.pForDeltaDecoder:·asm 0 0 2 thrpt NaN --- ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org