gsmiller commented on a change in pull request #69:
URL: https://github.com/apache/lucene/pull/69#discussion_r609697557



##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene90/PForUtil.java
##########
@@ -121,4 +167,146 @@ void skip(DataInput in) throws IOException {
       in.skipBytes(forUtil.numBytes(bitsPerValue) + (numExceptions << 1));
     }
   }
+
+  /**
+   * Fill {@code longs} with the final values for the case of all deltas being 
1. Note this assumes
+   * there are no exceptions to apply.
+   */
+  private static void prefixSumOfOnes(long[] longs, long base) {
+    System.arraycopy(IDENTITY_PLUS_ONE, 0, longs, 0, ForUtil.BLOCK_SIZE);
+    // This loop gets auto-vectorized
+    for (int i = 0; i < ForUtil.BLOCK_SIZE; ++i) {
+      longs[i] += base;
+    }
+  }
+
+  /**
+   * Fill {@code longs} with the final values for the case of all deltas being 
{@code val}. Note
+   * this assumes there are no exceptions to apply.
+   */
+  private static void prefixSumOf(long[] longs, long base, long val) {
+    for (int i = 0; i < ForUtil.BLOCK_SIZE; i++) {
+      longs[i] = (i + 1) * val + base;

Review comment:
       Here's what I've found with `perfasm` on [this microbenchmark 
branch](https://github.com/gsmiller/decode-128-ints-benchmark/tree/pfor-is-it-vectorizing):
   1. The `prefixSumOf` method in question [1] is _not_ auto-vectorizing. The 
assembly loop is below [2].
   2. If I change the implementation of `prefixSumOf` to use two loops [3], the 
second "add" loop is auto-vectoring in the same way that `prefixSumOfOnes` does 
[4], but the first "multiply" loop does not [5].
   3. Even though the second approach [3] gets partially vectorized, it's 
significantly less performant than the vanilla, single-loop approach (7.1 
throughput vs. 6.3) [6].
   4. The full output of the jmh benchmark runs with `perfasm` are here (note 
that the first run in each of these is `prefixSumOfOnes` as a baseline, 
controlled by `sameVal == 1` instead of `sameVal == 2`; `sameVal == 1` triggers 
the special-case handling using `prefixSumOfOnes`): 
[single-loop.log](https://github.com/gsmiller/decode-128-ints-benchmark/blob/pfor-is-it-vectorizing/single-loop.log),
 
[two-loops.log](https://github.com/gsmiller/decode-128-ints-benchmark/blob/pfor-is-it-vectorizing/two-loops.log)
   
   [1]
   ```
       private static void prefixSumOf(long val, long[] arr, long base) {
           for (int i = 0; i < ForUtil.BLOCK_SIZE; i++) {
               arr[i] = IDENTITY_PLUS_ONE[i] * val + base;
           }
       }
   ```
   [2]
   ```
     0.45%    ↗  0x00007f37dfa02c52: mov    %r9,%r8
     0.30%    │  0x00007f37dfa02c55: movabs $0xd2816340,%rdi   ;   
{oop([J{0x00000000d2816340})}
     3.16%    │  0x00007f37dfa02c5f: imul   0x10(%rdi,%r10,8),%r8
     1.12%    │  0x00007f37dfa02c65: add    %rcx,%r8
     1.42%    │  0x00007f37dfa02c68: mov    %r8,0x10(%rbx,%r10,8)
     2.97%    │  0x00007f37dfa02c6d: mov    %r9,%r8
     2.62%    │  0x00007f37dfa02c70: imul   0x18(%rdi,%r10,8),%r8
     1.37%    │  0x00007f37dfa02c76: add    %rcx,%r8
     1.40%    │  0x00007f37dfa02c79: mov    %r8,0x18(%rbx,%r10,8)
     5.71%    │  0x00007f37dfa02c7e: mov    %r9,%r8
     2.02%    │  0x00007f37dfa02c81: imul   0x20(%rdi,%r10,8),%r8
     1.08%    │  0x00007f37dfa02c87: add    %rcx,%r8
     1.91%    │  0x00007f37dfa02c8a: mov    %r8,0x20(%rbx,%r10,8)
     4.96%    │  0x00007f37dfa02c8f: mov    %r9,%r8
     1.57%    │  0x00007f37dfa02c92: imul   0x28(%rdi,%r10,8),%r8
     0.71%    │  0x00007f37dfa02c98: add    %rcx,%r8
     0.56%    │  0x00007f37dfa02c9b: mov    %r8,0x28(%rbx,%r10,8)  ;*lastore 
{reexecute=0 rethrow=0 return_oop=0}
              │                                                ; - 
jpountz.PForDeltaDecoder::prefixSumOf@24 (line 29)
              │                                                ; - 
jpountz.PForDeltaDecoder::decodeAndPrefixSum@32 (line 59)
              │                                                ; - 
jpountz.PackedIntsDeltaDecodeBenchmark::pForDeltaDecoder@42 (line 29)
              │                                                ; - 
jpountz.generated.PackedIntsDeltaDecodeBenchmark_pForDeltaDecoder_jmhTest::pForDeltaDecoder_thrpt_jmhStub@151
 (line 240)
     4.79%    │  0x00007f37dfa02ca0: add    $0x4,%r10d         ;*iinc 
{reexecute=0 rethrow=0 return_oop=0}
              │                                                ; - 
jpountz.PForDeltaDecoder::prefixSumOf@25 (line 28)
              │                                                ; - 
jpountz.PForDeltaDecoder::decodeAndPrefixSum@32 (line 59)
              │                                                ; - 
jpountz.PackedIntsDeltaDecodeBenchmark::pForDeltaDecoder@42 (line 29)
              │                                                ; - 
jpountz.generated.PackedIntsDeltaDecodeBenchmark_pForDeltaDecoder_jmhTest::pForDeltaDecoder_thrpt_jmhStub@151
 (line 240)
     1.22%    │  0x00007f37dfa02ca4: cmp    $0x7d,%r10d
              ╰  0x00007f37dfa02ca8: jl     0x00007f37dfa02c52  ;*if_icmpge 
{reexecute=0 rethrow=0 return_oop=0}
   ```
   [3]
   ```
       private static void prefixSumOfTwoLoops(long val, long[] arr, long base) 
{
           System.arraycopy(IDENTITY_PLUS_ONE, 0, arr, 0, ForUtil.BLOCK_SIZE);
           for (int i = 0; i < ForUtil.BLOCK_SIZE; i++) {
               arr[i] *= val;
           }
           for (int i = 0; i < ForUtil.BLOCK_SIZE; i++) {
               arr[i] += base;
           }
       }
   ```
   [4]
   ```
     0.11%       ↗    0x00007f1607a05810: vpaddq 0x10(%rbp,%r11,8),%ymm0,%ymm1
     0.17%       │    0x00007f1607a05817: vmovdqu %ymm1,0x10(%rbp,%r11,8)
     0.24%       │    0x00007f1607a0581e: vpaddq 0x30(%rbp,%r11,8),%ymm0,%ymm1
     0.20%       │    0x00007f1607a05825: vmovdqu %ymm1,0x30(%rbp,%r11,8)
     0.15%       │    0x00007f1607a0582c: vpaddq 0x50(%rbp,%r11,8),%ymm0,%ymm1
     0.30%       │    0x00007f1607a05833: vmovdqu %ymm1,0x50(%rbp,%r11,8)
     0.09%       │    0x00007f1607a0583a: vpaddq 0x70(%rbp,%r11,8),%ymm0,%ymm1
     0.30%       │    0x00007f1607a05841: vmovdqu %ymm1,0x70(%rbp,%r11,8)
                 │    0x00007f1607a05848: vpaddq 0x90(%rbp,%r11,8),%ymm0,%ymm1
     0.15%       │    0x00007f1607a05852: vmovdqu %ymm1,0x90(%rbp,%r11,8)
     0.20%       │    0x00007f1607a0585c: vpaddq 0xb0(%rbp,%r11,8),%ymm0,%ymm1
     0.11%       │    0x00007f1607a05866: vmovdqu %ymm1,0xb0(%rbp,%r11,8)
     0.09%       │    0x00007f1607a05870: vpaddq 0xd0(%rbp,%r11,8),%ymm0,%ymm1
     0.37%       │    0x00007f1607a0587a: vmovdqu %ymm1,0xd0(%rbp,%r11,8)
     0.07%       │    0x00007f1607a05884: vpaddq 0xf0(%rbp,%r11,8),%ymm0,%ymm1
     0.26%       │    0x00007f1607a0588e: vmovdqu %ymm1,0xf0(%rbp,%r11,8)  
;*lastore {reexecute=0 rethrow=0 return_oop=0}
                 │                                                  ; - 
jpountz.PForDeltaDecoder::prefixSumOfTwoLoops@55 (line 39)
                 │                                                  ; - 
jpountz.PForDeltaDecoder::decodeAndPrefixSum@32 (line 59)
                 │                                                  ; - 
jpountz.PackedIntsDeltaDecodeBenchmark::pForDeltaDecoder@42 (line 29)
                 │                                                  ; - 
jpountz.generated.PackedIntsDeltaDecodeBenchmark_pForDeltaDecoder_jmhTest::pForDeltaDecoder_thrpt_jmhStub@151
 (line 240)
     0.15%       │    0x00007f1607a05898: add    $0x20,%r11d        ;*iinc 
{reexecute=0 rethrow=0 return_oop=0}
                 │                                                  ; - 
jpountz.PForDeltaDecoder::prefixSumOfTwoLoops@56 (line 38)
                 │                                                  ; - 
jpountz.PForDeltaDecoder::decodeAndPrefixSum@32 (line 59)
                 │                                                  ; - 
jpountz.PackedIntsDeltaDecodeBenchmark::pForDeltaDecoder@42 (line 29)
                 │                                                  ; - 
jpountz.generated.PackedIntsDeltaDecodeBenchmark_pForDeltaDecoder_jmhTest::pForDeltaDecoder_thrpt_jmhStub@151
 (line 240)
     0.13%       │    0x00007f1607a0589c: cmp    $0x61,%r11d
                 ╰    0x00007f1607a058a0: jl     0x00007f1607a05810  
;*if_icmpge {reexecute=0 rethrow=0 return_oop=0}
   ```
   [5]
   ```
     0.31%   ↗        0x00007f1607a05750: mov    %r13,%r11
     1.06%   │        0x00007f1607a05753: imul   0x10(%rbp,%r10,8),%r11
     1.70%   │        0x00007f1607a05759: mov    %r11,0x10(%rbp,%r10,8)
     3.32%   │        0x00007f1607a0575e: mov    %r13,%r11
     0.28%   │        0x00007f1607a05761: imul   0x18(%rbp,%r10,8),%r11
     2.72%   │        0x00007f1607a05767: mov    %r11,0x18(%rbp,%r10,8)
     2.24%   │        0x00007f1607a0576c: mov    %r13,%r11
     1.67%   │        0x00007f1607a0576f: imul   0x20(%rbp,%r10,8),%r11
     0.26%   │        0x00007f1607a05775: mov    %r11,0x20(%rbp,%r10,8)
     6.17%   │        0x00007f1607a0577a: mov    %r13,%r11
     0.69%   │        0x00007f1607a0577d: imul   0x28(%rbp,%r10,8),%r11
     1.54%   │        0x00007f1607a05783: mov    %r11,0x28(%rbp,%r10,8)  
;*lastore {reexecute=0 rethrow=0 return_oop=0}
             │                                                      ; - 
jpountz.PForDeltaDecoder::prefixSumOfTwoLoops@30 (line 36)
             │                                                      ; - 
jpountz.PForDeltaDecoder::decodeAndPrefixSum@32 (line 59)
             │                                                      ; - 
jpountz.PackedIntsDeltaDecodeBenchmark::pForDeltaDecoder@42 (line 29)
             │                                                      ; - 
jpountz.generated.PackedIntsDeltaDecodeBenchmark_pForDeltaDecoder_jmhTest::pForDeltaDecoder_thrpt_jmhStub@151
 (line 240)
     1.80%   │        0x00007f1607a05788: add    $0x4,%r10d         ;*iinc 
{reexecute=0 rethrow=0 return_oop=0}
             │                                                      ; - 
jpountz.PForDeltaDecoder::prefixSumOfTwoLoops@31 (line 35)
             │                                                      ; - 
jpountz.PForDeltaDecoder::decodeAndPrefixSum@32 (line 59)
             │                                                      ; - 
jpountz.PackedIntsDeltaDecodeBenchmark::pForDeltaDecoder@42 (line 29)
             │                                                      ; - 
jpountz.generated.PackedIntsDeltaDecodeBenchmark_pForDeltaDecoder_jmhTest::pForDeltaDecoder_thrpt_jmhStub@151
 (line 240)
     1.59%   │        0x00007f1607a0578c: cmp    $0x7d,%r10d
             ╰        0x00007f1607a05790: jl     0x00007f1607a05750  
;*if_icmpge {reexecute=0 rethrow=0 return_oop=0}
   ```
   [6]
   First run with a single loop:
   ```
   Benchmark                                             (bitsPerValue)  
(exceptionCount)  (sameVal)   Mode  Cnt  Score   Error   Units
   PackedIntsDeltaDecodeBenchmark.pForDeltaDecoder                    0         
        0          1  thrpt    5  8.829 ± 0.045  ops/us
   PackedIntsDeltaDecodeBenchmark.pForDeltaDecoder:·asm               0         
        0          1  thrpt         NaN             ---
   PackedIntsDeltaDecodeBenchmark.pForDeltaDecoder                    0         
        0          2  thrpt    5  7.165 ± 0.217  ops/us
   PackedIntsDeltaDecodeBenchmark.pForDeltaDecoder:·asm               0         
        0          2  thrpt         NaN             ---
   ```
   Second run with two loops:
   ```
   Benchmark                                             (bitsPerValue)  
(exceptionCount)  (sameVal)   Mode  Cnt  Score   Error   Units
   PackedIntsDeltaDecodeBenchmark.pForDeltaDecoder                    0         
        0          1  thrpt    5  9.294 ± 0.041  ops/us
   PackedIntsDeltaDecodeBenchmark.pForDeltaDecoder:·asm               0         
        0          1  thrpt         NaN             ---
   PackedIntsDeltaDecodeBenchmark.pForDeltaDecoder                    0         
        0          2  thrpt    5  6.635 ± 0.015  ops/us
   PackedIntsDeltaDecodeBenchmark.pForDeltaDecoder:·asm               0         
        0          2  thrpt         NaN             ---
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to