Re: RFR: 8322768: Optimize non-subword vector compress and expand APIs for AVX2 target. [v2]

Jatin Bhateja Mon, 08 Jan 2024 22:28:48 -0800

On Mon, 8 Jan 2024 07:55:00 GMT, Emanuel Peter <epe...@openjdk.org> wrote:


>>> You are using `VectorMask<Integer> pred = VectorMask.fromLong(ispecies, 
>>> maskctr++);`. That basically systematically iterates over all masks, which 
>>> is nice for a correctness test. But that would use different density inside 
>>> one test run, right? The average over the loop is still at `50%`, correct?
>>> 
>>> I was thinking more a run where the percentage over the whole loop is lower 
>>> than maybe `1%`. That would get us to a point where maybe the branch 
>>> prediction of non-vectorized code might be faster, what do you think?
>> 
>> An imperative loop for compression will check each mask bit to select 
>> compressible lane. Therefore mask with low or high density of set bits 
>> should show similar performance.
>
> Yes, IF it is vectorized, then there is no difference between high and low 
> density. My concern was more if vectorization is preferrable over the scalar 
> alternative in the low-density case, where branch prediction is more stable.

At runtime we do need to scan entire mask to pick the compressible lane 
corresponding to set mask bit. Thus the loop overhead of mask compare (BTW 
masks are held in a vector register for AVX2 targets) and jump will anyways be 
incurred , in addition for sparsely populated mask we may incur additional 
misprediction penalty for not taking if block which  extracts an element from 
appropriate source vector lane and insert into destination vector lane. Overall 
vector solution will win for most common cases for varying mask and also for 
very sparsely populate masks.  Here is the result of setting just a single mask 
bit. I am process of updating to benchmark for 128 bit species will update the 
patch.


    @Benchmark
    public void fuzzyFilterIntColumn() {
       int i = 0;
       int j = 0;
       long maskctr = 1;
       int endIndex = ispecies.loopBound(size);
       for (; i < endIndex; i += ispecies.length()) {
           IntVector vec = IntVector.fromArray(ispecies, intinCol, i);
           VectorMask<Integer> pred = VectorMask.fromLong(ispecies, 1);
           vec.compress(pred).intoArray(intoutCol, j);
           j += pred.trueCount();
       }
   }


Baseline:
Benchmark                                                     (size)   Mode  
Cnt    Score   Error   Units
ColumnFilterBenchmark.fuzzyFilterIntColumn    1024  thrpt    2  379.059         
 ops/ms
ColumnFilterBenchmark.fuzzyFilterIntColumn    2047  thrpt    2  188.355         
 ops/ms
ColumnFilterBenchmark.fuzzyFilterIntColumn    4096  thrpt    2   95.315         
 ops/ms


Withopt:
Benchmark                                                     (size)   Mode  
Cnt     Score   Error   Units
ColumnFilterBenchmark.fuzzyFilterIntColumn    1024  thrpt    2  7390.074        
  ops/ms
ColumnFilterBenchmark.fuzzyFilterIntColumn    2047  thrpt    2  3483.247        
  ops/ms
ColumnFilterBenchmark.fuzzyFilterIntColumn    4096  thrpt    2  1823.817        
  ops/ms

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/17261#discussion_r1445666305

Re: RFR: 8322768: Optimize non-subword vector compress and expand APIs for AVX2 target. [v2]

Reply via email to