Micro-benchmarks for ntz and pop (BitUtils) operations.
-------------------------------------------------------

                 Key: LUCENE-2221
                 URL: https://issues.apache.org/jira/browse/LUCENE-2221
             Project: Lucene - Java
          Issue Type: Task
          Components: Other
            Reporter: Dawid Weiss
            Priority: Trivial


As suggested by Yonik, I performed a suite of micro-benchmarks to investigate 
the following:
* pop() (bitCount) seems to be implemented in the same way ("hacker's delight") 
as in the BitUtils class (SUN's standard library from version 1.5).
* native intrinsics have been recently added to the HotSpot that should speed 
up bitCount significantly.

I have tried to run the code on various VMs and architectures, but of course 
the results may vary depending on the setting.

h2. Micro-benchmark code, evaluation

* The micro-benchmarks were written as JUnit tests with a custom set of rules 
that repeats each test measuring execution time, gc use, etc.
* There were 5 warmup runs before each test, followed by 15 benchmarked runs. 
The result contain overall times, round times and standard deviations where 
applicable.
* There were several tests for isolated performance of {{BitUtil.pop()}}, JDK's 
{{bitCount}}, {{BitUtil.ntz()}}, {{BitUtil.ntz2()}}, {{BitUtil.ntz3()}} and 
JDK's {{numberOfTrailingZeros}}, the test code had the following loop:
{code}
final long [] longs = NtzPopBenchmark.random.bits;
int cnt = 0;
for (int i = longs.length; --i >= 0;)
{
   cnt += Long.bitCount(longs[i]);
}
volatileVariable = cnt; // to prevent dead code removal.
{code}
* I also added another version of pop() based on a precomputed bit counts. This 
version was called {{pop2}}.

* The input array of long values was initialized to a memory taking 200MB. 
There were two different sets: {{random}} (random values) and {{single}} 
(single bit rotated to the right in each long).

* There were tests that called {{BitUtil.pop_xor}} between the two input 
bitsets (random, single).
* Additional tests that used iterators and and other BitUtil operations showed 
similar performance to isolated loops, so I omit them here.

h2. Evaluation environment

I tested on three different machines:
* Pentium 4, 32-bit, 3GHZ, 2GB RAM (Windows)
* AMD Athlon(tm) 64 X2 Dual Core Processor 5200+, 64-bit, 4GB RAM (Ubuntu)
* Intel(R) Core(TM)2 Quad CPU    Q9650  @ 3.00GHz, 64-bit, 4GB RAM (Ubuntu)

and on various VMs:
* 1.6.0_17, Java HotSpot(TM) Server VM, 14.3-b01, Sun Microsystems Inc., 
* 1.5.0_18, Java HotSpot(TM) Server VM, 1.5.0_18-b02, Sun Microsystems Inc., 
* 1.7.0-ea, Java HotSpot(TM) Server VM, 17.0-b06, Sun Microsystems Inc., 
* 1.6.0, IBM J9 VM, 2.4, IBM Corporation,
* BEA JRockit.
* (ant other minor versions of the VMs above, depending on the computer).

h2. Results overview

h3. {{pop}}

The times between {{BitUtil}} and JDK were mostly identical. However, on 32-bit 
systems, precached {{pop2}} performed
*much* better. Examples:

{noformat}
# 1.6.0_17, Java HotSpot(TM) Server VM, 14.3-b01, Sun Microsystems Inc., 
test_POP_JDK_random       : 15/20 rounds, time.total: 15.61, time.warmup: 4.31, 
time.bench: 11.30, round: 0.75 [+- 0.02]
test_POP_JDK_single       : 15/20 rounds, time.total: 15.67, time.warmup: 4.31, 
time.bench: 11.36, round: 0.76 [+- 0.02]
test_POP_BitUtil_random   : 15/20 rounds, time.total: 15.55, time.warmup: 4.33, 
time.bench: 11.22, round: 0.75 [+- 0.01]
test_POP_BitUtil_single   : 15/20 rounds, time.total: 15.55, time.warmup: 4.31, 
time.bench: 11.23, round: 0.75 [+- 0.01]
test_POP2_random          : 15/20 rounds, time.total:  6.69, time.warmup: 1.75, 
time.bench:  4.94, round: 0.33 [+- 0.00]
test_POP2_single          : 15/20 rounds, time.total:  4.66, time.warmup: 1.22, 
time.bench:  3.44, round: 0.23 [+- 0.01]
{noformat}

Note the difference between random and single distributions -- most probably 
due to more cache hits when referring to the
lookup table. Other VMs on this 32-bit machine:

{noformat}
# 1.5.0_18, Java HotSpot(TM) Server VM, 1.5.0_18-b02, Sun Microsystems Inc., 
test_POP_JDK_random       : 15/20 rounds, time.total: 20.67, time.warmup: 5.19, 
time.bench: 15.48, round: 1.03 [+- 0.01]
test_POP_JDK_single       : 15/20 rounds, time.total: 22.70, time.warmup: 5.63, 
time.bench: 17.08, round: 1.14 [+- 0.01]
test_POP_BitUtil_random   : 15/20 rounds, time.total: 22.69, time.warmup: 5.63, 
time.bench: 17.06, round: 1.14 [+- 0.01]
test_POP_BitUtil_single   : 15/20 rounds, time.total: 20.67, time.warmup: 5.19, 
time.bench: 15.48, round: 1.03 [+- 0.01]
test_POP2_random          : 15/20 rounds, time.total:  6.30, time.warmup: 1.63, 
time.bench:  4.67, round: 0.31 [+- 0.01]
test_POP2_single          : 15/20 rounds, time.total:  4.33, time.warmup: 1.16, 
time.bench:  3.17, round: 0.21 [+- 0.01]

# 1.7.0-ea, Java HotSpot(TM) Server VM, 17.0-b06, Sun Microsystems Inc., 
test_POP_JDK_random       : 15/20 rounds, time.total: 15.28, time.warmup: 4.25, 
time.bench: 11.03, round: 0.74 [+- 0.03]
test_POP_JDK_single       : 15/20 rounds, time.total: 15.16, time.warmup: 4.20, 
time.bench: 10.95, round: 0.73 [+- 0.01]
test_POP_BitUtil_random   : 15/20 rounds, time.total: 15.12, time.warmup: 4.20, 
time.bench: 10.92, round: 0.73 [+- 0.01]
test_POP_BitUtil_single   : 15/20 rounds, time.total: 15.13, time.warmup: 4.25, 
time.bench: 10.88, round: 0.73 [+- 0.01]
test_POP2_random          : 15/20 rounds, time.total:  6.78, time.warmup: 1.72, 
time.bench:  5.06, round: 0.34 [+- 0.01]
test_POP2_single          : 15/20 rounds, time.total:  4.72, time.warmup: 1.20, 
time.bench:  3.52, round: 0.23 [+- 0.02]
{noformat}

On 64-bit machines, the results are nearly equal, with pop2 performing slightly 
worse on SUN's 1.6 compared to JDK and BitUtil 
(but this difference is really tiny and not present on all VMs; see IBM J9 and 
SUN's 1.5 below).

{noformat}
# 1.6.0_16, Java HotSpot(TM) 64-Bit Server VM, 14.2-b01, Sun Microsystems Inc.,
test_POP_JDK_random       : 15/20 rounds, time.total: 3.27, time.warmup: 0.81, 
time.bench: 2.46, round: 0.16 [+- 0.00]
test_POP_JDK_single       : 15/20 rounds, time.total: 3.11, time.warmup: 0.76, 
time.bench: 2.34, round: 0.16 [+- 0.02]
test_POP_BitUtil_random   : 15/20 rounds, time.total: 3.27, time.warmup: 0.81, 
time.bench: 2.46, round: 0.16 [+- 0.00]
test_POP_BitUtil_single   : 15/20 rounds, time.total: 3.03, time.warmup: 0.77, 
time.bench: 2.26, round: 0.15 [+- 0.00]
test_POP2_random          : 15/20 rounds, time.total: 3.63, time.warmup: 0.93, 
time.bench: 2.70, round: 0.18 [+- 0.00]
test_POP2_single          : 15/20 rounds, time.total: 3.51, time.warmup: 0.89, 
time.bench: 2.62, round: 0.17 [+- 0.00]

# 1.6.0, IBM J9 VM, 2.4, IBM Corporation,
test_POP_JDK_random       : 15/20 rounds, time.total: 4.80, time.warmup: 1.24, 
time.bench: 3.57, round: 0.24 [+- 0.01]
test_POP_JDK_single       : 15/20 rounds, time.total: 5.00, time.warmup: 1.44, 
time.bench: 3.56, round: 0.24 [+- 0.01]
test_POP_BitUtil_random   : 15/20 rounds, time.total: 4.81, time.warmup: 1.24, 
time.bench: 3.56, round: 0.24 [+- 0.01]
test_POP_BitUtil_single   : 15/20 rounds, time.total: 4.75, time.warmup: 1.19, 
time.bench: 3.56, round: 0.24 [+- 0.01]
test_POP2_random          : 15/20 rounds, time.total: 3.65, time.warmup: 0.90, 
time.bench: 2.76, round: 0.18 [+- 0.00]
test_POP2_single          : 15/20 rounds, time.total: 3.82, time.warmup: 0.93, 
time.bench: 2.89, round: 0.19 [+- 0.01]

# 1.5.0_18, Java HotSpot(TM) 64-Bit Server VM, 1.5.0_18-b02, Sun Microsystems 
Inc.,
test_POP_JDK_random       : 15/20 rounds, time.total: 3.72, time.warmup: 0.94, 
time.bench: 2.78, round: 0.19 [+- 0.00]
test_POP_JDK_single       : 15/20 rounds, time.total: 5.96, time.warmup: 1.40, 
time.bench: 4.56, round: 0.30 [+- 0.00]
test_POP_BitUtil_random   : 15/20 rounds, time.total: 6.16, time.warmup: 1.43, 
time.bench: 4.73, round: 0.31 [+- 0.00]
test_POP_BitUtil_single   : 15/20 rounds, time.total: 3.62, time.warmup: 0.92, 
time.bench: 2.70, round: 0.18 [+- 0.00]
test_POP2_random          : 15/20 rounds, time.total: 3.70, time.warmup: 0.96, 
time.bench: 2.74, round: 0.18 [+- 0.00]
test_POP2_single          : 15/20 rounds, time.total: 3.57, time.warmup: 0.93, 
time.bench: 2.65, round: 0.18 [+- 0.00]
{noformat}

The other 64-bit machine (quad-core):

{noformat}
# 1.7.0-ea, Java HotSpot(TM) 64-Bit Server VM, 17.0-b06, Sun Microsystems Inc.,
test_POP_JDK_random       : 15/20 rounds, time.total: 2.46, time.warmup: 0.62, 
time.bench: 1.84, round: 0.12 [+- 0.00]
test_POP_JDK_single       : 15/20 rounds, time.total: 2.49, time.warmup: 0.62, 
time.bench: 1.87, round: 0.12 [+- 0.01]
test_POP_BitUtil_random   : 15/20 rounds, time.total: 2.47, time.warmup: 0.62, 
time.bench: 1.84, round: 0.12 [+- 0.00]
test_POP_BitUtil_single   : 15/20 rounds, time.total: 2.46, time.warmup: 0.62, 
time.bench: 1.84, round: 0.12 [+- 0.00]
test_POP2_random          : 15/20 rounds, time.total: 2.82, time.warmup: 0.71, 
time.bench: 2.11, round: 0.14 [+- 0.00]
test_POP2_single          : 15/20 rounds, time.total: 2.15, time.warmup: 0.55, 
time.bench: 1.61, round: 0.11 [+- 0.00]
{noformat}

I then replaced {{BitUtil.pop}} with {{BitUtil.pop2}} in bit-counting methods 
like xor/and/or. The results are intriguing.
On 32-bit systems, there is a measureable gain, like here:

{noformat}
# 1.6.0_17, Java HotSpot(TM) Server VM, 14.3-b01, Sun Microsystems Inc., 
test_pop_xor              : 15/20 rounds, time.total:  9.78, time.warmup: 2.59, 
time.bench:  7.19, round: 0.48 [+- 0.01]
test_pop2_hd_xor          : 15/20 rounds, time.total:  8.27, time.warmup: 2.22, 
time.bench:  6.05, round: 0.40 [+- 0.01]

# 1.7.0-ea, Java HotSpot(TM) Server VM, 17.0-b06, Sun Microsystems Inc., 
test_pop_xor              : 15/20 rounds, time.total:  9.89, time.warmup: 2.59, 
time.bench: 7.30, round: 0.49 [+- 0.02]
test_pop2_hd_xor          : 15/20 rounds, time.total:  8.20, time.warmup: 2.24, 
time.bench: 5.97, round: 0.40 [+- 0.01]
{noformat}

On 64-bit systems, when 64-bit values can be manipulated directly in registers, 
there was nearly no speedup or even
a small performance penalty like in here:

{noformat}
# 1.7.0-ea, Java HotSpot(TM) 64-Bit Server VM, 17.0-b06, Sun Microsystems Inc.,
test_pop_xor              : 15/20 rounds, time.total: 1.76, time.warmup: 0.49, 
time.bench: 1.27, round: 0.09 [+- 0.00]
test_pop2_hd_xor          : 15/20 rounds, time.total: 2.06, time.warmup: 0.55, 
time.bench: 1.51, round: 0.10 [+- 0.00]
{noformat}

I'm guessing referencing memory on this fast processors is slower than 
manipulating registers.

h3. {{ntz}}

On JVMs prior to 1.7, the {{ntz} version from Lucene was much faster in my 
tests than the one from JDK, 
but it also has a greater variance depending on the input bits' distribution 
(compare the same routine 
for random and single below).

{noformat}
# 32-bit system;
# 1.6.0_17, Java HotSpot(TM) Server VM, 14.3-b01, Sun Microsystems Inc., 
test_NTZ_JDK_random       : 15/20 rounds, time.total:  6.69, time.warmup: 1.73, 
time.bench: 4.95, round: 0.33 [+- 0.01]
test_NTZ_JDK_single       : 15/20 rounds, time.total:  7.59, time.warmup: 1.94, 
time.bench: 5.66, round: 0.38 [+- 0.01]
test_NTZ_BitUtil_random   : 15/20 rounds, time.total:  2.72, time.warmup: 0.73, 
time.bench: 1.98, round: 0.13 [+- 0.02]
test_NTZ_BitUtil_single   : 15/20 rounds, time.total:  5.28, time.warmup: 1.34, 
time.bench: 3.94, round: 0.26 [+- 0.02]
test_NTZ2_BitUtil_random  : 15/20 rounds, time.total:  3.06, time.warmup: 0.81, 
time.bench: 2.25, round: 0.15 [+- 0.01]
test_NTZ2_BitUtil_single  : 15/20 rounds, time.total:  5.36, time.warmup: 1.34, 
time.bench: 4.02, round: 0.27 [+- 0.01]
test_NTZ3_BitUtil_random  : 15/20 rounds, time.total:  5.80, time.warmup: 1.48, 
time.bench: 4.31, round: 0.29 [+- 0.01]
test_NTZ3_BitUtil_single  : 15/20 rounds, time.total:  6.98, time.warmup: 1.81, 
time.bench: 5.17, round: 0.34 [+- 0.01]

# 64-bit Athlon
# 1.6.0_16, Java HotSpot(TM) 64-Bit Server VM, 14.2-b01, Sun Microsystems Inc.,
test_NTZ_JDK_random       : 15/20 rounds, time.total: 4.59, time.warmup: 1.16, 
time.bench: 3.44, round: 0.23 [+- 0.00]
test_NTZ_JDK_single       : 15/20 rounds, time.total: 6.64, time.warmup: 1.59, 
time.bench: 5.04, round: 0.34 [+- 0.01]
test_NTZ_BitUtil_random   : 15/20 rounds, time.total: 2.09, time.warmup: 0.53, 
time.bench: 1.56, round: 0.10 [+- 0.00]
test_NTZ_BitUtil_single   : 15/20 rounds, time.total: 3.87, time.warmup: 0.98, 
time.bench: 2.90, round: 0.19 [+- 0.00]
test_NTZ2_BitUtil_random  : 15/20 rounds, time.total: 2.09, time.warmup: 0.52, 
time.bench: 1.57, round: 0.10 [+- 0.00]
test_NTZ2_BitUtil_single  : 15/20 rounds, time.total: 3.31, time.warmup: 0.84, 
time.bench: 2.47, round: 0.16 [+- 0.00]
test_NTZ3_BitUtil_random  : 15/20 rounds, time.total: 3.31, time.warmup: 0.83, 
time.bench: 2.48, round: 0.17 [+- 0.00]
test_NTZ3_BitUtil_single  : 15/20 rounds, time.total: 5.71, time.warmup: 1.39, 
time.bench: 4.32, round: 0.29 [+- 0.00]
{noformat}

But then comes the 1.7 HotSport and things change radically, on 32-bit system 
the JDK's version is much faster for nearly-empty
{{long}} values:

{noformat}
# 1.7.0-ea, Java HotSpot(TM) Server VM, 17.0-b06, Sun Microsystems Inc., 
test_NTZ_JDK_random       : 15/20 rounds, time.total:  1.97, time.warmup: 0.61, 
time.bench: 1.36, round: 0.09 [+- 0.01]
test_NTZ_JDK_single       : 15/20 rounds, time.total:  2.53, time.warmup: 0.77, 
time.bench: 1.77, round: 0.12 [+- 0.01]
test_NTZ_BitUtil_random   : 15/20 rounds, time.total:  2.36, time.warmup: 0.66, 
time.bench: 1.70, round: 0.11 [+- 0.01]
test_NTZ_BitUtil_single   : 15/20 rounds, time.total:  4.50, time.warmup: 1.19, 
time.bench: 3.31, round: 0.22 [+- 0.01]
test_NTZ2_BitUtil_random  : 15/20 rounds, time.total:  3.08, time.warmup: 0.81, 
time.bench: 2.27, round: 0.15 [+- 0.01]
test_NTZ2_BitUtil_single  : 15/20 rounds, time.total:  4.97, time.warmup: 1.28, 
time.bench: 3.69, round: 0.25 [+- 0.01]
test_NTZ3_BitUtil_random  : 15/20 rounds, time.total:  5.78, time.warmup: 1.48, 
time.bench: 4.30, round: 0.29 [+- 0.01]
test_NTZ3_BitUtil_single  : 15/20 rounds, time.total:  7.77, time.warmup: 1.91, 
time.bench: 5.86, round: 0.39 [+- 0.01]
{noformat}

On the 64-bit quad core:

{noformat}
# 1.6.0_13, Java HotSpot(TM) 64-Bit Server VM, 11.3-b02, Sun Microsystems Inc.,
test_NTZ_JDK_random       : 15/20 rounds, time.total: 3.92, time.warmup: 0.97, 
time.bench: 2.94, round: 0.20 [+- 0.00]
test_NTZ_JDK_single       : 15/20 rounds, time.total: 3.80, time.warmup: 0.97, 
time.bench: 2.82, round: 0.19 [+- 0.00]
test_NTZ_BitUtil_random   : 15/20 rounds, time.total: 0.96, time.warmup: 0.25, 
time.bench: 0.71, round: 0.05 [+- 0.00]
test_NTZ_BitUtil_single   : 15/20 rounds, time.total: 2.74, time.warmup: 0.69, 
time.bench: 2.04, round: 0.14 [+- 0.00]
test_NTZ2_BitUtil_random  : 15/20 rounds, time.total: 1.22, time.warmup: 0.31, 
time.bench: 0.91, round: 0.06 [+- 0.00]
test_NTZ2_BitUtil_single  : 15/20 rounds, time.total: 2.18, time.warmup: 0.56, 
time.bench: 1.62, round: 0.11 [+- 0.00]
test_NTZ3_BitUtil_random  : 15/20 rounds, time.total: 2.76, time.warmup: 0.71, 
time.bench: 2.06, round: 0.14 [+- 0.00]
test_NTZ3_BitUtil_single  : 15/20 rounds, time.total: 3.47, time.warmup: 0.91, 
time.bench: 2.56, round: 0.17 [+- 0.01]
{noformat}

And then comes the 1.7, compare JDK's implementation with anything else 
(especially the {{time.bench}} for
the {{single}} input data. Looks like this is hardware-accelerated.

{noformat}
# -server -Xbatch -Xmx1024m
# 1.7.0-ea, Java HotSpot(TM) 64-Bit Server VM, 17.0-b06, Sun Microsystems Inc.,
test_NTZ_JDK_random       : 15/20 rounds, time.total: 0.79, time.warmup: 0.21, 
time.bench: 0.58, round: 0.04 [+- 0.00]
test_NTZ_JDK_single       : 15/20 rounds, time.total: 0.75, time.warmup: 0.20, 
time.bench: 0.55, round: 0.04 [+- 0.00]
test_NTZ_BitUtil_random   : 15/20 rounds, time.total: 0.98, time.warmup: 0.25, 
time.bench: 0.72, round: 0.05 [+- 0.00]
test_NTZ_BitUtil_single   : 15/20 rounds, time.total: 2.61, time.warmup: 0.66, 
time.bench: 1.95, round: 0.13 [+- 0.00]
test_NTZ2_BitUtil_random  : 15/20 rounds, time.total: 1.30, time.warmup: 0.33, 
time.bench: 0.97, round: 0.06 [+- 0.00]
test_NTZ2_BitUtil_single  : 15/20 rounds, time.total: 2.48, time.warmup: 0.61, 
time.bench: 1.88, round: 0.13 [+- 0.00]
test_NTZ3_BitUtil_random  : 15/20 rounds, time.total: 2.81, time.warmup: 0.70, 
time.bench: 2.11, round: 0.14 [+- 0.00]
test_NTZ3_BitUtil_single  : 15/20 rounds, time.total: 4.07, time.warmup: 1.02, 
time.bench: 3.05, round: 0.20 [+- 0.00]
{noformat}

h2. Conclusions

It seems that any change introduced to these routines will hurt somebody in 
some configuration, so it's really hard
for me to make choices. I would definitely opt for the precached {{pop2}} 
version on 32-bit systems as it seems to 
be always faster or equally fast compared to other bit counting options. 
{{pop2}} looked like this:

{code}
   private static byte [] bcounts = new byte [0x10000];
   static
   {
       for (int i = 0x10000; --i >= 0;)
           bcounts[i] = (byte) Integer.bitCount(i);
   }

   public static int pop2(long v)
   {
       int t;
       return 
             bcounts[(t = (int) v) & 0xffff]
           + bcounts[t >>> 16]
           + bcounts[(t = ((int) (v >>> 32))) & 0xffff]
           + bcounts[t >>> 16];
   }
{code}

As for the hardware-accelerated {{ntz}}, if one can detect 1.7, then using the 
JDK's version is speeding up things
significantly. But I have not checked how this detection would affect speed if 
done at run-time (I assume a final
static flag wouldn't cause any performance penalty) and it is definitely not 
worth replacing for folks with older
VMs.

h2. Raw results data.

I will attach raw results as part of the issue if you want to draw your own 
conclusions. Didn't have access to sparc-machine
or to any machine with the newest Intels.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Reply via email to