Re: RFR 8209171 : Simplify Java implementation of Integer/Long.numberOfTrailingZeros()

2018-08-13 Thread Martin Buchholz
Approved!

There was a lot of benchmarking work, so I'd like to see the jmh benchmark
checked in.
Especially since the exact variant we eventually settled on is not in
Hacker's Delight.

On Mon, Aug 13, 2018 at 12:09 PM, Ivan Gerasimov 
wrote:

> Okay, your last variant is the winner :)
>
> Here's the updated webrev, benchmark and the graphs:
>
> http://cr.openjdk.java.net/~igerasim/8209171/02/webrev/
> http://cr.openjdk.java.net/~igerasim/8209171/02/bench/int/MyBenchmark.java
> http://cr.openjdk.java.net/~igerasim/8209171/02/bench/int/be
> nch-int-04-client.png
> http://cr.openjdk.java.net/~igerasim/8209171/02/bench/int/be
> nch-int-04-server.png
>
> With kind regards,
> Ivan
>
>
> On 8/13/18 10:10 AM, Ivan Gerasimov wrote:
>
>> Hi Martin!
>>
>> Good point about the command line flags, thanks!
>>
>> These variants are close to numberOfTrailingZeros_07 that I've already
>> tested, though you did better by saving one arithmetic operation at the
>> return line!
>>
>> I'll rerun the benchmarks.
>>
>> With kind regards,
>>
>> Ivan
>>
>>
>> On 8/13/18 7:56 AM, Martin Buchholz wrote:
>>
>>> The number of plausible variants is astonishing!
>>>
>>> ---
>>>
>>> Your use of -client and -server is outdated, which explains why you get
>>> the same results for both (-client is ignored).
>>>
>>> I'm not sure what's blessed by hotspot team, but for C1 I use
>>> -XX:+TieredCompilation -XX:TieredStopAtLevel=1 and for C2 I use
>>> -XX:-TieredCompilation -server
>>>
>>> ---
>>>
>>> Now I understand the advantage of using ~i & (i - 1): the subsequent
>>> zero check is a short-circuit for all odd numbers, better than i & -i,
>>> which explains your results - they depend on being able to short-circuit.
>>>
>>> So just use a more faithful inlining of nlz without trying to improve on
>>> it.
>>>
>>> static int ntz_inlineNlz5(int i) {
>>> i = ~i & (i - 1);
>>> if (i <= 0)
>>> return (i == 0) ? 0 : 32;
>>> int n = 1;
>>> if (i >= 1 << 16) { n += 16; i >>>= 16; }
>>> if (i >= 1 <<  8) { n +=  8; i >>>= 8; }
>>> if (i >= 1 <<  4) { n +=  4; i >>>= 4; }
>>> if (i >= 1 <<  2) { n +=  2; i >>>= 2; }
>>> return n + (i >>> 1);
>>> }
>>>
>>> But it's hard to resist the urge to optimize out a branch:
>>>
>>> static int ntz_inlineNlz6(int i) {
>>> i = ~i & (i - 1);
>>> if (i <= 0) return i & 32;
>>> int n = 1;
>>> if (i >= 1 << 16) { n += 16; i >>>= 16; }
>>> if (i >= 1 <<  8) { n +=  8; i >>>= 8; }
>>> if (i >= 1 <<  4) { n +=  4; i >>>= 4; }
>>> if (i >= 1 <<  2) { n +=  2; i >>>= 2; }
>>> return n + (i >>> 1);
>>> }
>>>
>>>
>>
> --
> With kind regards,
> Ivan Gerasimov
>
>


Re: RFR 8209171 : Simplify Java implementation of Integer/Long.numberOfTrailingZeros()

2018-08-13 Thread Ivan Gerasimov

Okay, your last variant is the winner :)

Here's the updated webrev, benchmark and the graphs:

http://cr.openjdk.java.net/~igerasim/8209171/02/webrev/
http://cr.openjdk.java.net/~igerasim/8209171/02/bench/int/MyBenchmark.java
http://cr.openjdk.java.net/~igerasim/8209171/02/bench/int/bench-int-04-client.png
http://cr.openjdk.java.net/~igerasim/8209171/02/bench/int/bench-int-04-server.png

With kind regards,
Ivan

On 8/13/18 10:10 AM, Ivan Gerasimov wrote:

Hi Martin!

Good point about the command line flags, thanks!

These variants are close to numberOfTrailingZeros_07 that I've already 
tested, though you did better by saving one arithmetic operation at 
the return line!


I'll rerun the benchmarks.

With kind regards,

Ivan


On 8/13/18 7:56 AM, Martin Buchholz wrote:

The number of plausible variants is astonishing!

---

Your use of -client and -server is outdated, which explains why you 
get the same results for both (-client is ignored).


I'm not sure what's blessed by hotspot team, but for C1 I use 
-XX:+TieredCompilation -XX:TieredStopAtLevel=1 and for C2 I use 
-XX:-TieredCompilation -server


---

Now I understand the advantage of using ~i & (i - 1): the subsequent 
zero check is a short-circuit for all odd numbers, better than i & 
-i, which explains your results - they depend on being able to 
short-circuit.


So just use a more faithful inlining of nlz without trying to improve 
on it.


static int ntz_inlineNlz5(int i) {
i = ~i & (i - 1);
if (i <= 0)
return (i == 0) ? 0 : 32;
int n = 1;
if (i >= 1 << 16) { n += 16; i >>>= 16; }
if (i >= 1 <<  8) { n +=  8; i >>>= 8; }
if (i >= 1 <<  4) { n +=  4; i >>>= 4; }
if (i >= 1 <<  2) { n +=  2; i >>>= 2; }
return n + (i >>> 1);
}

But it's hard to resist the urge to optimize out a branch:

static int ntz_inlineNlz6(int i) {
i = ~i & (i - 1);
if (i <= 0) return i & 32;
int n = 1;
if (i >= 1 << 16) { n += 16; i >>>= 16; }
if (i >= 1 <<  8) { n +=  8; i >>>= 8; }
if (i >= 1 <<  4) { n +=  4; i >>>= 4; }
if (i >= 1 <<  2) { n +=  2; i >>>= 2; }
return n + (i >>> 1);
}





--
With kind regards,
Ivan Gerasimov



Re: RFR 8209171 : Simplify Java implementation of Integer/Long.numberOfTrailingZeros()

2018-08-13 Thread Martin Buchholz
On Mon, Aug 13, 2018 at 10:10 AM, Ivan Gerasimov 
wrote:

> Hi Martin!
>
> Good point about the command line flags, thanks!
>
> These variants are close to numberOfTrailingZeros_07 that I've already
> tested, though you did better by saving one arithmetic operation at the
> return line!
>
>
Right. At this point

return n + i - (i >> 1);


i is either 1 or 3, and

i - (i >> 1)

is equivalent to

1 + (i >>> 1)

so just initialize n to 1 instead.  Very much like
Integer.numberOfLeadingZeros does.


Re: RFR 8209171 : Simplify Java implementation of Integer/Long.numberOfTrailingZeros()

2018-08-13 Thread Ivan Gerasimov

Hi Martin!

Good point about the command line flags, thanks!

These variants are close to numberOfTrailingZeros_07 that I've already 
tested, though you did better by saving one arithmetic operation at the 
return line!


I'll rerun the benchmarks.

With kind regards,

Ivan


On 8/13/18 7:56 AM, Martin Buchholz wrote:

The number of plausible variants is astonishing!

---

Your use of -client and -server is outdated, which explains why you 
get the same results for both (-client is ignored).


I'm not sure what's blessed by hotspot team, but for C1 I 
use -XX:+TieredCompilation -XX:TieredStopAtLevel=1 and for C2 I 
use -XX:-TieredCompilation -server


---

Now I understand the advantage of using ~i & (i - 1): the subsequent 
zero check is a short-circuit for all odd numbers, better than i & -i, 
which explains your results - they depend on being able to short-circuit.


So just use a more faithful inlining of nlz without trying to improve 
on it.


static int ntz_inlineNlz5(int i) {
i = ~i & (i - 1);
if (i <= 0)
return (i == 0) ? 0 : 32;
int n = 1;
if (i >= 1 << 16) { n += 16; i >>>= 16; }
if (i >= 1 <<  8) { n +=  8; i >>>=  8; }
if (i >= 1 <<  4) { n +=  4; i >>>=  4; }
if (i >= 1 <<  2) { n +=  2; i >>>=  2; }
return n + (i >>> 1);
}

But it's hard to resist the urge to optimize out a branch:

static int ntz_inlineNlz6(int i) {
i = ~i & (i - 1);
if (i <= 0) return i & 32;
int n = 1;
if (i >= 1 << 16) { n += 16; i >>>= 16; }
if (i >= 1 <<  8) { n +=  8; i >>>=  8; }
if (i >= 1 <<  4) { n +=  4; i >>>=  4; }
if (i >= 1 <<  2) { n +=  2; i >>>=  2; }
return n + (i >>> 1);
}



--
With kind regards,
Ivan Gerasimov



Re: RFR 8209171 : Simplify Java implementation of Integer/Long.numberOfTrailingZeros()

2018-08-13 Thread Martin Buchholz
The number of plausible variants is astonishing!

---

Your use of -client and -server is outdated, which explains why you get the
same results for both (-client is ignored).

I'm not sure what's blessed by hotspot team, but for C1 I
use -XX:+TieredCompilation  -XX:TieredStopAtLevel=1 and for C2 I
use -XX:-TieredCompilation -server

---

Now I understand the advantage of using ~i & (i - 1): the subsequent zero
check is a short-circuit for all odd numbers, better than i & -i, which
explains your results - they depend on being able to short-circuit.

So just use a more faithful inlining of nlz without trying to improve on it.

static int ntz_inlineNlz5(int i) {
i = ~i & (i - 1);
if (i <= 0)
return (i == 0) ? 0 : 32;
int n = 1;
if (i >= 1 << 16) { n += 16; i >>>= 16; }
if (i >= 1 <<  8) { n +=  8; i >>>=  8; }
if (i >= 1 <<  4) { n +=  4; i >>>=  4; }
if (i >= 1 <<  2) { n +=  2; i >>>=  2; }
return n + (i >>> 1);
}

But it's hard to resist the urge to optimize out a branch:

static int ntz_inlineNlz6(int i) {
i = ~i & (i - 1);
if (i <= 0) return i & 32;
int n = 1;
if (i >= 1 << 16) { n += 16; i >>>= 16; }
if (i >= 1 <<  8) { n +=  8; i >>>=  8; }
if (i >= 1 <<  4) { n +=  4; i >>>=  4; }
if (i >= 1 <<  2) { n +=  2; i >>>=  2; }
return n + (i >>> 1);
}


Re: RFR 8209171 : Simplify Java implementation of Integer/Long.numberOfTrailingZeros()

2018-08-13 Thread Ivan Gerasimov

On 8/12/18 10:57 AM, Martin Buchholz wrote:
If delegating to nlz is the winner so far, we should be able to do at 
least as well by inlining nlz into ntz and then looking for more 
optimizations.  Following this strategy leads naturally to


static int ntz_inlineNlz2(int i) {
i &= -i;
if (i <= 0) return 32 - (i >>> 31);
int n = 0;
if (i >= 1 << 16) { n += 16; i >>>= 16; }
if (i >= 1 <<  8) { n +=  8; i >>>=  8; }
if (i >= 1 <<  4) { n +=  4; i >>>=  4; }
if (i >= 1 <<  2) { n +=  2; i >>>=  2; }
return n + (i >>> 1);
}

Right.  The variant numberOfLeadingZeros_01() from the benchmark is very 
close to this, though you've got better handling of (i <= 0) case, so I 
added it as numberOfLeadingZeros_01a() with a minor modification.



which should save a branch and so should be a benchmark winner.

A reason why delegating to nlz may have beat my previous attempt is 
because direct comparison with a constant is an operation the hardware 
tries hard to optimize, e.g. branch predict.


Most of the comparisons should be false in practice because "most ints 
are small".


I also see that our nlz implementation favors small integers, which 
helps with ntz.


It's possible that benchmarking may cause branches to be very highly 
predictable.  It should be more real-world for each benchmark method 
to see a variety of inputs, perhaps in an array.


Okay.  Now I tried to combine calculating of several results in one 
iteration of benchmark to make it harder to predict branches :)
Surprisingly, this made the variant 05 (reducing to nlz) the leader, for 
which I don't have a good explanation, as it does strictly more 
calculations than 01 or 01a even after inlining.


Anyways, please find the updated benchmark here:
http://cr.openjdk.java.net/~igerasim/8209171/03/bench/int/MyBenchmark.java

The graphs for -client and -server are here:
http://cr.openjdk.java.net/~igerasim/8209171/03/bench/int/bench-int-03-client.png
http://cr.openjdk.java.net/~igerasim/8209171/03/bench/int/bench-int-03-server.png

It took almost an hour to generate the results, so they seem to be quite 
accurate.


So, I'm still inclined to prefer the variant 05 (which is to reduce ntz 
to nlz)  :)


With kind regards,
Ivan



On Sun, Aug 12, 2018 at 7:22 AM, Ivan Gerasimov 
mailto:ivan.gerasi...@oracle.com>> wrote:


Hi Martin!


On 8/11/18 5:54 PM, Martin Buchholz wrote:

Hi Ivan,

Oh the allure of bit-twiddling!


Yes :)


I'm skeptical that ntz should ever delegate to nlz, and not only
because of the overhead of a wrapper, but because small numbers
are more common, and we can take that into account when
implementing ntz.

I was under impression that the more frequently a function is
called the faster it gets compiled, so all the callers of this
function will benefit.
For example, if numberOfTrailingZeros is reduced to
numberOfLeadingZeros then when the later is compiled while the
former is still not, it will still be running faster than the
variant with independent implementations.


  At least add "1" to the set of numbers to benchmark.

In the last proposed patch, all odd numbers will be processed via
a fast path (because for any odd i, ~i & (i - 1) == 0).
So, I added 1, 16 and 42 -- small numbers with different number of
trailing zeros.

Here's the updated benchmark:
http://cr.openjdk.java.net/~igerasim/8209171/02/bench/int/MyBenchmark.java


(I only executed four implementations to keep the picture clear.)


  Here's my own entry in this race:

static int ntz(int i) {
if (i == 0) return 32;
int n = 0;
if ((i << 16)  == 0) { n += 16; i >>>= 16; }
if ((i & 0xFF) == 0) { n +=  8; i >>>=  8; }
if ((i & 0xF)  == 0) { n +=  4; i >>>=  4; }
if ((i & 0x3)  == 0) { n +=  2; i >>>=  2; }
return n + (~i & 1);
}


Interesting!
You might also avoid inversion at the end, if n is initialized
with 1, and then the last line may be written as return n - (i & 1).

Still this variant appears to be slower in most tried cases.
Here's the graph of the latest benchmark:

http://cr.openjdk.java.net/~igerasim/8209171/02/bench/int/bench-int-02-client.png



http://cr.openjdk.java.net/~igerasim/8209171/02/bench/int/bench-int-02-server.png



The variant from the test01 is the fastest in most cases, but I
would prefer to proceed with the variant from test05:
It's only slightly slower than 01, but contains less bytecode and
helps to warm up numberOfLeadingZeros().


Whatever happens, we ought to check in 

Re: RFR 8209171 : Simplify Java implementation of Integer/Long.numberOfTrailingZeros()

2018-08-12 Thread Martin Buchholz
Here's an example of a microbenchmark that uses multiple random inputs
simulating various typical populations:

https://github.com/google/caliper/blob/master/caliper-examples/src/main/java/examples/Utf8Benchmark.java


Re: RFR 8209171 : Simplify Java implementation of Integer/Long.numberOfTrailingZeros()

2018-08-12 Thread Martin Buchholz
If delegating to nlz is the winner so far, we should be able to do at least
as well by inlining nlz into ntz and then looking for more optimizations.
Following this strategy leads naturally to

static int ntz_inlineNlz2(int i) {
i &= -i;
if (i <= 0) return 32 - (i >>> 31);
int n = 0;
if (i >= 1 << 16) { n += 16; i >>>= 16; }
if (i >= 1 <<  8) { n +=  8; i >>>=  8; }
if (i >= 1 <<  4) { n +=  4; i >>>=  4; }
if (i >= 1 <<  2) { n +=  2; i >>>=  2; }
return n + (i >>> 1);
}

which should save a branch and so should be a benchmark winner.

A reason why delegating to nlz may have beat my previous attempt is because
direct comparison with a constant is an operation the hardware tries hard
to optimize, e.g. branch predict.

Most of the comparisons should be false in practice because "most ints are
small".

I also see that our nlz implementation favors small integers, which helps
with ntz.

It's possible that benchmarking may cause branches to be very highly
predictable.  It should be more real-world for each benchmark method to see
a variety of inputs, perhaps in an array.


On Sun, Aug 12, 2018 at 7:22 AM, Ivan Gerasimov 
wrote:

> Hi Martin!
>
> On 8/11/18 5:54 PM, Martin Buchholz wrote:
>
> Hi Ivan,
>
> Oh the allure of bit-twiddling!
>
> Yes :)
>
> I'm skeptical that ntz should ever delegate to nlz, and not only because
> of the overhead of a wrapper, but because small numbers are more common,
> and we can take that into account when implementing ntz.
>
> I was under impression that the more frequently a function is called the
> faster it gets compiled, so all the callers of this function will benefit.
> For example, if numberOfTrailingZeros is reduced to numberOfLeadingZeros
> then when the later is compiled while the former is still not, it will
> still be running faster than the variant with independent implementations.
>
>   At least add "1" to the set of numbers to benchmark.
>
> In the last proposed patch, all odd numbers will be processed via a fast
> path (because for any odd i, ~i & (i - 1) == 0).
> So, I added 1, 16 and 42 -- small numbers with different number of
> trailing zeros.
>
> Here's the updated benchmark:
> http://cr.openjdk.java.net/~igerasim/8209171/02/bench/int/MyBenchmark.java
> (I only executed four implementations to keep the picture clear.)
>
>   Here's my own entry in this race:
>
> static int ntz(int i) {
> if (i == 0) return 32;
> int n = 0;
> if ((i << 16)  == 0) { n += 16; i >>>= 16; }
> if ((i & 0xFF) == 0) { n +=  8; i >>>=  8; }
> if ((i & 0xF)  == 0) { n +=  4; i >>>=  4; }
> if ((i & 0x3)  == 0) { n +=  2; i >>>=  2; }
> return n + (~i & 1);
> }
>
> Interesting!
> You might also avoid inversion at the end, if n is initialized with 1, and
> then the last line may be written as return n - (i & 1).
>
> Still this variant appears to be slower in most tried cases.
> Here's the graph of the latest benchmark:
> http://cr.openjdk.java.net/~igerasim/8209171/02/bench/int/
> bench-int-02-client.png
> http://cr.openjdk.java.net/~igerasim/8209171/02/bench/int/
> bench-int-02-server.png
>
> The variant from the test01 is the fastest in most cases, but I would
> prefer to proceed with the variant from test05:
> It's only slightly slower than 01, but contains less bytecode and helps to
> warm up numberOfLeadingZeros().
>
> Whatever happens, we ought to check in the microbenchmarks somewhere.  It
> looks like the new jmh-jdk-microbenchmarks is the place.
> I also suspect that tests for these methods could be improved (but there
> are existing hotspot tests).
>
> To make sure the new code is correct I ran an exhaustive loop from
> Integer.MIN_VALUE to MAX_VALUE inclusive and checked all the tested
> variants of implementation.
>
> nlz seems harder than ntz in the sense that for nlz "small ints" and
> random bits may have different optimal implementations.
>
>
> On Fri, Aug 10, 2018 at 7:03 PM, Ivan Gerasimov  > wrote:
>
>> Thanks Martin!
>>
>> On 8/9/18 5:42 PM, Martin Buchholz wrote:
>>
>>
>>
>> On Thu, Aug 9, 2018 at 5:27 PM, Ivan Gerasimov > > wrote:
>>
>>> I did not use the intrinsified variants of numberOfLeadingZeros in the
>>> benchmark.
>>>
>>
>> Oops! Should have looked more closely!
>>
>> Did you know about
>> http://www.hackersdelight.org/hdcodetxt/ntz.c.txt
>>
>>
>> Ah, right, ntz1() is even better because it has less branches.  How could
>> I miss that?
>>
>> Here's the updated webrev and benchmarks:
>>
>> http://cr.openjdk.java.net/~igerasim/8209171/01/webrev/
>> http://cr.openjdk.java.net/~igerasim/8209171/01/bench/int/My
>> Benchmark.java
>> http://cr.openjdk.java.net/~igerasim/8209171/01/bench/long/
>> MyBenchmark.java
>>
>> --
>> With kind regards,
>> Ivan Gerasimov
>>
>>
>
> --
> With kind regards,
> Ivan Gerasimov
>
>


Re: RFR 8209171 : Simplify Java implementation of Integer/Long.numberOfTrailingZeros()

2018-08-12 Thread Ivan Gerasimov

Hi Martin!


On 8/11/18 5:54 PM, Martin Buchholz wrote:

Hi Ivan,

Oh the allure of bit-twiddling!


Yes :)

I'm skeptical that ntz should ever delegate to nlz, and not only 
because of the overhead of a wrapper, but because small numbers are 
more common, and we can take that into account when implementing ntz.
I was under impression that the more frequently a function is called the 
faster it gets compiled, so all the callers of this function will benefit.
For example, if numberOfTrailingZeros is reduced to numberOfLeadingZeros 
then when the later is compiled while the former is still not, it will 
still be running faster than the variant with independent implementations.



  At least add "1" to the set of numbers to benchmark.
In the last proposed patch, all odd numbers will be processed via a fast 
path (because for any odd i, ~i & (i - 1) == 0).
So, I added 1, 16 and 42 -- small numbers with different number of 
trailing zeros.


Here's the updated benchmark:
http://cr.openjdk.java.net/~igerasim/8209171/02/bench/int/MyBenchmark.java
(I only executed four implementations to keep the picture clear.)


  Here's my own entry in this race:

static int ntz(int i) {
if (i == 0) return 32;
int n = 0;
if ((i << 16)  == 0) { n += 16; i >>>= 16; }
if ((i & 0xFF) == 0) { n +=  8; i >>>=  8; }
if ((i & 0xF)  == 0) { n +=  4; i >>>=  4; }
if ((i & 0x3)  == 0) { n +=  2; i >>>=  2; }
return n + (~i & 1);
}


Interesting!
You might also avoid inversion at the end, if n is initialized with 1, 
and then the last line may be written as return n - (i & 1).


Still this variant appears to be slower in most tried cases.
Here's the graph of the latest benchmark:
http://cr.openjdk.java.net/~igerasim/8209171/02/bench/int/bench-int-02-client.png
http://cr.openjdk.java.net/~igerasim/8209171/02/bench/int/bench-int-02-server.png

The variant from the test01 is the fastest in most cases, but I would 
prefer to proceed with the variant from test05:
It's only slightly slower than 01, but contains less bytecode and helps 
to warm up numberOfLeadingZeros().


Whatever happens, we ought to check in the microbenchmarks somewhere.  
It looks like the new jmh-jdk-microbenchmarks is the place.
I also suspect that tests for these methods could be improved (but 
there are existing hotspot tests).


To make sure the new code is correct I ran an exhaustive loop from 
Integer.MIN_VALUE to MAX_VALUE inclusive and checked all the tested 
variants of implementation.


nlz seems harder than ntz in the sense that for nlz "small ints" and 
random bits may have different optimal implementations.



On Fri, Aug 10, 2018 at 7:03 PM, Ivan Gerasimov 
mailto:ivan.gerasi...@oracle.com>> wrote:


Thanks Martin!


On 8/9/18 5:42 PM, Martin Buchholz wrote:



On Thu, Aug 9, 2018 at 5:27 PM, Ivan Gerasimov
mailto:ivan.gerasi...@oracle.com>> wrote:

I did not use the intrinsified variants of
numberOfLeadingZeros in the benchmark.


Oops! Should have looked more closely!
Did you know about
http://www.hackersdelight.org/hdcodetxt/ntz.c.txt



Ah, right, ntz1() is even better because it has less branches. 
How could I miss that?


Here's the updated webrev and benchmarks:

http://cr.openjdk.java.net/~igerasim/8209171/01/webrev/

http://cr.openjdk.java.net/~igerasim/8209171/01/bench/int/MyBenchmark.java


http://cr.openjdk.java.net/~igerasim/8209171/01/bench/long/MyBenchmark.java



-- 
With kind regards,

Ivan Gerasimov




--
With kind regards,
Ivan Gerasimov



Re: RFR 8209171 : Simplify Java implementation of Integer/Long.numberOfTrailingZeros()

2018-08-11 Thread Martin Buchholz
Hi Ivan,

Oh the allure of bit-twiddling!

I'm skeptical that ntz should ever delegate to nlz, and not only because of
the overhead of a wrapper, but because small numbers are more common, and
we can take that into account when implementing ntz.  At least add "1" to
the set of numbers to benchmark.  Here's my own entry in this race:

static int ntz(int i) {
if (i == 0) return 32;
int n = 0;
if ((i << 16)  == 0) { n += 16; i >>>= 16; }
if ((i & 0xFF) == 0) { n +=  8; i >>>=  8; }
if ((i & 0xF)  == 0) { n +=  4; i >>>=  4; }
if ((i & 0x3)  == 0) { n +=  2; i >>>=  2; }
return n + (~i & 1);
}

Whatever happens, we ought to check in the microbenchmarks somewhere.  It
looks like the new jmh-jdk-microbenchmarks is the place.
I also suspect that tests for these methods could be improved (but there
are existing hotspot tests).

nlz seems harder than ntz in the sense that for nlz "small ints" and random
bits may have different optimal implementations.


On Fri, Aug 10, 2018 at 7:03 PM, Ivan Gerasimov 
wrote:

> Thanks Martin!
>
> On 8/9/18 5:42 PM, Martin Buchholz wrote:
>
>
>
> On Thu, Aug 9, 2018 at 5:27 PM, Ivan Gerasimov 
> wrote:
>
>> I did not use the intrinsified variants of numberOfLeadingZeros in the
>> benchmark.
>>
>
> Oops! Should have looked more closely!
>
> Did you know about
> http://www.hackersdelight.org/hdcodetxt/ntz.c.txt
>
>
> Ah, right, ntz1() is even better because it has less branches.  How could
> I miss that?
>
> Here's the updated webrev and benchmarks:
>
> http://cr.openjdk.java.net/~igerasim/8209171/01/webrev/
> http://cr.openjdk.java.net/~igerasim/8209171/01/bench/int/MyBenchmark.java
> http://cr.openjdk.java.net/~igerasim/8209171/01/bench/
> long/MyBenchmark.java
>
> --
> With kind regards,
> Ivan Gerasimov
>
>


Re: RFR 8209171 : Simplify Java implementation of Integer/Long.numberOfTrailingZeros()

2018-08-10 Thread Ivan Gerasimov

Thanks Martin!


On 8/9/18 5:42 PM, Martin Buchholz wrote:



On Thu, Aug 9, 2018 at 5:27 PM, Ivan Gerasimov 
mailto:ivan.gerasi...@oracle.com>> wrote:


I did not use the intrinsified variants of numberOfLeadingZeros in
the benchmark.


Oops! Should have looked more closely!
Did you know about
http://www.hackersdelight.org/hdcodetxt/ntz.c.txt


Ah, right, ntz1() is even better because it has less branches.  How 
could I miss that?


Here's the updated webrev and benchmarks:

http://cr.openjdk.java.net/~igerasim/8209171/01/webrev/
http://cr.openjdk.java.net/~igerasim/8209171/01/bench/int/MyBenchmark.java
http://cr.openjdk.java.net/~igerasim/8209171/01/bench/long/MyBenchmark.java

--
With kind regards,
Ivan Gerasimov



Re: RFR 8209171 : Simplify Java implementation of Integer/Long.numberOfTrailingZeros()

2018-08-09 Thread Martin Buchholz
On Thu, Aug 9, 2018 at 5:27 PM, Ivan Gerasimov 
wrote:

> I did not use the intrinsified variants of numberOfLeadingZeros in the
> benchmark.
>

Oops! Should have looked more closely!

Did you know about
http://www.hackersdelight.org/hdcodetxt/ntz.c.txt


Re: RFR 8209171 : Simplify Java implementation of Integer/Long.numberOfTrailingZeros()

2018-08-09 Thread Ivan Gerasimov

Hi Martin!

Thanks for taking a look!


On 8/9/18 5:13 PM, Martin Buchholz wrote:



On Thu, Aug 9, 2018 at 4:15 PM, Ivan Gerasimov 
mailto:ivan.gerasi...@oracle.com>> wrote:


Hello!

Integer.numberOfTrailingZeros() and Long.numberOfTrailingZeros()
are intrinsified by Hotspot, so the Java implementation of these
is not too important.
However, they still can be improved by a tiny bit :)

Could you please help review the fix?

Bug: https://bugs.openjdk.java.net/browse/JDK-8209171

Webrev: http://cr.openjdk.java.net/~igerasim/8209171/00/webrev/

Benchmark for Integer:
http://cr.openjdk.java.net/~igerasim/8209171/00/bench/int/MyBenchmark.java


Benchmark for Long:
http://cr.openjdk.java.net/~igerasim/8209171/00/bench/long/MyBenchmark.java



The resulting code is both smaller and faster.  It may also help
to warm up Integer.numberOfLeadingZeros() quicker.

On average, the new Integer.numberOfTrailingZeros() has got +11%
to performance for -client and +1% for -server.
Long.numberOfTrailingZeros() is faster on 17% for -client and +20%
for -server.


It seems to me the benchmark is not entirely fair to the old java 
implementation, since the new one gets the benefit of the 
intrinsification of  numberOfLeadingZeros.  A fairer comparison would 
use turn off intrinsification of both.
I did not use the intrinsified variants of numberOfLeadingZeros in the 
benchmark.


In the first (Integer) benchmark I copied/pasted the Java implementation 
of both Integer.numberOfLeadingZeros and Integer.bitCount.
In the second (Long) one, Long.numberOfLeadingZeros and Long.bitCount 
were also copied into the test with their original names, and Integer 
versions of the functions were copied as Integer_numberOfLeadingZeros 
and Integer_numberOfTrailingZeros.


The later one was not the original function 
Integer.numberOfTrailingZeros, but the new one, so the numbers of the 
second benchmark reflect changes in both classes: Integer and Long.


I guess benchmarking on 32-bit platforms is  no longer relevant, given 
that they are all legacy now.

I ran them on x64 (Intel Core i7 to be precise).

--
With kind regards,
Ivan Gerasimov



Re: RFR 8209171 : Simplify Java implementation of Integer/Long.numberOfTrailingZeros()

2018-08-09 Thread Martin Buchholz
On Thu, Aug 9, 2018 at 4:15 PM, Ivan Gerasimov 
wrote:

> Hello!
>
> Integer.numberOfTrailingZeros() and Long.numberOfTrailingZeros() are
> intrinsified by Hotspot, so the Java implementation of these is not too
> important.
> However, they still can be improved by a tiny bit :)
>
> Could you please help review the fix?
>
> Bug: https://bugs.openjdk.java.net/browse/JDK-8209171
> Webrev: http://cr.openjdk.java.net/~igerasim/8209171/00/webrev/
> Benchmark for Integer: http://cr.openjdk.java.net/~ig
> erasim/8209171/00/bench/int/MyBenchmark.java
> Benchmark for Long: http://cr.openjdk.java.net/~ig
> erasim/8209171/00/bench/long/MyBenchmark.java
>
> The resulting code is both smaller and faster.  It may also help to warm
> up Integer.numberOfLeadingZeros() quicker.
>
> On average, the new Integer.numberOfTrailingZeros() has got +11% to
> performance for -client and +1% for -server.
> Long.numberOfTrailingZeros() is faster on 17% for -client and +20% for
> -server.


It seems to me the benchmark is not entirely fair to the old java
implementation, since the new one gets the benefit of the intrinsification
of  numberOfLeadingZeros.  A fairer comparison would use turn off
intrinsification of both.

I guess benchmarking on 32-bit platforms is  no longer relevant, given that
they are all legacy now.


Re: RFR 8209171 : Simplify Java implementation of Integer/Long.numberOfTrailingZeros()

2018-08-09 Thread Ivan Gerasimov

Hi Joe!


On 8/9/18 4:20 PM, Joseph D. Darcy wrote:

Hi Ivan,

On which platforms were the benchmark numbers collected?


Ah, yes, should have mentioned that.
macOS 10.13.5 High Sierra, processor 3.1 GHz Intel Core i7

With kind regards,
Ivan


Thanks,

-Joe

On 8/9/2018 4:15 PM, Ivan Gerasimov wrote:

Hello!

Integer.numberOfTrailingZeros() and Long.numberOfTrailingZeros() are 
intrinsified by Hotspot, so the Java implementation of these is not 
too important.

However, they still can be improved by a tiny bit :)

Could you please help review the fix?

Bug: https://bugs.openjdk.java.net/browse/JDK-8209171
Webrev: http://cr.openjdk.java.net/~igerasim/8209171/00/webrev/
Benchmark for Integer: 
http://cr.openjdk.java.net/~igerasim/8209171/00/bench/int/MyBenchmark.java
Benchmark for Long: 
http://cr.openjdk.java.net/~igerasim/8209171/00/bench/long/MyBenchmark.java


The resulting code is both smaller and faster.  It may also help to 
warm up Integer.numberOfLeadingZeros() quicker.


On average, the new Integer.numberOfTrailingZeros() has got +11% to 
performance for -client and +1% for -server.
Long.numberOfTrailingZeros() is faster on 17% for -client and +20% 
for -server.







--
With kind regards,
Ivan Gerasimov



Re: RFR 8209171 : Simplify Java implementation of Integer/Long.numberOfTrailingZeros()

2018-08-09 Thread Joseph D. Darcy

Hi Ivan,

On which platforms were the benchmark numbers collected?

Thanks,

-Joe

On 8/9/2018 4:15 PM, Ivan Gerasimov wrote:

Hello!

Integer.numberOfTrailingZeros() and Long.numberOfTrailingZeros() are 
intrinsified by Hotspot, so the Java implementation of these is not 
too important.

However, they still can be improved by a tiny bit :)

Could you please help review the fix?

Bug: https://bugs.openjdk.java.net/browse/JDK-8209171
Webrev: http://cr.openjdk.java.net/~igerasim/8209171/00/webrev/
Benchmark for Integer: 
http://cr.openjdk.java.net/~igerasim/8209171/00/bench/int/MyBenchmark.java
Benchmark for Long: 
http://cr.openjdk.java.net/~igerasim/8209171/00/bench/long/MyBenchmark.java


The resulting code is both smaller and faster.  It may also help to 
warm up Integer.numberOfLeadingZeros() quicker.


On average, the new Integer.numberOfTrailingZeros() has got +11% to 
performance for -client and +1% for -server.
Long.numberOfTrailingZeros() is faster on 17% for -client and +20% for 
-server.