[jvm-l] Re: Performance characteristics of mutable static primitives?

beelzabub Tue, 15 Apr 2008 10:04:41 -0700

What you should do is run the thing through a profiler.  This will
tell you how many times poll is being called.  My guess is that
perhaps the thread contention is prolonging the amount of time
necessary to increment count - Threads 1-3 all load count into memory
(i.e 0), all add 1 to it, and then write it back out to memory, so
after 3 threads have "written" to count, it only contains 1.  This
seems to be the case - tested with linux jdk1.6, intel Q6600 (quad-
core):


$ java -server Trouble 1
time: 1995
fired: 3906250
time: 1966
fired: 3906250
time: 930
fired: 3906250
time: 916
fired: 3906250
time: 894
fired: 3906250
$ java -server Trouble 2
time: 734
fired: 3654450
time: 769
fired: 3802532
time: 821
fired: 3755766
time: 829
fired: 3775890
time: 800
fired: 3779483
$ java -server Trouble 3
time: 1836
fired: 3934611
time: 4518
fired: 3566843
time: 3102
fired: 3618666
time: 2886
fired: 3648195
time: 3844
fired: 3667866

On a side note, it's way faster to just optimize away the algorithm
(at least in your benchmark):
int numCalls = total / 0xFF;
i += total;
for (int k = 0; k < numCalls; ++k) doNothing();

Here's the same way, using AtomicIntegers:
int tmp = i.getAndAdd(total);
for (int k  = 0; k < numCalls; ++k)
{
   if ((k & 0xFF) == 0) doNothing();
}

However, none of this is related to the actual code, just the
benchmark.  The way to optimize the actual code would be to push count
into the ThreadContext & instead of doing a binary and, simply reset
the counter & do a comparison of 256 / numThreads.  Or you can keep
that code if you want every thread to poll once every 256 calls rather
than a thread getting polled once every 256 calls.

Other ways to speed it up:  Don't count & simply call pollEvents with
longer intervals in-between & pollEvents then simply calls doNothing
directly.

Summary: The problem is that multiple threads are overwriting the
global variable, so that it takes longer to actually increment it by
1.

Think of it this way:
count = 1
Threads A, B, C are running.
The CPU loads count (with a value of 1) into the register.
The CPU increments the values in its registers to 2
The CPU then writes out the values of each of the registers into
count.  So after all 3 threads increment count, the value is still 1.
Add in the overhead of switching between threads, and you can see why
you get a slowdown.  Adding in synchronization will help, provided you
do it intelligently - too often, and synchronization penalty takes
over.

On Apr 2, 1:48 am, Charles Oliver Nutter <[EMAIL PROTECTED]>
wrote:
> I ran into a very strange effect when some Sun folks tried to benchmark
> JRuby's multi-thread scalability. In short, adding more threads actually
> caused the benchmarks to take longer.
>
> The source of the problem (at least the source that, when fixed, allowed
> normal thread scaling), was an increment, mask, and test of a static int
> field. The code in question looked like this:
>
> private static int count = 0;
>
> public void pollEvents(ThreadContext context) {
>    if ((count++ & 0xFF) == 0) context.poll();
>
> }
>
> So the basic idea was that this would call poll() every 256 hits,
> incrementing a counter all the while. My first attempt to improve
> performance was to comment out the body of poll() in case it was causing
> a threading bottleneck (it does some locking and such), but that had no
> effect. Then, as a total shot in the dark, I commented out the entire
> line above. Thread scaling went to normal.
>
> So I'm rather confused here. Is a ++ operation on a static int doing
> some kind of atomic update that causes multiple threads to contend? I
> never would have expected this, so I wrote up a small Java benchmark:
>
> http://pastie.org/173993
>
> The benchmark does basically the same thing, with a single main counter
> and another "fired" counter to prevent hotspot from optimizing things
> completely away. I've been running this on a dual-core MacBook Pro with
> both Apple's Java 5 and the soylatte Java 6 release. The results are
> very confusing:
>
> First on Apple's Java 5
>
> ~/NetBeansProjects/jruby ➔ java -server Trouble 1
> time: 3924
> fired: 3906250
> time: 3945
> fired: 3906250
> time: 1841
> fired: 3906250
> time: 1882
> fired: 3906250
> time: 1896
> fired: 3906250
> ~/NetBeansProjects/jruby ➔ java -server Trouble 2
> time: 3243
> fired: 4090645
> time: 3245
> fired: 4100505
> time: 1173
> fired: 3906049
> time: 1233
> fired: 3906188
> time: 1173
> fired: 3906134
>
> Normal scaling here...1 thread on my system uses about 60-65% CPU, so
> the extra thread uses up the remaining 35-40% and the numbers show it.
> Then there's soylatte Java 6:
>
> ~/NetBeansProjects/jruby ➔ java -server Trouble 1
> time: 1772
> fired: 3906250
> time: 1973
> fired: 3906250
> time: 2748
> fired: 3906250
> time: 2114
> fired: 3906250
> time: 2294
> fired: 3906250
> ~/NetBeansProjects/jruby ➔ java -server Trouble 2
> time: 3402
> fired: 3848648
> time: 3805
> fired: 3885471
> time: 4145
> fired: 3866850
> time: 4140
> fired: 3839130
> time: 3658
> fired: 3880202
>
> Don't compare the times directly, since these are two pretty different
> codebases and they each have different general performance
> characteristics. Instead pay attention to the trend...the soylatte Java
> 6 run with two threads is significantly slower than the run with a
> single thread. This mirrors the results with JRuby when there was a
> single static counter being incremented.
>
> So what's up here?
>
> - Charlie
--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups "JVM 
Languages" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/jvm-languages?hl=en
-~----------~----~----~----~------~----~------~--~---

[jvm-l] Re: Performance characteristics of mutable static primitives?

Reply via email to