Hi Alan, David, Nils,

I just want to clear something regarding PerfCounter implementation.

Access to 64bit value in native memory is through a direct buffer which uses normal read/write (non-volatile, Unsafe.[get|set]Long). So on processors that don't support atomic 64bit stores/loads, each access results in two separate 32bit load/store accesses right?

The PerfCounter methods that access the 64bit value are synchronized, using PerfCounter instance as a lock. But how is this 64bit value accessed for example in the jstat utility? Is it possible that jstat can see one half of the 64bit value before the update and the other half after the update?

If this is true and it is not that important, then instead of a synchronized update of 64bit counter, a 32bit CAS could be used, optionally (rarely) followed by a second 32bit CAS, like for example:

http://dl.dropbox.com/u/101777488/jdk8-tl/PerfCounter/webrev.01/index.html

I tried this on ARM v6 and it works much better than synchronized access, but I don't know if it's acceptable. It guarantees eventual correctness of summed value if the only operation performed is add() (no set() intermingled) and has the same possibility of incorrect half-half reads by observers as current PerfCounter has for unsynchronized observers.

Here's the comparison of unpatched/patched PerfCounter.increment() micro-benchmark on single-core ARM v6 (Raspbery-PI):

*** Original PerfCounter, ARM v6

#
# PerfCounter_increment: run duration:  5,000 ms, #of logical CPUS: 1
#
           1 threads, Tavg =    269.34 ns/op (? =   0.00 ns/op) [   269.34]
2 threads, Tavg = 7,170.48 ns/op (? = 410.77 ns/op) [ 6,783.73, 7,603.95] 3 threads, Tavg = 12,034.82 ns/op (? = 418.99 ns/op) [11,792.33, 11,714.67, 12,639.26] 4 threads, Tavg = 16,029.76 ns/op (? = 1,411.44 ns/op) [15,592.04, 18,511.52, 15,642.52, 14,818.16]


*** Patched PerfCounter, ARM v6

#
# PerfCounter_increment: run duration:  5,000 ms, #of logical CPUS: 1
#
           1 threads, Tavg =    166.21 ns/op (? =   0.00 ns/op) [   166.21]
2 threads, Tavg = 332.58 ns/op (? = 0.12 ns/op) [ 332.45, 332.70] 3 threads, Tavg = 500.30 ns/op (? = 0.22 ns/op) [ 500.04, 500.29, 500.58] 4 threads, Tavg = 667.95 ns/op (? = 2.11 ns/op) [ 665.22, 667.18, 668.40, 671.04]


Regards, Peter


On 02/24/2013 11:31 AM, David Holmes wrote:
On 24/02/2013 6:50 PM, Peter Levart wrote:
Hi David,

I thought it was ok to pass null, but I don't know the "portability"
issues in-depth. The javadoc for Unsafe says:

/"This method refers to a variable by means of two parameters, and so it
provides (in effect) a double-register addressing mode for Java
variables. When the object reference is null, this method uses its
offset as an absolute address. This is similar in operation to methods
such as getInt(long), which provide (in effect) a single-register
addressing mode for non-Java variables. However, because Java variables
may have a different layout in memory from non-Java variables,
programmers should not assume that these two addressing modes are ever
equivalent. Also, programmers should remember that offsets from the
double-register addressing mode cannot be portably confused with longs
used in the single-register addressing mode."/

That is the doc for getXXX but not for getAndAddXXX or compareAndSwapXXX. You can't have null here:

UNSAFE_ENTRY(jboolean, Unsafe_CompareAndSwapLong(JNIEnv *env, jobject unsafe, jobject obj, jlong offset, jlong e, jlong x))
  UnsafeWrapper("Unsafe_CompareAndSwapLong");
  Handle p (THREAD, JNIHandles::resolve(obj));
  jlong* addr = (jlong*)(index_oop_from_field_offset_long(p(), offset));
  if (VM_Version::supports_cx8())
    return (jlong)(Atomic::cmpxchg(x, addr, e)) == e;
  else {
    jboolean success = false;
    ObjectLocker ol(p, THREAD);
    if (*addr == e) { *addr = x; success = true; }
    return success;
  }
UNSAFE_END

David
-----


Does anybody know the in-depth interpretation of the above? Is it only
the particular Java/native type differences (for example, endianess of
variables) that these two addressing modes might interpret differently
or something else too?

Regards, Peter


On 02/24/2013 12:39 AM, David Holmes wrote:
Peter,

In your use of Unsafe you pass "null" as the object. I'm pretty
certain you can't pass null here. Unsafe operates on fields or array
elements.

David

On 24/02/2013 5:39 AM, Peter Levart wrote:
Hi Nils,

If the counters are updated frequently from multiple threads, there
might be contention/scalability issues. Instead of synchronization on
updates, you might consider using atomic updates provided by
sun.misc.Unsafe, like for example:


Index: jdk/src/share/classes/sun/misc/PerfCounter.java
===================================================================
--- jdk/src/share/classes/sun/misc/PerfCounter.java
+++ jdk/src/share/classes/sun/misc/PerfCounter.java
@@ -25,6 +25,8 @@

  package sun.misc;

+import sun.nio.ch.DirectBuffer;
+
  import java.nio.ByteBuffer;
  import java.nio.ByteOrder;
  import java.nio.LongBuffer;
@@ -50,6 +52,8 @@
  public class PerfCounter {
      private static final Perf perf =
          AccessController.doPrivileged(new Perf.GetPerfAction());
+    private static final Unsafe unsafe =
+        Unsafe.getUnsafe();

      // Must match values defined in
hotspot/src/share/vm/runtime/perfdata.hpp
      private final static int V_Constant  = 1;
@@ -59,12 +63,14 @@

      private final String name;
      private final LongBuffer lb;
+    private final DirectBuffer db;

      private PerfCounter(String name, int type) {
          this.name = name;
          ByteBuffer bb = perf.createLong(name, U_None, type, 0L);
          bb.order(ByteOrder.nativeOrder());
          this.lb = bb.asLongBuffer();
+        this.db = bb instanceof DirectBuffer ? (DirectBuffer) bb :
null;
      }

      static PerfCounter newPerfCounter(String name) {
@@ -79,23 +85,44 @@
      /**
       * Returns the current value of the perf counter.
       */
-    public synchronized long get() {
+    public long get() {
+        if (db != null) {
+            return unsafe.getLongVolatile(null, db.address());
+        }
+        else {
+            synchronized (this) {
-        return lb.get(0);
-    }
+                return lb.get(0);
+            }
+        }
+    }

      /**
       * Sets the value of the perf counter to the given newValue.
       */
-    public synchronized void set(long newValue) {
+    public void set(long newValue) {
+        if (db != null) {
+            unsafe.putOrderedLong(null, db.address(), newValue);
+        }
+        else {
+            synchronized (this) {
-        lb.put(0, newValue);
-    }
+                lb.put(0, newValue);
+            }
+        }
+    }

      /**
       * Adds the given value to the perf counter.
       */
-    public synchronized void add(long value) {
-        long res = get() + value;
+    public void add(long value) {
+        if (db != null) {
+            unsafe.getAndAddLong(null, db.address(), value);
+        }
+        else {
+            synchronized (this) {
+                long res = lb.get(0) + value;
-        lb.put(0, res);
+                lb.put(0, res);
+            }
+        }
      }

      /**



Testing the PerfCounter.increment() method in a loop on multiple threads sharing the same PerfCounter instance, for example, on a 4-core Intel i7
machine produces the following results:

#
# PerfCounter_increment: run duration:  5,000 ms, #of logical CPUS: 8
#
            1 threads, Tavg =     19.02 ns/op (? =   0.00 ns/op)
            2 threads, Tavg =    109.93 ns/op (? =   6.17 ns/op)
            3 threads, Tavg =    136.64 ns/op (? =   2.99 ns/op)
            4 threads, Tavg =    293.26 ns/op (? =   5.30 ns/op)
            5 threads, Tavg =    316.94 ns/op (? =   6.28 ns/op)
            6 threads, Tavg =    686.96 ns/op (? =   7.09 ns/op)
            7 threads, Tavg =    793.28 ns/op (? =  10.57 ns/op)
            8 threads, Tavg =    898.15 ns/op (? =  14.63 ns/op)


With the presented patch, the results are a little better:

#
# PerfCounter_increment: run duration:  5,000 ms, #of logical CPUS: 8
#
# Measure:
            1 threads, Tavg =      5.22 ns/op (? =   0.00 ns/op)
            2 threads, Tavg =     34.51 ns/op (? =   0.60 ns/op)
            3 threads, Tavg =     54.85 ns/op (? =   1.42 ns/op)
            4 threads, Tavg =     74.67 ns/op (? =   1.71 ns/op)
            5 threads, Tavg =     94.71 ns/op (? =  41.68 ns/op)
            6 threads, Tavg =    114.80 ns/op (? =  32.10 ns/op)
            7 threads, Tavg =    136.70 ns/op (? =  26.80 ns/op)
            8 threads, Tavg =    158.48 ns/op (? =   9.93 ns/op)


The scalability is not much better, but the raw speed is, so it might
present less contention when used in real-world code. If you wanted even
better scalability, there is a new class in JDK8, the
java.util.concurrent.LongAdder. But that doesn't buy atomic "set()" -
only "add()". And it can't update native-memory variables, so it could
only be used for add-only counters and in conjunction with a background
thread that would periodically flush the sum to the native memory....

Regards, Peter


On 02/08/2013 06:10 PM, Nils Loodin wrote:
It would be interesting to know the number of thrown throwables in the
JVM, to be able to do some high level application diagnostics /
statistics. A good way to put this number would be a performance
counter, since it is accessible both from Java and from the VM.

http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=8007806
http://cr.openjdk.java.net/~nloodin/8007806/webrev.00/

Regards,
Nils Loodin



Reply via email to