Hi Ogata,

On 09/18/2017 12:28 PM, Kazunori Ogata wrote:
Hi Hans and Peter,

Thank you for your comments.

Regarding the code Hans showed, I don't yet understand what it the
problem.  Since the load at 1204b is a speculative one, dereferencing
slots[17] should not raise any exception.  If the confirmation at 1204.5
succeeds, the value of tmp must also be correct because we put full fence
and we see a non-NULL reference that was stored after the full fence.

I don't know much, but I can imagine that speculative read may see the value and guess it correctly based on let's say some CPU state of half-processed write instruction in the pipeline, which is established even before the fence instruction flushes writes to array slots. So I can accept that such outcome is possible and doesn't violate JMM.


Also note that even the original code doesn't assume that a ClassDataSlot
array is singleton for a given class.  So even if another method modifies
dataLayout between 1204b and 1204.5, the current thread can keep using the
reference loaded earlier.  If a thread reads a non-NULL reference, the
ClassDataSlot[] entries reachable through the reference must be correct
because we put full fence.

It is not the problem of reading from different arrays since they are all equal in content. The problem is reading uninitialized array slots.



@Peter,

Thank you for the fix.  I'll measure the performance.

ClassDataLayout object (instead of ClassDataSlot[] array) is also an opportunity to put into it some pre-computed cached state which is re-computed frequently in ObjectInputStream.readSerialData and maybe also ObjectOutputStream.writeSerialData.

I'll show the variant which pre-computes the 'hasSpecialReadMethod' value which is always recomputed in ObjectInputStream.readSerialData if your benchmark shows this is a promising direction...

Regards, Peter



Regards,
Ogata


Peter Levart <[email protected]> wrote on 2017/09/18 05:51:07:

From: Peter Levart <[email protected]>
To: Hans Boehm <[email protected]>
Cc: Kazunori Ogata <[email protected]>, core-libs-dev <core-libs-
[email protected]>
Date: 2017/09/18 05:51
Subject: Re: RFR: 8187033: [PPC] Imporve performance of
ObjectStreamClass.getClassDataLayout()

Hi,
On 09/15/17 19:54, Hans Boehm wrote:
The problem occurs if this is transformed (by hardware or compiler) to

1196     ClassDataSlot[] getClassDataLayout() throws
InvalidClassException {
1197         // REMIND: synchronize instead of relying on fullFence()?
                  <prefetch dataLayout>
1198         ClassDataSlot[] slots = DATA_LAYOUT_GUESS;
1199         if (slots == null) {
                      if (dataLayout != DATA_LAYOUT_GUESS) <recover>
1200             slots = getClassDataLayout0();
1201             VarHandle.fullFence();
1202             dataLayout = slots;
1203         }
1204b       tmp = slots[17];
1204.5      if (dataLayout != DATA_LAYOUT_GUESS) <recover>
1205     ...

(This is only an illustration. If the problem were to occur in real
life,
it would probably occur as a
result of a different optimization. DEC Alpha allowed this sort of thing
for entirely different reasons.)

Observe that

(1) This transformation is allowed by the Java memory model, since
dataLayout is not a final field.
(2) This code breaks if another thread runs all of the initialization
code, including the code that sets
slots[17] and the code that sets dataLayout, between 1204b and 1204.5,
but
the check in
1204.5 still succeeds (because we guessed well). tmp will contain the
pre-
initialization value of slots[17].

The fence is not executed by the reading thread, and has no impact on
ordering within the reading thread.

C++ fences have no effect unless they are paired with another fence or
ordered atomic operation in
the other thread involved in the communication. I think that is the
current intent for Java as well.

Well, in that case, it's better to stick with final fields...
@Ogata

You said you implemented 4 variants:
On 09/04/17 07:20, Kazunori Ogata wrote:
1) Put VarHandle.fullFence() between initialization of ClassDataSlot[]
and
writing the reference to non-volatile dataLayout.
   Webrev: http://cr.openjdk.java.net/~horii/8187033/webrev.01-fence/

2) Use Unsafe.getObjectAcquire() and Unsafe.putObjectRelease() for
reading/writing dataLayout.
   Webrev: http://cr.openjdk.java.net/~horii/8187033/webrev.01-unsafe/

3) Put reference to ClassDataSlot[] into a final field of an object and
store the object to non-volatile dataLayout.  Every invocation of
getDataLayout() reads the final field needs to deference the object
pointer.
   Webrev: http://cr.openjdk.java.net/~horii/8187033/webrev.01-final/

4) Put reference to ClassDataSlot[] into a final field of an object,
read
the final field immediately after the object creation, and store it to
non-volatile dataLayout.  I think this is also correct based on the
semantics of final fields and data dependency.
   Webrev: http://cr.openjdk.java.net/~horii/8187033/webrev.01-final2/


The performance improvements were:

1) +3.5%
2) +1.1%
3) +2.2%
4) +3.4%

The 1st and 4th are not correct as we have established. The 3rd is
promising, but does not have the most speed improvement. Perhaps because
of extra de-referencing.

What if 'dataLayout' was not an array of ClassDataSlot records, each of
them containing a reference to an ObjectStreamClass and a boolean, but
an
object containing two arrays:

     static class ClassDataLayout {
         final ObjectStreamClass[] descs;
         final boolean[] hasDatas;
     }

Such object could be "unsafely" published. By eliminating the
intermediate
ClassDataSlot object, number of de-references should be kept down.

Here's a prototype:

     http://cr.openjdk.java.net/~plevart/jdk10-dev/
8187033_ObjectStreamClass.dataLayout/webrev.01/


Could you give it a try in your benchmark and compare it with your last
approach (with fullFence)?

Regards, Peter


Reply via email to