Re: GetPrimitiveArrayCritical vs GetByteArrayRegion: 140x slow-down using -Xcheck:jni and java.util.zip.DeflaterOutputStream

2018-03-19 Thread Thomas Schatzl
Hi,

On Fri, 2018-03-16 at 17:19 +, Ian Rogers wrote:
> Thanks Paul, very interesting.
> 
> On Fri, Mar 16, 2018 at 9:21 AM Paul Sandoz 
> wrote:
> > Hi Ian, Thomas,
> > 
> > [...]
> > (This is also something we need to consider if we modify buffers to
> > support capacities larger than Integer.MAX_VALUE. Also connects
> > with Project Panama.)
> > 
> > If Thomas has not done so or does not plan to i can log an issue
> > for you.
> > 
> 
> That'd be great. I wonder if identifying more TTSP issues should also
> be a bug. Its interesting to observe that overlooking TTSP in C2
> motivated the Unsafe.copyMemory change permitting a fresh TTSP issue.
> If TTSP is a 1st class issue then maybe we can deprecate JNI critical
> regions to support that effort :-)

Please log an issue. I am still a bit unsure what and how many issues
should be filed.

@Ian: at bugreports.oracle.com everyone may file bug reports without
the need for an account.
It will take some time until they show up in Jira due to vetting, but
if you have a good case, and can e.g. link to the mailing list, this
should be painless.

Thanks,
  Thomas



Re: GetPrimitiveArrayCritical vs GetByteArrayRegion: 140x slow-down using -Xcheck:jni and java.util.zip.DeflaterOutputStream

2018-03-16 Thread Paul Sandoz
Hi Ian, Thomas,

Some background on the bulk copying for byte buffers after talking with Mikael 
who worked on these changes a few years ago.

Please correct the following if needed as our memory is hazy :-)

IIRC at the time we felt this was a reasonable thing to do because C2 did not 
strip mine the loops for the bulk copying of large Java arrays i.e. the issue 
was there anyway for more common cases. However, i believe that may no longer 
be the so in some cases after Roland implemented loop strip mining in C2 [1]. 
So we should go back and revisit/check the current support in buffers and Java 
arrays (System.arraycopy etc).

(This is also something we need to consider if we modify buffers to support 
capacities larger than Integer.MAX_VALUE. Also connects with Project Panama.)

If Thomas has not done so or does not plan to i can log an issue for you.

Paul.


[1] https://bugs.openjdk.java.net/browse/JDK-8186027 


> On Mar 15, 2018, at 10:49 AM, Ian Rogers  wrote:
> 
> +hotspot-gc-dev
> 
> On Thu, Mar 15, 2018 at 1:25 AM Thomas Schatzl 
> wrote:
> 
>> Hi,
>> 
>> On Thu, 2018-03-15 at 01:00 +, Ian Rogers wrote:
>>> An old data point on how large a critical region should be comes from
>>> java.nio.Bits. In JDK 9 the code migrated into unsafe, but in JDK 8
>>> the copies within a critical region were bound at most copying 1MB:
>>> http://hg.openjdk.java.net/jdk8/jdk8/jdk/file/687fd7c7986d/src/share/
>>> native/java/nio/Bits.c#l88 This is inconsistent with Deflater and
>>> ObjectOutputStream which both allow unlimited arrays and thereby
>>> critical region sizes.
>>> 
>>> In JDK 9 the copies starve the garbage collector in nio Bits too as
>>> there is no 1MB limit to the copy sizes:
>>> http://hg.openjdk.java.net/jdk/jdk/rev/f70e100d3195
>>> which came from:
>>> https://bugs.openjdk.java.net/browse/JDK-8149596
>>> 
>>> Perhaps this is a regression not demonstrated due to the testing
>>> challenge.
>>> [...]
>>> It doesn't seem unreasonable to have the loops for the copies occur
>>> in 1MB chunks but JDK-8149596 lost this and so I'm confused on what
>>> the HotSpot stand point is.
>> 
>> Please file a bug (seems to be a core-libs/java.nio regression?),
>> preferably with some kind of regression test. Also file enhancements (I
>> would guess) for the other cases allowing unlimited arrays.
>> 
> 
> I don't have perms to file bugs there's some catch-22 scenario in getting
> the permissions. Happy to have a bug filed or to file were that not an
> issue. Happy to create a test case but can't see any others for TTSP
> issues. This feels like a potential use case for jmh, perhaps run the
> benchmark well having a separate thread run GC bench.
> 
> Should there be a bug to add, in debug mode, a TTSP watcher thread whose
> job it is to bring "random" threads into safepoints and report on tardy
> ones?
> Should there be a bug to warn on being in a JNI critical for more than just
> a short period?
> Seems like there should be a bug on Unsafe.copyMemory and
> Unsafe.copySwapMemory having TTSP issues.
> Seems like there should be a bug on all uses of critical that don't chunk
> their critical region work based on some bound (like 1MB chunks for nio
> Bits)? How are these bounds set? A past reference that I've lost is in
> having the bound be the equivalent of 65535 bytecodes due to the
> expectation of GC work at least once in a method or on a loop backedge - I
> thought this was in a spec somewhere but now I can't find it. The bytecode
> size feels as arbitrary as 1MB, a time period would be better but that can
> depend on the kind of GC you want as delays with concurrent GC mean more
> than non-concurrent. Clearly the chunk size shouldn't just be 0, but this
> appears to currently be the norm in the JDK.
> 
> The original reason for coming here was a 140x slow down in -Xcheck:jni in
> Deflater.deflate There are a few options there that its useful to enumerate:
> 1) rewrite in Java but there are correctness and open source related issues
> 2) remove underflow/overflow protection from critical arrays (revert
> JDK-6311046
> or perhaps bound protection to arrays of a particular small size) - this
> removes checking and doesn't deal with TTSP
> 3) add a critical array slice API to JNI so that copies with -Xcheck:jni
> aren't unbounded (martinrb@ proposed this) - keeps checking but doesn't
> deal with TTSP
> 4) rewrite primitive array criticals with GetArrayRegion as O(n) beats the
> "silent killer" TTSP (effectively deprecate the critical APIs)
> 
> In general (ie not just the deflate case) I think (1) is the most
> preferable. (2) and (3) both have TTSP issues. (4) isn't great performance
> wise, which motivates more use of approach (1), but I think deprecating
> criticals may just be the easiest and sanest way forward. I think that
> discussion is worth having on an e-mail thread rather than a bug.
> 
> 
>> Long TTSP is 

Re: GetPrimitiveArrayCritical vs GetByteArrayRegion: 140x slow-down using -Xcheck:jni and java.util.zip.DeflaterOutputStream

2018-03-15 Thread Thomas Schatzl
Hi,

On Thu, 2018-03-15 at 01:00 +, Ian Rogers wrote:
> An old data point on how large a critical region should be comes from
> java.nio.Bits. In JDK 9 the code migrated into unsafe, but in JDK 8
> the copies within a critical region were bound at most copying 1MB:
> http://hg.openjdk.java.net/jdk8/jdk8/jdk/file/687fd7c7986d/src/share/
> native/java/nio/Bits.c#l88 This is inconsistent with Deflater and
> ObjectOutputStream which both allow unlimited arrays and thereby
> critical region sizes.
> 
> In JDK 9 the copies starve the garbage collector in nio Bits too as
> there is no 1MB limit to the copy sizes:
> http://hg.openjdk.java.net/jdk/jdk/rev/f70e100d3195
> which came from:
> https://bugs.openjdk.java.net/browse/JDK-8149596
> 
> Perhaps this is a regression not demonstrated due to the testing
> challenge.
> [...]
> It doesn't seem unreasonable to have the loops for the copies occur
> in 1MB chunks but JDK-8149596 lost this and so I'm confused on what
> the HotSpot stand point is.

Please file a bug (seems to be a core-libs/java.nio regression?),
preferably with some kind of regression test. Also file enhancements (I
would guess) for the other cases allowing unlimited arrays.

Long TTSP is a performance bug as any other.

> In a way criticals are better than unsafe as they may
> pin the memory and not starve GC, which shenandoah does.

(Region based) Object pinning has its own share of problems:

 - only (relatively) easily implemented in region based collectors

 - may slow down pause a bit in presence of pinned regions/objects (for
non-concurrent copying collectors)

 - excessive use of pinning may cause OOME and VM exit probably earlier
than the gc locker. GC locker seems to provide a more gradual
degradation. E.g. pinning regions typically makes these regions
unavailable for allocation.
I.e. you still should not use it for many, very long living objects.
Of course this somewhat depends on the sophistication of the
implementation.

I think region based pinning would be a good addition to other
collectors than Shenandoah too. It has been on our minds for a long
time, but there are so many other more important issues :), so of
course we are eager to see contributions in this area. ;)

If you are interested on working on this, please ping us on hotspot-gc-
dev for implementation ideas to get you jump-started.

Thanks,
  Thomas



Re: GetPrimitiveArrayCritical vs GetByteArrayRegion: 140x slow-down using -Xcheck:jni and java.util.zip.DeflaterOutputStream

2018-03-06 Thread Martin Buchholz
Thanks Ian and Sherman for the excellent presentation and memories of
ancient efforts.

Yes, Sherman, I still have vague memory that attempts to touch any
implementation detail in this area was asking for trouble and someone would
complain.  I was happy to let you deal with those problems!

There's a continual struggle in the industry to enable more checking at
test time, and -Xcheck:jni does look like it should be possible to
routinely turn on for running all tests.  (Google tests run with a time
limit, and so any low-level performance regression immediately causes test
failures, for better or worse)

Our problem reduces to accessing a primitive array slice from native code.
The only way to get O(1) access is via GetPrimitiveArrayCritical, BUT when
it fails you have to pay for a copy of the entire array.  An obvious
solution is to introduce a slice variant GetPrimitiveArrayRegionCritical
that would only degrade to a copy of the slice.  Offhand that seems
relatively easy to implement though we would hold our noses at adding yet
more *Critical* functions to the JNI spec.  In spirit though it's a
straightforward generalization.

Implementing Deflater in pure Java seems very reasonable and we've had good
success with "nearby" code, but we likely cannot reuse the GNU Classpath
code.

Thanks for pointing out
JDK-6311046: -Xcheck:jni should support checking of
GetPrimitiveArrayCritical
which went into jdk8 in u40.

We can probably be smarter about choosing a better buffer size, e.g. in
ZipOutputStream.

Here's an idea:  In code like this
  try (DeflaterOutputStream dout = new DeflaterOutputStream(deflated)) {
 dout.write(inflated, 0, inflated.length);
   }
when the DeflaterOutputStream is given an input that is clearly too large
for the current buffer size, reorganize internals dynamically to use a much
bigger buffer size.

It's possible (but hard work!) to adjust algorithms based on whether
critical array access is available.  It would be nice if we could get the
JVM to tell us (but it might depend, e.g. on the size of the array).


Re: GetPrimitiveArrayCritical vs GetByteArrayRegion: 140x slow-down using -Xcheck:jni and java.util.zip.DeflaterOutputStream

2018-03-05 Thread Xueming Shen

On 3/5/18, 12:45 PM, Xueming Shen wrote:

On 3/5/18, 11:15 AM, Ian Rogers wrote:
Thanks! Changing the DeflaterOutputStream buffer size to be something 
other than the default reduces the number of JNI native calls and is 
a possible work around here, as this is an implementation detail 
could it be made in the JDK? Unfortunately larger input sizes will 
also regress the issue as the number of calls is "input size / buffer 
size". The JNI critical may give direct access to the array but 
depending on the GC, may require a lock and so lock contention may be 
a significant issue with the code and contribute to tail latencies. 
In my original post I mention this is difficult to measure and I 
think good practice is to avoid JNI critical regions.


We do have a history on the usage of 
GetPrimitiveArrayCritical/Elements() here regarding the
potential "lock contention", copy overhead... and went back and forth 
on which jni call is the

appropriate one to go. Martin might still have the memory of that :-)

And for the record. The direct root trigger of this issue probably is 
the fix for JDK-6311046 that went into jdk9.


https://bugs.openjdk.java.net/browse/JDK-6311046

-Sherman









Re: GetPrimitiveArrayCritical vs GetByteArrayRegion: 140x slow-down using -Xcheck:jni and java.util.zip.DeflaterOutputStream

2018-03-05 Thread Xueming Shen

On 3/5/18, 11:15 AM, Ian Rogers wrote:
Thanks! Changing the DeflaterOutputStream buffer size to be something 
other than the default reduces the number of JNI native calls and is a 
possible work around here, as this is an implementation detail could 
it be made in the JDK? Unfortunately larger input sizes will also 
regress the issue as the number of calls is "input size / buffer 
size". The JNI critical may give direct access to the array but 
depending on the GC, may require a lock and so lock contention may be 
a significant issue with the code and contribute to tail latencies. In 
my original post I mention this is difficult to measure and I think 
good practice is to avoid JNI critical regions.


We do have a history on the usage of 
GetPrimitiveArrayCritical/Elements() here regarding the
potential "lock contention", copy overhead... and went back and forth on 
which jni call is the

appropriate one to go. Martin might still have the memory of that :-)

Some related bugids:

https://bugs.openjdk.java.net/browse/JDK-6206933
https://bugs.openjdk.java.net/browse/JDK-6348045
https://bugs.openjdk.java.net/browse/JDK-5043044
https://bugs.openjdk.java.net/browse/JDK-6356456

There was once a "lock contention" bug that affects the performance of 
GetPrimitiveArrayCritical.
But that bug was fixed since (in hotspot). With various use scenario, a 
"copy overhead" when using

GetPrimitiveArrayElement() was concluded not acceptable, at least back then.

It appears to be an easy to do/must to do (nice to have?) to increase 
the "default buf size" used in
DeflaterOutputStream. But every time we tried to touch those default 
setting/configuration values
that have been there for decades, some "regression" complains would come 
back and hurt us :-)
For this particular case a potential "regression" is that the footprint 
increase because of the
the default buffer size is increased from 512 ->  might not be 
desirable for some use
scenario. For example if hundreds of thousands of DeflaterOutputStream 
are being open/closed
on hundreds of thousands of compressed jar or zip files the increase 
might be huge. And the API
does provide the constructor that you can customize the buffer size. It 
might be desired to keep
the default size asis. That said, I agreed that 512 appears to be a 
"too-small" default size if the
majority of the use scenario for the DeflaterOutputStream is to open a 
jar/zip file entry.


Sherman






Re: GetPrimitiveArrayCritical vs GetByteArrayRegion: 140x slow-down using -Xcheck:jni and java.util.zip.DeflaterOutputStream

2018-03-05 Thread Xueming Shen

On 03/05/2018 10:28 AM, Xueming Shen wrote:

On 03/05/2018 08:34 AM, Ian Rogers wrote:

Firstly, we're not running -Xcheck:jni in production code :-) During
development and testing it doesn't seem an unreasonable flag to enable, but
a 140x regression is too much to get developers to swallow.

There are 2 performance considerations:
1) the performance of -Xcheck:jni, which probably shouldn't be orders of
magnitude worse than without the flag.
2) the problems associated with JNI criticals, for which GetByteArrayRegion
is a panacea but by introducing a copying overhead.




The reason the GetByteArrayCritical was/is being used here is exactly to avoid 
the copy
overhead, which was an issue escalated in the past. Though the "copy overhead" 
appears
to be much bigger for the GBAC when -Xcheck:jni is used here.

Another issue with the DeflaterOutputStream is the default buf size is relative 
too small,
for historical reason. So with a DeflaterOutStream(deflated, new Deflater(), 
8192 *64),
is which a bigger buf/8192*64,  the performance is close to the run with the 
-Xcheck:jni



type:

in which a bigger buf/8192*64 is used,  run without the -Xcheck:jni is 
specified.

-Sherman



Re: GetPrimitiveArrayCritical vs GetByteArrayRegion: 140x slow-down using -Xcheck:jni and java.util.zip.DeflaterOutputStream

2018-03-05 Thread Xueming Shen

On 03/05/2018 08:34 AM, Ian Rogers wrote:

Firstly, we're not running -Xcheck:jni in production code :-) During
development and testing it doesn't seem an unreasonable flag to enable, but
a 140x regression is too much to get developers to swallow.

There are 2 performance considerations:
1) the performance of -Xcheck:jni, which probably shouldn't be orders of
magnitude worse than without the flag.
2) the problems associated with JNI criticals, for which GetByteArrayRegion
is a panacea but by introducing a copying overhead.




The reason the GetByteArrayCritical was/is being used here is exactly to avoid 
the copy
overhead, which was an issue escalated in the past. Though the "copy overhead" 
appears
to be much bigger for the GBAC when -Xcheck:jni is used here.

Another issue with the DeflaterOutputStream is the default buf size is relative 
too small,
for historical reason. So with a DeflaterOutStream(deflated, new Deflater(), 
8192 *64),
is which a bigger buf/8192*64,  the performance is close to the run with the 
-Xcheck:jni
for the byte[1 << 23] input.

Understood the test case is to show the issue for the 
GetPrimitiveArrayCritical+check:jni
use scenario.

-Sherman


Re: GetPrimitiveArrayCritical vs GetByteArrayRegion: 140x slow-down using -Xcheck:jni and java.util.zip.DeflaterOutputStream

2018-03-05 Thread Alan Bateman



On 05/03/2018 06:33, David Holmes wrote:



Hi Ian,

Do you run with -Xcheck:jni in production mode because you load 
unknown native code libraries? It's mainly intended as a diagnostic 
option to turn on if you encounter a possible JNI problem.
It does unusual to be running with -Xcheck:jni in a performance critical 
environment. I would expect to see-Xcheck:jni when developing or 
maintaining a library that uses JNI and drop the option the code has 
been fulling tested.


Lots of good work was done in JDK 9 to replace the ZipFile 
implementation with a Java implementation and it would be good to get 
some results with a re-write of Inflater and Deflater. It would need 
lots of testing, including startup. We would need have a dependency on 
libz of course as it is needed by the VM to deflate entries in the 
jimage container (when they are compressed) or access JAR files on the 
boot class path.


-Alan


Re: GetPrimitiveArrayCritical vs GetByteArrayRegion: 140x slow-down using -Xcheck:jni and java.util.zip.DeflaterOutputStream

2018-03-04 Thread David Holmes



Hi Ian,

Do you run with -Xcheck:jni in production mode because you load unknown 
native code libraries? It's mainly intended as a diagnostic option to 
turn on if you encounter a possible JNI problem.


I'll leave the debate on your actual patch proposal to others more 
familiar with the library code.


Thanks,
David

On 5/03/2018 5:24 AM, Ian Rogers wrote:

Hi,

we've been encountering poor performance with -Xcheck:jni, for the
following example the performance is 140x to 510x slower with the flag
enabled:




import java.io.ByteArrayOutputStream;
import java.io.IOException;
import java.util.Random;
import java.util.zip.DeflaterOutputStream;

public final class CheckJniTest {
  static void deflateBytesPerformance() throws IOException {
byte[] inflated = new byte[1 << 23];
new Random(71917367).nextBytes(inflated);
ByteArrayOutputStream deflated = new ByteArrayOutputStream();
try (DeflaterOutputStream dout = new DeflaterOutputStream(deflated)) {
  dout.write(inflated, 0, inflated.length);
}
if (8391174 != deflated.size()) {
  throw new AssertionError();
}
  }

  public static void main(String args[]) throws IOException {
int n = 5;
if (args.length > 0) {
  n = Integer.parseInt(args[0]);
}
for (int i = 0; i < n; i++) {
  long startTime = System.currentTimeMillis();
  deflateBytesPerformance();
  long endTime = System.currentTimeMillis();
  System.out.println("Round " + i + " took " + (endTime - startTime) +
 "ms");
}
  }
}


The issue is in the libzip Deflater.c JNI code:
http://hg.openjdk.java.net/jdk/jdk/file/c5eb27eed365/src/java.base/share/native/libzip/Deflater.c#l131

The test has an 8MB array to deflate/compress. The DeflaterOutputStream has
an buffer of size 512 bytes:
http://hg.openjdk.java.net/jdk/jdk/file/c5eb27eed365/src/java.base/share/classes/java/util/zip/DeflaterOutputStream.java#l128

To compress the array, 16,384 native calls are made that use the 8MB input
array and the 512 byte output array. These arrays are accessed using
GetPrimitiveArrayCritical that with -Xcheck:jni copies the array:
http://hg.openjdk.java.net/jdk/jdk/file/c5eb27eed365/src/hotspot/share/prims/jniCheck.cpp#l1862
The copying of the arrays leads to 128GB of copying which dominates
execution time.

One approach to fix the problem is to rewrite libzip in Java. GNU Classpath
has such an implementation:
http://cvs.savannah.gnu.org/viewvc/classpath/classpath/java/util/zip/Deflater.java?view=markup#l417

A different approach is to use Get/SetByteArrayRegion (using
Get/SetByteArrayElements would be no different than the current approach
accept potentially causing more copies). I've included a patch and
performance data for this approach below where regions of the arrays are
copied onto a 512 byte buffer on the stack. The non -Xcheck:jni performance
is roughly equivalent before and after the patch, the -Xcheck:jni
performance is now similar to the non -Xcheck:jni performance.

The choice to go from a using GetPrimitiveArrayCritical to
GetByteArrayRegion is a controversial one, as engineers have many different
expectations about what critical means and does. GetPrimitiveArrayCritical
may have similar overhead to GetByteArrayElements if primitive arrays
(possibly of a certain size) are known to be non-moving. There may be a
cost to pin critical arrays or regions they exist within. There may be a
global or region lock that is in play that can cause interactions with the
garbage collector - such interactions may cause tail latency issues in
production environments. GetByteArrayRegion loses compared to
GetPrimitiveArrayCritical as it must always copy a region of the array for
access. Given these issues it is hard to develop a benchmark of
GetPrimitiveArrayCritical
vs GetByteArrayRegion that can take into account the GC interactions. Most
benchmarks will see that avoiding a copy can be good for performance.

For more background, Aleksey has a good post on GetPrimitiveArrayCritical
here:
https://shipilev.net/jvm-anatomy-park/9-jni-critical-gclocker/

A different solution to the performance problem presented here is to change
the check JNI implementation to do less copying of arrays. This would be
motivated if GetPrimitiveArrayCritical were expected to be used more widely
than GetByteArrayRegion in performance sensitive, commonly used code. Given
the range of problems possible with GetPrimitiveArrayCritical I'd
personally prefer GetByteArrayRegion to be more commonly used, as I'm yet
to see a performance problem that made GetPrimitiveArrayCritical so
compelling. For example, ObjectOutputStream has burnt me previously:
http://hg.openjdk.java.net/jdk/jdk/file/c5eb27eed365/src/java.base/share/native/libjava/ObjectOutputStream.c#l69
and these trivial copy operations, should really be a call to fix the
JIT/AOT compilers.

Next steps: it'd be great to get this turned in to a bug although its not
clear to me whether this is a JDK issue