For what it's worth, I fully agree with David and Kirk around finalization not necessarily needing this treatment.
However, I was hoping this would have the effect of improving (non-finalizable) reference handling. We've seen serious issues in WeakReference handling and have had to write some twisted code to deal with this. So I guess the question I have to Kirk and David is: do you feel a GC load of 10K WeakReferences per cycle is also "doing something else wrong"? Sorry if this is going off-topic. Thanks Moh >-----Original Message----- >From: core-libs-dev [mailto:core-libs-dev-boun...@openjdk.java.net] On Behalf >Of Kirk Pepperdine >Sent: Thursday, May 28, 2015 11:58 PM >To: david.hol...@oracle.com Holmes >Cc: hotspot-gc-...@openjdk.java.net openjdk.java.net; core-libs- >d...@openjdk.java.net >Subject: Re: JEP 132: More-prompt finalization > >Hi Peter, > >It is a very interesting proposal but to further David's comments, the life- >cycle costs of reference objects is horrendous of which the actual process of >finalizing an object is only a fraction of that total cost. Unfortunately your >micro-benchmark only focuses on one aspect of that cost. In other words, it >isn't very representative of a real concern. In the real world the finalizer >*must compete with mutator threads and since F-J is an "all threads on deck" >implementation, it doesn't play well with others. It creates a "tragedy of the >commons". That is situations where everyone behaves rationally with a common >resource but to the detriment of the whole group". In short, parallelizing (F- >Jing) *everything* in an application is simply not a good idea. We do not live >in an infinite compute environment which means to have to consider the impact >of our actions to the entire group. > >This was one of the points of my recent article in Java Magazine which I wrote >to try to counter some of the rhetoric I was hearing in conference about the >universal benefits of being able easily parallelize streams in Java 8. Yes, I >agree it's a great feature but it must be used with discretion. Case in point. >After I finished writing the article, I started running into a couple of early >adopters that had swallowed the parallel message whole indiscriminately >parallelizing all of their streams. As you can imagine, they were quite >surprised by the results and quickly worked to de-parallelize *all* of the >streams in the application. > >To add some ability to parallelize the handling of reference objects seems >like a good idea if you are collecting large numbers of reference objects >(>10,000 per GC cycle). However if you are collecting large numbers of >reference objects you're most likely doing something else wrong. IME, >finalization is extremely useful but really only for a limited number of use >cases and none of them (to date) have resulted in the app burning through >1000s of final objects / sec. > >It would be interesting to know why why you picked on this particular issue. > >Kind regards, >Kirk > > > >On May 29, 2015, at 5:18 AM, David Holmes <david.hol...@oracle.com> wrote: > >> Hi Peter, >> >> I guess I'm very concerned about the premise that finalization should scale >to millions of objects and be performed highly concurrently. To me that's >sending the wrong message about finalization. It also isn't the most effective >use of cpu resources - most people would want to do useful work on most cpu's >most of the time. >> >> Cheers, >> David >> >> On 29/05/2015 3:12 AM, Peter Levart wrote: >>> Hi, >>> >>> Did you know that the following simple loop: >>> >>> public class FinalizableBottleneck { >>> static boolean no; >>> >>> @Override >>> protected void finalize() throws Throwable { >>> // empty finalize() method does not make the object finalizable >>> // (it is not even registered on the finalizer's list) >>> if (no) { >>> throw new AssertionError(); >>> } >>> } >>> >>> public static void main(String[] args) { >>> while (true) { >>> new FinalizableBottleneck(); >>> } >>> } >>> } >>> >>> >>> ...quickly fills the entire heap with FinalizableBottleneck and internal >>> Finalizer objects and brings the JVM to a halt? After a few seconds of >>> running the above program, jmap -histo:live reports: >>> >>> num #instances #bytes class name >>> ---------------------------------------------- >>> 1: 50048325 2001933000 java.lang.ref.Finalizer >>> 2: 50048278 800772448 FinalizableBottleneck >>> >>> >>> There are a couple of bottlenecks that make this happen: >>> >>> - ReferenceHandler thread synchronizes with VM to unhook Reference(s) >>> from the pending chain one be one and dispatches them to their respected >>> ReferenceQueue(s) which also use synchronization for equeueing each >>> Reference. >>> - Enqueueing synchronizes with the finalization thread which removes the >>> Finalizer(s) (FinalReferences) from the finalization queue and executes >>> them. >>> - Executing the Finalizer(s) removes them from the doubly-linked list of >>> all Finalizer(s) which is used to retain them until they are needed and >>> this synchronizes with the threads that link new Finalizer(s) into the >>> doubly-linked list as new finalizable objects get registered. >>> >>> We see that the creation of a finalizable object only takes one >>> synchronization (registering into the doubly-linked list) and is >>> performed synchronously, while finalization takes 4 synchronizations >>> among 4 different threads (in pairs) and happens when the Finalizer >>> instance "travels" over from VM thread to ReferenceHandler thread and >>> then to finalization thread. No wonder that finalization can not keep up >>> with allocation in a single thread. The situation is even worse when >>> finalize() methods do some actual work. >>> >>> I have experimented with various approaches to widen these bottlenecks >>> and found out that I can not beat the ForkJoinPool when combined with >>> some improvements to internal data structures used in reference >>> processing. Here's a prototype I came up with: >>> >>> >http://cr.openjdk.java.net/~plevart/misc/JEP132/ReferenceHandling/webrev.01/ >>> >>> >>> And this is the benchmark I use for measuring the throughput: >>> >>> >http://cr.openjdk.java.net/~plevart/misc/JEP132/ReferenceHandling/FinalizerThr >oughput.java >>> >>> >>> The benchmark shows (results inline in source) that using unpatched JDK, >>> on my PC (i7-2700K, Linux, JDK8) I can not construct more than 1500 >>> finalizable objects per ms in a single thread and that while doing so, >>> finalization only manages to process approx. 100 - 120 objects at the >>> same time. Objects "in-flight" quickly accumulate and bring the VM to a >>> halt, where it is not doing anything but full GC cycles. >>> >>> When constructing in 4 threads, there's not much difference. >>> Construction of finalizable objects simply doesn't scale. >>> >>> Patched JDK shows something completely different. Single thread >>> construction achieves a rate of 3600 objects / ms. Number of "in-flight" >>> objects is kept constant at about 5-6M instances which amounts to approx >>> 1.5 s of allocation. I think this is about the rate of GC cycles during >>> which VM also processes the references. The benchmark also shows the >>> ForkJoinPool statistics which shows that the number of queued tasks is >>> also kept low. >>> >>> Increasing the allocation threads to 4 increases allocation rate to >>> about 4300 objects / ms and finalization keeps up. Increasing allocation >>> threads to 8, further increases allocation rate to about 4600 objects / >>> ms and finalization still keeps up. The increase in rate is not linear, >>> but keep in mind that i7 is a 4-core CPU. >>> >>> About the implementation... >>> >>> 1st improvement I did was for the doubly-linked list of Finalizer >>> instances that is used to keep them alive until they are needed. I >>> ripped-off the wonderful ConcurrentLinkedDeque by Doug Lea and Martin >>> Buchholz and just kept the internal link/unlink methods while >>> specializing them to Finalizer entries (very straight-forward). I >>> experimented with throughput and got some improvement, but throughput >>> has increased much more when I used several instances of independent >>> lists and distributed registrations among them randomly (unlinking >>> consequently is also distributed randomly). >>> >>> I found out that no matter how hard I try to optimize ReferenceQueue >>> while keeping the API unchanged, I can only do so much and that was not >>> enough. I have been surprised by how well ForkJoinPool distributes tasks >>> among threads, so I concluded that leveraging it is the best choice. I >>> re-designed the pending-list unhooking loop to unhook pending references >>> in chunks which greatly improves the throughput. Since unhooking can be >>> performed by a single thread while holding a lock which is mandated by >>> interface between VM and Java, I didn't employ multiple threads, but a >>> single eternal ForkJoinTask that unhooks in chunks and forks-off other >>> processing tasks that process chunks. When there are just a couple of >>> References pending at one time and a not-full chunk is unhooked, then >>> the processing is performed by the same thread that unhooked the >>> refrences, but when there are more, worker tasks are forked off and the >>> unhooking thread continues with full peace. This processing includes >>> execution of Cleaners, forking the finalizer tasks and enqueue-ing other >>> references. Finalizer(s) are always executed as separate ForkJoinTask(s). >>> >>> It's interesting how Runtime.runFinalizers() is implemented in this >>> patch - it basically amounts to ForkJoinPool.awaitQuiescence() ... >>> >>> I also tweaked the ReferenceQueue implementation a bit (it is still used >>> for other kinds of references) so that it avoids synchronization with a >>> monitor lock when there are no blocking waiters and uses CAS to >>> enqueue/dequeue. This improves throughput when the queue is not empty. >>> Since in the prototype multiple threads can enqueue into the same queue, >>> I thought this would improve throughput in such situations. >>> >>> Comments, suggestions, criticism are welcome. >>> >>> Regards, Peter >>>