On Tue, 4 Mar 2025 11:56:56 GMT, Thomas Schatzl <tscha...@openjdk.org> wrote:
>> Hi all, >> >> please review this change that implements (currently Draft) JEP: G1: >> Improve Application Throughput with a More Efficient Write-Barrier. >> >> The reason for posting this early is that this is a large change, and the >> JEP process is already taking very long with no end in sight but we would >> like to have this ready by JDK 25. >> >> ### Current situation >> >> With this change, G1 will reduce the post write barrier to much more >> resemble Parallel GC's as described in the JEP. The reason is that G1 lacks >> in throughput compared to Parallel/Serial GC due to larger barrier. >> >> The main reason for the current barrier is how g1 implements concurrent >> refinement: >> * g1 tracks dirtied cards using sets (dirty card queue set - dcqs) of >> buffers (dirty card queues - dcq) containing the location of dirtied cards. >> Refinement threads pick up their contents to re-refine. The barrier needs to >> enqueue card locations. >> * For correctness dirty card updates requires fine-grained synchronization >> between mutator and refinement threads, >> * Finally there is generic code to avoid dirtying cards altogether >> (filters), to avoid executing the synchronization and the enqueuing as much >> as possible. >> >> These tasks require the current barrier to look as follows for an assignment >> `x.a = y` in pseudo code: >> >> >> // Filtering >> if (region(@x.a) == region(y)) goto done; // same region check >> if (y == null) goto done; // null value check >> if (card(@x.a) == young_card) goto done; // write to young gen check >> StoreLoad; // synchronize >> if (card(@x.a) == dirty_card) goto done; >> >> *card(@x.a) = dirty >> >> // Card tracking >> enqueue(card-address(@x.a)) into thread-local-dcq; >> if (thread-local-dcq is not full) goto done; >> >> call runtime to move thread-local-dcq into dcqs >> >> done: >> >> >> Overall this post-write barrier alone is in the range of 40-50 total >> instructions, compared to three or four(!) for parallel and serial gc. >> >> The large size of the inlined barrier not only has a large code footprint, >> but also prevents some compiler optimizations like loop unrolling or >> inlining. >> >> There are several papers showing that this barrier alone can decrease >> throughput by 10-20% >> ([Yang12](https://dl.acm.org/doi/10.1145/2426642.2259004)), which is >> corroborated by some benchmarks (see links). >> >> The main idea for this change is to not use fine-grained synchronization >> between refinement and mutator threads, but coarse grained based on >> atomically switching c... > > Thomas Schatzl has updated the pull request incrementally with one additional > commit since the last revision: > > iwalulya review > * comments for variables tracking to-collection-set and just dirtied > cards after GC/refinement > * predicate for determining whether the refinement has been disabled > * some other typos/comment improvements > * renamed _has_xxx_ref to _has_ref_to_xxx to be more consistent with > naming src/hotspot/share/gc/g1/g1ConcurrentRefine.cpp line 356: > 354: bool do_heap_region(G1HeapRegion* r) override { > 355: if (!r->is_free()) { > 356: // Need to scan all parts of non-free regions, so reset the > claim. Why is the condition "is_free"? I thought we scan only old-or-humongous regions? src/hotspot/share/gc/g1/g1ConcurrentRefine.hpp line 116: > 114: SwapGlobalCT, // Swap global card table. > 115: SwapJavaThreadsCT, // Swap java thread's card tables. > 116: SwapGCThreadsCT, // Swap GC thread's card tables. Do GC threads have card-table? src/hotspot/share/gc/g1/g1ConcurrentRefineThread.cpp line 219: > 217: // The young gen revising mechanism reads the predictor and the > values set > 218: // here. Avoid inconsistencies by locking. > 219: MutexLocker x(G1RareEvent_lock, Mutex::_no_safepoint_check_flag); Who else can be in this critical-section? I don't get what this lock is protecting us from. src/hotspot/share/gc/g1/g1ConcurrentRefineThread.hpp line 83: > 81: > 82: public: > 83: static G1ConcurrentRefineThread* create(G1ConcurrentRefine* cr); I wonder if the comment for this class "One or more G1 Concurrent Refinement Threads..." has become obsolete. (AFAICS, this class is a singleton.) src/hotspot/share/gc/g1/g1ConcurrentRefineWorkTask.cpp line 69: > 67: } else if (res == G1RemSet::NoInteresting) { > 68: _refine_stats.inc_cards_clean_again(); > 69: } A `switch` is probably cleaner. src/hotspot/share/gc/g1/g1ConcurrentRefineWorkTask.cpp line 78: > 76: do_dirty_card(source, dest_card); > 77: } > 78: return pointer_delta(dirty_r, dirty_l, sizeof(CardValue)); I feel the `pointer_delta` line belongs to the caller. After that, even the entire method can be inlined to the caller. YMMV. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1979666477 PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1979678325 PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1979699376 PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1979695999 PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1979705019 PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1979709682