On Mon, 3 Mar 2025 08:42:05 GMT, Thomas Schatzl <tscha...@openjdk.org> wrote:
>> Hi all, >> >> please review this change that implements (currently Draft) JEP: G1: >> Improve Application Throughput with a More Efficient Write-Barrier. >> >> The reason for posting this early is that this is a large change, and the >> JEP process is already taking very long with no end in sight but we would >> like to have this ready by JDK 25. >> >> ### Current situation >> >> With this change, G1 will reduce the post write barrier to much more >> resemble Parallel GC's as described in the JEP. The reason is that G1 lacks >> in throughput compared to Parallel/Serial GC due to larger barrier. >> >> The main reason for the current barrier is how g1 implements concurrent >> refinement: >> * g1 tracks dirtied cards using sets (dirty card queue set - dcqs) of >> buffers (dirty card queues - dcq) containing the location of dirtied cards. >> Refinement threads pick up their contents to re-refine. The barrier needs to >> enqueue card locations. >> * For correctness dirty card updates requires fine-grained synchronization >> between mutator and refinement threads, >> * Finally there is generic code to avoid dirtying cards altogether >> (filters), to avoid executing the synchronization and the enqueuing as much >> as possible. >> >> These tasks require the current barrier to look as follows for an assignment >> `x.a = y` in pseudo code: >> >> >> // Filtering >> if (region(@x.a) == region(y)) goto done; // same region check >> if (y == null) goto done; // null value check >> if (card(@x.a) == young_card) goto done; // write to young gen check >> StoreLoad; // synchronize >> if (card(@x.a) == dirty_card) goto done; >> >> *card(@x.a) = dirty >> >> // Card tracking >> enqueue(card-address(@x.a)) into thread-local-dcq; >> if (thread-local-dcq is not full) goto done; >> >> call runtime to move thread-local-dcq into dcqs >> >> done: >> >> >> Overall this post-write barrier alone is in the range of 40-50 total >> instructions, compared to three or four(!) for parallel and serial gc. >> >> The large size of the inlined barrier not only has a large code footprint, >> but also prevents some compiler optimizations like loop unrolling or >> inlining. >> >> There are several papers showing that this barrier alone can decrease >> throughput by 10-20% >> ([Yang12](https://dl.acm.org/doi/10.1145/2426642.2259004)), which is >> corroborated by some benchmarks (see links). >> >> The main idea for this change is to not use fine-grained synchronization >> between refinement and mutator threads, but coarse grained based on >> atomically switching c... > > Thomas Schatzl has updated the pull request incrementally with one additional > commit since the last revision: > > * fix comment (trailing whitespace) > * another assert when snapshotting at a safepoint. src/hotspot/cpu/x86/gc/g1/g1BarrierSetAssembler_x86.cpp line 106: > 104: > 105: __ testptr(count, count); > 106: __ jcc(Assembler::equal, done); I wonder if we can use "zero" instead of "equal" here; they have the same underlying value, but the semantic is to checking for "zero". src/hotspot/cpu/x86/gc/g1/g1BarrierSetAssembler_x86.cpp line 133: > 131: Label is_clean_card; > 132: __ cmpb(Address(addr, 0), G1CardTable::clean_card_val()); > 133: __ jcc(Assembler::equal, is_clean_card); Should this checking be guarded by `if (UseCondCardMark)`? I see that aarch64 does that. src/hotspot/cpu/x86/gc/g1/g1BarrierSetAssembler_x86.cpp line 143: > 141: > 142: __ bind(is_clean_card); > 143: // Card was not clean. Dirty card and go to next.. Why "not clean"? I thought this path is for dirtying clean card? src/hotspot/cpu/x86/gc/g1/g1BarrierSetAssembler_x86.cpp line 323: > 321: assert(thread == r15_thread, "must be"); > 322: #endif // _LP64 > 323: assert_different_registers(store_addr, new_val, thread, tmp1 /*, tmp2 > unused */, noreg); Seems that `tmp2` is unused in this method. It is used in aarch64, but it's not obvious to me whether that is indeed necessary. If so, can you add a comment saying sth like "this unused var is needed for other archs..."? src/hotspot/share/gc/g1/g1CardTable.inline.hpp line 54: > 52: // result = 0xBBAABBAA > 53: inline size_t blend(size_t a, size_t b, size_t mask) { > 54: return a ^ ((a ^ b) & mask); The example makes it much clearer; I wonder if `return (a & ~mask) | (b & mask);` is more readable. src/hotspot/share/gc/g1/g1CardTableClaimTable.cpp line 59: > 57: > 58: void G1CardTableClaimTable::reset_all_claims_to_claimed() { > 59: for (size_t i = 0; i < _max_reserved_regions; i++) { `uint` for `i`? src/hotspot/share/gc/g1/g1CardTableClaimTable.hpp line 64: > 62: void reset_all_claims_to_unclaimed(); > 63: void reset_all_claims_to_claimed(); > 64: I wonder if these two APIs can be renamed to "reset_all_to_x", which is more aligned with its single-region counterpart, `reset_to_unclaimed`, IMO. src/hotspot/share/gc/g1/g1ConcurrentRefine.cpp line 348: > 346: void > G1ConcurrentRefineWorkState::snapshot_heap_into(G1CardTableClaimTable* > sweep_table) { > 347: // G1CollectedHeap::heap_region_iterate() below will only visit > committed regions. Initialize > 348: // all entries in the state table here to not require special handling > when iterating over it. Can you elaborate on what the "special handling" would be, if we don's set "claimed" for non-committed regions? src/hotspot/share/gc/g1/g1RemSet.cpp line 837: > 835: for (; refinement_cur_card < refinement_end_card; > ++refinement_cur_card, ++card_cur_word) { > 836: size_t value = *refinement_cur_card; > 837: *refinement_cur_card = G1CardTable::WordAllClean; Similarly, this is a "word", not "card", also. src/hotspot/share/gc/g1/g1YoungGCPostEvacuateTasks.cpp line 857: > 855: // We do not expect too many non-Java threads compared to Java > threads, so just > 856: // let one worker claim that work. > 857: if (!_non_java_threads_claim && > !Atomic::cmpxchg(&_non_java_threads_claim, false, true, > memory_order_relaxed)) { Do non-java threads have card-table-base? src/hotspot/share/gc/g1/g1YoungGCPostEvacuateTasks.cpp line 862: > 860: > 861: class ResizeAndSwapCardTableClosure : public ThreadClosure { > 862: SwapCardTableClosure _cl; Field indentation. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1977586579 PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1977594184 PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1977583002 PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1977601907 PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1977645576 PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1977571306 PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1977573354 PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1977704351 PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1977575441 PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1977701293 PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1977679688