Re: RFR: 8342382: Implementation of JEP G1: Improve Application Throughput with a More Efficient Write-Barrier [v5]

Albert Mingkun Yang Mon, 03 Mar 2025 07:22:39 -0800

On Mon, 3 Mar 2025 08:42:05 GMT, Thomas Schatzl <tscha...@openjdk.org> wrote:


>> Hi all,
>> 
>>   please review this change that implements (currently Draft) JEP: G1: 
>> Improve Application Throughput with a More Efficient Write-Barrier.
>> 
>> The reason for posting this early is that this is a large change, and the 
>> JEP process is already taking very long with no end in sight but we would 
>> like to have this ready by JDK 25.
>> 
>> ### Current situation
>> 
>> With this change, G1 will reduce the post write barrier to much more 
>> resemble Parallel GC's as described in the JEP. The reason is that G1 lacks 
>> in throughput compared to Parallel/Serial GC due to larger barrier.
>> 
>> The main reason for the current barrier is how g1 implements concurrent 
>> refinement:
>> * g1 tracks dirtied cards using sets (dirty card queue set - dcqs) of 
>> buffers (dirty card queues - dcq) containing the location of dirtied cards. 
>> Refinement threads pick up their contents to re-refine. The barrier needs to 
>> enqueue card locations.
>> * For correctness dirty card updates requires fine-grained synchronization 
>> between mutator and refinement threads,
>> * Finally there is generic code to avoid dirtying cards altogether 
>> (filters), to avoid executing the synchronization and the enqueuing as much 
>> as possible.
>> 
>> These tasks require the current barrier to look as follows for an assignment 
>> `x.a = y` in pseudo code:
>> 
>> 
>> // Filtering
>> if (region(@x.a) == region(y)) goto done; // same region check
>> if (y == null) goto done;     // null value check
>> if (card(@x.a) == young_card) goto done;  // write to young gen check
>> StoreLoad;                // synchronize
>> if (card(@x.a) == dirty_card) goto done;
>> 
>> *card(@x.a) = dirty
>> 
>> // Card tracking
>> enqueue(card-address(@x.a)) into thread-local-dcq;
>> if (thread-local-dcq is not full) goto done;
>> 
>> call runtime to move thread-local-dcq into dcqs
>> 
>> done:
>> 
>> 
>> Overall this post-write barrier alone is in the range of 40-50 total 
>> instructions, compared to three or four(!) for parallel and serial gc.
>> 
>> The large size of the inlined barrier not only has a large code footprint, 
>> but also prevents some compiler optimizations like loop unrolling or 
>> inlining.
>> 
>> There are several papers showing that this barrier alone can decrease 
>> throughput by 10-20% 
>> ([Yang12](https://dl.acm.org/doi/10.1145/2426642.2259004)), which is 
>> corroborated by some benchmarks (see links).
>> 
>> The main idea for this change is to not use fine-grained synchronization 
>> between refinement and mutator threads, but coarse grained based on 
>> atomically switching c...
>
> Thomas Schatzl has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   * fix comment (trailing whitespace)
>   * another assert when snapshotting at a safepoint.

src/hotspot/cpu/x86/gc/g1/g1BarrierSetAssembler_x86.cpp line 106:

> 104: 
> 105:   __ testptr(count, count);
> 106:   __ jcc(Assembler::equal, done);

I wonder if we can use "zero" instead of "equal" here; they have the same 
underlying value, but the semantic is to checking for "zero".

src/hotspot/cpu/x86/gc/g1/g1BarrierSetAssembler_x86.cpp line 133:

> 131:   Label is_clean_card;
> 132:   __ cmpb(Address(addr, 0), G1CardTable::clean_card_val());
> 133:   __ jcc(Assembler::equal, is_clean_card);

Should this checking be guarded by `if (UseCondCardMark)`? I see that aarch64 
does that.

src/hotspot/cpu/x86/gc/g1/g1BarrierSetAssembler_x86.cpp line 143:

> 141: 
> 142:   __ bind(is_clean_card);
> 143:   // Card was not clean. Dirty card and go to next..

Why "not clean"? I thought this path is for dirtying clean card?

src/hotspot/cpu/x86/gc/g1/g1BarrierSetAssembler_x86.cpp line 323:

> 321:   assert(thread == r15_thread, "must be");
> 322: #endif // _LP64
> 323:   assert_different_registers(store_addr, new_val, thread, tmp1 /*, tmp2 
> unused */, noreg);

Seems that `tmp2` is unused in this method. It is used in aarch64, but it's not 
obvious to me whether that is indeed necessary. If so, can you add a comment 
saying sth like "this unused var is needed for other archs..."?

src/hotspot/share/gc/g1/g1CardTable.inline.hpp line 54:

> 52: // result = 0xBBAABBAA
> 53: inline size_t blend(size_t a, size_t b, size_t mask) {
> 54:   return a ^ ((a ^ b) & mask);

The example makes it much clearer; I wonder if `return (a & ~mask) | (b & 
mask);` is more readable.

src/hotspot/share/gc/g1/g1CardTableClaimTable.cpp line 59:

> 57: 
> 58: void G1CardTableClaimTable::reset_all_claims_to_claimed() {
> 59:   for (size_t i = 0; i < _max_reserved_regions; i++) {

`uint` for `i`?

src/hotspot/share/gc/g1/g1CardTableClaimTable.hpp line 64:

> 62:   void reset_all_claims_to_unclaimed();
> 63:   void reset_all_claims_to_claimed();
> 64: 

I wonder if these two APIs can be renamed to "reset_all_to_x", which is more 
aligned with its single-region counterpart, `reset_to_unclaimed`, IMO.

src/hotspot/share/gc/g1/g1ConcurrentRefine.cpp line 348:

> 346: void 
> G1ConcurrentRefineWorkState::snapshot_heap_into(G1CardTableClaimTable* 
> sweep_table) {
> 347:   // G1CollectedHeap::heap_region_iterate() below will only visit 
> committed regions. Initialize
> 348:   // all entries in the state table here to not require special handling 
> when iterating over it.

Can you elaborate on what the "special handling" would be, if we don's set 
"claimed" for non-committed regions?

src/hotspot/share/gc/g1/g1RemSet.cpp line 837:

> 835:         for (; refinement_cur_card < refinement_end_card; 
> ++refinement_cur_card, ++card_cur_word) {
> 836:           size_t value = *refinement_cur_card;
> 837:           *refinement_cur_card = G1CardTable::WordAllClean;

Similarly, this is a "word", not "card", also.

src/hotspot/share/gc/g1/g1YoungGCPostEvacuateTasks.cpp line 857:

> 855:     // We do not expect too many non-Java threads compared to Java 
> threads, so just
> 856:     // let one worker claim that work.
> 857:     if (!_non_java_threads_claim && 
> !Atomic::cmpxchg(&_non_java_threads_claim, false, true, 
> memory_order_relaxed)) {

Do non-java threads have card-table-base?

src/hotspot/share/gc/g1/g1YoungGCPostEvacuateTasks.cpp line 862:

> 860: 
> 861:     class ResizeAndSwapCardTableClosure : public ThreadClosure {
> 862:     SwapCardTableClosure _cl;

Field indentation.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1977586579
PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1977594184
PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1977583002
PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1977601907
PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1977645576
PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1977571306
PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1977573354
PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1977704351
PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1977575441
PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1977701293
PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1977679688

Re: RFR: 8342382: Implementation of JEP G1: Improve Application Throughput with a More Efficient Write-Barrier [v5]

Reply via email to