Re: RFR: 8342382: Implementation of JEP G1: Improve Application Throughput with a More Efficient Write-Barrier [v2]

Albert Mingkun Yang Thu, 27 Feb 2025 11:12:05 -0800

On Tue, 25 Feb 2025 15:13:43 GMT, Thomas Schatzl <tscha...@openjdk.org> wrote:


>> Hi all,
>> 
>>   please review this change that implements (currently Draft) JEP: G1: 
>> Improve Application Throughput with a More Efficient Write-Barrier.
>> 
>> The reason for posting this early is that this is a large change, and the 
>> JEP process is already taking very long with no end in sight but we would 
>> like to have this ready by JDK 25.
>> 
>> ### Current situation
>> 
>> With this change, G1 will reduce the post write barrier to much more 
>> resemble Parallel GC's as described in the JEP. The reason is that G1 lacks 
>> in throughput compared to Parallel/Serial GC due to larger barrier.
>> 
>> The main reason for the current barrier is how g1 implements concurrent 
>> refinement:
>> * g1 tracks dirtied cards using sets (dirty card queue set - dcqs) of 
>> buffers (dirty card queues - dcq) containing the location of dirtied cards. 
>> Refinement threads pick up their contents to re-refine. The barrier needs to 
>> enqueue card locations.
>> * For correctness dirty card updates requires fine-grained synchronization 
>> between mutator and refinement threads,
>> * Finally there is generic code to avoid dirtying cards altogether 
>> (filters), to avoid executing the synchronization and the enqueuing as much 
>> as possible.
>> 
>> These tasks require the current barrier to look as follows for an assignment 
>> `x.a = y` in pseudo code:
>> 
>> 
>> // Filtering
>> if (region(@x.a) == region(y)) goto done; // same region check
>> if (y == null) goto done;     // null value check
>> if (card(@x.a) == young_card) goto done;  // write to young gen check
>> StoreLoad;                // synchronize
>> if (card(@x.a) == dirty_card) goto done;
>> 
>> *card(@x.a) = dirty
>> 
>> // Card tracking
>> enqueue(card-address(@x.a)) into thread-local-dcq;
>> if (thread-local-dcq is not full) goto done;
>> 
>> call runtime to move thread-local-dcq into dcqs
>> 
>> done:
>> 
>> 
>> Overall this post-write barrier alone is in the range of 40-50 total 
>> instructions, compared to three or four(!) for parallel and serial gc.
>> 
>> The large size of the inlined barrier not only has a large code footprint, 
>> but also prevents some compiler optimizations like loop unrolling or 
>> inlining.
>> 
>> There are several papers showing that this barrier alone can decrease 
>> throughput by 10-20% 
>> ([Yang12](https://dl.acm.org/doi/10.1145/2426642.2259004)), which is 
>> corroborated by some benchmarks (see links).
>> 
>> The main idea for this change is to not use fine-grained synchronization 
>> between refinement and mutator threads, but coarse grained based on 
>> atomically switching c...
>
> Thomas Schatzl has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   * remove unnecessarily added logging

src/hotspot/share/gc/g1/g1BarrierSet.hpp line 54:

> 52: // them, keeping the write barrier simple.
> 53: //
> 54: // The refinement threads mark cards in the the current collection set 
> specially on the

"the the" typo.

src/hotspot/share/gc/g1/g1CardTable.inline.hpp line 47:

> 45: 
> 46: // Returns bits from a where mask is 0, and bits from b where mask is 1.
> 47: inline size_t blend(size_t a, size_t b, size_t mask) {

Can you provide some input/output examples in the doc?

src/hotspot/share/gc/g1/g1CardTableClaimTable.cpp line 45:

> 43: }
> 44: 
> 45: void G1CardTableClaimTable::initialize(size_t max_reserved_regions) {

Should the arg be `uint`?

src/hotspot/share/gc/g1/g1ConcurrentRefine.cpp line 280:

> 278:   assert_state(State::SweepRT);
> 279: 
> 280:   set_state_start_time();

This method is called in a loop; would that skew the state-starting time?

src/hotspot/share/gc/g1/g1ConcurrentRefine.cpp line 344:

> 342:     size_t _num_clean;
> 343:     size_t _num_dirty;
> 344:     size_t _num_to_cset;

Seem never read.

src/hotspot/share/gc/g1/g1ConcurrentRefine.cpp line 349:

> 347: 
> 348:     bool do_heap_region(G1HeapRegion* r) override {
> 349:       if (!r->is_free()) {

I am a bit lost on this closure; the intention seems to set unclaimed to all 
non-free regions, why can't this be done in one go, instead of first setting 
all regions to claimed (`reset_all_claims_to_claimed`), then set non-free ones 
unclaimed?

src/hotspot/share/gc/g1/g1ConcurrentRefine.hpp line 116:

> 114: 
> 115:   // Current heap snapshot.
> 116:   G1CardTableClaimTable* _sweep_state;

Since this is a table, I wonder if we can name it "x_table" instead of 
"x_state".

src/hotspot/share/gc/g1/g1RemSet.cpp line 147:

> 145:     if (_contains[region]) {
> 146:       return;
> 147:     }

Indentation seems broken.

src/hotspot/share/gc/g1/g1RemSet.cpp line 830:

> 828:         size_t const start_idx = region_card_base_idx + claim.value();
> 829: 
> 830:         size_t* card_cur_card = 
> (size_t*)card_table->byte_for_index(start_idx);

This var name should end with "_word", instead of "_card".

src/hotspot/share/gc/g1/g1RemSet.cpp line 1252:

> 1250:     G1ConcurrentRefineWorkState::snapshot_heap_into(&constructed);
> 1251:     claim = &constructed;
> 1252:   }

It's not super obvious to me why the "has_sweep_claims" checking needs to be on 
this level. Can `G1ConcurrentRefineWorkState` return a valid 
`G1CardTableClaimTable*` directly?

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1974124792
PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1971426039
PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1973435950
PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1974083760
PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1973447654
PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1973452168
PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1974056492
PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1973423400
PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1974108760
PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1974134441

Re: RFR: 8342382: Implementation of JEP G1: Improve Application Throughput with a More Efficient Write-Barrier [v2]

Reply via email to