On Tue, 4 Mar 2025 11:56:56 GMT, Thomas Schatzl <tscha...@openjdk.org> wrote:

>> Hi all,
>> 
>>   please review this change that implements (currently Draft) JEP: G1: 
>> Improve Application Throughput with a More Efficient Write-Barrier.
>> 
>> The reason for posting this early is that this is a large change, and the 
>> JEP process is already taking very long with no end in sight but we would 
>> like to have this ready by JDK 25.
>> 
>> ### Current situation
>> 
>> With this change, G1 will reduce the post write barrier to much more 
>> resemble Parallel GC's as described in the JEP. The reason is that G1 lacks 
>> in throughput compared to Parallel/Serial GC due to larger barrier.
>> 
>> The main reason for the current barrier is how g1 implements concurrent 
>> refinement:
>> * g1 tracks dirtied cards using sets (dirty card queue set - dcqs) of 
>> buffers (dirty card queues - dcq) containing the location of dirtied cards. 
>> Refinement threads pick up their contents to re-refine. The barrier needs to 
>> enqueue card locations.
>> * For correctness dirty card updates requires fine-grained synchronization 
>> between mutator and refinement threads,
>> * Finally there is generic code to avoid dirtying cards altogether 
>> (filters), to avoid executing the synchronization and the enqueuing as much 
>> as possible.
>> 
>> These tasks require the current barrier to look as follows for an assignment 
>> `x.a = y` in pseudo code:
>> 
>> 
>> // Filtering
>> if (region(@x.a) == region(y)) goto done; // same region check
>> if (y == null) goto done;     // null value check
>> if (card(@x.a) == young_card) goto done;  // write to young gen check
>> StoreLoad;                // synchronize
>> if (card(@x.a) == dirty_card) goto done;
>> 
>> *card(@x.a) = dirty
>> 
>> // Card tracking
>> enqueue(card-address(@x.a)) into thread-local-dcq;
>> if (thread-local-dcq is not full) goto done;
>> 
>> call runtime to move thread-local-dcq into dcqs
>> 
>> done:
>> 
>> 
>> Overall this post-write barrier alone is in the range of 40-50 total 
>> instructions, compared to three or four(!) for parallel and serial gc.
>> 
>> The large size of the inlined barrier not only has a large code footprint, 
>> but also prevents some compiler optimizations like loop unrolling or 
>> inlining.
>> 
>> There are several papers showing that this barrier alone can decrease 
>> throughput by 10-20% 
>> ([Yang12](https://dl.acm.org/doi/10.1145/2426642.2259004)), which is 
>> corroborated by some benchmarks (see links).
>> 
>> The main idea for this change is to not use fine-grained synchronization 
>> between refinement and mutator threads, but coarse grained based on 
>> atomically switching c...
>
> Thomas Schatzl has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   iwalulya review
>     * comments for variables tracking to-collection-set and just dirtied 
> cards after GC/refinement
>     * predicate for determining whether the refinement has been disabled
>     * some other typos/comment improvements
>     * renamed _has_xxx_ref to _has_ref_to_xxx to be more consistent with 
> naming

src/hotspot/share/gc/g1/g1ConcurrentRefine.cpp line 356:

> 354:     bool do_heap_region(G1HeapRegion* r) override {
> 355:       if (!r->is_free()) {
> 356:         // Need to scan all parts of non-free regions, so reset the 
> claim.

Why is the condition "is_free"? I thought we scan only old-or-humongous regions?

src/hotspot/share/gc/g1/g1ConcurrentRefine.hpp line 116:

> 114:     SwapGlobalCT,                // Swap global card table.
> 115:     SwapJavaThreadsCT,           // Swap java thread's card tables.
> 116:     SwapGCThreadsCT,             // Swap GC thread's card tables.

Do GC threads have card-table?

src/hotspot/share/gc/g1/g1ConcurrentRefineThread.cpp line 219:

> 217:       // The young gen revising mechanism reads the predictor and the 
> values set
> 218:       // here. Avoid inconsistencies by locking.
> 219:       MutexLocker x(G1RareEvent_lock, Mutex::_no_safepoint_check_flag);

Who else can be in this critical-section? I don't get what this lock is 
protecting us from.

src/hotspot/share/gc/g1/g1ConcurrentRefineThread.hpp line 83:

> 81: 
> 82: public:
> 83:   static G1ConcurrentRefineThread* create(G1ConcurrentRefine* cr);

I wonder if the comment for this class "One or more G1 Concurrent Refinement 
Threads..."  has become obsolete. (AFAICS, this class is a singleton.)

src/hotspot/share/gc/g1/g1ConcurrentRefineWorkTask.cpp line 69:

> 67:     } else if (res == G1RemSet::NoInteresting) {
> 68:       _refine_stats.inc_cards_clean_again();
> 69:     }

A `switch` is probably cleaner.

src/hotspot/share/gc/g1/g1ConcurrentRefineWorkTask.cpp line 78:

> 76:       do_dirty_card(source, dest_card);
> 77:     }
> 78:     return pointer_delta(dirty_r, dirty_l, sizeof(CardValue));

I feel the `pointer_delta` line belongs to the caller. After that, even the 
entire method can be inlined to the caller.

YMMV.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1979666477
PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1979678325
PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1979699376
PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1979695999
PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1979705019
PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1979709682

Reply via email to