Re: RFR: 8342382: Implementation of JEP G1: Improve Application Throughput with a More Efficient Write-Barrier [v9]

Martin Doerr Tue, 04 Mar 2025 02:41:35 -0800

On Tue, 4 Mar 2025 09:57:56 GMT, Thomas Schatzl <[email protected]> wrote:


>> Hi all,
>> 
>>   please review this change that implements (currently Draft) JEP: G1: 
>> Improve Application Throughput with a More Efficient Write-Barrier.
>> 
>> The reason for posting this early is that this is a large change, and the 
>> JEP process is already taking very long with no end in sight but we would 
>> like to have this ready by JDK 25.
>> 
>> ### Current situation
>> 
>> With this change, G1 will reduce the post write barrier to much more 
>> resemble Parallel GC's as described in the JEP. The reason is that G1 lacks 
>> in throughput compared to Parallel/Serial GC due to larger barrier.
>> 
>> The main reason for the current barrier is how g1 implements concurrent 
>> refinement:
>> * g1 tracks dirtied cards using sets (dirty card queue set - dcqs) of 
>> buffers (dirty card queues - dcq) containing the location of dirtied cards. 
>> Refinement threads pick up their contents to re-refine. The barrier needs to 
>> enqueue card locations.
>> * For correctness dirty card updates requires fine-grained synchronization 
>> between mutator and refinement threads,
>> * Finally there is generic code to avoid dirtying cards altogether 
>> (filters), to avoid executing the synchronization and the enqueuing as much 
>> as possible.
>> 
>> These tasks require the current barrier to look as follows for an assignment 
>> `x.a = y` in pseudo code:
>> 
>> 
>> // Filtering
>> if (region(@x.a) == region(y)) goto done; // same region check
>> if (y == null) goto done;     // null value check
>> if (card(@x.a) == young_card) goto done;  // write to young gen check
>> StoreLoad;                // synchronize
>> if (card(@x.a) == dirty_card) goto done;
>> 
>> *card(@x.a) = dirty
>> 
>> // Card tracking
>> enqueue(card-address(@x.a)) into thread-local-dcq;
>> if (thread-local-dcq is not full) goto done;
>> 
>> call runtime to move thread-local-dcq into dcqs
>> 
>> done:
>> 
>> 
>> Overall this post-write barrier alone is in the range of 40-50 total 
>> instructions, compared to three or four(!) for parallel and serial gc.
>> 
>> The large size of the inlined barrier not only has a large code footprint, 
>> but also prevents some compiler optimizations like loop unrolling or 
>> inlining.
>> 
>> There are several papers showing that this barrier alone can decrease 
>> throughput by 10-20% 
>> ([Yang12](https://dl.acm.org/doi/10.1145/2426642.2259004)), which is 
>> corroborated by some benchmarks (see links).
>> 
>> The main idea for this change is to not use fine-grained synchronization 
>> between refinement and mutator threads, but coarse grained based on 
>> atomically switching c...
>
> Thomas Schatzl has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   * iwalulya review 2
>     * G1ConcurrentRefineWorkState -> G1ConcurrentRefineSweepState
>     * some additional documentation

I got an error while testing java/foreign/TestUpcallStress.java on linuxaarch64 
with this PR:

#  Internal Error 
(/openjdk-jdk-linux_aarch64-dbg/jdk/src/hotspot/share/gc/g1/g1CardTable.cpp:56),
 pid=19044, tid=19159
#  guarantee(!failures) failed: there should not have been any failures
...
V  [libjvm.so+0xb6e988]  G1CardTable::verify_region(MemRegion, unsigned char, 
bool)+0x3b8  (g1CardTable.cpp:56)
V  [libjvm.so+0xc3a10c]  
G1MergeHeapRootsTask::G1ClearBitmapClosure::do_heap_region(G1HeapRegion*)+0x13c 
 (g1RemSet.cpp:1048)
V  [libjvm.so+0xb7a80c]  
G1CollectedHeap::par_iterate_regions_array(G1HeapRegionClosure*, 
G1HeapRegionClaimer*, unsigned int const*, unsigned long, unsigned int) 
const+0x9c  (g1CollectedHeap.cpp:2059)
V  [libjvm.so+0xc49fe8]  G1MergeHeapRootsTask::work(unsigned int)+0x708  
(g1RemSet.cpp:1225)
V  [libjvm.so+0x19597bc]  WorkerThread::run()+0x98  (workerThread.cpp:69)
V  [libjvm.so+0x1824510]  Thread::call_run()+0xac  (thread.cpp:231)
V  [libjvm.so+0x13b3994]  thread_native_entry(Thread*)+0x130  (os_linux.cpp:877)
C  [libpthread.so.0+0x875c]  start_thread+0x18c

-------------

PR Comment: https://git.openjdk.org/jdk/pull/23739#issuecomment-2697024679

Re: RFR: 8342382: Implementation of JEP G1: Improve Application Throughput with a More Efficient Write-Barrier [v9]

Reply via email to