On Tue, 25 Feb 2025 15:13:43 GMT, Thomas Schatzl <tscha...@openjdk.org> wrote:
>> Hi all, >> >> please review this change that implements (currently Draft) JEP: G1: >> Improve Application Throughput with a More Efficient Write-Barrier. >> >> The reason for posting this early is that this is a large change, and the >> JEP process is already taking very long with no end in sight but we would >> like to have this ready by JDK 25. >> >> ### Current situation >> >> With this change, G1 will reduce the post write barrier to much more >> resemble Parallel GC's as described in the JEP. The reason is that G1 lacks >> in throughput compared to Parallel/Serial GC due to larger barrier. >> >> The main reason for the current barrier is how g1 implements concurrent >> refinement: >> * g1 tracks dirtied cards using sets (dirty card queue set - dcqs) of >> buffers (dirty card queues - dcq) containing the location of dirtied cards. >> Refinement threads pick up their contents to re-refine. The barrier needs to >> enqueue card locations. >> * For correctness dirty card updates requires fine-grained synchronization >> between mutator and refinement threads, >> * Finally there is generic code to avoid dirtying cards altogether >> (filters), to avoid executing the synchronization and the enqueuing as much >> as possible. >> >> These tasks require the current barrier to look as follows for an assignment >> `x.a = y` in pseudo code: >> >> >> // Filtering >> if (region(@x.a) == region(y)) goto done; // same region check >> if (y == null) goto done; // null value check >> if (card(@x.a) == young_card) goto done; // write to young gen check >> StoreLoad; // synchronize >> if (card(@x.a) == dirty_card) goto done; >> >> *card(@x.a) = dirty >> >> // Card tracking >> enqueue(card-address(@x.a)) into thread-local-dcq; >> if (thread-local-dcq is not full) goto done; >> >> call runtime to move thread-local-dcq into dcqs >> >> done: >> >> >> Overall this post-write barrier alone is in the range of 40-50 total >> instructions, compared to three or four(!) for parallel and serial gc. >> >> The large size of the inlined barrier not only has a large code footprint, >> but also prevents some compiler optimizations like loop unrolling or >> inlining. >> >> There are several papers showing that this barrier alone can decrease >> throughput by 10-20% >> ([Yang12](https://dl.acm.org/doi/10.1145/2426642.2259004)), which is >> corroborated by some benchmarks (see links). >> >> The main idea for this change is to not use fine-grained synchronization >> between refinement and mutator threads, but coarse grained based on >> atomically switching c... > > Thomas Schatzl has updated the pull request incrementally with one additional > commit since the last revision: > > * remove unnecessarily added logging src/hotspot/share/gc/g1/g1BarrierSet.hpp line 54: > 52: // them, keeping the write barrier simple. > 53: // > 54: // The refinement threads mark cards in the the current collection set > specially on the "the the" typo. src/hotspot/share/gc/g1/g1CardTable.inline.hpp line 47: > 45: > 46: // Returns bits from a where mask is 0, and bits from b where mask is 1. > 47: inline size_t blend(size_t a, size_t b, size_t mask) { Can you provide some input/output examples in the doc? src/hotspot/share/gc/g1/g1CardTableClaimTable.cpp line 45: > 43: } > 44: > 45: void G1CardTableClaimTable::initialize(size_t max_reserved_regions) { Should the arg be `uint`? src/hotspot/share/gc/g1/g1ConcurrentRefine.cpp line 280: > 278: assert_state(State::SweepRT); > 279: > 280: set_state_start_time(); This method is called in a loop; would that skew the state-starting time? src/hotspot/share/gc/g1/g1ConcurrentRefine.cpp line 344: > 342: size_t _num_clean; > 343: size_t _num_dirty; > 344: size_t _num_to_cset; Seem never read. src/hotspot/share/gc/g1/g1ConcurrentRefine.cpp line 349: > 347: > 348: bool do_heap_region(G1HeapRegion* r) override { > 349: if (!r->is_free()) { I am a bit lost on this closure; the intention seems to set unclaimed to all non-free regions, why can't this be done in one go, instead of first setting all regions to claimed (`reset_all_claims_to_claimed`), then set non-free ones unclaimed? src/hotspot/share/gc/g1/g1ConcurrentRefine.hpp line 116: > 114: > 115: // Current heap snapshot. > 116: G1CardTableClaimTable* _sweep_state; Since this is a table, I wonder if we can name it "x_table" instead of "x_state". src/hotspot/share/gc/g1/g1RemSet.cpp line 147: > 145: if (_contains[region]) { > 146: return; > 147: } Indentation seems broken. src/hotspot/share/gc/g1/g1RemSet.cpp line 830: > 828: size_t const start_idx = region_card_base_idx + claim.value(); > 829: > 830: size_t* card_cur_card = > (size_t*)card_table->byte_for_index(start_idx); This var name should end with "_word", instead of "_card". src/hotspot/share/gc/g1/g1RemSet.cpp line 1252: > 1250: G1ConcurrentRefineWorkState::snapshot_heap_into(&constructed); > 1251: claim = &constructed; > 1252: } It's not super obvious to me why the "has_sweep_claims" checking needs to be on this level. Can `G1ConcurrentRefineWorkState` return a valid `G1CardTableClaimTable*` directly? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1974124792 PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1971426039 PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1973435950 PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1974083760 PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1973447654 PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1973452168 PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1974056492 PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1973423400 PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1974108760 PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1974134441