Hi Liang,

Thanks for your interest! Before addressing your specific questions, let me 
summarize our view and strategy regarding performance work. Since the late 
barrier expansion model does not expose all details of the barrier operations 
to C2, there will always be cases where our proposal leads to slight 
inefficiencies at the micro level compared to the current model. Our view is 
that these small inefficiencies are tolerable as long as they do not translate 
into regressions for interesting applications, and will be outweighed anyway by 
barrier optimization work that this JEP enables G1 engineers to perform. With 
this in mind, our strategy is to take the opportunity to re-evaluate the 
optimizations that C2 currently applies to G1 barriers, and re-implement in the 
late barrier expansion model only those which have a demonstrable performance 
effect at the application level (see "Optimizations" subsection in the JEP). 
Many of these optimizations can also be performed in the late barrier expansion 
model, albeit in a more explicit way.

Regarding question a), our current prototype inlines all barrier checks 
together with the corresponding memory access operation ("fast path"), but 
places the runtime calls together with their prologue/epilogue code out-of-line 
in assembly stubs ("slow path"). We plan to re-evaluate, in the context of this 
JEP, the performance effect of moving more parts of the post-barrier to the 
stub, similarly to JDK-8225776.

Regarding question b), yes, our current prototype always performs precise card 
marking. We have not yet found empirical evidence that the potential 
inefficiencies in the generated code translate into regressions at the 
application level, but are happy to reconsider this if someone knows of an 
interesting application-level benchmark where imprecise card marking makes a 
significant difference.

Hope that answers your questions!

Thanks,

Roberto

________________________________________
From: Liang Mao <maoliang...@alibaba-inc.com>
Sent: Sunday, February 4, 2024 7:43 AM
To: porters-dev; Roberto Castaneda Lozano; adinn
Subject: [External] : Re: Heads-up: Late G1 Barrier Expansion (Draft JEP)

Hi Roberto,

Excited to hear the news about improving G1 barrier! I have a few questions 
about this proposal:

    a)  ZGC uses late expansion because it has a clear fast path and a 
medium/slow path. The fastpath
contains only 1 or 2 simple instructions so doesn't need optimization from c2. 
G1 post barrier has several
branch check and doesn't have clear boudaries of fast or slow paths. And there 
could be optimization opportunity such
as JDK-8225776. Permanently avoiding C2 optimization might lose performance.

    b)  G1(as well as card table remset GC) uses imprecise card mark which 
marks the object address card instead of the field address.
If we use late expansion, we only have field address there and therefore have 
to recompute the object address which
needs additional instructions or registers. BTW, I didn't see the details in 
the prototype implementation. We can
alway use precise card mark in G1 anyway. Imprecise card mark has the advantage 
to eliminate redundant card
mark while writing into different field of an object because the card mark 
addresses are the same. Parallel GC can perform this optimization.

The late expansion could benifit from domination analysis to remove redudant 
barriers and traditional ideal optimization could barelly help
G1 barrier. Looking forwarding to your reply and progress!

Thanks,
Liang
________________________________________
发件人: porters-dev <porters-dev-r...@openjdk.org> 代表 Roberto Castaneda Lozano
发送时间: 2024年2月2日 22:37
收件人: Andrew Dinn <ad...@redhat.com>; porters-dev@openjdk.org
主题: Re: [External] : Re: Heads-up: Late G1 Barrier Expansion (Draft JEP)

Hi Andrew,

Thanks for your interest! I am unfortunately not very familiar with Shenandoah 
and its barrier model, but in principle late barrier expansion should be 
applicable to any collector where barriers are tightly coupled to individual 
memory access operations and performance does not depend too much on exposing 
barrier operation details to the JIT compiler.

If it helps, our prototype is available here: 
https://github.com/robcasloz/jdk/tree/g1-late-barrier-expansion<https://urldefense.com/v3/__https://github.com/robcasloz/jdk/tree/g1-late-barrier-expansion__;!!ACWV5N9M2RV99hQ!Ol930UIJb0zvV45LMomqimMtgIGdUCXGpdnLXxHYmH3UxFoSk03gvyZZz-RbR_jy_yRRJujVMs6j750RaY7X19c6wKVuUROfW_2o$>.
 Please note that this is early, experimental work and might change 
significantly as the JEP evolves.

Thanks,

Roberto

________________________________________
From: Andrew Dinn <ad...@redhat.com>
Sent: Friday, February 2, 2024 2:33 PM
To: Roberto Castaneda Lozano; porters-dev@openjdk.org
Subject: [External] : Re: Heads-up: Late G1 Barrier Expansion (Draft JEP)

Hi Roberto,

On 02/02/2024 13:18, Roberto Castaneda Lozano wrote:
> I have written (together with Erik Österlund) a draft JEP for
> simplifying C2's handling of G1 barriers, see
> https://bugs.openjdk.org/browse/JDK-8322295. This is a heads-up that
> the implementation of this JEP requires platform-specific support from
> all OpenJDK ports. While interpreter G1 barrier implementations are
> available for all ports and can be largely reused, the JEP
> additionally requires 1) defining G1-specific ADL instructions and 2)
> implementing platform-specific logic to support runtime calls from the
> barrier code. For ports that already support ZGC, the effort should be
> smaller, as the logic for 2) can be shared between ZGC and G1.
>
> To give a rough estimation of the required effort, the x86-64 changes
> in our prototype involve approximately 900 line insertions and 300
> line deletions over 9 files, among which approximately 300 deleted and
> inserted lines correspond to logic factored out from ZGC.

I looked at the proposal and was interested in the approach, not least because 
ZGC appears to have traversed the path that this JEP recommends
G1 to follow.

Have you considered whether this same approach might be taken with the 
Shenandoah GC? Alternatively, can declare any basic assumptions regarding how 
G1 operates that are needed to enable this change which might therefore need be 
met by Shenandoah?

Of course, access to the prototype code might help answer those questions (at 
least it would help someone better versed in Shenandoah than me) but a 
high-level summary of what in the design of G1 and ZGC makes this approach work 
or, conversely, might make it fail would, if available, be a great help.

regards,


Andrew Dinn
-----------

=

Reply via email to