Thanks a lot for your answer and for the confirmation that my understanding is correct.
On Wed, Feb 5, 2025 at 12:30 PM Aleksey Shipilev <[email protected]> wrote: > On 2/3/25 12:06, Peter Veentjer wrote: > > Imagine the following code: > > > > ... lot of writes writes to the buffer > > buffer.putInt(a_offset,a_value) (1) > > buffer.putRelease(b_offset,b_value) (2) > > releaseFence() (3) > > buffer.putInt(c_offset,c_value) (4) > > > > Buffer is a chunk of memory that is shared with another process and the > writes need to be seen in > > order. So when 'b' is seen, 'a' should be seen. And when 'c' is seen, > 'b' should be seen. There is > > no other synchronization. > > > > All offsets are guaranteed to be naturally aligned. All the putInts are > plain puts (using Unsafe). > > > > The putRelease (2) will ensure that 'a' is seen before 'b', and it will > ensure atomicity and > > visibility of 'b' (so the appropriate compiler and memory fences where > needed). > > > > The releaseFence (3) will ensure that b is seen before c. > > Looks to me this fence can be replaced with releasing store of "c": > > buffer.putInt(a_offset,a_value) > buffer.putRelease(b_offset,b_value) > buffer.putRelease(c_offset,c_value) > > My preference is almost always to avoid the explicit fences if you can > control the memory ordering > of the actual accesses. Using putRelease instead of explicit fence also > forces you think about the > symmetries: should all loads of "c" be performed with getAcquire to match > the putRelease? > > > My question is about (4). Since it is a plain store, the compiler can do > a ton of trickery including > > the delay of visibility of (4). Is my understanding correct and is there > anything else that could go > > wrong? > > The common wisdom is indeed "let's put non-plain memory access mode, so > the access is hopefully more > prompt", but I have not seen any of these effects thoroughly quantified > beyond "let's forbid the > compiler to yank our access out of the loop". Maybe I have not looked hard > enough. > > I suspect the delays introduced by compiler moving code around in > sequential code streams is on the > scale where it does not matter all that much for end-to-end latency. The > only (?) place where code > movement impact could be multiplied to a macro-effect is when the memory > ops shift in/out/around the > loops. I would not be overly concerned about latency impact of reordering > within the short straight > code stream. > > You can try to measure it with producer-consumer / ping-pong style > benchmarks: put more memory ops > around (4), turn on instruction scheduler randomizers (-XX:+StressLCM > should be useful here, maybe > -XX:+StressGCM), see if there is an impact. I suspect the effect is too > fine-grained to be > accurately measured with direct timing measurements, so you'll need to get > creative how to measure > "promptness". > > > What would be the lowest memory access mode that would resolve this > problem? My guess is that the > > last putInt, should be a putIntOpaque. > > Yes, in current Hotspot, opaque would effectively pin the access in place, > so it would be exposed to > hardware in the order closer to original source code order. Then it is up > to hardware to see when to > perform the store. But as I said above, I'll be surprised if it actually > matters. > > Thanks, > -Aleksey > -- You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion, visit https://groups.google.com/d/msgid/mechanical-sympathy/CAGuAWdAsWprk9BK46iJdZ_w1wPBcM4OCkDgCLTAP98B4VCPscw%40mail.gmail.com.
