Re: [PR] GH-40646: [C++] Use Acquire-Release for loads and stores on MemoryPool statistics [arrow]

via GitHub Tue, 19 Mar 2024 19:04:39 -0700


felipecrv commented on PR #40647:
URL: https://github.com/apache/arrow/pull/40647#issuecomment-2008561783


   I pushed some re-ordering of loads and stores that I believe can work better 
on CPUs with higher latency on the memory system [1]. Note that my code updates 
`max_memory_` correctly (I removed the comment about "no need to be rigorous"). 
The `lock cmpxchg` that I introduced never appears in profiles, so I'm keeping 
it. I also suspect it would be beneficial on contented workloads because we can 
give up on updating `max_memory_` if another threads increases it before the 
current thread.
   
   When benchmarking on an Intel CPU, compared to the baseline, this version 
doesn't cost less, but it should improve the situation on CPUs that are 
spending more time [1] in the load issued by `bytes_allocated_.fetch_add(diff)` 
(`lock xadd` instruction on x86). My hypothesis is that by not immediately 
using the result of the `xadd`, the CPU can wait for the memory system in the 
background while performing other operations. It also clusters all the 
`lock`-prefixed instructions together which is recommended practice.
   
   Annotated `perf report` of 
`arrow::memory_pool::internal::JemallocAllocator::AllocateAligned` after these 
changes:
   
   ```asm
   #   Percent│       push %rbp
              │       push %r15
              │       push %r14
              │       push %r13
              │       push %r12
              │       push %rbx
              │       push %rax
              │       mov  %r8,%r13
              │       mov  %rcx,%rbx
              │       mov  %rdx,%r12
              │       mov  %rsi,%r15
              │       mov  %rdi,%r14
              │       incb 0x13f6583(%rip)        # 61bc853 
<__TMC_END__+0x13935b>
   1     5.31 │       xor  %edi,%edi
              │       mov  %rdx,%rsi
              │     → call __sanitizer_cov_trace_const_cmp8@plt
              │       test %r12,%r12
              │     ↓ js   6c
   1     5.56 │       incb 0x13f6570(%rip)        # 61bc855 
<__TMC_END__+0x13935d>
              │       mov  %rsp,%rdi
              │       mov  %r12,%rsi
              │       mov  %rbx,%rdx
              │       mov  %r13,%rcx
              │     → call 
arrow::memory_pool::internal::JemallocAllocator::AllocateAligned@plt
              │       test %r14,%r14
              │     ↓ je   136
              │       incb 0x13f6552(%rip)        # 61bc857 
<__TMC_END__+0x13935f>
              │       mov  (%rsp),%rax
              │       mov  %rax,(%r14)
              │       test %rax,%rax
              │     ↓ je   8b
              │       incb 0x13f6541(%rip)        # 61bc858 
<__TMC_END__+0x139360>
              │     ↓ jmp  11e
              │ 6c:   incb 0x13f6532(%rip)        # 61bc854 
<__TMC_END__+0x13935c>
              │       lea  typeinfo name for arrow::json::TableReader+0x22a,%rdx
              │       mov  %r14,%rdi
              │       mov  $0x4,%esi
              │     → call arrow::Status::FromArgs<char const (&) [21]>@plt
              │     ↓ jmp  11e
              │ 8b:   incb 0x13f6518(%rip)        # 61bc859 
<__TMC_END__+0x139361>
   1     5.54 │       mov  0x40(%r15),%rbx
              │       mov  %r12,%rsi
   4    22.34 │       lock xadd %rsi,0x48(%r15)
   4    22.47 │       lock add  %r12,0x50(%r15)
   6    33.09 │       lock incq 0x58(%r15)
              │       mov  %rsi,%r13
              │       add  %r12,%r13
              │     ↓ jo   14a
              │       incb 0x13f64f1(%rip)        # 61bc85b 
<__TMC_END__+0x139363>
              │       mov  %rbx,%rdi
              │       mov  %r13,%rsi
              │     → call __sanitizer_cov_trace_cmp8@plt
              │       cmp  %r13,%rbx
   1     5.69 │     ↓ jge  103
              │       incb 0x13f64dd(%rip)        # 61bc85d 
<__TMC_END__+0x139365>
              │ d0:   incb 0x13f64d8(%rip)        # 61bc85e 
<__TMC_END__+0x139366>
              │       mov  %rbx,%rax
              │       lock cmpxchg %r13,0x40(%r15)
              │       sete %bpl
              │       mov  %rax,%rbx
              │       mov  %rax,%rdi
              │       mov  %r13,%rsi
              │     → call __sanitizer_cov_trace_cmp8@plt
              │       test %bpl,%bpl
              │     ↓ jne  10b
              │       cmp  %r13,%rbx
              │     ↓ jge  10b
              │       incb 0x13f64af(%rip)        # 61bc860 
<__TMC_END__+0x139368>
              │     ↑ jmp  d0
              │103:   incb 0x13f64a3(%rip)        # 61bc85c 
<__TMC_END__+0x139364>
              │     ↓ jmp  111
              │10b:   incb 0x13f649e(%rip)        # 61bc85f 
<__TMC_END__+0x139367>
              │111:   incb 0x13f649a(%rip)        # 61bc861 
<__TMC_END__+0x139369>
              │       movq $0x0,(%r14)
              │11e:   incb 0x13f648e(%rip)        # 61bc862 
<__TMC_END__+0x13936a>
              │       mov  %r14,%rax
              │       add  $0x8,%rsp
              │       pop  %rbx
              │       pop  %r12
              │       pop  %r13
              │       pop  %r14
              │       pop  %r15
              │       pop  %rbp
              │     ← ret
   ```
   
   Benchmarks (based on the old code) on a Zen 3 CPU show that the CPU can get 
stuck waiting for the value produced by `lock xadd` instead of progressing:
   
   ```asm
       0.18 |    lock xadd %rax,(%rdi)
      80.73 |    add    %rsi,%rax
   ```
   
   Doing useful work that doesn't depend on `%rax` (the result of `lock xadd`) 
should mask the latency of the memory load across the memory system.
   
   From [1]: 
   
   > In Zen 3, a single 32MB L3 cache pool is shared among all 8 cores in a 
chiplet, vs. Zen 2's two 16MB pools each shared among 4 cores in a core 
complex, of which there were two per chiplet. This new arrangement improves the 
cache hit rate as well as performance in situations that require cache data to 
be exchanged among cores, but increases cache latency from 39 cycles in Zen 2 
to 46 clock cycles and halves per-core cache bandwidth, although both problems 
are partially mitigated by higher clock speeds. 
   
   [1] https://en.wikipedia.org/wiki/Zen_3#Features


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] GH-40646: [C++] Use Acquire-Release for loads and stores on MemoryPool statistics [arrow]

Reply via email to