felipecrv commented on PR #40647:
URL: https://github.com/apache/arrow/pull/40647#issuecomment-2008561783
I pushed some re-ordering of loads and stores that I believe can work better
on CPUs with higher latency on the memory system [1]. Note that my code updates
`max_memory_` correctly (I removed the comment about "no need to be rigorous").
The `lock cmpxchg` that I introduced never appears in profiles, so I'm keeping
it. I also suspect it would be beneficial on contented workloads because we can
give up on updating `max_memory_` if another threads increases it before the
current thread.
When benchmarking on an Intel CPU, compared to the baseline, this version
doesn't cost less, but it should improve the situation on CPUs that are
spending more time [1] in the load issued by `bytes_allocated_.fetch_add(diff)`
(`lock xadd` instruction on x86). My hypothesis is that by not immediately
using the result of the `xadd`, the CPU can wait for the memory system in the
background while performing other operations. It also clusters all the
`lock`-prefixed instructions together which is recommended practice.
Annotated `perf report` of
`arrow::memory_pool::internal::JemallocAllocator::AllocateAligned` after these
changes:
```asm
# Percent│ push %rbp
│ push %r15
│ push %r14
│ push %r13
│ push %r12
│ push %rbx
│ push %rax
│ mov %r8,%r13
│ mov %rcx,%rbx
│ mov %rdx,%r12
│ mov %rsi,%r15
│ mov %rdi,%r14
│ incb 0x13f6583(%rip) # 61bc853
<__TMC_END__+0x13935b>
1 5.31 │ xor %edi,%edi
│ mov %rdx,%rsi
│ → call __sanitizer_cov_trace_const_cmp8@plt
│ test %r12,%r12
│ ↓ js 6c
1 5.56 │ incb 0x13f6570(%rip) # 61bc855
<__TMC_END__+0x13935d>
│ mov %rsp,%rdi
│ mov %r12,%rsi
│ mov %rbx,%rdx
│ mov %r13,%rcx
│ → call
arrow::memory_pool::internal::JemallocAllocator::AllocateAligned@plt
│ test %r14,%r14
│ ↓ je 136
│ incb 0x13f6552(%rip) # 61bc857
<__TMC_END__+0x13935f>
│ mov (%rsp),%rax
│ mov %rax,(%r14)
│ test %rax,%rax
│ ↓ je 8b
│ incb 0x13f6541(%rip) # 61bc858
<__TMC_END__+0x139360>
│ ↓ jmp 11e
│ 6c: incb 0x13f6532(%rip) # 61bc854
<__TMC_END__+0x13935c>
│ lea typeinfo name for arrow::json::TableReader+0x22a,%rdx
│ mov %r14,%rdi
│ mov $0x4,%esi
│ → call arrow::Status::FromArgs<char const (&) [21]>@plt
│ ↓ jmp 11e
│ 8b: incb 0x13f6518(%rip) # 61bc859
<__TMC_END__+0x139361>
1 5.54 │ mov 0x40(%r15),%rbx
│ mov %r12,%rsi
4 22.34 │ lock xadd %rsi,0x48(%r15)
4 22.47 │ lock add %r12,0x50(%r15)
6 33.09 │ lock incq 0x58(%r15)
│ mov %rsi,%r13
│ add %r12,%r13
│ ↓ jo 14a
│ incb 0x13f64f1(%rip) # 61bc85b
<__TMC_END__+0x139363>
│ mov %rbx,%rdi
│ mov %r13,%rsi
│ → call __sanitizer_cov_trace_cmp8@plt
│ cmp %r13,%rbx
1 5.69 │ ↓ jge 103
│ incb 0x13f64dd(%rip) # 61bc85d
<__TMC_END__+0x139365>
│ d0: incb 0x13f64d8(%rip) # 61bc85e
<__TMC_END__+0x139366>
│ mov %rbx,%rax
│ lock cmpxchg %r13,0x40(%r15)
│ sete %bpl
│ mov %rax,%rbx
│ mov %rax,%rdi
│ mov %r13,%rsi
│ → call __sanitizer_cov_trace_cmp8@plt
│ test %bpl,%bpl
│ ↓ jne 10b
│ cmp %r13,%rbx
│ ↓ jge 10b
│ incb 0x13f64af(%rip) # 61bc860
<__TMC_END__+0x139368>
│ ↑ jmp d0
│103: incb 0x13f64a3(%rip) # 61bc85c
<__TMC_END__+0x139364>
│ ↓ jmp 111
│10b: incb 0x13f649e(%rip) # 61bc85f
<__TMC_END__+0x139367>
│111: incb 0x13f649a(%rip) # 61bc861
<__TMC_END__+0x139369>
│ movq $0x0,(%r14)
│11e: incb 0x13f648e(%rip) # 61bc862
<__TMC_END__+0x13936a>
│ mov %r14,%rax
│ add $0x8,%rsp
│ pop %rbx
│ pop %r12
│ pop %r13
│ pop %r14
│ pop %r15
│ pop %rbp
│ ← ret
```
Benchmarks (based on the old code) on a Zen 3 CPU show that the CPU can get
stuck waiting for the value produced by `lock xadd` instead of progressing:
```asm
0.18 | lock xadd %rax,(%rdi)
80.73 | add %rsi,%rax
```
Doing useful work that doesn't depend on `%rax` (the result of `lock xadd`)
should mask the latency of the memory load across the memory system.
From [1]:
> In Zen 3, a single 32MB L3 cache pool is shared among all 8 cores in a
chiplet, vs. Zen 2's two 16MB pools each shared among 4 cores in a core
complex, of which there were two per chiplet. This new arrangement improves the
cache hit rate as well as performance in situations that require cache data to
be exchanged among cores, but increases cache latency from 39 cycles in Zen 2
to 46 clock cycles and halves per-core cache bandwidth, although both problems
are partially mitigated by higher clock speeds.
[1] https://en.wikipedia.org/wiki/Zen_3#Features
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]