rluvaton commented on issue #9083:
URL: https://github.com/apache/arrow-rs/issues/9083#issuecomment-3703789536
## Bottleneck 1: Store bound for 50 columns of non-null u64 array
> Note: using u64 as it doesn't have some sign that we need to deal with
> and no nulls to focus on a specific code path (`fixed::encode_not_null`)
<details><summary>Benchmark code:</summary>
<p>
```rust
fn do_bench(c: &mut Criterion, name: &str, cols: Vec<ArrayRef>) {
let fields: Vec<_> = cols
.iter()
.map(|x| SortField::new(x.data_type().clone()))
.collect();
let mut group = c.benchmark_group("row_format");
group.throughput(criterion::Throughput::Elements(cols.len() as u64));
group.bench_function(&format!("convert_columns {name}"), |b| {
b.iter(|| {
let converter = RowConverter::new(fields.clone()).unwrap();
hint::black_box(converter.convert_columns(&cols).unwrap())
});
});
}
fn run_bench_on_multi_column_with_same_fixed_size_data_type_with_no_nulls<T>(
c: &mut Criterion,
num_cols: usize,
batch_size: usize,
) where
T: ArrowPrimitiveType,
StandardUniform: Distribution<T::Native>,
{
let mut seed = 0;
let mut cols: Vec<ArrayRef> = vec![];
for _ in 0..num_cols {
seed += 1;
cols.push(
Arc::new(create_primitive_array_with_seed::<T>(batch_size, 0.0, seed))
as ArrayRef,
);
}
do_bench(
c,
format!(
"{} with {} columns at size {batch_size}",
T::DATA_TYPE,
num_cols
)
.as_str(),
cols,
);
}
fn row_bench(c: &mut Criterion) {
run_bench_on_multi_column_with_same_fixed_size_data_type_with_no_nulls::<UInt64Type>(
c, 50, 8192,
);
}
criterion_group! {
name = benches;
config = Criterion::default().measurement_time(Duration::from_secs(10));
targets = row_bench
}
```
</p>
</details>
Results on main:
```bash
$ cargo bench --features=test_utils --bench row_format --profile=profiling
-- --save-baseline=main
Finished `profiling` profile [optimized + debuginfo] target(s) in 0.15s
Running benches/row_format.rs
(target/profiling/deps/row_format-7489cf224545c7d6)
Benchmarking row_format/convert_columns UInt64 with 50 columns at size 8192:
Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 10.0s. You may wish to increase
target time to 10.6s, enable flat sampling, or reduce sample count to 60.
Benchmarking row_format/convert_columns UInt64 with 50 columns at size 8192:
Collecting 100 samples in estimated 10.585 s (50
row_format/convert_columns UInt64 with 50 columns at size 8192
time: [2.0778 ms 2.0787 ms 2.0799 ms]
thrpt: [24.040 Kelem/s 24.054 Kelem/s 24.063
Kelem/s]
Found 16 outliers among 100 measurements (16.00%)
6 (6.00%) high mild
10 (10.00%) high severe
```
> [!NOTE]
> TL;DR: the problem is that we are bottleneck by store, and there are a lot
of store instructions that split across a cacheline boundary.
>
> Allowing us to bitpack nulls from all columns at the start of the row and
to reorder fields and add padding will allow us to improve the performance.
<details><summary>Step by step finding the problem:</summary>
<details><summary>Running Top-Down analysis - Level 1</summary>
<p>
```bash
$ ~/pmu-tools/toplev.py --core S0-C0 -l1 --single-thread -v --no-desc
taskset -c 0 target/profiling/deps/row_format-7489cf224545c7d6 --bench
--warm-up-time 1 --noplot
Benchmarking row_format/convert_columns UInt64 with 50 columns at size 8192:
Warming up for 1.0000 s
Warning: Unable to complete 100 samples in 10.0s. You may wish to increase
target time to 10.9s, enable flat sampling, or reduce sample count to 60.
Benchmarking row_format/convert_columns UInt64 with 50 columns at size 8192:
Collecting 100 samples in estimated 10.932 s (50
row_format/convert_columns UInt64 with 50 columns at size 8192
time: [2.1303 ms 2.1308 ms 2.1313 ms]
thrpt: [23.459 Kelem/s 23.466 Kelem/s 23.471
Kelem/s]
Found 12 outliers among 100 measurements (12.00%)
5 (5.00%) high mild
7 (7.00%) high severe
# 5.01-full-perf on Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GHz
[clx/skylake]
FE Frontend_Bound % Slots 0.3 <
BAD Bad_Speculation % Slots 0.4 <
BE Backend_Bound % Slots 80.3 <==
RET Retiring % Slots 19.0 <
MUX % 100.00
Run toplev --describe Backend_Bound^ to get more information on bottleneck
Add --run-sample to find locations
Add --nodes '!+Backend_Bound*/2,+MUX' for breakdown.
```
</p>
</details>
It looks like the problem is `Backend_Bound`.
<details><summary>Running Top-Down analysis - Level 2</summary>
<p>
```bash
$ ~/pmu-tools/toplev.py --core S0-C0 -l2 --single-thread -v --no-desc
taskset -c 0 target/profiling/deps/row_format-7489cf224545c7d6 --bench
--warm-up-time 1 --noplot
Benchmarking row_format/convert_columns UInt64 with 50 columns at size 8192:
Warming up for 1.0000 s
Warning: Unable to complete 100 samples in 10.0s. You may wish to increase
target time to 11.1s, enable flat sampling, or reduce sample count to 60.
Benchmarking row_format/convert_columns UInt64 with 50 columns at size 8192:
Collecting 100 samples in estimated 11.130 s (50
row_format/convert_columns UInt64 with 50 columns at size 8192
time: [2.1779 ms 2.1784 ms 2.1789 ms]
thrpt: [22.947 Kelem/s 22.953 Kelem/s 22.958
Kelem/s]
change:
time: [+2.0777% +2.2031% +2.3211%] (p = 0.00 <
0.05)
thrpt: [−2.2684% −2.1556% −2.0354%]
Performance has regressed.
Found 3 outliers among 100 measurements (3.00%)
2 (2.00%) high mild
1 (1.00%) high severe
# 5.01-full-perf on Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GHz
[clx/skylake]
FE Frontend_Bound % Slots
0.5 < [14.0%]
BAD Bad_Speculation % Slots
0.5 < [14.0%]
BE Backend_Bound % Slots
78.9 [14.0%]<==
RET Retiring % Slots
19.9 < [14.0%]
FE Frontend_Bound.Fetch_Latency % Slots
0.2 < [14.0%]
FE Frontend_Bound.Fetch_Bandwidth % Slots
0.3 < [14.0%]
BAD Bad_Speculation.Branch_Mispredicts % Slots
0.5 < [14.0%]
BAD Bad_Speculation.Machine_Clears % Slots
0.0 < [14.0%]
BE/Mem Backend_Bound.Memory_Bound % Slots
40.5 [14.0%]
BE/Core Backend_Bound.Core_Bound % Slots
38.4 [14.0%]
RET Retiring.Light_Operations % Slots
18.2 < [14.0%]
RET Retiring.Heavy_Operations % Slots
1.7 < [14.0%]
MUX %
14.00
Run toplev --describe Backend_Bound^ to get more information on bottleneck
Add --run-sample to find locations
Add --nodes '!+Backend_Bound*/2,+MUX' for breakdown.
```
</p>
</details>
<details><summary>Running Top-Down analysis - Level 3</summary>
<p>
```bash
$ ~/pmu-tools/toplev.py --core S0-C0 -l3 --single-thread -v --no-desc
taskset -c 0 target/profiling/deps/row_format-7489cf224545c7d6 --bench
--warm-up-time 1 --noplot
Benchmarking row_format/convert_columns UInt64 with 50 columns at size 8192:
Warming up for 1.0000 s
Warning: Unable to complete 100 samples in 10.0s. You may wish to increase
target time to 10.9s, enable flat sampling, or reduce sample count to 60.
row_format/convert_columns UInt64 with 50 columns at size 8192
time: [2.1342 ms 2.1347 ms 2.1353 ms]
thrpt: [23.416 Kelem/s 23.422 Kelem/s 23.428
Kelem/s]
change:
time: [+0.5527% +0.6704% +0.7973%] (p = 0.00 <
0.05)
thrpt: [−0.7910% −0.6659% −0.5497%]
Change within noise threshold.
Found 7 outliers among 100 measurements (7.00%)
3 (3.00%) high mild
4 (4.00%) high severe
# 5.01-full-perf on Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GHz
[clx/skylake]
FE Frontend_Bound %
Slots 0.6 < [ 3.0%]
BAD Bad_Speculation.Branch_Mispredicts.Other_Mispredicts %
Slots 0.0 < [ 3.0%]
BAD Bad_Speculation.Machine_Clears.Other_Nukes %
Slots 0.0 <
BE/Core Backend_Bound.Core_Bound.Serializing_Operation %
Clocks 0.0 < [ 3.0%]
BAD Bad_Speculation %
Slots 0.4 < [ 3.0%]
BE Backend_Bound %
Slots 77.8 [ 3.0%]<==
RET Retiring %
Slots 19.7 < [ 3.0%]
FE Frontend_Bound.Fetch_Latency %
Slots 0.3 < [ 3.0%]
FE Frontend_Bound.Fetch_Bandwidth %
Slots 0.3 < [ 3.0%]
BAD Bad_Speculation.Branch_Mispredicts %
Slots 0.4 < [ 3.0%]
BAD Bad_Speculation.Machine_Clears %
Slots 0.0 [ 3.0%]
BE/Mem Backend_Bound.Memory_Bound %
Slots 39.9 [ 3.0%]
BE/Core Backend_Bound.Core_Bound %
Slots 37.9 [ 3.0%]
RET Retiring.Light_Operations %
Slots 18.0 < [ 3.0%]
RET Retiring.Heavy_Operations %
Slots 1.7 < [ 3.0%]
FE Frontend_Bound.Fetch_Latency.ICache_Misses %
Clocks 0.1 < [ 3.0%]
FE Frontend_Bound.Fetch_Latency.ITLB_Misses %
Clocks 0.2 < [ 3.0%]
FE Frontend_Bound.Fetch_Latency.Branch_Resteers %
Clocks 0.2 < [ 3.0%]
FE Frontend_Bound.Fetch_Latency.MS_Switches %
Clocks_est 0.3 [ 3.0%]
FE Frontend_Bound.Fetch_Latency.LCP %
Clocks 0.0 < [ 3.0%]
FE Frontend_Bound.Fetch_Latency.DSB_Switches %
Clocks 0.3 < [ 3.0%]
FE Frontend_Bound.Fetch_Bandwidth.MITE %
Slots_est 0.5 < [ 3.0%]
FE Frontend_Bound.Fetch_Bandwidth.DSB %
Slots_est 2.8 < [ 3.0%]
BE/Mem Backend_Bound.Memory_Bound.L1_Bound %
Stalls 12.7 [ 3.0%]
BE/Mem Backend_Bound.Memory_Bound.L2_Bound %
Stalls 0.1 < [ 3.0%]
BE/Mem Backend_Bound.Memory_Bound.L3_Bound %
Stalls 0.4 < [ 3.0%]
BE/Mem Backend_Bound.Memory_Bound.DRAM_Bound %
Stalls 0.5 < [ 3.0%]
BE/Mem Backend_Bound.Memory_Bound.Store_Bound %
Stalls 52.2 [ 3.0%]
BE/Core Backend_Bound.Core_Bound.Divider %
Clocks 0.0 < [ 3.0%]
BE/Core Backend_Bound.Core_Bound.Ports_Utilization %
Clocks 9.2 < [ 3.0%]
RET Retiring.Light_Operations.FP_Arith %
Uops 0.9 < [ 3.0%]
RET Retiring.Light_Operations.Memory_Operations %
Slots 5.5 < [ 3.0%]
RET Retiring.Light_Operations.Fused_Instructions %
Slots 4.5 < [ 3.0%]
RET Retiring.Light_Operations.Non_Fused_Branches %
Slots 0.1 < [ 3.0%]
RET Retiring.Light_Operations.Other_Light_Ops %
Slots 7.1 < [ 3.0%]
RET Retiring.Heavy_Operations.Few_Uops_Instructions %
Slots 1.5 < [ 3.0%]
RET Retiring.Heavy_Operations.Microcode_Sequencer %
Slots 0.3 [ 3.0%]
MUX %
3.00
Run toplev --describe Backend_Bound^ to get more information on bottleneck
Add --run-sample to find locations
Add --nodes '!+Backend_Bound*/2,+MUX' for breakdown.
```
</p>
</details>
<details><summary>Recording only store related events</summary>
<p>
```bash
$ perf record -e mem_inst_retired.all_stores:pp -g
target/profiling/deps/row_format-7489cf224545c7d6 --bench --warm-up-time 1
--noplot
Benchmarking row_format/convert_columns UInt64 with 50 columns at size 8192:
Warming up for 1.0000 s
Warning: Unable to complete 100 samples in 10.0s. You may wish to increase
target time to 10.7s, enable flat sampling, or reduce sample count to 60.
row_format/convert_columns UInt64 with 50 columns at size 8192
time: [2.1135 ms 2.1139 ms 2.1145 ms]
thrpt: [23.647 Kelem/s 23.652 Kelem/s 23.658
Kelem/s]
change:
time: [−2.3692% −2.2443% −2.1307%] (p = 0.00 <
0.05)
thrpt: [+2.1770% +2.2958% +2.4266%]
Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
3 (3.00%) high mild
5 (5.00%) high severe
[ perf record: Woken up 3 times to write data ]
[ perf record: Captured and wrote 0.676 MB perf.data (3924 samples) ]
```
then report:
```bash
perf report -g --percent-limit 0.01 --no-children
```
</p>
</details>
the most expensive function is `fixed::encode_not_null` (as this is the only
function that is being used) with this info in the annotation:
<details><summary>annotated info</summary>
<p>
```asm
Percent│ add $0xfffffffffffffff8,%r9
▒
│ <core::slice::iter::Iter<T> as
core::iter::traits::iterator::Iterator>::next:
▒
│ ↑ jne 30
▒
│ ↓ jmp dc
▒
│ arrow_row::fixed::encode_not_null:
▒
│ let offset = &mut offsets[value_idx + 1];
▒
│ 7d: cmp $0x1,%rcx
▒
│ mov $0x1,%eax
▒
│ mov $0x1,%ebx
▒
│ cmovb %rcx,%rbx
▒
│ sub %rcx,%rbx
▒
│ core::ptr::non_null::NonNull<T>::add:
▒
│ data16 data16 data16 data16 cs nopw 0x0(%rax,%rax,1)
▒
│ a0: lea (%rbx,%rax,1),%r10
▒
│ arrow_row::fixed::encode_not_null:
▒
│ cmp $0x1,%r10
▒
│ ↓ je f7
▒
│ let end_offset = *offset + T::ENCODED_LEN;
▒
│ mov (%rdx,%rax,8),%r11
▒
│ lea 0x9(%r11),%r10
▒
│ core::num::<impl usize>::checked_sub:
▒
│ cmp $0xfffffffffffffff6,%r11
▒
│ ↓ ja e1
▒
│ cmp %rsi,%r10
▒
│ ↓ ja e1
▒
│ arrow_row::fixed::encode_not_null:
▒
│ to_write[0] = 1;
▒
49.57 │ movb $0x1,(%rdi,%r11,1)
▒
│ let mut encoded = val.encode();
▒
│ mov -0x8(%r8,%rax,8),%r14
▒
│ core::num::<impl u64>::swap_bytes:
▒
│ bswap %r14
▒
│ core::ptr::copy_nonoverlapping:
▒
49.22 │ mov %r14,0x1(%rdi,%r11,1)
▒
│ arrow_row::fixed::encode_not_null:
▒
│ *offset = end_offset;
▒
1.22 │ mov %r10,(%rdx,%rax,8)
▒
│ <core::ptr::non_null::NonNull<T> as core::cmp::PartialEq>::eq:
▒
│ inc %rax
▒
│ add $0xfffffffffffffff8,%r9
▒
│ <core::slice::iter::Iter<T> as
core::iter::traits::iterator::Iterator>::next:
◆
│ ↑ jne a0
▒
│ arrow_row::fixed::encode_not_null:
▒
│ }
▒
│ }
▒
│ dc: pop %rbx
▒
│ pop %r14
▒
│ pop %rbp
▒
│ ← ret
▒
│ e1: mov %rsi,%rdx
▒
│ <core::ops::range::Range<usize> as
core::slice::index::SliceIndex<[T]>>::index_mut:
▒
│ lea
anon.674ea2c4ec16cc99febf090ac1512894.22.llvm.1276781284273354020,%rcx
▒
│ mov %r11,%rdi
▒
│ mov %r10,%rsi
```
</p>
</details>
we see that there are 2 expensive instructions, `movb` and `mov`, the first
`movb` is writing `to_write[0] = 1;`
and the second `mov` is writing the encoded value.
So testing with `mem_inst_retired.split_stores`
<details><summary>recording:</summary>
<p>
```bash
$ perf record -e mem_inst_retired.split_stores:pp -g
target/profiling/deps/row_format-7489cf224545c7d6 --bench --warm-up-time 1
--noplot
Benchmarking row_format/convert_columns UInt64 with 50 columns at size 8192:
Warming up for 1.0000 s
Warning: Unable to complete 100 samples in 10.0s. You may wish to increase
target time to 10.9s, enable flat sampling, or reduce sample count to 60.
row_format/convert_columns UInt64 with 50 columns at size 8192
time: [2.1362 ms 2.1367 ms 2.1373 ms]
thrpt: [23.394 Kelem/s 23.400 Kelem/s 23.406
Kelem/s]
change:
time: [+0.9806% +1.0905% +1.2080%] (p = 0.00 <
0.05)
thrpt: [−1.1936% −1.0787% −0.9711%]
Change within noise threshold.
Found 3 outliers among 100 measurements (3.00%)
2 (2.00%) high mild
1 (1.00%) high severe
[ perf record: Woken up 2 times to write data ]
[ perf record: Captured and wrote 0.460 MB perf.data (2612 samples) ]
```
Viewing the data:
```bash
perf report -g --percent-limit 0.01 --no-children
```
</p>
</details>
showed for the same hot function `fixed::encode_not_null`:
<details><summary>annotated info</summary>
<p>
```asm
Percent│ 7d: cmp $0x1,%rcx
│ mov $0x1,%eax
│ mov $0x1,%ebx
│ cmovb %rcx,%rbx
│ sub %rcx,%rbx
│ core::ptr::non_null::NonNull<T>::add:
│ data16 data16 data16 data16 cs nopw 0x0(%rax,%rax,1)
│ a0: lea (%rbx,%rax,1),%r10
│ arrow_row::fixed::encode_not_null:
│ cmp $0x1,%r10
│ ↓ je f7
│ let end_offset = *offset + T::ENCODED_LEN;
│ mov (%rdx,%rax,8),%r11
│ lea 0x9(%r11),%r10
│ core::num::<impl usize>::checked_sub:
│ cmp $0xfffffffffffffff6,%r11
│ ↓ ja e1
│ cmp %rsi,%r10
│ ↓ ja e1
│ arrow_row::fixed::encode_not_null:
│ to_write[0] = 1;
│ movb $0x1,(%rdi,%r11,1)
│ let mut encoded = val.encode();
│ mov -0x8(%r8,%rax,8),%r14
│ core::num::<impl u64>::swap_bytes:
│ bswap %r14
│ core::ptr::copy_nonoverlapping:
100.00 │ mov %r14,0x1(%rdi,%r11,1)
│ arrow_row::fixed::encode_not_null:
│ *offset = end_offset;
│ mov %r10,(%rdx,%rax,8)
│ <core::ptr::non_null::NonNull<T> as core::cmp::PartialEq>::eq:
│ inc %rax
│ add $0xfffffffffffffff8,%r9
│ <core::slice::iter::Iter<T> as
core::iter::traits::iterator::Iterator>::next:
│ ↑ jne a0
│ arrow_row::fixed::encode_not_null:
│ }
│ }
│ dc: pop %rbx
│ pop %r14
│ pop %rbp
│ ← ret
│ e1: mov %rsi,%rdx
│ <core::ops::range::Range<usize> as
core::slice::index::SliceIndex<[T]>>::index_mut:
│ lea
anon.674ea2c4ec16cc99febf090ac1512894.22.llvm.1276781284273354020,%rcx
│ mov %r11,%rdi
│ mov %r10,%rsi
│ → call *0x31ec69(%rip) # 52de90 <_DYNAMIC+0x4a8>
│ arrow_row::fixed::encode_not_null:
│ let offset = &mut offsets[value_idx + 1];
│ f7: lea
anon.674ea2c4ec16cc99febf090ac1512894.21.llvm.1276781284273354020,%rdx
│ mov %rax,%rdi
│ mov %rcx,%rsi
```
</p>
</details>
that the `mov` instruction that writes the encoded value is now taking
`100%` of the samples in this function for writing across boundries.
the `100%` is relative to this function, but when pressing `p` it toggle
percent type to global
and then the percent is `92.99%`, so very expensive.
</details>
I tried multiple things that keep the ordering gurrentees but still
improving the performance.
one of them was to use 2 bytes for null info instead of 1 byte.
this improved performance by 18%:
```bash
Finished `profiling` profile [optimized + debuginfo] target(s) in 0.15s
Running benches/row_format.rs
(target/profiling/deps/row_format-7489cf224545c7d6)
row_format/convert_columns UInt64 with 50 columns at size 8192
time: [1.6916 ms 1.6919 ms 1.6922 ms]
thrpt: [29.546 Kelem/s 29.553 Kelem/s 29.558
Kelem/s]
change:
time: [−18.727% −18.626% −18.535%] (p = 0.00 <
0.05)
thrpt: [+22.752% +22.890% +23.042%]
Performance has improved.
```
The problem, however, is that it doesn't really solve the problem, it just
delay it, with the right combination of columns we can get to that problem
again.
and also doing perf and report shows that even for the 50 columns of u64
with 2 bytes for null the is still high (and even higher than before):
Changing the code to not add the null byte made the code 37% faster:
```bash
Finished `profiling` profile [optimized + debuginfo] target(s) in 4.74s
Running benches/row_format.rs
(target/profiling/deps/row_format-7489cf224545c7d6)
row_format/convert_columns UInt64 with 50 columns at size 8192
time: [1.3429 ms 1.3433 ms 1.3436 ms]
thrpt: [37.213 Kelem/s 37.223 Kelem/s 37.232
Kelem/s]
change:
time: [−37.193% −37.114% −37.040%] (p = 0.00 <
0.05)
thrpt: [+58.832% +59.018% +59.218%]
Performance has improved.
Found 13 outliers among 100 measurements (13.00%)
5 (5.00%) high mild
8 (8.00%) high severe
```
> [!IMPORTANT]
> I did not add the null info byte back so this remove a single instruction
from the hot path. but if we move it to the start of the row and bitpack it
with other columns
> and then use padding we could have that performance improvement.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]