rluvaton commented on issue #9083:
URL: https://github.com/apache/arrow-rs/issues/9083#issuecomment-3703789536

   
   ## Bottleneck 1: Store bound for 50 columns of non-null u64 array
   
   > Note: using u64 as it doesn't have some sign that we need to deal with
   > and no nulls to focus on a specific code path (`fixed::encode_not_null`)
   
   <details><summary>Benchmark code:</summary>
   <p>
   
   ```rust
   fn do_bench(c: &mut Criterion, name: &str, cols: Vec<ArrayRef>) {
     let fields: Vec<_> = cols
       .iter()
       .map(|x| SortField::new(x.data_type().clone()))
       .collect();
     let mut group = c.benchmark_group("row_format");
     group.throughput(criterion::Throughput::Elements(cols.len() as u64));
   
     group.bench_function(&format!("convert_columns {name}"), |b| {
       b.iter(|| {
         let converter = RowConverter::new(fields.clone()).unwrap();
         hint::black_box(converter.convert_columns(&cols).unwrap())
       });
     });
   }
   
   fn run_bench_on_multi_column_with_same_fixed_size_data_type_with_no_nulls<T>(
     c: &mut Criterion,
     num_cols: usize,
     batch_size: usize,
   ) where
     T: ArrowPrimitiveType,
     StandardUniform: Distribution<T::Native>,
   {
     let mut seed = 0;
   
     let mut cols: Vec<ArrayRef> = vec![];
   
     for _ in 0..num_cols {
       seed += 1;
       cols.push(
         Arc::new(create_primitive_array_with_seed::<T>(batch_size, 0.0, seed)) 
as ArrayRef,
       );
     }
   
     do_bench(
       c,
       format!(
         "{} with {} columns at size {batch_size}",
         T::DATA_TYPE,
         num_cols
       )
         .as_str(),
       cols,
     );
   }
   
   fn row_bench(c: &mut Criterion) {
     
run_bench_on_multi_column_with_same_fixed_size_data_type_with_no_nulls::<UInt64Type>(
       c, 50, 8192,
     );
   }
   
   criterion_group! {
       name = benches;
       config = Criterion::default().measurement_time(Duration::from_secs(10));
       targets = row_bench
   }
   ```
   
   </p>
   </details> 
   
   Results on main:
   ```bash
   $ cargo bench --features=test_utils --bench row_format --profile=profiling 
-- --save-baseline=main
   
       Finished `profiling` profile [optimized + debuginfo] target(s) in 0.15s
        Running benches/row_format.rs 
(target/profiling/deps/row_format-7489cf224545c7d6)
   Benchmarking row_format/convert_columns UInt64 with 50 columns at size 8192: 
Warming up for 3.0000 s
   Warning: Unable to complete 100 samples in 10.0s. You may wish to increase 
target time to 10.6s, enable flat sampling, or reduce sample count to 60.
   Benchmarking row_format/convert_columns UInt64 with 50 columns at size 8192: 
Collecting 100 samples in estimated 10.585 s (50
   row_format/convert_columns UInt64 with 50 columns at size 8192
                           time:   [2.0778 ms 2.0787 ms 2.0799 ms]
                           thrpt:  [24.040 Kelem/s 24.054 Kelem/s 24.063 
Kelem/s]
   Found 16 outliers among 100 measurements (16.00%)
     6 (6.00%) high mild
     10 (10.00%) high severe
   ```
   
   > [!NOTE]  
   > TL;DR: the problem is that we are bottleneck by store, and there are a lot 
of store instructions that split across a cacheline boundary.
   >
   > Allowing us to bitpack nulls from all columns at the start of the row and 
to reorder fields and add padding will allow us to improve the performance.
   
   
   <details><summary>Step by step finding the problem:</summary>
   
   
   <details><summary>Running Top-Down analysis - Level 1</summary>
   <p>
   
   ```bash
   $ ~/pmu-tools/toplev.py --core S0-C0 -l1 --single-thread -v --no-desc 
taskset -c 0 target/profiling/deps/row_format-7489cf224545c7d6 --bench 
--warm-up-time 1 --noplot
   Benchmarking row_format/convert_columns UInt64 with 50 columns at size 8192: 
Warming up for 1.0000 s
   Warning: Unable to complete 100 samples in 10.0s. You may wish to increase 
target time to 10.9s, enable flat sampling, or reduce sample count to 60.
   Benchmarking row_format/convert_columns UInt64 with 50 columns at size 8192: 
Collecting 100 samples in estimated 10.932 s (50
   row_format/convert_columns UInt64 with 50 columns at size 8192
                           time:   [2.1303 ms 2.1308 ms 2.1313 ms]
                           thrpt:  [23.459 Kelem/s 23.466 Kelem/s 23.471 
Kelem/s]
   Found 12 outliers among 100 measurements (12.00%)
     5 (5.00%) high mild
     7 (7.00%) high severe
   
   # 5.01-full-perf on Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GHz 
[clx/skylake]
   FE               Frontend_Bound   % Slots                        0.3  <
   BAD              Bad_Speculation  % Slots                        0.4  <
   BE               Backend_Bound    % Slots                       80.3   <==
   RET              Retiring         % Slots                       19.0  <
   MUX                               %                            100.00
   Run toplev --describe Backend_Bound^ to get more information on bottleneck
   Add --run-sample to find locations
   Add --nodes '!+Backend_Bound*/2,+MUX' for breakdown.
   ```
   
   </p>
   </details> 
   
   It looks like the problem is `Backend_Bound`.
   
   
   <details><summary>Running Top-Down analysis - Level 2</summary>
   <p>
   
   ```bash
   $ ~/pmu-tools/toplev.py --core S0-C0 -l2 --single-thread -v --no-desc 
taskset -c 0 target/profiling/deps/row_format-7489cf224545c7d6 --bench 
--warm-up-time 1 --noplot
   Benchmarking row_format/convert_columns UInt64 with 50 columns at size 8192: 
Warming up for 1.0000 s
   Warning: Unable to complete 100 samples in 10.0s. You may wish to increase 
target time to 11.1s, enable flat sampling, or reduce sample count to 60.
   Benchmarking row_format/convert_columns UInt64 with 50 columns at size 8192: 
Collecting 100 samples in estimated 11.130 s (50
   row_format/convert_columns UInt64 with 50 columns at size 8192
                           time:   [2.1779 ms 2.1784 ms 2.1789 ms]
                           thrpt:  [22.947 Kelem/s 22.953 Kelem/s 22.958 
Kelem/s]
                    change:
                           time:   [+2.0777% +2.2031% +2.3211%] (p = 0.00 < 
0.05)
                           thrpt:  [−2.2684% −2.1556% −2.0354%]
                           Performance has regressed.
   Found 3 outliers among 100 measurements (3.00%)
     2 (2.00%) high mild
     1 (1.00%) high severe
   
   # 5.01-full-perf on Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GHz 
[clx/skylake]
   FE               Frontend_Bound                      % Slots                 
       0.5  < [14.0%]
   BAD              Bad_Speculation                     % Slots                 
       0.5  < [14.0%]
   BE               Backend_Bound                       % Slots                 
      78.9    [14.0%]<==
   RET              Retiring                            % Slots                 
      19.9  < [14.0%]
   FE               Frontend_Bound.Fetch_Latency        % Slots                 
       0.2  < [14.0%]
   FE               Frontend_Bound.Fetch_Bandwidth      % Slots                 
       0.3  < [14.0%]
   BAD              Bad_Speculation.Branch_Mispredicts  % Slots                 
       0.5  < [14.0%]
   BAD              Bad_Speculation.Machine_Clears      % Slots                 
       0.0  < [14.0%]
   BE/Mem           Backend_Bound.Memory_Bound          % Slots                 
      40.5    [14.0%]
   BE/Core          Backend_Bound.Core_Bound            % Slots                 
      38.4    [14.0%]
   RET              Retiring.Light_Operations           % Slots                 
      18.2  < [14.0%]
   RET              Retiring.Heavy_Operations           % Slots                 
       1.7  < [14.0%]
   MUX                                                  %                       
      14.00
   Run toplev --describe Backend_Bound^ to get more information on bottleneck
   Add --run-sample to find locations
   Add --nodes '!+Backend_Bound*/2,+MUX' for breakdown.
   ```
   
   </p>
   </details> 
   
   
   <details><summary>Running Top-Down analysis - Level 3</summary>
   <p>
   
   ```bash
   $ ~/pmu-tools/toplev.py --core S0-C0 -l3 --single-thread -v --no-desc 
taskset -c 0 target/profiling/deps/row_format-7489cf224545c7d6 --bench 
--warm-up-time 1 --noplot
   Benchmarking row_format/convert_columns UInt64 with 50 columns at size 8192: 
Warming up for 1.0000 s
   Warning: Unable to complete 100 samples in 10.0s. You may wish to increase 
target time to 10.9s, enable flat sampling, or reduce sample count to 60.
   row_format/convert_columns UInt64 with 50 columns at size 8192
                           time:   [2.1342 ms 2.1347 ms 2.1353 ms]
                           thrpt:  [23.416 Kelem/s 23.422 Kelem/s 23.428 
Kelem/s]
                    change:
                           time:   [+0.5527% +0.6704% +0.7973%] (p = 0.00 < 
0.05)
                           thrpt:  [−0.7910% −0.6659% −0.5497%]
                           Change within noise threshold.
   Found 7 outliers among 100 measurements (7.00%)
     3 (3.00%) high mild
     4 (4.00%) high severe
   
   # 5.01-full-perf on Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GHz 
[clx/skylake]
   FE               Frontend_Bound                                        % 
Slots                        0.6  < [ 3.0%]
   BAD              Bad_Speculation.Branch_Mispredicts.Other_Mispredicts  % 
Slots                        0.0  < [ 3.0%]
   BAD              Bad_Speculation.Machine_Clears.Other_Nukes            % 
Slots                        0.0  <
   BE/Core          Backend_Bound.Core_Bound.Serializing_Operation        % 
Clocks                       0.0  < [ 3.0%]
   BAD              Bad_Speculation                                       % 
Slots                        0.4  < [ 3.0%]
   BE               Backend_Bound                                         % 
Slots                       77.8    [ 3.0%]<==
   RET              Retiring                                              % 
Slots                       19.7  < [ 3.0%]
   FE               Frontend_Bound.Fetch_Latency                          % 
Slots                        0.3  < [ 3.0%]
   FE               Frontend_Bound.Fetch_Bandwidth                        % 
Slots                        0.3  < [ 3.0%]
   BAD              Bad_Speculation.Branch_Mispredicts                    % 
Slots                        0.4  < [ 3.0%]
   BAD              Bad_Speculation.Machine_Clears                        % 
Slots                        0.0    [ 3.0%]
   BE/Mem           Backend_Bound.Memory_Bound                            % 
Slots                       39.9    [ 3.0%]
   BE/Core          Backend_Bound.Core_Bound                              % 
Slots                       37.9    [ 3.0%]
   RET              Retiring.Light_Operations                             % 
Slots                       18.0  < [ 3.0%]
   RET              Retiring.Heavy_Operations                             % 
Slots                        1.7  < [ 3.0%]
   FE               Frontend_Bound.Fetch_Latency.ICache_Misses            % 
Clocks                       0.1  < [ 3.0%]
   FE               Frontend_Bound.Fetch_Latency.ITLB_Misses              % 
Clocks                       0.2  < [ 3.0%]
   FE               Frontend_Bound.Fetch_Latency.Branch_Resteers          % 
Clocks                       0.2  < [ 3.0%]
   FE               Frontend_Bound.Fetch_Latency.MS_Switches              % 
Clocks_est                   0.3    [ 3.0%]
   FE               Frontend_Bound.Fetch_Latency.LCP                      % 
Clocks                       0.0  < [ 3.0%]
   FE               Frontend_Bound.Fetch_Latency.DSB_Switches             % 
Clocks                       0.3  < [ 3.0%]
   FE               Frontend_Bound.Fetch_Bandwidth.MITE                   % 
Slots_est                    0.5  < [ 3.0%]
   FE               Frontend_Bound.Fetch_Bandwidth.DSB                    % 
Slots_est                    2.8  < [ 3.0%]
   BE/Mem           Backend_Bound.Memory_Bound.L1_Bound                   % 
Stalls                      12.7    [ 3.0%]
   BE/Mem           Backend_Bound.Memory_Bound.L2_Bound                   % 
Stalls                       0.1  < [ 3.0%]
   BE/Mem           Backend_Bound.Memory_Bound.L3_Bound                   % 
Stalls                       0.4  < [ 3.0%]
   BE/Mem           Backend_Bound.Memory_Bound.DRAM_Bound                 % 
Stalls                       0.5  < [ 3.0%]
   BE/Mem           Backend_Bound.Memory_Bound.Store_Bound                % 
Stalls                      52.2    [ 3.0%]
   BE/Core          Backend_Bound.Core_Bound.Divider                      % 
Clocks                       0.0  < [ 3.0%]
   BE/Core          Backend_Bound.Core_Bound.Ports_Utilization            % 
Clocks                       9.2  < [ 3.0%]
   RET              Retiring.Light_Operations.FP_Arith                    % 
Uops                         0.9  < [ 3.0%]
   RET              Retiring.Light_Operations.Memory_Operations           % 
Slots                        5.5  < [ 3.0%]
   RET              Retiring.Light_Operations.Fused_Instructions          % 
Slots                        4.5  < [ 3.0%]
   RET              Retiring.Light_Operations.Non_Fused_Branches          % 
Slots                        0.1  < [ 3.0%]
   RET              Retiring.Light_Operations.Other_Light_Ops             % 
Slots                        7.1  < [ 3.0%]
   RET              Retiring.Heavy_Operations.Few_Uops_Instructions       % 
Slots                        1.5  < [ 3.0%]
   RET              Retiring.Heavy_Operations.Microcode_Sequencer         % 
Slots                        0.3    [ 3.0%]
   MUX                                                                    %     
                         3.00
   Run toplev --describe Backend_Bound^ to get more information on bottleneck
   Add --run-sample to find locations
   Add --nodes '!+Backend_Bound*/2,+MUX' for breakdown.
   ```
   
   </p>
   </details> 
   
   
   
   <details><summary>Recording only store related events</summary>
   <p>
   
   ```bash
   $ perf record -e mem_inst_retired.all_stores:pp -g 
target/profiling/deps/row_format-7489cf224545c7d6 --bench --warm-up-time 1 
--noplot
   Benchmarking row_format/convert_columns UInt64 with 50 columns at size 8192: 
Warming up for 1.0000 s
   Warning: Unable to complete 100 samples in 10.0s. You may wish to increase 
target time to 10.7s, enable flat sampling, or reduce sample count to 60.
   row_format/convert_columns UInt64 with 50 columns at size 8192
                           time:   [2.1135 ms 2.1139 ms 2.1145 ms]
                           thrpt:  [23.647 Kelem/s 23.652 Kelem/s 23.658 
Kelem/s]
                    change:
                           time:   [−2.3692% −2.2443% −2.1307%] (p = 0.00 < 
0.05)
                           thrpt:  [+2.1770% +2.2958% +2.4266%]
                           Performance has improved.
   Found 8 outliers among 100 measurements (8.00%)
     3 (3.00%) high mild
     5 (5.00%) high severe
   
   [ perf record: Woken up 3 times to write data ]
   [ perf record: Captured and wrote 0.676 MB perf.data (3924 samples) ]
   ```
   
   then report:
   ```bash
   perf report -g --percent-limit 0.01 --no-children
   ```
   
   </p>
   </details> 
   
   the most expensive function is `fixed::encode_not_null` (as this is the only 
function that is being used) with this info in the annotation:
   
   
   <details><summary>annotated info</summary>
   <p>
   
   ```asm
   Percent│       add    $0xfffffffffffffff8,%r9                                
                                                                                
  ▒
          │     <core::slice::iter::Iter<T> as 
core::iter::traits::iterator::Iterator>::next:                                  
                                   ▒
          │     ↑ jne    30                                                     
                                                                                
  ▒
          │     ↓ jmp    dc                                                     
                                                                                
  ▒
          │     arrow_row::fixed::encode_not_null:                              
                                                                                
  ▒
          │     let offset = &mut offsets[value_idx + 1];                       
                                                                                
  ▒
          │ 7d:   cmp    $0x1,%rcx                                              
                                                                                
  ▒
          │       mov    $0x1,%eax                                              
                                                                                
  ▒
          │       mov    $0x1,%ebx                                              
                                                                                
  ▒
          │       cmovb  %rcx,%rbx                                              
                                                                                
  ▒
          │       sub    %rcx,%rbx                                              
                                                                                
  ▒
          │     core::ptr::non_null::NonNull<T>::add:                           
                                                                                
  ▒
          │       data16 data16 data16 data16 cs nopw 0x0(%rax,%rax,1)          
                                                                                
  ▒
          │ a0:   lea    (%rbx,%rax,1),%r10                                     
                                                                                
  ▒
          │     arrow_row::fixed::encode_not_null:                              
                                                                                
  ▒
          │       cmp    $0x1,%r10                                              
                                                                                
  ▒
          │     ↓ je     f7                                                     
                                                                                
  ▒
          │     let end_offset = *offset + T::ENCODED_LEN;                      
                                                                                
  ▒
          │       mov    (%rdx,%rax,8),%r11                                     
                                                                                
  ▒
          │       lea    0x9(%r11),%r10                                         
                                                                                
  ▒
          │     core::num::<impl usize>::checked_sub:                           
                                                                                
  ▒
          │       cmp    $0xfffffffffffffff6,%r11                               
                                                                                
  ▒
          │     ↓ ja     e1                                                     
                                                                                
  ▒
          │       cmp    %rsi,%r10                                              
                                                                                
  ▒
          │     ↓ ja     e1                                                     
                                                                                
  ▒
          │     arrow_row::fixed::encode_not_null:                              
                                                                                
  ▒
          │     to_write[0] = 1;                                                
                                                                                
  ▒
    49.57 │       movb   $0x1,(%rdi,%r11,1)                                     
                                                                                
  ▒
          │     let mut encoded = val.encode();                                 
                                                                                
  ▒
          │       mov    -0x8(%r8,%rax,8),%r14                                  
                                                                                
  ▒
          │     core::num::<impl u64>::swap_bytes:                              
                                                                                
  ▒
          │       bswap  %r14                                                   
                                                                                
  ▒
          │     core::ptr::copy_nonoverlapping:                                 
                                                                                
  ▒
    49.22 │       mov    %r14,0x1(%rdi,%r11,1)                                  
                                                                                
  ▒
          │     arrow_row::fixed::encode_not_null:                              
                                                                                
  ▒
          │     *offset = end_offset;                                           
                                                                                
  ▒
     1.22 │       mov    %r10,(%rdx,%rax,8)                                     
                                                                                
  ▒
          │     <core::ptr::non_null::NonNull<T> as core::cmp::PartialEq>::eq:  
                                                                                
  ▒
          │       inc    %rax                                                   
                                                                                
  ▒
          │       add    $0xfffffffffffffff8,%r9                                
                                                                                
  ▒
          │     <core::slice::iter::Iter<T> as 
core::iter::traits::iterator::Iterator>::next:                                  
                                   ◆
          │     ↑ jne    a0                                                     
                                                                                
  ▒
          │     arrow_row::fixed::encode_not_null:                              
                                                                                
  ▒
          │     }                                                               
                                                                                
  ▒
          │     }                                                               
                                                                                
  ▒
          │ dc:   pop    %rbx                                                   
                                                                                
  ▒
          │       pop    %r14                                                   
                                                                                
  ▒
          │       pop    %rbp                                                   
                                                                                
  ▒
          │     ← ret                                                           
                                                                                
  ▒
          │ e1:   mov    %rsi,%rdx                                              
                                                                                
  ▒
          │     <core::ops::range::Range<usize> as 
core::slice::index::SliceIndex<[T]>>::index_mut:                                
                               ▒
          │       lea    
anon.674ea2c4ec16cc99febf090ac1512894.22.llvm.1276781284273354020,%rcx          
                                                         ▒
          │       mov    %r11,%rdi                                              
                                                                                
  ▒
          │       mov    %r10,%rsi
   ```
   
   </p>
   </details> 
   
   we see that there are 2 expensive instructions, `movb` and `mov`, the first 
`movb` is writing `to_write[0] = 1;`
   and the second `mov` is writing the encoded value.
   
   
   So testing with `mem_inst_retired.split_stores`
   
   <details><summary>recording:</summary>
   <p>
   
   
   ```bash
   $ perf record -e mem_inst_retired.split_stores:pp -g 
target/profiling/deps/row_format-7489cf224545c7d6 --bench --warm-up-time 1 
--noplot
   Benchmarking row_format/convert_columns UInt64 with 50 columns at size 8192: 
Warming up for 1.0000 s
   Warning: Unable to complete 100 samples in 10.0s. You may wish to increase 
target time to 10.9s, enable flat sampling, or reduce sample count to 60.
   row_format/convert_columns UInt64 with 50 columns at size 8192
                           time:   [2.1362 ms 2.1367 ms 2.1373 ms]
                           thrpt:  [23.394 Kelem/s 23.400 Kelem/s 23.406 
Kelem/s]
                    change:
                           time:   [+0.9806% +1.0905% +1.2080%] (p = 0.00 < 
0.05)
                           thrpt:  [−1.1936% −1.0787% −0.9711%]
                           Change within noise threshold.
   Found 3 outliers among 100 measurements (3.00%)
     2 (2.00%) high mild
     1 (1.00%) high severe
   
   [ perf record: Woken up 2 times to write data ]
   [ perf record: Captured and wrote 0.460 MB perf.data (2612 samples) ]
   ```
   
   Viewing the data:
   ```bash
   perf report -g --percent-limit 0.01 --no-children
   ```
   
   </p>
   </details> 
   
   
   showed for the same hot function `fixed::encode_not_null`:
   
   
   <details><summary>annotated info</summary>
   <p>
   
   ```asm
   Percent│ 7d:   cmp    $0x1,%rcx
          │       mov    $0x1,%eax
          │       mov    $0x1,%ebx
          │       cmovb  %rcx,%rbx
          │       sub    %rcx,%rbx
          │     core::ptr::non_null::NonNull<T>::add:
          │       data16 data16 data16 data16 cs nopw 0x0(%rax,%rax,1)
          │ a0:   lea    (%rbx,%rax,1),%r10
          │     arrow_row::fixed::encode_not_null:
          │       cmp    $0x1,%r10
          │     ↓ je     f7
          │     let end_offset = *offset + T::ENCODED_LEN;
          │       mov    (%rdx,%rax,8),%r11
          │       lea    0x9(%r11),%r10
          │     core::num::<impl usize>::checked_sub:
          │       cmp    $0xfffffffffffffff6,%r11
          │     ↓ ja     e1
          │       cmp    %rsi,%r10
          │     ↓ ja     e1
          │     arrow_row::fixed::encode_not_null:
          │     to_write[0] = 1;
          │       movb   $0x1,(%rdi,%r11,1)
          │     let mut encoded = val.encode();
          │       mov    -0x8(%r8,%rax,8),%r14
          │     core::num::<impl u64>::swap_bytes:
          │       bswap  %r14
          │     core::ptr::copy_nonoverlapping:
   100.00 │       mov    %r14,0x1(%rdi,%r11,1)
          │     arrow_row::fixed::encode_not_null:
          │     *offset = end_offset;
          │       mov    %r10,(%rdx,%rax,8)
          │     <core::ptr::non_null::NonNull<T> as core::cmp::PartialEq>::eq:
          │       inc    %rax
          │       add    $0xfffffffffffffff8,%r9
          │     <core::slice::iter::Iter<T> as 
core::iter::traits::iterator::Iterator>::next:
          │     ↑ jne    a0
          │     arrow_row::fixed::encode_not_null:
          │     }
          │     }
          │ dc:   pop    %rbx
          │       pop    %r14
          │       pop    %rbp
          │     ← ret
          │ e1:   mov    %rsi,%rdx
          │     <core::ops::range::Range<usize> as 
core::slice::index::SliceIndex<[T]>>::index_mut:
          │       lea    
anon.674ea2c4ec16cc99febf090ac1512894.22.llvm.1276781284273354020,%rcx
          │       mov    %r11,%rdi
          │       mov    %r10,%rsi
          │     → call   *0x31ec69(%rip)        # 52de90 <_DYNAMIC+0x4a8>
          │     arrow_row::fixed::encode_not_null:
          │     let offset = &mut offsets[value_idx + 1];
          │ f7:   lea    
anon.674ea2c4ec16cc99febf090ac1512894.21.llvm.1276781284273354020,%rdx
          │       mov    %rax,%rdi
          │       mov    %rcx,%rsi
   ```
   
   </p>
   </details>
   
   
   that the `mov` instruction that writes the encoded value is now taking 
`100%` of the samples in this function for writing across boundries.
   
   the `100%` is relative to this function, but when pressing `p` it toggle 
percent type to global
   and then the percent is `92.99%`, so very expensive.
   
   </details>
   
   I tried multiple things that keep the ordering gurrentees but still 
improving the performance.
   one of them was to use 2 bytes for null info instead of 1 byte.
   
   this improved performance by 18%:
   
   ```bash
       Finished `profiling` profile [optimized + debuginfo] target(s) in 0.15s
        Running benches/row_format.rs 
(target/profiling/deps/row_format-7489cf224545c7d6)
   row_format/convert_columns UInt64 with 50 columns at size 8192
                           time:   [1.6916 ms 1.6919 ms 1.6922 ms]
                           thrpt:  [29.546 Kelem/s 29.553 Kelem/s 29.558 
Kelem/s]
                    change:
                           time:   [−18.727% −18.626% −18.535%] (p = 0.00 < 
0.05)
                           thrpt:  [+22.752% +22.890% +23.042%]
                           Performance has improved.
   ```
   
   The problem, however, is that it doesn't really solve the problem, it just 
delay it, with the right combination of columns we can get to that problem 
again.
   and also doing perf and report shows that even for the 50 columns of u64 
with 2 bytes for null the is still high (and even higher than before):
   
   
   Changing the code to not add the null byte made the code 37% faster:
   
   ```bash
       Finished `profiling` profile [optimized + debuginfo] target(s) in 4.74s
        Running benches/row_format.rs 
(target/profiling/deps/row_format-7489cf224545c7d6)
   row_format/convert_columns UInt64 with 50 columns at size 8192
                           time:   [1.3429 ms 1.3433 ms 1.3436 ms]
                           thrpt:  [37.213 Kelem/s 37.223 Kelem/s 37.232 
Kelem/s]
                    change:
                           time:   [−37.193% −37.114% −37.040%] (p = 0.00 < 
0.05)
                           thrpt:  [+58.832% +59.018% +59.218%]
                           Performance has improved.
   Found 13 outliers among 100 measurements (13.00%)
     5 (5.00%) high mild
     8 (8.00%) high severe
   ```
   
   > [!IMPORTANT]  
   > I did not add the null info byte back so this remove a single instruction 
from the hot path. but if we move it to the start of the row and bitpack it 
with other columns
   > and then use padding we could have that performance improvement.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to