klion26 commented on PR #9689:
URL: https://github.com/apache/arrow-rs/pull/9689#issuecomment-4450609206

   @alamb @scovich After some experiments, seems that the regression comes from 
compilation settings and CPU architecture, set cargo feature [`-C 
codegen-units=1`](https://doc.rust-lang.org/rustc/codegen-options/index.html#codegen-units)
 or [`-C 
target-cpu=native`](https://doc.rust-lang.org/rustc/codegen-options/index.html#target-cpu)
 will  improve the performance, and using `codegen-units=1` for both main and 
current branch will have the same performance.
   
   - Graviton can't get the `l1i_cache_refill,stall_frontend` stat from perf, 
verified by using the commands[1]
   - Run benchmarks with different settings on main and current branch, the 
command likes `RUSTFLAGS='-C codegen-units=1 -C target-cpu=native' cargo bench 
--features=arrow,async,test_common,experimental,object_store --bench 
cast_kernels "cast decimal32 to" -- --save-baseline 
variant-codege-units-1-native`
   
     - main branch with default setting (`main-no-feature` in benchmark group 
below)
     - main branch with `-C codegen-units=1`(`main-codegen-units-1` in 
benchmark group below)
     - current branch with `-C codegen-units=1`. (`variant-codegen-units-1` in 
benchmark group below)
     - current branch with `-C target-cpu=native`  (`variant-codegen-native` in 
benchmark group below, native = `neoverse-n1` for the running machine )
     - current branch with `-C coegen-units=1 -C target-cpu=native` 
(`variant-codegen-units-1-native` in benchmark group below)
     - current branch with default setting (`variant-no-feature`  in benchmark 
group below)
   
   
   <details><summary>result of different benchmarks</summary>
   <p>
   
   
   ```
   [ec2-user@ip-172-31-35-37 arrow-rs]$ critcmp codegen-main-no-feature 
codegen-main-codegen-units-1 codegen-variant-codegen-units-1 
codegen-variant-codegen-native codegen-variant-codegen-codegen-units-1-native 
codegen-variant-no-feature
   group                                               main-codegen-units-1     
             main-no-feature                        
variant-codegen-units-1-native                    variant-codegen-native        
         variant-codegen-units-1                variant-no-feature
   -----                                              
----------------------------           -----------------------                
----------------------------------------------    
------------------------------         -------------------------------        
--------------------------
   "cast decimal32 to float32"                        1.00     16.2±0.04µs      
  ? ?/sec    1.00     16.2±0.03µs        ? ?/sec    1.00     16.2±0.04µs        
? ?/sec               1.00     16.2±0.02µs        ? ?/sec    1.00     
16.2±0.01µs        ? ?/sec    1.00     16.3±0.09µs        ? ?/sec
   "cast decimal32 to float64"                        1.00     16.4±0.09µs      
  ? ?/sec    1.00     16.4±0.07µs        ? ?/sec    1.01     16.6±0.05µs        
? ?/sec               1.01     16.6±0.06µs        ? ?/sec    1.00     
16.4±0.02µs        ? ?/sec    1.00     16.4±0.08µs        ? ?/sec
   "cast decimal32 to int16"                          1.00     29.3±0.15µs      
  ? ?/sec    1.17     34.3±0.16µs        ? ?/sec    1.12     32.8±1.49µs        
? ?/sec               1.13     33.1±0.16µs        ? ?/sec    1.00     
29.3±0.03µs        ? ?/sec    1.13     33.2±0.17µs        ? ?/sec
   "cast decimal32 to int32"                          1.26     29.3±0.10µs      
  ? ?/sec    1.20     27.9±0.41µs        ? ?/sec    1.00     23.3±0.12µs        
? ?/sec               1.47     34.3±0.12µs        ? ?/sec    1.26     
29.3±0.05µs        ? ?/sec    1.48     34.6±0.20µs        ? ?/sec
   "cast decimal32 to int64"                          1.00     23.6±0.09µs      
  ? ?/sec    1.25     29.4±0.14µs        ? ?/sec    1.11     26.1±0.09µs        
? ?/sec               1.51     35.7±0.24µs        ? ?/sec    1.00     
23.6±0.04µs        ? ?/sec    1.53     36.0±0.32µs        ? ?/sec
   "cast decimal32 to int8"                           1.00     70.0±1.71µs      
  ? ?/sec    1.05     73.4±1.62µs        ? ?/sec    1.01     70.6±1.47µs        
? ?/sec               1.09     76.5±1.15µs        ? ?/sec    1.10     
77.2±1.12µs        ? ?/sec    1.11     78.0±1.35µs        ? ?/sec
   "cast decimal32 to uint16"                         1.00     25.4±0.42µs      
  ? ?/sec    1.50     38.1±0.99µs        ? ?/sec    1.00     25.4±0.36µs        
? ?/sec               1.30     33.1±0.21µs        ? ?/sec    1.00     
25.5±0.41µs        ? ?/sec    1.31     33.2±0.14µs        ? ?/sec
   "cast decimal32 to uint32"                         1.00     26.1±0.05µs      
  ? ?/sec    1.05     27.4±0.72µs        ? ?/sec    1.00     26.2±0.09µs        
? ?/sec               1.31     34.1±0.13µs        ? ?/sec    1.00     
26.2±0.03µs        ? ?/sec    1.32     34.4±0.19µs        ? ?/sec
   "cast decimal32 to uint64"                         1.15     30.2±0.10µs      
  ? ?/sec    1.12     29.5±0.16µs        ? ?/sec    1.00     26.3±0.17µs        
? ?/sec               1.35     35.5±0.29µs        ? ?/sec    1.00     
26.3±0.03µs        ? ?/sec    1.37     35.9±0.30µs        ? ?/sec
   "cast decimal32 to uint8"                          1.00     81.6±1.70µs      
  ? ?/sec    1.05     85.0±1.76µs        ? ?/sec    1.00     81.3±1.42µs        
? ?/sec               1.06     86.0±1.59µs        ? ?/sec    1.09     
88.8±1.34µs        ? ?/sec    1.10     89.1±1.33µs        ? ?/sec
   cast decimal32 to decimal32 512                    1.00     14.0±0.05µs      
  ? ?/sec    1.00     14.1±0.08µs        ? ?/sec    1.14     16.0±0.43µs        
? ?/sec               1.00     14.0±0.03µs        ? ?/sec    1.03     
14.5±0.27µs        ? ?/sec    1.01     14.3±0.12µs        ? ?/sec
   cast decimal32 to decimal32 512 lower precision    1.00     21.9±0.05µs      
  ? ?/sec    1.01     22.1±0.12µs        ? ?/sec    1.08     23.5±0.08µs        
? ?/sec               1.00     22.0±0.26µs        ? ?/sec    1.04     
22.7±0.41µs        ? ?/sec    1.08     23.7±0.17µs        ? ?/sec
   cast decimal32 to decimal64 512                    1.00      9.9±0.02µs      
  ? ?/sec    1.00      9.9±0.02µs        ? ?/sec    1.00      9.8±0.03µs        
? ?/sec               1.00      9.8±0.02µs        ? ?/sec    1.04     
10.3±0.16µs        ? ?/sec    1.02     10.0±0.07µs        ? ?/sec
   ```
   
   another benchmark for `decimal32 to int8/uint8` with `-C codegen-units=1` on 
both main and current branch
   ```
   group                                              codegen-unit1-main        
                   codegen-unit1-variant                     
   ---
   cast decimal32 to int8"                           1.01     70.9±1.70µs       
 ? ?/sec        1.02     71.8±1.84µs
   cast decimal32 to uint8"                         1.00     81.5±1.74µs        
? ?/sec       1.00     81.6±1.54µs 
   ```
   
   </p>
   </details>
   
   The default setting will use `codegen-units = 16` in ec2 and mac by using 
the command `cargo clean`, `CARGO_INCREMENTAL=0  RUSTFLAGS="-C codegen-units=1 
-C save-temps"  cargo build -p arrow-cast --release -vv` and `find 
target/release/deps -name '*cast*rcgu.o' | wc -l` 
   
   [1]
   <details>
   <summary>commands that show ec2 did not have specified perf stat</summary>
   <p>
   
   1. run some perf command and the state
   `perf stat -v -e cycles,l1i_cache_refill,stall_frontend:u 
./main_cast_kernels-589283a85a93d289 --bench 'cast decimal32 to uint8'`
     Using CPUID 0x00000000410fd0c0
     l1i_cache_refill -> armv8_pmuv3_0/event=0x1/
     stall_frontend -> armv8_pmuv3_0/event=0x23/
   
   2. run `perf stat -vv -e armv8_pmuv3_0/event=0x1/ -- sleep 1` to get the 
result, the result will be always 0
   
   ```
    Performance counter stats for 'sleep 1':
   
                    0      armv8_pmuv3_0/event=0x1/u
   
          1.001426993 seconds time elapsed
   
          0.001400000 seconds user
          0.000000000 seconds sys
   
   
   Performance counter stats for 'sleep 1':
   
                      0      armv8_pmuv3_0/event=0x23/u
   
            1.001548493 seconds time elapsed
   
            0.001492000 seconds user
            0.000000000 seconds sys
   ```
   </p>
   </details> 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to