FrozenGene edited a comment on pull request #5914: URL: https://github.com/apache/incubator-tvm/pull/5914#issuecomment-650891863
I want to share my testing on my skylake machine result. The CPU is Intel(R) Xeon(R) Silver 4114 CPU @ 2.20GHz. I use one core to tune. The tuning method is the tutorial of `tune_relay_x86.py` (for without clflush) and this pr's modification of `tune_relay_x86.py` (for with clflush) Without cache flush: ``` [Task 1/12] Current/Best: 43.31/ 44.04 GFLOPS | Progress: (800/800) | 1325.94 s Done. [Task 2/12] Current/Best: 4.33/ 44.02 GFLOPS | Progress: (720/720) | 1152.81 s Done. [Task 3/12] Current/Best: 44.19/ 65.08 GFLOPS | Progress: (972/972) | 1492.48 s Done. [Task 4/12] Current/Best: 39.14/ 57.74 GFLOPS | Progress: (864/864) | 1292.03 s Done. [Task 5/12] Current/Best: 53.04/ 66.45 GFLOPS | Progress: (1024/1024) | 1584.45 s Done. [Task 6/12] Current/Best: 42.93/ 54.29 GFLOPS | Progress: (896/896) | 1382.46 s Done. [Task 7/12] Current/Best: 8.67/ 63.83 GFLOPS | Progress: (980/980) | 1501.93 s Done. [Task 8/12] Current/Best: 32.04/ 57.19 GFLOPS | Progress: (308/308) | 484.76 s Done. [Task 9/12] Current/Best: 12.25/ 46.87 GFLOPS | Progress: (980/980) | 2161.56 s Done. [Task 10/12] Current/Best: 41.17/ 49.36 GFLOPS | Progress: (896/896) | 2174.74 s Done. [Task 11/12] Current/Best: 17.24/ 49.36 GFLOPS | Progress: (864/864) | 2075.64 s Done. [Task 12/12] Current/Best: 23.43/ 51.69 GFLOPS | Progress: (720/720) | 1708.31 s Done. ``` With cache flush: ``` [Task 1/12] Current/Best: 41.26/ 42.29 GFLOPS | Progress: (800/800) | 543.79 s Done. [Task 2/12] Current/Best: 4.30/ 41.93 GFLOPS | Progress: (720/720) | 338.14 s Done. [Task 3/12] Current/Best: 43.09/ 64.36 GFLOPS | Progress: (972/972) | 503.03 s Done. [Task 4/12] Current/Best: 41.95/ 56.23 GFLOPS | Progress: (864/864) | 350.40 s Done. [Task 5/12] Current/Best: 52.39/ 66.52 GFLOPS | Progress: (1024/1024) | 505.65 s Done. [Task 6/12] Current/Best: 42.34/ 53.17 GFLOPS | Progress: (896/896) | 353.18 s Done. [Task 7/12] Current/Best: 8.38/ 62.88 GFLOPS | Progress: (980/980) | 492.13 s Done. [Task 8/12] Current/Best: 31.29/ 57.12 GFLOPS | Progress: (308/308) | 166.95 s Done. [Task 9/12] Current/Best: 12.36/ 40.97 GFLOPS | Progress: (980/980) | 302.91 s Done. [Task 10/12] Current/Best: 36.14/ 41.85 GFLOPS | Progress: (896/896) | 264.56 s Done. [Task 11/12] Current/Best: 16.24/ 48.53 GFLOPS | Progress: (864/864) | 257.15 s Done. [Task 12/12] Current/Best: 19.53/ 47.11 GFLOPS | Progress: (720/720) | 212.18 s Done. ``` The execution time: 88.36ms (w/ clflush) v.s. 87.26ms (w/o clflush). As you could see, if we have clflush, almost single layer's tuning gflops is slower than without clflush. But when we run it end2end, we could get better result. And with clflush, we could have much less tuning time as we only need to tune 10 times (even could less). As said before, if we have winograd for cpu, this becomes more important. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
