FrozenGene commented on pull request #5914:
URL: https://github.com/apache/incubator-tvm/pull/5914#issuecomment-650891863


   I want to share my testing on my skylake machine result. The CPU is Intel(R) 
Xeon(R) Silver 4114 CPU @ 2.20GHz. I use one core to tune. The tuning method is 
the tutorial of `tune_relay_x86.py` (for without clflush) and this pr's 
modification of tune_relay_x86.py` (for with clflush)
   
   Without cache flush:
   
   ```
   [Task  1/12]  Current/Best:   43.31/  44.04 GFLOPS | Progress: (800/800) | 
1325.94 s Done.
   [Task  2/12]  Current/Best:    4.33/  44.02 GFLOPS | Progress: (720/720) | 
1152.81 s Done.
   [Task  3/12]  Current/Best:   44.19/  65.08 GFLOPS | Progress: (972/972) | 
1492.48 s Done.
   [Task  4/12]  Current/Best:   39.14/  57.74 GFLOPS | Progress: (864/864) | 
1292.03 s Done.
   [Task  5/12]  Current/Best:   53.04/  66.45 GFLOPS | Progress: (1024/1024) | 
1584.45 s Done.
   [Task  6/12]  Current/Best:   42.93/  54.29 GFLOPS | Progress: (896/896) | 
1382.46 s Done.
   [Task  7/12]  Current/Best:    8.67/  63.83 GFLOPS | Progress: (980/980) | 
1501.93 s Done.
   [Task  8/12]  Current/Best:   32.04/  57.19 GFLOPS | Progress: (308/308) | 
484.76 s Done.
   [Task  9/12]  Current/Best:   12.25/  46.87 GFLOPS | Progress: (980/980) | 
2161.56 s Done.
   [Task 10/12]  Current/Best:   41.17/  49.36 GFLOPS | Progress: (896/896) | 
2174.74 s Done.
   [Task 11/12]  Current/Best:   17.24/  49.36 GFLOPS | Progress: (864/864) | 
2075.64 s Done.
   [Task 12/12]  Current/Best:   23.43/  51.69 GFLOPS | Progress: (720/720) | 
1708.31 s Done.
   ```
   
   With cache flush:
   ```
   [Task  1/12]  Current/Best:   41.26/  42.29 GFLOPS | Progress: (800/800) | 
543.79 s Done.
   [Task  2/12]  Current/Best:    4.30/  41.93 GFLOPS | Progress: (720/720) | 
338.14 s Done.
   [Task  3/12]  Current/Best:   43.09/  64.36 GFLOPS | Progress: (972/972) | 
503.03 s Done.
   [Task  4/12]  Current/Best:   41.95/  56.23 GFLOPS | Progress: (864/864) | 
350.40 s Done.
   [Task  5/12]  Current/Best:   52.39/  66.52 GFLOPS | Progress: (1024/1024) | 
505.65 s Done.
   [Task  6/12]  Current/Best:   42.34/  53.17 GFLOPS | Progress: (896/896) | 
353.18 s Done.
   [Task  7/12]  Current/Best:    8.38/  62.88 GFLOPS | Progress: (980/980) | 
492.13 s Done.
   [Task  8/12]  Current/Best:   31.29/  57.12 GFLOPS | Progress: (308/308) | 
166.95 s Done.
   [Task  9/12]  Current/Best:   12.36/  40.97 GFLOPS | Progress: (980/980) | 
302.91 s Done.
   [Task 10/12]  Current/Best:   36.14/  41.85 GFLOPS | Progress: (896/896) | 
264.56 s Done.
   [Task 11/12]  Current/Best:   16.24/  48.53 GFLOPS | Progress: (864/864) | 
257.15 s Done.
   [Task 12/12]  Current/Best:   19.53/  47.11 GFLOPS | Progress: (720/720) | 
212.18 s Done.
   ```
   
   The execution time:
   88.36ms (w/ clflush) v.s. 87.26ms (w/o clflush). 
   
   As you could see, if we have clflush, almost single layer's tuning gflops is 
slower than without clflush. But when we run it end2end, we could get better 
result. And with clflush, we could have much less tuning time as we only need 
to tune 10 times (even could less).
   
   As said before, if we have winograd for cpu, this becomes more important.
   
   
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to