yzh119 opened a new pull request, #13726:
URL: https://github.com/apache/tvm/pull/13726

   # Motivation
   Currently, our default profiler (`time_evaluator`) does not flush the L2 
cache per execution, this might lead to incorrect time measurement because the 
input data last run might reside in L2 cache and reduce the data fetching time 
in the next run. Both 
[Triton](https://github.com/openai/triton/blob/ff399fbc2059a6f35cb93534dc29398f7b82dbc7/python/triton/testing.py#L156-L181)
 and 
[nvbench](https://github.com/NVIDIA/nvbench/blob/1a13a2e724b8aa8aee27649ac6878babb63862a6/nvbench/detail/measure_cold.cuh#L123)
 consider this effect thus reporting more accurate measurements.
   
   # Solution
   `time_evalutor` has an argument `f_preproc` where user can specify a 
pre-processing function per execution of the kernel being evaluated. Currently, 
TVM supports `cache_flush_cpu_non_first_arg` which flushes CPU cache. But 
similar functionality for GPU is missing.
   
   This PR completely borrows the design of nvbench's 
[l2flush](https://github.com/NVIDIA/nvbench/blob/1a13a2e724b8aa8aee27649ac6878babb63862a6/nvbench/detail/l2flush.cuh)
 struct and allow the user to specify `"l2_cache_flush_cuda"` as a 
preprocessing function which flushes NVIDIA GPU's L2 cache.
   
   Note that this PR also changes the location where `f_preproc` being 
triggered: previously `f_preproc` is triggered per repeat but that doesn't 
sound correct to me because most users specify `repeat=1` and `f_preproc` need 
to be triggered once per run.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to