Thanks for sharing your thoughts.

Let me share some more background. To achieve high performance for compute 
heavy ops (close to hand-written kernels like MKLDNN or ACL), we need to 
perform vector register tiling. This is one more level lower than cache tiling. 
Here, we have to carefully craft a TVM schedule that carefully manages the data 
reuse in vector registers, number of vector registers, number of vector FMA 
operations in innermost loop, number of vector memory accesses, and prefetcher 
friendly accesses. There are many factors to consider here, and a developer has 
to carefully craft the loop optimizations schedule to find a suitable balance. 
@kevinthesun can back me up here.
 
Now, simple optimization like Loop unrolling can completely offset this 
balance. For example, my TVM schedule might be keeping the total vector 
register count < 32 (# ARM vector registers), but LLVM unrolling even by a 
factor of 2 will double the vfma operations, defeating the whole purpose of 
loop tiling. I have dabbled in writing x86 assembly for SGEMM, and have 
experienced all these issues.

### What about rerolling, unroll-and-jam and strip-mining?
I think reroll is disabled by default. Dont know about unroll-and-jam. 
Strip-mining is a TVM responsibility (it is just tiling in 1D for vectorization 
- common in TVM). But, I understand your overarching point. And yes, more 
strongly I am suggesting to give more control to TVM for these loop 
optimizations. I also believe that different loop optimizations will have 
different impact. I observed that LLVM unrolling has a big impact.

### Default schedules to use LLVM optimizations?
I was thinking about this as well. And I completely agree. I want more control 
in compute-intensive ops, but I want LLVM to optimize for default schedules. 
Even further, if I can embed something in TVM IR to disable a loop optimization 
for certain section of LLVM IR, it might be the best design.

### Mix and match of TVM optimization and LLVM optimization
Yes, this is same as previous point.


## Summary
### Why should we disable LLVM unrolling?
* TVM schedules performance behave as expected. A developer can trust his/her 
schedule for performance.
* This also helps in improving Auto-TVM that can be painfully long today. By 
carefully analyzing the loop structure, we can reason about how good register 
tiling is and discard bad configurations quickly.
* Disabling LLVM unrolling does not mean we will miss a configuration. Our 
schedules are templated. AutoTVM will have configuration where the axis that 
LLVM was unrolling, is now unrolled by TVM. (But, I understand we need data).

### Why should we keep LLVM unrolling?
* Default schedules might see perf degradation.
* In short-term, top-hub might not be optimal anymore. We might need to re-tune.


If all of us see theoretical benefits and agree that the performance data is 
the only deciding factor, I can start collecting data for both x86 and ARM. 
Data collection will take time, so it is better if we agree on the idea first :)





---
[Visit Topic](https://discuss.tvm.ai/t/disabling-llvm-unrolling/6039/5) to 
respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, [click 
here](https://discuss.tvm.ai/email/unsubscribe/a3efd352fb16577099457feba58cfeca7e9ac306faafa4ba12395d9a874f2f5f).

Reply via email to