altanh commented on PR #11531:
URL: https://github.com/apache/tvm/pull/11531#issuecomment-1146362724
> This is a great first set of steps towards improving LSTM performance.
Could you comment on what this unscheduled performance looks like vs what we
currently have in TVM?
It's pretty terrible with the naive dense loops, even compared to untuned
TVM with default schedules. For example (on a 5900X):
```
seq_len = 80
batch_size = 1
in_dim = 512
hidden_dim = 256
compiling TE LSTM...
took 0.04480266571044922 seconds.
TOPI mean (ms): 48.967991919999996
compiling Relay unrolled LSTM...
One or more operators have not been tuned. Please tune your model for better
performance. Use DEBUG logging level to see more details.
took 42.43188190460205 seconds.
Relay mean (ms): 14.4790252
```
At least it compiles quickly, haha. The Relay baseline comparison uses the
`lstm_cell` from `relay/frontend/common.py`. Note that for this benchmark I did
super basic scheduling by inlining the gate and activation computations.
Reducing the input and hidden dimensions shows some gains in terms of
reduced kernel overhead I think (sequence length increased to exaggerate
effect):
```
seq_len = 256
batch_size = 1
in_dim = 16
hidden_dim = 16
compiling TE LSTM...
took 0.057991743087768555 seconds.
TOPI mean (ms): 0.14541639
compiling Relay unrolled LSTM...
One or more operators have not been tuned. Please tune your model for better
performance. Use DEBUG logging level to see more details.
took 708.0528562068939 seconds.
Relay mean (ms): 2.62690786
```
(the compile time is pretty ridiculous on Relay)
Here's the script I used for benchmarking:
https://gist.github.com/altanh/a6dc8bf633028eaca5fbedbb591064f2
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]