sbodenstein commented on issue #11796: Batch_dot does not support FP16 well URL: https://github.com/apache/incubator-mxnet/issues/11796#issuecomment-436414711 @szha: can we reopen this? For some reason, the fix in https://github.com/dmlc/mshadow/pull/353 was reverted by [this](https://github.com/eric-haibin-lin/mshadow/commit/c879f3b7a877b8838f7b64c8e72b4ac3cc82e9d0) commit by @eric-haibin-lin . This code, run on version `1.3.0` (latest EC2 Deep Learning AMI): ``` import mxnet as mx a = mx.nd.ones((100,100,100), ctx=mx.gpu(), dtype='float16') b = mx.nd.ones((100,100,100), ctx=mx.gpu(), dtype='float16') for i in range(10): c = mx.nd.batch_dot(a,b) mx.nd.waitall() import time begin = time.time() for i in range(500): c = mx.nd.batch_dot(a,b) mx.nd.waitall() end = time.time() print(end - begin) ``` takes 0.9s on a V100 (and 0.0318s when using float32 instead, a 30x slowdown!) We want to implement transformers using TensorCores for training, but there is no way of doing this in MXNet at the moment (`linalg_gemm` and `linalg_gemm2` unfortunately don't support float16 either, despite it seemingly being implemented [here](https://github.com/apache/incubator-mxnet/blob/49e6a7e40691936e533f7cf16848b10c025e4e75/src/operator/linalg_impl.h#L244)). What is the plan for exposing any form of GEMM to users with Real16 and TensorCore support? @szhengac
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
