If it is buggy, how does it matter if it is performant or not? I am not seeing the rationale to make the correct version only opt-in.
On Mon, Jul 23, 2018 at 6:47 PM, Leonard Lausen <[email protected]> wrote: > Currently the default kernel of nn.Embedding backward is known to be > buggy on P3 instances or using Cuda 9.2 (though the issue also occurs on > other instances with earlier version of Cuda, but less often). > > https://github.com/apache/incubator-mxnet/issues/11314 > > There is currently an opt-in for using a bug-free kernel, but it is not > the default. However, the bug-free kernel is used by default for shape > smaller 16384. > > Should MXNet ship a more efficient but buggy kernel in v1.3 or use a > correct but less efficient kernel by default? As MXNet v1.3 is likely to > be used a lot with Cuda 9.2 I believe the default behavior should be > changed to use the bug-free but less efficient Kernel. Correctness and > providing a good user experience should be No. 1 here (?). Then users > that want a faster but buggy backward kernel can still select to do so. > Note this only affects the backward pass. > > Hao did related work on improving the take operator > https://github.com/apache/incubator-mxnet/pull/11326 > https://github.com/apache/incubator-mxnet/pull/11795 which also fixes > the issue, but he found it to be only "slightly faster" compared to the > bug-free kernel that is currently under opt-in while leading to CI > failures on Windows. > > In my experience, there is no speed difference between the current buggy > and > opt-in bug-free kernel, but the GPU utilization of the latter is 100% > compared > to 60% of the former (benchmark script: > https://github.com/apache/incubator-mxnet/pull/11795# > issuecomment-405808567 ) >
