Currently the default kernel of nn.Embedding backward is known to be
buggy on P3 instances or using Cuda 9.2 (though the issue also occurs on
other instances with earlier version of Cuda, but less often).

There is currently an opt-in for using a bug-free kernel, but it is not
the default. However, the bug-free kernel is used by default for shape
smaller 16384.

Should MXNet ship a more efficient but buggy kernel in v1.3 or use a
correct but less efficient kernel by default? As MXNet v1.3 is likely to
be used a lot with Cuda 9.2 I believe the default behavior should be
changed to use the bug-free but less efficient Kernel. Correctness and
providing a good user experience should be No. 1 here (?). Then users
that want a faster but buggy backward kernel can still select to do so.
Note this only affects the backward pass.

Hao did related work on improving the take operator which also fixes
the issue, but he found it to be only "slightly faster" compared to the
bug-free kernel that is currently under opt-in while leading to CI
failures on Windows.

In my experience, there is no speed difference between the current buggy and
opt-in bug-free kernel, but the GPU utilization of the latter is 100% compared
to 60% of the former (benchmark script: )

Reply via email to