pioy commented on pull request #19185: URL: https://github.com/apache/incubator-mxnet/pull/19185#issuecomment-700603056
The bugs occurs for onednn GEMM calculations with small dimensions (1<n<16). The bug is triggered by the range of values, that can be interpreted as NaN, in a one zmm registers (zmm24, zmm25, zmm26, or zmm27) just before calling the gemm kernel. Those values can be remaining of other calculations, likely integer operations (case of this PR). I assume that float kernels does not return NaN in properly configured pipeline. The NaN values, if they are not overwritten by other avx512 kernels, may stay long; so they may come from operations that were executed much earlier in the pipeline. In result of the bug NaNs propagates to the result array. What may terminate execution of operators. The fix has been merged into master/1.8/1.7. It's ready for customer testing. See https://github.com/oneapi-src/oneDNN/commit/5ce95efe6f5e86cddbf704b637063cd8dc914125. There are some other fixes to be merged into 1.6 branch. The tag v1.6.4 will be added after those fixes get merged. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
