sxjscience opened a new issue #19045:
URL: https://github.com/apache/incubator-mxnet/issues/19045
FYI @szha @sandeep-krishnamurthy @pengzhao-intel @ptrendx @yzhliu @leezu
@xidulu @CassiniXu
Implementing operators is cumbersome and seems to involve more *"engineering
than science"*. This is not the case, especially in how we may correctly test
the implementation of an operator. Designing practical numerical computation
systems involves lots of stability concerns and the following is my initial
thoughts in the *Science* behind testing operators and feel free to add more.
Let's consider the following example, in which you implemented a new
operator and decides to test it.
```python
def new_op(*args):
# CODES
return out
```
## How to test gradient
### The general testing function
We can use the following way to obtain the gradient of this operator in
MXNet.
```python
# Generate the inputs, only have two inputs for an example
a = mx.np.random.normal(0, 1, a_shape)
b = mx.np.random.normal(0, 1, b_shape)
# Test for different gradient requirement flags.
# Here, we use `add` to optimize the case in which the gradient accumulation
can be sparse, e.g., in the embedding layer
# In this example, we attach one with "add" and the other with "write".
a.attach_grad('add')
# When it's add, we randomly initialize a gradient
original_a_grad = mx.np.random.normal(0, 1, a_shape)
a.grad[:] = original_a_grad
b.attach_grad('write')
# Generate some random head gradients
out1_grad = mx.np.random.normal(0, 1, out1_shape)
out2_grad = mx.np.random.normal(0, 1, out2_shape)
with mx.autograd.record():
out1, out2 = new_op(a, b)
loss = (out1 * out1_grad + out2_grad).sum()
loss.backward()
mx_a_grad = a.grad.asnumpy()
mx_b_grad = b.grad.asnumpy()
```
Then, in order to test it, you may try to obtain some *ground truth*
gradient and compare with the numerical gradient generated by the autograd
engine (even if you are using pytorch).
```python
import numpy.testing as npt
npt.assert_allclose(mx_a_grad, gt_a_grad + original_a_grad, 1E-3, 1E-3)
npt.assert_allclose(mx_b_grad, gt_b_grad, 1E-3, 1E-3)
```
### How to obtain the ground-truth gradient?
There are multiple ways to obtain this *ground truth* gradient and here is
how science may play an important role. For some simpler operators, you can
directly implement the gradient via numpy and try to test against that.
However, another choice is to test with [Finite
Difference](https://en.wikipedia.org/wiki/Finite_difference), which works even
if the inner operator is a *black box*. In MXNet, we used the "central
difference" variant as shown in
https://github.com/apache/incubator-mxnet/blob/b0c39f7ea983639093c63d7d2486bbef083a55d6/python/mxnet/test_utils.py#L1013-L1023
In addition, you can use the same Finite Difference technique to test for
Jacobians, Hessians and general *higher order* gradient. In fact, using the
finite difference to approximate Hessian is the key behind [Hessian-free
optimization](http://www.cs.toronto.edu/~jmartens/docs/Deep_HessianFree.pdf).
Also, you may refer to Chapter 8.1 of the book [Numerical
Optimization](https://link.springer.com/book/10.1007/978-0-387-40065-5).
Since random tests may cause flakyness, I think one solution is to always
fix the seed when doing any random test.
### Mixed Data Types
Well, things are more complicated when your operator needs to support
**mixed precision**. For example, assume that you need to test with **fp16**.
The obvious outcome is that you cannot use *fp16* to do the *Finite Difference*
test, which won't be very accurate. Thus, I think the appropriate way is to
always use *fp32* to generate the ground-truth gradient, cast that back to fp16
and then compare the results. (For this part, we may need to have some further
discussion).
## Test the correctness of **random** operators.
Things may become more complicated when your operator involves *randomness*.
For example, you implemented an operator that generates Gaussian random
variables and your task is confirm the others that it's generating Gaussian.
In order to do that, you can try to conduct **statistical testing**. For
example, you can compare the moments of the generated distribution and the
ground-truth (mean + variance for Gaussian). A more principled way is to run
standard statistical testings like the [Kolmogorov-Smirnov
Test](https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test):
https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.kstest.html,
or the (Chi-Square
Testing)[https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chisquare.html]
which helps you verify if the empirical distribution is the same as the
ground-truth distribution.
In the current MXNet, we provided the tool for Chi-Square testing:
https://github.com/apache/incubator-mxnet/blob/b0c39f7ea983639093c63d7d2486bbef083a55d6/python/mxnet/test_utils.py#L2104
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]