access2rohit edited a comment on issue #17331:
URL:
https://github.com/apache/incubator-mxnet/issues/17331#issuecomment-622665471
@szha @eric-haibin-lin @apeforest
With current master and new broadcast_axis changes on p3.16xl single GPU
training run.
Bert Run Command:
```
python3 run_pretraining.py --data='./part-0000.train'
--data_eval='./part-0000.train' --num_steps 100 --lr 1e-4 --optimizer lamb
--accumulate 1 --raw --gpus 0 --num_dataset_workers 2 --num_batch_workers 1
--circle_length 1 --total_batch_size 4 --total_batch_size_eval 4 --log_interval
10
```
Results:
| Code Version | throughput (samples/sec) | | | total time
|
|--------------|------------|---------------|--------|-------------------------------------------|
| | avg | p50 | p90 | (only training
ignoring evaluation steps) |
| master LT | 24.38k | 25.50k | 28.47k | 134.8 sec
|
| master | 25.90k | 25.90k | 27.82k | 131.9 sec
|
| new LT | 25.87k | 25.80k | 28.00k | 127.3 sec
|
| new | 25.92k | 25.80k | 27.80k | 131.5 sec
|
"new" refers to mxnet code with optimized broadcast_axis.
"master" refers to mxnet master branch code
"LT" refers to of the build was done after enabling large tensor.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]