ptrendx commented on issue #9774: does not respect dtype 
   I don't think it will perform better than producing fp32 and then casting to 
fp16 at the beginning of the training.
   1) You need this double-buffering of data at the beginning of training in 
order to hide cpu->gpu transfers
   2) You would have fp16 computation done on cpu inside the data iterator 
then. They are slow, due to cpu not having native support for fp16

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:

With regards,
Apache Git Services

Reply via email to