canteen-man commented on issue #14822: When using the imglist file to load the data,If the number of data set label isn't the integer multiple batch size,the loss become nan URL: https://github.com/apache/incubator-mxnet/issues/14822#issuecomment-488964610 @lanking520 ,the lst file is such as : 0 0.2083763 picture_root/1.png 1 0.1911981 picture_root/2.png 2 0.3475569 picture_root/3.png And I load the image by use training_iter=mx.image.ImageIter(batch_size =batch_size, data_shape= (3,256,256), path_root= xxx,path_imglist= xxx) mod=mx.mod.Module(symbol= xxx,context = xxx,data_names=['xxx'],label_names=['xxx']) mod.bind(data_shapes=training_iter.provide_data,label_shapes=training_iter.provide_label) mod.init_params(mx.initializer.Xavier()) lr_sch = mx.lr_scheduler.FactorScheduler(step=2000, factor=0.5) mod.fit(train_data=training_iter,optimizer='sgd',optimizer_params=(('learning_rate', 0.1), ('lr_scheduler', lr_sch)),eval_metric='mse',num_epoch=500,epoch_end_callback=checkpoint) And I print the label in the fit function of the base_module.py by use: print("label",data_batch.label) I add this print function in the "while not end_of_batch" loop. And the print label are't normal when the last batch in a epoch if the number of data set label isn't the integer multiple of batch size. Like the number of date set label is 10,and the batch size is 6,and the loss is nan. And if the number of date set is 10 ,and the batch size is 5 ,the print label is normal again. ########################################################################### 不行,我受不了了,我打中文了,你也是中国人吧。 就是用这个imglst载入数据的时候我发现这数据集总数不是batch size的整数倍时,1个epoch中前边的batch size都正常,然后1个epoch结束loss就变成nan了。就正好一个epoch结束变成的nan。然后我就打印的源码中的变量,发现最后一个epoch中最后一个batch打印出的label是奇怪的数,肯定不是我lst文件中的数,而且就在最后一个batch出现。然后我没有打乱数据集排序,就正常顺序的读,发现1个epoch中最后一个batch的前几个label都是正常的,最后一个batch中的label不正常就是从数据集最后一个label结束后开始的。 按理来说最后一个label载入后不应该是从头继续载入吗,直到填充满最后这个batch size,但这里最后却是很奇怪的数,要不就很大,e的10几,要不就很小,e的负10几。 在源码这块我只改了多少个epoch存储一次模型和打印一些变量,没有改什么重要的源码,用的训练代码就上边这些。
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
