CoinCheung opened a new issue #9343: memory leakage ? URL: https://github.com/apache/incubator-mxnet/issues/9343 ## Description Following the examples I created a resnet model and train it on cifar10. After around 6 epoches with batch size of 256, my memory usage rate rose to 100% and the program terminated. ## Environment info (Required) ``` ----------Python Info---------- Version : 3.6.3 Compiler : GCC 7.2.0 Build : ('default', 'Oct 24 2017 14:48:20') Arch : ('64bit', '') ------------Pip Info----------- Version : 9.0.1 Directory : /usr/lib/python3.6/site-packages/pip ----------MXNet Info----------- Version : 1.0.0 Directory : /home/coin/.local/lib/python3.6/site-packages/mxnet Commit Hash : 7a83d67ecd3e49a3b342cc896598cc0d44de60e0 ----------System Info---------- Platform : Linux-4.13.12-1-ARCH-x86_64-with-arch system : Linux Press ENTER or type command to continue ----------Python Info---------- Version : 3.6.3 Compiler : GCC 7.2.0 Build : ('default', 'Oct 24 2017 14:48:20') Arch : ('64bit', '') ------------Pip Info----------- Version : 9.0.1 Directory : /usr/lib/python3.6/site-packages/pip ----------MXNet Info----------- Version : 1.0.0 Directory : /home/coin/.local/lib/python3.6/site-packages/mxnet Commit Hash : 7a83d67ecd3e49a3b342cc896598cc0d44de60e0 ----------System Info---------- Platform : Linux-4.13.12-1-ARCH-x86_64-with-arch system : Linux node : Arch-R720 release : 4.13.12-1-ARCH version : #1 SMP PREEMPT Wed Nov 8 11:54:06 CET 2017 ----------Hardware Info---------- machine : x86_64 processor : Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 4 On-line CPU(s) list: 0-3 Thread(s) per core: 1 Core(s) per socket: 4 Socket(s): 1 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 158 Model name: Intel(R) Core(TM) i5-7300HQ CPU @ 2.50GHz Stepping: 9 CPU MHz: 3143.299 CPU max MHz: 3500.0000 CPU min MHz: 800.0000 BogoMIPS: 4993.00 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 6144K NUMA node0 CPU(s): 0-3 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault intel_pt tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx rdseed adx smap clflushopt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp ----------Network Test---------- Setting timeout: 10 Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0082 sec, LOAD: 4.7975 sec. Timing for Gluon Tutorial(en): http://gluon.mxnet.io, DNS: 0.0199 sec, LOAD: 0.2834 sec. Timing for Gluon Tutorial(cn): https://zh.gluon.ai, DNS: 0.0315 sec, LOAD: 0.2753 sec. Timing for FashionMNIST: https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.0268 sec, LOAD: 0.7383 sec. Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0072 sec, LOAD: 0.6306 sec. Timing for Conda: https://repo.continuum.io/pkgs/free/, DNS: 0.0071 sec, LOAD: 0.8748 sec. ``` Package used (Python/R/Scala/Julia): python ## Error Message: ``` .... epoch 5, iteration 52 epoch 5, iteration 53 terminate called after throwing an instance of 'std::bad_alloc' what(): std::bad_alloc Command terminated Press ENTER or type command to continue ``` ## Minimum reproducible example ``` #!/usr/bin/python # filename: net.py import mxnet as mx from mxnet import nd, autograd, gluon import numpy as np mx.random.seed(1) # running context ctx_cpu = mx.cpu() # get datasets batch_size = 256 epoch = 10 def data_pre_deal(data, label): return nd.transpose(data.astype(np.float32),(2,0,1)), label.astype(np.float32) cifar10_train = gluon.data.vision.CIFAR10('./cifar10/train', train=True, transform = data_pre_deal) cifar10_test = gluon.data.vision.CIFAR10('./cifar10/test', train=False, transform = data_pre_deal) # define data iterators train_data = gluon.data.DataLoader(cifar10_train, batch_size=batch_size, shuffle=True) test_data = gluon.data.DataLoader(cifar10_test, shuffle=True, batch_size=batch_size) # construction the bottleneck resnet unit class bottleneck(gluon.HybridBlock): def __init__(self, filter_num, in_channels, stride=(1,1), prefix=None, params=None): super(bottleneck, self).__init__(prefix=prefix, params=params) self.stride = stride self.filter_num = filter_num self.in_channels = in_channels # create a sequential for residual self.res = gluon.nn.HybridSequential(prefix='') with self.res.name_scope(): # add the first convolution layer self.res.add(gluon.nn.BatchNorm(axis=1)) self.res.add(gluon.nn.Activation(activation='relu')) self.res.add(gluon.nn.Conv2D(channels=filter_num//4, in_channels=in_channels, kernel_size=1, strides=(1,1), layout='NCHW')) # add the second convolution layer self.res.add(gluon.nn.BatchNorm(axis=1)) self.res.add(gluon.nn.Activation(activation='relu')) self.res.add(gluon.nn.Conv2D(channels=filter_num//4, in_channels=filter_num//4, kernel_size=3, strides=stride, padding=(1,1), layout='NCHW')) # add the third convolution layer self.res.add(gluon.nn.BatchNorm(axis=1)) self.res.add(gluon.nn.Activation(activation='relu')) self.res.add(gluon.nn.Conv2D(channels=filter_num, in_channels=filter_num//4, kernel_size=1, strides=(1,1), layout='NCHW')) # create a sequential for shortcut if self.stride != (1,1) or self.filter_num != in_channels: self.shortcut = gluon.nn.HybridSequential(prefix='') self.shortcut.add(gluon.nn.Conv2D(channels=filter_num,in_channels=in_channels, kernel_size=1, strides=stride, layout='NCHW')) def hybrid_forward(self, F, x): res = self.res(x) if self.stride != (1,1) or self.filter_num != self.in_channels: sc = self.shortcut(x) else: sc = x out = sc + res return out # construct the last few layers class lastDense(gluon.HybridBlock): def __init__(self, units, prefix=None, params=None): super(lastDense, self).__init__(prefix=prefix, params=params) # create a sequential self.last = gluon.nn.HybridSequential(prefix='') with self.last.name_scope(): self.last.add(gluon.nn.BatchNorm(axis=1)) self.last.add(gluon.nn.Activation('relu')) self.last.add(gluon.nn.GlobalAvgPool2D(layout='NCHW')) self.last.add(gluon.nn.Dense(units)) def hybrid_forward(self, F, x): return self.last(x) # method to construct the model resnet = gluon.nn.Sequential() with resnet.name_scope(): resnet.add(bottleneck(64, in_channels=3)) # 28x28x64 resnet.add(bottleneck(128, in_channels=64, stride=(2,2))) # 14x14x128 resnet.add(bottleneck(128, in_channels=128)) # 14x14x128 resnet.add(bottleneck(256, in_channels=128, stride=(2,2))) # 7x7x256 resnet.add(bottleneck(256, in_channels=256)) # 7x7x256 resnet.add(lastDense(10)) # model initialization resnet.collect_params().initialize(mx.init.Xavier(magnitude=2.24),ctx=ctx_cpu) # definition of loss function and trainer loss_fun = gluon.loss.SoftmaxCrossEntropyLoss() trainer = gluon.Trainer(resnet.collect_params(), 'adam', {'learning_rate':1e-3, 'beta1':0.9, 'beta2':0.999, 'epsilon':1e-7}) # train the model for e in range(epoch): for i, (data, label) in enumerate(train_data): data = data.as_in_context(ctx_cpu) label = label.as_in_context(ctx_cpu) with autograd.record(): output = resnet(data) loss = loss_fun(output, label) print("epoch {}, iteration {}".format(e, i)) ``` ## Steps to reproduce (Paste the commands you ran that produced the error.) 1. download the data cifar10 to the directory where the script stays into the directory cifar10/train and cifar10/test 2. run the script ``` $ python net.py ``` ## What have you tried to solve it? I have no idea why the memory allocated will not be freed in time. But if I do not use the 'self-defined block', there will be no such memory leakage problems.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
