[GitHub] kirk86 opened a new issue #13885: imagenet example failure to run properly

GitBox Tue, 15 Jan 2019 05:37:34 -0800

kirk86 opened a new issue #13885: imagenet example failure to run properly
URL: https://github.com/apache/incubator-mxnet/issues/13885
 
 
   ## Description
   I am following this example 
https://mxnet.incubator.apache.org/tutorials/vision/large_scale_classification.html
 to run imagenet training on multi-gpu single node, but it throws errors. The 
only step that I have omitted from the example is the optional one.
   
   ## Environment info (Required)
   
   ```
   ----------Python Info----------
   Version      : 3.6.8
   Compiler     : GCC 7.3.0
   Build        : ('default', 'Dec 30 2018 01:22:34')
   Arch         : ('64bit', '')
   ------------Pip Info-----------
   Version      : 18.1
   Directory    : 
/home/user/miniconda3/envs/mxnet/lib/python3.6/site-packages/pip
   ----------MXNet Info-----------
   Version      : 1.2.1
   Directory    : 
/home/user/miniconda3/envs/mxnet/lib/python3.6/site-packages/mxnet
   Hashtag not found. Not installed from pre-built package.
   ----------System Info----------
   Platform     : Linux-4.15.0-36-generic-x86_64-with-debian-buster-sid
   system       : Linux
   node         : theengine
   release      : 4.15.0-36-generic
   version      : #39-Ubuntu SMP Mon Sep 24 16:19:09 UTC 2018
   ----------Hardware Info----------
   machine      : x86_64
   processor    : x86_64
   Architecture:        x86_64
   CPU op-mode(s):      32-bit, 64-bit
   Byte Order:          Little Endian
   CPU(s):              72
   On-line CPU(s) list: 0-71
   Thread(s) per core:  2
   Core(s) per socket:  18
   Socket(s):           2
   NUMA node(s):        2
   Vendor ID:           GenuineIntel
   CPU family:          6
   Model:               79
   Model name:          Intel(R) Xeon(R) CPU E5-2695 v4 @ 2.10GHz
   Stepping:            1
   CPU MHz:             1990.164
   CPU max MHz:         3300.0000
   CPU min MHz:         1200.0000
   BogoMIPS:            4190.74
   Virtualization:      VT-x
   L1d cache:           32K
   L1i cache:           32K
   L2 cache:            256K
   L3 cache:            46080K
   NUMA node0 CPU(s):   0-17,36-53
   NUMA node1 CPU(s):   18-35,54-71
   Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge 
mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx 
pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology 
nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est 
tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt 
tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch 
cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti intel_ppin ssbd ibrs ibpb 
stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 
smep bmi2 erms invpcid rtm cqm rdt_a rdseed adx smap intel_pt xsaveopt cqm_llc 
cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts flush_l1d
   ----------Network Test----------
   Setting timeout: 10
   Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0038 
sec, LOAD: 0.7055 sec.
   Timing for Gluon Tutorial(en): http://gluon.mxnet.io, DNS: 0.0036 sec, LOAD: 
0.5789 sec.
   Timing for Gluon Tutorial(cn): https://zh.gluon.ai, DNS: 0.0077 sec, LOAD: 
0.6072 sec.
   Timing for FashionMNIST: 
https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz,
 DNS: 0.0063 sec, LOAD: 0.9507 sec.
   Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0141 sec, LOAD: 
0.5458 sec.
   Timing for Conda: https://repo.continuum.io/pkgs/free/, DNS: 0.0032 sec, 
LOAD: 0.0308 sec.
   
   ```
   
   Package used (Python/R/Scala/Julia):
   Python
   
   ## Build info (Required if built from source)
   Installed through conda: mxnet-cu92
   
   MXNet commit hash:
   4fe5461eb98bdede589c511f486c1b934bfa6393
   
   Build config:
   ```
   # Licensed to the Apache Software Foundation (ASF) under one
   # or more contributor license agreements.  See the NOTICE file
   # distributed with this work for additional information
   # regarding copyright ownership.  The ASF licenses this file
   # to you under the Apache License, Version 2.0 (the
   # "License"); you may not use this file except in compliance
   # with the License.  You may obtain a copy of the License at
   #
   #   http://www.apache.org/licenses/LICENSE-2.0
   #
   # Unless required by applicable law or agreed to in writing,
   # software distributed under the License is distributed on an
   # "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
   # KIND, either express or implied.  See the License for the
   # specific language governing permissions and limitations
   # under the License.
   
   
#-------------------------------------------------------------------------------
   #  Template configuration for compiling mxnet
   #
   #  If you want to change the configuration, please use the following
   #  steps. Assume you are on the root directory of mxnet. First copy the this
   #  file so that any local changes will be ignored by git
   #
   #  $ cp make/config.mk .
   #
   #  Next modify the according entries, and then compile by
   #
   #  $ make
   #
   #  or build in parallel with 8 threads
   #
   #  $ make -j8
   
#-------------------------------------------------------------------------------
   
   #---------------------
   # choice of compiler
   #--------------------
   
   ifndef CC
   export CC = gcc
   endif
   ifndef CXX
   export CXX = g++
   endif
   ifndef NVCC
   export NVCC = nvcc
   endif
   
   # whether compile with options for MXNet developer
   DEV = 0
   
   # whether compile with debug
   DEBUG = 0
   
   # whether to turn on segfault signal handler to log the stack trace
   USE_SIGNAL_HANDLER =
   
   # the additional link flags you want to add
   ADD_LDFLAGS =
   
   # the additional compile flags you want to add
   ADD_CFLAGS =
   
   #---------------------------------------------
   # matrix computation libraries for CPU/GPU
   #---------------------------------------------
   
   # whether use CUDA during compile
   USE_CUDA = 0
   
   # add the path to CUDA library to link and compile flag
   # if you have already add them to environment variable, leave it as NONE
   # USE_CUDA_PATH = /usr/local/cuda
   USE_CUDA_PATH = NONE
   
   # whether to enable CUDA runtime compilation
   ENABLE_CUDA_RTC = 1
   
   # whether use CuDNN R3 library
   USE_CUDNN = 0
   
   #whether to use NCCL library
   USE_NCCL = 0
   #add the path to NCCL library
   USE_NCCL_PATH = NONE
   
   # whether use opencv during compilation
   # you can disable it, however, you will not able to use
   # imbin iterator
   USE_OPENCV = 1
   
   #whether use libjpeg-turbo for image decode without OpenCV wrapper
   USE_LIBJPEG_TURBO = 0
   #add the path to libjpeg-turbo library
   USE_LIBJPEG_TURBO_PATH = NONE
   
   # use openmp for parallelization
   USE_OPENMP = 1
   
   # whether use MKL-DNN library: 0 = disabled, 1 = enabled
   # if USE_MKLDNN is not defined, MKL-DNN will be enabled by default on x86 
Linux.
   # you can disable it explicity with USE_MKLDNN = 0
   USE_MKLDNN =
   
   # whether use NNPACK library
   USE_NNPACK = 0
   
   # choose the version of blas you want to use
   # can be: mkl, blas, atlas, openblas
   # in default use atlas for linux while apple for osx
   UNAME_S := $(shell uname -s)
   ifeq ($(UNAME_S), Darwin)
   USE_BLAS = apple
   else
   USE_BLAS = atlas
   endif
   
   # whether use lapack during compilation
   # only effective when compiled with blas versions openblas/apple/atlas/mkl
   USE_LAPACK = 1
   
   # path to lapack library in case of a non-standard installation
   USE_LAPACK_PATH =
   
   # add path to intel library, you may need it for MKL, if you did not add the 
path
   # to environment variable
   USE_INTEL_PATH = NONE
   
   # If use MKL only for BLAS, choose static link automatically to allow python 
wrapper
   ifeq ($(USE_BLAS), mkl)
   USE_STATIC_MKL = 1
   else
   USE_STATIC_MKL = NONE
   endif
   
   #----------------------------
   # Settings for power and arm arch
   #----------------------------
   ARCH := $(shell uname -a)
   ifneq (,$(filter $(ARCH), armv6l armv7l powerpc64le ppc64le aarch64))
           USE_SSE=0
           USE_F16C=0
   else
           USE_SSE=1
   endif
   
   #----------------------------
   # F16C instruction support for faster arithmetic of fp16 on CPU
   #----------------------------
   # For distributed training with fp16, this helps even if training on GPUs
   # If left empty, checks CPU support and turns it on.
   # For cross compilation, please check support for F16C on target device and 
turn off if necessary.
   USE_F16C =
   
   #----------------------------
   # distributed computing
   #----------------------------
   
   # whether or not to enable multi-machine supporting
   USE_DIST_KVSTORE = 0
   
   # whether or not allow to read and write HDFS directly. If yes, then hadoop 
is
   # required
   USE_HDFS = 0
   
   # path to libjvm.so. required if USE_HDFS=1
   LIBJVM=$(JAVA_HOME)/jre/lib/amd64/server
   
   # whether or not allow to read and write AWS S3 directly. If yes, then
   # libcurl4-openssl-dev is required, it can be installed on Ubuntu by
   # sudo apt-get install -y libcurl4-openssl-dev
   USE_S3 = 0
   
   #----------------------------
   # performance settings
   #----------------------------
   # Use operator tuning
   USE_OPERATOR_TUNING = 1
   
   # Use gperftools if found
   USE_GPERFTOOLS = 1
   
   # path to gperftools (tcmalloc) library in case of a non-standard 
installation
   USE_GPERFTOOLS_PATH =
   
   # Link gperftools statically
   USE_GPERFTOOLS_STATIC =
   
   # Use JEMalloc if found, and not using gperftools
   USE_JEMALLOC = 1
   
   # path to jemalloc library in case of a non-standard installation
   USE_JEMALLOC_PATH =
   
   # Link jemalloc statically
   USE_JEMALLOC_STATIC =
   
   #----------------------------
   # additional operators
   #----------------------------
   
   # path to folders containing projects specific operators that you don't want 
to put in src/operators
   EXTRA_OPERATORS =
   
   #----------------------------
   # other features
   #----------------------------
   
   # Create C++ interface package
   USE_CPP_PACKAGE = 0
   
   #----------------------------
   # plugins
   #----------------------------
   
   # whether to use caffe integration. This requires installing caffe.
   # You also need to add CAFFE_PATH/build/lib to your LD_LIBRARY_PATH
   # CAFFE_PATH = $(HOME)/caffe
   # MXNET_PLUGINS += plugin/caffe/caffe.mk
   
   # WARPCTC_PATH = $(HOME)/warp-ctc
   # MXNET_PLUGINS += plugin/warpctc/warpctc.mk
   
   # whether to use sframe integration. This requires build sframe
   # [email protected]:dato-code/SFrame.git
   # SFRAME_PATH = $(HOME)/SFrame
   # MXNET_PLUGINS += plugin/sframe/plugin.mk
   ```
   
   ## Error Message:
   ```
   Traceback (most recent call last):
     File "./incubator-mxnet/example/image-classification/train_imagenet.py", 
line 66, in <module>
       fit.fit(args, sym, data.get_rec_iter)
     File 
"/home/john/.kaggle/competitions/imagenet-object-localization-challenge/ILSVRC/Data/CLS-LOC/incubator-mxnet/example/image-classification/common/fit.py",
 line 180, in fit
       (train, val) = data_loader(args, kv)
     File 
"/home/john/.kaggle/competitions/imagenet-object-localization-challenge/ILSVRC/Data/CLS-LOC/incubator-mxnet/example/image-classification/common/data.py",
 line 184, in get_rec_iter
       part_index          = rank)
     File 
"/home/john/miniconda3/envs/mxnet/lib/python3.6/site-packages/mxnet/io.py", 
line 936, in creator
       ctypes.byref(iter_handle)))
     File 
"/home/john/miniconda3/envs/mxnet/lib/python3.6/site-packages/mxnet/base.py", 
line 149, in check_call
       raise MXNetError(py_str(_LIB.MXGetLastError()))
   mxnet.base.MXNetError: [13:34:40] src/io/input_split_base.cc:24: Check 
failed: files_[i].size % align_bytes == 0 file do not align by 4 bytes
   
   Stack trace returned 10 entries:
   [bt] (0) 
/home/john/miniconda3/envs/mxnet/lib/python3.6/site-packages/mxnet/libmxnet.so(dmlc::StackTrace[abi:cxx11]()+0x5a)
 [0x7f2d9e96712a]
   [bt] (1) 
/home/john/miniconda3/envs/mxnet/lib/python3.6/site-packages/mxnet/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x28)
 [0x7f2d9e967ba8]
   [bt] (2) 
/home/john/miniconda3/envs/mxnet/lib/python3.6/site-packages/mxnet/libmxnet.so(dmlc::io::InputSplitBase::Init(dmlc::io::FileSystem*,
 char const*, unsigned long, bool)+0x416) [0x7f2da176c546]
   [bt] (3) 
/home/john/miniconda3/envs/mxnet/lib/python3.6/site-packages/mxnet/libmxnet.so(dmlc::InputSplit::Create(char
 const*, char const*, unsigned int, unsigned int, char const*, bool, int, 
unsigned long, bool)+0x429) [0x7f2da1733e39]
   [bt] (4) 
/home/john/miniconda3/envs/mxnet/lib/python3.6/site-packages/mxnet/libmxnet.so(mxnet::io::ImageRecordIOParser2<float>::Init(std::vector<std::pair<std::__cxx11::basic_string<char,
 std::char_traits<char>, std::allocator<char> >, 
std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > 
>, std::allocator<std::pair<std::__cxx11::basic_string<char, 
std::char_traits<char>, std::allocator<char> >, 
std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > 
> > > const&)+0xe4e) [0x7f2da10fe5de]
   [bt] (5) 
/home/john/miniconda3/envs/mxnet/lib/python3.6/site-packages/mxnet/libmxnet.so(mxnet::io::ImageRecordIter2<float>::Init(std::vector<std::pair<std::__cxx11::basic_string<char,
 std::char_traits<char>, std::allocator<char> >, 
std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > 
>, std::allocator<std::pair<std::__cxx11::basic_string<char, 
std::char_traits<char>, std::allocator<char> >, 
std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > 
> > > const&)+0x8a) [0x7f2da10ff17a]
   [bt] (6) 
/home/john/miniconda3/envs/mxnet/lib/python3.6/site-packages/mxnet/libmxnet.so(MXDataIterCreateIter+0x3c1)
 [0x7f2da16be9a1]
   [bt] (7) 
/home/john/miniconda3/envs/mxnet/lib/python3.6/lib-dynload/../../libffi.so.6(ffi_call_unix64+0x4c)
 [0x7f2dba662ec0]
   [bt] (8) 
/home/john/miniconda3/envs/mxnet/lib/python3.6/lib-dynload/../../libffi.so.6(ffi_call+0x22d)
 [0x7f2dba66287d]
   [bt] (9) 
/home/john/miniconda3/envs/mxnet/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so(_ctypes_callproc+0x2ce)
 [0x7f2dbbbeaede]
   ```
   
   ## Minimum reproducible example
   
https://mxnet.incubator.apache.org/tutorials/vision/large_scale_classification.html
   
   ## Steps to reproduce
   Followed steps in the link 
https://mxnet.incubator.apache.org/tutorials/vision/large_scale_classification.html


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

[GitHub] kirk86 opened a new issue #13885: imagenet example failure to run properly

Reply via email to