## Description

### Short Version

`mxnet.image.imdecode` crashes and hangs when loading certain corrupt images 
using Gluon data loader. One possible workaround is to wrap a try/except block 
around `imdecode`, but Python try/except cannot capture `MXNetError`.

### Long Version

I am working with a very large dataset that it is impractical to clean all 
images beforehand. Currently, when using Gluon data loader, loading a corrupt 
image crashes in `imdecode` with an `MXNetError` exception (see Error Message 
below) and then hangs. *Ultimately, I would like the Gluon data loader to 
ignore corrupt images instead of crashing.*

My idea to work around this issue is as follows: wrap the `imdecode` with a 
try/catch block and whenever an exception occurs, simply return a dummy image 
(and label).  Given the dummy image/label during training, I can ignore 
backpropagating that sample. I've tried that (see What have you tried to solve 
it? below) but it does not work because Python try/catch cannot capture 
`MXNetError`.

I think there should be a mechanism to capture an error from `imdecode` or 
`imread` (both from `mxnet.image`) rather than crashing, unless I am missing 
something.

## Environment info (Required)

```
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                32
On-line CPU(s) list:   0-31
Thread(s) per core:    2
Core(s) per socket:    16
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 79
Model name:            Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
Stepping:              1
CPU MHz:               2699.804
CPU max MHz:           3000.0000
CPU min MHz:           1200.0000
BogoMIPS:              4600.11
Hypervisor vendor:     Xen
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              46080K
NUMA node0 CPU(s):     0-31
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca 
cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm 
constant_tsc rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni 
pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt 
tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 
3dnowprefetch invpcid_single kaiser fsgsbase bmi1 hle avx2 smep bmi2 erms 
invpcid rtm rdseed adx xsaveopt
----------Python Info----------
Version      : 3.7.0
Compiler     : GCC 7.2.0
Build        : ('default', 'Jun 28 2018 13:15:42')
Arch         : ('64bit', '')
------------Pip Info-----------
Version      : 18.0
Directory    : 
/home/ubuntu/anaconda3/envs/mxnet_latest/lib/python3.7/site-packages/pip
----------MXNet Info-----------
Version      : 1.3.0
Directory    : /home/ubuntu/incubator-mxnet/python/mxnet
Hashtag not found. Not installed from pre-built package.
----------System Info----------
Platform     : Linux-4.4.0-1062-aws-x86_64-with-debian-stretch-sid
system       : Linux
node         : ip-172-31-35-198
release      : 4.4.0-1062-aws
version      : #71-Ubuntu SMP Fri Jun 15 10:07:39 UTC 2018
----------Hardware Info----------
machine      : x86_64
processor    : x86_64
----------Network Test----------
Setting timeout: 10
Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0022 sec, 
LOAD: 0.4228 sec.
Timing for Gluon Tutorial(en): http://gluon.mxnet.io, DNS: 0.0008 sec, LOAD: 
0.3465 sec.
Timing for Gluon Tutorial(cn): https://zh.gluon.ai, DNS: 0.0006 sec, LOAD: 
0.3446 sec.
Timing for FashionMNIST: 
https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz,
 DNS: 0.0005 sec, LOAD: 0.1155 sec.
Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0005 sec, LOAD: 
0.0623 sec.
Timing for Conda: https://repo.continuum.io/pkgs/free/, DNS: 0.0004 sec, LOAD: 
0.0202 sec.
```

Package used (Python/R/Scala/Julia): Python

## Build info (Required if built from source)

Compiler (gcc/clang/mingw/visual studio): gcc

MXNet commit hash: a6ecb5919d867e8c01acbaaadad2a3cc24638530

Build config:
```
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements.  See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership.  The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License.  You may obtain a copy of the License at
#
#   http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied.  See the License for the
# specific language governing permissions and limitations
# under the License.

#-------------------------------------------------------------------------------
#  Template configuration for compiling mxnet
#
#  If you want to change the configuration, please use the following
#  steps. Assume you are on the root directory of mxnet. First copy the this
#  file so that any local changes will be ignored by git
#
#  $ cp make/config.mk .
#
#  Next modify the according entries, and then compile by
#
#  $ make
#
#  or build in parallel with 8 threads
#
#  $ make -j8
#-------------------------------------------------------------------------------

#---------------------
# choice of compiler
#--------------------

ifndef CC
export CC = gcc
endif
ifndef CXX
export CXX = g++
endif
ifndef NVCC
export NVCC = nvcc
endif

# whether compile with options for MXNet developer
DEV = 0

# whether compile with debug
DEBUG = 0

# whether to turn on segfault signal handler to log the stack trace
USE_SIGNAL_HANDLER =

# the additional link flags you want to add
ADD_LDFLAGS =

# the additional compile flags you want to add
ADD_CFLAGS =

#---------------------------------------------
# matrix computation libraries for CPU/GPU
#---------------------------------------------

# whether use CUDA during compile
USE_CUDA = 0

# add the path to CUDA library to link and compile flag
# if you have already add them to environment variable, leave it as NONE
# USE_CUDA_PATH = /usr/local/cuda
USE_CUDA_PATH = NONE

# whether to enable CUDA runtime compilation
ENABLE_CUDA_RTC = 1

# whether use CuDNN R3 library
USE_CUDNN = 0

#whether to use NCCL library
USE_NCCL = 0
#add the path to NCCL library
USE_NCCL_PATH = NONE

# whether use opencv during compilation
# you can disable it, however, you will not able to use
# imbin iterator
USE_OPENCV = 1

#whether use libjpeg-turbo for image decode without OpenCV wrapper
USE_LIBJPEG_TURBO = 0
#add the path to libjpeg-turbo library
USE_LIBJPEG_TURBO_PATH = NONE

# use openmp for parallelization
USE_OPENMP = 1

# whether use MKL-DNN library
USE_MKLDNN = 0

# whether use NNPACK library
USE_NNPACK = 0

# choose the version of blas you want to use
# can be: mkl, blas, atlas, openblas
# in default use atlas for linux while apple for osx
UNAME_S := $(shell uname -s)
ifeq ($(UNAME_S), Darwin)
USE_BLAS = apple
else
USE_BLAS = atlas
endif

# whether use lapack during compilation
# only effective when compiled with blas versions openblas/apple/atlas/mkl
USE_LAPACK = 1

# path to lapack library in case of a non-standard installation
USE_LAPACK_PATH =

# add path to intel library, you may need it for MKL, if you did not add the 
path
# to environment variable
USE_INTEL_PATH = NONE

# If use MKL only for BLAS, choose static link automatically to allow python 
wrapper
ifeq ($(USE_BLAS), mkl)
USE_STATIC_MKL = 1
else
USE_STATIC_MKL = NONE
endif

#----------------------------
# Settings for power and arm arch
#----------------------------
ARCH := $(shell uname -a)
ifneq (,$(filter $(ARCH), armv6l armv7l powerpc64le ppc64le aarch64))
        USE_SSE=0
        USE_F16C=0
else
        USE_SSE=1
endif

#----------------------------
# F16C instruction support for faster arithmetic of fp16 on CPU
#----------------------------
# For distributed training with fp16, this helps even if training on GPUs
# If left empty, checks CPU support and turns it on.
# For cross compilation, please check support for F16C on target device and 
turn off if necessary.
USE_F16C =

#----------------------------
# distributed computing
#----------------------------

# whether or not to enable multi-machine supporting
USE_DIST_KVSTORE = 0

# whether or not allow to read and write HDFS directly. If yes, then hadoop is
# required
USE_HDFS = 0

# path to libjvm.so. required if USE_HDFS=1
LIBJVM=$(JAVA_HOME)/jre/lib/amd64/server

# whether or not allow to read and write AWS S3 directly. If yes, then
# libcurl4-openssl-dev is required, it can be installed on Ubuntu by
# sudo apt-get install -y libcurl4-openssl-dev
USE_S3 = 0

#----------------------------
# performance settings
#----------------------------
# Use operator tuning
USE_OPERATOR_TUNING = 1

# Use gperftools if found
USE_GPERFTOOLS = 1

# Use JEMalloc if found, and not using gperftools
USE_JEMALLOC = 1

#----------------------------
# additional operators
#----------------------------

# path to folders containing projects specific operators that you don't want to 
put in src/operators
EXTRA_OPERATORS =

#----------------------------
# other features
#----------------------------

# Create C++ interface package
USE_CPP_PACKAGE = 0

#----------------------------
# plugins
#----------------------------

# whether to use caffe integration. This requires installing caffe.
# You also need to add CAFFE_PATH/build/lib to your LD_LIBRARY_PATH
# CAFFE_PATH = $(HOME)/caffe
# MXNET_PLUGINS += plugin/caffe/caffe.mk

# WARPCTC_PATH = $(HOME)/warp-ctc
# MXNET_PLUGINS += plugin/warpctc/warpctc.mk

# whether to use sframe integration. This requires build sframe
# [email protected]:dato-code/SFrame.git
# SFRAME_PATH = $(HOME)/SFrame
# MXNET_PLUGINS += plugin/sframe/plugin.mk
```

## Error Message:

```
Process Process-2:                                                              
                                                                      [1/1921]
Traceback (most recent call last):
  File 
"/home/ubuntu/anaconda3/envs/mxnet_latest/lib/python3.7/multiprocessing/process.py",
 line 297, in _bootstrap
    self.run()
  File 
"/home/ubuntu/anaconda3/envs/mxnet_latest/lib/python3.7/multiprocessing/process.py",
 line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/home/ubuntu/incubator-mxnet/python/mxnet/gluon/data/dataloader.py", 
line 170, in worker_loop
    data_queue.put((idx, batch))
  File 
"/home/ubuntu/anaconda3/envs/mxnet_latest/lib/python3.7/multiprocessing/queues.py",
 line 358, in put
    obj = _ForkingPickler.dumps(obj)
  File 
"/home/ubuntu/anaconda3/envs/mxnet_latest/lib/python3.7/multiprocessing/reduction.py",
 line 51, in dumps
    cls(buf, protocol).dump(obj)
  File "/home/ubuntu/incubator-mxnet/python/mxnet/gluon/data/dataloader.py", 
line 63, in reduce_ndarray
    pid, fd, shape, dtype = data._to_shared_mem()
  File "/home/ubuntu/incubator-mxnet/python/mxnet/ndarray/ndarray.py", line 
200, in _to_shared_mem
    self.handle, ctypes.byref(shared_pid), ctypes.byref(shared_id)))
  File "/home/ubuntu/incubator-mxnet/python/mxnet/base.py", line 255, in 
check_call
    raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [19:31:59] src/io/image_io.cc:162: Check failed: 
!dst.empty() Decoding failed. Invalid image file.
 
Stack trace returned 10 entries:
[bt] (0) 
/home/ubuntu/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(dmlc::StackTrace[abi:cxx11]()+0x5b)
 [0x7ff3b5cb5a0b]
[bt] (1) 
/home/ubuntu/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x28)
 [0x7ff3b5cb6578]
[bt] (2) 
/home/ubuntu/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::io::ImdecodeImpl(int,
 bool, void*, unsigned long, mxnet::NDArray*)+0x4c6) [0x7
ff3b83e6fd6]
[bt] (3) 
/home/ubuntu/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(+0x3a279db) 
[0x7ff3b8a4b9db]
[bt] (4) 
/home/ubuntu/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::engine::ThreadedEngine::ExecuteOprBlock(mxnet::RunContext,
 mxnet::engine::OprB
lock*)+0x8e5) [0x7ff3b8a45e35]
[bt] (5) 
/home/ubuntu/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(std::_Function_handler<void
 (std::shared_ptr<dmlc::ManualEvent>), mxnet::engine::Thre
adedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, 
bool)::{lambda()#1}::operator()() 
const::{lambda(std::shared_ptr<dmlc::ManualEvent>)#1}>::_M_invo
ke(std::_Any_data const&, std::shared_ptr<dmlc::ManualEvent>&&)+0xe2) 
[0x7ff3b8a5c642]
[bt] (6) 
/home/ubuntu/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(std::thread::_Impl<std::_Bind_simple<std::function<void
 (std::shared_ptr<dmlc::Manual
Event>)> (std::shared_ptr<dmlc::ManualEvent>)> >::_M_run()+0x4a) 
[0x7ff3b8a4543a]
[bt] (7) 
/home/ubuntu/anaconda3/envs/mxnet_latest/bin/../lib/libstdc++.so.6(+0xafc5c) 
[0x7ff40d4cfc5c]
[bt] (8) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba) [0x7ff41b5f66ba]
[bt] (9) /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7ff41b32c41d]
```

## Minimum reproducible example

Run `train_imagenet.py` from GluonCV 
(https://github.com/dmlc/gluon-cv/blob/master/scripts/classification/imagenet/train_imagenet.py
 with commit hash 863f19bc86cda0f785b97c39a360fbd8cbd1b0e1) on a training 
dataset with corrupted images (e.g., an image with 0 bytes).

## What have you tried to solve it?

1. I modified `ImageFolderDataset` below so that it could handle corrupt images 
in theory. The try/catch does not capture `MXNetError`.

```
DEFAULT_IMAGE_SIZE = 224
DEFAULT_MISSING_LABELS_SENTINEL = -1234

class ImageFolderDataset(gluon.data.Dataset):
    """A dataset for loading image files stored in a folder structure like::
        root/car/0001.jpg
        root/car/xxxa.jpg
        root/car/yyyb.jpg
        root/bus/123.jpg
        root/bus/023.jpg
        root/bus/wwww.jpg
    Parameters
    ----------
    root : str
        Path to root directory.
    flag : {0, 1}, default 1
        If 0, always convert loaded images to greyscale (1 channel).
        If 1, always convert loaded images to colored (3 channels).
    transform : callable, default None
        A function that takes data and label and transforms them:
    ::
        transform = lambda data, label: (data.astype(np.float32)/255, label)
    Attributes
    ----------
    synsets : list
        List of class names. `synsets[i]` is the name for the integer label `i`
    items : list of tuples
        List of all images in (filename, label) pairs.
    """
    def __init__(self, root, flag=1, transform=None, 
missing_sentinel=DEFAULT_MISSING_LABELS_SENTINEL):
        self._root = os.path.expanduser(root)
        self._flag = flag
        self._transform = transform
        self._missing_sentinel = missing_sentinel
        self._exts = tuple(['.jpg', '.jpeg', '.png'])
        self._list_images(self._root)
 
    def _list_images(self, root):
        self.synsets = []
        self.items = []
 
        for folder in sorted(os.listdir(root)):
            path = os.path.join(root, folder)
            if not os.path.isdir(path):
                warnings.warn('Ignoring %s, which is not a directory.'%path, 
stacklevel=3)
                continue
            label = len(self.synsets)
            self.synsets.append(folder)
            for filename in sorted(os.listdir(path)):
                filename = os.path.join(path, filename)
                ext = os.path.splitext(filename)[1]
                if ext.lower() not in self._exts:
                    warnings.warn('Ignoring %s of type %s. Only support %s'%(
                        filename, ext, ', '.join(self._exts)))
                    continue
                self.items.append((filename, label))
 
    def __getitem__(self, idx):
        file_name = self.items[idx][0]
        if os.path.exists(file_name) and file_name.endswith(self._exts):
            try:
                img = image.imread(file_name, self._flag)
                label = self.items[idx][1]
            except:
                img = mx.nd.zeros((3, DEFAULT_IMAGE_SIZE, DEFAULT_IMAGE_SIZE))
                label = self._missing_sentinel
        else:
            img = mx.nd.zeros((3, DEFAULT_IMAGE_SIZE, DEFAULT_IMAGE_SIZE))
            label = self._missing_sentinel
 
        if self._transform is not None:
            return self._transform(img, label)
        return img, label
 
    def __len__(self):
        return len(self.items)
```

2. I replaced `image.imread` with `cv2.imread` (directly using OpenCV) in the 
above code. It seemed to work on some images but still crashes eventually, 
which may mean `mxnet.image.imdecode` is running somewhere else too? I have not 
explored this yet.


[ Full content available at: 
https://github.com/apache/incubator-mxnet/issues/12280 ]
This message was relayed via gitbox.apache.org for [email protected]

Reply via email to