pengzhao-intel commented on issue #8532: mxnet-mkl (v0.12.0) crash when using 
(conda-installed) numpy with MKL
URL: 
https://github.com/apache/incubator-mxnet/issues/8532#issuecomment-372910970
 
 
   @fhieber  @rsdubtso is correct. It caused by multiple omp lib. The 
workaround is to set KMP_DUPLICATE_LIB_OK=TRUE .
   
   I am considering the final solution now. Will come back soon :)
   
   Error log:
   
   > [INFO:sockeye.training] Using bucketing. Default max_seq_len=(60, 60)
   > [INFO:__main__] Optimizer: adam
   > [INFO:__main__] Optimizer Parameters: {'wd': 0.0, 'learning_rate': 0.0003, 
'lr_scheduler': LearningRateSchedulerPlateauReduce(reduce_factor=0.50, 
reduce_num_not_improved=0), 'clip_gradient': 1.0, 'rescale_grad': 1.0}
   > [INFO:__main__] kvstore: device
   > [INFO:__main__] Gradient Compression: None
   > [INFO:__main__] Decode and Evaluate Device(s): cpu(0)
   > [INFO:sockeye.model] Saved config to 
"/home/chenxiny/newstest/wmt_model/config"
   > OMP: Error #15: Initializing libiomp5.so, but found libiomp5.so already 
initialized.
   > OMP: Hint: This means that multiple copies of the OpenMP runtime have been 
linked into the program. That is dangerous, since it can degrade performance or 
cause incorrect results. The best thing to do is to ensure that only a single 
OpenMP runtime is linked into the process, e.g. by avoiding static linking of 
the OpenMP runtime in any library. As an unsafe, unsupported, undocumented 
workaround you can set the environment variable KMP_DUPLICATE_LIB_OK=TRUE to 
allow the program to continue to execute, but that may cause crashes or 
silently produce incorrect results. For more information, please see 
http://www.intel.com/software/products/support/.
   > Aborted
   
   After setting the environment, the problem is gone.
   New log:
   
   > [INFO:sockeye.utils] Sockeye version 1.16.2 commit 
762ce78e4e49b9ba5d14eb0a48d97f19c8807707
   > [INFO:sockeye.utils] MXNet version 1.1.0
   > [INFO:sockeye.utils] Command: 
/home/chenxiny/anaconda3/lib/python3.6/site-packages/sockeye/train.py -s 
corpus.tc.BPE.de -t corpus.tc.BPE.en -vs newstest2016.tc.BPE.de -vt 
newstest2016.tc.BPE.en --num-embed 256 --rnn-num-hidden 512 
--rnn-attention-type dot --max-seq-len 60 --decode-and-evaluate 500 -o wmt_model
   > [INFO:sockeye.utils] Arguments: Namespace(allow_missing_params=False, 
batch_size=64, batch_type='sentence', bucket_width=10, 
checkpoint_frequency=1000, cnn_activation_type='glu', cnn_hidden_dropout=0.0, 
cnn_kernel_width=(3, 5), cnn_num_hidden=512, 
cnn_positional_embedding_type='learned', cnn_project_qkv=False, 
conv_embed_add_positional_encodings=False, conv_embed_dropout=0.0, 
conv_embed_max_filter_width=8, conv_embed_num_filters=(200, 200, 250, 250, 300, 
300, 300, 300), conv_embed_num_highway_layers=4, conv_embed_output_dim=None, 
conv_embed_pool_stride=5, decode_and_evaluate=500, 
decode_and_evaluate_device_id=None, decode_and_evaluate_use_cpu=False, 
decoder='rnn', device_ids=[-1], disable_device_locking=False, 
embed_dropout=(0.0, 0.0), embed_weight_init='default', encoder='rnn', 
fill_up='replicate', gradient_clipping_threshold=1.0, 
gradient_clipping_type='abs', gradient_compression_threshold=0.5, 
gradient_compression_type=None, initial_learning_rate=0.0003, 
keep_last_params=-1, kvstore='device', label_smoothing=0.0, 
layer_normalization=False, learning_rate_decay_optimizer_states_reset='off', 
learning_rate_decay_param_reset=False, learning_rate_half_life=10, 
learning_rate_reduce_factor=0.5, learning_rate_reduce_num_not_improved=3, 
learning_rate_schedule=None, learning_rate_scheduler_type='plateau-reduce', 
learning_rate_warmup=0, lock_dir='/tmp', loss='cross-entropy', 
loss_normalization_type='valid', max_num_checkpoint_not_improved=8, 
max_num_epochs=None, max_seq_len=(60, 60), max_updates=None, 
metrics=['perplexity'], min_num_epochs=None, momentum=None, 
monitor_pattern=None, monitor_stat_func='mx_default', no_bucketing=False, 
num_embed=(256, 256), num_layers=(1, 1), num_words=(50000, 50000), 
optimized_metric='perplexity', optimizer='adam', optimizer_params=None, 
output='wmt_model', overwrite_output=False, params=None, prepared_data=None, 
quiet=False, rnn_attention_coverage_num_hidden=1, 
rnn_attention_coverage_type='count', rnn_attention_in_upper_layers=False, 
rnn_attention_mhdot_heads=None, rnn_attention_num_hidden=None, 
rnn_attention_type='dot', rnn_attention_use_prev_word=False, 
rnn_cell_type='lstm', rnn_context_gating=False, rnn_decoder_hidden_dropout=0.0, 
rnn_decoder_state_init='last', rnn_dropout_inputs=(0.0, 0.0), 
rnn_dropout_recurrent=(0.0, 0.0), rnn_dropout_states=(0.0, 0.0), 
rnn_encoder_reverse_input=False, rnn_first_residual_layer=2, 
rnn_forget_bias=0.0, rnn_h2h_init='orthogonal', rnn_num_hidden=512, 
rnn_residual_connections=False, seed=13, shared_vocab=False, 
source='corpus.tc.BPE.de', source_vocab=None, target='corpus.tc.BPE.en', 
target_vocab=None, transformer_activation_type='relu', 
transformer_attention_heads=8, transformer_dropout_act=0.0, 
transformer_dropout_attention=0.0, transformer_dropout_prepost=0.0, 
transformer_feed_forward_num_hidden=2048, transformer_model_size=512, 
transformer_positional_embedding_type='fixed', transformer_postprocess=('drn', 
'drn'), transformer_preprocess=('', ''), use_cpu=False, use_tensorboard=False, 
validation_source='newstest2016.tc.BPE.de', 
validation_target='newstest2016.tc.BPE.en', weight_decay=0.0, 
weight_init='xavier', weight_init_scale=2.34, 
weight_init_xavier_factor_type='in', weight_init_xavier_rand_type='uniform', 
weight_normalization=False, weight_tying=False, 
weight_tying_type='trg_softmax', word_min_count=(1, 1))
   > [INFO:sockeye.utils] Attempting to acquire 1 GPUs of 1 GPUs. The requested 
devices are: [-1]
   > [INFO:sockeye.utils] Acquired GPU 0.
   > [INFO:__main__] Training Device(s): GPU [0]
   > [INFO:sockeye.vocab] Building vocabulary from dataset(s): 
['corpus.tc.BPE.de']
   > [INFO:sockeye.vocab] Vocabulary: types: 22309/22309/22309/22313 
(initial/min_pruned/max_pruned/+special) [min_frequency=1, max_num_types=50000]
   > [INFO:sockeye.vocab] Building vocabulary from dataset(s): 
['corpus.tc.BPE.en']
   > [INFO:sockeye.vocab] Vocabulary: types: 16757/16757/16757/16761 
(initial/min_pruned/max_pruned/+special) [min_frequency=1, max_num_types=50000]
   > [INFO:sockeye.data_io] ===============================
   > [INFO:sockeye.data_io] Creating training data iterator
   > [INFO:sockeye.data_io] ===============================
   > [INFO:sockeye.data_io] 193369 sequences of maximum length (60, 60) in 
'/home/chenxiny/newstest/corpus.tc.BPE.de' and 
'/home/chenxiny/newstest/corpus.tc.BPE.en'.
   > [INFO:sockeye.data_io] Mean training target/source length ratio: 0.99 
(+-0.45)
   > [INFO:sockeye.data_io] Tokens: source 4563838 target 4466824
   > [INFO:sockeye.data_io] Vocabulary coverage: source 100% target 100%
   > [INFO:sockeye.data_io] 193369 sequences across 6 buckets
   > [INFO:sockeye.data_io] 6631 sequences did not fit into buckets and were 
discarded
   > [INFO:sockeye.data_io] Bucket (10, 10): 32318 samples in 505 batches of 
64, ~362.2 tokens/batch.
   > [INFO:sockeye.data_io] Bucket (20, 20): 46870 samples in 733 batches of 
64, ~949.5 tokens/batch.
   > [INFO:sockeye.data_io] Bucket (30, 30): 49430 samples in 773 batches of 
64, ~1511.9 tokens/batch.
   > [INFO:sockeye.data_io] Bucket (40, 40): 35885 samples in 561 batches of 
64, ~2070.4 tokens/batch.
   > [INFO:sockeye.data_io] Bucket (50, 50): 19636 samples in 307 batches of 
64, ~2618.3 tokens/batch.
   > 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

Reply via email to