[GitHub] piiswrong commented on a change in pull request #9681: Better Exception Handling for Operators
piiswrong commented on a change in pull request #9681: Better Exception Handling for Operators URL: https://github.com/apache/incubator-mxnet/pull/9681#discussion_r166079802 ## File path: src/engine/threaded_engine.h ## @@ -338,33 +346,46 @@ class ThreadedEngine : public Engine { #endif CallbackOnComplete callback = this->CreateCallback( ThreadedEngine::OnCompleteStatic, opr_block); +CallbackOnComplete on_start_callback = this->CreateCallback( Review comment: I think this is unnecessary overhead. Call OnStart directly if possible This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] piiswrong commented on a change in pull request #9681: Better Exception Handling for Operators
piiswrong commented on a change in pull request #9681: Better Exception Handling for Operators URL: https://github.com/apache/incubator-mxnet/pull/9681#discussion_r165731530 ## File path: src/engine/threaded_engine.h ## @@ -476,6 +507,9 @@ class ThreadedEngine : public Engine { */ std::mutex finished_m_; std::condition_variable finished_cv_; + /*! \brief exception_ptr associated with the engine, + * which is used to throw exception in waitall */ + std::exception_ptr global_ex_ptr; Review comment: ex_ptr is a bad name. class members should end with `_`. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] piiswrong commented on a change in pull request #9681: Better Exception Handling for Operators
piiswrong commented on a change in pull request #9681: Better Exception Handling for Operators URL: https://github.com/apache/incubator-mxnet/pull/9681#discussion_r165731112 ## File path: src/storage/gpu_device_storage.h ## @@ -62,7 +62,7 @@ inline void* GPUDeviceStorage::Alloc(size_t size) { #endif // MXNET_USE_NCCL cudaError_t e = cudaMalloc(, size); if (e != cudaSuccess && e != cudaErrorCudartUnloading) -throw std::bad_alloc(); +LOG(FATAL) << cudaGetLastError(); Review comment: Does cudaGetLastError return a string or an error code? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] piiswrong commented on a change in pull request #9681: Better Exception Handling for Operators
piiswrong commented on a change in pull request #9681: Better Exception Handling for Operators URL: https://github.com/apache/incubator-mxnet/pull/9681#discussion_r165731019 ## File path: src/storage/cpu_device_storage.h ## @@ -61,10 +61,10 @@ inline void* CPUDeviceStorage::Alloc(size_t size) { void* ptr; #if _MSC_VER ptr = _aligned_malloc(size, alignment_); - if (ptr == NULL) throw std::bad_alloc(); + if (ptr == NULL) LOG(FATAL) << "Malloc failure"; Review comment: Failed to allocate CPU memory This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] piiswrong commented on a change in pull request #9681: Better Exception Handling for Operators
piiswrong commented on a change in pull request #9681: Better Exception Handling for Operators URL: https://github.com/apache/incubator-mxnet/pull/9681#discussion_r165730863 ## File path: src/engine/threaded_engine.h ## @@ -338,33 +346,46 @@ class ThreadedEngine : public Engine { #endif CallbackOnComplete callback = this->CreateCallback( ThreadedEngine::OnCompleteStatic, opr_block); +CallbackOnComplete on_start_callback = this->CreateCallback( +ThreadedEngine::OnStartStatic, opr_block); bool debug_info = (engine_info_ && debug_push_opr_ == opr_block); if (debug_info) { LOG(INFO) << "ExecuteOprBlock " << opr_block << "shutdown_phase=" << shutdown_phase_; } if (!shutdown_phase_) { try { +on_start_callback(); if (debug_info) { LOG(INFO) << "ExecuteOprFn "; } -threaded_opr->fn(run_ctx, callback); +try { + if (!threaded_opr->ex_ptr || threaded_opr->wait) { +threaded_opr->fn(run_ctx, callback); + } else { +callback(); + } +} catch (dmlc::Error& e) { + threaded_opr->ex_ptr = std::current_exception(); + callback(); +} if (debug_info) { LOG(INFO) << "Fin ExecuteOprFn "; } - } catch(dmlc::Error ) { + } catch (dmlc::Error& e) { Review comment: what would this catch now? There is already a try block inside This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] piiswrong commented on a change in pull request #9681: Better Exception Handling for Operators
piiswrong commented on a change in pull request #9681: Better Exception Handling for Operators URL: https://github.com/apache/incubator-mxnet/pull/9681#discussion_r165728531 ## File path: src/engine/threaded_engine.h ## @@ -338,33 +346,46 @@ class ThreadedEngine : public Engine { #endif CallbackOnComplete callback = this->CreateCallback( ThreadedEngine::OnCompleteStatic, opr_block); +CallbackOnComplete on_start_callback = this->CreateCallback( +ThreadedEngine::OnStartStatic, opr_block); bool debug_info = (engine_info_ && debug_push_opr_ == opr_block); if (debug_info) { LOG(INFO) << "ExecuteOprBlock " << opr_block << "shutdown_phase=" << shutdown_phase_; } if (!shutdown_phase_) { try { +on_start_callback(); if (debug_info) { LOG(INFO) << "ExecuteOprFn "; } -threaded_opr->fn(run_ctx, callback); +try { + if (!threaded_opr->ex_ptr || threaded_opr->wait) { +threaded_opr->fn(run_ctx, callback); + } else { +callback(); + } +} catch (dmlc::Error& e) { Review comment: why only catch dmlc error? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] piiswrong commented on a change in pull request #9681: Better Exception Handling for Operators
piiswrong commented on a change in pull request #9681: Better Exception Handling for Operators URL: https://github.com/apache/incubator-mxnet/pull/9681#discussion_r165728359 ## File path: src/engine/threaded_engine.h ## @@ -338,33 +346,46 @@ class ThreadedEngine : public Engine { #endif CallbackOnComplete callback = this->CreateCallback( ThreadedEngine::OnCompleteStatic, opr_block); +CallbackOnComplete on_start_callback = this->CreateCallback( Review comment: what's the point of creating a call back here? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] piiswrong commented on a change in pull request #9681: Better Exception Handling for Operators
piiswrong commented on a change in pull request #9681: Better Exception Handling for Operators URL: https://github.com/apache/incubator-mxnet/pull/9681#discussion_r165727816 ## File path: src/engine/threaded_engine.h ## @@ -175,6 +175,8 @@ class ThreadedVar final static std::atomic counter; ~ThreadedVar() { LOG(INFO) << __func__ << " " << --counter; } #endif // ENGINE_DEBUG + /*! \brief exception_ptr associated with the ThreadedVar */ + std::exception_ptr ex_ptr; Review comment: ex_ptr is a bad variable name This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] piiswrong commented on a change in pull request #9681: Better Exception Handling for Operators
piiswrong commented on a change in pull request #9681: Better Exception Handling for Operators URL: https://github.com/apache/incubator-mxnet/pull/9681#discussion_r165727071 ## File path: src/engine/threaded_engine.cc ## @@ -403,7 +417,11 @@ inline void ThreadedEngine::OnComplete(ThreadedOpr* threaded_opr) { } // Mark complete for write variables. for (auto&& i : threaded_opr->mutable_vars) { -const bool debug_info = (engine_info_ && debug_wait_var_ == i); +if (threaded_opr->ex_ptr) { + i->ex_ptr = threaded_opr->ex_ptr; + if (!global_ex_ptr) global_ex_ptr = i->ex_ptr; +} Review comment: suppose and operator has three outputs x, y and z and it raises an exception. then x.asnumpy() would raise an error. Then y.asnumpy() would raise the same error again. and if I do z += 1 and it succeeds, z.asnumpy() would still raise the error This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] piiswrong commented on a change in pull request #9681: Better Exception Handling for Operators
piiswrong commented on a change in pull request #9681: Better Exception Handling for Operators URL: https://github.com/apache/incubator-mxnet/pull/9681#discussion_r165727071 ## File path: src/engine/threaded_engine.cc ## @@ -403,7 +417,11 @@ inline void ThreadedEngine::OnComplete(ThreadedOpr* threaded_opr) { } // Mark complete for write variables. for (auto&& i : threaded_opr->mutable_vars) { -const bool debug_info = (engine_info_ && debug_wait_var_ == i); +if (threaded_opr->ex_ptr) { + i->ex_ptr = threaded_opr->ex_ptr; + if (!global_ex_ptr) global_ex_ptr = i->ex_ptr; +} Review comment: suppose and operator has two outputs x and y and it raises an exception. then x.asnumpy() would raise an error. Then y.asnumpy() would raise the same error again. and if I do x += 1 and it succeeds, x.asnumpy() would still raise the error This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] piiswrong commented on a change in pull request #9681: Better Exception Handling for Operators
piiswrong commented on a change in pull request #9681: Better Exception Handling for Operators URL: https://github.com/apache/incubator-mxnet/pull/9681#discussion_r165727071 ## File path: src/engine/threaded_engine.cc ## @@ -403,7 +417,11 @@ inline void ThreadedEngine::OnComplete(ThreadedOpr* threaded_opr) { } // Mark complete for write variables. for (auto&& i : threaded_opr->mutable_vars) { -const bool debug_info = (engine_info_ && debug_wait_var_ == i); +if (threaded_opr->ex_ptr) { + i->ex_ptr = threaded_opr->ex_ptr; + if (!global_ex_ptr) global_ex_ptr = i->ex_ptr; +} Review comment: suppose and operator has two outputs x and y and it raises an exception. then x.asnumpy() would raise an error. Then y.asnumpy() would raise the same error again. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] piiswrong commented on a change in pull request #9681: Better Exception Handling for Operators
piiswrong commented on a change in pull request #9681: Better Exception Handling for Operators URL: https://github.com/apache/incubator-mxnet/pull/9681#discussion_r165726239 ## File path: src/engine/threaded_engine.cc ## @@ -403,7 +417,11 @@ inline void ThreadedEngine::OnComplete(ThreadedOpr* threaded_opr) { } // Mark complete for write variables. for (auto&& i : threaded_opr->mutable_vars) { -const bool debug_info = (engine_info_ && debug_wait_var_ == i); +if (threaded_opr->ex_ptr) { + i->ex_ptr = threaded_opr->ex_ptr; + if (!global_ex_ptr) global_ex_ptr = i->ex_ptr; +} +bool debug_info = (engine_info_ && debug_wait_var_ == i); Review comment: why remove const? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] piiswrong commented on a change in pull request #9681: Better Exception Handling for Operators
piiswrong commented on a change in pull request #9681: Better Exception Handling for Operators URL: https://github.com/apache/incubator-mxnet/pull/9681#discussion_r165726063 ## File path: src/engine/threaded_engine.cc ## @@ -391,6 +400,11 @@ void ThreadedEngine::WaitForAll() { finished_cv_.wait(lock, [this]() { return pending_.load() == 0 || kill_.load(); }); + if (global_ex_ptr) { +std::exception_ptr ex_ptr = global_ex_ptr; Review comment: use std::rethrow_exception(std::move(global_ex_ptr)) This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] piiswrong commented on a change in pull request #9681: Better Exception Handling for Operators
piiswrong commented on a change in pull request #9681: Better Exception Handling for Operators URL: https://github.com/apache/incubator-mxnet/pull/9681#discussion_r165726063 ## File path: src/engine/threaded_engine.cc ## @@ -391,6 +400,11 @@ void ThreadedEngine::WaitForAll() { finished_cv_.wait(lock, [this]() { return pending_.load() == 0 || kill_.load(); }); + if (global_ex_ptr) { +std::exception_ptr ex_ptr = global_ex_ptr; Review comment: use std::move This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] piiswrong commented on a change in pull request #9681: Better Exception Handling for Operators
piiswrong commented on a change in pull request #9681: Better Exception Handling for Operators URL: https://github.com/apache/incubator-mxnet/pull/9681#discussion_r165725747 ## File path: src/engine/threaded_engine.cc ## @@ -356,7 +357,11 @@ void ThreadedEngine::DeleteVariable(SyncFn delete_fn, void ThreadedEngine::WaitForVar(VarHandle var) { BulkFlush(); ThreadedVar* threaded_var = ThreadedVar::CastFromBase(var); - if (threaded_var->ready_to_read()) return; + if (threaded_var->ready_to_read()) { +if (threaded_var->ex_ptr) { + std::rethrow_exception(threaded_var->ex_ptr); +} + } Review comment: why are you not returning? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] piiswrong commented on a change in pull request #9681: Better Exception Handling for Operators
piiswrong commented on a change in pull request #9681: Better Exception Handling for Operators URL: https://github.com/apache/incubator-mxnet/pull/9681#discussion_r165725474 ## File path: include/mxnet/engine.h ## @@ -182,7 +182,7 @@ class MXNET_API Engine { std::vector const& mutable_vars, FnProperty prop = FnProperty::kNormal, int priority = 0, - const char* opr_name = nullptr) = 0; + const char* opr_name = nullptr, bool wait = false) = 0; Review comment: what's wait? Why do you need it? Please document arguments This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services