Paul Guo created HAWQ-1335:
------------------------------
Summary: Need to refactor some QD error handling code.
Key: HAWQ-1335
URL: https://issues.apache.org/jira/browse/HAWQ-1335
Project: Apache HAWQ
Issue Type: Bug
Components: Dispatcher
Reporter: Paul Guo
Assignee: Ed Espino
While recently I worked on QD related issues I found the QD error handling code
is really confusing. At least:
1) dispmgt_thread_func_run()
/*
* Cleanup rules:
* 1. query cancel, result error, and poll error: mark the executor stop.
* 2. connection error: mark the gang error. Set by
workermgr_mark_executor_error().
*/
I do not think the code is handling like this. Maybe I did not understand
related details. Besides workermgr_mark_executor_error() does not exist also.
2) In executormgr_cancel()
#if 0
if (success)
{
executor->state = QES_STOP;
executor->health = QEH_CANCEL;
}
else
{
/* TODO: log error? how to deal with connection error. */
executormgr_catch_error(executor);
}
#endif
{
write_log("function executormgr_cancel calling
executormgr_catch_error");
executormgr_catch_error(executor);
}
Why executormgr_catch_error() is called for all cases? and whether "success" is
not enough to judge whether executormgr_catch_error() should be called or not.
3) cdbdisp_seterrcode()
if (!dispatchResult->errcode)
{
dispatchResult->errcode =
(errcode == 0) ? ERRCODE_INTERNAL_ERROR : errcode;
if (resultIndex >= 0)
dispatchResult->errindex = resultIndex;
}
Why need to set ERRCODE_INTERNAL_ERROR while leave meleeResults->err code
alone. This piece of code seems to be totally redundant.
4) It is splitting general errors with normal cancel kinda of interrupts.
However I still see error code below related to cancellation.
if (errcode == ERRCODE_GP_OPERATION_CANCELED ||
errcode == ERRCODE_QUERY_CANCELED)
Is it possible to combine them to just error code only.
5) dispmgt_thread_func_run() should really set error code for each
error_cleanup cases.
6) dispmgt_thread_func_run() could quit earlier not just because QE, e.g. lack
of memory on QD so QD fails in some system calls with ENOMEM, while
cdbdisp_seterrcode() seems to need a related executor.
7) Probably need to rewrite some of the logs and the comments.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)