Paul Guo created HAWQ-1335:
------------------------------

             Summary: Need to refactor some QD error handling code.
                 Key: HAWQ-1335
                 URL: https://issues.apache.org/jira/browse/HAWQ-1335
             Project: Apache HAWQ
          Issue Type: Bug
          Components: Dispatcher
            Reporter: Paul Guo
            Assignee: Ed Espino


While recently I worked on QD related issues I found the QD error handling code 
is really confusing. At least:

1) dispmgt_thread_func_run()
    /*
     * Cleanup rules:
     * 1. query cancel, result error, and poll error: mark the executor stop.
     * 2. connection error: mark the gang error. Set by 
workermgr_mark_executor_error().
     */

    I do not think the code is handling like this. Maybe I did not understand 
related details. Besides workermgr_mark_executor_error() does not exist also.

2) In executormgr_cancel()

#if 0
    if (success)
    {
        executor->state = QES_STOP;
        executor->health = QEH_CANCEL;
    }
    else
    {
        /* TODO: log error? how to deal with connection error. */
        executormgr_catch_error(executor);
    }
#endif 

    {
        write_log("function executormgr_cancel calling 
executormgr_catch_error");
        executormgr_catch_error(executor);
    }

Why executormgr_catch_error() is called for all cases? and whether "success" is 
not enough to judge whether executormgr_catch_error() should be called or not.

3) cdbdisp_seterrcode()

    if (!dispatchResult->errcode)
    {
        dispatchResult->errcode =
            (errcode == 0) ? ERRCODE_INTERNAL_ERROR : errcode;
        if (resultIndex >= 0)
            dispatchResult->errindex = resultIndex;
    }

    Why need to set ERRCODE_INTERNAL_ERROR while leave meleeResults->err code 
alone. This piece of code seems to be totally redundant. 

4) It is splitting general errors with normal cancel kinda of interrupts.
    However I still see error code below related to cancellation.
    if (errcode == ERRCODE_GP_OPERATION_CANCELED ||
        errcode == ERRCODE_QUERY_CANCELED)
    Is it possible to combine them to just error code only.

5) dispmgt_thread_func_run() should really set error code for each 
error_cleanup cases.

6) dispmgt_thread_func_run() could quit earlier not just because QE, e.g. lack 
of memory on QD so QD fails in some system calls with ENOMEM, while 
cdbdisp_seterrcode() seems to need a related executor.

7) Probably need to rewrite some of the logs and the comments.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to