[ 
https://issues.apache.org/jira/browse/HAWQ-252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15130193#comment-15130193
 ] 

ASF GitHub Bot commented on HAWQ-252:
-------------------------------------

Github user huor commented on a diff in the pull request:

    https://github.com/apache/incubator-hawq/pull/330#discussion_r51702311
  
    --- Diff: depends/libyarn/src/libyarnclient/LibYarnClient.cpp ---
    @@ -106,7 +106,7 @@ list<ResourceRequest> LibYarnClient::getAskRequests() {
     
     void* heartbeatFunc(void* args) {
    --- End diff --
    
    1. Do we need to retry here as it already retries in 
"client->dummyAllocate();"?
    
    2. If we need to retry, the code can be refined as below. The main points 
are:
    
    1) use macro instead of magic number for maximum retry number
    2) provide retry time in logging information
    3) use /* ... */ for multiple line comment
    
    {noformat}
        #define HEARTBEAT_MAX_RETRY_NUM 2
    
        int failcounter = 0;
        LibYarnClient *client = (LibYarnClient*)args;
    
        while (client->keepRun) {
                try {
                        client->dummyAllocate();
                        failcounter = 0;
                }
                catch(const YarnException &e) {
                        failcounter++;
    
                        LOG(WARNING, "LibYarnClient::heartbeatFunc, dummy 
allocation "
                                                 "is not correctly executed in 
try %d with exception raised. %s",
                                                 failcounter;
                                                 e.msg());
    
                        if ( failcounter >= HEARTBEAT_MAX_RETRY_NUM ) {
                                /**
                                 * In case retry too many times with 
errors/exceptions, this
                                 * thread will return. LibYarn has to 
re-register application
                                 * and start the heartbeat thread again.
                                 */
                                LOG(WARNING, "LibYarnClient::heartbeatFunc, 
there are too many "
                                                     "failures raised after %d 
maximum retries. This heart-beat thread exits now.", HEARTBEAT_MAX_RETRY_NUM);
                                client->keepRun = false;
                                break;
                        }
                }
                usleep((client->heartbeatInterval) * 1000);
        }
    {noformat}


> Coredump When RM Reconnect libyarn
> ----------------------------------
>
>                 Key: HAWQ-252
>                 URL: https://issues.apache.org/jira/browse/HAWQ-252
>             Project: Apache HAWQ
>          Issue Type: Bug
>          Components: Resource Manager
>            Reporter: Lin Wen
>            Assignee: Lin Wen
>             Fix For: 2.0.0
>
>
> Coredump When RM Reconnect libyarn
> Missing separate debuginfos, use: debuginfo-install 
> hawq-2.0.0.0_beta-19011.x86_64
> (gdb) bt
> #0  0x0000000000e661f8 in std::string::_Rep::_S_empty_rep_storage ()
> #1  0x00007f7f1f20947c in libyarn::LibYarnClient::dummyAllocate (this=<value 
> optimized out>)
>     at 
> /data1/pulse2-agent/agents/agent1/work/LIBYARN-main-opt/rhel5_x86_64/src/libyarnclient/LibYarnClient.cpp:330
> #2  0x00007f7f1f209988 in libyarn::heartbeatFunc (args=<value optimized out>)
>     at 
> /data1/pulse2-agent/agents/agent1/work/LIBYARN-main-opt/rhel5_x86_64/src/libyarnclient/LibYarnClient.cpp:114
> #3  0x000000350b4079d1 in start_thread () from /lib64/libpthread.so.0
> #4  0x000000350b0e8b6d in clone () from /lib64/libc.so.6
> (gdb) info thread
>   4 Thread 0x7f7efc239700 (LWP 760442)  0x000000350b40b98e in 
> pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
>   3 Thread 0x7f7f1a1758c0 (LWP 760441)  0x000000350b0accdd in nanosleep () 
> from /lib64/libc.so.6
>   2 Thread 0x7f7efae37700 (LWP 760797)  0x000000350b0accdd in nanosleep () 
> from /lib64/libc.so.6
> * 1 Thread 0x7f7efb838700 (LWP 760443)  0x0000000000e661f8 in 
> std::string::_Rep::_S_empty_rep_storage ()
> (gdb) thread 2
> [Switching to thread 2 (Thread 0x7f7efae37700 (LWP 760797))]#0  
> 0x000000350b0accdd in nanosleep () from /lib64/libc.so.6
> (gdb) bt
> #0  0x000000350b0accdd in nanosleep () from /lib64/libc.so.6
> #1  0x000000350b0e1e54 in usleep () from /lib64/libc.so.6
> #2  0x00007f7f1f209999 in libyarn::heartbeatFunc (args=<value optimized out>)
>     at 
> /data1/pulse2-agent/agents/agent1/work/LIBYARN-main-opt/rhel5_x86_64/src/libyarnclient/LibYarnClient.cpp:131
> #3  0x000000350b4079d1 in start_thread () from /lib64/libpthread.so.0
> #4  0x000000350b0e8b6d in clone () from /lib64/libc.so.6
> (gdb) thread 3
> [Switching to thread 3 (Thread 0x7f7f1a1758c0 (LWP 760441))]#0  
> 0x000000350b0accdd in nanosleep () from /lib64/libc.so.6
> (gdb) bt
> #0  0x000000350b0accdd in nanosleep () from /lib64/libc.so.6
> #1  0x000000350b0e1e54 in usleep () from /lib64/libc.so.6
> #2  0x00000000008dd8b9 in RB2YARN_registerYARNApplication () at 
> resourcebroker_LIBYARN_proc.c:1354
> #3  0x00000000008df8ad in RB2YARN_initializeConnection () at 
> resourcebroker_LIBYARN_proc.c:1270
> #4  0x00000000008dfc93 in ResBrokerMainInternal () at 
> resourcebroker_LIBYARN_proc.c:202
> #5  0x00000000008dff79 in ResBrokerMain () at 
> resourcebroker_LIBYARN_proc.c:157
> #6  0x00000000008dc246 in RB_LIBYARN_start (isforked=<value optimized out>) 
> at resourcebroker_LIBYARN.c:153
> #7  0x0000000000903bda in MainHandlerLoop () at resourcemanager.c:531
> #8  0x00000000009041f1 in ResManagerMainServer2ndPhase () at 
> resourcemanager.c:508
> #9  0x0000000000904624 in ResManagerMain (argc=<value optimized out>, 
> argv=<value optimized out>) at resourcemanager.c:330
> #10 0x00000000009049b1 in ResManagerProcessStartup () at resourcemanager.c:402
> #11 0x0000000000764b08 in CommenceNormalOperations () at postmaster.c:3616
> #12 0x00000000007659c2 in do_reaper () at postmaster.c:3964
> #13 0x000000000076a01d in ServerLoop () at postmaster.c:2102
> #14 0x000000000076bb5e in PostmasterMain (argc=9, argv=0x32a15b0) at 
> postmaster.c:1421
> #15 0x00000000006c691a in main (argc=9, argv=0x32a1570) at main.c:226
> There are two heartbeat thread at this moment, which means one heartbeat 
> thread hasn't be canceled when RM reconnects libyarn.
> In function ResBrokerMainInternal(), from line:270, should cancel the 
> heartbeat thread before call RB2YARN_disconnectFromYARN 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to