[ 
https://issues.apache.org/jira/browse/TRAFODION-2692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16108179#comment-16108179
 ] 

ASF GitHub Bot commented on TRAFODION-2692:
-------------------------------------------

Github user DaveBirdsall commented on a diff in the pull request:

    
https://github.com/apache/incubator-trafodion/pull/1192#discussion_r130491826
  
    --- Diff: core/sqf/monitor/linux/commaccept.cxx ---
    @@ -33,6 +33,7 @@ using namespace std;
     #include <signal.h>
     #include <unistd.h>
     
    +extern CCommAccept CommAccept;
    --- End diff --
    
    Global objects cause race conditions at exit time. Their destructors are 
called in non-deterministic order. Does this object depend on any other 
objects? Does its destructor traverse to any other objects? If so, then this 
will probably break at some time in the future.


> Monitor fails to start when node names are not of the right form
> ----------------------------------------------------------------
>
>                 Key: TRAFODION-2692
>                 URL: https://issues.apache.org/jira/browse/TRAFODION-2692
>             Project: Apache Trafodion
>          Issue Type: Bug
>          Components: foundation
>    Affects Versions: 2.2-incubating
>         Environment: I tried this on an OpenStack cluster, using Hortonworks 
> HDP 5.4. This is the code with the new elasticity feature.
>            Reporter: Hans Zeller
>            Assignee: Gonzalo E Correa
>             Fix For: 2.2-incubating
>
>
> When trying to install Trafodion on a cluster, I ran into various situations 
> where the monitor failed to start, based on how host names were configured 
> and specified. I used three kinds of names:
> NN - a "nickname", a name I made up and put into /etc/hosts. Note: I made the 
> mistake of just adding the nickname, not the actual name in the /etc/hosts 
> line.
> LN - a local, non-qualified name that is also the OpenStack instance name and 
> the host name.
> FQDN - the fully qualified domain host name
> {noformat}
> Case  Name specified  hostname command  sqconfig  What happened
>       in HDP          returns           contains
> ----  --------------  ----------------  --------  --------------------------
>   1   nickname        local name        nickname  sqstart returned an error,
>                                                   saying that sqstart must
>                                                   be executed on one of the
>                                                   nodes of the cluster
>   2   local name      local name        FQDN?     monitor core dump (1)
>   3   local name      FQDN              FQDN      monitor abends (2)
>   4   FQDN            FQDN              FQDN      install succeeds
> {noformat}
> Notes: (1) The core dump happened because of the following code in file 
> core/sqf/monitor/linux/cluster.cxx:
> {noformat}
>     // Build the monitor's configured view of the cluster
>     if ( IsRealCluster )
>     {   // Map node name to physical node id
>         // (for virtual nodes physical node equals "rank" (previously set))
>         MyPNID = clusterConfig->GetPNid( Node_name );
>     }
>     Nodes->AddNodes( );
>     MyNode = Nodes->GetNode(MyPNID);
>     Nodes->SetupCluster( &Node, &LNode, &indexToPnid_ );
> {noformat}
> Node_name is a local name. The name of the nodes in the "Nodes" list is the 
> FQDN, so we don't find the node and MyPNID is set to -1. This leads to 
> dereferencing MyNode, which is a NULL pointer.
> Note 2: The third case is the same as the second, with two modifications: Use 
> the "hostname" command to set the host name to the FQDN, and edit /etc/hosts 
> to put the FQDN first in the line and the local name second (case 2 had it 
> the other way round). This time, we get past the problem described in case 2, 
> but we get an error from MPI, which is unable to communicate with all the 
> nodes (sorry, didn't record the exact error message).
> This is now the lines in /etc/hosts look like (same layout for all nodes of 
> the cluster):
> {noformat}
> # case 1
> 1.2.3.4 nickname1
> 1.2.3.5 nickname2
> # case 2
> 1.2.3.4 mynode1 mynode1.novalocal
> 1.2.3.5 mynode2 mynode2.novalocal
> # cases 3 and 4
> 1.2.3.4 mynode1.novalocal mynode1
> 1.2.3.5 mynode2.novalocal mynode2
> {noformat}
> My suggestion would be to identify the places where we read node names that 
> are provided by the user and where such node names are compared, and to 
> provide a comparison method that tolerates equivalent forms of names.
> There are related JIRAs: TRAFODION-2480 and TRAFODION-2391.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to