[
https://issues.apache.org/jira/browse/TRAFODION-2692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16108105#comment-16108105
]
ASF GitHub Bot commented on TRAFODION-2692:
-------------------------------------------
GitHub user zcorrea opened a pull request:
https://github.com/apache/incubator-trafodion/pull/1192
[TRAFODION-2692] Fixed monitor startup when hostname is in various forms
[TRAFODION-2001] Updated 'sqgen' to not generate script now supported by
'persist' commands.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/zcorrea/incubator-trafodion TRAFODION-2692
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/incubator-trafodion/pull/1192.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #1192
----
commit f3293e4feea0d655d1a1560689627706cdc53e89
Author: Zalo Correa <[email protected]>
Date: 2017-07-31T20:49:17Z
[TRAFODION-2692] Fixed monitor startup when hostname is in various forms
[TRAFODION-2001] Updated 'sqgen' to not generate script now supported by
'persist' commands.
----
> Monitor fails to start when node names are not of the right form
> ----------------------------------------------------------------
>
> Key: TRAFODION-2692
> URL: https://issues.apache.org/jira/browse/TRAFODION-2692
> Project: Apache Trafodion
> Issue Type: Bug
> Components: foundation
> Affects Versions: 2.2-incubating
> Environment: I tried this on an OpenStack cluster, using Hortonworks
> HDP 5.4. This is the code with the new elasticity feature.
> Reporter: Hans Zeller
> Assignee: Gonzalo E Correa
> Fix For: 2.2-incubating
>
>
> When trying to install Trafodion on a cluster, I ran into various situations
> where the monitor failed to start, based on how host names were configured
> and specified. I used three kinds of names:
> NN - a "nickname", a name I made up and put into /etc/hosts. Note: I made the
> mistake of just adding the nickname, not the actual name in the /etc/hosts
> line.
> LN - a local, non-qualified name that is also the OpenStack instance name and
> the host name.
> FQDN - the fully qualified domain host name
> {noformat}
> Case Name specified hostname command sqconfig What happened
> in HDP returns contains
> ---- -------------- ---------------- -------- --------------------------
> 1 nickname local name nickname sqstart returned an error,
> saying that sqstart must
> be executed on one of the
> nodes of the cluster
> 2 local name local name FQDN? monitor core dump (1)
> 3 local name FQDN FQDN monitor abends (2)
> 4 FQDN FQDN FQDN install succeeds
> {noformat}
> Notes: (1) The core dump happened because of the following code in file
> core/sqf/monitor/linux/cluster.cxx:
> {noformat}
> // Build the monitor's configured view of the cluster
> if ( IsRealCluster )
> { // Map node name to physical node id
> // (for virtual nodes physical node equals "rank" (previously set))
> MyPNID = clusterConfig->GetPNid( Node_name );
> }
> Nodes->AddNodes( );
> MyNode = Nodes->GetNode(MyPNID);
> Nodes->SetupCluster( &Node, &LNode, &indexToPnid_ );
> {noformat}
> Node_name is a local name. The name of the nodes in the "Nodes" list is the
> FQDN, so we don't find the node and MyPNID is set to -1. This leads to
> dereferencing MyNode, which is a NULL pointer.
> Note 2: The third case is the same as the second, with two modifications: Use
> the "hostname" command to set the host name to the FQDN, and edit /etc/hosts
> to put the FQDN first in the line and the local name second (case 2 had it
> the other way round). This time, we get past the problem described in case 2,
> but we get an error from MPI, which is unable to communicate with all the
> nodes (sorry, didn't record the exact error message).
> This is now the lines in /etc/hosts look like (same layout for all nodes of
> the cluster):
> {noformat}
> # case 1
> 1.2.3.4 nickname1
> 1.2.3.5 nickname2
> # case 2
> 1.2.3.4 mynode1 mynode1.novalocal
> 1.2.3.5 mynode2 mynode2.novalocal
> # cases 3 and 4
> 1.2.3.4 mynode1.novalocal mynode1
> 1.2.3.5 mynode2.novalocal mynode2
> {noformat}
> My suggestion would be to identify the places where we read node names that
> are provided by the user and where such node names are compared, and to
> provide a comparison method that tolerates equivalent forms of names.
> There are related JIRAs: TRAFODION-2480 and TRAFODION-2391.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)