[
https://issues.apache.org/jira/browse/TRAFODION-2692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16108268#comment-16108268
]
ASF GitHub Bot commented on TRAFODION-2692:
-------------------------------------------
Github user zcorrea commented on a diff in the pull request:
https://github.com/apache/incubator-trafodion/pull/1192#discussion_r130503892
--- Diff: core/sqf/monitor/linux/clio.cxx ---
@@ -377,6 +379,21 @@ Local_IO_To_Monitor::Local_IO_To_Monitor(int pv_pid)
tmpptr++;
}
+ // Remove the domain portion of the name if any
+ char str1[MPI_MAX_PROCESSOR_NAME];
--- End diff --
It's a larger buffer than needed, but I should look at all the uses of
MPI_MAX_PROCESSOR_NAME. This is targeted for cleanup is MPI usage which is
quite minimal in the current code base.
> Monitor fails to start when node names are not of the right form
> ----------------------------------------------------------------
>
> Key: TRAFODION-2692
> URL: https://issues.apache.org/jira/browse/TRAFODION-2692
> Project: Apache Trafodion
> Issue Type: Bug
> Components: foundation
> Affects Versions: 2.2-incubating
> Environment: I tried this on an OpenStack cluster, using Hortonworks
> HDP 5.4. This is the code with the new elasticity feature.
> Reporter: Hans Zeller
> Assignee: Gonzalo E Correa
> Fix For: 2.2-incubating
>
>
> When trying to install Trafodion on a cluster, I ran into various situations
> where the monitor failed to start, based on how host names were configured
> and specified. I used three kinds of names:
> NN - a "nickname", a name I made up and put into /etc/hosts. Note: I made the
> mistake of just adding the nickname, not the actual name in the /etc/hosts
> line.
> LN - a local, non-qualified name that is also the OpenStack instance name and
> the host name.
> FQDN - the fully qualified domain host name
> {noformat}
> Case Name specified hostname command sqconfig What happened
> in HDP returns contains
> ---- -------------- ---------------- -------- --------------------------
> 1 nickname local name nickname sqstart returned an error,
> saying that sqstart must
> be executed on one of the
> nodes of the cluster
> 2 local name local name FQDN? monitor core dump (1)
> 3 local name FQDN FQDN monitor abends (2)
> 4 FQDN FQDN FQDN install succeeds
> {noformat}
> Notes: (1) The core dump happened because of the following code in file
> core/sqf/monitor/linux/cluster.cxx:
> {noformat}
> // Build the monitor's configured view of the cluster
> if ( IsRealCluster )
> { // Map node name to physical node id
> // (for virtual nodes physical node equals "rank" (previously set))
> MyPNID = clusterConfig->GetPNid( Node_name );
> }
> Nodes->AddNodes( );
> MyNode = Nodes->GetNode(MyPNID);
> Nodes->SetupCluster( &Node, &LNode, &indexToPnid_ );
> {noformat}
> Node_name is a local name. The name of the nodes in the "Nodes" list is the
> FQDN, so we don't find the node and MyPNID is set to -1. This leads to
> dereferencing MyNode, which is a NULL pointer.
> Note 2: The third case is the same as the second, with two modifications: Use
> the "hostname" command to set the host name to the FQDN, and edit /etc/hosts
> to put the FQDN first in the line and the local name second (case 2 had it
> the other way round). This time, we get past the problem described in case 2,
> but we get an error from MPI, which is unable to communicate with all the
> nodes (sorry, didn't record the exact error message).
> This is now the lines in /etc/hosts look like (same layout for all nodes of
> the cluster):
> {noformat}
> # case 1
> 1.2.3.4 nickname1
> 1.2.3.5 nickname2
> # case 2
> 1.2.3.4 mynode1 mynode1.novalocal
> 1.2.3.5 mynode2 mynode2.novalocal
> # cases 3 and 4
> 1.2.3.4 mynode1.novalocal mynode1
> 1.2.3.5 mynode2.novalocal mynode2
> {noformat}
> My suggestion would be to identify the places where we read node names that
> are provided by the user and where such node names are compared, and to
> provide a comparison method that tolerates equivalent forms of names.
> There are related JIRAs: TRAFODION-2480 and TRAFODION-2391.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)