Re: [gpfsug-discuss] CES doesn't assign addresses to nodes

2017-01-31 Thread Jonathon A Anderson
Simon,

This is what I’d usually do, and I’m pretty sure it’d fix the problem; but we 
only have two protocol nodes, so no good way to do quorum in a separate cluster 
of just those two.

Plus, I’d just like to see the bug fixed.

I suppose we could move the compute nodes to a separate cluster, and keep the 
protocol nodes together with the NSD servers; but then I’m back to the age-old 
question of “do I technically violate the GPFS license in order to do the right 
thing architecturally?” (Since you have to nominate GPFS servers in the 
client-only cluster to manage quorum, for nodes that only have client licenses.)

So far, we’re 100% legit, and it’d be better to stay that way.

~jonathon


On 1/31/17, 1:07 PM, "gpfsug-discuss-boun...@spectrumscale.org on behalf of 
Simon Thompson (Research Computing - IT Services)" 
 
wrote:

We use multicluster for our environment, storage systems in a separate 
cluster to hpc nodes on a separate cluster from protocol nodes.

According to the docs, this isn't supported, but we haven't seen any 
issues. Note unsupported as opposed to broken.

Simon

From: gpfsug-discuss-boun...@spectrumscale.org 
[gpfsug-discuss-boun...@spectrumscale.org] on behalf of Jonathon A Anderson 
[jonathon.ander...@colorado.edu]
Sent: 31 January 2017 17:47
To: gpfsug main discussion list
Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes

Yeah, I searched around for places where ` tsctl shownodes up` appears in 
the GPFS code I have access to (i.e., the ksh and python stuff); but it’s only 
in CES. I suspect there just haven’t been that many people exporting CES out of 
an HPC cluster environment.

~jonathon


From:  on behalf of Olaf Weiser 

Reply-To: gpfsug main discussion list 
Date: Tuesday, January 31, 2017 at 10:45 AM
To: gpfsug main discussion list 
Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes

I ll open a pmr here for my env ... the issue may hurt you in a ces env. 
only... but needs to be fixed in core gpfs.base  i thi k

Gesendet von IBM Verse
Jonathon A Anderson --- Re: [gpfsug-discuss] CES doesn't assign addresses 
to nodes ---

Von:

"Jonathon A Anderson" 

An:

"gpfsug main discussion list" 

Datum:

Di. 31.01.2017 17:32

Betreff:

Re: [gpfsug-discuss] CES doesn't assign addresses to nodes



No, I’m having trouble getting this through DDN support because, while we 
have a GPFS server license and GRIDScaler support, apparently we don’t have 
“protocol node” support, so they’ve pushed back on supporting this as an 
overall CES-rooted effort.

I do have a DDN case open, though: 78804. If you are (as I suspect) a GPFS 
developer, do you mind if I cite your info from here in my DDN case to get them 
to open a PMR?

Thanks.

~jonathon


From:  on behalf of Olaf Weiser 

Reply-To: gpfsug main discussion list 
Date: Tuesday, January 31, 2017 at 8:42 AM
To: gpfsug main discussion list 
Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes

ok.. so obviously ... it seems , that we have several issues..
the 3983 characters is obviously a defect
have you already raised a PMR , if so , can you send me the number ?




From:Jonathon A Anderson 
To:gpfsug main discussion list 
Date:01/31/2017 04:14 PM
Subject:Re: [gpfsug-discuss] CES doesn't assign addresses to nodes
Sent by:gpfsug-discuss-boun...@spectrumscale.org




The tail isn’t the issue; that’ my addition, so that I didn’t have to paste 
the hundred or so line nodelist into the thread.

The actual command is

tsctl shownodes up | $tr ',' '\n' | $sort -o $upnodefile

But you can see in my tailed output that the last hostname listed is 
cut-off halfway through the hostname. Less obvious in the example, but true, is 
the fact that it’s only showing the first 120 hosts, when we have 403 nodes in 
our gpfs cluster.

[root@sgate2 ~]# tsctl shownodes up | tr ',' '\n' | wc -l
120

[root@sgate2 ~]# mmlscluster | grep '\-opa' | wc -l
403

Perhaps more explicitly, it looks like `tsctl shownodes up` can only 
transmit 3983 

Re: [gpfsug-discuss] CES doesn't assign addresses to nodes

2017-01-31 Thread Simon Thompson (Research Computing - IT Services)
We use multicluster for our environment, storage systems in a separate cluster 
to hpc nodes on a separate cluster from protocol nodes.

According to the docs, this isn't supported, but we haven't seen any issues. 
Note unsupported as opposed to broken.

Simon

From: gpfsug-discuss-boun...@spectrumscale.org 
[gpfsug-discuss-boun...@spectrumscale.org] on behalf of Jonathon A Anderson 
[jonathon.ander...@colorado.edu]
Sent: 31 January 2017 17:47
To: gpfsug main discussion list
Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes

Yeah, I searched around for places where ` tsctl shownodes up` appears in the 
GPFS code I have access to (i.e., the ksh and python stuff); but it’s only in 
CES. I suspect there just haven’t been that many people exporting CES out of an 
HPC cluster environment.

~jonathon


From:  on behalf of Olaf Weiser 

Reply-To: gpfsug main discussion list 
Date: Tuesday, January 31, 2017 at 10:45 AM
To: gpfsug main discussion list 
Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes

I ll open a pmr here for my env ... the issue may hurt you in a ces env. 
only... but needs to be fixed in core gpfs.base  i thi k

Gesendet von IBM Verse
Jonathon A Anderson --- Re: [gpfsug-discuss] CES doesn't assign addresses to 
nodes ---

Von:

"Jonathon A Anderson" 

An:

"gpfsug main discussion list" 

Datum:

Di. 31.01.2017 17:32

Betreff:

Re: [gpfsug-discuss] CES doesn't assign addresses to nodes



No, I’m having trouble getting this through DDN support because, while we have 
a GPFS server license and GRIDScaler support, apparently we don’t have 
“protocol node” support, so they’ve pushed back on supporting this as an 
overall CES-rooted effort.

I do have a DDN case open, though: 78804. If you are (as I suspect) a GPFS 
developer, do you mind if I cite your info from here in my DDN case to get them 
to open a PMR?

Thanks.

~jonathon


From:  on behalf of Olaf Weiser 

Reply-To: gpfsug main discussion list 
Date: Tuesday, January 31, 2017 at 8:42 AM
To: gpfsug main discussion list 
Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes

ok.. so obviously ... it seems , that we have several issues..
the 3983 characters is obviously a defect
have you already raised a PMR , if so , can you send me the number ?




From:Jonathon A Anderson 
To:gpfsug main discussion list 
Date:01/31/2017 04:14 PM
Subject:Re: [gpfsug-discuss] CES doesn't assign addresses to nodes
Sent by:gpfsug-discuss-boun...@spectrumscale.org




The tail isn’t the issue; that’ my addition, so that I didn’t have to paste the 
hundred or so line nodelist into the thread.

The actual command is

tsctl shownodes up | $tr ',' '\n' | $sort -o $upnodefile

But you can see in my tailed output that the last hostname listed is cut-off 
halfway through the hostname. Less obvious in the example, but true, is the 
fact that it’s only showing the first 120 hosts, when we have 403 nodes in our 
gpfs cluster.

[root@sgate2 ~]# tsctl shownodes up | tr ',' '\n' | wc -l
120

[root@sgate2 ~]# mmlscluster | grep '\-opa' | wc -l
403

Perhaps more explicitly, it looks like `tsctl shownodes up` can only transmit 
3983 characters.

[root@sgate2 ~]# tsctl shownodes up | wc -c
3983

Again, I’m convinced this is a bug not only because the command doesn’t 
actually produce a list of all of the up nodes in our cluster; but because the 
last name listed is incomplete.

[root@sgate2 ~]# tsctl shownodes up | tr ',' '\n' | tail -n 1
shas0260-opa.rc.int.col[root@sgate2 ~]#

I’d continue my investigation within tsctl itself but, alas, it’s a binary with 
no source code available to me. :)

I’m trying to get this opened as a bug / PMR; but I’m still working through the 
DDN support infrastructure. Thanks for reporting it, though.

For the record:

[root@sgate2 ~]# rpm -qa | grep -i gpfs
gpfs.base-4.2.1-2.x86_64
gpfs.msg.en_US-4.2.1-2.noarch
gpfs.gplbin-3.10.0-327.el7.x86_64-4.2.1-0.x86_64
gpfs.gskit-8.0.50-57.x86_64
gpfs.gpl-4.2.1-2.noarch
nfs-ganesha-gpfs-2.3.2-0.ibm24.el7.x86_64
gpfs.ext-4.2.1-2.x86_64
gpfs.gplbin-3.10.0-327.36.3.el7.x86_64-4.2.1-2.x86_64
gpfs.docs-4.2.1-2.noarch

~jonathon


From:  on behalf of Olaf Weiser 

Reply-To: gpfsug main discussion list 
Date: Tuesday, January 31, 2017 at 1:30 AM
To: gpfsug main discussion list 
Subject: Re: 

Re: [gpfsug-discuss] CES doesn't assign addresses to nodes

2017-01-31 Thread Jonathon A Anderson
No, I’m having trouble getting this through DDN support because, while we have 
a GPFS server license and GRIDScaler support, apparently we don’t have 
“protocol node” support, so they’ve pushed back on supporting this as an 
overall CES-rooted effort.

I do have a DDN case open, though: 78804. If you are (as I suspect) a GPFS 
developer, do you mind if I cite your info from here in my DDN case to get them 
to open a PMR?

Thanks.

~jonathon


From:  on behalf of Olaf Weiser 

Reply-To: gpfsug main discussion list 
Date: Tuesday, January 31, 2017 at 8:42 AM
To: gpfsug main discussion list 
Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes

ok.. so obviously ... it seems , that we have several issues..
the 3983 characters is obviously a defect
have you already raised a PMR , if so , can you send me the number ?




From:Jonathon A Anderson 
To:gpfsug main discussion list 
Date:01/31/2017 04:14 PM
Subject:Re: [gpfsug-discuss] CES doesn't assign addresses to nodes
Sent by:gpfsug-discuss-boun...@spectrumscale.org




The tail isn’t the issue; that’ my addition, so that I didn’t have to paste the 
hundred or so line nodelist into the thread.

The actual command is

tsctl shownodes up | $tr ',' '\n' | $sort -o $upnodefile

But you can see in my tailed output that the last hostname listed is cut-off 
halfway through the hostname. Less obvious in the example, but true, is the 
fact that it’s only showing the first 120 hosts, when we have 403 nodes in our 
gpfs cluster.

[root@sgate2 ~]# tsctl shownodes up | tr ',' '\n' | wc -l
120

[root@sgate2 ~]# mmlscluster | grep '\-opa' | wc -l
403

Perhaps more explicitly, it looks like `tsctl shownodes up` can only transmit 
3983 characters.

[root@sgate2 ~]# tsctl shownodes up | wc -c
3983

Again, I’m convinced this is a bug not only because the command doesn’t 
actually produce a list of all of the up nodes in our cluster; but because the 
last name listed is incomplete.

[root@sgate2 ~]# tsctl shownodes up | tr ',' '\n' | tail -n 1
shas0260-opa.rc.int.col[root@sgate2 ~]#

I’d continue my investigation within tsctl itself but, alas, it’s a binary with 
no source code available to me. :)

I’m trying to get this opened as a bug / PMR; but I’m still working through the 
DDN support infrastructure. Thanks for reporting it, though.

For the record:

[root@sgate2 ~]# rpm -qa | grep -i gpfs
gpfs.base-4.2.1-2.x86_64
gpfs.msg.en_US-4.2.1-2.noarch
gpfs.gplbin-3.10.0-327.el7.x86_64-4.2.1-0.x86_64
gpfs.gskit-8.0.50-57.x86_64
gpfs.gpl-4.2.1-2.noarch
nfs-ganesha-gpfs-2.3.2-0.ibm24.el7.x86_64
gpfs.ext-4.2.1-2.x86_64
gpfs.gplbin-3.10.0-327.36.3.el7.x86_64-4.2.1-2.x86_64
gpfs.docs-4.2.1-2.noarch

~jonathon


From:  on behalf of Olaf Weiser 

Reply-To: gpfsug main discussion list 
Date: Tuesday, January 31, 2017 at 1:30 AM
To: gpfsug main discussion list 
Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes

Hi ...same thing here.. everything after 10 nodes will be truncated..
though I don't have an issue with it ... I 'll open a PMR .. and I recommend 
you to do the same thing.. ;-)

the reason seems simple.. it is the "| tail" .at the end of the command.. .. 
which truncates the output to the last 10 items...

should be easy to fix..
cheers
olaf





From:Jonathon A Anderson 
To:"gpfsug-discuss@spectrumscale.org" 
Date:01/30/2017 11:11 PM
Subject:Re: [gpfsug-discuss] CES doesn't assign addresses to nodes
Sent by:gpfsug-discuss-boun...@spectrumscale.org





In trying to figure this out on my own, I’m relatively certain I’ve found a bug 
in GPFS related to the truncation of output from `tsctl shownodes up`. Any 
chance someone in development can confirm?


Here are the details of my investigation:


## GPFS is up on sgate2

[root@sgate2 ~]# mmgetstate

Node number  Node nameGPFS state
--
   414  sgate2-opa   active


## but if I tell ces to explicitly put one of our ces addresses on that node, 
it says that GPFS is down

[root@sgate2 ~]# mmces address move --ces-ip 10.225.71.102 --ces-node sgate2-opa
mmces address move: GPFS is down on this node.
mmces address move: Command failed. Examine previous error messages to 
determine cause.


## the “GPFS is down on this node” message is defined as code 109 in mmglobfuncs

[root@sgate2 ~]# grep --before-context=1 "GPFS is down on this node." 
/usr/lpp/mmfs/bin/mmglobfuncs
  109 ) msgTxt=\
"%s: GPFS is down on this node."



Re: [gpfsug-discuss] CES doesn't assign addresses to nodes

2017-01-31 Thread Olaf Weiser
Hi ...same thing here.. everything after
10 nodes will be truncated.. though I don't have an issue with it
... I 'll open a PMR .. and I recommend you to do the same thing.. ;-)
the reason seems simple.. it is the
"| tail" .at the end of the command.. .. which truncates
the output to the last 10 items... should be easy to fix.. cheersolafFrom:      
 Jonathon A Anderson
To:      
 "gpfsug-discuss@spectrumscale.org"
Date:      
 01/30/2017 11:11 PMSubject:    
   Re: [gpfsug-discuss]
CES doesn't assign addresses to nodesSent by:    
   gpfsug-discuss-boun...@spectrumscale.orgIn trying to figure this out on my own, I’m relatively
certain I’ve found a bug in GPFS related to the truncation of output from
`tsctl shownodes up`. Any chance someone in development can confirm?Here are the details of my investigation:## GPFS is up on sgate2[root@sgate2 ~]# mmgetstate Node number  Node name        GPFS state --     414      sgate2-opa      
active## but if I tell ces to explicitly put one of our ces addresses on that
node, it says that GPFS is down[root@sgate2 ~]# mmces address move --ces-ip 10.225.71.102 --ces-node sgate2-opammces address move: GPFS is down on this node.mmces address move: Command failed. Examine previous error messages to
determine cause.## the “GPFS is down on this node” message is defined as code 109 in
mmglobfuncs[root@sgate2 ~]# grep --before-context=1 "GPFS is down on this node."
/usr/lpp/mmfs/bin/mmglobfuncs    109 ) msgTxt=\"%s: GPFS is down on this node."## and is generated by printErrorMsg in mmcesnetmvaddress when it detects
that the current node is identified as “down” by getDownCesNodeList[root@sgate2 ~]# grep --before-context=5 'printErrorMsg 109' /usr/lpp/mmfs/bin/mmcesnetmvaddress  downNodeList=$(getDownCesNodeList)  for downNode in $downNodeList  do    if [[ $toNodeName == $downNode ]]    then      printErrorMsg 109 "$mmcmd"## getDownCesNodeList is the intersection of all ces nodes with GPFS cluster
nodes listed in `tsctl shownodes up`[root@sgate2 ~]# grep --after-context=16 '^function getDownCesNodeList'
/usr/lpp/mmfs/bin/mmcesfuncsfunction getDownCesNodeList{  typeset sourceFile="mmcesfuncs.sh"  [[ -n $DEBUG || -n $DEBUGgetDownCesNodeList ]] && set -x  $mmTRACE_ENTER "$*"  typeset upnodefile=${cmdTmpDir}upnodefile  typeset downNodeList  # get all CES nodes  $sort -o $nodefile $mmfsCesNodes.dae  $tsctl shownodes up | $tr ',' '\n' | $sort -o $upnodefile  downNodeList=$($comm -23 $nodefile $upnodefile)  print -- $downNodeList}  #- end of function getDownCesNodeList ## but not only are the sgate nodes not listed by `tsctl shownodes up`;
its output is obviously and erroneously truncated[root@sgate2 ~]# tsctl shownodes up | tr ',' '\n' | tailshas0251-opa.rc.int.colorado.edushas0252-opa.rc.int.colorado.edushas0253-opa.rc.int.colorado.edushas0254-opa.rc.int.colorado.edushas0255-opa.rc.int.colorado.edushas0256-opa.rc.int.colorado.edushas0257-opa.rc.int.colorado.edushas0258-opa.rc.int.colorado.edushas0259-opa.rc.int.colorado.edushas0260-opa.rc.int.col[root@sgate2 ~]### I expect that this is a bug in GPFS, likely related to a maximum output
buffer for `tsctl shownodes up`.On 1/24/17, 12:48 PM, "Jonathon A Anderson" 
wrote:    I think I'm having the same issue described here:        http://www.spectrumscale.org/pipermail/gpfsug-discuss/2016-October/002288.html        Any advice or further troubleshooting steps would be much
appreciated. Full disclosure: I also have a DDN case open. (78804)        We've got a four-node (snsd{1..4}) DDN gridscaler system.
I'm trying to add two CES protocol nodes (sgate{1,2}) to serve NFS.         Here's the steps I took:         ---     mmcrnodeclass protocol -N sgate1-opa,sgate2-opa     mmcrnodeclass nfs -N sgate1-opa,sgate2-opa     mmchconfig cesSharedRoot=/gpfs/summit/ces     mmchcluster --ccr-enable     mmchnode --ces-enable -N protocol     mmces service enable NFS     mmces service start NFS -N nfs     mmces address add --ces-ip 10.225.71.104,10.225.71.105     mmces address policy even-coverage     mmces address move --rebalance     ---         This worked the very first time I ran it, but the CES addresses
weren't re-distributed after restarting GPFS or a node reboot.         Things I've tried:         * disabling ces on the sgate nodes and re-running the above
procedure     * moving the cluster and filesystem managers to different
snsd nodes     * deleting and re-creating the cesSharedRoot directory         Meanwhile, the following log entry appears in mmfs.log.latest
every ~30s:         ---     Mon Jan 23 20:31:20 MST 2017: mmcesnetworkmonitor: Found
unassigned address 10.225.71.104     Mon Jan 23 20:31:20 MST 2017: mmcesnetworkmonitor: Found
unassigned address 10.225.71.105     Mon Jan 23 20:31:20 MST 2017: mmcesnetworkmonitor: handleNetworkProblem
with lock held: assignIP