Re: [gpfsug-discuss] CES doesn't assign addresses to nodes

[email protected] Thu, 09 Feb 2017 06:40:42 -0800

Has any headway been made on this issue?  I just ran into it as well.  The CES 
ip addresses just disappeared from my two protocol nodes (4.2.2.0).

From: <[email protected]> on behalf of Olaf Weiser 
<[email protected]>
Reply-To: gpfsug main discussion list <[email protected]>
Date: Thursday, February 2, 2017 at 12:02 PM
To: gpfsug main discussion list <[email protected]>
Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes

pls contact me directly
[email protected]
Mit freundlichen Grüßen / Kind regards

Olaf Weiser

EMEA Storage Competence Center Mainz, German / IBM Systems, Storage Platform,
-------------------------------------------------------------------------------------------------------------------------------------------
IBM Deutschland
IBM Allee 1
71139 Ehningen
Phone: +49-170-579-44-66
E-Mail: [email protected]
-------------------------------------------------------------------------------------------------------------------------------------------
IBM Deutschland GmbH / Vorsitzender des Aufsichtsrats: Martin Jetter
Geschäftsführung: Martina Koederitz (Vorsitzende), Susanne Peter, Norbert 
Janzen, Dr. Christian Keller, Ivo Koerner, Markus Koerner
Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart, HRB 
14562 / WEEE-Reg.-Nr. DE 99369940

From:        Jonathon A Anderson <[email protected]>
To:        gpfsug main discussion list <[email protected]>
Date:        02/02/2017 06:45 PM
Subject:        Re: [gpfsug-discuss] CES doesn't assign addresses to nodes
Sent by:        [email protected]
________________________________

Any chance I can get that PMR# also, so I can reference it in my DDN case?

~jonathon

From: <[email protected]> on behalf of Olaf Weiser 
<[email protected]>
Reply-To: gpfsug main discussion list <[email protected]>
Date: Wednesday, February 1, 2017 at 2:28 AM
To: gpfsug main discussion list <[email protected]>
Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes

Pmr opened... send the # directly to u

Gesendet von IBM Verse
Mathias Dietz --- Re: [gpfsug-discuss] CES doesn't assign addresses to nodes ---

Von:

"Mathias Dietz" <[email protected]>

An:

"gpfsug main discussion list" <[email protected]>

Datum:

Mi. 01.02.2017 10:05

Betreff:

Re: [gpfsug-discuss] CES doesn't assign addresses to nodes

________________________________

>I ll open a pmr here for my env ... the issue may hurt you inralf a ces env. 
>only... but needs to be fixed in core gpfs.base  i think

Thanks for opening the PMR.
The problem is inside the gpfs base code and we are working on a fix right now.
In the meantime until the fix is available we will use the PMR to 
propose/discuss potential work arounds.

Mit freundlichen Grüßen / Kind regards

Mathias Dietz

Spectrum Scale - Release Lead Architect (4.2.X Release)
System Health and Problem Determination Architect
IBM Certified Software Engineer

----------------------------------------------------------------------------------------------------------
IBM Deutschland
Hechtsheimer Str. 2
55131 Mainz
Phone: +49-6131-84-2027
Mobile: +49-15152801035
E-Mail: [email protected]
----------------------------------------------------------------------------------------------------------
IBM Deutschland Research & Development GmbH
Vorsitzender des Aufsichtsrats: Martina Koederitz, Geschäftsführung: Dirk 
Wittkopp
Sitz der Gesellschaft: Böblingen / Registergericht: Amtsgericht Stuttgart, HRB 
243294

From:        Olaf Weiser/Germany/IBM@IBMDE
To:        "gpfsug main discussion list" <[email protected]>
Cc:        "gpfsug main discussion list" <[email protected]>
Date:        01/31/2017 11:47 PM
Subject:        Re: [gpfsug-discuss] CES doesn't assign addresses to nodes
Sent by:        [email protected]
________________________________

Yeah... depending on the #nodes you 're affected or not. .....
So if your remote ces  cluster is small enough in terms of the #nodes ... 
you'll neuer hit into this issue

Gesendet von IBM Verse

Simon Thompson (Research Computing - IT Services) --- Re: [gpfsug-discuss] CES 
doesn't assign addresses to nodes ---
Von:

"Simon Thompson (Research Computing - IT Services)" <[email protected]>

An:

"gpfsug main discussion list" <[email protected]>

Datum:

Di. 31.01.2017 21:07

Betreff:

Re: [gpfsug-discuss] CES doesn't assign addresses to nodes

________________________________

We use multicluster for our environment, storage systems in a separate cluster 
to hpc nodes on a separate cluster from protocol nodes.

According to the docs, this isn't supported, but we haven't seen any issues. 
Note unsupported as opposed to broken.

Simon
________________________________________
From: [email protected] 
[[email protected]] on behalf of Jonathon A Anderson 
[[email protected]]
Sent: 31 January 2017 17:47
To: gpfsug main discussion list
Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes

Yeah, I searched around for places where ` tsctl shownodes up` appears in the 
GPFS code I have access to (i.e., the ksh and python stuff); but it’s only in 
CES. I suspect there just haven’t been that many people exporting CES out of an 
HPC cluster environment.

~jonathon

From: <[email protected]> on behalf of Olaf Weiser 
<[email protected]>
Reply-To: gpfsug main discussion list <[email protected]>
Date: Tuesday, January 31, 2017 at 10:45 AM
To: gpfsug main discussion list <[email protected]>
Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes

I ll open a pmr here for my env ... the issue may hurt you in a ces env. 
only... but needs to be fixed in core gpfs.base  i thi k

Gesendet von IBM Verse
Jonathon A Anderson --- Re: [gpfsug-discuss] CES doesn't assign addresses to 
nodes ---

Von:

"Jonathon A Anderson" <[email protected]>

An:

"gpfsug main discussion list" <[email protected]>

Datum:

Di. 31.01.2017 17:32

Betreff:

Re: [gpfsug-discuss] CES doesn't assign addresses to nodes

________________________________

No, I’m having trouble getting this through DDN support because, while we have 
a GPFS server license and GRIDScaler support, apparently we don’t have 
“protocol node” support, so they’ve pushed back on supporting this as an 
overall CES-rooted effort.

I do have a DDN case open, though: 78804. If you are (as I suspect) a GPFS 
developer, do you mind if I cite your info from here in my DDN case to get them 
to open a PMR?

Thanks.

~jonathon

From: <[email protected]> on behalf of Olaf Weiser 
<[email protected]>
Reply-To: gpfsug main discussion list <[email protected]>
Date: Tuesday, January 31, 2017 at 8:42 AM
To: gpfsug main discussion list <[email protected]>
Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes

ok.. so obviously ... it seems , that we have several issues..
the 3983 characters is obviously a defect
have you already raised a PMR , if so , can you send me the number ?

From:        Jonathon A Anderson <[email protected]>
To:        gpfsug main discussion list <[email protected]>
Date:        01/31/2017 04:14 PM
Subject:        Re: [gpfsug-discuss] CES doesn't assign addresses to nodes
Sent by:        [email protected]
________________________________

The tail isn’t the issue; that’ my addition, so that I didn’t have to paste the 
hundred or so line nodelist into the thread.

The actual command is

tsctl shownodes up | $tr ',' '\n' | $sort -o $upnodefile

But you can see in my tailed output that the last hostname listed is cut-off 
halfway through the hostname. Less obvious in the example, but true, is the 
fact that it’s only showing the first 120 hosts, when we have 403 nodes in our 
gpfs cluster.

[root@sgate2 ~]# tsctl shownodes up | tr ',' '\n' | wc -l
120

[root@sgate2 ~]# mmlscluster | grep '\-opa' | wc -l
403

Perhaps more explicitly, it looks like `tsctl shownodes up` can only transmit 
3983 characters.

[root@sgate2 ~]# tsctl shownodes up | wc -c
3983

Again, I’m convinced this is a bug not only because the command doesn’t 
actually produce a list of all of the up nodes in our cluster; but because the 
last name listed is incomplete.

[root@sgate2 ~]# tsctl shownodes up | tr ',' '\n' | tail -n 1
shas0260-opa.rc.int.col[root@sgate2 ~]#

I’d continue my investigation within tsctl itself but, alas, it’s a binary with 
no source code available to me. :)

I’m trying to get this opened as a bug / PMR; but I’m still working through the 
DDN support infrastructure. Thanks for reporting it, though.

For the record:

[root@sgate2 ~]# rpm -qa | grep -i gpfs
gpfs.base-4.2.1-2.x86_64
gpfs.msg.en_US-4.2.1-2.noarch
gpfs.gplbin-3.10.0-327.el7.x86_64-4.2.1-0.x86_64
gpfs.gskit-8.0.50-57.x86_64
gpfs.gpl-4.2.1-2.noarch
nfs-ganesha-gpfs-2.3.2-0.ibm24.el7.x86_64
gpfs.ext-4.2.1-2.x86_64
gpfs.gplbin-3.10.0-327.36.3.el7.x86_64-4.2.1-2.x86_64
gpfs.docs-4.2.1-2.noarch

~jonathon

From: <[email protected]> on behalf of Olaf Weiser 
<[email protected]>
Reply-To: gpfsug main discussion list <[email protected]>
Date: Tuesday, January 31, 2017 at 1:30 AM
To: gpfsug main discussion list <[email protected]>
Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes

Hi ...same thing here.. everything after 10 nodes will be truncated..
though I don't have an issue with it ... I 'll open a PMR .. and I recommend 
you to do the same thing.. ;-)

the reason seems simple.. it is the "| tail" .at the end of the command.. .. 
which truncates the output to the last 10 items...

should be easy to fix..
cheers
olaf

From:        Jonathon A Anderson <[email protected]>
To:        "[email protected]" <[email protected]>
Date:        01/30/2017 11:11 PM
Subject:        Re: [gpfsug-discuss] CES doesn't assign addresses to nodes
Sent by:        [email protected]
________________________________

In trying to figure this out on my own, I’m relatively certain I’ve found a bug 
in GPFS related to the truncation of output from `tsctl shownodes up`. Any 
chance someone in development can confirm?

Here are the details of my investigation:

## GPFS is up on sgate2

[root@sgate2 ~]# mmgetstate

Node number  Node name        GPFS state
------------------------------------------
 414      sgate2-opa       active

## but if I tell ces to explicitly put one of our ces addresses on that node, 
it says that GPFS is down

[root@sgate2 ~]# mmces address move --ces-ip 10.225.71.102 --ces-node sgate2-opa
mmces address move: GPFS is down on this node.
mmces address move: Command failed. Examine previous error messages to 
determine cause.

## the “GPFS is down on this node” message is defined as code 109 in mmglobfuncs

[root@sgate2 ~]# grep --before-context=1 "GPFS is down on this node." 
/usr/lpp/mmfs/bin/mmglobfuncs
109 ) msgTxt=\
"%s: GPFS is down on this node."

## and is generated by printErrorMsg in mmcesnetmvaddress when it detects that 
the current node is identified as “down” by getDownCesNodeList

[root@sgate2 ~]# grep --before-context=5 'printErrorMsg 109' 
/usr/lpp/mmfs/bin/mmcesnetmvaddress
downNodeList=$(getDownCesNodeList)
for downNode in $downNodeList
do
if [[ $toNodeName == $downNode ]]
then
  printErrorMsg 109 "$mmcmd"

## getDownCesNodeList is the intersection of all ces nodes with GPFS cluster 
nodes listed in `tsctl shownodes up`

[root@sgate2 ~]# grep --after-context=16 '^function getDownCesNodeList' 
/usr/lpp/mmfs/bin/mmcesfuncs
function getDownCesNodeList
{
typeset sourceFile="mmcesfuncs.sh"
[[ -n $DEBUG || -n $DEBUGgetDownCesNodeList ]] &&set -x
$mmTRACE_ENTER "$*"

typeset upnodefile=${cmdTmpDir}upnodefile
typeset downNodeList

# get all CES nodes
$sort -o $nodefile $mmfsCesNodes.dae

$tsctl shownodes up | $tr ',' '\n' | $sort -o $upnodefile

downNodeList=$($comm -23 $nodefile $upnodefile)
print -- $downNodeList
}  #----- end of function getDownCesNodeList --------------------

## but not only are the sgate nodes not listed by `tsctl shownodes up`; its 
output is obviously and erroneously truncated

[root@sgate2 ~]# tsctl shownodes up | tr ',' '\n' | tail
shas0251-opa.rc.int.colorado.edu
shas0252-opa.rc.int.colorado.edu
shas0253-opa.rc.int.colorado.edu
shas0254-opa.rc.int.colorado.edu
shas0255-opa.rc.int.colorado.edu
shas0256-opa.rc.int.colorado.edu
shas0257-opa.rc.int.colorado.edu
shas0258-opa.rc.int.colorado.edu
shas0259-opa.rc.int.colorado.edu
shas0260-opa.rc.int.col[root@sgate2 ~]#

## I expect that this is a bug in GPFS, likely related to a maximum output 
buffer for `tsctl shownodes up`.

On 1/24/17, 12:48 PM, "Jonathon A Anderson" <[email protected]> 
wrote:

I think I'm having the same issue described here:

http://www.spectrumscale.org/pipermail/gpfsug-discuss/2016-October/002288.html

Any advice or further troubleshooting steps would be much appreciated. Full 
disclosure: I also have a DDN case open. (78804)

We've got a four-node (snsd{1..4}) DDN gridscaler system. I'm trying to add two 
CES protocol nodes (sgate{1,2}) to serve NFS.

Here's the steps I took:

---
mmcrnodeclass protocol -N sgate1-opa,sgate2-opa
mmcrnodeclass nfs -N sgate1-opa,sgate2-opa
mmchconfig cesSharedRoot=/gpfs/summit/ces
mmchcluster --ccr-enable
mmchnode --ces-enable -N protocol
mmces service enable NFS
mmces service start NFS -N nfs
mmces address add --ces-ip 10.225.71.104,10.225.71.105
mmces address policy even-coverage
mmces address move --rebalance
---

This worked the very first time I ran it, but the CES addresses weren't 
re-distributed after restarting GPFS or a node reboot.

Things I've tried:

* disabling ces on the sgate nodes and re-running the above procedure
* moving the cluster and filesystem managers to different snsd nodes
* deleting and re-creating the cesSharedRoot directory

Meanwhile, the following log entry appears in mmfs.log.latest every ~30s:

---
Mon Jan 23 20:31:20 MST 2017: mmcesnetworkmonitor: Found unassigned address 
10.225.71.104
Mon Jan 23 20:31:20 MST 2017: mmcesnetworkmonitor: Found unassigned address 
10.225.71.105
Mon Jan 23 20:31:20 MST 2017: mmcesnetworkmonitor: handleNetworkProblem with 
lock held: assignIP 10.225.71.104_0-_+,10.225.71.105_0-_+ 1
Mon Jan 23 20:31:20 MST 2017: mmcesnetworkmonitor: Assigning addresses: 
10.225.71.104_0-_+,10.225.71.105_0-_+
Mon Jan 23 20:31:20 MST 2017: mmcesnetworkmonitor: moveCesIPs: 
10.225.71.104_0-_+,10.225.71.105_0-_+
---

Also notable, whenever I add or remove addresses now, I see this in 
mmsysmonitor.log (among a lot of other entries):

---
2017-01-23T20:40:56.363 sgate1 D ET_cesnetwork Entity state without 
requireUnique: ces_network_ips_down WARNING No CES relevant NICs detected - 
Service.calculateAndUpdateState:275
2017-01-23T20:40:11.364 sgate1 D ET_cesnetwork Update multiple entities at once 
{'p2p2': 1, 'bond0': 1, 'p2p1': 1} - Service.setLocalState:333
---

For the record, here's the interface I expect to get the address on sgate1:

---
11: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 9000 qdisc noqueue 
state UP
link/ether 3c:fd:fe:08:a7:c0 brd ff:ff:ff:ff:ff:ff
inet 10.225.71.107/20 brd 10.225.79.255 scope global bond0
valid_lft forever preferred_lft forever
inet6 fe80::3efd:feff:fe08:a7c0/64 scope link
valid_lft forever preferred_lft forever
---

which is a bond of p2p1 and p2p2.

---
6: p2p1: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 9000 qdisc mq master bond0 
state UP qlen 1000
link/ether 3c:fd:fe:08:a7:c0 brd ff:ff:ff:ff:ff:ff
7: p2p2: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 9000 qdisc mq master bond0 
state UP qlen 1000
link/ether 3c:fd:fe:08:a7:c0 brd ff:ff:ff:ff:ff:ff
---

A similar bond0 exists on sgate2.

I crawled around in /usr/lpp/mmfs/lib/mmsysmon/CESNetworkService.py for a while 
trying to figure it out, but have been unsuccessful so far.

_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss

_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss

_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss

_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss

_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss

This message (including any attachments) is intended only for the use of the 
individual or entity to which it is addressed and may contain information that 
is non-public, proprietary, privileged, confidential, and exempt from 
disclosure under applicable law. If you are not the intended recipient, you are 
hereby notified that any use, dissemination, distribution, or copying of this 
communication is strictly prohibited. This message may be viewed by parties at 
Sirius Computer Solutions other than those named in the message header. This 
message does not contain an official representation of Sirius Computer 
Solutions. If you have received this communication in error, notify Sirius 
Computer Solutions immediately and (i) destroy this message if a facsimile or 
(ii) delete this message immediately if this is an electronic communication. 
Thank you.

Sirius Computer Solutions<http://www.siriuscom.com>

_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss

Re: [gpfsug-discuss] CES doesn't assign addresses to nodes

Reply via email to