Yeah... depending on the #nodes you 're affected or not. .....
So if your remote ces cluster is small enough in terms of the #nodes ...
you'll neuer hit into this issue
Gesendet von IBM Verse
Simon Thompson (Research Computing - IT Services) --- Re: [gpfsug-discuss]
CES doesn't assign addresses to nodes ---
Von:"Simon Thompson (Research Computing - IT Services)"
<[email protected]>An:"gpfsug main discussion list"
<[email protected]>Datum:Di. 31.01.2017 21:07Betreff:Re:
[gpfsug-discuss] CES doesn't assign addresses to nodes
We use multicluster for our environment, storage systems in a separate
cluster to hpc nodes on a separate cluster from protocol nodes.According to the
docs, this isn't supported, but we haven't seen any issues. Note unsupported as
opposed to broken.Simon________________________________________From:
[email protected]
[[email protected]] on behalf of Jonathon A Anderson
[[email protected]]Sent: 31 January 2017 17:47To: gpfsug main
discussion listSubject: Re: [gpfsug-discuss] CES doesn't assign addresses to
nodesYeah, I searched around for places where ` tsctl shownodes up` appears in
the GPFS code I have access to (i.e., the ksh and python stuff); but it’s only
in CES. I suspect there just haven’t been that many people exporting CES out of
an HPC cluster environment.~jonathonFrom:
<[email protected]> on behalf of Olaf Weiser
<[email protected]>Reply-To: gpfsug main discussion list
<[email protected]>Date: Tuesday, January 31, 2017 at 10:45
AMTo: gpfsug main discussion list <[email protected]>Subject:
Re: [gpfsug-discuss] CES doesn't assign addresses to nodesI ll open a pmr here
for my env ... the issue may hurt you in a ces env. only... but needs to be
fixed in core gpfs.base i thi kGesendet von IBM VerseJonathon A Anderson ---
Re: [gpfsug-discuss] CES doesn't assign addresses to nodes ---Von:"Jonathon A
Anderson" <[email protected]>An:"gpfsug main discussion list"
<[email protected]>Datum:Di. 31.01.2017 17:32Betreff:Re:
[gpfsug-discuss] CES doesn't assign addresses to
nodes________________________________No, I’m having trouble getting this
through DDN support because, while we have a GPFS server license and GRIDScaler
support, apparently we don’t have “protocol node” support, so they’ve pushed
back on supporting this as an overall CES-rooted effort.I do have a DDN case
open, though: 78804. If you are (as I suspect) a GPFS developer, do you mind if
I cite your info from here in my DDN case to get them to open a
PMR?Thanks.~jonathonFrom: <[email protected]> on behalf
of Olaf Weiser <[email protected]>Reply-To: gpfsug main discussion list
<[email protected]>Date: Tuesday, January 31, 2017 at 8:42 AMTo:
gpfsug main discussion list <[email protected]>Subject: Re:
[gpfsug-discuss] CES doesn't assign addresses to nodesok.. so obviously ... it
seems , that we have several issues..the 3983 characters is obviously a
defecthave you already raised a PMR , if so , can you send me the number ?From:
Jonathon A Anderson <[email protected]>To: gpfsug
main discussion list <[email protected]>Date: 01/31/2017
04:14 PMSubject: Re: [gpfsug-discuss] CES doesn't assign addresses to
nodesSent by:
[email protected]________________________________The
tail isn’t the issue; that’ my addition, so that I didn’t have to paste the
hundred or so line nodelist into the thread.The actual command istsctl
shownodes up | $tr ',' '\n' | $sort -o $upnodefileBut you can see in my tailed
output that the last hostname listed is cut-off halfway through the hostname.
Less obvious in the example, but true, is the fact that it’s only showing the
first 120 hosts, when we have 403 nodes in our gpfs cluster.[root@sgate2 ~]#
tsctl shownodes up | tr ',' '\n' | wc -l120[root@sgate2 ~]# mmlscluster | grep
'\-opa' | wc -l403Perhaps more explicitly, it looks like `tsctl shownodes up`
can only transmit 3983 characters.[root@sgate2 ~]# tsctl shownodes up | wc
-c3983Again, I’m convinced this is a bug not only because the command doesn’t
actually produce a list of all of the up nodes in our cluster; but because the
last name listed is incomplete.[root@sgate2 ~]# tsctl shownodes up | tr ','
'\n' | tail -n 1shas0260-opa.rc.int.col[root@sgate2 ~]#I’d continue my
investigation within tsctl itself but, alas, it’s a binary with no source code
available to me. :)I’m trying to get this opened as a bug / PMR; but I’m still
working through the DDN support infrastructure. Thanks for reporting it,
though.For the record:[root@sgate2 ~]# rpm -qa | grep -i
gpfsgpfs.base-4.2.1-2.x86_64gpfs.msg.en_US-4.2.1-2.noarchgpfs.gplbin-3.10.0-327.el7.x86_64-4.2.1-0.x86_64gpfs.gskit-8.0.50-57.x86_64gpfs.gpl-4.2.1-2.noarchnfs-ganesha-gpfs-2.3.2-0.ibm24.el7.x86_64gpfs.ext-4.2.1-2.x86_64gpfs.gplbin-3.10.0-327.36.3.el7.x86_64-4.2.1-2.x86_64gpfs.docs-4.2.1-2.noarch~jonathonFrom:
<[email protected]> on behalf of Olaf Weiser
<[email protected]>Reply-To: gpfsug main discussion list
<[email protected]>Date: Tuesday, January 31, 2017 at 1:30 AMTo:
gpfsug main discussion list <[email protected]>Subject: Re:
[gpfsug-discuss] CES doesn't assign addresses to nodesHi ...same thing here..
everything after 10 nodes will be truncated..though I don't have an issue with
it ... I 'll open a PMR .. and I recommend you to do the same thing.. ;-)the
reason seems simple.. it is the "| tail" .at the end of the command.. .. which
truncates the output to the last 10 items...should be easy to
fix..cheersolafFrom: Jonathon A Anderson
<[email protected]>To: "[email protected]"
<[email protected]>Date: 01/30/2017 11:11 PMSubject:
Re: [gpfsug-discuss] CES doesn't assign addresses to nodesSent by:
[email protected]________________________________In
trying to figure this out on my own, I’m relatively certain I’ve found a bug in
GPFS related to the truncation of output from `tsctl shownodes up`. Any chance
someone in development can confirm?Here are the details of my investigation:##
GPFS is up on sgate2[root@sgate2 ~]# mmgetstateNode number Node name
GPFS state------------------------------------------ 414 sgate2-opa
active## but if I tell ces to explicitly put one of our ces addresses on that
node, it says that GPFS is down[root@sgate2 ~]# mmces address move --ces-ip
10.225.71.102 --ces-node sgate2-opammces address move: GPFS is down on this
node.mmces address move: Command failed. Examine previous error messages to
determine cause.## the “GPFS is down on this node” message is defined as code
109 in mmglobfuncs[root@sgate2 ~]# grep --before-context=1 "GPFS is down on
this node." /usr/lpp/mmfs/bin/mmglobfuncs 109 ) msgTxt=\"%s: GPFS is down on
this node."## and is generated by printErrorMsg in mmcesnetmvaddress when it
detects that the current node is identified as “down” by
getDownCesNodeList[root@sgate2 ~]# grep --before-context=5 'printErrorMsg 109'
/usr/lpp/mmfs/bin/mmcesnetmvaddressdownNodeList=$(getDownCesNodeList)for
downNode in $downNodeListdo if [[ $toNodeName == $downNode ]] then
printErrorMsg 109 "$mmcmd"## getDownCesNodeList is the intersection of all ces
nodes with GPFS cluster nodes listed in `tsctl shownodes up`[root@sgate2 ~]#
grep --after-context=16 '^function getDownCesNodeList'
/usr/lpp/mmfs/bin/mmcesfuncsfunction getDownCesNodeList{typeset
sourceFile="mmcesfuncs.sh"[[ -n $DEBUG || -n $DEBUGgetDownCesNodeList ]] &&set
-x$mmTRACE_ENTER "$*"typeset upnodefile=${cmdTmpDir}upnodefiletypeset
downNodeList# get all CES nodes$sort -o $nodefile $mmfsCesNodes.dae$tsctl
shownodes up | $tr ',' '\n' | $sort -o $upnodefiledownNodeList=$($comm -23
$nodefile $upnodefile)print -- $downNodeList} #----- end of function
getDownCesNodeList --------------------## but not only are the sgate nodes not
listed by `tsctl shownodes up`; its output is obviously and erroneously
truncated[root@sgate2 ~]# tsctl shownodes up | tr ',' '\n' |
tailshas0251-opa.rc.int.colorado.edushas0252-opa.rc.int.colorado.edushas0253-opa.rc.int.colorado.edushas0254-opa.rc.int.colorado.edushas0255-opa.rc.int.colorado.edushas0256-opa.rc.int.colorado.edushas0257-opa.rc.int.colorado.edushas0258-opa.rc.int.colorado.edushas0259-opa.rc.int.colorado.edushas0260-opa.rc.int.col[root@sgate2
~]### I expect that this is a bug in GPFS, likely related to a maximum output
buffer for `tsctl shownodes up`.On 1/24/17, 12:48 PM, "Jonathon A Anderson"
<[email protected]> wrote: I think I'm having the same issue
described here:
http://www.spectrumscale.org/pipermail/gpfsug-discuss/2016-October/002288.html
Any advice or further troubleshooting steps would be much appreciated. Full
disclosure: I also have a DDN case open. (78804) We've got a four-node
(snsd{1..4}) DDN gridscaler system. I'm trying to add two CES protocol nodes
(sgate{1,2}) to serve NFS. Here's the steps I took: --- mmcrnodeclass
protocol -N sgate1-opa,sgate2-opa mmcrnodeclass nfs -N sgate1-opa,sgate2-opa
mmchconfig cesSharedRoot=/gpfs/summit/ces mmchcluster --ccr-enable mmchnode
--ces-enable -N protocol mmces service enable NFS mmces service start NFS -N
nfs mmces address add --ces-ip 10.225.71.104,10.225.71.105 mmces address
policy even-coverage mmces address move --rebalance --- This worked the very
first time I ran it, but the CES addresses weren't re-distributed after
restarting GPFS or a node reboot. Things I've tried: * disabling ces on the
sgate nodes and re-running the above procedure * moving the cluster and
filesystem managers to different snsd nodes * deleting and re-creating the
cesSharedRoot directory Meanwhile, the following log entry appears in
mmfs.log.latest every ~30s: --- Mon Jan 23 20:31:20 MST 2017:
mmcesnetworkmonitor: Found unassigned address 10.225.71.104 Mon Jan 23
20:31:20 MST 2017: mmcesnetworkmonitor: Found unassigned address 10.225.71.105
Mon Jan 23 20:31:20 MST 2017: mmcesnetworkmonitor: handleNetworkProblem with
lock held: assignIP 10.225.71.104_0-_+,10.225.71.105_0-_+ 1 Mon Jan 23
20:31:20 MST 2017: mmcesnetworkmonitor: Assigning addresses:
10.225.71.104_0-_+,10.225.71.105_0-_+ Mon Jan 23 20:31:20 MST 2017:
mmcesnetworkmonitor: moveCesIPs: 10.225.71.104_0-_+,10.225.71.105_0-_+ ---
Also notable, whenever I add or remove addresses now, I see this in
mmsysmonitor.log (among a lot of other entries): --- 2017-01-23T20:40:56.363
sgate1 D ET_cesnetwork Entity state without requireUnique: ces_network_ips_down
WARNING No CES relevant NICs detected - Service.calculateAndUpdateState:275
2017-01-23T20:40:11.364 sgate1 D ET_cesnetwork Update multiple entities at once
{'p2p2': 1, 'bond0': 1, 'p2p1': 1} - Service.setLocalState:333 --- For the
record, here's the interface I expect to get the address on sgate1: --- 11:
bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP
link/ether 3c:fd:fe:08:a7:c0 brd ff:ff:ff:ff:ff:ff inet 10.225.71.107/20 brd
10.225.79.255 scope global bond0 valid_lft forever preferred_lft forever
inet6 fe80::3efd:feff:fe08:a7c0/64 scope link valid_lft forever preferred_lft
forever --- which is a bond of p2p1 and p2p2. --- 6: p2p1:
<BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 9000 qdisc mq master bond0 state UP
qlen 1000 link/ether 3c:fd:fe:08:a7:c0 brd ff:ff:ff:ff:ff:ff 7: p2p2:
<BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 9000 qdisc mq master bond0 state UP
qlen 1000 link/ether 3c:fd:fe:08:a7:c0 brd ff:ff:ff:ff:ff:ff --- A similar
bond0 exists on sgate2. I crawled around in
/usr/lpp/mmfs/lib/mmsysmon/CESNetworkService.py for a while trying to figure it
out, but have been unsuccessful so
far._______________________________________________gpfsug-discuss mailing
listgpfsug-discuss at
spectrumscale.orghttp://gpfsug.org/mailman/listinfo/gpfsug-discuss_______________________________________________gpfsug-discuss
mailing listgpfsug-discuss at
spectrumscale.orghttp://gpfsug.org/mailman/listinfo/gpfsug-discuss_______________________________________________gpfsug-discuss
mailing listgpfsug-discuss at
spectrumscale.orghttp://gpfsug.org/mailman/listinfo/gpfsug-discuss
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss