Thanks, that’s interesting. I thought the commands must doing a remote call 
like that some how. There don’t appear to be any pending outgoing connections 
in either direction in mmdiag –network or netstat.

Admin and daemon names match.

Thanks for the ticket numbers – I will mention them in mine.

Many thanks,

Luke

--
Luke Sudbery
Principal Engineer (HPC and Storage).
Architecture, Infrastructure and Systems
Advanced Research Computing, IT Services
Room 132, Computer Centre G5, Elms Road

Please note I don’t work on Monday.

From: gpfsug-discuss <[email protected]> On Behalf Of Ryan 
Novosielski
Sent: 17 March 2023 16:36
To: gpfsug main discussion list <[email protected]>
Subject: Re: [gpfsug-discuss] mmvdisk version/communication issues?

CAUTION:This email originated from outside the organisation. Do not click links 
or open attachments unless you recognise the sender and know the content is 
safe.

We had a very similar problem running mmlsfs all_remote, and the short version 
is that in some cases, communication goes in the opposite direction that you 
might expect (eg. you think something will connect from A->B to get a response, 
but what really happens is A contacts B and asks B to run a companion process 
to contact A, which didn’t work).

There was also recently a bug where strange things would happen if you had 
different host names in the cluster for the admin and daemon name (I think we 
might have had like dss[01-02]-ib0 and dss[01-02] respectively. I think this 
was supposed to be fixed in GPFS 5.1.6-0, which isn’t available yet for DSS-G.

I’m not sure that either of these things actually is what’s getting you, but it 
was also a roughly a 5 minute timeout, so it may be a hint.

Our tickets for these, respectively, are TS008145078 and TS010747847 (if 
someone notices the ticket is about an upgrade and not this problem, we noticed 
it upgrading from 2.4b to 2.10b, since the output of commands didn’t match the 
documentation).

HTH,
--
#BlackLivesMatter
____
|| \\UTGERS<file://UTGERS>,    
|---------------------------*O*---------------------------
||_// the State  |         Ryan Novosielski - 
[email protected]<mailto:[email protected]>
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\    of NJ  | Office of Advanced Research Computing - MSB A555B, Newark
     `'


On Mar 17, 2023, at 12:03, Luke Sudbery 
<[email protected]<mailto:[email protected]>> wrote:

On further investigation the command does eventually complete, after 11 minutes 
rather than a couple of seconds.

[root@rds-pg-dssg01 ~]# time mmvdisk pdisk list --rg rds_er_dssg02 --not-ok
mmvdisk: All pdisks of recovery group 'rds_er_dssg02' are ok.

real    11m14.106s
user    0m1.430s
sys     0m0.555s
[root@rds-pg-dssg01 ~]#

Looking at the process tree, the bits that hang are:
[root@rds-pg-dssg01 ~]# time tslsrecgroup rds_er_dssg02 -Y --v2 --failure-domain
Failed to connect to file system daemon: Connection timed out

real    5m30.181s
user    0m0.001s
sys     0m0.003s

and then
[root@rds-pg-dssg01 ~]# time tslspdisk --recovery-group rds_er_dssg02 --notOK
Failed to connect to file system daemon: Connection timed out

real    5m30.247s
user    0m0.003s
sys     0m0.002s
[root@rds-pg-dssg01 ~]#

Which adds up to the 11 minutes.... then it does something else and just works. 
Or maybe it doesn't work and just wouldn't report any failed disks is there 
were any….

While hanging, the ts commands appear to be LISTENing, not attempting to make 
connections:

[root@rds-pg-dssg01 ~]# pidof tslspdisk
2156809
[root@rds-pg-dssg01 ~]# netstat -apt | grep 2156809
tcp        0      0 0.0.0.0:60000           0.0.0.0:*               LISTEN      
2156809/tslspdisk
[root@rds-pg-dssg01 ~]#

Port 60000 is the lowest of our tscCmdPortRange.

Don’t know if that helps anyone….

Cheers,

Luke
--
Luke Sudbery
Principal Engineer (HPC and Storage).
Architecture, Infrastructure and Systems
Advanced Research Computing, IT Services
Room 132, Computer Centre G5, Elms Road

Please note I don’t work on Monday.

From: gpfsug-discuss 
<[email protected]<mailto:[email protected]>> 
On Behalf Of Luke Sudbery
Sent: 17 March 2023 15:11
To: [email protected]<mailto:[email protected]>
Subject: [gpfsug-discuss] mmvdisk version/communication issues?

Hello,

We 3 Lenovo DSSG “Building Blocks” as they call them – 2x GNR server pairs.

We’ve just upgraded the 1st of them from 3.2a (GPFS 5.1.1.0) to 4.3a (5.1.5.1 
efix 20).

Now the older systems can’t communicated with the newer in certain 
circumstances, specifically querying recovery groups hosted on other servers.

It works old->old, new->old and new->new but not old->new.

If fairly sure it is not a TCP comms problem. I can ssh between the node as 
root and as the GPFS sudoUser. Port 1191 and thetscCmdPortRange are open and 
accessible in both direction between the nodes. There are connections present 
between the nodes in netstat and in mmfsd.latest.log. No pending message (to 
that node) inmmdiag --network.

In these examples rds-er-dssg01/2 are upgraded, rds-pg-dssg01/2 are downlevel:

[root@rds-er-dssg01 ~]# mmvdisk pdisk list --rg rds_er_dssg02 --not-ok  # New 
to new
mmvdisk: All pdisks of recovery group 'rds_er_dssg02' are ok.
[root@rds-er-dssg01 ~]# mmvdisk pdisk list --rg rds_pg_dssg02 --not-ok  # New 
to old
mmvdisk: All pdisks of recovery group 'rds_pg_dssg02' are ok.
[root@rds-er-dssg01 ~]#

[root@rds-pg-dssg01 ~]# mmvdisk pdisk list --rg rds_pg_dssg02 --not-ok # Old to 
old
mmvdisk: All pdisks of recovery group 'rds_pg_dssg02' are ok.
[root@rds-pg-dssg01 ~]# mmvdisk pdisk list --rg rds_er_dssg02 --not-ok # Old to 
new [HANGS]
^Cmmvdisk: Command failed. Examine previous error messages to determine cause.
[root@rds-pg-dssg01 ~]#

Has anyone come across this? mmvdisk should work across slightly different 
versions of 5.1, right? No recovery group, cluster or filesystem versions have 
been changed yet.

We will also log a ticket with snaps and more info but wondered if anyone had 
seen this.

And while this particular command is not a major issue, we don’t know what else 
it may affect, before we proceed with the reset of the cluster.

Many thanks,

Luke

--
Luke Sudbery
Principal Engineer (HPC and Storage).
Architecture, Infrastructure and Systems
Advanced Research Computing, IT Services
Room 132, Computer Centre G5, Elms Road

Please note I don’t work on Monday.

_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at gpfsug.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org

_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at gpfsug.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org

Reply via email to