Thanks, that’s interesting. I thought the commands must doing a remote call like that some how. There don’t appear to be any pending outgoing connections in either direction in mmdiag –network or netstat.
Admin and daemon names match. Thanks for the ticket numbers – I will mention them in mine. Many thanks, Luke -- Luke Sudbery Principal Engineer (HPC and Storage). Architecture, Infrastructure and Systems Advanced Research Computing, IT Services Room 132, Computer Centre G5, Elms Road Please note I don’t work on Monday. From: gpfsug-discuss <[email protected]> On Behalf Of Ryan Novosielski Sent: 17 March 2023 16:36 To: gpfsug main discussion list <[email protected]> Subject: Re: [gpfsug-discuss] mmvdisk version/communication issues? CAUTION:This email originated from outside the organisation. Do not click links or open attachments unless you recognise the sender and know the content is safe. We had a very similar problem running mmlsfs all_remote, and the short version is that in some cases, communication goes in the opposite direction that you might expect (eg. you think something will connect from A->B to get a response, but what really happens is A contacts B and asks B to run a companion process to contact A, which didn’t work). There was also recently a bug where strange things would happen if you had different host names in the cluster for the admin and daemon name (I think we might have had like dss[01-02]-ib0 and dss[01-02] respectively. I think this was supposed to be fixed in GPFS 5.1.6-0, which isn’t available yet for DSS-G. I’m not sure that either of these things actually is what’s getting you, but it was also a roughly a 5 minute timeout, so it may be a hint. Our tickets for these, respectively, are TS008145078 and TS010747847 (if someone notices the ticket is about an upgrade and not this problem, we noticed it upgrading from 2.4b to 2.10b, since the output of commands didn’t match the documentation). HTH, -- #BlackLivesMatter ____ || \\UTGERS<file://UTGERS>, |---------------------------*O*--------------------------- ||_// the State | Ryan Novosielski - [email protected]<mailto:[email protected]> || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus || \\ of NJ | Office of Advanced Research Computing - MSB A555B, Newark `' On Mar 17, 2023, at 12:03, Luke Sudbery <[email protected]<mailto:[email protected]>> wrote: On further investigation the command does eventually complete, after 11 minutes rather than a couple of seconds. [root@rds-pg-dssg01 ~]# time mmvdisk pdisk list --rg rds_er_dssg02 --not-ok mmvdisk: All pdisks of recovery group 'rds_er_dssg02' are ok. real 11m14.106s user 0m1.430s sys 0m0.555s [root@rds-pg-dssg01 ~]# Looking at the process tree, the bits that hang are: [root@rds-pg-dssg01 ~]# time tslsrecgroup rds_er_dssg02 -Y --v2 --failure-domain Failed to connect to file system daemon: Connection timed out real 5m30.181s user 0m0.001s sys 0m0.003s and then [root@rds-pg-dssg01 ~]# time tslspdisk --recovery-group rds_er_dssg02 --notOK Failed to connect to file system daemon: Connection timed out real 5m30.247s user 0m0.003s sys 0m0.002s [root@rds-pg-dssg01 ~]# Which adds up to the 11 minutes.... then it does something else and just works. Or maybe it doesn't work and just wouldn't report any failed disks is there were any…. While hanging, the ts commands appear to be LISTENing, not attempting to make connections: [root@rds-pg-dssg01 ~]# pidof tslspdisk 2156809 [root@rds-pg-dssg01 ~]# netstat -apt | grep 2156809 tcp 0 0 0.0.0.0:60000 0.0.0.0:* LISTEN 2156809/tslspdisk [root@rds-pg-dssg01 ~]# Port 60000 is the lowest of our tscCmdPortRange. Don’t know if that helps anyone…. Cheers, Luke -- Luke Sudbery Principal Engineer (HPC and Storage). Architecture, Infrastructure and Systems Advanced Research Computing, IT Services Room 132, Computer Centre G5, Elms Road Please note I don’t work on Monday. From: gpfsug-discuss <[email protected]<mailto:[email protected]>> On Behalf Of Luke Sudbery Sent: 17 March 2023 15:11 To: [email protected]<mailto:[email protected]> Subject: [gpfsug-discuss] mmvdisk version/communication issues? Hello, We 3 Lenovo DSSG “Building Blocks” as they call them – 2x GNR server pairs. We’ve just upgraded the 1st of them from 3.2a (GPFS 5.1.1.0) to 4.3a (5.1.5.1 efix 20). Now the older systems can’t communicated with the newer in certain circumstances, specifically querying recovery groups hosted on other servers. It works old->old, new->old and new->new but not old->new. If fairly sure it is not a TCP comms problem. I can ssh between the node as root and as the GPFS sudoUser. Port 1191 and thetscCmdPortRange are open and accessible in both direction between the nodes. There are connections present between the nodes in netstat and in mmfsd.latest.log. No pending message (to that node) inmmdiag --network. In these examples rds-er-dssg01/2 are upgraded, rds-pg-dssg01/2 are downlevel: [root@rds-er-dssg01 ~]# mmvdisk pdisk list --rg rds_er_dssg02 --not-ok # New to new mmvdisk: All pdisks of recovery group 'rds_er_dssg02' are ok. [root@rds-er-dssg01 ~]# mmvdisk pdisk list --rg rds_pg_dssg02 --not-ok # New to old mmvdisk: All pdisks of recovery group 'rds_pg_dssg02' are ok. [root@rds-er-dssg01 ~]# [root@rds-pg-dssg01 ~]# mmvdisk pdisk list --rg rds_pg_dssg02 --not-ok # Old to old mmvdisk: All pdisks of recovery group 'rds_pg_dssg02' are ok. [root@rds-pg-dssg01 ~]# mmvdisk pdisk list --rg rds_er_dssg02 --not-ok # Old to new [HANGS] ^Cmmvdisk: Command failed. Examine previous error messages to determine cause. [root@rds-pg-dssg01 ~]# Has anyone come across this? mmvdisk should work across slightly different versions of 5.1, right? No recovery group, cluster or filesystem versions have been changed yet. We will also log a ticket with snaps and more info but wondered if anyone had seen this. And while this particular command is not a major issue, we don’t know what else it may affect, before we proceed with the reset of the cluster. Many thanks, Luke -- Luke Sudbery Principal Engineer (HPC and Storage). Architecture, Infrastructure and Systems Advanced Research Computing, IT Services Room 132, Computer Centre G5, Elms Road Please note I don’t work on Monday. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at gpfsug.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org
_______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at gpfsug.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org
