Luke, Yes. That’s it. I didn’t know how to find the APAR number by case or defect number before.
Regards, Gang Qiu ==================================================== Gang Qiu(邱钢) Spectrum Scale Development - ECE IBM China Systems Lab Mobile: +86-18612867902 ==================================================== 发件人: Luke Sudbery <[email protected]> 日期: 星期二, 2023年3月21日 20:37 收件人: Gang Qiu <[email protected]>, gpfsug main discussion list <[email protected]> 主题: [EXTERNAL] RE: mmvdisk version/communication issues? ZjQcmQRYFpfptBannerStart This Message Is From an External Sender This message came from outside your organization. ZjQcmQRYFpfptBannerEnd Thank you, I’ll pass that onto Lenovo. I think I’ve also identified it as this in the fixies in 5.1.6.1 too: https://www.ibm.com/support/pages/apar/IJ44607 IJ44607: SPECTRUM SCALE V5.1.3 AND GREATER FAILS TO INTEROPERATE WITH SPECTRUMSCALE VERSIONS PRIOR TO 5.1.3. Problem summary · GNR RPCs fail when received by a GPFS daemon 5.1.3 or later · from a GPFS daemon older than version 5.1.3. · Kernel assert going off: privVfsP != NULL Symptom: · Hang in the command As such, AFAICT it will affect all Lenovo customers going from DSSG 3.x (GPFS 5.1.1.0) to 4.x (GPFS 5.1.5.1 efix20) and I’m a bit annoyed that didn’t pick it up before release. Many thanks, Luke -- Luke Sudbery Principal Engineer (HPC and Storage). Architecture, Infrastructure and Systems Advanced Research Computing, IT Services Room 132, Computer Centre G5, Elms Road Please note I don’t work on Monday. From: Gang Qiu <[email protected]> Sent: 21 March 2023 12:31 To: Luke Sudbery (Advanced Research Computing) <[email protected]>; gpfsug main discussion list <[email protected]> Subject: 答复: mmvdisk version/communication issues? CAUTION: This email originated from outside the organisation. Do not click links or open attachments unless you recognise the sender and know the content is safe. Luke, 1. The case number is TS011014198 (Internal defect number is 1163600) 5.1.3.1 efix47 includes this fix. (I didn't find the efix for 5.1.4 or 5.1.5) 2. The impact is that any GNR-related command from 5.1.2 to 5.1.5 will hang. Regards, Gang Qiu ==================================================== Gang Qiu(邱钢) Spectrum Scale Development - ECE IBM China Systems Lab Mobile: +86-18612867902 ==================================================== 发件人: Luke Sudbery <[email protected]<mailto:[email protected]>> 日期: 星期二, 2023年3月21日 17:31 收件人: gpfsug main discussion list <[email protected]<mailto:[email protected]>>, Gang Qiu <[email protected]<mailto:[email protected]>> 主题: [EXTERNAL] RE: mmvdisk version/communication issues? Thank you, that’s far more useful than Lenovo support have been so far! Unfortunately Lenovo only support particular versions of scale on their DSSG and they are our only source of the required GNR packages. So a couple more questions if you ZjQcmQRYFpfptBannerStart This Message Is From an External Sender This message came from outside your organization. ZjQcmQRYFpfptBannerEnd Thank you, that’s far more useful than Lenovo support have been so far! Unfortunately Lenovo only support particular versions of scale on their DSSG and they are our only source of the required GNR packages. So a couple more questions if you don’t mind! · Can you provide a link the APAR or similar I can share with Lenovo? · Do you know of any workaround or other impact of this issue? (I think I’ve seen mmheath show false errors because of it). This may help us to decide whether to just press ahead upgrading to 5.1.5 everywhere if the issue is not present 5.1.5 -> 5.1.5. Many thanks, Luke -- Luke Sudbery Principal Engineer (HPC and Storage). Architecture, Infrastructure and Systems Advanced Research Computing, IT Services Room 132, Computer Centre G5, Elms Road Please note I don’t work on Monday. From: gpfsug-discuss <[email protected]<mailto:[email protected]>> On Behalf Of Gang Qiu Sent: 21 March 2023 03:49 To: gpfsug main discussion list <[email protected]<mailto:[email protected]>> Subject: [gpfsug-discuss] 答复: mmvdisk version/communication issues? CAUTION: This email originated from outside the organisation. Do not click links or open attachments unless you recognise the sender and know the content is safe. It is not mmvdisk issue. Mmvdisk hangs because ‘tslsrecgroup’ command hangs which executed by mmvdisk. I recreated it on my environment (coexist 5.1.3 or 5.1.4, 5.1.5 with 5.1.2 or previous version) After further investigation, it should be a known issue which is fixed in 5.1.6 PTF1. To resolve this issue, we need to upgrade the version to 5.1.6.1 at least. Regards, Gang Qiu ==================================================== Gang Qiu(邱钢) Spectrum Scale Development - ECE IBM China Systems Lab Mobile: +86-18612867902 ==================================================== 发件人: gpfsug-discuss <[email protected]<mailto:[email protected]>> 代表 Luke Sudbery <[email protected]<mailto:[email protected]>> 日期: 星期六, 2023年3月18日 00:06 收件人: gpfsug main discussion list <[email protected]<mailto:[email protected]>> 主题: [EXTERNAL] Re: [gpfsug-discuss] mmvdisk version/communication issues? On further investigation the command does eventually complete, after 11 minutes rather than a couple of seconds. [root@ rds-pg-dssg01 ~]# time mmvdisk pdisk list --rg rds_er_dssg02 --not-ok mmvdisk: All pdisks of recovery group 'rds_er_dssg02' ZjQcmQRYFpfptBannerStart This Message Is From an External Sender This message came from outside your organization. ZjQcmQRYFpfptBannerEnd On further investigation the command does eventually complete, after 11 minutes rather than a couple of seconds. [root@rds-pg-dssg01 ~]# time mmvdisk pdisk list --rg rds_er_dssg02 --not-ok mmvdisk: All pdisks of recovery group 'rds_er_dssg02' are ok. real 11m14.106s user 0m1.430s sys 0m0.555s [root@rds-pg-dssg01 ~]# Looking at the process tree, the bits that hang are: [root@rds-pg-dssg01 ~]# time tslsrecgroup rds_er_dssg02 -Y --v2 --failure-domain Failed to connect to file system daemon: Connection timed out real 5m30.181s user 0m0.001s sys 0m0.003s and then [root@rds-pg-dssg01 ~]# time tslspdisk --recovery-group rds_er_dssg02 --notOK Failed to connect to file system daemon: Connection timed out real 5m30.247s user 0m0.003s sys 0m0.002s [root@rds-pg-dssg01 ~]# Which adds up to the 11 minutes.... then it does something else and just works. Or maybe it doesn't work and just wouldn't report any failed disks is there were any…. While hanging, the ts commands appear to be LISTENing, not attempting to make connections: [root@rds-pg-dssg01 ~]# pidof tslspdisk 2156809 [root@rds-pg-dssg01 ~]# netstat -apt | grep 2156809 tcp 0 0 0.0.0.0:60000 0.0.0.0:* LISTEN 2156809/tslspdisk [root@rds-pg-dssg01 ~]# Port 60000 is the lowest of our tscCmdPortRange. Don’t know if that helps anyone…. Cheers, Luke -- Luke Sudbery Principal Engineer (HPC and Storage). Architecture, Infrastructure and Systems Advanced Research Computing, IT Services Room 132, Computer Centre G5, Elms Road Please note I don’t work on Monday. From: gpfsug-discuss <[email protected]<mailto:[email protected]>> On Behalf Of Luke Sudbery Sent: 17 March 2023 15:11 To: [email protected]<mailto:[email protected]> Subject: [gpfsug-discuss] mmvdisk version/communication issues? Hello, We 3 Lenovo DSSG “Building Blocks” as they call them – 2x GNR server pairs. We’ve just upgraded the 1st of them from 3.2a (GPFS 5.1.1.0) to 4.3a (5.1.5.1 efix 20). Now the older systems can’t communicated with the newer in certain circumstances, specifically querying recovery groups hosted on other servers. It works old->old, new->old and new->new but not old->new. If fairly sure it is not a TCP comms problem. I can ssh between the node as root and as the GPFS sudoUser. Port 1191 and the tscCmdPortRange are open and accessible in both direction between the nodes. There are connections present between the nodes in netstat and in mmfsd.latest.log. No pending message (to that node) in mmdiag --network. In these examples rds-er-dssg01/2 are upgraded, rds-pg-dssg01/2 are downlevel: [root@rds-er-dssg01 ~]# mmvdisk pdisk list --rg rds_er_dssg02 --not-ok # New to new mmvdisk: All pdisks of recovery group 'rds_er_dssg02' are ok. [root@rds-er-dssg01 ~]# mmvdisk pdisk list --rg rds_pg_dssg02 --not-ok # New to old mmvdisk: All pdisks of recovery group 'rds_pg_dssg02' are ok. [root@rds-er-dssg01 ~]# [root@rds-pg-dssg01 ~]# mmvdisk pdisk list --rg rds_pg_dssg02 --not-ok # Old to old mmvdisk: All pdisks of recovery group 'rds_pg_dssg02' are ok. [root@rds-pg-dssg01 ~]# mmvdisk pdisk list --rg rds_er_dssg02 --not-ok # Old to new [HANGS] ^Cmmvdisk: Command failed. Examine previous error messages to determine cause. [root@rds-pg-dssg01 ~]# Has anyone come across this? mmvdisk should work across slightly different versions of 5.1, right? No recovery group, cluster or filesystem versions have been changed yet. We will also log a ticket with snaps and more info but wondered if anyone had seen this. And while this particular command is not a major issue, we don’t know what else it may affect, before we proceed with the reset of the cluster. Many thanks, Luke -- Luke Sudbery Principal Engineer (HPC and Storage). Architecture, Infrastructure and Systems Advanced Research Computing, IT Services Room 132, Computer Centre G5, Elms Road Please note I don’t work on Monday.
_______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at gpfsug.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org
