Hi Damir, It was related to specific user jobs and mmap (?). We opened PMR with IBM and have patch from IBM, since than we don’t see issue.
Regards, Sveta. > On Nov 2, 2018, at 11:55 AM, Damir Krstic <[email protected]> wrote: > > Hi, > > Did you ever figure out the root cause of the issue? We have recently (end of > the June) upgraded our storage to: gpfs.base-5.0.0-1.1.3.ppc64 > > In the last few weeks we have seen an increasing number of ps hangs across > compute and login nodes on our cluster. The filesystem version (of all > filesystems on our cluster) is: > -V 15.01 (4.2.0.0) File system version > > I am just wondering if anyone has seen this type of issue since you first > reported it and if there is a known fix for it. > > Damir > > On Tue, May 22, 2018 at 10:43 AM <[email protected] > <mailto:[email protected]>> wrote: > Hello All, > > We have recently upgraded from GPFS 4.2.3.2 to GPFS 5.0.0-2 about a month > ago. We have not yet converted the 4.2.2.2 filesystem version to 5. ( That is > we have not run the mmchconfig release=LATEST command) > Right after the upgrade, we are seeing many “ps hangs" across the cluster. > All the “ps hangs” happen when jobs run related to a Java process or many > Java threads (example: GATK ) > The hangs are pretty random, and have no particular pattern except that we > know that it is related to just Java or some jobs reading from directories > with about 600000 files. > > I have raised an IBM critical service request about a month ago related to > this - PMR: 24090,L6Q,000. > However, According to the ticket - they seemed to feel that it might not be > related to GPFS. > Although, we are sure that these hangs started to appear only after we > upgraded GPFS to GPFS 5.0.0.2 from 4.2.3.2. > > One of the other reasons we are not able to prove that it is GPFS is because, > we are unable to capture any logs/traces from GPFS once the hang happens. > Even GPFS trace commands hang, once “ps hangs” and thus it is getting > difficult to get any dumps from GPFS. > > Also - According to the IBM ticket, they seemed to have a seen a “ps hang" > issue and we have to run mmchconfig release=LATEST command, and that will > resolve the issue. > However we are not comfortable making the permanent change to Filesystem > version 5. and since we don’t see any near solution to these hangs - we are > thinking of downgrading to GPFS 4.2.3.2 or the previous state that we know > the cluster was stable. > > Can downgrading GPFS take us back to exactly the previous GPFS config state? > With respect to downgrading from 5 to 4.2.3.2 -> is it just that i reinstall > all rpms to a previous version? or is there anything else that i need to make > sure with respect to GPFS configuration? > Because i think that GPFS 5.0 might have updated internal default GPFS > configuration parameters , and i am not sure if downgrading GPFS will change > them back to what they were in GPFS 4.2.3.2 > > Our previous state: > > 2 Storage clusters - 4.2.3.2 > 1 Compute cluster - 4.2.3.2 ( remote mounts the above 2 storage clusters ) > > Our current state: > > 2 Storage clusters - 5.0.0.2 ( filesystem version - 4.2.2.2) > 1 Compute cluster - 5.0.0.2 > > Do i need to downgrade all the clusters to go to the previous state ? or is > it ok if we just downgrade the compute cluster to previous version? > > Any advice on the best steps forward, would greatly help. > > Thanks, > > Lohit > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org <http://spectrumscale.org/> > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > <http://gpfsug.org/mailman/listinfo/gpfsug-discuss> > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss
_______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss
