Hi Damir,

It was related to specific user jobs and mmap (?). We opened PMR with IBM and 
have patch from IBM, since than we don’t see issue.

Regards,

Sveta.

> On Nov 2, 2018, at 11:55 AM, Damir Krstic <[email protected]> wrote:
> 
> Hi,
> 
> Did you ever figure out the root cause of the issue? We have recently (end of 
> the June) upgraded our storage to: gpfs.base-5.0.0-1.1.3.ppc64
> 
> In the last few weeks we have seen an increasing number of ps hangs across 
> compute and login nodes on our cluster. The filesystem version (of all 
> filesystems on our cluster) is:
>  -V                 15.01 (4.2.0.0)          File system version
> 
> I am just wondering if anyone has seen this type of issue since you first 
> reported it and if there is a known fix for it.
> 
> Damir
> 
> On Tue, May 22, 2018 at 10:43 AM <[email protected] 
> <mailto:[email protected]>> wrote:
> Hello All,
> 
> We have recently upgraded from GPFS 4.2.3.2 to GPFS 5.0.0-2 about a month 
> ago. We have not yet converted the 4.2.2.2 filesystem version to 5. ( That is 
> we have not run the mmchconfig release=LATEST command)
> Right after the upgrade, we are seeing many “ps hangs" across the cluster. 
> All the “ps hangs” happen when jobs run related to a Java process or many 
> Java threads (example: GATK )
> The hangs are pretty random, and have no particular pattern except that we 
> know that it is related to just Java or some jobs reading from directories 
> with about 600000 files.
> 
> I have raised an IBM critical service request about a month ago related to 
> this - PMR: 24090,L6Q,000. 
> However, According to the ticket  - they seemed to feel that it might not be 
> related to GPFS. 
> Although, we are sure that these hangs started to appear only after we 
> upgraded GPFS to GPFS 5.0.0.2 from 4.2.3.2.
> 
> One of the other reasons we are not able to prove that it is GPFS is because, 
> we are unable to capture any logs/traces from GPFS once the hang happens.
> Even GPFS trace commands hang, once “ps hangs” and thus it is getting 
> difficult to get any dumps from GPFS.
> 
> Also  - According to the IBM ticket, they seemed to have a seen a “ps hang" 
> issue and we have to run  mmchconfig release=LATEST command, and that will 
> resolve the issue.
> However we are not comfortable making the permanent change to Filesystem 
> version 5. and since we don’t see any near solution to these hangs - we are 
> thinking of downgrading to GPFS 4.2.3.2 or the previous state that we know 
> the cluster was stable.
> 
> Can downgrading GPFS take us back to exactly the previous GPFS config state? 
> With respect to downgrading from 5 to 4.2.3.2 -> is it just that i reinstall 
> all rpms to a previous version? or is there anything else that i need to make 
> sure with respect to GPFS configuration?
> Because i think that GPFS 5.0 might have updated internal default GPFS 
> configuration parameters , and i am not sure if downgrading GPFS will change 
> them back to what they were in GPFS 4.2.3.2
> 
> Our previous state:
> 
> 2 Storage clusters - 4.2.3.2
> 1 Compute cluster - 4.2.3.2  ( remote mounts the above 2 storage clusters )
> 
> Our current state:
> 
> 2 Storage clusters - 5.0.0.2 ( filesystem version - 4.2.2.2)
> 1 Compute cluster - 5.0.0.2
> 
> Do i need to downgrade all the clusters to go to the previous state ? or is 
> it ok if we just downgrade the compute cluster to previous version?
> 
> Any advice on the best steps forward, would greatly help.
> 
> Thanks,
> 
> Lohit
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org <http://spectrumscale.org/>
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss 
> <http://gpfsug.org/mailman/listinfo/gpfsug-discuss>
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss

_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss

Reply via email to