Re: [gpfsug-discuss] Critical Hang issues with GPFS 5.0. Downgrading from GPFS 5.0.0-2 to GPFS 4.2.3.2

valleru Fri, 02 Nov 2018 09:49:08 -0700

Yes,

We have upgraded to 5.0.1-0.5, which has the patch for the issue.
The related IBM case number was : TS001010674


Regards,
Lohit

On Nov 2, 2018, 12:27 PM -0400, Mazurkova, Svetlana/Information Systems 
<[email protected]>, wrote:
> Hi Damir,
>
> It was related to specific user jobs and mmap (?). We opened PMR with IBM and 
> have patch from IBM, since than we don’t see issue.
>
> Regards,
>
> Sveta.
>
> > On Nov 2, 2018, at 11:55 AM, Damir Krstic <[email protected]> wrote:
> >
> > Hi,
> >
> > Did you ever figure out the root cause of the issue? We have recently (end 
> > of the June) upgraded our storage to: gpfs.base-5.0.0-1.1.3.ppc64
> >
> > In the last few weeks we have seen an increasing number of ps hangs across 
> > compute and login nodes on our cluster. The filesystem version (of all 
> > filesystems on our cluster) is:
> >  -V                 15.01 (4.2.0.0)          File system version
> >
> > I am just wondering if anyone has seen this type of issue since you first 
> > reported it and if there is a known fix for it.
> >
> > Damir
> >
> > > On Tue, May 22, 2018 at 10:43 AM <[email protected]> wrote:
> > > > Hello All,
> > > >
> > > > We have recently upgraded from GPFS 4.2.3.2 to GPFS 5.0.0-2 about a 
> > > > month ago. We have not yet converted the 4.2.2.2 filesystem version to 
> > > > 5. ( That is we have not run the mmchconfig release=LATEST command)
> > > > Right after the upgrade, we are seeing many “ps hangs" across the 
> > > > cluster. All the “ps hangs” happen when jobs run related to a Java 
> > > > process or many Java threads (example: GATK )
> > > > The hangs are pretty random, and have no particular pattern except that 
> > > > we know that it is related to just Java or some jobs reading from 
> > > > directories with about 600000 files.
> > > >
> > > > I have raised an IBM critical service request about a month ago related 
> > > > to this - PMR: 24090,L6Q,000.
> > > > However, According to the ticket  - they seemed to feel that it might 
> > > > not be related to GPFS.
> > > > Although, we are sure that these hangs started to appear only after we 
> > > > upgraded GPFS to GPFS 5.0.0.2 from 4.2.3.2.
> > > >
> > > > One of the other reasons we are not able to prove that it is GPFS is 
> > > > because, we are unable to capture any logs/traces from GPFS once the 
> > > > hang happens.
> > > > Even GPFS trace commands hang, once “ps hangs” and thus it is getting 
> > > > difficult to get any dumps from GPFS.
> > > >
> > > > Also  - According to the IBM ticket, they seemed to have a seen a “ps 
> > > > hang" issue and we have to run  mmchconfig release=LATEST command, and 
> > > > that will resolve the issue.
> > > > However we are not comfortable making the permanent change to 
> > > > Filesystem version 5. and since we don’t see any near solution to these 
> > > > hangs - we are thinking of downgrading to GPFS 4.2.3.2 or the previous 
> > > > state that we know the cluster was stable.
> > > >
> > > > Can downgrading GPFS take us back to exactly the previous GPFS config 
> > > > state?
> > > > With respect to downgrading from 5 to 4.2.3.2 -> is it just that i 
> > > > reinstall all rpms to a previous version? or is there anything else 
> > > > that i need to make sure with respect to GPFS configuration?
> > > > Because i think that GPFS 5.0 might have updated internal default GPFS 
> > > > configuration parameters , and i am not sure if downgrading GPFS will 
> > > > change them back to what they were in GPFS 4.2.3.2
> > > >
> > > > Our previous state:
> > > >
> > > > 2 Storage clusters - 4.2.3.2
> > > > 1 Compute cluster - 4.2.3.2  ( remote mounts the above 2 storage 
> > > > clusters )
> > > >
> > > > Our current state:
> > > >
> > > > 2 Storage clusters - 5.0.0.2 ( filesystem version - 4.2.2.2)
> > > > 1 Compute cluster - 5.0.0.2
> > > >
> > > > Do i need to downgrade all the clusters to go to the previous state ? 
> > > > or is it ok if we just downgrade the compute cluster to previous 
> > > > version?
> > > >
> > > > Any advice on the best steps forward, would greatly help.
> > > >
> > > > Thanks,
> > > >
> > > > Lohit
> > > > _______________________________________________
> > > > gpfsug-discuss mailing list
> > > > gpfsug-discuss at spectrumscale.org
> > > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss
> > _______________________________________________
> > gpfsug-discuss mailing list
> > gpfsug-discuss at spectrumscale.org
> > http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss

_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss

Re: [gpfsug-discuss] Critical Hang issues with GPFS 5.0. Downgrading from GPFS 5.0.0-2 to GPFS 4.2.3.2

Reply via email to