I *think* this was a known bug in the Power firmware included with 5.3.4, and that it was fixed in the FW860.70. Something hanging/crashing in IPMI.
-jf tor. 30. jan. 2020 kl. 17:10 skrev Wahl, Edward <[email protected]>: > Interesting. We just deployed an ESS here and are running into a very > similar problem with the gui refresh it appears. Takes my ppc64le's about > 45 seconds to run rinv when they are idle. > I had just opened a support case on this last evening. We're on ESS > 5.3.4 as well. I will wait to see what support says. > > Ed Wahl > Ohio Supercomputer Center > > > -----Original Message----- > From: [email protected] < > [email protected]> On Behalf Of Ulrich Sibiller > Sent: Thursday, January 30, 2020 9:44 AM > To: [email protected] > Subject: Re: [gpfsug-discuss] gui_refresh_task_failed for HW_INVENTORY > with two active GUI nodes > > On 1/29/20 2:05 PM, Billich Heinrich Rainer (ID SD) wrote: > > Hello, > > > > Can I change the times at which the GUI runs HW_INVENTORY and related > tasks? > > > > we frequently get messages like > > > > gui_refresh_task_failed GUI WARNING 12 hours > ago > > The following GUI refresh task(s) failed: HW_INVENTORY > > > > The tasks fail due to timeouts. Running the task manually most times > > succeeds. We do run two gui nodes per cluster and I noted that both > > servers seem run the HW_INVENTORY at the exact same time which may > > lead to locking or congestion issues, actually the logs show messages > > like > > > > EFSSA0194I Waiting for concurrent operation to complete. > > > > The gui calls ‘rinv’ on the xCat servers. Rinv for a single > > little-endian server takes a long time – about 2-3 minutes , while it > finishes in about 15s for big-endian server. > > > > Hence the long runtime of rinv on little-endian systems may be an > > issue, too > > > > We run 5.0.4-1 efix9 on the gui and ESS 5.3.4.1 on the GNR systems > > (5.0.3.2 efix4). We run a mix of ppc64 and ppc64le systems, which a > separate xCat/ems server for each type. The GUI nodes are ppc64le. > > > > We did see this issue with several gpfs version on the gui and with at > least two ESS/xCat versions. > > > > Just to be sure I did purge the Posgresql tables. > > > > I did try > > > > /usr/lpp/mmfs/gui/cli/lstasklog HW_INVENTORY > > > > /usr/lpp/mmfs/gui/cli/runtask HW_INVENTORY –debug > > > > And also tried to read the logs in /var/log/cnlog/mgtsrv/ - but they are > difficult. > > > I have seen the same on ppc64le. From time to time it recovers but then it > starts again. The timeouts are okay, it is the hardware. I haven opened a > call at IBM and they suggested upgrading to ESS 5.3.5 because of the new > firmwares which I am currently doing. I can dig out more details if you > want. > > Uli > -- > Science + Computing AG > Vorstandsvorsitzender/Chairman of the board of management: > Dr. Martin Matzke > Vorstand/Board of Management: > Matthias Schempp, Sabine Hohenstein > Vorsitzender des Aufsichtsrats/ > Chairman of the Supervisory Board: > Philippe Miltin > Aufsichtsrat/Supervisory Board: > Martin Wibbe, Ursula Morgenstern > Sitz/Registered Office: Tuebingen > Registergericht/Registration Court: Stuttgart Registernummer/Commercial > Register No.: HRB 382196 _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > > https://urldefense.com/v3/__http://gpfsug.org/mailman/listinfo/gpfsug-discuss__;!!KGKeukY!gqw1FGbrK5S4LZwnuFxwJtT6l9bm5S5mMjul3tadYbXRwk0eq6nesPhvndYl$ > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss >
_______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss
