I saw the same behavior on my RH 9.0 test cluster, however it only manifested itself AFTER I ran up2date and then start_over and built the cluster again. Symptoms where MPICH failing - your pam.d hack caused all tests to successfully run to completion.
I also saw a similar thing with RH 8.0 but didn't uncover the pam delay before trying it with 9.0
I suspect this update:https://rhn.redhat.com/errata/RHSA-2003-222.html did something to OpenSSH to induce this delay - maybe even on purpose so there is ALWAYS a set delay when using pam to prevent timing attacks.
Cheers,
Sean Slattery Spectral Sciences Inc.
Message: 1 Date: Thu, 04 Sep 2003 19:07:27 -0500 To: Jeff Johnson <[EMAIL PROTECTED]>, [EMAIL PROTECTED] From: Jeremy Enos <[EMAIL PROTECTED]> Subject: Re: [Oscar-users] Re: Two install issues OSC2.3/RH80
I'm out of ideas (and cycles) for this problem at the moment. I don't think I've ever seen it before. If you do find a solution, please post it. Sorry I couldn't be of more help-
Jeremy
At 04:59 PM 9/4/2003 -0700, Jeff Johnson wrote:
On Thu, 2003-09-04 at 16:12, Jeremy Enos wrote:
> IPv6 issues come to mind... quick test: > > On your head node: > > time ssh NODE_X hostname > time ssh -4 NODE_X hostname > > Let me know if the times differ- > > Jeremy
time ssh node01 hostname real 2.620s user 0.060s sys 0.000s
time ssh -4 node01 hostname real 2.603s user 0.060s sys 0.000s
If I change /etc/pam.d/sshd:
auth required /lib/security/pam_stack.so service=system-auth
to read:
auth required /lib/security/pam_stack.so shadow nodelay
And retest...
time ssh node01 hostname real 0.282s user 0.070s sys 0.010s
time ssh -4 node01 hostname real 0.280s user 0.070s sys 0.010s
A test of a running RH7.3/Oscar2.1 cluster the ssh times are down to real:0.179s.
I am pulling my hair out trying to find the cause of the pam/ssh 10x slowness. It really gags when starting jobs across 16-30 nodes and is so bad that it took a time factor of 12 (from the default of 3) to get the cluster tests to pass.
If I leave that pam.d/sshd edit in place I can start and run the Pallas2 benchmark across all nodes without problems and it only takes a 15 secs or so to start the actual i/o and post results. Without the change to pam.d/sshd it takes several minutes.
I don't want to leave the band-aid edit in place. i want to find out what pam is lagging on and provide it. I am assuming that this is due to some config file being referenced by pam that it is not finding but I cannot figure out what it is.
Jeff
> > At 03:08 PM 9/4/2003 -0700, Jeff Johnson wrote:
occurred on> >On Thu, 2003-09-04 at 14:25, Terrence Fleury wrote:
> > Oscar 2.3> > > >> On 04 Sep 2003, Jeff Johnson <[EMAIL PROTECTED]> wrote:
> > > > Greetings, > > > > > > > > I have run into strange behavior on two separate installs of
> > > > on top of Redhat 8.0. In both cases RH8 was updated current as of Aug
> > > > 29th. The same behaviors were noted on both installs which
during> > > > two separate clusters.
> > > >
> > > > The first was during step 1, download additional packages. After
> > > > selecting this step a progress bar is displayed and the install gui
> > > > becomes unresponsive. This condition lasts for over a half hour
RAM and> > > > which perl (according to top) runs as high as 90%, takes 2GB of
fine as> > > > dips into swap before the gui dies. Running the gui again runs
the other> > > > long as step 1 is bypassed.
> > >
> > > There are two possible issues here. One is the 'opd' program and
found in> > > is the Opder GUI. The GUI simply calls the 'opd' script (which is
and it's> > > $OSCAR_HOME/scripts/). It could be that 'opd' is not working properly
> > OR it
> > > could be that the files you are trying to download are REALLY big
itself> > > just taking a long time to transfer the files. Right now, there's no
> > way to
> > > display the file download status within the Opder GUI (because opd
from the> > > doesn't output that info when called from another process). This is > > > something that we will definitely address in the future.
> > > >No file transfer takes place. A menu of additional packages to select > >does not even appear. Selection of download additional packages from the > >main oscar install gui causes a blank grey window to appear that hangs > >and dies in the manner I mentioned above in the original message. From > >your comments I assume it must be something with the opd script > >initially called by the gui when the initial selection is made. > >
> > > So, my suggestion is to run the $OSCAR_HOME/scripts/opd program
should> > > command line and see if you can download the files that way. It
problem is opd> > show
> > > you a progress bar on a per-file basis so you can see if the
quickly), you> > > failing, or just huge files taking a long time to download.
> > >
> > > If running opd from the command line seems to run fine (and
directory> > > can try the Opder GUI again and look in the /var/cache/oscar/opd
files are> > > while getting files to see if they are actually coming in. The
> > > given an .opd extension while downloading. Any files that were
> > successfully
> > > downloaded get put in /var/cache/oscar/downloads. > > > > > > If the problem is in fact opd failing, please let us know. Thanks. > > > > > > Terry Fleury
> >[EMAIL PROTECTED] > > > >The other, more crucial issue in my opinion, is the drastic slowdown in > >job starting and ssh transactions involving PAM. This slowdown is > >causing a simple cexec or ckill command to take 60-90 seconds to > >complete. The starting of a mpich job whether by pbs or manually started > >(ie: mpirun -nolocal -np 34 ./PMB2 -npmin 32) takes a very long time. To > >give you an idea to make the test_cluster script pass I had to up the > >time factor in all of the test scripts to 12 so it had 210+ seconds to > >complete. This case is 17 nodes over a gigabit network running dual 3Ghz > >Xeons. This is a test that normally completes in under 30 seconds. > > > >What is it about RH8 over RH73 or Oscar2.3 over previous versions with > >regard to PAM that causes such a severe lag? > > > >I appreciate your advice. > > > >Jeff > >-- > >Jeff Johnson <[EMAIL PROTECTED]> > >Western Scientific, Inc > > > >"Rome did not create a great Empire by holding meetings. They did it by > >killing all those who opposed them." > > > > > > > >------------------------------------------------------- > >This sf.net email is sponsored by:ThinkGeek > >Welcome to geek heaven. > >http://thinkgeek.com/sf > >_______________________________________________ > >Oscar-users mailing list > >[EMAIL PROTECTED] > >https://lists.sourceforge.net/lists/listinfo/oscar-users
> > > > ------------------------------------------------------- > This sf.net email is sponsored by:ThinkGeek > Welcome to geek heaven. > http://thinkgeek.com/sf > _______________________________________________ > Oscar-users mailing list > [EMAIL PROTECTED] > https://lists.sourceforge.net/lists/listinfo/oscar-users
-- Jeff Johnson <[EMAIL PROTECTED]> Western Scientific, Inc
"Rome did not create a great Empire by holding meetings. They did it by killing all those who opposed them."
------------------------------------------------------- This sf.net email is sponsored by:ThinkGeek Welcome to geek heaven. http://thinkgeek.com/sf _______________________________________________ Oscar-users mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/oscar-users
------------------------------------------------------- This sf.net email is sponsored by:ThinkGeek Welcome to geek heaven. http://thinkgeek.com/sf _______________________________________________ Oscar-users mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/oscar-users
