Hi Jeff,

I saw the same behavior on my RH 9.0 test cluster, however it only manifested itself AFTER I ran up2date and then start_over and built the cluster again. Symptoms where MPICH failing - your pam.d hack caused all tests to successfully run to completion.

I also saw a similar thing with RH 8.0 but didn't uncover the pam delay before trying it with 9.0

I suspect this update:https://rhn.redhat.com/errata/RHSA-2003-222.html did something to OpenSSH to induce this delay - maybe even on purpose so there is ALWAYS a set delay when using pam to prevent timing attacks.

Cheers,

Sean Slattery
Spectral Sciences Inc.



Message: 1
Date: Thu, 04 Sep 2003 19:07:27 -0500
To: Jeff Johnson <[EMAIL PROTECTED]>, [EMAIL PROTECTED]
From: Jeremy Enos <[EMAIL PROTECTED]>
Subject: Re: [Oscar-users] Re: Two install issues OSC2.3/RH80

I'm out of ideas (and cycles) for this problem at the moment. I don't think I've ever seen it before. If you do find a solution, please post it. Sorry I couldn't be of more help-

Jeremy

At 04:59 PM 9/4/2003 -0700, Jeff Johnson wrote:

On Thu, 2003-09-04 at 16:12, Jeremy Enos wrote:


> IPv6 issues come to mind...  quick test:
>
> On your head node:
>
> time ssh NODE_X hostname
> time ssh -4 NODE_X hostname
>
> Let me know if the times differ-
>
>          Jeremy



time ssh node01 hostname real 2.620s user 0.060s sys 0.000s

time ssh -4 node01 hostname
        real    2.603s
        user    0.060s
        sys     0.000s

If I change /etc/pam.d/sshd:

auth required /lib/security/pam_stack.so service=system-auth

to read:

auth required /lib/security/pam_stack.so shadow nodelay

And retest...

time ssh node01 hostname
        real    0.282s
        user    0.070s
        sys     0.010s

time ssh -4 node01 hostname
        real    0.280s
        user    0.070s
        sys     0.010s

A test of a running RH7.3/Oscar2.1 cluster the ssh times are down to
real:0.179s.

I am pulling my hair out trying to find the cause of the pam/ssh 10x
slowness. It really gags when starting jobs across 16-30 nodes and is so
bad that it took a time factor of 12 (from the default of 3) to get the
cluster tests to pass.

If I leave that pam.d/sshd edit in place I can start and run the Pallas2
benchmark across all nodes without problems and it only takes a 15 secs
or so to start the actual i/o and post results. Without the change to
pam.d/sshd it takes several minutes.

I don't want to leave the band-aid edit in place. i want to find out
what pam is lagging on and provide it. I am assuming that this is due to
some config file being referenced by pam that it is not finding but I
cannot figure out what it is.

Jeff








>
> At 03:08 PM 9/4/2003 -0700, Jeff Johnson wrote:


> >On Thu, 2003-09-04 at 14:25, Terrence Fleury wrote:


> > > >> On 04 Sep 2003, Jeff Johnson <[EMAIL PROTECTED]> wrote:


> > > > Greetings,
> > > >
> > > >     I have run into strange behavior on two separate installs of


> > Oscar 2.3


> > > > on top of Redhat 8.0. In both cases RH8 was updated current as of Aug
> > > > 29th. The same behaviors were noted on both installs which


occurred on


> > > > two separate clusters.
> > > >
> > > > The first was during step 1, download additional packages. After
> > > > selecting this step a progress bar is displayed and the install gui
> > > > becomes unresponsive. This condition lasts for over a half hour


during


> > > > which perl (according to top) runs as high as 90%, takes 2GB of


RAM and


> > > > dips into swap before the gui dies. Running the gui again runs


fine as


> > > > long as step 1 is bypassed.


> > >
> > > There are two possible issues here. One is the 'opd' program and


the other


> > > is the Opder GUI. The GUI simply calls the 'opd' script (which is


found in


> > > $OSCAR_HOME/scripts/). It could be that 'opd' is not working properly


> > OR it


> > > could be that the files you are trying to download are REALLY big


and it's


> > > just taking a long time to transfer the files. Right now, there's no


> > way to


> > > display the file download status within the Opder GUI (because opd


itself


> > > doesn't output that info when called from another process).  This is
> > > something that we will definitely address in the future.


> >
> >No file transfer takes place. A menu of additional packages to select
> >does not even appear. Selection of download additional packages from the
> >main oscar install gui causes a blank grey window to appear that hangs
> >and dies in the manner I mentioned above in the original message. From
> >your comments I assume it must be something with the opd script
> >initially called by the gui when the initial selection is made.
> >


> > > So, my suggestion is to run the $OSCAR_HOME/scripts/opd program


from the


> > > command line and see if you can download the files that way. It


should


> > show


> > > you a progress bar on a per-file basis so you can see if the


problem is opd


> > > failing, or just huge files taking a long time to download.
> > >
> > > If running opd from the command line seems to run fine (and


quickly), you


> > > can try the Opder GUI again and look in the /var/cache/oscar/opd


directory


> > > while getting files to see if they are actually coming in. The


files are


> > > given an .opd extension while downloading. Any files that were


> > successfully


> > > downloaded get put in /var/cache/oscar/downloads.
> > >
> > > If the problem is in fact opd failing, please let us know.  Thanks.
> > >
> > > Terry Fleury


> >[EMAIL PROTECTED]
> >
> >The other, more crucial issue in my opinion, is the drastic slowdown in
> >job starting and ssh transactions involving PAM. This slowdown is
> >causing a simple cexec or ckill command to take 60-90 seconds to
> >complete. The starting of a mpich job whether by pbs or manually started
> >(ie: mpirun -nolocal -np 34 ./PMB2 -npmin 32) takes a very long time. To
> >give you an idea to make the test_cluster script pass I had to up the
> >time factor in all of the test scripts to 12 so it had 210+ seconds to
> >complete. This case is 17 nodes over a gigabit network running dual 3Ghz
> >Xeons. This is a test that normally completes in under 30 seconds.
> >
> >What is it about RH8 over RH73 or Oscar2.3 over previous versions with
> >regard to PAM that causes such a severe lag?
> >
> >I appreciate your advice.
> >
> >Jeff
> >--
> >Jeff Johnson <[EMAIL PROTECTED]>
> >Western Scientific, Inc
> >
> >"Rome did not create a great Empire by holding meetings. They did it by
> >killing all those who opposed them."
> >
> >
> >
> >-------------------------------------------------------
> >This sf.net email is sponsored by:ThinkGeek
> >Welcome to geek heaven.
> >http://thinkgeek.com/sf
> >_______________________________________________
> >Oscar-users mailing list
> >[EMAIL PROTECTED]
> >https://lists.sourceforge.net/lists/listinfo/oscar-users


>
>
>
> -------------------------------------------------------
> This sf.net email is sponsored by:ThinkGeek
> Welcome to geek heaven.
> http://thinkgeek.com/sf
> _______________________________________________
> Oscar-users mailing list
> [EMAIL PROTECTED]
> https://lists.sourceforge.net/lists/listinfo/oscar-users


--
Jeff Johnson <[EMAIL PROTECTED]>
Western Scientific, Inc

"Rome did not create a great Empire by holding meetings. They did it by
killing all those who opposed them."



-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
Oscar-users mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/oscar-users








-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
Oscar-users mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/oscar-users

Reply via email to