On Thu, 2003-09-04 at 16:12, Jeremy Enos wrote:
> IPv6 issues come to mind...  quick test:
> 
> On your head node:
> 
> time ssh NODE_X hostname
> time ssh -4 NODE_X hostname
> 
> Let me know if the times differ-
> 
>          Jeremy

time ssh node01 hostname
        real    2.620s
        user    0.060s
        sys     0.000s

time ssh -4 node01 hostname
        real    2.603s
        user    0.060s
        sys     0.000s

If I change /etc/pam.d/sshd:
        
        auth required /lib/security/pam_stack.so service=system-auth

        to read:

        auth required /lib/security/pam_stack.so shadow nodelay

And retest...

time ssh node01 hostname
        real    0.282s
        user    0.070s
        sys     0.010s

time ssh -4 node01 hostname
        real    0.280s
        user    0.070s
        sys     0.010s

A test of a running RH7.3/Oscar2.1 cluster the ssh times are down to
real:0.179s. 

I am pulling my hair out trying to find the cause of the pam/ssh 10x
slowness. It really gags when starting jobs across 16-30 nodes and is so
bad that it took a time factor of 12 (from the default of 3) to get the
cluster tests to pass.

If I leave that pam.d/sshd edit in place I can start and run the Pallas2
benchmark across all nodes without problems and it only takes a 15 secs
or so to start the actual i/o and post results. Without the change to
pam.d/sshd it takes several minutes.

I don't want to leave the band-aid edit in place. i want to find out
what pam is lagging on and provide it. I am assuming that this is due to
some config file being referenced by pam that it is not finding but I
cannot figure out what it is.

Jeff 






> 
> At 03:08 PM 9/4/2003 -0700, Jeff Johnson wrote:
> >On Thu, 2003-09-04 at 14:25, Terrence Fleury wrote:
> > > >> On 04 Sep 2003, Jeff Johnson <[EMAIL PROTECTED]> wrote:
> > > > Greetings,
> > > >
> > > >     I have run into strange behavior on two separate installs of 
> > Oscar 2.3
> > > > on top of Redhat 8.0. In both cases RH8 was updated current as of Aug
> > > > 29th. The same behaviors were noted on both installs which occurred on
> > > > two separate clusters.
> > > >
> > > > The first was during step 1, download additional packages. After
> > > > selecting this step a progress bar is displayed and the install gui
> > > > becomes unresponsive. This condition lasts for over a half hour during
> > > > which perl (according to top) runs as high as 90%, takes 2GB of RAM and
> > > > dips into swap before the gui dies. Running the gui again runs fine as
> > > > long as step 1 is bypassed.
> > >
> > > There are two possible issues here.  One is the 'opd' program and the other
> > > is the Opder GUI.  The GUI simply calls the 'opd' script (which is found in
> > > $OSCAR_HOME/scripts/).  It could be that 'opd' is not working properly 
> > OR it
> > > could be that the files you are trying to download are REALLY big and it's
> > > just taking a long time to transfer the files.  Right now, there's no 
> > way to
> > > display the file download status within the Opder GUI (because opd itself
> > > doesn't output that info when called from another process).  This is
> > > something that we will definitely address in the future.
> >
> >No file transfer takes place. A menu of additional packages to select
> >does not even appear. Selection of download additional packages from the
> >main oscar install gui causes a blank grey window to appear that hangs
> >and dies in the manner I mentioned above in the original message. From
> >your comments I assume it must be something with the opd script
> >initially called by the gui when the initial selection is made.
> >
> > > So, my suggestion is to run the $OSCAR_HOME/scripts/opd program from the
> > > command line and see if you can download the files that way.  It should 
> > show
> > > you a progress bar on a per-file basis so you can see if the problem is opd
> > > failing, or just huge files taking a long time to download.
> > >
> > > If running opd from the command line seems to run fine (and quickly), you
> > > can try the Opder GUI again and look in the /var/cache/oscar/opd directory
> > > while getting files to see if they are actually coming in.  The files are
> > > given an .opd extension while downloading.  Any files that were 
> > successfully
> > > downloaded get put in /var/cache/oscar/downloads.
> > >
> > > If the problem is in fact opd failing, please let us know.  Thanks.
> > >
> > > Terry Fleury
> >[EMAIL PROTECTED]
> >
> >The other, more crucial issue in my opinion, is the drastic slowdown in
> >job starting and ssh transactions involving PAM. This slowdown is
> >causing a simple cexec or ckill command to take 60-90 seconds to
> >complete. The starting of a mpich job whether by pbs or manually started
> >(ie: mpirun -nolocal -np 34 ./PMB2 -npmin 32) takes a very long time. To
> >give you an idea to make the test_cluster script pass I had to up the
> >time factor in all of the test scripts to 12 so it had 210+ seconds to
> >complete. This case is 17 nodes over a gigabit network running dual 3Ghz
> >Xeons. This is a test that normally completes in under 30 seconds.
> >
> >What is it about RH8 over RH73 or Oscar2.3 over previous versions with
> >regard to PAM that causes such a severe lag?
> >
> >I appreciate your advice.
> >
> >Jeff
> >--
> >Jeff Johnson <[EMAIL PROTECTED]>
> >Western Scientific, Inc
> >
> >"Rome did not create a great Empire by holding meetings. They did it by
> >killing all those who opposed them."
> >
> >
> >
> >-------------------------------------------------------
> >This sf.net email is sponsored by:ThinkGeek
> >Welcome to geek heaven.
> >http://thinkgeek.com/sf
> >_______________________________________________
> >Oscar-users mailing list
> >[EMAIL PROTECTED]
> >https://lists.sourceforge.net/lists/listinfo/oscar-users
> 
> 
> 
> -------------------------------------------------------
> This sf.net email is sponsored by:ThinkGeek
> Welcome to geek heaven.
> http://thinkgeek.com/sf
> _______________________________________________
> Oscar-users mailing list
> [EMAIL PROTECTED]
> https://lists.sourceforge.net/lists/listinfo/oscar-users
-- 
Jeff Johnson <[EMAIL PROTECTED]>
Western Scientific, Inc

"Rome did not create a great Empire by holding meetings. They did it by
killing all those who opposed them."



-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
Oscar-users mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/oscar-users

Reply via email to