Mike, 
        Great work, and most excellent suggestions...
although you clearly don't have enough to do at your
day job -- wouldn't you love to become a 'highly paid'
OSCAR developer in your spare time?
 
I have individual comments below... Rich

Richard Ferri
IBM Linux Technology Center
[EMAIL PROTECTED]
845.433.7920 

--- Mike Mettke <[EMAIL PROTECTED]> wrote:
> 
> All,
> 
> I successfully installed RH7.3 and 1.3b3 and it is
> working for a couple 
> of days now. All cluster tests pass successfully and
> all of my own 
> LAM/MPI programs run.
> 
> I took notes wrt what I did during the install and I
> hope they might be 
> useful for others:
> 
> hardware:
> master: P3 800MHz 256MB, Intel + 3com NIC
> 
> 3 nodes: dual P3 800MHz EB 256 MB RAM on ASUS
> CUR-DLS motherboard, 
> ServerWorks LE 3.0 chipset, dual onboard Intel 82559
> NIC
> 
> 1 node: dual dual P3 800MHz EB 256 MB RAM on Tyan
> Tiger 200 motherboard,
> VIA Apollo Pro 133A chipset, onboard Intel 82559 NIC
> 
> All bioses on all nodes were flashed with latest
> bios version.
> 
> 1. Master configuration:
>     Full (everything) RH7.3 install + all updates
> from Redhat.
> 
> 2. cp all rpms from cdrom into /tftpboot/rpm
> 
> 3. cp all update rpms into /tftpboot/rpm
> 
> 4. rm /tftpboot/rpm/c3-2.71-2*.rpm
>     since 1.3b3 comes with its own version of c3
> 
> 5. rpm -e tftp
>     rpm -e tftp-server
>     since 1.3b3 brings its own tftp packages
> 
> 6. start install_cluster eth0
> 
> 7. put in full path (/usr/bin/rsync) to rsync in
> /etc/init.d/systemimager
> 
> 8. restart install_cluster eth0
> 
> 9. follow the installation wizard
> 
> 10. I used my own rpm list which I created from
> scratch.
> I added hdparm, gcc, lm-sensors and the ntp packages
> and dependencies 
> beyond what was strictly needed. see attachment.
> 
> 
> Some suggestions/wishlist:
> 0. Starting of dhcpd on the master only on the
> private interface (edit 
> /etc/sysconfig/dhcpd accordingly). This prevents the
> dhcpd on the master 
> node from respoding to dhcp requests on the company
> network.
> 
> 1. centralized logging at the head node, at least
> for critical stuff.

We opened a bug for this, planned for the 1.4 release
of oscar -- agreed that it's essential for debugging.

> 
> 2. ntp installation. My nodes have a drift of ~100
> ppm/day without it !
>     multicast or broadcast mode should be
> sufficient.
> 

NTP support is on the list for 1.4 also (505572)

> 3. /usr/spool/PBS/server_priv/nodes does not get
> correctly updated with 
> the cpu count when install_cluster is restarted.
> This prevents the tests 
> from successfully executing in the case of SMP
> nodes. I had to fix this 
> file manually.

This sounds like a bug that should go into the bug
tracking system, for 1.3, it's a serious bug.
> 
> 4. ScaLAPACK installation. Nice for benchmarking.

You're the second person to suggest ScaLAPACK today;
if we don't already have it in the tracking system
I'll add it.

> 
> 5. Maybe: lm-sensors installation and configuration.
>   xpbsmon (or gmond?) could then display the
> temperature and fanspeed of 
> the individual nodes. For large, unsupervised
> clusters this could be 
> critical to prevent costly mishaps when the air
> condition fails.

someone from ganglia land, please respond?

> 
> 6. direct PXE netbooting all the time. The kernel
> could then either 
> install the node, or if the node already has the
> correct image, use 
> pivot_root to simply use the existing image. Doing
> this means:
> a) potential for greater fault-tolerance in case the
> local hd gets 
> corrupted.
> b) less sysadmin overhead since the master and node
> setup stays 
> constant. No bios parameter switching, no messing
> with dhcpd.conf on the 
> head node.
> This idea bears fruit especially for larger
> clusters.
> 

Interesting.  We used this approach (always attempting
to network boot) on the RS/6000 SP.  Agreed, the
advantages are less manipulation of the device boot
list, and centralized management of the nodes -- the
master tells the slave what mode to boot into.  The
problem I saw was that some nodes never recover from a
failed network boot.  They are supposed to percolate
to the next item on the bootlist, but some just
happily start net boot all over again -- I think these
nodes are in the minority, and I'd like to say it's
actually a deficiency  on the part of the firmware.

Other than these 'wacky' nodes, I agree that
centralized management of booting is more scalable...

> 7. Option of using IP ranges for dhcpd.conf instead
> of MAC addresses.
> This avoids the whole MAC address snooping process
> during installation, 
> but makes nodes unidentifiable (unless you ssh to
> the node and issue a 
> "beep" command). In my opinion, I'd be happy not
> knowing which node is 
> which, as long as I can positively identify any node
> when the need (hw 
> problems) arises. But this is only my opinion ...
> 
> 
This is cool.  We opted to be able to identify the
node over ease of MAC address collection, but you may
be aware that Scyld for example, takes the opposite
approach -- they collect all the MACs easily, but you
don't know which node is which. The debate here rages
on... 
> 
> Comments and suggestions are always appreciated.
> 
> regards,
> Mike
> 
> Wireless Advanced Technology Lab
> Bell Labs - Lucent Technologies
> 
> 
> 
> 
> 
> 
> > filesystem
> glibc-common
> glib
> chkconfig
> popt
> pcre
> bzip2-libs
> mingetty
> termcap
> bash
> bzip2
> openssl
> cracklib
> perl
> libappconfig-perl
> which
> cracklib-dicts
> systemimager-client
> info
> gawk
> ed
> grep
> procps
> pam
> modutils
> cyrus-sasl-md5
> libuser
> tcsh
> shadow-utils
> openpbs-oscar
> openpbs-oscar-mom
> bdflush
> modules
> env-switcher
> lam-module
> sysklogd
> initscripts
> dev
> gzip
> losetup
> mkinitrd
> kernel-smp
> openssh
> openssh-clients
> lm_sensors
> nfs-utils
> ntp
> binutils
> kernel-headers
> gcc
> setup
> basesystem
> glibc
> gdbm
> zlib
> e2fsprogs
> rsync
> mktemp
> net-tools
> libtermcap
> ganglia-monitor-core
> textutils
> glib2
> db3
> c3-ckillnode
> systemimager-common
> words
> ncurses
> systemconfigurator
> iputils
> make
> sed
> diffutils
> fileutils0
> lam
> sh-utils
> cyrus-sasl
> openldap
> mount
> SysVinit
> rpm
> mpich-oscar
> pvm
> tcl
> pvm-modules
> mpich-oscar-module
> iproute
> util-linux
> usermode
> less
> tar
> findutils
> lilo
> hdparm
> openssh-server
> pump
> portmap
> libcap
> pfilter
> cpp
> glibc-devel
> psmisc
> 


__________________________________________________
Do You Yahoo!?
Sign up for SBC Yahoo! Dial - First Month Free
http://sbc.yahoo.com


-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Two, two, TWO treats in one.
http://thinkgeek.com/sf
_______________________________________________
Oscar-users mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/oscar-users

Reply via email to