Hi Michael, Ok, I downloaded and put all FC5 updates on the server node, updated local fedora-5 repository, regenerated oscarimage and reimaged nodes. This time it worked! Not a single rsync timeout anymore. So it was a rsync bug, perhaps triggered by a particular combination of hardware/kernel/drivers.
Now I am having another problem, with SGE configuration on slave nodes picking up the wrong server name. My server node has two network interfaces, private and public one, pretty standard configuration. Machine hostname is set to reflect public interface, not private one. Now, SGE install scripts set qmaster hostname to that public name and SGE execd daemons on slave nodes obviously fail to contact qmaster by this name since they have no access to the public network. Easy to fix manually but still I think this is a bug... On Monday 16 April 2007 06:16:16 pm Michael Edwards wrote: > When you update rsync on the head node also update it on the image. > > copy the rpm into the image directory (probably something like > /var/lib/systemimager/images/oscarimage/tmp) > chroot /var/lib/systemimager/images/oscarimage > install the rpm from /tmp in the new environment > exit the chroot and try imaging the nodes > > I think there are some scripts to do this as well, but I haven't gotten > around to learning how to use them yet :) There are indeed, and reading the documentation can sometimes save you a lot of time ;-). That said, OSCAR documentation is horrible. Not only it is incomplete and fragmented, it also has a good number of typographic errors in examples, that makes it look more like a puzzle. It was not especially difficult to solve but I wonder why these outstanding errors are still in there? It would take 5 minutes to fix them and regenerate the PDF file... Cheers, Ivan > On 4/16/07, Ivan Adzhubey <[EMAIL PROTECTED]> wrote: > > On Monday 16 April 2007 02:52:28 pm Michael Edwards wrote: > > > Have you tried the Use your own kernel (UYOK) function? > > > > Yes I did, to no avail. In fact, networking works perfectly well as far > > as anything else but rsync is concerned. I ran rsync manually from client > > (slave > > node) side with -vvv option and specifying just one oscarimage > > subdirectory > > to simplify things a bit. It starts the protocol, successfully obtains > > the list of remote files, creates proper sym/hard links in the target > > (local) directory and then hangs on phase_1 (AFAIK that's when the actual > > file copying should start). > > > > It looks like a bug in rsync and searching through rsync mailing list I > > can > > see several similar bugs submitted. They are all unconfirmed because of > > highly random and unreproducible nature. > > > > I found that in my case the only cure is a complete fresh reinstall of > > the server node OS+OSCAR, including disk reformatting (!). After that, I > > sometimes am able to make image push work once or twice. Then it fails > > again > > and after that nothing can make it push a single byte anymore. > > > > Weird, isn't it? > > > > I'll try to upgrade rsync on the server and see if it helps. > > > > Cheers, > > Ivan > > > > > It is possible that SIS is mis-detecting your network card drivers and > > > therefore having problems... > > > > > > On 4/16/07, Ivan Adzhubey <[EMAIL PROTECTED]> wrote: > > > > On Monday 16 April 2007 01:35:04 pm Michael Edwards wrote: > > > > > That sounds like a problem we used to have with pfilter. Did you > > > > > select pfilter to be installed, it is not selected by default in > > > > > 5.0 > > > > . > > > > > > pfilter neither selected nor installed (just checked to be sure). I > > > > have > > > > > > iptables configured on the server node with eth0 (private LAN) > > > > interface > > > > > > set > > > > as trusted. I also tried disabling iptables altogether, as I wrote, > > > > no effect. SELinux is disabled. Frankly, I am lost. In the process of > > > > debugging > > > > this issue and I will post updates as soon as I get something. > > > > > > > > > On 4/16/07, Ivan Adzhubey <[EMAIL PROTECTED]> wrote: > > > > > > Hi, > > > > > > > > > > > > I am installing OSCAR 5.0, with Fedora Core 5 fresh install on a > > > > > > > > server > > > > > > > > > > node. > > > > > > Everything goes smooth until I try to push OSCAR image to slave > > > > nodes > > > > > > via > > > > > > > > > > network boot (PXE). If I select rsync as a transport, > > > > > > oscar_wizard proceeds > > > > > > to network-booting a node just fine, partitions hard drive, > > > > creates > > > > > > local > > > > > > > > > > filesystem and initiates rsync image transfer without any errors > > > > or > > > > > > > > warnings > > > > > > reported. Then the transfer will abort at about 50% and time out. > > > > > > > > There > > > > > > > > > > is no > > > > > > errors whatsoever reported, it just stops abruptly. I can still > > > > ping > > > > > > > > slave node from the server node and after install script fails on > > > > the > > > > > > > > node upon rsync timeout and throws a shell prompt I can also ping > > > > > > > > server > > > > > > > > > > node from the > > > > > > slave, but alas - no rsync traffic at all. I initially suspected > > > > > > a hardware > > > > > > problem and tried replacing network switch, cables, power cycling > > > > > > both head > > > > > > node and slave, etc. Nothing helped. I also tried disabling > > > > firewall, > > > > > > no > > > > > > > > > > effect. However, now every next attempt after the initial failure > > > > the > > > > > > > > image > > > > > > transfer will not even start, just sits at 0% and eventually > > > > > > times > > > > > > > > out. > > > > > > > > > > Which > > > > > > looks more like a software problem to me. All other transport > > > > options > > > > > > > > work fine, e.g. BitTorrent, multicast, but I'd really like to be > > > > able > > > > > > to > > > > > > > > > > use rsync > > > > > > since it is by far the fastest transport supported. > > > > > > > > > > > > On a related issue: should I try upgrading server node OS (with > > > > yume) > > > > > > > > before > > > > > > creating slave images? Will this put updated rsync into the image > > > > or > > > > > > is > > > > > > > > > > it a > > > > > > fixed part of SystemImager network boot image (together with the > > > > > > > > kernel)? > > > > > > > > > > Cheers, > > > > > > Ivan Adzhubey ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Oscar-users mailing list Oscar-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/oscar-users