> > Somewhere in the middle of the restore, everything began
> > to crawl, and we had to stop restoring volumes to it.
> > It took 2 days until the server was working alright again,
> > and since then it jumps up to >96% iowait as soon as we
> > try to 'vos move' or restore more accounts to this partition.
> > 
> ...
> > 
> > Experimenting a bit with SIGSTOP/SIGCONT on 'vosserver' during a
> > freeze shows us it is that process that causes the iowaits,
> > and as soon as we SIGCONT's it, it immediately begins again waiting
> > for something which I don't know. 
> > 
> > Is there anyone else that have large partitions and/or have seen
> > this behaviour before?
> > 
> > It worked alright up to the point when the partition had used up
> > ~12-13Gb of the ~16 available.
> > 
> > 
> 
> We have seen a similar problem on a server using 6GB SDS raids: during the
> 'clone' of a biggish volume with thousands of files the volserver would
> literally stop the machine, so that even clients connecting to the fileserver
> time out. 
> 
> We suspected a problem with too many I/O requests queuing up - they are 
> typically slower on a RAID than on a normal disk. 
> 
> We never found the real cause. In the end what helped was that Transarc gave
> us a volserver supporting a '-sleep' parameter: we set it up so that the
> volserver sleeps 2 seconds every 30 seconds of sustained activity. This way
> the RAID gets a chance to 'cool down' and the fileserver to say hello to all
> its clients. 


It's probably the same reason as what I've seen here on AIX 3.2.5 - 
The fileserver process gets starved out because it sits on the 'system' 
or 'user' processing queue.  The volserver gets woken up first and 
given a chance to run because it's on the 'iowait' queue, and it takes 
over the server's single threaded cpu.  The sleep patch allows the volserver 
to voluntarily leave the 'iowait' queue and go to sleep for a bit 
(letting other things run).  It's mostly an internal OS queuing problem
and, from what I've been told this is not a problem on threaded 
kernels (like AIX 4.1.x).

We ran across it while doing vos releases of ~1Gb volumes to multiple
sites.  The clients were able to get "I'm alive" messages from the
server, but no data (since fileserver was not being allowed onto the 'run'
queue), and the keepalives prevented the clients from failing over to
one of the other replication sites.  While volserver is alseep, fileserver
catches up with the outstanding requests and the clients are happy.

This may not be exactly the same as the OS versions you're using, but
it's probably close...

        -Dan

---
Dan Shea                                |  phone: (602) 554-1494
5000 W. Chandler Blvd. MS: CH3-70       |  fax:   (602) 554-6067
Chandler, AZ 85226                      |  [EMAIL PROTECTED]

Reply via email to