Re: AFS/HTTP Server Performance

Brian W. Spolarich Thu, 19 Sep 1996 15:50:16 GMT

  Do you notice any problems with such a large cache?  AFS cache
performance is going to be degraded by unhashed directory lookups on
standard UFS filesystems if the directory is very large (i.e. the number
of files > 10,000 or so).

  -b.

On Thu, 19 Sep 1996, Mickey Beddingfield wrote:

> Brian,
> 
> I can't write much right now, but we are running our http server on AFS
> with no problems.  I am running on a Sun SPARC 5 with a 2GB cache under
> Solaris 2.5 and AFS 3.4.  Again, no problems at this time...Mic
> 
> ----------
> > From: Brian W. Spolarich <[EMAIL PROTECTED]>
> > To: [EMAIL PROTECTED]
> > Subject: AFS/HTTP Server Performance
> > Date: Tuesday, September 17, 1996 8:44 AM
> > 
> > 
> >   I'm attaching a writeup of some tests I did a few months ago using AFS
> > to serve data to heavily-loaded HTTP server.  At that time I did not have
> > an opportunity to follow up very much on some of the unanswered
> questions,
> > and didn't feel like I had characterized the situation clearly enough.
> > 
> >   The problems that we saw happened when we started the test scenario
> > against a "cold" cache with a moderate (60Mb) amount of data to retrieve.
> > In this scenario, the tests would proceed for a few minutes, and then the
> > AFS client/HTTP server would lose contact with the AFS server.  The
> > addition of an additional database server did not solve the problem.
> > 
> >   If anyone has any thoughts on this, I'd appreciate hearing them.  Some
> > differences between these somewhat informal tests that I ran back in May
> > and what we're going to do now include a change in operating system
> > (Solaris 2.5.1 instead of 2.4), and Ultras (on 10Mb/sec ethernet) instead
> > of Sparc 20s.  If we can get it, we'll try and run the enhanced release
> of
> > 2.5.1 (for Web servers) and see what happens.
> > 
> >   I believe tcp_max_conn_req was set to 128 during this test, but as I
> > said, I did this somewhat informally and didn't collect all of the data
> > that I should have. :-]
> > 
> >   What I'm looking for is responses like "Yeah, we saw similar behaviour
> > when we did something like this and fixed it by <blah>", or "You might
> try
> > bumping up <bleh>".
> > 
> >   Transarc didn't really provide much help (although they tried to be
> > helpful) as the support guy didn't feel like he had enough info to really
> > understand the problem. 
> > 
> >   -brian
> > 
> >
> ---------------------------------------------------------------------------
> > | APPENDIX A.  AFS/HTTP SERVER PERFORMANCE EVALUATION RESULTS
> > |
> >
> ---------------------------------------------------------------------------
> > 
> > 
> > Overview -------- This test scenario is designed to do some stress
> testing
> > of an HTTP (Web)  server reading its data out of AFS versus local disk. 
> > The HTTP client is running some home-grown software (webbash) which
> allows
> > us to fork off multiple simultaneous threads which act as HTTP clients,
> > requesting documents from a list.  The document testbed is a set of files
> > containing random ASCII characters ranging in size from 0k to 1Mb, and
> > totals initially 2.2Mb.  Copying this testbed into subdirectories [a..z]
> > yields a testbed of 60Mb. 
> > 
> > Test Environment 
> > ---------------- 
> > The test environment consists of three machines isolated from the rest of
> > the network via an ethernet hub.  I connect to the machines via a Cisco
> > terminal server which is not isolated from the local network.  All
> > machines are running Solaris 2.4 w/o the recommended patches. :-]
> > 
> >         [prod-1b]       AFS Fileserver  Sparc20/128Mb
> >                         AFS Client
> > 
> >         [  log  ]       AFS Client /    Sparc20/64Mb
> >                         HTTP Server
> > 
> >         [prod-2a]       HTTP Client     Sparc20/128Mb
> >                         ("bash" program)
> >                         AFS Client
> >                         (later becomes an additional AFS file/dbserver)
> > 
> > 
> > Explanation of the Fields Below
> > -------------------------------
> > Data Source - local disk or AFS
> > Client Threads - number of simultaneous threads created by "webbash". 
> Each
> >         thread reads the file list and randomizes it.  Each thread will
> >         time out after ten seconds if it does not receive data from the
> >         HTTP server.
> > Iterations - number of times the test suite iterates over the list of
> files
> >         to retrieve.
> > Cache/Data Ratio - the ratio of AFS cache size to data testbed size.
> > Daemons (afsd) - number of extra afsd processes to run to handle service
> >         requests.
> > Volume Type - ReadWrite or ReadOnly.  Lack of per-file callbacks on
> ReadOnly
> >         volumes should reduce AFS client-server traffic.
> > Throughput (Mb/Hr) - As reported by webbash in megabytes per hour.
> > HTTP Ops/Sec - Number of HTTP operations services per second.  This is
> probably
> >         the real "thoughput benchmark".
> > Comments - Describes various events that may have happened during the
> test.
> >         See the key below the table for details on this.
> > 
> >                         Cache/                          Throug-
> > Data    Client  Itera-  Data    Data    Daemons Volume  hput    HTTP   
> Com-
> > Source  Threads tions   Ratio   Size    (afsd)  Type    (Mb/Hr) Ops/Sec
> ments
> >
> ----------------------------------------------------------------------------
> ---
> > local   10      10      n/a     2.2Mb   n/a     n/a     4048    15.01
> > afs     10      10      34/1    2.2Mb   3       RW      4053    15.04   
> > afs     20      10      34/1    2.2Mb   3       RW      3959    14.69   #
> > afs     50      10      34/1    2.2Mb   3       RW      3795    13.98  
> #+
> > afs     100     10      34/1    2.2Mb   3       RW      4306    23.06  
> #+
> > afs     200     10      34/1    2.2Mb   3       RW      5208    31.31  
> #+
> > afs     10      5       1.25    60Mb    3       RW      3793    37.59  
> !@
> > afs     1       5       1.25    60Mb    3       RW      1504     5.57   
> > afs     3       3       1.25    60Mb    3       RW      3155    11.74  
> $%
> > afs     5       2       1.25    60Mb    5       RO      2927    12.53  
> !@
> > 
> > # A second file/db server was added to the AFS cell at this point.
> > (these were warm cache reads)
> > afs     10      1       1.25    60Mb    5       RO      3957    14.68  
> #$
> > # Then I flushed the volume from the cache (i.e. cold reads)
> > afs     10      2       1.25    60Mb    5       RO      1524     8.70
> > !+#+@+
> > 
> > Comments Key:
> > ! - Cell "Thrashing":  AFS client loses contact with fs and volserver.
> > AFS
> >         client freezes.
> > @ - HTTP Server returned "403 Not Found" Errors (i.e. file did not
> exist).
> > # - Large files (500K+) took more than 10 seconds to transfer.
> > $ - Timeout (>10 sec) trying to read data.
> > % - HTTP Server returned "504" Error.
> > 
> > + - Modifer to above.  This event occurred many (>10, generally) times.
> > 
> > Problems Observed
> > -----------------
> > 
> > Interestingly, reading data from a small to moderate local AFS cache
> > appears to be slightly faster than local disk.  This seems to differ from
> > work done by Michael Stolarchuck at U-M which suggested ways to improve
> > AFS cache read performance.  The difference is small, and may be a
> > statistical anonmaly. 
> > 
> > The biggest problem we've seen with this situation has been timeouts
> > between the AFS client and server(s) as the data is being fetched into
> the
> > cache.  Generally what happens is that I can observe an initial burst of
> > traffic between the AFS client and server, during which I see a large
> > number of network collisions.  After a period of time, the traffic will
> > come to a halt (I'm observing the lights on the hub).  The AFS
> client/HTTP
> > server will at this point freeze for a while, and will eventually report
> > that the server(s)  for the realm are unavailable.  This will cause the
> > HTTP server to report that the files the various threads are trying to
> > fetch do not exist, which will generate the "403 Not Found" errors. 
> > 
> > AFS timeouts generally look like this:
> > 
> > afs: Lost contact with file server 198.83.22.104 in cell test.ans.net
> (all
> > multi-homed ip addresses down for the server)
> > afs: Lost contact with file server 198.83.22.104 in cell test.ans.net
> (all
> > multi-homed ip addresses down for the server)
> > afs: file server 198.83.22.104 in cell test.ans.net is back up
> > (multi-homed address; other same-host interfaces may still be down)
> > afs: file server 198.83.22.104 in cell test.ans.net is back up
> > (multi-homed address; other same-host interfaces may still be down)
> > 
> > Once the data is in the cache, the timeouts do not happen.  With a larger
> > number of clients (threads), the timeouts seem to happen more frequently.
> 
> > When the timeouts do happen, the AFS client machine freezes via serial
> > console.  After a second file/dbserver was added to the cell, timeouts
> > occurred for both AFS servers, although the initial timeouts were only to
> > the new fileserver, which held one of the ReadOnly copies of the data I
> > was trying to fetch. 
> > 
> > Only the AFS client itself seems to pause during this time.  The main AFS
> > server and HTTP client do not seem to be affected.
> > 
> > During one test, the AFS client became unpingable and had to be rebooted
> > by sending a break to the console.
> > 
> > Conclusions
> > -----------
> > The artificialness of the test scenario makes it hard to determine
> whether
> > or not these problems would occur in Real Life.  Multiple, parallel HTTP
> > GET streams coming at this rate may not be a realistic scenario.  I don't
> > know "how much" traffic each thread really represents in real terms. 
> > 
> > Still, we did observe some definite problems with the AFS client that may
> > or may not be tunable.  Adding a second file/dbserver initially appeared
> > to help the situation, but after flushing the volume from the cache in
> > reality did nothing.  Increasing the number of afsd processes helped
> > somewhat (it seemed to take longer for the client to lose contact with
> the
> > server), but didn't improve the situation very much. 
> > 
> > I know what AFS is in use under some relatively heavy usage conditions at
> > U-M and NCSA.  I know that U-M's WWW servers are considered "slow" by
> > their user populace (I helped administer them for a short time), but this
> > is probably due to a number of factors, including network congestion and
> > overburdened fileservers, and the growing pains of a very large AFS cell.
> 
> > On the other hand, I don't generally consider NCSA's site to be
> > particularly slow. 
> > 
> > Perhaps a more rational way of getting the cache seeded beforehand would
> > help.
> > 
> > I'd like to see some response from Transarc on this. 
> > 
> > 
> > --
> >        Brian W. Spolarich - ANS - [EMAIL PROTECTED] - (313)677-7311
> >                 Look both ways before crossing the Net.
> > 
> > 
> 

--
       Brian W. Spolarich - ANS - [EMAIL PROTECTED] - (313)677-7311
                 And if I die before I learn to speak,
     Will money pay for all the days I lived awake but half asleep?
Re: AFS/HTTP Server Performance

Reply via email to