Brian,
I can't write much right now, but we are running our http server on AFS
with no problems. I am running on a Sun SPARC 5 with a 2GB cache under
Solaris 2.5 and AFS 3.4. Again, no problems at this time...Mic
----------
> From: Brian W. Spolarich <[EMAIL PROTECTED]>
> To: [EMAIL PROTECTED]
> Subject: AFS/HTTP Server Performance
> Date: Tuesday, September 17, 1996 8:44 AM
>
>
> I'm attaching a writeup of some tests I did a few months ago using AFS
> to serve data to heavily-loaded HTTP server. At that time I did not have
> an opportunity to follow up very much on some of the unanswered
questions,
> and didn't feel like I had characterized the situation clearly enough.
>
> The problems that we saw happened when we started the test scenario
> against a "cold" cache with a moderate (60Mb) amount of data to retrieve.
> In this scenario, the tests would proceed for a few minutes, and then the
> AFS client/HTTP server would lose contact with the AFS server. The
> addition of an additional database server did not solve the problem.
>
> If anyone has any thoughts on this, I'd appreciate hearing them. Some
> differences between these somewhat informal tests that I ran back in May
> and what we're going to do now include a change in operating system
> (Solaris 2.5.1 instead of 2.4), and Ultras (on 10Mb/sec ethernet) instead
> of Sparc 20s. If we can get it, we'll try and run the enhanced release
of
> 2.5.1 (for Web servers) and see what happens.
>
> I believe tcp_max_conn_req was set to 128 during this test, but as I
> said, I did this somewhat informally and didn't collect all of the data
> that I should have. :-]
>
> What I'm looking for is responses like "Yeah, we saw similar behaviour
> when we did something like this and fixed it by <blah>", or "You might
try
> bumping up <bleh>".
>
> Transarc didn't really provide much help (although they tried to be
> helpful) as the support guy didn't feel like he had enough info to really
> understand the problem.
>
> -brian
>
>
---------------------------------------------------------------------------
> | APPENDIX A. AFS/HTTP SERVER PERFORMANCE EVALUATION RESULTS
> |
>
---------------------------------------------------------------------------
>
>
> Overview -------- This test scenario is designed to do some stress
testing
> of an HTTP (Web) server reading its data out of AFS versus local disk.
> The HTTP client is running some home-grown software (webbash) which
allows
> us to fork off multiple simultaneous threads which act as HTTP clients,
> requesting documents from a list. The document testbed is a set of files
> containing random ASCII characters ranging in size from 0k to 1Mb, and
> totals initially 2.2Mb. Copying this testbed into subdirectories [a..z]
> yields a testbed of 60Mb.
>
> Test Environment
> ----------------
> The test environment consists of three machines isolated from the rest of
> the network via an ethernet hub. I connect to the machines via a Cisco
> terminal server which is not isolated from the local network. All
> machines are running Solaris 2.4 w/o the recommended patches. :-]
>
> [prod-1b] AFS Fileserver Sparc20/128Mb
> AFS Client
>
> [ log ] AFS Client / Sparc20/64Mb
> HTTP Server
>
> [prod-2a] HTTP Client Sparc20/128Mb
> ("bash" program)
> AFS Client
> (later becomes an additional AFS file/dbserver)
>
>
> Explanation of the Fields Below
> -------------------------------
> Data Source - local disk or AFS
> Client Threads - number of simultaneous threads created by "webbash".
Each
> thread reads the file list and randomizes it. Each thread will
> time out after ten seconds if it does not receive data from the
> HTTP server.
> Iterations - number of times the test suite iterates over the list of
files
> to retrieve.
> Cache/Data Ratio - the ratio of AFS cache size to data testbed size.
> Daemons (afsd) - number of extra afsd processes to run to handle service
> requests.
> Volume Type - ReadWrite or ReadOnly. Lack of per-file callbacks on
ReadOnly
> volumes should reduce AFS client-server traffic.
> Throughput (Mb/Hr) - As reported by webbash in megabytes per hour.
> HTTP Ops/Sec - Number of HTTP operations services per second. This is
probably
> the real "thoughput benchmark".
> Comments - Describes various events that may have happened during the
test.
> See the key below the table for details on this.
>
> Cache/ Throug-
> Data Client Itera- Data Data Daemons Volume hput HTTP
Com-
> Source Threads tions Ratio Size (afsd) Type (Mb/Hr) Ops/Sec
ments
>
----------------------------------------------------------------------------
---
> local 10 10 n/a 2.2Mb n/a n/a 4048 15.01
> afs 10 10 34/1 2.2Mb 3 RW 4053 15.04
> afs 20 10 34/1 2.2Mb 3 RW 3959 14.69 #
> afs 50 10 34/1 2.2Mb 3 RW 3795 13.98
#+
> afs 100 10 34/1 2.2Mb 3 RW 4306 23.06
#+
> afs 200 10 34/1 2.2Mb 3 RW 5208 31.31
#+
> afs 10 5 1.25 60Mb 3 RW 3793 37.59
!@
> afs 1 5 1.25 60Mb 3 RW 1504 5.57
> afs 3 3 1.25 60Mb 3 RW 3155 11.74
$%
> afs 5 2 1.25 60Mb 5 RO 2927 12.53
!@
>
> # A second file/db server was added to the AFS cell at this point.
> (these were warm cache reads)
> afs 10 1 1.25 60Mb 5 RO 3957 14.68
#$
> # Then I flushed the volume from the cache (i.e. cold reads)
> afs 10 2 1.25 60Mb 5 RO 1524 8.70
> !+#+@+
>
> Comments Key:
> ! - Cell "Thrashing": AFS client loses contact with fs and volserver.
> AFS
> client freezes.
> @ - HTTP Server returned "403 Not Found" Errors (i.e. file did not
exist).
> # - Large files (500K+) took more than 10 seconds to transfer.
> $ - Timeout (>10 sec) trying to read data.
> % - HTTP Server returned "504" Error.
>
> + - Modifer to above. This event occurred many (>10, generally) times.
>
> Problems Observed
> -----------------
>
> Interestingly, reading data from a small to moderate local AFS cache
> appears to be slightly faster than local disk. This seems to differ from
> work done by Michael Stolarchuck at U-M which suggested ways to improve
> AFS cache read performance. The difference is small, and may be a
> statistical anonmaly.
>
> The biggest problem we've seen with this situation has been timeouts
> between the AFS client and server(s) as the data is being fetched into
the
> cache. Generally what happens is that I can observe an initial burst of
> traffic between the AFS client and server, during which I see a large
> number of network collisions. After a period of time, the traffic will
> come to a halt (I'm observing the lights on the hub). The AFS
client/HTTP
> server will at this point freeze for a while, and will eventually report
> that the server(s) for the realm are unavailable. This will cause the
> HTTP server to report that the files the various threads are trying to
> fetch do not exist, which will generate the "403 Not Found" errors.
>
> AFS timeouts generally look like this:
>
> afs: Lost contact with file server 198.83.22.104 in cell test.ans.net
(all
> multi-homed ip addresses down for the server)
> afs: Lost contact with file server 198.83.22.104 in cell test.ans.net
(all
> multi-homed ip addresses down for the server)
> afs: file server 198.83.22.104 in cell test.ans.net is back up
> (multi-homed address; other same-host interfaces may still be down)
> afs: file server 198.83.22.104 in cell test.ans.net is back up
> (multi-homed address; other same-host interfaces may still be down)
>
> Once the data is in the cache, the timeouts do not happen. With a larger
> number of clients (threads), the timeouts seem to happen more frequently.
> When the timeouts do happen, the AFS client machine freezes via serial
> console. After a second file/dbserver was added to the cell, timeouts
> occurred for both AFS servers, although the initial timeouts were only to
> the new fileserver, which held one of the ReadOnly copies of the data I
> was trying to fetch.
>
> Only the AFS client itself seems to pause during this time. The main AFS
> server and HTTP client do not seem to be affected.
>
> During one test, the AFS client became unpingable and had to be rebooted
> by sending a break to the console.
>
> Conclusions
> -----------
> The artificialness of the test scenario makes it hard to determine
whether
> or not these problems would occur in Real Life. Multiple, parallel HTTP
> GET streams coming at this rate may not be a realistic scenario. I don't
> know "how much" traffic each thread really represents in real terms.
>
> Still, we did observe some definite problems with the AFS client that may
> or may not be tunable. Adding a second file/dbserver initially appeared
> to help the situation, but after flushing the volume from the cache in
> reality did nothing. Increasing the number of afsd processes helped
> somewhat (it seemed to take longer for the client to lose contact with
the
> server), but didn't improve the situation very much.
>
> I know what AFS is in use under some relatively heavy usage conditions at
> U-M and NCSA. I know that U-M's WWW servers are considered "slow" by
> their user populace (I helped administer them for a short time), but this
> is probably due to a number of factors, including network congestion and
> overburdened fileservers, and the growing pains of a very large AFS cell.
> On the other hand, I don't generally consider NCSA's site to be
> particularly slow.
>
> Perhaps a more rational way of getting the cache seeded beforehand would
> help.
>
> I'd like to see some response from Transarc on this.
>
>
> --
> Brian W. Spolarich - ANS - [EMAIL PROTECTED] - (313)677-7311
> Look both ways before crossing the Net.
>
>