Everyone,

I know we've discussed this already but I'd like to bring some new information to the table that I believe shows a definite bug instead of simply a "that's the way it works for now" issue.

(Jeffrey, thanks for adding an incident report for me. I'm just adding new information here.)

At the end of our previous conversation in the thread...

"Windows cache problem revisited..."
https://lists.openafs.org/pipermail/openafs-info/2003-December/011370.html

...I had resigned myself to using a small AFS cache of 48 Meg instead of 256 Meg. This seemed to solve some of the issues that I was having. But now, after further testing I find that I was wrong.

Here is our situation. In our IT group, we actually use AFS to run applications from. We have some applications installed locally, but others (many others) are installed and run from AFS. Some of these applications are quite large and on execution require multiple megabytes of download (from AFS) to run. We have one application in particular that is causing us grief because of its size. The application is called "ProEngineer", or ProE. Over the course of the last year we've had reports from the professors teaching the classes that it can take as long as 10 to 15 minutes to startup. We found it odd, but assumed it was because of network loading and the effect of everyone trying to run ProE at the same time.

To try and eliminate the problem we've thoroughly replicated the application to multiple servers in the same building, and set our AFS preferences so that loading would be minimized. This hasn't had any effect. Well the end of the semester has arrived and we need to fix this problem because the professor is now saying we should install ProE locally. We don't like running them locally if they are runnable from the net.

Since Friday, three members of our IT staff (including me) have been testing various senarios of starting times for the ProE application. We've tested large AFS cache sizes and small.

Here is what we've found...

* We tested on new Dell OptiPlex GX 270 P4 3.0 GHz machines with 1 Gig RAM and 100 MBit connections to our file servers.

* We set the AFS cache to 256 Meg and 48 Meg.

* With a fresh restart of AFS and an empty cache, fs getcacheparms returns...

AFS using 100 of the cache's available 256000 1K byte blocks.

* We started ProE. The load time on average was 30 seconds. This is on-par with our Sun Solaris 9 Blade 150's load-time. The cache setting had little to no effect (as expected on first load).

* The resultant fs getcacheparms after ProE is loaded is (for 256 Meg cache)...

AFS using 57685 of the cache's available 256000 1K byte blocks.

* Starting ProE again resulted in a load time of 10-15 seconds...excellent, cache works.

* Even with a 48 Meg cache, the load time was a decent average of 25 to 30 seconds.

Now this is not what we observed when we walked into our labs and started our testing. When we first sat down to the machines and ran ProE cold it loaded in about 2 to 5 minutes. So we thought something must be happening to the AFS client during the day that would cause it to go into "slow mode". We always restart our AFS service at 4:00am and delete the cache (via a task scheduled script), so we are assured of a fresh cache in the morning. The problem is, we have various students logging on to the lab machines during the day which are causing some anomaly.

So we immediately thought to check and see what would happen if we overflowed the cache. What I mean here is simply to set the cache to some size, then load many files from AFS, enough to cause the cache to be fully utilized, more than the cache size value. When we did this our load time for ProE suddenly when down the drain. Instead of loading quickly, or even average of 30 seconds, it was starting to take upwards of a minute. This was even after we had stopped ProE and restarted it again. It is almost as if there is a "leak" somewhere in the service that is causing the service to slow to a crawl, using up all the CPU.

At this point I don't believe it has anything to do with the number of handles or the amount of RAM in the machine. The problem appears to be totally within the AFS service itself.

As I had previously stated in a recent thread, I also see the problem when copying very large single files, files greater than 256 Meg at a time to and from AFS.

I'm not sure if this is a cache problem, or a problem somewhere else in the AFS code, but it sure seems to be losing track of some important buffered information somewhere.

If anyone needs any more data I'll be happy to provide.

Help is appreciated,

Thanks again,

Rodney

Rodney M. Dyer
Windows Systems Programmer
Mosaic Computing Group
William States Lee College of Engineering
University of North Carolina at Charlotte
Email: [EMAIL PROTECTED]
Web: http://www.coe.uncc.edu/~rmdyer
Phone (704)687-3518
Help Desk Line (704)687-3150
FAX (704)687-2352
Office  267 Smith Building

_______________________________________________
OpenAFS-info mailing list
[EMAIL PROTECTED]
https://lists.openafs.org/mailman/listinfo/openafs-info

Reply via email to