Re: [OpenAFS] Puzzler: lack of access to AFS files

Steve Simmons Thu, 13 Dec 2007 11:25:15 -0800


On Dec 12, 2007, at 11:06 PM, Jeffrey Altman wrote:

...Stupid things like re-using objects that were recently accessedbecausethe queues did not track objects in the order of most recent use.Being
forced to read data or directory entries from the file server that was
just written by the client because data buffer version numbers weren't
incremented when merging the updated status data received as aresult of
the write or the failure to locally update the directory entries when
possible. Re-issuing FetchStatus calls on .readonly volumesprematurelybecause the volume callback expirations were not tracked by eachobject
in the volume.  Some of the changes result in improved performance of
the client when measured by throughput. Other changes reduced theCPU
time required by the client but most of all, the improvements have
reduced network traffic and load on the file servers.


Putting on my old software guy hat for a moment . . .

With help from Dan Hyde, I've made a few brief trips through the afssource code, though mostly on the server side. There are somewonderful things in there, but there are also several categories ofdog. I'm getting incredibly itchy to dive into the vos copy/move/shadow code and refactor it. When finishing up the shadow work we sawlots of opportunities for improvement, but we didn't make thosechanges because we want the code to be accepted. Doing an unsolicitedmass rewrite is no way to get code accepted. So we're going slow andcareful. Once we've convinced the Elders of our general competence,then we might refactor that code.

Some of the changes have unfortunately triggered bug in the fileservers
that in turn have to be fixed.

Sooo true, and not just in client-server relations. During the shadowwork, Dan found a condition where an interrupted volume operationwould cause the original to be deleted. But once one started usingshadows, it was possible to start making a shadow, interrupt, and thecleanup would blow away the original rather than cleaning up theincomplete shadow. Oops. Yes, the bug is fixed in production AFS andin the 1.5 line. But it's been there latent for years.

Similarly, we're convinced some of the issues that we have beenworking lately are present because recent fixes to the locking codeuncovered bugs that had been there since Transarc days. Sort of amicrocosm of what FreeBSD went through in implementing the removal ofgiant() so multiprocessing really worked right. Once we figured outwhat the problem class was, Dan spent a great deal of time poringover other code sections that might have similar issues. He verifiedsome code is clean, found and fixed some others. Did we geteverything? Good question.

All of which is a roundabout way of saying that as active work on AFSkeeps ramping up, we'll keep finding, fixing, and unfortunatelyrevealing bugs. Some of these would be best done by major refactoringof the code. We will not attempt that regional refactoring until wehave a solid enough understanding of the code as a whole and we'veconvinced the elders that we're competent to do it. Why? Becausereading a four-line bug fix is easy; verifying it doesn't breakanything is easy. Reading 2,000 lines of replacement code is damnedhard. Writing it is hard. Verifying that you're introducing fewer newbugs than you're fixing is even harder. So start small.

At this point, the topic has drifted pretty far from the original.I'll write a separate note on other things relating to this.


Steve
_______________________________________________
OpenAFS-info mailing list
[email protected]
https://lists.openafs.org/mailman/listinfo/openafs-info

Re: [OpenAFS] Puzzler: lack of access to AFS files

Reply via email to