On Dec 12, 2007, at 11:06 PM, Jeffrey Altman wrote:

...Stupid things like re-using objects that were recently accessed because the queues did not track objects in the order of most recent use. Being
forced to read data or directory entries from the file server that was
just written by the client because data buffer version numbers weren't
incremented when merging the updated status data received as a result of
the write or the failure to locally update the directory entries when
possible. Re-issuing FetchStatus calls on .readonly volumes prematurely because the volume callback expirations were not tracked by each object
in the volume.  Some of the changes result in improved performance of
the client when measured by throughput. Other changes reduced the CPU
time required by the client but most of all, the improvements have
reduced network traffic and load on the file servers.

Putting on my old software guy hat for a moment . . .

With help from Dan Hyde, I've made a few brief trips through the afs source code, though mostly on the server side. There are some wonderful things in there, but there are also several categories of dog. I'm getting incredibly itchy to dive into the vos copy/move/ shadow code and refactor it. When finishing up the shadow work we saw lots of opportunities for improvement, but we didn't make those changes because we want the code to be accepted. Doing an unsolicited mass rewrite is no way to get code accepted. So we're going slow and careful. Once we've convinced the Elders of our general competence, then we might refactor that code.

Some of the changes have unfortunately triggered bug in the file servers
that in turn have to be fixed.

Sooo true, and not just in client-server relations. During the shadow work, Dan found a condition where an interrupted volume operation would cause the original to be deleted. But once one started using shadows, it was possible to start making a shadow, interrupt, and the cleanup would blow away the original rather than cleaning up the incomplete shadow. Oops. Yes, the bug is fixed in production AFS and in the 1.5 line. But it's been there latent for years.

Similarly, we're convinced some of the issues that we have been working lately are present because recent fixes to the locking code uncovered bugs that had been there since Transarc days. Sort of a microcosm of what FreeBSD went through in implementing the removal of giant() so multiprocessing really worked right. Once we figured out what the problem class was, Dan spent a great deal of time poring over other code sections that might have similar issues. He verified some code is clean, found and fixed some others. Did we get everything? Good question.

All of which is a roundabout way of saying that as active work on AFS keeps ramping up, we'll keep finding, fixing, and unfortunately revealing bugs. Some of these would be best done by major refactoring of the code. We will not attempt that regional refactoring until we have a solid enough understanding of the code as a whole and we've convinced the elders that we're competent to do it. Why? Because reading a four-line bug fix is easy; verifying it doesn't break anything is easy. Reading 2,000 lines of replacement code is damned hard. Writing it is hard. Verifying that you're introducing fewer new bugs than you're fixing is even harder. So start small.

At this point, the topic has drifted pretty far from the original. I'll write a separate note on other things relating to this.

Steve
_______________________________________________
OpenAFS-info mailing list
[email protected]
https://lists.openafs.org/mailman/listinfo/openafs-info

Reply via email to