Rodney M. Dyer wrote: > I understand this, however you need to realize where I'm coming from. > We support professors who have research projects that run into the > millions of dollars. Many times these people don't know anything about > where their data files are being saved when they choose "File->Save" > from an application. They expect it to work. We need to be in a > position to provide the "works" part. If they save a valuable data file > from an application one day, then return the next and the application > won't load it because of some random network change updated a few bytes > here or there when the file was saved, what do we tell them? "Oh btw, > maybe you should keep a local copy on your USB keychain unless the AFS > network fails?" Most professors don't spend the extra time to run > checksums on their files after the save. This kind of thing doesn't cut > it. I'm the type of "professional" sysadmin who's willing to give up 10 > percent of my speed for guaranteed delivery. I'm not some young post > high school geek who's got a job running a smallish home network and > constantly boasts product x is faster than product y, and that's just > uber cool because product y sux'ors!
The data corruption error that was discovered in January and reported by David Bolt to OpenAFS RT was fixed in the 15 February 2008 release. At the time of the announcement I stressed the importance of upgrading because of the seriousness of the error. For those who are unfamiliar, if during a background write operation to the file server the network drops out for any reason, the daemon thread would drop all of the dirty buffers that were in progress on the floor and mark them as clean. The end result would be a hole in the file on the file server either leaving the previous data or a page full of zeros. This error was present in the original OpenAFS 1.0 release. IBM fixed the problem in the 3.6.2.59 release of IBM AFS for Windows. OpenAFS fixed it in 1.5.15. As for the performance improvements, I'm not on a performance kick for the hell of it. I'm on a performance kick because large OpenAFS users have repeatedly mentioned the performance of the Windows client as one reason why they are moving away from AFS to CIFS. In addition, the file servers are experiencing serious scalability issues and a large part of the problem is that the Windows clients have not been as smart as they could be and have re-requested data from the file servers that should have been accessed from the cache. Stupid things like re-using objects that were recently accessed because the queues did not track objects in the order of most recent use. Being forced to read data or directory entries from the file server that was just written by the client because data buffer version numbers weren't incremented when merging the updated status data received as a result of the write or the failure to locally update the directory entries when possible. Re-issuing FetchStatus calls on .readonly volumes prematurely because the volume callback expirations were not tracked by each object in the volume. Some of the changes result in improved performance of the client when measured by throughput. Other changes reduced the CPU time required by the client but most of all, the improvements have reduced network traffic and load on the file servers. Some of the changes have unfortunately triggered bug in the file servers that in turn have to be fixed. That is the case with the GiveUpAllCallBacks RPC bug that exists in all file servers from 1.3.50 to 1.4.5. The attempt to be a good citizen by giving up callbacks when we know that the server will be unable to contact us since we are suspended or shutdown resulted in corruption of the file server state data and the possibility of eventual file server crashes. I am very thankful for the efforts you put into helping track down the thread safety issues in 1.5.26 as well as the issues with the infinite loop detection code that was added to 1.5.21 which resulted in client crashes. As you are well aware the thread safety issues were particularly challenging to reproduce and identify. It is both fortunate and unfortunate that your use case was the perfect use case to trigger the race condition. The race condition was finally fixed thanks to your efforts in 1.5.27. 1.5.28 in turn fixes addition crash reports that were received by the Windows Error Reporting service. Nothing significant. The crash conditions are so rare that I doubt anyone who did experience them could reproduce them. As I said over the summer, I was truly embarrassed by the quality issues in the releases from 1.5.21 to 1.5.25. I do my best to test things given the tools at my disposal. Unfortunately, I don't not have a test environment that can replicate all of the possible multiple client interactions. > I am happy with the speed improvements, and I hope we can continue to > use AFS. However I need to be able to look at people with a straight > face when they ask about how well AFS works. > > Speed? Check > Scale? Check > Functionality? Check > Reliablity? hrm... You see I would actually give us less credit than that: Speed? Not so much but you can get decent performance for specific classes of use cases Scale? Well, we have global access but what Transarc advertised in the mid-90s as infinite scalability has not lived up to the claims. The file servers are capable of handling approximately 100 simultaneous requests and when those requests require network traffic to query the client's identity, obtain protection data, or communicate with the volume database server, the threads sit idle blocked on the I/O. The actual throughput of a given file server is far below what it needs to be if we are truly going to be serving petabytes of data to tens of thousands of clients from each file server. Functionality? Hmm. Much of the complexity that was added this summer for the directory searching was necessary because of the lack of functionality in the AFS3 protocols. The locking issues that everyone runs into are also a lack of functionality. Shall we discuss Unicode object names in profile directories and the data corruption that produces? What about the inability to maintain data connectivity due to the CIFS client timeouts? Do you like having Office apps crash on you? I sure don't. Reliability? Given everything else I actually mark reliability on the high side. At least when there is an issue, we get a fix out ASAP. The funny thing is that even with all of the negatives I have mentioned I actually think that OpenAFS is the best it has ever been. I am finally at the point where I am willing to say to people that I think you should consider OpenAFS for new deployments. Do we still have issues? Absolutely. But we also have plans and we have a growing number of skilled developers who are actively contributing to make OpenAFS better on a significantly broad number of platforms. If you need a file system that is going to provide good WAN performance with federated authentication and high availability, you really can't find anything else out there. Jeffrey Altman
smime.p7s
Description: S/MIME Cryptographic Signature
