Simon Wilkinson wrote:
I'm in the process of updating the Michigan disconnected operation code for the Unix tree, so here are my thoughts on what I'm doing there. Bear in mind that none of this has been accepted into the tree yet! Sorry for polluting the Windows list with Unix comments(I've set followups to openafs-devel) Jeffrey Altman wrote:Disconnected operations should not be a globally setting. That is acceptable for a research project that demonstrates the capability but it is not acceptable for real world environments in which some servers or cells may not be accessible while others remain accessible.I guess this depends on what you're trying to achieve through providing disconnected operation, and the quality of the user experience you can provide when performing re-integration. Looking at other disconnected systems, one of the usability challenges of Coda is that clients can go disconnected without the user's knowledge, and so the user can end up having to resolve integration conflicts which aren't of their making, and which they were completely unaware of. This tends to score badly for usability, as it violates some of the user's fundamental assumptions. Providing a system which requires an explicit 'go disconnected' step has the advantage that the user is aware both of when they disconnected from the network, and when they reconnected. This allows them to rationalise any conflict resolution steps that they have to perform.That's not to say that 'opportunistic' disconnection (as I'm christening the solution you outline - where the cache manager continues to serve files for which it had a valid callback when the file server disappeared, without any user interaction) doesn't have real uses - I just think that the usability challenges are far higher.
But you are forgetting the "must intentionally configure what you wantto be able to use offline" step. What the Windows cache manager does today is optimistic disconnection. If the data was actively in use when the connectivity to the server was lost, do not fail a request immediately if it can be served by the cache manager without the help of the file server. However, the minute that an operation that does require the file server takes place (close file, ACL check, ...) the error is immediate.
Usable offline systems are those in which the user specifies up front what portions of the network file space are required for offline use and what the policies for those objects are. There is a huge difference in the offline behavior for read-only objects vs those for read-write.
For data which is pre-configured for offline use there is no harm in switching to disconnected operations. Its a lot better than having the application crash because one of its DLLs can no longer be read or having its data be lost because its file handle is no longer valid.
(1) how do you ensure that you have all of the data for all of the files and directories that the user wishes to access in the cache? AFS caches arbitrary blocks not whole files or directories.I'll add to this:1a) How do you ensure that the data you have in the cache is sufficiently recent to be of use to the clientThe naive mechanism, as implemented by the Michigan code, just serves whatever happens to be in the cache back to the user. The problem is that, depending on the size of your cache against your normal working set, it's possible that you might get files that are months, out of date. The normal AFS way of resolving this is to hold callbacks for these files - you could extend this to disconnected operation by adding a 'pinning' functionality, where a user indicates to the cache manager that they want a particular file to be available offline, and the cache manager should ensure that its always up to date on the client. However, if you attempt to hold callbacks for every file in a users offline set, then you're likely to cause severe callback storms with the fileserver (multiple clients hold more than the fileserver's maximum number of callbacks - fileserver starts breaking older callbacks, clients see callback breaks and attempt to update pinned files, fileserver creates new callbacks for these, and round and round we go)The question of how we ensure acceptable recency, without making fileserver changes, is a tricky one.
This is less of an issue if you are using an offline model that is stored outside of AFS and uses redirection. It is really important to understand the usage model for which offline access is required.
There are just a handful of use cases that seem to be of critical importance:
(1) Distribution of read-only data. Applications and documents. These objects do not change often and only need to be synchronized on a periodic basis. What is important is that a consistent set of the data be available when required.
(2) User Profiles. Read-write. The synchronization rules used by Windows is "last writer wins." So the synchronization rules for a profile are "if conflict is detected, use local copy." User profiles are synchronized only at login and logout. Intermediate changes do not matter or at least Windows doesn't check.
(3) Home directories and Shared Project directories. Read-write. These are frequently used files which change often. Periodic checks must be made to ensure that local copies are up to date. File sets must be maintained consistently. Policy will determine how often the synchronization checks will occur when a file set is not in use. As soon as any object in a file set is touched, the entire file set must be synchronized and kept current until the file set is no longer in use again. Write-backs to the file server can result in collisions. Policies can be assigned to file sets. "Server copy wins, Local copy wins, Prompt user, etc."
File server callback storms will not take place if the synchronization logic is not primarily dependent on the existing callback mechanism.
(2) how do you synchronize read and write locks when the file server is not accessible?It's relatively easy to maintain a list of the locks granted by the cache manager whilst in disconnected mode, and you can ensure that the locking protects processes running on the same machine from each other. The issue is what you do when reconnecting. The cache manager plays the list of locally granted locks to the fileserver, and all is well if it grants them. But, what happens if the fileserver refuses a lock. You can't recall locks which have already been issued, so you can have a situation where there's a process happily writing to a file, under what it believes is a write lock, whilst it actually has no lock at all on the server. As I see it, there are three options 1) Ignore the problem; 2) Fail reads and writes to that file descriptor as soon as the lock fails; 3) 'Defer' reintegration of that file until it is closed, and deal with the problem then.This is a much bigger issue on Windows than Unix, though.
There are two components here. The lock and the data version. If a file is currently open and the application is accessing the file in disconnected mode, then continue to treat that file set as disconnected until all the file handles are closed. Then perform whatever synchronization policy applies for the file set.
I should point out that this is becoming a bigger issue for UNIX as applications get used to CIFS semantics. Notice the behavior of Open Office for example.
(3) how do you interact with the end user to notify them of collisions and what do you do when there are collisions?I'm currently implementing a collision resolution policy of "last closer wins". Whilst this does have the potential to cause significant data loss, it has the big advantage over more complex resolution policies that it's explainable to, and understandable by, the user. At the moment collisions get logged in the system log. It would be possible to take advantage of some of the new desktop technologies appearing for Unix to get those messages closer to the user (although, on multi-user machines, desktop based notifications break down)
I do not believe this model is deployable in the Windows world.
(5) how do you address access control issues for files that are offline?The Michigan code simply disables access control when a machine goes offline. With the Unix model, this is more acceptable - machines only go offline with an explicit command, which can only be issued by the super user. The super user has access to the cache contents, anyway. However, this doesn't help with people who have implemented access controls to protect themselves from silly mistakes.I've got a provisional implementation of 'local' tokens which can be used to convey CPS information from the userland to the cache manager, but won't be usable in a connected environment. My eventual plan is that it's possible to 'stash' access data for a particular userid to a file, from where it can be reloaded while the cache manager is offline. However, as soon as you start using these you run in to ...
By pulling offline operations out of the cache manager and implementing it with a redirector model I believe that all of these issues can be avoided. Synchronization requires AFS credentials. Offline access requires local machine credentials and are enforced by the local file system based upon the user rights granted at the time the synchronization was configured.
(6) how do you ensure that the file are synchronized back to file server with the same user credentials that were intended to be used when the files were modified?This is tricky. I don't (yet) have a good answer to this one. At the moment, all replays have to come from a single identity (and their token had better be valid when reintegration starts)
Yep.
smime.p7s
Description: S/MIME Cryptographic Signature
