On Tue, 27 Nov 2012, Sam Lang wrote:
> Hi Noah,
>
> I was able to reproduce your issue with a similar test using the fuse client
> and the clock_offset option for the mds. This is what I see happening:
>
> clientA's clock is a few seconds behind the mds clock
>
> clientA creates the file
> - the mds sets the mtime from its current time
> - clientA acquires the exclusive capability (cap) for the file
>
> clientA writes to the file
> - the mtime is updated locally (at clientA with its current time)
>
> clientA closes the file
> - the exclusive cap is flushed to the mds, but the mtime is less
> than the create mtime because of the clock skew, so the mds
> doesn't update it to the mtime from clientA's write
>
> clientA stats the file
> - the mtime from the write (still cached) gets returned. I saw a
> race in my tests, where sometimes the mtime was from the cache
> (if the flush hadn't completed I assume), and sometimes it was
> from the mds.
>
> clientB stats the file
> - the exclusive cap is revoked at clientA, but the mtime returned
> to clientB is from the mds
>
> The goal of the current implementation is to provide an mtime that is
> non-decreasing, but that conflicts with using mtime as a version in this case.
> Using mtime as a version has its own set of problems, but I won't go into that
> here. I think there are a few alternatives if we want to try to have a more
> consistent mtime value across clients.
>
> 1. Let the client set the create mtime. This avoids the issue that the mds
> and client clocks are out of sync, but in other cases where the client has a
> clock a few seconds ahead of other clients, we run into a similar problem.
> This might be reasonable considering clients that share state will more likely
> have synchronized clocks than the clients and mds.
I like this option the best. It will clearly break when client clocks are
out of sync and multiple clients write to the file, but I think that is
the price you pay for client-driven writeback.
Noah, is that sufficient to resolve the hadoop race? Is there a single
client writer?
> 2. Provide a config option to always set the mtime on cap flush/revoke, even
> if its less than the current mtime. This breaks the non-decreasing behavior,
> and requires the user set a config option across the cluster if they want
> this.
We could also do this... but if the above is sufficient for hadoop I'd
rather not. :/
> 3. When a client acquires the cap for a file, have the mds provide its current
> time as well. As the client updates the mtime, it uses the timestamp provided
> by the mds and the time since the cap was acquired.
> Except for the skew caused by the message latency, this approach allows the
> mtime to be based off the mds time, so it will be consistent across clients
> and the mds. It does however, allow a client to set an mtime to the future
> (based off of its local time), which might be undesirable, but that is more
> like how NFS behaves. Message latency probably won't be much of an issue
> either, as the granularity of mtime is a second. Also, the client can set its
> cap acquired timestamp to the time at which the cap was requested, ensuring
> that the relative increment includes the round trip latency so that the mtime
> will always be set further ahead. Of course, this approach would be a lot more
> intrusive to implement. :-)
Yeah, I'm less excited about this one.
I think that giving consistent behavior from a single client despite clock
skew is a good goal. That will make things like pjd's test behave
consistently, for example.
sage
>
> -sam
>
>
> On 11/20/2012 01:44 PM, Noah Watkins wrote:
> > This is a description of the clock synchronization issue we are facing
> > in Hadoop:
> >
> > Components of Hadoop use mtime as a versioning mechanism. Here is an
> > example where Client B tests the expected 'version' of a file created
> > by Client A:
> >
> > Client A: create file, write data into file.
> > Client A: expected_mtime <-- lstat(file)
> > Client A: broadcast expected_mtime to client B
> > ...
> > Client B: mtime <-- lstat(file)
> > Client B: test expected_mtime == mtime
> >
> > Since mtime may be set in Ceph by both client and MDS, inconsistent
> > mtime view is possible when clocks are not adequately synchronized.
> >
> > Here is a test that reproduces the problem. In the following output,
> > issdm-18 has the MDS, and issdm-22 is a non-Ceph node with its time
> > set to an hour earlier than the MDS node.
> >
> > nwatkins@issdm-22:~$ ssh issdm-18 date && ./test
> > Tue Nov 20 11:40:28 PST 2012 // MDS TIME
> > local time: Tue Nov 20 10:42:47 2012 // Client TIME
> > fstat time: Tue Nov 20 11:40:28 2012 // mtime seen after file
> > creation (MDS time)
> > lstat time: Tue Nov 20 10:42:47 2012 // mtime seen after file write
> > (client time)
> >
> > Here is the code used to produce that output.
> >
> > #include <errno.h>
> > #include <sys/fcntl.h>
> > #include <sys/time.h>
> > #include <unistd.h>
> > #include <sys/types.h>
> > #include <sys/stat.h>
> > #include <dirent.h>
> > #include <sys/xattr.h>
> > #include <stdio.h>
> > #include <string.h>
> > #include <assert.h>
> > #include <cephfs/libcephfs.h>
> > #include <time.h>
> >
> > int main(int argc, char **argv)
> > {
> > struct stat st;
> > struct ceph_mount_info *cmount;
> > struct timeval tv;
> >
> > /* setup */
> > ceph_create(&cmount, "admin");
> > ceph_conf_read_file(cmount, "/users/nwatkins/Projects/ceph.conf");
> > ceph_mount(cmount, "/");
> >
> > /* print local time for reference */
> > gettimeofday(&tv, NULL);
> > printf("local time: %s", ctime(&tv.tv_sec));
> >
> > /* create a file */
> > char buf[256];
> > sprintf(buf, "/somefile.%d", getpid());
> > int fd = ceph_open(cmount, buf, O_WRONLY|O_CREAT, 0);
> > assert(fd > 0);
> >
> > /* get mtime for this new file */
> > memset(&st, 0, sizeof(st));
> > int ret = ceph_fstat(cmount, fd, &st);
> > assert(ret == 0);
> > printf("fstat time: %s", ctime(&st.st_mtime));
> >
> > /* write some data into the file */
> > ret = ceph_write(cmount, fd, buf, sizeof(buf), -1);
> > assert(ret == sizeof(buf));
> > ceph_close(cmount, fd);
> >
> > memset(&st, 0, sizeof(st));
> > ret = ceph_lstat(cmount, buf, &st);
> > assert(ret == 0);
> > printf("lstat time: %s", ctime(&st.st_mtime));
> >
> > ceph_shutdown(cmount);
> > return 0;
> > }
> >
> > Note that this output is currently using the short patch from
> > http://marc.info/?l=ceph-devel&m=133178637520337&w=2 which forces
> > getattr to always go to the MDS.
> >
> > diff --git a/src/client/Client.cc b/src/client/Client.cc
> > index 4a9ae3c..2bb24b7 100644
> > --- a/src/client/Client.cc
> > +++ b/src/client/Client.cc
> > @@ -3858,7 +3858,7 @@ int Client::readlink(const char *relpath, char
> > *buf, loff_t \
> > size)
> > int Client::_getattr(Inode *in, int mask, int uid, int gid)
> > {
> > - bool yes = in->caps_issued_mask(mask);
> > + bool yes = false; //in->caps_issued_mask(mask);
> >
> > ldout(cct, 10) << "_getattr mask " << ccap_string(mask) << "
> > issued=" << yes << \
> > dendl; if (yes)
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> >
>
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to [email protected]
More majordomo info at http://vger.kernel.org/majordomo-info.html