Re: Solution for blocking fsync in 0.8

Jay Kreps Fri, 25 May 2012 10:22:49 -0700

It depends a great deal on the hw and the flush interval. I think for our
older generation hw we saw an avg flush time of 40ms, for newer stuff we
just got it is much less but I think that might be because the disks
themselves have some kind of nvram or something.


-Jay

On Fri, May 25, 2012 at 7:09 AM, S Ahmed <sahmed1...@gmail.com> wrote:

> In practise (at linkedin), how long do you see the calls blocked for during
> fsycs?
>
> On Thu, May 24, 2012 at 1:40 PM, Jay Kreps <jay.kr...@gmail.com> wrote:
>
> > One issue with using the filesystem for persistence is that the
> > synchronization in the filesystem is not great. In particular the fsync
> and
> > fsyncdata system calls block appends to the file, apparently for the
> entire
> > duration of the fsync (which can be quite long). This is documented in
> some
> > detail here:
> >  http://antirez.com/post/fsync-different-thread-useless.html
> >
> > This is a problem in 0.7 because our definition of a committed message is
> > one written prior to calling fsync(). This is the only way to guarantee
> the
> > message is on disk. We do not hand out any messages to consumers until an
> > fsync call occurs. The problem is that regardless of whether the fsync is
> > in a background thread or not it will block any produce requests to the
> > file. This is buffered a bit in the client since our produce request is
> > effectively async in 0.7, but it can lead to weird latency spikes
> > nontheless as this buffering gets filled.
> >
> > In 0.8 with replication the definition of a committed message changes to
> > one that is replicated to multiple machines, not necessarily committed to
> > disk. This is a different kind of guarantee with different strengths and
> > weaknesses (pro: data can survive destruction of the file system on one
> > machine, con: you will lose a few messages if you haven't sync'd and the
> > power goes out). We will likely retain the flush interval and time
> settings
> > for those who want fine grained control over flushing, but it is less
> > relevant.
> >
> > Unfortunately *any* call to fsync will block appends even in a background
> > thread so how can we give control over physical disk persistence without
> > introducing high latency for the producer? The answer is that the linux
> > pdflush daemon actually does a very similar thing to our flush
> parameters.
> > pdflush is a daemon running on every linux machine that controls the
> > writing of buffered/cached data back to disk. It allows you to control
> the
> > percentage of memory filled with dirty pages by giving it either a
> > percentage of memory, a time out for any dirty page to be written, or a
> > fixed number of dirty bytes.
> >
> > The question is, does pdflush block appends? The answer seems to be
> mostly
> > no. It locks the page being flushed but not the whole file. The time to
> > flush one page is actually usually pretty quick (plus I think it may not
> be
> > flushing just written pages anyway). I wrote some test code for this and
> > here are the results:
> >
> > I modified the code from the link above. Here are the results from my
> > desktop (Centos Linux 2.6.32).
> >
> > We run the test writing 1024 bytes every 100 us and flushing every 500
> us:
> >
> > $ ./pdflush-test 1024 100 500
> > 21
> > 4
> > 3
> > 3
> > 9
> > 6
> > Sync in 20277 us (0), sleeping for 500 us
> > 19819
> > 7
> > 7
> > 8
> > 38
> > Sync in 19470 us (0), sleeping for 500 us
> > 19048
> > 7
> > 4
> > 3
> > 8
> > 4
> > Sync in 19405 us (0), sleeping for 500 us
> > 19017
> > 6
> > 6
> > 10
> > 6
> > Sync in 19410 us (0), sleeping for 500 us
> > 19025
> > 7
> > 7
> > 11
> > 6
> >
> > $ cat /proc/sys/vm/dirty_writeback_centisecs
> > 100
> > $ cat /proc/sys/vm/dirty_expire_centisecs
> > 500
> >
> > Now run the test with the background flush disabled (rarely running):
> > $ ./pdflush-test 1024 100 5000000000000 > times.txt
> >
> > I ran this for 298,028 writes. The 99.9th percentile for this test is 17
> us
> > and the max time was 2043 us (2ms).
> >
> > Here is the test code:
> >
> > #include <stdio.h>
> > #include <unistd.h>
> > #include <string.h>
> > #include <sys/types.h>
> > #include <pthread.h>
> > #include <sys/stat.h>
> > #include <fcntl.h>
> > #include <sys/time.h>
> > #include <stdlib.h>
> >
> > static long long microseconds(void) {
> >    struct timeval tv;
> >    long long mst;
> >
> >    gettimeofday(&tv, NULL);
> >    mst = ((long long)tv.tv_sec)*1000000;
> >    mst += tv.tv_usec;
> >    return mst;
> > }
> >
> > void *IOThreadEntryPoint(void *arg) {
> >    int fd, retval;
> >    long long start;
> >    long sleep = (long) arg;
> >
> >    while(1) {
> >        usleep(sleep);
> >        start = microseconds();
> >        fd = open("/tmp/foo.txt",O_RDONLY);
> >        retval = fsync(fd);
> >        close(fd);
> >        printf("Sync in %lld us (%d), sleeping for %ld us\n",
> > microseconds()-start, retval, sleep);
> >    }
> >    return NULL;
> > }
> >
> > int main(int argc, char* argv[]) {
> >    if(argc != 4) {
> >      printf("USAGE: %s size write_sleep fsync_sleep\n", argv[0]);
> >      exit(1);
> >    }
> >
> >    pthread_t thread;
> >    int fd = open("/tmp/foo.txt",O_WRONLY|O_CREAT,0644);
> >    long long start;
> >    long long ellapsed;
> >    int size = atoi(argv[1]);
> >    long write_sleep = atol(argv[2]);
> >    long fsync_sleep = atol(argv[3]);
> >    char buff[size];
> >
> >    pthread_create(&thread,NULL,IOThreadEntryPoint, (void*) fsync_sleep);
> >
> >    while(1) {
> >        start = microseconds();
> >        if (write(fd,buff,size) == -1) {
> >            perror("write");
> >            exit(1);
> >        }
> >        ellapsed = microseconds()-start;
> >        printf("%lld\n", ellapsed);
> >        usleep(write_sleep);
> >    }
> >    close(fd);
> >    exit(0);
> > }
> >
> > Cheers,
> >
> > -Jay
> >
>

Re: Solution for blocking fsync in 0.8

Reply via email to