It depends a great deal on the hw and the flush interval. I think for our older generation hw we saw an avg flush time of 40ms, for newer stuff we just got it is much less but I think that might be because the disks themselves have some kind of nvram or something.
-Jay On Fri, May 25, 2012 at 7:09 AM, S Ahmed <sahmed1...@gmail.com> wrote: > In practise (at linkedin), how long do you see the calls blocked for during > fsycs? > > On Thu, May 24, 2012 at 1:40 PM, Jay Kreps <jay.kr...@gmail.com> wrote: > > > One issue with using the filesystem for persistence is that the > > synchronization in the filesystem is not great. In particular the fsync > and > > fsyncdata system calls block appends to the file, apparently for the > entire > > duration of the fsync (which can be quite long). This is documented in > some > > detail here: > > http://antirez.com/post/fsync-different-thread-useless.html > > > > This is a problem in 0.7 because our definition of a committed message is > > one written prior to calling fsync(). This is the only way to guarantee > the > > message is on disk. We do not hand out any messages to consumers until an > > fsync call occurs. The problem is that regardless of whether the fsync is > > in a background thread or not it will block any produce requests to the > > file. This is buffered a bit in the client since our produce request is > > effectively async in 0.7, but it can lead to weird latency spikes > > nontheless as this buffering gets filled. > > > > In 0.8 with replication the definition of a committed message changes to > > one that is replicated to multiple machines, not necessarily committed to > > disk. This is a different kind of guarantee with different strengths and > > weaknesses (pro: data can survive destruction of the file system on one > > machine, con: you will lose a few messages if you haven't sync'd and the > > power goes out). We will likely retain the flush interval and time > settings > > for those who want fine grained control over flushing, but it is less > > relevant. > > > > Unfortunately *any* call to fsync will block appends even in a background > > thread so how can we give control over physical disk persistence without > > introducing high latency for the producer? The answer is that the linux > > pdflush daemon actually does a very similar thing to our flush > parameters. > > pdflush is a daemon running on every linux machine that controls the > > writing of buffered/cached data back to disk. It allows you to control > the > > percentage of memory filled with dirty pages by giving it either a > > percentage of memory, a time out for any dirty page to be written, or a > > fixed number of dirty bytes. > > > > The question is, does pdflush block appends? The answer seems to be > mostly > > no. It locks the page being flushed but not the whole file. The time to > > flush one page is actually usually pretty quick (plus I think it may not > be > > flushing just written pages anyway). I wrote some test code for this and > > here are the results: > > > > I modified the code from the link above. Here are the results from my > > desktop (Centos Linux 2.6.32). > > > > We run the test writing 1024 bytes every 100 us and flushing every 500 > us: > > > > $ ./pdflush-test 1024 100 500 > > 21 > > 4 > > 3 > > 3 > > 9 > > 6 > > Sync in 20277 us (0), sleeping for 500 us > > 19819 > > 7 > > 7 > > 8 > > 38 > > Sync in 19470 us (0), sleeping for 500 us > > 19048 > > 7 > > 4 > > 3 > > 8 > > 4 > > Sync in 19405 us (0), sleeping for 500 us > > 19017 > > 6 > > 6 > > 10 > > 6 > > Sync in 19410 us (0), sleeping for 500 us > > 19025 > > 7 > > 7 > > 11 > > 6 > > > > $ cat /proc/sys/vm/dirty_writeback_centisecs > > 100 > > $ cat /proc/sys/vm/dirty_expire_centisecs > > 500 > > > > Now run the test with the background flush disabled (rarely running): > > $ ./pdflush-test 1024 100 5000000000000 > times.txt > > > > I ran this for 298,028 writes. The 99.9th percentile for this test is 17 > us > > and the max time was 2043 us (2ms). > > > > Here is the test code: > > > > #include <stdio.h> > > #include <unistd.h> > > #include <string.h> > > #include <sys/types.h> > > #include <pthread.h> > > #include <sys/stat.h> > > #include <fcntl.h> > > #include <sys/time.h> > > #include <stdlib.h> > > > > static long long microseconds(void) { > > struct timeval tv; > > long long mst; > > > > gettimeofday(&tv, NULL); > > mst = ((long long)tv.tv_sec)*1000000; > > mst += tv.tv_usec; > > return mst; > > } > > > > void *IOThreadEntryPoint(void *arg) { > > int fd, retval; > > long long start; > > long sleep = (long) arg; > > > > while(1) { > > usleep(sleep); > > start = microseconds(); > > fd = open("/tmp/foo.txt",O_RDONLY); > > retval = fsync(fd); > > close(fd); > > printf("Sync in %lld us (%d), sleeping for %ld us\n", > > microseconds()-start, retval, sleep); > > } > > return NULL; > > } > > > > int main(int argc, char* argv[]) { > > if(argc != 4) { > > printf("USAGE: %s size write_sleep fsync_sleep\n", argv[0]); > > exit(1); > > } > > > > pthread_t thread; > > int fd = open("/tmp/foo.txt",O_WRONLY|O_CREAT,0644); > > long long start; > > long long ellapsed; > > int size = atoi(argv[1]); > > long write_sleep = atol(argv[2]); > > long fsync_sleep = atol(argv[3]); > > char buff[size]; > > > > pthread_create(&thread,NULL,IOThreadEntryPoint, (void*) fsync_sleep); > > > > while(1) { > > start = microseconds(); > > if (write(fd,buff,size) == -1) { > > perror("write"); > > exit(1); > > } > > ellapsed = microseconds()-start; > > printf("%lld\n", ellapsed); > > usleep(write_sleep); > > } > > close(fd); > > exit(0); > > } > > > > Cheers, > > > > -Jay > > >