Re: [pgsql-hackers-win32] [PATCHES] SRA Win32 sync() code
Shridhar Daithankar wrote: Does 30% difference above count as significant? No. It's Linux, we can look at the sources: there is no per-fd cache, the page cache is global. Thus fsync() syncs the whole cache to disk. A problem could only occur if the file cache is not global - perhaps a per-node file cache on NUMA systems - IRIX on an Origin 2000 cluster or something similar. But as I read the unix spec, fsync is guaranteed to sync all data to disk: Draft 6 of the posix-200x spec: SIO If _POSIX_SYNCHRONIZED_IO is defined, the fsync( ) function shall force all currently queued I/O operations associated with the file indicated by file descriptor fildes to the synchronized I/O completion state. All I/O operations shall be completed as defined for synchronized I/O file integrity completion. "All I/O operations associated with the file", not all operations associated with the file descriptor. -- Manfred ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html
Re: [pgsql-hackers-win32] [PATCHES] SRA Win32 sync() code
On Mon, Nov 17, 2003 at 12:46:34AM -0500, Bruce Momjian wrote: > Tom Lane wrote: > > > Do we know that having the background writer fsync a file that was > > > written by a backend cause all the data to fsync? I think I could write > > > a program to test this by timing each of these tests: > > > > That might prove something about the particular platform you tested it > > on; but it would not speak to the real problem, which is what we can > > assume is true on every platform... > > The attached program does test if fsync can be used on a file descriptor > after the file is closed and then reopened. I see: > > write 0.000613 > write & fsync 0.001727 > write, close & fsync 0.001633 > Does anyone have a platform where the last duration is significantly > different from the middle timing? write 0.002807 write & fsync 0.015248 write, close & fsync 0.004696 This is a Linux 2.6.0-test5 on an old IDE disk. The results change alot. An other result shows: write 0.002737 write & fsync 0.006658 write, close & fsync 0.008431 The first time is stable, the other 2 aren't. Averagly write & fsync would be about twice as big/slow as write, close & fsync. PS: Please specify some modes when creating files. Kurt ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [pgsql-hackers-win32] [PATCHES] SRA Win32 sync() code
On Monday 17 November 2003 11:16, Bruce Momjian wrote: > Tom Lane wrote: > > > Do we know that having the background writer fsync a file that was > > > written by a backend cause all the data to fsync? I think I could > > > write a program to test this by timing each of these tests: > > > > That might prove something about the particular platform you tested it > > on; but it would not speak to the real problem, which is what we can > > assume is true on every platform... > > The attached program does test if fsync can be used on a file descriptor > after the file is closed and then reopened. I see: > > write 0.000613 > write & fsync 0.001727 > write, close & fsync 0.001633 ArchLinux, maxtor IDE HDD, write cache enabled. [EMAIL PROTECTED] tmp]$ gcc -o test_fsync test_fsync.c [EMAIL PROTECTED] tmp]$ ./test_fsync write 0.002403 write & fsync 0.009423 write, close & fsync 0.006457 [EMAIL PROTECTED] tmp]$ uname -a Linux daithan 2.4.21 #1 SMP Tue Jul 8 19:41:52 PDT 2003 i686 unknown > Anyway, if we find all our platforms can pass this test, we might be > able to allow backends to do their own writes and just record the file > name somewhere for the checkpointer to fsync. It also shows write/fsync > was 3x slower than simple write. > > Does anyone have a platform where the last duration is significantly > different from the middle timing? Does 30% difference above count as significant? Assuming fsync on a file descriptor flushes dirty buffers of that file, from all processes, would following be sufficient? 1. Open WAL with O_SYNC|O_DIRECT (Later whereever possible) And issue fsync on WAL files whenever required. 2. Use regular writes for data files and fsync them in background. May be if background process is the only one that issues any fsync on data files, that could maximize overall system throughput. Say, all backends write to a datafile and signal the background writer, that they are blocked on this write to complete. BGWriter could chunk all such requests and flush them/fsync when there is enough disk activity. Hopefully none of them would be stalled for too long. That way slowest part of the system i.e the disk will be kept full of load. Besides since WAL writes are synchornous, backgrounds can safely push a write and move to further business, most of the times. I guess BGWriter has to fsync the data files anyways to recycle a WAL segment. In idle conditions, this mechanism should not be a problem. Just a thought. Does this take care of sync? > I am keeping this discussion on patches because of the C program > attachment. I dropped win32 list. I am not subscribed to it. Just getting thread out of it. I will write a short program which writes to a file in different processes and attempts to fsync them from only one. Let's see what that turns out. Shridhar ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [pgsql-hackers-win32] [PATCHES] SRA Win32 sync() code
Hannu Krosing wrote: > Bruce Momjian kirjutas E, 17.11.2003 kell 03:58: > > > > > OK, let me give you my logic and you can tell me where I am wrong. > > > > First, how many backend can a single write process support if all the > > backends are doing insert/update/deletes? 5? 10? Let's assume 10. > > Second, once we change write to write/fsync, how much slower will that > > be? 100x, 1000x? Let's say 10x. > > > > So, by my logic, if we have 100 backends all doing updates, we will need > > 10 * 100 or 1000 writer processes or threads to keep up with that load. > > That seems quite excessive to me from a context switching and process > > overhead perspective. > > > > Where am I wrong? > > Maybe you meant 100/10 instead of 100*10 ;) I figured 10 backends, but using fsync, they are not 100x slower (10 * 100). However, testing shows fsync is only 3x slower. -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [pgsql-hackers-win32] [PATCHES] SRA Win32 sync() code
Bruce Momjian kirjutas E, 17.11.2003 kell 03:58: > > OK, let me give you my logic and you can tell me where I am wrong. > > First, how many backend can a single write process support if all the > backends are doing insert/update/deletes? 5? 10? Let's assume 10. > Second, once we change write to write/fsync, how much slower will that > be? 100x, 1000x? Let's say 10x. > > So, by my logic, if we have 100 backends all doing updates, we will need > 10 * 100 or 1000 writer processes or threads to keep up with that load. > That seems quite excessive to me from a context switching and process > overhead perspective. > > Where am I wrong? Maybe you meant 100/10 instead of 100*10 ;) Hannu ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [pgsql-hackers-win32] [PATCHES] SRA Win32 sync() code
Tom Lane wrote: > > Do we know that having the background writer fsync a file that was > > written by a backend cause all the data to fsync? I think I could write > > a program to test this by timing each of these tests: > > That might prove something about the particular platform you tested it > on; but it would not speak to the real problem, which is what we can > assume is true on every platform... The attached program does test if fsync can be used on a file descriptor after the file is closed and then reopened. I see: write 0.000613 write & fsync 0.001727 write, close & fsync 0.001633 This shows that fsync works even after the file is closed and reopened. I could test by writing using a subprocess, but I don't see how that would be different, and it would mess up my timings. Anyway, if we find all our platforms can pass this test, we might be able to allow backends to do their own writes and just record the file name somewhere for the checkpointer to fsync. It also shows write/fsync was 3x slower than simple write. Does anyone have a platform where the last duration is significantly different from the middle timing? I am keeping this discussion on patches because of the C program attachment. -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 /* * test_fsync.c * tests if fsync can be done from another process than the original write */ #include #include #include #include #include void die(char *str); void print_elapse(struct timeval start_t, struct timeval elapse_t); int main(int argc, char *argv[]) { struct timeval start_t; struct timeval elapse_t; int tmpfile; int i; char charout = 44; /* write only */ gettimeofday(&start_t, NULL); if ((tmpfile = open("/var/tmp/test_fsync.out", O_RDWR | O_CREAT)) == -1) die("can't open /var/tmp/test_fsync.out"); for (i = 0; i < 200; i++) write(tmpfile, &charout, 1); close(tmpfile); gettimeofday(&elapse_t, NULL); unlink("/var/tmp/test_fsync.out"); printf("write "); print_elapse(start_t, elapse_t); printf("\n"); /* write & fsync */ gettimeofday(&start_t, NULL); if ((tmpfile = open("/var/tmp/test_fsync.out", O_RDWR | O_CREAT)) == -1) die("can't open /var/tmp/test_fsync.out"); for (i = 0; i < 200; i++) write(tmpfile, &charout, 1); fsync(tmpfile); close(tmpfile); gettimeofday(&elapse_t, NULL); unlink("/var/tmp/test_fsync.out"); printf("write & fsync "); print_elapse(start_t, elapse_t); printf("\n"); /* write, close & fsync */ gettimeofday(&start_t, NULL); if ((tmpfile = open("/var/tmp/test_fsync.out", O_RDWR | O_CREAT)) == -1) die("can't open /var/tmp/test_fsync.out"); for (i = 0; i < 200; i++) write(tmpfile, &charout, 1); close(tmpfile); /* reopen file */ if ((tmpfile = open("/var/tmp/test_fsync.out", O_RDWR | O_CREAT)) == -1) die("can't open /var/tmp/test_fsync.out"); fsync(tmpfile); close(tmpfile); gettimeofday(&elapse_t, NULL); unlink("/var/tmp/test_fsync.out"); printf("write, close & fsync "); print_elapse(start_t, elapse_t); printf("\n"); return 0; } void print_elapse(struct timeval start_t, struct timeval elapse_t) { if (elapse_t.tv_usec < start_t.tv_usec) { elapse_t.tv_sec--; elapse_t.tv_usec += 100; } printf("%ld.%06ld", (long) (elapse_t.tv_sec - start_t.tv_sec), (long) (elapse_t.tv_usec - start_t.tv_usec)); } void die(char *str) { fprintf(stderr, "%s", str); exit(1); } ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [pgsql-hackers-win32] [PATCHES] SRA Win32 sync() code
Tom Lane wrote: > Bruce Momjian <[EMAIL PROTECTED]> writes: > > Where am I wrong? > > I don't think any of this is relevant. There are a certain number of > blocks we have to get down to disk before we can declare a transaction > committed, and there are a certain number that we have to get down to > disk before we can declare a checkpoint complete. You are focusing too > much on the question of whether a particular process performs an fsync > operation, and ignoring the fact that ultimately it's got to wait for > I/O to complete --- directly or indirectly. If it blocks waiting for > some other process to declare a buffer clean, rather than writing for > itself, what's the difference? The difference is two-fold. First, there might be 10 other backends asking for writes, so it isn't clear that asking someone else do the right is as fast. Second, that other writer is doing fsync, so it is 100x or 1000x slower than our current setup. > Sure, fsync serializes the particular process that's doing it, but we > can deal with that by spreading the fsyncs across multiple processes, > and trying to ensure that they are mostly background processes rather > than foreground ones. How many? That was my point, that it might require 1000 backend processes _and_ it would be slower because we are write/fsync rather than write. However, I think we could fix that by doing the write and returning OK to the backend, then doing the fsync whenever we want --- perhaps that was already your plan. > I don't claim that immediate-fsync-on-write is the only answer, but > I cannot follow your reasoning for dismissing it out of hand ... and I > certainly cannot buy *any* logic that says that sync() is a good answer > to any of these issues. AFAICS sync() means that we abandon > responsibility. sync() means we group the fsync into one massive one, that sync all other process I/O too --- clearly bad, but I am hoping for something as good as what we currently have because that sync hopefully is only ever few minutes. > > Do we know that having the background writer fsync a file that was > > written by a backend cause all the data to fsync? I think I could write > > a program to test this by timing each of these tests: > > That might prove something about the particular platform you tested it > on; but it would not speak to the real problem, which is what we can > assume is true on every platform... Yes, it would only be per platform. I was thinking we could have a platform test and enable this behavior on platforms that support it (all?) and use sync on the others. Also, let me say I am glad we are delving into this. Our buffer system has needed an overhaul for a while, and right now we already have some major improvements for 7.5, and this discussion is just a continuation of those improvements. -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])
Re: [pgsql-hackers-win32] [PATCHES] SRA Win32 sync() code
Bruce Momjian <[EMAIL PROTECTED]> writes: > Where am I wrong? I don't think any of this is relevant. There are a certain number of blocks we have to get down to disk before we can declare a transaction committed, and there are a certain number that we have to get down to disk before we can declare a checkpoint complete. You are focusing too much on the question of whether a particular process performs an fsync operation, and ignoring the fact that ultimately it's got to wait for I/O to complete --- directly or indirectly. If it blocks waiting for some other process to declare a buffer clean, rather than writing for itself, what's the difference? Sure, fsync serializes the particular process that's doing it, but we can deal with that by spreading the fsyncs across multiple processes, and trying to ensure that they are mostly background processes rather than foreground ones. I don't claim that immediate-fsync-on-write is the only answer, but I cannot follow your reasoning for dimissing it out of hand ... and I certainly cannot buy *any* logic that says that sync() is a good answer to any of these issues. AFAICS sync() means that we abandon responsibility. > Do we know that having the background writer fsync a file that was > written by a backend cause all the data to fsync? I think I could write > a program to test this by timing each of these tests: That might prove something about the particular platform you tested it on; but it would not speak to the real problem, which is what we can assume is true on every platform... regards, tom lane ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html
Re: [pgsql-hackers-win32] [PATCHES] SRA Win32 sync() code
Tom Lane wrote: > Bruce Momjian <[EMAIL PROTECTED]> writes: > > Tom Lane wrote: > >> Seriously though, if we can move the bulk of the writing work into > >> background processes then I don't believe that there will be any > >> significant penalty for regular backends. > > > If the background writer starts using fsync(), we can have normal > > backends that do a write() set a shared memory boolean. We can then > > test that boolean and do sync() only if other backends had to do their > > own writes. > > That seems like the worst of both worlds --- you still are depending on > sync() for correctness. > > Also, as long as backends only *seldom* do writes, making them fsync a > write when they do make one will be less of an impact on overall system > performance than having a sync() ensue shortly afterwards. I think you > are focusing too narrowly on the idea that backends shouldn't ever wait > for writes, and failing to see the bigger picture. What we need to > optimize is overall system performance, not an arbitrary restriction > that certain processes never wait for certain things. OK, let me give you my logic and you can tell me where I am wrong. First, how many backend can a single write process support if all the backends are doing insert/update/deletes? 5? 10? Let's assume 10. Second, once we change write to write/fsync, how much slower will that be? 100x, 1000x? Let's say 10x. So, by my logic, if we have 100 backends all doing updates, we will need 10 * 100 or 1000 writer processes or threads to keep up with that load. That seems quite excessive to me from a context switching and process overhead perspective. Where am I wrong? Also, if we go with the fsync only at checkpoint, we are doing fsync's once every minute (at checkpoint time) rather than several times a second potentially. Do we know that having the background writer fsync a file that was written by a backend cause all the data to fsync? I think I could write a program to test this by timing each of these tests: create an empty file open file time fsync close open file write 2mb into the file time fsync close open file write 2mb into the file close open file time fsync close If I do the write via system(), I am doing it in a separate process so the test should work. Should I try this? -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org