Re: Fsyncgate: errors on fsync are unrecovarable
All, I spent a couple of hours with this last week -- it's well worth the read to understand some of the reasons for this behaviour. > tl;dr: running fsync() after an fsync() that reported EIO clears that error > state with no way of recovery on Linux. The important thing that Jan didn't call out, which was a key point for me, was the mechanism. That specifically the dirty bits on the pages in the file cache are cleared on Linux after the fsync which returns EIO. This means that the kernel is now free to throw away those pages (with unwritten data) from the page cache, which results in the writes being lost. I just thought it worth expanding on that. This is why you have to ensure you redo _every_ write to the file between the last successful fsync and the failing fsync, such that you end up with the pages ending up dirty again so the OS knows it needs to write them to durable storage before throwing them out of memory. >From what I can see, this means that _somewhere_ in the stack needs to keep >track of the writes which happened since the last fsync, whether the low level >or the higher level. That is, Jan's (2) is a case where a higher level in the >system is either doing (1) or reporting failure direct back to the client so >they know their commit didn't make it to disk (probably). I think the >correctness of the latter might depend on whether we only write one new header >per result reported back to the client (i.e., not per request as _bulk_docs >has multiple results reported and I don't know if that's one header write and >one fsync). One comment from the PR: > As for the "fsync will forget about errors" issue... well ain't that a kick > in the pants? I agree there's not a whole lot we can do. Although I will have > to say that given the length of time multiple popular databases have run on > Linux without even a whisper of "fsync is totally broken" makes me think that > actually encountering that particular issue in the wild is likely fairly > miniscule. In the chain, the postgres developers do come to the opinion that quite a few more common (though not super-common) errors seen by postgres users could be traced back down to this behaviour. No one really thought to delve into the fsync question until the OP in that thread started poking. -- Mike. On Sun, 21 Jul 2019, at 21:26, Joan Touzet wrote: > On 2019-07-21 1:23 p.m., Paul Davis wrote: > > I’m browsing on my phone but I’m pretty sure we should add an `ok =` to > > this line so that we force a bad match: > > > > https://github.com/apache/couchdb/blob/9d098787a71d1c7f7f6adea05da15b0da3ecc7ef/src/couch/src/couch_file.erl#L223 > > > > Unless I’m missing somewhere else that we’re making that assertion. > > This looks good to me, and I've asked Paul a question in the PR that he > answers extremely well...if you're curious about how this would play out > should an entire filesystem go read-only or missing from underneath > CouchDB, I recommend reading it. (It's also on the > notifications@couchdb.a.o list if you prefer.) > > +1 to the fix. > > -Joan > > > > > > On Sun, Jul 21, 2019 at 4:21 AM Jan Lehnardt wrote: > > > >> Hey all, > >> > >> > >> Joan sent this along in IRC and it reads bad enough[tm] that we should at > >> least have a few pairs of eyes looking at if we have to do anything: > >> > >> http://danluu.com/fsyncgate/ > >> > >> (It is long and dense, you’ll need to read 10-15% of that page to get the > >> main picture. > >> > >> tl;dr: running fsync() after an fsync() that reported EIO clears that > >> error state with no way of recovery on Linux. > >> > >> There are two ways of handling this correctly: > >> > >> 1. whatever you wrote() between the last successful fsync() and the > >> fsync() that raised the error, keep around until after the second fsync(), > >> so you can write() it again. > >> > >> 2. if any one fsync() returns EIO, report this back up immediately, so > >> whoever calls you can retry. > >> > >> * * * > >> > >> We seem to be doing 2. as per my reading. > >> > >> Erlang looks like it correctly just raises whatever error fsync() might > >> return: > >> > >> 1. > >> https://github.com/erlang/otp/blob/maint-r14/erts/emulator/drivers/unix/unix_efile.c#L792-L809 > >> 2. > >> https://github.com/erlang/otp/blob/maint-r14/erts/emulator/drivers/unix/unix_efile.c#L151-L163 > >> > >> couch_file too: > >> > >> 1. > >> https://github.com/apache/couchdb/blob/master/src/couch/src/couch_file.erl#L215-L223 > >> > >> I glanced at a few paths going up this chain and couldn’t spot a catch > >> where we’d hide that error, but it’d be great to get some confirmation on > >> this. > >> > >> * * * > >> > >> Please double-check my understanding of the issue, the correct ways > >> forward and the findings in Erlang and CouchDB. > >> > >> Best > >> Jan > >> -- > >> Professional Support for Apache CouchDB: > >> https://neighbourhood.ie/couchdb-support/ > >> > >> > > >
Re: Fsyncgate: errors on fsync are unrecovarable
On 2019-07-21 1:23 p.m., Paul Davis wrote: I’m browsing on my phone but I’m pretty sure we should add an `ok =` to this line so that we force a bad match: https://github.com/apache/couchdb/blob/9d098787a71d1c7f7f6adea05da15b0da3ecc7ef/src/couch/src/couch_file.erl#L223 Unless I’m missing somewhere else that we’re making that assertion. This looks good to me, and I've asked Paul a question in the PR that he answers extremely well...if you're curious about how this would play out should an entire filesystem go read-only or missing from underneath CouchDB, I recommend reading it. (It's also on the notifications@couchdb.a.o list if you prefer.) +1 to the fix. -Joan On Sun, Jul 21, 2019 at 4:21 AM Jan Lehnardt wrote: Hey all, Joan sent this along in IRC and it reads bad enough[tm] that we should at least have a few pairs of eyes looking at if we have to do anything: http://danluu.com/fsyncgate/ (It is long and dense, you’ll need to read 10-15% of that page to get the main picture. tl;dr: running fsync() after an fsync() that reported EIO clears that error state with no way of recovery on Linux. There are two ways of handling this correctly: 1. whatever you wrote() between the last successful fsync() and the fsync() that raised the error, keep around until after the second fsync(), so you can write() it again. 2. if any one fsync() returns EIO, report this back up immediately, so whoever calls you can retry. * * * We seem to be doing 2. as per my reading. Erlang looks like it correctly just raises whatever error fsync() might return: 1. https://github.com/erlang/otp/blob/maint-r14/erts/emulator/drivers/unix/unix_efile.c#L792-L809 2. https://github.com/erlang/otp/blob/maint-r14/erts/emulator/drivers/unix/unix_efile.c#L151-L163 couch_file too: 1. https://github.com/apache/couchdb/blob/master/src/couch/src/couch_file.erl#L215-L223 I glanced at a few paths going up this chain and couldn’t spot a catch where we’d hide that error, but it’d be great to get some confirmation on this. * * * Please double-check my understanding of the issue, the correct ways forward and the findings in Erlang and CouchDB. Best Jan -- Professional Support for Apache CouchDB: https://neighbourhood.ie/couchdb-support/
Re: Fsyncgate: errors on fsync are unrecovarable
I’m browsing on my phone but I’m pretty sure we should add an `ok =` to this line so that we force a bad match: https://github.com/apache/couchdb/blob/9d098787a71d1c7f7f6adea05da15b0da3ecc7ef/src/couch/src/couch_file.erl#L223 Unless I’m missing somewhere else that we’re making that assertion. On Sun, Jul 21, 2019 at 4:21 AM Jan Lehnardt wrote: > Hey all, > > > Joan sent this along in IRC and it reads bad enough[tm] that we should at > least have a few pairs of eyes looking at if we have to do anything: > > http://danluu.com/fsyncgate/ > > (It is long and dense, you’ll need to read 10-15% of that page to get the > main picture. > > tl;dr: running fsync() after an fsync() that reported EIO clears that > error state with no way of recovery on Linux. > > There are two ways of handling this correctly: > > 1. whatever you wrote() between the last successful fsync() and the > fsync() that raised the error, keep around until after the second fsync(), > so you can write() it again. > > 2. if any one fsync() returns EIO, report this back up immediately, so > whoever calls you can retry. > > * * * > > We seem to be doing 2. as per my reading. > > Erlang looks like it correctly just raises whatever error fsync() might > return: > > 1. > https://github.com/erlang/otp/blob/maint-r14/erts/emulator/drivers/unix/unix_efile.c#L792-L809 > 2. > https://github.com/erlang/otp/blob/maint-r14/erts/emulator/drivers/unix/unix_efile.c#L151-L163 > > couch_file too: > > 1. > https://github.com/apache/couchdb/blob/master/src/couch/src/couch_file.erl#L215-L223 > > I glanced at a few paths going up this chain and couldn’t spot a catch > where we’d hide that error, but it’d be great to get some confirmation on > this. > > * * * > > Please double-check my understanding of the issue, the correct ways > forward and the findings in Erlang and CouchDB. > > Best > Jan > -- > Professional Support for Apache CouchDB: > https://neighbourhood.ie/couchdb-support/ > >
Fsyncgate: errors on fsync are unrecovarable
Hey all, Joan sent this along in IRC and it reads bad enough[tm] that we should at least have a few pairs of eyes looking at if we have to do anything: http://danluu.com/fsyncgate/ (It is long and dense, you’ll need to read 10-15% of that page to get the main picture. tl;dr: running fsync() after an fsync() that reported EIO clears that error state with no way of recovery on Linux. There are two ways of handling this correctly: 1. whatever you wrote() between the last successful fsync() and the fsync() that raised the error, keep around until after the second fsync(), so you can write() it again. 2. if any one fsync() returns EIO, report this back up immediately, so whoever calls you can retry. * * * We seem to be doing 2. as per my reading. Erlang looks like it correctly just raises whatever error fsync() might return: 1. https://github.com/erlang/otp/blob/maint-r14/erts/emulator/drivers/unix/unix_efile.c#L792-L809 2. https://github.com/erlang/otp/blob/maint-r14/erts/emulator/drivers/unix/unix_efile.c#L151-L163 couch_file too: 1. https://github.com/apache/couchdb/blob/master/src/couch/src/couch_file.erl#L215-L223 I glanced at a few paths going up this chain and couldn’t spot a catch where we’d hide that error, but it’d be great to get some confirmation on this. * * * Please double-check my understanding of the issue, the correct ways forward and the findings in Erlang and CouchDB. Best Jan -- Professional Support for Apache CouchDB: https://neighbourhood.ie/couchdb-support/