On Wed, May 11, 2016 at 09:10:12AM -0700, Josef Bacik wrote: > On 05/11/2016 09:05 AM, Chris Mason wrote: > >On Wed, May 11, 2016 at 04:56:56PM +0100, Filipe Manana wrote: > >>On Wed, May 11, 2016 at 4:41 PM, Josef Bacik <[email protected]> wrote: > >>>On 05/09/2016 07:01 AM, [email protected] wrote: > >>>> > >>>>From: Filipe Manana <[email protected]> > >>>> > >>>>When we do a direct IO write against a preallocated extent (fallocate) > >>>>that does not go beyond the i_size of the inode, we do the write operation > >>>>without holding the inode's i_mutex (an optimization that landed in > >>>>commit 38851cc19adb ("Btrfs: implement unlocked dio write")). This allows > >>>>for a very tiny time window where a race can happen with a concurrent > >>>>fsync using the fast code path, as the direct IO write path creates first > >>>>a new extent map (no longer flagged as a prealloc extent) and then it > >>>>creates the ordered extent, while the fast fsync path first collects > >>>>ordered extents and then it collects extent maps. This allows for the > >>>>possibility of the fast fsync path to collect the new extent map without > >>>>collecting the new ordered extent, and therefore logging an extent item > >>>>based on the extent map without waiting for the ordered extent to be > >>>>created and complete. This can result in a situation where after a log > >>>>replay we end up with an extent not marked anymore as prealloc but it was > >>>>only partially written (or not written at all), exposing random, stale or > >>>>garbage data corresponding to the unwritten pages and without any > >>>>checksums in the csum tree covering the extent's range. > >>>> > >>>>This is an extension of what was done in commit de0ee0edb21f ("Btrfs: fix > >>>>race between fsync and lockless direct IO writes"). > >>>> > >>>>So fix this by creating first the ordered extent and then the extent > >>>>map, so that this way if the fast fsync patch collects the new extent > >>>>map it also collects the corresponding ordered extent. > >>>> > >>> > >>>So this seems to just be trading one problem for another bad behavior. Now > >>>instead we'll end up with an ordered extent but possibly not the extent > >>>map, > >>>which is fine, but seems ugly, and is a subtle and undocumented behavior > >>>that is going to bite us in the ass in the future. > >>> > >>>I know everybody loves lockless writes, but we need to protect for this > >>>particular case in an explicit way so I don't go change something in the > >>>future and forget about this dependency. What if we add a rwsem around > >>>adding the extent map and the ordered extent. In fact we pull this dance > >>>out into a common helper function so everybody who wants to do this just > >>>calls the helper function that is easy to audit and understand, then we > >>>just > >>>put a > >>> > >>>down_read(&BTRFS_I(inode)->lets_not_fuck_fsync)); > >>>/* Do our extent map + ordered extent dance here */ > >>>up_read(&BTRFS_I(inode)->lets_not_fuck_fsync)); > >>> > >>>and then where we gather all of this stuff in fsync we do a > >>>down_write/up_write. This won't affect the buffered write case currently > >>>as > >>>we're still protected by the i_mutex, and then makes the direct io case > >>>much > >>>cleaner and safer. Then later on when we want to remove the i_mutex from > >>>the buffered write case we have one less gotcha to worry about. Thanks, > >> > >>Agreed. I though about something similar initially but I didn't want > >>to increase the size of struct btrfs_inode. > >>Would you mind that if instead of changing this patch I do one on top > >>that introduces that helper function, and makes this and the other > >>place in the dio path that does the same logic use it too? > > > >Once we go down the rwsem path, we might want to look at overall DIO > >locking and see where else it fits in. We can probably get rid of a few > >warts with it. > > > > I hesitate to suggest this, but long term I'd like to look at just using our > range locking for all of this, and do something like (u64)-1 for the end > when we are talking the whole file. That would remove the need for a lot of > this other locking stuff. But I imagine that'll cause more problems than it > solves atm. Thanks,
The last time I tried, I had regrets. But it's definitely worth picking up again. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to [email protected] More majordomo info at http://vger.kernel.org/majordomo-info.html
