On Wed, Feb 21, 2018 at 1:10 PM, Nikolay Borisov <nbori...@suse.com> wrote: > > > On 21.02.2018 15:06, Filipe Manana wrote: >> On Wed, Feb 21, 2018 at 11:41 AM, Nikolay Borisov <nbori...@suse.com> wrote: >>> Currently the DIO read cases uses a botched idea from ext4 to ensure >>> that DIO reads don't race with truncate. The idea is that if we have a >>> pending truncate we set BTRFS_INODE_READDIO_NEED_LOCK which in turn >>> forces the dio read case to fallback to inode_locking to prevent >>> read/truncate races. Unfortunately this is subtly broken for at least >>> 2 reasons: >>> >>> 1. inode_dio_begin in btrfs_direct_IO is called outside of inode_lock >>> (for the read case). This means that there is no ordering guarantee >>> between the invocation of inode_dio_wait and the increment of >>> i_dio_count in btrfs_direct_IO in the tread case. >>> >>> 2. The memory barriers used in btrfs_inode_(block|resume)_unlocked_dio >>> are not really paired with the reader side - the test_bit in >>> btrfs_direct_IO, since the latter is missing a memory barrier. Furthermore, >>> the actual sleeping condition that needs ordering to prevent live-locks/ >>> missed wakeups is the modification/read of i_dio_count. So in this case >>> the waker(T2) needs to make the condition false _BEFORE_ doing a test. >>> >>> The interraction between the two threads roughly looks like: >>> >>> T1(truncate): T2(btrfs_direct_IO): >>> set_bit(BTRFS_INODE_READDIO_NEED_LOCK) if >>> (test_bit(BTRFS_INODE_READDIO_NEED_LOCK)) >>> if (atomic_read()) if >>> (atomic_dec_and_test(&inode->i_dio_count) >>> schedule() wake_up_bit >>> clear_bit(BTRFS_INODE_READDIO_NEED_LOCK) >>> >>> Without the ordering between the test_bit in T2 and setting the bit in >>> T1 (due to a missing pairing barrier in T2) it's possible that T1 goes >>> to sleep in schedule and T2 misses the bit set, resulting in missing the >>> wake up. >>> >>> In any case all of this is VERY subtle. So fix it by simply making >>> the DIO READ case take inode_lock_shared. This ensure that we can have >>> DIO reads in parallel at the same time we are protected against >>> concurrent modification of the target file. >> >> And that prevents writes and reads against different (i.e. not >> overlapping) ranges from happening in parallel. >> That has a big impact on applications (databases for e.g.) that >> operate on large files serving multiple requests. >> Now all reads are serialized against all writes and vice versa. >> > > Correct, but I'd prefer correctness over performance! And I assume other > people as well, since as is the code atm it's not providing full > protection between racing reads and truncate.
And even more people prefer correctness without significant impact on performance. So fix it differently please. > >> Unless I missed something, a big NAK to this change as it is. > > > >> >> >>> This way we closely mimic >>> what ext4 codes does and simplify this mess. >>> >>> Multiple xfstest runs didn't show any regressions. >>> >>> Signed-off-by: Nikolay Borisov <nbori...@suse.com> >>> --- >>> fs/btrfs/btrfs_inode.h | 17 ----------------- >>> fs/btrfs/inode.c | 34 ++++++++++++++++++++-------------- >>> 2 files changed, 20 insertions(+), 31 deletions(-) >>> >>> diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h >>> index f527e99c9f8d..3519e49d4ef0 100644 >>> --- a/fs/btrfs/btrfs_inode.h >>> +++ b/fs/btrfs/btrfs_inode.h >>> @@ -329,23 +329,6 @@ struct btrfs_dio_private { >>> blk_status_t); >>> }; >>> >>> -/* >>> - * Disable DIO read nolock optimization, so new dio readers will be forced >>> - * to grab i_mutex. It is used to avoid the endless truncate due to >>> - * nonlocked dio read. >>> - */ >>> -static inline void btrfs_inode_block_unlocked_dio(struct btrfs_inode >>> *inode) >>> -{ >>> - set_bit(BTRFS_INODE_READDIO_NEED_LOCK, &inode->runtime_flags); >>> - smp_mb(); >>> -} >>> - >>> -static inline void btrfs_inode_resume_unlocked_dio(struct btrfs_inode >>> *inode) >>> -{ >>> - smp_mb__before_atomic(); >>> - clear_bit(BTRFS_INODE_READDIO_NEED_LOCK, &inode->runtime_flags); >>> -} >>> - >>> static inline void btrfs_print_data_csum_error(struct btrfs_inode *inode, >>> u64 logical_start, u32 csum, u32 csum_expected, int >>> mirror_num) >>> { >>> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c >>> index 491a7397f6fa..9c43257e6e11 100644 >>> --- a/fs/btrfs/inode.c >>> +++ b/fs/btrfs/inode.c >>> @@ -5149,10 +5149,13 @@ static int btrfs_setsize(struct inode *inode, >>> struct iattr *attr) >>> /* we don't support swapfiles, so vmtruncate shouldn't fail >>> */ >>> truncate_setsize(inode, newsize); >>> >>> - /* Disable nonlocked read DIO to avoid the end less >>> truncate */ >>> - btrfs_inode_block_unlocked_dio(BTRFS_I(inode)); >>> + /* >>> + * Truncate after all in-flight dios are finished, new ones >>> + * will block on inode_lock. This only matters for AIO >>> requests >>> + * since DIO READ is performed under inode_shared_lock and >>> + * write under exclusive lock. >>> + */ >>> inode_dio_wait(inode); >>> - btrfs_inode_resume_unlocked_dio(BTRFS_I(inode)); >>> >>> ret = btrfs_truncate(inode); >>> if (ret && inode->i_nlink) { >>> @@ -8669,15 +8672,12 @@ static ssize_t btrfs_direct_IO(struct kiocb *iocb, >>> struct iov_iter *iter) >>> loff_t offset = iocb->ki_pos; >>> size_t count = 0; >>> int flags = 0; >>> - bool wakeup = true; >>> bool relock = false; >>> ssize_t ret; >>> >>> if (check_direct_IO(fs_info, iter, offset)) >>> return 0; >>> >>> - inode_dio_begin(inode); >>> - >>> /* >>> * The generic stuff only does filemap_write_and_wait_range, which >>> * isn't enough if we've written compressed pages to this area, so >>> @@ -8691,6 +8691,9 @@ static ssize_t btrfs_direct_IO(struct kiocb *iocb, >>> struct iov_iter *iter) >>> offset + count - 1); >>> >>> if (iov_iter_rw(iter) == WRITE) { >>> + >>> + inode_dio_begin(inode); >>> + >>> /* >>> * If the write DIO is beyond the EOF, we need update >>> * the isize, but it is protected by i_mutex. So we can >>> @@ -8720,11 +8723,13 @@ static ssize_t btrfs_direct_IO(struct kiocb *iocb, >>> struct iov_iter *iter) >>> dio_data.unsubmitted_oe_range_end = (u64)offset; >>> current->journal_info = &dio_data; >>> down_read(&BTRFS_I(inode)->dio_sem); >>> - } else if (test_bit(BTRFS_INODE_READDIO_NEED_LOCK, >>> - &BTRFS_I(inode)->runtime_flags)) { >>> - inode_dio_end(inode); >>> - flags = DIO_LOCKING | DIO_SKIP_HOLES; >>> - wakeup = false; >>> + } else { >>> + /* >>> + * In DIO READ case locking the inode in shared mode ensures >>> + * we are protected against parallel writes/truncates >>> + */ >>> + inode_lock_shared(inode); >>> + inode_dio_begin(inode); >>> } >>> >>> ret = __blockdev_direct_IO(iocb, inode, >>> @@ -8755,10 +8760,11 @@ static ssize_t btrfs_direct_IO(struct kiocb *iocb, >>> struct iov_iter *iter) >>> btrfs_delalloc_release_space(inode, data_reserved, >>> offset, count - (size_t)ret); >>> btrfs_delalloc_release_extents(BTRFS_I(inode), count); >>> - } >>> + } else >>> + inode_unlock_shared(inode); >>> out: >>> - if (wakeup) >>> - inode_dio_end(inode); >>> + inode_dio_end(inode); >>> + >>> if (relock) >>> inode_lock(inode); >>> >>> -- >>> 2.7.4 >>> >>> -- >>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in >>> the body of a message to majord...@vger.kernel.org >>> More majordomo info at http://vger.kernel.org/majordomo-info.html >> >> >> -- Filipe David Manana, “Whether you think you can, or you think you can't — you're right.” -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html