On Wed, Apr 25, 2018 at 6:31 PM, Andrey Borodin <x4...@yandex-team.ru> wrote: > 4. Using O_DIRECT while writing data files
One interesting thing about O_DIRECT that I haven't seen mentioned in these discussions: POSIX effectively requires writes to the same file to be serialised, meaning in practice that [p]write[v]() acquires a mutex on the inode. I've heard of people working on other databases that have multiple write-back threads that considered this to be a serious problem, and I guess it becomes worse as you concentrate more stuff into bigger files to cut down on file descriptors and file system metadata operations and have more concurrent writers. Some ways around that: 1. On Linux if you use O_DIRECT then XFS forgets about that POSIX requirement (O_DIRECT is outside the spec anyway). It just doesn't acquire inode->i_mutex, so you get parallel writes to the same file (except in some corner cases). AFAIK other Linux filesystems don't do that, so XFS might be your only choice if you want parallel direct IO on that OS. Anyone know more about that? Relevant kernel code: the inode_lock() call that appears in ext4_file_write_iter() but not in xfs_file_write_iter() in the IOCB_DIRECT case. 2. Ditto for Solaris's UFS, AIX's JFS2 with "O_CIO" enabled (but apparently not the JFS port on Linux). I don't know the answer for FreeBSD's UFS off-hand. 3. I've heard that ZFS achieves parallel writes to the same file while magically adhering to the POSIX serialisation rules, and that's independent of direct IO (which it doesn't even support). -- Thomas Munro http://www.enterprisedb.com