On Wed, Apr 25, 2018 at 6:31 PM, Andrey Borodin <x4...@yandex-team.ru> wrote:
> 4. Using O_DIRECT while writing data files

One interesting thing about O_DIRECT that I haven't seen mentioned in
these discussions:

POSIX effectively requires writes to the same file to be serialised,
meaning in practice that [p]write[v]() acquires a mutex on the inode.
I've heard of people working on other databases that have multiple
write-back threads that considered this to be a serious problem, and I
guess it becomes worse as you concentrate more stuff into bigger files
to cut down on file descriptors and file system metadata operations
and have more concurrent writers.  Some ways around that:

1.  On Linux if you use O_DIRECT then XFS forgets about that POSIX
requirement (O_DIRECT is outside the spec anyway).  It just doesn't
acquire inode->i_mutex, so you get parallel writes to the same file
(except in some corner cases).  AFAIK other Linux filesystems don't do
that, so XFS might be your only choice if you want parallel direct IO
on that OS.  Anyone know more about that?  Relevant kernel code: the
inode_lock() call that appears in ext4_file_write_iter() but not in
xfs_file_write_iter() in the IOCB_DIRECT case.

2.  Ditto for Solaris's UFS, AIX's JFS2 with "O_CIO" enabled (but
apparently not the JFS port on Linux).  I don't know the answer for
FreeBSD's UFS off-hand.

3.  I've heard that ZFS achieves parallel writes to the same file
while magically adhering to the POSIX serialisation rules, and that's
independent of direct IO (which it doesn't even support).

-- 
Thomas Munro
http://www.enterprisedb.com

Reply via email to