On 09/21/2016 08:58 PM, Jeff Darcy wrote:
However, my understanding is that filesystems need not maintain the relative
order of writes (as it received from vfs/kernel) on two different fds. Also,
if we have to maintain the order it might come with increased latency. The
increased latency can be because of having "newer" writes to wait on "older"
ones. This wait can fill up write-behind buffer and can eventually result in
a full write-behind cache and hence not able to "write-back" newer writes.
IEEE 1003.1, 2013 edition
After a write() to a regular file has successfully returned:
Any successful read() from each byte position in the file that was
modified by that write shall return the data specified by the write()
for that position until >such byte positions are again modified.
Any subsequent successful write() to the same byte position in the
file shall overwrite that file data.
Note that the reference is to a *file*, not to a file *descriptor*.
It's an application of the general POSIX assumption that time is
simple, locking is cheap (if it's even necessary), and therefore
time-based requirements like linearizability - what this is - are
easy to satisfy. I know that's not very realistic nowadays, but
it's pretty clear: according to the standard as it's still written,
P2's write *is* required to overwrite P1's. Same vs. different fd
or process/thread doesn't even come into play.
Just for fun, I'll point out that the standard snippet above
doesn't say anything about *non overlapping* writes. Does POSIX
allow the following?
read B, get new value
read A, get *old* value
This is a non-linearizable result, which would surely violate
some people's (notably POSIX authors') expectations, but good
luck finding anything in that standard which actually precludes
I will reply to both comments here.
First, I think that all file systems will perform this way since this is really
a function of how the page cache works and O_DIRECT.
More broadly, this is not a promise or hard and fast thing - the traditional way
applications that do concurrent writes is to make sure that they use either
whole file or byte range locking when one or more threads/processes are doing IO
to the same file concurrently.
I don't understand the Jeff snippet above - if they are non-overlapping writes
to dfferent offsets, this would never happen.
If the writes are to the same offset and happened at different times, it would
not happen either.
If they are the same offset and at the same time, then you can have an undefined
results where you might get fragments of A and fragments of B (where you might
be able to see some odd things if the write spans pages/blocks).
This last case is where the normal best practice comes in to suggest using
Gluster-devel mailing list