On 10/03/2016 07:34 AM, Alex Bligh wrote:
>
>> On 3 Oct 2016, at 08:57, Christoph Hellwig <[email protected]> wrote:
>>
>>> Can you clarify what you mean by that? Why is it an "odd flush
>>> definition", and how would you "properly" define it?
>>
>> E.g. take the defintion from NVMe which also supports multiple queues:
>>
>> "The Flush command shall commit data and metadata associated with the
>> specified namespace(s) to non-volatile media. The flush applies to all
>> commands completed prior to the submission of the Flush command.
>> The controller may also flush additional data and/or metadata from any
>> namespace."
>>
>> The focus is completed - we need to get a reply to the host first
>> before we can send the flush command, so anything that we require
>> to be flushed needs to explicitly be completed first.
>
> I think there are two separate issues here:
>
> a) What's described as the "HOL blocking issue".
>
> This comes down to what Wouter said here:
>
>> Well, when I asked earlier, Christoph said[1] that blk-mq assumes that
>> when a FLUSH is sent over one channel, and the reply comes in, that all
>> commands which have been received, regardless of which channel they were
>> received over, have reched disk.
>>
>> [1] Message-ID: <[email protected]>
>>
>> It is impossible for nbd to make such a guarantee, due to head-of-line
>> blocking on TCP.
>
> this is perfectly accurate as far as it goes, but this isn't the current
> NBD definition of 'flush'.
>
> That is (from the docs):
>
>> All write commands (that includes NBD_CMD_WRITE, and NBD_CMD_TRIM) that the 
>> server completes (i.e. replies to) prior to processing to a NBD_CMD_FLUSH 
>> MUST be written to non-volatile storage prior to replying to 
>> thatNBD_CMD_FLUSH. This paragraph only applies if NBD_FLAG_SEND_FLUSH is set 
>> within the transmission flags, as otherwise NBD_CMD_FLUSH will never be sent 
>> by the client to the server.
>
> I don't think HOL blocking is an issue here by that definition, because all 
> FLUSH requires is that commands that are actually completed are flushed to 
> disk. If there is head of line blocking which delays the arrival of a write 
> issued before a flush, then the sender cannot be relying on whether that 
> write is actually completed or not (or it would have waited for the result). 
> The flush requires only that those commands COMPLETED are flushed to disk, 
> not that those commands RECEIVED have been flushed to disk (and a fortiori 
> not that those commands SENT FIRST) have been flushed to disk. From the point 
> of view of the client, the flush can therefore only guarantee that the data 
> associated with those commands for which it's actually received a reply prior 
> to issuing the flush will be flushed, because the replies can be disordered 
> too.
>
> I don't think there is actually a problem here - Wouter if I'm wrong about 
> this, I'd like to understand your argument better.
>
>
>
> b) What I'm describing - which is the lack of synchronisation between 
> channels.
>
> Suppose you have a simple forking NBD server which uses (e.g.) a Ceph 
> backend. Each process (i.e. each NBD channel) will have a separate connection 
> to something with its own cache and buffering. Issuing a flush in Ceph 
> requires waiting until a quorum of backends (OSDs) has been written to, and 
> with a number of backends substantially greater than the quorum, it is not 
> unlikely that a flush on one channel will not wait for writes on what Ceph 
> considers a completely independent channel to have fully written (assuming 
> the write completes before the flush is done).
>
> The same would happen pretty trivially with a forking server that uses a 
> process-space write-back cache.
>
> This is because the spec when the spec says: "All write commands (that 
> includes NBD_CMD_WRITE, and NBD_CMD_TRIM) that the server completes (i.e. 
> replies to) prior to processing to a NBD_CMD_FLUSH MUST be written to 
> non-volatile storage prior to replying to that NBD_CMD_FLUSH." what it 
> currently means is actually "All write commands (that includes NBD_CMD_WRITE, 
> and NBD_CMD_TRIM) ***ASSOCIATED WITH THAT CLIENT*** that the server completes 
> (i.e. replies to) prior to processing to a NBD_CMD_FLUSH MUST be written to 
> non-volatile storage prior to replying to that NBD_CMD_FLUSH".
>
> So what we would need the spec to mean is "All write commands (that includes 
> NBD_CMD_WRITE, and NBD_CMD_TRIM) ***ASSOCIATED WITH ANY CHANNEL OF THAT 
> CLIENT*** that the server completes (i.e. replies to) prior to processing to 
> a NBD_CMD_FLUSH MUST be written to non-volatile storage prior to replying to 
> that NBD_CMD_FLUSH". And as we have no way to associate different channels of 
> the same client, for servers that can't rely on the OS to synchronise 
> flushing across different clients relating to the same file, in practice that 
> means "All write commands (that includes NBD_CMD_WRITE, and NBD_CMD_TRIM) 
> ***ASSOCIATED WITH ANY CLIENT AT ALL*** that the server completes (i.e. 
> replies to) prior to processing to a NBD_CMD_FLUSH MUST be written to 
> non-volatile storage prior to replying to that NBD_CMD_FLUSH" - i.e. a flush 
> on any channel of any client must flush every channel of every client, 
> because we have no easy way to tell which clients are in fact two channels. I 
> have concerns over the sc
 alability of this.
>
> Now, in the reference server, NBD_CMD_FLUSH is implemented through an 
> fdatasync(). Each client (and therefore each channel) runs in a different 
> process.
>
> Earlier in this thread, someone suggested that if this happens:
>
>       Process A                Process B
>       =========                =========
>
>       fd1=open("file123")
>                                fd2=open("file123")
>
>       write(fd1, ...)
>                                fdatasync("fd2")
>
> then the fdatasync() is guaranteed to sync the write that Process A has 
> written. This may or may not be the case under Linux (wiser minds than me 
> will know). Is it guaranteed to be the case with (e.g.) the file on NFS? On 
> all POSIX platforms?
>
> Looking at 
> https://urldefense.proofpoint.com/v2/url?u=http-3A__pubs.opengroup.org_onlinepubs_009695399_functions_fdatasync.html&d=DQIFAg&c=5VD0RTtNlTh3ycd41b3MUw&r=sDzg6MvHymKOUgI8SFIm4Q&m=CrKTswZ5fz5tdtZvA9rerZHXb8O8O57LSOjNJN1ejms&s=bkOpj64mHN60JXapJ62GJe0Qtzp-ZWwVn91kXmJ247M&e=
>   I'd say it was a little ambiguous as to whether it will ALWAYS flush all 
> data associated with the file even if it is being written by a different 
> process (and a different FD).
>
> If fdatasync() is always guaranteed to flush data associated with writes by a 
> different process (and separately opened fd), then as it happens there won't 
> be a problem on the reference server, just on servers that don't happen to 
> use fdatasync() or similar to implement flushes, and which don't maintain 
> their own caches. If fdatasync() is not so guaranteed, we have a problem with 
> the reference server too, at least on some platforms and fling systems.
>
> What I'm therefore asking for is either:
> a) that the server can say 'if you are multichannel, you will need to send 
> flush on each channel' (best); OR
> b) that the server can say 'don't go multichannel'
>
> as part of the negotiation stage. Of course as this is dependent on the 
> backend, this is going to be something that is per-target (i.e. needs to come 
> as a transmission flag or similar).
>

Ok I understand your objections now.  You aren't arguing that we are unsafe by 
default, only that we are unsafe with servers that do something special beyond 
simply writing to a single disk or file.  I agree this is problematic, but you 
simply don't use this feature if your server can't deal with it well.  Thanks,

Josef
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most 
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
Nbd-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nbd-general

Reply via email to