Goswin, --On 29 May 2011 14:53:01 +0200 Goswin von Brederlow <[email protected]> wrote:
> That really sucks documentation wise. Because then you have to start a > hunt for further documentation which probably doesn't even exist other > than the source. > > It is ok to say we implement what the linux block layer expects but it > should be spelled out in the text or at least name a file to look at for > the details. It should not be left this vague. Point taken. However, it does mean we would end up documenting in practice how the linux block layer behaves. Which itself is not static. >>> True. And a read reply takes times (lots of data to send). In case there >>> are multiple replies pending it would make sense to order them so that >>> FUA/FLUSH get priority I think. After that I think all read replies >>> should go out in order of their request (oldest first) and write replies >>> last. Reason being that something will be waiting for the read while the >>> writes are likely cached. On the other hand write replies are tiny and >>> sending them first gets them out of the way and clears up dirty pages on >>> the client side faster. That might be beneficial too. >>> >>> What do you think? >> >> There's no need to specify that in the protocol. It may be how you choose >> to implement it in your server; but it might not be how I choose to >> implement it in mine. A good example is a RAID0/JBOD server where you >> might choose to split the incoming request queue by underlying physical >> device (and split requests spanning multiple devices into multiple >> requests). Each device's queue could be handled by a separate thread. >> This is perfectly permissible, and there needs to be no ordering between >> the queues. > > Obviously. That was purely an implementation question. > > I think you also misunderstood me. I didn't mean that incoming requests > should be ordered in this way but that pending outgoing replies should > be. I don't think I misunderstood. What I meant was that in such a JBOD situation, one disk X might be replying more slowly than disk Y (say because it happens to have a more seeky load). So a read request to disk Y issued after a read request to disk X might result in a disk Y reply coming before the reply for disk X. > But I just thought about something else you wrote that makes this a bad > idea. You said that a FLUSH only ensures completed requests have been > flushed. yes > So if a FLUSH ACKed before a WRITE then the client should > assume that WRITE wasn't yet flushed and issue another FLUSH. If a client wants to flush a particular write, it should just not issue the flush until the write is ACK'd. No need to issue two flushes. This is the way the request system works, not my design! > To prevent > this a FLUSH ACK should be after any WRITE ACK that it flushed to > disk. No, it need only come after it has flushed any acknowledged writes. IE it can send the ACK before the ACK of any other writes (for instance unacknowledged ones issued before the flush) it happened to flush to disk at the same time. > So there should be some limits on how much a FUA/FLUSH ACK can > skip other replies. I don't see why it has to be any different to the current linux request model. It's a bit odd in some ways, but it is a working system. And this has /nothing/ to do with FUA. There are no ordering constraints at all on FUA. > Maybe it is best to simply send out replies in the order they happen to > finish. Or send them in order they came in (only those waiting to be > send, no waiting). I think you are overcomplicating it :-) You may send process and send replies in any order. However, to process a flush, you need to ensure all completed writes go to non-volatile storage before acking. That could be as simple as an fsync(), or (e.g.) flushing your own volatile cache to disk and asking the component disks to flush their volatile disks. FUA is normally handled simply by keeping the bit in the request and ensuring it writes through. > It should be extended to have a FLUSH option then. For a simple case > that would be the same as fsync(). On a striped raid or multiple device > LV it could be reduced to only flush the required physical devices and > not all devices. I agree it isn't particularly useful, as do certain people on linux-kernel (see the author of the original text - Christoph Hellwig). However, Linus himself put the syscall in. You'll need to debate the point on linux-kernel. >> Yes. But it's more than that. If you write to a CoW based filing system, >> fsync (and even fdatasync) will ensure the CoW metadata is also flushed >> to the device, whereas sync_file_range won't. Without the CoW metadata >> being written, the data itself is not really written. > > Which just means that a CoW based filing system or sparse files don't > support FUA. No, CoW based filing systems *do* support FUA in that they send them out. Go trace what (e.g.) btrfs does. I think you are confusing block layer semantics (REQ_FLUSH and REQ_FUA). There is no VFS equivalent of either REQ_FLUSH or REQ_FUA. fsync() on a file does roughly what REQ_FLUSH does. Opening a second file with O_DATASYNC set and writing the blocks to that does roughly what REQ_FUA does (an fdatasync() does rather more). nbd-server currently does "more than it needs" for REQ_FUA, but given that almost all REQ_FUA are immediately followed by a REQ_FLUSH, and 2 x fdatasync in a row are no more work than one, this doesn't matter. > The idea of a FUA is that it is cheaper than a FLUSH. But > if nbd-server does fsync() in both cases then it is pointless to > announce FUA support. Well, FUA could (and will if I have a minute) be implemented using a shadow file and O_DATASYNC. I think there is a comment to that effect. However, nbd-server is not the only server in existence. >>> Why not return EIO on the next FLUSH? If I return success on the next >>> FLUSH that would make the client thing the write has successfully >>> migrated to the physical medium. Which would not be true. >> >> Because >> a) there may not be a next FLUSH at all > > Then I will never know the write did have an error. I only see that on > fsync(). As I said, I am not saying don't error the flush. I am saying don't only error the flush. > Say you are running "mkfs -t ext4 -c -c /dev/nbd0". Now you hit one bad > block and the device turns itself into read-only mode. Not the behaviour > you want. Well, assuming you want the mkfs to carry on, and it wants to know where block errors are, should be opening the disk with O_SYNC (or O_DATASYNC these days) which will translate into REQ_FUA as I understand it. Consider a normal SATA disk with a write-behind cache (forget nbd for a minute). From memory mkfs just does normal block writes. It may do an fsync() on the block device which results in a sync at the end. It has no way of knowing where bad blocks are anyway. (In practice SATA devices do their own bad block management but you probably know that). >>> What happens if the client mounts a filesystem with sync option? Does >>> nbd then get every request with FUA set? >> >> No. The client sends FUA only when there is FUA on the request. The sync >> option is a server option, and the server merely syncs after every >> request. > > It is also a mount option. Mounting a filesystem with sync in a local > disk and nbd should give the same behaviour. A mount option is something different. That will (as I understand it) cause the block layer to work synchronously, and you will get FUA/FLUSH/whatever. No client negotiation (beyond advertising support of these) is needed. That's just how a sync mount option would work with any other block device. Note nbd devices don't have to be mounted. What you are proposing would affect raw I/O (not mounted I/O) to these block devices. >> And what I am saying is that this is not current behaviour. Even with >> today's nbd release and my patched kernel (none of which you are >> guaranteed) you will not get one single REQ_FLUSH before dismounting an >> ext2 or (on Ubuntu and some other distros) ext3 filing system with >> default options. With an older kernel or client you will never get a >> REQ_FLUSH *ever*. So if you throw away data because it is not flushed >> when you get an NBD_CMD_DISC you *will corrupt the filing system*. Do >> not do this. You should treat NBD_CMD_DISC as containing an implicit >> flush (by which I mean buffer flush, not necessarily write to disk), >> which it always has done (it closes the files). > > I'm not talking about throwing away any data. The data will be written > or the write requests wouldn't have been ACKed. I mean "written to non-volatile storage". You can ACK data if it's been written to (e.g.) a volatile cache. If you do, you must commit that cache to non-volatile storage after NBD_CMD_DISC at some stage rather than discard it. >>>>> * NBD_CMD_FLUSH: Wait for all pending requests to finish, flush data >>>>> (sync_file_range() on the whole file, fdatasync() or fsync() >>>>> returned). >>>> >>>> You only need to wait until any writes that you have sent replies >>>> for have been flushed to disk. It may be easier to write ore than than >>>> that (which is fine). Whilst you do not *have* to flush any commands >>>> issued (but not completed) before the REQ_FLUSH, Jan Kara says >>>> "*please* don't do that". >>> >>> Urgs. Assuming one doesn't flush commands that where issued but not yet >>> completed. How then should the client force those to disk? Sleep until >>> they happen to be ACKed and only then flush? >> >> Easy. The client issues a REQ_FLUSH *after* anything that needs to >> be flushed to disk have been ACK'd. > > This would make ACKing writes only when they reach the physical medium > (using libaio, not fsync() every write) a total no go with a file backed > device. Is that really how Linux currently works? If that's true then I > really need to switch to ACKing requests as soon as the write is issued > and not when it completes. There's nothing to prevent you doing *more* than Linux requires. Linux only issues a REQ_FLUSH after the write of the data it wants to go to disk has been ACK'ed (so Jan Kara / Christoph H say, anyway). But yes, if you want speed, you should consider ACK'ing before you have actually done, which will mean that (by default) you have become a write-behind cache. > Then we need to spell out what that behaviour exactly is: > > a) A FLUSH affects at least all completed requests, a client must wait > for request completion before sending a FLUSH. Yes. Except the client only need wait for completion of those requests it wants to ensure are flushed (not every request). > b) A FLUSH might affect other requests. (Normaly those issued but not > yet completed before the flush is issued.) Yes. You can always flush more than is required. > c) Requests should be ACKed as soon as possible to minimize the delay > until a client can savely issue a FLUSH. That's probably true performance wise as a general point, but there is a complexity / safety / memory use tradeoff. If you ACK every request as soon as it comes in, you will use a lot of memory. >> A request driven driver simply errors the request, in which case it >> is passed up and errors the relevant bios (there may be more than one). >> The errored block number is largely irrelevant as there might not >> be one (REQ_FLUSH) or it might be outside the bio due to merging >> (that's my understanding anyway). > > How? I mean that just pushes the issue down a layer. The physical disk > gets a write request, dumps the data into its cache and ACKs the > write. The driver passes up the ACK to the bio and the bio > completes. Then some time later the driver gets a REQ_FLUSH and the disk > returns a write error when it finds out it can't actually write the > block. > > Color me ignorant but isn't that roughly how it will go with a disk with > write cacheing enabled? I am not sure what the "How?" is in relation to. Remember a request based driver doesn't deal with bios, it deals with requests. There is not a 1:1 relationship, due to merging, and due to generation of additional requests to do flushes, etc. I think the bit that is missing is "the elevator algorithm". All I'm saying is that there isn't an "errorred block" that is passed up the chain - there's just an error on a particular request, which might be a 20MB request, formed by the merger of 10 bios. There's no indication where the error occurred as far as I know (or if there is it is lost between layers). -- Alex Bligh ------------------------------------------------------------------------------ vRanger cuts backup time in half-while increasing security. With the market-leading solution for virtual backup and recovery, you get blazing-fast, flexible, and affordable data protection. Download your free trial now. http://p.sf.net/sfu/quest-d2dcopy1 _______________________________________________ Nbd-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nbd-general
