On Mon, May 16, 2011 at 01:46:09PM +0100, Alex Bligh wrote: > Enclosed is a patch (COMPILE TESTED ONLY - DO NOT RUN ON LIVE SYSTEMS) which > supports FUA (force unit access) and FLUSH. > > FUA means "force the current write to hit the media service" and FLUSH > means "empty the current write queue to disk". Broadly they have the same > semantics as the linux kernel REQ_FLUSH and REQ_FUA. FUA is implemented > through sync_file_range() (or if that doesn't exist, fdatasync() on the > file handle concerned), FLUSH through fdatasync() on all files. FUA and > FLUSH are selected in the config file, and set new flags bits which will > cause the client to sent FUA and FLUSH requests. The way these are > implemented is further explained in doc/proto.txt. > > The purpose of this is reasonably obvious: without supporting either FUA > or FLUSH (and it's relatively easy to support both), filesystems on the > client have no way to ensure the relevant sectors have hit the disk. > The patch is implement such that the default behaviour is unchanged. > > Additionally, it introduces an F_ROTATIONAL flag. This will turn off > the use of QUEUE_FLAG_NONROT in the client. QUEUE_FLAG_NONROT effectively > disables the elevator algorithm, making the algorithm merge only. That is > unhelpful where the server does not have its own elevator algorithm > or where the client elevator algorithm is neutered (e.g. writing to a > raw partition with F_SYNC with nbd-server). It's not going to be used > often where the backing store is a file. > > It also incidentally fixes a bug where F_SYNC is ignored if F_COPYONWRITE > is set (not that this currently has much utility). > > I have not written the client side of this yet (see comment re "compile > tested only"). What I'd like to know (before I spend too much time on it) > is whether this is a useful course to pursue, and whether I'm doing this > in a vaguely useful way. I repeat the point about this code not being > tested at all beyond compilation. > > Clean version of patch at > http://www.alex.org.uk/nbd-flush-fua-01.patch > in case my mailer mangles newlines > > -- > Alex Bligh > > Signed-Off-By: Alex Bligh <[email protected]> > > diff --git a/cliserv.h b/cliserv.h > index 51c8bd1..63c2743 100644 > --- a/cliserv.h > +++ b/cliserv.h > @@ -40,7 +40,12 @@ typedef unsigned long long u64; > #include "nbd.h" > > #if NBD_LFS==1 > +/* /usr/include/features.h (included from /usr/include/sys/types.h) > + defines this when _GNU_SOURCE is defined > + */ > +#ifndef _LARGEFILE_SOURCE > #define _LARGEFILE_SOURCE > +#endif > #define _FILE_OFFSET_BITS 64 > #endif > > @@ -135,6 +140,9 @@ u64 ntohll(u64 a) { > /* Flags used between the client and server */ > #define NBD_FLAG_HAS_FLAGS (1 << 0) /* Flags are there */ > #define NBD_FLAG_READ_ONLY (1 << 1) /* Device is read-only */ > +#define NBD_FLAG_SEND_FLUSH (1 << 2) /* Send FLUSH */ > +#define NBD_FLAG_SEND_FUA (1 << 3) /* Send FLUSH */
"Send FUA" rather than "Send FLUSH"? Also, please expand what 'FUA' stands for in this comment. > +#define NBD_FLAG_ROTATIONAL (1 << 3) /* Use elevator algorithm - > rotational media */ That's the same value as the _SEND_FUA thing here; should that not be different? > #define NBD_DEFAULT_PORT "10809" /* Port on which named exports are > * served */ > diff --git a/configure.ac b/configure.ac > index f50193f..c78177b 100644 > --- a/configure.ac > +++ b/configure.ac > @@ -94,6 +94,9 @@ AC_CHECK_SIZEOF(unsigned int) > AC_CHECK_SIZEOF(unsigned long int) > AC_CHECK_SIZEOF(unsigned long long int) > AC_CHECK_FUNCS([llseek alarm gethostbyname inet_ntoa memset socket > strerror strstr]) > +AC_CHECK_FUNC([sync_file_range], > + [AC_DEFINE([HAVE_SYNC_FILE_RANGE], [sync_file_range(2) is not > supported], > [sync_file_range(2) is supported])], > + []) > AC_FUNC_FORK > AC_FUNC_SETVBUF_REVERSED > AC_MSG_CHECKING(whether client should be built) > diff --git a/doc/proto.txt b/doc/proto.txt > index fe5e819..d9a269b 100644 > --- a/doc/proto.txt > +++ b/doc/proto.txt > @@ -26,8 +26,9 @@ server during the handshake. > There are two message types in the data pushing phase: the request, and > the response. > > -There are three request types in the data pushing phase: NBD_CMD_READ, > -NBD_CMD_WRITE, and NBD_CMD_DISC (disconnect). > +There are four request types in the data pushing phase: NBD_CMD_READ, > +NBD_CMD_WRITE, NBD_CMD_WRITE_FUA, NBD_CMD_DISC (disconnect), and > +NBD_CMD_FLUSH. > > The request is sent by the client; the response by the server. A request > header consists a 32 bit magic number (magic), a 32 bit field denoting > @@ -50,6 +51,15 @@ we change that to asynchronous handling, handling the > disconnect request > will probably be postponed until there are no other outstanding > requests. > > +NBD_CMD_WRITE_FUA is the equivalent of NBD_WRITE, with an implied > +FUA (force unit access). This is implementing using sync_file_range > +if present, else by fdatasync() of that file (note not all files > +in a multifile environment). NBD_CMD_WRITE_FUA will not be sent > +unless NBD_FLAG_SEND_FUA is set. > + > +A flush request will not be sent unless NBD_FLAG_SEND_FLUSH is set, > +and indicates the backing file should be fdatasync()'d to disk. > + > There are two versions of the negotiation: the 'old' style (nbd <= > 2.9.16) and the 'new' style (nbd >= 2.9.17, though due to a bug it does > not work with anything below 2.9.18). What follows is a description of > diff --git a/nbd-server.c b/nbd-server.c > index 4b9a083..88e4615 100644 > --- a/nbd-server.c > +++ b/nbd-server.c > @@ -55,6 +55,9 @@ > * Suriya Soutmun <[email protected]> > */ > > +/* For sync_file_range */ > +#define _GNU_SOURCE > + This should perhaps be inside an #ifdef HAVE_SYNC_FILE_RANGE, too. > /* Includes LFS defines, which defines behaviours of some of the following > * headers, so must come before those */ > #include "lfs.h" > @@ -134,11 +137,13 @@ gboolean do_oldstyle=FALSE; > #define DEBUG2( a,b ) printf( a,b ) > #define DEBUG3( a,b,c ) printf( a,b,c ) > #define DEBUG4( a,b,c,d ) printf( a,b,c,d ) > +#define DEBUG5( a,b,c,d,e ) printf( a,b,c,d,e ) > #else > #define DEBUG( a ) > #define DEBUG2( a,b ) > #define DEBUG3( a,b,c ) > #define DEBUG4( a,b,c,d ) > +#define DEBUG5( a,b,c,d,e ) I should probably change these to use C99 varargs macros (we require C99 these days anyway). Oh well, -- some other time, I guess :-) > #endif > #ifndef PACKAGE_VERSION > #define PACKAGE_VERSION "" > @@ -160,6 +165,9 @@ gboolean do_oldstyle=FALSE; > #define F_SPARSE 16 /**< flag to tell us copyronwrite should use a > sparse file */ > #define F_SDP 32 /**< flag to tell us the export should be done using > the Socket Direct Protocol for RDMA */ > #define F_SYNC 64 /**< Whether to fsync() after a write */ > +#define F_FLUSH 128 /**< Whether server wants FLUSH to be sent by the > client */ > +#define F_FUA 128 /**< Whether server wants FUA to be sent by the > client These should probably not be the same values? > */ > +#define F_ROTATIONAL 128 /**< Whether server wants the client to > implement the elevator algorithm */ > GHashTable *children; > char pidfname[256]; /**< name of our PID file */ > char pidftemplate[256]; /**< template to be used for the filename of the > PID file */ > @@ -713,6 +721,9 @@ GArray* parse_cfile(gchar* f, GError** e) { > { "sparse_cow", FALSE, PARAM_BOOL, NULL, F_SPARSE }, > { "sdp", FALSE, PARAM_BOOL, NULL, F_SDP }, > { "sync", FALSE, PARAM_BOOL, NULL, F_SYNC }, > + { "flush", FALSE, PARAM_BOOL, NULL, F_FLUSH }, > + { "fua", FALSE, PARAM_BOOL, NULL, F_FUA }, > + { "rotational", FALSE, PARAM_BOOL, NULL, F_ROTATIONAL }, > { "listenaddr", FALSE, PARAM_STRING, NULL, 0 }, > { "maxconnections", FALSE, PARAM_INT, NULL, 0 }, > }; > @@ -763,9 +774,10 @@ GArray* parse_cfile(gchar* f, GError** e) { > lp[6].target=&(s.postrun); > lp[7].target=lp[8].target=lp[9].target= > lp[10].target=lp[11].target= > - lp[12].target=&(s.flags); > - lp[13].target=&(s.listenaddr); > - lp[14].target=&(s.max_connections); > + lp[12].target=lp[13].target= > + lp[14].target=lp[15].target=&(s.flags); > + lp[16].target=&(s.listenaddr); > + lp[17].target=&(s.max_connections); > > /* After the [generic] group, start parsing exports */ > if(i==1) { > @@ -1060,7 +1072,7 @@ void myseek(int handle,off_t a) { > * @param client The client we're serving for > * @return The number of bytes actually written, or -1 in case of an error > **/ > -ssize_t rawexpwrite(off_t a, char *buf, size_t len, CLIENT *client) { > +ssize_t rawexpwrite(off_t a, char *buf, size_t len, CLIENT *client, int > fua) { > int fhandle; > off_t foffset; > size_t maxbytes; > @@ -1071,12 +1083,20 @@ ssize_t rawexpwrite(off_t a, char *buf, size_t len, > CLIENT *client) { > if(maxbytes && len > maxbytes) > len = maxbytes; > > - DEBUG4("(WRITE to fd %d offset %llu len %u), ", fhandle, foffset, len); > + DEBUG5("(WRITE to fd %d offset %llu len %u fua %d), ", fhandle, > foffset, > len, fua); > > myseek(fhandle, foffset); > retval = write(fhandle, buf, len); > if(client->server->flags & F_SYNC) { > fsync(fhandle); > + } else if (fua) { > +#ifdef HAVE_SYNC_FILE_RANGE > + sync_file_range(fhandle, foffset, len, > + SYNC_FILE_RANGE_WAIT_BEFORE | > SYNC_FILE_RANGE_WRITE | > + SYNC_FILE_RANGE_WAIT_AFTER); > +#else > + fdatasync(fhandle); > +#endif > } > return retval; > } > @@ -1085,10 +1105,10 @@ ssize_t rawexpwrite(off_t a, char *buf, size_t len, > CLIENT *client) { > * Call rawexpwrite repeatedly until all data has been written. > * @return 0 on success, nonzero on failure > **/ > -int rawexpwrite_fully(off_t a, char *buf, size_t len, CLIENT *client) { > +int rawexpwrite_fully(off_t a, char *buf, size_t len, CLIENT *client, int > fua) { > ssize_t ret=0; > > - while(len > 0 && (ret=rawexpwrite(a, buf, len, client)) > 0 ) { > + while(len > 0 && (ret=rawexpwrite(a, buf, len, client, fua)) > 0 ) { > a += ret; > buf += ret; > len -= ret; > @@ -1189,7 +1209,7 @@ int expread(off_t a, char *buf, size_t len, CLIENT > *client) { > * @param client The client we're going to write for. > * @return 0 on success, nonzero on failure > **/ > -int expwrite(off_t a, char *buf, size_t len, CLIENT *client) { > +int expwrite(off_t a, char *buf, size_t len, CLIENT *client, int fua) { > char pagebuf[DIFFPAGESIZE]; > off_t mapcnt,mapl,maph; > off_t wrlen,rdlen; > @@ -1197,7 +1217,7 @@ int expwrite(off_t a, char *buf, size_t len, CLIENT > *client) { > off_t offset; > > if (!(client->server->flags & F_COPYONWRITE)) > - return(rawexpwrite_fully(a, buf, len, client)); > + return(rawexpwrite_fully(a, buf, len, client, fua)); Gratuitous whitespace change? > DEBUG3("Asked to write %d bytes at %llu.\n", len, (unsigned long > long)a); > > mapl=a/DIFFPAGESIZE ; maph=(a+len-1)/DIFFPAGESIZE ; > @@ -1230,6 +1250,33 @@ int expwrite(off_t a, char *buf, size_t len, CLIENT > *client) { > } > len-=wrlen ; a+=wrlen ; buf+=wrlen ; > } > + if (client->server->flags & F_SYNC) { > + fsync(client->difffile); > + } else if (fua) { > + /* open question: would it be cheaper to do multiple > sync_file_ranges? > + as we iterate through the above? > + */ Good question... it might be a good idea to test this. > + fdatasync(client->difffile); > + } > + return 0; > +} > + > +int expflush(CLIENT *client) { > + int fhandle; > + off_t foffset; > + size_t maxbytes; > + gint i; > + > + if (!(client->server->flags & F_COPYONWRITE)) { > + return fsync(client->difffile); > + } > + > + for (i = 0; i < client->export->len; i++) { > + FILE_INFO fi = g_array_index(client->export, FILE_INFO, i); > + if (fsync(fi.fhandle) < 0) > + return -1; > + } > + > return 0; > } > > @@ -1322,6 +1369,12 @@ CLIENT* negotiate(int net, CLIENT *client, GArray* > servers) { > err("Negotiation failed: %m"); > if (client->server->flags & F_READONLY) > flags |= NBD_FLAG_READ_ONLY; > + if (client->server->flags & F_FLUSH) > + flags |= NBD_FLAG_SEND_FLUSH; > + if (client->server->flags & F_FUA) > + flags |= NBD_FLAG_SEND_FUA; > + if (client->server->flags & F_ROTATIONAL) > + flags |= NBD_FLAG_ROTATIONAL; > if (!client->modern) { > /* oldstyle */ > flags = htonl(flags); > @@ -1419,7 +1472,7 @@ int mainloop(CLIENT *client) { > continue; > } > > - if (request.type==NBD_CMD_WRITE) { > + if ((request.type==NBD_CMD_WRITE) || > (request.type==NBD_CMD_WRITE_FUA)) { This makes me think that the _FUA thing should be something that can be masked away; that way, we can do stuff like int type = request.type & ~NBD_CMD_FLAG_FUA; if (type == NBD_CMD_WRITE) { ... } or something similar, which seems cleaner. It would also allow it to be applied to other request types without having to special-case too much; e.g., there's been talk of a truncate command, which could sparsify the backend. I guess that could also use this FUA thing. > DEBUG("wr: net->buf, "); > while(len > 0) { > readit(client->net, buf, currlen); > @@ -1430,11 +1483,12 @@ int mainloop(CLIENT *client) { > ERROR(client, reply, EPERM); > continue; > } > - if (expwrite(request.from, buf, len, client)) { > + if (expwrite(request.from, buf, len, client, > + > (request.type==NBD_CMD_WRITE_FUA)?1:0)) { ... and here you could then use request.type & NBD_CMD_FLAG_FUA rather than the tristate. > DEBUG("Write failed: %m" ); > ERROR(client, reply, errno); > continue; > - } > + } Another gratuitous whitespace change. > SEND(client->net, reply); > DEBUG("OK!\n"); > len -= currlen; > @@ -1442,6 +1496,19 @@ int mainloop(CLIENT *client) { > } > continue; > } > + > + if (request.type==NBD_CMD_FLUSH) { > + DEBUG("fl: "); > + if (expflush(client)) { > + DEBUG("Flush failed: %m"); > + ERROR(client, reply, errno); > + continue; > + } > + SEND(client->net, reply); > + DEBUG("OK!\n"); > + continue; > + } > + > /* READ */ > > DEBUG("exp->buf, "); > diff --git a/nbd.h b/nbd.h > index 451f50c..05488b0 100644 > --- a/nbd.h > +++ b/nbd.h > @@ -31,7 +31,9 @@ > enum { > NBD_CMD_READ = 0, > NBD_CMD_WRITE = 1, > - NBD_CMD_DISC = 2 > + NBD_CMD_DISC = 2, > + NBD_CMD_FLUSH = 3, > + NBD_CMD_WRITE_FUA = 4 > }; This file is taken from the kernel, so should probably be patched there rather than here (though of course it would be hard to compile-test without these bits). Please also document the new config file options in nbd-server.5.in.sgml. Other than that, seems fine to me. -- The volume of a pizza of thickness a and radius z can be described by the following formula: pi zz a ------------------------------------------------------------------------------ Achieve unprecedented app performance and reliability What every C/C++ and Fortran developer should know. Learn how Intel has extended the reach of its next-generation tools to help boost performance applications - inlcuding clusters. http://p.sf.net/sfu/intel-dev2devmay _______________________________________________ Nbd-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nbd-general
