On Mon, May 16, 2011 at 01:46:09PM +0100, Alex Bligh wrote:
> Enclosed is a patch (COMPILE TESTED ONLY - DO NOT RUN ON LIVE SYSTEMS) which
> supports FUA (force unit access) and FLUSH.
> 
> FUA means "force the current write to hit the media service" and FLUSH
> means "empty the current write queue to disk". Broadly they have the same
> semantics as the linux kernel REQ_FLUSH and REQ_FUA. FUA is implemented
> through sync_file_range() (or if that doesn't exist, fdatasync() on the
> file handle concerned), FLUSH through fdatasync() on all files. FUA and
> FLUSH are selected in the config file, and set new flags bits which will
> cause the client to sent FUA and FLUSH requests. The way these are
> implemented is further explained in doc/proto.txt.
> 
> The purpose of this is reasonably obvious: without supporting either FUA
> or FLUSH (and it's relatively easy to support both), filesystems on the
> client have no way to ensure the relevant sectors have hit the disk.
> The patch is implement such that the default behaviour is unchanged.
> 
> Additionally, it introduces an F_ROTATIONAL flag. This will turn off
> the use of QUEUE_FLAG_NONROT in the client. QUEUE_FLAG_NONROT effectively
> disables the elevator algorithm, making the algorithm merge only. That is
> unhelpful where the server does not have its own elevator algorithm
> or where the client elevator algorithm is neutered (e.g. writing to a
> raw partition with F_SYNC with nbd-server). It's not going to be used
> often where the backing store is a file.
> 
> It also incidentally fixes a bug where F_SYNC is ignored if F_COPYONWRITE
> is set (not that this currently has much utility).
> 
> I have not written the client side of this yet (see comment re "compile
> tested only"). What I'd like to know (before I spend too much time on it)
> is whether this is a useful course to pursue, and whether I'm doing this
> in a vaguely useful way. I repeat the point about this code not being
> tested at all beyond compilation.
> 
> Clean version of patch at
>   http://www.alex.org.uk/nbd-flush-fua-01.patch
> in case my mailer mangles newlines
> 
> -- 
> Alex Bligh
> 
> Signed-Off-By: Alex Bligh <[email protected]>
> 
> diff --git a/cliserv.h b/cliserv.h
> index 51c8bd1..63c2743 100644
> --- a/cliserv.h
> +++ b/cliserv.h
> @@ -40,7 +40,12 @@ typedef unsigned long long u64;
>  #include "nbd.h"
> 
>  #if NBD_LFS==1
> +/* /usr/include/features.h (included from /usr/include/sys/types.h)
> +   defines this when _GNU_SOURCE is defined
> + */
> +#ifndef _LARGEFILE_SOURCE
>  #define _LARGEFILE_SOURCE
> +#endif
>  #define _FILE_OFFSET_BITS 64
>  #endif
> 
> @@ -135,6 +140,9 @@ u64 ntohll(u64 a) {
>  /* Flags used between the client and server */
>  #define NBD_FLAG_HAS_FLAGS   (1 << 0)        /* Flags are there */
>  #define NBD_FLAG_READ_ONLY   (1 << 1)        /* Device is read-only */
> +#define NBD_FLAG_SEND_FLUSH  (1 << 2)        /* Send FLUSH */
> +#define NBD_FLAG_SEND_FUA    (1 << 3)        /* Send FLUSH */

"Send FUA" rather than "Send FLUSH"? Also, please expand what 'FUA'
stands for in this comment.

> +#define NBD_FLAG_ROTATIONAL  (1 << 3)        /* Use elevator algorithm - 
> rotational media */

That's the same value as the _SEND_FUA thing here; should that not be
different?

>  #define NBD_DEFAULT_PORT     "10809" /* Port on which named exports are
>                                        * served */
> diff --git a/configure.ac b/configure.ac
> index f50193f..c78177b 100644
> --- a/configure.ac
> +++ b/configure.ac
> @@ -94,6 +94,9 @@ AC_CHECK_SIZEOF(unsigned int)
>  AC_CHECK_SIZEOF(unsigned long int)
>  AC_CHECK_SIZEOF(unsigned long long int)
>  AC_CHECK_FUNCS([llseek alarm gethostbyname inet_ntoa memset socket 
> strerror strstr])
> +AC_CHECK_FUNC([sync_file_range],
> +     [AC_DEFINE([HAVE_SYNC_FILE_RANGE], [sync_file_range(2) is not 
> supported], 
> [sync_file_range(2) is supported])],
> +        [])
>  AC_FUNC_FORK
>  AC_FUNC_SETVBUF_REVERSED
>  AC_MSG_CHECKING(whether client should be built)
> diff --git a/doc/proto.txt b/doc/proto.txt
> index fe5e819..d9a269b 100644
> --- a/doc/proto.txt
> +++ b/doc/proto.txt
> @@ -26,8 +26,9 @@ server during the handshake.
>  There are two message types in the data pushing phase: the request, and
>  the response.
> 
> -There are three request types in the data pushing phase: NBD_CMD_READ,
> -NBD_CMD_WRITE, and NBD_CMD_DISC (disconnect).
> +There are four request types in the data pushing phase: NBD_CMD_READ,
> +NBD_CMD_WRITE, NBD_CMD_WRITE_FUA, NBD_CMD_DISC (disconnect), and
> +NBD_CMD_FLUSH.
> 
>  The request is sent by the client; the response by the server. A request
>  header consists a 32 bit magic number (magic), a 32 bit field denoting
> @@ -50,6 +51,15 @@ we change that to asynchronous handling, handling the 
> disconnect request
>  will probably be postponed until there are no other outstanding
>  requests.
> 
> +NBD_CMD_WRITE_FUA is the equivalent of NBD_WRITE, with an implied
> +FUA (force unit access). This is implementing using sync_file_range
> +if present, else by fdatasync() of that file (note not all files
> +in a multifile environment). NBD_CMD_WRITE_FUA will not be sent
> +unless NBD_FLAG_SEND_FUA is set.
> +
> +A flush request will not be sent unless NBD_FLAG_SEND_FLUSH is set,
> +and indicates the backing file should be fdatasync()'d to disk.
> +
>  There are two versions of the negotiation: the 'old' style (nbd <=
>  2.9.16) and the 'new' style (nbd >= 2.9.17, though due to a bug it does
>  not work with anything below 2.9.18). What follows is a description of
> diff --git a/nbd-server.c b/nbd-server.c
> index 4b9a083..88e4615 100644
> --- a/nbd-server.c
> +++ b/nbd-server.c
> @@ -55,6 +55,9 @@
>   *   Suriya Soutmun <[email protected]>
>   */
> 
> +/* For sync_file_range */
> +#define _GNU_SOURCE
> +

This should perhaps be inside an #ifdef HAVE_SYNC_FILE_RANGE, too.

>  /* Includes LFS defines, which defines behaviours of some of the following
>   * headers, so must come before those */
>  #include "lfs.h"
> @@ -134,11 +137,13 @@ gboolean do_oldstyle=FALSE;
>  #define DEBUG2( a,b ) printf( a,b )
>  #define DEBUG3( a,b,c ) printf( a,b,c )
>  #define DEBUG4( a,b,c,d ) printf( a,b,c,d )
> +#define DEBUG5( a,b,c,d,e ) printf( a,b,c,d,e )
>  #else
>  #define DEBUG( a )
>  #define DEBUG2( a,b )
>  #define DEBUG3( a,b,c )
>  #define DEBUG4( a,b,c,d )
> +#define DEBUG5( a,b,c,d,e )

I should probably change these to use C99 varargs macros (we require C99
these days anyway). Oh well, -- some other time, I guess :-)

>  #endif
>  #ifndef PACKAGE_VERSION
>  #define PACKAGE_VERSION ""
> @@ -160,6 +165,9 @@ gboolean do_oldstyle=FALSE;
>  #define F_SPARSE 16    /**< flag to tell us copyronwrite should use a 
> sparse file */
>  #define F_SDP 32       /**< flag to tell us the export should be done using 
> the Socket Direct Protocol for RDMA */
>  #define F_SYNC 64      /**< Whether to fsync() after a write */
> +#define F_FLUSH 128    /**< Whether server wants FLUSH to be sent by the 
> client */
> +#define F_FUA 128      /**< Whether server wants FUA to be sent by the 
> client 
These should probably not be the same values?

> */
> +#define F_ROTATIONAL 128  /**< Whether server wants the client to 
> implement the elevator algorithm */
>  GHashTable *children;
>  char pidfname[256]; /**< name of our PID file */
>  char pidftemplate[256]; /**< template to be used for the filename of the 
> PID file */
> @@ -713,6 +721,9 @@ GArray* parse_cfile(gchar* f, GError** e) {
>               { "sparse_cow", FALSE,  PARAM_BOOL,     NULL, F_SPARSE },
>               { "sdp",        FALSE,  PARAM_BOOL,     NULL, F_SDP },
>               { "sync",       FALSE,  PARAM_BOOL,     NULL, F_SYNC },
> +             { "flush",      FALSE,  PARAM_BOOL,     NULL, F_FLUSH },
> +             { "fua",        FALSE,  PARAM_BOOL,     NULL, F_FUA },
> +             { "rotational", FALSE,  PARAM_BOOL,     NULL, F_ROTATIONAL },
>               { "listenaddr", FALSE,  PARAM_STRING,   NULL, 0 },
>               { "maxconnections", FALSE, PARAM_INT,   NULL, 0 },
>       };
> @@ -763,9 +774,10 @@ GArray* parse_cfile(gchar* f, GError** e) {
>               lp[6].target=&(s.postrun);
>               lp[7].target=lp[8].target=lp[9].target=
>                               lp[10].target=lp[11].target=
> -                             lp[12].target=&(s.flags);
> -             lp[13].target=&(s.listenaddr);
> -             lp[14].target=&(s.max_connections);
> +                             lp[12].target=lp[13].target=
> +                             lp[14].target=lp[15].target=&(s.flags);
> +             lp[16].target=&(s.listenaddr);
> +             lp[17].target=&(s.max_connections);
> 
>               /* After the [generic] group, start parsing exports */
>               if(i==1) {
> @@ -1060,7 +1072,7 @@ void myseek(int handle,off_t a) {
>   * @param client The client we're serving for
>   * @return The number of bytes actually written, or -1 in case of an error
>   **/
> -ssize_t rawexpwrite(off_t a, char *buf, size_t len, CLIENT *client) {
> +ssize_t rawexpwrite(off_t a, char *buf, size_t len, CLIENT *client, int 
> fua) {
>       int fhandle;
>       off_t foffset;
>       size_t maxbytes;
> @@ -1071,12 +1083,20 @@ ssize_t rawexpwrite(off_t a, char *buf, size_t len, 
> CLIENT *client) {
>       if(maxbytes && len > maxbytes)
>               len = maxbytes;
> 
> -     DEBUG4("(WRITE to fd %d offset %llu len %u), ", fhandle, foffset, len);
> +     DEBUG5("(WRITE to fd %d offset %llu len %u fua %d), ", fhandle, 
> foffset, 
> len, fua);
> 
>       myseek(fhandle, foffset);
>       retval = write(fhandle, buf, len);
>       if(client->server->flags & F_SYNC) {
>               fsync(fhandle);
> +     } else if (fua) {
> +#ifdef HAVE_SYNC_FILE_RANGE
> +             sync_file_range(fhandle, foffset, len,
> +                             SYNC_FILE_RANGE_WAIT_BEFORE | 
> SYNC_FILE_RANGE_WRITE |
> +                             SYNC_FILE_RANGE_WAIT_AFTER);
> +#else
> +             fdatasync(fhandle);
> +#endif
>       }
>       return retval;
>  }
> @@ -1085,10 +1105,10 @@ ssize_t rawexpwrite(off_t a, char *buf, size_t len, 
> CLIENT *client) {
>   * Call rawexpwrite repeatedly until all data has been written.
>   * @return 0 on success, nonzero on failure
>   **/
> -int rawexpwrite_fully(off_t a, char *buf, size_t len, CLIENT *client) {
> +int rawexpwrite_fully(off_t a, char *buf, size_t len, CLIENT *client, int 
> fua) {
>       ssize_t ret=0;
> 
> -     while(len > 0 && (ret=rawexpwrite(a, buf, len, client)) > 0 ) {
> +     while(len > 0 && (ret=rawexpwrite(a, buf, len, client, fua)) > 0 ) {
>               a += ret;
>               buf += ret;
>               len -= ret;
> @@ -1189,7 +1209,7 @@ int expread(off_t a, char *buf, size_t len, CLIENT 
> *client) {
>   * @param client The client we're going to write for.
>   * @return 0 on success, nonzero on failure
>   **/
> -int expwrite(off_t a, char *buf, size_t len, CLIENT *client) {
> +int expwrite(off_t a, char *buf, size_t len, CLIENT *client, int fua) {
>       char pagebuf[DIFFPAGESIZE];
>       off_t mapcnt,mapl,maph;
>       off_t wrlen,rdlen;
> @@ -1197,7 +1217,7 @@ int expwrite(off_t a, char *buf, size_t len, CLIENT 
> *client) {
>       off_t offset;
> 
>       if (!(client->server->flags & F_COPYONWRITE))
> -             return(rawexpwrite_fully(a, buf, len, client));
> +       return(rawexpwrite_fully(a, buf, len, client, fua));

Gratuitous whitespace change?

>       DEBUG3("Asked to write %d bytes at %llu.\n", len, (unsigned long 
> long)a);
> 
>       mapl=a/DIFFPAGESIZE ; maph=(a+len-1)/DIFFPAGESIZE ;
> @@ -1230,6 +1250,33 @@ int expwrite(off_t a, char *buf, size_t len, CLIENT 
> *client) {
>               }                                               
>               len-=wrlen ; a+=wrlen ; buf+=wrlen ;
>       }
> +     if (client->server->flags & F_SYNC) {
> +             fsync(client->difffile);
> +     } else if (fua) {
> +             /* open question: would it be cheaper to do multiple 
> sync_file_ranges?
> +                as we iterate through the above?
> +              */

Good question... it might be a good idea to test this.

> +             fdatasync(client->difffile);
> +     }
> +     return 0;
> +}
> +
> +int expflush(CLIENT *client) {
> +     int fhandle;
> +     off_t foffset;
> +     size_t maxbytes;
> +     gint i;
> +
> +        if (!(client->server->flags & F_COPYONWRITE)) {
> +             return fsync(client->difffile);
> +     }
> +     
> +     for (i = 0; i < client->export->len; i++) {
> +             FILE_INFO fi = g_array_index(client->export, FILE_INFO, i);
> +             if (fsync(fi.fhandle) < 0)
> +                     return -1;
> +     }
> +     
>       return 0;
>  }
> 
> @@ -1322,6 +1369,12 @@ CLIENT* negotiate(int net, CLIENT *client, GArray* 
> servers) {
>               err("Negotiation failed: %m");
>       if (client->server->flags & F_READONLY)
>               flags |= NBD_FLAG_READ_ONLY;
> +     if (client->server->flags & F_FLUSH)
> +             flags |= NBD_FLAG_SEND_FLUSH;
> +     if (client->server->flags & F_FUA)
> +             flags |= NBD_FLAG_SEND_FUA;
> +     if (client->server->flags & F_ROTATIONAL)
> +             flags |= NBD_FLAG_ROTATIONAL;
>       if (!client->modern) {
>               /* oldstyle */
>               flags = htonl(flags);
> @@ -1419,7 +1472,7 @@ int mainloop(CLIENT *client) {
>                       continue;
>               }
> 
> -             if (request.type==NBD_CMD_WRITE) {
> +             if ((request.type==NBD_CMD_WRITE) || 
> (request.type==NBD_CMD_WRITE_FUA)) {

This makes me think that the _FUA thing should be something that can be
masked away; that way, we can do stuff like

int type = request.type & ~NBD_CMD_FLAG_FUA;
if (type == NBD_CMD_WRITE) {
        ...
}

or something similar, which seems cleaner. It would also allow it to be
applied to other request types without having to special-case too much;
e.g., there's been talk of a truncate command, which could sparsify the
backend. I guess that could also use this FUA thing.

>                       DEBUG("wr: net->buf, ");
>                       while(len > 0) {
>                               readit(client->net, buf, currlen);
> @@ -1430,11 +1483,12 @@ int mainloop(CLIENT *client) {
>                                       ERROR(client, reply, EPERM);
>                                       continue;
>                               }
> -                             if (expwrite(request.from, buf, len, client)) {
> +                             if (expwrite(request.from, buf, len, client,
> +                                          
> (request.type==NBD_CMD_WRITE_FUA)?1:0)) {

... and here you could then use

request.type & NBD_CMD_FLAG_FUA

rather than the tristate.

>                                       DEBUG("Write failed: %m" );
>                                       ERROR(client, reply, errno);
>                                       continue;
> -                             }
> +                             }       

Another gratuitous whitespace change.

>                               SEND(client->net, reply);
>                               DEBUG("OK!\n");
>                               len -= currlen;
> @@ -1442,6 +1496,19 @@ int mainloop(CLIENT *client) {
>                       }
>                       continue;
>               }
> +
> +             if (request.type==NBD_CMD_FLUSH) {
> +                     DEBUG("fl: ");
> +                     if (expflush(client)) {
> +                             DEBUG("Flush failed: %m");
> +                             ERROR(client, reply, errno);
> +                             continue;
> +                     }
> +                     SEND(client->net, reply);
> +                     DEBUG("OK!\n");
> +                     continue;
> +             }
> +
>               /* READ */
> 
>               DEBUG("exp->buf, ");
> diff --git a/nbd.h b/nbd.h
> index 451f50c..05488b0 100644
> --- a/nbd.h
> +++ b/nbd.h
> @@ -31,7 +31,9 @@
>  enum {
>       NBD_CMD_READ = 0,
>       NBD_CMD_WRITE = 1,
> -     NBD_CMD_DISC = 2
> +     NBD_CMD_DISC = 2,
> +     NBD_CMD_FLUSH = 3,
> +     NBD_CMD_WRITE_FUA = 4
>  };

This file is taken from the kernel, so should probably be patched there
rather than here (though of course it would be hard to compile-test
without these bits).

Please also document the new config file options in nbd-server.5.in.sgml.

Other than that, seems fine to me.

-- 
The volume of a pizza of thickness a and radius z can be described by
the following formula:

pi zz a

------------------------------------------------------------------------------
Achieve unprecedented app performance and reliability
What every C/C++ and Fortran developer should know.
Learn how Intel has extended the reach of its next-generation tools
to help boost performance applications - inlcuding clusters.
http://p.sf.net/sfu/intel-dev2devmay
_______________________________________________
Nbd-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nbd-general

Reply via email to