Re: [Cluster-devel] always fall back to buffered I/O after invalidation failures, was: Re: [PATCH 2/6] iomap: IOMAP_DIO_RWF_NO_STALE_PAGECACHE return if page invalidation fails

2020-07-09 Thread Dave Chinner
On Thu, Jul 09, 2020 at 10:10:38AM -0700, Darrick J. Wong wrote:
> On Thu, Jul 09, 2020 at 06:05:19PM +0100, Matthew Wilcox wrote:
> > On Thu, Jul 09, 2020 at 09:09:26AM -0700, Darrick J. Wong wrote:
> > > On Thu, Jul 09, 2020 at 12:25:27PM +1000, Dave Chinner wrote:
> > > > -*/
> > > > -   ret = invalidate_inode_pages2_range(mapping,
> > > > -   pos >> PAGE_SHIFT, end >> PAGE_SHIFT);
> > > > -   if (ret)
> > > > -   dio_warn_stale_pagecache(iocb->ki_filp);
> > > > -   ret = 0;
> > > > +   if (iov_iter_rw(iter) == WRITE) {
> > > > +   /*
> > > > +* Try to invalidate cache pages for the range we're 
> > > > direct
> > > > +* writing.  If this invalidation fails, tough, the 
> > > > write will
> > > > +* still work, but racing two incompatible write paths 
> > > > is a
> > > > +* pretty crazy thing to do, so we don't support it 
> > > > 100%.
> > > > +*/
> > > > +   ret = invalidate_inode_pages2_range(mapping,
> > > > +   pos >> PAGE_SHIFT, end >> PAGE_SHIFT);
> > > > +   if (ret)
> > > > +   dio_warn_stale_pagecache(iocb->ki_filp);
> > > > +   ret = 0;
> > > >  
> > > > -   if (iov_iter_rw(iter) == WRITE && !wait_for_completion &&
> > > > -   !inode->i_sb->s_dio_done_wq) {
> > > > -   ret = sb_init_dio_done_wq(inode->i_sb);
> > > > -   if (ret < 0)
> > > > -   goto out_free_dio;
> > > > +   if (!wait_for_completion &&
> > > > +   !inode->i_sb->s_dio_done_wq) {
> > > > +   ret = sb_init_dio_done_wq(inode->i_sb);
> > > > +   if (ret < 0)
> > > > +   goto out_free_dio;
> 
> ...and yes I did add in the closing brace here. :P

Doh! I forgot to refresh the patch after fixing that. :/

Thanks!

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com



[Cluster-devel] [PATCH dlm-tool 1/3] dlm_controld: don't abort node configuration

2020-07-09 Thread Alexander Aring
In case of not having mark value file we don't abort the whole
configuration if any error on open fails. We skip this specific setting
and jump to the next one since it's not mandatory to have a correct mark
setting. It will still appear inside the logs.
---
 dlm_controld/action.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/dlm_controld/action.c b/dlm_controld/action.c
index 63040227..bc9c44f2 100644
--- a/dlm_controld/action.c
+++ b/dlm_controld/action.c
@@ -642,6 +642,9 @@ int add_configfs_node(int nodeid, char *addr, int addrlen, 
int local,
 
/*
 * set skb mark for nodeid
+*
+* If open() fails we skip it because kernel doesn't support it.
+* It's not a required confiuration. It will show up in the log.
 */
 
memset(path, 0, PATH_MAX);
@@ -650,7 +653,7 @@ int add_configfs_node(int nodeid, char *addr, int addrlen, 
int local,
fd = open(path, O_WRONLY);
if (fd < 0) {
log_error("%s: open failed: %d", path, errno);
-   return -1;
+   goto skip_non_required;
}
 
memset(buf, 0, sizeof(buf));
@@ -664,6 +667,8 @@ int add_configfs_node(int nodeid, char *addr, int addrlen, 
int local,
}
close(fd);
 
+skip_non_required:
+
/*
 * set local
 */
-- 
2.26.2



[Cluster-devel] [PATCH dlm-tool 2/3] dlm_controld: change enable_waitplock_recovery default

2020-07-09 Thread Alexander Aring
This patch sets the default value of enable_waitplock_recovery to false
as it is the same as the kernel configuration default. Note that in this
specific case dlm_controld will not set the default value on startup.
However this patch syncs the kernel and user land default behaviour.
---
 dlm_controld/main.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/dlm_controld/main.c b/dlm_controld/main.c
index 3ec318c2..95107d09 100644
--- a/dlm_controld/main.c
+++ b/dlm_controld/main.c
@@ -1754,7 +1754,7 @@ static void set_opt_defaults(void)
 
set_opt_default(enable_waitplock_recovery_ind,
"enable_waitplock_recovery", '\0', req_arg_bool,
-   1, NULL,
+   0, NULL,
"enable/disable posix lock to wait for dlm recovery 
after lock acquire");
 
set_opt_default(plock_debug_ind,
-- 
2.26.2



[Cluster-devel] [PATCH dlm-tool 3/3] dlm_controld: add default value handling for unsigned int

2020-07-09 Thread Alexander Aring
This patch adds suppot to handle default values for unsigned int.
---
 dlm_controld/dlm_daemon.h |  1 +
 dlm_controld/main.c   | 61 +--
 2 files changed, 34 insertions(+), 28 deletions(-)

diff --git a/dlm_controld/dlm_daemon.h b/dlm_controld/dlm_daemon.h
index 979aab7a..ee21c256 100644
--- a/dlm_controld/dlm_daemon.h
+++ b/dlm_controld/dlm_daemon.h
@@ -134,6 +134,7 @@ struct dlm_option {
 
int default_int;
const char *default_str;
+   unsigned int default_uint;
 
int cli_set;
int cli_int;
diff --git a/dlm_controld/main.c b/dlm_controld/main.c
index 95107d09..8023f4b0 100644
--- a/dlm_controld/main.c
+++ b/dlm_controld/main.c
@@ -1678,6 +1678,8 @@ static void print_usage(void)
printf(" [%d]\n", o->default_int);
else if (o->req_arg == req_arg_bool)
printf(" [%d]\n", o->default_int);
+   else if (o->req_arg == req_arg_uint)
+   printf(" [%u]\n", o->default_uint);
else if (o->req_arg == no_arg && !o->default_int)
printf(" [0]\n");
else
@@ -1688,7 +1690,8 @@ static void print_usage(void)
 }
 
 static void set_opt_default(int ind, const char *name, char letter, int 
arg_type,
-   int default_int, const char *default_str, const 
char *desc)
+   int default_int, const char *default_str,
+   unsigned int default_uint, const char *desc)
 {
dlm_options[ind].name = name;
dlm_options[ind].letter = letter;
@@ -1696,145 +1699,147 @@ static void set_opt_default(int ind, const char 
*name, char letter, int arg_type
dlm_options[ind].desc = desc;
dlm_options[ind].default_int = default_int;
dlm_options[ind].default_str = default_str;
+   dlm_options[ind].default_uint = default_uint;
dlm_options[ind].use_int = default_int;
dlm_options[ind].use_str = (char *)default_str;
+   dlm_options[ind].use_uint = default_uint;
 }
 
 static void set_opt_defaults(void)
 {
set_opt_default(daemon_debug_ind,
"daemon_debug", 'D', no_arg,
-   0, NULL,
+   0, NULL, 0,
"enable debugging to stderr and don't fork");
 
set_opt_default(foreground_ind,
"foreground", '\0', no_arg,
-   0, NULL,
+   0, NULL, 0,
"don't fork");
 
set_opt_default(log_debug_ind,
"log_debug", 'K', no_arg,
-   0, NULL,
+   0, NULL, 0,
"enable kernel dlm debugging messages");
 
set_opt_default(timewarn_ind,
"timewarn", '\0', req_arg_int,
-   0, NULL,
+   0, NULL, 0,
""); /* do not advertise */
 
set_opt_default(protocol_ind,
"protocol", 'r', req_arg_str,
-   -1, "detect",
+   -1, "detect", 0,
"dlm kernel lowcomms protocol: tcp, sctp, detect");
 
set_opt_default(bind_all_ind,
"bind_all", '\0', req_arg_int,
-   0, NULL,
+   0, NULL, 0,
""); /* do not advertise */
 
set_opt_default(mark_ind,
"mark", '\0', req_arg_uint,
-   0, NULL,
+   0, NULL, 0,
"set mark value for the DLM in-kernel listen socket");
 
set_opt_default(debug_logfile_ind,
"debug_logfile", 'L', no_arg,
-   0, NULL,
+   0, NULL, 0,
"write debugging to log file");
 
set_opt_default(enable_fscontrol_ind,
"enable_fscontrol", '\0', req_arg_bool,
-   0, NULL,
+   0, NULL, 0,
""); /* do not advertise */
 
set_opt_default(enable_plock_ind,
"enable_plock", 'p', req_arg_bool,
-   1, NULL,
+   1, NULL, 0,
"enable/disable posix lock support for cluster fs");
 
set_opt_default(enable_waitplock_recovery_ind,
"enable_waitplock_recovery", '\0', req_arg_bool,
-   0, NULL,
+   0, NULL, 0,
"enable/disable posix lock to wait for dlm recovery 
after lock acquire");
 
set_opt_default(plock_debug_ind,
"plock_debug", 'P', no_arg,
-   0, NULL,
+   0, NULL, 0,
"enable plock debugging");
 
set_opt_default(plock_rate_limit_ind,
 

[Cluster-devel] [PATCH dlm-tool 0/3] dlm_controld: changes from v2 to v3

2020-07-09 Thread Alexander Aring
Hi,

series v2 got applied and my changes for series v3 got dropped. I
created this patch series now to sync upstream with the previous v3
series. Every change is now splitted in a separate patch.

Thank you.

- Alex

Alexander Aring (3):
  dlm_controld: don't abort node configuration
  dlm_controld: change enable_waitplock_recovery default
  dlm_controld: add default value handling for unsigned int

 dlm_controld/action.c |  7 -
 dlm_controld/dlm_daemon.h |  1 +
 dlm_controld/main.c   | 61 +--
 3 files changed, 40 insertions(+), 29 deletions(-)

-- 
2.26.2



[Cluster-devel] [PATCHv3 dlm-tool 4/4] dlm_controld: add support for waitplock_recovery switch

2020-07-09 Thread Alexander Aring
This patch adds support to set the cluster attribute waitplock_recovery
via enable_waitplock_recover arg or config file attribute.
---
 dlm_controld/action.c | 5 +
 dlm_controld/dlm.conf.5   | 2 ++
 dlm_controld/dlm_daemon.h | 1 +
 dlm_controld/main.c   | 5 +
 4 files changed, 13 insertions(+)

diff --git a/dlm_controld/action.c b/dlm_controld/action.c
index 9e18d286..bc9c44f2 100644
--- a/dlm_controld/action.c
+++ b/dlm_controld/action.c
@@ -881,6 +881,11 @@ int setup_configfs_options(void)
dlm_options[timewarn_ind].file_set)
set_configfs_cluster("timewarn_cs", NULL, opt(timewarn_ind));
 
+   if (dlm_options[enable_waitplock_recovery_ind].cli_set ||
+   dlm_options[enable_waitplock_recovery_ind].file_set)
+   set_configfs_cluster("waitplock_recovery", NULL,
+opt(enable_waitplock_recovery_ind));
+
set_configfs_cluster("mark", NULL, optu(mark_ind));
 
proto_name = opts(protocol_ind);
diff --git a/dlm_controld/dlm.conf.5 b/dlm_controld/dlm.conf.5
index 1ce0c644..e92dfc8e 100644
--- a/dlm_controld/dlm.conf.5
+++ b/dlm_controld/dlm.conf.5
@@ -46,6 +46,8 @@ debug_logfile
 .br
 enable_plock
 .br
+enable_waitplock_recovery
+.br
 plock_debug
 .br
 plock_rate_limit
diff --git a/dlm_controld/dlm_daemon.h b/dlm_controld/dlm_daemon.h
index 0b4ae5f2..ee21c256 100644
--- a/dlm_controld/dlm_daemon.h
+++ b/dlm_controld/dlm_daemon.h
@@ -102,6 +102,7 @@ enum {
 mark_ind,
 enable_fscontrol_ind,
 enable_plock_ind,
+enable_waitplock_recovery_ind,
 plock_debug_ind,
 plock_rate_limit_ind,
 plock_ownership_ind,
diff --git a/dlm_controld/main.c b/dlm_controld/main.c
index 8a5a2ad1..8023f4b0 100644
--- a/dlm_controld/main.c
+++ b/dlm_controld/main.c
@@ -1757,6 +1757,11 @@ static void set_opt_defaults(void)
1, NULL, 0,
"enable/disable posix lock support for cluster fs");
 
+   set_opt_default(enable_waitplock_recovery_ind,
+   "enable_waitplock_recovery", '\0', req_arg_bool,
+   0, NULL, 0,
+   "enable/disable posix lock to wait for dlm recovery 
after lock acquire");
+
set_opt_default(plock_debug_ind,
"plock_debug", 'P', no_arg,
0, NULL, 0,
-- 
2.26.2



[Cluster-devel] [PATCHv3 dlm-tool 2/4] dlm_controld: set listen skb mark setting

2020-07-09 Thread Alexander Aring
This patch adds support to set the skb mark value for the in-kernel DLM
listen socket.
---
 dlm_controld/action.c | 2 ++
 dlm_controld/dlm.conf.5   | 2 ++
 dlm_controld/dlm_daemon.h | 1 +
 dlm_controld/main.c   | 5 +
 4 files changed, 10 insertions(+)

diff --git a/dlm_controld/action.c b/dlm_controld/action.c
index ecd0d022..e901d555 100644
--- a/dlm_controld/action.c
+++ b/dlm_controld/action.c
@@ -851,6 +851,8 @@ int setup_configfs_options(void)
dlm_options[timewarn_ind].file_set)
set_configfs_cluster("timewarn_cs", NULL, opt(timewarn_ind));
 
+   set_configfs_cluster("mark", NULL, optu(mark_ind));
+
proto_name = opts(protocol_ind);
proto_num = -1;
 
diff --git a/dlm_controld/dlm.conf.5 b/dlm_controld/dlm.conf.5
index 09492176..771951d4 100644
--- a/dlm_controld/dlm.conf.5
+++ b/dlm_controld/dlm.conf.5
@@ -40,6 +40,8 @@ protocol
 .br
 bind_all
 .br
+mark
+.br
 debug_logfile
 .br
 enable_plock
diff --git a/dlm_controld/dlm_daemon.h b/dlm_controld/dlm_daemon.h
index 47557a7c..4aa2ff82 100644
--- a/dlm_controld/dlm_daemon.h
+++ b/dlm_controld/dlm_daemon.h
@@ -97,6 +97,7 @@ enum {
 protocol_ind,
 debug_logfile_ind,
bind_all_ind,
+mark_ind,
 enable_fscontrol_ind,
 enable_plock_ind,
 plock_debug_ind,
diff --git a/dlm_controld/main.c b/dlm_controld/main.c
index 6129b8a6..22faa3d5 100644
--- a/dlm_controld/main.c
+++ b/dlm_controld/main.c
@@ -1737,6 +1737,11 @@ static void set_opt_defaults(void)
0, NULL, 0,
""); /* do not advertise */
 
+   set_opt_default(mark_ind,
+   "mark", '\0', req_arg_uint,
+   0, NULL, 0,
+   "set mark value for the DLM in-kernel listen socket");
+
set_opt_default(debug_logfile_ind,
"debug_logfile", 'L', no_arg,
0, NULL, 0,
-- 
2.26.2



[Cluster-devel] [PATCHv3 dlm-tool 1/4] dlm_controld: add support for unsigned int values

2020-07-09 Thread Alexander Aring
This patch adds support for setting a unsigned integer value.
---
 dlm_controld/config.c | 25 
 dlm_controld/dlm_daemon.h |  6 
 dlm_controld/main.c   | 60 ++-
 3 files changed, 65 insertions(+), 26 deletions(-)

diff --git a/dlm_controld/config.c b/dlm_controld/config.c
index 3ba8a53b..ec269168 100644
--- a/dlm_controld/config.c
+++ b/dlm_controld/config.c
@@ -156,6 +156,19 @@ static void get_val_int(char *line, int *val_out)
*val_out = atoi(val);
 }
 
+static void get_val_uint(char *line, unsigned int *val_out)
+{
+   char key[MAX_LINE];
+   char val[MAX_LINE];
+   int rv;
+
+   rv = sscanf(line, "%[^=]=%s", key, val);
+   if (rv != 2)
+   return;
+
+   *val_out = strtoul(val, NULL, 0);
+}
+
 static void get_val_str(char *line, char *val_out)
 {
char key[MAX_LINE];
@@ -171,6 +184,7 @@ static void get_val_str(char *line, char *val_out)
 
 void set_opt_file(int update)
 {
+   unsigned int uval = 0;
struct dlm_option *o;
FILE *file;
char line[MAX_LINE];
@@ -238,6 +252,17 @@ void set_opt_file(int update)
log_debug("config file %s = %d cli_set %d use %d",
  o->name, o->file_int, o->cli_set, o->use_int);
 
+   } else if (o->req_arg == req_arg_uint) {
+   get_val_uint(line, );
+
+   o->file_uint = uval;
+
+   if (!o->cli_set)
+   o->use_uint = o->file_uint;
+
+   log_debug("config file %s = %u cli_set %d use %u",
+ o->name, o->file_uint, o->cli_set, 
o->use_uint);
+
} else if (o->req_arg == req_arg_bool) {
get_val_int(line, );
 
diff --git a/dlm_controld/dlm_daemon.h b/dlm_controld/dlm_daemon.h
index 5b9a52da..47557a7c 100644
--- a/dlm_controld/dlm_daemon.h
+++ b/dlm_controld/dlm_daemon.h
@@ -86,6 +86,7 @@ enum {
 req_arg_bool = 1,
 req_arg_int = 2,
 req_arg_str = 3,
+req_arg_uint = 4,
 };
 
 enum {
@@ -125,22 +126,27 @@ struct dlm_option {
 
int use_int;
char *use_str;
+   unsigned int use_uint;
 
int default_int;
const char *default_str;
+   unsigned int default_uint;
 
int cli_set;
int cli_int;
char *cli_str;
+   unsigned int cli_uint;
 
int file_set;
int file_int;
char *file_str;
+   unsigned int file_uint;
 };
 
 EXTERN struct dlm_option dlm_options[dlm_options_max];
 #define opt(x) dlm_options[x].use_int
 #define opts(x) dlm_options[x].use_str
+#define optu(x) dlm_options[x].use_uint
 
 
 /* DLM_LOCKSPACE_LEN: maximum lockspace name length, from linux/dlmconstants.h.
diff --git a/dlm_controld/main.c b/dlm_controld/main.c
index 8be6a4bc..6129b8a6 100644
--- a/dlm_controld/main.c
+++ b/dlm_controld/main.c
@@ -1678,6 +1678,8 @@ static void print_usage(void)
printf(" [%d]\n", o->default_int);
else if (o->req_arg == req_arg_bool)
printf(" [%d]\n", o->default_int);
+   else if (o->req_arg == req_arg_uint)
+   printf(" [%u]\n", o->default_uint);
else if (o->req_arg == no_arg && !o->default_int)
printf(" [0]\n");
else
@@ -1688,7 +1690,8 @@ static void print_usage(void)
 }
 
 static void set_opt_default(int ind, const char *name, char letter, int 
arg_type,
-   int default_int, const char *default_str, const 
char *desc)
+   int default_int, const char *default_str,
+   unsigned int default_uint, const char *desc)
 {
dlm_options[ind].name = name;
dlm_options[ind].letter = letter;
@@ -1696,135 +1699,137 @@ static void set_opt_default(int ind, const char 
*name, char letter, int arg_type
dlm_options[ind].desc = desc;
dlm_options[ind].default_int = default_int;
dlm_options[ind].default_str = default_str;
+   dlm_options[ind].default_uint = default_uint;
dlm_options[ind].use_int = default_int;
dlm_options[ind].use_str = (char *)default_str;
+   dlm_options[ind].use_uint = default_uint;
 }
 
 static void set_opt_defaults(void)
 {
set_opt_default(daemon_debug_ind,
"daemon_debug", 'D', no_arg,
-   0, NULL,
+   0, NULL, 0,
"enable debugging to stderr and don't fork");
 
set_opt_default(foreground_ind,
"foreground", '\0', no_arg,
-   0, NULL,
+   0, NULL, 0,
"don't fork");
 
set_opt_default(log_debug_ind,
"log_debug", 'K', no_arg,
-   0, NULL,
+   0, NULL, 0,

[Cluster-devel] [PATCHv3 dlm-tool 3/4] dlm_controld: add support for per nodeid configuration

2020-07-09 Thread Alexander Aring
This patch adds support to make a configuration per nodeid and key-value
pairs. As example this patch will introduce the key mark to set via configfs
comms per nodeid the SO_MARK socket option.
---
 dlm_controld/Makefile  |  3 +-
 dlm_controld/action.c  | 36 +++--
 dlm_controld/dlm.conf.5| 19 +
 dlm_controld/dlm_daemon.h  |  5 ++-
 dlm_controld/main.c|  4 ++
 dlm_controld/member.c  |  6 ++-
 dlm_controld/node_config.c | 82 ++
 dlm_controld/node_config.h | 31 ++
 8 files changed, 180 insertions(+), 6 deletions(-)
 create mode 100644 dlm_controld/node_config.c
 create mode 100644 dlm_controld/node_config.h

diff --git a/dlm_controld/Makefile b/dlm_controld/Makefile
index 6081cf8b..fbc8926c 100644
--- a/dlm_controld/Makefile
+++ b/dlm_controld/Makefile
@@ -32,7 +32,8 @@ BIN_SOURCE = action.c \
  config.c \
  member.c \
  logging.c \
- rbtree.c
+ rbtree.c \
+ node_config.c
 LIB_SOURCE = lib.c
 
 CFLAGS += -D_GNU_SOURCE -O2 -ggdb \
diff --git a/dlm_controld/action.c b/dlm_controld/action.c
index e901d555..9e18d286 100644
--- a/dlm_controld/action.c
+++ b/dlm_controld/action.c
@@ -570,15 +570,16 @@ static int add_configfs_base(void)
return rv;
 }
 
-int add_configfs_node(int nodeid, char *addr, int addrlen, int local)
+int add_configfs_node(int nodeid, char *addr, int addrlen, int local,
+ uint32_t mark)
 {
char path[PATH_MAX];
char padded_addr[sizeof(struct sockaddr_storage)];
char buf[32];
int rv, fd;
 
-   log_debug("set_configfs_node %d %s local %d",
- nodeid, str_ip(addr), local);
+   log_debug("set_configfs_node %d %s local %d mark %" PRIu32,
+ nodeid, str_ip(addr), local, mark);
 
/*
 * create comm dir for this node
@@ -639,6 +640,35 @@ int add_configfs_node(int nodeid, char *addr, int addrlen, 
int local)
}
close(fd);
 
+   /*
+* set skb mark for nodeid
+*
+* If open() fails we skip it because kernel doesn't support it.
+* It's not a required confiuration. It will show up in the log.
+*/
+
+   memset(path, 0, PATH_MAX);
+   snprintf(path, PATH_MAX, "%s/%d/mark", COMMS_DIR, nodeid);
+
+   fd = open(path, O_WRONLY);
+   if (fd < 0) {
+   log_error("%s: open failed: %d", path, errno);
+   goto skip_non_required;
+   }
+
+   memset(buf, 0, sizeof(buf));
+   snprintf(buf, 32, "%" PRIu32, mark);
+
+   rv = do_write(fd, buf, strlen(buf));
+   if (rv < 0) {
+   log_error("%s: write failed: %d, %s", path, errno, buf);
+   close(fd);
+   return -1;
+   }
+   close(fd);
+
+skip_non_required:
+
/*
 * set local
 */
diff --git a/dlm_controld/dlm.conf.5 b/dlm_controld/dlm.conf.5
index 771951d4..1ce0c644 100644
--- a/dlm_controld/dlm.conf.5
+++ b/dlm_controld/dlm.conf.5
@@ -392,6 +392,25 @@ masterfoo node=2 weight=1
 In which case node 1 will master 2/3 of the total resources and node 2
 will master the other 1/3.
 
+.SS Node configuration
+
+Node configurations can be set by the node keyword followed of key-value
+pairs.
+
+.B Keys:
+
+.B mark
+The mark key can be used to set a specific mark value which is then used
+by the in-kernel DLM socket creation. This can be used to match for DLM
+specfic packets for e.g. routing.
+
+Example of setting a per socket value for nodeid 1 and a mark value
+of 42:
+
+node id=1 mark=42
+
+For local nodes this value doesn't have any effect.
+
 .SH SEE ALSO
 .BR dlm_controld (8),
 .BR dlm_tool (8)
diff --git a/dlm_controld/dlm_daemon.h b/dlm_controld/dlm_daemon.h
index 4aa2ff82..0b4ae5f2 100644
--- a/dlm_controld/dlm_daemon.h
+++ b/dlm_controld/dlm_daemon.h
@@ -40,6 +40,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #include 
@@ -48,6 +49,7 @@
 #include "libdlmcontrol.h"
 #include "dlm_controld.h"
 #include "fence_config.h"
+#include "node_config.h"
 #include "list.h"
 #include "rbtree.h"
 #include "linux_endian.h"
@@ -365,7 +367,8 @@ int set_sysfs_nodir(char *name, int val);
 int set_configfs_members(struct lockspace *ls, char *name,
 int new_count, int *new_members,
 int renew_count, int *renew_members);
-int add_configfs_node(int nodeid, char *addr, int addrlen, int local);
+int add_configfs_node(int nodeid, char *addr, int addrlen, int local,
+ uint32_t mark);
 void del_configfs_node(int nodeid);
 void clear_configfs(void);
 int setup_configfs_options(void);
diff --git a/dlm_controld/main.c b/dlm_controld/main.c
index 22faa3d5..8a5a2ad1 100644
--- a/dlm_controld/main.c
+++ b/dlm_controld/main.c
@@ -2052,6 +2052,10 @@ int main(int argc, char **argv)
set_opt_cli(argc, argv);
set_opt_file(0);
 
+   rv = 

Re: [Cluster-devel] always fall back to buffered I/O after invalidation failures, was: Re: [PATCH 2/6] iomap: IOMAP_DIO_RWF_NO_STALE_PAGECACHE return if page invalidation fails

2020-07-09 Thread Darrick J. Wong
On Thu, Jul 09, 2020 at 06:05:19PM +0100, Matthew Wilcox wrote:
> On Thu, Jul 09, 2020 at 09:09:26AM -0700, Darrick J. Wong wrote:
> > On Thu, Jul 09, 2020 at 12:25:27PM +1000, Dave Chinner wrote:
> > > iomap: Only invalidate page cache pages on direct IO writes
> > > 
> > > From: Dave Chinner 
> > > 
> > > The historic requirement for XFS to invalidate cached pages on
> > > direct IO reads has been lost in the twisty pages of history - it was
> > > inherited from Irix, which implemented page cache invalidation on
> > > read as a method of working around problems synchronising page
> > > cache state with uncached IO.
> > 
> > Urk.
> > 
> > > XFS has carried this ever since. In the initial linux ports it was
> > > necessary to get mmap and DIO to play "ok" together and not
> > > immediately corrupt data. This was the state of play until the linux
> > > kernel had infrastructure to track unwritten extents and synchronise
> > > page faults with allocations and unwritten extent conversions
> > > (->page_mkwrite infrastructure). IOws, the page cache invalidation
> > > on DIO read was necessary to prevent trivial data corruptions. This
> > > didn't solve all the problems, though.
> > > 
> > > There were peformance problems if we didn't invalidate the entire
> > > page cache over the file on read - we couldn't easily determine if
> > > the cached pages were over the range of the IO, and invalidation
> > > required taking a serialising lock (i_mutex) on the inode. This
> > > serialising lock was an issue for XFS, as it was the only exclusive
> > > lock in the direct Io read path.
> > > 
> > > Hence if there were any cached pages, we'd just invalidate the
> > > entire file in one go so that subsequent IOs didn't need to take the
> > > serialising lock. This was a problem that prevented ranged
> > > invalidation from being particularly useful for avoiding the
> > > remaining coherency issues. This was solved with the conversion of
> > > i_mutex to i_rwsem and the conversion of the XFS inode IO lock to
> > > use i_rwsem. Hence we could now just do ranged invalidation and the
> > > performance problem went away.
> > > 
> > > However, page cache invalidation was still needed to serialise
> > > sub-page/sub-block zeroing via direct IO against buffered IO because
> > > bufferhead state attached to the cached page could get out of whack
> > > when direct IOs were issued.  We've removed bufferheads from the
> > > XFS code, and we don't carry any extent state on the cached pages
> > > anymore, and so this problem has gone away, too.
> > > 
> > > IOWs, it would appear that we don't have any good reason to be
> > > invalidating the page cache on DIO reads anymore. Hence remove the
> > > invalidation on read because it is unnecessary overhead,
> > > not needed to maintain coherency between mmap/buffered access and
> > > direct IO anymore, and prevents anyone from using direct IO reads
> > > from intentionally invalidating the page cache of a file.
> > > 
> > > Signed-off-by: Dave Chinner 
> > > ---
> > >  fs/iomap/direct-io.c | 33 +
> > >  1 file changed, 17 insertions(+), 16 deletions(-)
> > > 
> > > diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> > > index ec7b78e6feca..ef0059eb34b5 100644
> > > --- a/fs/iomap/direct-io.c
> > > +++ b/fs/iomap/direct-io.c
> > > @@ -475,23 +475,24 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter 
> > > *iter,
> > >   if (ret)
> > >   goto out_free_dio;
> > >  
> > > - /*
> > > -  * Try to invalidate cache pages for the range we're direct
> > > -  * writing.  If this invalidation fails, tough, the write will
> > > -  * still work, but racing two incompatible write paths is a
> > > -  * pretty crazy thing to do, so we don't support it 100%.
> > 
> > I always wondered about the repeated use of 'write' in this comment
> > despite the lack of any sort of WRITE check logic.  Seems fine to me,
> > let's throw it on the fstests pile and see what happens.
> > 
> > Reviewed-by: Darrick J. Wong 
> 
> Reviewed-by: Matthew Wilcox (Oracle) 
> 
> > --D
> > 
> > > -  */
> > > - ret = invalidate_inode_pages2_range(mapping,
> > > - pos >> PAGE_SHIFT, end >> PAGE_SHIFT);
> > > - if (ret)
> > > - dio_warn_stale_pagecache(iocb->ki_filp);
> > > - ret = 0;
> > > + if (iov_iter_rw(iter) == WRITE) {
> > > + /*
> > > +  * Try to invalidate cache pages for the range we're direct
> > > +  * writing.  If this invalidation fails, tough, the write will
> > > +  * still work, but racing two incompatible write paths is a
> > > +  * pretty crazy thing to do, so we don't support it 100%.
> > > +  */
> > > + ret = invalidate_inode_pages2_range(mapping,
> > > + pos >> PAGE_SHIFT, end >> PAGE_SHIFT);
> > > + if (ret)
> > > + dio_warn_stale_pagecache(iocb->ki_filp);
> > > + ret = 0;
> > >  
> > > - if (iov_iter_rw(iter) == WRITE && 

Re: [Cluster-devel] always fall back to buffered I/O after invalidation failures, was: Re: [PATCH 2/6] iomap: IOMAP_DIO_RWF_NO_STALE_PAGECACHE return if page invalidation fails

2020-07-09 Thread Matthew Wilcox
On Thu, Jul 09, 2020 at 09:09:26AM -0700, Darrick J. Wong wrote:
> On Thu, Jul 09, 2020 at 12:25:27PM +1000, Dave Chinner wrote:
> > iomap: Only invalidate page cache pages on direct IO writes
> > 
> > From: Dave Chinner 
> > 
> > The historic requirement for XFS to invalidate cached pages on
> > direct IO reads has been lost in the twisty pages of history - it was
> > inherited from Irix, which implemented page cache invalidation on
> > read as a method of working around problems synchronising page
> > cache state with uncached IO.
> 
> Urk.
> 
> > XFS has carried this ever since. In the initial linux ports it was
> > necessary to get mmap and DIO to play "ok" together and not
> > immediately corrupt data. This was the state of play until the linux
> > kernel had infrastructure to track unwritten extents and synchronise
> > page faults with allocations and unwritten extent conversions
> > (->page_mkwrite infrastructure). IOws, the page cache invalidation
> > on DIO read was necessary to prevent trivial data corruptions. This
> > didn't solve all the problems, though.
> > 
> > There were peformance problems if we didn't invalidate the entire
> > page cache over the file on read - we couldn't easily determine if
> > the cached pages were over the range of the IO, and invalidation
> > required taking a serialising lock (i_mutex) on the inode. This
> > serialising lock was an issue for XFS, as it was the only exclusive
> > lock in the direct Io read path.
> > 
> > Hence if there were any cached pages, we'd just invalidate the
> > entire file in one go so that subsequent IOs didn't need to take the
> > serialising lock. This was a problem that prevented ranged
> > invalidation from being particularly useful for avoiding the
> > remaining coherency issues. This was solved with the conversion of
> > i_mutex to i_rwsem and the conversion of the XFS inode IO lock to
> > use i_rwsem. Hence we could now just do ranged invalidation and the
> > performance problem went away.
> > 
> > However, page cache invalidation was still needed to serialise
> > sub-page/sub-block zeroing via direct IO against buffered IO because
> > bufferhead state attached to the cached page could get out of whack
> > when direct IOs were issued.  We've removed bufferheads from the
> > XFS code, and we don't carry any extent state on the cached pages
> > anymore, and so this problem has gone away, too.
> > 
> > IOWs, it would appear that we don't have any good reason to be
> > invalidating the page cache on DIO reads anymore. Hence remove the
> > invalidation on read because it is unnecessary overhead,
> > not needed to maintain coherency between mmap/buffered access and
> > direct IO anymore, and prevents anyone from using direct IO reads
> > from intentionally invalidating the page cache of a file.
> > 
> > Signed-off-by: Dave Chinner 
> > ---
> >  fs/iomap/direct-io.c | 33 +
> >  1 file changed, 17 insertions(+), 16 deletions(-)
> > 
> > diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> > index ec7b78e6feca..ef0059eb34b5 100644
> > --- a/fs/iomap/direct-io.c
> > +++ b/fs/iomap/direct-io.c
> > @@ -475,23 +475,24 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter 
> > *iter,
> > if (ret)
> > goto out_free_dio;
> >  
> > -   /*
> > -* Try to invalidate cache pages for the range we're direct
> > -* writing.  If this invalidation fails, tough, the write will
> > -* still work, but racing two incompatible write paths is a
> > -* pretty crazy thing to do, so we don't support it 100%.
> 
> I always wondered about the repeated use of 'write' in this comment
> despite the lack of any sort of WRITE check logic.  Seems fine to me,
> let's throw it on the fstests pile and see what happens.
> 
> Reviewed-by: Darrick J. Wong 

Reviewed-by: Matthew Wilcox (Oracle) 

> --D
> 
> > -*/
> > -   ret = invalidate_inode_pages2_range(mapping,
> > -   pos >> PAGE_SHIFT, end >> PAGE_SHIFT);
> > -   if (ret)
> > -   dio_warn_stale_pagecache(iocb->ki_filp);
> > -   ret = 0;
> > +   if (iov_iter_rw(iter) == WRITE) {
> > +   /*
> > +* Try to invalidate cache pages for the range we're direct
> > +* writing.  If this invalidation fails, tough, the write will
> > +* still work, but racing two incompatible write paths is a
> > +* pretty crazy thing to do, so we don't support it 100%.
> > +*/
> > +   ret = invalidate_inode_pages2_range(mapping,
> > +   pos >> PAGE_SHIFT, end >> PAGE_SHIFT);
> > +   if (ret)
> > +   dio_warn_stale_pagecache(iocb->ki_filp);
> > +   ret = 0;
> >  
> > -   if (iov_iter_rw(iter) == WRITE && !wait_for_completion &&
> > -   !inode->i_sb->s_dio_done_wq) {
> > -   ret = sb_init_dio_done_wq(inode->i_sb);
> > -   if (ret < 0)
> > -   goto out_free_dio;
> > +   if 

Re: [Cluster-devel] always fall back to buffered I/O after invalidation failures, was: Re: [PATCH 2/6] iomap: IOMAP_DIO_RWF_NO_STALE_PAGECACHE return if page invalidation fails

2020-07-09 Thread Darrick J. Wong
On Thu, Jul 09, 2020 at 12:25:27PM +1000, Dave Chinner wrote:
> On Wed, Jul 08, 2020 at 02:54:37PM +0100, Matthew Wilcox wrote:
> > On Wed, Jul 08, 2020 at 04:51:27PM +1000, Dave Chinner wrote:
> > > On Tue, Jul 07, 2020 at 03:00:30PM +0200, Christoph Hellwig wrote:
> > > > On Tue, Jul 07, 2020 at 01:57:05PM +0100, Matthew Wilcox wrote:
> > > > > Indeed, I'm in favour of not invalidating
> > > > > the page cache at all for direct I/O.  For reads, I think the page 
> > > > > cache
> > > > > should be used to satisfy any portion of the read which is currently
> > > > > cached.  For writes, I think we should write into the page cache pages
> > > > > which currently exist, and then force those pages to be written back,
> > > > > but left in cache.
> > > > 
> > > > Something like that, yes.
> > > 
> > > So are we really willing to take the performance regression that
> > > occurs from copying out of the page cache consuming lots more CPU
> > > than an actual direct IO read? Or that direct IO writes suddenly
> > > serialise because there are page cache pages and now we have to do
> > > buffered IO?
> > > 
> > > Direct IO should be a deterministic, zero-copy IO path to/from
> > > storage. Using the CPU to copy data during direct IO is the complete
> > > opposite of the intended functionality, not to mention the behaviour
> > > that many applications have been careful designed and tuned for.
> > 
> > Direct I/O isn't deterministic though.
> 
> When all the application is doing is direct IO, it is as
> deterministic as the underlying storage behaviour. This is the best
> we can possibly do from the perspective of the filesystem, and this
> is largely what Direct IO requires from the filesystem.
> 
> Direct IO starts from delegating all responsibility for IO
> synchronisation data coherency and integrity to userspace, and then
> follows up by requiring the filesystem and kernel to stay out of the
> IO path except where it is absolutely necessary to read or write
> data to/from the underlying storage hardware. Serving Direct IO from
> the page cache violates the second of these requirements.
> 
> > If the file isn't shared, then
> > it works great, but as soon as you get mixed buffered and direct I/O,
> > everything is already terrible.
> 
> Right, but that's *the rare exception* for applications using direct
> IO, not the common fast path. It is the slow path where -shit has
> already gone wrong on the production machine-, and it most
> definitely does not change the DIO requirements that the filesystem
> needs to provide userspace via the direct IO path.
> 
> Optimising the slow exception path is fine if it does not affect the
> guarantees we try to provide through the common/fast path. If it is
> does affect behaviour of the fast path, then we must be able to
> either turn it off or provide our own alternative implementation
> that does not have that cost.
> 
> > Direct I/Os perform pagecache lookups
> > already, but instead of using the data that we found in the cache, we
> > (if it's dirty) write it back, wait for the write to complete, remove
> > the page from the pagecache and then perform another I/O to get the data
> > that we just wrote out!  And then the app that's using buffered I/O has
> > to read it back in again.
> 
> Yup, that's because we have a history going back 20+ years of people
> finding performance regressions in applications using direct IO when
> we leave incorrectly left pages in the page cache rather than
> invalidating them and continuing to do direct IO.
> 
> 
> > Nobody's proposing changing Direct I/O to exclusively work through the
> > pagecache.  The proposal is to behave less weirdly when there's already
> > data in the pagecache.
> 
> No, the proposal it to make direct IO behave *less
> deterministically* if there is data in the page cache.
> 
> e.g. Instead of having a predicatable submission CPU overhead and
> read latency of 100us for your data, this proposal makes the claim
> that it is always better to burn 10x the IO submission CPU for a
> single IO to copy the data and give that specific IO 10x lower
> latency than it is to submit 10 async IOs to keep the IO pipeline
> full.
> 
> What it fails to take into account is that in spending that CPU time
> to copy the data, we haven't submitted 10 other IOs and so the
> actual in-flight IO for the application has decreased. If
> performance comes from keeping the IO pipeline as close to 100% full
> as possible, then copying the data out of the page cache will cause
> performance regressions.
> 
> i.e. Hit 5 page cache pages in 5 IOs in a row, and the IO queue
> depth craters because we've only fulfilled 5 complete IOs instead of
> submitting 50 entire IOs. This is the hidden cost of synchronous IO
> via CPU data copying vs async IO via hardware offload, and if we
> take that into account we must look at future hardware performance
> trends to determine if this cost is going to increase or decrease in
> future.
> 
> That is: CPUs 

[Cluster-devel] [PATCH] lseek.2: List gfs2 support for SEEK_HOLE/SEEK_DATA

2020-07-09 Thread Andrew Price
Signed-off-by: Andrew Price 
---
 man2/lseek.2 | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/man2/lseek.2 b/man2/lseek.2
index a20b10431..22de446fc 100644
--- a/man2/lseek.2
+++ b/man2/lseek.2
@@ -182,6 +182,9 @@ NFS (since Linux 3.18)
 .IP *
 FUSE (since Linux 4.5)
 .\" commit 0b5da8db145bfd44266ac964a2636a0cf8d7c286
+.IP *
+GFS2 (since Linux 4.15)
+.\" commit 3a27411cb4bc3ce31db228e3569ad01b462a4310
 .SH RETURN VALUE
 Upon successful completion,
 .BR lseek ()
-- 
2.26.2



Re: [Cluster-devel] always fall back to buffered I/O after invalidation failures, was: Re: [PATCH 2/6] iomap: IOMAP_DIO_RWF_NO_STALE_PAGECACHE return if page invalidation fails

2020-07-09 Thread Steven Whitehouse

Hi,

On 08/07/2020 17:54, Christoph Hellwig wrote:

On Wed, Jul 08, 2020 at 02:54:37PM +0100, Matthew Wilcox wrote:

Direct I/O isn't deterministic though.  If the file isn't shared, then
it works great, but as soon as you get mixed buffered and direct I/O,
everything is already terrible.  Direct I/Os perform pagecache lookups
already, but instead of using the data that we found in the cache, we
(if it's dirty) write it back, wait for the write to complete, remove
the page from the pagecache and then perform another I/O to get the data
that we just wrote out!  And then the app that's using buffered I/O has
to read it back in again.

Mostly agreed.  That being said I suspect invalidating clean cache
might still be a good idea.  The original idea was mostly on how
to deal with invalidation failures of any kind, but falling back for
any kind of dirty cache also makes at least some sense.


I have had an objection raised off-list.  In a scenario with a block
device shared between two systems and an application which does direct
I/O, everything is normally fine.  If one of the systems uses tar to
back up the contents of the block device then the application on that
system will no longer see the writes from the other system because
there's nothing to invalidate the pagecache on the first system.

Err, WTF?  If someone access shared block devices with random
applications all bets are off anyway.


On GFS2 the locking should take care of that. Not 100% sure about OCFS2 
without looking, but I'm fairly sure that they have a similar 
arrangement. So this shouldn't be a problem unless there is an 
additional cluster fs that I'm not aware of that they are using in this 
case. It would be good to confirm which fs they are using,


Steve.