from:"NeilBrown"

Re: [PATCH v3 00/47] filelock: split file leases out of struct file_lock

2024-02-01 Thread NeilBrown

On Thu, 01 Feb 2024, Jeff Layton wrote:
> I'm not sure this is much prettier than the last, but contracting
> "fl_core" to "c", as Neil suggested is a bit easier on the eyes.
> 
> I also added a few small helpers and converted several users over to
> them. That reduces the size of the per-fs conversion patches later in
> the series. I played with some others too, but they were too awkward
> or not frequently used enough to make it worthwhile.
> 
> Many thanks to Chuck and Neil for the earlier R-b's and comments. I've
> dropped those for now since this set is a bit different from the last.
> 
> I'd like to get this into linux-next soon and we can see about merging
> it for v6.9, unless anyone has major objections.

For all patches:
  Reviewed-by: NeilBrown 

Thanks Jeff - I think this is a good and useful change and while it
might not all be as pretty as I might like, I don't see any way to
improve it and think it is certainly good enough to merge.

I think the conversion from "fl_core" to "c" does work well enough.
I particularly like how the removal of the IS_* macros (patch 16) turned
out.  The inline expansion of IS_LEASE() in particular makes the code
clearer to me.

Thanks,
NeilBrown


> 
> Thanks!
> 
> Signed-off-by: Jeff Layton 
> ---
> Changes in v3:
> - Rename "flc_core" fields in file_lock and file_lease to "c"
> - new helpers: locks_wake_up, for_each_file_lock, and 
> lock_is_{unlock,read,write}
> - Link to v2: 
> https://lore.kernel.org/r/20240125-flsplit-v2-0-7485322b6...@kernel.org
> 
> Changes in v2:
> - renamed file_lock_core fields to have "flc_" prefix
> - used macros to more easily do the change piecemeal
> - broke up patches into per-subsystem ones
> - Link to v1: 
> https://lore.kernel.org/r/20240116-flsplit-v1-0-c9d0f4370...@kernel.org
> 
> ---
> Jeff Layton (47):
>   filelock: fl_pid field should be signed int
>   filelock: rename some fields in tracepoints
>   filelock: rename fl_pid variable in lock_get_status
>   filelock: add some new helper functions
>   9p: rename fl_type variable in v9fs_file_do_lock
>   afs: convert to using new filelock helpers
>   ceph: convert to using new filelock helpers
>   dlm: convert to using new filelock helpers
>   gfs2: convert to using new filelock helpers
>   lockd: convert to using new filelock helpers
>   nfs: convert to using new filelock helpers
>   nfsd: convert to using new filelock helpers
>   ocfs2: convert to using new filelock helpers
>   smb/client: convert to using new filelock helpers
>   smb/server: convert to using new filelock helpers
>   filelock: drop the IS_* macros
>   filelock: split common fields into struct file_lock_core
>   filelock: have fs/locks.c deal with file_lock_core directly
>   filelock: convert more internal functions to use file_lock_core
>   filelock: make posix_same_owner take file_lock_core pointers
>   filelock: convert posix_owner_key to take file_lock_core arg
>   filelock: make locks_{insert,delete}_global_locks take file_lock_core 
> arg
>   filelock: convert locks_{insert,delete}_global_blocked
>   filelock: make __locks_delete_block and __locks_wake_up_blocks take 
> file_lock_core
>   filelock: convert __locks_insert_block, conflict and deadlock checks to 
> use file_lock_core
>   filelock: convert fl_blocker to file_lock_core
>   filelock: clean up locks_delete_block internals
>   filelock: reorganize locks_delete_block and __locks_insert_block
>   filelock: make assign_type helper take a file_lock_core pointer
>   filelock: convert locks_wake_up_blocks to take a file_lock_core pointer
>   filelock: convert locks_insert_lock_ctx and locks_delete_lock_ctx
>   filelock: convert locks_translate_pid to take file_lock_core
>   filelock: convert seqfile handling to use file_lock_core
>   9p: adapt to breakup of struct file_lock
>   afs: adapt to breakup of struct file_lock
>   ceph: adapt to breakup of struct file_lock
>   dlm: adapt to breakup of struct file_lock
>   gfs2: adapt to breakup of struct file_lock
>   fuse: adapt to breakup of struct file_lock
>   lockd: adapt to breakup of struct file_lock
>   nfs: adapt to breakup of struct file_lock
>   nfsd: adapt to breakup of struct file_lock
>   ocfs2: adapt to breakup of struct file_lock
>   smb/client: adapt to breakup of struct file_lock
>   smb/server: adapt to breakup of struct file_lock
>   filelock: remove temporary compatibility macros
>   filelock: split leases out of struct file_lock
> 
>  fs/9p/vfs_file.c|  4

Re: [PATCH v3 34/47] 9p: adapt to breakup of struct file_lock

2024-02-01 Thread NeilBrown

On Thu, 01 Feb 2024, Jeff Layton wrote:
> Most of the existing APIs have remained the same, but subsystems that
> access file_lock fields directly need to reach into struct
> file_lock_core now.
> 
> Signed-off-by: Jeff Layton 
> ---
>  fs/9p/vfs_file.c | 39 +++
>  1 file changed, 19 insertions(+), 20 deletions(-)
> 
> diff --git a/fs/9p/vfs_file.c b/fs/9p/vfs_file.c
> index a1dabcf73380..abdbbaee5184 100644
> --- a/fs/9p/vfs_file.c
> +++ b/fs/9p/vfs_file.c
> @@ -9,7 +9,6 @@
>  #include 
>  #include 
>  #include 
> -#define _NEED_FILE_LOCK_FIELD_MACROS
>  #include 
>  #include 
>  #include 
> @@ -108,7 +107,7 @@ static int v9fs_file_lock(struct file *filp, int cmd, 
> struct file_lock *fl)
>  
>   p9_debug(P9_DEBUG_VFS, "filp: %p lock: %p\n", filp, fl);
>  
> - if ((IS_SETLK(cmd) || IS_SETLKW(cmd)) && fl->fl_type != F_UNLCK) {
> + if ((IS_SETLK(cmd) || IS_SETLKW(cmd)) && fl->c.flc_type != F_UNLCK) {

Should this have already been changed to lock_is_unlock() ???

NeilBrown


>   filemap_write_and_wait(inode->i_mapping);
>   invalidate_mapping_pages(>i_data, 0, -1);
>   }
> @@ -127,7 +126,7 @@ static int v9fs_file_do_lock(struct file *filp, int cmd, 
> struct file_lock *fl)
>   fid = filp->private_data;
>   BUG_ON(fid == NULL);
>  
> - BUG_ON((fl->fl_flags & FL_POSIX) != FL_POSIX);
> + BUG_ON((fl->c.flc_flags & FL_POSIX) != FL_POSIX);
>  
>   res = locks_lock_file_wait(filp, fl);
>   if (res < 0)
> @@ -136,7 +135,7 @@ static int v9fs_file_do_lock(struct file *filp, int cmd, 
> struct file_lock *fl)
>   /* convert posix lock to p9 tlock args */
>   memset(, 0, sizeof(flock));
>   /* map the lock type */
> - switch (fl->fl_type) {
> + switch (fl->c.flc_type) {
>   case F_RDLCK:
>   flock.type = P9_LOCK_TYPE_RDLCK;
>   break;
> @@ -152,7 +151,7 @@ static int v9fs_file_do_lock(struct file *filp, int cmd, 
> struct file_lock *fl)
>   flock.length = 0;
>   else
>   flock.length = fl->fl_end - fl->fl_start + 1;
> - flock.proc_id = fl->fl_pid;
> + flock.proc_id = fl->c.flc_pid;
>   flock.client_id = fid->clnt->name;
>   if (IS_SETLKW(cmd))
>   flock.flags = P9_LOCK_FLAGS_BLOCK;
> @@ -207,13 +206,13 @@ static int v9fs_file_do_lock(struct file *filp, int 
> cmd, struct file_lock *fl)
>* incase server returned error for lock request, revert
>* it locally
>*/
> - if (res < 0 && fl->fl_type != F_UNLCK) {
> - unsigned char type = fl->fl_type;
> + if (res < 0 && fl->c.flc_type != F_UNLCK) {
> + unsigned char type = fl->c.flc_type;
>  
> - fl->fl_type = F_UNLCK;
> + fl->c.flc_type = F_UNLCK;
>   /* Even if this fails we want to return the remote error */
>   locks_lock_file_wait(filp, fl);
> - fl->fl_type = type;
> + fl->c.flc_type = type;
>   }
>   if (flock.client_id != fid->clnt->name)
>   kfree(flock.client_id);
> @@ -235,7 +234,7 @@ static int v9fs_file_getlock(struct file *filp, struct 
> file_lock *fl)
>* if we have a conflicting lock locally, no need to validate
>* with server
>*/
> - if (fl->fl_type != F_UNLCK)
> + if (fl->c.flc_type != F_UNLCK)
>   return res;
>  
>   /* convert posix lock to p9 tgetlock args */
> @@ -246,7 +245,7 @@ static int v9fs_file_getlock(struct file *filp, struct 
> file_lock *fl)
>   glock.length = 0;
>   else
>   glock.length = fl->fl_end - fl->fl_start + 1;
> - glock.proc_id = fl->fl_pid;
> + glock.proc_id = fl->c.flc_pid;
>   glock.client_id = fid->clnt->name;
>  
>   res = p9_client_getlock_dotl(fid, );
> @@ -255,13 +254,13 @@ static int v9fs_file_getlock(struct file *filp, struct 
> file_lock *fl)
>   /* map 9p lock type to os lock type */
>   switch (glock.type) {
>   case P9_LOCK_TYPE_RDLCK:
> - fl->fl_type = F_RDLCK;
> + fl->c.flc_type = F_RDLCK;
>   break;
>   case P9_LOCK_TYPE_WRLCK:
> - fl->fl_type = F_WRLCK;
> + fl->c.flc_type = F_WRLCK;
>   break;
>   case P9_LOCK_TYPE_UNLCK:
> - fl->fl_type = F_UNLCK;
> + fl->c.flc_type = F_UNLCK;
>   break;
>   }
>   if (glock.type != P9_LOCK

Re: [PATCH v2 00/41] filelock: split struct file_lock into file_lock and file_lease structs

2024-01-25 Thread NeilBrown

On Fri, 26 Jan 2024, Chuck Lever wrote:
> On Thu, Jan 25, 2024 at 05:42:41AM -0500, Jeff Layton wrote:
> > Long ago, file locks used to hang off of a singly-linked list in struct
> > inode. Because of this, when leases were added, they were added to the
> > same list and so they had to be tracked using the same sort of
> > structure.
> > 
> > Several years ago, we added struct file_lock_context, which allowed us
> > to use separate lists to track different types of file locks. Given
> > that, leases no longer need to be tracked using struct file_lock.
> > 
> > That said, a lot of the underlying infrastructure _is_ the same between
> > file leases and locks, so we can't completely separate everything.
> > 
> > This patchset first splits a group of fields used by both file locks and
> > leases into a new struct file_lock_core, that is then embedded in struct
> > file_lock. Coccinelle was then used to convert a lot of the callers to
> > deal with the move, with the remaining 25% or so converted by hand.
> > 
> > It then converts several internal functions in fs/locks.c to work
> > with struct file_lock_core. Lastly, struct file_lock is split into
> > struct file_lock and file_lease, and the lease-related APIs converted to
> > take struct file_lease.
> > 
> > After the first few patches (which I left split up for easier review),
> > the set should be bisectable. I'll plan to squash the first few
> > together to make sure the resulting set is bisectable before merge.
> > 
> > Finally, I left the coccinelle scripts I used in tree. I had heard it
> > was preferable to merge those along with the patches that they
> > generate, but I wasn't sure where they go. I can either move those to a
> > more appropriate location or we can just drop that commit if it's not
> > needed.
> > 
> > Signed-off-by: Jeff Layton 
> 
> v2 looks nicer.
> 
> I would add a few list handling primitives, as I see enough
> instances of list_for_each_entry, list_for_each_entry_safe,
> list_first_entry, and list_first_entry_or_null on fl_core.flc_list
> to make it worth having those.
> 
> Also, there doesn't seem to be benefit for API consumers to have to
> understand the internal structure of struct file_lock/lease to reach
> into fl_core. Having accessor functions for common fields like
> fl_type and fl_flags could be cleaner.

I'm not a big fan of accessor functions.  They don't *look* like normal
field access, so a casual reader has to go find out what the function
does, just to find the it doesn't really do anything.

But neither am I a fan have requiring filesystems to use
"fl_core.flc_foo".  As you say, reaching into fl_core isn't ideal.

It would be nice if we could make fl_core and anonymous structure, but
that really requires -fplan9-extensions which Linus is on-record as not
liking.
Unless...

How horrible would it be to use

   union {
   struct file_lock_core flc_core;
   struct file_lock_core;
   };

I think that only requires -fms-extensions, which Linus was less
negative towards.  That would allow access to the members of
file_lock_core without the "flc_core." prefix, but would still allow
getting the address of 'flc_core'.
Maybe it's too ugly.

While fl_type and fl_flags are most common, fl_pid, fl_owner, fl_file
and even fl_wait are also used.  Having accessor functions for all of those
would be too much I think.

Maybe higher-level functions which meet the real need of the filesystem
might be a useful approach:

 locks_wakeup(lock)
 locks_wait_interruptible(lock, condition)
 locks_posix_init(lock, type, pid, ...) ??
 locks_is_unlock() - fl_type is compared with F_UNLCK 22 times.

While those are probably a good idea, through don't really help much
with reducing the need for accessor functions.

I don't suppose we could just leave the #defines in place?  Probably not
a good idea.

Maybe spell "fl_core" as "c"?  lk->c.flc_flags ???

And I wonder if we could have a new fl_flag for 'FOREIGN' locks rather
than encoding that flag in the sign of the pid.  That seems a bit ...
clunky?

NeilBrown

Re: [PATCH 00/20] filelock: split struct file_lock into file_lock and file_lease structs

2024-01-17 Thread NeilBrown

On Thu, 18 Jan 2024, Chuck Lever wrote:
> On Tue, Jan 16, 2024 at 02:45:56PM -0500, Jeff Layton wrote:
> > Long ago, file locks used to hang off of a singly-linked list in struct
> > inode. Because of this, when leases were added, they were added to the
> > same list and so they had to be tracked using the same sort of
> > structure.
> > 
> > Several years ago, we added struct file_lock_context, which allowed us
> > to use separate lists to track different types of file locks. Given
> > that, leases no longer need to be tracked using struct file_lock.
> > 
> > That said, a lot of the underlying infrastructure _is_ the same between
> > file leases and locks, so we can't completely separate everything.
> 
> Naive question: locks and leases are similar. Why do they need to be
> split apart? The cover letter doesn't address that, and I'm new
> enough at this that I don't have that context.

For me, the big win was in the last patch where we got separate
lock_manager_operations and lease_manager_operations.  There is zero
overlap between the two.  This highlights that one one level these are
different things with different behaviours.

NeilBrown

Re: [PATCH 20/20] filelock: split leases out of struct file_lock

2024-01-16 Thread NeilBrown

On Wed, 17 Jan 2024, Jeff Layton wrote:
> Add a new struct file_lease and move the lease-specific fields from
> struct file_lock to it. Convert the appropriate API calls to take
> struct file_lease instead, and convert the callers to use them.

I think that splitting of struct lease_manager_operations out from
lock_manager_operations should be mentioned here too.

>  
> +struct file_lease {
> + struct file_lock_core fl_core;
> + struct fasync_struct *  fl_fasync; /* for lease break notifications */
> + /* for lease breaks: */
> + unsigned long fl_break_time;
> + unsigned long fl_downgrade_time;
> + const struct lease_manager_operations *fl_lmops;/* Callbacks 
> for lockmanagers */

comment should be "Callbacks for leasemanagers".  Or maybe 
"lease managers". 

It is unfortunate that "lock" and "lease" both start with 'l' as we now
have two quite different fields in different structures with the same
name - fl_lmops.

NeilBrown

Re: [PATCH 13/20] filelock: convert __locks_insert_block, conflict and deadlock checks to use file_lock_core

2024-01-16 Thread NeilBrown

On Wed, 17 Jan 2024, Jeff Layton wrote:
> Have both __locks_insert_block and the deadlock and conflict checking
> functions take a struct file_lock_core pointer instead of a struct
> file_lock one. Also, change posix_locks_deadlock to return bool.
> 
> Signed-off-by: Jeff Layton 
> ---
>  fs/locks.c | 132 
> -
>  1 file changed, 70 insertions(+), 62 deletions(-)
> 


>  
>  /* Must be called with the blocked_lock_lock held! */
> -static int posix_locks_deadlock(struct file_lock *caller_fl,
> - struct file_lock *block_fl)
> +static bool posix_locks_deadlock(struct file_lock *caller_fl,
> +  struct file_lock *block_fl)
>  {
> + struct file_lock_core *caller = _fl->fl_core;
> + struct file_lock_core *blocker = _fl->fl_core;
>   int i = 0;
> - struct file_lock_core *flc = _fl->fl_core;
>  
>   lockdep_assert_held(_lock_lock);
>  
> @@ -1034,16 +1040,16 @@ static int posix_locks_deadlock(struct file_lock 
> *caller_fl,
>* This deadlock detector can't reasonably detect deadlocks with
>* FL_OFDLCK locks, since they aren't owned by a process, per-se.
>*/
> - if (IS_OFDLCK(flc))
> + if (IS_OFDLCK(caller))
>   return 0;

  return false;

Thanks,
NeilBrown

Re: [PATCH 12/20] filelock: make __locks_delete_block and __locks_wake_up_blocks take file_lock_core

2024-01-16 Thread NeilBrown

On Wed, 17 Jan 2024, Jeff Layton wrote:
> Convert __locks_delete_block and __locks_wake_up_blocks to take a struct
> file_lock_core pointer. Note that to accomodate this, we need to add a
> new file_lock() wrapper to go from file_lock_core to file_lock.

Actually we don't need it see below.

> 
> Signed-off-by: Jeff Layton 
> ---
>  fs/locks.c | 43 ++-
>  1 file changed, 26 insertions(+), 17 deletions(-)
> 
> diff --git a/fs/locks.c b/fs/locks.c
> index eddf4d767d5d..6b8e8820dec9 100644
> --- a/fs/locks.c
> +++ b/fs/locks.c
> @@ -92,6 +92,11 @@ static inline bool IS_LEASE(struct file_lock_core *flc)
>  
>  #define IS_REMOTELCK(fl) (fl->fl_core.fl_pid <= 0)
>  
> +struct file_lock *file_lock(struct file_lock_core *flc)
> +{
> + return container_of(flc, struct file_lock, fl_core);
> +}
> +
>  static bool lease_breaking(struct file_lock *fl)
>  {
>   return fl->fl_core.fl_flags & (FL_UNLOCK_PENDING | 
> FL_DOWNGRADE_PENDING);
> @@ -677,31 +682,35 @@ static void locks_delete_global_blocked(struct 
> file_lock_core *waiter)
>   *
>   * Must be called with blocked_lock_lock held.
>   */
> -static void __locks_delete_block(struct file_lock *waiter)
> +static void __locks_delete_block(struct file_lock_core *waiter)
>  {
> - locks_delete_global_blocked(>fl_core);
> - list_del_init(>fl_core.fl_blocked_member);
> + locks_delete_global_blocked(waiter);
> + list_del_init(>fl_blocked_member);
>  }
>  
> -static void __locks_wake_up_blocks(struct file_lock *blocker)
> +static void __locks_wake_up_blocks(struct file_lock_core *blocker)
>  {
> - while (!list_empty(>fl_core.fl_blocked_requests)) {
> - struct file_lock *waiter;
> + while (!list_empty(>fl_blocked_requests)) {
> + struct file_lock_core *waiter;
> + struct file_lock *fl;
> +
> + waiter = list_first_entry(>fl_blocked_requests,
> +   struct file_lock_core, 
> fl_blocked_member);
>  
> - waiter = list_first_entry(>fl_core.fl_blocked_requests,
> -   struct file_lock, 
> fl_core.fl_blocked_member);

> + fl = file_lock(waiter);

fl = list_first_entry(>fl_core.fl_blocked_requests,
  struct file_lock, 
fl_core.fl_blocked_member);

waiter = >fl_core;

achieves the same result without needing file_lock().

If you really want to add file_lock() then do so, but you need a better
justification :-)

NeilBrown



>   __locks_delete_block(waiter);
> - if (waiter->fl_lmops && waiter->fl_lmops->lm_notify)
> - waiter->fl_lmops->lm_notify(waiter);
> + if ((IS_POSIX(waiter) || IS_FLOCK(waiter)) &&
> + fl->fl_lmops && fl->fl_lmops->lm_notify)
> + fl->fl_lmops->lm_notify(fl);
>   else
> - wake_up(>fl_core.fl_wait);
> + wake_up(>fl_wait);
>  
>   /*
>* The setting of fl_blocker to NULL marks the "done"
>* point in deleting a block. Paired with acquire at the top
>* of locks_delete_block().
>*/
> - smp_store_release(>fl_core.fl_blocker, NULL);
> + smp_store_release(>fl_blocker, NULL);
>   }
>  }
>  
> @@ -743,8 +752,8 @@ int locks_delete_block(struct file_lock *waiter)
>   spin_lock(_lock_lock);
>   if (waiter->fl_core.fl_blocker)
>   status = 0;
> - __locks_wake_up_blocks(waiter);
> - __locks_delete_block(waiter);
> + __locks_wake_up_blocks(>fl_core);
> + __locks_delete_block(>fl_core);
>  
>   /*
>* The setting of fl_blocker to NULL marks the "done" point in deleting
> @@ -799,7 +808,7 @@ static void __locks_insert_block(struct file_lock 
> *blocker,
>* waiter, but might not conflict with blocker, or the requests
>* and lock which block it.  So they all need to be woken.
>*/
> - __locks_wake_up_blocks(waiter);
> + __locks_wake_up_blocks(>fl_core);
>  }
>  
>  /* Must be called with flc_lock held. */
> @@ -831,7 +840,7 @@ static void locks_wake_up_blocks(struct file_lock 
> *blocker)
>   return;
>  
>   spin_lock(_lock_lock);
> - __locks_wake_up_blocks(blocker);
> + __locks_wake_up_blocks(>fl_core);
>   spin_unlock(_lock_lock);
>  }
>  
> @@ -1186,7 +1195,7 @@ static int posix_lock_inode

Re: [PATCH 11/20] filelock: convert the IS_* macros to take file_lock_core

2024-01-16 Thread NeilBrown

On Wed, 17 Jan 2024, Jeff Layton wrote:
> I couldn't get them to work properly as macros, so convert them
> to static inlines instead (which is probably better for the type safety
> anyway).
> 
> Signed-off-by: Jeff Layton 
> ---
>  fs/locks.c | 46 +-
>  1 file changed, 33 insertions(+), 13 deletions(-)
> 
> diff --git a/fs/locks.c b/fs/locks.c
> index 770aaa5809ba..eddf4d767d5d 100644
> --- a/fs/locks.c
> +++ b/fs/locks.c
> @@ -70,10 +70,26 @@
>  
>  #include 
>  
> -#define IS_POSIX(fl) (fl->fl_core.fl_flags & FL_POSIX)
Used 3 times... once as
if (IS_POSIX(blocker) && !IS_OFDLCK(blocker))
Can an IS_POSIX lock also be IS_OFDLCK ??


> -#define IS_FLOCK(fl) (fl->fl_core.fl_flags & FL_FLOCK)
Used once.

> -#define IS_LEASE(fl) (fl->fl_core.fl_flags & (FL_LEASE|FL_DELEG|FL_LAYOUT))
Used twice.  Either "IS_LEASE" approves things that aren't leases, or
FL_LEASE is not set on all leases Names could be improved.

> -#define IS_OFDLCK(fl)(fl->fl_core.fl_flags & FL_OFDLCK)

used 4 times - a clear winner :-)

If it would me, I would simply discard these macros and open-code the
tests.  I don't think IS_FLOCK() is easier to read for someone who knows
the code, and I think IS_LEASE() is actually harder to read for someone
who doesn't know the code, as that it does it not really obvious.

But this is just a suggestion, I won't push it.

Thanks,
NeilBrown


> +static inline bool IS_POSIX(struct file_lock_core *flc)
> +{
> + return flc->fl_flags & FL_POSIX;
> +}
> +
> +static inline bool IS_FLOCK(struct file_lock_core *flc)
> +{
> + return flc->fl_flags & FL_FLOCK;
> +}
> +
> +static inline bool IS_OFDLCK(struct file_lock_core *flc)
> +{
> + return flc->fl_flags & FL_OFDLCK;
> +}
> +
> +static inline bool IS_LEASE(struct file_lock_core *flc)
> +{
> + return flc->fl_flags & (FL_LEASE|FL_DELEG|FL_LAYOUT);
> +}
> +
>  #define IS_REMOTELCK(fl) (fl->fl_core.fl_pid <= 0)
>  
>  static bool lease_breaking(struct file_lock *fl)
> @@ -761,6 +777,7 @@ static void __locks_insert_block(struct file_lock 
> *blocker,
>  struct file_lock *))
>  {
>   struct file_lock *fl;
> + struct file_lock_core *bflc;
>   BUG_ON(!list_empty(>fl_core.fl_blocked_member));
>  
>  new_blocker:
> @@ -773,7 +790,9 @@ static void __locks_insert_block(struct file_lock 
> *blocker,
>   waiter->fl_core.fl_blocker = blocker;
>   list_add_tail(>fl_core.fl_blocked_member,
> >fl_core.fl_blocked_requests);
> - if (IS_POSIX(blocker) && !IS_OFDLCK(blocker))
> +
> + bflc = >fl_core;
> + if (IS_POSIX(bflc) && !IS_OFDLCK(bflc))
>   locks_insert_global_blocked(>fl_core);
>  
>   /* The requests in waiter->fl_blocked are known to conflict with
> @@ -998,6 +1017,7 @@ static int posix_locks_deadlock(struct file_lock 
> *caller_fl,
>   struct file_lock *block_fl)
>  {
>   int i = 0;
> + struct file_lock_core *flc = _fl->fl_core;
>  
>   lockdep_assert_held(_lock_lock);
>  
> @@ -1005,7 +1025,7 @@ static int posix_locks_deadlock(struct file_lock 
> *caller_fl,
>* This deadlock detector can't reasonably detect deadlocks with
>* FL_OFDLCK locks, since they aren't owned by a process, per-se.
>*/
> - if (IS_OFDLCK(caller_fl))
> + if (IS_OFDLCK(flc))
>   return 0;
>  
>   while ((block_fl = what_owner_is_waiting_for(block_fl))) {
> @@ -2157,7 +2177,7 @@ static pid_t locks_translate_pid(struct file_lock *fl, 
> struct pid_namespace *ns)
>   pid_t vnr;
>   struct pid *pid;
>  
> - if (IS_OFDLCK(fl))
> + if (IS_OFDLCK(>fl_core))
>   return -1;
>   if (IS_REMOTELCK(fl))
>   return fl->fl_core.fl_pid;
> @@ -2721,19 +2741,19 @@ static void lock_get_status(struct seq_file *f, 
> struct file_lock *fl,
>   if (repeat)
>   seq_printf(f, "%*s", repeat - 1 + (int)strlen(pfx), pfx);
>  
> - if (IS_POSIX(fl)) {
> + if (IS_POSIX(>fl_core)) {
>   if (fl->fl_core.fl_flags & FL_ACCESS)
>   seq_puts(f, "ACCESS");
> - else if (IS_OFDLCK(fl))
> + else if (IS_OFDLCK(>fl_core))
>   seq_puts(f, "OFDLCK");
>   else
>   seq_puts(f, "POSIX ");
>  
>   seq_printf(f, " %s ",
>(inode

Re: [PATCH 01/20] filelock: split common fields into struct file_lock_core

2024-01-16 Thread NeilBrown

On Wed, 17 Jan 2024, Jeff Layton wrote:
> In a future patch, we're going to split file leases into their own
> structure. Since a lot of the underlying machinery uses the same fields
> move those into a new file_lock_core, and embed that inside struct
> file_lock.
> 
> Signed-off-by: Jeff Layton 
> ---
>  include/linux/filelock.h | 9 +++--
>  1 file changed, 7 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/filelock.h b/include/linux/filelock.h
> index 95e868e09e29..7825511c1c11 100644
> --- a/include/linux/filelock.h
> +++ b/include/linux/filelock.h
> @@ -85,8 +85,9 @@ bool opens_in_grace(struct net *);
>   *
>   * Obviously, the last two criteria only matter for POSIX locks.
>   */
> -struct file_lock {
> - struct file_lock *fl_blocker;   /* The lock, that is blocking us */
> +
> +struct file_lock_core {
> + struct file_lock *fl_blocker;   /* The lock that is blocking us */
>   struct list_head fl_list;   /* link into file_lock_context */
>   struct hlist_node fl_link;  /* node in global lists */
>   struct list_head fl_blocked_requests;   /* list of requests with
> @@ -102,6 +103,10 @@ struct file_lock {
>   int fl_link_cpu;/* what cpu's list is this on? */
>   wait_queue_head_t fl_wait;
>   struct file *fl_file;
> +};
> +
> +struct file_lock {
> + struct file_lock_core fl_core;
>   loff_t fl_start;
>   loff_t fl_end;
>  

If I we doing this, I would rename all the fields in file_lock_core to
have an "flc_" prefix, and add some #defines like

 #define fl_list fl_core.flc_list

so there would be no need to squash this with later patches to achieve
bisectability.

The #defines would be removed after the coccinelle scripts etc are
applied.

I would also do the "convert some internal functions" patches *before*
the bulk conversion of fl_foo to fl_code.flc_foo so that those functions
don't get patched twice.

But this is all personal preference.  If you prefer your approach,
please leave it that way.  The only clear benefit of my approach is that
you don't need to squash patches together, and that is probably not a
big deal.

Thanks,
NeilBrown

Re: [PATCH v2 12/12] docs: path-lookup: update symlink description

2021-04-18 Thread NeilBrown

On Tue, Mar 16 2021, Fox Chen wrote:

> instead of lookup_real()/vfs_create(), i_op->lookup() and
> i_op->create() will be called directly.
>
> update vfs_open() logic
>
> should_follow_link is merged into lookup_last() or open_last_lookup()
> which returns symlink name instead of an integer.
>
> Signed-off-by: Fox Chen 
> ---
>  Documentation/filesystems/path-lookup.rst | 13 ++---
>  1 file changed, 6 insertions(+), 7 deletions(-)
>
> diff --git a/Documentation/filesystems/path-lookup.rst 
> b/Documentation/filesystems/path-lookup.rst
> index eef6e9f68fba..adbc714740c2 100644
> --- a/Documentation/filesystems/path-lookup.rst
> +++ b/Documentation/filesystems/path-lookup.rst
> @@ -1202,16 +1202,15 @@ the code.
> it.  If the file was found in the dcache, then ``vfs_open()`` is used for
> this.  If not, then ``lookup_open()`` will either call ``atomic_open()`` 
> (if
> the filesystem provides it) to combine the final lookup with the open, or
> -   will perform the separate ``lookup_real()`` and ``vfs_create()`` steps
> +   will perform the separate ``i_op->lookup()`` and ``i_op->create()`` steps
> directly.  In the later case the actual "open" of this newly found or
> created file will be performed by ``vfs_open()``, just as if the name
> were found in the dcache.
>  
>  2. ``vfs_open()`` can fail with ``-EOPENSTALE`` if the cached information
> -   wasn't quite current enough.  Rather than restarting the lookup from
> -   the top with ``LOOKUP_REVAL`` set, ``lookup_open()`` is called instead,
> -   giving the filesystem a chance to resolve small inconsistencies.
> -   If that doesn't work, only then is the lookup restarted from the top.
> +   wasn't quite current enough.  If it's in RCU-walk -ECHILD will be returned
> +   otherwise will return -ESTALE.  When -ESTALE is returned, the caller may

"otherwise -ESTALE is returned".
If you don't like repeating "is returned", then maybe:
  "... -ECHILD will be returned, otherwise the result is -ESTALE".


> +   retry with LOOKUP_REVAL flag set.
>  
>  3. An open with O_CREAT **does** follow a symlink in the final component,
> unlike other creation system calls (like ``mkdir``).  So the sequence::
> @@ -1221,8 +1220,8 @@ the code.
>  
> will create a file called ``/tmp/bar``.  This is not permitted if
> ``O_EXCL`` is set but otherwise is handled for an O_CREAT open much
> -   like for a non-creating open: ``should_follow_link()`` returns ``1``, and
> -   so does ``do_last()`` so that ``trailing_symlink()`` gets called and the
> +   like for a non-creating open: ``lookup_last()`` or ``open_last_lookup()``
> +   returns a non ``Null`` value, and ``link_path_walk()`` gets called and the

"NULL", not "Null".

This those changes,
 Reviewed-by: NeilBrown 

Thanks for a lot of all these improvements!! and apologies for the delay
in the review.

Thanks,
NeilBrown


> open process continues on the symlink that was found.
>  
>  Updating the access time
> -- 
> 2.30.2


signature.asc
Description: PGP signature

Re: [PATCH v2 11/12] docs: path-lookup: update get_link() ->follow_link description

2021-04-18 Thread NeilBrown

On Tue, Mar 16 2021, Fox Chen wrote:

> get_link() is merged into pick_link(). i_op->follow_link is
> replaced with i_op->get_link(). get_link() can return ERR_PTR(0)
> which equals NULL.
>
> Signed-off-by: Fox Chen 
> ---
>  Documentation/filesystems/path-lookup.rst | 13 ++---
>  1 file changed, 6 insertions(+), 7 deletions(-)
>
> diff --git a/Documentation/filesystems/path-lookup.rst 
> b/Documentation/filesystems/path-lookup.rst
> index abd0153e2415..eef6e9f68fba 100644
> --- a/Documentation/filesystems/path-lookup.rst
> +++ b/Documentation/filesystems/path-lookup.rst
> @@ -1134,10 +1134,10 @@ Symlinks with no final component
>  
>  A pair of special-case symlinks deserve a little further explanation.
>  Both result in a new ``struct path`` (with mount and dentry) being set
> -up in the ``nameidata``, and result in ``get_link()`` returning ``NULL``.
> +up in the ``nameidata``, and result in ``pick_link()`` returning ``NULL``.
>  
>  The more obvious case is a symlink to "``/``".  All symlinks starting
> -with "``/``" are detected in ``get_link()`` which resets the ``nameidata``
> +with "``/``" are detected in ``pick_link()`` which resets the ``nameidata``
>  to point to the effective filesystem root.  If the symlink only
>  contains "``/``" then there is nothing more to do, no components at all,
>  so ``NULL`` is returned to indicate that the symlink can be released and
> @@ -1154,12 +1154,11 @@ something that looks like a symlink.  It is really a 
> reference to the
>  target file, not just the name of it.  When you ``readlink`` these
>  objects you get a name that might refer to the same file - unless it
>  has been unlinked or mounted over.  When ``walk_component()`` follows
> -one of these, the ``->follow_link()`` method in "procfs" doesn't return
> +one of these, the ``->get_link()`` method in "procfs" doesn't return
>  a string name, but instead calls ``nd_jump_link()`` which updates the
> -``nameidata`` in place to point to that target.  ``->follow_link()`` then
> -returns ``NULL``.  Again there is no final component and ``get_link()``
> -reports this by leaving the ``last_type`` field of ``nameidata`` as
> -``LAST_BIND``.
> +``nameidata`` in place to point to that target.  ``->get_link()`` then
> +returns ``0``.  Again there is no final component and ``pick_link()``

Why did you change NULL to 0?  ->get_link returns a pointer.

Without that change:
  Reviewed-by: NeilBrown 

Thanks,
NeilBrown


> +returns NULL.
>  
>  Following the symlink in the final component
>  
> -- 
> 2.30.2


signature.asc
Description: PGP signature

Re: [PATCH v2 10/12] docs: path-lookup: update WALK_GET, WALK_PUT desc

2021-04-18 Thread NeilBrown

On Tue, Mar 16 2021, Fox Chen wrote:

> WALK_GET is changed to WALK_TRAILING with a different meaning.
> Here it should be WALK_NOFOLLOW. WALK_PUT dosn't exist, we have
> WALK_MORE.
>
> WALK_PUT == !WALK_MORE
>
> And there is not should_follow_link().
>
> Related commits:
> commit 8c4efe22e7c4 ("namei: invert the meaning of WALK_FOLLOW")
> commit 1c4ff1a87e46 ("namei: invert WALK_PUT logics")
>
> Signed-off-by: Fox Chen 
> ---
>  Documentation/filesystems/path-lookup.rst | 12 +---
>  1 file changed, 5 insertions(+), 7 deletions(-)
>
> diff --git a/Documentation/filesystems/path-lookup.rst 
> b/Documentation/filesystems/path-lookup.rst
> index 0d41c61f7e4f..abd0153e2415 100644
> --- a/Documentation/filesystems/path-lookup.rst
> +++ b/Documentation/filesystems/path-lookup.rst
> @@ -1123,13 +1123,11 @@ stack in ``walk_component()`` immediately when the 
> symlink is found;
>  old symlink as it walks that last component.  So it is quite
>  convenient for ``walk_component()`` to release the old symlink and pop
>  the references just before pushing the reference information for the
> -new symlink.  It is guided in this by two flags; ``WALK_GET``, which
> -gives it permission to follow a symlink if it finds one, and
> -``WALK_PUT``, which tells it to release the current symlink after it has been
> -followed.  ``WALK_PUT`` is tested first, leading to a call to
> -``put_link()``.  ``WALK_GET`` is tested subsequently (by
> -``should_follow_link()``) leading to a call to ``pick_link()`` which sets
> -up the stack frame.
> +new symlink.  It is guided in this by two flags; ``WALK_NOFOLLOW``, which

There are 3 flags now.  You haven't documented WALK_TRAIlING.


> +suggests whether to follow a symlink if it finds one, and

I don't think it is a suggestion.

.. which forbits it from following a symlink if it finds one, and
WALK_MORE which indicates that it is yet too early to release the
current symlink.

> +``WALK_MORE``, which tells whether to release the current symlink after it 
> has
> +been followed.  ``WALK_MORE`` is tested first, leading to a call to
> +``put_link()``.

I don't think that "tested first" sentence is relevant any more.

Thanks,
NeilBrown

>  
>  Symlinks with no final component
>  
> -- 
> 2.30.2


signature.asc
Description: PGP signature

Re: [PATCH v2 09/12] docs: path-lookup: no get_link()

2021-04-18 Thread NeilBrown

On Tue, Mar 16 2021, Fox Chen wrote:

> no get_link() anymore. we have step_into() and pick_link().
>
> walk_component() will call step_into(), in turn call pick_link,
> and return symlink name.
>
> Signed-off-by: Fox Chen 
> ---
>  Documentation/filesystems/path-lookup.rst | 10 --
>  1 file changed, 4 insertions(+), 6 deletions(-)
>
> diff --git a/Documentation/filesystems/path-lookup.rst 
> b/Documentation/filesystems/path-lookup.rst
> index 8ab95dd9046e..0d41c61f7e4f 100644
> --- a/Documentation/filesystems/path-lookup.rst
> +++ b/Documentation/filesystems/path-lookup.rst
> @@ -1103,12 +1103,10 @@ doesn't need to notice.  Getting this ``name`` 
> variable on and off the
>  stack is very straightforward; pushing and popping the references is
>  a little more complex.
>  
> -When a symlink is found, ``walk_component()`` returns the value ``1``
> -(``0`` is returned for any other sort of success, and a negative number
> -is, as usual, an error indicator).  This causes ``get_link()`` to be
> -called; it then gets the link from the filesystem.  Providing that
> -operation is successful, the old path ``name`` is placed on the stack,
> -and the new value is used as the ``name`` for a while.  When the end of
> +When a symlink is found, ``walk_component()`` calls ``pick_link()``,

walk_component() calls pick_link() via step_into()
??

> +it then gets the link from the filesystem returning new path ``name``.

"which returns the link from the filesystem."

With those changes (assuming you agree with them)

 Reviewed-by: NeilBrown 

Thanks,
NeilBrown


> +Providing that operation is successful, the old path ``name`` is placed on 
> the
> +stack, and the new value is used as the ``name`` for a while.  When the end 
> of
>  the path is found (i.e. ``*name`` is ``'\0'``) the old ``name`` is restored
>  off the stack and path walking continues.
>  
> -- 
> 2.30.2


signature.asc
Description: PGP signature

Re: [PATCH v2 08/12] docs: path-lookup: update i_op->put_link and cookie description

2021-04-18 Thread NeilBrown

On Tue, Mar 16 2021, Fox Chen wrote:

> No inode->put_link operation anymore. We use delayed_call to
> deal with link destruction. Cookie has been replaced with
> struct delayed_call.
>
> Related commit: commit fceef393a538 ("switch ->get_link() to
> delayed_call, kill ->put_link()")
>
> Signed-off-by: Fox Chen 
> ---
>  Documentation/filesystems/path-lookup.rst | 30 ++-
>  1 file changed, 8 insertions(+), 22 deletions(-)
>
> diff --git a/Documentation/filesystems/path-lookup.rst 
> b/Documentation/filesystems/path-lookup.rst
> index e6b6c43ff0f6..8ab95dd9046e 100644
> --- a/Documentation/filesystems/path-lookup.rst
> +++ b/Documentation/filesystems/path-lookup.rst
> @@ -1066,34 +1066,20 @@ method. This is called both in RCU-walk and REF-walk. 
> In RCU-walk the
>  RCU-walk.  Much like the ``i_op->permission()`` method we
>  looked at previously, ``->get_link()`` would need to be careful that
>  all the data structures it references are safe to be accessed while
> -holding no counted reference, only the RCU lock.  Though getting a
> -reference with ``->follow_link()`` is not yet done in RCU-walk mode, the
> -code is ready to release the reference when that does happen.
> -
> -This need to drop the reference to a symlink adds significant
> -complexity.  It requires a reference to the inode so that the
> -``i_op->put_link()`` inode operation can be called.  In REF-walk, that
> -reference is kept implicitly through a reference to the dentry, so
> -keeping the ``struct path`` of the symlink is easiest.  For RCU-walk,
> -the pointer to the inode is kept separately.  To allow switching from
> -RCU-walk back to REF-walk in the middle of processing nested symlinks
> -we also need the seq number for the dentry so we can confirm that
> -switching back was safe.
> -
> -Finally, when providing a reference to a symlink, the filesystem also
> -provides an opaque "cookie" that must be passed to ``->put_link()`` so that 
> it
> -knows what to free.  This might be the allocated memory area, or a
> -pointer to the ``struct page`` in the page cache, or something else
> -completely.  Only the filesystem knows what it is.
> +holding no counted reference, only the RCU lock. A callback
> +``struct delayed_called`` will be passed to get_link,

I'd put a ":", not "," at the end of above line.

> +file systems can set their own put_link function and argument through
> +``set_delayed_call``. Later on, when vfs wants to put link, it will call

() after function names please, both above and below.

Also:  "when VFS want to put the link"

With these changes:
 Reviewed-by: NeilBrown 

Thanks,
NeilBrown


> +``do_delayed_call`` to invoke that callback function with the argument.
>  
>  In order for the reference to each symlink to be dropped when the walk 
> completes,
>  whether in RCU-walk or REF-walk, the symlink stack needs to contain,
>  along with the path remnants:
>  
> -- the ``struct path`` to provide a reference to the inode in REF-walk
> -- the ``struct inode *`` to provide a reference to the inode in RCU-walk
> +- the ``struct path`` to provide a reference to the previous path
> +- the ``const char *`` to provide a reference to the to previous name
>  - the ``seq`` to allow the path to be safely switched from RCU-walk to 
> REF-walk
> -- the ``cookie`` that tells ``->put_path()`` what to put.
> +- the ``struct delayed_call`` for later invocation.
>  
>  This means that each entry in the symlink stack needs to hold five
>  pointers and an integer instead of just one pointer (the path
> -- 
> 2.30.2


signature.asc
Description: PGP signature

Re: [PATCH v2 07/12] docs: path-lookup: i_op->follow_link replaced with i_op->get_link

2021-04-18 Thread NeilBrown

On Tue, Mar 16 2021, Fox Chen wrote:

> follow_link has been replaced by get_link() which can be
> called in RCU mode.
>
> see commit: commit 6b2553918d8b ("replace ->follow_link() with
> new method that could stay in RCU mode")
>
> Signed-off-by: Fox Chen 

Reviewed-By: NeilBrown 

Thanks,
NeilBrown


> ---
>  Documentation/filesystems/path-lookup.rst | 12 +---
>  1 file changed, 5 insertions(+), 7 deletions(-)
>
> diff --git a/Documentation/filesystems/path-lookup.rst 
> b/Documentation/filesystems/path-lookup.rst
> index af5c20fecfef..e6b6c43ff0f6 100644
> --- a/Documentation/filesystems/path-lookup.rst
> +++ b/Documentation/filesystems/path-lookup.rst
> @@ -1060,13 +1060,11 @@ filesystem cannot successfully get a reference in 
> RCU-walk mode, it
>  must return ``-ECHILD`` and ``unlazy_walk()`` will be called to return to
>  REF-walk mode in which the filesystem is allowed to sleep.
>  
> -The place for all this to happen is the ``i_op->follow_link()`` inode
> -method.  In the present mainline code this is never actually called in
> -RCU-walk mode as the rewrite is not quite complete.  It is likely that
> -in a future release this method will be passed an ``inode`` pointer when
> -called in RCU-walk mode so it both (1) knows to be careful, and (2) has the
> -validated pointer.  Much like the ``i_op->permission()`` method we
> -looked at previously, ``->follow_link()`` would need to be careful that
> +The place for all this to happen is the ``i_op->get_link()`` inode
> +method. This is called both in RCU-walk and REF-walk. In RCU-walk the
> +``dentry*`` argument is NULL, ``->get_link()`` can return -ECHILD to drop 
> out of
> +RCU-walk.  Much like the ``i_op->permission()`` method we
> +looked at previously, ``->get_link()`` would need to be careful that
>  all the data structures it references are safe to be accessed while
>  holding no counted reference, only the RCU lock.  Though getting a
>  reference with ``->follow_link()`` is not yet done in RCU-walk mode, the
> -- 
> 2.30.2


signature.asc
Description: PGP signature

Re: [PATCH v2 06/12] docs: path-lookup: Add macro name to symlink limit description

2021-04-18 Thread NeilBrown

On Tue, Mar 16 2021, Fox Chen wrote:

> Add macro name MAXSYMLINKS to the symlink limit description, so
> that it is consistent with path name length description above.
>
> Signed-off-by: Fox Chen 

Reviewed-by: NeilBrown 

Thanks,
NeilBrown

> ---
>  Documentation/filesystems/path-lookup.rst | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/Documentation/filesystems/path-lookup.rst 
> b/Documentation/filesystems/path-lookup.rst
> index 66697db74955..af5c20fecfef 100644
> --- a/Documentation/filesystems/path-lookup.rst
> +++ b/Documentation/filesystems/path-lookup.rst
> @@ -992,8 +992,8 @@ is 4096.  There are a number of reasons for this limit; 
> not letting the
>  kernel spend too much time on just one path is one of them.  With
>  symbolic links you can effectively generate much longer paths so some
>  sort of limit is needed for the same reason.  Linux imposes a limit of
> -at most 40 symlinks in any one path lookup.  It previously imposed a
> -further limit of eight on the maximum depth of recursion, but that was
> +at most 40 (MAXSYMLINKS) symlinks in any one path lookup.  It previously 
> imposed
> +a further limit of eight on the maximum depth of recursion, but that was
>  raised to 40 when a separate stack was implemented, so there is now
>  just the one limit.
>  
> -- 
> 2.30.2


signature.asc
Description: PGP signature

Re: [PATCH v2 05/12] docs: path-lookup: remove filename_mountpoint

2021-04-18 Thread NeilBrown

On Tue, Mar 16 2021, Fox Chen wrote:

> No filename_mountpoint any more
> see commit: commit 161aff1d93ab ("LOOKUP_MOUNTPOINT:
> fold path_mountpointat() into path_lookupat()")
>
> Without filename_mountpoint and path_mountpoint(), the
> numbers should be four & three:
>
> "These four correspond roughly to the three path_*() functions"
>
> Signed-off-by: Fox Chen 

Reviewed-by: NeilBrown 

Thanks,
NeilBrown


> ---
>  Documentation/filesystems/path-lookup.rst | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/Documentation/filesystems/path-lookup.rst 
> b/Documentation/filesystems/path-lookup.rst
> index a65cb477d524..66697db74955 100644
> --- a/Documentation/filesystems/path-lookup.rst
> +++ b/Documentation/filesystems/path-lookup.rst
> @@ -652,9 +652,9 @@ restarts from the top with REF-walk.
>  
>  This pattern of "try RCU-walk, if that fails try REF-walk" can be
>  clearly seen in functions like ``filename_lookup()``,
> -``filename_parentat()``, ``filename_mountpoint()``,
> -``do_filp_open()``, and ``do_file_open_root()``.  These five
> -correspond roughly to the four ``path_*()`` functions we met earlier,
> +``filename_parentat()``,
> +``do_filp_open()``, and ``do_file_open_root()``.  These four
> +correspond roughly to the three ``path_*()`` functions we met earlier,
>  each of which calls ``link_path_walk()``.  The ``path_*()`` functions are
>  called using different mode flags until a mode is found which works.
>  They are first called with ``LOOKUP_RCU`` set to request "RCU-walk".  If
> -- 
> 2.30.2


signature.asc
Description: PGP signature

Re: [PATCH v2 04/12] docs: path-lookup: update do_last() part

2021-04-18 Thread NeilBrown

On Tue, Mar 16 2021, Fox Chen wrote:

> traling_symlink() was merged into lookup_last, do_last().
>
> do_last() has later been split into open_last_lookups()
> and do_open().
>
> see related commit: commit c5971b8c6354 ("take post-lookup
> part of do_last() out of loop")
>
> Signed-off-by: Fox Chen 
> ---
>  Documentation/filesystems/path-lookup.rst | 35 ---
>  1 file changed, 18 insertions(+), 17 deletions(-)
>
> diff --git a/Documentation/filesystems/path-lookup.rst 
> b/Documentation/filesystems/path-lookup.rst
> index b6a301b78121..a65cb477d524 100644
> --- a/Documentation/filesystems/path-lookup.rst
> +++ b/Documentation/filesystems/path-lookup.rst
> @@ -495,11 +495,11 @@ This is important when unmounting a filesystem that is 
> inaccessible, such as
>  one provided by a dead NFS server.
>  
>  Finally ``path_openat()`` is used for the ``open()`` system call; it
> -contains, in support functions starting with "``do_last()``", all the
> +contains, in support functions starting with "``open_last_lookups()``", all 
> the
>  complexity needed to handle the different subtleties of O_CREAT (with
>  or without O_EXCL), final "``/``" characters, and trailing symbolic
>  links.  We will revisit this in the final part of this series, which
> -focuses on those symbolic links.  "``do_last()``" will sometimes, but
> +focuses on those symbolic links.  "``open_last_lookups()``" will sometimes, 
> but
>  not always, take ``i_rwsem``, depending on what it finds.
>  
>  Each of these, or the functions which call them, need to be alert to
> @@ -1199,26 +1199,27 @@ symlink.
>  This case is handled by the relevant caller of ``link_path_walk()``, such as
>  ``path_lookupat()`` using a loop that calls ``link_path_walk()``, and then
>  handles the final component.  If the final component is a symlink
> -that needs to be followed, then ``trailing_symlink()`` is called to set
> -things up properly and the loop repeats, calling ``link_path_walk()``
> -again.  This could loop as many as 40 times if the last component of
> -each symlink is another symlink.
> +that needs to be followed, then ``open_last_lookups()`` is
> +called to set things up properly and the loop repeats, calling
> +``link_path_walk()`` again.  This could loop as many as 40 times if the last
> +component of each symlink is another symlink.
>  
>  The various functions that examine the final component and possibly
> -report that it is a symlink are ``lookup_last()``, ``mountpoint_last()``
> -and ``do_last()``, each of which use the same convention as
> -``walk_component()`` of returning ``1`` if a symlink was found that needs
> -to be followed.
> +report that it is a symlink are ``lookup_last()``, ``open_last_lookups()``
> +, each of which use the same convention as
> +``walk_component()`` of returning ``char *name`` if a symlink was found that
> +needs to be followed.

This para no longer makes sense.
There is only one function that examines the final compoenent:
step_into()
It is called from open_last_lookups() directly and indirectly from
lookup_last() through walk_component().
But saying that here might be duplicating earlier text.

I think the key point in the para is that convention of returning a
'char *name' if a symlink was found.  The rest might now be redundant.

I think this needs a larger revision.

Thanks,
NeilBrown


>  
> -Of these, ``do_last()`` is the most interesting as it is used for
> -opening a file.  Part of ``do_last()`` runs with ``i_rwsem`` held and this
> -part is in a separate function: ``lookup_open()``.
> +Of these, ``open_last_lookups()`` is the most interesting as it works in 
> tandem
> +with ``do_open()`` for opening a file.  Part of ``open_last_lookups()`` runs
> +with ``i_rwsem`` held and this part is in a separate function: 
> ``lookup_open()``.
>  
> -Explaining ``do_last()`` completely is beyond the scope of this article,
> -but a few highlights should help those interested in exploring the
> -code.
> +Explaining ``open_last_lookups()`` and ``do_open()`` completely is beyond 
> the scope
> +of this article, but a few highlights should help those interested in 
> exploring
> +the code.
>  
> -1. Rather than just finding the target file, ``do_last()`` needs to open
> +1. Rather than just finding the target file, ``do_open()`` is used after
> +   ``open_last_lookup()`` to open
> it.  If the file was found in the dcache, then ``vfs_open()`` is used for
> this.  If not, then ``lookup_open()`` will either call ``atomic_open()`` 
> (if
> the filesystem provides it) to combine the final lookup with the open, or
> -- 
> 2.30.2


signature.asc
Description: PGP signature

Re: [PATCH v2 03/12] docs: path-lookup: update path_mountpoint() part

2021-04-18 Thread NeilBrown

On Tue, Mar 16 2021, Fox Chen wrote:

> path_mountpoint() doesn't exist anymore. Have been folded
> into path_lookup_at when flag is set with LOOKUP_MOUNTPOINT.
> Check commit: commit 161aff1d93abf0e ("LOOKUP_MOUNTPOINT: fold
> path_mountpointat() into path_lookupat()")
>
> Signed-off-by: Fox Chen 
> ---
>  Documentation/filesystems/path-lookup.rst | 12 +---
>  1 file changed, 5 insertions(+), 7 deletions(-)
>
> diff --git a/Documentation/filesystems/path-lookup.rst 
> b/Documentation/filesystems/path-lookup.rst
> index a29d714431a3..b6a301b78121 100644
> --- a/Documentation/filesystems/path-lookup.rst
> +++ b/Documentation/filesystems/path-lookup.rst
> @@ -472,7 +472,7 @@ Handling the final component
>  ``nd->last_type`` to refer to the final component of the path.  It does
>  not call ``walk_component()`` that last time.  Handling that final
>  component remains for the caller to sort out. Those callers are
> -``path_lookupat()``, ``path_parentat()``, ``path_mountpoint()`` and
> +``path_lookupat()``, ``path_parentat()`` and
>  ``path_openat()`` each of which handles the differing requirements of
>  different system calls.
>  
> @@ -488,12 +488,10 @@ perform their operation.
>  object is wanted such as by ``stat()`` or ``chmod()``.  It essentially just
>  calls ``walk_component()`` on the final component through a call to
>  ``lookup_last()``.  ``path_lookupat()`` returns just the final dentry.
> -
> -``path_mountpoint()`` handles the special case of unmounting which must
> -not try to revalidate the mounted filesystem.  It effectively
> -contains, through a call to ``mountpoint_last()``, an alternate
> -implementation of ``lookup_slow()`` which skips that step.  This is
> -important when unmounting a filesystem that is inaccessible, such as
> +It is worth noting that when flag ``LOOKUP_MOUNTPOINT`` is set,
> +``path_lookupat()`` will unset LOOKUP_JUMPED in nameidata so that in the 
> further

I would say "subsequent" rather than "further".

Either way:
 Reviewed-by: NeilBrown 

Thanks,
NeilBrown


> +path traversal ``d_weak_revalidate()`` won't be called.
> +This is important when unmounting a filesystem that is inaccessible, such as
>  one provided by a dead NFS server.
>  
>  Finally ``path_openat()`` is used for the ``open()`` system call; it
> -- 
> 2.30.2


signature.asc
Description: PGP signature

Re: [PATCH v2 02/12] docs: path-lookup: update path_to_nameidata() part

2021-04-18 Thread NeilBrown

On Tue, Mar 16 2021, Fox Chen wrote:

> No path_to_namei() anymore, step_into() will be called.
> Related commit: commit c99687a03a78 ("fold path_to_nameidata()
> into its only remaining caller")
>
> Signed-off-by: Fox Chen 
> ---
>  Documentation/filesystems/path-lookup.rst | 7 ---
>  1 file changed, 4 insertions(+), 3 deletions(-)
>
> diff --git a/Documentation/filesystems/path-lookup.rst 
> b/Documentation/filesystems/path-lookup.rst
> index d07766375e13..a29d714431a3 100644
> --- a/Documentation/filesystems/path-lookup.rst
> +++ b/Documentation/filesystems/path-lookup.rst
> @@ -455,9 +455,10 @@ directly from walk_component() or from handle_dots().  
> It calls
>  ``handle_mount()``, to check and handle mount points, in which a new
>  ``struct path`` containing a counted reference to the new dentry and a
>  reference to the new ``vfsmount`` which is only counted if it is
> -different from the previous ``vfsmount``.  It then calls
> -``path_to_nameidata()`` to install the new ``struct path`` in the
> -``struct nameidata`` and drop the unneeded references.
> +different from the previous ``vfsmount`` will be created. Then if there is

That "will be created" messes up the sentence.
It would probably work to put it earlier:

  It calls handle_mounts() to check and handle mount points, in which a
  new struct path is created containing a counted reference to the new
  dentry and a reference to the new vfsmount, which is only counted if
  it is different from the previous vfsmount.

(I'm not sure about the comma I put in before the 'which' - Jon often
removes my commas, and sometimes changes 'which' to 'that'...)

> +symbolic link, ``step_into()`` calls ``pick_link()`` to deal with it, 
> otherwise

"a symbolic link"

> +installs the new ``struct path`` in the ``struct nameidata`` and drop the

"it installs".  Any maybe "into the".  And "drops".

> +unneeded references.

So sentence is:
   Then if there is a symbolic link, step_into() calls pick_link() to
   deal with it, otherwise it installs the new struct path into the
   struct nameidata, and drops the unneeded references.

With those changes,
  Reviewed-by: NeilBrown 

Thanks,
NeilBrown

>  
>  This "hand-over-hand" sequencing of getting a reference to the new
>  dentry before dropping the reference to the previous dentry may
> -- 
> 2.30.2


signature.asc
Description: PGP signature

Re: [PATCH v2 01/12] docs: path-lookup: update follow_managed() part

2021-04-18 Thread NeilBrown

On Tue, Mar 16 2021, Fox Chen wrote:

> No follow_managed() anymore, handle_mounts(),
> traverse_mounts(), will do the job.
> see commit 9deed3ebca24 ("new helper: traverse_mounts()")
>
> Signed-off-by: Fox Chen 
> ---
>  Documentation/filesystems/path-lookup.rst | 10 +-
>  1 file changed, 5 insertions(+), 5 deletions(-)
>
> diff --git a/Documentation/filesystems/path-lookup.rst 
> b/Documentation/filesystems/path-lookup.rst
> index c482e1619e77..d07766375e13 100644
> --- a/Documentation/filesystems/path-lookup.rst
> +++ b/Documentation/filesystems/path-lookup.rst
> @@ -448,10 +448,11 @@ described.  If it finds a ``LAST_NORM`` component it 
> first calls
>  filesystem to revalidate the result if it is that sort of filesystem.
>  If that doesn't get a good result, it calls "``lookup_slow()``" which
>  takes ``i_rwsem``, rechecks the cache, and then asks the filesystem
> -to find a definitive answer.  Each of these will call
> -``follow_managed()`` (as described below) to handle any mount points.
> +to find a definitive answer.
>  
> -In the absence of symbolic links, ``walk_component()`` creates a new
> +As the last step of ``walk_component()``, ``step_into()`` will be called 
> either
> +directly from walk_component() or from handle_dots().  It calls
> +``handle_mount()``, to check and handle mount points, in which a new

Typo - it is "handle_mounts", not "handle_mount"

With that fixed:
  Reviewed-by: NeilBrown 

Thanks,
NeilBrown


>  ``struct path`` containing a counted reference to the new dentry and a
>  reference to the new ``vfsmount`` which is only counted if it is
>  different from the previous ``vfsmount``.  It then calls
> @@ -535,8 +536,7 @@ covered in greater detail in autofs.txt in the Linux 
> documentation
>  tree, but a few notes specifically related to path lookup are in order
>  here.
>  
> -The Linux VFS has a concept of "managed" dentries which is reflected
> -in function names such as "``follow_managed()``".  There are three
> +The Linux VFS has a concept of "managed" dentries.  There are three
>  potentially interesting things about these dentries corresponding
>  to three different flags that might be set in ``dentry->d_flags``:
>  
> -- 
> 2.30.2


signature.asc
Description: PGP signature

Re: [PATCH v2 00/12] docs: path-lookup: Update pathlookup docs

2021-04-13 Thread NeilBrown

On Tue, Apr 13 2021, Jonathan Corbet wrote:

> Fox Chen  writes:
>
>> The Path lookup is a very complex subject in VFS. The path-lookup
>> document provides a very detailed guidance to help people understand
>> how path lookup works in the kernel. This document was originally
>> written based on three lwn articles five years ago. As times goes by,
>> some of the content is outdated. This patchset is intended to update
>> the document to make it more relevant to current codebase.
>
> Neil, have you had a chance to take a look at these?  I'm reluctant to
> apply them without your ack...

No I haven't, I'm sorry.  And I'm on leave at the moment so my attention
is mostly elsewhere.  However I'll try to make time to have a look
sometime in the next week or so.

Thanks for the prompt.

NeilBrown


signature.asc
Description: PGP signature

Re: [PATCH] drivers/block: Goodbye BLK_DEV_UMEM

2021-03-23 Thread NeilBrown

On Tue, Mar 23 2021, Davidlohr Bueso wrote:

> I'm also Ccing Neil, who is one of the authors.

Thanks!
I have no objection to the removal.  The driver served its purpose at
the time, but technology has moved on.
Add
  Acked-by: NeilBrown 
if you like (not necessary).

Thanks,
NeilBrown

>
> On Tue, 23 Mar 2021, Bueso wrote:
>
>>This removes the driver on the premise that it has been unused for a long
>>time. This is a better approach compared to changing untestable code nobody
>>cares about in the first place. Similarly, the umem.com website now shows a
>>mere Godaddy parking add.
>>
>>Suggested-by: Christoph Hellwig 
>>Signed-off-by: Davidlohr Bueso 
>>---
>> arch/mips/configs/malta_defconfig   |1 -
>> arch/mips/configs/malta_kvm_defconfig   |1 -
>> arch/mips/configs/maltaup_xpa_defconfig |1 -
>> drivers/block/Kconfig   |   17 -
>> drivers/block/Makefile  |1 -
>> drivers/block/umem.c| 1130 ---
>> drivers/block/umem.h|  132 ---
>> 7 files changed, 1283 deletions(-)
>> delete mode 100644 drivers/block/umem.c
>> delete mode 100644 drivers/block/umem.h


signature.asc
Description: PGP signature

Re: [PATCH v2] block: fix trace completion for chained bio

2021-03-22 Thread NeilBrown

On Wed, Mar 03 2021, edwardh wrote:

> From: Edward Hsieh 
>
> For chained bio, trace_block_bio_complete in bio_endio is currently called
> only by the parent bio once upon all chained bio completed.
> However, the sector and size for the parent bio are modified in bio_split.
> Therefore, the size and sector of the complete events might not match the
> queue events in blktrace.
>
> The original fix of bio completion trace  ("block: trace
> completion of all bios.") wants multiple complete events to correspond
> to one queue event but missed this.
>
> md/raid5 read with bio cross chunks can reproduce this issue.
>
> To fix, move trace completion into the loop for every chained bio to call.

Thanks.  I think this is correct as far as tracing goes.
However the code still looks a bit odd.

The comment for the handling of bio_chain_endio suggests that the *only*
purpose for that is to avoid deep recursion.  That suggests it should be
at the end of the function.
As it is blk_throtl_bio_endio() and bio_unint() are only called on the
last bio in a chain.
That seems wrong.

I'd be more comfortable if the patch moved the bio_chain_endio()
handling to the end, after all of that.
So the function would end.

if (bio->bi_end_io == bio_chain_endio) {
   bio = __bio_chain_endio(bio);
   goto again;
} else if (bio->bi_end_io)
   bio->bi_end_io(bio);

Jens:  can you see any reason why that functions must only be called on
the last bio in the chain?

Thanks,
NeilBrown


>
> Fixes: fbbaf700e7b1 ("block: trace completion of all bios.")
> Reviewed-by: Wade Liang 
> Reviewed-by: BingJing Chang 
> Signed-off-by: Edward Hsieh 
> ---
>  block/bio.c | 13 ++---
>  1 file changed, 6 insertions(+), 7 deletions(-)
>
> diff --git a/block/bio.c b/block/bio.c
> index a1c4d29..2ff72cb 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -1397,8 +1397,7 @@ static inline bool bio_remaining_done(struct bio *bio)
>   *
>   *   bio_endio() can be called several times on a bio that has been chained
>   *   using bio_chain().  The ->bi_end_io() function will only be called the
> - *   last time.  At this point the BLK_TA_COMPLETE tracing event will be
> - *   generated if BIO_TRACE_COMPLETION is set.
> + *   last time.
>   **/
>  void bio_endio(struct bio *bio)
>  {
> @@ -1411,6 +1410,11 @@ void bio_endio(struct bio *bio)
>   if (bio->bi_bdev)
>   rq_qos_done_bio(bio->bi_bdev->bd_disk->queue, bio);
>  
> + if (bio->bi_bdev && bio_flagged(bio, BIO_TRACE_COMPLETION)) {
> + trace_block_bio_complete(bio->bi_bdev->bd_disk->queue, bio);
> + bio_clear_flag(bio, BIO_TRACE_COMPLETION);
> + }
> +
>   /*
>* Need to have a real endio function for chained bios, otherwise
>* various corner cases will break (like stacking block devices that
> @@ -1424,11 +1428,6 @@ void bio_endio(struct bio *bio)
>   goto again;
>   }
>  
> - if (bio->bi_bdev && bio_flagged(bio, BIO_TRACE_COMPLETION)) {
> - trace_block_bio_complete(bio->bi_bdev->bd_disk->queue, bio);
> - bio_clear_flag(bio, BIO_TRACE_COMPLETION);
> - }
> -
>   blk_throtl_bio_endio(bio);
>   /* release cgroup info */
>   bio_uninit(bio);
> -- 
> 2.7.4


signature.asc
Description: PGP signature

Re: [PATCH 0/3] Fix some seq_file users that were recently broken

2021-02-07 Thread NeilBrown

On Fri, Feb 05 2021, Andrew Morton wrote:

> On Fri, 05 Feb 2021 11:36:30 +1100 NeilBrown  wrote:
>
>> A recent change to seq_file broke some users which were using seq_file
>> in a non-"standard" way ...  though the "standard" isn't documented, so
>> they can be excused.  The result is a possible leak - of memory in one
>> case, of references to a 'transport' in the other.
>> 
>> These three patches:
>>  1/ document and explain the problem
>>  2/ fix the problem user in x86
>>  3/ fix the problem user in net/sctp
>> 
>
> 1f4aace60b0e ("fs/seq_file.c: simplify seq_file iteration code and
> interface") was August 2018, so I don't think "recent" applies here?

I must be getting old :-(

>
> I didn't look closely, but it appears that the sctp procfs file is
> world-readable.  So we gave unprivileged userspace the ability to leak
> kernel memory?

Not quite that bad.  The x86 problem allows arbitrary memory to be
leaked, but that is in debugfs (as I'm sure you saw) so is root-only.
The sctp one only allows an sctp_transport to be pinned.  That is not
good and would, e.g., prevent the module from being unloaded.  But
unlike a simple memory leak it won't affect anything outside of sctp.

>
> So I'm thinking that we aim for 5.12-rc1 on all three patches with a 
> cc:stable?

Thanks for looking after these!

NeilBrown

signature.asc
Description: PGP signature

[PATCH 0/3] Fix some seq_file users that were recently broken

2021-02-04 Thread NeilBrown

A recent change to seq_file broke some users which were using seq_file
in a non-"standard" way ...  though the "standard" isn't documented, so
they can be excused.  The result is a possible leak - of memory in one
case, of references to a 'transport' in the other.

These three patches:
 1/ document and explain the problem
 2/ fix the problem user in x86
 3/ fix the problem user in net/sctp

I suspect the patches should each go through the relevant subsystems,
but I'm including akpm as the original went through him.

Thanks,
NeilBrown

---

NeilBrown (3):
  seq_file: document how per-entry resources are managed.
  x86: fix seq_file iteration for pat/memtype.c
  net: fix iteration for sctp transport seq_files

 Documentation/filesystems/seq_file.rst |  6 ++
 arch/x86/mm/pat/memtype.c  |  4 ++--
 net/sctp/proc.c| 16 
 3 files changed, 20 insertions(+), 6 deletions(-)

--
Signature

[PATCH 3/3] net: fix iteration for sctp transport seq_files

2021-02-04 Thread NeilBrown

The sctp transport seq_file iterators take a reference to the transport
in the ->start and ->next functions and releases the reference in the
->show function.  The preferred handling for such resources is to
release them in the subsequent ->next or ->stop function call.

Since Commit 1f4aace60b0e ("fs/seq_file.c: simplify seq_file iteration
code and interface") there is no guarantee that ->show will be called
after ->next, so this function can now leak references.

So move the sctp_transport_put() call to ->next and ->stop.

Fixes: 1f4aace60b0e ("fs/seq_file.c: simplify seq_file iteration code and 
interface")
Reported-by: Xin Long 
Signed-off-by: NeilBrown 
---
 net/sctp/proc.c |   16 
 1 file changed, 12 insertions(+), 4 deletions(-)

diff --git a/net/sctp/proc.c b/net/sctp/proc.c
index f7da88ae20a5..982a87b3e11f 100644
--- a/net/sctp/proc.c
+++ b/net/sctp/proc.c
@@ -215,6 +215,12 @@ static void sctp_transport_seq_stop(struct seq_file *seq, 
void *v)
 {
struct sctp_ht_iter *iter = seq->private;
 
+   if (v && v != SEQ_START_TOKEN) {
+   struct sctp_transport *transport = v;
+
+   sctp_transport_put(transport);
+   }
+
sctp_transport_walk_stop(>hti);
 }
 
@@ -222,6 +228,12 @@ static void *sctp_transport_seq_next(struct seq_file *seq, 
void *v, loff_t *pos)
 {
struct sctp_ht_iter *iter = seq->private;
 
+   if (v && v != SEQ_START_TOKEN) {
+   struct sctp_transport *transport = v;
+
+   sctp_transport_put(transport);
+   }
+
++*pos;
 
return sctp_transport_get_next(seq_file_net(seq), >hti);
@@ -277,8 +289,6 @@ static int sctp_assocs_seq_show(struct seq_file *seq, void 
*v)
sk->sk_rcvbuf);
seq_printf(seq, "\n");
 
-   sctp_transport_put(transport);
-
return 0;
 }
 
@@ -354,8 +364,6 @@ static int sctp_remaddr_seq_show(struct seq_file *seq, void 
*v)
seq_printf(seq, "\n");
}
 
-   sctp_transport_put(transport);
-
return 0;
 }

[PATCH 2/3] x86: fix seq_file iteration for pat/memtype.c

2021-02-04 Thread NeilBrown

The memtype seq_file iterator allocates a buffer in the ->start and
->next functions and frees it in the ->show function.
The preferred handling for such resources is to free them in the
subsequent ->next or ->stop function call.

Since Commit 1f4aace60b0e ("fs/seq_file.c: simplify seq_file iteration
code and interface") there is no guarantee that ->show will be called
after ->next, so this function can now leak memory.

So move the freeing of the buffer to ->next and ->stop.

Fixes: 1f4aace60b0e ("fs/seq_file.c: simplify seq_file iteration code and 
interface")
Cc: Xin Long 
Signed-off-by: NeilBrown 
---
 arch/x86/mm/pat/memtype.c |4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/mm/pat/memtype.c b/arch/x86/mm/pat/memtype.c
index 8f665c352bf0..ca311aaa67b8 100644
--- a/arch/x86/mm/pat/memtype.c
+++ b/arch/x86/mm/pat/memtype.c
@@ -1164,12 +1164,14 @@ static void *memtype_seq_start(struct seq_file *seq, 
loff_t *pos)
 
 static void *memtype_seq_next(struct seq_file *seq, void *v, loff_t *pos)
 {
+   kfree(v);
++*pos;
return memtype_get_idx(*pos);
 }
 
 static void memtype_seq_stop(struct seq_file *seq, void *v)
 {
+   kfree(v);
 }
 
 static int memtype_seq_show(struct seq_file *seq, void *v)
@@ -1181,8 +1183,6 @@ static int memtype_seq_show(struct seq_file *seq, void *v)
entry_print->end,
cattr_name(entry_print->type));
 
-   kfree(entry_print);
-
return 0;
 }

[PATCH 1/3] seq_file: document how per-entry resources are managed.

2021-02-04 Thread NeilBrown

Users of seq_file will sometimes find it convenient to take a resource,
such as a lock or memory allocation, in the ->start or ->next
operations.
These are per-entry resources, distinct from per-session resources which
are taken in ->start and released in ->stop.

The preferred management of these is release the resource on the
subsequent call to ->next or ->stop.

However prior to Commit 1f4aace60b0e ("fs/seq_file.c: simplify seq_file
iteration code and interface") it happened that ->show would always be
called after ->start or ->next, and a few users chose to release the
resource in ->show.

This is no longer reliable.  Since the mentioned commit, ->next will
always come after a successful ->show (to ensure m->index is updated
correctly), so the original ordering cannot be maintained.

This patch updates the documentation to clearly state the required
behaviour.  Other patches will fix the few problematic users.

Fixes: 1f4aace60b0e ("fs/seq_file.c: simplify seq_file iteration code and 
interface")
Cc: Xin Long 
Signed-off-by: NeilBrown 
---
 Documentation/filesystems/seq_file.rst |6 ++
 1 file changed, 6 insertions(+)

diff --git a/Documentation/filesystems/seq_file.rst 
b/Documentation/filesystems/seq_file.rst
index 56856481dc8d..0e40e1532e7f 100644
--- a/Documentation/filesystems/seq_file.rst
+++ b/Documentation/filesystems/seq_file.rst
@@ -217,6 +217,12 @@ between the calls to start() and stop(), so holding a lock 
during that time
 is a reasonable thing to do. The seq_file code will also avoid taking any
 other locks while the iterator is active.
 
+The iterater value returned by start() or next() is guaranteed to be
+passed to a subsequenct next() or stop() call.  This allows resources
+such as locks that were taken to be reliably released.  There is *no*
+guarantee that the iterator will be passed to show(), though in practice
+it often will be.
+
 
 Formatted output

Re: [PATCH] list: add more extensive double add check

2021-02-01 Thread NeilBrown

On Mon, Feb 01 2021, Andy Shevchenko wrote:

> On Mon, Feb 01, 2021 at 02:52:51PM +0100, Christian König wrote:
>> Adding the same element to a linked list multiple times
>> seems to be a rather common programming mistake. To debug
>> those I've more than once written some code to check a
>> linked list for duplicates.
>> 
>> Since re-inventing the wheel over and over again is a bad
>> idea this patch tries to add some common code which allows
>> to check linked lists for duplicates while adding new
>> elements.
>> 
>> When list debugging is enabled we currently already check
>> the previous and next element if they are identical to the
>> new one. This patch now adds a configuration option to
>> check N elements before and after the desired position.
>> 
>> By default we still only test one item since testing more
>> means quite a large CPU overhead. This can be overwritten
>> on a per C file bases by defining DEBUG_LIST_DOUBLE_ADD
>> before including list.h.
>
> I'm not sure it is a good idea. Currently the implementation is *generic*.
> You are customizing it w/o letting caller know.
>
> Create a derivative implementation and name it exlist (exclusive list) and use
> whenever it makes sense.
>
> And I think if you are still pushing to modify generic one the default must be
> 0 in order not altering current behaviour.

I don't understand your complaint.
The extra checks are also completely *generic*.  It can never make sense
to add sometime to a list if it is already on the list.  All lists are
exclusive lists.
The code ALREADY tests if the inserted object is already present either
side of the insert side of the insertion point.  This patch just extends
it somewhat.

I myself have never had, or heard of, a bug due to double insertion so
I'm no strongly in favour of this patch for that reason.
But I *am* in favour of making the platform more resilient in general,
and if others have experienced this sort of bug, then I'm in favour of
make that easier to detect in future.

NeilBrown

>
>> A new kunit test is also added to the existing list tests
>> which intentionally triggers the debug functionality.
>
> -- 
> With Best Regards,
> Andy Shevchenko

signature.asc
Description: PGP signature

Re: [PATCH 00/12] docs: path-lookup: Update pathlookup docs

2021-01-27 Thread NeilBrown

On Tue, Jan 26 2021, Fox Chen wrote:

> The Path lookup is a very complex subject in VFS. The path-lookup
> document provides a very detailed guidance to help people understand
> how path lookup works in the kernel.This document was originally
> written based on three lwn articles five years ago. As times goes by,
> some of the content was outdated. This patchset is intended to update
> the document to make it more relevant to current codebase.
>
>
> Fox Chen (12):
>   docs: path-lookup: update follow_managed() part
>   docs: path-lookup: update path_to_nameidata() parth
>   docs: path-lookup: update path_mountpoint() part
>   docs: path-lookup: update do_last() part
>   docs: path-lookup: remove filename_mountpoint
>   docs: path-lookup: Add macro name to symlink limit description
>   docs: path-lookup: i_op->follow_link replaced with i_op->get_link
>   docs: path-lookup: update i_op->put_link and cookie description
>   docs: path-lookup: no get_link()
>   docs: path-lookup: update WALK_GET, WALK_PUT desc
>   docs: path-lookup: update get_link() ->follow_link description
>   docs: path-lookup: update symlink description
>

Thanks for doing this.  I've responded individually to several of the
patches.  As you can see from my comments, there is often more to it
than just changing function names.  In some places you have capture the
more general nature of the change fairly well.  In other places the
result is incoherent or confusion.
Making small updates to this sort of documentation is not easy.  You
need to step have and see the "big picture", to overall story-arc.
Sometimes you can fit changes into that arc, sometimes you might need to
restructure or re-tell the story.

This is part of why I haven't put much effort into the document -
re-telling a story can be a lot of work.

NeilBrown

signature.asc
Description: PGP signature

Re: [PATCH 07/12] docs: path-lookup: i_op->follow_link replaced with i_op->get_link

2021-01-27 Thread NeilBrown

On Tue, Jan 26 2021, Fox Chen wrote:

> follow_link has been replaced by get_link() which can be
> called in RCU mode.
>
> see commit: 6b2553918d8b4e6de9853fd6315bec7271a2e592
>
> Signed-off-by: Fox Chen 
> ---
>  Documentation/filesystems/path-lookup.rst | 12 +---
>  1 file changed, 5 insertions(+), 7 deletions(-)
>
> diff --git a/Documentation/filesystems/path-lookup.rst 
> b/Documentation/filesystems/path-lookup.rst
> index 25d2a5a59f45..0a362849b26f 100644
> --- a/Documentation/filesystems/path-lookup.rst
> +++ b/Documentation/filesystems/path-lookup.rst
> @@ -1062,13 +1062,11 @@ filesystem cannot successfully get a reference in 
> RCU-walk mode, it
>  must return ``-ECHILD`` and ``unlazy_walk()`` will be called to return to
>  REF-walk mode in which the filesystem is allowed to sleep.
>  
> -The place for all this to happen is the ``i_op->follow_link()`` inode
> -method.  In the present mainline code this is never actually called in
> -RCU-walk mode as the rewrite is not quite complete.  It is likely that
> -in a future release this method will be passed an ``inode`` pointer when
> -called in RCU-walk mode so it both (1) knows to be careful, and (2) has the
> -validated pointer.  Much like the ``i_op->permission()`` method we
> -looked at previously, ``->follow_link()`` would need to be careful that
> +The place for all this to happen is the ``i_op->get_link()`` inode
> +method. This is called both in RCU-walk and REF-walk. In RCU-walk the
> +``dentry*`` argument is NULL, ``->get_link()`` can return -ECHILD to drop
> +RCU-walk.  Much like the ``i_op->permission()`` method we

The phrase "drop RCU-walk" isn't consistent with the rest of the text.
It talks about "dropping down to REF-walk", so you could write "dropping
out of RCU-walk", but not just "dropping RCU-walk".

NeilBrown


> +looked at previously, ``->get_link()`` would need to be careful that
>  all the data structures it references are safe to be accessed while
>  holding no counted reference, only the RCU lock.  Though getting a
>  reference with ``->follow_link()`` is not yet done in RCU-walk mode, the
> -- 
> 2.30.0


signature.asc
Description: PGP signature

Re: [PATCH 04/12] docs: path-lookup: update do_last() part

2021-01-27 Thread NeilBrown

On Tue, Jan 26 2021, Fox Chen wrote:

> traling_symlink() was merged into lookup_last, do_last().
>
> do_last() has later been split into open_last_lookups()
> and do_open().
>
> see related commit: c5971b8c6354a95c9ee7eb722928af5000bac247
>
> Signed-off-by: Fox Chen 
> ---
>  Documentation/filesystems/path-lookup.rst | 34 +++
>  1 file changed, 17 insertions(+), 17 deletions(-)
>
> diff --git a/Documentation/filesystems/path-lookup.rst 
> b/Documentation/filesystems/path-lookup.rst
> index 4e77c8520fa9..1f05b1417a55 100644
> --- a/Documentation/filesystems/path-lookup.rst
> +++ b/Documentation/filesystems/path-lookup.rst
> @@ -496,11 +496,11 @@ one provided by a dead NFS server.  In the current 
> kernel, path_mountpoint
>  has been merged into ``path_lookup_at()`` with a new flag LOOKUP_MOUNTPOINT.
>  
>  Finally ``path_openat()`` is used for the ``open()`` system call; it
> -contains, in support functions starting with "``do_last()``", all the
> +contains, in support functions starting with "``open_last_lookups()``", all 
> the
>  complexity needed to handle the different subtleties of O_CREAT (with
>  or without O_EXCL), final "``/``" characters, and trailing symbolic
>  links.  We will revisit this in the final part of this series, which
> -focuses on those symbolic links.  "``do_last()``" will sometimes, but
> +focuses on those symbolic links.  "``open_last_lookups()``" will sometimes, 
> but
>  not always, take ``i_rwsem``, depending on what it finds.
>  
>  Each of these, or the functions which call them, need to be alert to
> @@ -1201,26 +1201,26 @@ symlink.
>  This case is handled by the relevant caller of ``link_path_walk()``, such as
>  ``path_lookupat()`` using a loop that calls ``link_path_walk()``, and then
>  handles the final component.  If the final component is a symlink
> -that needs to be followed, then ``trailing_symlink()`` is called to set
> -things up properly and the loop repeats, calling ``link_path_walk()``
> -again.  This could loop as many as 40 times if the last component of
> -each symlink is another symlink.
> +that needs to be followed, then ``open_last_lookups()`` and ``do_open()`` is
> +called to set things up properly and the loop repeats, calling

This implies that do_open() is inside the loop (in path_openat()).  But
it isn't, it comes after the loop.

(I haven't closely examined this rest of this patch).

NeilBrown


> +``link_path_walk()`` again.  This could loop as many as 40 times if the last
> +component of each symlink is another symlink.
>  
>  The various functions that examine the final component and possibly
> -report that it is a symlink are ``lookup_last()``, ``mountpoint_last()``
> -and ``do_last()``, each of which use the same convention as
> -``walk_component()`` of returning ``1`` if a symlink was found that needs
> -to be followed.
> +report that it is a symlink are ``lookup_last()``, ``open_last_lookups()``
> +, each of which use the same convention as
> +``walk_component()`` of returning ``char *name`` if a symlink was found that
> +needs to be followed.
>  
> -Of these, ``do_last()`` is the most interesting as it is used for
> -opening a file.  Part of ``do_last()`` runs with ``i_rwsem`` held and this
> -part is in a separate function: ``lookup_open()``.
> +Of these, ``open_last_lookups()``, ``do_open()`` is the most interesting as 
> it is
> +used for opening a file.  Part of ``open_last_lookups()`` runs with 
> ``i_rwsem``
> +held and this part is in a separate function: ``lookup_open()``.
>  
> -Explaining ``do_last()`` completely is beyond the scope of this article,
> -but a few highlights should help those interested in exploring the
> -code.
> +Explaining ``open_last_lookups()``, ``do_open()`` completely is beyond the 
> scope
> +of this article, but a few highlights should help those interested in 
> exploring
> +the code.
>  
> -1. Rather than just finding the target file, ``do_last()`` needs to open
> +1. Rather than just finding the target file, ``do_open()`` needs to open
> it.  If the file was found in the dcache, then ``vfs_open()`` is used for
> this.  If not, then ``lookup_open()`` will either call ``atomic_open()`` 
> (if
> the filesystem provides it) to combine the final lookup with the open, or
> -- 
> 2.30.0


signature.asc
Description: PGP signature

Re: [PATCH 08/12] docs: path-lookup: update i_op->put_link and cookie description

2021-01-27 Thread NeilBrown

On Tue, Jan 26 2021, Fox Chen wrote:

> No inode->put_link operation anymore. We use delayed_call to
> deal with link destruction. Cookie has been replaced with
> struct delayed_call.
>
> Related commit: fceef393a538134f03b778c5d2519e670269342f
>
> Signed-off-by: Fox Chen 
> ---
>  Documentation/filesystems/path-lookup.rst | 31 +++
>  1 file changed, 9 insertions(+), 22 deletions(-)
>
> diff --git a/Documentation/filesystems/path-lookup.rst 
> b/Documentation/filesystems/path-lookup.rst
> index 0a362849b26f..8572300b5405 100644
> --- a/Documentation/filesystems/path-lookup.rst
> +++ b/Documentation/filesystems/path-lookup.rst
> @@ -1068,34 +1068,21 @@ method. This is called both in RCU-walk and REF-walk. 
> In RCU-walk the
>  RCU-walk.  Much like the ``i_op->permission()`` method we
>  looked at previously, ``->get_link()`` would need to be careful that
>  all the data structures it references are safe to be accessed while
> -holding no counted reference, only the RCU lock.  Though getting a
> -reference with ``->follow_link()`` is not yet done in RCU-walk mode, the
> -code is ready to release the reference when that does happen.
> -
> -This need to drop the reference to a symlink adds significant
> -complexity.  It requires a reference to the inode so that the
> -``i_op->put_link()`` inode operation can be called.  In REF-walk, that
> -reference is kept implicitly through a reference to the dentry, so
> -keeping the ``struct path`` of the symlink is easiest.  For RCU-walk,
> -the pointer to the inode is kept separately.  To allow switching from
> -RCU-walk back to REF-walk in the middle of processing nested symlinks
> -we also need the seq number for the dentry so we can confirm that
> -switching back was safe.
> -
> -Finally, when providing a reference to a symlink, the filesystem also
> -provides an opaque "cookie" that must be passed to ``->put_link()`` so that 
> it
> -knows what to free.  This might be the allocated memory area, or a
> -pointer to the ``struct page`` in the page cache, or something else
> -completely.  Only the filesystem knows what it is.
> +holding no counted reference, only the RCU lock.
> +
> +Finally, a callback ``struct delayed_called`` will be passed to get_link,
> +file systems can set their own put_link function and argument through
> +``set_delayed_call``. Latter on, when vfs wants to put link, it will call 

"Later", not "Latter".

Also, I'm not sure that the "Finally" at the start of the sentence makes
sense any more.  I think it was meant to introduce the final part of the
"significant complexity", but now that significant complexity is gone.
At least, I assume it is gone.  I haven't checked to code to see if
maybe it has just been moved.

NeilBrown


> +``do_delayed_call`` to invoke that callback function with the argument.
>  
>  In order for the reference to each symlink to be dropped when the walk 
> completes,
>  whether in RCU-walk or REF-walk, the symlink stack needs to contain,
>  along with the path remnants:
>  
> -- the ``struct path`` to provide a reference to the inode in REF-walk
> -- the ``struct inode *`` to provide a reference to the inode in RCU-walk
> +- the ``struct path`` to provide a reference to the previous path
> +- the ``const char *`` to provide a reference to the to previous name
>  - the ``seq`` to allow the path to be safely switched from RCU-walk to 
> REF-walk
> -- the ``cookie`` that tells ``->put_path()`` what to put.
> +- the ``struct delayed_call`` for later invocation.
>  
>  This means that each entry in the symlink stack needs to hold five
>  pointers and an integer instead of just one pointer (the path
> -- 
> 2.30.0


signature.asc
Description: PGP signature

Re: [PATCH 03/12] docs: path-lookup: update path_mountpoint() part

2021-01-27 Thread NeilBrown

On Tue, Jan 26 2021, Fox Chen wrote:

> path_mountpoint() doesn't exist anymore. Have been folded
> into path_lookup_at when flag is set with LOOKUP_MOUNTPOINT.
> check out commit:161aff1d93abf0e5b5e9dbca88928998c155f677
>
> Signed-off-by: Fox Chen 
> ---
>  Documentation/filesystems/path-lookup.rst | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/Documentation/filesystems/path-lookup.rst 
> b/Documentation/filesystems/path-lookup.rst
> index 2ad96e1e3c49..4e77c8520fa9 100644
> --- a/Documentation/filesystems/path-lookup.rst
> +++ b/Documentation/filesystems/path-lookup.rst
> @@ -492,7 +492,8 @@ not try to revalidate the mounted filesystem.  It 
> effectively
>  contains, through a call to ``mountpoint_last()``, an alternate
>  implementation of ``lookup_slow()`` which skips that step.  This is
>  important when unmounting a filesystem that is inaccessible, such as
> -one provided by a dead NFS server.
> +one provided by a dead NFS server.  In the current kernel, path_mountpoint
> +has been merged into ``path_lookup_at()`` with a new flag LOOKUP_MOUNTPOINT.

You've taken a very different approach here.  Rather than re-telling the
story you have added a note (like a foot-note) that the details have
changed, withouy trying to re-weave the story.  The is easier to get
right, but doesn't produce as nice a result.

Maybe this is a good approach, it depends on how much effort you are
willing/able to spend on the task.

IF you do stick with this approach:  it is "path_lookupat", not
"path_lookup_at".

NeilBrown


>  
>  Finally ``path_openat()`` is used for the ``open()`` system call; it
>  contains, in support functions starting with "``do_last()``", all the
> -- 
> 2.30.0


signature.asc
Description: PGP signature

Re: [PATCH 02/12] docs: path-lookup: update path_to_nameidata() parth

2021-01-27 Thread NeilBrown

On Tue, Jan 26 2021, Fox Chen wrote:

> No path_to_namei() anymore, step_into() will be called.
> Related commit: c99687a03a78775f77d57fe9b07af4c8ec3dd03c
>
> Signed-off-by: Fox Chen 
> ---
>  Documentation/filesystems/path-lookup.rst | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/Documentation/filesystems/path-lookup.rst 
> b/Documentation/filesystems/path-lookup.rst
> index e778db767120..2ad96e1e3c49 100644
> --- a/Documentation/filesystems/path-lookup.rst
> +++ b/Documentation/filesystems/path-lookup.rst
> @@ -455,7 +455,7 @@ In the absence of symbolic links, ``walk_component()`` 
> creates a new
>  ``struct path`` containing a counted reference to the new dentry and a
>  reference to the new ``vfsmount`` which is only counted if it is
>  different from the previous ``vfsmount``.  It then calls
> -``path_to_nameidata()`` to install the new ``struct path`` in the
> +``step_into()`` to install the new ``struct path`` in the
>  ``struct nameidata`` and drop the unneeded references.

The logic describe here is now embodied by the code in step_into(), so
the change doesn't make the description any more correct.

Possibly you need to change the hero of the story from walk_component()
to step_into(), but that is just a guess.

NeilBrown


>  
>  This "hand-over-hand" sequencing of getting a reference to the new
> -- 
> 2.30.0


signature.asc
Description: PGP signature

Re: [PATCH 01/12] docs: path-lookup: update follow_managed() part

2021-01-27 Thread NeilBrown

On Tue, Jan 26 2021, Fox Chen wrote:

> No follow_managed() anymore, handle_mounts(),
> traverse_mounts(), will do the job.
> see commit: 9deed3ebca244663530782631834e706a86a8c8f
>
> Signed-off-by: Fox Chen 
> ---
>  Documentation/filesystems/path-lookup.rst | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/Documentation/filesystems/path-lookup.rst 
> b/Documentation/filesystems/path-lookup.rst
> index c482e1619e77..e778db767120 100644
> --- a/Documentation/filesystems/path-lookup.rst
> +++ b/Documentation/filesystems/path-lookup.rst
> @@ -448,8 +448,8 @@ described.  If it finds a ``LAST_NORM`` component it 
> first calls
>  filesystem to revalidate the result if it is that sort of filesystem.
>  If that doesn't get a good result, it calls "``lookup_slow()``" which
>  takes ``i_rwsem``, rechecks the cache, and then asks the filesystem
> -to find a definitive answer.  Each of these will call
> -``follow_managed()`` (as described below) to handle any mount points.
> +to find a definitive answer.  In ``step_into()``, ``handle_mount()`` will be 
> +called to handle any mount point.

The text now introduces step_into() without any hint as to why that is
relevant at this point.
It is a bit awkward to explain succinctly because while lookup_fast and
lookup_slow return a dentry which is passed to step_into(), handle_dots()
calls step_into() itself.

This is a general problem with this sort of documentation.  It weaves a
story and when the code changes, you might need to completely re-weave
the story.

I don't have a good suggestion for how to fix this text, but at the
least it needs to be made clear the walk_component() calls step_into(),
either directly or via handle_dots().

>  
>  In the absence of symbolic links, ``walk_component()`` creates a new
>  ``struct path`` containing a counted reference to the new dentry and a
> @@ -536,7 +536,7 @@ tree, but a few notes specifically related to path lookup 
> are in order
>  here.
>  
>  The Linux VFS has a concept of "managed" dentries which is reflected
> -in function names such as "``follow_managed()``".  There are three
> +in function names such as "``traverse_mounts()``".  There are three

Here you've completely broken the story.  Saying

  The VFS has a concept of "managed" dentries which is reflected in
  function names like "traverse_mounts()"

makes no sense at all.
Again, I cannot offer any quick fix.

NeilBrown


>  potentially interesting things about these dentries corresponding
>  to three different flags that might be set in ``dentry->d_flags``:
>  
> -- 
> 2.30.0


signature.asc
Description: PGP signature

Re: [PATCH 1/3] vfs: Do not ignore return code from s_op->sync_fs

2020-12-21 Thread NeilBrown

On Mon, Dec 21 2020, Vivek Goyal wrote:

> Current implementation of __sync_filesystem() ignores the
> return code from ->sync_fs(). I am not sure why that's the case.
>
> Ignoring ->sync_fs() return code is problematic for overlayfs where
> it can return error if sync_filesystem() on upper super block failed.
> That error will simply be lost and sycnfs(overlay_fd), will get
> success (despite the fact it failed).
>
> Al Viro noticed that there are other filesystems which can sometimes
> return error in ->sync_fs() and these errors will be ignored too.
>
> fs/btrfs/super.c:2412:  .sync_fs= btrfs_sync_fs,
> fs/exfat/super.c:204:   .sync_fs= exfat_sync_fs,
> fs/ext4/super.c:1674:   .sync_fs= ext4_sync_fs,
> fs/f2fs/super.c:2480:   .sync_fs= f2fs_sync_fs,
> fs/gfs2/super.c:1600:   .sync_fs= gfs2_sync_fs,
> fs/hfsplus/super.c:368: .sync_fs= hfsplus_sync_fs,
> fs/nilfs2/super.c:689:  .sync_fs= nilfs_sync_fs,
> fs/ocfs2/super.c:139:   .sync_fs= ocfs2_sync_fs,
> fs/overlayfs/super.c:399: .sync_fs= ovl_sync_fs,
> fs/ubifs/super.c:2052:  .sync_fs   = ubifs_sync_fs,
>
> Hence, this patch tries to fix it and capture error returned
> by ->sync_fs() and return to caller. I am specifically interested
> in syncfs() path and return error to user.
>
> I am assuming that we want to continue to call __sync_blockdev()
> despite the fact that there have been errors reported from
> ->sync_fs(). So this patch continues to call __sync_blockdev()
> even if ->sync_fs() returns an error.
>
> Al noticed that there are few other callsites where ->sync_fs() error
> code is being ignored.
>
> sync_fs_one_sb(): For this it seems desirable to ignore the return code.
>
> dquot_disable(): Jan Kara mentioned that ignoring return code here is fine
>because we don't want to fail dquot_disable() just beacuse
>caches might be incoherent.
>
> dquot_quota_sync(): Jan thinks that it might make some sense to capture
>   return code here. But I am leaving it untouched for
>  now. When somebody needs it, they can easily fix it.
>
> Signed-off-by: Vivek Goyal 
> ---
>  fs/sync.c | 8 ++--
>  1 file changed, 6 insertions(+), 2 deletions(-)
>
> diff --git a/fs/sync.c b/fs/sync.c
> index 1373a610dc78..b5fb83a734cd 100644
> --- a/fs/sync.c
> +++ b/fs/sync.c
> @@ -30,14 +30,18 @@
>   */
>  static int __sync_filesystem(struct super_block *sb, int wait)
>  {
> + int ret, ret2;
> +
>   if (wait)
>   sync_inodes_sb(sb);
>   else
>   writeback_inodes_sb(sb, WB_REASON_SYNC);
>  
>   if (sb->s_op->sync_fs)
> - sb->s_op->sync_fs(sb, wait);
> - return __sync_blockdev(sb->s_bdev, wait);
> + ret = sb->s_op->sync_fs(sb, wait);
> + ret2 = __sync_blockdev(sb->s_bdev, wait);
> +
> + return ret ? ret : ret2;

I'm surprised that the compiler didn't complain that 'ret' might be used
uninitialized.

NeilBrown

>  }
>  
>  /*
> -- 
> 2.25.4


signature.asc
Description: PGP signature

Re: [PATCH 3/3] overlayfs: Check writeback errors w.r.t upper in ->syncfs()

2020-12-18 Thread NeilBrown

On Fri, Dec 18 2020, Jeffrey Layton wrote:
>
> The patch we're discussing here _does_ add a f_op->syncfs, which is why
> I was suggesting to do it that way.

I haven't thought through the issues to decide what I think of adding a
new op, but I already know what I think of adding ->syncfs.  Don't Do
It.  The name is much too easily confused with ->sync_fs.

If you call it ->sync_fs_return_error() it would be MUCH better.

And having said that, the solution becomes obvious.  Add a new flag,
either as another bit in 'int wait', or as a new bool.
The new flag would be "return_error" - or whatever is appropriate.

NeilBrown

signature.asc
Description: PGP signature

Re: [PATCH rfc] workqueue: honour cond_resched() more effectively.

2020-11-26 Thread NeilBrown

On Thu, Nov 26 2020, Hillf Danton wrote:

> On Fri, 20 Nov 2020 15:33:27 +1100 NeilBrown wrote:
>>
>>My first idea was to add WQ_CPU_INTENSIVE to the nfsiod workqueue, but
>>Trond wondered what was special about NFS.  Many filesystems call iput
>>from a workqueue, so finding a solution that helps them all is best.
>
> In terms of iput, I think we can splice WQ_CPU_INTENSIVE to
> WQ_MEM_RECLAIM.

I'm actually starting to think that WQ_CPU_INTENSIVE is a mistake.  If
you really have cpu-intensive work, you should be using WQ_UNBOUND.

It is possible that there might be work that is CPU intensive and which
must be run on a particular CPU - such as clearing out per-cpu lists of
recently freed slab allocations.  But I don't WQ_CPU_INTENSIVE is currently
used that way.

I cannot find *any* users of WQ_CPU_INTENSIVE which call cond_resched()
in the relevant work items.  And if the code doesn't call cond_resched()
(or similar), then it isn't really CPU-intensive.

>
>>I then suggested getting cond_resched() to do something more useful when
>>called by a worker.  PeterZ didn't like the overhead.
>>
>>Also, TJ seemed to be against auto-adjusting for cpu-intensive code,
>>preferring the right sort of workqueue to be chosen up front.
>
> Actually WQ_EVENTS_LONG sounds better than WQ_CPU_INTENSIVE, given that
> we have two events WQs with the same attr.

There is no WQ_EVENTS_LONG

>
>   system_wq = alloc_workqueue("events", 0, 0);
>   system_long_wq = alloc_workqueue("events_long", 0, 0);
>
> Then what are the boundaries we can draw in between WQ_MEM_RECLAIM,
> WQ_CPU_INTENSIVE and WQ_EVENTS_LONG?

I think system_long_wq is a design flaw.
Some code (mistakenly) schedules work on system_wq, calls
flush_workqueue(system_wq)) and expects that to complete reasonably quickly.
To ensure this can work, system_long_wq was created and work items that
might take a long time are encouraged to be run there.
Instead, the mistaken code should create its own work queue, schedule
work on that, and flush that queue.

Thanks,
NeilBrown

>
>
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -4261,6 +4261,9 @@ struct workqueue_struct *alloc_workqueue
>   if ((flags & WQ_POWER_EFFICIENT) && wq_power_efficient)
>   flags |= WQ_UNBOUND;
>  
> + if (flags & (WQ_MEM_RECLAIM | WQ_UNBOUND) == WQ_MEM_RECLAIM)
> + flags |= WQ_CPU_INTENSIVE;
> +
>   /* allocate wq and format name */
>   if (flags & WQ_UNBOUND)
>   tbl_size = nr_node_ids * sizeof(wq->numa_pwq_tbl[0]);

signature.asc
Description: PGP signature

Re: [PATCH rfc] workqueue: honour cond_resched() more effectively.

2020-11-26 Thread NeilBrown

On Wed, Nov 25 2020, t...@kernel.org wrote:

> Hello,
>
> On Fri, Nov 20, 2020 at 10:23:44AM +1100, NeilBrown wrote:
>> On Mon, Nov 09 2020, t...@kernel.org wrote:
>> 
>> >Given that nothing on
>> > these types of workqueues can be latency sensitive
>> 
>> This caught my eye and it seems worth drilling in to.  There is no
>> mention of "latency" in workqueue.rst or workqueue.h.  But you seem to
>> be saying there is an undocumented assumption that latency-sensitive
>> work items much not be scheduled on CM-workqueues.
>> Is that correct?
>
> Yeah, correct. Because they're all sharing execution concurrency, the
> latency consistency is likely a lot worse.
>
>> NFS writes are latency sensitive to a degree as increased latency per
>> request will hurt overall throughput.  Does this mean that handling
>> write-completion in a CM-wq is a poor choice?
>> Would it be better to us WQ_HIGHPRI??  Is there any rule-of-thumb that
>> can be used to determine when WQ_HIGHPRI is appropriate?
>
> I don't think it'd need HIGHPRI but UNBOUND or CPU_INTENSIVE would make
> sense. I think the rule of the thumb is along the line of if you're worried
> about cpu consumption or latency, let the scheduler take care of it (ie. use
> unbound workqueues).

Thanks.
For nfsiod there are two contexts where it is used.

 In one context there is normally a thread waiting for the work item
 to complete.  It doesn't run the work in-line because the thread needs
 to abort if signaled, but the work needs to happen anyway so that the
 client and server remain in-sync.  In this case the fact that a
 application is waiting suggests that latency could be a problem.

 The other context is completing an async READ or WRITE.  I'm not sure
 if latency at this stage of the request will actually affect
 throughput, but we do need a WQ_MEM_RECLAIM wq for the WRITE at least.

 Keep both types of users on the same wq is simplest, so making it
  WQ_UNBOUND | WQ_MEM_RECLAIM
 is probably safest and would ensure that a cpu-intensive iput_final()
 doesn't interfere with other requests unduly.
 Quite a few other filesystems do use WQ_UNBOUND, often with
  WQ_MEM_RECLAIM, but it is not easy to do a like-for-like comparison.

 I might have a go at updating the workqueue documentation to provide
 some guidance on how to choose a workqueue and when certain flags are
 needed.

Thanks,
NeilBrown

signature.asc
Description: PGP signature

Re: [PATCH rfc] workqueue: honour cond_resched() more effectively.

2020-11-19 Thread NeilBrown

On Fri, Nov 20 2020, Hillf Danton wrote:

> On Fri, 20 Nov 2020 10:07:56 +1100 NeilBrown wrote:
>>On Wed, Nov 18 2020, Hillf Danton wrote:
>>> On Wed, 18 Nov 2020 16:11:44 +1100 NeilBrown wrote:
>>>> On Wed, Nov 18 2020, Hillf Danton wrote:
>>>> ...
>>>> I don't think this is a good idea.
>>>
>>> Let me add a few more words.
>>>
>>>> cond_resched() is expected to be called often.  Adding all this extra
>>>
>>> They are those only invoked in concurrency-managed worker contexts and
>>> are thus supposed to be less often than thought; what is more the callers
>>> know what they are doing if a schedule() follows up, needless to say it
>>> is an ant-antenna-size add-in to check WORKER_CPU_INTENSIVE given
>>> WARN_ON_ONCE(workqueue_mustnt_use_cpu())
>>> added in cond_resched().
>>
>>"supposed to be less often" is the central point here.
>
> No, it is not in any shape, see below.
>
>>Because the facts are that they sometime happen with high frequency
>>despite what is "supposed" to happen.
>
> Feel free to point me to a couple of such workers. I want to see
> how high it is and why.

The patch should suggest some.
Any work item which calls iput() might find it self in iput_final() and
then truncate_inode_pages_range() which will call cond_resched() once
for every 16 or fewer pages.  If there are millions of pages 

When a reply is received for an async NFS request (e.g. WRITE, but
several others), the processing happens in a workqueue (nfsiod), and this
will often call iput(), but rarely will that lead to iput_final().
Also, lots of non-workqueue code calls iput(), so adding code to an
inner-loop would cost everyone.

Any worker which allocates memory might find itself in
should_reclaim_retry() which calls cond_resched().  I don't know how
frequently this will fire.

The slab memory allocator uses a system_wq worker to reap a cache.  I
don't know exactly what that means but cache_reap() seems to need to
call cond_resched() periodically.  Maybe it should use be a
WQ_CPU_INTENSIVE workqueue, but there isn't a system_cpu_wq
Using system_unbound_wq() as it is doing per-CPU work.

>
>>Either the assumption that CM-workers don't call cond_resched() is
>>wrong, or the code that schedules such workers on CM-queues is wrong.
>>
>>I much prefer the perspective that the assumption is wrong.  If that is
>>agreed then we need to handle that circumstance without making
>>cond_resched() more expensive.
>
> This is the central point I think; it is a mile in between what
> you are trying to fix and what you are adding in cond_resched().

My latest patch only adds a WARNING to cond_resched(), so that we can
find problem code before it becomes a problem.  I did previously try
adding more to cond_resched(), and PeterZ didn't like that at all.

I agree that fixing the problem cannot be in cond_resched().  I think
that finding the scope of the problem is best done by instrumenting
cond_resched() (when DEBUG_KERNEL is selected).

>
>>Note that adding WARN_ON_ONCE() does not make it more expensive as it is
>>only enabled with KERNEL_DEBUG (and WQ_WATCHDOG, though the particular
>>config option could be changed). It isn't needed in production.
>
> Because cond_resched() is not the right place from the beginning
> for debugging like this, something in workqueue's backyard by
> design.  It's been there for a while, in production or not.

I don't understand your reasoning.  I don't see why one subsystem cannot
provide debugging to help some other subsystem.  Many subsystems add
"might_sleep()", not to detect bugs in themselves but to detect bugs in
their callers.  Adding a WARNING to cond_resched() helps us find bugs in
code that calls cond_resched()...

>>
>>If the workqueue maintainers are unmovable in the position that a
>
> They are open to any good thoughts, yesterday and tomorrow.
>
>>CM-workitem must not use excessive CPU ever, and so must not call
>>cond_resched(), then I can take that back to the NFS maintainers and
>>negotiate different workqueue settings. 
>
> That sounds like an easy road to go without either touching
> cond_resched() or adding a couple of lines in workqueue.  But
> the rising question is why you are branching to a new direction
> overnight if you think your thoughts are fine to fix an issue
> you observed in wq's property.

I'm branching off because I'm getting push-back and so am trying to
explore the problem space.
My first idea was to add WQ_CPU_INTENSIVE to the nfsiod workqueue, but
Trond wondered what was special about NFS.  Many filesystems call iput
from a workqueue, so finding a solution that helps them all is best.
I then sugg

Re: [PATCH rfc] workqueue: honour cond_resched() more effectively.

2020-11-19 Thread NeilBrown

On Mon, Nov 09 2020, t...@kernel.org wrote:

>Given that nothing on
> these types of workqueues can be latency sensitive

This caught my eye and it seems worth drilling in to.  There is no
mention of "latency" in workqueue.rst or workqueue.h.  But you seem to
be saying there is an undocumented assumption that latency-sensitive
work items much not be scheduled on CM-workqueues.
Is that correct?

NFS writes are latency sensitive to a degree as increased latency per
request will hurt overall throughput.  Does this mean that handling
write-completion in a CM-wq is a poor choice?
Would it be better to us WQ_HIGHPRI??  Is there any rule-of-thumb that
can be used to determine when WQ_HIGHPRI is appropriate?

Thanks,
NeilBrown

signature.asc
Description: PGP signature

Re: [PATCH rfc] workqueue: honour cond_resched() more effectively.

2020-11-19 Thread NeilBrown

On Wed, Nov 18 2020, Hillf Danton wrote:

> On Wed, 18 Nov 2020 16:11:44 +1100 NeilBrown wrote:
>> On Wed, Nov 18 2020, Hillf Danton wrote:
>> ...
>> I don't think this is a good idea.
>
> Let me add a few more words.
>
>> cond_resched() is expected to be called often.  Adding all this extra
>
> They are those only invoked in concurrency-managed worker contexts and
> are thus supposed to be less often than thought; what is more the callers
> know what they are doing if a schedule() follows up, needless to say it
> is an ant-antenna-size add-in to check WORKER_CPU_INTENSIVE given
>   WARN_ON_ONCE(workqueue_mustnt_use_cpu())
> added in cond_resched().

"supposed to be less often" is the central point here.
Because the facts are that they sometime happen with high frequency
despite what is "supposed" to happen.
Either the assumption that CM-workers don't call cond_resched() is
wrong, or the code that schedules such workers on CM-queues is wrong.

I much prefer the perspective that the assumption is wrong.  If that is
agreed then we need to handle that circumstance without making
cond_resched() more expensive.
Note that adding WARN_ON_ONCE() does not make it more expensive as it is
only enabled with KERNEL_DEBUG (and WQ_WATCHDOG, though the particular
config option could be changed). It isn't needed in production.

If the workqueue maintainers are unmovable in the position that a
CM-workitem must not use excessive CPU ever, and so must not call
cond_resched(), then I can take that back to the NFS maintainers and
negotiate different workqueue settings.  But as I've said, I think this
is requiring the decision to be made in a place that is not well
positioned to make it.

Thanks,
NeilBrown

signature.asc
Description: PGP signature

Re: [PATCH rfc] workqueue: honour cond_resched() more effectively.

2020-11-17 Thread NeilBrown

On Wed, Nov 18 2020, Hillf Danton wrote:

> On Wed, 18 Nov 2020 09:16:09 +1100
>> From: NeilBrown 
>> Date: Mon, 9 Nov 2020 13:37:17 +1100
>
> What is the brand of your wall clock? The gap between the Date tag
> above and 18 Nov is longer than a week.

I guess 'git commit --amend' doesn't update the Date: stamp.

>
>> Subject: [PATCH] workqueue: warn when cond_resched called from 
>> concurrency-managed worker
>> 
>> Several workers in concurrency-managed work queues call cond_resched().
>> This suggest that they might consume a lot of CPU time and are willing
>> to allow other code to run concurrently.
>> This *does* allow non-workqueue code to run but does *not* allow other
>> concurrency-managed work items to run on the same CPU.
>> 
>> In general such workers should not be run from a concurrency-managed
>> workqueue however:
>>  1/ there is no mechanism to alert coders to the fact that they are
>> using the wrong work queue, and
>>  2/ sometimes it is not clear, when a work item is created, whether it
>> will end up in code that might consume at lot of CPU time.  So
>> choosing the best workqueue it not always easy.
>> 
>> This patch addresses both these issues:
>>  1/ If a concurrency-managed worker calls cond_resched() a warning
>> (WARN_ON_ONCE) is generated (if CONFIG_WQ_WATCHDOG is selected).
>>  2/ A new interface - workqueue_prepare_use_cpu() - is provided which
>> may be called by low-level code which might be called in a workqueue
>> and might consume CPU.  This switches the worker to CPU_INTENSIVE
>> mode so that other work items can run on the same CPU.
>> This means there is no need to chose the required behaviour when
>> submitting a work item, as the code can switch behaviour when needed.
>
> Hm...sounds like the cure can be prepared by splicing the new interface
> above to the cond_resched() in worker's context like
>
> static void foo_work_fn(struct work_struct *work)
> {
> + bool use;
> ...
> + use = workqueue_prepare_use_cpu();
>   cond_resched();
> + workqueue_end_use_cpu(use);
> ...
> }
>
> bool workqueue_prepare_use_cpu(void)
> {
>   struct worker *worker = current_wq_worker();
>
>   if (worker && !(worker->flags & WORKER_CPU_INTENSIVE)) {
>   worker_set_flags(worker, WORKER_CPU_INTENSIVE);
>   return true;
>   } else
>   return false;
> }
>
> void workqueue_end_use_cpu(bool use)
> {
>   if (use) {
>   struct worker *worker = current_wq_worker();
>
>   worker_clear_flags(worker, WORKER_CPU_INTENSIVE);
>   }
> }
>
> And in a nutshell it looks like the below helper to avoid touching
> anything like cond_resched().
>
> void workqueue_cond_resched(void)
> {
>   struct worker *worker = current_wq_worker();
>
>   if (worker && !(worker->flags & WORKER_CPU_INTENSIVE)) {
>   worker_set_flags(worker, WORKER_CPU_INTENSIVE);
>   cond_resched();
>   worker_clear_flags(worker, WORKER_CPU_INTENSIVE);
>   } else
>   cond_resched();
> }

I don't think this is a good idea.
cond_resched() is expected to be called often.  Adding all this extra
work every time is excessive and unnecessary.

It might make sense to add a "workqueue_end_use_cpu()" call at the end
of functions that include "workqueue_prepare_use_cpu()" at the start.  I
don't think it is likely to make a useful difference, but I'm open to
the possibility if anyone can make a case for it.

Thanks,
NeilBrown


>
>> 
>> This patch also changes some code to avoid the warning:
>>  - in some cases, system_unbound_wq is used instead of system_wq,
>>when the work item will normally call cond_resched()
>>  - in other cases, calls to workqueue_prepare_use_cpu() are inserted.
>> 
>>  - in slab.c, a cond_resched() call is replaced by
>>  if (tif_need_resched())
>> break;
>>so that the worker will relinquish CPU and try again later.
>> 
>> There are doubtless more changes needed, but these allow me to boot and
>> access NFS files without warnings.
>> 
>> Signed-off-by: NeilBrown 
>> =2D--
>>  fs/dcache.c|  2 ++
>>  fs/nfs/delegation.c|  5 +
>>  fs/nfs/write.c |  6 ++
>>  include/linux/rhashtable.h |  4 ++--
>>  include/linux/sched.h  |  2 ++
>>  include/linux/workqueue.h  | 30 ++
>>  kernel/rcu/tree.c  |  2 +-
>>  kernel/workqueue.c |

Re: [PATCH rfc] workqueue: honour cond_resched() more effectively.

2020-11-17 Thread NeilBrown

On Mon, Nov 09 2020, t...@kernel.org wrote:

> Hello,
>
> On Mon, Nov 09, 2020 at 02:11:42PM +, Trond Myklebust wrote:
>> That means changing all filesystem code to use cpu-intensive queues. As
>> far as I can tell, they all use workqueues (most of them using the
>> standard system queue) for fput(), dput() and/or iput() calls.
>
> I suppose the assumption was that those operations couldn't possiby be
> expensive enough to warrant other options, which doesn't seem to be the case
> unfortunately. Switching the users to system_unbound_wq, which should be
> pretty trivial, seems to be the straight forward solution.
>
> I can definitely see benefits in making workqueue smarter about
> concurrency-managed work items taking a long time. Given that nothing on
> these types of workqueues can be latency sensitive and the problem being
> reported is on the scale of tens of seconds, I think a more palatable
> approach could be through watchdog mechanism rather than hooking into
> cond_resched(). Something like:
>
> * Run watchdog timer more frequently - e.g. 1/4 of threshold.
>
> * If a work item is occupying the local concurrency for too long, set
>   WORKER_CPU_INTENSIVE for the worker and, probably, generate a warning.
>
> I still think this should generate a warning and thus can't replace
> switching to unbound wq. The reason is that the concurrency limit isn't the
> only problem. A kthread needing to run on one particular CPU for tens of
> seconds just isn't great.

I don't think that using a timer to trigger a warning is sufficient.
As justification I point to "might_sleep()".  This doesn't wait for
atomic code to *actually* sleep, but produces a warning when atomic code
calls other code that *might* sleep.  This means we catch problems
during development rather that when in production.

For the same reasons, I think we need a warning when a CM-worker calls
code that *might* consume a lot of CPU, only one when it actually *does*
consume a lot of CPU.
So I think that a warning from cond_resched() is the right thing to do.
The patch below does this - it borrows CONFIG_WQ_WATCHDOG to desire whether
to check for the warning.

So I don't think that moving to workitems to system_unbound_wq is always
appropriate.  Sometimes it certain is, as in patch below.  However in
some cases it is not at all clear at the time the workitem is submitted,
how much work will be down.
nfs has a work queue (nfsiod) which handle the reply to an async RPC.
In some cases there is a tiny amount of work to be done.  In rare cases
there is a lot.  Rather than trying to deduce in advance, it is much
easier if the worker can switch mode only when it finds that there is a
lot of work to do.
The patch below supports this with workqueue_prepare_use_cpu() which
switches to CPU_INTENSIVE mode.  It doesn't unbind from the current
CPU.  Maybe it should but I suspect that wouldn't be as easy to code,
and I'm not at all sure of the benefit.

So: I propose the patch below.  Thoughts?

Thanks,
NeilBrown

From: NeilBrown 
Date: Mon, 9 Nov 2020 13:37:17 +1100
Subject: [PATCH] workqueue: warn when cond_resched called from
 concurrency-managed worker

Several workers in concurrency-managed work queues call cond_resched().
This suggest that they might consume a lot of CPU time and are willing
to allow other code to run concurrently.
This *does* allow non-workqueue code to run but does *not* allow other
concurrency-managed work items to run on the same CPU.

In general such workers should not be run from a concurrency-managed
workqueue however:
 1/ there is no mechanism to alert coders to the fact that they are
using the wrong work queue, and
 2/ sometimes it is not clear, when a work item is created, whether it
will end up in code that might consume at lot of CPU time.  So
choosing the best workqueue it not always easy.

This patch addresses both these issues:
 1/ If a concurrency-managed worker calls cond_resched() a warning
(WARN_ON_ONCE) is generated (if CONFIG_WQ_WATCHDOG is selected).
 2/ A new interface - workqueue_prepare_use_cpu() - is provided which
may be called by low-level code which might be called in a workqueue
and might consume CPU.  This switches the worker to CPU_INTENSIVE
mode so that other work items can run on the same CPU.
This means there is no need to chose the required behaviour when
submitting a work item, as the code can switch behaviour when needed.

This patch also changes some code to avoid the warning:
 - in some cases, system_unbound_wq is used instead of system_wq,
   when the work item will normally call cond_resched()
 - in other cases, calls to workqueue_prepare_use_cpu() are inserted.

 - in slab.c, a cond_resched() call is replaced by
 if (tif_need_resched())
break;
   so that the worker will relinquish CPU and try again later.

There are doubtless more chan

Re: [PATCH] NFS: only invalidate dentrys that are clearly invalid.

2020-11-15 Thread NeilBrown

On Mon, Nov 16 2020, Trond Myklebust wrote:

> On Mon, 2020-11-16 at 16:12 +1100, NeilBrown wrote:
>> On Mon, Nov 16 2020, Trond Myklebust wrote:
>> 
>> > On Mon, 2020-11-16 at 16:00 +1100, NeilBrown wrote:
>> > > On Mon, Nov 16 2020, Trond Myklebust wrote:
>> > > 
>> > > > On Mon, 2020-11-16 at 15:43 +1100, NeilBrown wrote:
>> > > > > On Mon, Nov 16 2020, Trond Myklebust wrote:
>> > > > > 
>> > > > > > On Mon, 2020-11-16 at 13:59 +1100, NeilBrown wrote:
>> > > > > > > 
>> > > > > > > Prior to commit 5ceb9d7fdaaf ("NFS: Refactor
>> > > > > > > nfs_lookup_revalidate()")
>> > > > > > > and error from nfs_lookup_verify_inode() other than -
>> > > > > > > ESTALE
>> > > > > > > would
>> > > > > > > result
>> > > > > > > in nfs_lookup_revalidate() returning that error code (-
>> > > > > > > ESTALE
>> > > > > > > is
>> > > > > > > mapped
>> > > > > > > to zero).
>> > > > > > > Since that commit, all errors result in zero being
>> > > > > > > returned.
>> > > > > > > 
>> > > > > > > When nfs_lookup_revalidate() returns zero, the dentry is
>> > > > > > > invalidated
>> > > > > > > and, significantly, if the dentry is a directory that is
>> > > > > > > mounted
>> > > > > > > on,
>> > > > > > > that mountpoint is lost.
>> > > > > > > 
>> > > > > > > If you:
>> > > > > > >  - mount an NFS filesystem which contains a directory
>> > > > > > >  - mount something (e.g. tmpfs) on that directory
>> > > > > > >  - use iptables (or scissors) to block traffic to the
>> > > > > > > server
>> > > > > > >  - ls -l the-mounted-on-directory
>> > > > > > >  - interrupt the 'ls -l'
>> > > > > > > you will find that the directory has been unmounted.
>> > > > > > > 
>> > > > > > > This can be fixed by returning the actual error code from
>> > > > > > > nfs_lookup_verify_inode() rather then zero (except for -
>> > > > > > > ESTALE).
>> > > > > > > 
>> > > > > > > Fixes: 5ceb9d7fdaaf ("NFS: Refactor
>> > > > > > > nfs_lookup_revalidate()")
>> > > > > > > Signed-off-by: NeilBrown 
>> > > > > > > ---
>> > > > > > >  fs/nfs/dir.c | 8 +---
>> > > > > > >  1 file changed, 5 insertions(+), 3 deletions(-)
>> > > > > > > 
>> > > > > > > diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
>> > > > > > > index cb52db9a0cfb..d24acf556e9e 100644
>> > > > > > > --- a/fs/nfs/dir.c
>> > > > > > > +++ b/fs/nfs/dir.c
>> > > > > > > @@ -1350,7 +1350,7 @@ nfs_do_lookup_revalidate(struct
>> > > > > > > inode
>> > > > > > > *dir,
>> > > > > > > struct dentry *dentry,
>> > > > > > >  unsigned int flags)
>> > > > > > >  {
>> > > > > > > struct inode *inode;
>> > > > > > > -   int error;
>> > > > > > > +   int error = 0;
>> > > > > > >  
>> > > > > > > nfs_inc_stats(dir, NFSIOS_DENTRYREVALIDATE);
>> > > > > > > inode = d_inode(dentry);
>> > > > > > > @@ -1372,8 +1372,10 @@ nfs_do_lookup_revalidate(struct
>> > > > > > > inode
>> > > > > > > *dir,
>> > > > > > > struct dentry *dentry,
>> > > > > > >     nfs_check_verifier(dir, dentry, flags &
>> > > > > > > LOOKUP_RCU))
>> > > > > > > {
>> > > > > > > error = nfs_lookup_verify_inode(inode,
>> > > > > > > flags);
>> > > > > > > if (error) {
>> > > > > > > -

Re: [PATCH] NFS: only invalidate dentrys that are clearly invalid.

2020-11-15 Thread NeilBrown

On Mon, Nov 16 2020, Trond Myklebust wrote:

> On Mon, 2020-11-16 at 16:00 +1100, NeilBrown wrote:
>> On Mon, Nov 16 2020, Trond Myklebust wrote:
>> 
>> > On Mon, 2020-11-16 at 15:43 +1100, NeilBrown wrote:
>> > > On Mon, Nov 16 2020, Trond Myklebust wrote:
>> > > 
>> > > > On Mon, 2020-11-16 at 13:59 +1100, NeilBrown wrote:
>> > > > > 
>> > > > > Prior to commit 5ceb9d7fdaaf ("NFS: Refactor
>> > > > > nfs_lookup_revalidate()")
>> > > > > and error from nfs_lookup_verify_inode() other than -ESTALE
>> > > > > would
>> > > > > result
>> > > > > in nfs_lookup_revalidate() returning that error code (-ESTALE
>> > > > > is
>> > > > > mapped
>> > > > > to zero).
>> > > > > Since that commit, all errors result in zero being returned.
>> > > > > 
>> > > > > When nfs_lookup_revalidate() returns zero, the dentry is
>> > > > > invalidated
>> > > > > and, significantly, if the dentry is a directory that is
>> > > > > mounted
>> > > > > on,
>> > > > > that mountpoint is lost.
>> > > > > 
>> > > > > If you:
>> > > > >  - mount an NFS filesystem which contains a directory
>> > > > >  - mount something (e.g. tmpfs) on that directory
>> > > > >  - use iptables (or scissors) to block traffic to the server
>> > > > >  - ls -l the-mounted-on-directory
>> > > > >  - interrupt the 'ls -l'
>> > > > > you will find that the directory has been unmounted.
>> > > > > 
>> > > > > This can be fixed by returning the actual error code from
>> > > > > nfs_lookup_verify_inode() rather then zero (except for -
>> > > > > ESTALE).
>> > > > > 
>> > > > > Fixes: 5ceb9d7fdaaf ("NFS: Refactor nfs_lookup_revalidate()")
>> > > > > Signed-off-by: NeilBrown 
>> > > > > ---
>> > > > >  fs/nfs/dir.c | 8 +---
>> > > > >  1 file changed, 5 insertions(+), 3 deletions(-)
>> > > > > 
>> > > > > diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
>> > > > > index cb52db9a0cfb..d24acf556e9e 100644
>> > > > > --- a/fs/nfs/dir.c
>> > > > > +++ b/fs/nfs/dir.c
>> > > > > @@ -1350,7 +1350,7 @@ nfs_do_lookup_revalidate(struct inode
>> > > > > *dir,
>> > > > > struct dentry *dentry,
>> > > > >  unsigned int flags)
>> > > > >  {
>> > > > > struct inode *inode;
>> > > > > -   int error;
>> > > > > +   int error = 0;
>> > > > >  
>> > > > > nfs_inc_stats(dir, NFSIOS_DENTRYREVALIDATE);
>> > > > > inode = d_inode(dentry);
>> > > > > @@ -1372,8 +1372,10 @@ nfs_do_lookup_revalidate(struct inode
>> > > > > *dir,
>> > > > > struct dentry *dentry,
>> > > > >     nfs_check_verifier(dir, dentry, flags &
>> > > > > LOOKUP_RCU))
>> > > > > {
>> > > > > error = nfs_lookup_verify_inode(inode,
>> > > > > flags);
>> > > > > if (error) {
>> > > > > -   if (error == -ESTALE)
>> > > > > +   if (error == -ESTALE) {
>> > > > > nfs_zap_caches(dir);
>> > > > > +   error = 0;
>> > > > > +   }
>> > > > > goto out_bad;
>> > > > > }
>> > > > > nfs_advise_use_readdirplus(dir);
>> > > > > @@ -1395,7 +1397,7 @@ nfs_do_lookup_revalidate(struct inode
>> > > > > *dir,
>> > > > > struct dentry *dentry,
>> > > > >  out_bad:
>> > > > > if (flags & LOOKUP_RCU)
>> > > > > return -ECHILD;
>> > > > > -   return nfs_lookup_revalidate_done(dir, dentry, inode,
>> > > > > 0);
>> > > > > +   return nfs_lookup_revalidate_done(dir, dentry, inode,
>>

Re: [PATCH] NFS: only invalidate dentrys that are clearly invalid.

2020-11-15 Thread NeilBrown

On Mon, Nov 16 2020, Trond Myklebust wrote:

> On Mon, 2020-11-16 at 15:43 +1100, NeilBrown wrote:
>> On Mon, Nov 16 2020, Trond Myklebust wrote:
>> 
>> > On Mon, 2020-11-16 at 13:59 +1100, NeilBrown wrote:
>> > > 
>> > > Prior to commit 5ceb9d7fdaaf ("NFS: Refactor
>> > > nfs_lookup_revalidate()")
>> > > and error from nfs_lookup_verify_inode() other than -ESTALE would
>> > > result
>> > > in nfs_lookup_revalidate() returning that error code (-ESTALE is
>> > > mapped
>> > > to zero).
>> > > Since that commit, all errors result in zero being returned.
>> > > 
>> > > When nfs_lookup_revalidate() returns zero, the dentry is
>> > > invalidated
>> > > and, significantly, if the dentry is a directory that is mounted
>> > > on,
>> > > that mountpoint is lost.
>> > > 
>> > > If you:
>> > >  - mount an NFS filesystem which contains a directory
>> > >  - mount something (e.g. tmpfs) on that directory
>> > >  - use iptables (or scissors) to block traffic to the server
>> > >  - ls -l the-mounted-on-directory
>> > >  - interrupt the 'ls -l'
>> > > you will find that the directory has been unmounted.
>> > > 
>> > > This can be fixed by returning the actual error code from
>> > > nfs_lookup_verify_inode() rather then zero (except for -ESTALE).
>> > > 
>> > > Fixes: 5ceb9d7fdaaf ("NFS: Refactor nfs_lookup_revalidate()")
>> > > Signed-off-by: NeilBrown 
>> > > ---
>> > >  fs/nfs/dir.c | 8 +---
>> > >  1 file changed, 5 insertions(+), 3 deletions(-)
>> > > 
>> > > diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
>> > > index cb52db9a0cfb..d24acf556e9e 100644
>> > > --- a/fs/nfs/dir.c
>> > > +++ b/fs/nfs/dir.c
>> > > @@ -1350,7 +1350,7 @@ nfs_do_lookup_revalidate(struct inode *dir,
>> > > struct dentry *dentry,
>> > >  unsigned int flags)
>> > >  {
>> > > struct inode *inode;
>> > > -   int error;
>> > > +   int error = 0;
>> > >  
>> > > nfs_inc_stats(dir, NFSIOS_DENTRYREVALIDATE);
>> > > inode = d_inode(dentry);
>> > > @@ -1372,8 +1372,10 @@ nfs_do_lookup_revalidate(struct inode
>> > > *dir,
>> > > struct dentry *dentry,
>> > >     nfs_check_verifier(dir, dentry, flags & LOOKUP_RCU))
>> > > {
>> > > error = nfs_lookup_verify_inode(inode, flags);
>> > > if (error) {
>> > > -   if (error == -ESTALE)
>> > > +   if (error == -ESTALE) {
>> > > nfs_zap_caches(dir);
>> > > +   error = 0;
>> > > +   }
>> > > goto out_bad;
>> > > }
>> > > nfs_advise_use_readdirplus(dir);
>> > > @@ -1395,7 +1397,7 @@ nfs_do_lookup_revalidate(struct inode *dir,
>> > > struct dentry *dentry,
>> > >  out_bad:
>> > > if (flags & LOOKUP_RCU)
>> > > return -ECHILD;
>> > > -   return nfs_lookup_revalidate_done(dir, dentry, inode, 0);
>> > > +   return nfs_lookup_revalidate_done(dir, dentry, inode,
>> > > error);
>> > 
>> > Which errors do we actually need to return here? As far as I can
>> > tell,
>> > the only errors that nfs_lookup_verify_inode() is supposed to
>> > return is
>> > ENOMEM, ESTALE, ECHILD, and possibly EIO or ETiMEDOUT.
>> > 
>> > Why would it be better to return those errors rather than just a 0
>> > when
>> > we need to invalidate the inode, particularly since we already have
>> > a
>> > special case in nfs_lookup_revalidate_done() when the dentry is
>> > root?
>> 
>> ERESTARTSYS is the error that easily causes problems.
>> 
>> Returning 0 causes d_invalidate() to be called which is quite heavy
>> handed in mountpoints.
>
> My point is that it shouldn't get returned for mountpoints. See
> nfs_lookup_revalidate_done().

nfs_lookup_revalidate_done() only checks IS_ROOT(), and while many
mountpoints are IS_ROOT(), not all are (--bind easily makes others).

But that isn't even really relevant here.  The dentry being revalidated
is the underlying directory - that something else is mounted on.
step_into() which follows mount points is called in walk_component()
*after* lookup_fast or lookup_slow which will have revalidated the
dentry.

NeilBrown


>
>> So it is only reasonable to return 0 when we have unambiguous
>> confirmation from the server that the object no longer exists. 
>> ESTALE
>> is unambiguous. EIO might be unambiguous.  ERESTARTSYS, ENOMEM,
>> ETIMEDOUT are transient and don't justify d_invalidate() being
>> called.
>> 
>> (BTW, Commit cc89684c9a26 ("NFS: only invalidate dentrys that are
>> clearly invalid.")
>>  fixed much the same bug 3 years ago).
>>  
>> Thanks,
>> NeilBrown
>> 
>> 
>> > 
>> > >  }
>> > >  
>> > >  static int
>> > 
>> > -- 
>> > Trond Myklebust
>> > Linux NFS client maintainer, Hammerspace
>> > trond.mykleb...@hammerspace.com
>
> -- 
> Trond Myklebust
> CTO, Hammerspace Inc
> 4984 El Camino Real, Suite 208
> Los Altos, CA 94022
> 
> www.hammer.space


signature.asc
Description: PGP signature

Re: [PATCH] NFS: only invalidate dentrys that are clearly invalid.

2020-11-15 Thread NeilBrown

On Mon, Nov 16 2020, Trond Myklebust wrote:

> On Mon, 2020-11-16 at 13:59 +1100, NeilBrown wrote:
>> 
>> Prior to commit 5ceb9d7fdaaf ("NFS: Refactor
>> nfs_lookup_revalidate()")
>> and error from nfs_lookup_verify_inode() other than -ESTALE would
>> result
>> in nfs_lookup_revalidate() returning that error code (-ESTALE is
>> mapped
>> to zero).
>> Since that commit, all errors result in zero being returned.
>> 
>> When nfs_lookup_revalidate() returns zero, the dentry is invalidated
>> and, significantly, if the dentry is a directory that is mounted on,
>> that mountpoint is lost.
>> 
>> If you:
>>  - mount an NFS filesystem which contains a directory
>>  - mount something (e.g. tmpfs) on that directory
>>  - use iptables (or scissors) to block traffic to the server
>>  - ls -l the-mounted-on-directory
>>  - interrupt the 'ls -l'
>> you will find that the directory has been unmounted.
>> 
>> This can be fixed by returning the actual error code from
>> nfs_lookup_verify_inode() rather then zero (except for -ESTALE).
>> 
>> Fixes: 5ceb9d7fdaaf ("NFS: Refactor nfs_lookup_revalidate()")
>> Signed-off-by: NeilBrown 
>> ---
>>  fs/nfs/dir.c | 8 +---
>>  1 file changed, 5 insertions(+), 3 deletions(-)
>> 
>> diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
>> index cb52db9a0cfb..d24acf556e9e 100644
>> --- a/fs/nfs/dir.c
>> +++ b/fs/nfs/dir.c
>> @@ -1350,7 +1350,7 @@ nfs_do_lookup_revalidate(struct inode *dir,
>> struct dentry *dentry,
>>  unsigned int flags)
>>  {
>> struct inode *inode;
>> -   int error;
>> +   int error = 0;
>>  
>> nfs_inc_stats(dir, NFSIOS_DENTRYREVALIDATE);
>> inode = d_inode(dentry);
>> @@ -1372,8 +1372,10 @@ nfs_do_lookup_revalidate(struct inode *dir,
>> struct dentry *dentry,
>>     nfs_check_verifier(dir, dentry, flags & LOOKUP_RCU)) {
>> error = nfs_lookup_verify_inode(inode, flags);
>> if (error) {
>> -   if (error == -ESTALE)
>> +   if (error == -ESTALE) {
>> nfs_zap_caches(dir);
>> +   error = 0;
>> +   }
>> goto out_bad;
>> }
>> nfs_advise_use_readdirplus(dir);
>> @@ -1395,7 +1397,7 @@ nfs_do_lookup_revalidate(struct inode *dir,
>> struct dentry *dentry,
>>  out_bad:
>> if (flags & LOOKUP_RCU)
>> return -ECHILD;
>> -   return nfs_lookup_revalidate_done(dir, dentry, inode, 0);
>> +   return nfs_lookup_revalidate_done(dir, dentry, inode, error);
>
> Which errors do we actually need to return here? As far as I can tell,
> the only errors that nfs_lookup_verify_inode() is supposed to return is
> ENOMEM, ESTALE, ECHILD, and possibly EIO or ETiMEDOUT.
>
> Why would it be better to return those errors rather than just a 0 when
> we need to invalidate the inode, particularly since we already have a
> special case in nfs_lookup_revalidate_done() when the dentry is root?

ERESTARTSYS is the error that easily causes problems.

Returning 0 causes d_invalidate() to be called which is quite heavy
handed in mountpoints.
So it is only reasonable to return 0 when we have unambiguous
confirmation from the server that the object no longer exists.  ESTALE
is unambiguous. EIO might be unambiguous.  ERESTARTSYS, ENOMEM,
ETIMEDOUT are transient and don't justify d_invalidate() being called.

(BTW, Commit cc89684c9a26 ("NFS: only invalidate dentrys that are clearly 
invalid.")
 fixed much the same bug 3 years ago).
 
Thanks,
NeilBrown


>
>>  }
>>  
>>  static int
>
> -- 
> Trond Myklebust
> Linux NFS client maintainer, Hammerspace
> trond.mykleb...@hammerspace.com


signature.asc
Description: PGP signature

[PATCH] NFS: only invalidate dentrys that are clearly invalid.

2020-11-15 Thread NeilBrown


Prior to commit 5ceb9d7fdaaf ("NFS: Refactor nfs_lookup_revalidate()")
and error from nfs_lookup_verify_inode() other than -ESTALE would result
in nfs_lookup_revalidate() returning that error code (-ESTALE is mapped
to zero).
Since that commit, all errors result in zero being returned.

When nfs_lookup_revalidate() returns zero, the dentry is invalidated
and, significantly, if the dentry is a directory that is mounted on,
that mountpoint is lost.

If you:
 - mount an NFS filesystem which contains a directory
 - mount something (e.g. tmpfs) on that directory
 - use iptables (or scissors) to block traffic to the server
 - ls -l the-mounted-on-directory
 - interrupt the 'ls -l'
you will find that the directory has been unmounted.

This can be fixed by returning the actual error code from
nfs_lookup_verify_inode() rather then zero (except for -ESTALE).

Fixes: 5ceb9d7fdaaf ("NFS: Refactor nfs_lookup_revalidate()")
Signed-off-by: NeilBrown 
---
 fs/nfs/dir.c | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index cb52db9a0cfb..d24acf556e9e 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -1350,7 +1350,7 @@ nfs_do_lookup_revalidate(struct inode *dir, struct dentry 
*dentry,
 unsigned int flags)
 {
struct inode *inode;
-   int error;
+   int error = 0;
 
nfs_inc_stats(dir, NFSIOS_DENTRYREVALIDATE);
inode = d_inode(dentry);
@@ -1372,8 +1372,10 @@ nfs_do_lookup_revalidate(struct inode *dir, struct 
dentry *dentry,
nfs_check_verifier(dir, dentry, flags & LOOKUP_RCU)) {
error = nfs_lookup_verify_inode(inode, flags);
if (error) {
-   if (error == -ESTALE)
+   if (error == -ESTALE) {
nfs_zap_caches(dir);
+   error = 0;
+   }
goto out_bad;
}
nfs_advise_use_readdirplus(dir);
@@ -1395,7 +1397,7 @@ nfs_do_lookup_revalidate(struct inode *dir, struct dentry 
*dentry,
 out_bad:
if (flags & LOOKUP_RCU)
return -ECHILD;
-   return nfs_lookup_revalidate_done(dir, dentry, inode, 0);
+   return nfs_lookup_revalidate_done(dir, dentry, inode, error);
 }
 
 static int
-- 
2.29.2



signature.asc
Description: PGP signature

Re: [PATCH rfc] workqueue: honour cond_resched() more effectively.

2020-11-09 Thread NeilBrown

On Mon, Nov 09 2020, Peter Zijlstra wrote:

> On Mon, Nov 09, 2020 at 01:50:40PM +, Trond Myklebust wrote:
>> On Mon, 2020-11-09 at 09:00 +0100, Peter Zijlstra wrote:
>
>> > I'm thinking the real problem is that you're abusing workqueues. Just
>> > don't stuff so much work into it that this becomes a problem. Or
>> > rather,
>> > if you do, don't lie to it about it.
>> 
>> If we can't use workqueues to call iput_final() on an inode, then what
>> is the point of having them at all?
>
> Running short stuff, apparently.

Also running stuff that sleeps.  If only does work in short bursts, and
sleeps between the works, it can run as long as it likes.
It is only sustained bursts that are currently not supported with
explicit code.

>
>> Neil's use case is simply a file that has managed to accumulate a
>> seriously large page cache, and is therefore taking a long time to
>> complete the call to truncate_inode_pages_final(). Are you saying we
>> have to allocate a dedicated thread for every case where this happens?
>
> I'm not saying anything, but you're trying to wreck the scheduler
> because of a workqueue 'feature'. The 'new' workqueues limit concurrency
> by design, if you're then relying on concurrency for things, you're
> using it wrong.
>
> I really don't know what the right answer is here, but I thoroughly hate
> the one proposed.

Oh good - plenty for room for improvement then :-)

I feel strongly that this should work transparently.  Expecting people
too choose the right option to handle cases that don't often some up in
testing is naive.
A warning whenever a bound,non-CPU-intensive worker calls cond_resched()
is trivial to implement and extremely noise.  As mentioned, I get twenty
just to boot.

One amusing example is rhashtable which schedule a worker to rehash a
table.  This is expected to be cpu-intensive because it calls
cond_resched(), but it is run with schedule_work() - clearly not
realizing that will block other scheduled work on that CPU.

An amusing example for the flip-side is crypto/cryptd.c which creates a
WQ_CPU_INTENSIVE workqueue (cryptd) but the cryptd_queue_worker() has
a comment "Only handle one request at a time to avoid hogging crypto
workqueue." !!! The whole point of WQ_CPU_INTENSIVE is that you cannot
hog the workqueue!!

Anyway, I digress  warning on ever cond_resched() generates lots of
warnings, including some from printk so any work item that might
ever print a message needs to be CPU_INTENSIVE???
I don't think that scales.

Is there some way the scheduler can help?  Does the scheduler notice
"time to check on that CPU over there" and then:
 - if it is in user-space- force it to schedule
 - if it is in kernel-space (and preempt is disabled), then leave it
 alone
 ??

If so, could there be a third case - if it is a bound,non-cpu-intensive
worker, switch it to cpu-intensive???

I wonder how long workers typically run - do many run long enough that
the scheduler might want to ask them to take a break?

Thanks,
NeilBrown

signature.asc
Description: PGP signature

[PATCH rfc] workqueue: honour cond_resched() more effectively.

2020-11-08 Thread NeilBrown


If a worker task for a normal (bound, non-CPU-intensive) calls
cond_resched(), this allows other non-workqueue processes to run, but
does *not* allow other workqueue workers to run.  This is because
workqueue will only attempt to run one task at a time on each CPU,
unless the current task actually sleeps.

This is already a problem for should_reclaim_retry() in mm/page_alloc.c,
which inserts a small sleep just to convince workqueue to allow other
workers to run.

It can also be a problem for NFS when closing a very large file (e.g.
100 million pages in memory).  NFS can call the final iput() from a
workqueue, which can then take long enough to trigger a workqueue-lockup
warning, and long enough for performance problems to be observed.

I added a WARN_ON_ONCE() in cond_resched() to report when it is run from
a workqueue, and got about 20 hits during boot, many of system_wq (aka
"events") which suggests there is a real chance that worker are being
delayed unnecessarily be tasks which are written to politely relinquish
the CPU.

I think that once a worker calls cond_resched(), it should be treated as
though it was run from a WQ_CPU_INTENSIVE queue, because only cpu-intensive
tasks need to call cond_resched().  This would allow other workers to be
scheduled.

The following patch achieves this I believe.

Signed-off-by: NeilBrown 
---
 include/linux/sched.h |  7 ++-
 include/linux/workqueue.h |  2 ++
 kernel/sched/core.c   |  4 
 kernel/workqueue.c| 16 +++-
 mm/page_alloc.c   | 12 +---
 5 files changed, 28 insertions(+), 13 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 4418f5cb8324..728870965df1 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1784,7 +1784,12 @@ static inline int test_tsk_need_resched(struct 
task_struct *tsk)
 #ifndef CONFIG_PREEMPTION
 extern int _cond_resched(void);
 #else
-static inline int _cond_resched(void) { return 0; }
+static inline int _cond_resched(void)
+{
+   if (current->flags & PF_WQ_WORKER)
+   workqueue_cond_resched();
+   return 0;
+}
 #endif
 
 #define cond_resched() ({  \
diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index 8b505d22fc0e..7bcc5717d80f 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -626,6 +626,8 @@ static inline bool schedule_delayed_work(struct 
delayed_work *dwork,
return queue_delayed_work(system_wq, dwork, delay);
 }
 
+void workqueue_cond_resched(void);
+
 #ifndef CONFIG_SMP
 static inline long work_on_cpu(int cpu, long (*fn)(void *), void *arg)
 {
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 9a2fbf98fd6f..5b2e38567a0c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5620,6 +5620,8 @@ SYSCALL_DEFINE0(sched_yield)
 #ifndef CONFIG_PREEMPTION
 int __sched _cond_resched(void)
 {
+   if (current->flags & PF_WQ_WORKER)
+   workqueue_cond_resched();
if (should_resched(0)) {
preempt_schedule_common();
return 1;
@@ -5643,6 +5645,8 @@ int __cond_resched_lock(spinlock_t *lock)
int resched = should_resched(PREEMPT_LOCK_OFFSET);
int ret = 0;
 
+   if (current->flags & PF_WQ_WORKER)
+   workqueue_cond_resched();
lockdep_assert_held(lock);
 
if (spin_needbreak(lock) || resched) {
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 891ccad5f271..fd2e9557b1a6 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -2296,7 +2296,7 @@ __acquires(>lock)
spin_lock_irq(>lock);
 
/* clear cpu intensive status */
-   if (unlikely(cpu_intensive))
+   if (unlikely(worker->flags & WORKER_CPU_INTENSIVE))
worker_clr_flags(worker, WORKER_CPU_INTENSIVE);
 
/* tag the worker for identification in schedule() */
@@ -5330,6 +5330,20 @@ int workqueue_set_unbound_cpumask(cpumask_var_t cpumask)
return ret;
 }
 
+void workqueue_cond_resched(void)
+{
+   struct worker *worker = current_wq_worker();
+
+   if (worker && !(worker->flags & WORKER_CPU_INTENSIVE)) {
+   struct worker_pool *pool = worker->pool;
+
+   worker_set_flags(worker, WORKER_CPU_INTENSIVE);
+   if (need_more_worker(pool))
+   wake_up_worker(pool);
+   }
+}
+EXPORT_SYMBOL(workqueue_cond_resched);
+
 #ifdef CONFIG_SYSFS
 /*
  * Workqueues with WQ_SYSFS flag set is visible to userland via
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 13cc653122b7..0d5720ecbfe7 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4420,17 +4420,7 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
}
 
 out:
-   /*
-* Memory allocation/reclaim might be called from a WQ context and the
-* current implementation of the WQ concurrency control doesn't
-* recognize that a particu

[PATCH] workqueue: export apply_workqueue_attrs()

2020-10-18 Thread NeilBrown


Lustre is a widely used cluster filesystem which is currently
out-of-tree, but work is underway to make it ready for upstream
submission.

Lustre needs apply_workqueue_attrs(), and for this reason that function
was exported in Commit 6106c0f82481 ("staging: lustre: lnet: convert
selftest to use workqueues").

Unfortuantely it was later (through an excess of caution) unexported by
Commit 2c9858ecbeb1 ("workqueue: Make alloc/apply/free_workqueue_attrs()
static"), which also marked the function static.

Subsequently in Commit 513c98d08682 ("workqueue: unconfine
alloc/apply/free_workqueue_attrs()") the "static" marking was removed
when it was realized that it is actually useful to some clients of
"workqueue", but it was not exported at this time.

Lustre has been making do with use of kallsyms_lookup_name() to get
access to this function when it isn't exported, but that loop-hole has
now been removed.

So: time to export the function again.

Signed-off-by: NeilBrown 
---
 kernel/workqueue.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 437935e7a199..ab593b20fb94 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -4068,6 +4068,7 @@ int apply_workqueue_attrs(struct workqueue_struct *wq,
 
return ret;
 }
+EXPORT_SYMBOL_GPL(apply_workqueue_attrs);
 
 /**
  * wq_update_unbound_numa - update NUMA affinity of a wq for CPU hot[un]plug
-- 
2.28.0



signature.asc
Description: PGP signature

Re: [mm/writeback] 8d92890bd6: will-it-scale.per_process_ops -15.3% regression

2020-10-14 Thread NeilBrown

On Wed, Oct 14 2020, Jan Kara wrote:

> On Wed 14-10-20 16:47:06, kernel test robot wrote:
>> Greeting,
>> 
>> FYI, we noticed a -15.3% regression of will-it-scale.per_process_ops due
>> to commit:
>> 
>> commit: 8d92890bd6b8502d6aee4b37430ae6444ade7a8c ("mm/writeback: discard
>> NR_UNSTABLE_NFS, use NR_WRITEBACK instead")
>> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
>
> Thanks for report but it doesn't quite make sense to me. If we omit
> reporting & NFS changes in that commit (which is code not excercised by
> this benchmark), what remains are changes like:
>
> nr_pages += node_page_state(pgdat, NR_FILE_DIRTY);
> -   nr_pages += node_page_state(pgdat, NR_UNSTABLE_NFS);
> nr_pages += node_page_state(pgdat, NR_WRITEBACK);
> ...
> -   nr_reclaimable = global_node_page_state(NR_FILE_DIRTY) +
> -   
> global_node_page_state(NR_UNSTABLE_NFS);
> +   nr_reclaimable = global_node_page_state(NR_FILE_DIRTY);
> ...
> -   gdtc->dirty = global_node_page_state(NR_FILE_DIRTY) +
> - global_node_page_state(NR_UNSTABLE_NFS);
> +   gdtc->dirty = global_node_page_state(NR_FILE_DIRTY);
>
> So if there's any negative performance impact of these changes, they're
> likely due to code alignment changes or something like that... So I don't
> think there's much to do here since optimal code alignment is highly specific
> to a particular CPU etc.

I agree, it seems odd.

Removing NR_UNSTABLE_NFS from enum node_stat_item would renumber all the
following value and would, I think, change NR_DIRTIED from 32 to 31.
Might that move something to a different cache line and change some
contention?

That would be easy enough to test: just re-add NR_UNSTABLE_NFS.

I have no experience reading will-it-scale results, but 15% does seem
like a lot.

NeilBrown


signature.asc
Description: PGP signature

Re: [PATCH] fsync.2: ERRORS: add EIO and ENOSPC

2020-09-16 Thread NeilBrown

On Thu, Sep 10 2020, Jeff Layton wrote:
>
>> Regarding your "NOTES" addition, I don't feel comfortable with the
>> "clean" language.  I would prefer something like:
>> 
>>  When fsync() reports a failure (EIO, ENOSPC, EDQUOT) it must be assumed
>>  that any write requests initiated since the previous successful fsync
>>  was initiated may have failed, and that any cached data may have been
>>  lost.  A future fsync() will not attempt to write out the same data
>>  again.  If recovery is possible and desired, the application must
>>  repeat all the writes that may have failed.
>> 
>>  If the regions of a file that were written to prior to a failed fsync()
>>  are read, the content reported may not reflect the stored content, and
>>  subsequent reads may revert to the stored content at any time.
>> 
>
> Much nicer.

I guess someone should turn it into a patch

>
> Should we make a distinction between usage and functional classes of
> errors in this? The "usage" errors will probably not result in the pages
> being tossed out, but the functional ones almost certainly will...

Maybe.  I think it is a useful distinction, but to be consistent it
would be best to make it in all (section 2) man pages.  Maybe not all at
once though.  It is really up to Michael if that is a direction he is
interesting in following.

NeilBrown


>
> -- 
> Jeff Layton 


signature.asc
Description: PGP signature

Re: [PATCH] fsync.2: ERRORS: add EIO and ENOSPC

2020-09-09 Thread NeilBrown

On Tue, Sep 08 2020, Jeff Layton wrote:

> On Tue, 2020-09-08 at 13:27 +0200, Jan Kara wrote:
>> Added Jeff to CC since he has written the code...
>> 
>> On Mon 07-09-20 09:11:06, Michael Kerrisk (man-pages) wrote:
>> > [Widening the CC to include Andrew and linux-fsdevel@]
>> > [Milan: thanks for the patch, but it's unclear to me from your commit
>> > message how/if you verified the details.]
>> > 
>> > Andrew, maybe you (or someone else) can comment, since long ago your
>> > 
>> > commit f79e2abb9bd452d97295f34376dedbec9686b986
>> > Author: Andrew Morton 
>> > Date:   Fri Mar 31 02:30:42 2006 -0800
>> > 
>> > included a comment that is referred to in  stackoverflow discussion
>> > about this topic (that SO discussion is in turn referred to by
>> > https://bugzilla.kernel.org/show_bug.cgi?id=194757).
>> > 
>> > The essence as I understand it, is this:
>> > (1) fsync() (and similar) may fail EIO or ENOSPC, at which point data
>> > has not been synced.
>> > (2) In this case, the EIO/ENOSPC setting is cleared so that...
>> > (3) A subsequent fsync() might return success, but...
>> > (4) That doesn't mean that the data in (1) landed on the disk.
>> 
>> Correct.
>> 
>> > The proposed manual page patch below wants to document this, but I'd
>> > be happy to have an FS-knowledgeable person comment before I apply.
>> 
>> Just a small comment below:
>> 
>> > On Sat, 29 Aug 2020 at 09:13,  wrote:
>> > > From: Milan Shah 
>> > > 
>> > > This Fix addresses Bug 194757.
>> > > Ref: https://bugzilla.kernel.org/show_bug.cgi?id=194757
>> > > ---
>> > >  man2/fsync.2 | 13 +
>> > >  1 file changed, 13 insertions(+)
>> > > 
>> > > diff --git a/man2/fsync.2 b/man2/fsync.2
>> > > index 96401cd..f38b3e4 100644
>> > > --- a/man2/fsync.2
>> > > +++ b/man2/fsync.2
>> > > @@ -186,6 +186,19 @@ In these cases disk caches need to be disabled using
>> > >  or
>> > >  .BR sdparm (8)
>> > >  to guarantee safe operation.
>> > > +
>> > > +When
>> > > +.BR fsync ()
>> > > +or
>> > > +.BR fdatasync ()
>> > > +returns
>> > > +.B EIO
>> > > +or
>> > > +.B ENOSPC
>> > > +any error flags on pages in the file mapping are cleared, so subsequent 
>> > > synchronisation attempts
>> > > +will return without error. It is
>> > > +.I not
>> > > +safe to retry synchronisation and assume that a non-error return means 
>> > > prior writes are now on disk.
>> > >  .SH SEE ALSO
>> > >  .BR sync (1),
>> > >  .BR bdflush (2),
>> 
>> So the error state isn't really stored "on pages in the file mapping".
>> Current implementation (since 4.14) is that error state is stored in struct
>> file (I think this tends to be called "file description" in manpages) and
>> so EIO / ENOSPC is reported once for each file description of the file that
>> was open before the error happened. Not sure if we want to be so precise in
>> the manpages or if it just confuses people. Anyway your takeway that no
>> error on subsequent fsync() does not mean data was written is correct.
>> 
>> 
>
> Thinking about it more, I think we ought to spell this out explicitly as
> we can in the manpage. This is a point of confusion for a lot of people
> and not understanding this can lead to data integrity bugs. Maybe
> something like this in the NOTES section?
>
> '''
> When fsync returns an error, the file is considered to be "clean". A
> subsequent call to fsync will not result in a reattempt to write out the
> data, unless that data has been rewritten. Applications that want to
> reattempt writing to the file after a transient error must re-write
> their data.
> '''
>
> To be clear:
>
> In practice, you'd only have to write enough to redirty each page in
> most cases.

Nonononono.  In practice you have to repeat the entire write because you
cannot know if the cached page is from before the write failure, or has
since been flushed and reloaded.

>
> Also, it is hard to claim that the above behavior is universally true. A
> filesystem could opt to keep the pages dirty for some errors, but the
> vast majority just toss out the data whenever there is a writeback
> problem.

...and any filesystem that doesn't behave that way is wasting effort,
because nothing else can be assumed.

Regarding your "NOTES" addition, I don't feel comfortable with the
"clean" language.  I would prefer something like:

 When fsync() reports a failure (EIO, ENOSPC, EDQUOT) it must be assumed
 that any write requests initiated since the previous successful fsync
 was initiated may have failed, and that any cached data may have been
 lost.  A future fsync() will not attempt to write out the same data
 again.  If recovery is possible and desired, the application must
 repeat all the writes that may have failed.

 If the regions of a file that were written to prior to a failed fsync()
 are read, the content reported may not reflect the stored content, and
 subsequent reads may revert to the stored content at any time.

NeilBrown


>
>
> -- 
> Jeff Layton 


signature.asc
Description: PGP signature

Re: [PATCH] fsync.2: ERRORS: add EIO and ENOSPC

2020-09-09 Thread NeilBrown

On Tue, Sep 08 2020, Jeff Layton wrote:

>
> Yep.
>
> My only comment is that there is nothing special about EIO and ENOSPC.

There are two type of errors that fsync can return.
 EBADF EROFS EINVAL  - these are usage errors.
 EIO ENOSPC EDQUOT   - these are functional failures.

So I would say there *is* something special about those errors, though
it isn't *very* special, and it isn't *just* those errors. EDQUOT should
be included in the list.

NeilBrown

> All errors are the same in this regard. Basically, issuing a new fsync
> after a failed one doesn't do any good. You need to redirty the pages
> first.
> -- 
> Jeff Layton 

signature.asc
Description: PGP signature

Re: Minor RST rant

2020-07-24 Thread NeilBrown

On Fri, Jul 24 2020, Steven Rostedt wrote:

> On Fri, 24 Jul 2020 18:41:30 +0100
> Matthew Wilcox  wrote:
>
>> Great example.  Some people definitely go too far with rst markup, and
>> we generally try to discourage it.  And I'm pretty sure we take patches
>
> I'd send patches but I suck at markup ;-) [1]

Do you read Jane Austen at all?

   "I certainly have not the talent which some people possess," said
   Darcy, "of conversing easily with those I have never seen before.
   I cannot catch their tone of conversation, or appear interested
   in their concerns, as I often see done."

   "My fingers," said Elizabeth, "do not move over this instrument
   in the masterly manner which I see so many women's do.  They
   have not the same force or rapidity, and do not produce the
   same expression.  But then I have always supposed it to be my
   own fault--because I will not take the trouble of practising."

:-)
NeilBrown


>
>> to remove excessive markup where it's gone too far [1].
>> 
>> You can see how this renders in html at
>> https://www.kernel.org/doc/html/latest/filesystems/path-lookup.html or
>> run 'make htmldocs' to build it locally.  Personally, I don't think
>> the markup style it uses works very well in the html either.
>> 
>> I'd like to see this paragraph written as:
>> 
>> > It is tempting to describe the second kind as starting with a
>> > component, but that isn't always accurate: a pathname can lack both
>> > slashes and components, it can be empty, in other words.  This is
>> > generally forbidden in POSIX, but some of the "*at()" system calls
>> > in Linux permit it when the ``AT_EMPTY_PATH`` flag is given.  For
>> > example, if you have an open file descriptor on an executable file you
>> > can execute it by calling execveat() passing the file descriptor, an
>> > empty path, and the ``AT_EMPTY_PATH`` flag.  
>> 
>> I think we're all pretty comfortable seeing function names adorned with
>> a closing pair of parens.  The ``...`` to adorn constants feels OK to me,
>> but maybe not to you?  If that feels excessive, can you suggest something
>> that would distinguish between POSIX and AT_EMPTY_PATH?
>
> Honestly, it's the context that distinguishes the two for me. I don't
> need any markup. But yeah, the double backtick still seems awkward.
> Funny thing is, markup like this:
>
>   AT_EMPTY_PATH
>
> doesn't bother me as much. Not sure why though :-/
>
> My frustration with this stood out quite a bit because I went from one
> file (with the same name) in .txt format, and went through that fast and
> quickly where everything made a lot of sense, and then jumping to this
> file, and feeling like I came to a stand-still in my understanding of
> the material.
>
>> 
>> [1] Too far being a subjective measure, of course.  My preferences
>> are on display in core-api/xarray.rst
>
> [1] I maintain trace/ftrace.rst, but the markup in that was written by
> others, and I gave a lot of pushback when I found that the markup made
> it hard to read with "less".
>
> -- Steve


signature.asc
Description: PGP signature

Re: Minor RST rant

2020-07-24 Thread NeilBrown

On Fri, Jul 24 2020, Steven Rostedt wrote:

> On Fri, 24 Jul 2020 11:33:25 -0600
> Jonathan Corbet  wrote:
>
>> Give people a tool, some of them will make more use of it than you might
>> like. I do my best to push back against excessive markup (which all of the
>> above qualifies as, as far as I'm concerned), but I can't really even do
>> that will all that goes through my tree, much less all the docs stuff
>> merged by others.
>> 
>> The markup in question was seemingly added by Neil; I've added him to CC
>> in case he wants to comment on it.
>
> I saw Neil as the author and should have Cc'd him.
>
> Neil, you can read my full email here:
>
>   https://lore.kernel.org/r/20200724132200.51fd2...@oasis.local.home
>
>> 
>> I'm not sure what to do other than to continue to push for minimal use of
>> intrusive markup.
>
> Yeah, I really didn't expect an action item to come from this. It was
> just some feedback, and perhaps you can use this as an example of "too
> much markup" when dealing with others.
>
> Looking at the web page that Matthew pointed out to, does make it much
> easier to read. But one still needs to remember that a large audience
> of this work is still those of us that will read the plain text.
>
> My viewer of choice is "less" ;-)
>
> -- Steve

Hi Steven,
 thanks for your rants - and under appreciated art form I believe.

 Short answer is: I don't much care and if someone were to remove the
 markup, I possibly wouldn't even notice and certainly wouldn't object.

 Longer answer is that I think visual appearance is important.  I
 originally wrote that document as an article for lwn.net.  I wanted the
 code fragments - even when a single word like AT_EMPTY_PATH - to be
 rendered like code (constant-width font).  The markdown markup for that
 is  `code goes here`, so that is what I sent to Jon (he graciously uses
 a markdown-to-html tool to munge what I send), and that is what I
 placed in Documentation.
 I subsequently converted this to rst (7bbfd9ad8eb2) which involved
 converting single ` to double ``.
 I agree this was not an improvement when reading the text, but maybe
 that is the price of using rst.  Or maybe the price is not being able
 to say what you mean.

 A brief look over the file suggests that most uses of `` are to
 highlight one of:
- paths
- function names
- constant names.

 All these must be used widely throughout the documentation - whatever
 is the common standard should definitely be imposed.
 Constant names stand out least effectively by themselves.  In
 kernel-doc comments they are preceded by a '%'.  Would that make the
 text more readable for you?  Does our doc infrastructure honour that in
 .rst documents?

Thanks,
NeilBrown

signature.asc
Description: PGP signature

Re: [PATCH 09/23] md: rewrite md_setup_drive to avoid ioctls

2020-07-15 Thread NeilBrown

On Tue, Jul 14 2020, Christoph Hellwig wrote:

> md_setup_drive knows it works with md devices, so it is rather pointless
> to open a file descriptor and issue ioctls.  Just call directly into the
> relevant low-level md routines after getting a handle to the device using
> blkdev_get_by_dev instead.
>
> Signed-off-by: Christoph Hellwig 
> Acked-by: Song Liu 
> ---
>  drivers/md/md-autodetect.c | 127 -
>  drivers/md/md.c|  20 +++---
>  drivers/md/md.h|   6 ++
>  3 files changed, 71 insertions(+), 82 deletions(-)
>
> diff --git a/drivers/md/md-autodetect.c b/drivers/md/md-autodetect.c
> index a43a8f1580584c..5b24b5616d3acc 100644
> --- a/drivers/md/md-autodetect.c
> +++ b/drivers/md/md-autodetect.c
> @@ -2,7 +2,6 @@
>  #include 
>  #include 
>  #include 
> -#include 
>  #include 
>  #include 
>  #include 
> @@ -120,37 +119,29 @@ static int __init md_setup(char *str)
>   return 1;
>  }
>  
> -static inline int create_dev(char *name, dev_t dev)
> -{
> - ksys_unlink(name);
> - return ksys_mknod(name, S_IFBLK|0600, new_encode_dev(dev));
> -}
> -
>  static void __init md_setup_drive(struct md_setup_args *args)
>  {
> - int minor, i, partitioned;
> - dev_t dev;
> - dev_t devices[MD_SB_DISKS+1];
> - int fd;
> - int err = 0;
> - char *devname;
> - mdu_disk_info_t dinfo;
> + char *devname = args->device_names;
> + dev_t devices[MD_SB_DISKS + 1], mdev;
> + struct mdu_array_info_s ainfo = { };
> + struct block_device *bdev;
> + struct mddev *mddev;
> + int err = 0, i;
>   char name[16];
>  
> - minor = args->minor;
> - partitioned = args->partitioned;
> - devname = args->device_names;
> + if (args->partitioned) {
> + mdev = MKDEV(mdp_major, args->minor << MdpMinorShift);
> + sprintf(name, "md_d%d", args->minor);
> + } else {
> + mdev = MKDEV(MD_MAJOR, args->minor);
> + sprintf(name, "md%d", args->minor);
> + }
>  
> - sprintf(name, "/dev/md%s%d", partitioned?"_d":"", minor);
> - if (partitioned)
> - dev = MKDEV(mdp_major, minor << MdpMinorShift);
> - else
> - dev = MKDEV(MD_MAJOR, minor);
> - create_dev(name, dev);
>   for (i = 0; i < MD_SB_DISKS && devname != NULL; i++) {
>   struct kstat stat;
>   char *p;
>   char comp_name[64];
> + dev_t dev;
>  
>   p = strchr(devname, ',');
>   if (p)
> @@ -163,7 +154,7 @@ static void __init md_setup_drive(struct md_setup_args 
> *args)
>   if (vfs_stat(comp_name, ) == 0 && S_ISBLK(stat.mode))
>   dev = new_decode_dev(stat.rdev);
>   if (!dev) {
> - printk(KERN_WARNING "md: Unknown device name: %s\n", 
> devname);
> + pr_warn("md: Unknown device name: %s\n", devname);
>   break;
>   }
>  
> @@ -175,68 +166,64 @@ static void __init md_setup_drive(struct md_setup_args 
> *args)
>   if (!i)
>   return;
>  
> - printk(KERN_INFO "md: Loading md%s%d: %s\n",
> - partitioned ? "_d" : "", minor,
> - args->device_names);
> + pr_info("md: Loading %s: %s\n", name, args->device_names);
>  
> - fd = ksys_open(name, 0, 0);
> - if (fd < 0) {
> - printk(KERN_ERR "md: open failed - cannot start "
> - "array %s\n", name);
> + bdev = blkdev_get_by_dev(mdev, FMODE_READ, NULL);
> + if (IS_ERR(bdev)) {
> + pr_err("md: open failed - cannot start array %s\n", name);
>   return;
>   }
> - if (ksys_ioctl(fd, SET_ARRAY_INFO, 0) == -EBUSY) {
> - printk(KERN_WARNING
> -"md: Ignoring md=%d, already autodetected. (Use 
> raid=noautodetect)\n",
> -minor);
> - ksys_close(fd);
> - return;

I'd be more comfortable if you added something like
if (WARN(bdev->bd_disk->fops != md_fops,
 "Opening block device %x resulted in non-md device\"))
return;
here.  However even without that

Reviewed-by: NeilBrown 

Thanks,
NeilBrown


> + mddev = bdev->bd_disk->private_data;
> +
> + err = mddev_lock(mddev);
> + if (err) {
> + pr_err("md:

Re: [PATCH 08/23] md: simplify md_setup_drive

2020-07-15 Thread NeilBrown

On Tue, Jul 14 2020, Christoph Hellwig wrote:

> Move the loop over the possible arrays into the caller to remove a level
> of indentation for the whole function.
>
> Signed-off-by: Christoph Hellwig 
> Acked-by: Song Liu 

Nice

Reviewed-by: NeilBrown 

Thanks,
NeilBrown

> ---
>  drivers/md/md-autodetect.c | 203 ++---
>  1 file changed, 101 insertions(+), 102 deletions(-)
>
> diff --git a/drivers/md/md-autodetect.c b/drivers/md/md-autodetect.c
> index 6bc9b734eee6ff..a43a8f1580584c 100644
> --- a/drivers/md/md-autodetect.c
> +++ b/drivers/md/md-autodetect.c
> @@ -27,7 +27,7 @@ static int __initdata raid_noautodetect=1;
>  #endif
>  static int __initdata raid_autopart;
>  
> -static struct {
> +static struct md_setup_args {
>   int minor;
>   int partitioned;
>   int level;
> @@ -126,122 +126,117 @@ static inline int create_dev(char *name, dev_t dev)
>   return ksys_mknod(name, S_IFBLK|0600, new_encode_dev(dev));
>  }
>  
> -static void __init md_setup_drive(void)
> +static void __init md_setup_drive(struct md_setup_args *args)
>  {
> - int minor, i, ent, partitioned;
> + int minor, i, partitioned;
>   dev_t dev;
>   dev_t devices[MD_SB_DISKS+1];
> + int fd;
> + int err = 0;
> + char *devname;
> + mdu_disk_info_t dinfo;
> + char name[16];
>  
> - for (ent = 0; ent < md_setup_ents ; ent++) {
> - int fd;
> - int err = 0;
> - char *devname;
> - mdu_disk_info_t dinfo;
> - char name[16];
> + minor = args->minor;
> + partitioned = args->partitioned;
> + devname = args->device_names;
>  
> - minor = md_setup_args[ent].minor;
> - partitioned = md_setup_args[ent].partitioned;
> - devname = md_setup_args[ent].device_names;
> + sprintf(name, "/dev/md%s%d", partitioned?"_d":"", minor);
> + if (partitioned)
> + dev = MKDEV(mdp_major, minor << MdpMinorShift);
> + else
> + dev = MKDEV(MD_MAJOR, minor);
> + create_dev(name, dev);
> + for (i = 0; i < MD_SB_DISKS && devname != NULL; i++) {
> + struct kstat stat;
> + char *p;
> + char comp_name[64];
>  
> - sprintf(name, "/dev/md%s%d", partitioned?"_d":"", minor);
> - if (partitioned)
> - dev = MKDEV(mdp_major, minor << MdpMinorShift);
> - else
> - dev = MKDEV(MD_MAJOR, minor);
> - create_dev(name, dev);
> - for (i = 0; i < MD_SB_DISKS && devname != NULL; i++) {
> - struct kstat stat;
> - char *p;
> - char comp_name[64];
> + p = strchr(devname, ',');
> + if (p)
> + *p++ = 0;
>  
> - p = strchr(devname, ',');
> - if (p)
> - *p++ = 0;
> + dev = name_to_dev_t(devname);
> + if (strncmp(devname, "/dev/", 5) == 0)
> + devname += 5;
> + snprintf(comp_name, 63, "/dev/%s", devname);
> + if (vfs_stat(comp_name, ) == 0 && S_ISBLK(stat.mode))
> + dev = new_decode_dev(stat.rdev);
> + if (!dev) {
> + printk(KERN_WARNING "md: Unknown device name: %s\n", 
> devname);
> + break;
> + }
>  
> - dev = name_to_dev_t(devname);
> - if (strncmp(devname, "/dev/", 5) == 0)
> - devname += 5;
> - snprintf(comp_name, 63, "/dev/%s", devname);
> - if (vfs_stat(comp_name, ) == 0 &&
> - S_ISBLK(stat.mode))
> - dev = new_decode_dev(stat.rdev);
> - if (!dev) {
> - printk(KERN_WARNING "md: Unknown device name: 
> %s\n", devname);
> - break;
> - }
> + devices[i] = dev;
> + devname = p;
> + }
> + devices[i] = 0;
>  
> - devices[i] = dev;
> + if (!i)
> + return;
>  
> - devname = p;
> - }
> - devices[i] = 0;
> + printk(KERN_INFO "md: Loading md%s%d: %s\n",
> + partitioned ? "_d" : "", min

Re: [PATCH 06/23] md: remove the autoscan partition re-read

2020-07-15 Thread NeilBrown

On Tue, Jul 14 2020, Christoph Hellwig wrote:

> devfs is long gone, and autoscan works just fine without this these days.
>
> Signed-off-by: Christoph Hellwig 
> Acked-by: Song Liu 

Happy to see this go!

Reviewed-by: NeilBrown 

Thanks,
NeilBrown


> ---
>  drivers/md/md-autodetect.c | 10 --
>  1 file changed, 10 deletions(-)
>
> diff --git a/drivers/md/md-autodetect.c b/drivers/md/md-autodetect.c
> index 0eb746211ed53c..6bc9b734eee6ff 100644
> --- a/drivers/md/md-autodetect.c
> +++ b/drivers/md/md-autodetect.c
> @@ -240,16 +240,6 @@ static void __init md_setup_drive(void)
>   err = ksys_ioctl(fd, RUN_ARRAY, 0);
>   if (err)
>   printk(KERN_WARNING "md: starting md%d failed\n", 
> minor);
> - else {
> - /* reread the partition table.
> -  * I (neilb) and not sure why this is needed, but I 
> cannot
> -  * boot a kernel with devfs compiled in from 
> partitioned md
> -  * array without it
> -  */
> - ksys_close(fd);
> - fd = ksys_open(name, 0, 0);
> - ksys_ioctl(fd, BLKRRPART, 0);
> - }
>   ksys_close(fd);
>   }
>  }
> -- 
> 2.27.0


signature.asc
Description: PGP signature

Re: [PATCH 05/23] md: replace the RAID_AUTORUN ioctl with a direct function call

2020-07-15 Thread NeilBrown

On Tue, Jul 14 2020, Christoph Hellwig wrote:

> Instead of using a spcial RAID_AUTORUN ioctl that only exists for
> non-modular builds and is only called from the early init code, just
> call the actual function directly.

I think there was a tool in the old 'mdutls' that would call this ioctl
from user-space, but I cannot find a copy of that online, so I cannot be
sure not that it really matters.

Reviewed-by: NeilBrown 

Thanks,
NeilBrown


>
> Signed-off-by: Christoph Hellwig 
> Acked-by: Song Liu 
> ---
>  drivers/md/md-autodetect.c | 10 ++
>  drivers/md/md.c| 14 +-
>  drivers/md/md.h|  3 +++
>  3 files changed, 6 insertions(+), 21 deletions(-)
>
> diff --git a/drivers/md/md-autodetect.c b/drivers/md/md-autodetect.c
> index fe806f7b9759a1..0eb746211ed53c 100644
> --- a/drivers/md/md-autodetect.c
> +++ b/drivers/md/md-autodetect.c
> @@ -9,6 +9,7 @@
>  #include 
>  #include 
>  #include 
> +#include "md.h"
>  
>  /*
>   * When md (and any require personalities) are compiled into the kernel
> @@ -285,8 +286,6 @@ __setup("md=", md_setup);
>  
>  static void __init autodetect_raid(void)
>  {
> - int fd;
> -
>   /*
>* Since we don't want to detect and use half a raid array, we need to
>* wait for the known devices to complete their probing
> @@ -295,12 +294,7 @@ static void __init autodetect_raid(void)
>   printk(KERN_INFO "md: If you don't use raid, use raid=noautodetect\n");
>  
>   wait_for_device_probe();
> -
> - fd = ksys_open("/dev/md0", 0, 0);
> - if (fd >= 0) {
> - ksys_ioctl(fd, RAID_AUTORUN, raid_autopart);
> - ksys_close(fd);
> - }
> + md_autostart_arrays(raid_autopart);
>  }
>  
>  void __init md_run_setup(void)
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index f567f536b529bd..6e9a48da474848 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -68,10 +68,6 @@
>  #include "md-bitmap.h"
>  #include "md-cluster.h"
>  
> -#ifndef MODULE
> -static void autostart_arrays(int part);
> -#endif
> -
>  /* pers_list is a list of registered personalities protected
>   * by pers_lock.
>   * pers_lock does extra service to protect accesses to
> @@ -7421,7 +7417,6 @@ static inline bool md_ioctl_valid(unsigned int cmd)
>   case GET_DISK_INFO:
>   case HOT_ADD_DISK:
>   case HOT_REMOVE_DISK:
> - case RAID_AUTORUN:
>   case RAID_VERSION:
>   case RESTART_ARRAY_RW:
>   case RUN_ARRAY:
> @@ -7467,13 +7462,6 @@ static int md_ioctl(struct block_device *bdev, fmode_t 
> mode,
>   case RAID_VERSION:
>   err = get_version(argp);
>   goto out;
> -
> -#ifndef MODULE
> - case RAID_AUTORUN:
> - err = 0;
> - autostart_arrays(arg);
> - goto out;
> -#endif
>   default:;
>   }
>  
> @@ -9721,7 +9709,7 @@ void md_autodetect_dev(dev_t dev)
>   }
>  }
>  
> -static void autostart_arrays(int part)
> +void md_autostart_arrays(int part)
>  {
>   struct md_rdev *rdev;
>   struct detected_devices_node *node_detected_dev;
> diff --git a/drivers/md/md.h b/drivers/md/md.h
> index 612814d07d35ab..37315a3f28e97d 100644
> --- a/drivers/md/md.h
> +++ b/drivers/md/md.h
> @@ -800,4 +800,7 @@ static inline void mddev_check_write_zeroes(struct mddev 
> *mddev, struct bio *bio
>   !bio->bi_disk->queue->limits.max_write_zeroes_sectors)
>   mddev->queue->limits.max_write_zeroes_sectors = 0;
>  }
> +
> +void md_autostart_arrays(int part);
> +
>  #endif /* _MD_MD_H */
> -- 
> 2.27.0


signature.asc
Description: PGP signature

Re: [PATCH 03/23] init: remove the bstat helper

2020-07-15 Thread NeilBrown

On Tue, Jul 14 2020, Christoph Hellwig wrote:

> The only caller of the bstat function becomes cleaner and simpler when
> open coding the function.
>
> Signed-off-by: Christoph Hellwig 
> Acked-by: Song Liu 

Reviewed-by: NeilBrown 

Nice!

NeilBrown

> ---
>  init/do_mounts.h| 10 --
>  init/do_mounts_md.c |  8 
>  2 files changed, 4 insertions(+), 14 deletions(-)
>
> diff --git a/init/do_mounts.h b/init/do_mounts.h
> index 0bb0806de4ce2c..7513d1c14d13fe 100644
> --- a/init/do_mounts.h
> +++ b/init/do_mounts.h
> @@ -20,16 +20,6 @@ static inline int create_dev(char *name, dev_t dev)
>   return ksys_mknod(name, S_IFBLK|0600, new_encode_dev(dev));
>  }
>  
> -static inline u32 bstat(char *name)
> -{
> - struct kstat stat;
> - if (vfs_stat(name, ) != 0)
> - return 0;
> - if (!S_ISBLK(stat.mode))
> - return 0;
> - return stat.rdev;
> -}
> -
>  #ifdef CONFIG_BLK_DEV_RAM
>  
>  int __init rd_load_disk(int n);
> diff --git a/init/do_mounts_md.c b/init/do_mounts_md.c
> index b84031528dd446..359363e85ccd0b 100644
> --- a/init/do_mounts_md.c
> +++ b/init/do_mounts_md.c
> @@ -138,9 +138,9 @@ static void __init md_setup_drive(void)
>   dev = MKDEV(MD_MAJOR, minor);
>   create_dev(name, dev);
>   for (i = 0; i < MD_SB_DISKS && devname != NULL; i++) {
> + struct kstat stat;
>   char *p;
>   char comp_name[64];
> - u32 rdev;
>  
>   p = strchr(devname, ',');
>   if (p)
> @@ -150,9 +150,9 @@ static void __init md_setup_drive(void)
>   if (strncmp(devname, "/dev/", 5) == 0)
>   devname += 5;
>   snprintf(comp_name, 63, "/dev/%s", devname);
> - rdev = bstat(comp_name);
> - if (rdev)
> - dev = new_decode_dev(rdev);
> + if (vfs_stat(comp_name, ) == 0 &&
> + S_ISBLK(stat.mode))
> + dev = new_decode_dev(stat.rdev);
>   if (!dev) {
>   printk(KERN_WARNING "md: Unknown device name: 
> %s\n", devname);
>   break;
> -- 
> 2.27.0


signature.asc
Description: PGP signature

Re: [Ksummit-discuss] [PATCH] CodingStyle: Inclusive Terminology

2020-07-06 Thread NeilBrown

On Sat, Jul 04 2020, Matthew Wilcox wrote:

> Another suggestion for "slave" replacement should be "device". This is in
> the context of the w1 bus which is by far the largest user of the
> master/slave terminology in the kernel.

Ugh.  Please, no.  "device" doesn't mean anything, in that you can use
it to refer to any thing.  (i.e. it is almost semantically equivalent to
"thing").
Look in /sys/devices.  Everything in there is a device, and (nearly)
every thing is in there.

NeilBrown

>
> On Sat., Jul. 4, 2020, 16:19 Dan Williams,  wrote:
>
>> Recent events have prompted a Linux position statement on inclusive
>> terminology. Given that Linux maintains a coding-style and its own
>> idiomatic set of terminology here is a proposal to answer the call to
>> replace non-inclusive terminology.
>>
>> Cc: Jonathan Corbet 
>> Cc: Kees Cook 
>> Signed-off-by: Chris Mason 
>> Signed-off-by: Greg Kroah-Hartman 
>> Signed-off-by: Dan Williams 
>> ---
>>  Documentation/process/coding-style.rst  |   12 
>>  Documentation/process/inclusive-terminology.rst |   64
>> +++
>>  Documentation/process/index.rst |1
>>  3 files changed, 77 insertions(+)
>>  create mode 100644 Documentation/process/inclusive-terminology.rst
>>
>> diff --git a/Documentation/process/coding-style.rst
>> b/Documentation/process/coding-style.rst
>> index 2657a55c6f12..4b15ab671089 100644
>> --- a/Documentation/process/coding-style.rst
>> +++ b/Documentation/process/coding-style.rst
>> @@ -319,6 +319,18 @@ If you are afraid to mix up your local variable
>> names, you have another
>>  problem, which is called the function-growth-hormone-imbalance syndrome.
>>  See chapter 6 (Functions).
>>
>> +For symbol names, avoid introducing new usage of the words 'slave' and
>> +'blacklist'. Recommended replacements for 'slave' are: 'secondary',
>> +'subordinate', 'replica', 'responder', 'follower', 'proxy', or
>> +'performer'.  Recommended replacements for blacklist are: 'blocklist' or
>> +'denylist'.
>> +
>> +Exceptions for introducing new usage is to maintain a userspace ABI, or
>> +when updating code for an existing (as of 2020) hardware or protocol
>> +specification that mandates those terms. For new specifications consider
>> +translating specification usage of the terminology to the kernel coding
>> +standard where possible. See :ref:`process/inclusive-terminology.rst
>> +` for details.
>>
>>  5) Typedefs
>>  ---
>> diff --git a/Documentation/process/inclusive-terminology.rst
>> b/Documentation/process/inclusive-terminology.rst
>> new file mode 100644
>> index ..a8eb26690eb4
>> --- /dev/null
>> +++ b/Documentation/process/inclusive-terminology.rst
>> @@ -0,0 +1,64 @@
>> +.. _inclusiveterminology:
>> +
>> +Linux kernel inclusive terminology
>> +==
>> +
>> +The Linux kernel is a global software project, and in 2020 there was a
>> +global reckoning on race relations that caused many organizations to
>> +re-evaluate their policies and practices relative to the inclusion of
>> +people of African descent. This document describes why the 'Naming'
>> +section in :ref:`process/coding-style.rst ` recommends
>> +avoiding usage of 'slave' and 'blacklist' in new additions to the Linux
>> +kernel.
>> +
>> +On the triviality of replacing words
>> +
>> +
>> +The African slave trade was a brutal system of human misery deployed at
>> +global scale. Some word choice decisions in a modern software project
>> +does next to nothing to compensate for that legacy. So why put any
>> +effort into something so trivial in comparison? Because the goal is not
>> +to repair, or erase the past. The goal is to maximize availability and
>> +efficiency of the global developer community to participate in the Linux
>> +kernel development process.
>> +
>> +Word choice and developer efficiency
>> +
>> +
>> +Why does any software project go through the trouble of developing a
>> +document like :ref:`process/coding-style.rst `? It does so
>> +because a common coding style maximizes the efficiency of both
>> +maintainers and developers. Developers learn common design patterns and
>> +idiomatic expressions while maintainers can spot deviations from those
>> +norms. Even non-compliant whitespace is considered a leading indicator
>> +to deeper problems in a patchset

Re: [Ksummit-discuss] [PATCH] CodingStyle: Inclusive Terminology

2020-07-06 Thread NeilBrown

On Sat, Jul 04 2020, Matthew Wilcox wrote:

> Another suggestion for "slave" replacement should be "device". This is in
> the context of the w1 bus which is by far the largest user of the
> master/slave terminology in the kernel.

Ugh.  Please, no.  "device" doesn't mean anything, in that you can use
it to refer to any thing.  (i.e. it is almost semantically equivalent to
"thing").
Look in /sys/devices.  Everything in there is a device, and (nearly)
every thing is in there.

NeilBrown

>
> On Sat., Jul. 4, 2020, 16:19 Dan Williams,  wrote:
>
>> Recent events have prompted a Linux position statement on inclusive
>> terminology. Given that Linux maintains a coding-style and its own
>> idiomatic set of terminology here is a proposal to answer the call to
>> replace non-inclusive terminology.
>>
>> Cc: Jonathan Corbet 
>> Cc: Kees Cook 
>> Signed-off-by: Chris Mason 
>> Signed-off-by: Greg Kroah-Hartman 
>> Signed-off-by: Dan Williams 
>> ---
>>  Documentation/process/coding-style.rst  |   12 
>>  Documentation/process/inclusive-terminology.rst |   64
>> +++
>>  Documentation/process/index.rst |1
>>  3 files changed, 77 insertions(+)
>>  create mode 100644 Documentation/process/inclusive-terminology.rst
>>
>> diff --git a/Documentation/process/coding-style.rst
>> b/Documentation/process/coding-style.rst
>> index 2657a55c6f12..4b15ab671089 100644
>> --- a/Documentation/process/coding-style.rst
>> +++ b/Documentation/process/coding-style.rst
>> @@ -319,6 +319,18 @@ If you are afraid to mix up your local variable
>> names, you have another
>>  problem, which is called the function-growth-hormone-imbalance syndrome.
>>  See chapter 6 (Functions).
>>
>> +For symbol names, avoid introducing new usage of the words 'slave' and
>> +'blacklist'. Recommended replacements for 'slave' are: 'secondary',
>> +'subordinate', 'replica', 'responder', 'follower', 'proxy', or
>> +'performer'.  Recommended replacements for blacklist are: 'blocklist' or
>> +'denylist'.
>> +
>> +Exceptions for introducing new usage is to maintain a userspace ABI, or
>> +when updating code for an existing (as of 2020) hardware or protocol
>> +specification that mandates those terms. For new specifications consider
>> +translating specification usage of the terminology to the kernel coding
>> +standard where possible. See :ref:`process/inclusive-terminology.rst
>> +` for details.
>>
>>  5) Typedefs
>>  ---
>> diff --git a/Documentation/process/inclusive-terminology.rst
>> b/Documentation/process/inclusive-terminology.rst
>> new file mode 100644
>> index ..a8eb26690eb4
>> --- /dev/null
>> +++ b/Documentation/process/inclusive-terminology.rst
>> @@ -0,0 +1,64 @@
>> +.. _inclusiveterminology:
>> +
>> +Linux kernel inclusive terminology
>> +==
>> +
>> +The Linux kernel is a global software project, and in 2020 there was a
>> +global reckoning on race relations that caused many organizations to
>> +re-evaluate their policies and practices relative to the inclusion of
>> +people of African descent. This document describes why the 'Naming'
>> +section in :ref:`process/coding-style.rst ` recommends
>> +avoiding usage of 'slave' and 'blacklist' in new additions to the Linux
>> +kernel.
>> +
>> +On the triviality of replacing words
>> +
>> +
>> +The African slave trade was a brutal system of human misery deployed at
>> +global scale. Some word choice decisions in a modern software project
>> +does next to nothing to compensate for that legacy. So why put any
>> +effort into something so trivial in comparison? Because the goal is not
>> +to repair, or erase the past. The goal is to maximize availability and
>> +efficiency of the global developer community to participate in the Linux
>> +kernel development process.
>> +
>> +Word choice and developer efficiency
>> +
>> +
>> +Why does any software project go through the trouble of developing a
>> +document like :ref:`process/coding-style.rst `? It does so
>> +because a common coding style maximizes the efficiency of both
>> +maintainers and developers. Developers learn common design patterns and
>> +idiomatic expressions while maintainers can spot deviations from those
>> +norms. Even non-compliant whitespace is considered a leading indicator
>> +to deeper problems in a patchset

[PATCH 2/2] MM: Discard NR_UNSTABLE_NFS, use NR_WRITEBACK instead.

2020-05-31 Thread NeilBrown


After an NFS page has been written it is considered "unstable" until a
COMMIT request succeeds.  If the COMMIT fails, the page will be
re-written.

These "unstable" pages are currently accounted as "reclaimable", either
in WB_RECLAIMABLE, or in NR_UNSTABLE_NFS which is included in a
'reclaimable' count.  This might have made sense when sending the COMMIT
required a separate action by the VFS/MM (e.g.  releasepage() used to
send a COMMIT).  However now that all writes generated by ->writepages()
will automatically be followed by a COMMIT (since commit 919e3bd9a875
("NFS: Ensure we commit after writeback is complete")) it makes more
sense to treat them as writeback pages.

So this patch removes NR_UNSTABLE_NFS and accounts unstable pages in
NR_WRITEBACK and WB_WRITEBACK.

A particular effect of this change is that when
wb_check_background_flush() calls wb_over_bg_threshold(), the latter
will report 'true' a lot less often as the 'unstable' pages are no
longer considered 'dirty' (as there is nothing that writeback can do
about them anyway).

Currently wb_check_background_flush() will trigger writeback to NFS even
when there are relatively few dirty pages (if there are lots of unstable
pages), this can result in small writes going to the server (10s of
Kilobytes rather than a Megabyte) which hurts throughput.
With this patch, there are fewer writes which are each larger on average.

Where the NR_UNSTABLE_NFS count was included in statistics
virtual-files, the entry is retained, but the value is hard-coded as
zero.  static trace points and warning printks which mentioned this
counter no longer report it.

Reviewed-by: Jan Kara 
Reviewed-by: Christoph Hellwig 
Acked-by: Trond Myklebust 
Acked-by: Michal Hocko  # for MM parts
Signed-off-by: NeilBrown 
---
 Documentation/filesystems/proc.rst |  4 ++--
 drivers/base/node.c|  2 +-
 fs/fs-writeback.c  |  1 -
 fs/nfs/internal.h  | 10 +++---
 fs/nfs/write.c |  4 ++--
 fs/proc/meminfo.c  |  3 +--
 include/linux/mmzone.h |  1 -
 include/trace/events/writeback.h   |  5 +
 mm/memcontrol.c|  1 -
 mm/page-writeback.c| 17 -
 mm/page_alloc.c|  5 +
 mm/vmstat.c| 11 +--
 12 files changed, 28 insertions(+), 36 deletions(-)

diff --git a/Documentation/filesystems/proc.rst 
b/Documentation/filesystems/proc.rst
index 38b606991065..092b7b44d158 100644
--- a/Documentation/filesystems/proc.rst
+++ b/Documentation/filesystems/proc.rst
@@ -1042,8 +1042,8 @@ PageTables
   amount of memory dedicated to the lowest level of page
   tables.
 NFS_Unstable
-  NFS pages sent to the server, but not yet committed to stable
- storage
+  Always zero. Previous counted pages which had been written to
+  the server, but has not been committed to stable storage.
 Bounce
   Memory used for block device "bounce buffers"
 WritebackTmp
diff --git a/drivers/base/node.c b/drivers/base/node.c
index 10d7e818e118..15f5ed6a8830 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -439,7 +439,7 @@ static ssize_t node_read_meminfo(struct device *dev,
   nid, K(i.sharedram),
   nid, sum_zone_node_page_state(nid, NR_KERNEL_STACK_KB),
   nid, K(sum_zone_node_page_state(nid, NR_PAGETABLE)),
-  nid, K(node_page_state(pgdat, NR_UNSTABLE_NFS)),
+  nid, 0,
   nid, K(sum_zone_node_page_state(nid, NR_BOUNCE)),
   nid, K(node_page_state(pgdat, NR_WRITEBACK_TEMP)),
   nid, K(sreclaimable +
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 76ac9c7d32ec..c5bdf46e3b4b 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1070,7 +1070,6 @@ static void bdi_split_work_to_wbs(struct backing_dev_info 
*bdi,
 static unsigned long get_nr_dirty_pages(void)
 {
return global_node_page_state(NR_FILE_DIRTY) +
-   global_node_page_state(NR_UNSTABLE_NFS) +
get_nr_dirty_inodes();
 }
 
diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h
index 1f32a9fbfdaf..6673a77884d9 100644
--- a/fs/nfs/internal.h
+++ b/fs/nfs/internal.h
@@ -668,7 +668,8 @@ void nfs_super_set_maxbytes(struct super_block *sb, __u64 
maxfilesize)
 }
 
 /*
- * Record the page as unstable and mark its inode as dirty.
+ * Record the page as unstable (an extra writeback period) and mark its
+ * inode as dirty.
  */
 static inline
 void nfs_mark_page_unstable(struct page *page, struct nfs_commit_info *cinfo)
@@ -676,8 +677,11 @@ void nfs_mark_page_unstable(struct page *page, struct 
nfs_commit_info *cinfo)
if (!cinfo->dreq) {
struct inode *inode = page_file_mapping(page)->host;

[PATCH 1/2] MM: replace PF_LESS_THROTTLE with PF_LOCAL_THROTTLE

2020-05-31 Thread NeilBrown


PF_LESS_THROTTLE exists for loop-back nfsd (and a similar need in the
loop block driver and callers of prctl(PR_SET_IO_FLUSHER)), where a
daemon needs to write to one bdi (the final bdi) in order to free up
writes queued to another bdi (the client bdi).

The daemon sets PF_LESS_THROTTLE and gets a larger allowance of dirty
pages, so that it can still dirty pages after other processses have been
throttled.  The purpose of this is to avoid deadlock that happen when
the PF_LESS_THROTTLE process must write for any dirty pages to be freed,
but it is being thottled and cannot write.

This approach was designed when all threads were blocked equally,
independently on which device they were writing to, or how fast it was.
Since that time the writeback algorithm has changed substantially with
different threads getting different allowances based on non-trivial
heuristics.  This means the simple "add 25%" heuristic is no longer
reliable.

The important issue is not that the daemon needs a *larger* dirty page
allowance, but that it needs a *private* dirty page allowance, so that
dirty pages for the "client" bdi that it is helping to clear (the bdi for
an NFS filesystem or loop block device etc) do not affect the throttling
of the deamon writing to the "final" bdi.

This patch changes the heuristic so that the task is not throttled when
the bdi it is writing to has a dirty page count below below (or equal
to) the free-run threshold for that bdi.  This ensures it will always be
able to have some pages in flight, and so will not deadlock.

In a steady-state, it is expected that PF_LOCAL_THROTTLE tasks might
still be throttled by global threshold, but that is acceptable as it is
only the deadlock state that is interesting for this flag.

This approach of "only throttle when target bdi is busy" is consistent
with the other use of PF_LESS_THROTTLE in current_may_throttle(), were
it causes attention to be focussed only on the target bdi.

So this patch
 - renames PF_LESS_THROTTLE to PF_LOCAL_THROTTLE,
 - removes the 25% bonus that that flag gives, and
 - If PF_LOCAL_THROTTLE is set, don't delay at all unless the
   global and the local free-run thresholds are exceeded.

Note that previously realtime threads were treated the same as
PF_LESS_THROTTLE threads.  This patch does *not* change the behvaiour for
real-time threads, so it is now different from the behaviour of nfsd and
loop tasks.  I don't know what is wanted for realtime.

Reviewed-by: Jan Kara 
Acked-by: Chuck Lever  (for nfsd change)
Signed-off-by: NeilBrown 
---
 drivers/block/loop.c  |  2 +-
 fs/nfsd/vfs.c |  9 +
 include/linux/sched.h |  3 ++-
 kernel/sys.c  |  2 +-
 mm/page-writeback.c   | 41 +
 mm/vmscan.c   |  4 ++--
 6 files changed, 44 insertions(+), 17 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index da693e6a834e..d89c25ba3b89 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -919,7 +919,7 @@ static void loop_unprepare_queue(struct loop_device *lo)
 
 static int loop_kthread_worker_fn(void *worker_ptr)
 {
-   current->flags |= PF_LESS_THROTTLE | PF_MEMALLOC_NOIO;
+   current->flags |= PF_LOCAL_THROTTLE | PF_MEMALLOC_NOIO;
return kthread_worker_fn(worker_ptr);
 }
 
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 0aa02eb18bd3..c3fbab1753ec 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -979,12 +979,13 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh 
*fhp, struct nfsd_file *nf,
 
if (test_bit(RQ_LOCAL, >rq_flags))
/*
-* We want less throttling in balance_dirty_pages()
-* and shrink_inactive_list() so that nfs to
+* We want throttling in balance_dirty_pages()
+* and shrink_inactive_list() to only consider
+* the backingdev we are writing to, so that nfs to
 * localhost doesn't cause nfsd to lock up due to all
 * the client's dirty pages or its congested queue.
 */
-   current->flags |= PF_LESS_THROTTLE;
+   current->flags |= PF_LOCAL_THROTTLE;
 
exp = fhp->fh_export;
use_wgather = (rqstp->rq_vers == 2) && EX_WGATHER(exp);
@@ -1037,7 +1038,7 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh 
*fhp, struct nfsd_file *nf,
nfserr = nfserrno(host_err);
}
if (test_bit(RQ_LOCAL, >rq_flags))
-   current_restore_flags(pflags, PF_LESS_THROTTLE);
+   current_restore_flags(pflags, PF_LOCAL_THROTTLE);
return nfserr;
 }
 
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 4418f5cb8324..5955a089df32 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1481,7 +1481,8 @@ extern struct pid *cad_pid;
 #define PF_KSWAPD  0x0002  /* I am kswapd */
 #define PF_MEMA

Writeback fixes for NFS

2020-05-31 Thread NeilBrown


Hi Andrew,
 could you please queue these two patches (following).
 I think they have sufficient review and no remaining complaints.

Thanks,
NeilBrown


signature.asc
Description: PGP signature

[PATCH 2/2 V4] MM: Discard NR_UNSTABLE_NFS, use NR_WRITEBACK instead.

2020-05-13 Thread NeilBrown


After an NFS page has been written it is considered "unstable" until a
COMMIT request succeeds.  If the COMMIT fails, the page will be
re-written.

These "unstable" pages are currently accounted as "reclaimable", either
in WB_RECLAIMABLE, or in NR_UNSTABLE_NFS which is included in a
'reclaimable' count.  This might have made sense when sending the COMMIT
required a separate action by the VFS/MM (e.g.  releasepage() used to
send a COMMIT).  However now that all writes generated by ->writepages()
will automatically be followed by a COMMIT (since commit 919e3bd9a875
("NFS: Ensure we commit after writeback is complete")) it makes more
sense to treat them as writeback pages.

So this patch removes NR_UNSTABLE_NFS and accounts unstable pages in
NR_WRITEBACK and WB_WRITEBACK.

A particular effect of this change is that when
wb_check_background_flush() calls wb_over_bg_threshold(), the latter
will report 'true' a lot less often as the 'unstable' pages are no
longer considered 'dirty' (as there is nothing that writeback can do
about them anyway).

Currently wb_check_background_flush() will trigger writeback to NFS even
when there are relatively few dirty pages (if there are lots of unstable
pages), this can result in small writes going to the server (10s of
Kilobytes rather than a Megabyte) which hurts throughput.
With this patch, there are fewer writes which are each larger on average.

Where the NR_UNSTABLE_NFS count was included in statistics
virtual-files, the entry is retained, but the value is hard-coded as
zero.  static trace points and warning printks which mentioned this
counter no longer report it.

Reviewed-by: Christoph Hellwig 
Acked-by: Trond Myklebust 
Acked-by: Michal Hocko  # for MM parts
Signed-off-by: NeilBrown 
---
 Documentation/filesystems/proc.rst |  4 ++--
 drivers/base/node.c|  2 +-
 fs/fs-writeback.c  |  1 -
 fs/nfs/internal.h  | 10 +++---
 fs/nfs/write.c |  4 ++--
 fs/proc/meminfo.c  |  3 +--
 include/linux/mmzone.h |  1 -
 include/trace/events/writeback.h   |  5 +
 mm/memcontrol.c|  1 -
 mm/page-writeback.c| 17 -
 mm/page_alloc.c|  5 +
 mm/vmstat.c| 11 +--
 12 files changed, 28 insertions(+), 36 deletions(-)

diff --git a/Documentation/filesystems/proc.rst 
b/Documentation/filesystems/proc.rst
index 38b606991065..092b7b44d158 100644
--- a/Documentation/filesystems/proc.rst
+++ b/Documentation/filesystems/proc.rst
@@ -1042,8 +1042,8 @@ PageTables
   amount of memory dedicated to the lowest level of page
   tables.
 NFS_Unstable
-  NFS pages sent to the server, but not yet committed to stable
- storage
+  Always zero. Previous counted pages which had been written to
+  the server, but has not been committed to stable storage.
 Bounce
   Memory used for block device "bounce buffers"
 WritebackTmp
diff --git a/drivers/base/node.c b/drivers/base/node.c
index 10d7e818e118..15f5ed6a8830 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -439,7 +439,7 @@ static ssize_t node_read_meminfo(struct device *dev,
   nid, K(i.sharedram),
   nid, sum_zone_node_page_state(nid, NR_KERNEL_STACK_KB),
   nid, K(sum_zone_node_page_state(nid, NR_PAGETABLE)),
-  nid, K(node_page_state(pgdat, NR_UNSTABLE_NFS)),
+  nid, 0,
   nid, K(sum_zone_node_page_state(nid, NR_BOUNCE)),
   nid, K(node_page_state(pgdat, NR_WRITEBACK_TEMP)),
   nid, K(sreclaimable +
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 76ac9c7d32ec..c5bdf46e3b4b 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1070,7 +1070,6 @@ static void bdi_split_work_to_wbs(struct backing_dev_info 
*bdi,
 static unsigned long get_nr_dirty_pages(void)
 {
return global_node_page_state(NR_FILE_DIRTY) +
-   global_node_page_state(NR_UNSTABLE_NFS) +
get_nr_dirty_inodes();
 }
 
diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h
index 1f32a9fbfdaf..6673a77884d9 100644
--- a/fs/nfs/internal.h
+++ b/fs/nfs/internal.h
@@ -668,7 +668,8 @@ void nfs_super_set_maxbytes(struct super_block *sb, __u64 
maxfilesize)
 }
 
 /*
- * Record the page as unstable and mark its inode as dirty.
+ * Record the page as unstable (an extra writeback period) and mark its
+ * inode as dirty.
  */
 static inline
 void nfs_mark_page_unstable(struct page *page, struct nfs_commit_info *cinfo)
@@ -676,8 +677,11 @@ void nfs_mark_page_unstable(struct page *page, struct 
nfs_commit_info *cinfo)
if (!cinfo->dreq) {
struct inode *inode = page_file_mapping(page)->host;
 
-   inc_node_page_state(

[PATCH 1/2 V4] MM: replace PF_LESS_THROTTLE with PF_LOCAL_THROTTLE

2020-05-13 Thread NeilBrown


PF_LESS_THROTTLE exists for loop-back nfsd (and a similar need in the
loop block driver and callers of prctl(PR_SET_IO_FLUSHER)), where a
daemon needs to write to one bdi (the final bdi) in order to free up
writes queued to another bdi (the client bdi).

The daemon sets PF_LESS_THROTTLE and gets a larger allowance of dirty
pages, so that it can still dirty pages after other processses have been
throttled.  The purpose of this is to avoid deadlock that happen when
the PF_LESS_THROTTLE process must write for any dirty pages to be freed,
but it is being thottled and cannot write.

This approach was designed when all threads were blocked equally,
independently on which device they were writing to, or how fast it was.
Since that time the writeback algorithm has changed substantially with
different threads getting different allowances based on non-trivial
heuristics.  This means the simple "add 25%" heuristic is no longer
reliable.

The important issue is not that the daemon needs a *larger* dirty page
allowance, but that it needs a *private* dirty page allowance, so that
dirty pages for the "client" bdi that it is helping to clear (the bdi for
an NFS filesystem or loop block device etc) do not affect the throttling
of the deamon writing to the "final" bdi.

This patch changes the heuristic so that the task is not throttled when
the bdi it is writing to has a dirty page count below below (or equal
to) the free-run threshold for that bdi.  This ensures it will always be
able to have some pages in flight, and so will not deadlock.

In a steady-state, it is expected that PF_LOCAL_THROTTLE tasks might
still be throttled by global threshold, but that is acceptable as it is
only the deadlock state that is interesting for this flag.

This approach of "only throttle when target bdi is busy" is consistent
with the other use of PF_LESS_THROTTLE in current_may_throttle(), were
it causes attention to be focussed only on the target bdi.

So this patch
 - renames PF_LESS_THROTTLE to PF_LOCAL_THROTTLE,
 - removes the 25% bonus that that flag gives, and
 - If PF_LOCAL_THROTTLE is set, don't delay at all unless the
   global and the local free-run thresholds are exceeded.

Note that previously realtime threads were treated the same as
PF_LESS_THROTTLE threads.  This patch does *not* change the behvaiour for
real-time threads, so it is now different from the behaviour of nfsd and
loop tasks.  I don't know what is wanted for realtime.

Acked-by: Chuck Lever  (for nfsd change)
Signed-off-by: NeilBrown 
---
 drivers/block/loop.c  |  2 +-
 fs/nfsd/vfs.c |  9 +
 include/linux/sched.h |  3 ++-
 kernel/sys.c  |  2 +-
 mm/page-writeback.c   | 41 +
 mm/vmscan.c   |  4 ++--
 6 files changed, 44 insertions(+), 17 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index da693e6a834e..d89c25ba3b89 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -919,7 +919,7 @@ static void loop_unprepare_queue(struct loop_device *lo)
 
 static int loop_kthread_worker_fn(void *worker_ptr)
 {
-   current->flags |= PF_LESS_THROTTLE | PF_MEMALLOC_NOIO;
+   current->flags |= PF_LOCAL_THROTTLE | PF_MEMALLOC_NOIO;
return kthread_worker_fn(worker_ptr);
 }
 
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 0aa02eb18bd3..c3fbab1753ec 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -979,12 +979,13 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh 
*fhp, struct nfsd_file *nf,
 
if (test_bit(RQ_LOCAL, >rq_flags))
/*
-* We want less throttling in balance_dirty_pages()
-* and shrink_inactive_list() so that nfs to
+* We want throttling in balance_dirty_pages()
+* and shrink_inactive_list() to only consider
+* the backingdev we are writing to, so that nfs to
 * localhost doesn't cause nfsd to lock up due to all
 * the client's dirty pages or its congested queue.
 */
-   current->flags |= PF_LESS_THROTTLE;
+   current->flags |= PF_LOCAL_THROTTLE;
 
exp = fhp->fh_export;
use_wgather = (rqstp->rq_vers == 2) && EX_WGATHER(exp);
@@ -1037,7 +1038,7 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh 
*fhp, struct nfsd_file *nf,
nfserr = nfserrno(host_err);
}
if (test_bit(RQ_LOCAL, >rq_flags))
-   current_restore_flags(pflags, PF_LESS_THROTTLE);
+   current_restore_flags(pflags, PF_LOCAL_THROTTLE);
return nfserr;
 }
 
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 4418f5cb8324..5955a089df32 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1481,7 +1481,8 @@ extern struct pid *cad_pid;
 #define PF_KSWAPD  0x0002  /* I am kswapd */
 #define PF_MEMALLOC_NOFS   0x000

Re: [PATCH 1/2 V3] MM: replace PF_LESS_THROTTLE with PF_LOCAL_THROTTLE

2020-05-13 Thread NeilBrown


I thought about this some more and come up with another "simple"
approach that didn't require me understanding too much code, but does -
I think - address your concerns.

I've changed the heuristic to avoid any throttling on PF_LOCAL_THROTTLE
task if:
 - the global dirty count is below the global free-run threshold.  The
   code did this already.
 - (or) the per-wb dirty count is below the per-wb free-run threshold.
   This is the change.

This means that:
 - in a steady stated, all bdis will be throttled based on their (steady
   state) throughput, which is equally appropriate for PF_LOCAL_THROTTLE
   tasks.
 - a PF_LOCAL_THROTTLE task will never be *completely* blocked by dirty
   pages queued for other devices.  This means no deadlock, and that is
   the primary purpose of PF_LOCAL_THROTTLE.
 - when writes through the PF_LOCAL_THROTTLE task start up from idle -
   when there is no current throughput estimate - the PF_LOCAL_THROTTLE
   can be expected to get a fair share of the available memory, just as
   much as any other writer.  This was the possible problem with
   treating PF_LOCAL_THROTTLE just like BDI_CAP_STRICTLIMIT.

So I think this is a good solution.  Thoughts?
Patches follow - I've address the comment formatting issue.

Thanks,
NeilBrown


signature.asc
Description: PGP signature

Re: [PATCH 0/5] cachefiles, nfs: Fixes

2020-05-11 Thread NeilBrown

On Fri, May 08 2020, David Howells wrote:

> Hi Linus, Trond, Anna,
>
> Can you pull these fixes for cachefiles and NFS's use of fscache?  Should
> they go through the NFS tree or directly upstream?  The things fixed are:

hi David,
thanks for these fscache fixes.  Here is another for your consideration.

NeilBrown

From: NeilBrown 
Date: Tue, 12 May 2020 08:32:25 +1000
Subject: [PATCH] cachefiles: fix inverted ASSERTion.

bmap() returns a negative result precisely when a_ops->bmap is NULL.

A recent patch converted

   ASSERT(inode->i_mapping->a_ops->bmap);

to an assertion that bmap(inode, ...) returns a negative number.
This inverts the sense of the assertion.
So change it back : ASSERT(ret == 0)

Fixes: 10d83e11a582 ("cachefiles: drop direct usage of ->bmap method.")
Signed-off-by: NeilBrown 
---
 fs/cachefiles/rdwr.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/cachefiles/rdwr.c b/fs/cachefiles/rdwr.c
index 1dc97f2d6201..a4573c96660c 100644
--- a/fs/cachefiles/rdwr.c
+++ b/fs/cachefiles/rdwr.c
@@ -431,7 +431,7 @@ int cachefiles_read_or_alloc_page(struct fscache_retrieval 
*op,
block <<= shift;

ret = bmap(inode, );
-   ASSERT(ret < 0);
+   ASSERT(ret == 0);

_debug("%llx -> %llx",
   (unsigned long long) (page->index << shift),
-- 
2.26.2

signature.asc
Description: PGP signature

[PATCH] SUNRPC: fix use-after-free in rpc_free_client_work()

2020-05-08 Thread NeilBrown


Parts of rpc_free_client() were recently moved to
a separate rpc_free_clent_work().  This introduced
a use-after-free as rpc_clnt_remove_pipedir() calls
rpc_net_ns(), and that uses clnt->cl_xprt which has already
been freed.
So move the call to xprt_put() after the call to
rpc_clnt_remove_pipedir().

Reported-by: syzbot+22b5ef302c7c40d94...@syzkaller.appspotmail.com
Fixes: 7c4310ff5642 ("SUNRPC: defer slow parts of rpc_free_client() to a 
workqueue.")
Signed-off-by: NeilBrown 
---
 net/sunrpc/clnt.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c
index 8350d3a2e9a7..8d3972ea688b 100644
--- a/net/sunrpc/clnt.c
+++ b/net/sunrpc/clnt.c
@@ -890,6 +890,7 @@ static void rpc_free_client_work(struct work_struct *work)
 */
rpc_clnt_debugfs_unregister(clnt);
rpc_clnt_remove_pipedir(clnt);
+   xprt_put(rcu_dereference_raw(clnt->cl_xprt));
 
kfree(clnt);
rpciod_down();
@@ -907,7 +908,6 @@ rpc_free_client(struct rpc_clnt *clnt)
rpc_unregister_client(clnt);
rpc_free_iostats(clnt->cl_metrics);
clnt->cl_metrics = NULL;
-   xprt_put(rcu_dereference_raw(clnt->cl_xprt));
xprt_iter_destroy(>cl_xpi);
put_cred(clnt->cl_cred);
rpc_free_clid(clnt);
-- 
2.26.2



signature.asc
Description: PGP signature

Re: [PATCH AUTOSEL 5.4 26/35] SUNRPC: defer slow parts of rpc_free_client() to a workqueue.

2020-05-07 Thread NeilBrown

On Thu, May 07 2020, Sasha Levin wrote:

> From: NeilBrown 
>
> [ Upstream commit 7c4310ff56422ea43418305d22bbc5fe19150ec4 ]

This one is buggy - it introduces a use-after-free.  Best delay it for
now.

NeilBrown

>
> The rpciod workqueue is on the write-out path for freeing dirty memory,
> so it is important that it never block waiting for memory to be
> allocated - this can lead to a deadlock.
>
> rpc_execute() - which is often called by an rpciod work item - calls
> rcp_task_release_client() which can lead to rpc_free_client().
>
> rpc_free_client() makes two calls which could potentially block wating
> for memory allocation.
>
> rpc_clnt_debugfs_unregister() calls into debugfs and will block while
> any of the debugfs files are being accessed.  In particular it can block
> while any of the 'open' methods are being called and all of these use
> malloc for one thing or another.  So this can deadlock if the memory
> allocation waits for NFS to complete some writes via rpciod.
>
> rpc_clnt_remove_pipedir() can take the inode_lock() and while it isn't
> obvious that memory allocations can happen while the lock it held, it is
> safer to assume they might and to not let rpciod call
> rpc_clnt_remove_pipedir().
>
> So this patch moves these two calls (together with the final kfree() and
> rpciod_down()) into a work-item to be run from the system work-queue.
> rpciod can continue its important work, and the final stages of the free
> can happen whenever they happen.
>
> I have seen this deadlock on a 4.12 based kernel where debugfs used
> synchronize_srcu() when removing objects.  synchronize_srcu() requires a
> workqueue and there were no free workther threads and none could be
> allocated.  While debugsfs no longer uses SRCU, I believe the deadlock
> is still possible.
>
> Signed-off-by: NeilBrown 
> Signed-off-by: Trond Myklebust 
> Signed-off-by: Sasha Levin 
> ---
>  include/linux/sunrpc/clnt.h |  8 +++-
>  net/sunrpc/clnt.c   | 21 +
>  2 files changed, 24 insertions(+), 5 deletions(-)
>
> diff --git a/include/linux/sunrpc/clnt.h b/include/linux/sunrpc/clnt.h
> index abc63bd1be2b5..d99d39d45a494 100644
> --- a/include/linux/sunrpc/clnt.h
> +++ b/include/linux/sunrpc/clnt.h
> @@ -71,7 +71,13 @@ struct rpc_clnt {
>  #if IS_ENABLED(CONFIG_SUNRPC_DEBUG)
>   struct dentry   *cl_debugfs;/* debugfs directory */
>  #endif
> - struct rpc_xprt_itercl_xpi;
> + /* cl_work is only needed after cl_xpi is no longer used,
> +  * and that are of similar size
> +  */
> + union {
> + struct rpc_xprt_itercl_xpi;
> + struct work_struct  cl_work;
> + };
>   const struct cred   *cl_cred;
>  };
>  
> diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c
> index f7f78566be463..a7430b66c7389 100644
> --- a/net/sunrpc/clnt.c
> +++ b/net/sunrpc/clnt.c
> @@ -877,6 +877,20 @@ EXPORT_SYMBOL_GPL(rpc_shutdown_client);
>  /*
>   * Free an RPC client
>   */
> +static void rpc_free_client_work(struct work_struct *work)
> +{
> + struct rpc_clnt *clnt = container_of(work, struct rpc_clnt, cl_work);
> +
> + /* These might block on processes that might allocate memory,
> +  * so they cannot be called in rpciod, so they are handled separately
> +  * here.
> +  */
> + rpc_clnt_debugfs_unregister(clnt);
> + rpc_clnt_remove_pipedir(clnt);
> +
> + kfree(clnt);
> + rpciod_down();
> +}
>  static struct rpc_clnt *
>  rpc_free_client(struct rpc_clnt *clnt)
>  {
> @@ -887,17 +901,16 @@ rpc_free_client(struct rpc_clnt *clnt)
>   rcu_dereference(clnt->cl_xprt)->servername);
>   if (clnt->cl_parent != clnt)
>   parent = clnt->cl_parent;
> - rpc_clnt_debugfs_unregister(clnt);
> - rpc_clnt_remove_pipedir(clnt);
>   rpc_unregister_client(clnt);
>   rpc_free_iostats(clnt->cl_metrics);
>   clnt->cl_metrics = NULL;
>   xprt_put(rcu_dereference_raw(clnt->cl_xprt));
>   xprt_iter_destroy(>cl_xpi);
> - rpciod_down();
>   put_cred(clnt->cl_cred);
>   rpc_free_clid(clnt);
> - kfree(clnt);
> +
> + INIT_WORK(>cl_work, rpc_free_client_work);
> + schedule_work(>cl_work);
>   return parent;
>  }
>  
> -- 
> 2.20.1


signature.asc
Description: PGP signature

Re: [PATCH] sunrpc: fix crash when cache_head become valid before update

2019-10-08 Thread NeilBrown

On Tue, Oct 08 2019,  J . Bruce Fields  wrote:

> On Tue, Oct 08, 2019 at 10:02:53AM +, Pavel Tikhomirov wrote:
>> Add Neil to CC, sorry, had lost it somehow...
>
> Always happy when we can fix a bug by deleting code, and your
> explanation makes sense to me, but I'll give Neil a chance to look it
> over if he wants.

Yes, it makes sense to me.  But I'm not sure that is worth much.  The
original fix got a Reviewed-by from me but was wrong.
 Acked-by: NeilBrown 

'Acked' is weaker than 'reviewed' - isn't it? :-)

NeilBrown

signature.asc
Description: PGP signature

Re: Bisected: Kernel 4.14 + has 3 times higher write IO latency than Kernel 4.4 with raid1

2019-08-19 Thread NeilBrown

On Fri, Aug 16 2019, Jinpu Wang wrote:

> On Wed, Aug 7, 2019 at 2:35 PM Jinpu Wang  wrote:
>>
>> On Wed, Aug 7, 2019 at 8:36 AM Jinpu Wang  wrote:
>> >
>> > On Wed, Aug 7, 2019 at 1:40 AM NeilBrown  wrote:
>> > >
>> > > On Tue, Aug 06 2019, Jinpu Wang wrote:
>> > >
>> > > > On Tue, Aug 6, 2019 at 9:54 AM Jinpu Wang  
>> > > > wrote:
>> > > >>
>> > > >> On Tue, Aug 6, 2019 at 1:46 AM NeilBrown  wrote:
>> > > >> >
>> > > >> > On Mon, Aug 05 2019, Jinpu Wang wrote:
>> > > >> >
>> > > >> > > Hi Neil,
>> > > >> > >
>> > > >> > > For the md higher write IO latency problem, I bisected it to 
>> > > >> > > these commits:
>> > > >> > >
>> > > >> > > 4ad23a97 MD: use per-cpu counter for writes_pending
>> > > >> > > 210f7cd percpu-refcount: support synchronous switch to atomic 
>> > > >> > > mode.
>> > > >> > >
>> > > >> > > Do you maybe have an idea? How can we fix it?
>> > > >> >
>> > > >> > Hmmm not sure.
>> > > >> Hi Neil,
>> > > >>
>> > > >> Thanks for reply, detailed result in line.
>> > >
>> > > Thanks for the extra testing.
>> > > ...
>> > > > [  105.133299] md md0 in_sync is 0, sb_flags 2, recovery 3, external
>> > > > 0, safemode 0, recovery_cp 524288
>> > > ...
>> > >
>> > > ahh - the resync was still happening.  That explains why set_in_sync()
>> > > is being called so often.  If you wait for sync to complete (or create
>> > > the array with --assume-clean) you should see more normal behaviour.
>> > I've updated my tests accordingly, thanks for the hint.
>> > >
>> > > This patch should fix it.  I think we can do better but it would be more
>> > > complex so no suitable for backports to -stable.
>> > >
>> > > Once you confirm it works, I'll send it upstream with a
>> > > Reported-and-Tested-by from you.
>> > >
>> > > Thanks,
>> > > NeilBrown
>> >
>> > Thanks a lot, Neil, my quick test show, yes, it fixed the problem for me.
>> >
>> > I will run more tests to be sure, will report back the test result.
>> Hi Neil,
>>
>> I've run our regression tests with your patch, everything works fine
>> as expected.
>>
>> So Reported-and-Tested-by: Jack Wang 
>>
>> Thank you for your quick fix.
>>
>> The patch should go to stable 4.12+
>
> Hi Neil,
>
> I hope you're doing well, just a soft ping? do you need further
> testing from my side?

Thanks for the reminder.  I've sent out the patch now.

NeilBrown


>
> Please let me know how can we move the fix forward.
>
> Thanks,
> Jack Wang


signature.asc
Description: PGP signature

Re: Bisected: Kernel 4.14 + has 3 times higher write IO latency than Kernel 4.4 with raid1

2019-08-06 Thread NeilBrown

On Tue, Aug 06 2019, Jinpu Wang wrote:

> On Tue, Aug 6, 2019 at 9:54 AM Jinpu Wang  wrote:
>>
>> On Tue, Aug 6, 2019 at 1:46 AM NeilBrown  wrote:
>> >
>> > On Mon, Aug 05 2019, Jinpu Wang wrote:
>> >
>> > > Hi Neil,
>> > >
>> > > For the md higher write IO latency problem, I bisected it to these 
>> > > commits:
>> > >
>> > > 4ad23a97 MD: use per-cpu counter for writes_pending
>> > > 210f7cd percpu-refcount: support synchronous switch to atomic mode.
>> > >
>> > > Do you maybe have an idea? How can we fix it?
>> >
>> > Hmmm not sure.
>> Hi Neil,
>>
>> Thanks for reply, detailed result in line.

Thanks for the extra testing.
...
> [  105.133299] md md0 in_sync is 0, sb_flags 2, recovery 3, external
> 0, safemode 0, recovery_cp 524288
...

ahh - the resync was still happening.  That explains why set_in_sync()
is being called so often.  If you wait for sync to complete (or create
the array with --assume-clean) you should see more normal behaviour.

This patch should fix it.  I think we can do better but it would be more
complex so no suitable for backports to -stable.

Once you confirm it works, I'll send it upstream with a
Reported-and-Tested-by from you.

Thanks,
NeilBrown


diff --git a/drivers/md/md.c b/drivers/md/md.c
index 24638ccedce4..624cf1ac43dc 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -8900,6 +8900,7 @@ void md_check_recovery(struct mddev *mddev)
 
if (mddev_trylock(mddev)) {
int spares = 0;
+   bool try_set_sync = mddev->safemode != 0;
 
if (!mddev->external && mddev->safemode == 1)
mddev->safemode = 0;
@@ -8945,7 +8946,7 @@ void md_check_recovery(struct mddev *mddev)
}
}
 
-   if (!mddev->external && !mddev->in_sync) {
+   if (try_set_sync && !mddev->external && !mddev->in_sync) {
spin_lock(>lock);
set_in_sync(mddev);
spin_unlock(>lock);


signature.asc
Description: PGP signature

Re: Bisected: Kernel 4.14 + has 3 times higher write IO latency than Kernel 4.4 with raid1

2019-08-05 Thread NeilBrown

On Mon, Aug 05 2019, Jinpu Wang wrote:

> Hi Neil,
>
> For the md higher write IO latency problem, I bisected it to these commits:
>
> 4ad23a97 MD: use per-cpu counter for writes_pending
> 210f7cd percpu-refcount: support synchronous switch to atomic mode.
>
> Do you maybe have an idea? How can we fix it?

Hmmm not sure.

My guess is that the set_in_sync() call from md_check_recovery()
is taking a long time, and is being called too often.

Could you try two experiments please.

1/ set  /sys/block/md0/md/safe_mode_delay 
   to 20 or more.  It defaults to about 0.2.

2/ comment out the call the set_in_sync() in md_check_recovery().

Then run the least separately after each of these changes.

I the second one makes a difference, I'd like to know how often it gets
called - and why.  The test
if ( ! (
(mddev->sb_flags & ~ (1<recovery) ||
test_bit(MD_RECOVERY_DONE, >recovery) ||
(mddev->external == 0 && mddev->safemode == 1) ||
(mddev->safemode == 2
 && !mddev->in_sync && mddev->recovery_cp == MaxSector)
))
return;

should normally return when doing lots of IO - I'd like to know
which condition causes it to not return.

Thanks,
NeilBrown

signature.asc
Description: PGP signature

Re: seq_file: fix problem when seeking mid-record.

2019-08-05 Thread NeilBrown

On Mon, Aug 05 2019, Markus Elfring wrote:

>> Fixes: 1f4aace60b0e ("fs/seq_file.c: simplify seq_file iteration code
>>  and interface")
>
> Please do not split this tag across multiple lines in the final commit 
> description.

I tend to agree...
I had previously seen
 "Possible unwrapped commit description (prefer a maximum 75 chars per 
line)\n"
warnings from checkpatch, but one closer look that doesn't apply to
Fixes: lines (among other special cases).

Maybe Andrew will fix it up for me when it applies  (please!)

Thanks,
NeilBrown

signature.asc
Description: PGP signature

[PATCH] seq_file: fix problem when seeking mid-record.

2019-08-04 Thread NeilBrown


If you use lseek or similar (e.g. pread) to access
a location in a seq_file file that is within a record,
rather than at a record boundary, then the first read
will return the remainder of the record, and the second
read will return the whole of that same record (instead
of the next record).
Whnn seeking to a record boundary, the next record is
correctly returned.

This bug was introduced by a recent patch (identified below)
Before that patch, seq_read() would increment m->index when
the last of the buffer was returned (m->count == 0).
After that patch, we rely on ->next to increment m->index
after filling the buffer - but there was one place where that
didn't happen.

Link: https://lkml.kernel.org/lkml/877e7xl029@notabene.neil.brown.name/
Reported-by-tested-by: Sergei Turchanov 
Fixes: 1f4aace60b0e ("fs/seq_file.c: simplify seq_file iteration code
and interface")
Cc: sta...@vger.kernel.org (v4.19+)
Signed-off-by: NeilBrown 
---

Hi Andrew: as you applied the offending patch for me, maybe you could
queue up this fix too.
Thanks,
NeilBrown

 fs/seq_file.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/seq_file.c b/fs/seq_file.c
index 04f09689cd6d..1600034a929b 100644
--- a/fs/seq_file.c
+++ b/fs/seq_file.c
@@ -119,6 +119,7 @@ static int traverse(struct seq_file *m, loff_t offset)
}
if (seq_has_overflowed(m))
goto Eoverflow;
+   p = m->op->next(m, p, >index);
if (pos + m->count > offset) {
m->from = offset - pos;
m->count -= m->from;
@@ -126,7 +127,6 @@ static int traverse(struct seq_file *m, loff_t offset)
}
pos += m->count;
m->count = 0;
-   p = m->op->next(m, p, >index);
if (pos == offset)
break;
}
-- 
2.14.0.rc0.dirty



signature.asc
Description: PGP signature

Re: [BUG] lseek on /proc/meminfo is broken in 4.19.59 maybe due to commit 1f4aace60b0e ("fs/seq_file.c: simplify seq_file iteration code and interface")

2019-08-01 Thread NeilBrown

On Thu, Aug 01 2019, Sergei Turchanov wrote:

> Hello!
>
> [
>   As suggested in previous discussion this behavior may be caused by your
>   commit 1f4aace60b0e ("fs/seq_file.c: simplify seq_file iteration code and 
> interface")
> ]

Yes I think I can see what happened.
 removing:
-   if (!m->count) {
-   m->from = 0;
-   m->index++;
-   }

from seq_read meant that ->index didn't get updated in a case that it
needs to be.

Please confirm that the following patch fixes the problem.
I think it is correct, but I need to look it over more carefully in the
morning, and see if I can explain why it is correct.

Thanks for the report.
NeilBrown

diff --git a/fs/seq_file.c b/fs/seq_file.c
index 04f09689cd6d..1600034a929b 100644
--- a/fs/seq_file.c
+++ b/fs/seq_file.c
@@ -119,6 +119,7 @@ static int traverse(struct seq_file *m, loff_t offset)
}
if (seq_has_overflowed(m))
goto Eoverflow;
+   p = m->op->next(m, p, >index);
if (pos + m->count > offset) {
m->from = offset - pos;
m->count -= m->from;
@@ -126,7 +127,6 @@ static int traverse(struct seq_file *m, loff_t offset)
}
pos += m->count;
m->count = 0;
-   p = m->op->next(m, p, >index);
if (pos == offset)
break;
}


>
> Original bug report:
>
> Seeking (to an offset within file size) in /proc/meminfo is broken in 
> 4.19.59. It does seek to a desired position, but reading from that position 
> returns the remainder of file and then a whole copy of file. This doesn't 
> happen with /proc/vmstat or /proc/self/maps for example.
>
> Seeking did work correctly in kernel 4.14.47. So it seems something broke in 
> the way.
>
> Background: this kind of access pattern (seeking to /proc/meminfo) is used by 
> libvirt-lxc fuse driver for virtualized view of /proc/meminfo. So that 
> /proc/meminfo is broken in guests when running kernel 4.19.x.
>
>  > On 01.08.2019 17:11, Gao Xiang wrote:
>> Hi,
>>
>> I just took a glance, maybe due to
>> commit 1f4aace60b0e ("fs/seq_file.c: simplify seq_file iteration code and 
>> interface")
>>
>> I simply reverted it just now and it seems fine... but I haven't digged into 
>> this commit.
>>
>> Maybe you could Cc NeilBrown  for some more advice and
>> I have no idea whether it's an expected behavior or not...
>>
>> Thanks,
>> Gao Xiang
>>
>> On 2019/8/1 14:16, Sergei Turchanov wrote:
>
>
> $ ./test /proc/meminfo 0    # Works as expected
>
> MemTotal:   394907728 kB
> MemFree:    173738328 kB
> ...
> DirectMap2M:    13062144 kB
> DirectMap1G:    390070272 kB
>
> ---
>
> $ ./test /proc/meminfo 1024 # returns a copy of file after the remainder
>
> Will seek to 1024
>
>
> Data read at offset 1024
> gePages: 0 kB
> ShmemHugePages:    0 kB
> ShmemPmdMapped:    0 kB
> HugePages_Total:   0
> HugePages_Free:    0
> HugePages_Rsvd:    0
> HugePages_Surp:    0
> Hugepagesize:   2048 kB
> Hugetlb:   0 kB
> DirectMap4k:  245204 kB
> DirectMap2M:    13062144 kB
> DirectMap1G:    390070272 kB
> MemTotal:   394907728 kB
> MemFree:    173738328 kB
> MemAvailable:   379989680 kB
> Buffers:  355812 kB
> Cached: 207216224 kB
> ...
> DirectMap2M:    13062144 kB
> DirectMap1G:    390070272 kB
>
> As you see, after "DirectMap1G:" line, a whole copy of /proc/meminfo returned 
> by "read".
>
> Test program:
>
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
>
> #define SIZE 1024
> char buf[SIZE + 1];
>
> int main(int argc, char *argv[]) {
>      int fd;
>      ssize_t rd;
>      off_t   ofs = 0;
>
>      if (argc < 2) {
>      printf("Usage: test  []\n");
>      exit(1);
>      }
>
>      if (-1 == (fd = open(argv[1], O_RDONLY))) {
>      perror("open failed");
>      exit(1);
>      }
>
>      if (argc > 2) {
>      ofs = atol(argv[2]);
>      }
>      printf("Will seek to %ld\n", ofs);
>
>      if (-1 == (lseek(fd, ofs, SEEK_SET))) {
>      perror("lseek failed");
>      exit(1);
>      }
>
>      for (;; ofs += rd) {
>      printf("\n\nData read at offset %ld\n", ofs);
>      if (-1 == (rd = read(fd, buf, SIZE))) {
>      perror("read failed");
>      exit(1);
>      }
>      buf[rd] = '\0';
>      printf(buf);
>      if (rd < SIZE) {
>      break;
>      }
>      }
>
>      return 0;
> }


signature.asc
Description: PGP signature

[PATCH 1/2] staging: mt7621-dts: update sdhci config.

2019-06-30 Thread NeilBrown

The mtk-sd driver has been updated to support
the IP in the mt7621, so update our configuration
to work with it.

Signed-off-by: NeilBrown 
---
 drivers/staging/mt7621-dts/mt7621.dtsi |   41 +++-
 1 file changed, 40 insertions(+), 1 deletion(-)

diff --git a/drivers/staging/mt7621-dts/mt7621.dtsi 
b/drivers/staging/mt7621-dts/mt7621.dtsi
index 9c90cac82efc..549ff5a0699e 100644
--- a/drivers/staging/mt7621-dts/mt7621.dtsi
+++ b/drivers/staging/mt7621-dts/mt7621.dtsi
@@ -43,6 +43,30 @@
clock-frequency = <22000>;
};
 
+   mmc_clock: mmc_clock@0 {
+   #clock-cells = <0>;
+   compatible = "fixed-clock";
+   clock-frequency = <4800>;
+   };
+
+   mmc_fixed_3v3: fixedregulator@0 {
+   compatible = "regulator-fixed";
+   regulator-name = "mmc_power";
+   regulator-min-microvolt = <330>;
+   regulator-max-microvolt = <330>;
+   enable-active-high;
+   regulator-always-on;
+ };
+
+ mmc_fixed_1v8_io: fixedregulator@1 {
+   compatible = "regulator-fixed";
+   regulator-name = "mmc_io";
+   regulator-min-microvolt = <180>;
+   regulator-max-microvolt = <180>;
+   enable-active-high;
+   regulator-always-on;
+   };
+
palmbus: palmbus@1E00 {
compatible = "palmbus";
reg = <0x1E00 0x10>;
@@ -299,9 +323,24 @@
sdhci: sdhci@1E13 {
status = "disabled";
 
-   compatible = "ralink,mt7620-sdhci";
+   compatible = "mediatek,mt7620-mmc";
reg = <0x1E13 0x4000>;
 
+   bus-width = <4>;
+   max-frequency = <4800>;
+   cap-sd-highspeed;
+   cap-mmc-highspeed;
+   vmmc-supply = <_fixed_3v3>;
+   vqmmc-supply = <_fixed_1v8_io>;
+   disable-wp;
+
+   pinctrl-names = "default", "state_uhs";
+   pinctrl-0 = <_pins>;
+   pinctrl-1 = <_pins>;
+
+   clocks = <_clock _clock>;
+   clock-names = "source", "hclk";
+
interrupt-parent = <>;
interrupts = ;
};

[PATCH 0/2] staging: update mt7621 dts for some recent driver changes

2019-06-30 Thread NeilBrown

The mt7621 MMC driver was recently removed from staging due to
copyright concerns.  Since then drivers/mmc/host/mtk-sd.c has been
enhanced to work with the mt7621 IP.  The first patch updates
the dts file to match this driver.

Earlier, the drivers/net/ethernet/mediatek/ driver was enhanced
to work with mt7621 hardware and the mt7621-eth driver was removed
from staging.  The second patch enhances the mt7621.dtsi to better
support this driver and particularly to allow the second network port
to be used in at least one of its possible configurations.

Thanks,
NeilBrown

---

NeilBrown (2):
  staging: mt7621-dts: update sdhci config.
  staging: mt7621-dts: add support for second network interface


 drivers/staging/mt7621-dts/Kconfig |7 
 drivers/staging/mt7621-dts/Makefile|1 +
 drivers/staging/mt7621-dts/gbpc1.dts   |2 +
 drivers/staging/mt7621-dts/gbpc2.dts   |   21 +
 drivers/staging/mt7621-dts/mt7621.dtsi |   53 +---
 5 files changed, 77 insertions(+), 7 deletions(-)
 create mode 100644 drivers/staging/mt7621-dts/gbpc2.dts

--
Signature

[PATCH 2/2] staging: mt7621-dts: add support for second network interface

2019-06-30 Thread NeilBrown

The mt7621 has two network interfaces, one that connects to an
internal switch, and one that can connect to either that switch
or an external phy, or possibly an internal phy.

The Gnubee-PC2 has an external phy for use with the second interface.

This patch add some support for the second interface to mt7621.dtsi
and add a gbpc2.dts which makes use of this.  This allows the second
interface to be used.

I don't fully understand how to configure this interface - the
documentation is thin - so there could well be room for improvement
here.

Signed-off-by: NeilBrown 
---
 drivers/staging/mt7621-dts/Kconfig |7 ++-
 drivers/staging/mt7621-dts/Makefile|1 +
 drivers/staging/mt7621-dts/gbpc1.dts   |2 +-
 drivers/staging/mt7621-dts/gbpc2.dts   |   21 +
 drivers/staging/mt7621-dts/mt7621.dtsi |   12 
 5 files changed, 37 insertions(+), 6 deletions(-)
 create mode 100644 drivers/staging/mt7621-dts/gbpc2.dts

diff --git a/drivers/staging/mt7621-dts/Kconfig 
b/drivers/staging/mt7621-dts/Kconfig
index 3ea08ab9d0d3..6932ab7acadf 100644
--- a/drivers/staging/mt7621-dts/Kconfig
+++ b/drivers/staging/mt7621-dts/Kconfig
@@ -1,6 +1,11 @@
 # SPDX-License-Identifier: GPL-2.0
 config DTB_GNUBEE1
-   bool "GnuBee1 NAS"
+   bool "GnuBee1 2.5inch NAS"
+   depends on SOC_MT7621 && DTB_RT_NONE
+   select BUILTIN_DTB
+
+config DTB_GNUBEE2
+   bool "GnuBee2 3.5inch NAS"
depends on SOC_MT7621 && DTB_RT_NONE
select BUILTIN_DTB
 
diff --git a/drivers/staging/mt7621-dts/Makefile 
b/drivers/staging/mt7621-dts/Makefile
index aeec48a4edc7..b4ab99fed932 100644
--- a/drivers/staging/mt7621-dts/Makefile
+++ b/drivers/staging/mt7621-dts/Makefile
@@ -1,4 +1,5 @@
 # SPDX-License-Identifier: GPL-2.0
 dtb-$(CONFIG_DTB_GNUBEE1)  += gbpc1.dtb
+dtb-$(CONFIG_DTB_GNUBEE2)  += gbpc2.dtb
 
 obj-y  += $(patsubst %.dtb, %.dtb.o, $(dtb-y))
diff --git a/drivers/staging/mt7621-dts/gbpc1.dts 
b/drivers/staging/mt7621-dts/gbpc1.dts
index 250c15ace2a7..1fb560ff059c 100644
--- a/drivers/staging/mt7621-dts/gbpc1.dts
+++ b/drivers/staging/mt7621-dts/gbpc1.dts
@@ -119,7 +119,7 @@
 
  {
state_default: pinctrl0 {
-   gpio {
+   default_gpio: gpio {
groups = "wdt", "rgmii2", "uart3";
function = "gpio";
};
diff --git a/drivers/staging/mt7621-dts/gbpc2.dts 
b/drivers/staging/mt7621-dts/gbpc2.dts
new file mode 100644
index ..52760e7351f6
--- /dev/null
+++ b/drivers/staging/mt7621-dts/gbpc2.dts
@@ -0,0 +1,21 @@
+/dts-v1/;
+
+#include "gbpc1.dts"
+
+/ {
+   compatible = "gnubee,gb-pc2", "mediatek,mt7621-soc";
+   model = "GB-PC2";
+};
+
+_gpio {
+   groups = "wdt", "uart3";
+   function = "gpio";
+};
+
+ {
+   status = "ok";
+};
+
+_external {
+   status = "ok";
+};
diff --git a/drivers/staging/mt7621-dts/mt7621.dtsi 
b/drivers/staging/mt7621-dts/mt7621.dtsi
index 549ff5a0699e..a4c08110094b 100644
--- a/drivers/staging/mt7621-dts/mt7621.dtsi
+++ b/drivers/staging/mt7621-dts/mt7621.dtsi
@@ -427,16 +427,20 @@
compatible = "mediatek,eth-mac";
reg = <1>;
status = "off";
-   phy-mode = "rgmii";
-   phy-handle = <>;
+   phy-mode = "rgmii-rxid";
+   phy-handle = <_external>;
};
mdio-bus {
#address-cells = <1>;
#size-cells = <0>;
 
-   phy5: ethernet-phy@5 {
+   phy_external: ethernet-phy@5 {
+   status = "off";
reg = <5>;
-   phy-mode = "rgmii";
+   phy-mode = "rgmii-rxid";
+
+   pinctrl-names = "default";
+   pinctrl-0 = <_pins>;
};
 
switch0: switch0@0 {

Re: [PATCH] overlayfs: ignore empty NFSv4 ACLs in ext4 upperdir

2019-05-06 Thread NeilBrown

On Fri, May 03 2019, J. Bruce Fields wrote:

> On Thu, May 02, 2019 at 12:02:33PM +1000, NeilBrown wrote:
>> On Tue, Dec 06 2016, J. Bruce Fields wrote:
>> 
>> > On Tue, Dec 06, 2016 at 02:18:31PM +0100, Andreas Gruenbacher wrote:
>> >> On Tue, Dec 6, 2016 at 11:08 AM, Miklos Szeredi  wrote:
>> >> > On Tue, Dec 6, 2016 at 12:24 AM, Andreas Grünbacher
>> >> >  wrote:
>> >> >> 2016-12-06 0:19 GMT+01:00 Andreas Grünbacher 
>> >> >> :
>> >> >
>> >> >>> It's not hard to come up with a heuristic that determines if a
>> >> >>> system.nfs4_acl value is equivalent to a file mode, and to ignore the
>> >> >>> attribute in that case. (The file mode is transmitted in its own
>> >> >>> attribute already, so actually converting .) That way, overlayfs could
>> >> >>> still fail copying up files that have an actual ACL. It's still an
>> >> >>> ugly hack ...
>> >> >>
>> >> >> Actually, that kind of heuristic would make sense in the NFS client
>> >> >> which could then hide the "system.nfs4_acl" attribute.
>> >> >
>> >> > Even simpler would be if knfsd didn't send the attribute if not
>> >> > necessary.  Looks like there's code actively creating the nfs4_acl on
>> >> > the wire even if the filesystem had none:
>> >> >
>> >> > pacl = get_acl(inode, ACL_TYPE_ACCESS);
>> >> > if (!pacl)
>> >> > pacl = posix_acl_from_mode(inode->i_mode, GFP_KERNEL);
>> >> >
>> >> > What's the point?
>> >> 
>> >> That's how the protocol is specified.
>> >
>> > Yep, even if we could make that change to nfsd it wouldn't help the
>> > client with the large number of other servers that are out there
>> > (including older knfsd's).
>> >
>> > --b.
>> >
>> >> (I'm not saying that that's very helpful.)
>> >> 
>> >> Andreas
>> 
>> Hi everyone.
>>  I have a customer facing this problem, and so stumbled onto the email
>>  thread.
>>  Unfortunately it didn't resolve anything.  Maybe I can help kick things
>>  along???
>> 
>>  The core problem here is that NFSv4 and ext4 use different and largely
>>  incompatible ACL implementations.  There is no way to accurately
>>  translate from one to the other in general (common specific examples
>>  can be converted).
>> 
>>  This means that either:
>>1/ overlayfs cannot use ext4 for upper and NFS for lower (or vice
>>   versa) or
>>2/ overlayfs need to accept that sometimes it cannot copy ACLs, and
>>   that is OK.
>> 
>>  Silently not copying the ACLs is probably not a good idea as it might
>>  result in inappropriate permissions being given away.  So if the
>>  sysadmin wants this (and some clearly do), they need a way to
>>  explicitly say "I accept the risk".
>
> So, I feel like silently copying ACLs up *also* carries a risk, if that
> means switching from server-enforcement to client-enforcement of those
> permissions.

Interesting perspective  though doesn't NFSv4 explicitly allow
client-side ACL enforcement in the case of delegations?
Not sure how relevant that is

It seems to me we have two options:
 1/ declare the NFSv4 doesn't work as a lower layer for overlayfs and
recommend people use NFSv3, or
 2/ Modify overlayfs to work with NFSv4 by ignoring nfsv4 ACLs either
 2a/ always - and ignore all other acls and probably all system. xattrs,
 or
 2b/ based on a mount option that might be
  2bi/ general "noacl" or might be
  2bii/ explicit "noxattr=system.nfs4acl"
 
I think that continuing to discuss the miniature of the options isn't
going to help.  No solution is perfect - we just need to clearly
document the implications of whatever we come up with.

I lean towards 2a, but I be happy with with any '2' and '1' won't kill
me.

Do we have a vote?  Or does someone make an executive decision??

NeilBrown


signature.asc
Description: PGP signature

[PATCH 3/4] mmc: mtk-sd: enable internal card-detect logic.

2019-05-04 Thread NeilBrown

The mtk-sd silicon has integrated card-detect logic that is
enabled on the MT7621.  The circuit is phased out on newer hardware so
we should be careful to only enabled it on hardware known to support
it.  This a new "use_internal_cd" flag in struct mtk_mmc_compatible.

If the sdhci isn't marked non-removable and doesn't have a
cd-gpio configured, and if use_internal_cd is set, then assume the
internal cd logic should be used as recommended by
 Documentation/devicetree/bindings/mmc/mmc.txt

Signed-off-by: NeilBrown 
---
 drivers/mmc/host/mtk-sd.c |   64 ++---
 1 file changed, 60 insertions(+), 4 deletions(-)

diff --git a/drivers/mmc/host/mtk-sd.c b/drivers/mmc/host/mtk-sd.c
index 0c2be4f54b1f..c518cc208a1f 100644
--- a/drivers/mmc/host/mtk-sd.c
+++ b/drivers/mmc/host/mtk-sd.c
@@ -300,6 +300,8 @@
 #define CMD_TIMEOUT (HZ/10 * 5)/* 100ms x5 */
 #define DAT_TIMEOUT (HZ* 5)/* 1000ms x5 */
 
+#define DEFAULT_DEBOUNCE   (8) /* 8 cycles CD debounce */
+
 #define PAD_DELAY_MAX  32 /* PAD delay cells */
 /*--*/
 /* Descriptor Structure */
@@ -372,6 +374,7 @@ struct mtk_mmc_compatible {
bool stop_clk_fix;
bool enhance_rx;
bool support_64g;
+   bool use_internal_cd;
 };
 
 struct msdc_tune_para {
@@ -430,6 +433,7 @@ struct msdc_host {
bool hs400_cmd_resp_sel_rising;
 /* cmd response sample selection for HS400 */
bool hs400_mode;/* current eMMC will run at hs400 mode */
+   bool internal_cd;   /* Use internal card-detect logic */
struct msdc_save_para save_para; /* used when gate HCLK */
struct msdc_tune_para def_tune_para; /* default tune setting */
struct msdc_tune_para saved_tune_para; /* tune result of CMD21/CMD19 */
@@ -526,6 +530,7 @@ static const struct mtk_mmc_compatible mt7620_compat = {
.busy_check = false,
.stop_clk_fix = false,
.enhance_rx = false,
+   .use_internal_cd = true,
 };
 
 static const struct of_device_id msdc_of_ids[] = {
@@ -1430,6 +1435,12 @@ static irqreturn_t msdc_irq(int irq, void *dev_id)
sdio_signal_irq(host->mmc);
}
 
+   if ((events & event_mask) & MSDC_INT_CDSC) {
+   if (host->internal_cd)
+   mmc_detect_change(host->mmc, 
msecs_to_jiffies(20));
+   events &= ~MSDC_INT_CDSC;
+   }
+
if (!(events & (event_mask & ~MSDC_INT_SDIOIRQ)))
break;
 
@@ -1463,14 +1474,24 @@ static void msdc_init_hw(struct msdc_host *host)
/* Reset */
msdc_reset_hw(host);
 
-   /* Disable card detection */
-   sdr_clr_bits(host->base + MSDC_PS, MSDC_PS_CDEN);
-
/* Disable and clear all interrupts */
writel(0, host->base + MSDC_INTEN);
val = readl(host->base + MSDC_INT);
writel(val, host->base + MSDC_INT);
 
+   /* Configure card detection */
+   if (host->internal_cd) {
+   sdr_set_field(host->base + MSDC_PS, MSDC_PS_CDDEBOUNCE,
+ DEFAULT_DEBOUNCE);
+   sdr_set_bits(host->base + MSDC_PS, MSDC_PS_CDEN);
+   sdr_set_bits(host->base + MSDC_INTEN, MSDC_INTEN_CDSC);
+   sdr_set_bits(host->base + SDC_CFG, SDC_CFG_INSWKUP);
+   } else {
+   sdr_clr_bits(host->base + SDC_CFG, SDC_CFG_INSWKUP);
+   sdr_clr_bits(host->base + MSDC_PS, MSDC_PS_CDEN);
+   sdr_clr_bits(host->base + MSDC_INTEN, MSDC_INTEN_CDSC);
+   }
+
if (host->top_base) {
writel(0, host->top_base + EMMC_TOP_CONTROL);
writel(0, host->top_base + EMMC_TOP_CMD);
@@ -1580,6 +1601,13 @@ static void msdc_init_hw(struct msdc_host *host)
 static void msdc_deinit_hw(struct msdc_host *host)
 {
u32 val;
+
+   if (host->internal_cd) {
+   /* Disabled card-detect */
+   sdr_clr_bits(host->base + MSDC_PS, MSDC_PS_CDEN);
+   sdr_clr_bits(host->base + SDC_CFG, SDC_CFG_INSWKUP);
+   }
+
/* Disable and clear all interrupts */
writel(0, host->base + MSDC_INTEN);
 
@@ -2078,13 +2106,31 @@ static void msdc_ack_sdio_irq(struct mmc_host *mmc)
__msdc_enable_sdio_irq(mmc, 1);
 }
 
+static int msdc_get_cd(struct mmc_host *mmc)
+{
+   struct msdc_host *host = mmc_priv(mmc);
+   int val;
+
+   if (mmc->caps & MMC_CAP_NONREMOVABLE)
+   return 1;
+
+   if (!host->internal_cd)
+   return mmc_gpio_get_cd(mmc);
+
+   val = readl(host->base + MSDC_PS) & MSDC_PS_CDSTS;
+   if (mmc->caps2 & MMC_CAP2_CD_ACTIVE_HIGH)
+

[PATCH 1/4] mmc: mtk-sd: don't hard-code interrupt trigger type

2019-05-04 Thread NeilBrown

When using devicetree for configuration, interrupt trigger type
should be described in the dts file, not hard-coded in the C code.

The mtk-sd silicon in the mt7621 soc uses an active-high interrupt
and so cannot be used with the current code.

So replace IRQF_TRIGGER_LOW with IRQF_TRIGGER_NONE.

Also IRQF_ONESHOT is not needed - it is used for threaded interrupt
handlers, and this driver does not used a threaded interrupt handler.
So remove that setting.

Signed-off-by: NeilBrown 
---
 drivers/mmc/host/mtk-sd.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/mmc/host/mtk-sd.c b/drivers/mmc/host/mtk-sd.c
index 0798f0ba6d34..469d4a717175 100644
--- a/drivers/mmc/host/mtk-sd.c
+++ b/drivers/mmc/host/mtk-sd.c
@@ -2240,7 +2240,7 @@ static int msdc_drv_probe(struct platform_device *pdev)
msdc_init_hw(host);
 
ret = devm_request_irq(>dev, host->irq, msdc_irq,
-   IRQF_TRIGGER_LOW | IRQF_ONESHOT, pdev->name, host);
+  IRQF_TRIGGER_NONE, pdev->name, host);
if (ret)
goto release;

[PATCH 2/4] mmc: mtk-sd: add support for config found in mt7620 family SOCs.

2019-05-04 Thread NeilBrown

mt7620 family MIPS SOCs contain the mtk-sd silicon.
Add support for this.

Signed-off-by: NeilBrown 
---
 Documentation/devicetree/bindings/mmc/mtk-sd.txt |1 +
 drivers/mmc/host/mtk-sd.c|   12 
 2 files changed, 13 insertions(+)

diff --git a/Documentation/devicetree/bindings/mmc/mtk-sd.txt 
b/Documentation/devicetree/bindings/mmc/mtk-sd.txt
index 91a2ec59e497..8a532f4453f2 100644
--- a/Documentation/devicetree/bindings/mmc/mtk-sd.txt
+++ b/Documentation/devicetree/bindings/mmc/mtk-sd.txt
@@ -16,6 +16,7 @@ Required properties:
"mediatek,mt2712-mmc": for mmc host ip compatible with mt2712
"mediatek,mt7622-mmc": for MT7622 SoC
"mediatek,mt7623-mmc", "mediatek,mt2701-mmc": for MT7623 SoC
+   "mediatek,mt7620-mmc", for MT7621 SoC (and others)
 
 - reg: physical base address of the controller and length
 - interrupts: Should contain MSDC interrupt number
diff --git a/drivers/mmc/host/mtk-sd.c b/drivers/mmc/host/mtk-sd.c
index 469d4a717175..0c2be4f54b1f 100644
--- a/drivers/mmc/host/mtk-sd.c
+++ b/drivers/mmc/host/mtk-sd.c
@@ -517,6 +517,17 @@ static const struct mtk_mmc_compatible mt8516_compat = {
.stop_clk_fix = true,
 };
 
+static const struct mtk_mmc_compatible mt7620_compat = {
+   .clk_div_bits = 8,
+   .hs400_tune = false,
+   .pad_tune_reg = MSDC_PAD_TUNE,
+   .async_fifo = false,
+   .data_tune = false,
+   .busy_check = false,
+   .stop_clk_fix = false,
+   .enhance_rx = false,
+};
+
 static const struct of_device_id msdc_of_ids[] = {
{ .compatible = "mediatek,mt8135-mmc", .data = _compat},
{ .compatible = "mediatek,mt8173-mmc", .data = _compat},
@@ -525,6 +536,7 @@ static const struct of_device_id msdc_of_ids[] = {
{ .compatible = "mediatek,mt2712-mmc", .data = _compat},
{ .compatible = "mediatek,mt7622-mmc", .data = _compat},
{ .compatible = "mediatek,mt8516-mmc", .data = _compat},
+   { .compatible = "mediatek,mt7620-mmc", .data = _compat},
{}
 };
 MODULE_DEVICE_TABLE(of, msdc_of_ids);

[PATCH 0/4] mtk-sd enhancement to support MT7621 - V2

2019-05-04 Thread NeilBrown

The MT7621 MIPS-based SOC contains an sdhci unit that is
much the same as the units supported by mtk-sd.c.

These patches enhance the driver so that I can use it on my MT7621
board (gnubee.org).

This series have been revised based on feedback from Chaotian.

Thanks,
NeilBrown

---

NeilBrown (4):
  mmc: mtk-sd: don't hard-code interrupt trigger type
  mmc: mtk-sd: add support for config found in mt7620 family SOCs.
  mmc: mtk-sd: enable internal card-detect logic.
  mmc: mtk-sd: select REGULATOR


 Documentation/devicetree/bindings/mmc/mtk-sd.txt |1 
 drivers/mmc/host/Kconfig |1 
 drivers/mmc/host/mtk-sd.c|   78 +-
 3 files changed, 75 insertions(+), 5 deletions(-)

--
Signature

[PATCH 4/4] mmc: mtk-sd: select REGULATOR

2019-05-04 Thread NeilBrown

The mtk-sd driver requires a regulator to be present, even if it is
the "fixed" regulator.  So select REGULATOR to make it hard to build
unusable configurations.

Signed-off-by: NeilBrown 
---
 drivers/mmc/host/Kconfig |1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/mmc/host/Kconfig b/drivers/mmc/host/Kconfig
index 9c01310a0d2e..1249cde7004d 100644
--- a/drivers/mmc/host/Kconfig
+++ b/drivers/mmc/host/Kconfig
@@ -941,6 +941,7 @@ config MMC_BCM2835
 config MMC_MTK
tristate "MediaTek SD/MMC Card Interface support"
depends on HAS_DMA
+   select REGULATOR
help
  This selects the MediaTek(R) Secure digital and Multimedia card 
Interface.
  If you have a machine with a integrated SD/MMC card reader, say Y or 
M here.

Re: [PATCH 4/5] mmc: mtk-sd: enable internal card-detect logic.

2019-05-04 Thread NeilBrown

On Thu, Apr 18 2019, Chaotian Jing wrote:

> On Tue, 2019-04-16 at 14:47 +1000, NeilBrown wrote:
>> The mtk-sd silicon has integrated card-detect logic that is
>> enabled, at least, on the MT7621 as used in the GNUBEE NAS.
>> 
>> If the sdhci isn't marked non-removable and doesn't have a
>> cd-gpio configured, assume the internal cd logic should be used.
>> 
>> Signed-off-by: NeilBrown 
...

>> @@ -2206,6 +2247,15 @@ static int msdc_drv_probe(struct platform_device 
>> *pdev)
>>  goto host_free;
>>  }
>>  
>> +if (!(mmc->caps & MMC_CAP_NONREMOVABLE) &&
>> +!mmc_can_gpio_cd(mmc)) {
>
> Should not do this assume!
> better to add "mediatek,internal-cd" in your DTS, then no impact to
> other Soc.

(Sorry for the delay).

Documentation/devicetree/bindings/mmc/mmc.txt

says:
   If no property below is supplied, host native card detect is used.

So this assumption is *exactly* what the documentation said we should
do.

How about I limit this assumption to mt7621 using a flag in the
mtk_mmc_compatible structure?

Thanks,
NeilBrown


signature.asc
Description: PGP signature

Re: [PATCH] overlayfs: ignore empty NFSv4 ACLs in ext4 upperdir

2019-05-02 Thread NeilBrown

On Thu, May 02 2019, Andreas Gruenbacher wrote:

> On Thu, 2 May 2019 at 05:57, NeilBrown  wrote:
>> On Wed, May 01 2019, Amir Goldstein wrote:
>> > On Wed, May 1, 2019 at 10:03 PM NeilBrown  wrote:
>> >> On Tue, Dec 06 2016, J. Bruce Fields wrote:
>> >> > On Tue, Dec 06, 2016 at 02:18:31PM +0100, Andreas Gruenbacher wrote:
>> >> >> On Tue, Dec 6, 2016 at 11:08 AM, Miklos Szeredi  
>> >> >> wrote:
>> >> >> > On Tue, Dec 6, 2016 at 12:24 AM, Andreas Grünbacher
>> >> >> >  wrote:
>> >> >> >> 2016-12-06 0:19 GMT+01:00 Andreas Grünbacher 
>> >> >> >> :
>> >> >> >
>> >> >> >>> It's not hard to come up with a heuristic that determines if a
>> >> >> >>> system.nfs4_acl value is equivalent to a file mode, and to ignore 
>> >> >> >>> the
>> >> >> >>> attribute in that case. (The file mode is transmitted in its own
>> >> >> >>> attribute already, so actually converting .) That way, overlayfs 
>> >> >> >>> could
>> >> >> >>> still fail copying up files that have an actual ACL. It's still an
>> >> >> >>> ugly hack ...
>> >> >> >>
>> >> >> >> Actually, that kind of heuristic would make sense in the NFS client
>> >> >> >> which could then hide the "system.nfs4_acl" attribute.
>
> I still think the nfs client could make this problem mostly go away by
> not exposing "system.nfs4_acl" xattrs when the acl is equivalent to
> the file mode.

Maybe ... but this feels a bit like "sweeping it under the carpet".
What happens if some file on the lower layer does have a more complex
ACL?
Do we just fail any attempt to modify that object?  Doesn't that violate
the law of least surprise?

Maybe if the lower-layer has an i_op->permission method, then overlayfs
should *always* call that for permission checking - unless a
chmod/chown/etc has happened on the file.  That way, we wouldn't need to
copy-up the ACL, but would still get correct ACL testing.

Thanks,
NeilBrown


> The richacl patches contain a workable abgorithm for
> that. The problem would remain for files that have an actual NFS4 ACL,
> which just cannot be mapped to a file mode or to POSIX ACLs in the
> general case, as well as for files that have a POSIX ACL. Mapping NFS4
> ACL that used to be a POSIX ACL back to POSIX ACLs could be achieved
> in many cases as well, but the code would be quite messy. A better way
> seems to be to using a filesystem that doesn't support POSIX ACLs in
> the first place. Unfortunately, xfs doesn't allow turning off POSIX
> ACLs, for example.
>
> Andreas
>
>> >> >> > Even simpler would be if knfsd didn't send the attribute if not
>> >> >> > necessary.  Looks like there's code actively creating the nfs4_acl on
>> >> >> > the wire even if the filesystem had none:
>> >> >> >
>> >> >> > pacl = get_acl(inode, ACL_TYPE_ACCESS);
>> >> >> > if (!pacl)
>> >> >> > pacl = posix_acl_from_mode(inode->i_mode, GFP_KERNEL);
>> >> >> >
>> >> >> > What's the point?
>> >> >>
>> >> >> That's how the protocol is specified.
>> >> >
>> >> > Yep, even if we could make that change to nfsd it wouldn't help the
>> >> > client with the large number of other servers that are out there
>> >> > (including older knfsd's).
>> >> >
>> >> > --b.
>> >> >
>> >> >> (I'm not saying that that's very helpful.)
>> >> >>
>> >> >> Andreas
>> >>
>> >> Hi everyone.
>> >>  I have a customer facing this problem, and so stumbled onto the email
>> >>  thread.
>> >>  Unfortunately it didn't resolve anything.  Maybe I can help kick things
>> >>  along???
>> >>
>> >>  The core problem here is that NFSv4 and ext4 use different and largely
>> >>  incompatible ACL implementations.  There is no way to accurately
>> >>  translate from one to the other in general (common specific examples
>> >>  can be converted).
>> >>
>> >>  This means that either:
>> >>1/ overlayfs cannot use ext4 for upper and NFS for lower (or vice
>> >>   versa) or
>> >&g

Re: [PATCH] OVL: add honoracl=off mount option.

2019-05-02 Thread NeilBrown

On Thu, May 02 2019, Amir Goldstein wrote:

> On Thu, May 2, 2019 at 12:35 AM NeilBrown  wrote:
>>
>>
>> If the upper and lower layers use incompatible ACL formats, it is not
>> possible to copy the ACL xttr from one to the other, so overlayfs
>> cannot work with them.
>> This happens particularly with NFSv4 which uses system.nfs4_acl, and
>> ext4 which uses system.posix_acl_access.
>>
>> If all ACLs actually make to Unix permissions, then there is no need
>> to copy up the ACLs, but overlayfs cannot determine this.
>>
>> So allow the sysadmin it assert that ACLs are not needed with a mount
>> option
>>   honoracl=off
>> This causes the ACLs to not be copied, so filesystems with different
>> ACL formats can be overlaid together.
>>
>> Signed-off-by: NeilBrown 
>> ---
>>  Documentation/filesystems/overlayfs.txt | 24 
>>  fs/overlayfs/copy_up.c  |  9 +++--
>>  fs/overlayfs/dir.c  |  2 +-
>>  fs/overlayfs/overlayfs.h|  2 +-
>>  fs/overlayfs/ovl_entry.h|  1 +
>>  fs/overlayfs/super.c| 15 +++
>>  6 files changed, 49 insertions(+), 4 deletions(-)
>>
>> diff --git a/Documentation/filesystems/overlayfs.txt 
>> b/Documentation/filesystems/overlayfs.txt
>> index eef7d9d259e8..7ad675940c93 100644
>> --- a/Documentation/filesystems/overlayfs.txt
>> +++ b/Documentation/filesystems/overlayfs.txt
>> @@ -245,6 +245,30 @@ filesystem - future operations on the file are barely 
>> noticed by the
>>  overlay filesystem (though an operation on the name of the file such as
>>  rename or unlink will of course be noticed and handled).
>>
>> +ACL copy-up
>> +---
>> +
>> +When a file that only exists on the lower layer is modified it needs
>> +to be copied up to the upper layer.  This means copying the metadata
>> +and (usually) the data (though see "Metadata only copy up" below).
>> +One part of the metadata can be problematic: the ACLs.
>> +
>> +Now all filesystems support ACLs, and when they do they don't all use
>> +the same format.  A significant conflict appears between POSIX acls
>> +used on many local filesystems, and NFSv4 ACLs used with NFSv4.  There
>> +two formats are, in general, not inter-convertible.
>> +
>> +If a site only uses regular Unix permissions (Read, Write, eXecute by
>> +User, Group and Other), then as these permissions are compatible with
>> +all ACLs, there is no need to copy ACLs.  overlayfs cannot determine
>> +if this is the case itself.
>> +
>> +For this reason, overlayfs supports a mount option "honoracl=off"
>> +which causes ACLs, any "system." extended attribute, on the lower
>> +layer to be ignored and, particularly, not copied to the upper later.
>> +This allows NFSv4 to be overlaid with a local filesystem, but should
>> +only be used if the only access controls used on the filesystem are
>> +Unix permission bits.
>>
>
> I don't know. On the one hand "system." is not only ACLs.

Isn't it?  What else goes in "system." "??

"man xattr" says:

   Extended system attributes
   Extended system attributes are used by the kernel to store system
   objects  such  as  Access Control Lists.  Read and write access
   permissions to system attributes depend on the policy implemented
   for each system attribute implemented by filesystems in the
   kernel.

so it *allows* things other than ACLs, but doesn't confirm that there
are any.

In the kernel source, "XATTR_SYSTEM_PREFIX" is only used with POSIX acls
and "system.sockprotoname" - which is socket specific and no likely to
be found on a filesystem.

"system.
also appears in
   CIFS_XATTR_CIFS_ACL
   SMB3_XATTR_CIFS_ACL
   F2FS_SYSTEM_ADVISE_NAME
   XATTR_NAME_NFSV4_ACL
   SYSTEM_ORANGEFS_KEY

which should all use XATTR_SYSTEM_PREFIX ...

So yes,  I guess they aren't (quite) all ACLs.  Bother.


> On the other hand, "honoracl=off" is not the same as -o noacl,
> but it sure sounds the same.
>
> I'd be a lot more comfortable with "ignore_xattrs=system.nfs4_acl"
> argument takes a comma separated list of xattr prefixes to ignore.

That requires the sysadmin to know a lot more about the internals of the
relevant filesystems Maybe that is a good idea, but it feels rather
clunky.

In each of these cases, except maybe POSIX_ACLs, it doesn't make sense
to copy-up the "system." xattr unless it is the exact same filesystem
type.

So if given a "noacl" flag (or similar), ignoring copy-up failure for
all "system." attributes is probably the right thing to do, as ACLs are
the only system. attribute for which it can make any sense at all to
copy them.

Thanks,
NeilBrown


>
> ovl_is_private_xattr() can be generalized to ovl_is_ignored_xattr(),
> going over a blacklist of N>=1 which will also be called from
> ovl_can_list(), because there is no point in listing the ACLs that
> are ignored. right?
>
> Thanks,
> Amir.


signature.asc
Description: PGP signature

Re: [PATCH] overlayfs: ignore empty NFSv4 ACLs in ext4 upperdir

2019-05-02 Thread NeilBrown

On Thu, May 02 2019, Miklos Szeredi wrote:

> On Thu, May 2, 2019 at 10:05 AM Andreas Gruenbacher  
> wrote:
>>
>> On Thu, 2 May 2019 at 05:57, NeilBrown  wrote:
>> > On Wed, May 01 2019, Amir Goldstein wrote:
>> > > On Wed, May 1, 2019 at 10:03 PM NeilBrown  wrote:
>> > >> On Tue, Dec 06 2016, J. Bruce Fields wrote:
>> > >> > On Tue, Dec 06, 2016 at 02:18:31PM +0100, Andreas Gruenbacher wrote:
>> > >> >> On Tue, Dec 6, 2016 at 11:08 AM, Miklos Szeredi  
>> > >> >> wrote:
>> > >> >> > On Tue, Dec 6, 2016 at 12:24 AM, Andreas Grünbacher
>> > >> >> >  wrote:
>> > >> >> >> 2016-12-06 0:19 GMT+01:00 Andreas Grünbacher 
>> > >> >> >> :
>> > >> >> >
>> > >> >> >>> It's not hard to come up with a heuristic that determines if a
>> > >> >> >>> system.nfs4_acl value is equivalent to a file mode, and to 
>> > >> >> >>> ignore the
>> > >> >> >>> attribute in that case. (The file mode is transmitted in its own
>> > >> >> >>> attribute already, so actually converting .) That way, overlayfs 
>> > >> >> >>> could
>> > >> >> >>> still fail copying up files that have an actual ACL. It's still 
>> > >> >> >>> an
>> > >> >> >>> ugly hack ...
>> > >> >> >>
>> > >> >> >> Actually, that kind of heuristic would make sense in the NFS 
>> > >> >> >> client
>> > >> >> >> which could then hide the "system.nfs4_acl" attribute.
>>
>> I still think the nfs client could make this problem mostly go away by
>> not exposing "system.nfs4_acl" xattrs when the acl is equivalent to
>> the file mode. The richacl patches contain a workable abgorithm for
>> that. The problem would remain for files that have an actual NFS4 ACL,
>> which just cannot be mapped to a file mode or to POSIX ACLs in the
>> general case, as well as for files that have a POSIX ACL. Mapping NFS4
>> ACL that used to be a POSIX ACL back to POSIX ACLs could be achieved
>> in many cases as well, but the code would be quite messy. A better way
>> seems to be to using a filesystem that doesn't support POSIX ACLs in
>> the first place. Unfortunately, xfs doesn't allow turning off POSIX
>> ACLs, for example.
>
> How about mounting NFSv4 with noacl?  That should fix this issue, right?

No.
"noacl" only affect NFSv3 (and maybe v2) and it disables use of the
NFSACL side-protocol.
"noacl" has no effect on an NFSv4 mount.

NeilBrown


signature.asc
Description: PGP signature

[PATCH] OVL: add honoracl=off mount option.

2019-05-01 Thread NeilBrown


If the upper and lower layers use incompatible ACL formats, it is not
possible to copy the ACL xttr from one to the other, so overlayfs
cannot work with them.
This happens particularly with NFSv4 which uses system.nfs4_acl, and
ext4 which uses system.posix_acl_access.

If all ACLs actually make to Unix permissions, then there is no need
to copy up the ACLs, but overlayfs cannot determine this.

So allow the sysadmin it assert that ACLs are not needed with a mount
option
  honoracl=off
This causes the ACLs to not be copied, so filesystems with different
ACL formats can be overlaid together.

Signed-off-by: NeilBrown 
---
 Documentation/filesystems/overlayfs.txt | 24 
 fs/overlayfs/copy_up.c  |  9 +++--
 fs/overlayfs/dir.c  |  2 +-
 fs/overlayfs/overlayfs.h|  2 +-
 fs/overlayfs/ovl_entry.h|  1 +
 fs/overlayfs/super.c| 15 +++
 6 files changed, 49 insertions(+), 4 deletions(-)

diff --git a/Documentation/filesystems/overlayfs.txt 
b/Documentation/filesystems/overlayfs.txt
index eef7d9d259e8..7ad675940c93 100644
--- a/Documentation/filesystems/overlayfs.txt
+++ b/Documentation/filesystems/overlayfs.txt
@@ -245,6 +245,30 @@ filesystem - future operations on the file are barely 
noticed by the
 overlay filesystem (though an operation on the name of the file such as
 rename or unlink will of course be noticed and handled).
 
+ACL copy-up
+---
+
+When a file that only exists on the lower layer is modified it needs
+to be copied up to the upper layer.  This means copying the metadata
+and (usually) the data (though see "Metadata only copy up" below).
+One part of the metadata can be problematic: the ACLs.
+
+Now all filesystems support ACLs, and when they do they don't all use
+the same format.  A significant conflict appears between POSIX acls
+used on many local filesystems, and NFSv4 ACLs used with NFSv4.  There
+two formats are, in general, not inter-convertible.
+
+If a site only uses regular Unix permissions (Read, Write, eXecute by
+User, Group and Other), then as these permissions are compatible with
+all ACLs, there is no need to copy ACLs.  overlayfs cannot determine
+if this is the case itself.
+
+For this reason, overlayfs supports a mount option "honoracl=off"
+which causes ACLs, any "system." extended attribute, on the lower
+layer to be ignored and, particularly, not copied to the upper later.
+This allows NFSv4 to be overlaid with a local filesystem, but should
+only be used if the only access controls used on the filesystem are
+Unix permission bits.
 
 Multiple lower layers
 -
diff --git a/fs/overlayfs/copy_up.c b/fs/overlayfs/copy_up.c
index 68b3303e4b46..032aa88f21c1 100644
--- a/fs/overlayfs/copy_up.c
+++ b/fs/overlayfs/copy_up.c
@@ -39,7 +39,7 @@ static int ovl_ccup_get(char *buf, const struct kernel_param 
*param)
 module_param_call(check_copy_up, ovl_ccup_set, ovl_ccup_get, NULL, 0644);
 MODULE_PARM_DESC(ovl_check_copy_up, "Obsolete; does nothing");
 
-int ovl_copy_xattr(struct dentry *old, struct dentry *new)
+int ovl_copy_xattr(struct dentry *old, struct dentry *new, struct ovl_fs *ofs)
 {
ssize_t list_size, size, value_size = 0;
char *buf, *name, *value = NULL;
@@ -77,6 +77,10 @@ int ovl_copy_xattr(struct dentry *old, struct dentry *new)
}
list_size -= slen;
 
+   if (strncmp(name, XATTR_SYSTEM_PREFIX, XATTR_SYSTEM_PREFIX_LEN) 
== 0 &&
+   !ofs->config.honoracl)
+   continue;
+
if (ovl_is_private_xattr(name))
continue;
 retry:
@@ -461,7 +465,8 @@ static int ovl_copy_up_inode(struct ovl_copy_up_ctx *c, 
struct dentry *temp)
return err;
}
 
-   err = ovl_copy_xattr(c->lowerpath.dentry, temp);
+   err = ovl_copy_xattr(c->lowerpath.dentry, temp,
+c->dentry->d_sb->s_fs_info);
if (err)
return err;
 
diff --git a/fs/overlayfs/dir.c b/fs/overlayfs/dir.c
index 82c129bfe58d..cc8fb9eeb7df 100644
--- a/fs/overlayfs/dir.c
+++ b/fs/overlayfs/dir.c
@@ -368,7 +368,7 @@ static struct dentry *ovl_clear_empty(struct dentry *dentry,
if (IS_ERR(opaquedir))
goto out_unlock;
 
-   err = ovl_copy_xattr(upper, opaquedir);
+   err = ovl_copy_xattr(upper, opaquedir, upper->d_sb->s_fs_info);
if (err)
goto out_cleanup;
 
diff --git a/fs/overlayfs/overlayfs.h b/fs/overlayfs/overlayfs.h
index 9c6018287d57..4a104a4732af 100644
--- a/fs/overlayfs/overlayfs.h
+++ b/fs/overlayfs/overlayfs.h
@@ -422,7 +422,7 @@ int ovl_copy_up(struct dentry *dentry);
 int ovl_copy_up_with_data(struct dentry *dentry);
 int ovl_copy_up_flags(struct dentry *dentry, int flags);
 int ovl_open_maybe_copy_up(struct dentry *dentry, unsigned int

Re: [PATCH] overlayfs: ignore empty NFSv4 ACLs in ext4 upperdir

2019-05-01 Thread NeilBrown

On Wed, May 01 2019, Amir Goldstein wrote:

> On Wed, May 1, 2019 at 10:03 PM NeilBrown  wrote:
>>
>> On Tue, Dec 06 2016, J. Bruce Fields wrote:
>>
>> > On Tue, Dec 06, 2016 at 02:18:31PM +0100, Andreas Gruenbacher wrote:
>> >> On Tue, Dec 6, 2016 at 11:08 AM, Miklos Szeredi  wrote:
>> >> > On Tue, Dec 6, 2016 at 12:24 AM, Andreas Grünbacher
>> >> >  wrote:
>> >> >> 2016-12-06 0:19 GMT+01:00 Andreas Grünbacher 
>> >> >> :
>> >> >
>> >> >>> It's not hard to come up with a heuristic that determines if a
>> >> >>> system.nfs4_acl value is equivalent to a file mode, and to ignore the
>> >> >>> attribute in that case. (The file mode is transmitted in its own
>> >> >>> attribute already, so actually converting .) That way, overlayfs could
>> >> >>> still fail copying up files that have an actual ACL. It's still an
>> >> >>> ugly hack ...
>> >> >>
>> >> >> Actually, that kind of heuristic would make sense in the NFS client
>> >> >> which could then hide the "system.nfs4_acl" attribute.
>> >> >
>> >> > Even simpler would be if knfsd didn't send the attribute if not
>> >> > necessary.  Looks like there's code actively creating the nfs4_acl on
>> >> > the wire even if the filesystem had none:
>> >> >
>> >> > pacl = get_acl(inode, ACL_TYPE_ACCESS);
>> >> > if (!pacl)
>> >> > pacl = posix_acl_from_mode(inode->i_mode, GFP_KERNEL);
>> >> >
>> >> > What's the point?
>> >>
>> >> That's how the protocol is specified.
>> >
>> > Yep, even if we could make that change to nfsd it wouldn't help the
>> > client with the large number of other servers that are out there
>> > (including older knfsd's).
>> >
>> > --b.
>> >
>> >> (I'm not saying that that's very helpful.)
>> >>
>> >> Andreas
>>
>> Hi everyone.
>>  I have a customer facing this problem, and so stumbled onto the email
>>  thread.
>>  Unfortunately it didn't resolve anything.  Maybe I can help kick things
>>  along???
>>
>>  The core problem here is that NFSv4 and ext4 use different and largely
>>  incompatible ACL implementations.  There is no way to accurately
>>  translate from one to the other in general (common specific examples
>>  can be converted).
>>
>>  This means that either:
>>1/ overlayfs cannot use ext4 for upper and NFS for lower (or vice
>>   versa) or
>>2/ overlayfs need to accept that sometimes it cannot copy ACLs, and
>>   that is OK.
>>
>>  Silently not copying the ACLs is probably not a good idea as it might
>>  result in inappropriate permissions being given away.
>
> For example? permissions given away to do what?
> Note that ovl_permission() only check permissions of *mounter*
> to read the lower NFS file and ovl_open()/ovl_read_iter() access
> the lower file with *mounter* credentials.
>
> I might be wrong, but seems to me that once admin mounted
> overlayfs with lower NFS, NFS ACLs are not being enforced at all
> even before copy up.

I guess it is just as well that copy-up fails then - if the lower-level
permission check is being ignored.

>
>> So if the
>>  sysadmin wants this (and some clearly do), they need a way to
>>  explicitly say "I accept the risk".  If only standard Unix permissions
>>  are used, there is no risk, so this seems reasonable.
>>
>>  So I would like to propose a new option for overlayfs
>> nocopyupacl:   when overlayfs is copying a file (or directory etc)
>> from the lower filesystem to the upper filesystem, it does not
>> copy extended attributes with the "system." prefix.  These are
>> used for storing ACL information and this is sometimes not
>> compatible between different filesystem types (e.g. ext4 and
>> NFSv4).  Standard Unix ownership permission flags (rwx) *are*
>> copied so this option does not risk giving away inappropriate
>> permissions unless the lowerfs uses unusual ACLs.
>>
>>
>
> I am wondering if it would make more sense for nfs to register a
> security_inode_copy_up_xattr() hook.
> That is the mechanism that prevents copying up other security.*
> xattrs?

No, I don't think that would make sense.
Support some day support for nfs4 acls were added to ext4 (not a totally
ridiculous suggestion).  We would then want NFS to allow it's ACLs to be
copied up.

Thanks,
NeilBrown


>
> Thanks,
> Amir.


signature.asc
Description: PGP signature

Re: [PATCH] overlayfs: ignore empty NFSv4 ACLs in ext4 upperdir

2019-05-01 Thread NeilBrown

On Tue, Dec 06 2016, J. Bruce Fields wrote:

> On Tue, Dec 06, 2016 at 02:18:31PM +0100, Andreas Gruenbacher wrote:
>> On Tue, Dec 6, 2016 at 11:08 AM, Miklos Szeredi  wrote:
>> > On Tue, Dec 6, 2016 at 12:24 AM, Andreas Grünbacher
>> >  wrote:
>> >> 2016-12-06 0:19 GMT+01:00 Andreas Grünbacher 
>> >> :
>> >
>> >>> It's not hard to come up with a heuristic that determines if a
>> >>> system.nfs4_acl value is equivalent to a file mode, and to ignore the
>> >>> attribute in that case. (The file mode is transmitted in its own
>> >>> attribute already, so actually converting .) That way, overlayfs could
>> >>> still fail copying up files that have an actual ACL. It's still an
>> >>> ugly hack ...
>> >>
>> >> Actually, that kind of heuristic would make sense in the NFS client
>> >> which could then hide the "system.nfs4_acl" attribute.
>> >
>> > Even simpler would be if knfsd didn't send the attribute if not
>> > necessary.  Looks like there's code actively creating the nfs4_acl on
>> > the wire even if the filesystem had none:
>> >
>> > pacl = get_acl(inode, ACL_TYPE_ACCESS);
>> > if (!pacl)
>> > pacl = posix_acl_from_mode(inode->i_mode, GFP_KERNEL);
>> >
>> > What's the point?
>> 
>> That's how the protocol is specified.
>
> Yep, even if we could make that change to nfsd it wouldn't help the
> client with the large number of other servers that are out there
> (including older knfsd's).
>
> --b.
>
>> (I'm not saying that that's very helpful.)
>> 
>> Andreas

Hi everyone.
 I have a customer facing this problem, and so stumbled onto the email
 thread.
 Unfortunately it didn't resolve anything.  Maybe I can help kick things
 along???

 The core problem here is that NFSv4 and ext4 use different and largely
 incompatible ACL implementations.  There is no way to accurately
 translate from one to the other in general (common specific examples
 can be converted).

 This means that either:
   1/ overlayfs cannot use ext4 for upper and NFS for lower (or vice
  versa) or
   2/ overlayfs need to accept that sometimes it cannot copy ACLs, and
  that is OK.

 Silently not copying the ACLs is probably not a good idea as it might
 result in inappropriate permissions being given away.  So if the
 sysadmin wants this (and some clearly do), they need a way to
 explicitly say "I accept the risk".  If only standard Unix permissions
 are used, there is no risk, so this seems reasonable.

 So I would like to propose a new option for overlayfs
nocopyupacl:   when overlayfs is copying a file (or directory etc)
from the lower filesystem to the upper filesystem, it does not
copy extended attributes with the "system." prefix.  These are
used for storing ACL information and this is sometimes not
compatible between different filesystem types (e.g. ext4 and
NFSv4).  Standard Unix ownership permission flags (rwx) *are*
copied so this option does not risk giving away inappropriate
permissions unless the lowerfs uses unusual ACLs.

 Miklos: would you find that acceptable?

Thanks,
NeilBrown

signature.asc
Description: PGP signature

Re: [PATCH] md: properly lock and unlock in rdev_attr_store()

2019-04-28 Thread NeilBrown

On Sun, Apr 28 2019, Lukas Bulwahn wrote:

> rdev_attr_store() should lock and unlock mddev->reconfig_mutex in a
> balanced way with mddev_lock() and mddev_unlock().

It does.

>
> But when rdev->mddev is NULL, rdev_attr_store() would try to unlock
> without locking before. Resolve this locking issue..

This is incorrect.

>
> This locking issue was detected with Clang Thread Safety Analyser:

Either the Clang Thread Safety Analyser is broken, or you used it
incorrectly.

>
> drivers/md/md.c:3393:3: warning: releasing mutex 'mddev->reconfig_mutex' that 
> was not held [-Wthread-safety-analysis]
> mddev_unlock(mddev);
> ^
>
> This warning was reported after annotating mutex functions and
> mddev_lock() and mddev_unlock().
>
> Fixes: 27c529bb8e90 ("md: lock access to rdev attributes properly")
> Link: 
> https://groups.google.com/d/topic/clang-built-linux/CvBiiQLB0H4/discussion
> Signed-off-by: Lukas Bulwahn 
> ---
> Arnd, Neil, here a proposal to fix lock and unlocking asymmetry.
>
> I quite sure that if mddev is NULL, it should just return.

If mddev is NULL, the code does return (with -EBUSY).  All you've done
is change things so it returns from a different part of the code.  You
haven't changed the behaviour at all.

>
> I am still puzzled if the return value from mddev_lock() should be really
> return by rdev_attr_store() when it is not 0. But that was the behaviour
> before, so I will keep it that way.

Certainly it should. mddev_lock() either returns 0 to indicate success
or -EINTR if it received a signal.
If it was interrupted by a signal, then rdev_attr_store() should return
-EINTR as well.

As Arnd tried to explain, the only possible problem here is that the C
compiler is allowed to assume that rdev->mddev never changes value, so
in
   rv = mddev ? mddev_lock(mddev) : =EBUSY

it could load rdev->mddev, test if it is NULL, then load it again and
pass that value to mddev_lock() - the new value might be NULL which
would cause problems.

This could be fixed by changing

struct mddev *mddev = rdev->mddev;
to
struct mddev *mddev = READ_ONCE(rdev->mddev);

That is the only change that might be useful here.

NeilBrown


>
>  drivers/md/md.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 05b8b769..a9735d8f1e70 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -3384,7 +3384,9 @@ rdev_attr_store(struct kobject *kobj, struct attribute 
> *attr,
>   return -EIO;
>   if (!capable(CAP_SYS_ADMIN))
>   return -EACCES;
> - rv = mddev ? mddev_lock(mddev): -EBUSY;
> + if (!mddev)
> + return -EBUSY;
> + rv = mddev_lock(mddev);
>   if (!rv) {
>   if (rdev->mddev == NULL)
>   rv = -EBUSY;
> -- 
> 2.17.1


signature.asc
Description: PGP signature

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 4892 matches

Mail list logo