Re: [Cluster-devel] [PATCH 0/2] scop GFP_NOFS api

2016-05-03 Thread Dave Chinner
On Sun, May 01, 2016 at 08:19:44AM +1000, NeilBrown wrote:
> On Sat, Apr 30 2016, Dave Chinner wrote:
> > Indeed, blocking the superblock shrinker in reclaim is a key part of
> > balancing inode cache pressure in XFS. If the shrinker starts
> > hitting dirty inodes, it blocks on cleaning them, thereby slowing
> > the rate of allocation to that which inodes can be cleaned and
> > reclaimed. There are also background threads that walk ahead freeing
> > clean inodes, but we have to throttle direct reclaim in this manner
> > otherwise the allocation pressure vastly outweighs the ability to
> > reclaim inodes. if we don't balance this, inode allocation triggers
> > the OOM killer because reclaim keeps reporting "no progress being
> > made" because dirty inodes are skipped. BY blocking on such inodes,
> > the shrinker makes progress (slowly) and reclaim sees that memory is
> > being freed and so continues without invoking the OOM killer...
> 
> I'm very aware of the need to throttle allocation based on IO.  I
> remember when NFS didn't quite get this right and filled up memory :-)
> 
> balance_dirty_pages() used to force threads to wait on the write-out of
> one page for every page that they dirtied (or wait on 128 pages for every 128
> dirtied or whatever).  This was exactly to provide the sort of
> throttling you are talking about.
> 
> We don't do that any more.  It was problematic.  I don't recall all the
> reasons but I think that different backing devices having different
> clearance rates was part of the problem.

As the original architect of those changes (the IO-less dirty page
throttling was an evolution of the algorithm I developed for Irix
years before it was done in linux) I remember the reasons very well.

Mostly it was to prevent what we termed "IO breakdown" where so many
different foreground threads were dispatching writeback that it
turned dirty page writeback into random IO instead of being nice,
large sequential IOs.

> So now we monitor clearance rates and wait for some number of blocks to
> be written, rather than waiting for some specific blocks to be written.

Right - we have a limited pool of flusher threads dispatching IO
efficiently as possible, and foreground dirtying processes wait for
that to do the work of cleaning pages.

> We should be able to do the same thing to balance dirty inodes as
> we do to balance dirty pages.

No, we can't. Dirty inode state is deeply tied into filesystem
implementations - it's intertwined with the journal operation in
many filesystems and can't be separated easily.  Indeed, some
filesystem don't even use the VFS for tracking dirty inode state,
nor do they implement the ->write_inode method. Hence the VFS northe
inode cache shrinkers are able to determine if an inode is dirty, or
if it is, trigger writeback of it.

IOWs, inode caches are not unified in allocation, behaviour, design
or implementation like the page cache is, and so balancing dirty
inodes is likely not to be possible

> >>If it could be changed
> >>to just schedule the IO without waiting for it then I think this
> >>would be safe to be called in any FS allocation context.  It already
> >>uses a 'trylock' in xfs_dqlock_nowait() to avoid deadlocking
> >>if the lock is held.
> >
> > We could, but then we have the same problem as the inode cache -
> > there's no indication of progress going back to the memory reclaim
> > subsystem, nor is reclaim able to throttle memory allocation back to
> > the rate at which reclaim is making progress.
> >
> > There's feedback loops all throughout the XFS reclaim code - it's
> > designed specifically that way - I made changes to the shrinker
> > infrastructure years ago to enable this. It's no different to the
> > dirty page throttling that was done at roughly the same time -
> > that's also one big feedback loop controlled by the rate at which
> > pages can be cleaned. Indeed, it was designed was based on the same
> > premise as all the XFS shrinker code: in steady state conditions
> > we can't allocate a resource faster than we can reclaim it, so we
> > need to make reclaim as efficient at possible...
> 
> You seem to be referring here to the same change that I was referred to
> above, but seem to be seeing it from a different perspective.

Sure, not many people have the same viewpoint as the preson who had
to convince everyone else that it was a sound idea even before
prototypes were written...

> Waiting for inodes to be freed in important.  Waiting for any one
> specific inode to be freed is dangerous.

Sure. But there's a difference between waiting on an inode you have
no idea when you'll be able to reclaim it, versus waiting on IO
completion for an inode you've already guaranteed can complete
reclaim and are the only context that has access to the inode

Not to mention that XFS also triggers an async inode reclaim worker
thread to do non-blocking reclaim of inodes in the background, so
while direct reclaim throttles (and 

Re: [Cluster-devel] [PATCH 0/2] scop GFP_NOFS api

2016-05-03 Thread Michal Hocko
Hi,

On Sun 01-05-16 07:55:31, NeilBrown wrote:
[...]
> One particular problem with your process-context idea is that it isn't
> inherited across threads.
> Steve Whitehouse's example in gfs shows how allocation dependencies can
> even cross into user space.

Hmm, I am still not sure I understand that example completely but making
a dependency between direct reclaim and userspace can hardly work.
Especially when the direct reclaim might be sitting on top of hard to
guess pile of locks. So unless I've missed anything what Steve has
described is a clear NOFS context.

> A more localized one that I have seen is that NFSv4 sometimes needs to
> start up a state-management thread (particularly if the server
> restarted).
> It uses kthread_run(), which doesn't actually create the thread but asks
> kthreadd to do it.  If NFS writeout is waiting for state management it
> would need to make sure that kthreadd runs in allocation context to
> avoid deadlock.
> I feel that I've forgotten some important detail here and this might
> have been fixed somehow, but the point still stands that the allocation
> context can cross from thread to thread and can effectively become
> anything and everything.

Not sure I understand your point here but relying on kthread_run
from GFP_NOFS context has always been deadlock prone with or without
scope GFP_NOFS semantic so I am not really sure I see your point
here. Similarly relying on a work item which doesn't have a dedicated
WQ_MEM_RECLAIM WQ is deadlock prone.  You simply shouldn't do that.

> It is OK to wait for memory to be freed.  It is not OK to wait for any
> particular piece of memory to be freed because you don't always know who
> is waiting for you, or who you really are waiting on to free that
> memory.
> 
> Whenever trying to free memory I think you need to do best-effort
> without blocking.

I agree with that. Or at least you have to wait on something that is
_guaranteed_ to make a forward progress. I am not really that sure this
is easy to achieve with the current code base.

-- 
Michal Hocko
SUSE Labs



Re: [Cluster-devel] [PATCH 0/2] scop GFP_NOFS api

2016-04-30 Thread NeilBrown
On Sat, Apr 30 2016, Dave Chinner wrote:

> On Fri, Apr 29, 2016 at 03:35:42PM +1000, NeilBrown wrote:
>> On Tue, Apr 26 2016, Michal Hocko wrote:
>> 
>> > Hi,
>> > we have discussed this topic at LSF/MM this year. There was a general
>> > interest in the scope GFP_NOFS allocation context among some FS
>> > developers. For those who are not aware of the discussion or the issue
>> > I am trying to sort out (or at least start in that direction) please
>> > have a look at patch 1 which adds memalloc_nofs_{save,restore} api
>> > which basically copies what we have for the scope GFP_NOIO allocation
>> > context. I haven't converted any of the FS myself because that is way
>> > beyond my area of expertise but I would be happy to help with further
>> > changes on the MM front as well as in some more generic code paths.
>> >
>> > Dave had an idea on how to further improve the reclaim context to be
>> > less all-or-nothing wrt. GFP_NOFS. In short he was suggesting an opaque
>> > and FS specific cookie set in the FS allocation context and consumed
>> > by the FS reclaim context to allow doing some provably save actions
>> > that would be skipped due to GFP_NOFS normally.  I like this idea and
>> > I believe we can go that direction regardless of the approach taken here.
>> > Many filesystems simply need to cleanup their NOFS usage first before
>> > diving into a more complex changes.>
>> 
>> This strikes me as over-engineering to work around an unnecessarily
>> burdensome interface but without details it is hard to be certain.
>> 
>> Exactly what things happen in "FS reclaim context" which may, or may
>> not, be safe depending on the specific FS allocation context?  Do they
>> need to happen at all?
>> 
>> My research suggests that for most filesystems the only thing that
>> happens in reclaim context that is at all troublesome is the final
>> 'evict()' on an inode.  This needs to flush out dirty pages and sync the
>> inode to storage.  Some time ago we moved most dirty-page writeout out
>> of the reclaim context and into kswapd.  I think this was an excellent
>> advance in simplicity.
>
> No, we didn't move dirty page writeout to kswapd - we moved it back
> to the background writeback threads where it can be done
> efficiently.  kswapd should almost never do single page writeback
> because of how inefficient it is from an IO perspective, even though
> it can. i.e. if we are doing any significant amount of dirty page
> writeback from memory reclaim (direct, kswapd or otherwise) then
> we've screwed something up.
>
>> If we could similarly move evict() into kswapd (and I believe we can)
>> then most file systems would do nothing in reclaim context that
>> interferes with allocation context.
>
> When lots of GFP_NOFS allocation is being done, this already
> happens. The shrinkers that can't run due to context accumulate the
> work on the shrinker structure, and when the shrinker can next run
> (e.g. run from kswapd) it runs all the deferred work from GFP_NOFS
> reclaim contexts.
>
> IOWs, we already move shrinker work from direct reclaim to kswapd
> when appropriate.
>
>> The exceptions include:
>>  - nfs and any filesystem using fscache can block for up to 1 second
>>in ->releasepage().  They used to block waiting for some IO, but that
>>caused deadlocks and wasn't really needed.  I left the timeout because
>>it seemed likely that some throttling would help.  I suspect that a
>>careful analysis will show that there is sufficient throttling
>>elsewhere.
>> 
>>  - xfs_qm_shrink_scan is nearly unique among shrinkers in that it waits
>>for IO so it can free some quotainfo things. 
>
> No it's not. evict() can block on IO - waiting for data or inode
> writeback to complete, or even for filesystems to run transactions
> on the inode. Hence the superblock shrinker can and does block in
> inode cache reclaim.

That is why I said "nearly" :-)

>
> Indeed, blocking the superblock shrinker in reclaim is a key part of
> balancing inode cache pressure in XFS. If the shrinker starts
> hitting dirty inodes, it blocks on cleaning them, thereby slowing
> the rate of allocation to that which inodes can be cleaned and
> reclaimed. There are also background threads that walk ahead freeing
> clean inodes, but we have to throttle direct reclaim in this manner
> otherwise the allocation pressure vastly outweighs the ability to
> reclaim inodes. if we don't balance this, inode allocation triggers
> the OOM killer because reclaim keeps reporting "no progress being
> made" because dirty inodes are skipped. BY blocking on such inodes,
> the shrinker makes progress (slowly) and reclaim sees that memory is
> being freed and so continues without invoking the OOM killer...

I'm very aware of the need to throttle allocation based on IO.  I
remember when NFS didn't quite get this right and filled up memory :-)

balance_dirty_pages() used to force threads to wait on the write-out of
one page for every page that they 

Re: [Cluster-devel] [PATCH 0/2] scop GFP_NOFS api

2016-04-29 Thread Dave Chinner
On Fri, Apr 29, 2016 at 02:04:18PM +0200, Michal Hocko wrote:
> I would also like to revisit generic inode/dentry shrinker and see
> whether it could be more __GFP_FS friendly. As you say many FS might
> even not depend on some FS internal locks so pushing GFP_FS check down
> the layers might make a lot of sense and allow to clean some [id]cache
> even for __GFP_FS context.

That's precisely my point about passing a context to the shrinker.
It's recursion within a single superblock context that makes up the
majority of cases GFP_NOFS is used for, so passing the superblock
immediately allows for reclaim to run the superblock shrinker on
every other superblock.

We can refine it further from there, but I strongly suspect that
further refinement is going to require filesystems to specifically
configure the superblock shrinker.

e.g. in XFS, we can't allow evict() even on clean VFS inodes in a
PF_FSTRANS context, because we may run a transaction on a clean
VFS inode to prepare it for reclaim.  We can, however,
allow the fs-specific shrinker callouts to run (i.e. call into
.free_cached_objects) so that it can reclaim clean XFS inodes
because that doesn't require transactions

i.e. the infrastructure I suggested we use is aimed directly at
providing the mechanism required for finer-grained inode/dentry
cache reclaim in contexts that it is currently disallowed
completely. I was also implying that once the infrastructure to pass
contexts is in place, I'd then make the changes to the generic
superblock shrinker code to enable finer grained reclaim and
optimise the XFS shrinkers to make use of it...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com



Re: [Cluster-devel] [PATCH 0/2] scop GFP_NOFS api

2016-04-29 Thread Dave Chinner
On Fri, Apr 29, 2016 at 03:35:42PM +1000, NeilBrown wrote:
> On Tue, Apr 26 2016, Michal Hocko wrote:
> 
> > Hi,
> > we have discussed this topic at LSF/MM this year. There was a general
> > interest in the scope GFP_NOFS allocation context among some FS
> > developers. For those who are not aware of the discussion or the issue
> > I am trying to sort out (or at least start in that direction) please
> > have a look at patch 1 which adds memalloc_nofs_{save,restore} api
> > which basically copies what we have for the scope GFP_NOIO allocation
> > context. I haven't converted any of the FS myself because that is way
> > beyond my area of expertise but I would be happy to help with further
> > changes on the MM front as well as in some more generic code paths.
> >
> > Dave had an idea on how to further improve the reclaim context to be
> > less all-or-nothing wrt. GFP_NOFS. In short he was suggesting an opaque
> > and FS specific cookie set in the FS allocation context and consumed
> > by the FS reclaim context to allow doing some provably save actions
> > that would be skipped due to GFP_NOFS normally.  I like this idea and
> > I believe we can go that direction regardless of the approach taken here.
> > Many filesystems simply need to cleanup their NOFS usage first before
> > diving into a more complex changes.>
> 
> This strikes me as over-engineering to work around an unnecessarily
> burdensome interface but without details it is hard to be certain.
> 
> Exactly what things happen in "FS reclaim context" which may, or may
> not, be safe depending on the specific FS allocation context?  Do they
> need to happen at all?
> 
> My research suggests that for most filesystems the only thing that
> happens in reclaim context that is at all troublesome is the final
> 'evict()' on an inode.  This needs to flush out dirty pages and sync the
> inode to storage.  Some time ago we moved most dirty-page writeout out
> of the reclaim context and into kswapd.  I think this was an excellent
> advance in simplicity.

No, we didn't move dirty page writeout to kswapd - we moved it back
to the background writeback threads where it can be done
efficiently.  kswapd should almost never do single page writeback
because of how inefficient it is from an IO perspective, even though
it can. i.e. if we are doing any significant amount of dirty page
writeback from memory reclaim (direct, kswapd or otherwise) then
we've screwed something up.

> If we could similarly move evict() into kswapd (and I believe we can)
> then most file systems would do nothing in reclaim context that
> interferes with allocation context.

When lots of GFP_NOFS allocation is being done, this already
happens. The shrinkers that can't run due to context accumulate the
work on the shrinker structure, and when the shrinker can next run
(e.g. run from kswapd) it runs all the deferred work from GFP_NOFS
reclaim contexts.

IOWs, we already move shrinker work from direct reclaim to kswapd
when appropriate.

> The exceptions include:
>  - nfs and any filesystem using fscache can block for up to 1 second
>in ->releasepage().  They used to block waiting for some IO, but that
>caused deadlocks and wasn't really needed.  I left the timeout because
>it seemed likely that some throttling would help.  I suspect that a
>careful analysis will show that there is sufficient throttling
>elsewhere.
> 
>  - xfs_qm_shrink_scan is nearly unique among shrinkers in that it waits
>for IO so it can free some quotainfo things. 

No it's not. evict() can block on IO - waiting for data or inode
writeback to complete, or even for filesystems to run transactions
on the inode. Hence the superblock shrinker can and does block in
inode cache reclaim.

Indeed, blocking the superblock shrinker in reclaim is a key part of
balancing inode cache pressure in XFS. If the shrinker starts
hitting dirty inodes, it blocks on cleaning them, thereby slowing
the rate of allocation to that which inodes can be cleaned and
reclaimed. There are also background threads that walk ahead freeing
clean inodes, but we have to throttle direct reclaim in this manner
otherwise the allocation pressure vastly outweighs the ability to
reclaim inodes. if we don't balance this, inode allocation triggers
the OOM killer because reclaim keeps reporting "no progress being
made" because dirty inodes are skipped. BY blocking on such inodes,
the shrinker makes progress (slowly) and reclaim sees that memory is
being freed and so continues without invoking the OOM killer...

>If it could be changed
>to just schedule the IO without waiting for it then I think this
>would be safe to be called in any FS allocation context.  It already
>uses a 'trylock' in xfs_dqlock_nowait() to avoid deadlocking
>if the lock is held.

We could, but then we have the same problem as the inode cache -
there's no indication of progress going back to the memory reclaim
subsystem, nor is reclaim able to throttle 

Re: [Cluster-devel] [PATCH 0/2] scop GFP_NOFS api

2016-04-29 Thread Michal Hocko
On Fri 29-04-16 15:35:42, NeilBrown wrote:
> On Tue, Apr 26 2016, Michal Hocko wrote:
> 
> > Hi,
> > we have discussed this topic at LSF/MM this year. There was a general
> > interest in the scope GFP_NOFS allocation context among some FS
> > developers. For those who are not aware of the discussion or the issue
> > I am trying to sort out (or at least start in that direction) please
> > have a look at patch 1 which adds memalloc_nofs_{save,restore} api
> > which basically copies what we have for the scope GFP_NOIO allocation
> > context. I haven't converted any of the FS myself because that is way
> > beyond my area of expertise but I would be happy to help with further
> > changes on the MM front as well as in some more generic code paths.
> >
> > Dave had an idea on how to further improve the reclaim context to be
> > less all-or-nothing wrt. GFP_NOFS. In short he was suggesting an opaque
> > and FS specific cookie set in the FS allocation context and consumed
> > by the FS reclaim context to allow doing some provably save actions
> > that would be skipped due to GFP_NOFS normally.  I like this idea and
> > I believe we can go that direction regardless of the approach taken here.
> > Many filesystems simply need to cleanup their NOFS usage first before
> > diving into a more complex changes.>
> 
> This strikes me as over-engineering to work around an unnecessarily
> burdensome interface but without details it is hard to be certain.
>
> Exactly what things happen in "FS reclaim context" which may, or may
> not, be safe depending on the specific FS allocation context?  Do they
> need to happen at all?

Let me quote Dave Chinner from one of the emails discussed at LSFMM
mailing list:
: IMO, making GFP_NOFS "better" cannot be done with context-less flags
: being passed through reclaim. If we want to prevent the recursive
: self-deadlock case in an optimal manner, then we need to be able to
: pass state down to reclaim so that page writeback and the shrinkers
: can determine if they are likely to deadlock.
: 
: IOWs, I think we should stop thinking of GFP_NOFS as a *global*
: directive to avoid recursion under any circumstance and instead
: start thinking about it as a mechanism to avoid recursion in
: specific reclaim contexts.
: 
: Something as simple as adding an opaque cookie (e.g. can hold a
: superblock or inode pointer) to check against in writeback and
: subsystem shrinkers would result in the vast majority of GFP_NOFS
: contexts being able to reclaim from everything but the one context
: that we *might* deadlock against.
: 
: e.g, if we then also check the PF_FSTRANS flag in XFS, we'll
: still be able to reclaim clean inodes, buffers and write back
: dirty pages that don't require transactions to complete under "don't
: recurse" situations because we know it's transactions that we could
: deadlock on in the direct reclaim context.
: 
: Note that this information could be added to the writeback_control
: for page writeback, and it could be passed directly to shrinkers
: in the shrink_control structures. The allocation paths might be a
: little harder, but I suspect using the task struct for passing this
: information into direct reclaim might be the easiest approach...

> My research suggests that for most filesystems the only thing that
> happens in reclaim context that is at all troublesome is the final
> 'evict()' on an inode.  This needs to flush out dirty pages and sync the
> inode to storage.  Some time ago we moved most dirty-page writeout out
> of the reclaim context and into kswapd.  I think this was an excellent
> advance in simplicity.
> If we could similarly move evict() into kswapd (and I believe we can)
> then most file systems would do nothing in reclaim context that
> interferes with allocation context.
> 
> The exceptions include:
>  - nfs and any filesystem using fscache can block for up to 1 second
>in ->releasepage().  They used to block waiting for some IO, but that
>caused deadlocks and wasn't really needed.  I left the timeout because
>it seemed likely that some throttling would help.  I suspect that a
>careful analysis will show that there is sufficient throttling
>elsewhere.
> 
>  - xfs_qm_shrink_scan is nearly unique among shrinkers in that it waits
>for IO so it can free some quotainfo things.  If it could be changed
>to just schedule the IO without waiting for it then I think this
>would be safe to be called in any FS allocation context.  It already
>uses a 'trylock' in xfs_dqlock_nowait() to avoid deadlocking
>if the lock is held.
> 
> I think you/we would end up with a much simpler system if instead of
> focussing on the places where GFP_NOFS is used, we focus on places where
> __GFP_FS is tested, and try to remove them.

One think I have learned is that shrinkers can be really complex and
getting rid of GFP_NOFS will be really hard so I would really like to
start the easiest way possible and remove the direct usage and 

Re: [Cluster-devel] [PATCH 0/2] scop GFP_NOFS api

2016-04-29 Thread Steven Whitehouse

Hi,

On 29/04/16 06:35, NeilBrown wrote:

On Tue, Apr 26 2016, Michal Hocko wrote:


Hi,
we have discussed this topic at LSF/MM this year. There was a general
interest in the scope GFP_NOFS allocation context among some FS
developers. For those who are not aware of the discussion or the issue
I am trying to sort out (or at least start in that direction) please
have a look at patch 1 which adds memalloc_nofs_{save,restore} api
which basically copies what we have for the scope GFP_NOIO allocation
context. I haven't converted any of the FS myself because that is way
beyond my area of expertise but I would be happy to help with further
changes on the MM front as well as in some more generic code paths.

Dave had an idea on how to further improve the reclaim context to be
less all-or-nothing wrt. GFP_NOFS. In short he was suggesting an opaque
and FS specific cookie set in the FS allocation context and consumed
by the FS reclaim context to allow doing some provably save actions
that would be skipped due to GFP_NOFS normally.  I like this idea and
I believe we can go that direction regardless of the approach taken here.
Many filesystems simply need to cleanup their NOFS usage first before
diving into a more complex changes.>

This strikes me as over-engineering to work around an unnecessarily
burdensome interface but without details it is hard to be certain.

Exactly what things happen in "FS reclaim context" which may, or may
not, be safe depending on the specific FS allocation context?  Do they
need to happen at all?

My research suggests that for most filesystems the only thing that
happens in reclaim context that is at all troublesome is the final
'evict()' on an inode.  This needs to flush out dirty pages and sync the
inode to storage.  Some time ago we moved most dirty-page writeout out
of the reclaim context and into kswapd.  I think this was an excellent
advance in simplicity.
If we could similarly move evict() into kswapd (and I believe we can)
then most file systems would do nothing in reclaim context that
interferes with allocation context.
evict() is an issue, but moving it into kswapd would be a potential 
problem for GFS2. We already have a memory allocation issue when 
recovery is taking place and memory is short. The code path is as follows:


 1. Inode is scheduled for eviction (which requires deallocation)
 2. The glock is required in order to perform the deallocation, which 
implies getting a DLM lock

 3. Another node in the cluster fails, so needs recovery
 4. When the DLM lock is requested, it gets blocked until recovery is 
complete (for the failed node)

 5. Recovery is performed using a userland fencing utility
 6. Fencing requires memory and then blocks on the eviction
 7. Deadlock (Fencing waiting on memory alloc, memory alloc waiting on 
DLM lock, DLM lock waiting on fencing)


It doesn't happen often, but we've been looking at the best place to 
break that cycle, and one of the things we've been wondering is whether 
we could avoid deallocation evictions from memory related contexts, or 
at least make it async somehow.


The issue is that it is not possible to know in advance whether an 
eviction will result in mearly writing things back to disk (because the 
inode is being dropped from cache, but still resides on disk) which is 
easy to do, or whether it requires a full deallocation (n_link==0) which 
may require significant resources and time,


Steve.



Re: [Cluster-devel] [PATCH 0/2] scop GFP_NOFS api

2016-04-29 Thread NeilBrown
On Tue, Apr 26 2016, Michal Hocko wrote:

> Hi,
> we have discussed this topic at LSF/MM this year. There was a general
> interest in the scope GFP_NOFS allocation context among some FS
> developers. For those who are not aware of the discussion or the issue
> I am trying to sort out (or at least start in that direction) please
> have a look at patch 1 which adds memalloc_nofs_{save,restore} api
> which basically copies what we have for the scope GFP_NOIO allocation
> context. I haven't converted any of the FS myself because that is way
> beyond my area of expertise but I would be happy to help with further
> changes on the MM front as well as in some more generic code paths.
>
> Dave had an idea on how to further improve the reclaim context to be
> less all-or-nothing wrt. GFP_NOFS. In short he was suggesting an opaque
> and FS specific cookie set in the FS allocation context and consumed
> by the FS reclaim context to allow doing some provably save actions
> that would be skipped due to GFP_NOFS normally.  I like this idea and
> I believe we can go that direction regardless of the approach taken here.
> Many filesystems simply need to cleanup their NOFS usage first before
> diving into a more complex changes.>

This strikes me as over-engineering to work around an unnecessarily
burdensome interface but without details it is hard to be certain.

Exactly what things happen in "FS reclaim context" which may, or may
not, be safe depending on the specific FS allocation context?  Do they
need to happen at all?

My research suggests that for most filesystems the only thing that
happens in reclaim context that is at all troublesome is the final
'evict()' on an inode.  This needs to flush out dirty pages and sync the
inode to storage.  Some time ago we moved most dirty-page writeout out
of the reclaim context and into kswapd.  I think this was an excellent
advance in simplicity.
If we could similarly move evict() into kswapd (and I believe we can)
then most file systems would do nothing in reclaim context that
interferes with allocation context.

The exceptions include:
 - nfs and any filesystem using fscache can block for up to 1 second
   in ->releasepage().  They used to block waiting for some IO, but that
   caused deadlocks and wasn't really needed.  I left the timeout because
   it seemed likely that some throttling would help.  I suspect that a
   careful analysis will show that there is sufficient throttling
   elsewhere.

 - xfs_qm_shrink_scan is nearly unique among shrinkers in that it waits
   for IO so it can free some quotainfo things.  If it could be changed
   to just schedule the IO without waiting for it then I think this
   would be safe to be called in any FS allocation context.  It already
   uses a 'trylock' in xfs_dqlock_nowait() to avoid deadlocking
   if the lock is held.

I think you/we would end up with a much simpler system if instead of
focussing on the places where GFP_NOFS is used, we focus on places where
__GFP_FS is tested, and try to remove them.  If we get rid of enough of
them the remainder could just use __GFP_IO.

> The patch 2 is a debugging aid which warns about explicit allocation
> requests from the scope context. This is should help to reduce the
> direct usage of the NOFS flags to bare minimum in favor of the scope
> API. It is not aimed to be merged upstream. I would hope Andrew took it
> into mmotm tree to give it linux-next exposure and allow developers to
> do further cleanups.  There is a new kernel command line parameter which
> has to be used for the debugging to be enabled.
>
> I think the GFP_NOIO should be seeing the same clean up.

I think you are suggesting that use of GFP_NOIO should (largely) be
deprecated in favour of memalloc_noio_save().  I think I agree.
Could we go a step further and deprecate GFP_ATOMIC in favour of some
in_atomic() test?  Maybe that is going too far.

Thanks,
NeilBrown

>
> Any feedback is highly appreciated.
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


signature.asc
Description: PGP signature


[Cluster-devel] [PATCH 0/2] scop GFP_NOFS api

2016-04-26 Thread Michal Hocko
Hi,
we have discussed this topic at LSF/MM this year. There was a general
interest in the scope GFP_NOFS allocation context among some FS
developers. For those who are not aware of the discussion or the issue
I am trying to sort out (or at least start in that direction) please
have a look at patch 1 which adds memalloc_nofs_{save,restore} api
which basically copies what we have for the scope GFP_NOIO allocation
context. I haven't converted any of the FS myself because that is way
beyond my area of expertise but I would be happy to help with further
changes on the MM front as well as in some more generic code paths.

Dave had an idea on how to further improve the reclaim context to be
less all-or-nothing wrt. GFP_NOFS. In short he was suggesting an opaque
and FS specific cookie set in the FS allocation context and consumed
by the FS reclaim context to allow doing some provably save actions
that would be skipped due to GFP_NOFS normally.  I like this idea and
I believe we can go that direction regardless of the approach taken here.
Many filesystems simply need to cleanup their NOFS usage first before
diving into a more complex changes.

The patch 2 is a debugging aid which warns about explicit allocation
requests from the scope context. This is should help to reduce the
direct usage of the NOFS flags to bare minimum in favor of the scope
API. It is not aimed to be merged upstream. I would hope Andrew took it
into mmotm tree to give it linux-next exposure and allow developers to
do further cleanups.  There is a new kernel command line parameter which
has to be used for the debugging to be enabled.

I think the GFP_NOIO should be seeing the same clean up.

Any feedback is highly appreciated.