Re: [Cluster-devel] [PATCH 0/2] scop GFP_NOFS api
On Sun, May 01, 2016 at 08:19:44AM +1000, NeilBrown wrote: > On Sat, Apr 30 2016, Dave Chinner wrote: > > Indeed, blocking the superblock shrinker in reclaim is a key part of > > balancing inode cache pressure in XFS. If the shrinker starts > > hitting dirty inodes, it blocks on cleaning them, thereby slowing > > the rate of allocation to that which inodes can be cleaned and > > reclaimed. There are also background threads that walk ahead freeing > > clean inodes, but we have to throttle direct reclaim in this manner > > otherwise the allocation pressure vastly outweighs the ability to > > reclaim inodes. if we don't balance this, inode allocation triggers > > the OOM killer because reclaim keeps reporting "no progress being > > made" because dirty inodes are skipped. BY blocking on such inodes, > > the shrinker makes progress (slowly) and reclaim sees that memory is > > being freed and so continues without invoking the OOM killer... > > I'm very aware of the need to throttle allocation based on IO. I > remember when NFS didn't quite get this right and filled up memory :-) > > balance_dirty_pages() used to force threads to wait on the write-out of > one page for every page that they dirtied (or wait on 128 pages for every 128 > dirtied or whatever). This was exactly to provide the sort of > throttling you are talking about. > > We don't do that any more. It was problematic. I don't recall all the > reasons but I think that different backing devices having different > clearance rates was part of the problem. As the original architect of those changes (the IO-less dirty page throttling was an evolution of the algorithm I developed for Irix years before it was done in linux) I remember the reasons very well. Mostly it was to prevent what we termed "IO breakdown" where so many different foreground threads were dispatching writeback that it turned dirty page writeback into random IO instead of being nice, large sequential IOs. > So now we monitor clearance rates and wait for some number of blocks to > be written, rather than waiting for some specific blocks to be written. Right - we have a limited pool of flusher threads dispatching IO efficiently as possible, and foreground dirtying processes wait for that to do the work of cleaning pages. > We should be able to do the same thing to balance dirty inodes as > we do to balance dirty pages. No, we can't. Dirty inode state is deeply tied into filesystem implementations - it's intertwined with the journal operation in many filesystems and can't be separated easily. Indeed, some filesystem don't even use the VFS for tracking dirty inode state, nor do they implement the ->write_inode method. Hence the VFS northe inode cache shrinkers are able to determine if an inode is dirty, or if it is, trigger writeback of it. IOWs, inode caches are not unified in allocation, behaviour, design or implementation like the page cache is, and so balancing dirty inodes is likely not to be possible > >>If it could be changed > >>to just schedule the IO without waiting for it then I think this > >>would be safe to be called in any FS allocation context. It already > >>uses a 'trylock' in xfs_dqlock_nowait() to avoid deadlocking > >>if the lock is held. > > > > We could, but then we have the same problem as the inode cache - > > there's no indication of progress going back to the memory reclaim > > subsystem, nor is reclaim able to throttle memory allocation back to > > the rate at which reclaim is making progress. > > > > There's feedback loops all throughout the XFS reclaim code - it's > > designed specifically that way - I made changes to the shrinker > > infrastructure years ago to enable this. It's no different to the > > dirty page throttling that was done at roughly the same time - > > that's also one big feedback loop controlled by the rate at which > > pages can be cleaned. Indeed, it was designed was based on the same > > premise as all the XFS shrinker code: in steady state conditions > > we can't allocate a resource faster than we can reclaim it, so we > > need to make reclaim as efficient at possible... > > You seem to be referring here to the same change that I was referred to > above, but seem to be seeing it from a different perspective. Sure, not many people have the same viewpoint as the preson who had to convince everyone else that it was a sound idea even before prototypes were written... > Waiting for inodes to be freed in important. Waiting for any one > specific inode to be freed is dangerous. Sure. But there's a difference between waiting on an inode you have no idea when you'll be able to reclaim it, versus waiting on IO completion for an inode you've already guaranteed can complete reclaim and are the only context that has access to the inode Not to mention that XFS also triggers an async inode reclaim worker thread to do non-blocking reclaim of inodes in the background, so while direct reclaim throttles (and
Re: [Cluster-devel] [PATCH 0/2] scop GFP_NOFS api
Hi, On Sun 01-05-16 07:55:31, NeilBrown wrote: [...] > One particular problem with your process-context idea is that it isn't > inherited across threads. > Steve Whitehouse's example in gfs shows how allocation dependencies can > even cross into user space. Hmm, I am still not sure I understand that example completely but making a dependency between direct reclaim and userspace can hardly work. Especially when the direct reclaim might be sitting on top of hard to guess pile of locks. So unless I've missed anything what Steve has described is a clear NOFS context. > A more localized one that I have seen is that NFSv4 sometimes needs to > start up a state-management thread (particularly if the server > restarted). > It uses kthread_run(), which doesn't actually create the thread but asks > kthreadd to do it. If NFS writeout is waiting for state management it > would need to make sure that kthreadd runs in allocation context to > avoid deadlock. > I feel that I've forgotten some important detail here and this might > have been fixed somehow, but the point still stands that the allocation > context can cross from thread to thread and can effectively become > anything and everything. Not sure I understand your point here but relying on kthread_run from GFP_NOFS context has always been deadlock prone with or without scope GFP_NOFS semantic so I am not really sure I see your point here. Similarly relying on a work item which doesn't have a dedicated WQ_MEM_RECLAIM WQ is deadlock prone. You simply shouldn't do that. > It is OK to wait for memory to be freed. It is not OK to wait for any > particular piece of memory to be freed because you don't always know who > is waiting for you, or who you really are waiting on to free that > memory. > > Whenever trying to free memory I think you need to do best-effort > without blocking. I agree with that. Or at least you have to wait on something that is _guaranteed_ to make a forward progress. I am not really that sure this is easy to achieve with the current code base. -- Michal Hocko SUSE Labs
Re: [Cluster-devel] [PATCH 0/2] scop GFP_NOFS api
On Sat, Apr 30 2016, Dave Chinner wrote: > On Fri, Apr 29, 2016 at 03:35:42PM +1000, NeilBrown wrote: >> On Tue, Apr 26 2016, Michal Hocko wrote: >> >> > Hi, >> > we have discussed this topic at LSF/MM this year. There was a general >> > interest in the scope GFP_NOFS allocation context among some FS >> > developers. For those who are not aware of the discussion or the issue >> > I am trying to sort out (or at least start in that direction) please >> > have a look at patch 1 which adds memalloc_nofs_{save,restore} api >> > which basically copies what we have for the scope GFP_NOIO allocation >> > context. I haven't converted any of the FS myself because that is way >> > beyond my area of expertise but I would be happy to help with further >> > changes on the MM front as well as in some more generic code paths. >> > >> > Dave had an idea on how to further improve the reclaim context to be >> > less all-or-nothing wrt. GFP_NOFS. In short he was suggesting an opaque >> > and FS specific cookie set in the FS allocation context and consumed >> > by the FS reclaim context to allow doing some provably save actions >> > that would be skipped due to GFP_NOFS normally. I like this idea and >> > I believe we can go that direction regardless of the approach taken here. >> > Many filesystems simply need to cleanup their NOFS usage first before >> > diving into a more complex changes.> >> >> This strikes me as over-engineering to work around an unnecessarily >> burdensome interface but without details it is hard to be certain. >> >> Exactly what things happen in "FS reclaim context" which may, or may >> not, be safe depending on the specific FS allocation context? Do they >> need to happen at all? >> >> My research suggests that for most filesystems the only thing that >> happens in reclaim context that is at all troublesome is the final >> 'evict()' on an inode. This needs to flush out dirty pages and sync the >> inode to storage. Some time ago we moved most dirty-page writeout out >> of the reclaim context and into kswapd. I think this was an excellent >> advance in simplicity. > > No, we didn't move dirty page writeout to kswapd - we moved it back > to the background writeback threads where it can be done > efficiently. kswapd should almost never do single page writeback > because of how inefficient it is from an IO perspective, even though > it can. i.e. if we are doing any significant amount of dirty page > writeback from memory reclaim (direct, kswapd or otherwise) then > we've screwed something up. > >> If we could similarly move evict() into kswapd (and I believe we can) >> then most file systems would do nothing in reclaim context that >> interferes with allocation context. > > When lots of GFP_NOFS allocation is being done, this already > happens. The shrinkers that can't run due to context accumulate the > work on the shrinker structure, and when the shrinker can next run > (e.g. run from kswapd) it runs all the deferred work from GFP_NOFS > reclaim contexts. > > IOWs, we already move shrinker work from direct reclaim to kswapd > when appropriate. > >> The exceptions include: >> - nfs and any filesystem using fscache can block for up to 1 second >>in ->releasepage(). They used to block waiting for some IO, but that >>caused deadlocks and wasn't really needed. I left the timeout because >>it seemed likely that some throttling would help. I suspect that a >>careful analysis will show that there is sufficient throttling >>elsewhere. >> >> - xfs_qm_shrink_scan is nearly unique among shrinkers in that it waits >>for IO so it can free some quotainfo things. > > No it's not. evict() can block on IO - waiting for data or inode > writeback to complete, or even for filesystems to run transactions > on the inode. Hence the superblock shrinker can and does block in > inode cache reclaim. That is why I said "nearly" :-) > > Indeed, blocking the superblock shrinker in reclaim is a key part of > balancing inode cache pressure in XFS. If the shrinker starts > hitting dirty inodes, it blocks on cleaning them, thereby slowing > the rate of allocation to that which inodes can be cleaned and > reclaimed. There are also background threads that walk ahead freeing > clean inodes, but we have to throttle direct reclaim in this manner > otherwise the allocation pressure vastly outweighs the ability to > reclaim inodes. if we don't balance this, inode allocation triggers > the OOM killer because reclaim keeps reporting "no progress being > made" because dirty inodes are skipped. BY blocking on such inodes, > the shrinker makes progress (slowly) and reclaim sees that memory is > being freed and so continues without invoking the OOM killer... I'm very aware of the need to throttle allocation based on IO. I remember when NFS didn't quite get this right and filled up memory :-) balance_dirty_pages() used to force threads to wait on the write-out of one page for every page that they
Re: [Cluster-devel] [PATCH 0/2] scop GFP_NOFS api
On Fri, Apr 29, 2016 at 02:04:18PM +0200, Michal Hocko wrote: > I would also like to revisit generic inode/dentry shrinker and see > whether it could be more __GFP_FS friendly. As you say many FS might > even not depend on some FS internal locks so pushing GFP_FS check down > the layers might make a lot of sense and allow to clean some [id]cache > even for __GFP_FS context. That's precisely my point about passing a context to the shrinker. It's recursion within a single superblock context that makes up the majority of cases GFP_NOFS is used for, so passing the superblock immediately allows for reclaim to run the superblock shrinker on every other superblock. We can refine it further from there, but I strongly suspect that further refinement is going to require filesystems to specifically configure the superblock shrinker. e.g. in XFS, we can't allow evict() even on clean VFS inodes in a PF_FSTRANS context, because we may run a transaction on a clean VFS inode to prepare it for reclaim. We can, however, allow the fs-specific shrinker callouts to run (i.e. call into .free_cached_objects) so that it can reclaim clean XFS inodes because that doesn't require transactions i.e. the infrastructure I suggested we use is aimed directly at providing the mechanism required for finer-grained inode/dentry cache reclaim in contexts that it is currently disallowed completely. I was also implying that once the infrastructure to pass contexts is in place, I'd then make the changes to the generic superblock shrinker code to enable finer grained reclaim and optimise the XFS shrinkers to make use of it... Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [Cluster-devel] [PATCH 0/2] scop GFP_NOFS api
On Fri, Apr 29, 2016 at 03:35:42PM +1000, NeilBrown wrote: > On Tue, Apr 26 2016, Michal Hocko wrote: > > > Hi, > > we have discussed this topic at LSF/MM this year. There was a general > > interest in the scope GFP_NOFS allocation context among some FS > > developers. For those who are not aware of the discussion or the issue > > I am trying to sort out (or at least start in that direction) please > > have a look at patch 1 which adds memalloc_nofs_{save,restore} api > > which basically copies what we have for the scope GFP_NOIO allocation > > context. I haven't converted any of the FS myself because that is way > > beyond my area of expertise but I would be happy to help with further > > changes on the MM front as well as in some more generic code paths. > > > > Dave had an idea on how to further improve the reclaim context to be > > less all-or-nothing wrt. GFP_NOFS. In short he was suggesting an opaque > > and FS specific cookie set in the FS allocation context and consumed > > by the FS reclaim context to allow doing some provably save actions > > that would be skipped due to GFP_NOFS normally. I like this idea and > > I believe we can go that direction regardless of the approach taken here. > > Many filesystems simply need to cleanup their NOFS usage first before > > diving into a more complex changes.> > > This strikes me as over-engineering to work around an unnecessarily > burdensome interface but without details it is hard to be certain. > > Exactly what things happen in "FS reclaim context" which may, or may > not, be safe depending on the specific FS allocation context? Do they > need to happen at all? > > My research suggests that for most filesystems the only thing that > happens in reclaim context that is at all troublesome is the final > 'evict()' on an inode. This needs to flush out dirty pages and sync the > inode to storage. Some time ago we moved most dirty-page writeout out > of the reclaim context and into kswapd. I think this was an excellent > advance in simplicity. No, we didn't move dirty page writeout to kswapd - we moved it back to the background writeback threads where it can be done efficiently. kswapd should almost never do single page writeback because of how inefficient it is from an IO perspective, even though it can. i.e. if we are doing any significant amount of dirty page writeback from memory reclaim (direct, kswapd or otherwise) then we've screwed something up. > If we could similarly move evict() into kswapd (and I believe we can) > then most file systems would do nothing in reclaim context that > interferes with allocation context. When lots of GFP_NOFS allocation is being done, this already happens. The shrinkers that can't run due to context accumulate the work on the shrinker structure, and when the shrinker can next run (e.g. run from kswapd) it runs all the deferred work from GFP_NOFS reclaim contexts. IOWs, we already move shrinker work from direct reclaim to kswapd when appropriate. > The exceptions include: > - nfs and any filesystem using fscache can block for up to 1 second >in ->releasepage(). They used to block waiting for some IO, but that >caused deadlocks and wasn't really needed. I left the timeout because >it seemed likely that some throttling would help. I suspect that a >careful analysis will show that there is sufficient throttling >elsewhere. > > - xfs_qm_shrink_scan is nearly unique among shrinkers in that it waits >for IO so it can free some quotainfo things. No it's not. evict() can block on IO - waiting for data or inode writeback to complete, or even for filesystems to run transactions on the inode. Hence the superblock shrinker can and does block in inode cache reclaim. Indeed, blocking the superblock shrinker in reclaim is a key part of balancing inode cache pressure in XFS. If the shrinker starts hitting dirty inodes, it blocks on cleaning them, thereby slowing the rate of allocation to that which inodes can be cleaned and reclaimed. There are also background threads that walk ahead freeing clean inodes, but we have to throttle direct reclaim in this manner otherwise the allocation pressure vastly outweighs the ability to reclaim inodes. if we don't balance this, inode allocation triggers the OOM killer because reclaim keeps reporting "no progress being made" because dirty inodes are skipped. BY blocking on such inodes, the shrinker makes progress (slowly) and reclaim sees that memory is being freed and so continues without invoking the OOM killer... >If it could be changed >to just schedule the IO without waiting for it then I think this >would be safe to be called in any FS allocation context. It already >uses a 'trylock' in xfs_dqlock_nowait() to avoid deadlocking >if the lock is held. We could, but then we have the same problem as the inode cache - there's no indication of progress going back to the memory reclaim subsystem, nor is reclaim able to throttle
Re: [Cluster-devel] [PATCH 0/2] scop GFP_NOFS api
On Fri 29-04-16 15:35:42, NeilBrown wrote: > On Tue, Apr 26 2016, Michal Hocko wrote: > > > Hi, > > we have discussed this topic at LSF/MM this year. There was a general > > interest in the scope GFP_NOFS allocation context among some FS > > developers. For those who are not aware of the discussion or the issue > > I am trying to sort out (or at least start in that direction) please > > have a look at patch 1 which adds memalloc_nofs_{save,restore} api > > which basically copies what we have for the scope GFP_NOIO allocation > > context. I haven't converted any of the FS myself because that is way > > beyond my area of expertise but I would be happy to help with further > > changes on the MM front as well as in some more generic code paths. > > > > Dave had an idea on how to further improve the reclaim context to be > > less all-or-nothing wrt. GFP_NOFS. In short he was suggesting an opaque > > and FS specific cookie set in the FS allocation context and consumed > > by the FS reclaim context to allow doing some provably save actions > > that would be skipped due to GFP_NOFS normally. I like this idea and > > I believe we can go that direction regardless of the approach taken here. > > Many filesystems simply need to cleanup their NOFS usage first before > > diving into a more complex changes.> > > This strikes me as over-engineering to work around an unnecessarily > burdensome interface but without details it is hard to be certain. > > Exactly what things happen in "FS reclaim context" which may, or may > not, be safe depending on the specific FS allocation context? Do they > need to happen at all? Let me quote Dave Chinner from one of the emails discussed at LSFMM mailing list: : IMO, making GFP_NOFS "better" cannot be done with context-less flags : being passed through reclaim. If we want to prevent the recursive : self-deadlock case in an optimal manner, then we need to be able to : pass state down to reclaim so that page writeback and the shrinkers : can determine if they are likely to deadlock. : : IOWs, I think we should stop thinking of GFP_NOFS as a *global* : directive to avoid recursion under any circumstance and instead : start thinking about it as a mechanism to avoid recursion in : specific reclaim contexts. : : Something as simple as adding an opaque cookie (e.g. can hold a : superblock or inode pointer) to check against in writeback and : subsystem shrinkers would result in the vast majority of GFP_NOFS : contexts being able to reclaim from everything but the one context : that we *might* deadlock against. : : e.g, if we then also check the PF_FSTRANS flag in XFS, we'll : still be able to reclaim clean inodes, buffers and write back : dirty pages that don't require transactions to complete under "don't : recurse" situations because we know it's transactions that we could : deadlock on in the direct reclaim context. : : Note that this information could be added to the writeback_control : for page writeback, and it could be passed directly to shrinkers : in the shrink_control structures. The allocation paths might be a : little harder, but I suspect using the task struct for passing this : information into direct reclaim might be the easiest approach... > My research suggests that for most filesystems the only thing that > happens in reclaim context that is at all troublesome is the final > 'evict()' on an inode. This needs to flush out dirty pages and sync the > inode to storage. Some time ago we moved most dirty-page writeout out > of the reclaim context and into kswapd. I think this was an excellent > advance in simplicity. > If we could similarly move evict() into kswapd (and I believe we can) > then most file systems would do nothing in reclaim context that > interferes with allocation context. > > The exceptions include: > - nfs and any filesystem using fscache can block for up to 1 second >in ->releasepage(). They used to block waiting for some IO, but that >caused deadlocks and wasn't really needed. I left the timeout because >it seemed likely that some throttling would help. I suspect that a >careful analysis will show that there is sufficient throttling >elsewhere. > > - xfs_qm_shrink_scan is nearly unique among shrinkers in that it waits >for IO so it can free some quotainfo things. If it could be changed >to just schedule the IO without waiting for it then I think this >would be safe to be called in any FS allocation context. It already >uses a 'trylock' in xfs_dqlock_nowait() to avoid deadlocking >if the lock is held. > > I think you/we would end up with a much simpler system if instead of > focussing on the places where GFP_NOFS is used, we focus on places where > __GFP_FS is tested, and try to remove them. One think I have learned is that shrinkers can be really complex and getting rid of GFP_NOFS will be really hard so I would really like to start the easiest way possible and remove the direct usage and
Re: [Cluster-devel] [PATCH 0/2] scop GFP_NOFS api
Hi, On 29/04/16 06:35, NeilBrown wrote: On Tue, Apr 26 2016, Michal Hocko wrote: Hi, we have discussed this topic at LSF/MM this year. There was a general interest in the scope GFP_NOFS allocation context among some FS developers. For those who are not aware of the discussion or the issue I am trying to sort out (or at least start in that direction) please have a look at patch 1 which adds memalloc_nofs_{save,restore} api which basically copies what we have for the scope GFP_NOIO allocation context. I haven't converted any of the FS myself because that is way beyond my area of expertise but I would be happy to help with further changes on the MM front as well as in some more generic code paths. Dave had an idea on how to further improve the reclaim context to be less all-or-nothing wrt. GFP_NOFS. In short he was suggesting an opaque and FS specific cookie set in the FS allocation context and consumed by the FS reclaim context to allow doing some provably save actions that would be skipped due to GFP_NOFS normally. I like this idea and I believe we can go that direction regardless of the approach taken here. Many filesystems simply need to cleanup their NOFS usage first before diving into a more complex changes.> This strikes me as over-engineering to work around an unnecessarily burdensome interface but without details it is hard to be certain. Exactly what things happen in "FS reclaim context" which may, or may not, be safe depending on the specific FS allocation context? Do they need to happen at all? My research suggests that for most filesystems the only thing that happens in reclaim context that is at all troublesome is the final 'evict()' on an inode. This needs to flush out dirty pages and sync the inode to storage. Some time ago we moved most dirty-page writeout out of the reclaim context and into kswapd. I think this was an excellent advance in simplicity. If we could similarly move evict() into kswapd (and I believe we can) then most file systems would do nothing in reclaim context that interferes with allocation context. evict() is an issue, but moving it into kswapd would be a potential problem for GFS2. We already have a memory allocation issue when recovery is taking place and memory is short. The code path is as follows: 1. Inode is scheduled for eviction (which requires deallocation) 2. The glock is required in order to perform the deallocation, which implies getting a DLM lock 3. Another node in the cluster fails, so needs recovery 4. When the DLM lock is requested, it gets blocked until recovery is complete (for the failed node) 5. Recovery is performed using a userland fencing utility 6. Fencing requires memory and then blocks on the eviction 7. Deadlock (Fencing waiting on memory alloc, memory alloc waiting on DLM lock, DLM lock waiting on fencing) It doesn't happen often, but we've been looking at the best place to break that cycle, and one of the things we've been wondering is whether we could avoid deallocation evictions from memory related contexts, or at least make it async somehow. The issue is that it is not possible to know in advance whether an eviction will result in mearly writing things back to disk (because the inode is being dropped from cache, but still resides on disk) which is easy to do, or whether it requires a full deallocation (n_link==0) which may require significant resources and time, Steve.
Re: [Cluster-devel] [PATCH 0/2] scop GFP_NOFS api
On Tue, Apr 26 2016, Michal Hocko wrote: > Hi, > we have discussed this topic at LSF/MM this year. There was a general > interest in the scope GFP_NOFS allocation context among some FS > developers. For those who are not aware of the discussion or the issue > I am trying to sort out (or at least start in that direction) please > have a look at patch 1 which adds memalloc_nofs_{save,restore} api > which basically copies what we have for the scope GFP_NOIO allocation > context. I haven't converted any of the FS myself because that is way > beyond my area of expertise but I would be happy to help with further > changes on the MM front as well as in some more generic code paths. > > Dave had an idea on how to further improve the reclaim context to be > less all-or-nothing wrt. GFP_NOFS. In short he was suggesting an opaque > and FS specific cookie set in the FS allocation context and consumed > by the FS reclaim context to allow doing some provably save actions > that would be skipped due to GFP_NOFS normally. I like this idea and > I believe we can go that direction regardless of the approach taken here. > Many filesystems simply need to cleanup their NOFS usage first before > diving into a more complex changes.> This strikes me as over-engineering to work around an unnecessarily burdensome interface but without details it is hard to be certain. Exactly what things happen in "FS reclaim context" which may, or may not, be safe depending on the specific FS allocation context? Do they need to happen at all? My research suggests that for most filesystems the only thing that happens in reclaim context that is at all troublesome is the final 'evict()' on an inode. This needs to flush out dirty pages and sync the inode to storage. Some time ago we moved most dirty-page writeout out of the reclaim context and into kswapd. I think this was an excellent advance in simplicity. If we could similarly move evict() into kswapd (and I believe we can) then most file systems would do nothing in reclaim context that interferes with allocation context. The exceptions include: - nfs and any filesystem using fscache can block for up to 1 second in ->releasepage(). They used to block waiting for some IO, but that caused deadlocks and wasn't really needed. I left the timeout because it seemed likely that some throttling would help. I suspect that a careful analysis will show that there is sufficient throttling elsewhere. - xfs_qm_shrink_scan is nearly unique among shrinkers in that it waits for IO so it can free some quotainfo things. If it could be changed to just schedule the IO without waiting for it then I think this would be safe to be called in any FS allocation context. It already uses a 'trylock' in xfs_dqlock_nowait() to avoid deadlocking if the lock is held. I think you/we would end up with a much simpler system if instead of focussing on the places where GFP_NOFS is used, we focus on places where __GFP_FS is tested, and try to remove them. If we get rid of enough of them the remainder could just use __GFP_IO. > The patch 2 is a debugging aid which warns about explicit allocation > requests from the scope context. This is should help to reduce the > direct usage of the NOFS flags to bare minimum in favor of the scope > API. It is not aimed to be merged upstream. I would hope Andrew took it > into mmotm tree to give it linux-next exposure and allow developers to > do further cleanups. There is a new kernel command line parameter which > has to be used for the debugging to be enabled. > > I think the GFP_NOIO should be seeing the same clean up. I think you are suggesting that use of GFP_NOIO should (largely) be deprecated in favour of memalloc_noio_save(). I think I agree. Could we go a step further and deprecate GFP_ATOMIC in favour of some in_atomic() test? Maybe that is going too far. Thanks, NeilBrown > > Any feedback is highly appreciated. > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html signature.asc Description: PGP signature
[Cluster-devel] [PATCH 0/2] scop GFP_NOFS api
Hi, we have discussed this topic at LSF/MM this year. There was a general interest in the scope GFP_NOFS allocation context among some FS developers. For those who are not aware of the discussion or the issue I am trying to sort out (or at least start in that direction) please have a look at patch 1 which adds memalloc_nofs_{save,restore} api which basically copies what we have for the scope GFP_NOIO allocation context. I haven't converted any of the FS myself because that is way beyond my area of expertise but I would be happy to help with further changes on the MM front as well as in some more generic code paths. Dave had an idea on how to further improve the reclaim context to be less all-or-nothing wrt. GFP_NOFS. In short he was suggesting an opaque and FS specific cookie set in the FS allocation context and consumed by the FS reclaim context to allow doing some provably save actions that would be skipped due to GFP_NOFS normally. I like this idea and I believe we can go that direction regardless of the approach taken here. Many filesystems simply need to cleanup their NOFS usage first before diving into a more complex changes. The patch 2 is a debugging aid which warns about explicit allocation requests from the scope context. This is should help to reduce the direct usage of the NOFS flags to bare minimum in favor of the scope API. It is not aimed to be merged upstream. I would hope Andrew took it into mmotm tree to give it linux-next exposure and allow developers to do further cleanups. There is a new kernel command line parameter which has to be used for the debugging to be enabled. I think the GFP_NOIO should be seeing the same clean up. Any feedback is highly appreciated.