On Tue, Jul 4, 2017 at 2:11 PM, Steven Whitehouse <[email protected]> wrote:
> Hi,
>
>
> On 03/07/17 15:56, Andreas Gruenbacher wrote:
>>
>> These are the remaining patches for fixing a cluster-wide GFS2 and DLM
>> deadlock.
>>
>> As explained in the previous posting of this patch queue, when inodes
>> are evicted, GFS2 currently calls into DLM.  Inode eviction can be
>> triggered by memory pressure, in the context of a random user-space
>> process.  If DLM happens to block in the process in question (for
>> example, it that process is a fence agent), GFS2 and DLM will deadlock.
>>
>> This patch queue stops GFS2 from calling into DLM on the inode evict
>> path when under memory pressure.  It does so by first decoupling
>> destroying inodes and putting their associated glocks, which is what
>> ends up calling into DLM.  Second, when under memory pressure, it moves
>> putting glocks into work queue context where it cannot block DLM.
>> Third, when gfs2_drop_inode determines that an inode's link count has
>> hit zero under memory pressure, it puts that inode on the delete
>> workqueue (and keeps the inode in the icache) instead of causing
>> gfs2_evict_inode to delete the inode immediately.  The delete workqueue
>> will not be processed under memory pressure, so deleting inodes from
>> there is safe.
>
> Does this mean that all the corner cases are now covered and that this is
> now passing all the tests?

It did look like that for a a while but unfortunately, we did get
directory corruption again ("Number of entries corrupt in dir"). All
signs are pointing at patch "Put glocks asynchronously" at this point;
I'm trying to figure out how to analyze this further.

Andreas

Reply via email to