On Tue, Jun 12, 2018 at 11:15 AM, Ross Zwisler
<[email protected]> wrote:
> On Fri, Jun 08, 2018 at 04:51:14PM -0700, Dan Williams wrote:
>> In preparation for implementing support for memory poison (media error)
>> handling via dax mappings, implement a lock_page() equivalent. Poison
>> error handling requires rmap and needs guarantees that the page->mapping
>> association is maintained / valid (inode not freed) for the duration of
>> the lookup.
>>
>> In the device-dax case it is sufficient to simply hold a dev_pagemap
>> reference. In the filesystem-dax case we need to use the entry lock.
>>
>> Export the entry lock via dax_lock_page() that uses rcu_read_lock() to
>> protect against the inode being freed, and revalidates the page->mapping
>> association under xa_lock().
>>
>> Signed-off-by: Dan Williams <[email protected]>
>> ---
>> fs/dax.c | 76
>> +++++++++++++++++++++++++++++++++++++++++++++++++++
>> include/linux/dax.h | 15 ++++++++++
>> 2 files changed, 91 insertions(+)
>>
>> diff --git a/fs/dax.c b/fs/dax.c
>> index cccf6cad1a7a..b7e71b108fcf 100644
>> --- a/fs/dax.c
>> +++ b/fs/dax.c
>> @@ -361,6 +361,82 @@ static void dax_disassociate_entry(void *entry, struct
>> address_space *mapping,
>> }
>> }
>>
>> +struct page *dax_lock_page(unsigned long pfn)
>> +{
>> + pgoff_t index;
>> + struct inode *inode;
>> + wait_queue_head_t *wq;
>> + void *entry = NULL, **slot;
>> + struct address_space *mapping;
>> + struct wait_exceptional_entry_queue ewait;
>> + struct page *ret = NULL, *page = pfn_to_page(pfn);
>> +
>> + rcu_read_lock();
>> + for (;;) {
>> + mapping = READ_ONCE(page->mapping);
>
> Why the READ_ONCE()?
We're potentially racing inode teardown, so the READ_ONCE() prevents
the compiler from trying to de-reference page->mapping twice and
getting inconsistent answers.
>
>> +
>> + if (!mapping || !IS_DAX(mapping->host))
>
> Might read better using the dax_mapping() helper.
Sure.
>
> Also, forgive my ignorance, but this implies that dev dax has page->mapping
> set up and that that inode will have IS_DAX set, right? This will let us get
> past this point for device DAX, and we'll bail out at the S_ISCHR() check?
Yes.
>
>> + break;
>> +
>> + /*
>> + * In the device-dax case there's no need to lock, a
>> + * struct dev_pagemap pin is sufficient to keep the
>> + * inode alive.
>> + */
>> + inode = mapping->host;
>> + if (S_ISCHR(inode->i_mode)) {
>> + ret = page;
>
> 'ret' isn't actually used for anything in this function, we just
> unconditionally return 'page'.
>
Yes, bug.
>> + break;
>> + }
>> +
>> + xa_lock_irq(&mapping->i_pages);
>> + if (mapping != page->mapping) {
>> + xa_unlock_irq(&mapping->i_pages);
>> + continue;
>> + }
>> + index = page->index;
>> +
>> + init_wait(&ewait.wait);
>> + ewait.wait.func = wake_exceptional_entry_func;
>> +
>> + entry = __radix_tree_lookup(&mapping->i_pages, index, NULL,
>> + &slot);
>> + if (!entry ||
>
> So if we do a lookup and there is no entry in the tree, we won't add an empty
> entry and lock it, we'll just return with no entry in the tree and nothing
> locked.
>
> Then, when we call dax_unlock_page(), we'll eventually hit a WARN_ON_ONCE() in
> dax_unlock_mapping_entry() when we see entry is 0. And, in that gap we've got
> nothing locked so page faults could have happened, etc... (which would mean
> that instead of WARN_ON_ONCE() for an empty entry, we'd hit it instead for an
> unlocked entry).
>
> Is that okay? Or do we need to insert a locked empty entry here?
No, the intent was to return NULL and fail the lock, but I messed up
and unconditionally returned the page.
_______________________________________________
Linux-nvdimm mailing list
[email protected]
https://lists.01.org/mailman/listinfo/linux-nvdimm