Re: [PATCH v11 1/8] dax: Introduce holder for dax_device

2022-04-08 Thread Shiyang Ruan




在 2022/4/8 9:38, Dan Williams 写道:

[ add Mauro and Tony for RAS discussion ]

On Wed, Apr 6, 2022 at 1:39 PM Darrick J. Wong  wrote:


On Tue, Apr 05, 2022 at 06:22:48PM -0700, Dan Williams wrote:

On Tue, Apr 5, 2022 at 5:55 PM Jane Chu  wrote:


On 3/30/2022 9:18 AM, Darrick J. Wong wrote:

On Wed, Mar 30, 2022 at 08:49:29AM -0700, Christoph Hellwig wrote:

On Wed, Mar 30, 2022 at 06:58:21PM +0800, Shiyang Ruan wrote:

As the code I pasted before, pmem driver will subtract its ->data_offset,
which is byte-based. And the filesystem who implements ->notify_failure()
will calculate the offset in unit of byte again.

So, leave its function signature byte-based, to avoid repeated conversions.


I'm actually fine either way, so I'll wait for Dan to comment.


FWIW I'd convinced myself that the reason for using byte units is to
make it possible to reduce the pmem failure blast radius to subpage
units... but then I've also been distracted for months. :/



Yes, thanks Darrick!  I recall that.
Maybe just add a comment about why byte unit is used?


I think we start with page failure notification and then figure out
how to get finer grained through the dax interface in follow-on
changes. Otherwise, for finer grained error handling support,
memory_failure() would also need to be converted to stop upcasting
cache-line granularity to page granularity failures. The native MCE
notification communicates a 'struct mce' that can be in terms of
sub-page bytes, but the memory management implications are all page
based. I assume the FS implications are all FS-block-size based?


I wouldn't necessarily make that assumption -- for regular files, the
user program is in a better position to figure out how to reset the file
contents.

For fs metadata, it really depends.  In principle, if (say) we could get
byte granularity poison info, we could look up the space usage within
the block to decide if the poisoned part was actually free space, in
which case we can correct the problem by (re)zeroing the affected bytes
to clear the poison.

Obviously, if the blast radius hits the internal space info or something
that was storing useful data, then you'd have to rebuild the whole block
(or the whole data structure), but that's not necessarily a given.


tl;dr: dax_holder_notify_failure() != fs->notify_failure()

So I think I see some confusion between what DAX->notify_failure()
needs, memory_failure() needs, the raw information provided by the
hardware, and the failure granularity the filesystem can make use of.

DAX and memory_failure() need to make immediate page granularity
decisions. They both need to map out whole pages (in the direct map
and userspace respectively) to prevent future poison consumption, at
least until the poison is repaired.

The event that leads to a page being failed can be triggered by a
hardware error as small as an individual cacheline. While that is
interesting to a filesystem it isn't information that memory_failure()
and DAX can utilize.

The reason DAX needs to have a callback into filesystem code is to map
the page failure back to all the processes that might have that page
mapped because reflink means that page->mapping is not sufficient to
find all the affected 'struct address_space' instances. So it's more
of an address-translation / "help me kill processes" service than a
general failure notification service.

Currently when raw hardware event happens there are mechanisms like
arch-specific notifier chains, like powerpc::mce_register_notifier()
and x86::mce_register_decode_chain(), or other platform firmware code
like ghes_edac_report_mem_error() that uplevel the error to a coarse
page granularity failure, while emitting the fine granularity error
event to userspace.

All of this to say that the interface to ask the fs to do the bottom
half of memory_failure() (walking affected 'struct address_space'
instances and killing processes (mf_dax_kill_procs())) is different
than the general interface to tell the filesystem that memory has gone
bad relative to a device. So if the only caller of
fs->notify_failure() handler is this code:

+   if (pgmap->ops->memory_failure) {
+   rc = pgmap->ops->memory_failure(pgmap, PFN_PHYS(pfn), PAGE_SIZE,
+   flags);

...then you'll never get fine-grained reports. So, I still think the
DAX, pgmap and memory_failure() interface should be pfn based. The
interface to the *filesystem* ->notify_failure() can still be
byte-based, but the trigger for that byte based interface will likely
need to be something driven by another agent. Perhaps like rasdaemon
in userspace translating all the arch specific physical address events
back into device-relative offsets and then calling a new ABI that is
serviced by fs->notify_failure() on the backend.


Understood.  I'll do as your advise.  Thanks!


--
Ruan.





Re: [PATCH v11 1/8] dax: Introduce holder for dax_device

2022-04-07 Thread Dan Williams
[ add Mauro and Tony for RAS discussion ]

On Wed, Apr 6, 2022 at 1:39 PM Darrick J. Wong  wrote:
>
> On Tue, Apr 05, 2022 at 06:22:48PM -0700, Dan Williams wrote:
> > On Tue, Apr 5, 2022 at 5:55 PM Jane Chu  wrote:
> > >
> > > On 3/30/2022 9:18 AM, Darrick J. Wong wrote:
> > > > On Wed, Mar 30, 2022 at 08:49:29AM -0700, Christoph Hellwig wrote:
> > > >> On Wed, Mar 30, 2022 at 06:58:21PM +0800, Shiyang Ruan wrote:
> > > >>> As the code I pasted before, pmem driver will subtract its 
> > > >>> ->data_offset,
> > > >>> which is byte-based. And the filesystem who implements 
> > > >>> ->notify_failure()
> > > >>> will calculate the offset in unit of byte again.
> > > >>>
> > > >>> So, leave its function signature byte-based, to avoid repeated 
> > > >>> conversions.
> > > >>
> > > >> I'm actually fine either way, so I'll wait for Dan to comment.
> > > >
> > > > FWIW I'd convinced myself that the reason for using byte units is to
> > > > make it possible to reduce the pmem failure blast radius to subpage
> > > > units... but then I've also been distracted for months. :/
> > > >
> > >
> > > Yes, thanks Darrick!  I recall that.
> > > Maybe just add a comment about why byte unit is used?
> >
> > I think we start with page failure notification and then figure out
> > how to get finer grained through the dax interface in follow-on
> > changes. Otherwise, for finer grained error handling support,
> > memory_failure() would also need to be converted to stop upcasting
> > cache-line granularity to page granularity failures. The native MCE
> > notification communicates a 'struct mce' that can be in terms of
> > sub-page bytes, but the memory management implications are all page
> > based. I assume the FS implications are all FS-block-size based?
>
> I wouldn't necessarily make that assumption -- for regular files, the
> user program is in a better position to figure out how to reset the file
> contents.
>
> For fs metadata, it really depends.  In principle, if (say) we could get
> byte granularity poison info, we could look up the space usage within
> the block to decide if the poisoned part was actually free space, in
> which case we can correct the problem by (re)zeroing the affected bytes
> to clear the poison.
>
> Obviously, if the blast radius hits the internal space info or something
> that was storing useful data, then you'd have to rebuild the whole block
> (or the whole data structure), but that's not necessarily a given.

tl;dr: dax_holder_notify_failure() != fs->notify_failure()

So I think I see some confusion between what DAX->notify_failure()
needs, memory_failure() needs, the raw information provided by the
hardware, and the failure granularity the filesystem can make use of.

DAX and memory_failure() need to make immediate page granularity
decisions. They both need to map out whole pages (in the direct map
and userspace respectively) to prevent future poison consumption, at
least until the poison is repaired.

The event that leads to a page being failed can be triggered by a
hardware error as small as an individual cacheline. While that is
interesting to a filesystem it isn't information that memory_failure()
and DAX can utilize.

The reason DAX needs to have a callback into filesystem code is to map
the page failure back to all the processes that might have that page
mapped because reflink means that page->mapping is not sufficient to
find all the affected 'struct address_space' instances. So it's more
of an address-translation / "help me kill processes" service than a
general failure notification service.

Currently when raw hardware event happens there are mechanisms like
arch-specific notifier chains, like powerpc::mce_register_notifier()
and x86::mce_register_decode_chain(), or other platform firmware code
like ghes_edac_report_mem_error() that uplevel the error to a coarse
page granularity failure, while emitting the fine granularity error
event to userspace.

All of this to say that the interface to ask the fs to do the bottom
half of memory_failure() (walking affected 'struct address_space'
instances and killing processes (mf_dax_kill_procs())) is different
than the general interface to tell the filesystem that memory has gone
bad relative to a device. So if the only caller of
fs->notify_failure() handler is this code:

+   if (pgmap->ops->memory_failure) {
+   rc = pgmap->ops->memory_failure(pgmap, PFN_PHYS(pfn), PAGE_SIZE,
+   flags);

...then you'll never get fine-grained reports. So, I still think the
DAX, pgmap and memory_failure() interface should be pfn based. The
interface to the *filesystem* ->notify_failure() can still be
byte-based, but the trigger for that byte based interface will likely
need to be something driven by another agent. Perhaps like rasdaemon
in userspace translating all the arch specific physical address events
back into device-relative offsets and then calling a new ABI that is
serviced by fs->notify_failure() on the 

Re: [PATCH v11 1/8] dax: Introduce holder for dax_device

2022-04-06 Thread Darrick J. Wong
On Tue, Apr 05, 2022 at 06:22:48PM -0700, Dan Williams wrote:
> On Tue, Apr 5, 2022 at 5:55 PM Jane Chu  wrote:
> >
> > On 3/30/2022 9:18 AM, Darrick J. Wong wrote:
> > > On Wed, Mar 30, 2022 at 08:49:29AM -0700, Christoph Hellwig wrote:
> > >> On Wed, Mar 30, 2022 at 06:58:21PM +0800, Shiyang Ruan wrote:
> > >>> As the code I pasted before, pmem driver will subtract its 
> > >>> ->data_offset,
> > >>> which is byte-based. And the filesystem who implements 
> > >>> ->notify_failure()
> > >>> will calculate the offset in unit of byte again.
> > >>>
> > >>> So, leave its function signature byte-based, to avoid repeated 
> > >>> conversions.
> > >>
> > >> I'm actually fine either way, so I'll wait for Dan to comment.
> > >
> > > FWIW I'd convinced myself that the reason for using byte units is to
> > > make it possible to reduce the pmem failure blast radius to subpage
> > > units... but then I've also been distracted for months. :/
> > >
> >
> > Yes, thanks Darrick!  I recall that.
> > Maybe just add a comment about why byte unit is used?
> 
> I think we start with page failure notification and then figure out
> how to get finer grained through the dax interface in follow-on
> changes. Otherwise, for finer grained error handling support,
> memory_failure() would also need to be converted to stop upcasting
> cache-line granularity to page granularity failures. The native MCE
> notification communicates a 'struct mce' that can be in terms of
> sub-page bytes, but the memory management implications are all page
> based. I assume the FS implications are all FS-block-size based?

I wouldn't necessarily make that assumption -- for regular files, the
user program is in a better position to figure out how to reset the file
contents.

For fs metadata, it really depends.  In principle, if (say) we could get
byte granularity poison info, we could look up the space usage within
the block to decide if the poisoned part was actually free space, in
which case we can correct the problem by (re)zeroing the affected bytes
to clear the poison.

Obviously, if the blast radius hits the internal space info or something
that was storing useful data, then you'd have to rebuild the whole block
(or the whole data structure), but that's not necessarily a given.

--D




Re: [PATCH v11 1/8] dax: Introduce holder for dax_device

2022-04-05 Thread Dan Williams
On Tue, Apr 5, 2022 at 5:55 PM Jane Chu  wrote:
>
> On 3/30/2022 9:18 AM, Darrick J. Wong wrote:
> > On Wed, Mar 30, 2022 at 08:49:29AM -0700, Christoph Hellwig wrote:
> >> On Wed, Mar 30, 2022 at 06:58:21PM +0800, Shiyang Ruan wrote:
> >>> As the code I pasted before, pmem driver will subtract its ->data_offset,
> >>> which is byte-based. And the filesystem who implements ->notify_failure()
> >>> will calculate the offset in unit of byte again.
> >>>
> >>> So, leave its function signature byte-based, to avoid repeated 
> >>> conversions.
> >>
> >> I'm actually fine either way, so I'll wait for Dan to comment.
> >
> > FWIW I'd convinced myself that the reason for using byte units is to
> > make it possible to reduce the pmem failure blast radius to subpage
> > units... but then I've also been distracted for months. :/
> >
>
> Yes, thanks Darrick!  I recall that.
> Maybe just add a comment about why byte unit is used?

I think we start with page failure notification and then figure out
how to get finer grained through the dax interface in follow-on
changes. Otherwise, for finer grained error handling support,
memory_failure() would also need to be converted to stop upcasting
cache-line granularity to page granularity failures. The native MCE
notification communicates a 'struct mce' that can be in terms of
sub-page bytes, but the memory management implications are all page
based. I assume the FS implications are all FS-block-size based?



Re: [PATCH v11 1/8] dax: Introduce holder for dax_device

2022-04-05 Thread Jane Chu
On 3/30/2022 9:18 AM, Darrick J. Wong wrote:
> On Wed, Mar 30, 2022 at 08:49:29AM -0700, Christoph Hellwig wrote:
>> On Wed, Mar 30, 2022 at 06:58:21PM +0800, Shiyang Ruan wrote:
>>> As the code I pasted before, pmem driver will subtract its ->data_offset,
>>> which is byte-based. And the filesystem who implements ->notify_failure()
>>> will calculate the offset in unit of byte again.
>>>
>>> So, leave its function signature byte-based, to avoid repeated conversions.
>>
>> I'm actually fine either way, so I'll wait for Dan to comment.
> 
> FWIW I'd convinced myself that the reason for using byte units is to
> make it possible to reduce the pmem failure blast radius to subpage
> units... but then I've also been distracted for months. :/
> 

Yes, thanks Darrick!  I recall that.
Maybe just add a comment about why byte unit is used?

thanks!
-jane

> --D



Re: [PATCH v11 1/8] dax: Introduce holder for dax_device

2022-03-30 Thread Darrick J. Wong
On Wed, Mar 30, 2022 at 08:49:29AM -0700, Christoph Hellwig wrote:
> On Wed, Mar 30, 2022 at 06:58:21PM +0800, Shiyang Ruan wrote:
> > As the code I pasted before, pmem driver will subtract its ->data_offset,
> > which is byte-based. And the filesystem who implements ->notify_failure()
> > will calculate the offset in unit of byte again.
> > 
> > So, leave its function signature byte-based, to avoid repeated conversions.
> 
> I'm actually fine either way, so I'll wait for Dan to comment.

FWIW I'd convinced myself that the reason for using byte units is to
make it possible to reduce the pmem failure blast radius to subpage
units... but then I've also been distracted for months. :/

--D



Re: [PATCH v11 1/8] dax: Introduce holder for dax_device

2022-03-30 Thread Christoph Hellwig
On Wed, Mar 30, 2022 at 06:58:21PM +0800, Shiyang Ruan wrote:
> As the code I pasted before, pmem driver will subtract its ->data_offset,
> which is byte-based. And the filesystem who implements ->notify_failure()
> will calculate the offset in unit of byte again.
> 
> So, leave its function signature byte-based, to avoid repeated conversions.

I'm actually fine either way, so I'll wait for Dan to comment.



Re: [PATCH v11 1/8] dax: Introduce holder for dax_device

2022-03-30 Thread Shiyang Ruan




在 2022/3/30 18:13, Christoph Hellwig 写道:

On Wed, Mar 30, 2022 at 06:03:01PM +0800, Shiyang Ruan wrote:


Because I am not sure if the offset between each layer is page aligned.  For
example, when pmem dirver handles ->memory_failure(), it should subtract its
->data_offset when it calls dax_holder_notify_failure().


If they aren't, none of the DAX machinery would work.


OK. Got it.

So, use page-based function signature for ->memory_failure():

int (*memory_failure)(struct dev_pagemap *pgmap, unsigned long pfn,
  unsigned long nr_pfns, int flags);


As the code I pasted before, pmem driver will subtract its 
->data_offset, which is byte-based. And the filesystem who implements 
->notify_failure() will calculate the offset in unit of byte again.


So, leave its function signature byte-based, to avoid repeated conversions.

int (*notify_failure)(struct dax_device *dax_dev, u64 offset,
  u64 len, int mf_flags);

What do you think?


--
Thanks,
Ruan.





Re: [PATCH v11 1/8] dax: Introduce holder for dax_device

2022-03-30 Thread Christoph Hellwig
On Wed, Mar 30, 2022 at 06:03:01PM +0800, Shiyang Ruan wrote:
> 
> Because I am not sure if the offset between each layer is page aligned.  For
> example, when pmem dirver handles ->memory_failure(), it should subtract its
> ->data_offset when it calls dax_holder_notify_failure().

If they aren't, none of the DAX machinery would work.



Re: [PATCH v11 1/8] dax: Introduce holder for dax_device

2022-03-30 Thread Shiyang Ruan




在 2022/3/30 13:41, Christoph Hellwig 写道:

On Wed, Mar 16, 2022 at 09:46:07PM +0800, Shiyang Ruan wrote:

Forgive me if this has been discussed before, but since dax_operations
are in terms of pgoff and nr pages and memory_failure() is in terms of
pfns what was the rationale for making the function signature byte
based?


Maybe I didn't describe it clearly...  The @offset and @len here are
byte-based.  And so is ->memory_failure().


Yes, but is there a good reason for that when the rest of the DAX code
tends to work in page chunks?


Because I am not sure if the offset between each layer is page aligned. 
 For example, when pmem dirver handles ->memory_failure(), it should 
subtract its ->data_offset when it calls dax_holder_notify_failure().


The implementation of ->memory_failure() by pmem driver:
+static int pmem_pagemap_memory_failure(struct dev_pagemap *pgmap,
+   phys_addr_t addr, u64 len, int mf_flags)
+{
+   struct pmem_device *pmem =
+   container_of(pgmap, struct pmem_device, pgmap);
+   u64 offset = addr - pmem->phys_addr - pmem->data_offset;
+
+   return dax_holder_notify_failure(pmem->dax_dev, offset, len, mf_flags);
+}

So, I choose u64 as the type of @len.  And for consistency, the @addr is 
using byte-based type as well.


> memory_failure()
> |* fsdax case
> |
> |pgmap->ops->memory_failure()  => pmem_pgmap_memory_failure()
> | dax_holder_notify_failure()  =>

the offset from 'pmem driver' to 'dax holder'

> |  dax_device->holder_ops->notify_failure() =>
> | - xfs_dax_notify_failure()
> |  |* xfs_dax_notify_failure()
> |  |--
> |  |   xfs_rmap_query_range()
> |  |xfs_dax_failure_fn()
> |  |* corrupted on metadata
> |  |   try to recover data, call xfs_force_shutdown()
> |  |* corrupted on file data
> |  |   try to recover data, call mf_dax_kill_procs()
> |* normal case
> |-
> |mf_generic_kill_procs()


--
Thanks,
Ruan.





Re: [PATCH v11 1/8] dax: Introduce holder for dax_device

2022-03-29 Thread Christoph Hellwig
On Wed, Mar 16, 2022 at 09:46:07PM +0800, Shiyang Ruan wrote:
> > Forgive me if this has been discussed before, but since dax_operations
> > are in terms of pgoff and nr pages and memory_failure() is in terms of
> > pfns what was the rationale for making the function signature byte
> > based?
> 
> Maybe I didn't describe it clearly...  The @offset and @len here are
> byte-based.  And so is ->memory_failure().

Yes, but is there a good reason for that when the rest of the DAX code
tends to work in page chunks?



Re: [PATCH v11 1/8] dax: Introduce holder for dax_device

2022-03-29 Thread Christoph Hellwig
On Fri, Mar 11, 2022 at 03:35:13PM -0800, Dan Williams wrote:
> > +   if (!dax_dev->holder_ops) {
> > +   rc = -EOPNOTSUPP;
> 
> I think it is ok to return success (0) for this case. All the caller
> of dax_holder_notify_failure() wants to know is if the notification
> was successfully delivered to the holder. If there is no holder
> present then there is nothing to report. This is minor enough for me
> to fix up locally if nothing else needs to be changed.

The caller needs to know there are no holder ops to fall back to
different path.

> Isn't this another failure scenario? If kill_dax() is called while a
> holder is still holding the dax_device that seems to be another
> ->notify_failure scenario to tell the holder that the device is going
> away and the holder has not released the device yet.

Yes.



Re: [PATCH v11 1/8] dax: Introduce holder for dax_device

2022-03-16 Thread Shiyang Ruan




在 2022/3/12 7:35, Dan Williams 写道:

On Sun, Feb 27, 2022 at 4:08 AM Shiyang Ruan  wrote:


To easily track filesystem from a pmem device, we introduce a holder for
dax_device structure, and also its operation.  This holder is used to
remember who is using this dax_device:
  - When it is the backend of a filesystem, the holder will be the
instance of this filesystem.
  - When this pmem device is one of the targets in a mapped device, the
holder will be this mapped device.  In this case, the mapped device
has its own dax_device and it will follow the first rule.  So that we
can finally track to the filesystem we needed.

The holder and holder_ops will be set when filesystem is being mounted,
or an target device is being activated.

Signed-off-by: Shiyang Ruan 
---
  drivers/dax/super.c | 89 +
  include/linux/dax.h | 32 
  2 files changed, 121 insertions(+)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index e3029389d809..da5798e19d57 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -21,6 +21,9 @@
   * @cdev: optional character interface for "device dax"
   * @private: dax driver private data
   * @flags: state and boolean properties
+ * @ops: operations for dax_device
+ * @holder_data: holder of a dax_device: could be filesystem or mapped device
+ * @holder_ops: operations for the inner holder
   */
  struct dax_device {
 struct inode inode;
@@ -28,6 +31,8 @@ struct dax_device {
 void *private;
 unsigned long flags;
 const struct dax_operations *ops;
+   void *holder_data;
+   const struct dax_holder_operations *holder_ops;
  };

  static dev_t dax_devt;
@@ -193,6 +198,29 @@ int dax_zero_page_range(struct dax_device *dax_dev, 
pgoff_t pgoff,
  }
  EXPORT_SYMBOL_GPL(dax_zero_page_range);

+int dax_holder_notify_failure(struct dax_device *dax_dev, u64 off,
+ u64 len, int mf_flags)
+{
+   int rc, id;
+
+   id = dax_read_lock();
+   if (!dax_alive(dax_dev)) {
+   rc = -ENXIO;
+   goto out;
+   }
+
+   if (!dax_dev->holder_ops) {
+   rc = -EOPNOTSUPP;


I think it is ok to return success (0) for this case. All the caller
of dax_holder_notify_failure() wants to know is if the notification
was successfully delivered to the holder. If there is no holder
present then there is nothing to report. This is minor enough for me
to fix up locally if nothing else needs to be changed.


I thought it could fall back to generic memory failure handler: 
mf_generic_kill_procs(), if holder_ops not exists.





+   goto out;
+   }
+
+   rc = dax_dev->holder_ops->notify_failure(dax_dev, off, len, mf_flags);
+out:
+   dax_read_unlock(id);
+   return rc;
+}
+EXPORT_SYMBOL_GPL(dax_holder_notify_failure);
+
  #ifdef CONFIG_ARCH_HAS_PMEM_API
  void arch_wb_cache_pmem(void *addr, size_t size);
  void dax_flush(struct dax_device *dax_dev, void *addr, size_t size)
@@ -268,6 +296,10 @@ void kill_dax(struct dax_device *dax_dev)

 clear_bit(DAXDEV_ALIVE, _dev->flags);
 synchronize_srcu(_srcu);
+
+   /* clear holder data */
+   dax_dev->holder_ops = NULL;
+   dax_dev->holder_data = NULL;


Isn't this another failure scenario? If kill_dax() is called while a
holder is still holding the dax_device that seems to be another
->notify_failure scenario to tell the holder that the device is going
away and the holder has not released the device yet.


Yes.  I should call dax_holder_notify_failure() and then unregister the 
holder.





  }
  EXPORT_SYMBOL_GPL(kill_dax);

@@ -409,6 +441,63 @@ void put_dax(struct dax_device *dax_dev)
  }
  EXPORT_SYMBOL_GPL(put_dax);

+/**
+ * dax_holder() - obtain the holder of a dax device
+ * @dax_dev: a dax_device instance
+
+ * Return: the holder's data which represents the holder if registered,
+ * otherwize NULL.
+ */
+void *dax_holder(struct dax_device *dax_dev)
+{
+   if (!dax_alive(dax_dev))
+   return NULL;


It's safe for the holder to assume that it can de-reference
->holder_data freely in its notify_handler callback because
dax_holder_notify_failure() arranges for the callback to run in
dax_read_lock() context.

This is another minor detail that I can fixup locally.


+
+   return dax_dev->holder_data;
+}
+EXPORT_SYMBOL_GPL(dax_holder);
+
+/**
+ * dax_register_holder() - register a holder to a dax device
+ * @dax_dev: a dax_device instance
+ * @holder: a pointer to a holder's data which represents the holder
+ * @ops: operations of this holder
+
+ * Return: negative errno if an error occurs, otherwise 0.
+ */
+int dax_register_holder(struct dax_device *dax_dev, void *holder,
+   const struct dax_holder_operations *ops)
+{
+   if (!dax_alive(dax_dev))
+   return -ENXIO;
+
+   if (cmpxchg(_dev->holder_data, NULL, holder))
+   return -EBUSY;
+
+   

Re: [PATCH v11 1/8] dax: Introduce holder for dax_device

2022-03-11 Thread Dan Williams
On Sun, Feb 27, 2022 at 4:08 AM Shiyang Ruan  wrote:
>
> To easily track filesystem from a pmem device, we introduce a holder for
> dax_device structure, and also its operation.  This holder is used to
> remember who is using this dax_device:
>  - When it is the backend of a filesystem, the holder will be the
>instance of this filesystem.
>  - When this pmem device is one of the targets in a mapped device, the
>holder will be this mapped device.  In this case, the mapped device
>has its own dax_device and it will follow the first rule.  So that we
>can finally track to the filesystem we needed.
>
> The holder and holder_ops will be set when filesystem is being mounted,
> or an target device is being activated.
>
> Signed-off-by: Shiyang Ruan 
> ---
>  drivers/dax/super.c | 89 +
>  include/linux/dax.h | 32 
>  2 files changed, 121 insertions(+)
>
> diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> index e3029389d809..da5798e19d57 100644
> --- a/drivers/dax/super.c
> +++ b/drivers/dax/super.c
> @@ -21,6 +21,9 @@
>   * @cdev: optional character interface for "device dax"
>   * @private: dax driver private data
>   * @flags: state and boolean properties
> + * @ops: operations for dax_device
> + * @holder_data: holder of a dax_device: could be filesystem or mapped device
> + * @holder_ops: operations for the inner holder
>   */
>  struct dax_device {
> struct inode inode;
> @@ -28,6 +31,8 @@ struct dax_device {
> void *private;
> unsigned long flags;
> const struct dax_operations *ops;
> +   void *holder_data;
> +   const struct dax_holder_operations *holder_ops;
>  };
>
>  static dev_t dax_devt;
> @@ -193,6 +198,29 @@ int dax_zero_page_range(struct dax_device *dax_dev, 
> pgoff_t pgoff,
>  }
>  EXPORT_SYMBOL_GPL(dax_zero_page_range);
>
> +int dax_holder_notify_failure(struct dax_device *dax_dev, u64 off,
> + u64 len, int mf_flags)
> +{
> +   int rc, id;
> +
> +   id = dax_read_lock();
> +   if (!dax_alive(dax_dev)) {
> +   rc = -ENXIO;
> +   goto out;
> +   }
> +
> +   if (!dax_dev->holder_ops) {
> +   rc = -EOPNOTSUPP;

I think it is ok to return success (0) for this case. All the caller
of dax_holder_notify_failure() wants to know is if the notification
was successfully delivered to the holder. If there is no holder
present then there is nothing to report. This is minor enough for me
to fix up locally if nothing else needs to be changed.

> +   goto out;
> +   }
> +
> +   rc = dax_dev->holder_ops->notify_failure(dax_dev, off, len, mf_flags);
> +out:
> +   dax_read_unlock(id);
> +   return rc;
> +}
> +EXPORT_SYMBOL_GPL(dax_holder_notify_failure);
> +
>  #ifdef CONFIG_ARCH_HAS_PMEM_API
>  void arch_wb_cache_pmem(void *addr, size_t size);
>  void dax_flush(struct dax_device *dax_dev, void *addr, size_t size)
> @@ -268,6 +296,10 @@ void kill_dax(struct dax_device *dax_dev)
>
> clear_bit(DAXDEV_ALIVE, _dev->flags);
> synchronize_srcu(_srcu);
> +
> +   /* clear holder data */
> +   dax_dev->holder_ops = NULL;
> +   dax_dev->holder_data = NULL;

Isn't this another failure scenario? If kill_dax() is called while a
holder is still holding the dax_device that seems to be another
->notify_failure scenario to tell the holder that the device is going
away and the holder has not released the device yet.

>  }
>  EXPORT_SYMBOL_GPL(kill_dax);
>
> @@ -409,6 +441,63 @@ void put_dax(struct dax_device *dax_dev)
>  }
>  EXPORT_SYMBOL_GPL(put_dax);
>
> +/**
> + * dax_holder() - obtain the holder of a dax device
> + * @dax_dev: a dax_device instance
> +
> + * Return: the holder's data which represents the holder if registered,
> + * otherwize NULL.
> + */
> +void *dax_holder(struct dax_device *dax_dev)
> +{
> +   if (!dax_alive(dax_dev))
> +   return NULL;

It's safe for the holder to assume that it can de-reference
->holder_data freely in its notify_handler callback because
dax_holder_notify_failure() arranges for the callback to run in
dax_read_lock() context.

This is another minor detail that I can fixup locally.

> +
> +   return dax_dev->holder_data;
> +}
> +EXPORT_SYMBOL_GPL(dax_holder);
> +
> +/**
> + * dax_register_holder() - register a holder to a dax device
> + * @dax_dev: a dax_device instance
> + * @holder: a pointer to a holder's data which represents the holder
> + * @ops: operations of this holder
> +
> + * Return: negative errno if an error occurs, otherwise 0.
> + */
> +int dax_register_holder(struct dax_device *dax_dev, void *holder,
> +   const struct dax_holder_operations *ops)
> +{
> +   if (!dax_alive(dax_dev))
> +   return -ENXIO;
> +
> +   if (cmpxchg(_dev->holder_data, NULL, holder))
> +   return -EBUSY;
> +
> +   dax_dev->holder_ops = ops;
> +   return 0;
> +}
>