Re: [dm-devel] [PATCH v5 2/8] drivers/pmem: Allow pmem_clear_poison() to accept arbitrary offset and len

2020-02-28 Thread Dan Williams
On Fri, Feb 28, 2020 at 6:05 AM Christoph Hellwig  wrote:
>
> On Thu, Feb 27, 2020 at 07:28:41PM -0800, Dan Williams wrote:
> > "don't perform generic memory-error-handling, there's an fs that owns
> > this daxdev and wants to disposition errors". The fs/dax.c
> > infrastructure that sets up ->index and ->mapping would still need to
> > be there for ext4 until its ready to take on the same responsibility.
> > Last I checked the ext4 reverse mapping implementation was not as
> > capable as XFS. This goes back to why the initial design avoided
> > relying on not universally available / stable reverse-mapping
> > facilities and opted for extending the generic mm/memory-failure.c
> > implementation.
>
> The important but is that we stop using ->index and ->mapping when XFS
> is used as that enables using DAX+reflinks, which actually is the most
> requested DAX feature on XFS (way more than silly runtime flag switches
> for example).

Understood. To be clear the plan we are marching to is knock down all
the known objections to the removal of the "experimental" designation.
reflink is on that list and so is per-file dax. The thought was that
pef-file dax was a nearer term goal than reflink.


--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH v5 2/8] drivers/pmem: Allow pmem_clear_poison() to accept arbitrary offset and len

2020-02-28 Thread Christoph Hellwig
On Thu, Feb 27, 2020 at 07:28:41PM -0800, Dan Williams wrote:
> "don't perform generic memory-error-handling, there's an fs that owns
> this daxdev and wants to disposition errors". The fs/dax.c
> infrastructure that sets up ->index and ->mapping would still need to
> be there for ext4 until its ready to take on the same responsibility.
> Last I checked the ext4 reverse mapping implementation was not as
> capable as XFS. This goes back to why the initial design avoided
> relying on not universally available / stable reverse-mapping
> facilities and opted for extending the generic mm/memory-failure.c
> implementation.

The important but is that we stop using ->index and ->mapping when XFS
is used as that enables using DAX+reflinks, which actually is the most
requested DAX feature on XFS (way more than silly runtime flag switches
for example).


--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH v5 2/8] drivers/pmem: Allow pmem_clear_poison() to accept arbitrary offset and len

2020-02-27 Thread Dan Williams
On Thu, Feb 27, 2020 at 5:31 PM Dave Chinner  wrote:
> On Wed, Feb 26, 2020 at 08:19:37PM -0800, Dan Williams wrote:
[..]
> > So you want the FS to have error handling for just pmem errors or all
> > memory errors?
>
> Just pmem errors in the specific range the filesystem manages - we
> really only care storage errors because those are the only errors
> the filesystem is responsible for handling correctly.
>
> Somebody else can worry about errors that hit page cache pages -
> page cache pages require mapping/index pointers on each page anyway,
> so a generic mechanism for handling those errors can be built into
> the page cache. And if the error is in general kernel memory, then
> it's game over for the entire kernel at that point, not just the
> filesystem.
>
> > And you want this to be done without the mm core using
> > page->index to identify what to unmap when the error happens?
>
> Isn't that exactly what I just said? We get the page address that
> failed, the daxdev can turn that into a sector address and call into
> the filesystem with a {sector, len, errno} tuple. We then do a
> reverse mapping lookup on {sector, len} to find all the owners of
> that range in the filesystem. If it's file data, that owner record
> gives us the inode and the offset into the file, which then gives us
> a {mapping, index} tuple.
>
> Further, the filesytem reverse map is aware that it's blocks can be
> shared by multiple owners, so it will have a record for every inode
> and file offset that maps to that page. Hence we can simply iterate
> the reverse map and do that whacky collect_procs/kill_procs dance
> for every {mapping, index} pair that references the the bad range.
>
> Nothing ever needs to be stored on the struct page...

Ok, so fs/dax.c needs to coordinate with mm/memory-failure.c to say
"don't perform generic memory-error-handling, there's an fs that owns
this daxdev and wants to disposition errors". The fs/dax.c
infrastructure that sets up ->index and ->mapping would still need to
be there for ext4 until its ready to take on the same responsibility.
Last I checked the ext4 reverse mapping implementation was not as
capable as XFS. This goes back to why the initial design avoided
relying on not universally available / stable reverse-mapping
facilities and opted for extending the generic mm/memory-failure.c
implementation.

If XFS optionally supplants mm/memory-failure.c I would expect we'd
want to do better than the "whacky collect_procs/kill_procs"
implementation and let applications register for a notification format
better than BUS_MCEERR_AO signals.

> > Memory
> > error scanning is not universally available on all pmem
> > implementations, so FS will need to subscribe for memory-error
> > handling events.
>
> No. Filesystems interact with the underlying device abstraction, not
> the physical storage that lies behind that device abstraction.  The
> filesystem should not interface with pmem directly in any way (all
> direct accesses are hidden inside fs/dax.c!), nor should it care
> about the quirks of the pmem implementation it is sitting on. That's
> what the daxdev abstraction is supposed to hide from the users of
> the pmem.

I wasn't proposing that XFS have a machine-check handler, I was trying
to get to the next level detail of the async notification interface to
the fs.

>
> IOWs, the daxdev infrastructure subscribes to memory-error event
> subsystem, calls out to the filesystem when an error in a page in
> the daxdev is reported. The filesystem only needs to know the
> {sector, len, errno} tuple related to the error; it is the device's
> responsibility to handle the physical mechanics of listening,
> filtering and translating MCEs to a format the filesystem
> understands
>
> Another reason it should be provided by the daxdev as a {sector,
> len, errno} tuple is that it also allows non-dax block devices to
> implement the similar error notifications and provide filesystems
> with exactly the same information so the filesystem can start
> auto-recovery processes

The driver layer does already have 'struct badblocks' that both pmem
and md use, just no current way for the FS to get a reference to it.
However, my hope was to get away from the interface being sector based
because the error granularity is already smaller than a sector in the
daxdev case as compared to a bdev. A daxdev native error record should
be a daxdev relative byte offset, not a sector. If the fs wants to
align the blast radius of the error to sectors or fs-blocks that's its
choice, but I don't think the driver interface should preclude
sub-sector error handling. Essentially I don't want to add more bdev
semantics to fs/dax.c while Vivek is busy removing them.


--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH v5 2/8] drivers/pmem: Allow pmem_clear_poison() to accept arbitrary offset and len

2020-02-27 Thread Dave Chinner
On Thu, Feb 27, 2020 at 10:25:17AM -0500, Vivek Goyal wrote:
> On Thu, Feb 27, 2020 at 02:11:43PM +1100, Dave Chinner wrote:
> > On Wed, Feb 26, 2020 at 11:57:56AM -0500, Vivek Goyal wrote:
> > > On Tue, Feb 25, 2020 at 02:49:30PM -0800, Dan Williams wrote:
> > > [..]
> > > > > > I'm ok with replacing blkdev_issue_zeroout() with a dax operation
> > > > > > callback that deals with page aligned entries. That change at least
> > > > > > makes the error boundary symmetric across copy_from_iter() and the
> > > > > > zeroing path.
> > > > >
> > > > > IIUC, you are suggesting that modify dax_zero_page_range() to take 
> > > > > page
> > > > > aligned start and size and call this interface from
> > > > > __dax_zero_page_range() and get rid of blkdev_issue_zeroout() in that
> > > > > path?
> > > > >
> > > > > Something like.
> > > > >
> > > > > __dax_zero_page_range() {
> > > > >   if(page_aligned_io)
> > > > > call_dax_page_zero_range()
> > > > >   else
> > > > > use_direct_access_and_memcpy;
> > > > > }
> > > > >
> > > > > And other callers of blkdev_issue_zeroout() in filesystems can migrate
> > > > > to calling dax_zero_page_range() instead.
> > > > >
> > > > > If yes, I am not seeing what advantage do we get by this change.
> > > > >
> > > > > - __dax_zero_page_range() seems to be called by only partial block
> > > > >   zeroing code. So dax_zero_page_range() call will remain unused.
> > > > >
> > > > >
> > > > > - dax_zero_page_range() will be exact replacement of
> > > > >   blkdev_issue_zeroout() so filesystems will not gain anything. Just 
> > > > > that
> > > > >   it will create a dax specific hook.
> > > > >
> > > > > In that case it might be simpler to just get rid of 
> > > > > blkdev_issue_zeroout()
> > > > > call from __dax_zero_page_range() and make sure there are no callers 
> > > > > of
> > > > > full block zeroing from this path.
> > > > 
> > > > I think you're right. The path I'm concerned about not regressing is
> > > > the error clearing on new block allocation and we get that already via
> > > > xfs_zero_extent() and sb_issue_zeroout().
> > > 
> > > Well I was wrong. I found atleast one user which uses 
> > > __dax_zero_page_range()
> > > to zero full PAGE_SIZE blocks.
> > > 
> > > xfs_io -c "allocsp 32K 0" foo.txt
> > 
> > That ioctl interface is deprecated and likely unused by any new
> > application written since 1999. It predates unwritten extents (1998)
> > and I don't think any native linux applications have ever used it. A
> > newly written DAX aware application would almost certainly not use
> > this interface.
> > 
> > IOWs, I wouldn't use it as justification for some special case
> > behaviour; I'm more likely to say "rip that ancient ioctl out" than
> > to jump through hoops because it's implementation behaviour.
> 
> Hi Dave,
> 
> Do you see any other path in xfs using iomap_zero_range() and zeroing
> full block.

Yes:

- xfs_file_aio_write_checks() for zeroing blocks between the
  existing EOF and the start of the incoming write beyond EOF
- xfs_setattr_size() on truncate up for zeroing blocks between the
  existing EOF and the new EOF.
- xfs_reflink_zero_posteof() for zeroing blocks between the old EOF
  and where the new reflinked extents are going to land beyond EOF

And don't think that blocks beyond EOF can't exist when DAX is
enabled. We can turn DAX on and off, we can crash between allocation
and file size extension, etc. Hence this code must be able to handle
zeroing large ranges of blocks beyond EOF...

> iomap_zero_range() already skips IOMAP_HOLE and
> IOMAP_UNWRITTEN. So it has to be a full block zeroing which is of not type
> IOMAP_HOLE and IOMAP_UNWRITTEN.
> 
> My understanding is that ext4 and xfs both are initializing full blocks
> using blkdev_issue_zeroout(). Only partial blocks are being zeroed using
> this dax zeroing path.

Look at the API, not the callers: iomap_zero_range takes a 64 bit
length parameter. It can be asked to zero blocks across petabytes of
a file. If there's a single block somewhere in that range, it will
only zero that block. If the entire range is allocated, it will zero
that entire range (yes, it will take forever!) as that it what it
is intended to do.

It should be pretty clear that needs to be able to zero entire
pages, regardless of how it is currently used/called by filesystems.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH v5 2/8] drivers/pmem: Allow pmem_clear_poison() to accept arbitrary offset and len

2020-02-27 Thread Dave Chinner
On Wed, Feb 26, 2020 at 08:19:37PM -0800, Dan Williams wrote:
> On Wed, Feb 26, 2020 at 7:03 PM Dave Chinner  wrote:
> > On Mon, Feb 24, 2020 at 10:38:44AM -0500, Vivek Goyal wrote:
> > > Anyway, partial page truncate can't ensure that data in rest of the page 
> > > can
> > > be read back successfully. Memory can get poison after the write and
> > > hence read after truncate will still fail.
> >
> > Which is where the notification requirement comes in. Yes, we may
> > still get errors on read or write, but if memory at rest goes bad,
> > we want to handle that and correct it ASAP, not wait days or months
> > for something to trip over the poisoned page before we find out
> > about it.
> >
> > > Hence, all we are trying to ensure is that if a poison is known at the
> > > time of writing partial page, then we should return error to user space.
> >
> > I think within FS-DAX infrastructure, any access to the data (read
> > or write) within a poisoned page or a page marked with PageError()
> > should return EIO to the caller, unless it's the specific command to
> > clear the error/poison state on the page. What happens with that
> > error state is then up to the caller.
> >
> 
> I agree with most of the above if you replace "device-dax error
> handling" with "System RAM error handling". It's typical memory error
> handling that injects the page->index and page->mappping dependency.

I disagree, but that's beside the point and not worth arguing.

> So you want the FS to have error handling for just pmem errors or all
> memory errors?

Just pmem errors in the specific range the filesystem manages - we
really only care storage errors because those are the only errors
the filesystem is responsible for handling correctly.

Somebody else can worry about errors that hit page cache pages -
page cache pages require mapping/index pointers on each page anyway,
so a generic mechanism for handling those errors can be built into
the page cache. And if the error is in general kernel memory, then
it's game over for the entire kernel at that point, not just the
filesystem.

> And you want this to be done without the mm core using
> page->index to identify what to unmap when the error happens?

Isn't that exactly what I just said? We get the page address that
failed, the daxdev can turn that into a sector address and call into
the filesystem with a {sector, len, errno} tuple. We then do a
reverse mapping lookup on {sector, len} to find all the owners of
that range in the filesystem. If it's file data, that owner record
gives us the inode and the offset into the file, which then gives us
a {mapping, index} tuple.

Further, the filesytem reverse map is aware that it's blocks can be
shared by multiple owners, so it will have a record for every inode
and file offset that maps to that page. Hence we can simply iterate
the reverse map and do that whacky collect_procs/kill_procs dance
for every {mapping, index} pair that references the the bad range.

Nothing ever needs to be stored on the struct page...

> Memory
> error scanning is not universally available on all pmem
> implementations, so FS will need to subscribe for memory-error
> handling events.

No. Filesystems interact with the underlying device abstraction, not
the physical storage that lies behind that device abstraction.  The
filesystem should not interface with pmem directly in any way (all
direct accesses are hidden inside fs/dax.c!), nor should it care
about the quirks of the pmem implementation it is sitting on. That's
what the daxdev abstraction is supposed to hide from the users of
the pmem.

IOWs, the daxdev infrastructure subscribes to memory-error event
subsystem, calls out to the filesystem when an error in a page in
the daxdev is reported. The filesystem only needs to know the
{sector, len, errno} tuple related to the error; it is the device's
responsibility to handle the physical mechanics of listening,
filtering and translating MCEs to a format the filesystem
understands

Another reason it should be provided by the daxdev as a {sector,
len, errno} tuple is that it also allows non-dax block devices to
implement the similar error notifications and provide filesystems
with exactly the same information so the filesystem can start
auto-recovery processes

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH v5 2/8] drivers/pmem: Allow pmem_clear_poison() to accept arbitrary offset and len

2020-02-27 Thread Vivek Goyal
On Thu, Feb 27, 2020 at 02:11:43PM +1100, Dave Chinner wrote:
> On Wed, Feb 26, 2020 at 11:57:56AM -0500, Vivek Goyal wrote:
> > On Tue, Feb 25, 2020 at 02:49:30PM -0800, Dan Williams wrote:
> > [..]
> > > > > I'm ok with replacing blkdev_issue_zeroout() with a dax operation
> > > > > callback that deals with page aligned entries. That change at least
> > > > > makes the error boundary symmetric across copy_from_iter() and the
> > > > > zeroing path.
> > > >
> > > > IIUC, you are suggesting that modify dax_zero_page_range() to take page
> > > > aligned start and size and call this interface from
> > > > __dax_zero_page_range() and get rid of blkdev_issue_zeroout() in that
> > > > path?
> > > >
> > > > Something like.
> > > >
> > > > __dax_zero_page_range() {
> > > >   if(page_aligned_io)
> > > > call_dax_page_zero_range()
> > > >   else
> > > > use_direct_access_and_memcpy;
> > > > }
> > > >
> > > > And other callers of blkdev_issue_zeroout() in filesystems can migrate
> > > > to calling dax_zero_page_range() instead.
> > > >
> > > > If yes, I am not seeing what advantage do we get by this change.
> > > >
> > > > - __dax_zero_page_range() seems to be called by only partial block
> > > >   zeroing code. So dax_zero_page_range() call will remain unused.
> > > >
> > > >
> > > > - dax_zero_page_range() will be exact replacement of
> > > >   blkdev_issue_zeroout() so filesystems will not gain anything. Just 
> > > > that
> > > >   it will create a dax specific hook.
> > > >
> > > > In that case it might be simpler to just get rid of 
> > > > blkdev_issue_zeroout()
> > > > call from __dax_zero_page_range() and make sure there are no callers of
> > > > full block zeroing from this path.
> > > 
> > > I think you're right. The path I'm concerned about not regressing is
> > > the error clearing on new block allocation and we get that already via
> > > xfs_zero_extent() and sb_issue_zeroout().
> > 
> > Well I was wrong. I found atleast one user which uses 
> > __dax_zero_page_range()
> > to zero full PAGE_SIZE blocks.
> > 
> > xfs_io -c "allocsp 32K 0" foo.txt
> 
> That ioctl interface is deprecated and likely unused by any new
> application written since 1999. It predates unwritten extents (1998)
> and I don't think any native linux applications have ever used it. A
> newly written DAX aware application would almost certainly not use
> this interface.
> 
> IOWs, I wouldn't use it as justification for some special case
> behaviour; I'm more likely to say "rip that ancient ioctl out" than
> to jump through hoops because it's implementation behaviour.

Hi Dave,

Do you see any other path in xfs using iomap_zero_range() and zeroing
full block. iomap_zero_range() already skips IOMAP_HOLE and
IOMAP_UNWRITTEN. So it has to be a full block zeroing which is of not type
IOMAP_HOLE and IOMAP_UNWRITTEN.

My understanding is that ext4 and xfs both are initializing full blocks
using blkdev_issue_zeroout(). Only partial blocks are being zeroed using
this dax zeroing path.

If there are no callers of full block zeroing through
__dax_zero_page_range(), then I can simply get rid of
blkdev_issue_zerout() call from __dax_zero_page_range().

Thanks
Vivek

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH v5 2/8] drivers/pmem: Allow pmem_clear_poison() to accept arbitrary offset and len

2020-02-26 Thread Dan Williams
On Wed, Feb 26, 2020 at 7:03 PM Dave Chinner  wrote:
>
> On Mon, Feb 24, 2020 at 10:38:44AM -0500, Vivek Goyal wrote:
> > On Mon, Feb 24, 2020 at 10:03:30AM +1100, Dave Chinner wrote:
> >
> > [..]
> > > > > > Hi Jeff,
> > > > > >
> > > > > > New dax zeroing interface (dax_zero_page_range()) can technically 
> > > > > > pass
> > > > > > a range which is less than a sector. Or which is bigger than a 
> > > > > > sector
> > > > > > but start and end are not aligned on sector boundaries.
> > > > >
> > > > > Sure, but who will call it with misaligned ranges?
> > > >
> > > > create a file foo.txt of size 4K and then truncate it.
> > > >
> > > > "truncate -s 23 foo.txt". Filesystems try to zero the bytes from 24 to
> > > > 4095.
> > >
> > > This should fail with EIO. Only full page writes should clear the
> > > bad page state, and partial writes should therefore fail because
> > > they do not guarantee the data in the filesystem block is all good.
> > >
> > > If this zeroing was a buffered write to an address with a bad
> > > sector, then the writeback will fail and the user will (eventually)
> > > get an EIO on the file.
> > >
> > > DAX should do the same thing, except because the zeroing is
> > > synchronous (i.e. done directly by the truncate syscall) we can -
> > > and should - return EIO immediately.
> > >
> > > Indeed, with your code, if we then extend the file by truncating up
> > > back to 4k, then the range between 23 and 512 is still bad, even
> > > though we've successfully zeroed it and the user knows it. An
> > > attempt to read anywhere in this range (e.g. 10 bytes at offset 100)
> > > will fail with EIO, but reading 10 bytes at offset 2000 will
> > > succeed.
> >
> > Hi Dave,
> >
> > What is expected if I do "truncate -s 512 foo.txt". Say first sector (0 to
> > 511) is poisoned and rest don't have poison. Should this fail with -EIO.
>
> Yes - the filesystem block still contains bad data.
>
> > In current implementation it does not.
>
> I'm not surprised - the whole hardware error handling architecture
> for FS-DAX is fundamentally broken. It was designed for device-dax,
> and it just doesn't work for FS-DAX.
>
> For example, to get the hardware error handling to be able to kill
> userspace applications, a 1:1 physical-to-logical association
> constraint was added to fs/dax.c:
>
> /*
>  * TODO: for reflink+dax we need a way to associate a single page with
>  * multiple address_space instances at different linear_page_index()
>  * offsets.
>  */
> static void dax_associate_entry(void *entry, struct address_space *mapping,
> struct vm_area_struct *vma, unsigned long address)
>
> because device-dax only has *linear mappings* and so has a fixed
> reverse mapping architecture.
>
> i.e. the whole hardware error handling infrastructure was designed
> around the constraints of device-dax. device-dax does not having any
> structure to serialise access to the physical storage, so locking
> was added to the mapping tree. THe mapping tree locking is accessed
> on hardware error via the reverse mappingi association in the struct
> page and that's how device-dax serialises direct physical storage
> access against hardware error processing.  And while the page index
> is locked in the mapping tree, it can walk the process vmas that
> have the page mapped to kill them so that they don't try to access
> the bad page.
>
> That bad physical storage state is carried in a volatile struct page
> flag, hence requiring some mechanism to make it persistent (the
> device bad blocks register) and some other mechanism to clear the
> poison state (direct IO, IIRC).
>
> It's also a big, nasty, solid roadblock to implementing shared
> data extents in FS-DAX. We basically have to completely re-architect
> the hardware error handling for FS-DAX so that such errors are
> reported to the filesystem first and then the filesystem does what
> is needed to handle the error.
>
> None of this works for filesystems because they need to perform
> different operations depending on what the page that went bad
> contains. FS-DAX should never trip over an unexpected poisoned page;
> we do so now because such physical storage errors are completely
> hidden form the fielsystem.
>
> What you are trying to do is slap a band-aid over what to do when we
> hit an unexpected page containing bad data. Filesystems expect to
> find out about bad data in storage when they marshall the data into
> or out of memory. They make the assumption that once it is in memory
> it remains valid on the physical storage. Hence if an in-memory
> error occurs, we can just toss it away and re-read it from storage,
> and all is good.
>
> FS-DAX changes that - we are no longer marshalling data into and out
> of memory so we don't have a mechanism to get EIO when reading the
> page into the page cache or writing it back to disk. We also don't
> have an in-memory copy of the data - the physical storage is the
> in-memory copy, and so we can't just toss it away 

Re: [dm-devel] [PATCH v5 2/8] drivers/pmem: Allow pmem_clear_poison() to accept arbitrary offset and len

2020-02-26 Thread Dave Chinner
On Wed, Feb 26, 2020 at 11:57:56AM -0500, Vivek Goyal wrote:
> On Tue, Feb 25, 2020 at 02:49:30PM -0800, Dan Williams wrote:
> [..]
> > > > I'm ok with replacing blkdev_issue_zeroout() with a dax operation
> > > > callback that deals with page aligned entries. That change at least
> > > > makes the error boundary symmetric across copy_from_iter() and the
> > > > zeroing path.
> > >
> > > IIUC, you are suggesting that modify dax_zero_page_range() to take page
> > > aligned start and size and call this interface from
> > > __dax_zero_page_range() and get rid of blkdev_issue_zeroout() in that
> > > path?
> > >
> > > Something like.
> > >
> > > __dax_zero_page_range() {
> > >   if(page_aligned_io)
> > > call_dax_page_zero_range()
> > >   else
> > > use_direct_access_and_memcpy;
> > > }
> > >
> > > And other callers of blkdev_issue_zeroout() in filesystems can migrate
> > > to calling dax_zero_page_range() instead.
> > >
> > > If yes, I am not seeing what advantage do we get by this change.
> > >
> > > - __dax_zero_page_range() seems to be called by only partial block
> > >   zeroing code. So dax_zero_page_range() call will remain unused.
> > >
> > >
> > > - dax_zero_page_range() will be exact replacement of
> > >   blkdev_issue_zeroout() so filesystems will not gain anything. Just that
> > >   it will create a dax specific hook.
> > >
> > > In that case it might be simpler to just get rid of blkdev_issue_zeroout()
> > > call from __dax_zero_page_range() and make sure there are no callers of
> > > full block zeroing from this path.
> > 
> > I think you're right. The path I'm concerned about not regressing is
> > the error clearing on new block allocation and we get that already via
> > xfs_zero_extent() and sb_issue_zeroout().
> 
> Well I was wrong. I found atleast one user which uses __dax_zero_page_range()
> to zero full PAGE_SIZE blocks.
> 
> xfs_io -c "allocsp 32K 0" foo.txt

That ioctl interface is deprecated and likely unused by any new
application written since 1999. It predates unwritten extents (1998)
and I don't think any native linux applications have ever used it. A
newly written DAX aware application would almost certainly not use
this interface.

IOWs, I wouldn't use it as justification for some special case
behaviour; I'm more likely to say "rip that ancient ioctl out" than
to jump through hoops because it's implementation behaviour.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH v5 2/8] drivers/pmem: Allow pmem_clear_poison() to accept arbitrary offset and len

2020-02-26 Thread Dave Chinner
On Mon, Feb 24, 2020 at 10:38:44AM -0500, Vivek Goyal wrote:
> On Mon, Feb 24, 2020 at 10:03:30AM +1100, Dave Chinner wrote:
> 
> [..]
> > > > > Hi Jeff,
> > > > >
> > > > > New dax zeroing interface (dax_zero_page_range()) can technically pass
> > > > > a range which is less than a sector. Or which is bigger than a sector
> > > > > but start and end are not aligned on sector boundaries.
> > > > 
> > > > Sure, but who will call it with misaligned ranges?
> > > 
> > > create a file foo.txt of size 4K and then truncate it.
> > > 
> > > "truncate -s 23 foo.txt". Filesystems try to zero the bytes from 24 to
> > > 4095.
> > 
> > This should fail with EIO. Only full page writes should clear the
> > bad page state, and partial writes should therefore fail because
> > they do not guarantee the data in the filesystem block is all good.
> > 
> > If this zeroing was a buffered write to an address with a bad
> > sector, then the writeback will fail and the user will (eventually)
> > get an EIO on the file.
> > 
> > DAX should do the same thing, except because the zeroing is
> > synchronous (i.e. done directly by the truncate syscall) we can -
> > and should - return EIO immediately.
> > 
> > Indeed, with your code, if we then extend the file by truncating up
> > back to 4k, then the range between 23 and 512 is still bad, even
> > though we've successfully zeroed it and the user knows it. An
> > attempt to read anywhere in this range (e.g. 10 bytes at offset 100)
> > will fail with EIO, but reading 10 bytes at offset 2000 will
> > succeed.
> 
> Hi Dave,
> 
> What is expected if I do "truncate -s 512 foo.txt". Say first sector (0 to
> 511) is poisoned and rest don't have poison. Should this fail with -EIO.

Yes - the filesystem block still contains bad data.

> In current implementation it does not.

I'm not surprised - the whole hardware error handling architecture
for FS-DAX is fundamentally broken. It was designed for device-dax,
and it just doesn't work for FS-DAX.

For example, to get the hardware error handling to be able to kill
userspace applications, a 1:1 physical-to-logical association
constraint was added to fs/dax.c:

/*
 * TODO: for reflink+dax we need a way to associate a single page with
 * multiple address_space instances at different linear_page_index()
 * offsets.
 */
static void dax_associate_entry(void *entry, struct address_space *mapping,
struct vm_area_struct *vma, unsigned long address)

because device-dax only has *linear mappings* and so has a fixed
reverse mapping architecture.

i.e. the whole hardware error handling infrastructure was designed
around the constraints of device-dax. device-dax does not having any
structure to serialise access to the physical storage, so locking
was added to the mapping tree. THe mapping tree locking is accessed
on hardware error via the reverse mappingi association in the struct
page and that's how device-dax serialises direct physical storage
access against hardware error processing.  And while the page index
is locked in the mapping tree, it can walk the process vmas that
have the page mapped to kill them so that they don't try to access
the bad page.

That bad physical storage state is carried in a volatile struct page
flag, hence requiring some mechanism to make it persistent (the
device bad blocks register) and some other mechanism to clear the
poison state (direct IO, IIRC).

It's also a big, nasty, solid roadblock to implementing shared
data extents in FS-DAX. We basically have to completely re-architect
the hardware error handling for FS-DAX so that such errors are
reported to the filesystem first and then the filesystem does what
is needed to handle the error.

None of this works for filesystems because they need to perform
different operations depending on what the page that went bad
contains. FS-DAX should never trip over an unexpected poisoned page;
we do so now because such physical storage errors are completely
hidden form the fielsystem.

What you are trying to do is slap a band-aid over what to do when we
hit an unexpected page containing bad data. Filesystems expect to
find out about bad data in storage when they marshall the data into
or out of memory. They make the assumption that once it is in memory
it remains valid on the physical storage. Hence if an in-memory
error occurs, we can just toss it away and re-read it from storage,
and all is good.

FS-DAX changes that - we are no longer marshalling data into and out
of memory so we don't have a mechanism to get EIO when reading the
page into the page cache or writing it back to disk. We also don't
have an in-memory copy of the data - the physical storage is the
in-memory copy, and so we can't just toss it away when an error
occurs.

What we actually require is instantaneous notification of physical
storage errors so we can handle the error immediately. And that, in
turn, means we should never poison or see poisoned pages during
direct access operations because the filesystem 

Re: [dm-devel] [PATCH v5 2/8] drivers/pmem: Allow pmem_clear_poison() to accept arbitrary offset and len

2020-02-26 Thread Vivek Goyal
On Tue, Feb 25, 2020 at 02:49:30PM -0800, Dan Williams wrote:
[..]
> > > I'm ok with replacing blkdev_issue_zeroout() with a dax operation
> > > callback that deals with page aligned entries. That change at least
> > > makes the error boundary symmetric across copy_from_iter() and the
> > > zeroing path.
> >
> > IIUC, you are suggesting that modify dax_zero_page_range() to take page
> > aligned start and size and call this interface from
> > __dax_zero_page_range() and get rid of blkdev_issue_zeroout() in that
> > path?
> >
> > Something like.
> >
> > __dax_zero_page_range() {
> >   if(page_aligned_io)
> > call_dax_page_zero_range()
> >   else
> > use_direct_access_and_memcpy;
> > }
> >
> > And other callers of blkdev_issue_zeroout() in filesystems can migrate
> > to calling dax_zero_page_range() instead.
> >
> > If yes, I am not seeing what advantage do we get by this change.
> >
> > - __dax_zero_page_range() seems to be called by only partial block
> >   zeroing code. So dax_zero_page_range() call will remain unused.
> >
> >
> > - dax_zero_page_range() will be exact replacement of
> >   blkdev_issue_zeroout() so filesystems will not gain anything. Just that
> >   it will create a dax specific hook.
> >
> > In that case it might be simpler to just get rid of blkdev_issue_zeroout()
> > call from __dax_zero_page_range() and make sure there are no callers of
> > full block zeroing from this path.
> 
> I think you're right. The path I'm concerned about not regressing is
> the error clearing on new block allocation and we get that already via
> xfs_zero_extent() and sb_issue_zeroout().

Well I was wrong. I found atleast one user which uses __dax_zero_page_range()
to zero full PAGE_SIZE blocks.

xfs_io -c "allocsp 32K 0" foo.txt

In that case, I will add a new dax method say dax_zero_page_range() which will
only take PAGE_SIZE aligned range will clear known poison and call that
from __dax_zero_page_range() if I/O is PAGE_SIZE aligned.

For now I will limit this interface to take only single page at a time
because otherwise implementation becomes complex in dm/md stack where
range has to be broken across multiple devices and there are no users
because current iomap API passes one page at a time.

So once we have grown users, then one can also tackle the complexity of
modifying dm/md to break a page range across multiple devices.

Will post a patch series for this.

Thanks
Vivek

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH v5 2/8] drivers/pmem: Allow pmem_clear_poison() to accept arbitrary offset and len

2020-02-26 Thread Jane Chu

On 2/24/2020 4:26 PM, Dan Williams wrote:

On Mon, Feb 24, 2020 at 1:53 PM Jeff Moyer  wrote:


Dan Williams  writes:


Let's just focus on reporting errors when we know we have them.


That's the problem in my eyes. If software needs to contend with
latent error reporting then it should always contend otherwise
software has multiple error models to wrangle.


The only way for an application to know that the data has been written
successfully would be to issue a read after every write.  That's not a
performance hit most applications are willing to take.  And, of course,
the media can still go bad at a later time, so it only guarantees the
data is accessible immediately after having been written.

What I'm suggesting is that we should not complete a write successfully
if we know that the data will not be retrievable.  I wouldn't call this
adding an extra error model to contend with.  Applications should
already be checking for errors on write.

Does that make sense? Are we talking past each other?


The badblock list is late to update in both directions, late to add
entries that the scrub needs to find and late to delete entries that
were inadvertently cleared by cache-line writes that did not first
ingest the poison for a read-modify-write. So I see the above as being
wishful in using the error list as the hard source of truth and
unfortunate to up-level all sub-sector error entries into full
PAGE_SIZE data offline events.


Sorry, don't mean to distract the discussion, but I'm wondering if
anyone has noticed SIGBUS with si_code = MCEERR_AO in a single process 
poison test over a dax-xfs file?  There is only 1 poison in the

file which has been consumed, it's the recovery code path (hole punch/
munmap/mmap/pwrite/read) that encounters the _AO. I'm confident that
latent error isn't the scenario per ARS scrub. Also, the _AO appears
rarely. This is un-explainable given the kernel MCE pmem handling
implementation.



I'm hoping we can find a way to make the error handling more fine
grained over time, but for the current patch, managing the blast
radius as PAGE_SIZE granularity at least matches the zero path with
the write path.


Maybe the new filesystem op for clearing pmem poison should insist on
4K alignment? because in hwpoison_clear() the starting pfn is given
by PHYS_PFN which rounds down to the nearest page, so we might inadvertently
clear the poison bit and 'noce' bit from a page when we only cleared a
poison e.g. in the second half of the page.

BTW, set_mce_nospec() doesn't seem to work in 5.5 release,
  [ 2321.209382] Could not invalidate pfn=0x1850600 from 1:1 map
I will see if I can find more information.




Setting that aside we can start with just treating zeroing the same as
the copy_from_iter() case and fail the I/O at the dax_direct_access()
step.


OK.


I'd rather have a separate op that filesystems can use to clear errors
at block allocation time that can be enforced to have the correct
alignment.


So would file systems always call that routine instead of zeroing, or
would they first check to see if there are badblocks?


The proposal is that filesystems distinguish zeroing from free-block
allocation/initialization such that the fsdax implementation directs
initialization to a driver callback. This "initialization op" would
take care to check for poison and clear it. All other dax paths would
not consult the badblocks list.


thanks!
-jane



___
Linux-nvdimm mailing list -- linux-nvd...@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org



--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH v5 2/8] drivers/pmem: Allow pmem_clear_poison() to accept arbitrary offset and len

2020-02-26 Thread Vivek Goyal
On Tue, Feb 25, 2020 at 02:49:30PM -0800, Dan Williams wrote:
[..]
> > > > Hi Dan,
> > > >
> > > > IIUC, block aligned hole punch don't go through __dax_zero_page_range()
> > > > path. Instead they call blkdev_issue_zeroout() at later point of time.
> > > >
> > > > Only partial block zeroing path is taking __dax_zero_page_range(). So
> > > > even if we remove poison clearing code from __dax_zero_page_range(),
> > > > there should not be a regression w.r.t full block zeroing. Only possible
> > > > regression will be if somebody was doing partial block zeroing on sector
> > > > boundary, then poison will not be cleared.
> > > >
> > > > We now seem to be discussing too many issues w.r.t poison clearing
> > > > and dax. Atleast 3 issues are mentioned in this thread.
> > > >
> > > > A. Get rid of dependency on block device in dax zeroing path.
> > > >(__dax_zero_page_range)
> > > >
> > > > B. Provide a way to clear latent poison. And possibly use movdir64b to
> > > >do that and make filesystems use that interface for initialization
> > > >of blocks.
> > > >
> > > > C. Dax zero operation is clearing known poison while copy_from_iter() is
> > > >not. I guess this ship has already sailed. If we change it now,
> > > >somebody will complain of some regression.
> > > >
> > > > For issue A, there are two possible ways to deal with it.
> > > >
> > > > 1. Implement a dax method to zero page. And this method will also clear
> > > >known poison. This is what my patch series is doing.
> > > >
> > > > 2. Just get rid of blkdev_issue_zeroout() from __dax_zero_page_range()
> > > >so that no poison will be cleared in __dax_zero_page_range() path. 
> > > > This
> > > >path is currently used in partial page zeroing path and full 
> > > > filesystem
> > > >block zeroing happens with blkdev_issue_zeroout(). There is a small
> > > >chance of regression here in case of sector aligned partial block
> > > >zeroing.
> > > >
> > > > My patch series takes care of issue A without any regressions. In fact 
> > > > it
> > > > improves current interface. For example, currently "truncate -s 512
> > > > foo.txt" will succeed even if first sector in the block is poisoned. My
> > > > patch series fixes it. Current implementation will return error on if 
> > > > any
> > > > non sector aligned truncate is done and any of the sector is poisoned. 
> > > > My
> > > > implementation will not return error if poisoned can be cleared as part
> > > > of zeroing. It will return only if poison is present in non-zeoring 
> > > > part.
> > >
> > > That asymmetry makes the implementation too much of a special case. If
> > > the dax mapping path forces error boundaries on PAGE_SIZE blocks then
> > > so should zeroing.
> > >
> > > >
> > > > Why don't we solve one issue A now and deal with issue B and C later in
> > > > a sepaprate patch series. This patch series gets rid of dependency on
> > > > block device in dax path and also makes current zeroing interface 
> > > > better.
> > >
> > > I'm ok with replacing blkdev_issue_zeroout() with a dax operation
> > > callback that deals with page aligned entries. That change at least
> > > makes the error boundary symmetric across copy_from_iter() and the
> > > zeroing path.
> >
> > IIUC, you are suggesting that modify dax_zero_page_range() to take page
> > aligned start and size and call this interface from
> > __dax_zero_page_range() and get rid of blkdev_issue_zeroout() in that
> > path?
> >
> > Something like.
> >
> > __dax_zero_page_range() {
> >   if(page_aligned_io)
> > call_dax_page_zero_range()
> >   else
> > use_direct_access_and_memcpy;
> > }
> >
> > And other callers of blkdev_issue_zeroout() in filesystems can migrate
> > to calling dax_zero_page_range() instead.
> >
> > If yes, I am not seeing what advantage do we get by this change.
> >
> > - __dax_zero_page_range() seems to be called by only partial block
> >   zeroing code. So dax_zero_page_range() call will remain unused.
> >
> >
> > - dax_zero_page_range() will be exact replacement of
> >   blkdev_issue_zeroout() so filesystems will not gain anything. Just that
> >   it will create a dax specific hook.
> >
> > In that case it might be simpler to just get rid of blkdev_issue_zeroout()
> > call from __dax_zero_page_range() and make sure there are no callers of
> > full block zeroing from this path.
> 
> I think you're right. The path I'm concerned about not regressing is
> the error clearing on new block allocation and we get that already via
> xfs_zero_extent() and sb_issue_zeroout(). For your fs we'll want a
> dax-device equivalent  for that path, but that does mean that
> __dax_zero_page_range() stays out of the error clearing game.

In virtiofs we do not manage our own blocks. We let host filesystem
do that and we are just passthrough filesystem passing fuse messages
around. So I have not seen need of block zeroing interface yet.

I just happened to carry a patch in my patch series in this 

Re: [dm-devel] [PATCH v5 2/8] drivers/pmem: Allow pmem_clear_poison() to accept arbitrary offset and len

2020-02-25 Thread Dan Williams
On Tue, Feb 25, 2020 at 12:08 PM Vivek Goyal  wrote:
>
> On Tue, Feb 25, 2020 at 08:25:27AM -0800, Dan Williams wrote:
> > On Tue, Feb 25, 2020 at 5:37 AM Vivek Goyal  wrote:
> > >
> > > On Mon, Feb 24, 2020 at 01:32:58PM -0800, Dan Williams wrote:
> > >
> > > [..]
> > > > > > > Ok, how about if I add one more patch to the series which will 
> > > > > > > check
> > > > > > > if unwritten portion of the page has known poison. If it has, then
> > > > > > > -EIO is returned.
> > > > > > >
> > > > > > >
> > > > > > > Subject: pmem: zero page range return error if poisoned memory in 
> > > > > > > unwritten area
> > > > > > >
> > > > > > > Filesystems call into pmem_dax_zero_page_range() to zero partial 
> > > > > > > page upon
> > > > > > > truncate. If partial page is being zeroed, then at the end of 
> > > > > > > operation
> > > > > > > file systems expect that there is no poison in the whole page 
> > > > > > > (atleast
> > > > > > > known poison).
> > > > > > >
> > > > > > > So make sure part of the partial page which is not being written, 
> > > > > > > does not
> > > > > > > have poison. If it does, return error. If there is poison in area 
> > > > > > > of page
> > > > > > > being written, it will be cleared.
> > > > > >
> > > > > > No, I don't like that the zero operation is special cased compared 
> > > > > > to
> > > > > > the write case. I'd say let's make them identical for now. I.e. fail
> > > > > > the I/O at dax_direct_access() time.
> > > > >
> > > > > So basically __dax_zero_page_range() will only write zeros (and not
> > > > > try to clear any poison). Right?
> > > >
> > > > Yes, the zero operation would have already failed at the
> > > > dax_direct_access() step if there was present poison.
> > > >
> > > > > > I think the error clearing
> > > > > > interface should be an explicit / separate op rather than a
> > > > > > side-effect. What about an explicit interface for initializing newly
> > > > > > allocated blocks, and the only reliable way to destroy poison 
> > > > > > through
> > > > > > the filesystem is to free the block?
> > > > >
> > > > > Effectively pmem_make_request() is already that interface filesystems
> > > > > use to initialize blocks and clear poison. So we don't really have to
> > > > > introduce a new interface?
> > > >
> > > > pmem_make_request() is shared with the I/O path and is too low in the
> > > > stack to understand intent. DAX intercepts the I/O path closer to the
> > > > filesystem and can understand zeroing vs writing today. I'm proposing
> > > > we go a step further and make DAX understand free-to-allocated-block
> > > > initialization instead of just zeroing. Inject the error clearing into
> > > > that initialization interface.
> > > >
> > > > > Or you are suggesting separate dax_zero_page_range() interface which 
> > > > > will
> > > > > always call into firmware to clear poison. And that will make sure 
> > > > > latent
> > > > > poison is cleared as well and filesystem should use that for block
> > > > > initialization instead?
> > > >
> > > > Yes, except latent poison would not be cleared until the zeroing is
> > > > implemented with movdir64b instead of callouts to firmware. It's
> > > > otherwise too slow to call out to firmware unconditionally.
> > > >
> > > > > I do like the idea of not having to differentiate
> > > > > between known poison and latent poison. Once a block has been 
> > > > > initialized
> > > > > all poison should be cleared (known/latent). I am worried though that
> > > > > on large devices this might slowdown filesystem initialization a lot
> > > > > if they are zeroing large range of blocks.
> > > > >
> > > > > If yes, this sounds like two different patch series. First patch 
> > > > > series
> > > > > takes care of removing blkdev_issue_zeroout() from
> > > > > __dax_zero_page_range() and couple of iomap related cleans christoph
> > > > > wanted.
> > > > >
> > > > > And second patch series for adding new dax operation to zero a range
> > > > > and always call info firmware to clear poison and modify filesystems
> > > > > accordingly.
> > > >
> > > > Yes, but they may need to be merged together. I don't want to regress
> > > > the ability of a block-aligned hole-punch to clear errors.
> > >
> > > Hi Dan,
> > >
> > > IIUC, block aligned hole punch don't go through __dax_zero_page_range()
> > > path. Instead they call blkdev_issue_zeroout() at later point of time.
> > >
> > > Only partial block zeroing path is taking __dax_zero_page_range(). So
> > > even if we remove poison clearing code from __dax_zero_page_range(),
> > > there should not be a regression w.r.t full block zeroing. Only possible
> > > regression will be if somebody was doing partial block zeroing on sector
> > > boundary, then poison will not be cleared.
> > >
> > > We now seem to be discussing too many issues w.r.t poison clearing
> > > and dax. Atleast 3 issues are mentioned in this thread.
> > >
> > > A. Get rid of dependency on block device in dax zeroing path.

Re: [dm-devel] [PATCH v5 2/8] drivers/pmem: Allow pmem_clear_poison() to accept arbitrary offset and len

2020-02-25 Thread Dan Williams
On Tue, Feb 25, 2020 at 12:33 PM Jeff Moyer  wrote:
>
> Dan Williams  writes:
>
> > On Mon, Feb 24, 2020 at 1:53 PM Jeff Moyer  wrote:
> >>
> >> Dan Williams  writes:
> >>
> >> >> Let's just focus on reporting errors when we know we have them.
> >> >
> >> > That's the problem in my eyes. If software needs to contend with
> >> > latent error reporting then it should always contend otherwise
> >> > software has multiple error models to wrangle.
> >>
> >> The only way for an application to know that the data has been written
> >> successfully would be to issue a read after every write.  That's not a
> >> performance hit most applications are willing to take.  And, of course,
> >> the media can still go bad at a later time, so it only guarantees the
> >> data is accessible immediately after having been written.
> >>
> >> What I'm suggesting is that we should not complete a write successfully
> >> if we know that the data will not be retrievable.  I wouldn't call this
> >> adding an extra error model to contend with.  Applications should
> >> already be checking for errors on write.
> >>
> >> Does that make sense? Are we talking past each other?
> >
> > The badblock list is late to update in both directions, late to add
> > entries that the scrub needs to find and late to delete entries that
> > were inadvertently cleared by cache-line writes that did not first
> > ingest the poison for a read-modify-write.
>
> We aren't looking for perfection.  If the badblocks list is populated,
> then report the error instead of letting the user write data that we
> know they won't be able to access later.
>
> You have confused me, though, since I thought that stores to bad media
> would not clear errors.  Perhaps you are talking about some future
> hardware implementation that doesn't yet exist?

No, today cacheline aligned DMA, and cpu cachelines that are fully
dirtied without a demand fill can end up clearing poison. The
movdir64b instruction provides a way to force this behavior from the
cpu where it was previously luck.

> > So I see the above as being wishful in using the error list as the
> > hard source of truth and unfortunate to up-level all sub-sector error
> > entries into full PAGE_SIZE data offline events.
>
> The page size granularity is only an issue for mmap().  If you are using
> read/write, then 512 byte granularity can be achieved.  Even with mmap,
> if you encounter an error on a 4k page, you can query the status of each
> sector in that page to isolate the error.  So I'm not quite sure I
> understand what you're getting at.

I'm getting at the fact that we don't allow mmap on PAGE_SIZE
granularity in the presence of errors, and don't allow dax I/O to the
page when an error is present. Only non-dax direct-I/O can read /
write at sub-page granularity in the presence of errors.

The interface to query the status is not coordinated with the
filesystem, but that's a whole other discussion. Yes, we're getting a
bit off in the weeds.

> > I'm hoping we can find a way to make the error handling more fine
> > grained over time, but for the current patch, managing the blast
> > radius as PAGE_SIZE granularity at least matches the zero path with
> > the write path.
>
> I think the write path can do 512 byte granularity, not page size.
> Anyway, I think we've gone far enough off into the weeds that more
> patches will have to be posted for debate.  :)
>

It can't, see dax_iomap_actor(). That call to dax_direct_access() will
fail if it hits a known badblock.

> >> > Setting that aside we can start with just treating zeroing the same as
> >> > the copy_from_iter() case and fail the I/O at the dax_direct_access()
> >> > step.
> >>
> >> OK.
> >>
> >> > I'd rather have a separate op that filesystems can use to clear errors
> >> > at block allocation time that can be enforced to have the correct
> >> > alignment.
> >>
> >> So would file systems always call that routine instead of zeroing, or
> >> would they first check to see if there are badblocks?
> >
> > The proposal is that filesystems distinguish zeroing from free-block
> > allocation/initialization such that the fsdax implementation directs
> > initialization to a driver callback. This "initialization op" would
> > take care to check for poison and clear it. All other dax paths would
> > not consult the badblocks list.
>
> What do you mean by "all other dax paths?"  Would that include
> mmap/direct_access?  Because that definitely should still consult the
> badblocks list.

dax_direct_access() consults the badblock list,
dax_copy_{to,from}_iter() do not, and badblocks discovered after a
page is mapped do not invalidate the page unless the poison is
consumed.

> I'm not against having a separate operation for clearing errors, but I
> guess I'm not convinced it's cleaner, either.

The idea with a separate op is to have an explicit ABI to clear errors
like page-aligned-hole-punch (FALLOC_FL_PUNCH_ERROR?) vs some implicit
side effect of direct-I/O writes.

I also feel 

Re: [dm-devel] [PATCH v5 2/8] drivers/pmem: Allow pmem_clear_poison() to accept arbitrary offset and len

2020-02-25 Thread Jeff Moyer
Dan Williams  writes:

> On Mon, Feb 24, 2020 at 1:53 PM Jeff Moyer  wrote:
>>
>> Dan Williams  writes:
>>
>> >> Let's just focus on reporting errors when we know we have them.
>> >
>> > That's the problem in my eyes. If software needs to contend with
>> > latent error reporting then it should always contend otherwise
>> > software has multiple error models to wrangle.
>>
>> The only way for an application to know that the data has been written
>> successfully would be to issue a read after every write.  That's not a
>> performance hit most applications are willing to take.  And, of course,
>> the media can still go bad at a later time, so it only guarantees the
>> data is accessible immediately after having been written.
>>
>> What I'm suggesting is that we should not complete a write successfully
>> if we know that the data will not be retrievable.  I wouldn't call this
>> adding an extra error model to contend with.  Applications should
>> already be checking for errors on write.
>>
>> Does that make sense? Are we talking past each other?
>
> The badblock list is late to update in both directions, late to add
> entries that the scrub needs to find and late to delete entries that
> were inadvertently cleared by cache-line writes that did not first
> ingest the poison for a read-modify-write.

We aren't looking for perfection.  If the badblocks list is populated,
then report the error instead of letting the user write data that we
know they won't be able to access later.

You have confused me, though, since I thought that stores to bad media
would not clear errors.  Perhaps you are talking about some future
hardware implementation that doesn't yet exist?

> So I see the above as being wishful in using the error list as the
> hard source of truth and unfortunate to up-level all sub-sector error
> entries into full PAGE_SIZE data offline events.

The page size granularity is only an issue for mmap().  If you are using
read/write, then 512 byte granularity can be achieved.  Even with mmap,
if you encounter an error on a 4k page, you can query the status of each
sector in that page to isolate the error.  So I'm not quite sure I
understand what you're getting at.

> I'm hoping we can find a way to make the error handling more fine
> grained over time, but for the current patch, managing the blast
> radius as PAGE_SIZE granularity at least matches the zero path with
> the write path.

I think the write path can do 512 byte granularity, not page size.
Anyway, I think we've gone far enough off into the weeds that more
patches will have to be posted for debate.  :)

>> > Setting that aside we can start with just treating zeroing the same as
>> > the copy_from_iter() case and fail the I/O at the dax_direct_access()
>> > step.
>>
>> OK.
>>
>> > I'd rather have a separate op that filesystems can use to clear errors
>> > at block allocation time that can be enforced to have the correct
>> > alignment.
>>
>> So would file systems always call that routine instead of zeroing, or
>> would they first check to see if there are badblocks?
>
> The proposal is that filesystems distinguish zeroing from free-block
> allocation/initialization such that the fsdax implementation directs
> initialization to a driver callback. This "initialization op" would
> take care to check for poison and clear it. All other dax paths would
> not consult the badblocks list.

What do you mean by "all other dax paths?"  Would that include
mmap/direct_access?  Because that definitely should still consult the
badblocks list.

I'm not against having a separate operation for clearing errors, but I
guess I'm not convinced it's cleaner, either.

-Jeff

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH v5 2/8] drivers/pmem: Allow pmem_clear_poison() to accept arbitrary offset and len

2020-02-25 Thread Vivek Goyal
On Tue, Feb 25, 2020 at 08:25:27AM -0800, Dan Williams wrote:
> On Tue, Feb 25, 2020 at 5:37 AM Vivek Goyal  wrote:
> >
> > On Mon, Feb 24, 2020 at 01:32:58PM -0800, Dan Williams wrote:
> >
> > [..]
> > > > > > Ok, how about if I add one more patch to the series which will check
> > > > > > if unwritten portion of the page has known poison. If it has, then
> > > > > > -EIO is returned.
> > > > > >
> > > > > >
> > > > > > Subject: pmem: zero page range return error if poisoned memory in 
> > > > > > unwritten area
> > > > > >
> > > > > > Filesystems call into pmem_dax_zero_page_range() to zero partial 
> > > > > > page upon
> > > > > > truncate. If partial page is being zeroed, then at the end of 
> > > > > > operation
> > > > > > file systems expect that there is no poison in the whole page 
> > > > > > (atleast
> > > > > > known poison).
> > > > > >
> > > > > > So make sure part of the partial page which is not being written, 
> > > > > > does not
> > > > > > have poison. If it does, return error. If there is poison in area 
> > > > > > of page
> > > > > > being written, it will be cleared.
> > > > >
> > > > > No, I don't like that the zero operation is special cased compared to
> > > > > the write case. I'd say let's make them identical for now. I.e. fail
> > > > > the I/O at dax_direct_access() time.
> > > >
> > > > So basically __dax_zero_page_range() will only write zeros (and not
> > > > try to clear any poison). Right?
> > >
> > > Yes, the zero operation would have already failed at the
> > > dax_direct_access() step if there was present poison.
> > >
> > > > > I think the error clearing
> > > > > interface should be an explicit / separate op rather than a
> > > > > side-effect. What about an explicit interface for initializing newly
> > > > > allocated blocks, and the only reliable way to destroy poison through
> > > > > the filesystem is to free the block?
> > > >
> > > > Effectively pmem_make_request() is already that interface filesystems
> > > > use to initialize blocks and clear poison. So we don't really have to
> > > > introduce a new interface?
> > >
> > > pmem_make_request() is shared with the I/O path and is too low in the
> > > stack to understand intent. DAX intercepts the I/O path closer to the
> > > filesystem and can understand zeroing vs writing today. I'm proposing
> > > we go a step further and make DAX understand free-to-allocated-block
> > > initialization instead of just zeroing. Inject the error clearing into
> > > that initialization interface.
> > >
> > > > Or you are suggesting separate dax_zero_page_range() interface which 
> > > > will
> > > > always call into firmware to clear poison. And that will make sure 
> > > > latent
> > > > poison is cleared as well and filesystem should use that for block
> > > > initialization instead?
> > >
> > > Yes, except latent poison would not be cleared until the zeroing is
> > > implemented with movdir64b instead of callouts to firmware. It's
> > > otherwise too slow to call out to firmware unconditionally.
> > >
> > > > I do like the idea of not having to differentiate
> > > > between known poison and latent poison. Once a block has been 
> > > > initialized
> > > > all poison should be cleared (known/latent). I am worried though that
> > > > on large devices this might slowdown filesystem initialization a lot
> > > > if they are zeroing large range of blocks.
> > > >
> > > > If yes, this sounds like two different patch series. First patch series
> > > > takes care of removing blkdev_issue_zeroout() from
> > > > __dax_zero_page_range() and couple of iomap related cleans christoph
> > > > wanted.
> > > >
> > > > And second patch series for adding new dax operation to zero a range
> > > > and always call info firmware to clear poison and modify filesystems
> > > > accordingly.
> > >
> > > Yes, but they may need to be merged together. I don't want to regress
> > > the ability of a block-aligned hole-punch to clear errors.
> >
> > Hi Dan,
> >
> > IIUC, block aligned hole punch don't go through __dax_zero_page_range()
> > path. Instead they call blkdev_issue_zeroout() at later point of time.
> >
> > Only partial block zeroing path is taking __dax_zero_page_range(). So
> > even if we remove poison clearing code from __dax_zero_page_range(),
> > there should not be a regression w.r.t full block zeroing. Only possible
> > regression will be if somebody was doing partial block zeroing on sector
> > boundary, then poison will not be cleared.
> >
> > We now seem to be discussing too many issues w.r.t poison clearing
> > and dax. Atleast 3 issues are mentioned in this thread.
> >
> > A. Get rid of dependency on block device in dax zeroing path.
> >(__dax_zero_page_range)
> >
> > B. Provide a way to clear latent poison. And possibly use movdir64b to
> >do that and make filesystems use that interface for initialization
> >of blocks.
> >
> > C. Dax zero operation is clearing known poison while copy_from_iter() is
> >not. I guess this 

Re: [dm-devel] [PATCH v5 2/8] drivers/pmem: Allow pmem_clear_poison() to accept arbitrary offset and len

2020-02-25 Thread Dan Williams
On Tue, Feb 25, 2020 at 5:37 AM Vivek Goyal  wrote:
>
> On Mon, Feb 24, 2020 at 01:32:58PM -0800, Dan Williams wrote:
>
> [..]
> > > > > Ok, how about if I add one more patch to the series which will check
> > > > > if unwritten portion of the page has known poison. If it has, then
> > > > > -EIO is returned.
> > > > >
> > > > >
> > > > > Subject: pmem: zero page range return error if poisoned memory in 
> > > > > unwritten area
> > > > >
> > > > > Filesystems call into pmem_dax_zero_page_range() to zero partial page 
> > > > > upon
> > > > > truncate. If partial page is being zeroed, then at the end of 
> > > > > operation
> > > > > file systems expect that there is no poison in the whole page (atleast
> > > > > known poison).
> > > > >
> > > > > So make sure part of the partial page which is not being written, 
> > > > > does not
> > > > > have poison. If it does, return error. If there is poison in area of 
> > > > > page
> > > > > being written, it will be cleared.
> > > >
> > > > No, I don't like that the zero operation is special cased compared to
> > > > the write case. I'd say let's make them identical for now. I.e. fail
> > > > the I/O at dax_direct_access() time.
> > >
> > > So basically __dax_zero_page_range() will only write zeros (and not
> > > try to clear any poison). Right?
> >
> > Yes, the zero operation would have already failed at the
> > dax_direct_access() step if there was present poison.
> >
> > > > I think the error clearing
> > > > interface should be an explicit / separate op rather than a
> > > > side-effect. What about an explicit interface for initializing newly
> > > > allocated blocks, and the only reliable way to destroy poison through
> > > > the filesystem is to free the block?
> > >
> > > Effectively pmem_make_request() is already that interface filesystems
> > > use to initialize blocks and clear poison. So we don't really have to
> > > introduce a new interface?
> >
> > pmem_make_request() is shared with the I/O path and is too low in the
> > stack to understand intent. DAX intercepts the I/O path closer to the
> > filesystem and can understand zeroing vs writing today. I'm proposing
> > we go a step further and make DAX understand free-to-allocated-block
> > initialization instead of just zeroing. Inject the error clearing into
> > that initialization interface.
> >
> > > Or you are suggesting separate dax_zero_page_range() interface which will
> > > always call into firmware to clear poison. And that will make sure latent
> > > poison is cleared as well and filesystem should use that for block
> > > initialization instead?
> >
> > Yes, except latent poison would not be cleared until the zeroing is
> > implemented with movdir64b instead of callouts to firmware. It's
> > otherwise too slow to call out to firmware unconditionally.
> >
> > > I do like the idea of not having to differentiate
> > > between known poison and latent poison. Once a block has been initialized
> > > all poison should be cleared (known/latent). I am worried though that
> > > on large devices this might slowdown filesystem initialization a lot
> > > if they are zeroing large range of blocks.
> > >
> > > If yes, this sounds like two different patch series. First patch series
> > > takes care of removing blkdev_issue_zeroout() from
> > > __dax_zero_page_range() and couple of iomap related cleans christoph
> > > wanted.
> > >
> > > And second patch series for adding new dax operation to zero a range
> > > and always call info firmware to clear poison and modify filesystems
> > > accordingly.
> >
> > Yes, but they may need to be merged together. I don't want to regress
> > the ability of a block-aligned hole-punch to clear errors.
>
> Hi Dan,
>
> IIUC, block aligned hole punch don't go through __dax_zero_page_range()
> path. Instead they call blkdev_issue_zeroout() at later point of time.
>
> Only partial block zeroing path is taking __dax_zero_page_range(). So
> even if we remove poison clearing code from __dax_zero_page_range(),
> there should not be a regression w.r.t full block zeroing. Only possible
> regression will be if somebody was doing partial block zeroing on sector
> boundary, then poison will not be cleared.
>
> We now seem to be discussing too many issues w.r.t poison clearing
> and dax. Atleast 3 issues are mentioned in this thread.
>
> A. Get rid of dependency on block device in dax zeroing path.
>(__dax_zero_page_range)
>
> B. Provide a way to clear latent poison. And possibly use movdir64b to
>do that and make filesystems use that interface for initialization
>of blocks.
>
> C. Dax zero operation is clearing known poison while copy_from_iter() is
>not. I guess this ship has already sailed. If we change it now,
>somebody will complain of some regression.
>
> For issue A, there are two possible ways to deal with it.
>
> 1. Implement a dax method to zero page. And this method will also clear
>known poison. This is what my patch series is doing.
>
> 2. Just get 

Re: [dm-devel] [PATCH v5 2/8] drivers/pmem: Allow pmem_clear_poison() to accept arbitrary offset and len

2020-02-25 Thread Vivek Goyal
On Mon, Feb 24, 2020 at 01:32:58PM -0800, Dan Williams wrote:

[..]
> > > > Ok, how about if I add one more patch to the series which will check
> > > > if unwritten portion of the page has known poison. If it has, then
> > > > -EIO is returned.
> > > >
> > > >
> > > > Subject: pmem: zero page range return error if poisoned memory in 
> > > > unwritten area
> > > >
> > > > Filesystems call into pmem_dax_zero_page_range() to zero partial page 
> > > > upon
> > > > truncate. If partial page is being zeroed, then at the end of operation
> > > > file systems expect that there is no poison in the whole page (atleast
> > > > known poison).
> > > >
> > > > So make sure part of the partial page which is not being written, does 
> > > > not
> > > > have poison. If it does, return error. If there is poison in area of 
> > > > page
> > > > being written, it will be cleared.
> > >
> > > No, I don't like that the zero operation is special cased compared to
> > > the write case. I'd say let's make them identical for now. I.e. fail
> > > the I/O at dax_direct_access() time.
> >
> > So basically __dax_zero_page_range() will only write zeros (and not
> > try to clear any poison). Right?
> 
> Yes, the zero operation would have already failed at the
> dax_direct_access() step if there was present poison.
> 
> > > I think the error clearing
> > > interface should be an explicit / separate op rather than a
> > > side-effect. What about an explicit interface for initializing newly
> > > allocated blocks, and the only reliable way to destroy poison through
> > > the filesystem is to free the block?
> >
> > Effectively pmem_make_request() is already that interface filesystems
> > use to initialize blocks and clear poison. So we don't really have to
> > introduce a new interface?
> 
> pmem_make_request() is shared with the I/O path and is too low in the
> stack to understand intent. DAX intercepts the I/O path closer to the
> filesystem and can understand zeroing vs writing today. I'm proposing
> we go a step further and make DAX understand free-to-allocated-block
> initialization instead of just zeroing. Inject the error clearing into
> that initialization interface.
> 
> > Or you are suggesting separate dax_zero_page_range() interface which will
> > always call into firmware to clear poison. And that will make sure latent
> > poison is cleared as well and filesystem should use that for block
> > initialization instead?
> 
> Yes, except latent poison would not be cleared until the zeroing is
> implemented with movdir64b instead of callouts to firmware. It's
> otherwise too slow to call out to firmware unconditionally.
> 
> > I do like the idea of not having to differentiate
> > between known poison and latent poison. Once a block has been initialized
> > all poison should be cleared (known/latent). I am worried though that
> > on large devices this might slowdown filesystem initialization a lot
> > if they are zeroing large range of blocks.
> >
> > If yes, this sounds like two different patch series. First patch series
> > takes care of removing blkdev_issue_zeroout() from
> > __dax_zero_page_range() and couple of iomap related cleans christoph
> > wanted.
> >
> > And second patch series for adding new dax operation to zero a range
> > and always call info firmware to clear poison and modify filesystems
> > accordingly.
> 
> Yes, but they may need to be merged together. I don't want to regress
> the ability of a block-aligned hole-punch to clear errors.

Hi Dan,

IIUC, block aligned hole punch don't go through __dax_zero_page_range()
path. Instead they call blkdev_issue_zeroout() at later point of time.

Only partial block zeroing path is taking __dax_zero_page_range(). So
even if we remove poison clearing code from __dax_zero_page_range(),
there should not be a regression w.r.t full block zeroing. Only possible
regression will be if somebody was doing partial block zeroing on sector
boundary, then poison will not be cleared.

We now seem to be discussing too many issues w.r.t poison clearing
and dax. Atleast 3 issues are mentioned in this thread.

A. Get rid of dependency on block device in dax zeroing path.
   (__dax_zero_page_range)

B. Provide a way to clear latent poison. And possibly use movdir64b to
   do that and make filesystems use that interface for initialization
   of blocks.

C. Dax zero operation is clearing known poison while copy_from_iter() is
   not. I guess this ship has already sailed. If we change it now,
   somebody will complain of some regression.

For issue A, there are two possible ways to deal with it.

1. Implement a dax method to zero page. And this method will also clear
   known poison. This is what my patch series is doing.

2. Just get rid of blkdev_issue_zeroout() from __dax_zero_page_range()
   so that no poison will be cleared in __dax_zero_page_range() path. This
   path is currently used in partial page zeroing path and full filesystem
   block zeroing happens with blkdev_issue_zeroout(). 

Re: [dm-devel] [PATCH v5 2/8] drivers/pmem: Allow pmem_clear_poison() to accept arbitrary offset and len

2020-02-24 Thread Dan Williams
On Mon, Feb 24, 2020 at 1:53 PM Jeff Moyer  wrote:
>
> Dan Williams  writes:
>
> >> Let's just focus on reporting errors when we know we have them.
> >
> > That's the problem in my eyes. If software needs to contend with
> > latent error reporting then it should always contend otherwise
> > software has multiple error models to wrangle.
>
> The only way for an application to know that the data has been written
> successfully would be to issue a read after every write.  That's not a
> performance hit most applications are willing to take.  And, of course,
> the media can still go bad at a later time, so it only guarantees the
> data is accessible immediately after having been written.
>
> What I'm suggesting is that we should not complete a write successfully
> if we know that the data will not be retrievable.  I wouldn't call this
> adding an extra error model to contend with.  Applications should
> already be checking for errors on write.
>
> Does that make sense? Are we talking past each other?

The badblock list is late to update in both directions, late to add
entries that the scrub needs to find and late to delete entries that
were inadvertently cleared by cache-line writes that did not first
ingest the poison for a read-modify-write. So I see the above as being
wishful in using the error list as the hard source of truth and
unfortunate to up-level all sub-sector error entries into full
PAGE_SIZE data offline events.

I'm hoping we can find a way to make the error handling more fine
grained over time, but for the current patch, managing the blast
radius as PAGE_SIZE granularity at least matches the zero path with
the write path.

> > Setting that aside we can start with just treating zeroing the same as
> > the copy_from_iter() case and fail the I/O at the dax_direct_access()
> > step.
>
> OK.
>
> > I'd rather have a separate op that filesystems can use to clear errors
> > at block allocation time that can be enforced to have the correct
> > alignment.
>
> So would file systems always call that routine instead of zeroing, or
> would they first check to see if there are badblocks?

The proposal is that filesystems distinguish zeroing from free-block
allocation/initialization such that the fsdax implementation directs
initialization to a driver callback. This "initialization op" would
take care to check for poison and clear it. All other dax paths would
not consult the badblocks list.


--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH v5 2/8] drivers/pmem: Allow pmem_clear_poison() to accept arbitrary offset and len

2020-02-24 Thread Jeff Moyer
Dan Williams  writes:

>> Let's just focus on reporting errors when we know we have them.
>
> That's the problem in my eyes. If software needs to contend with
> latent error reporting then it should always contend otherwise
> software has multiple error models to wrangle.

The only way for an application to know that the data has been written
successfully would be to issue a read after every write.  That's not a
performance hit most applications are willing to take.  And, of course,
the media can still go bad at a later time, so it only guarantees the
data is accessible immediately after having been written.

What I'm suggesting is that we should not complete a write successfully
if we know that the data will not be retrievable.  I wouldn't call this
adding an extra error model to contend with.  Applications should
already be checking for errors on write.

Does that make sense?  Are we talking past each other?

> Setting that aside we can start with just treating zeroing the same as
> the copy_from_iter() case and fail the I/O at the dax_direct_access()
> step.

OK.

> I'd rather have a separate op that filesystems can use to clear errors
> at block allocation time that can be enforced to have the correct
> alignment.

So would file systems always call that routine instead of zeroing, or
would they first check to see if there are badblocks?

-Jeff

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH v5 2/8] drivers/pmem: Allow pmem_clear_poison() to accept arbitrary offset and len

2020-02-24 Thread Dan Williams
On Mon, Feb 24, 2020 at 1:17 PM Vivek Goyal  wrote:
>
> On Mon, Feb 24, 2020 at 12:52:13PM -0800, Dan Williams wrote:
> > On Mon, Feb 24, 2020 at 12:13 PM Vivek Goyal  wrote:
> > >
> > > On Mon, Feb 24, 2020 at 10:03:30AM +1100, Dave Chinner wrote:
> > > > On Fri, Feb 21, 2020 at 03:17:59PM -0500, Vivek Goyal wrote:
> > > > > On Fri, Feb 21, 2020 at 01:32:48PM -0500, Jeff Moyer wrote:
> > > > > > Vivek Goyal  writes:
> > > > > >
> > > > > > > On Thu, Feb 20, 2020 at 04:35:17PM -0500, Jeff Moyer wrote:
> > > > > > >> Vivek Goyal  writes:
> > > > > > >>
> > > > > > >> > Currently pmem_clear_poison() expects offset and len to be 
> > > > > > >> > sector aligned.
> > > > > > >> > Atleast that seems to be the assumption with which code has 
> > > > > > >> > been written.
> > > > > > >> > It is called only from pmem_do_bvec() which is called only 
> > > > > > >> > from pmem_rw_page()
> > > > > > >> > and pmem_make_request() which will only passe sector aligned 
> > > > > > >> > offset and len.
> > > > > > >> >
> > > > > > >> > Soon we want use this function from dax_zero_page_range() code 
> > > > > > >> > path which
> > > > > > >> > can try to zero arbitrary range of memory with-in a page. So 
> > > > > > >> > update this
> > > > > > >> > function to assume that offset and length can be arbitrary and 
> > > > > > >> > do the
> > > > > > >> > necessary alignments as needed.
> > > > > > >>
> > > > > > >> What caller will try to zero a range that is smaller than a 
> > > > > > >> sector?
> > > > > > >
> > > > > > > Hi Jeff,
> > > > > > >
> > > > > > > New dax zeroing interface (dax_zero_page_range()) can technically 
> > > > > > > pass
> > > > > > > a range which is less than a sector. Or which is bigger than a 
> > > > > > > sector
> > > > > > > but start and end are not aligned on sector boundaries.
> > > > > >
> > > > > > Sure, but who will call it with misaligned ranges?
> > > > >
> > > > > create a file foo.txt of size 4K and then truncate it.
> > > > >
> > > > > "truncate -s 23 foo.txt". Filesystems try to zero the bytes from 24 to
> > > > > 4095.
> > > >
> > > > This should fail with EIO. Only full page writes should clear the
> > > > bad page state, and partial writes should therefore fail because
> > > > they do not guarantee the data in the filesystem block is all good.
> > > >
> > > > If this zeroing was a buffered write to an address with a bad
> > > > sector, then the writeback will fail and the user will (eventually)
> > > > get an EIO on the file.
> > > >
> > > > DAX should do the same thing, except because the zeroing is
> > > > synchronous (i.e. done directly by the truncate syscall) we can -
> > > > and should - return EIO immediately.
> > > >
> > > > Indeed, with your code, if we then extend the file by truncating up
> > > > back to 4k, then the range between 23 and 512 is still bad, even
> > > > though we've successfully zeroed it and the user knows it. An
> > > > attempt to read anywhere in this range (e.g. 10 bytes at offset 100)
> > > > will fail with EIO, but reading 10 bytes at offset 2000 will
> > > > succeed.
> > > >
> > > > That's *awful* behaviour to expose to userspace, especially when
> > > > they look at the fs config and see that it's using both 4kB block
> > > > and sector sizes...
> > > >
> > > > The only thing that makes sense from a filesystem perspective is
> > > > clearing bad page state when entire filesystem blocks are
> > > > overwritten. The data in a filesystem block is either good or bad,
> > > > and it doesn't matter how many internal (kernel or device) sectors
> > > > it has.
> > > >
> > > > > > And what happens to the rest?  The caller is left to trip over the
> > > > > > errors?  That sounds pretty terrible.  I really think there needs 
> > > > > > to be
> > > > > > an explicit contract here.
> > > > >
> > > > > Ok, I think is is the contentious bit. Current interface
> > > > > (__dax_zero_page_range()) either clears the poison (if I/O is aligned 
> > > > > to
> > > > > sector) or expects page to be free of poison.
> > > > >
> > > > > So in above example, of "truncate -s 23 foo.txt", currently I get an 
> > > > > error
> > > > > because range being zeroed is not sector aligned. So
> > > > > __dax_zero_page_range() falls back to calling direct_access(). Which
> > > > > fails because there are poisoned sectors in the page.
> > > > >
> > > > > With my patches, dax_zero_page_range(), clears the poison from sector 
> > > > > 1 to
> > > > > 7 but leaves sector 0 untouched and just writes zeroes from byte 0 to 
> > > > > 511
> > > > > and returns success.
> > > >
> > > > Ok, kernel sectors are not the unit of granularity bad page state
> > > > should be managed at. They don't match page state granularity, and
> > > > they don't match filesystem block granularity, and the whacky
> > > > "partial writes silently succeed, reads fail unpredictably"
> > > > assymetry it leads to will just cause problems for users.
> > > >
> > > > > So question is, is this better behavior or 

Re: [dm-devel] [PATCH v5 2/8] drivers/pmem: Allow pmem_clear_poison() to accept arbitrary offset and len

2020-02-24 Thread Vivek Goyal
On Mon, Feb 24, 2020 at 12:52:13PM -0800, Dan Williams wrote:
> On Mon, Feb 24, 2020 at 12:13 PM Vivek Goyal  wrote:
> >
> > On Mon, Feb 24, 2020 at 10:03:30AM +1100, Dave Chinner wrote:
> > > On Fri, Feb 21, 2020 at 03:17:59PM -0500, Vivek Goyal wrote:
> > > > On Fri, Feb 21, 2020 at 01:32:48PM -0500, Jeff Moyer wrote:
> > > > > Vivek Goyal  writes:
> > > > >
> > > > > > On Thu, Feb 20, 2020 at 04:35:17PM -0500, Jeff Moyer wrote:
> > > > > >> Vivek Goyal  writes:
> > > > > >>
> > > > > >> > Currently pmem_clear_poison() expects offset and len to be 
> > > > > >> > sector aligned.
> > > > > >> > Atleast that seems to be the assumption with which code has been 
> > > > > >> > written.
> > > > > >> > It is called only from pmem_do_bvec() which is called only from 
> > > > > >> > pmem_rw_page()
> > > > > >> > and pmem_make_request() which will only passe sector aligned 
> > > > > >> > offset and len.
> > > > > >> >
> > > > > >> > Soon we want use this function from dax_zero_page_range() code 
> > > > > >> > path which
> > > > > >> > can try to zero arbitrary range of memory with-in a page. So 
> > > > > >> > update this
> > > > > >> > function to assume that offset and length can be arbitrary and 
> > > > > >> > do the
> > > > > >> > necessary alignments as needed.
> > > > > >>
> > > > > >> What caller will try to zero a range that is smaller than a sector?
> > > > > >
> > > > > > Hi Jeff,
> > > > > >
> > > > > > New dax zeroing interface (dax_zero_page_range()) can technically 
> > > > > > pass
> > > > > > a range which is less than a sector. Or which is bigger than a 
> > > > > > sector
> > > > > > but start and end are not aligned on sector boundaries.
> > > > >
> > > > > Sure, but who will call it with misaligned ranges?
> > > >
> > > > create a file foo.txt of size 4K and then truncate it.
> > > >
> > > > "truncate -s 23 foo.txt". Filesystems try to zero the bytes from 24 to
> > > > 4095.
> > >
> > > This should fail with EIO. Only full page writes should clear the
> > > bad page state, and partial writes should therefore fail because
> > > they do not guarantee the data in the filesystem block is all good.
> > >
> > > If this zeroing was a buffered write to an address with a bad
> > > sector, then the writeback will fail and the user will (eventually)
> > > get an EIO on the file.
> > >
> > > DAX should do the same thing, except because the zeroing is
> > > synchronous (i.e. done directly by the truncate syscall) we can -
> > > and should - return EIO immediately.
> > >
> > > Indeed, with your code, if we then extend the file by truncating up
> > > back to 4k, then the range between 23 and 512 is still bad, even
> > > though we've successfully zeroed it and the user knows it. An
> > > attempt to read anywhere in this range (e.g. 10 bytes at offset 100)
> > > will fail with EIO, but reading 10 bytes at offset 2000 will
> > > succeed.
> > >
> > > That's *awful* behaviour to expose to userspace, especially when
> > > they look at the fs config and see that it's using both 4kB block
> > > and sector sizes...
> > >
> > > The only thing that makes sense from a filesystem perspective is
> > > clearing bad page state when entire filesystem blocks are
> > > overwritten. The data in a filesystem block is either good or bad,
> > > and it doesn't matter how many internal (kernel or device) sectors
> > > it has.
> > >
> > > > > And what happens to the rest?  The caller is left to trip over the
> > > > > errors?  That sounds pretty terrible.  I really think there needs to 
> > > > > be
> > > > > an explicit contract here.
> > > >
> > > > Ok, I think is is the contentious bit. Current interface
> > > > (__dax_zero_page_range()) either clears the poison (if I/O is aligned to
> > > > sector) or expects page to be free of poison.
> > > >
> > > > So in above example, of "truncate -s 23 foo.txt", currently I get an 
> > > > error
> > > > because range being zeroed is not sector aligned. So
> > > > __dax_zero_page_range() falls back to calling direct_access(). Which
> > > > fails because there are poisoned sectors in the page.
> > > >
> > > > With my patches, dax_zero_page_range(), clears the poison from sector 1 
> > > > to
> > > > 7 but leaves sector 0 untouched and just writes zeroes from byte 0 to 
> > > > 511
> > > > and returns success.
> > >
> > > Ok, kernel sectors are not the unit of granularity bad page state
> > > should be managed at. They don't match page state granularity, and
> > > they don't match filesystem block granularity, and the whacky
> > > "partial writes silently succeed, reads fail unpredictably"
> > > assymetry it leads to will just cause problems for users.
> > >
> > > > So question is, is this better behavior or worse behavior. If sector 0
> > > > was poisoned, it will continue to remain poisoned and caller will come
> > > > to know about it on next read and then it should try to truncate file
> > > > to length 0 or unlink file or restore that file to get rid of poison.
> > >
> > > Worse, 

Re: [dm-devel] [PATCH v5 2/8] drivers/pmem: Allow pmem_clear_poison() to accept arbitrary offset and len

2020-02-24 Thread Dan Williams
On Mon, Feb 24, 2020 at 12:13 PM Vivek Goyal  wrote:
>
> On Mon, Feb 24, 2020 at 10:03:30AM +1100, Dave Chinner wrote:
> > On Fri, Feb 21, 2020 at 03:17:59PM -0500, Vivek Goyal wrote:
> > > On Fri, Feb 21, 2020 at 01:32:48PM -0500, Jeff Moyer wrote:
> > > > Vivek Goyal  writes:
> > > >
> > > > > On Thu, Feb 20, 2020 at 04:35:17PM -0500, Jeff Moyer wrote:
> > > > >> Vivek Goyal  writes:
> > > > >>
> > > > >> > Currently pmem_clear_poison() expects offset and len to be sector 
> > > > >> > aligned.
> > > > >> > Atleast that seems to be the assumption with which code has been 
> > > > >> > written.
> > > > >> > It is called only from pmem_do_bvec() which is called only from 
> > > > >> > pmem_rw_page()
> > > > >> > and pmem_make_request() which will only passe sector aligned 
> > > > >> > offset and len.
> > > > >> >
> > > > >> > Soon we want use this function from dax_zero_page_range() code 
> > > > >> > path which
> > > > >> > can try to zero arbitrary range of memory with-in a page. So 
> > > > >> > update this
> > > > >> > function to assume that offset and length can be arbitrary and do 
> > > > >> > the
> > > > >> > necessary alignments as needed.
> > > > >>
> > > > >> What caller will try to zero a range that is smaller than a sector?
> > > > >
> > > > > Hi Jeff,
> > > > >
> > > > > New dax zeroing interface (dax_zero_page_range()) can technically pass
> > > > > a range which is less than a sector. Or which is bigger than a sector
> > > > > but start and end are not aligned on sector boundaries.
> > > >
> > > > Sure, but who will call it with misaligned ranges?
> > >
> > > create a file foo.txt of size 4K and then truncate it.
> > >
> > > "truncate -s 23 foo.txt". Filesystems try to zero the bytes from 24 to
> > > 4095.
> >
> > This should fail with EIO. Only full page writes should clear the
> > bad page state, and partial writes should therefore fail because
> > they do not guarantee the data in the filesystem block is all good.
> >
> > If this zeroing was a buffered write to an address with a bad
> > sector, then the writeback will fail and the user will (eventually)
> > get an EIO on the file.
> >
> > DAX should do the same thing, except because the zeroing is
> > synchronous (i.e. done directly by the truncate syscall) we can -
> > and should - return EIO immediately.
> >
> > Indeed, with your code, if we then extend the file by truncating up
> > back to 4k, then the range between 23 and 512 is still bad, even
> > though we've successfully zeroed it and the user knows it. An
> > attempt to read anywhere in this range (e.g. 10 bytes at offset 100)
> > will fail with EIO, but reading 10 bytes at offset 2000 will
> > succeed.
> >
> > That's *awful* behaviour to expose to userspace, especially when
> > they look at the fs config and see that it's using both 4kB block
> > and sector sizes...
> >
> > The only thing that makes sense from a filesystem perspective is
> > clearing bad page state when entire filesystem blocks are
> > overwritten. The data in a filesystem block is either good or bad,
> > and it doesn't matter how many internal (kernel or device) sectors
> > it has.
> >
> > > > And what happens to the rest?  The caller is left to trip over the
> > > > errors?  That sounds pretty terrible.  I really think there needs to be
> > > > an explicit contract here.
> > >
> > > Ok, I think is is the contentious bit. Current interface
> > > (__dax_zero_page_range()) either clears the poison (if I/O is aligned to
> > > sector) or expects page to be free of poison.
> > >
> > > So in above example, of "truncate -s 23 foo.txt", currently I get an error
> > > because range being zeroed is not sector aligned. So
> > > __dax_zero_page_range() falls back to calling direct_access(). Which
> > > fails because there are poisoned sectors in the page.
> > >
> > > With my patches, dax_zero_page_range(), clears the poison from sector 1 to
> > > 7 but leaves sector 0 untouched and just writes zeroes from byte 0 to 511
> > > and returns success.
> >
> > Ok, kernel sectors are not the unit of granularity bad page state
> > should be managed at. They don't match page state granularity, and
> > they don't match filesystem block granularity, and the whacky
> > "partial writes silently succeed, reads fail unpredictably"
> > assymetry it leads to will just cause problems for users.
> >
> > > So question is, is this better behavior or worse behavior. If sector 0
> > > was poisoned, it will continue to remain poisoned and caller will come
> > > to know about it on next read and then it should try to truncate file
> > > to length 0 or unlink file or restore that file to get rid of poison.
> >
> > Worse, because the filesystem can't track what sub-parts of the
> > block are bad and that leads to inconsistent data integrity status
> > being exposed to userspace.
> >
> >
> > > IOW, if a partial block is being zeroed and if it is poisoned, caller
> > > will not be return an error and poison will not be cleared and memory
> > > will 

Re: [dm-devel] [PATCH v5 2/8] drivers/pmem: Allow pmem_clear_poison() to accept arbitrary offset and len

2020-02-24 Thread Dan Williams
On Mon, Feb 24, 2020 at 5:50 AM Jeff Moyer  wrote:
>
> Dan Williams  writes:
>
> > On Sun, Feb 23, 2020 at 3:03 PM Dave Chinner  wrote:
> >>
> >> On Fri, Feb 21, 2020 at 03:17:59PM -0500, Vivek Goyal wrote:
> >> > On Fri, Feb 21, 2020 at 01:32:48PM -0500, Jeff Moyer wrote:
> >> > > Vivek Goyal  writes:
> >> > >
> >> > > > On Thu, Feb 20, 2020 at 04:35:17PM -0500, Jeff Moyer wrote:
> >> > > >> Vivek Goyal  writes:
> >> > > >>
> >> > > >> > Currently pmem_clear_poison() expects offset and len to be sector 
> >> > > >> > aligned.
> >> > > >> > Atleast that seems to be the assumption with which code has been 
> >> > > >> > written.
> >> > > >> > It is called only from pmem_do_bvec() which is called only from 
> >> > > >> > pmem_rw_page()
> >> > > >> > and pmem_make_request() which will only passe sector aligned 
> >> > > >> > offset and len.
> >> > > >> >
> >> > > >> > Soon we want use this function from dax_zero_page_range() code 
> >> > > >> > path which
> >> > > >> > can try to zero arbitrary range of memory with-in a page. So 
> >> > > >> > update this
> >> > > >> > function to assume that offset and length can be arbitrary and do 
> >> > > >> > the
> >> > > >> > necessary alignments as needed.
> >> > > >>
> >> > > >> What caller will try to zero a range that is smaller than a sector?
> >> > > >
> >> > > > Hi Jeff,
> >> > > >
> >> > > > New dax zeroing interface (dax_zero_page_range()) can technically 
> >> > > > pass
> >> > > > a range which is less than a sector. Or which is bigger than a sector
> >> > > > but start and end are not aligned on sector boundaries.
> >> > >
> >> > > Sure, but who will call it with misaligned ranges?
> >> >
> >> > create a file foo.txt of size 4K and then truncate it.
> >> >
> >> > "truncate -s 23 foo.txt". Filesystems try to zero the bytes from 24 to
> >> > 4095.
> >>
> >> This should fail with EIO. Only full page writes should clear the
> >> bad page state, and partial writes should therefore fail because
> >> they do not guarantee the data in the filesystem block is all good.
> >>
> >> If this zeroing was a buffered write to an address with a bad
> >> sector, then the writeback will fail and the user will (eventually)
> >> get an EIO on the file.
> >>
> >> DAX should do the same thing, except because the zeroing is
> >> synchronous (i.e. done directly by the truncate syscall) we can -
> >> and should - return EIO immediately.
> >>
> >> Indeed, with your code, if we then extend the file by truncating up
> >> back to 4k, then the range between 23 and 512 is still bad, even
> >> though we've successfully zeroed it and the user knows it. An
> >> attempt to read anywhere in this range (e.g. 10 bytes at offset 100)
> >> will fail with EIO, but reading 10 bytes at offset 2000 will
> >> succeed.
> >>
> >> That's *awful* behaviour to expose to userspace, especially when
> >> they look at the fs config and see that it's using both 4kB block
> >> and sector sizes...
> >>
> >> The only thing that makes sense from a filesystem perspective is
> >> clearing bad page state when entire filesystem blocks are
> >> overwritten. The data in a filesystem block is either good or bad,
> >> and it doesn't matter how many internal (kernel or device) sectors
> >> it has.
> >>
> >> > > And what happens to the rest?  The caller is left to trip over the
> >> > > errors?  That sounds pretty terrible.  I really think there needs to be
> >> > > an explicit contract here.
> >> >
> >> > Ok, I think is is the contentious bit. Current interface
> >> > (__dax_zero_page_range()) either clears the poison (if I/O is aligned to
> >> > sector) or expects page to be free of poison.
> >> >
> >> > So in above example, of "truncate -s 23 foo.txt", currently I get an 
> >> > error
> >> > because range being zeroed is not sector aligned. So
> >> > __dax_zero_page_range() falls back to calling direct_access(). Which
> >> > fails because there are poisoned sectors in the page.
> >> >
> >> > With my patches, dax_zero_page_range(), clears the poison from sector 1 
> >> > to
> >> > 7 but leaves sector 0 untouched and just writes zeroes from byte 0 to 511
> >> > and returns success.
> >>
> >> Ok, kernel sectors are not the unit of granularity bad page state
> >> should be managed at. They don't match page state granularity, and
> >> they don't match filesystem block granularity, and the whacky
> >> "partial writes silently succeed, reads fail unpredictably"
> >> assymetry it leads to will just cause problems for users.
> >>
> >> > So question is, is this better behavior or worse behavior. If sector 0
> >> > was poisoned, it will continue to remain poisoned and caller will come
> >> > to know about it on next read and then it should try to truncate file
> >> > to length 0 or unlink file or restore that file to get rid of poison.
> >>
> >> Worse, because the filesystem can't track what sub-parts of the
> >> block are bad and that leads to inconsistent data integrity status
> >> being exposed to userspace.
> >
> > The driver 

Re: [dm-devel] [PATCH v5 2/8] drivers/pmem: Allow pmem_clear_poison() to accept arbitrary offset and len

2020-02-24 Thread Vivek Goyal
On Mon, Feb 24, 2020 at 10:03:30AM +1100, Dave Chinner wrote:
> On Fri, Feb 21, 2020 at 03:17:59PM -0500, Vivek Goyal wrote:
> > On Fri, Feb 21, 2020 at 01:32:48PM -0500, Jeff Moyer wrote:
> > > Vivek Goyal  writes:
> > > 
> > > > On Thu, Feb 20, 2020 at 04:35:17PM -0500, Jeff Moyer wrote:
> > > >> Vivek Goyal  writes:
> > > >> 
> > > >> > Currently pmem_clear_poison() expects offset and len to be sector 
> > > >> > aligned.
> > > >> > Atleast that seems to be the assumption with which code has been 
> > > >> > written.
> > > >> > It is called only from pmem_do_bvec() which is called only from 
> > > >> > pmem_rw_page()
> > > >> > and pmem_make_request() which will only passe sector aligned offset 
> > > >> > and len.
> > > >> >
> > > >> > Soon we want use this function from dax_zero_page_range() code path 
> > > >> > which
> > > >> > can try to zero arbitrary range of memory with-in a page. So update 
> > > >> > this
> > > >> > function to assume that offset and length can be arbitrary and do the
> > > >> > necessary alignments as needed.
> > > >> 
> > > >> What caller will try to zero a range that is smaller than a sector?
> > > >
> > > > Hi Jeff,
> > > >
> > > > New dax zeroing interface (dax_zero_page_range()) can technically pass
> > > > a range which is less than a sector. Or which is bigger than a sector
> > > > but start and end are not aligned on sector boundaries.
> > > 
> > > Sure, but who will call it with misaligned ranges?
> > 
> > create a file foo.txt of size 4K and then truncate it.
> > 
> > "truncate -s 23 foo.txt". Filesystems try to zero the bytes from 24 to
> > 4095.
> 
> This should fail with EIO. Only full page writes should clear the
> bad page state, and partial writes should therefore fail because
> they do not guarantee the data in the filesystem block is all good.
> 
> If this zeroing was a buffered write to an address with a bad
> sector, then the writeback will fail and the user will (eventually)
> get an EIO on the file.
> 
> DAX should do the same thing, except because the zeroing is
> synchronous (i.e. done directly by the truncate syscall) we can -
> and should - return EIO immediately.
> 
> Indeed, with your code, if we then extend the file by truncating up
> back to 4k, then the range between 23 and 512 is still bad, even
> though we've successfully zeroed it and the user knows it. An
> attempt to read anywhere in this range (e.g. 10 bytes at offset 100)
> will fail with EIO, but reading 10 bytes at offset 2000 will
> succeed.
> 
> That's *awful* behaviour to expose to userspace, especially when
> they look at the fs config and see that it's using both 4kB block
> and sector sizes...
> 
> The only thing that makes sense from a filesystem perspective is
> clearing bad page state when entire filesystem blocks are
> overwritten. The data in a filesystem block is either good or bad,
> and it doesn't matter how many internal (kernel or device) sectors
> it has.
> 
> > > And what happens to the rest?  The caller is left to trip over the
> > > errors?  That sounds pretty terrible.  I really think there needs to be
> > > an explicit contract here.
> > 
> > Ok, I think is is the contentious bit. Current interface
> > (__dax_zero_page_range()) either clears the poison (if I/O is aligned to
> > sector) or expects page to be free of poison.
> > 
> > So in above example, of "truncate -s 23 foo.txt", currently I get an error
> > because range being zeroed is not sector aligned. So
> > __dax_zero_page_range() falls back to calling direct_access(). Which
> > fails because there are poisoned sectors in the page.
> > 
> > With my patches, dax_zero_page_range(), clears the poison from sector 1 to
> > 7 but leaves sector 0 untouched and just writes zeroes from byte 0 to 511
> > and returns success.
> 
> Ok, kernel sectors are not the unit of granularity bad page state
> should be managed at. They don't match page state granularity, and
> they don't match filesystem block granularity, and the whacky
> "partial writes silently succeed, reads fail unpredictably"
> assymetry it leads to will just cause problems for users.
> 
> > So question is, is this better behavior or worse behavior. If sector 0
> > was poisoned, it will continue to remain poisoned and caller will come
> > to know about it on next read and then it should try to truncate file
> > to length 0 or unlink file or restore that file to get rid of poison.
> 
> Worse, because the filesystem can't track what sub-parts of the
> block are bad and that leads to inconsistent data integrity status
> being exposed to userspace.
> 
> 
> > IOW, if a partial block is being zeroed and if it is poisoned, caller
> > will not be return an error and poison will not be cleared and memory
> > will be zeroed. What do we expect in such cases.
> > 
> > Do we expect an interface where if there are any bad blocks in the range
> > being zeroed, then they all should be cleared (and hence all I/O should
> > be aligned) otherwise error is returned. If yes, I 

Re: [dm-devel] [PATCH v5 2/8] drivers/pmem: Allow pmem_clear_poison() to accept arbitrary offset and len

2020-02-24 Thread Vivek Goyal
On Mon, Feb 24, 2020 at 10:03:30AM +1100, Dave Chinner wrote:

[..]
> > > > Hi Jeff,
> > > >
> > > > New dax zeroing interface (dax_zero_page_range()) can technically pass
> > > > a range which is less than a sector. Or which is bigger than a sector
> > > > but start and end are not aligned on sector boundaries.
> > > 
> > > Sure, but who will call it with misaligned ranges?
> > 
> > create a file foo.txt of size 4K and then truncate it.
> > 
> > "truncate -s 23 foo.txt". Filesystems try to zero the bytes from 24 to
> > 4095.
> 
> This should fail with EIO. Only full page writes should clear the
> bad page state, and partial writes should therefore fail because
> they do not guarantee the data in the filesystem block is all good.
> 
> If this zeroing was a buffered write to an address with a bad
> sector, then the writeback will fail and the user will (eventually)
> get an EIO on the file.
> 
> DAX should do the same thing, except because the zeroing is
> synchronous (i.e. done directly by the truncate syscall) we can -
> and should - return EIO immediately.
> 
> Indeed, with your code, if we then extend the file by truncating up
> back to 4k, then the range between 23 and 512 is still bad, even
> though we've successfully zeroed it and the user knows it. An
> attempt to read anywhere in this range (e.g. 10 bytes at offset 100)
> will fail with EIO, but reading 10 bytes at offset 2000 will
> succeed.

Hi Dave,

What is expected if I do "truncate -s 512 foo.txt". Say first sector (0 to
511) is poisoned and rest don't have poison. Should this fail with -EIO.

In current implementation it does not. Because all sector aligned I/O
we redirect through blkdev_issue_zeroout() and that will happly zero
out sector 2-8 without worrying about the state of sector 1. Hence user
which tries to read 10 bytes at offset 100, will still fail. This probably
should be fixed if we want to retain existing behavior.

Anyway, partial page truncate can't ensure that data in rest of the page can
be read back successfully. Memory can get poison after the write and
hence read after truncate will still fail.

Hence, all we are trying to ensure is that if a poison is known at the
time of writing partial page, then we should return error to user space.

Thanks
Vivek

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH v5 2/8] drivers/pmem: Allow pmem_clear_poison() to accept arbitrary offset and len

2020-02-24 Thread Jeff Moyer
Dan Williams  writes:

> On Sun, Feb 23, 2020 at 3:03 PM Dave Chinner  wrote:
>>
>> On Fri, Feb 21, 2020 at 03:17:59PM -0500, Vivek Goyal wrote:
>> > On Fri, Feb 21, 2020 at 01:32:48PM -0500, Jeff Moyer wrote:
>> > > Vivek Goyal  writes:
>> > >
>> > > > On Thu, Feb 20, 2020 at 04:35:17PM -0500, Jeff Moyer wrote:
>> > > >> Vivek Goyal  writes:
>> > > >>
>> > > >> > Currently pmem_clear_poison() expects offset and len to be sector 
>> > > >> > aligned.
>> > > >> > Atleast that seems to be the assumption with which code has been 
>> > > >> > written.
>> > > >> > It is called only from pmem_do_bvec() which is called only from 
>> > > >> > pmem_rw_page()
>> > > >> > and pmem_make_request() which will only passe sector aligned offset 
>> > > >> > and len.
>> > > >> >
>> > > >> > Soon we want use this function from dax_zero_page_range() code path 
>> > > >> > which
>> > > >> > can try to zero arbitrary range of memory with-in a page. So update 
>> > > >> > this
>> > > >> > function to assume that offset and length can be arbitrary and do 
>> > > >> > the
>> > > >> > necessary alignments as needed.
>> > > >>
>> > > >> What caller will try to zero a range that is smaller than a sector?
>> > > >
>> > > > Hi Jeff,
>> > > >
>> > > > New dax zeroing interface (dax_zero_page_range()) can technically pass
>> > > > a range which is less than a sector. Or which is bigger than a sector
>> > > > but start and end are not aligned on sector boundaries.
>> > >
>> > > Sure, but who will call it with misaligned ranges?
>> >
>> > create a file foo.txt of size 4K and then truncate it.
>> >
>> > "truncate -s 23 foo.txt". Filesystems try to zero the bytes from 24 to
>> > 4095.
>>
>> This should fail with EIO. Only full page writes should clear the
>> bad page state, and partial writes should therefore fail because
>> they do not guarantee the data in the filesystem block is all good.
>>
>> If this zeroing was a buffered write to an address with a bad
>> sector, then the writeback will fail and the user will (eventually)
>> get an EIO on the file.
>>
>> DAX should do the same thing, except because the zeroing is
>> synchronous (i.e. done directly by the truncate syscall) we can -
>> and should - return EIO immediately.
>>
>> Indeed, with your code, if we then extend the file by truncating up
>> back to 4k, then the range between 23 and 512 is still bad, even
>> though we've successfully zeroed it and the user knows it. An
>> attempt to read anywhere in this range (e.g. 10 bytes at offset 100)
>> will fail with EIO, but reading 10 bytes at offset 2000 will
>> succeed.
>>
>> That's *awful* behaviour to expose to userspace, especially when
>> they look at the fs config and see that it's using both 4kB block
>> and sector sizes...
>>
>> The only thing that makes sense from a filesystem perspective is
>> clearing bad page state when entire filesystem blocks are
>> overwritten. The data in a filesystem block is either good or bad,
>> and it doesn't matter how many internal (kernel or device) sectors
>> it has.
>>
>> > > And what happens to the rest?  The caller is left to trip over the
>> > > errors?  That sounds pretty terrible.  I really think there needs to be
>> > > an explicit contract here.
>> >
>> > Ok, I think is is the contentious bit. Current interface
>> > (__dax_zero_page_range()) either clears the poison (if I/O is aligned to
>> > sector) or expects page to be free of poison.
>> >
>> > So in above example, of "truncate -s 23 foo.txt", currently I get an error
>> > because range being zeroed is not sector aligned. So
>> > __dax_zero_page_range() falls back to calling direct_access(). Which
>> > fails because there are poisoned sectors in the page.
>> >
>> > With my patches, dax_zero_page_range(), clears the poison from sector 1 to
>> > 7 but leaves sector 0 untouched and just writes zeroes from byte 0 to 511
>> > and returns success.
>>
>> Ok, kernel sectors are not the unit of granularity bad page state
>> should be managed at. They don't match page state granularity, and
>> they don't match filesystem block granularity, and the whacky
>> "partial writes silently succeed, reads fail unpredictably"
>> assymetry it leads to will just cause problems for users.
>>
>> > So question is, is this better behavior or worse behavior. If sector 0
>> > was poisoned, it will continue to remain poisoned and caller will come
>> > to know about it on next read and then it should try to truncate file
>> > to length 0 or unlink file or restore that file to get rid of poison.
>>
>> Worse, because the filesystem can't track what sub-parts of the
>> block are bad and that leads to inconsistent data integrity status
>> being exposed to userspace.
>
> The driver can't track it either. Latent poison isn't know until it is
> consumed, and writes to latent poison are not guaranteed to clear it.

I believe we're discussing the case where we know there is a bad block.
Obviously we can't know about latent errors.

>> > IOW, if a partial block is being 

Re: [dm-devel] [PATCH v5 2/8] drivers/pmem: Allow pmem_clear_poison() to accept arbitrary offset and len

2020-02-23 Thread Dan Williams
On Sun, Feb 23, 2020 at 3:03 PM Dave Chinner  wrote:
>
> On Fri, Feb 21, 2020 at 03:17:59PM -0500, Vivek Goyal wrote:
> > On Fri, Feb 21, 2020 at 01:32:48PM -0500, Jeff Moyer wrote:
> > > Vivek Goyal  writes:
> > >
> > > > On Thu, Feb 20, 2020 at 04:35:17PM -0500, Jeff Moyer wrote:
> > > >> Vivek Goyal  writes:
> > > >>
> > > >> > Currently pmem_clear_poison() expects offset and len to be sector 
> > > >> > aligned.
> > > >> > Atleast that seems to be the assumption with which code has been 
> > > >> > written.
> > > >> > It is called only from pmem_do_bvec() which is called only from 
> > > >> > pmem_rw_page()
> > > >> > and pmem_make_request() which will only passe sector aligned offset 
> > > >> > and len.
> > > >> >
> > > >> > Soon we want use this function from dax_zero_page_range() code path 
> > > >> > which
> > > >> > can try to zero arbitrary range of memory with-in a page. So update 
> > > >> > this
> > > >> > function to assume that offset and length can be arbitrary and do the
> > > >> > necessary alignments as needed.
> > > >>
> > > >> What caller will try to zero a range that is smaller than a sector?
> > > >
> > > > Hi Jeff,
> > > >
> > > > New dax zeroing interface (dax_zero_page_range()) can technically pass
> > > > a range which is less than a sector. Or which is bigger than a sector
> > > > but start and end are not aligned on sector boundaries.
> > >
> > > Sure, but who will call it with misaligned ranges?
> >
> > create a file foo.txt of size 4K and then truncate it.
> >
> > "truncate -s 23 foo.txt". Filesystems try to zero the bytes from 24 to
> > 4095.
>
> This should fail with EIO. Only full page writes should clear the
> bad page state, and partial writes should therefore fail because
> they do not guarantee the data in the filesystem block is all good.
>
> If this zeroing was a buffered write to an address with a bad
> sector, then the writeback will fail and the user will (eventually)
> get an EIO on the file.
>
> DAX should do the same thing, except because the zeroing is
> synchronous (i.e. done directly by the truncate syscall) we can -
> and should - return EIO immediately.
>
> Indeed, with your code, if we then extend the file by truncating up
> back to 4k, then the range between 23 and 512 is still bad, even
> though we've successfully zeroed it and the user knows it. An
> attempt to read anywhere in this range (e.g. 10 bytes at offset 100)
> will fail with EIO, but reading 10 bytes at offset 2000 will
> succeed.
>
> That's *awful* behaviour to expose to userspace, especially when
> they look at the fs config and see that it's using both 4kB block
> and sector sizes...
>
> The only thing that makes sense from a filesystem perspective is
> clearing bad page state when entire filesystem blocks are
> overwritten. The data in a filesystem block is either good or bad,
> and it doesn't matter how many internal (kernel or device) sectors
> it has.
>
> > > And what happens to the rest?  The caller is left to trip over the
> > > errors?  That sounds pretty terrible.  I really think there needs to be
> > > an explicit contract here.
> >
> > Ok, I think is is the contentious bit. Current interface
> > (__dax_zero_page_range()) either clears the poison (if I/O is aligned to
> > sector) or expects page to be free of poison.
> >
> > So in above example, of "truncate -s 23 foo.txt", currently I get an error
> > because range being zeroed is not sector aligned. So
> > __dax_zero_page_range() falls back to calling direct_access(). Which
> > fails because there are poisoned sectors in the page.
> >
> > With my patches, dax_zero_page_range(), clears the poison from sector 1 to
> > 7 but leaves sector 0 untouched and just writes zeroes from byte 0 to 511
> > and returns success.
>
> Ok, kernel sectors are not the unit of granularity bad page state
> should be managed at. They don't match page state granularity, and
> they don't match filesystem block granularity, and the whacky
> "partial writes silently succeed, reads fail unpredictably"
> assymetry it leads to will just cause problems for users.
>
> > So question is, is this better behavior or worse behavior. If sector 0
> > was poisoned, it will continue to remain poisoned and caller will come
> > to know about it on next read and then it should try to truncate file
> > to length 0 or unlink file or restore that file to get rid of poison.
>
> Worse, because the filesystem can't track what sub-parts of the
> block are bad and that leads to inconsistent data integrity status
> being exposed to userspace.

The driver can't track it either. Latent poison isn't know until it is
consumed, and writes to latent poison are not guaranteed to clear it.

>
>
> > IOW, if a partial block is being zeroed and if it is poisoned, caller
> > will not be return an error and poison will not be cleared and memory
> > will be zeroed. What do we expect in such cases.
> >
> > Do we expect an interface where if there are any bad blocks in the range
> > being 

Re: [dm-devel] [PATCH v5 2/8] drivers/pmem: Allow pmem_clear_poison() to accept arbitrary offset and len

2020-02-23 Thread Dave Chinner
On Fri, Feb 21, 2020 at 03:17:59PM -0500, Vivek Goyal wrote:
> On Fri, Feb 21, 2020 at 01:32:48PM -0500, Jeff Moyer wrote:
> > Vivek Goyal  writes:
> > 
> > > On Thu, Feb 20, 2020 at 04:35:17PM -0500, Jeff Moyer wrote:
> > >> Vivek Goyal  writes:
> > >> 
> > >> > Currently pmem_clear_poison() expects offset and len to be sector 
> > >> > aligned.
> > >> > Atleast that seems to be the assumption with which code has been 
> > >> > written.
> > >> > It is called only from pmem_do_bvec() which is called only from 
> > >> > pmem_rw_page()
> > >> > and pmem_make_request() which will only passe sector aligned offset 
> > >> > and len.
> > >> >
> > >> > Soon we want use this function from dax_zero_page_range() code path 
> > >> > which
> > >> > can try to zero arbitrary range of memory with-in a page. So update 
> > >> > this
> > >> > function to assume that offset and length can be arbitrary and do the
> > >> > necessary alignments as needed.
> > >> 
> > >> What caller will try to zero a range that is smaller than a sector?
> > >
> > > Hi Jeff,
> > >
> > > New dax zeroing interface (dax_zero_page_range()) can technically pass
> > > a range which is less than a sector. Or which is bigger than a sector
> > > but start and end are not aligned on sector boundaries.
> > 
> > Sure, but who will call it with misaligned ranges?
> 
> create a file foo.txt of size 4K and then truncate it.
> 
> "truncate -s 23 foo.txt". Filesystems try to zero the bytes from 24 to
> 4095.

This should fail with EIO. Only full page writes should clear the
bad page state, and partial writes should therefore fail because
they do not guarantee the data in the filesystem block is all good.

If this zeroing was a buffered write to an address with a bad
sector, then the writeback will fail and the user will (eventually)
get an EIO on the file.

DAX should do the same thing, except because the zeroing is
synchronous (i.e. done directly by the truncate syscall) we can -
and should - return EIO immediately.

Indeed, with your code, if we then extend the file by truncating up
back to 4k, then the range between 23 and 512 is still bad, even
though we've successfully zeroed it and the user knows it. An
attempt to read anywhere in this range (e.g. 10 bytes at offset 100)
will fail with EIO, but reading 10 bytes at offset 2000 will
succeed.

That's *awful* behaviour to expose to userspace, especially when
they look at the fs config and see that it's using both 4kB block
and sector sizes...

The only thing that makes sense from a filesystem perspective is
clearing bad page state when entire filesystem blocks are
overwritten. The data in a filesystem block is either good or bad,
and it doesn't matter how many internal (kernel or device) sectors
it has.

> > And what happens to the rest?  The caller is left to trip over the
> > errors?  That sounds pretty terrible.  I really think there needs to be
> > an explicit contract here.
> 
> Ok, I think is is the contentious bit. Current interface
> (__dax_zero_page_range()) either clears the poison (if I/O is aligned to
> sector) or expects page to be free of poison.
> 
> So in above example, of "truncate -s 23 foo.txt", currently I get an error
> because range being zeroed is not sector aligned. So
> __dax_zero_page_range() falls back to calling direct_access(). Which
> fails because there are poisoned sectors in the page.
> 
> With my patches, dax_zero_page_range(), clears the poison from sector 1 to
> 7 but leaves sector 0 untouched and just writes zeroes from byte 0 to 511
> and returns success.

Ok, kernel sectors are not the unit of granularity bad page state
should be managed at. They don't match page state granularity, and
they don't match filesystem block granularity, and the whacky
"partial writes silently succeed, reads fail unpredictably"
assymetry it leads to will just cause problems for users.

> So question is, is this better behavior or worse behavior. If sector 0
> was poisoned, it will continue to remain poisoned and caller will come
> to know about it on next read and then it should try to truncate file
> to length 0 or unlink file or restore that file to get rid of poison.

Worse, because the filesystem can't track what sub-parts of the
block are bad and that leads to inconsistent data integrity status
being exposed to userspace.


> IOW, if a partial block is being zeroed and if it is poisoned, caller
> will not be return an error and poison will not be cleared and memory
> will be zeroed. What do we expect in such cases.
> 
> Do we expect an interface where if there are any bad blocks in the range
> being zeroed, then they all should be cleared (and hence all I/O should
> be aligned) otherwise error is returned. If yes, I could make that
> change.
> 
> Downside of current interface is that it will clear as many blocks as
> possible in the given range and leave starting and end blocks poisoned
> (if it is unaligned) and not return error. That means a reader will
> get error on these blocks 

Re: [dm-devel] [PATCH v5 2/8] drivers/pmem: Allow pmem_clear_poison() to accept arbitrary offset and len

2020-02-21 Thread Jeff Moyer
Dan Williams  writes:

> Oh you misunderstood my comment, the "move badblocks to filesystem"
> proposal is long term / down the road thing to consider. In the near
> term this unaligned block zeroing facility is an improvement.

I'm not sure I agree.  I'm going to think about it and get back to you.

-Jeff

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH v5 2/8] drivers/pmem: Allow pmem_clear_poison() to accept arbitrary offset and len

2020-02-21 Thread Dan Williams
On Fri, Feb 21, 2020 at 1:25 PM Vivek Goyal  wrote:
>
> On Fri, Feb 21, 2020 at 01:00:29PM -0800, Dan Williams wrote:
> > On Fri, Feb 21, 2020 at 12:18 PM Vivek Goyal  wrote:
> > >
> > > On Fri, Feb 21, 2020 at 01:32:48PM -0500, Jeff Moyer wrote:
> > > > Vivek Goyal  writes:
> > > >
> > > > > On Thu, Feb 20, 2020 at 04:35:17PM -0500, Jeff Moyer wrote:
> > > > >> Vivek Goyal  writes:
> > > > >>
> > > > >> > Currently pmem_clear_poison() expects offset and len to be sector 
> > > > >> > aligned.
> > > > >> > Atleast that seems to be the assumption with which code has been 
> > > > >> > written.
> > > > >> > It is called only from pmem_do_bvec() which is called only from 
> > > > >> > pmem_rw_page()
> > > > >> > and pmem_make_request() which will only passe sector aligned 
> > > > >> > offset and len.
> > > > >> >
> > > > >> > Soon we want use this function from dax_zero_page_range() code 
> > > > >> > path which
> > > > >> > can try to zero arbitrary range of memory with-in a page. So 
> > > > >> > update this
> > > > >> > function to assume that offset and length can be arbitrary and do 
> > > > >> > the
> > > > >> > necessary alignments as needed.
> > > > >>
> > > > >> What caller will try to zero a range that is smaller than a sector?
> > > > >
> > > > > Hi Jeff,
> > > > >
> > > > > New dax zeroing interface (dax_zero_page_range()) can technically pass
> > > > > a range which is less than a sector. Or which is bigger than a sector
> > > > > but start and end are not aligned on sector boundaries.
> > > >
> > > > Sure, but who will call it with misaligned ranges?
> > >
> > > create a file foo.txt of size 4K and then truncate it.
> > >
> > > "truncate -s 23 foo.txt". Filesystems try to zero the bytes from 24 to
> > > 4095.
> > >
> > > I have also written a test for this.
> > >
> > > https://github.com/rhvgoyal/misc/blob/master/pmem-tests/iomap-range-test.sh#L102
> > >
> > > >
> > > > > At this point of time, all I care about is that case of an arbitrary
> > > > > range is handeled well. So if a caller passes a range in, we figure
> > > > > out subrange which is sector aligned in terms of start and end, and
> > > > > clear poison on those sectors and ignore rest of the range. And
> > > > > this itself will be an improvement over current behavior where
> > > > > nothing is cleared if I/O is not sector aligned.
> > > >
> > > > I don't think this makes sense.  The caller needs to know about the
> > > > blast radius of errors.  This is why I asked for a concrete example.
> > > > It might make more sense, for example, to return an error if not all of
> > > > the errors could be cleared.
> > > >
> > > > >> > nvdimm_clear_poison() seems to assume offset and len to be aligned 
> > > > >> > to
> > > > >> > clear_err_unit boundary. But this is currently internal detail and 
> > > > >> > is
> > > > >> > not exported for others to use. So for now, continue to align 
> > > > >> > offset and
> > > > >> > length to SECTOR_SIZE boundary. Improving it further and to align 
> > > > >> > it
> > > > >> > to clear_err_unit boundary is a TODO item for future.
> > > > >>
> > > > >> When there is a poisoned range of persistent memory, it is recorded 
> > > > >> by
> > > > >> the badblocks infrastructure, which currently operates on sectors.  
> > > > >> So,
> > > > >> no matter what the error unit is for the hardware, we currently can't
> > > > >> record/report to userspace anything smaller than a sector, and so 
> > > > >> that
> > > > >> is what we expect when clearing errors.
> > > > >>
> > > > >> Continuing on for completeness, we will currently not map a page with
> > > > >> badblocks into a process' address space.  So, let's say you have 256
> > > > >> bytes of bad pmem, we will tell you we've lost 512 bytes, and even if
> > > > >> you access a valid mmap()d address in the same page as the poisoned
> > > > >> memory, you will get a segfault.
> > > > >>
> > > > >> Userspace can fix up the error by calling write(2) and friends to
> > > > >> provide new data, or by punching a hole and writing new data to the 
> > > > >> hole
> > > > >> (which may result in getting a new block, or reallocating the old 
> > > > >> block
> > > > >> and zeroing it, which will clear the error).
> > > > >
> > > > > Fair enough. I do not need poison clearing at finer granularity. It 
> > > > > might
> > > > > be needed once dev_dax path wants to clear poison. Not sure how 
> > > > > exactly
> > > > > that works.
> > > >
> > > > It doesn't.  :)
> > > >
> > > > >> > +/*
> > > > >> > + * Callers can pass arbitrary offset and len. But 
> > > > >> > nvdimm_clear_poison()
> > > > >> > + * expects memory offset and length to meet certain 
> > > > >> > alignment
> > > > >> > + * restrction (clear_err_unit). Currently nvdimm does not 
> > > > >> > export
> > > > >>   
> > > > >> ^^
> > > > >> > + * required alignment. So align offset and length to 
> > > 

Re: [dm-devel] [PATCH v5 2/8] drivers/pmem: Allow pmem_clear_poison() to accept arbitrary offset and len

2020-02-21 Thread Vivek Goyal
On Fri, Feb 21, 2020 at 01:00:29PM -0800, Dan Williams wrote:
> On Fri, Feb 21, 2020 at 12:18 PM Vivek Goyal  wrote:
> >
> > On Fri, Feb 21, 2020 at 01:32:48PM -0500, Jeff Moyer wrote:
> > > Vivek Goyal  writes:
> > >
> > > > On Thu, Feb 20, 2020 at 04:35:17PM -0500, Jeff Moyer wrote:
> > > >> Vivek Goyal  writes:
> > > >>
> > > >> > Currently pmem_clear_poison() expects offset and len to be sector 
> > > >> > aligned.
> > > >> > Atleast that seems to be the assumption with which code has been 
> > > >> > written.
> > > >> > It is called only from pmem_do_bvec() which is called only from 
> > > >> > pmem_rw_page()
> > > >> > and pmem_make_request() which will only passe sector aligned offset 
> > > >> > and len.
> > > >> >
> > > >> > Soon we want use this function from dax_zero_page_range() code path 
> > > >> > which
> > > >> > can try to zero arbitrary range of memory with-in a page. So update 
> > > >> > this
> > > >> > function to assume that offset and length can be arbitrary and do the
> > > >> > necessary alignments as needed.
> > > >>
> > > >> What caller will try to zero a range that is smaller than a sector?
> > > >
> > > > Hi Jeff,
> > > >
> > > > New dax zeroing interface (dax_zero_page_range()) can technically pass
> > > > a range which is less than a sector. Or which is bigger than a sector
> > > > but start and end are not aligned on sector boundaries.
> > >
> > > Sure, but who will call it with misaligned ranges?
> >
> > create a file foo.txt of size 4K and then truncate it.
> >
> > "truncate -s 23 foo.txt". Filesystems try to zero the bytes from 24 to
> > 4095.
> >
> > I have also written a test for this.
> >
> > https://github.com/rhvgoyal/misc/blob/master/pmem-tests/iomap-range-test.sh#L102
> >
> > >
> > > > At this point of time, all I care about is that case of an arbitrary
> > > > range is handeled well. So if a caller passes a range in, we figure
> > > > out subrange which is sector aligned in terms of start and end, and
> > > > clear poison on those sectors and ignore rest of the range. And
> > > > this itself will be an improvement over current behavior where
> > > > nothing is cleared if I/O is not sector aligned.
> > >
> > > I don't think this makes sense.  The caller needs to know about the
> > > blast radius of errors.  This is why I asked for a concrete example.
> > > It might make more sense, for example, to return an error if not all of
> > > the errors could be cleared.
> > >
> > > >> > nvdimm_clear_poison() seems to assume offset and len to be aligned to
> > > >> > clear_err_unit boundary. But this is currently internal detail and is
> > > >> > not exported for others to use. So for now, continue to align offset 
> > > >> > and
> > > >> > length to SECTOR_SIZE boundary. Improving it further and to align it
> > > >> > to clear_err_unit boundary is a TODO item for future.
> > > >>
> > > >> When there is a poisoned range of persistent memory, it is recorded by
> > > >> the badblocks infrastructure, which currently operates on sectors.  So,
> > > >> no matter what the error unit is for the hardware, we currently can't
> > > >> record/report to userspace anything smaller than a sector, and so that
> > > >> is what we expect when clearing errors.
> > > >>
> > > >> Continuing on for completeness, we will currently not map a page with
> > > >> badblocks into a process' address space.  So, let's say you have 256
> > > >> bytes of bad pmem, we will tell you we've lost 512 bytes, and even if
> > > >> you access a valid mmap()d address in the same page as the poisoned
> > > >> memory, you will get a segfault.
> > > >>
> > > >> Userspace can fix up the error by calling write(2) and friends to
> > > >> provide new data, or by punching a hole and writing new data to the 
> > > >> hole
> > > >> (which may result in getting a new block, or reallocating the old block
> > > >> and zeroing it, which will clear the error).
> > > >
> > > > Fair enough. I do not need poison clearing at finer granularity. It 
> > > > might
> > > > be needed once dev_dax path wants to clear poison. Not sure how exactly
> > > > that works.
> > >
> > > It doesn't.  :)
> > >
> > > >> > +/*
> > > >> > + * Callers can pass arbitrary offset and len. But 
> > > >> > nvdimm_clear_poison()
> > > >> > + * expects memory offset and length to meet certain 
> > > >> > alignment
> > > >> > + * restrction (clear_err_unit). Currently nvdimm does not 
> > > >> > export
> > > >>   
> > > >> ^^
> > > >> > + * required alignment. So align offset and length to sector 
> > > >> > boundary
> > > >>
> > > >> What is "nvdimm" in that sentence?  Because the nvdimm most certainly
> > > >> does export the required alignment.  Perhaps you meant libnvdimm?
> > > >
> > > > I meant nvdimm_clear_poison() function in drivers/nvdimm/bus.c. Whatever
> > > > it is called. It first queries alignement required (clear_err_unit) and
> > > > then 

Re: [dm-devel] [PATCH v5 2/8] drivers/pmem: Allow pmem_clear_poison() to accept arbitrary offset and len

2020-02-21 Thread Vivek Goyal
On Fri, Feb 21, 2020 at 01:32:48PM -0500, Jeff Moyer wrote:
> Vivek Goyal  writes:
> 
> > On Thu, Feb 20, 2020 at 04:35:17PM -0500, Jeff Moyer wrote:
> >> Vivek Goyal  writes:
> >> 
> >> > Currently pmem_clear_poison() expects offset and len to be sector 
> >> > aligned.
> >> > Atleast that seems to be the assumption with which code has been written.
> >> > It is called only from pmem_do_bvec() which is called only from 
> >> > pmem_rw_page()
> >> > and pmem_make_request() which will only passe sector aligned offset and 
> >> > len.
> >> >
> >> > Soon we want use this function from dax_zero_page_range() code path which
> >> > can try to zero arbitrary range of memory with-in a page. So update this
> >> > function to assume that offset and length can be arbitrary and do the
> >> > necessary alignments as needed.
> >> 
> >> What caller will try to zero a range that is smaller than a sector?
> >
> > Hi Jeff,
> >
> > New dax zeroing interface (dax_zero_page_range()) can technically pass
> > a range which is less than a sector. Or which is bigger than a sector
> > but start and end are not aligned on sector boundaries.
> 
> Sure, but who will call it with misaligned ranges?

create a file foo.txt of size 4K and then truncate it.

"truncate -s 23 foo.txt". Filesystems try to zero the bytes from 24 to
4095.

I have also written a test for this.

https://github.com/rhvgoyal/misc/blob/master/pmem-tests/iomap-range-test.sh#L102

> 
> > At this point of time, all I care about is that case of an arbitrary
> > range is handeled well. So if a caller passes a range in, we figure
> > out subrange which is sector aligned in terms of start and end, and
> > clear poison on those sectors and ignore rest of the range. And
> > this itself will be an improvement over current behavior where 
> > nothing is cleared if I/O is not sector aligned.
> 
> I don't think this makes sense.  The caller needs to know about the
> blast radius of errors.  This is why I asked for a concrete example.
> It might make more sense, for example, to return an error if not all of
> the errors could be cleared.
> 
> >> > nvdimm_clear_poison() seems to assume offset and len to be aligned to
> >> > clear_err_unit boundary. But this is currently internal detail and is
> >> > not exported for others to use. So for now, continue to align offset and
> >> > length to SECTOR_SIZE boundary. Improving it further and to align it
> >> > to clear_err_unit boundary is a TODO item for future.
> >> 
> >> When there is a poisoned range of persistent memory, it is recorded by
> >> the badblocks infrastructure, which currently operates on sectors.  So,
> >> no matter what the error unit is for the hardware, we currently can't
> >> record/report to userspace anything smaller than a sector, and so that
> >> is what we expect when clearing errors.
> >> 
> >> Continuing on for completeness, we will currently not map a page with
> >> badblocks into a process' address space.  So, let's say you have 256
> >> bytes of bad pmem, we will tell you we've lost 512 bytes, and even if
> >> you access a valid mmap()d address in the same page as the poisoned
> >> memory, you will get a segfault.
> >> 
> >> Userspace can fix up the error by calling write(2) and friends to
> >> provide new data, or by punching a hole and writing new data to the hole
> >> (which may result in getting a new block, or reallocating the old block
> >> and zeroing it, which will clear the error).
> >
> > Fair enough. I do not need poison clearing at finer granularity. It might
> > be needed once dev_dax path wants to clear poison. Not sure how exactly
> > that works.
> 
> It doesn't.  :)
> 
> >> > +/*
> >> > + * Callers can pass arbitrary offset and len. But 
> >> > nvdimm_clear_poison()
> >> > + * expects memory offset and length to meet certain alignment
> >> > + * restrction (clear_err_unit). Currently nvdimm does not export
> >>   ^^
> >> > + * required alignment. So align offset and length to sector 
> >> > boundary
> >> 
> >> What is "nvdimm" in that sentence?  Because the nvdimm most certainly
> >> does export the required alignment.  Perhaps you meant libnvdimm?
> >
> > I meant nvdimm_clear_poison() function in drivers/nvdimm/bus.c. Whatever
> > it is called. It first queries alignement required (clear_err_unit) and
> > then makes sure range passed in meets that alignment requirement.
> 
> My point was your comment is misleading.
> 
> >> We could potentially support clearing less than a sector, but I'd have
> >> to understand the use cases better before offerring implementation
> >> suggestions.
> >
> > I don't need clearing less than a secotr. Once somebody needs it they
> > can implement it. All I am doing is making sure current logic is not
> > broken when dax_zero_page_range() starts using this logic and passes
> > an arbitrary range. We need to make sure we internally align I/O 
> 
> An 

Re: [dm-devel] [PATCH v5 2/8] drivers/pmem: Allow pmem_clear_poison() to accept arbitrary offset and len

2020-02-20 Thread Vivek Goyal
On Thu, Feb 20, 2020 at 04:35:17PM -0500, Jeff Moyer wrote:
> Vivek Goyal  writes:
> 
> > Currently pmem_clear_poison() expects offset and len to be sector aligned.
> > Atleast that seems to be the assumption with which code has been written.
> > It is called only from pmem_do_bvec() which is called only from 
> > pmem_rw_page()
> > and pmem_make_request() which will only passe sector aligned offset and len.
> >
> > Soon we want use this function from dax_zero_page_range() code path which
> > can try to zero arbitrary range of memory with-in a page. So update this
> > function to assume that offset and length can be arbitrary and do the
> > necessary alignments as needed.
> 
> What caller will try to zero a range that is smaller than a sector?

Hi Jeff,

New dax zeroing interface (dax_zero_page_range()) can technically pass
a range which is less than a sector. Or which is bigger than a sector
but start and end are not aligned on sector boundaries.

At this point of time, all I care about is that case of an arbitrary
range is handeled well. So if a caller passes a range in, we figure
out subrange which is sector aligned in terms of start and end, and
clear poison on those sectors and ignore rest of the range. And
this itself will be an improvement over current behavior where 
nothing is cleared if I/O is not sector aligned.

> 
> > nvdimm_clear_poison() seems to assume offset and len to be aligned to
> > clear_err_unit boundary. But this is currently internal detail and is
> > not exported for others to use. So for now, continue to align offset and
> > length to SECTOR_SIZE boundary. Improving it further and to align it
> > to clear_err_unit boundary is a TODO item for future.
> 
> When there is a poisoned range of persistent memory, it is recorded by
> the badblocks infrastructure, which currently operates on sectors.  So,
> no matter what the error unit is for the hardware, we currently can't
> record/report to userspace anything smaller than a sector, and so that
> is what we expect when clearing errors.
> 
> Continuing on for completeness, we will currently not map a page with
> badblocks into a process' address space.  So, let's say you have 256
> bytes of bad pmem, we will tell you we've lost 512 bytes, and even if
> you access a valid mmap()d address in the same page as the poisoned
> memory, you will get a segfault.
> 
> Userspace can fix up the error by calling write(2) and friends to
> provide new data, or by punching a hole and writing new data to the hole
> (which may result in getting a new block, or reallocating the old block
> and zeroing it, which will clear the error).

Fair enough. I do not need poison clearing at finer granularity. It might
be needed once dev_dax path wants to clear poison. Not sure how exactly
that works.

> 
> More comments below...
> 
> > Signed-off-by: Vivek Goyal 
> > ---
> >  drivers/nvdimm/pmem.c | 22 ++
> >  1 file changed, 18 insertions(+), 4 deletions(-)
> >
> > diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
> > index 075b11682192..e72959203253 100644
> > --- a/drivers/nvdimm/pmem.c
> > +++ b/drivers/nvdimm/pmem.c
> > @@ -74,14 +74,28 @@ static blk_status_t pmem_clear_poison(struct 
> > pmem_device *pmem,
> > sector_t sector;
> > long cleared;
> > blk_status_t rc = BLK_STS_OK;
> > +   phys_addr_t start_aligned, end_aligned;
> > +   unsigned int len_aligned;
> >  
> > -   sector = (offset - pmem->data_offset) / 512;
> > +   /*
> > +* Callers can pass arbitrary offset and len. But nvdimm_clear_poison()
> > +* expects memory offset and length to meet certain alignment
> > +* restrction (clear_err_unit). Currently nvdimm does not export
>   ^^
> > +* required alignment. So align offset and length to sector boundary
> 
> What is "nvdimm" in that sentence?  Because the nvdimm most certainly
> does export the required alignment.  Perhaps you meant libnvdimm?

I meant nvdimm_clear_poison() function in drivers/nvdimm/bus.c. Whatever
it is called. It first queries alignement required (clear_err_unit) and
then makes sure range passed in meets that alignment requirement.

> 
> > +* before passing it to nvdimm_clear_poison().
> > +*/
> > +   start_aligned = ALIGN(offset, SECTOR_SIZE);
> > +   end_aligned = ALIGN_DOWN((offset + len), SECTOR_SIZE) - 1;
> > +   len_aligned = end_aligned - start_aligned + 1;
> > +
> > +   sector = (start_aligned - pmem->data_offset) / 512;
> >  
> > -   cleared = nvdimm_clear_poison(dev, pmem->phys_addr + offset, len);
> > -   if (cleared < len)
> > +   cleared = nvdimm_clear_poison(dev, pmem->phys_addr + start_aligned,
> > + len_aligned);
> > +   if (cleared < len_aligned)
> > rc = BLK_STS_IOERR;
> > if (cleared > 0 && cleared / 512) {
> > -   hwpoison_clear(pmem, pmem->phys_addr + offset, cleared);
> > +   hwpoison_clear(pmem, pmem->phys_addr + 

Re: [dm-devel] [PATCH v5 2/8] drivers/pmem: Allow pmem_clear_poison() to accept arbitrary offset and len

2020-02-20 Thread Jeff Moyer
Vivek Goyal  writes:

> Currently pmem_clear_poison() expects offset and len to be sector aligned.
> Atleast that seems to be the assumption with which code has been written.
> It is called only from pmem_do_bvec() which is called only from pmem_rw_page()
> and pmem_make_request() which will only passe sector aligned offset and len.
>
> Soon we want use this function from dax_zero_page_range() code path which
> can try to zero arbitrary range of memory with-in a page. So update this
> function to assume that offset and length can be arbitrary and do the
> necessary alignments as needed.

What caller will try to zero a range that is smaller than a sector?

> nvdimm_clear_poison() seems to assume offset and len to be aligned to
> clear_err_unit boundary. But this is currently internal detail and is
> not exported for others to use. So for now, continue to align offset and
> length to SECTOR_SIZE boundary. Improving it further and to align it
> to clear_err_unit boundary is a TODO item for future.

When there is a poisoned range of persistent memory, it is recorded by
the badblocks infrastructure, which currently operates on sectors.  So,
no matter what the error unit is for the hardware, we currently can't
record/report to userspace anything smaller than a sector, and so that
is what we expect when clearing errors.

Continuing on for completeness, we will currently not map a page with
badblocks into a process' address space.  So, let's say you have 256
bytes of bad pmem, we will tell you we've lost 512 bytes, and even if
you access a valid mmap()d address in the same page as the poisoned
memory, you will get a segfault.

Userspace can fix up the error by calling write(2) and friends to
provide new data, or by punching a hole and writing new data to the hole
(which may result in getting a new block, or reallocating the old block
and zeroing it, which will clear the error).

More comments below...

> Signed-off-by: Vivek Goyal 
> ---
>  drivers/nvdimm/pmem.c | 22 ++
>  1 file changed, 18 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
> index 075b11682192..e72959203253 100644
> --- a/drivers/nvdimm/pmem.c
> +++ b/drivers/nvdimm/pmem.c
> @@ -74,14 +74,28 @@ static blk_status_t pmem_clear_poison(struct pmem_device 
> *pmem,
>   sector_t sector;
>   long cleared;
>   blk_status_t rc = BLK_STS_OK;
> + phys_addr_t start_aligned, end_aligned;
> + unsigned int len_aligned;
>  
> - sector = (offset - pmem->data_offset) / 512;
> + /*
> +  * Callers can pass arbitrary offset and len. But nvdimm_clear_poison()
> +  * expects memory offset and length to meet certain alignment
> +  * restrction (clear_err_unit). Currently nvdimm does not export
  ^^
> +  * required alignment. So align offset and length to sector boundary

What is "nvdimm" in that sentence?  Because the nvdimm most certainly
does export the required alignment.  Perhaps you meant libnvdimm?

> +  * before passing it to nvdimm_clear_poison().
> +  */
> + start_aligned = ALIGN(offset, SECTOR_SIZE);
> + end_aligned = ALIGN_DOWN((offset + len), SECTOR_SIZE) - 1;
> + len_aligned = end_aligned - start_aligned + 1;
> +
> + sector = (start_aligned - pmem->data_offset) / 512;
>  
> - cleared = nvdimm_clear_poison(dev, pmem->phys_addr + offset, len);
> - if (cleared < len)
> + cleared = nvdimm_clear_poison(dev, pmem->phys_addr + start_aligned,
> +   len_aligned);
> + if (cleared < len_aligned)
>   rc = BLK_STS_IOERR;
>   if (cleared > 0 && cleared / 512) {
> - hwpoison_clear(pmem, pmem->phys_addr + offset, cleared);
> + hwpoison_clear(pmem, pmem->phys_addr + start_aligned, cleared);
>   cleared /= 512;
>   dev_dbg(dev, "%#llx clear %ld sector%s\n",
>   (unsigned long long) sector, cleared,

We could potentially support clearing less than a sector, but I'd have
to understand the use cases better before offerring implementation
suggestions.

-Jeff

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH v5 2/8] drivers/pmem: Allow pmem_clear_poison() to accept arbitrary offset and len

2020-02-20 Thread Christoph Hellwig
On Tue, Feb 18, 2020 at 04:48:35PM -0500, Vivek Goyal wrote:
> Currently pmem_clear_poison() expects offset and len to be sector aligned.
> Atleast that seems to be the assumption with which code has been written.
> It is called only from pmem_do_bvec() which is called only from pmem_rw_page()
> and pmem_make_request() which will only passe sector aligned offset and len.
> 
> Soon we want use this function from dax_zero_page_range() code path which
> can try to zero arbitrary range of memory with-in a page. So update this
> function to assume that offset and length can be arbitrary and do the
> necessary alignments as needed.
> 
> nvdimm_clear_poison() seems to assume offset and len to be aligned to
> clear_err_unit boundary. But this is currently internal detail and is
> not exported for others to use. So for now, continue to align offset and
> length to SECTOR_SIZE boundary. Improving it further and to align it
> to clear_err_unit boundary is a TODO item for future.
> 
> Signed-off-by: Vivek Goyal 

This looks sensibel to me, but I'd really like to have Dan take at look.


--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel



[dm-devel] [PATCH v5 2/8] drivers/pmem: Allow pmem_clear_poison() to accept arbitrary offset and len

2020-02-18 Thread Vivek Goyal
Currently pmem_clear_poison() expects offset and len to be sector aligned.
Atleast that seems to be the assumption with which code has been written.
It is called only from pmem_do_bvec() which is called only from pmem_rw_page()
and pmem_make_request() which will only passe sector aligned offset and len.

Soon we want use this function from dax_zero_page_range() code path which
can try to zero arbitrary range of memory with-in a page. So update this
function to assume that offset and length can be arbitrary and do the
necessary alignments as needed.

nvdimm_clear_poison() seems to assume offset and len to be aligned to
clear_err_unit boundary. But this is currently internal detail and is
not exported for others to use. So for now, continue to align offset and
length to SECTOR_SIZE boundary. Improving it further and to align it
to clear_err_unit boundary is a TODO item for future.

Signed-off-by: Vivek Goyal 
---
 drivers/nvdimm/pmem.c | 22 ++
 1 file changed, 18 insertions(+), 4 deletions(-)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 075b11682192..e72959203253 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -74,14 +74,28 @@ static blk_status_t pmem_clear_poison(struct pmem_device 
*pmem,
sector_t sector;
long cleared;
blk_status_t rc = BLK_STS_OK;
+   phys_addr_t start_aligned, end_aligned;
+   unsigned int len_aligned;
 
-   sector = (offset - pmem->data_offset) / 512;
+   /*
+* Callers can pass arbitrary offset and len. But nvdimm_clear_poison()
+* expects memory offset and length to meet certain alignment
+* restrction (clear_err_unit). Currently nvdimm does not export
+* required alignment. So align offset and length to sector boundary
+* before passing it to nvdimm_clear_poison().
+*/
+   start_aligned = ALIGN(offset, SECTOR_SIZE);
+   end_aligned = ALIGN_DOWN((offset + len), SECTOR_SIZE) - 1;
+   len_aligned = end_aligned - start_aligned + 1;
+
+   sector = (start_aligned - pmem->data_offset) / 512;
 
-   cleared = nvdimm_clear_poison(dev, pmem->phys_addr + offset, len);
-   if (cleared < len)
+   cleared = nvdimm_clear_poison(dev, pmem->phys_addr + start_aligned,
+ len_aligned);
+   if (cleared < len_aligned)
rc = BLK_STS_IOERR;
if (cleared > 0 && cleared / 512) {
-   hwpoison_clear(pmem, pmem->phys_addr + offset, cleared);
+   hwpoison_clear(pmem, pmem->phys_addr + start_aligned, cleared);
cleared /= 512;
dev_dbg(dev, "%#llx clear %ld sector%s\n",
(unsigned long long) sector, cleared,
-- 
2.20.1

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel