Re: [dm-devel] dm-clone: Request option to send discard to source device during hydration

2023-03-29 Thread Nikos Tsironis

On 3/28/23 19:20, Mike Snitzer wrote:

On Mon, Mar 27 2023 at  4:24P -0400,
Gwendal Grignou  wrote:


On ChromeOS, we are working on migrating file backed loopback devices
to thinpool logical volumes using dm-clone on the Chromebook local
SSD.
Dm-clone hydration workflow is a great fit but the design of dm-clone
assumes a read-only source device. Data present in the backing file
will be copied to the new logical volume but can be safely deleted
only when the hydration process is complete. During migration, some
data will be duplicated and usage on the Chromebook SSD will
unnecessarily increase.
Would it be reasonable to add a discard option when enabling the
hydration process to discard data as we go on the source device?
2 implementations are possible:
a- add a state to the hydration state machine to ensure a region is
discarded before considering another region.
b- a simpler implementation where the discard is sent asynchronously
at the end of a region copy. It may not complete successfully (in case
the device crashes during the hydration for instance), but will vastly
reduce the amount of data left  in the source device at the end of the
hydration.

I prefer b) as it is easier to implement, but a) is cleaner from a
usage point of view.


In general, discards may not complete for any number of reasons. So
while a) gives you finer-grained potential for space being
deallocated, b) would likely suffice given that a device crash is
pretty unlikely (at least I would think).  And in the case of file
backed loopback devices, independent of dm-clone, you can just issue
discard(s) to all free space after a crash?

However you elect to do it, you'd do well to make it an optional
"discard_rw_src" (or some better name) feature that is configured when
you load the dm-clone target.



I agree with Mike, but I also want to note the following.

dm-clone commits its on-disk metadata periodically every second, and
every time a FLUSH or FUA bio is written. This is done to improve
performance.

This means the dm-clone device behaves like a physical disk that has a
volatile write cache. If power is lost you may lose some recent writes,
_and_ dm-clone might need to rehydrate some regions.

So, you can't discard a region on the source device after the copy
operation has finished, because then the following scenario will result
in data corruption:

1. dm-clone hydrates a region
2. dm-clone discards the region on the source device, either
   synchronously (a) or asynchronously (b)
3. The system crashes before the metadata is committed
4. The system comes up, and dm-clone rehydrates the region, because it
   thinks it has not been hydrated yet
5. The source device might contain garbage for this region, since we
   discarded it previously
6. You have data corruption

So, you can only discard hydrated regions for which the metadata have
been committed on disk.

I think you could discard hydrated regions on the source device
periodically, right after committing the metadata.

dm-clone keeps track of the regions hydrated during each metadata
transaction, so after committing the metadata for the current
transaction, you could also sent an asynchronous discard for these
regions.

Nikos.

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH 0/2] dm era: avoid deadlock when swapping table with dm-era target

2023-02-07 Thread Nikos Tsironis

On 1/31/23 22:20, Mike Snitzer wrote:

On Tue, Jan 31 2023 at  6:01P -0500,
Nikos Tsironis  wrote:


On 1/26/23 02:06, Mike Snitzer wrote:

On Wed, Jan 25 2023 at  7:37P -0500,
Nikos Tsironis  wrote:


On 1/23/23 19:34, Mike Snitzer wrote:

On Thu, Jan 19 2023 at  4:36P -0500,
Nikos Tsironis  wrote:


On 1/18/23 18:28, Mike Snitzer wrote:

On Wed, Jan 18 2023 at  7:29P -0500,
Nikos Tsironis  wrote:


Under certain conditions, swapping a table, that includes a dm-era
target, with a new table, causes a deadlock.

This happens when a status (STATUSTYPE_INFO) or message IOCTL is blocked
in the suspended dm-era target.

dm-era executes all metadata operations in a worker thread, which stops
processing requests when the target is suspended, and resumes again when
the target is resumed.

So, running 'dmsetup status' or 'dmsetup message' for a suspended dm-era
device blocks, until the device is resumed.

If we then load a new table to the device, while the aforementioned
dmsetup command is blocked in dm-era, and resume the device, we
deadlock.

The problem is that the 'dmsetup status' and 'dmsetup message' commands
hold a reference to the live table, i.e., they hold an SRCU read lock on
md->io_barrier, while they are blocked.

When the device is resumed, the old table is replaced with the new one
by dm_swap_table(), which ends up calling synchronize_srcu() on
md->io_barrier.

Since the blocked dmsetup command is holding the SRCU read lock, and the
old table is never resumed, 'dmsetup resume' blocks too, and we have a
deadlock.

The only way to recover is by rebooting.

Steps to reproduce:

1. Create device with dm-era target

# dmsetup create eradev --table "0 1048576 era /dev/datavg/erameta 
/dev/datavg/eradata 8192"

2. Suspend the device

# dmsetup suspend eradev

3. Load new table to device, e.g., to resize the device. Note, that, we
   must load the new table _after_ suspending the device to ensure the
   constructor of the new target instance reads up-to-date metadata, as
   committed by the post-suspend hook of dm-era.

# dmsetup load eradev --table "0 2097152 era /dev/datavg/erameta 
/dev/datavg/eradata 8192"

4. Device now has LIVE and INACTIVE tables

# dmsetup info eradev
Name:  eradev
State: SUSPENDED
Read Ahead:16384
Tables present:LIVE & INACTIVE
Open count:0
Event number:  0
Major, minor:  252, 2
Number of targets: 1

5. Retrieve the status of the device. This blocks because the device is
   suspended. Equivalently, any 'dmsetup message' operation would block
   too. This command holds the SRCU read lock on md->io_barrier.

# dmsetup status eradev


I'll have a look at this flow, it seems to me we shouldn't stack up
'dmsetup status' and 'dmsetup message' commands if the table is
suspended.

I think it is unique to dm-era that you don't allow to _read_ metadata
operations while a device is suspended.  But messages really shouldn't
be sent when the device is suspended.  As-is DM is pretty silently
cutthroat about that constraint.

Resulting in deadlock is obviously cutthroat...



Hi Mike,

Thanks for the quick reply.

I couldn't find this constraint documented anywhere and since the
various DM targets seem to allow message operations while the device is
suspended I drew the wrong conclusion.

Thanks for clarifying this.


6. Resume the device. The resume operation tries to swap the old table
   with the new one and deadlocks, because it synchronizes SRCU for the
   old table, while the blocked 'dmsetup status' holds the SRCU read
   lock. And the old table is never resumed again at this point.

# dmsetup resume eradev

7. The relevant dmesg logs are:

[ 7093.345486] dm-2: detected capacity change from 1048576 to 2097152
[ 7250.875665] INFO: task dmsetup:1986 blocked for more than 120 seconds.
[ 7250.875722]   Not tainted 5.16.0-rc2-release+ #16
[ 7250.875756] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.
[ 7250.875803] task:dmsetup state:D stack:0 pid: 1986 ppid:  1313 
flags:0x
[ 7250.875809] Call Trace:
[ 7250.875812]  
[ 7250.875816]  __schedule+0x330/0x8b0
[ 7250.875827]  schedule+0x4e/0xc0
[ 7250.875831]  schedule_timeout+0x20f/0x2e0
[ 7250.875836]  ? do_set_pte+0xb8/0x120
[ 7250.875843]  ? prep_new_page+0x91/0xa0
[ 7250.875847]  wait_for_completion+0x8c/0xf0
[ 7250.875854]  perform_rpc+0x95/0xb0 [dm_era]
[ 7250.875862]  in_worker1.constprop.20+0x48/0x70 [dm_era]
[ 7250.875867]  ? era_iterate_devices+0x30/0x30 [dm_era]
[ 7250.875872]  ? era_status+0x64/0x1e0 [dm_era]
[ 7250.875877]  era_status+0x64/0x1e0 [dm_era]
[ 7250.875882]  ? dm_get_live_or_inactive_table.isra.11+0x20/0x20 [dm_mod]
[ 7250.875900]  ? __mod_node_page_state+0x82/0xc0
[ 7250.875909]  retrieve_status+0xbc/0x1e0 [dm_mod]
[ 7250.875921]

Re: [dm-devel] [PATCH 0/2] dm era: avoid deadlock when swapping table with dm-era target

2023-01-31 Thread Nikos Tsironis

On 1/26/23 02:06, Mike Snitzer wrote:

On Wed, Jan 25 2023 at  7:37P -0500,
Nikos Tsironis  wrote:


On 1/23/23 19:34, Mike Snitzer wrote:

On Thu, Jan 19 2023 at  4:36P -0500,
Nikos Tsironis  wrote:


On 1/18/23 18:28, Mike Snitzer wrote:

On Wed, Jan 18 2023 at  7:29P -0500,
Nikos Tsironis  wrote:


Under certain conditions, swapping a table, that includes a dm-era
target, with a new table, causes a deadlock.

This happens when a status (STATUSTYPE_INFO) or message IOCTL is blocked
in the suspended dm-era target.

dm-era executes all metadata operations in a worker thread, which stops
processing requests when the target is suspended, and resumes again when
the target is resumed.

So, running 'dmsetup status' or 'dmsetup message' for a suspended dm-era
device blocks, until the device is resumed.

If we then load a new table to the device, while the aforementioned
dmsetup command is blocked in dm-era, and resume the device, we
deadlock.

The problem is that the 'dmsetup status' and 'dmsetup message' commands
hold a reference to the live table, i.e., they hold an SRCU read lock on
md->io_barrier, while they are blocked.

When the device is resumed, the old table is replaced with the new one
by dm_swap_table(), which ends up calling synchronize_srcu() on
md->io_barrier.

Since the blocked dmsetup command is holding the SRCU read lock, and the
old table is never resumed, 'dmsetup resume' blocks too, and we have a
deadlock.

The only way to recover is by rebooting.

Steps to reproduce:

1. Create device with dm-era target

   # dmsetup create eradev --table "0 1048576 era /dev/datavg/erameta 
/dev/datavg/eradata 8192"

2. Suspend the device

   # dmsetup suspend eradev

3. Load new table to device, e.g., to resize the device. Note, that, we
  must load the new table _after_ suspending the device to ensure the
  constructor of the new target instance reads up-to-date metadata, as
  committed by the post-suspend hook of dm-era.

   # dmsetup load eradev --table "0 2097152 era /dev/datavg/erameta 
/dev/datavg/eradata 8192"

4. Device now has LIVE and INACTIVE tables

   # dmsetup info eradev
   Name:  eradev
   State: SUSPENDED
   Read Ahead:16384
   Tables present:LIVE & INACTIVE
   Open count:0
   Event number:  0
   Major, minor:  252, 2
   Number of targets: 1

5. Retrieve the status of the device. This blocks because the device is
  suspended. Equivalently, any 'dmsetup message' operation would block
  too. This command holds the SRCU read lock on md->io_barrier.

   # dmsetup status eradev


I'll have a look at this flow, it seems to me we shouldn't stack up
'dmsetup status' and 'dmsetup message' commands if the table is
suspended.

I think it is unique to dm-era that you don't allow to _read_ metadata
operations while a device is suspended.  But messages really shouldn't
be sent when the device is suspended.  As-is DM is pretty silently
cutthroat about that constraint.

Resulting in deadlock is obviously cutthroat...



Hi Mike,

Thanks for the quick reply.

I couldn't find this constraint documented anywhere and since the
various DM targets seem to allow message operations while the device is
suspended I drew the wrong conclusion.

Thanks for clarifying this.


6. Resume the device. The resume operation tries to swap the old table
  with the new one and deadlocks, because it synchronizes SRCU for the
  old table, while the blocked 'dmsetup status' holds the SRCU read
  lock. And the old table is never resumed again at this point.

   # dmsetup resume eradev

7. The relevant dmesg logs are:

[ 7093.345486] dm-2: detected capacity change from 1048576 to 2097152
[ 7250.875665] INFO: task dmsetup:1986 blocked for more than 120 seconds.
[ 7250.875722]   Not tainted 5.16.0-rc2-release+ #16
[ 7250.875756] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.
[ 7250.875803] task:dmsetup state:D stack:0 pid: 1986 ppid:  1313 
flags:0x
[ 7250.875809] Call Trace:
[ 7250.875812]  
[ 7250.875816]  __schedule+0x330/0x8b0
[ 7250.875827]  schedule+0x4e/0xc0
[ 7250.875831]  schedule_timeout+0x20f/0x2e0
[ 7250.875836]  ? do_set_pte+0xb8/0x120
[ 7250.875843]  ? prep_new_page+0x91/0xa0
[ 7250.875847]  wait_for_completion+0x8c/0xf0
[ 7250.875854]  perform_rpc+0x95/0xb0 [dm_era]
[ 7250.875862]  in_worker1.constprop.20+0x48/0x70 [dm_era]
[ 7250.875867]  ? era_iterate_devices+0x30/0x30 [dm_era]
[ 7250.875872]  ? era_status+0x64/0x1e0 [dm_era]
[ 7250.875877]  era_status+0x64/0x1e0 [dm_era]
[ 7250.875882]  ? dm_get_live_or_inactive_table.isra.11+0x20/0x20 [dm_mod]
[ 7250.875900]  ? __mod_node_page_state+0x82/0xc0
[ 7250.875909]  retrieve_status+0xbc/0x1e0 [dm_mod]
[ 7250.875921]  ? dm_get_live_or_inactive_table.isra.11+0x20/0x20 [dm_mod]
[ 7250.875932]  table_status+0x61/0xa0 [dm_mod]
[ 7250.875942]

Re: [dm-devel] [PATCH 0/2] dm era: avoid deadlock when swapping table with dm-era target

2023-01-25 Thread Nikos Tsironis

On 1/23/23 19:34, Mike Snitzer wrote:

On Thu, Jan 19 2023 at  4:36P -0500,
Nikos Tsironis  wrote:


On 1/18/23 18:28, Mike Snitzer wrote:

On Wed, Jan 18 2023 at  7:29P -0500,
Nikos Tsironis  wrote:


Under certain conditions, swapping a table, that includes a dm-era
target, with a new table, causes a deadlock.

This happens when a status (STATUSTYPE_INFO) or message IOCTL is blocked
in the suspended dm-era target.

dm-era executes all metadata operations in a worker thread, which stops
processing requests when the target is suspended, and resumes again when
the target is resumed.

So, running 'dmsetup status' or 'dmsetup message' for a suspended dm-era
device blocks, until the device is resumed.

If we then load a new table to the device, while the aforementioned
dmsetup command is blocked in dm-era, and resume the device, we
deadlock.

The problem is that the 'dmsetup status' and 'dmsetup message' commands
hold a reference to the live table, i.e., they hold an SRCU read lock on
md->io_barrier, while they are blocked.

When the device is resumed, the old table is replaced with the new one
by dm_swap_table(), which ends up calling synchronize_srcu() on
md->io_barrier.

Since the blocked dmsetup command is holding the SRCU read lock, and the
old table is never resumed, 'dmsetup resume' blocks too, and we have a
deadlock.

The only way to recover is by rebooting.

Steps to reproduce:

1. Create device with dm-era target

  # dmsetup create eradev --table "0 1048576 era /dev/datavg/erameta 
/dev/datavg/eradata 8192"

2. Suspend the device

  # dmsetup suspend eradev

3. Load new table to device, e.g., to resize the device. Note, that, we
 must load the new table _after_ suspending the device to ensure the
 constructor of the new target instance reads up-to-date metadata, as
 committed by the post-suspend hook of dm-era.

  # dmsetup load eradev --table "0 2097152 era /dev/datavg/erameta 
/dev/datavg/eradata 8192"

4. Device now has LIVE and INACTIVE tables

  # dmsetup info eradev
  Name:  eradev
  State: SUSPENDED
  Read Ahead:16384
  Tables present:LIVE & INACTIVE
  Open count:0
  Event number:  0
  Major, minor:  252, 2
  Number of targets: 1

5. Retrieve the status of the device. This blocks because the device is
 suspended. Equivalently, any 'dmsetup message' operation would block
 too. This command holds the SRCU read lock on md->io_barrier.

  # dmsetup status eradev


I'll have a look at this flow, it seems to me we shouldn't stack up
'dmsetup status' and 'dmsetup message' commands if the table is
suspended.

I think it is unique to dm-era that you don't allow to _read_ metadata
operations while a device is suspended.  But messages really shouldn't
be sent when the device is suspended.  As-is DM is pretty silently
cutthroat about that constraint.

Resulting in deadlock is obviously cutthroat...



Hi Mike,

Thanks for the quick reply.

I couldn't find this constraint documented anywhere and since the
various DM targets seem to allow message operations while the device is
suspended I drew the wrong conclusion.

Thanks for clarifying this.


6. Resume the device. The resume operation tries to swap the old table
 with the new one and deadlocks, because it synchronizes SRCU for the
 old table, while the blocked 'dmsetup status' holds the SRCU read
 lock. And the old table is never resumed again at this point.

  # dmsetup resume eradev

7. The relevant dmesg logs are:

[ 7093.345486] dm-2: detected capacity change from 1048576 to 2097152
[ 7250.875665] INFO: task dmsetup:1986 blocked for more than 120 seconds.
[ 7250.875722]   Not tainted 5.16.0-rc2-release+ #16
[ 7250.875756] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.
[ 7250.875803] task:dmsetup state:D stack:0 pid: 1986 ppid:  1313 
flags:0x
[ 7250.875809] Call Trace:
[ 7250.875812]  
[ 7250.875816]  __schedule+0x330/0x8b0
[ 7250.875827]  schedule+0x4e/0xc0
[ 7250.875831]  schedule_timeout+0x20f/0x2e0
[ 7250.875836]  ? do_set_pte+0xb8/0x120
[ 7250.875843]  ? prep_new_page+0x91/0xa0
[ 7250.875847]  wait_for_completion+0x8c/0xf0
[ 7250.875854]  perform_rpc+0x95/0xb0 [dm_era]
[ 7250.875862]  in_worker1.constprop.20+0x48/0x70 [dm_era]
[ 7250.875867]  ? era_iterate_devices+0x30/0x30 [dm_era]
[ 7250.875872]  ? era_status+0x64/0x1e0 [dm_era]
[ 7250.875877]  era_status+0x64/0x1e0 [dm_era]
[ 7250.875882]  ? dm_get_live_or_inactive_table.isra.11+0x20/0x20 [dm_mod]
[ 7250.875900]  ? __mod_node_page_state+0x82/0xc0
[ 7250.875909]  retrieve_status+0xbc/0x1e0 [dm_mod]
[ 7250.875921]  ? dm_get_live_or_inactive_table.isra.11+0x20/0x20 [dm_mod]
[ 7250.875932]  table_status+0x61/0xa0 [dm_mod]
[ 7250.875942]  ctl_ioctl+0x1b5/0x4f0 [dm_mod]
[ 7250.875956]  dm_ctl_ioctl+0xa/0x10 [dm_mod]
[ 7250.875966]  __x64_sys_ioctl+0x8

Re: [dm-devel] [PATCH 0/2] dm era: avoid deadlock when swapping table with dm-era target

2023-01-25 Thread Nikos Tsironis

On 1/23/23 19:34, Mike Snitzer wrote:

On Thu, Jan 19 2023 at  4:36P -0500,
Nikos Tsironis  wrote:


On 1/18/23 18:28, Mike Snitzer wrote:

On Wed, Jan 18 2023 at  7:29P -0500,
Nikos Tsironis  wrote:


Under certain conditions, swapping a table, that includes a dm-era
target, with a new table, causes a deadlock.

This happens when a status (STATUSTYPE_INFO) or message IOCTL is blocked
in the suspended dm-era target.

dm-era executes all metadata operations in a worker thread, which stops
processing requests when the target is suspended, and resumes again when
the target is resumed.

So, running 'dmsetup status' or 'dmsetup message' for a suspended dm-era
device blocks, until the device is resumed.

If we then load a new table to the device, while the aforementioned
dmsetup command is blocked in dm-era, and resume the device, we
deadlock.

The problem is that the 'dmsetup status' and 'dmsetup message' commands
hold a reference to the live table, i.e., they hold an SRCU read lock on
md->io_barrier, while they are blocked.

When the device is resumed, the old table is replaced with the new one
by dm_swap_table(), which ends up calling synchronize_srcu() on
md->io_barrier.

Since the blocked dmsetup command is holding the SRCU read lock, and the
old table is never resumed, 'dmsetup resume' blocks too, and we have a
deadlock.

The only way to recover is by rebooting.

Steps to reproduce:

1. Create device with dm-era target

  # dmsetup create eradev --table "0 1048576 era /dev/datavg/erameta 
/dev/datavg/eradata 8192"

2. Suspend the device

  # dmsetup suspend eradev

3. Load new table to device, e.g., to resize the device. Note, that, we
 must load the new table _after_ suspending the device to ensure the
 constructor of the new target instance reads up-to-date metadata, as
 committed by the post-suspend hook of dm-era.

  # dmsetup load eradev --table "0 2097152 era /dev/datavg/erameta 
/dev/datavg/eradata 8192"

4. Device now has LIVE and INACTIVE tables

  # dmsetup info eradev
  Name:  eradev
  State: SUSPENDED
  Read Ahead:16384
  Tables present:LIVE & INACTIVE
  Open count:0
  Event number:  0
  Major, minor:  252, 2
  Number of targets: 1

5. Retrieve the status of the device. This blocks because the device is
 suspended. Equivalently, any 'dmsetup message' operation would block
 too. This command holds the SRCU read lock on md->io_barrier.

  # dmsetup status eradev


I'll have a look at this flow, it seems to me we shouldn't stack up
'dmsetup status' and 'dmsetup message' commands if the table is
suspended.

I think it is unique to dm-era that you don't allow to _read_ metadata
operations while a device is suspended.  But messages really shouldn't
be sent when the device is suspended.  As-is DM is pretty silently
cutthroat about that constraint.

Resulting in deadlock is obviously cutthroat...



Hi Mike,

Thanks for the quick reply.

I couldn't find this constraint documented anywhere and since the
various DM targets seem to allow message operations while the device is
suspended I drew the wrong conclusion.

Thanks for clarifying this.


6. Resume the device. The resume operation tries to swap the old table
 with the new one and deadlocks, because it synchronizes SRCU for the
 old table, while the blocked 'dmsetup status' holds the SRCU read
 lock. And the old table is never resumed again at this point.

  # dmsetup resume eradev

7. The relevant dmesg logs are:

[ 7093.345486] dm-2: detected capacity change from 1048576 to 2097152
[ 7250.875665] INFO: task dmsetup:1986 blocked for more than 120 seconds.
[ 7250.875722]   Not tainted 5.16.0-rc2-release+ #16
[ 7250.875756] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.
[ 7250.875803] task:dmsetup state:D stack:0 pid: 1986 ppid:  1313 
flags:0x
[ 7250.875809] Call Trace:
[ 7250.875812]  
[ 7250.875816]  __schedule+0x330/0x8b0
[ 7250.875827]  schedule+0x4e/0xc0
[ 7250.875831]  schedule_timeout+0x20f/0x2e0
[ 7250.875836]  ? do_set_pte+0xb8/0x120
[ 7250.875843]  ? prep_new_page+0x91/0xa0
[ 7250.875847]  wait_for_completion+0x8c/0xf0
[ 7250.875854]  perform_rpc+0x95/0xb0 [dm_era]
[ 7250.875862]  in_worker1.constprop.20+0x48/0x70 [dm_era]
[ 7250.875867]  ? era_iterate_devices+0x30/0x30 [dm_era]
[ 7250.875872]  ? era_status+0x64/0x1e0 [dm_era]
[ 7250.875877]  era_status+0x64/0x1e0 [dm_era]
[ 7250.875882]  ? dm_get_live_or_inactive_table.isra.11+0x20/0x20 [dm_mod]
[ 7250.875900]  ? __mod_node_page_state+0x82/0xc0
[ 7250.875909]  retrieve_status+0xbc/0x1e0 [dm_mod]
[ 7250.875921]  ? dm_get_live_or_inactive_table.isra.11+0x20/0x20 [dm_mod]
[ 7250.875932]  table_status+0x61/0xa0 [dm_mod]
[ 7250.875942]  ctl_ioctl+0x1b5/0x4f0 [dm_mod]
[ 7250.875956]  dm_ctl_ioctl+0xa/0x10 [dm_mod]
[ 7250.875966]  __x64_sys_ioctl+0x8

Re: [dm-devel] [PATCH 0/2] dm era: avoid deadlock when swapping table with dm-era target

2023-01-19 Thread Nikos Tsironis

On 1/19/23 14:58, Zdenek Kabelac wrote:

Dne 19. 01. 23 v 10:36 Nikos Tsironis napsal(a):

On 1/18/23 18:28, Mike Snitzer wrote:

On Wed, Jan 18 2023 at  7:29P -0500,
Nikos Tsironis  wrote:




Hi Mike,

Thanks for the quick reply.

I couldn't find this constraint documented anywhere and since the
various DM targets seem to allow message operations while the device is
suspended I drew the wrong conclusion.


Hi  Nikos


Some notes from lvm2 developer - we work with these constrains:

DM target should  not need to allocate bunch of memory while suspended (target 
should preallocated pool of some memory if it really needs to do it in this 
case).

DM target should check and allocate everything it can in the 'load' phase and 
error as early as possibly (so loaded inactive table can be cleared/dropped and 
'resume' target can continue its work).

Error in suspend phase is typically the last moment -we can handle failure 
'somehow'.

Handling failure in 'resume' is a can of worm with no good solution - so resume 
really should do as minimal as possible and should work with everything already 
prepared from load & suspend.

'DM status/info'  should work while device is suspend - but no allocation 
should be needed in this case to produce result.

Sending messages to a suspended target should not be needed - if it is - it 
should be most likely solved via  'table reload' change (target design change).

Loading to the inactive table slot should not be break anything for the active 
table slot  (table clear shall resume at suspend point).

Surely the list could go longer - but these basics are crucial.



Hi Zdenek,

That's great information! Thanks a lot for sharing it.

Regards,
Nikos

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH 0/2] dm era: avoid deadlock when swapping table with dm-era target

2023-01-19 Thread Nikos Tsironis

On 1/18/23 18:28, Mike Snitzer wrote:

On Wed, Jan 18 2023 at  7:29P -0500,
Nikos Tsironis  wrote:


Under certain conditions, swapping a table, that includes a dm-era
target, with a new table, causes a deadlock.

This happens when a status (STATUSTYPE_INFO) or message IOCTL is blocked
in the suspended dm-era target.

dm-era executes all metadata operations in a worker thread, which stops
processing requests when the target is suspended, and resumes again when
the target is resumed.

So, running 'dmsetup status' or 'dmsetup message' for a suspended dm-era
device blocks, until the device is resumed.

If we then load a new table to the device, while the aforementioned
dmsetup command is blocked in dm-era, and resume the device, we
deadlock.

The problem is that the 'dmsetup status' and 'dmsetup message' commands
hold a reference to the live table, i.e., they hold an SRCU read lock on
md->io_barrier, while they are blocked.

When the device is resumed, the old table is replaced with the new one
by dm_swap_table(), which ends up calling synchronize_srcu() on
md->io_barrier.

Since the blocked dmsetup command is holding the SRCU read lock, and the
old table is never resumed, 'dmsetup resume' blocks too, and we have a
deadlock.

The only way to recover is by rebooting.

Steps to reproduce:

1. Create device with dm-era target

 # dmsetup create eradev --table "0 1048576 era /dev/datavg/erameta 
/dev/datavg/eradata 8192"

2. Suspend the device

 # dmsetup suspend eradev

3. Load new table to device, e.g., to resize the device. Note, that, we
must load the new table _after_ suspending the device to ensure the
constructor of the new target instance reads up-to-date metadata, as
committed by the post-suspend hook of dm-era.

 # dmsetup load eradev --table "0 2097152 era /dev/datavg/erameta 
/dev/datavg/eradata 8192"

4. Device now has LIVE and INACTIVE tables

 # dmsetup info eradev
 Name:  eradev
 State: SUSPENDED
 Read Ahead:16384
 Tables present:LIVE & INACTIVE
 Open count:0
 Event number:  0
 Major, minor:  252, 2
 Number of targets: 1

5. Retrieve the status of the device. This blocks because the device is
suspended. Equivalently, any 'dmsetup message' operation would block
too. This command holds the SRCU read lock on md->io_barrier.

 # dmsetup status eradev


I'll have a look at this flow, it seems to me we shouldn't stack up
'dmsetup status' and 'dmsetup message' commands if the table is
suspended.

I think it is unique to dm-era that you don't allow to _read_ metadata
operations while a device is suspended.  But messages really shouldn't
be sent when the device is suspended.  As-is DM is pretty silently
cutthroat about that constraint.

Resulting in deadlock is obviously cutthroat...



Hi Mike,

Thanks for the quick reply.

I couldn't find this constraint documented anywhere and since the
various DM targets seem to allow message operations while the device is
suspended I drew the wrong conclusion.

Thanks for clarifying this.


6. Resume the device. The resume operation tries to swap the old table
with the new one and deadlocks, because it synchronizes SRCU for the
old table, while the blocked 'dmsetup status' holds the SRCU read
lock. And the old table is never resumed again at this point.

 # dmsetup resume eradev

7. The relevant dmesg logs are:

[ 7093.345486] dm-2: detected capacity change from 1048576 to 2097152
[ 7250.875665] INFO: task dmsetup:1986 blocked for more than 120 seconds.
[ 7250.875722]   Not tainted 5.16.0-rc2-release+ #16
[ 7250.875756] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.
[ 7250.875803] task:dmsetup state:D stack:0 pid: 1986 ppid:  1313 
flags:0x
[ 7250.875809] Call Trace:
[ 7250.875812]  
[ 7250.875816]  __schedule+0x330/0x8b0
[ 7250.875827]  schedule+0x4e/0xc0
[ 7250.875831]  schedule_timeout+0x20f/0x2e0
[ 7250.875836]  ? do_set_pte+0xb8/0x120
[ 7250.875843]  ? prep_new_page+0x91/0xa0
[ 7250.875847]  wait_for_completion+0x8c/0xf0
[ 7250.875854]  perform_rpc+0x95/0xb0 [dm_era]
[ 7250.875862]  in_worker1.constprop.20+0x48/0x70 [dm_era]
[ 7250.875867]  ? era_iterate_devices+0x30/0x30 [dm_era]
[ 7250.875872]  ? era_status+0x64/0x1e0 [dm_era]
[ 7250.875877]  era_status+0x64/0x1e0 [dm_era]
[ 7250.875882]  ? dm_get_live_or_inactive_table.isra.11+0x20/0x20 [dm_mod]
[ 7250.875900]  ? __mod_node_page_state+0x82/0xc0
[ 7250.875909]  retrieve_status+0xbc/0x1e0 [dm_mod]
[ 7250.875921]  ? dm_get_live_or_inactive_table.isra.11+0x20/0x20 [dm_mod]
[ 7250.875932]  table_status+0x61/0xa0 [dm_mod]
[ 7250.875942]  ctl_ioctl+0x1b5/0x4f0 [dm_mod]
[ 7250.875956]  dm_ctl_ioctl+0xa/0x10 [dm_mod]
[ 7250.875966]  __x64_sys_ioctl+0x8e/0xd0
[ 7250.875970]  do_syscall_64+0x3a/0xd0
[ 7250.875974]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 7250.875980] RIP

[dm-devel] [PATCH 0/2] dm era: avoid deadlock when swapping table with dm-era target

2023-01-18 Thread Nikos Tsironis
pid: 1987 ppid:  1385 
flags:0x
[ 7250.876134] Call Trace:
[ 7250.876136]  
[ 7250.876138]  __schedule+0x330/0x8b0
[ 7250.876142]  schedule+0x4e/0xc0
[ 7250.876145]  schedule_timeout+0x20f/0x2e0
[ 7250.876149]  ? __queue_work+0x226/0x420
[ 7250.876156]  wait_for_completion+0x8c/0xf0
[ 7250.876160]  __synchronize_srcu.part.19+0x92/0xc0
[ 7250.876167]  ? __bpf_trace_rcu_stall_warning+0x10/0x10
[ 7250.876173]  ? dm_swap_table+0x2f4/0x310 [dm_mod]
[ 7250.876185]  dm_swap_table+0x2f4/0x310 [dm_mod]
[ 7250.876198]  ? table_load+0x360/0x360 [dm_mod]
[ 7250.876207]  dev_suspend+0x95/0x250 [dm_mod]
[ 7250.876217]  ctl_ioctl+0x1b5/0x4f0 [dm_mod]
[ 7250.876231]  dm_ctl_ioctl+0xa/0x10 [dm_mod]
[ 7250.876240]  __x64_sys_ioctl+0x8e/0xd0
[ 7250.876244]  do_syscall_64+0x3a/0xd0
[ 7250.876247]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 7250.876252] RIP: 0033:0x7f15e9254017
[ 7250.876254] RSP: 002b:7dc59458 EFLAGS: 0246 ORIG_RAX: 
0010
[ 7250.876257] RAX: ffda RBX: 55d4d99560e0 RCX: 7f15e9254017
[ 7250.876260] RDX: 55d4d99560e0 RSI: c138fd06 RDI: 0003
[ 7250.876261] RBP: 000f R08: 7f15e975f648 R09: 7dc592c0
[ 7250.876263] R10: 7f15e975eb53 R11: 0246 R12: 55d4d9956110
[ 7250.876265] R13: 7f15e975eb53 R14: 55d4d9956000 R15: 0001
[ 7250.876269]  

Fix this by allowing metadata operations triggered by user space to run
in the corresponding user space thread, instead of queueing them for
execution by the dm-era worker thread.

This way, even if the device is suspended, the metadata operations will
run and release the SRCU read lock, preventing a subsequent table reload
(dm_swap_table()) from deadlocking.

To make this possible use a mutex to serialize access to the metadata
between the worker thread and the user space threads.

This is a follow up on 
https://listman.redhat.com/archives/dm-devel/2021-December/msg00044.html.

Nikos Tsironis (2):
  dm era: allocate in-core writesets when loading metadata
  dm era: avoid deadlock when swapping table

 drivers/md/dm-era-target.c | 78 ++
 1 file changed, 63 insertions(+), 15 deletions(-)

-- 
2.30.2

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



[dm-devel] [PATCH 2/2] dm era: avoid deadlock when swapping table

2023-01-18 Thread Nikos Tsironis
Under certain conditions, swapping a table, that includes a dm-era
target, with a new table, causes a deadlock.

This happens when a status (STATUSTYPE_INFO) or message IOCTL is blocked
in the suspended dm-era target.

dm-era executes all metadata operations in a worker thread, which stops
processing requests when the target is suspended, and resumes again when
the target is resumed.

So, running 'dmsetup status' or 'dmsetup message' for a suspended dm-era
device blocks, until the device is resumed.

If we then load a new table to the device, while the aforementioned
dmsetup command is blocked in dm-era, and resume the device, we
deadlock.

The problem is that the 'dmsetup status' and 'dmsetup message' commands
hold a reference to the live table, i.e., they hold an SRCU read lock on
md->io_barrier, while they are blocked.

When the device is resumed, the old table is replaced with the new one
by dm_swap_table(), which ends up calling synchronize_srcu() on
md->io_barrier.

Since the blocked dmsetup command is holding the SRCU read lock, and the
old table is never resumed, 'dmsetup resume' blocks too, and we have a
deadlock.

The only way to recover is by rebooting.

Fix this by allowing metadata operations triggered by user space to run
in the corresponding user space thread, instead of queueing them for
execution by the dm-era worker thread.

This way, even if the device is suspended, the metadata operations will
run and release the SRCU read lock, preventing a subsequent table reload
(dm_swap_table()) from deadlocking.

To make this possible use a mutex to serialize access to the metadata
between the worker thread and the user space threads.

Fixes: eec40579d8487 ("dm: add era target")
Cc: sta...@vger.kernel.org # v3.15+
Signed-off-by: Nikos Tsironis 
---
 drivers/md/dm-era-target.c | 58 --
 1 file changed, 43 insertions(+), 15 deletions(-)

diff --git a/drivers/md/dm-era-target.c b/drivers/md/dm-era-target.c
index 3332bed2f412..c57a19320dbf 100644
--- a/drivers/md/dm-era-target.c
+++ b/drivers/md/dm-era-target.c
@@ -11,6 +11,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #define DM_MSG_PREFIX "era"
 
@@ -1182,6 +1183,12 @@ struct era {
dm_block_t nr_blocks;
uint32_t sectors_per_block;
int sectors_per_block_shift;
+
+   /*
+* Serialize access to metadata between worker thread and user space
+* threads.
+*/
+   struct mutex metadata_lock;
struct era_metadata *md;
 
struct workqueue_struct *wq;
@@ -1358,10 +1365,12 @@ static void do_work(struct work_struct *ws)
 {
struct era *era = container_of(ws, struct era, worker);
 
+   mutex_lock(>metadata_lock);
kick_off_digest(era);
process_old_eras(era);
process_deferred_bios(era);
process_rpc_calls(era);
+   mutex_unlock(>metadata_lock);
 }
 
 static void defer_bio(struct era *era, struct bio *bio)
@@ -1400,17 +1409,6 @@ static int in_worker0(struct era *era, int (*fn)(struct 
era_metadata *))
return perform_rpc(era, );
 }
 
-static int in_worker1(struct era *era,
- int (*fn)(struct era_metadata *, void *), void *arg)
-{
-   struct rpc rpc;
-   rpc.fn0 = NULL;
-   rpc.fn1 = fn;
-   rpc.arg = arg;
-
-   return perform_rpc(era, );
-}
-
 static void start_worker(struct era *era)
 {
atomic_set(>suspended, 0);
@@ -1439,6 +1437,7 @@ static void era_destroy(struct era *era)
if (era->metadata_dev)
dm_put_device(era->ti, era->metadata_dev);
 
+   mutex_destroy(>metadata_lock);
kfree(era);
 }
 
@@ -1539,6 +1538,8 @@ static int era_ctr(struct dm_target *ti, unsigned argc, 
char **argv)
spin_lock_init(>rpc_lock);
INIT_LIST_HEAD(>rpc_calls);
 
+   mutex_init(>metadata_lock);
+
ti->private = era;
ti->num_flush_bios = 1;
ti->flush_supported = true;
@@ -1591,7 +1592,9 @@ static void era_postsuspend(struct dm_target *ti)
 
stop_worker(era);
 
+   mutex_lock(>metadata_lock);
r = metadata_commit(era->md);
+   mutex_unlock(>metadata_lock);
if (r) {
DMERR("%s: metadata_commit failed", __func__);
/* FIXME: fail mode */
@@ -1605,19 +1608,23 @@ static int era_preresume(struct dm_target *ti)
dm_block_t new_size = calc_nr_blocks(era);
 
if (era->nr_blocks != new_size) {
+   mutex_lock(>metadata_lock);
r = metadata_resize(era->md, _size);
if (r) {
DMERR("%s: metadata_resize failed", __func__);
+   mutex_unlock(>metadata_lock);
return r;
}
 
r = metadata_commit(era->md);
if (r) {
DMERR("%s: metadata

[dm-devel] [PATCH 1/2] dm era: allocate in-core writesets when loading metadata

2023-01-18 Thread Nikos Tsironis
Until now, the allocation of the in-core bitmaps was done in pre-resume
by metadata_resize().

In preparation for the next commit, which makes it possible for a
metadata operation, e.g. an era rollover, to run before pre-resume runs,
allocate the in-core bitmaps as part of opening the metadata.

Also, set the number of blocks of the era device in era_ctr() to the
number of blocks in the metadata.

This avoids attempting to resize the metadata every time we create a new
target instance, even if the metadata size hasn't changed.

Fixes: eec40579d8487 ("dm: add era target")
Cc: sta...@vger.kernel.org # v3.15+
Signed-off-by: Nikos Tsironis 
---
 drivers/md/dm-era-target.c | 20 
 1 file changed, 20 insertions(+)

diff --git a/drivers/md/dm-era-target.c b/drivers/md/dm-era-target.c
index e92c1afc3677..3332bed2f412 100644
--- a/drivers/md/dm-era-target.c
+++ b/drivers/md/dm-era-target.c
@@ -788,6 +788,7 @@ static int metadata_digest_start(struct era_metadata *md, 
struct digest *d)
  * High level metadata interface.  Target methods should use these, and not
  * the lower level ones.
  *--*/
+static void metadata_close(struct era_metadata *md);
 static struct era_metadata *metadata_open(struct block_device *bdev,
  sector_t block_size,
  bool may_format)
@@ -811,6 +812,24 @@ static struct era_metadata *metadata_open(struct 
block_device *bdev,
return ERR_PTR(r);
}
 
+   if (md->nr_blocks == 0)
+   return md;
+
+   /* Allocate in-core writesets */
+   r = writeset_alloc(>writesets[0], md->nr_blocks);
+   if (r) {
+   DMERR("%s: writeset_alloc failed for writeset 0", __func__);
+   metadata_close(md);
+   return ERR_PTR(r);
+   }
+
+   r = writeset_alloc(>writesets[1], md->nr_blocks);
+   if (r) {
+   DMERR("%s: writeset_alloc failed for writeset 1", __func__);
+   metadata_close(md);
+   return ERR_PTR(r);
+   }
+
return md;
 }
 
@@ -1504,6 +1523,7 @@ static int era_ctr(struct dm_target *ti, unsigned argc, 
char **argv)
return PTR_ERR(md);
}
era->md = md;
+   era->nr_blocks = md->nr_blocks;
 
era->wq = alloc_ordered_workqueue("dm-" DM_MSG_PREFIX, WQ_MEM_RECLAIM);
if (!era->wq) {
-- 
2.30.2

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



[dm-devel] [PATCH] dm clone: Fix typo in block_device format specifier

2022-09-29 Thread Nikos Tsironis
Use %pg for printing the block device name, instead of %pd.

Fixes: 385411ffba0c ("dm: stop using bdevname")
Cc: sta...@vger.kernel.org # v5.18+
Signed-off-by: Nikos Tsironis 
---
 drivers/md/dm-clone-target.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/md/dm-clone-target.c b/drivers/md/dm-clone-target.c
index 811b0a5379d0..2f1cc66d2641 100644
--- a/drivers/md/dm-clone-target.c
+++ b/drivers/md/dm-clone-target.c
@@ -2035,7 +2035,7 @@ static void disable_passdown_if_not_supported(struct 
clone *clone)
reason = "max discard sectors smaller than a region";
 
if (reason) {
-   DMWARN("Destination device (%pd) %s: Disabling discard 
passdown.",
+   DMWARN("Destination device (%pg) %s: Disabling discard 
passdown.",
   dest_dev, reason);
clear_bit(DM_CLONE_DISCARD_PASSDOWN, >flags);
}
-- 
2.30.2

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



[dm-devel] [PATCH] dm era: commit metadata in postsuspend after worker stops

2022-06-21 Thread Nikos Tsironis
During postsuspend dm-era does the following:

1. Archives the current era
2. Commits the metadata, as part of the RPC call for archiving the
   current era
3. Stops the worker

Until the worker stops, it might write to the metadata again. Moreover,
these writes are not flushed to disk immediately, but are cached by the
dm-bufio client, which writes them back asynchronously.

As a result, the committed metadata of a suspended dm-era device might
not be consistent with the in-core metadata.

In some cases, this can result in the corruption of the on-disk
metadata. Suppose the following sequence of events:

1. Load a new table, e.g. a snapshot-origin table, to a device with a
   dm-era table
2. Suspend the device
3. dm-era commits its metadata, but the worker does a few more metadata
   writes until it stops, as part of digesting an archived writeset
4. These writes are cached by the dm-bufio client
5. Load the dm-era table to another device.
6. The new instance of the dm-era target loads the committed, on-disk
   metadata, which don't include the extra writes done by the worker
   after the metadata commit.
7. Resume the new device
8. The new dm-era target instance starts using the metadata
9. Resume the original device
10. The destructor of the old dm-era target instance is called and
destroys the dm-bufio client, which results in flushing the cached
writes to disk
11. These writes might overwrite the writes done by the new dm-era
instance, hence corrupting its metadata.

Fix this by committing the metadata after the worker stops running.

stop_worker uses flush_workqueue to flush the current work. However, the
work item may re-queue itself and flush_workqueue doesn't wait for
re-queued works to finish.

This could result in the worker changing the metadata after they have
been committed, or writing to the metadata concurrently with the commit
in the postsuspend thread.

Use drain_workqueue instead, which waits until the work and all
re-queued works finish.

Fixes: eec40579d8487 ("dm: add era target")
Cc: sta...@vger.kernel.org # v3.15+
Signed-off-by: Nikos Tsironis 
---
 drivers/md/dm-era-target.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/drivers/md/dm-era-target.c b/drivers/md/dm-era-target.c
index 1f6bf152b3c7..e92c1afc3677 100644
--- a/drivers/md/dm-era-target.c
+++ b/drivers/md/dm-era-target.c
@@ -1400,7 +1400,7 @@ static void start_worker(struct era *era)
 static void stop_worker(struct era *era)
 {
atomic_set(>suspended, 1);
-   flush_workqueue(era->wq);
+   drain_workqueue(era->wq);
 }
 
 /*
@@ -1570,6 +1570,12 @@ static void era_postsuspend(struct dm_target *ti)
}
 
stop_worker(era);
+
+   r = metadata_commit(era->md);
+   if (r) {
+   DMERR("%s: metadata_commit failed", __func__);
+   /* FIXME: fail mode */
+   }
 }
 
 static int era_preresume(struct dm_target *ti)
-- 
2.30.2

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [LSF/MM/BFP ATTEND] [LSF/MM/BFP TOPIC] Storage: Copy Offload

2022-03-09 Thread Nikos Tsironis

On 3/9/22 10:51, Mikulas Patocka wrote:


Hi

Note that you must submit kcopyd callbacks from a single thread, otherwise
there's a race condition in snapshot.



Hi,

Thanks for the feedback. Yes, I'm aware of that.


The snapshot code doesn't take locks in the copy_callback and it expects
that the callbacks are serialized.

Maybe, adding the locks to copy_callback would solve it.



That's what I did. I used a lock to ensure that kcopyd callbacks are
serialized for persistent snapshots.

For transient snapshots we can lift this limitation, and complete
pending exceptions out-of-oder and in "parallel", i.e., without
explicitly serializing kcopyd callbacks. The locks in pending_complete()
are enough in this case.

Nikos

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [LSF/MM/BFP ATTEND] [LSF/MM/BFP TOPIC] Storage: Copy Offload

2022-03-08 Thread Nikos Tsironis

On 3/1/22 23:32, Chaitanya Kulkarni wrote:

Nikos,


[8] https://kernel.dk/io_uring.pdf


I would like to participate in the discussion too.

The dm-clone target would also benefit from copy offload, as it heavily
employs dm-kcopyd. I have been exploring redesigning kcopyd in order to
achieve increased IOPS in dm-clone and dm-snapshot for small copies over
NVMe devices, but copy offload sounds even more promising, especially
for larger copies happening in the background (as is the case with
dm-clone's background hydration).

Thanks,
Nikos


If you can document your findings here it will be great for me to
add it to the agenda.



My work focuses mainly on improving the IOPs and latency of the
dm-snapshot target, in order to bring the performance of short-lived
snapshots as close as possible to bare-metal performance.

My initial performance evaluation of dm-snapshot had revealed a big
performance drop, while the snapshot is active; a drop which is not
justified by COW alone.

Using fio with blktrace I had noticed that the per-CPU I/O distribution
was uneven. Although many threads were doing I/O, only a couple of the
CPUs ended up submitting I/O requests to the underlying device.

The same issue also affects dm-clone, when doing I/O with sizes smaller
than the target's region size, where kcopyd is used for COW.

The bottleneck here is kcopyd serializing all I/O. Users of kcopyd, such
as dm-snapshot and dm-clone, cannot take advantage of the increased I/O
parallelism that comes with using blk-mq in modern multi-core systems,
because I/Os are issued only by a single CPU at a time, the one on which
kcopyd’s thread happens to be running.

So, I experimented redesigning kcopyd to prevent I/O serialization by
respecting thread locality for I/Os and their completions. This made the
distribution of I/O processing uniform across CPUs.

My measurements had shown that scaling kcopyd, in combination with
scaling dm-snapshot itself [1] [2], can lead to an eventual performance
improvement of ~300% increase in sustained throughput and ~80% decrease
in I/O latency for transient snapshots, over the null_blk device.

The work for scaling dm-snapshot has been merged [1], but,
unfortunately, I haven't been able to send upstream my work on kcopyd
yet, because I have been really busy with other things the last couple
of years.

I haven't looked into the details of copy offload yet, but it would be
really interesting to see how it affects the performance of random and
sequential workloads, and to check how, and if, scaling kcopyd affects
the performance, in combination with copy offload.

Nikos

[1] 
https://lore.kernel.org/dm-devel/20190317122258.21760-1-ntsiro...@arrikto.com/
[2] 
https://lore.kernel.org/dm-devel/425d7efe-ab3f-67be-264e-9c3b6db22...@arrikto.com/

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [LSF/MM/BFP ATTEND] [LSF/MM/BFP TOPIC] Storage: Copy Offload

2022-03-03 Thread Nikos Tsironis

On 3/1/22 23:32, Chaitanya Kulkarni wrote:

Nikos,


[8] https://kernel.dk/io_uring.pdf


I would like to participate in the discussion too.

The dm-clone target would also benefit from copy offload, as it heavily
employs dm-kcopyd. I have been exploring redesigning kcopyd in order to
achieve increased IOPS in dm-clone and dm-snapshot for small copies over
NVMe devices, but copy offload sounds even more promising, especially
for larger copies happening in the background (as is the case with
dm-clone's background hydration).

Thanks,
Nikos


If you can document your findings here it will be great for me to
add it to the agenda.



Hi,

Give me a few days to gather my notes, because it's been a while since
the last time I worked on this, and I will come back with a summary of
my findings.

Nikos

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [LSF/MM/BFP ATTEND] [LSF/MM/BFP TOPIC] Storage: Copy Offload

2022-03-01 Thread Nikos Tsironis

On 1/27/22 09:14, Chaitanya Kulkarni wrote:

Hi,

* Background :-
---

Copy offload is a feature that allows file-systems or storage devices
to be instructed to copy files/logical blocks without requiring
involvement of the local CPU.

With reference to the RISC-V summit keynote [1] single threaded
performance is limiting due to Denard scaling and multi-threaded
performance is slowing down due Moore's law limitations. With the rise
of SNIA Computation Technical Storage Working Group (TWG) [2],
offloading computations to the device or over the fabrics is becoming
popular as there are several solutions available [2]. One of the common
operation which is popular in the kernel and is not merged yet is Copy
offload over the fabrics or on to the device.

* Problem :-
---

The original work which is done by Martin is present here [3]. The
latest work which is posted by Mikulas [4] is not merged yet. These two
approaches are totally different from each other. Several storage
vendors discourage mixing copy offload requests with regular READ/WRITE
I/O. Also, the fact that the operation fails if a copy request ever
needs to be split as it traverses the stack it has the unfortunate
side-effect of preventing copy offload from working in pretty much
every common deployment configuration out there.

* Current state of the work :-
---

With [3] being hard to handle arbitrary DM/MD stacking without
splitting the command in two, one for copying IN and one for copying
OUT. Which is then demonstrated by the [4] why [3] it is not a suitable
candidate. Also, with [4] there is an unresolved problem with the
two-command approach about how to handle changes to the DM layout
between an IN and OUT operations.

We have conducted a call with interested people late last year since
lack of LSFMMM and we would like to share the details with broader
community members.

* Why Linux Kernel Storage System needs Copy Offload support now ?
---

With the rise of the SNIA Computational Storage TWG and solutions [2],
existing SCSI XCopy support in the protocol, recent advancement in the
Linux Kernel File System for Zoned devices (Zonefs [5]), Peer to Peer
DMA support in the Linux Kernel mainly for NVMe devices [7] and
eventually NVMe Devices and subsystem (NVMe PCIe/NVMeOF) will benefit
from Copy offload operation.

With this background we have significant number of use-cases which are
strong candidates waiting for outstanding Linux Kernel Block Layer Copy
Offload support, so that Linux Kernel Storage subsystem can to address
previously mentioned problems [1] and allow efficient offloading of the
data related operations. (Such as move/copy etc.)

For reference following is the list of the use-cases/candidates waiting
for Copy Offload support :-

1. SCSI-attached storage arrays.
2. Stacking drivers supporting XCopy DM/MD.
3. Computational Storage solutions.
7. File systems :- Local, NFS and Zonefs.
4. Block devices :- Distributed, local, and Zoned devices.
5. Peer to Peer DMA support solutions.
6. Potentially NVMe subsystem both NVMe PCIe and NVMeOF.

* What we will discuss in the proposed session ?
---

I'd like to propose a session to go over this topic to understand :-

1. What are the blockers for Copy Offload implementation ?
2. Discussion about having a file system interface.
3. Discussion about having right system call for user-space.
4. What is the right way to move this work forward ?
5. How can we help to contribute and move this work forward ?

* Required Participants :-
---

I'd like to invite file system, block layer, and device drivers
developers to:-

1. Share their opinion on the topic.
2. Share their experience and any other issues with [4].
3. Uncover additional details that are missing from this proposal.

Required attendees :-

Martin K. Petersen
Jens Axboe
Christoph Hellwig
Bart Van Assche
Zach Brown
Roland Dreier
Ric Wheeler
Trond Myklebust
Mike Snitzer
Keith Busch
Sagi Grimberg
Hannes Reinecke
Frederick Knight
Mikulas Patocka
Keith Busch

-ck

[1]https://content.riscv.org/wp-content/uploads/2018/12/A-New-Golden-Age-for-Computer-Architecture-History-Challenges-and-Opportunities-David-Patterson-.pdf
[2] https://www.snia.org/computational
https://www.napatech.com/support/resources/solution-descriptions/napatech-smartnic-solution-for-hardware-offload/
https://www.eideticom.com/products.html
https://www.xilinx.com/applications/data-center/computational-storage.html
[3] git://git.kernel.org/pub/scm/linux/kernel/git/mkp/linux.git xcopy
[4] https://www.spinics.net/lists/linux-block/msg00599.html
[5] 

Re: [dm-devel] Deadlock when swapping a table with a dm-era target

2021-12-08 Thread Nikos Tsironis

On 12/3/21 6:00 PM, Zdenek Kabelac wrote:

Dne 03. 12. 21 v 15:42 Nikos Tsironis napsal(a):

On 12/2/21 5:41 PM, Zdenek Kabelac wrote:

Dne 01. 12. 21 v 18:07 Nikos Tsironis napsal(a):

Hello,

Under certain conditions, swapping a table, that includes a dm-era
target, with a new table, causes a deadlock.

This happens when a status (STATUSTYPE_INFO) or message IOCTL is blocked
in the suspended dm-era target.

dm-era executes all metadata operations in a worker thread, which stops
processing requests when the target is suspended, and resumes again when
the target is resumed.

So, running 'dmsetup status' or 'dmsetup message' for a suspended dm-era
device blocks, until the device is resumed.


Hi Zdenek,

Thanks for the feedback. There doesn't seem to be any documentation
mentioning that loading the new table should happen before suspend, so
thanks a lot for explaining it.

Unfortunately, this isn't what causes the deadlock. The following
sequence, which loads the table before suspend, also results in a
deadlock:

1. Create device with dm-era target

   # dmsetup create eradev --table "0 1048576 era /dev/datavg/erameta 
/dev/datavg/eradata 8192"

2. Load new table to device, e.g., to resize the device

   # dmsetup load eradev --table "0 2097152 era /dev/datavg/erameta 
/dev/datavg/eradata 8192"

3. Suspend the device

   # dmsetup suspend eradev

4. Retrieve the status of the device. This blocks for the reasons I
   explained in my previous email.

   # dmsetup status eradev



Hi

Querying 'status' while the device is suspend is the next issue you need to fix 
in your workflow.



Hi,

These steps are not my exact workflow. It's just a series of steps to
easily reproduce the bug.

I am not the one retrieving the status of the suspended device. LVM is.
LVM, when running commands like 'lvs' and 'vgs', retrieves the status of
the devices on the system using the DM_TABLE_STATUS ioctl.

LVM indeed uses the DM_NOFLUSH_FLAG, but this doesn't make a difference
for dm-era, since it doesn't check for this flag.

So, for example, a user or a monitoring daemon running an LVM command,
like 'vgs', at the "wrong" time triggers the bug:

1. Create device with dm-era target

   # dmsetup create eradev --table "0 1048576 era /dev/datavg/erameta 
/dev/datavg/eradata 8192"

2. Load new table to device, e.g., to resize the device

   # dmsetup load eradev --table "0 2097152 era /dev/datavg/erameta 
/dev/datavg/eradata 8192"

3. Suspend the device

   # dmsetup suspend eradev

4. Someone, e.g., a user or a monitoring daemon, runs an LVM command at
   this point, e.g. 'vgs'.

5. 'vgs' tries to retrieve the status of the dm-era device using the
   DM_TABLE_STATUS ioctl, and blocks.

6. Resume the device: This deadlocks.

   # dmsetup resume eradev

So, I can't change something in my workflow to prevent the bug. It's a
race that happens when someone runs an LVM command at the "wrong" time.

I am aware that using an appropriate LVM 'global_filter' can prevent
this, but:

1. This is just a workaround, not a proper solution.
2. This is not always an option. Imagine someone running an LVM command
   in a container, for example. Or, we may not be allowed to change the
   LVM configuration of the host at all.


Normally 'status' operation may need to flush queued IO operations to get 
accurate data.

So you should query states before you start to mess with tables.

If you want to get 'status' without flushing - use:   'dmsetup status 
--noflush'.



I am aware of that, and of the '--noflush' flag.

But, note, that:

1. As I have already explained in my previous emails, the reason of the
   deadlock is not I/O related.
2. dm-era doesn't check for this flag, so using it doesn't make a
   difference.
3. Other targets, e.g., dm-thin and dm-cache, that check for this flag,
   also check _explicitly_ if the device is suspended, before committing
   their metadata to get accurate statistics. They don't just rely on
   the user to use the '--noflush' flag.

That said, fixing 'era_status()' to check for the 'noflush' flag and to
check if the device is suspended, could be a possible fix, which I have
already proposed in my first email.

Although, as I have already explained, it's not a simple matter of not
committing metadata when the 'noflush' flag is used, or the device is
suspended.

dm-era queues the status operation (as well as all operations that touch
the metadata) for execution by a worker thread, to avoid using locks for
accessing the metadata.

When the target is suspended this thread doesn't execute operations, so
the 'table_status()' call blocks, holding the SRCU read lock of the
device (md->io_barrier), until the target is resumed.

But, 'table_status()' _never_ unblocks if you resume the device with a
new table preloaded. Instead, the resume operation ('dm_swap_table()')
deadlocks waiting for 'table_status()' to drop the SRCU read lock.

This never happens, and

Re: [dm-devel] Deadlock when swapping a table with a dm-era target

2021-12-03 Thread Nikos Tsironis

On 12/2/21 5:41 PM, Zdenek Kabelac wrote:

Dne 01. 12. 21 v 18:07 Nikos Tsironis napsal(a):

Hello,

Under certain conditions, swapping a table, that includes a dm-era
target, with a new table, causes a deadlock.

This happens when a status (STATUSTYPE_INFO) or message IOCTL is blocked
in the suspended dm-era target.

dm-era executes all metadata operations in a worker thread, which stops
processing requests when the target is suspended, and resumes again when
the target is resumed.

So, running 'dmsetup status' or 'dmsetup message' for a suspended dm-era
device blocks, until the device is resumed.

This seems to be a problem on its own.

If we then load a new table to the device, while the aforementioned
dmsetup command is blocked in dm-era, and resume the device, we
deadlock.

The problem is that the 'dmsetup status' and 'dmsetup message' commands
hold a reference to the live table, i.e., they hold an SRCU read lock on
md->io_barrier, while they are blocked.

When the device is resumed, the old table is replaced with the new one
by dm_swap_table(), which ends up calling synchronize_srcu() on
md->io_barrier.

Since the blocked dmsetup command is holding the SRCU read lock, and the
old table is never resumed, 'dmsetup resume' blocks too, and we have a
deadlock.

Steps to reproduce:

1. Create device with dm-era target

   # dmsetup create eradev --table "0 1048576 era /dev/datavg/erameta 
/dev/datavg/eradata 8192"

2. Suspend the device

   # dmsetup suspend eradev

3. Load new table to device, e.g., to resize the device

   # dmsetup load eradev --table "0 2097152 era /dev/datavg/erameta 
/dev/datavg/eradata 8192"



Your sequence is faulty - you must always preload  new table before suspend.

Suspend should be absolutely minimal in its timing.

Also nothing should be allocating memory in suspend so that's why suspend has 
to be used after table line is fully loaded.



Hi Zdenek,

Thanks for the feedback. There doesn't seem to be any documentation
mentioning that loading the new table should happen before suspend, so
thanks a lot for explaining it.

Unfortunately, this isn't what causes the deadlock. The following
sequence, which loads the table before suspend, also results in a
deadlock:

1. Create device with dm-era target

   # dmsetup create eradev --table "0 1048576 era /dev/datavg/erameta 
/dev/datavg/eradata 8192"

2. Load new table to device, e.g., to resize the device

   # dmsetup load eradev --table "0 2097152 era /dev/datavg/erameta 
/dev/datavg/eradata 8192"

3. Suspend the device

   # dmsetup suspend eradev

4. Retrieve the status of the device. This blocks for the reasons I
   explained in my previous email.

   # dmsetup status eradev

5. Resume the device. This deadlocks for the reasons I explained in my
   previous email.

   # dmsetup resume eradev

6. The dmesg logs are the same as the ones I included in my previous
   email.

I have explained the reasons for the deadlock in my previous email, but
I would be more than happy to discuss them more.

I would also like your feedback on the solutions I proposed there, so I
can work on a fix.

Thanks,
Nikos.

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



[dm-devel] Deadlock when swapping a table with a dm-era target

2021-12-01 Thread Nikos Tsironis

Hello,

Under certain conditions, swapping a table, that includes a dm-era
target, with a new table, causes a deadlock.

This happens when a status (STATUSTYPE_INFO) or message IOCTL is blocked
in the suspended dm-era target.

dm-era executes all metadata operations in a worker thread, which stops
processing requests when the target is suspended, and resumes again when
the target is resumed.

So, running 'dmsetup status' or 'dmsetup message' for a suspended dm-era
device blocks, until the device is resumed.

This seems to be a problem on its own.

If we then load a new table to the device, while the aforementioned
dmsetup command is blocked in dm-era, and resume the device, we
deadlock.

The problem is that the 'dmsetup status' and 'dmsetup message' commands
hold a reference to the live table, i.e., they hold an SRCU read lock on
md->io_barrier, while they are blocked.

When the device is resumed, the old table is replaced with the new one
by dm_swap_table(), which ends up calling synchronize_srcu() on
md->io_barrier.

Since the blocked dmsetup command is holding the SRCU read lock, and the
old table is never resumed, 'dmsetup resume' blocks too, and we have a
deadlock.

Steps to reproduce:

1. Create device with dm-era target

   # dmsetup create eradev --table "0 1048576 era /dev/datavg/erameta 
/dev/datavg/eradata 8192"

2. Suspend the device

   # dmsetup suspend eradev

3. Load new table to device, e.g., to resize the device

   # dmsetup load eradev --table "0 2097152 era /dev/datavg/erameta 
/dev/datavg/eradata 8192"

4. Device now has LIVE and INACTIVE tables

   # dmsetup info eradev
   Name:  eradev
   State: SUSPENDED
   Read Ahead:16384
   Tables present:LIVE & INACTIVE
   Open count:0
   Event number:  0
   Major, minor:  252, 2
   Number of targets: 1

5. Retrieve the status of the device. This blocks because the device is
   suspended. Equivalently, any 'dmsetup message' operation would block
   too. This command holds the SRCU read lock.

   # dmsetup status eradev

6. Resume the device. The resume operation tries to swap the old table
   with the new one and deadlocks, because it synchronizes SRCU for the
   old table, while the blocked 'dmsetup status' holds the SRCU read
   lock. And the old table is never resumed again at this point.

   # dmsetup resume eradev

7. The relevant dmesg logs are:


[ 7093.345486] dm-2: detected capacity change from 1048576 to 2097152
[ 7250.875665] INFO: task dmsetup:1986 blocked for more than 120 seconds.
[ 7250.875722]   Not tainted 5.16.0-rc2-release+ #16
[ 7250.875756] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.
[ 7250.875803] task:dmsetup state:D stack:0 pid: 1986 ppid:  1313 
flags:0x
[ 7250.875809] Call Trace:
[ 7250.875812]  
[ 7250.875816]  __schedule+0x330/0x8b0
[ 7250.875827]  schedule+0x4e/0xc0
[ 7250.875831]  schedule_timeout+0x20f/0x2e0
[ 7250.875836]  ? do_set_pte+0xb8/0x120
[ 7250.875843]  ? prep_new_page+0x91/0xa0
[ 7250.875847]  wait_for_completion+0x8c/0xf0
[ 7250.875854]  perform_rpc+0x95/0xb0 [dm_era]
[ 7250.875862]  in_worker1.constprop.20+0x48/0x70 [dm_era]
[ 7250.875867]  ? era_iterate_devices+0x30/0x30 [dm_era]
[ 7250.875872]  ? era_status+0x64/0x1e0 [dm_era]
[ 7250.875877]  era_status+0x64/0x1e0 [dm_era]
[ 7250.875882]  ? dm_get_live_or_inactive_table.isra.11+0x20/0x20 [dm_mod]
[ 7250.875900]  ? __mod_node_page_state+0x82/0xc0
[ 7250.875909]  retrieve_status+0xbc/0x1e0 [dm_mod]
[ 7250.875921]  ? dm_get_live_or_inactive_table.isra.11+0x20/0x20 [dm_mod]
[ 7250.875932]  table_status+0x61/0xa0 [dm_mod]
[ 7250.875942]  ctl_ioctl+0x1b5/0x4f0 [dm_mod]
[ 7250.875956]  dm_ctl_ioctl+0xa/0x10 [dm_mod]
[ 7250.875966]  __x64_sys_ioctl+0x8e/0xd0
[ 7250.875970]  do_syscall_64+0x3a/0xd0
[ 7250.875974]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 7250.875980] RIP: 0033:0x7f20b7cd4017
[ 7250.875984] RSP: 002b:7ffd443874b8 EFLAGS: 0246 ORIG_RAX: 
0010
[ 7250.875988] RAX: ffda RBX: 55d69d6bd0e0 RCX: 7f20b7cd4017
[ 7250.875991] RDX: 55d69d6bd0e0 RSI: c138fd0c RDI: 0003
[ 7250.875993] RBP: 001e R08: 7f20b81df648 R09: 7ffd44387320
[ 7250.875996] R10: 7f20b81deb53 R11: 0246 R12: 55d69d6bd110
[ 7250.875998] R13: 7f20b81deb53 R14: 55d69d6bd000 R15: 
[ 7250.876002]  
[ 7250.876004] INFO: task dmsetup:1987 blocked for more than 120 seconds.
[ 7250.876046]   Not tainted 5.16.0-rc2-release+ #16
[ 7250.876083] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.
[ 7250.876129] task:dmsetup state:D stack:0 pid: 1987 ppid:  1385 
flags:0x
[ 7250.876134] Call Trace:
[ 7250.876136]  
[ 7250.876138]  __schedule+0x330/0x8b0
[ 7250.876142]  schedule+0x4e/0xc0
[ 7250.876145]  schedule_timeout+0x20f/0x2e0
[ 7250.876149]  ? __queue_work+0x226/0x420
[ 7250.876156]  wait_for_completion+0x8c/0xf0

Re: [dm-devel] [LSF/MM/BFP ATTEND] [LSF/MM/BFP TOPIC] Storage: Copy Offload

2021-06-11 Thread Nikos Tsironis

On 5/11/21 3:15 AM, Chaitanya Kulkarni wrote:

Hi,

* Background :-
---

Copy offload is a feature that allows file-systems or storage devices
to be instructed to copy files/logical blocks without requiring
involvement of the local CPU.

With reference to the RISC-V summit keynote [1] single threaded
performance is limiting due to Denard scaling and multi-threaded
performance is slowing down due Moore's law limitations. With the rise
of SNIA Computation Technical Storage Working Group (TWG) [2],
offloading computations to the device or over the fabrics is becoming
popular as there are several solutions available [2]. One of the common
operation which is popular in the kernel and is not merged yet is Copy
offload over the fabrics or on to the device.

* Problem :-
---

The original work which is done by Martin is present here [3]. The
latest work which is posted by Mikulas [4] is not merged yet. These two
approaches are totally different from each other. Several storage
vendors discourage mixing copy offload requests with regular READ/WRITE
I/O. Also, the fact that the operation fails if a copy request ever
needs to be split as it traverses the stack it has the unfortunate
side-effect of preventing copy offload from working in pretty much
every common deployment configuration out there.

* Current state of the work :-
---

With [3] being hard to handle arbitrary DM/MD stacking without
splitting the command in two, one for copying IN and one for copying
OUT. Which is then demonstrated by the [4] why [3] it is not a suitable
candidate. Also, with [4] there is an unresolved problem with the
two-command approach about how to handle changes to the DM layout
between an IN and OUT operations.

* Why Linux Kernel Storage System needs Copy Offload support now ?
---

With the rise of the SNIA Computational Storage TWG and solutions [2],
existing SCSI XCopy support in the protocol, recent advancement in the
Linux Kernel File System for Zoned devices (Zonefs [5]), Peer to Peer
DMA support in the Linux Kernel mainly for NVMe devices [7] and
eventually NVMe Devices and subsystem (NVMe PCIe/NVMeOF) will benefit
from Copy offload operation.

With this background we have significant number of use-cases which are
strong candidates waiting for outstanding Linux Kernel Block Layer Copy
Offload support, so that Linux Kernel Storage subsystem can to address
previously mentioned problems [1] and allow efficient offloading of the
data related operations. (Such as move/copy etc.)

For reference following is the list of the use-cases/candidates waiting
for Copy Offload support :-

1. SCSI-attached storage arrays.
2. Stacking drivers supporting XCopy DM/MD.
3. Computational Storage solutions.
7. File systems :- Local, NFS and Zonefs.
4. Block devices :- Distributed, local, and Zoned devices.
5. Peer to Peer DMA support solutions.
6. Potentially NVMe subsystem both NVMe PCIe and NVMeOF.

* What we will discuss in the proposed session ?
---

I'd like to propose a session to go over this topic to understand :-

1. What are the blockers for Copy Offload implementation ?
2. Discussion about having a file system interface.
3. Discussion about having right system call for user-space.
4. What is the right way to move this work forward ?
5. How can we help to contribute and move this work forward ?

* Required Participants :-
---

I'd like to invite file system, block layer, and device drivers
developers to:-

1. Share their opinion on the topic.
2. Share their experience and any other issues with [4].
3. Uncover additional details that are missing from this proposal.

Required attendees :-

Martin K. Petersen
Jens Axboe
Christoph Hellwig
Bart Van Assche
Zach Brown
Roland Dreier
Ric Wheeler
Trond Myklebust
Mike Snitzer
Keith Busch
Sagi Grimberg
Hannes Reinecke
Frederick Knight
Mikulas Patocka
Keith Busch

Regards,
Chaitanya

[1]https://content.riscv.org/wp-content/uploads/2018/12/A-New-Golden-Age-for-Computer-Architecture-History-Challenges-and-Opportunities-David-Patterson-.pdf
[2] https://www.snia.org/computational
https://www.napatech.com/support/resources/solution-descriptions/napatech-smartnic-solution-for-hardware-offload/
   https://www.eideticom.com/products.html
https://www.xilinx.com/applications/data-center/computational-storage.html
[3] git://git.kernel.org/pub/scm/linux/kernel/git/mkp/linux.git xcopy
[4] https://www.spinics.net/lists/linux-block/msg00599.html
[5] https://lwn.net/Articles/793585/
[6] https://nvmexpress.org/new-nvmetm-specification-defines-zoned-
namespaces-zns-as-go-to-industry-technology/
[7] 

[dm-devel] [PATCH v2] dm era: only resize metadata in preresume

2021-02-11 Thread Nikos Tsironis
Metadata resize shouldn't happen in the ctr. The ctr loads a temporary
(inactive) table that will only become active upon resume. That is why
resize should always be done in terms of resume. Otherwise a load (ctr)
whose inactive table never becomes active will incorrectly resize the
metadata.

Also, perform the resize directly in preresume, instead of using the
worker to do it.

The worker might run other metadata operations, e.g., it could start
digestion, before resizing the metadata. These operations will end up
using the old size.

This could lead to errors, like:

  device-mapper: era: metadata_digest_transcribe_writeset: dm_array_set_value 
failed
  device-mapper: era: process_old_eras: digest step failed, stopping digestion

The reason of the above error is that the worker started the digestion
of the archived writeset using the old, larger size.

As a result, metadata_digest_transcribe_writeset tried to write beyond
the end of the era array.

Fixes: eec40579d84873 ("dm: add era target")
Cc: sta...@vger.kernel.org # v3.15+
Signed-off-by: Nikos Tsironis 
---
 drivers/md/dm-era-target.c | 21 ++---
 1 file changed, 10 insertions(+), 11 deletions(-)

diff --git a/drivers/md/dm-era-target.c b/drivers/md/dm-era-target.c
index d0e75fd31c1e..d9ac7372108c 100644
--- a/drivers/md/dm-era-target.c
+++ b/drivers/md/dm-era-target.c
@@ -1501,15 +1501,6 @@ static int era_ctr(struct dm_target *ti, unsigned argc, 
char **argv)
}
era->md = md;
 
-   era->nr_blocks = calc_nr_blocks(era);
-
-   r = metadata_resize(era->md, >nr_blocks);
-   if (r) {
-   ti->error = "couldn't resize metadata";
-   era_destroy(era);
-   return -ENOMEM;
-   }
-
era->wq = alloc_ordered_workqueue("dm-" DM_MSG_PREFIX, WQ_MEM_RECLAIM);
if (!era->wq) {
ti->error = "could not create workqueue for metadata object";
@@ -1584,9 +1575,17 @@ static int era_preresume(struct dm_target *ti)
dm_block_t new_size = calc_nr_blocks(era);
 
if (era->nr_blocks != new_size) {
-   r = in_worker1(era, metadata_resize, _size);
-   if (r)
+   r = metadata_resize(era->md, _size);
+   if (r) {
+   DMERR("%s: metadata_resize failed", __func__);
return r;
+   }
+
+   r = metadata_commit(era->md);
+   if (r) {
+   DMERR("%s: metadata_commit failed", __func__);
+   return r;
+   }
 
era->nr_blocks = new_size;
}
-- 
2.11.0

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH 4/4] dm era: Remove unreachable resize operation in pre-resume function

2021-02-11 Thread Nikos Tsironis

On 2/10/21 8:48 PM, Mike Snitzer wrote:

On Wed, Feb 10 2021 at  1:12P -0500,
Mike Snitzer  wrote:


On Fri, Jan 22 2021 at 10:25am -0500,
Nikos Tsironis  wrote:


The device metadata are resized in era_ctr(), so the metadata resize
operation in era_preresume() never runs.

Also, note, that if the operation did ever run it would deadlock, since
the worker has not been started at this point.


It wouldn't have deadlocked, it'd have queued the work (see wake_worker)



Hi Mike,

The resize is performed as an RPC and in_worker1() ends up calling
perform_rpc(). perform_rpc() calls wake_worker() and then waits for the
RPC to complete: wait_for_completion(>complete). So, start_worker()
is not called until after the RPC has been completed.

But, you are right, it won't deadlock. I was confused by wake_worker:

  static void wake_worker(struct era *era)
  {
  if (!atomic_read(>suspended))
  queue_work(era->wq, >worker);
  }

When we suspend the device we set era->suspended to 1, so I mistakenly
thought that the resize operation during preresume would deadlock,
because wake_worker wouldn't queue the work.

But, the resize is only triggered when loading a new table, which
creates a new target by calling era_ctr. There era->suspended is
indirectly initialized to 0, because of kzalloc.

So, wake_worker will indeed queue the work.



Fixes: eec40579d84873 ("dm: add era target")
Cc: sta...@vger.kernel.org # v3.15+
Signed-off-by: Nikos Tsironis 
---
  drivers/md/dm-era-target.c | 9 -
  1 file changed, 9 deletions(-)

diff --git a/drivers/md/dm-era-target.c b/drivers/md/dm-era-target.c
index 104fb110cd4e..c40e132e50cd 100644
--- a/drivers/md/dm-era-target.c
+++ b/drivers/md/dm-era-target.c
@@ -1567,15 +1567,6 @@ static int era_preresume(struct dm_target *ti)
  {
int r;
struct era *era = ti->private;
-   dm_block_t new_size = calc_nr_blocks(era);
-
-   if (era->nr_blocks != new_size) {
-   r = in_worker1(era, metadata_resize, _size);
-   if (r)
-   return r;
-
-   era->nr_blocks = new_size;
-   }
  
  	start_worker(era);
  
--

2.11.0



Resize shouldn't actually happen in the ctr.  The ctr loads a temporary
(inactive) table that will only become active upon resume.  That is why
resize should always be done in terms of resume.



I kept the resize in the ctr to maintain the original behavior of
dm-era.

But, I had missed what you are describing here, which indeed makes sense
and it's the correct thing to do.

Thanks a lot for explaining it.


I'll look closer but ctr shouldn't do the actual resize, and the
start_worker() should be moved above the resize code you've removed
above.


Does this work for you?  If so I'll get it staged (like I've just
staged all your other dm-era fixes for 5.12).



The patch you attach won't work as is. We can't perform the resize in
the worker, because the worker might run other metadata operations,
e.g., it could start digestion, before resizing the metadata. These
operations will end up using the old size.

This can lead to errors:

1. Create a 1GiB dm-era device

   # dmsetup create eradev --table "0 2097152 era /dev/datavg/erameta 
/dev/datavg/eradata 8192"

2. Write to a block

   # dd if=/dev/zero of=/dev/mapper/eradev oflag=direct bs=4M count=1 seek=200

3. Suspend the device

   # dmsetup suspend eradev

4. Load a new table reducing the size of the device, so it doesn't
   include the block written at step (2)

   # dmsetup load eradev --table "0 1048576 era /dev/datavg/erameta 
/dev/datavg/eradata 8192"

5. Resume the device

   # dmsetup resume eradev

In dmesg we see the following:

   device-mapper: era: metadata_digest_transcribe_writeset: dm_array_set_value 
failed
   device-mapper: era: process_old_eras: digest step failed, stopping digestion

The reason is that the worker started the digestion of the archived
writeset using the old, larger size.

As a result, metadata_digest_transcribe_writeset tried to write beyond
the end of the era array.

Instead, we have to resize the metadata directly in era_preresume, and
not use the worker to do it.

I prepared a new patch doing that, which I will send with a new mail.

Nikos.


  drivers/md/dm-era-target.c | 13 ++---
  1 file changed, 2 insertions(+), 11 deletions(-)

diff --git a/drivers/md/dm-era-target.c b/drivers/md/dm-era-target.c
index d0e75fd31c1e..ec198e9cdafb 100644
--- a/drivers/md/dm-era-target.c
+++ b/drivers/md/dm-era-target.c
@@ -1501,15 +1501,6 @@ static int era_ctr(struct dm_target *ti, unsigned argc, 
char **argv)
}
era->md = md;
  
-	era->nr_blocks = calc_nr_blocks(era);

-
-   r = metadata_resize(era->md, >nr_blocks);
-   if (r) {
-   ti->error = "couldn't resize metadata";
-   era_destroy(era);
-   return -ENOMEM;
-   }
-
era->wq

Re: [dm-devel] [PATCH 0/2] dm era: Fix bugs that lead to lost writes after crash

2021-02-09 Thread Nikos Tsironis

Hello,

This is a kind reminder for the dm-era fixes I have sent with this and
the rest of the relevant mails.

I'm bumping this thread to solicit your feedback. If there is anything
else I may need to do, please let me know.

Thanks,
Nikos

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel



[dm-devel] [PATCH 3/4] dm era: Use correct value size in equality function of writeset tree

2021-01-22 Thread Nikos Tsironis
Fix the writeset tree equality test function to use the right value size
when comparing two btree values.

Fixes: eec40579d84873 ("dm: add era target")
Cc: sta...@vger.kernel.org # v3.15+
Signed-off-by: Nikos Tsironis 
---
 drivers/md/dm-era-target.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/md/dm-era-target.c b/drivers/md/dm-era-target.c
index ffbbd8740253..104fb110cd4e 100644
--- a/drivers/md/dm-era-target.c
+++ b/drivers/md/dm-era-target.c
@@ -389,7 +389,7 @@ static void ws_dec(void *context, const void *value)
 
 static int ws_eq(void *context, const void *value1, const void *value2)
 {
-   return !memcmp(value1, value2, sizeof(struct writeset_metadata));
+   return !memcmp(value1, value2, sizeof(struct writeset_disk));
 }
 
 /**/
-- 
2.11.0

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel



[dm-devel] [PATCH 2/4] dm era: Fix bitset memory leaks

2021-01-22 Thread Nikos Tsironis
Deallocate the memory allocated for the in-core bitsets when destroying
the target and in error paths.

Fixes: eec40579d84873 ("dm: add era target")
Cc: sta...@vger.kernel.org # v3.15+
Signed-off-by: Nikos Tsironis 
---
 drivers/md/dm-era-target.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/drivers/md/dm-era-target.c b/drivers/md/dm-era-target.c
index 52e3f63335d3..ffbbd8740253 100644
--- a/drivers/md/dm-era-target.c
+++ b/drivers/md/dm-era-target.c
@@ -47,6 +47,7 @@ struct writeset {
 static void writeset_free(struct writeset *ws)
 {
vfree(ws->bits);
+   ws->bits = NULL;
 }
 
 static int setup_on_disk_bitset(struct dm_disk_bitset *info,
@@ -810,6 +811,8 @@ static struct era_metadata *metadata_open(struct 
block_device *bdev,
 
 static void metadata_close(struct era_metadata *md)
 {
+   writeset_free(>writesets[0]);
+   writeset_free(>writesets[1]);
destroy_persistent_data_objects(md);
kfree(md);
 }
@@ -847,6 +850,7 @@ static int metadata_resize(struct era_metadata *md, void 
*arg)
r = writeset_alloc(>writesets[1], *new_size);
if (r) {
DMERR("%s: writeset_alloc failed for writeset 1", __func__);
+   writeset_free(>writesets[0]);
return r;
}
 
@@ -857,6 +861,8 @@ static int metadata_resize(struct era_metadata *md, void 
*arg)
, >era_array_root);
if (r) {
DMERR("%s: dm_array_resize failed", __func__);
+   writeset_free(>writesets[0]);
+   writeset_free(>writesets[1]);
return r;
}
 
-- 
2.11.0

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel



[dm-devel] [PATCH 0/4] dm era: Various minor fixes

2021-01-22 Thread Nikos Tsironis
While working on fixing the bugs that cause lost writes, for which I
have sent separate emails, I bumped into several other minor issues that
I fix in this patch set.

In particular, this series of commits introduces the following fixes:

1. Add explicit check that the data block size hasn't changed
2. Fix bitset memory leaks. The in-core bitmaps were never freed.
3. Fix the writeset tree equality test function to use the right value
   size.
4. Remove unreachable resize operation in pre-resume function.

More information about the fixes can be found in their commit messages.

Nikos Tsironis (4):
  dm era: Verify the data block size hasn't changed
  dm era: Fix bitset memory leaks
  dm era: Use correct value size in equality function of writeset tree
  dm era: Remove unreachable resize operation in pre-resume function

 drivers/md/dm-era-target.c | 27 ---
 1 file changed, 16 insertions(+), 11 deletions(-)

-- 
2.11.0

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel



[dm-devel] [PATCH 1/4] dm era: Verify the data block size hasn't changed

2021-01-22 Thread Nikos Tsironis
dm-era doesn't support changing the data block size of existing devices,
so check explicitly that the requested block size for a new target
matches the one stored in the metadata.

Fixes: eec40579d84873 ("dm: add era target")
Cc: sta...@vger.kernel.org # v3.15+
Signed-off-by: Nikos Tsironis 
---
 drivers/md/dm-era-target.c | 10 +-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/drivers/md/dm-era-target.c b/drivers/md/dm-era-target.c
index b24e3839bb3a..52e3f63335d3 100644
--- a/drivers/md/dm-era-target.c
+++ b/drivers/md/dm-era-target.c
@@ -564,6 +564,15 @@ static int open_metadata(struct era_metadata *md)
}
 
disk = dm_block_data(sblock);
+
+   /* Verify the data block size hasn't changed */
+   if (le32_to_cpu(disk->data_block_size) != md->block_size) {
+   DMERR("changing the data block size (from %u to %llu) is not 
supported",
+ le32_to_cpu(disk->data_block_size), md->block_size);
+   r = -EINVAL;
+   goto bad;
+   }
+
r = dm_tm_open_with_sm(md->bm, SUPERBLOCK_LOCATION,
   disk->metadata_space_map_root,
   sizeof(disk->metadata_space_map_root),
@@ -575,7 +584,6 @@ static int open_metadata(struct era_metadata *md)
 
setup_infos(md);
 
-   md->block_size = le32_to_cpu(disk->data_block_size);
md->nr_blocks = le32_to_cpu(disk->nr_blocks);
md->current_era = le32_to_cpu(disk->current_era);
 
-- 
2.11.0

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel



[dm-devel] [PATCH 4/4] dm era: Remove unreachable resize operation in pre-resume function

2021-01-22 Thread Nikos Tsironis
The device metadata are resized in era_ctr(), so the metadata resize
operation in era_preresume() never runs.

Also, note, that if the operation did ever run it would deadlock, since
the worker has not been started at this point.

Fixes: eec40579d84873 ("dm: add era target")
Cc: sta...@vger.kernel.org # v3.15+
Signed-off-by: Nikos Tsironis 
---
 drivers/md/dm-era-target.c | 9 -
 1 file changed, 9 deletions(-)

diff --git a/drivers/md/dm-era-target.c b/drivers/md/dm-era-target.c
index 104fb110cd4e..c40e132e50cd 100644
--- a/drivers/md/dm-era-target.c
+++ b/drivers/md/dm-era-target.c
@@ -1567,15 +1567,6 @@ static int era_preresume(struct dm_target *ti)
 {
int r;
struct era *era = ti->private;
-   dm_block_t new_size = calc_nr_blocks(era);
-
-   if (era->nr_blocks != new_size) {
-   r = in_worker1(era, metadata_resize, _size);
-   if (r)
-   return r;
-
-   era->nr_blocks = new_size;
-   }
 
start_worker(era);
 
-- 
2.11.0

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel



[dm-devel] [PATCH 1/1] dm era: Reinitialize bitset cache before digesting a new writeset

2021-01-22 Thread Nikos Tsironis
In case of devices with at most 64 blocks, the digestion of consecutive
eras uses the writeset of the first era as the writeset of all eras to
digest, leading to lost writes. That is, we lose the information about
what blocks were written during the affected eras.

The digestion code uses a dm_disk_bitset object to access the archived
writesets. This structure includes a one word (64-bit) cache to reduce
the number of array lookups.

This structure is initialized only once, in metadata_digest_start(),
when we kick off digestion.

But, when we insert a new writeset into the writeset tree, before the
digestion of the previous writeset is done, or equivalently when there
are multiple writesets in the writeset tree to digest, then all these
writesets are digested using the same cache and the cache is not
re-initialized when moving from one writeset to the next.

For devices with more than 64 blocks, i.e., the size of the cache, the
cache is indirectly invalidated when we move to a next set of blocks, so
we avoid the bug.

But for devices with at most 64 blocks we end up using the same cached
data for digesting all archived writesets, i.e., the cache is loaded
when digesting the first writeset and it never gets reloaded, until the
digestion is done.

As a result, the writeset of the first era to digest is used as the
writeset of all the following archived eras, leading to lost writes.

Fix this by reinitializing the dm_disk_bitset structure, and thus
invalidating the cache, every time the digestion code starts digesting a
new writeset.

Fixes: eec40579d84873 ("dm: add era target")
Cc: sta...@vger.kernel.org # v3.15+
Signed-off-by: Nikos Tsironis 
---
 drivers/md/dm-era-target.c | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/drivers/md/dm-era-target.c b/drivers/md/dm-era-target.c
index b24e3839bb3a..951e6df409d4 100644
--- a/drivers/md/dm-era-target.c
+++ b/drivers/md/dm-era-target.c
@@ -746,6 +746,12 @@ static int metadata_digest_lookup_writeset(struct 
era_metadata *md,
ws_unpack(, >writeset);
d->value = cpu_to_le32(key);
 
+   /*
+* We initialise another bitset info to avoid any caching side effects
+* with the previous one.
+*/
+   dm_disk_bitset_init(md->tm, >info);
+
d->nr_bits = min(d->writeset.nr_bits, md->nr_blocks);
d->current_bit = 0;
d->step = metadata_digest_transcribe_writeset;
@@ -759,12 +765,6 @@ static int metadata_digest_start(struct era_metadata *md, 
struct digest *d)
return 0;
 
memset(d, 0, sizeof(*d));
-
-   /*
-* We initialise another bitset info to avoid any caching side
-* effects with the previous one.
-*/
-   dm_disk_bitset_init(md->tm, >info);
d->step = metadata_digest_lookup_writeset;
 
return 0;
-- 
2.11.0

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel



[dm-devel] [PATCH 0/1] dm era: Fix digestion bug that can lead to lost writes

2021-01-22 Thread Nikos Tsironis
In case of devices with at most 64 blocks, the digestion of consecutive
eras uses the writeset of the first era as the writeset of all eras to
digest, leading to lost writes. That is, we lose the information about
what blocks were written during the affected eras.

The root cause of the bug is a failure to reinitialize the on-disk
bitset cache when the digestion code starts digesting a new writeset.

Steps to reproduce
--

1. Create two LVs, one for data and one for metadata

   # lvcreate -n eradata -L1G datavg
   # lvcreate -n erameta -L64M datavg

2. Fill the whole data device with zeroes

   # dd if=/dev/zero of=/dev/datavg/eradata oflag=direct bs=1M

3. Create a dm-delay device, which inserts a 500 msec delay to writes:

   # dmsetup create delaymeta --table "0 `blockdev --getsz \
 /dev/datavg/erameta` delay /dev/datavg/erameta 0 0 /dev/datavg/erameta 0 
500"

4. Create a 256MiB (64 4MiB blocks) dm-era device, using the data LV for
   data and the dm-delay device for its metadata. We set the tracking
   granularity to 4MiB.

   # dmsetup create eradev --table "0 524288 era /dev/mapper/delaymeta \
 /dev/datavg/eradata 8192"

5. Run the following script:

   #!/bin/bash

   # Write to block #0 during era 1
   dd if=/dev/urandom of=/dev/mapper/eradev oflag=direct bs=4K count=1

   # Increase era to 2
   dmsetup message eradev 0 checkpoint

   # Write to block #1 during era 2
   dd if=/dev/urandom of=/dev/mapper/eradev oflag=direct bs=4K count=1 
seek=1024 &

   # Increase era to 3
   dmsetup message eradev 0 checkpoint

   # Sync the device
   sync /dev/mapper/eradev

6. Remove the device, so we can examine its metadata

   # dmsetup remove eradev

7. Examine the device's metadata with `era_dump --logical /dev/mapper/delaymeta`

   
   
   
   
   
   ...
   
   
   

   We see that:
a. Block #0 is marked as last written during era 2, whereas we wrote
   to it only during era 1
b. Block #1 is not marked as written at all, whereas we wrote to it
   during era 2

8. Examining the data device, e.g., with `hexdump /dev/datavg/eradata`,
   we can see that both blocks #0 and #1 are written, as expected.

Nikos Tsironis (1):
  dm era: Reinitialize bitset cache before digesting a new writeset

 drivers/md/dm-era-target.c | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

-- 
2.11.0

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel



[dm-devel] [PATCH 2/2] dm era: Update in-core bitset after committing the metadata

2021-01-22 Thread Nikos Tsironis
In case of a system crash, dm-era might fail to mark blocks as written
in its metadata, although the corresponding writes to these blocks were
passed down to the origin device and completed successfully.

Consider the following sequence of events:

1. We write to a block that has not been yet written in the current era
2. era_map() checks the in-core bitmap for the current era and sees
   that the block is not marked as written.
3. The write is deferred for submission after the metadata have been
   updated and committed.
4. The worker thread processes the deferred write
   (process_deferred_bios()) and marks the block as written in the
   in-core bitmap, **before** committing the metadata.
5. The worker thread starts committing the metadata.
6. We do more writes that map to the same block as the write of step (1)
7. era_map() checks the in-core bitmap and sees that the block is marked
   as written, **although the metadata have not been committed yet**.
8. These writes are passed down to the origin device immediately and the
   device reports them as completed.
9. The system crashes, e.g., power failure, before the commit from step
   (5) finishes.

When the system recovers and we query the dm-era target for the list of
written blocks it doesn't report the aforementioned block as written,
although the writes of step (6) completed successfully.

The issue is that era_map() decides whether to defer or not a write
based on non committed information. The root cause of the bug is that we
update the in-core bitmap, **before** committing the metadata.

Fix this by updating the in-core bitmap **after** successfully
committing the metadata.

Fixes: eec40579d84873 ("dm: add era target")
Cc: sta...@vger.kernel.org # v3.15+
Signed-off-by: Nikos Tsironis 
---
 drivers/md/dm-era-target.c | 25 +++--
 1 file changed, 19 insertions(+), 6 deletions(-)

diff --git a/drivers/md/dm-era-target.c b/drivers/md/dm-era-target.c
index 854b1be8b452..62f679faf9e7 100644
--- a/drivers/md/dm-era-target.c
+++ b/drivers/md/dm-era-target.c
@@ -134,7 +134,7 @@ static int writeset_test_and_set(struct dm_disk_bitset 
*info,
 {
int r;
 
-   if (!test_and_set_bit(block, ws->bits)) {
+   if (!test_bit(block, ws->bits)) {
r = dm_bitset_set_bit(info, ws->md.root, block, >md.root);
if (r) {
/* FIXME: fail mode */
@@ -1226,8 +1226,10 @@ static void process_deferred_bios(struct era *era)
int r;
struct bio_list deferred_bios, marked_bios;
struct bio *bio;
+   struct blk_plug plug;
bool commit_needed = false;
bool failed = false;
+   struct writeset *ws = era->md->current_writeset;
 
bio_list_init(_bios);
bio_list_init(_bios);
@@ -1237,9 +1239,11 @@ static void process_deferred_bios(struct era *era)
bio_list_init(>deferred_bios);
spin_unlock(>deferred_lock);
 
+   if (bio_list_empty(_bios))
+   return;
+
while ((bio = bio_list_pop(_bios))) {
-   r = writeset_test_and_set(>md->bitset_info,
- era->md->current_writeset,
+   r = writeset_test_and_set(>md->bitset_info, ws,
  get_block(era, bio));
if (r < 0) {
/*
@@ -1247,7 +1251,6 @@ static void process_deferred_bios(struct era *era)
 * FIXME: finish.
 */
failed = true;
-
} else if (r == 0)
commit_needed = true;
 
@@ -1263,9 +1266,19 @@ static void process_deferred_bios(struct era *era)
if (failed)
while ((bio = bio_list_pop(_bios)))
bio_io_error(bio);
-   else
-   while ((bio = bio_list_pop(_bios)))
+   else {
+   blk_start_plug();
+   while ((bio = bio_list_pop(_bios))) {
+   /*
+* Only update the in-core writeset if the on-disk one
+* was updated too.
+*/
+   if (commit_needed)
+   set_bit(get_block(era, bio), ws->bits);
submit_bio_noacct(bio);
+   }
+   blk_finish_plug();
+   }
 }
 
 static void process_rpc_calls(struct era *era)
-- 
2.11.0

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel



[dm-devel] [PATCH 1/2] dm era: Recover committed writeset after crash

2021-01-22 Thread Nikos Tsironis
Following a system crash, dm-era fails to recover the committed writeset
for the current era, leading to lost writes. That is, we lose the
information about what blocks were written during the affected era.

dm-era assumes that the writeset of the current era is archived when the
device is suspended. So, when resuming the device, it just moves on to
the next era, ignoring the committed writeset.

This assumption holds when the device is properly shut down. But, when
the system crashes, the code that suspends the target never runs, so the
writeset for the current era is not archived.

There are three issues that cause the committed writeset to get lost:

1. dm-era doesn't load the committed writeset when opening the metadata
2. The code that resizes the metadata wipes the information about the
   committed writeset (assuming it was loaded at step 1)
3. era_preresume() starts a new era, without taking into account that
   the current era might not have been archived, due to a system crash.

To fix this:

1. Load the committed writeset when opening the metadata
2. Fix the code that resizes the metadata to make sure it doesn't wipe
   the loaded writeset
3. Fix era_preresume() to check for a loaded writeset and archive it,
   before starting a new era.

Fixes: eec40579d84873 ("dm: add era target")
Cc: sta...@vger.kernel.org # v3.15+
Signed-off-by: Nikos Tsironis 
---
 drivers/md/dm-era-target.c | 17 +
 1 file changed, 9 insertions(+), 8 deletions(-)

diff --git a/drivers/md/dm-era-target.c b/drivers/md/dm-era-target.c
index b24e3839bb3a..854b1be8b452 100644
--- a/drivers/md/dm-era-target.c
+++ b/drivers/md/dm-era-target.c
@@ -71,8 +71,6 @@ static size_t bitset_size(unsigned nr_bits)
  */
 static int writeset_alloc(struct writeset *ws, dm_block_t nr_blocks)
 {
-   ws->md.nr_bits = nr_blocks;
-   ws->md.root = INVALID_WRITESET_ROOT;
ws->bits = vzalloc(bitset_size(nr_blocks));
if (!ws->bits) {
DMERR("%s: couldn't allocate in memory bitset", __func__);
@@ -85,12 +83,14 @@ static int writeset_alloc(struct writeset *ws, dm_block_t 
nr_blocks)
 /*
  * Wipes the in-core bitset, and creates a new on disk bitset.
  */
-static int writeset_init(struct dm_disk_bitset *info, struct writeset *ws)
+static int writeset_init(struct dm_disk_bitset *info, struct writeset *ws,
+dm_block_t nr_blocks)
 {
int r;
 
-   memset(ws->bits, 0, bitset_size(ws->md.nr_bits));
+   memset(ws->bits, 0, bitset_size(nr_blocks));
 
+   ws->md.nr_bits = nr_blocks;
r = setup_on_disk_bitset(info, ws->md.nr_bits, >md.root);
if (r) {
DMERR("%s: setup_on_disk_bitset failed", __func__);
@@ -579,6 +579,7 @@ static int open_metadata(struct era_metadata *md)
md->nr_blocks = le32_to_cpu(disk->nr_blocks);
md->current_era = le32_to_cpu(disk->current_era);
 
+   ws_unpack(>current_writeset, >current_writeset->md);
md->writeset_tree_root = le64_to_cpu(disk->writeset_tree_root);
md->era_array_root = le64_to_cpu(disk->era_array_root);
md->metadata_snap = le64_to_cpu(disk->metadata_snap);
@@ -870,7 +871,6 @@ static int metadata_era_archive(struct era_metadata *md)
}
 
ws_pack(>current_writeset->md, );
-   md->current_writeset->md.root = INVALID_WRITESET_ROOT;
 
keys[0] = md->current_era;
__dm_bless_for_disk();
@@ -882,6 +882,7 @@ static int metadata_era_archive(struct era_metadata *md)
return r;
}
 
+   md->current_writeset->md.root = INVALID_WRITESET_ROOT;
md->archived_writesets = true;
 
return 0;
@@ -898,7 +899,7 @@ static int metadata_new_era(struct era_metadata *md)
int r;
struct writeset *new_writeset = next_writeset(md);
 
-   r = writeset_init(>bitset_info, new_writeset);
+   r = writeset_init(>bitset_info, new_writeset, md->nr_blocks);
if (r) {
DMERR("%s: writeset_init failed", __func__);
return r;
@@ -951,7 +952,7 @@ static int metadata_commit(struct era_metadata *md)
int r;
struct dm_block *sblock;
 
-   if (md->current_writeset->md.root != SUPERBLOCK_LOCATION) {
+   if (md->current_writeset->md.root != INVALID_WRITESET_ROOT) {
r = dm_bitset_flush(>bitset_info, 
md->current_writeset->md.root,
>current_writeset->md.root);
if (r) {
@@ -1565,7 +1566,7 @@ static int era_preresume(struct dm_target *ti)
 
start_worker(era);
 
-   r = in_worker0(era, metadata_new_era);
+   r = in_worker0(era, metadata_era_rollover);
if (r) {
DMERR("%s: metadata_era_rollover failed", __func__);
return r;
-- 
2.11.0

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel



[dm-devel] [PATCH 0/2] dm era: Fix bugs that lead to lost writes after crash

2021-01-22 Thread Nikos Tsironis
sume delaymeta

6. Run the following script:

   #!/bin/bash

   # a. Write to the first 4KiB block of the device, which maps to era block #0
   dd if=/dev/urandom of=/dev/mapper/eradev oflag=direct bs=4K count=1 &

   # b. Write to the second 4KiB block of the device, which also maps to block 
#0
   dd if=/dev/urandom of=/dev/mapper/eradev oflag=direct bs=4K seek=1 count=1

   # c. Sync the device
   sync /dev/mapper/eradev

   # d. Forcefully reboot
   echo b > /proc/sysrq-trigger

   The command of step (6a) blocks as expected, waiting for the metadata
   commit. Meanwhile dm-era has marked block #0 as written in the in-core
   bitmap.

   We would expect the command of step (6b) to also block waiting for
   the metadata commit triggered by (6a), as they touch the same block.

   But, it doesn't.

7. After the system comes back up examine the data device, e.g., using
   `hexdump /dev/datavg/eradata`. We can see that indeed the write from
   (6a) never completed, but the write from (6b) hit the disk.

8. Recreate the device stack and ask for the list of blocks written
   since era 1, i.e., for all blocks ever written to the device.

   # dmsetup message eradev 0 take_metadata_snap
   # era_invalidate --metadata-snapshot --written-since 1 /dev/mapper/delaymeta
   
   

The list of written blocks reported by dm-era is empty, even though
block #0 was written and flushed to the device.

Nikos Tsironis (2):
  dm era: Recover committed writeset after crash
  dm era: Update in-core bitset after committing the metadata

 drivers/md/dm-era-target.c | 42 --
 1 file changed, 28 insertions(+), 14 deletions(-)

-- 
2.11.0

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel



[dm-devel] [PATCH 1/4] dm clone: Fix handling of partial region discards

2020-03-27 Thread Nikos Tsironis
There is a bug in the way dm-clone handles discards, which can lead to
discarding the wrong blocks or trying to discard blocks beyond the end
of the device.

This could lead to data corruption, if the destination device indeed
discards the underlying blocks, i.e., if the discard operation results
in the original contents of a block to be lost.

The root of the problem is the code that calculates the range of regions
covered by a discard request and decides which regions to discard.

Since dm-clone handles the device in units of regions, we don't discard
parts of a region, only whole regions.

The range is calculated as:

rs = dm_sector_div_up(bio->bi_iter.bi_sector, clone->region_size);
re = bio_end_sector(bio) >> clone->region_shift;

, where 'rs' is the first region to discard and (re - rs) is the number
of regions to discard.

The bug manifests when we try to discard part of a single region, i.e.,
when we try to discard a block with size < region_size, and the discard
request both starts at an offset with respect to the beginning of that
region and ends before the end of the region.

The root cause is the following comparison:

  if (rs == re)
// skip discard and complete original bio immediately

, which doesn't take into account that 'rs' might be greater than 're'.

Thus, we then issue a discard request for the wrong blocks, instead of
skipping the discard all together.

Fix the check to also take into account the above case, so we don't end
up discarding the wrong blocks.

Also, add some range checks to dm_clone_set_region_hydrated() and
dm_clone_cond_set_range(), which update dm-clone's region bitmap.

Note that the aforementioned bug doesn't cause invalid memory accesses,
because dm_clone_is_range_hydrated() returns True for this case, so the
checks are just precautionary.

Fixes: 7431b7835f55 ("dm: add clone target")
Cc: sta...@vger.kernel.org # v5.4+
Signed-off-by: Nikos Tsironis 
---
 drivers/md/dm-clone-metadata.c | 13 +
 drivers/md/dm-clone-target.c   | 43 --
 2 files changed, 42 insertions(+), 14 deletions(-)

diff --git a/drivers/md/dm-clone-metadata.c b/drivers/md/dm-clone-metadata.c
index c05b12110456..199e7af00858 100644
--- a/drivers/md/dm-clone-metadata.c
+++ b/drivers/md/dm-clone-metadata.c
@@ -850,6 +850,12 @@ int dm_clone_set_region_hydrated(struct dm_clone_metadata 
*cmd, unsigned long re
struct dirty_map *dmap;
unsigned long word, flags;
 
+   if (unlikely(region_nr >= cmd->nr_regions)) {
+   DMERR("Region %lu out of range (total number of regions %lu)",
+ region_nr, cmd->nr_regions);
+   return -ERANGE;
+   }
+
word = region_nr / BITS_PER_LONG;
 
spin_lock_irqsave(>bitmap_lock, flags);
@@ -879,6 +885,13 @@ int dm_clone_cond_set_range(struct dm_clone_metadata *cmd, 
unsigned long start,
struct dirty_map *dmap;
unsigned long word, region_nr;
 
+   if (unlikely(start >= cmd->nr_regions || (start + nr_regions) < start ||
+(start + nr_regions) > cmd->nr_regions)) {
+   DMERR("Invalid region range: start %lu, nr_regions %lu (total 
number of regions %lu)",
+ start, nr_regions, cmd->nr_regions);
+   return -ERANGE;
+   }
+
spin_lock_irq(>bitmap_lock);
 
if (cmd->read_only) {
diff --git a/drivers/md/dm-clone-target.c b/drivers/md/dm-clone-target.c
index d1e1b5b56b1b..022dddcad647 100644
--- a/drivers/md/dm-clone-target.c
+++ b/drivers/md/dm-clone-target.c
@@ -293,10 +293,17 @@ static inline unsigned long bio_to_region(struct clone 
*clone, struct bio *bio)
 
 /* Get the region range covered by the bio */
 static void bio_region_range(struct clone *clone, struct bio *bio,
-unsigned long *rs, unsigned long *re)
+unsigned long *rs, unsigned long *nr_regions)
 {
+   unsigned long end;
+
*rs = dm_sector_div_up(bio->bi_iter.bi_sector, clone->region_size);
-   *re = bio_end_sector(bio) >> clone->region_shift;
+   end = bio_end_sector(bio) >> clone->region_shift;
+
+   if (*rs >= end)
+   *nr_regions = 0;
+   else
+   *nr_regions = end - *rs;
 }
 
 /* Check whether a bio overwrites a region */
@@ -454,7 +461,7 @@ static void trim_bio(struct bio *bio, sector_t sector, 
unsigned int len)
 
 static void complete_discard_bio(struct clone *clone, struct bio *bio, bool 
success)
 {
-   unsigned long rs, re;
+   unsigned long rs, nr_regions;
 
/*
 * If the destination device supports discards, remap and trim the
@@ -463,9 +470,9 @@ static void complete_discard_bio(struct clone *clone, 
struct bio *bio, bool succ
 */
if (test_bit(DM_CLONE_DISCARD_PASSDOWN, >flags) && success) {
 

[dm-devel] [PATCH 0/4] dm clone: Fix discard handling and overflow bugs which could cause data corruption

2020-03-27 Thread Nikos Tsironis
There is a bug in the way dm-clone handles partial region discards,
which can lead to discarding the wrong blocks or trying to discard
blocks beyond the end of the device.

This could lead to data corruption, if the destination device indeed
discards the underlying blocks, i.e., if the discard operation results
in the original contents of a block to be lost.

The bug manifests when we try to discard part of a single region, i.e.,
when we try to discard a block with size < region_size, and the discard
request both starts at an offset with respect to the beginning of that
region and ends before the end of the region.

The root of the bug is the code that calculates the range of regions
covered by a discard request and decides which regions to discard.

For more information, please see the relevant commit.

As part of fixing this bug, I also audited dm-clone for other
arithmetic/overflow related bugs and found the following:

1. Missing overflow check for the total number of regions
2. Missing casts when converting from regions to sectors
3. Wrong return type of dm_clone_nr_of_hydrated_regions(), which caused
   an unwanted sign extension to occur.

Again, more information can be found in the relevant commits.

Nikos Tsironis (4):
  dm clone: Fix handling of partial region discards
  dm clone: Add overflow check for number of regions
  dm clone: Add missing casts to prevent overflows and data corruption
  dm clone metadata: Fix return type of
dm_clone_nr_of_hydrated_regions()

 drivers/md/dm-clone-metadata.c | 15 +-
 drivers/md/dm-clone-metadata.h |  2 +-
 drivers/md/dm-clone-target.c   | 66 ++
 3 files changed, 62 insertions(+), 21 deletions(-)

-- 
2.11.0


--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel



[dm-devel] [PATCH 3/4] dm clone: Add missing casts to prevent overflows and data corruption

2020-03-27 Thread Nikos Tsironis
Add missing casts when converting from regions to sectors.

In case BITS_PER_LONG == 32, the lack of the appropriate casts can lead
to overflows and miscalculation of the device sector.

As a result, we could end up discarding and/or copying the wrong parts
of the device, thus corrupting the device's data.

Fixes: 7431b7835f55 ("dm: add clone target")
Cc: sta...@vger.kernel.org # v5.4+
Signed-off-by: Nikos Tsironis 
---
 drivers/md/dm-clone-target.c | 9 ++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/drivers/md/dm-clone-target.c b/drivers/md/dm-clone-target.c
index 6ee85fb3388a..ca5020c58f7c 100644
--- a/drivers/md/dm-clone-target.c
+++ b/drivers/md/dm-clone-target.c
@@ -282,7 +282,7 @@ static bool bio_triggers_commit(struct clone *clone, struct 
bio *bio)
 /* Get the address of the region in sectors */
 static inline sector_t region_to_sector(struct clone *clone, unsigned long 
region_nr)
 {
-   return (region_nr << clone->region_shift);
+   return ((sector_t)region_nr << clone->region_shift);
 }
 
 /* Get the region number of the bio */
@@ -471,7 +471,7 @@ static void complete_discard_bio(struct clone *clone, 
struct bio *bio, bool succ
if (test_bit(DM_CLONE_DISCARD_PASSDOWN, >flags) && success) {
remap_to_dest(clone, bio);
bio_region_range(clone, bio, , _regions);
-   trim_bio(bio, rs << clone->region_shift,
+   trim_bio(bio, region_to_sector(clone, rs),
 nr_regions << clone->region_shift);
generic_make_request(bio);
} else
@@ -804,11 +804,14 @@ static void hydration_copy(struct 
dm_clone_region_hydration *hd, unsigned int nr
struct dm_io_region from, to;
struct clone *clone = hd->clone;
 
+   if (WARN_ON(!nr_regions))
+   return;
+
region_size = clone->region_size;
region_start = hd->region_nr;
region_end = region_start + nr_regions - 1;
 
-   total_size = (nr_regions - 1) << clone->region_shift;
+   total_size = region_to_sector(clone, nr_regions - 1);
 
if (region_end == clone->nr_regions - 1) {
/*
-- 
2.11.0


--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel



[dm-devel] [PATCH 2/4] dm clone: Add overflow check for number of regions

2020-03-27 Thread Nikos Tsironis
Add overflow check for clone->nr_regions variable, which holds the
number of regions of the target.

The overflow can occur with sufficiently large devices, if BITS_PER_LONG
== 32. E.g., if the region size is 8 sectors (4K), the overflow would
occur for device sizes > 34359738360 sectors (~16TB).

This could result in multiple device sectors wrongly mapping to the same
region number, due to the truncation from 64 bits to 32 bits, which
would lead to data corruption.

Fixes: 7431b7835f55 ("dm: add clone target")
Cc: sta...@vger.kernel.org # v5.4+
Signed-off-by: Nikos Tsironis 
---
 drivers/md/dm-clone-target.c | 12 +++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/drivers/md/dm-clone-target.c b/drivers/md/dm-clone-target.c
index 022dddcad647..6ee85fb3388a 100644
--- a/drivers/md/dm-clone-target.c
+++ b/drivers/md/dm-clone-target.c
@@ -1790,6 +1790,7 @@ static int copy_ctr_args(struct clone *clone, int argc, 
const char **argv, char
 static int clone_ctr(struct dm_target *ti, unsigned int argc, char **argv)
 {
int r;
+   sector_t nr_regions;
struct clone *clone;
struct dm_arg_set as;
 
@@ -1831,7 +1832,16 @@ static int clone_ctr(struct dm_target *ti, unsigned int 
argc, char **argv)
goto out_with_source_dev;
 
clone->region_shift = __ffs(clone->region_size);
-   clone->nr_regions = dm_sector_div_up(ti->len, clone->region_size);
+   nr_regions = dm_sector_div_up(ti->len, clone->region_size);
+
+   /* Check for overflow */
+   if (nr_regions != (unsigned long)nr_regions) {
+   ti->error = "Too many regions. Consider increasing the region 
size";
+   r = -EOVERFLOW;
+   goto out_with_source_dev;
+   }
+
+   clone->nr_regions = nr_regions;
 
r = validate_nr_regions(clone->nr_regions, >error);
if (r)
-- 
2.11.0


--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel



[dm-devel] [PATCH 4/4] dm clone metadata: Fix return type of dm_clone_nr_of_hydrated_regions()

2020-03-27 Thread Nikos Tsironis
dm_clone_nr_of_hydrated_regions() returns the number of regions that
have been hydrated so far. In order to do so it employs bitmap_weight().

Until now, the return type of dm_clone_nr_of_hydrated_regions() was
unsigned long.

Because bitmap_weight() returns an int, in case BITS_PER_LONG == 64 and
the return value of bitmap_weight() is 2^31 (the maximum allowed number
of regions for a device), the result is sign extended from 32 bits to 64
bits and an incorrect value is displayed, in the status output of
dm-clone, as the number of hydrated regions.

Fix this by having dm_clone_nr_of_hydrated_regions() return an unsigned
int.

Fixes: 7431b7835f55 ("dm: add clone target")
Cc: sta...@vger.kernel.org # v5.4+
Signed-off-by: Nikos Tsironis 
---
 drivers/md/dm-clone-metadata.c | 2 +-
 drivers/md/dm-clone-metadata.h | 2 +-
 drivers/md/dm-clone-target.c   | 2 +-
 3 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/md/dm-clone-metadata.c b/drivers/md/dm-clone-metadata.c
index 199e7af00858..17712456fa63 100644
--- a/drivers/md/dm-clone-metadata.c
+++ b/drivers/md/dm-clone-metadata.c
@@ -656,7 +656,7 @@ bool dm_clone_is_range_hydrated(struct dm_clone_metadata 
*cmd,
return (bit >= (start + nr_regions));
 }
 
-unsigned long dm_clone_nr_of_hydrated_regions(struct dm_clone_metadata *cmd)
+unsigned int dm_clone_nr_of_hydrated_regions(struct dm_clone_metadata *cmd)
 {
return bitmap_weight(cmd->region_map, cmd->nr_regions);
 }
diff --git a/drivers/md/dm-clone-metadata.h b/drivers/md/dm-clone-metadata.h
index 14af1ebd853f..d848b8799c07 100644
--- a/drivers/md/dm-clone-metadata.h
+++ b/drivers/md/dm-clone-metadata.h
@@ -156,7 +156,7 @@ bool dm_clone_is_range_hydrated(struct dm_clone_metadata 
*cmd,
 /*
  * Returns the number of hydrated regions.
  */
-unsigned long dm_clone_nr_of_hydrated_regions(struct dm_clone_metadata *cmd);
+unsigned int dm_clone_nr_of_hydrated_regions(struct dm_clone_metadata *cmd);
 
 /*
  * Returns the first unhydrated region with region_nr >= @start
diff --git a/drivers/md/dm-clone-target.c b/drivers/md/dm-clone-target.c
index ca5020c58f7c..5ce96ddf1ce1 100644
--- a/drivers/md/dm-clone-target.c
+++ b/drivers/md/dm-clone-target.c
@@ -1473,7 +1473,7 @@ static void clone_status(struct dm_target *ti, 
status_type_t type,
goto error;
}
 
-   DMEMIT("%u %llu/%llu %llu %lu/%lu %u ",
+   DMEMIT("%u %llu/%llu %llu %u/%lu %u ",
   DM_CLONE_METADATA_BLOCK_SIZE,
   (unsigned long long)(nr_metadata_blocks - 
nr_free_metadata_blocks),
   (unsigned long long)nr_metadata_blocks,
-- 
2.11.0


--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [LSF/MM/BFP ATTEND] [LSF/MM/BFP TOPIC] Storage: Copy Offload

2020-01-24 Thread Nikos Tsironis

On 1/7/20 8:14 PM, Chaitanya Kulkarni wrote:

Hi all,

* Background :-
---

Copy offload is a feature that allows file-systems or storage devices
to be instructed to copy files/logical blocks without requiring
involvement of the local CPU.

With reference to the RISC-V summit keynote [1] single threaded
performance is limiting due to Denard scaling and multi-threaded
performance is slowing down due Moore's law limitations. With the rise
of SNIA Computation Technical Storage Working Group (TWG) [2],
offloading computations to the device or over the fabrics is becoming
popular as there are several solutions available [2]. One of the common
operation which is popular in the kernel and is not merged yet is Copy
offload over the fabrics or on to the device.

* Problem :-
---

The original work which is done by Martin is present here [3]. The
latest work which is posted by Mikulas [4] is not merged yet. These two
approaches are totally different from each other. Several storage
vendors discourage mixing copy offload requests with regular READ/WRITE
I/O. Also, the fact that the operation fails if a copy request ever
needs to be split as it traverses the stack it has the unfortunate
side-effect of preventing copy offload from working in pretty much
every common deployment configuration out there.

* Current state of the work :-
---

With [3] being hard to handle arbitrary DM/MD stacking without
splitting the command in two, one for copying IN and one for copying
OUT. Which is then demonstrated by the [4] why [3] it is not a suitable
candidate. Also, with [4] there is an unresolved problem with the
two-command approach about how to handle changes to the DM layout
between an IN and OUT operations.

* Why Linux Kernel Storage System needs Copy Offload support now ?
---

With the rise of the SNIA Computational Storage TWG and solutions [2],
existing SCSI XCopy support in the protocol, recent advancement in the
Linux Kernel File System for Zoned devices (Zonefs [5]), Peer to Peer
DMA support in the Linux Kernel mainly for NVMe devices [7] and
eventually NVMe Devices and subsystem (NVMe PCIe/NVMeOF) will benefit
from Copy offload operation.

With this background we have significant number of use-cases which are
strong candidates waiting for outstanding Linux Kernel Block Layer Copy
Offload support, so that Linux Kernel Storage subsystem can to address
previously mentioned problems [1] and allow efficient offloading of the
data related operations. (Such as move/copy etc.)

For reference following is the list of the use-cases/candidates waiting
for Copy Offload support :-

1. SCSI-attached storage arrays.
2. Stacking drivers supporting XCopy DM/MD.
3. Computational Storage solutions.
7. File systems :- Local, NFS and Zonefs.
4. Block devices :- Distributed, local, and Zoned devices.
5. Peer to Peer DMA support solutions.
6. Potentially NVMe subsystem both NVMe PCIe and NVMeOF.

* What we will discuss in the proposed session ?
---

I'd like to propose a session to go over this topic to understand :-

1. What are the blockers for Copy Offload implementation ?
2. Discussion about having a file system interface.
3. Discussion about having right system call for user-space.
4. What is the right way to move this work forward ?
5. How can we help to contribute and move this work forward ?

* Required Participants :-
---

I'd like to invite block layer, device drivers and file system
developers to:-

1. Share their opinion on the topic.
2. Share their experience and any other issues with [4].
3. Uncover additional details that are missing from this proposal.

Required attendees :-

Martin K. Petersen
Jens Axboe
Christoph Hellwig
Bart Van Assche
Stephen Bates
Zach Brown
Roland Dreier
Ric Wheeler
Trond Myklebust
Mike Snitzer
Keith Busch
Sagi Grimberg
Hannes Reinecke
Frederick Knight
Mikulas Patocka
Matias Bjørling

[1]https://content.riscv.org/wp-content/uploads/2018/12/A-New-Golden-Age-for-Computer-Architecture-History-Challenges-and-Opportunities-David-Patterson-.pdf
[2] https://www.snia.org/computational
https://www.napatech.com/support/resources/solution-descriptions/napatech-smartnic-solution-for-hardware-offload/
   https://www.eideticom.com/products.html
https://www.xilinx.com/applications/data-center/computational-storage.html
[3] git://git.kernel.org/pub/scm/linux/kernel/git/mkp/linux.git xcopy
[4] https://www.spinics.net/lists/linux-block/msg00599.html
[5] https://lwn.net/Articles/793585/
[6] https://nvmexpress.org/new-nvmetm-specification-defines-zoned-
namespaces-zns-as-go-to-industry-technology/
[7] 

Re: [dm-devel] [PATCH 0/2] dm thin: Flush data device before committing metadata to avoid data corruption

2019-12-09 Thread Nikos Tsironis

On 12/6/19 10:06 PM, Eric Wheeler wrote:

On Fri, 6 Dec 2019, Nikos Tsironis wrote:

On 12/6/19 12:34 AM, Eric Wheeler wrote:

On Thu, 5 Dec 2019, Nikos Tsironis wrote:

On 12/4/19 10:17 PM, Mike Snitzer wrote:

On Wed, Dec 04 2019 at  2:58pm -0500,
Eric Wheeler  wrote:


On Wed, 4 Dec 2019, Nikos Tsironis wrote:


The thin provisioning target maintains per thin device mappings that
map
virtual blocks to data blocks in the data device.

When we write to a shared block, in case of internal snapshots, or
provision a new block, in case of external snapshots, we copy the
shared
block to a new data block (COW), update the mapping for the relevant
virtual block and then issue the write to the new data block.

Suppose the data device has a volatile write-back cache and the
following sequence of events occur:


For those with NV caches, can the data disk flush be optional (maybe
as a
table flag)?


IIRC block core should avoid issuing the flush if not needed.  I'll have
a closer look to verify as much.



For devices without a volatile write-back cache block core strips off
the REQ_PREFLUSH and REQ_FUA bits from requests with a payload and
completes empty REQ_PREFLUSH requests before entering the driver.

This happens in generic_make_request_checks():

   /*
* Filter flush bio's early so that make_request based
* drivers without flush support don't have to worry
* about them.
*/
   if (op_is_flush(bio->bi_opf) &&
   !test_bit(QUEUE_FLAG_WC, >queue_flags)) {
   bio->bi_opf &= ~(REQ_PREFLUSH | REQ_FUA);
   if (!nr_sectors) {
   status = BLK_STS_OK;
   goto end_io;
   }
   }

If I am not mistaken, it all depends on whether the underlying device
reports the existence of a write back cache or not.

You could check this by looking at /sys/block//queue/write_cache
If it says "write back" then flushes will be issued.

In case the sysfs entry reports a "write back" cache for a device with a
non-volatile write cache, I think you can change the kernel's view of
the device by writing to this entry (you could also create a udev rule
for this).

This way you can set the write cache as write through. This will
eliminate the cache flushes issued by the kernel, without altering the
device state (Documentation/block/queue-sysfs.rst).


Interesting, I'll remember that. I think this is a documentation bug, isn't
this backwards:
  'This means that it might not be safe to toggle the setting from
  "write back" to "write through", since that will also eliminate
  cache flushes issued by the kernel.'
  [https://www.kernel.org/doc/Documentation/block/queue-sysfs.rst]




If a device has a volatile cache then the write_cache sysfs entry will
be "write back" and we have to issue flushes to the device. In all other
cases write_cache will be "write through".


Forgive my misunderstanding, but if I have a RAID controller with a cache
and BBU with the RAID volume set to write-back mode in the controller, are
you saying that the sysfs entry should show "write through"? I had always
understood that it was safe to disable flushes with a non-volatile cache
and a non-volatile cache is called a write-back cache.



From the device perspective, a non-volatile cache operating in
write-back mode is indeed called a write-back cache.

But, from the OS perspective, a non-volatile cache (whether it operates
in write-back or write-through mode), for all intents and purposes, is
equivalent to a write-through cache: when the device acknowledges a
write it's guaranteed that the written data won't be lost in case of
power loss.

So, in the case of a controller with a BBU and/or a non-volatile cache,
you don't care what the device does internally. All that matters is that
acked writes won't be lost in case of power failure.

I believe that the sysfs entry reports exactly that. Whether the kernel
should treat the device as having a volatile write-back cache, so we
have to issue flushes to ensure the data are properly persisted, or as
having no cache or a write-through cache, so flushes are not necessary.


It is strange to me that this terminology in the kernel would be backwards
from how it is expressed in a RAID controller. Incidentally, I have an
Avago MegaRAID 9460 with 2 volumes. The first volume (sda) is in
write-back mode and the second volume is write-through. In both cases
sysfs reports "write through":

[root@hv1-he ~]# cat /sys/block/sda/queue/write_cache
write through
[root@hv1-he ~]# cat /sys/block/sdb/queue/write_cache
write through

This is running 4.19.75, so we can at least say that the 9460 does not
support proper representation of the VD cache mode in sysfs, but which is
correct? Should it not be that the sysfs entry reports the same cache mode
of the RAID controller?



My guess is that the controller reports to the kernel that it has a
write-through cache (or no

Re: [dm-devel] [PATCH 3/3] dm clone: Flush destination device before committing metadata

2019-12-06 Thread Nikos Tsironis

On 12/6/19 6:21 PM, Mike Snitzer wrote:

On Thu, Dec 05 2019 at  5:42P -0500,
Nikos Tsironis  wrote:


On 12/6/19 12:09 AM, Mike Snitzer wrote:

On Thu, Dec 05 2019 at  4:49pm -0500,
Nikos Tsironis  wrote:


For dm-thin, indeed, there is not much to gain by not using
blkdev_issue_flush(), since we still allocate a new bio, indirectly, in
the stack.


But thinp obviously could if there is actual benefit to avoiding this
flush bio allocation, via blkdev_issue_flush, every commit.



Yes, we could do the flush in thinp exactly the same way we do it in
dm-clone. Add a struct bio field in struct pool_c and use that in the
callback.

It would work since the callback is called holding a write lock on
pmd->root_lock, so it's executed only by a single thread at a time.

I didn't go for it in my implementation, because I didn't like having to
make that assumption in the callback, i.e., that it's executed under a
lock and so it's safe to have the bio in struct pool_c.

In hindsight, maybe this was a bad call, since it's technically feasible
to do it this way and we could just add a comment stating that the
callback is executed atomically.

If you want I can send a new follow-on patch tomorrow implementing the
flush in thinp the same way it's implemented in dm-clone.


I took care of it, here is the incremental:



Awesome, thanks!
 

diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
index 73d191ddbb9f..57626c27a54b 100644
--- a/drivers/md/dm-thin.c
+++ b/drivers/md/dm-thin.c
@@ -328,6 +328,7 @@ struct pool_c {
dm_block_t low_water_blocks;
struct pool_features requested_pf; /* Features requested during table 
load */
struct pool_features adjusted_pf;  /* Features used after adjusting for 
constituent devices */
+   struct bio flush_bio;
  };
  
  /*

@@ -3123,6 +3124,7 @@ static void pool_dtr(struct dm_target *ti)
__pool_dec(pt->pool);
dm_put_device(ti, pt->metadata_dev);
dm_put_device(ti, pt->data_dev);
+   bio_uninit(>flush_bio);
kfree(pt);
  
  	mutex_unlock(_thin_pool_table.mutex);

@@ -3202,8 +3204,13 @@ static void metadata_low_callback(void *context)
  static int metadata_pre_commit_callback(void *context)
  {
struct pool_c *pt = context;
+   struct bio *flush_bio = >flush_bio;
  
-	return blkdev_issue_flush(pt->data_dev->bdev, GFP_NOIO, NULL);

+   bio_reset(flush_bio);
+   bio_set_dev(flush_bio, pt->data_dev->bdev);
+   flush_bio->bi_opf = REQ_OP_WRITE | REQ_PREFLUSH;
+
+   return submit_bio_wait(flush_bio);
  }
  
  static sector_t get_dev_size(struct block_device *bdev)

@@ -3374,6 +3381,7 @@ static int pool_ctr(struct dm_target *ti, unsigned argc, 
char **argv)
pt->data_dev = data_dev;
pt->low_water_blocks = low_water_blocks;
pt->adjusted_pf = pt->requested_pf = pf;
+   bio_init(>flush_bio, NULL, 0);
ti->num_flush_bios = 1;
  
  	/*




Looks good,

Thanks Nikos

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH 0/2] dm thin: Flush data device before committing metadata to avoid data corruption

2019-12-06 Thread Nikos Tsironis

On 12/6/19 12:34 AM, Eric Wheeler wrote:

On Thu, 5 Dec 2019, Nikos Tsironis wrote:

On 12/4/19 10:17 PM, Mike Snitzer wrote:

On Wed, Dec 04 2019 at  2:58pm -0500,
Eric Wheeler  wrote:


On Wed, 4 Dec 2019, Nikos Tsironis wrote:


The thin provisioning target maintains per thin device mappings that map
virtual blocks to data blocks in the data device.

When we write to a shared block, in case of internal snapshots, or
provision a new block, in case of external snapshots, we copy the shared
block to a new data block (COW), update the mapping for the relevant
virtual block and then issue the write to the new data block.

Suppose the data device has a volatile write-back cache and the
following sequence of events occur:


For those with NV caches, can the data disk flush be optional (maybe as a
table flag)?


IIRC block core should avoid issuing the flush if not needed.  I'll have
a closer look to verify as much.



For devices without a volatile write-back cache block core strips off
the REQ_PREFLUSH and REQ_FUA bits from requests with a payload and
completes empty REQ_PREFLUSH requests before entering the driver.

This happens in generic_make_request_checks():

/*
 * Filter flush bio's early so that make_request based
 * drivers without flush support don't have to worry
 * about them.
 */
if (op_is_flush(bio->bi_opf) &&
!test_bit(QUEUE_FLAG_WC, >queue_flags)) {
bio->bi_opf &= ~(REQ_PREFLUSH | REQ_FUA);
if (!nr_sectors) {
status = BLK_STS_OK;
goto end_io;
}
}

If I am not mistaken, it all depends on whether the underlying device
reports the existence of a write back cache or not.

You could check this by looking at /sys/block//queue/write_cache
If it says "write back" then flushes will be issued.

In case the sysfs entry reports a "write back" cache for a device with a
non-volatile write cache, I think you can change the kernel's view of
the device by writing to this entry (you could also create a udev rule
for this).

This way you can set the write cache as write through. This will
eliminate the cache flushes issued by the kernel, without altering the
device state (Documentation/block/queue-sysfs.rst).


Interesting, I'll remember that. I think this is a documentation bug, isn't 
this backwards:
'This means that it might not be safe to toggle the setting from
"write back" to "write through", since that will also eliminate
cache flushes issued by the kernel.'
[https://www.kernel.org/doc/Documentation/block/queue-sysfs.rst]




If a device has a volatile cache then the write_cache sysfs entry will
be "write back" and we have to issue flushes to the device. In all other
cases write_cache will be "write through".

It's not safe to toggle write_cache from "write back" to "write through"
because this stops the kernel from sending flushes to the device, but
the device will continue caching the writes. So, in case something goes
wrong, you might lose your writes or end up with some kind of
corruption.


How does this work with stacking blockdevs?  Does it inherit from the
lower-level dev? If an upper-level is misconfigured, would a writeback at
higher levels would clear the flush for lower levels?



As Mike already mentioned in another reply to this thread, the device
capabilities are stacked up when each device is created and are
inherited from component devices.

The logic for device stacking is implemented in various functions in
block/blk-settings.c (blk_set_stacking_limits(), blk_stack_limits(),
etc.), which are used also by DM core in dm-table.c to set the
capabilities of DM devices.

If an upper layer device reports a "write back" cache then flushes will
be issued to it by the kernel, no matter what the capabilities of the
underlying devices are.

Normally an upper layer device would report a "write back" cache if at
least one underlying device supports flushes. But, some DM devices
report a "write back" cache irrespective of the underlying devices,
e.g., dm-thin, dm-clone, dm-cache. This is required so they can flush
their own metadata. They then pass the flush request down to the
underlying device and rely on block core to do the right thing. Either
actually send the flush to the device, if it has a volatile cache, or
complete it immediately.

Nikos


--
Eric Wheeler




Nikos


Mike





--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] dm-thin: Several Questions on dm-thin performance.

2019-12-06 Thread Nikos Tsironis

On 11/22/19 8:55 PM, Joe Thornber wrote:

On Fri, Nov 22, 2019 at 11:14:15AM +0800, JeffleXu wrote:


The first question is what's the purpose of data cell? In thin_bio_map(),
normal bio will be packed as a virtual cell and data cell. I can understand
that virtual cell is used to prevent discard bio and non-discard bio
targeting the same block from being processed at the same time. I find it
was added in commit e8088073c9610af017fd47fddd104a2c3afb32e8 (dm thin:
fix race between simultaneous io and discards to same block), but I'm still
confused about the use of data cell.


As you are aware there are two address spaces for the locks.  The 'virtual' one
refers to cells in the logical address space of the thin devices, and the 
'data' one
refers to the underlying data device.  There are certain conditions where we
unfortunately need to hold both of these (eg, to prevent a data block being 
reprovisioned
before an io to it has completed).


The second question is the impact of virtual cell and data cell on IO
performance. If $data_block_size is large for example 1G, in multithread fio
test, most bio will be buffered in cell->bios list and then be processed by
worker thread asynchronously, even when there's no discard bio. Thus the
original parallel IO is processed by worker thread serially now. As the
number of fio test threads increase, the single worker thread can easily get
CPU 100%, and thus become the bottleneck of the performance since dm-thin
workqueue is ordered unbound.


Yep, this is a big issue.  Take a look at dm-bio-prison-v2.h, this is the
new interface that we need to move dm-thin across to use (dm-cache already uses 
it).
It allows concurrent holders of a cell (ie, read locks), so we'll be able to 
remap
much more io without handing it off to a worker thread.  Once this is done I 
want
to add an extra field to cells that will cache the mapping, this way if you 
acquire a
cell that is already held then you can avoid the expensive btree lookup.  
Together
these changes should make a huge difference to the performance.

If you've got some spare coding cycles I'd love some help with this ;)



Hi Joe,

I would be interested in helping you with this task. I can't make any
promises, but I believe I could probably spare some time to work on it.

If you think you could use the extra help, let me know.

Nikos


- Joe

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel



--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH 3/3] dm clone: Flush destination device before committing metadata

2019-12-05 Thread Nikos Tsironis

On 12/6/19 12:09 AM, Mike Snitzer wrote:

On Thu, Dec 05 2019 at  4:49pm -0500,
Nikos Tsironis  wrote:


On 12/5/19 10:07 PM, Mike Snitzer wrote:

On Thu, Dec 05 2019 at  2:46pm -0500,
Mike Snitzer  wrote:


On Wed, Dec 04 2019 at  9:06P -0500,
Nikos Tsironis  wrote:


dm-clone maintains an on-disk bitmap which records which regions are
valid in the destination device, i.e., which regions have already been
hydrated, or have been written to directly, via user I/O.

Setting a bit in the on-disk bitmap meas the corresponding region is
valid in the destination device and we redirect all I/O regarding it to
the destination device.

Suppose the destination device has a volatile write-back cache and the
following sequence of events occur:

1. A region gets hydrated, either through the background hydration or
because it was written to directly, via user I/O.

2. The commit timeout expires and we commit the metadata, marking that
region as valid in the destination device.

3. The system crashes and the destination device's cache has not been
flushed, meaning the region's data are lost.

The next time we read that region we read it from the destination
device, since the metadata have been successfully committed, but the
data are lost due to the crash, so we read garbage instead of the old
data.

This has several implications:

1. In case of background hydration or of writes with size smaller than
the region size (which means we first copy the whole region and then
issue the smaller write), we corrupt data that the user never
touched.

2. In case of writes with size equal to the device's logical block size,
we fail to provide atomic sector writes. When the system recovers the
user will read garbage from the sector instead of the old data or the
new data.

3. In case of writes without the FUA flag set, after the system
recovers, the written sectors will contain garbage instead of a
random mix of sectors containing either old data or new data, thus we
fail again to provide atomic sector writes.

4. Even when the user flushes the dm-clone device, because we first
commit the metadata and then pass down the flush, the same risk for
corruption exists (if the system crashes after the metadata have been
committed but before the flush is passed down).

The only case which is unaffected is that of writes with size equal to
the region size and with the FUA flag set. But, because FUA writes
trigger metadata commits, this case can trigger the corruption
indirectly.

To solve this and avoid the potential data corruption we flush the
destination device **before** committing the metadata.

This ensures that any freshly hydrated regions, for which we commit the
metadata, are properly written to non-volatile storage and won't be lost
in case of a crash.

Fixes: 7431b7835f55 ("dm: add clone target")
Cc: sta...@vger.kernel.org # v5.4+
Signed-off-by: Nikos Tsironis 
---
  drivers/md/dm-clone-target.c | 46 ++--
  1 file changed, 40 insertions(+), 6 deletions(-)

diff --git a/drivers/md/dm-clone-target.c b/drivers/md/dm-clone-target.c
index 613c913c296c..d1e1b5b56b1b 100644
--- a/drivers/md/dm-clone-target.c
+++ b/drivers/md/dm-clone-target.c
@@ -86,6 +86,12 @@ struct clone {
struct dm_clone_metadata *cmd;
+   /*
+* bio used to flush the destination device, before committing the
+* metadata.
+*/
+   struct bio flush_bio;
+
/* Region hydration hash table */
struct hash_table_bucket *ht;
@@ -1108,10 +1114,13 @@ static bool need_commit_due_to_time(struct clone *clone)
  /*
   * A non-zero return indicates read-only or fail mode.
   */
-static int commit_metadata(struct clone *clone)
+static int commit_metadata(struct clone *clone, bool *dest_dev_flushed)
  {
int r = 0;
+   if (dest_dev_flushed)
+   *dest_dev_flushed = false;
+
mutex_lock(>commit_lock);
if (!dm_clone_changed_this_transaction(clone->cmd))
@@ -1128,6 +1137,19 @@ static int commit_metadata(struct clone *clone)
goto out;
}
+   bio_reset(>flush_bio);
+   bio_set_dev(>flush_bio, clone->dest_dev->bdev);
+   clone->flush_bio.bi_opf = REQ_OP_WRITE | REQ_PREFLUSH;
+
+   r = submit_bio_wait(>flush_bio);
+   if (unlikely(r)) {
+   __metadata_operation_failed(clone, "flush destination device", 
r);
+   goto out;
+   }
+
+   if (dest_dev_flushed)
+   *dest_dev_flushed = true;
+
r = dm_clone_metadata_commit(clone->cmd);
if (unlikely(r)) {
__metadata_operation_failed(clone, "dm_clone_metadata_commit", 
r);
@@ -1199,6 +1221,7 @@ static void process_deferred_bios(struct clone *clone)
  static void process_deferred_flush_bios(struct clone *clone)
  {
struct bio *bio;
+   bool dest_dev_flushed;
struc

Re: [dm-devel] [PATCH 3/3] dm clone: Flush destination device before committing metadata

2019-12-05 Thread Nikos Tsironis

On 12/5/19 10:07 PM, Mike Snitzer wrote:

On Thu, Dec 05 2019 at  2:46pm -0500,
Mike Snitzer  wrote:


On Wed, Dec 04 2019 at  9:06P -0500,
Nikos Tsironis  wrote:


dm-clone maintains an on-disk bitmap which records which regions are
valid in the destination device, i.e., which regions have already been
hydrated, or have been written to directly, via user I/O.

Setting a bit in the on-disk bitmap meas the corresponding region is
valid in the destination device and we redirect all I/O regarding it to
the destination device.

Suppose the destination device has a volatile write-back cache and the
following sequence of events occur:

1. A region gets hydrated, either through the background hydration or
because it was written to directly, via user I/O.

2. The commit timeout expires and we commit the metadata, marking that
region as valid in the destination device.

3. The system crashes and the destination device's cache has not been
flushed, meaning the region's data are lost.

The next time we read that region we read it from the destination
device, since the metadata have been successfully committed, but the
data are lost due to the crash, so we read garbage instead of the old
data.

This has several implications:

1. In case of background hydration or of writes with size smaller than
the region size (which means we first copy the whole region and then
issue the smaller write), we corrupt data that the user never
touched.

2. In case of writes with size equal to the device's logical block size,
we fail to provide atomic sector writes. When the system recovers the
user will read garbage from the sector instead of the old data or the
new data.

3. In case of writes without the FUA flag set, after the system
recovers, the written sectors will contain garbage instead of a
random mix of sectors containing either old data or new data, thus we
fail again to provide atomic sector writes.

4. Even when the user flushes the dm-clone device, because we first
commit the metadata and then pass down the flush, the same risk for
corruption exists (if the system crashes after the metadata have been
committed but before the flush is passed down).

The only case which is unaffected is that of writes with size equal to
the region size and with the FUA flag set. But, because FUA writes
trigger metadata commits, this case can trigger the corruption
indirectly.

To solve this and avoid the potential data corruption we flush the
destination device **before** committing the metadata.

This ensures that any freshly hydrated regions, for which we commit the
metadata, are properly written to non-volatile storage and won't be lost
in case of a crash.

Fixes: 7431b7835f55 ("dm: add clone target")
Cc: sta...@vger.kernel.org # v5.4+
Signed-off-by: Nikos Tsironis 
---
  drivers/md/dm-clone-target.c | 46 ++--
  1 file changed, 40 insertions(+), 6 deletions(-)

diff --git a/drivers/md/dm-clone-target.c b/drivers/md/dm-clone-target.c
index 613c913c296c..d1e1b5b56b1b 100644
--- a/drivers/md/dm-clone-target.c
+++ b/drivers/md/dm-clone-target.c
@@ -86,6 +86,12 @@ struct clone {
  
  	struct dm_clone_metadata *cmd;
  
+	/*

+* bio used to flush the destination device, before committing the
+* metadata.
+*/
+   struct bio flush_bio;
+
/* Region hydration hash table */
struct hash_table_bucket *ht;
  
@@ -1108,10 +1114,13 @@ static bool need_commit_due_to_time(struct clone *clone)

  /*
   * A non-zero return indicates read-only or fail mode.
   */
-static int commit_metadata(struct clone *clone)
+static int commit_metadata(struct clone *clone, bool *dest_dev_flushed)
  {
int r = 0;
  
+	if (dest_dev_flushed)

+   *dest_dev_flushed = false;
+
mutex_lock(>commit_lock);
  
  	if (!dm_clone_changed_this_transaction(clone->cmd))

@@ -1128,6 +1137,19 @@ static int commit_metadata(struct clone *clone)
goto out;
}
  
+	bio_reset(>flush_bio);

+   bio_set_dev(>flush_bio, clone->dest_dev->bdev);
+   clone->flush_bio.bi_opf = REQ_OP_WRITE | REQ_PREFLUSH;
+
+   r = submit_bio_wait(>flush_bio);
+   if (unlikely(r)) {
+   __metadata_operation_failed(clone, "flush destination device", 
r);
+   goto out;
+   }
+
+   if (dest_dev_flushed)
+   *dest_dev_flushed = true;
+
r = dm_clone_metadata_commit(clone->cmd);
if (unlikely(r)) {
__metadata_operation_failed(clone, "dm_clone_metadata_commit", 
r);
@@ -1199,6 +1221,7 @@ static void process_deferred_bios(struct clone *clone)
  static void process_deferred_flush_bios(struct clone *clone)
  {
struct bio *bio;
+   bool dest_dev_flushed;
struct bio_list bios = BIO_EMPTY_LIST;
struct bio_list bio_completions = BIO_EMPTY_LIST;
  
@@ -1218,7 +1241,7 @@ st

Re: [dm-devel] [PATCH 1/2] dm thin metadata: Add support for a pre-commit callback

2019-12-05 Thread Nikos Tsironis

On 12/5/19 9:40 PM, Mike Snitzer wrote:

On Wed, Dec 04 2019 at  9:07P -0500,
Nikos Tsironis  wrote:


Add support for one pre-commit callback which is run right before the
metadata are committed.

This allows the thin provisioning target to run a callback before the
metadata are committed and is required by the next commit.

Cc: sta...@vger.kernel.org
Signed-off-by: Nikos Tsironis 
---
  drivers/md/dm-thin-metadata.c | 29 +
  drivers/md/dm-thin-metadata.h |  7 +++
  2 files changed, 36 insertions(+)

diff --git a/drivers/md/dm-thin-metadata.c b/drivers/md/dm-thin-metadata.c
index 4c68a7b93d5e..b88d6d701f5b 100644
--- a/drivers/md/dm-thin-metadata.c
+++ b/drivers/md/dm-thin-metadata.c
@@ -189,6 +189,15 @@ struct dm_pool_metadata {
sector_t data_block_size;
  
  	/*

+* Pre-commit callback.
+*
+* This allows the thin provisioning target to run a callback before
+* the metadata are committed.
+*/
+   dm_pool_pre_commit_fn pre_commit_fn;
+   void *pre_commit_context;
+
+   /*
 * We reserve a section of the metadata for commit overhead.
 * All reported space does *not* include this.
 */
@@ -826,6 +835,14 @@ static int __commit_transaction(struct dm_pool_metadata 
*pmd)
if (unlikely(!pmd->in_service))
return 0;
  
+	if (pmd->pre_commit_fn) {

+   r = pmd->pre_commit_fn(pmd->pre_commit_context);
+   if (r < 0) {
+   DMERR("pre-commit callback failed");
+   return r;
+   }
+   }
+
r = __write_changed_details(pmd);
if (r < 0)
return r;
@@ -892,6 +909,8 @@ struct dm_pool_metadata *dm_pool_metadata_open(struct 
block_device *bdev,
pmd->in_service = false;
pmd->bdev = bdev;
pmd->data_block_size = data_block_size;
+   pmd->pre_commit_fn = NULL;
+   pmd->pre_commit_context = NULL;
  
  	r = __create_persistent_data_objects(pmd, format_device);

if (r) {
@@ -2044,6 +2063,16 @@ int dm_pool_register_metadata_threshold(struct 
dm_pool_metadata *pmd,
return r;
  }
  
+void dm_pool_register_pre_commit_callback(struct dm_pool_metadata *pmd,

+ dm_pool_pre_commit_fn fn,
+ void *context)
+{
+   pmd_write_lock_in_core(pmd);
+   pmd->pre_commit_fn = fn;
+   pmd->pre_commit_context = context;
+   pmd_write_unlock(pmd);
+}
+
  int dm_pool_metadata_set_needs_check(struct dm_pool_metadata *pmd)
  {
int r = -EINVAL;
diff --git a/drivers/md/dm-thin-metadata.h b/drivers/md/dm-thin-metadata.h
index f6be0d733c20..7ef56bd2a7e3 100644
--- a/drivers/md/dm-thin-metadata.h
+++ b/drivers/md/dm-thin-metadata.h
@@ -230,6 +230,13 @@ bool dm_pool_metadata_needs_check(struct dm_pool_metadata 
*pmd);
   */
  void dm_pool_issue_prefetches(struct dm_pool_metadata *pmd);
  
+/* Pre-commit callback */

+typedef int (*dm_pool_pre_commit_fn)(void *context);
+
+void dm_pool_register_pre_commit_callback(struct dm_pool_metadata *pmd,
+ dm_pool_pre_commit_fn fn,
+ void *context);
+
  /**/
  
  #endif

--
2.11.0



I have this incremental, not seeing need to avoid using blkdev_issue_flush



Ack,

Nikos.
 

---
  drivers/md/dm-thin.c | 12 +---
  1 file changed, 1 insertion(+), 11 deletions(-)

diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
index 9c9a323c0c30..255a52f7bbf0 100644
--- a/drivers/md/dm-thin.c
+++ b/drivers/md/dm-thin.c
@@ -3203,18 +3203,8 @@ static void metadata_low_callback(void *context)
  static int metadata_pre_commit_callback(void *context)
  {
struct pool_c *pt = context;
-   struct bio bio;
-   int r;
-
-   bio_init(, NULL, 0);
-   bio_set_dev(, pt->data_dev->bdev);
-   bio.bi_opf = REQ_OP_WRITE | REQ_PREFLUSH;
-
-   r = submit_bio_wait();
  
-	bio_uninit();

-
-   return r;
+   return blkdev_issue_flush(pt->data_dev->bdev, GFP_NOIO, NULL);
  }
  
  static sector_t get_dev_size(struct block_device *bdev)




--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH 0/2] dm thin: Flush data device before committing metadata to avoid data corruption

2019-12-05 Thread Nikos Tsironis

On 12/5/19 5:42 PM, Mike Snitzer wrote:

On Thu, Dec 05 2019 at 10:31am -0500,
Nikos Tsironis  wrote:


On 12/4/19 10:17 PM, Mike Snitzer wrote:

On Wed, Dec 04 2019 at  2:58pm -0500,
Eric Wheeler  wrote:


On Wed, 4 Dec 2019, Nikos Tsironis wrote:


The thin provisioning target maintains per thin device mappings that map
virtual blocks to data blocks in the data device.

When we write to a shared block, in case of internal snapshots, or
provision a new block, in case of external snapshots, we copy the shared
block to a new data block (COW), update the mapping for the relevant
virtual block and then issue the write to the new data block.

Suppose the data device has a volatile write-back cache and the
following sequence of events occur:


For those with NV caches, can the data disk flush be optional (maybe as a
table flag)?


IIRC block core should avoid issuing the flush if not needed.  I'll have
a closer look to verify as much.



For devices without a volatile write-back cache block core strips off
the REQ_PREFLUSH and REQ_FUA bits from requests with a payload and
completes empty REQ_PREFLUSH requests before entering the driver.

This happens in generic_make_request_checks():

/*
 * Filter flush bio's early so that make_request based
 * drivers without flush support don't have to worry
 * about them.
 */
if (op_is_flush(bio->bi_opf) &&
!test_bit(QUEUE_FLAG_WC, >queue_flags)) {
bio->bi_opf &= ~(REQ_PREFLUSH | REQ_FUA);
if (!nr_sectors) {
status = BLK_STS_OK;
goto end_io;
}
}

If I am not mistaken, it all depends on whether the underlying device
reports the existence of a write back cache or not.


Yes, thanks for confirming my memory of the situation.


You could check this by looking at /sys/block//queue/write_cache
If it says "write back" then flushes will be issued.

In case the sysfs entry reports a "write back" cache for a device with a
non-volatile write cache, I think you can change the kernel's view of
the device by writing to this entry (you could also create a udev rule
for this).

This way you can set the write cache as write through. This will
eliminate the cache flushes issued by the kernel, without altering the
device state (Documentation/block/queue-sysfs.rst).


Not delved into this aspect of Linux's capabilities but it strikes me as
"dangerous" to twiddle device capabilities like this.  Best to fix
driver to properly expose cache (or not, as the case may be).  It should
also be noted that with DM; the capabilities are stac ked up at device
creation time.  So any changes to the underlying devices will _not_ be
reflected to the high level DM device.



Yes, I agree completely. The queue-sysfs doc also mentions that it's not
safe to do that. I just mentioned it for completeness.

As far as DM is concerned, you are right. You would have to deactivate
and reactivate all DM devices for the change to propagate to upper
layers. That's why I mentioned udev, because that way the change will be
made to the lower level device when its queue is first created and it
will be properly propagated to upper layers.

But, again, I agree that this is not something safe to do and it's
better to make sure the driver properly exposes the cache capabilities,
as you said.

Nikos


Mike



--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH 0/2] dm thin: Flush data device before committing metadata to avoid data corruption

2019-12-05 Thread Nikos Tsironis

On 12/4/19 10:17 PM, Mike Snitzer wrote:

On Wed, Dec 04 2019 at  2:58pm -0500,
Eric Wheeler  wrote:


On Wed, 4 Dec 2019, Nikos Tsironis wrote:


The thin provisioning target maintains per thin device mappings that map
virtual blocks to data blocks in the data device.

When we write to a shared block, in case of internal snapshots, or
provision a new block, in case of external snapshots, we copy the shared
block to a new data block (COW), update the mapping for the relevant
virtual block and then issue the write to the new data block.

Suppose the data device has a volatile write-back cache and the
following sequence of events occur:


For those with NV caches, can the data disk flush be optional (maybe as a
table flag)?


IIRC block core should avoid issuing the flush if not needed.  I'll have
a closer look to verify as much.



For devices without a volatile write-back cache block core strips off
the REQ_PREFLUSH and REQ_FUA bits from requests with a payload and
completes empty REQ_PREFLUSH requests before entering the driver.

This happens in generic_make_request_checks():

/*
 * Filter flush bio's early so that make_request based
 * drivers without flush support don't have to worry
 * about them.
 */
if (op_is_flush(bio->bi_opf) &&
!test_bit(QUEUE_FLAG_WC, >queue_flags)) {
bio->bi_opf &= ~(REQ_PREFLUSH | REQ_FUA);
if (!nr_sectors) {
status = BLK_STS_OK;
goto end_io;
}
}

If I am not mistaken, it all depends on whether the underlying device
reports the existence of a write back cache or not.

You could check this by looking at /sys/block//queue/write_cache
If it says "write back" then flushes will be issued.

In case the sysfs entry reports a "write back" cache for a device with a
non-volatile write cache, I think you can change the kernel's view of
the device by writing to this entry (you could also create a udev rule
for this).

This way you can set the write cache as write through. This will
eliminate the cache flushes issued by the kernel, without altering the
device state (Documentation/block/queue-sysfs.rst).

Nikos


Mike



--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel



[dm-devel] [PATCH] dm thin: Avoid flushing the data device twice

2019-12-04 Thread Nikos Tsironis
Since we flush the data device as part of a metadata commit, it's
redundant to then submit any deferred REQ_PREFLUSH bios.

Add a check in process_deferred_bios() for deferred REQ_PREFLUSH bios
and complete them immediately.

Signed-off-by: Nikos Tsironis 
---
 drivers/md/dm-thin.c | 14 --
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
index e0be545080d0..40d8a255dbc3 100644
--- a/drivers/md/dm-thin.c
+++ b/drivers/md/dm-thin.c
@@ -2383,8 +2383,18 @@ static void process_deferred_bios(struct pool *pool)
while ((bio = bio_list_pop(_completions)))
bio_endio(bio);
 
-   while ((bio = bio_list_pop()))
-   generic_make_request(bio);
+   while ((bio = bio_list_pop())) {
+   if (bio->bi_opf & REQ_PREFLUSH) {
+   /*
+* We just flushed the data device as part of the
+* metadata commit, so there is no reason to send
+* another flush.
+*/
+   bio_endio(bio);
+   } else {
+   generic_make_request(bio);
+   }
+   }
 }
 
 static void do_worker(struct work_struct *ws)
-- 
2.11.0


--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH 2/2] dm thin: Flush data device before committing metadata

2019-12-04 Thread Nikos Tsironis

On 12/4/19 6:39 PM, Mike Snitzer wrote:>
On Wed, Dec 04 2019 at 11:17am -0500,

Nikos Tsironis  wrote:


On 12/4/19 5:27 PM, Joe Thornber wrote:

On Wed, Dec 04, 2019 at 04:07:42PM +0200, Nikos Tsironis wrote:

The thin provisioning target maintains per thin device mappings that map
virtual blocks to data blocks in the data device.



Ack.  But I think we're issuing the FLUSH twice with your patch.  Since the
original bio is still remapped and issued at the end of process_deferred_bios?



Yes, that's correct. I thought of it and of putting a check in
process_deferred_bios() to complete FLUSH bios immediately, but I have
one concern and I preferred to be safe than sorry.

In __commit_transaction() there is the following check:

   if (unlikely(!pmd->in_service))
 return 0;

, which means we don't commit the metadata, and thus we don't flush the
data device, in case the pool is not in service.

Opening a thin device doesn't seem to put the pool in service, since
dm_pool_open_thin_device() uses pmd_write_lock_in_core().

Can I assume that the pool is in service if I/O can be mapped to a thin
device? If so, it's safe to put such a check in process_deferred_bios().


In service means upper layer has issued a write to a thin device of a
pool.  The header for commit 873f258becca87 gets into more detail.


On second thought though, in order for a flush bio to end up in
deferred_flush_bios in the first place, someone must have changed the
metadata and thus put the pool in service. Otherwise, it would have been
submitted directly to the data device. So, it's probably safe to check
for flush bios after commit() in process_deferred_bios() and complete
them immediately.


Yes, I think so, which was Joe's original point.
  

If you confirm too that this is safe, I will send a second version of
the patch adding the check.


Not seeing why we need another in_service check.  After your changes are
applied, any commit will trigger a preceeding flush.. so the deferred
flushes are redundant.



Yes, I meant add a check in process_deferred_bios(), after commit(), to
check for REQ_PREFLUSH bios and complete them immediately. I should have
clarified that.


By definition, these deferred bios imply the pool is in service.

I'd be fine with seeing a 3rd follow-on thinp patch that completes the
redundant flushes immediately.



Ack, I will send another patch fixing this.

Nikos


Thanks,
Mike



--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH 2/2] dm thin: Flush data device before committing metadata

2019-12-04 Thread Nikos Tsironis

On 12/4/19 5:27 PM, Joe Thornber wrote:

On Wed, Dec 04, 2019 at 04:07:42PM +0200, Nikos Tsironis wrote:

The thin provisioning target maintains per thin device mappings that map
virtual blocks to data blocks in the data device.



Ack.  But I think we're issuing the FLUSH twice with your patch.  Since the
original bio is still remapped and issued at the end of process_deferred_bios?



Yes, that's correct. I thought of it and of putting a check in
process_deferred_bios() to complete FLUSH bios immediately, but I have
one concern and I preferred to be safe than sorry.

In __commit_transaction() there is the following check:

  if (unlikely(!pmd->in_service))
return 0;

, which means we don't commit the metadata, and thus we don't flush the
data device, in case the pool is not in service.

Opening a thin device doesn't seem to put the pool in service, since
dm_pool_open_thin_device() uses pmd_write_lock_in_core().

Can I assume that the pool is in service if I/O can be mapped to a thin
device? If so, it's safe to put such a check in process_deferred_bios().

On second thought though, in order for a flush bio to end up in
deferred_flush_bios in the first place, someone must have changed the
metadata and thus put the pool in service. Otherwise, it would have been
submitted directly to the data device. So, it's probably safe to check
for flush bios after commit() in process_deferred_bios() and complete
them immediately.

If you confirm too that this is safe, I will send a second version of
the patch adding the check.

Thanks,
Nikos


- Joe



--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel



[dm-devel] [PATCH 0/3] dm clone: Flush destination device before committing metadata to avoid data corruption

2019-12-04 Thread Nikos Tsironis
dm-clone maintains an on-disk bitmap which records which regions are
valid in the destination device, i.e., which regions have already been
hydrated, or have been written to directly, via user I/O.

Setting a bit in the on-disk bitmap meas the corresponding region is
valid in the destination device and we redirect all I/O regarding it to
the destination device.

Suppose the destination device has a volatile write-back cache and the
following sequence of events occur:

1. A region gets hydrated, either through the background hydration or
   because it was written to directly, via user I/O.

2. The commit timeout expires and we commit the metadata, marking that
   region as valid in the destination device.

3. The system crashes and the destination device's cache has not been
   flushed, meaning the region's data are lost.

The next time we read that region we read it from the destination
device, since the metadata have been successfully committed, but the
data are lost due to the crash, so we read garbage instead of the old
data.

For more information regarding the implications of this please see the
relevant commit.

To solve this and avoid the potential data corruption we have to flush
the destination device before committing the metadata.

This ensures that any freshly hydrated regions, for which we commit the
metadata, are properly written to non-volatile storage and won't be lost
in case of a crash.

Nikos Tsironis (3):
  dm clone metadata: Track exact changes per transaction
  dm clone metadata: Use a two phase commit
  dm clone: Flush destination device before committing metadata

 drivers/md/dm-clone-metadata.c | 136 ++---
 drivers/md/dm-clone-metadata.h |  17 ++
 drivers/md/dm-clone-target.c   |  53 +---
 3 files changed, 162 insertions(+), 44 deletions(-)

-- 
2.11.0


--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel



[dm-devel] [PATCH 2/2] dm thin: Flush data device before committing metadata

2019-12-04 Thread Nikos Tsironis
The thin provisioning target maintains per thin device mappings that map
virtual blocks to data blocks in the data device.

When we write to a shared block, in case of internal snapshots, or
provision a new block, in case of external snapshots, we copy the shared
block to a new data block (COW), update the mapping for the relevant
virtual block and then issue the write to the new data block.

Suppose the data device has a volatile write-back cache and the
following sequence of events occur:

1. We write to a shared block
2. A new data block is allocated
3. We copy the shared block to the new data block using kcopyd (COW)
4. We insert the new mapping for the virtual block in the btree for that
   thin device.
5. The commit timeout expires and we commit the metadata, that now
   includes the new mapping from step (4).
6. The system crashes and the data device's cache has not been flushed,
   meaning that the COWed data are lost.

The next time we read that virtual block of the thin device we read it
from the data block allocated in step (2), since the metadata have been
successfully committed. The data are lost due to the crash, so we read
garbage instead of the old, shared data.

This has the following implications:

1. In case of writes to shared blocks, with size smaller than the pool's
   block size (which means we first copy the whole block and then issue
   the smaller write), we corrupt data that the user never touched.

2. In case of writes to shared blocks, with size equal to the device's
   logical block size, we fail to provide atomic sector writes. When the
   system recovers the user will read garbage from that sector instead
   of the old data or the new data.

3. Even for writes to shared blocks, with size equal to the pool's block
   size (overwrites), after the system recovers, the written sectors
   will contain garbage instead of a random mix of sectors containing
   either old data or new data, thus we fail again to provide atomic
   sectors writes.

4. Even when the user flushes the thin device, because we first commit
   the metadata and then pass down the flush, the same risk for
   corruption exists (if the system crashes after the metadata have been
   committed but before the flush is passed down to the data device.)

The only case which is unaffected is that of writes with size equal to
the pool's block size and with the FUA flag set. But, because FUA writes
trigger metadata commits, this case can trigger the corruption
indirectly.

Moreover, apart from internal and external snapshots, the same issue
exists for newly provisioned blocks, when block zeroing is enabled.
After the system recovers the provisioned blocks might contain garbage
instead of zeroes.

To solve this and avoid the potential data corruption we flush the
pool's data device **before** committing its metadata.

This ensures that the data blocks of any newly inserted mappings are
properly written to non-volatile storage and won't be lost in case of a
crash.

Cc: sta...@vger.kernel.org
Signed-off-by: Nikos Tsironis 
---
 drivers/md/dm-thin.c | 32 
 1 file changed, 32 insertions(+)

diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
index 5a2c494cb552..e0be545080d0 100644
--- a/drivers/md/dm-thin.c
+++ b/drivers/md/dm-thin.c
@@ -3180,6 +3180,34 @@ static void metadata_low_callback(void *context)
dm_table_event(pool->ti->table);
 }
 
+/*
+ * We need to flush the data device **before** committing the metadata.
+ *
+ * This ensures that the data blocks of any newly inserted mappings are
+ * properly written to non-volatile storage and won't be lost in case of a
+ * crash.
+ *
+ * Failure to do so can result in data corruption in the case of internal or
+ * external snapshots and in the case of newly provisioned blocks, when block
+ * zeroing is enabled.
+ */
+static int metadata_pre_commit_callback(void *context)
+{
+   struct pool_c *pt = context;
+   struct bio bio;
+   int r;
+
+   bio_init(, NULL, 0);
+   bio_set_dev(, pt->data_dev->bdev);
+   bio.bi_opf = REQ_OP_WRITE | REQ_PREFLUSH;
+
+   r = submit_bio_wait();
+
+   bio_uninit();
+
+   return r;
+}
+
 static sector_t get_dev_size(struct block_device *bdev)
 {
return i_size_read(bdev->bd_inode) >> SECTOR_SHIFT;
@@ -3374,6 +3402,10 @@ static int pool_ctr(struct dm_target *ti, unsigned argc, 
char **argv)
if (r)
goto out_flags_changed;
 
+   dm_pool_register_pre_commit_callback(pt->pool->pmd,
+metadata_pre_commit_callback,
+pt);
+
pt->callbacks.congested_fn = pool_is_congested;
dm_table_add_target_callbacks(ti->table, >callbacks);
 
-- 
2.11.0


--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel



[dm-devel] [PATCH 1/2] dm thin metadata: Add support for a pre-commit callback

2019-12-04 Thread Nikos Tsironis
Add support for one pre-commit callback which is run right before the
metadata are committed.

This allows the thin provisioning target to run a callback before the
metadata are committed and is required by the next commit.

Cc: sta...@vger.kernel.org
Signed-off-by: Nikos Tsironis 
---
 drivers/md/dm-thin-metadata.c | 29 +
 drivers/md/dm-thin-metadata.h |  7 +++
 2 files changed, 36 insertions(+)

diff --git a/drivers/md/dm-thin-metadata.c b/drivers/md/dm-thin-metadata.c
index 4c68a7b93d5e..b88d6d701f5b 100644
--- a/drivers/md/dm-thin-metadata.c
+++ b/drivers/md/dm-thin-metadata.c
@@ -189,6 +189,15 @@ struct dm_pool_metadata {
sector_t data_block_size;
 
/*
+* Pre-commit callback.
+*
+* This allows the thin provisioning target to run a callback before
+* the metadata are committed.
+*/
+   dm_pool_pre_commit_fn pre_commit_fn;
+   void *pre_commit_context;
+
+   /*
 * We reserve a section of the metadata for commit overhead.
 * All reported space does *not* include this.
 */
@@ -826,6 +835,14 @@ static int __commit_transaction(struct dm_pool_metadata 
*pmd)
if (unlikely(!pmd->in_service))
return 0;
 
+   if (pmd->pre_commit_fn) {
+   r = pmd->pre_commit_fn(pmd->pre_commit_context);
+   if (r < 0) {
+   DMERR("pre-commit callback failed");
+   return r;
+   }
+   }
+
r = __write_changed_details(pmd);
if (r < 0)
return r;
@@ -892,6 +909,8 @@ struct dm_pool_metadata *dm_pool_metadata_open(struct 
block_device *bdev,
pmd->in_service = false;
pmd->bdev = bdev;
pmd->data_block_size = data_block_size;
+   pmd->pre_commit_fn = NULL;
+   pmd->pre_commit_context = NULL;
 
r = __create_persistent_data_objects(pmd, format_device);
if (r) {
@@ -2044,6 +2063,16 @@ int dm_pool_register_metadata_threshold(struct 
dm_pool_metadata *pmd,
return r;
 }
 
+void dm_pool_register_pre_commit_callback(struct dm_pool_metadata *pmd,
+ dm_pool_pre_commit_fn fn,
+ void *context)
+{
+   pmd_write_lock_in_core(pmd);
+   pmd->pre_commit_fn = fn;
+   pmd->pre_commit_context = context;
+   pmd_write_unlock(pmd);
+}
+
 int dm_pool_metadata_set_needs_check(struct dm_pool_metadata *pmd)
 {
int r = -EINVAL;
diff --git a/drivers/md/dm-thin-metadata.h b/drivers/md/dm-thin-metadata.h
index f6be0d733c20..7ef56bd2a7e3 100644
--- a/drivers/md/dm-thin-metadata.h
+++ b/drivers/md/dm-thin-metadata.h
@@ -230,6 +230,13 @@ bool dm_pool_metadata_needs_check(struct dm_pool_metadata 
*pmd);
  */
 void dm_pool_issue_prefetches(struct dm_pool_metadata *pmd);
 
+/* Pre-commit callback */
+typedef int (*dm_pool_pre_commit_fn)(void *context);
+
+void dm_pool_register_pre_commit_callback(struct dm_pool_metadata *pmd,
+ dm_pool_pre_commit_fn fn,
+ void *context);
+
 /**/
 
 #endif
-- 
2.11.0


--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel



[dm-devel] [PATCH 0/2] dm thin: Flush data device before committing metadata to avoid data corruption

2019-12-04 Thread Nikos Tsironis
The thin provisioning target maintains per thin device mappings that map
virtual blocks to data blocks in the data device.

When we write to a shared block, in case of internal snapshots, or
provision a new block, in case of external snapshots, we copy the shared
block to a new data block (COW), update the mapping for the relevant
virtual block and then issue the write to the new data block.

Suppose the data device has a volatile write-back cache and the
following sequence of events occur:

1. We write to a shared block
2. A new data block is allocated
3. We copy the shared block to the new data block using kcopyd (COW)
4. We insert the new mapping for the virtual block in the btree for that
   thin device.
5. The commit timeout expires and we commit the metadata, that now
   includes the new mapping from step (4).
6. The system crashes and the data device's cache has not been flushed,
   meaning that the COWed data are lost.

The next time we read that virtual block of the thin device we read it
from the data block allocated in step (2), since the metadata have been
successfully committed. The data are lost due to the crash, so we read
garbage instead of the old, shared data.

Moreover, apart from internal and external snapshots, the same issue
exists for newly provisioned blocks, when block zeroing is enabled.
After the system recovers the provisioned blocks might contain garbage
instead of zeroes.

For more information regarding the implications of this please see the
relevant commit.

To solve this and avoid the potential data corruption we have to flush
the pool's data device before committing its metadata.

This ensures that the data blocks of any newly inserted mappings are
properly written to non-volatile storage and won't be lost in case of a
crash.

Nikos Tsironis (2):
  dm thin metadata: Add support for a pre-commit callback
  dm thin: Flush data device before committing metadata

 drivers/md/dm-thin-metadata.c | 29 +
 drivers/md/dm-thin-metadata.h |  7 +++
 drivers/md/dm-thin.c  | 32 
 3 files changed, 68 insertions(+)

-- 
2.11.0


--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel



[dm-devel] [PATCH 1/3] dm clone metadata: Track exact changes per transaction

2019-12-04 Thread Nikos Tsironis
Extend struct dirty_map with a second bitmap which tracks the exact
regions that were hydrated during the current metadata transaction.

Moreover, fix __flush_dmap() to only commit the metadata of the regions
that were hydrated during the current transaction.

This is required by the following commits to fix a data corruption bug.

Fixes: 7431b7835f55 ("dm: add clone target")
Cc: sta...@vger.kernel.org # v5.4+
Signed-off-by: Nikos Tsironis 
---
 drivers/md/dm-clone-metadata.c | 90 +-
 1 file changed, 62 insertions(+), 28 deletions(-)

diff --git a/drivers/md/dm-clone-metadata.c b/drivers/md/dm-clone-metadata.c
index 08c552e5e41b..ee870a425ab8 100644
--- a/drivers/md/dm-clone-metadata.c
+++ b/drivers/md/dm-clone-metadata.c
@@ -67,23 +67,34 @@ struct superblock_disk {
  * To save constantly doing look ups on disk we keep an in core copy of the
  * on-disk bitmap, the region_map.
  *
- * To further reduce metadata I/O overhead we use a second bitmap, the dmap
- * (dirty bitmap), which tracks the dirty words, i.e. longs, of the region_map.
+ * In order to track which regions are hydrated during a metadata transaction,
+ * we use a second set of bitmaps, the dmap (dirty bitmap), which includes two
+ * bitmaps, namely dirty_regions and dirty_words. The dirty_regions bitmap
+ * tracks the regions that got hydrated during the current metadata
+ * transaction. The dirty_words bitmap tracks the dirty words, i.e. longs, of
+ * the dirty_regions bitmap.
+ *
+ * This allows us to precisely track the regions that were hydrated during the
+ * current metadata transaction and update the metadata accordingly, when we
+ * commit the current transaction. This is important because dm-clone should
+ * only commit the metadata of regions that were properly flushed to the
+ * destination device beforehand. Otherwise, in case of a crash, we could end
+ * up with a corrupted dm-clone device.
  *
  * When a region finishes hydrating dm-clone calls
  * dm_clone_set_region_hydrated(), or for discard requests
  * dm_clone_cond_set_range(), which sets the corresponding bits in region_map
  * and dmap.
  *
- * During a metadata commit we scan the dmap for dirty region_map words (longs)
- * and update accordingly the on-disk metadata. Thus, we don't have to flush to
- * disk the whole region_map. We can just flush the dirty region_map words.
+ * During a metadata commit we scan dmap->dirty_words and dmap->dirty_regions
+ * and update the on-disk metadata accordingly. Thus, we don't have to flush to
+ * disk the whole region_map. We can just flush the dirty region_map bits.
  *
- * We use a dirty bitmap, which is smaller than the original region_map, to
- * reduce the amount of memory accesses during a metadata commit. As dm-bitset
- * accesses the on-disk bitmap in 64-bit word granularity, there is no
- * significant benefit in tracking the dirty region_map bits with a smaller
- * granularity.
+ * We use the helper dmap->dirty_words bitmap, which is smaller than the
+ * original region_map, to reduce the amount of memory accesses during a
+ * metadata commit. Moreover, as dm-bitset also accesses the on-disk bitmap in
+ * 64-bit word granularity, the dirty_words bitmap helps us avoid useless disk
+ * accesses.
  *
  * We could update directly the on-disk bitmap, when dm-clone calls either
  * dm_clone_set_region_hydrated() or dm_clone_cond_set_range(), buts this
@@ -92,12 +103,13 @@ struct superblock_disk {
  * e.g., in a hooked overwrite bio's completion routine, and further reduce the
  * I/O completion latency.
  *
- * We maintain two dirty bitmaps. During a metadata commit we atomically swap
- * the currently used dmap with the unused one. This allows the metadata update
- * functions to run concurrently with an ongoing commit.
+ * We maintain two dirty bitmap sets. During a metadata commit we atomically
+ * swap the currently used dmap with the unused one. This allows the metadata
+ * update functions to run concurrently with an ongoing commit.
  */
 struct dirty_map {
unsigned long *dirty_words;
+   unsigned long *dirty_regions;
unsigned int changed;
 };
 
@@ -461,22 +473,40 @@ static size_t bitmap_size(unsigned long nr_bits)
return BITS_TO_LONGS(nr_bits) * sizeof(long);
 }
 
-static int dirty_map_init(struct dm_clone_metadata *cmd)
+static int __dirty_map_init(struct dirty_map *dmap, unsigned long nr_words,
+   unsigned long nr_regions)
 {
-   cmd->dmap[0].changed = 0;
-   cmd->dmap[0].dirty_words = kvzalloc(bitmap_size(cmd->nr_words), 
GFP_KERNEL);
+   dmap->changed = 0;
 
-   if (!cmd->dmap[0].dirty_words) {
-   DMERR("Failed to allocate dirty bitmap");
+   dmap->dirty_words = kvzalloc(bitmap_size(nr_words), GFP_KERNEL);
+   if (!dmap->dirty_words)
+   return -ENOMEM;
+
+   dmap->dirty_regions = kvzalloc(bitmap_size(nr_re

[dm-devel] [PATCH 3/3] dm clone: Flush destination device before committing metadata

2019-12-04 Thread Nikos Tsironis
dm-clone maintains an on-disk bitmap which records which regions are
valid in the destination device, i.e., which regions have already been
hydrated, or have been written to directly, via user I/O.

Setting a bit in the on-disk bitmap meas the corresponding region is
valid in the destination device and we redirect all I/O regarding it to
the destination device.

Suppose the destination device has a volatile write-back cache and the
following sequence of events occur:

1. A region gets hydrated, either through the background hydration or
   because it was written to directly, via user I/O.

2. The commit timeout expires and we commit the metadata, marking that
   region as valid in the destination device.

3. The system crashes and the destination device's cache has not been
   flushed, meaning the region's data are lost.

The next time we read that region we read it from the destination
device, since the metadata have been successfully committed, but the
data are lost due to the crash, so we read garbage instead of the old
data.

This has several implications:

1. In case of background hydration or of writes with size smaller than
   the region size (which means we first copy the whole region and then
   issue the smaller write), we corrupt data that the user never
   touched.

2. In case of writes with size equal to the device's logical block size,
   we fail to provide atomic sector writes. When the system recovers the
   user will read garbage from the sector instead of the old data or the
   new data.

3. In case of writes without the FUA flag set, after the system
   recovers, the written sectors will contain garbage instead of a
   random mix of sectors containing either old data or new data, thus we
   fail again to provide atomic sector writes.

4. Even when the user flushes the dm-clone device, because we first
   commit the metadata and then pass down the flush, the same risk for
   corruption exists (if the system crashes after the metadata have been
   committed but before the flush is passed down).

The only case which is unaffected is that of writes with size equal to
the region size and with the FUA flag set. But, because FUA writes
trigger metadata commits, this case can trigger the corruption
indirectly.

To solve this and avoid the potential data corruption we flush the
destination device **before** committing the metadata.

This ensures that any freshly hydrated regions, for which we commit the
metadata, are properly written to non-volatile storage and won't be lost
in case of a crash.

Fixes: 7431b7835f55 ("dm: add clone target")
Cc: sta...@vger.kernel.org # v5.4+
Signed-off-by: Nikos Tsironis 
---
 drivers/md/dm-clone-target.c | 46 ++--
 1 file changed, 40 insertions(+), 6 deletions(-)

diff --git a/drivers/md/dm-clone-target.c b/drivers/md/dm-clone-target.c
index 613c913c296c..d1e1b5b56b1b 100644
--- a/drivers/md/dm-clone-target.c
+++ b/drivers/md/dm-clone-target.c
@@ -86,6 +86,12 @@ struct clone {
 
struct dm_clone_metadata *cmd;
 
+   /*
+* bio used to flush the destination device, before committing the
+* metadata.
+*/
+   struct bio flush_bio;
+
/* Region hydration hash table */
struct hash_table_bucket *ht;
 
@@ -1108,10 +1114,13 @@ static bool need_commit_due_to_time(struct clone *clone)
 /*
  * A non-zero return indicates read-only or fail mode.
  */
-static int commit_metadata(struct clone *clone)
+static int commit_metadata(struct clone *clone, bool *dest_dev_flushed)
 {
int r = 0;
 
+   if (dest_dev_flushed)
+   *dest_dev_flushed = false;
+
mutex_lock(>commit_lock);
 
if (!dm_clone_changed_this_transaction(clone->cmd))
@@ -1128,6 +1137,19 @@ static int commit_metadata(struct clone *clone)
goto out;
}
 
+   bio_reset(>flush_bio);
+   bio_set_dev(>flush_bio, clone->dest_dev->bdev);
+   clone->flush_bio.bi_opf = REQ_OP_WRITE | REQ_PREFLUSH;
+
+   r = submit_bio_wait(>flush_bio);
+   if (unlikely(r)) {
+   __metadata_operation_failed(clone, "flush destination device", 
r);
+   goto out;
+   }
+
+   if (dest_dev_flushed)
+   *dest_dev_flushed = true;
+
r = dm_clone_metadata_commit(clone->cmd);
if (unlikely(r)) {
__metadata_operation_failed(clone, "dm_clone_metadata_commit", 
r);
@@ -1199,6 +1221,7 @@ static void process_deferred_bios(struct clone *clone)
 static void process_deferred_flush_bios(struct clone *clone)
 {
struct bio *bio;
+   bool dest_dev_flushed;
struct bio_list bios = BIO_EMPTY_LIST;
struct bio_list bio_completions = BIO_EMPTY_LIST;
 
@@ -1218,7 +1241,7 @@ static void process_deferred_flush_bios(struct clone 
*clone)
!(dm_clone_changed_this_transaction(clone->cmd) && 
need_commit_due_to_time(

[dm-devel] [PATCH 2/3] dm clone metadata: Use a two phase commit

2019-12-04 Thread Nikos Tsironis
Split the metadata commit in two parts:

1. dm_clone_metadata_pre_commit(): Prepare the current transaction for
   committing. After this is called, all subsequent metadata updates,
   done through either dm_clone_set_region_hydrated() or
   dm_clone_cond_set_range(), will be part of the next transaction.

2. dm_clone_metadata_commit(): Actually commit the current transaction
   to disk and start a new transaction.

This is required by the following commit. It allows dm-clone to flush
the destination device after step (1) to ensure that all freshly
hydrated regions, for which we are updating the metadata, are properly
written to non-volatile storage and won't be lost in case of a crash.

Fixes: 7431b7835f55 ("dm: add clone target")
Cc: sta...@vger.kernel.org # v5.4+
Signed-off-by: Nikos Tsironis 
---
 drivers/md/dm-clone-metadata.c | 46 +-
 drivers/md/dm-clone-metadata.h | 17 
 drivers/md/dm-clone-target.c   |  7 ++-
 3 files changed, 60 insertions(+), 10 deletions(-)

diff --git a/drivers/md/dm-clone-metadata.c b/drivers/md/dm-clone-metadata.c
index ee870a425ab8..c05b12110456 100644
--- a/drivers/md/dm-clone-metadata.c
+++ b/drivers/md/dm-clone-metadata.c
@@ -127,6 +127,9 @@ struct dm_clone_metadata {
struct dirty_map dmap[2];
struct dirty_map *current_dmap;
 
+   /* Protected by lock */
+   struct dirty_map *committing_dmap;
+
/*
 * In core copy of the on-disk bitmap to save constantly doing look ups
 * on disk.
@@ -511,6 +514,7 @@ static int dirty_map_init(struct dm_clone_metadata *cmd)
}
 
cmd->current_dmap = >dmap[0];
+   cmd->committing_dmap = NULL;
 
return 0;
 }
@@ -775,15 +779,17 @@ static int __flush_dmap(struct dm_clone_metadata *cmd, 
struct dirty_map *dmap)
return 0;
 }
 
-int dm_clone_metadata_commit(struct dm_clone_metadata *cmd)
+int dm_clone_metadata_pre_commit(struct dm_clone_metadata *cmd)
 {
-   int r = -EPERM;
+   int r = 0;
struct dirty_map *dmap, *next_dmap;
 
down_write(>lock);
 
-   if (cmd->fail_io || dm_bm_is_read_only(cmd->bm))
+   if (cmd->fail_io || dm_bm_is_read_only(cmd->bm)) {
+   r = -EPERM;
goto out;
+   }
 
/* Get current dirty bitmap */
dmap = cmd->current_dmap;
@@ -795,7 +801,7 @@ int dm_clone_metadata_commit(struct dm_clone_metadata *cmd)
 * The last commit failed, so we don't have a clean dirty-bitmap to
 * use.
 */
-   if (WARN_ON(next_dmap->changed)) {
+   if (WARN_ON(next_dmap->changed || cmd->committing_dmap)) {
r = -EINVAL;
goto out;
}
@@ -805,11 +811,33 @@ int dm_clone_metadata_commit(struct dm_clone_metadata 
*cmd)
cmd->current_dmap = next_dmap;
spin_unlock_irq(>bitmap_lock);
 
-   /*
-* No one is accessing the old dirty bitmap anymore, so we can flush
-* it.
-*/
-   r = __flush_dmap(cmd, dmap);
+   /* Set old dirty bitmap as currently committing */
+   cmd->committing_dmap = dmap;
+out:
+   up_write(>lock);
+
+   return r;
+}
+
+int dm_clone_metadata_commit(struct dm_clone_metadata *cmd)
+{
+   int r = -EPERM;
+
+   down_write(>lock);
+
+   if (cmd->fail_io || dm_bm_is_read_only(cmd->bm))
+   goto out;
+
+   if (WARN_ON(!cmd->committing_dmap)) {
+   r = -EINVAL;
+   goto out;
+   }
+
+   r = __flush_dmap(cmd, cmd->committing_dmap);
+   if (!r) {
+   /* Clear committing dmap */
+   cmd->committing_dmap = NULL;
+   }
 out:
up_write(>lock);
 
diff --git a/drivers/md/dm-clone-metadata.h b/drivers/md/dm-clone-metadata.h
index 3fe50a781c11..14af1ebd853f 100644
--- a/drivers/md/dm-clone-metadata.h
+++ b/drivers/md/dm-clone-metadata.h
@@ -75,7 +75,23 @@ void dm_clone_metadata_close(struct dm_clone_metadata *cmd);
 
 /*
  * Commit dm-clone metadata to disk.
+ *
+ * We use a two phase commit:
+ *
+ * 1. dm_clone_metadata_pre_commit(): Prepare the current transaction for
+ *committing. After this is called, all subsequent metadata updates, done
+ *through either dm_clone_set_region_hydrated() or
+ *dm_clone_cond_set_range(), will be part of the **next** transaction.
+ *
+ * 2. dm_clone_metadata_commit(): Actually commit the current transaction to
+ *disk and start a new transaction.
+ *
+ * This allows dm-clone to flush the destination device after step (1) to
+ * ensure that all freshly hydrated regions, for which we are updating the
+ * metadata, are properly written to non-volatile storage and won't be lost in
+ * case of a crash.
  */
+int dm_clone_metadata_pre_commit(struct dm_clone_metadata *cmd);
 int dm_clone_metadata_commit(struct dm_clone_metadata *cmd);
 
 /*
@@ -112,6 +128,7 @@ int dm_clone_metadata_abort(struc

Re: [dm-devel] dm clone: Add to the documentation index

2019-11-26 Thread Nikos Tsironis

On 11/26/19 5:40 PM, Mike Snitzer wrote:

On Tue, Nov 26 2019 at  7:00am -0500,
Nikos Tsironis  wrote:


From: Diego Calleja 

It was missing from the initial commit

Signed-off-by: Diego Calleja 

---
  Documentation/admin-guide/device-mapper/index.rst | 1 +
  1 file changed, 1 insertion(+)

diff --git a/Documentation/admin-guide/device-mapper/index.rst b/
Documentation/admin-guide/device-mapper/index.rst
index c77c58b8f67b..d8dec8911eb3 100644
--- a/Documentation/admin-guide/device-mapper/index.rst
+++ b/Documentation/admin-guide/device-mapper/index.rst
@@ -8,6 +8,7 @@ Device Mapper
  cache-policies
  cache
  delay
+dm-clone
  dm-crypt
  dm-flakey
  dm-init
--
2.24.0


I've picked this up:
https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-5.5=484e0d2b11e1fdd0d17702b282eb2ed56148385f

Nikos, please note that if you send a patch on someone else's behalf you
should add you Signed-off-by.  I've updated the commit header
accordingly.


You are right, I am sorry. I will keep that in mind the next time.

Thanks,
Nikos



Mike



--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel



[dm-devel] [PATCH] dm clone: Add to the documentation index

2019-11-26 Thread Nikos Tsironis

From: Diego Calleja 

It was missing from the initial commit

Signed-off-by: Diego Calleja 

---
 Documentation/admin-guide/device-mapper/index.rst | 1 +
 1 file changed, 1 insertion(+)

diff --git a/Documentation/admin-guide/device-mapper/index.rst b/
Documentation/admin-guide/device-mapper/index.rst
index c77c58b8f67b..d8dec8911eb3 100644
--- a/Documentation/admin-guide/device-mapper/index.rst
+++ b/Documentation/admin-guide/device-mapper/index.rst
@@ -8,6 +8,7 @@ Device Mapper
 cache-policies
 cache
 delay
+dm-clone
 dm-crypt
 dm-flakey
 dm-init
--
2.24.0




--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH 1/2] dm-snapshot: fix crash with the realtime kernel

2019-11-19 Thread Nikos Tsironis
On 11/12/19 9:50 AM, Mikulas Patocka wrote:
> 
> 
> On Mon, 11 Nov 2019, Mike Snitzer wrote:
> 
>> On Mon, Nov 11 2019 at 11:37am -0500,
>> Nikos Tsironis  wrote:
>>
>>> On 11/11/19 3:59 PM, Mikulas Patocka wrote:
>>>> Snapshot doesn't work with realtime kernels since the commit f79ae415b64c.
>>>> hlist_bl is implemented as a raw spinlock and the code takes two non-raw
>>>> spinlocks while holding hlist_bl (non-raw spinlocks are blocking mutexes
>>>> in the realtime kernel, so they couldn't be taken inside a raw spinlock).
>>>>
>>>> This patch fixes the problem by using non-raw spinlock
>>>> exception_table_lock instead of the hlist_bl lock.
>>>>
>>>> Signed-off-by: Mikulas Patocka 
>>>> Fixes: f79ae415b64c ("dm snapshot: Make exception tables scalable")
>>>>
>>>
>>> Hi Mikulas,
>>>
>>> I wasn't aware that hlist_bl is implemented as a raw spinlock in the
>>> real time kernel. I would expect it to be a standard non-raw spinlock,
>>> so everything works as expected. But, after digging further in the real
>>> time tree, I found commit ad7675b15fd87f1 ("list_bl: Make list head
>>> locking RT safe") which suggests that such a conversion would break
>>> other parts of the kernel.
>>
>> Right, the proper fix is to update list_bl to work on realtime (which I
>> assume the referenced commit does).  I do not want to take this
>> dm-snapshot specific workaround that open-codes what should be done
>> within hlist_{bl_lock,unlock}, etc.
> 
> If we change list_bl to use non-raw spinlock, it fails in dentry lookup 
> code. The dentry code takes a seqlock (which is implemented as preempt 
> disable in the realtime kernel) and then takes a list_bl lock.
> 
> This is wrong from the real-time perspective (the chain in the hash could 
> be arbitrarily long, so using non-raw spinlock could cause unbounded 
> wait), however we can't do anything with it.
> 
> I think that fixing dm-snapshot is way easier than fixing the dentry code. 
> If you have an idea how to fix the dentry code, tell us.
> 

I too think that it would be better to fix list_bl. dm-snapshot isn't
really broken. One should be able to acquire a spinlock while holding
another spinlock.

Moreover, apart from dm-snapshot, anyone ever using list_bl is at risk
of breaking the realtime kernel, if he or she is not aware of that
particular limitation of list_bl's implementation in the realtime tree.

But, I agree that it's a lot easier "fixing" dm-snapshot than fixing the
dentry code.

>> I'm not yet sure which realtime mailing list and/or maintainers should
>> be cc'd to further the inclussion of commit ad7675b15fd87f1 -- Nikos do
>> you?

No, unfortunately, I don't know for sure either. [1] and [2] suggest
that the relevant mailing lists are LKML and linux-rt-users and the
maintainers are Sebastian Siewior, Thomas Gleixner and Steven Rostedt.

I believe they are already Cc'd in the other thread regarding Mikulas'
"realtime: avoid BUG when the list is not locked" patch (for some reason
the thread doesn't properly appear in dm-devel archives and also my
mails to dm-devel have being failing since yesterday - Could there be an
issue with the mailing list?), so maybe we should Cc them in this thread
too.

Nikos

[1] https://wiki.linuxfoundation.org/realtime/communication/mailinglists
[2] https://wiki.linuxfoundation.org/realtime/communication/send_rt_patches

>>
>> Thanks,
>> Mike
> 
> Mikulas
> 

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH 1/2] dm-snapshot: fix crash with the realtime kernel

2019-11-19 Thread Nikos Tsironis
On 11/11/19 3:59 PM, Mikulas Patocka wrote:
> Snapshot doesn't work with realtime kernels since the commit f79ae415b64c.
> hlist_bl is implemented as a raw spinlock and the code takes two non-raw
> spinlocks while holding hlist_bl (non-raw spinlocks are blocking mutexes
> in the realtime kernel, so they couldn't be taken inside a raw spinlock).
> 
> This patch fixes the problem by using non-raw spinlock
> exception_table_lock instead of the hlist_bl lock.
> 
> Signed-off-by: Mikulas Patocka 
> Fixes: f79ae415b64c ("dm snapshot: Make exception tables scalable")
> 

Hi Mikulas,

I wasn't aware that hlist_bl is implemented as a raw spinlock in the
real time kernel. I would expect it to be a standard non-raw spinlock,
so everything works as expected. But, after digging further in the real
time tree, I found commit ad7675b15fd87f1 ("list_bl: Make list head
locking RT safe") which suggests that such a conversion would break
other parts of the kernel.

That said,

  Reviewed-by: Nikos Tsironis 

> ---
>  drivers/md/dm-snap.c |   65 
> ---
>  1 file changed, 42 insertions(+), 23 deletions(-)
> 
> Index: linux-2.6/drivers/md/dm-snap.c
> ===
> --- linux-2.6.orig/drivers/md/dm-snap.c   2019-11-08 15:51:42.0 
> +0100
> +++ linux-2.6/drivers/md/dm-snap.c2019-11-08 15:54:58.0 +0100
> @@ -141,6 +141,10 @@ struct dm_snapshot {
>* for them to be committed.
>*/
>   struct bio_list bios_queued_during_merge;
> +
> +#ifdef CONFIG_PREEMPT_RT_BASE
> + spinlock_t exception_table_lock;
> +#endif
>  };
>  
>  /*
> @@ -625,30 +629,42 @@ static uint32_t exception_hash(struct dm
>  
>  /* Lock to protect access to the completed and pending exception hash 
> tables. */
>  struct dm_exception_table_lock {
> +#ifndef CONFIG_PREEMPT_RT_BASE
>   struct hlist_bl_head *complete_slot;
>   struct hlist_bl_head *pending_slot;
> +#endif
>  };
>  
>  static void dm_exception_table_lock_init(struct dm_snapshot *s, chunk_t 
> chunk,
>struct dm_exception_table_lock *lock)
>  {
> +#ifndef CONFIG_PREEMPT_RT_BASE
>   struct dm_exception_table *complete = >complete;
>   struct dm_exception_table *pending = >pending;
>  
>   lock->complete_slot = >table[exception_hash(complete, chunk)];
>   lock->pending_slot = >table[exception_hash(pending, chunk)];
> +#endif
>  }
>  
> -static void dm_exception_table_lock(struct dm_exception_table_lock *lock)
> +static void dm_exception_table_lock(struct dm_snapshot *s, struct 
> dm_exception_table_lock *lock)
>  {
> +#ifdef CONFIG_PREEMPT_RT_BASE
> + spin_lock(>exception_table_lock);
> +#else
>   hlist_bl_lock(lock->complete_slot);
>   hlist_bl_lock(lock->pending_slot);
> +#endif
>  }
>  
> -static void dm_exception_table_unlock(struct dm_exception_table_lock *lock)
> +static void dm_exception_table_unlock(struct dm_snapshot *s, struct 
> dm_exception_table_lock *lock)
>  {
> +#ifdef CONFIG_PREEMPT_RT_BASE
> + spin_unlock(>exception_table_lock);
> +#else
>   hlist_bl_unlock(lock->pending_slot);
>   hlist_bl_unlock(lock->complete_slot);
> +#endif
>  }
>  
>  static int dm_exception_table_init(struct dm_exception_table *et,
> @@ -835,9 +851,9 @@ static int dm_add_exception(void *contex
>*/
>   dm_exception_table_lock_init(s, old, );
>  
> - dm_exception_table_lock();
> + dm_exception_table_lock(s, );
>   dm_insert_exception(>complete, e);
> - dm_exception_table_unlock();
> + dm_exception_table_unlock(s, );
>  
>   return 0;
>  }
> @@ -1318,6 +1334,9 @@ static int snapshot_ctr(struct dm_target
>   s->first_merging_chunk = 0;
>   s->num_merging_chunks = 0;
>   bio_list_init(>bios_queued_during_merge);
> +#ifdef CONFIG_PREEMPT_RT_BASE
> + spin_lock_init(>exception_table_lock);
> +#endif
>  
>   /* Allocate hash table for COW data */
>   if (init_hash_tables(s)) {
> @@ -1651,7 +1670,7 @@ static void pending_complete(void *conte
>   invalidate_snapshot(s, -EIO);
>   error = 1;
>  
> - dm_exception_table_lock();
> + dm_exception_table_lock(s, );
>   goto out;
>   }
>  
> @@ -1660,13 +1679,13 @@ static void pending_complete(void *conte
>   invalidate_snapshot(s, -ENOMEM);
>   error = 1;
>  
> - dm_exception_table_lock();
> + dm_exception_table_lock(s, );
>  

Re: [dm-devel] [PATCH RT 1/2 v2] dm-snapshot: fix crash with the realtime kernel

2019-11-19 Thread Nikos Tsironis
On 11/12/19 6:09 PM, Mikulas Patocka wrote:
> Snapshot doesn't work with realtime kernels since the commit f79ae415b64c.
> hlist_bl is implemented as a raw spinlock and the code takes two non-raw
> spinlocks while holding hlist_bl (non-raw spinlocks are blocking mutexes
> in the realtime kernel).
> 
> We can't change hlist_bl to use non-raw spinlocks, this triggers warnings 
> in dentry lookup code, because the dentry lookup code uses hlist_bl while 
> holding a seqlock.
> 
> This patch fixes the problem by using non-raw spinlock 
> exception_table_lock instead of the hlist_bl lock.
> 
> Signed-off-by: Mikulas Patocka 
> Fixes: f79ae415b64c ("dm snapshot: Make exception tables scalable")
> 

Reviewed-by: Nikos Tsironis 

> ---
>  drivers/md/dm-snap.c |   23 +++
>  1 file changed, 23 insertions(+)
> 
> Index: linux-2.6/drivers/md/dm-snap.c
> ===
> --- linux-2.6.orig/drivers/md/dm-snap.c   2019-11-12 16:44:36.0 
> +0100
> +++ linux-2.6/drivers/md/dm-snap.c2019-11-12 17:01:46.0 +0100
> @@ -141,6 +141,10 @@ struct dm_snapshot {
>* for them to be committed.
>*/
>   struct bio_list bios_queued_during_merge;
> +
> +#ifdef CONFIG_PREEMPT_RT_BASE
> + spinlock_t exception_table_lock;
> +#endif
>  };
>  
>  /*
> @@ -625,30 +629,46 @@ static uint32_t exception_hash(struct dm
>  
>  /* Lock to protect access to the completed and pending exception hash 
> tables. */
>  struct dm_exception_table_lock {
> +#ifndef CONFIG_PREEMPT_RT_BASE
>   struct hlist_bl_head *complete_slot;
>   struct hlist_bl_head *pending_slot;
> +#else
> + spinlock_t *lock;
> +#endif
>  };
>  
>  static void dm_exception_table_lock_init(struct dm_snapshot *s, chunk_t 
> chunk,
>struct dm_exception_table_lock *lock)
>  {
> +#ifndef CONFIG_PREEMPT_RT_BASE
>   struct dm_exception_table *complete = >complete;
>   struct dm_exception_table *pending = >pending;
>  
>   lock->complete_slot = >table[exception_hash(complete, chunk)];
>   lock->pending_slot = >table[exception_hash(pending, chunk)];
> +#else
> + lock->lock = >exception_table_lock;
> +#endif
>  }
>  
>  static void dm_exception_table_lock(struct dm_exception_table_lock *lock)
>  {
> +#ifndef CONFIG_PREEMPT_RT_BASE
>   hlist_bl_lock(lock->complete_slot);
>   hlist_bl_lock(lock->pending_slot);
> +#else
> + spin_lock(lock->lock);
> +#endif
>  }
>  
>  static void dm_exception_table_unlock(struct dm_exception_table_lock *lock)
>  {
> +#ifndef CONFIG_PREEMPT_RT_BASE
>   hlist_bl_unlock(lock->pending_slot);
>   hlist_bl_unlock(lock->complete_slot);
> +#else
> + spin_unlock(lock->lock);
> +#endif
>  }
>  
>  static int dm_exception_table_init(struct dm_exception_table *et,
> @@ -1318,6 +1338,9 @@ static int snapshot_ctr(struct dm_target
>   s->first_merging_chunk = 0;
>   s->num_merging_chunks = 0;
>   bio_list_init(>bios_queued_during_merge);
> +#ifdef CONFIG_PREEMPT_RT_BASE
> + spin_lock_init(>exception_table_lock);
> +#endif
>  
>   /* Allocate hash table for COW data */
>   if (init_hash_tables(s)) {
> 

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH RT 2/2 v2] list_bl: avoid BUG when the list is not locked

2019-11-19 Thread Nikos Tsironis
On 11/13/19 1:16 PM, Mikulas Patocka wrote:
> 
> 
> On Wed, 13 Nov 2019, Nikos Tsironis wrote:
> 
>> On 11/12/19 6:16 PM, Mikulas Patocka wrote:
>>> list_bl would crash with BUG() if we used it without locking. dm-snapshot 
>>> uses its own locking on realtime kernels (it can't use list_bl because 
>>> list_bl uses raw spinlock and dm-snapshot takes other non-raw spinlocks 
>>> while holding bl_lock).
>>>
>>> To avoid this BUG, we must set LIST_BL_LOCKMASK = 0.
>>>
>>> This patch is intended only for the realtime kernel patchset, not for the 
>>> upstream kernel.
>>>
>>> Signed-off-by: Mikulas Patocka 
>>>
>>> Index: linux-rt-devel/include/linux/list_bl.h
>>> ===
>>> --- linux-rt-devel.orig/include/linux/list_bl.h 2019-11-07 
>>> 14:01:51.0 +0100
>>> +++ linux-rt-devel/include/linux/list_bl.h  2019-11-08 10:12:49.0 
>>> +0100
>>> @@ -19,7 +19,7 @@
>>>   * some fast and compact auxiliary data.
>>>   */
>>>  
>>> -#if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK)
>>> +#if (defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK)) && 
>>> !defined(CONFIG_PREEMPT_RT_BASE)
>>>  #define LIST_BL_LOCKMASK   1UL
>>>  #else
>>>  #define LIST_BL_LOCKMASK   0UL
>>> @@ -161,9 +161,6 @@ static inline void hlist_bl_lock(struct
>>> bit_spin_lock(0, (unsigned long *)b);
>>>  #else
>>> raw_spin_lock(>lock);
>>> -#if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK)
>>> -   __set_bit(0, (unsigned long *)b);
>>> -#endif
>>>  #endif
>>>  }
>>>  
>>
>> Hi Mikulas,
>>
>> I think removing __set_bit()/__clear_bit() breaks hlist_bl_is_locked(),
>> which is used by the RCU variant of list_bl.
>>
>> Nikos
> 
> OK. so I can remove this part of the patch.
> 

I think this causes another problem. LIST_BL_LOCKMASK is used in various
functions to set/clear the lock bit, e.g. in hlist_bl_first(). So, if we
lock the list through hlist_bl_lock(), thus setting the lock bit with
__set_bit(), and then call hlist_bl_first() to get the first element,
the returned pointer will be invalid. As LIST_BL_LOCKMASK is zero the
least significant bit of the pointer will be 1.

I think for dm-snapshot to work using its own locking, and without
list_bl complaining, the following is sufficient:

--- a/include/linux/list_bl.h
+++ b/include/linux/list_bl.h
@@ -25,7 +25,7 @@
 #define LIST_BL_LOCKMASK   0UL
 #endif

-#ifdef CONFIG_DEBUG_LIST
+#if defined(CONFIG_DEBUG_LIST) && !defined(CONFIG_PREEMPT_RT_BASE)
 #define LIST_BL_BUG_ON(x) BUG_ON(x)
 #else
 #define LIST_BL_BUG_ON(x)

Nikos

> Mikulas
> 
>>> @@ -172,9 +169,6 @@ static inline void hlist_bl_unlock(struc
>>>  #ifndef CONFIG_PREEMPT_RT_BASE
>>> __bit_spin_unlock(0, (unsigned long *)b);
>>>  #else
>>> -#if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK)
>>> -   __clear_bit(0, (unsigned long *)b);
>>> -#endif
>>> raw_spin_unlock(>lock);
>>>  #endif
>>>  }
>>>
>>
> 

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel



[dm-devel] [PATCH RT 2/2 v4] list_bl: avoid BUG when the list is not locked

2019-11-19 Thread Nikos Tsironis
list_bl would crash with BUG() if we used it without locking.
dm-snapshot uses its own locking on realtime kernels (it can't use
list_bl because list_bl uses raw spinlock and dm-snapshot takes other
non-raw spinlocks while holding bl_lock).

To avoid this BUG we deactivate the list debug checks for list_bl on
realtime kernels.

This patch is intended only for the realtime kernel patchset, not for
the upstream kernel.

Signed-off-by: Nikos Tsironis 
Reviewed-by: Mikulas Patocka 
---
 include/linux/list_bl.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/list_bl.h b/include/linux/list_bl.h
index da38433240f5..3585b2f6b948 100644
--- a/include/linux/list_bl.h
+++ b/include/linux/list_bl.h
@@ -25,7 +25,7 @@
 #define LIST_BL_LOCKMASK   0UL
 #endif
 
-#ifdef CONFIG_DEBUG_LIST
+#if defined(CONFIG_DEBUG_LIST) && !defined(CONFIG_PREEMPT_RT_BASE)
 #define LIST_BL_BUG_ON(x) BUG_ON(x)
 #else
 #define LIST_BL_BUG_ON(x)
-- 
2.11.0


--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] Fix "dm kcopyd: Fix bug causing workqueue stalls" causes dead lock

2019-10-17 Thread Nikos Tsironis
On 10/17/19 8:58 AM, Guruswamy Basavaiah wrote:
>Hello Nikos,
>  Tested with your new patches. Issue is resolved. Thank you.

Hi Guru,

That's great. Thanks for testing the patches.

>  In second patch "struct wait_queue_head" to "wait_queue_head_t" for
> variable in_progress_wait, else compilation is failing with error
>  "error: field 'in_progress_wait' has incomplete type
>   struct wait_queue_head in_progress_wait;"

"struct wait_queue_head" was introduced by commit 9d9d676f595b50
("sched/wait: Standardize internal naming of wait-queue heads"), which
is included in kernels starting from v4.13.

So, the patch works fine with the latest kernel, but needs adapting for
older kernels, which I missed when rebasing the patches for the 4.4.x
kernel series.

Nikos.

>  Attached the changed patch.
> 
> Guru
> 
> On Sat, 12 Oct 2019 at 14:16, Guruswamy Basavaiah  wrote:
>>
>> Hello Nikos,
>>  I am having some issues in our set-up, I will try to get the results ASAP.
>> Guru
>>
>>
>> On Fri, 11 Oct 2019 at 17:47, Nikos Tsironis  wrote:
>>>
>>> On 10/11/19 2:39 PM, Nikos Tsironis wrote:
>>>> On 10/11/19 1:17 PM, Guruswamy Basavaiah wrote:
>>>>> Hello Nikos,
>>>>>  Applied these patches and tested.
>>>>>  We still see hung_task_timeout back traces and the drbd Resync is 
>>>>> blocked.
>>>>>  Attached the back trace, please let me know if you need any other 
>>>>> information.
>>>>>
>>>>
>>>> Hi Guru,
>>>>
>>>> Can you provide more information about your setup? The output of
>>>> 'dmsetup table', 'dmsetup ls --tree' and the DRBD configuration would
>>>> help to get a better picture of your I/O stack.
>>>>
>>>> Also, is it possible to describe the test case you are running and
>>>> exactly what it does?
>>>>
>>>> Thanks,
>>>> Nikos
>>>>
>>>
>>> Hi Guru,
>>>
>>> I believe I found the mistake. The in_progress variable was never
>>> initialized to zero.
>>>
>>> I attach a new version of the second patch correcting this.
>>>
>>> Can you please test again with this patch?
>>>
>>> Thanks,
>>> Nikos
>>>
>>>>>  In patch "0002-dm-snapshot-rework-COW-throttling-to-fix-deadlock.patch"
>>>>> I change "struct wait_queue_head" to "wait_queue_head_t" as i was
>>>>> getting compilation error with former one.
>>>>>
>>>>> On Thu, 10 Oct 2019 at 17:33, Nikos Tsironis  
>>>>> wrote:
>>>>>>
>>>>>> On 10/10/19 9:34 AM, Guruswamy Basavaiah wrote:
>>>>>>> Hello,
>>>>>>> We use 4.4.184 in our builds and the patch fails to apply.
>>>>>>> Is it possible to give a patch for 4.4.x branch ?
>>>>>> Hi Guru,
>>>>>>
>>>>>> I attach the two patches fixing the deadlock rebased on the 4.4.x branch.
>>>>>>
>>>>>> Nikos
>>>>>>
>>>>>>>
>>>>>>> patching Logs.
>>>>>>> patching file drivers/md/dm-snap.c
>>>>>>> Hunk #1 succeeded at 19 (offset 1 line).
>>>>>>> Hunk #2 succeeded at 105 (offset -1 lines).
>>>>>>> Hunk #3 succeeded at 157 (offset -4 lines).
>>>>>>> Hunk #4 succeeded at 1206 (offset -120 lines).
>>>>>>> Hunk #5 FAILED at 1508.
>>>>>>> Hunk #6 succeeded at 1412 (offset -124 lines).
>>>>>>> Hunk #7 succeeded at 1425 (offset -124 lines).
>>>>>>> Hunk #8 FAILED at 1925.
>>>>>>> Hunk #9 succeeded at 1866 with fuzz 2 (offset -255 lines).
>>>>>>> Hunk #10 succeeded at 2202 (offset -294 lines).
>>>>>>> Hunk #11 succeeded at 2332 (offset -294 lines).
>>>>>>> 2 out of 11 hunks FAILED -- saving rejects to file 
>>>>>>> drivers/md/dm-snap.c.rej
>>>>>>>
>>>>>>> Guru
>>>>>>>
>>>>>>> On Thu, 10 Oct 2019 at 01:33, Guruswamy Basavaiah  
>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Hello Mike,
>>>>>>>>  I will get the testing result before end of Thursday.
>>>>>>>> Guru
>>>>&

Re: [dm-devel] [PATCH 2/2] dm-snapshot: Reimplement the cow limit.

2019-10-11 Thread Nikos Tsironis
On 10/2/19 1:15 PM, Mikulas Patocka wrote:
> Commit 721b1d98fb517a ("dm snapshot: Fix excessive memory usage and
> workqueue stalls") introduced a semaphore to limit the maximum number of
> in-flight kcopyd (COW) jobs.
> 
> The implementation of this throttling mechanism is prone to a deadlock:
> 
> 1. One or more threads write to the origin device causing COW, which is
>performed by kcopyd.
> 
> 2. At some point some of these threads might reach the s->cow_count
>semaphore limit and block in down(>cow_count), holding a read lock
>on _origins_lock.
> 
> 3. Someone tries to acquire a write lock on _origins_lock, e.g.,
>snapshot_ctr(), which blocks because the threads at step (2) already
>hold a read lock on it.
> 
> 4. A COW operation completes and kcopyd runs dm-snapshot's completion
>callback, which ends up calling pending_complete().
>pending_complete() tries to resubmit any deferred origin bios. This
>requires acquiring a read lock on _origins_lock, which blocks.
> 
>This happens because the read-write semaphore implementation gives
>priority to writers, meaning that as soon as a writer tries to enter
>the critical section, no readers will be allowed in, until all
>writers have completed their work.
> 
>So, pending_complete() waits for the writer at step (3) to acquire
>and release the lock. This writer waits for the readers at step (2)
>to release the read lock and those readers wait for
>pending_complete() (the kcopyd thread) to signal the s->cow_count
>semaphore: DEADLOCK.
> 
> In order to fix the bug, I reworked limiting, so that it waits without 
> holding any locks. The patch adds a variable in_progress that counts how 
> many kcopyd jobs are running. A function wait_for_in_progress will sleep 
> if the variable in_progress is over the limit. It drops _origins_lock in 
> order to avoid the deadlock.
> 
> Signed-off-by: Mikulas Patocka 
> Cc: sta...@vger.kernel.org# v5.0+
> Fixes: 721b1d98fb51 ("dm snapshot: Fix excessive memory usage and workqueue 
> stalls")
> 
> ---
>  drivers/md/dm-snap.c |   69 
> ---
>  1 file changed, 55 insertions(+), 14 deletions(-)
> 
> Index: linux-2.6/drivers/md/dm-snap.c
> ===
> --- linux-2.6.orig/drivers/md/dm-snap.c   2019-10-01 15:23:42.0 
> +0200
> +++ linux-2.6/drivers/md/dm-snap.c2019-10-02 12:01:23.0 +0200
> @@ -18,7 +18,6 @@
>  #include 
>  #include 
>  #include 
> -#include 
>  
>  #include "dm.h"
>  
> @@ -107,8 +106,8 @@ struct dm_snapshot {
>   /* The on disk metadata handler */
>   struct dm_exception_store *store;
>  
> - /* Maximum number of in-flight COW jobs. */
> - struct semaphore cow_count;
> + unsigned in_progress;
> + struct wait_queue_head in_progress_wait;
>  
>   struct dm_kcopyd_client *kcopyd_client;
>  
> @@ -162,8 +161,8 @@ struct dm_snapshot {
>   */
>  #define DEFAULT_COW_THRESHOLD 2048
>  
> -static int cow_threshold = DEFAULT_COW_THRESHOLD;
> -module_param_named(snapshot_cow_threshold, cow_threshold, int, 0644);
> +static unsigned cow_threshold = DEFAULT_COW_THRESHOLD;
> +module_param_named(snapshot_cow_threshold, cow_threshold, uint, 0644);
>  MODULE_PARM_DESC(snapshot_cow_threshold, "Maximum number of chunks being 
> copied on write");
>  
>  DECLARE_DM_KCOPYD_THROTTLE_WITH_MODULE_PARM(snapshot_copy_throttle,
> @@ -1327,7 +1326,7 @@ static int snapshot_ctr(struct dm_target
>   goto bad_hash_tables;
>   }
>  
> - sema_init(>cow_count, (cow_threshold > 0) ? cow_threshold : INT_MAX);
> + init_waitqueue_head(>in_progress_wait);
>  

's->in_progress = 0' is missing here.

I totally missed that during the review and d3775354 ("dm: Use kzalloc
for all structs with embedded biosets/mempools") changed the allocation
of 's' to using kzalloc(), so 'in_progress' was implicitly initialized
to zero and the tests ran fine.

Nikos

>   s->kcopyd_client = dm_kcopyd_client_create(_kcopyd_throttle);
>   if (IS_ERR(s->kcopyd_client)) {
> @@ -1509,17 +1508,46 @@ static void snapshot_dtr(struct dm_targe
>  
>   dm_put_device(ti, s->origin);
>  
> + WARN_ON(s->in_progress);
> +
>   kfree(s);
>  }
>  
>  static void account_start_copy(struct dm_snapshot *s)
>  {
> - down(>cow_count);
> + spin_lock(>in_progress_wait.lock);
> + s->in_progress++;
> + spin_unlock(>in_progress_wait.lock);
>  }
>  
>  static void account_end_copy(struct dm_snapshot *s)
>  {
> - up(>cow_count);
> + spin_lock(>in_progress_wait.lock);
> + BUG_ON(!s->in_progress);
> + s->in_progress--;
> + if (likely(s->in_progress <= cow_threshold) && 
> unlikely(waitqueue_active(>in_progress_wait)))
> + wake_up_locked(>in_progress_wait);
> + spin_unlock(>in_progress_wait.lock);
> +}
> +
> +static bool wait_for_in_progress(struct dm_snapshot *s, bool unlock_origins)
> +{
> 

Re: [dm-devel] Fix "dm kcopyd: Fix bug causing workqueue stalls" causes dead lock

2019-10-11 Thread Nikos Tsironis
On 10/11/19 2:39 PM, Nikos Tsironis wrote:
> On 10/11/19 1:17 PM, Guruswamy Basavaiah wrote:
>> Hello Nikos,
>>  Applied these patches and tested.
>>  We still see hung_task_timeout back traces and the drbd Resync is blocked.
>>  Attached the back trace, please let me know if you need any other 
>> information.
>>
> 
> Hi Guru,
> 
> Can you provide more information about your setup? The output of
> 'dmsetup table', 'dmsetup ls --tree' and the DRBD configuration would
> help to get a better picture of your I/O stack.
> 
> Also, is it possible to describe the test case you are running and
> exactly what it does?
> 
> Thanks,
> Nikos
> 

Hi Guru,

I believe I found the mistake. The in_progress variable was never
initialized to zero.

I attach a new version of the second patch correcting this.

Can you please test again with this patch?

Thanks,
Nikos

>>  In patch "0002-dm-snapshot-rework-COW-throttling-to-fix-deadlock.patch"
>> I change "struct wait_queue_head" to "wait_queue_head_t" as i was
>> getting compilation error with former one.
>>
>> On Thu, 10 Oct 2019 at 17:33, Nikos Tsironis  wrote:
>>>
>>> On 10/10/19 9:34 AM, Guruswamy Basavaiah wrote:
>>>> Hello,
>>>> We use 4.4.184 in our builds and the patch fails to apply.
>>>> Is it possible to give a patch for 4.4.x branch ?
>>> Hi Guru,
>>>
>>> I attach the two patches fixing the deadlock rebased on the 4.4.x branch.
>>>
>>> Nikos
>>>
>>>>
>>>> patching Logs.
>>>> patching file drivers/md/dm-snap.c
>>>> Hunk #1 succeeded at 19 (offset 1 line).
>>>> Hunk #2 succeeded at 105 (offset -1 lines).
>>>> Hunk #3 succeeded at 157 (offset -4 lines).
>>>> Hunk #4 succeeded at 1206 (offset -120 lines).
>>>> Hunk #5 FAILED at 1508.
>>>> Hunk #6 succeeded at 1412 (offset -124 lines).
>>>> Hunk #7 succeeded at 1425 (offset -124 lines).
>>>> Hunk #8 FAILED at 1925.
>>>> Hunk #9 succeeded at 1866 with fuzz 2 (offset -255 lines).
>>>> Hunk #10 succeeded at 2202 (offset -294 lines).
>>>> Hunk #11 succeeded at 2332 (offset -294 lines).
>>>> 2 out of 11 hunks FAILED -- saving rejects to file drivers/md/dm-snap.c.rej
>>>>
>>>> Guru
>>>>
>>>> On Thu, 10 Oct 2019 at 01:33, Guruswamy Basavaiah  
>>>> wrote:
>>>>>
>>>>> Hello Mike,
>>>>>  I will get the testing result before end of Thursday.
>>>>> Guru
>>>>>
>>>>> On Wed, 9 Oct 2019 at 21:34, Mike Snitzer  wrote:
>>>>>>
>>>>>> On Wed, Oct 09 2019 at 11:44am -0400,
>>>>>> Nikos Tsironis  wrote:
>>>>>>
>>>>>>> On 10/9/19 5:13 PM, Mike Snitzer wrote:> On Tue, Oct 01 2019 at  8:43am 
>>>>>>> -0400,
>>>>>>>> Nikos Tsironis  wrote:
>>>>>>>>
>>>>>>>>> On 10/1/19 3:27 PM, Guruswamy Basavaiah wrote:
>>>>>>>>>> Hello Nikos,
>>>>>>>>>>  Yes, issue is consistently reproducible with us, in a particular
>>>>>>>>>> set-up and test case.
>>>>>>>>>>  I will get the access to set-up next week, will try to test and let
>>>>>>>>>> you know the results before end of next week.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> That sounds great!
>>>>>>>>>
>>>>>>>>> Thanks a lot,
>>>>>>>>> Nikos
>>>>>>>>
>>>>>>>> Hi Guru,
>>>>>>>>
>>>>>>>> Any chance you could try this fix that I've staged to send to Linus?
>>>>>>>> https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-5.4=633b1613b2a49304743c18314bb6e6465c21fd8a
>>>>>>>>
>>>>>>>> Shiort of that, Nikos: do you happen to have a test scenario that 
>>>>>>>> teases
>>>>>>>> out this deadlock?
>>>>>>>>
>>>>>>>
>>>>>>> Hi Mike,
>>>>>>>
>>>>>>> Yes,
>>>>>>>
>>>>>>> I created a 50G LV and took a snapshot of the same size:
>>>>>>>
>>>>>

Re: [dm-devel] Fix "dm kcopyd: Fix bug causing workqueue stalls" causes dead lock

2019-10-11 Thread Nikos Tsironis
On 10/11/19 1:17 PM, Guruswamy Basavaiah wrote:
> Hello Nikos,
>  Applied these patches and tested.
>  We still see hung_task_timeout back traces and the drbd Resync is blocked.
>  Attached the back trace, please let me know if you need any other 
> information.
> 

Hi Guru,

Can you provide more information about your setup? The output of
'dmsetup table', 'dmsetup ls --tree' and the DRBD configuration would
help to get a better picture of your I/O stack.

Also, is it possible to describe the test case you are running and
exactly what it does?

Thanks,
Nikos

>  In patch "0002-dm-snapshot-rework-COW-throttling-to-fix-deadlock.patch"
> I change "struct wait_queue_head" to "wait_queue_head_t" as i was
> getting compilation error with former one.
> 
> On Thu, 10 Oct 2019 at 17:33, Nikos Tsironis  wrote:
>>
>> On 10/10/19 9:34 AM, Guruswamy Basavaiah wrote:
>>> Hello,
>>> We use 4.4.184 in our builds and the patch fails to apply.
>>> Is it possible to give a patch for 4.4.x branch ?
>> Hi Guru,
>>
>> I attach the two patches fixing the deadlock rebased on the 4.4.x branch.
>>
>> Nikos
>>
>>>
>>> patching Logs.
>>> patching file drivers/md/dm-snap.c
>>> Hunk #1 succeeded at 19 (offset 1 line).
>>> Hunk #2 succeeded at 105 (offset -1 lines).
>>> Hunk #3 succeeded at 157 (offset -4 lines).
>>> Hunk #4 succeeded at 1206 (offset -120 lines).
>>> Hunk #5 FAILED at 1508.
>>> Hunk #6 succeeded at 1412 (offset -124 lines).
>>> Hunk #7 succeeded at 1425 (offset -124 lines).
>>> Hunk #8 FAILED at 1925.
>>> Hunk #9 succeeded at 1866 with fuzz 2 (offset -255 lines).
>>> Hunk #10 succeeded at 2202 (offset -294 lines).
>>> Hunk #11 succeeded at 2332 (offset -294 lines).
>>> 2 out of 11 hunks FAILED -- saving rejects to file drivers/md/dm-snap.c.rej
>>>
>>> Guru
>>>
>>> On Thu, 10 Oct 2019 at 01:33, Guruswamy Basavaiah  
>>> wrote:
>>>>
>>>> Hello Mike,
>>>>  I will get the testing result before end of Thursday.
>>>> Guru
>>>>
>>>> On Wed, 9 Oct 2019 at 21:34, Mike Snitzer  wrote:
>>>>>
>>>>> On Wed, Oct 09 2019 at 11:44am -0400,
>>>>> Nikos Tsironis  wrote:
>>>>>
>>>>>> On 10/9/19 5:13 PM, Mike Snitzer wrote:> On Tue, Oct 01 2019 at  8:43am 
>>>>>> -0400,
>>>>>>> Nikos Tsironis  wrote:
>>>>>>>
>>>>>>>> On 10/1/19 3:27 PM, Guruswamy Basavaiah wrote:
>>>>>>>>> Hello Nikos,
>>>>>>>>>  Yes, issue is consistently reproducible with us, in a particular
>>>>>>>>> set-up and test case.
>>>>>>>>>  I will get the access to set-up next week, will try to test and let
>>>>>>>>> you know the results before end of next week.
>>>>>>>>>
>>>>>>>>
>>>>>>>> That sounds great!
>>>>>>>>
>>>>>>>> Thanks a lot,
>>>>>>>> Nikos
>>>>>>>
>>>>>>> Hi Guru,
>>>>>>>
>>>>>>> Any chance you could try this fix that I've staged to send to Linus?
>>>>>>> https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-5.4=633b1613b2a49304743c18314bb6e6465c21fd8a
>>>>>>>
>>>>>>> Shiort of that, Nikos: do you happen to have a test scenario that teases
>>>>>>> out this deadlock?
>>>>>>>
>>>>>>
>>>>>> Hi Mike,
>>>>>>
>>>>>> Yes,
>>>>>>
>>>>>> I created a 50G LV and took a snapshot of the same size:
>>>>>>
>>>>>>   lvcreate -n data-lv -L50G testvg
>>>>>>   lvcreate -n snap-lv -L50G -s testvg/data-lv
>>>>>>
>>>>>> Then I ran the following fio job:
>>>>>>
>>>>>> [global]
>>>>>> randrepeat=1
>>>>>> ioengine=libaio
>>>>>> bs=1M
>>>>>> size=6G
>>>>>> offset_increment=6G
>>>>>> numjobs=8
>>>>>> direct=1
>>>>>> iodepth=32
>>>>>> group_reporting
>>>>>> filename=/dev/testvg/data-lv
>>>>>>
>>>>>> [test]
>>>>>> rw=write
>>>>>> timeout=180
>>>>>>
>>>>>> , concurrently with the following script:
>>>>>>
>>>>>> lvcreate -n dummy-lv -L1G testvg
>>>>>>
>>>>>> while true
>>>>>> do
>>>>>>  lvcreate -n dummy-snap -L1M -s testvg/dummy-lv
>>>>>>  lvremove -f testvg/dummy-snap
>>>>>> done
>>>>>>
>>>>>> This reproduced the deadlock for me. I also ran 'echo 30 >
>>>>>> /proc/sys/kernel/hung_task_timeout_secs', to reduce the hung task
>>>>>> timeout.
>>>>>>
>>>>>> Nikos.
>>>>>
>>>>> Very nice, well done.  Curious if you've tested with the fix I've staged
>>>>> (see above)?  If so, does it resolve the deadlock?  If you've had
>>>>> success I'd be happy to update the tags in the commit header to include
>>>>> your Tested-by before sending it to Linus.  Also, any review of the
>>>>> patch that you can do would be appreciated and with your formal
>>>>> Reviewed-by reply would be welcomed and folded in too.
>>>>>
>>>>> Mike
>>>>
>>>>
>>>>
>>>> --
>>>> Guruswamy Basavaiah
>>>
>>>
>>>
> 
> 
> 

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] Fix "dm kcopyd: Fix bug causing workqueue stalls" causes dead lock

2019-10-10 Thread Nikos Tsironis
On 10/10/19 9:34 AM, Guruswamy Basavaiah wrote:
> Hello,
> We use 4.4.184 in our builds and the patch fails to apply.
> Is it possible to give a patch for 4.4.x branch ?
Hi Guru,

I attach the two patches fixing the deadlock rebased on the 4.4.x branch.

Nikos

> 
> patching Logs.
> patching file drivers/md/dm-snap.c
> Hunk #1 succeeded at 19 (offset 1 line).
> Hunk #2 succeeded at 105 (offset -1 lines).
> Hunk #3 succeeded at 157 (offset -4 lines).
> Hunk #4 succeeded at 1206 (offset -120 lines).
> Hunk #5 FAILED at 1508.
> Hunk #6 succeeded at 1412 (offset -124 lines).
> Hunk #7 succeeded at 1425 (offset -124 lines).
> Hunk #8 FAILED at 1925.
> Hunk #9 succeeded at 1866 with fuzz 2 (offset -255 lines).
> Hunk #10 succeeded at 2202 (offset -294 lines).
> Hunk #11 succeeded at 2332 (offset -294 lines).
> 2 out of 11 hunks FAILED -- saving rejects to file drivers/md/dm-snap.c.rej
> 
> Guru
> 
> On Thu, 10 Oct 2019 at 01:33, Guruswamy Basavaiah  wrote:
>>
>> Hello Mike,
>>  I will get the testing result before end of Thursday.
>> Guru
>>
>> On Wed, 9 Oct 2019 at 21:34, Mike Snitzer  wrote:
>>>
>>> On Wed, Oct 09 2019 at 11:44am -0400,
>>> Nikos Tsironis  wrote:
>>>
>>>> On 10/9/19 5:13 PM, Mike Snitzer wrote:> On Tue, Oct 01 2019 at  8:43am 
>>>> -0400,
>>>>> Nikos Tsironis  wrote:
>>>>>
>>>>>> On 10/1/19 3:27 PM, Guruswamy Basavaiah wrote:
>>>>>>> Hello Nikos,
>>>>>>>  Yes, issue is consistently reproducible with us, in a particular
>>>>>>> set-up and test case.
>>>>>>>  I will get the access to set-up next week, will try to test and let
>>>>>>> you know the results before end of next week.
>>>>>>>
>>>>>>
>>>>>> That sounds great!
>>>>>>
>>>>>> Thanks a lot,
>>>>>> Nikos
>>>>>
>>>>> Hi Guru,
>>>>>
>>>>> Any chance you could try this fix that I've staged to send to Linus?
>>>>> https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-5.4=633b1613b2a49304743c18314bb6e6465c21fd8a
>>>>>
>>>>> Shiort of that, Nikos: do you happen to have a test scenario that teases
>>>>> out this deadlock?
>>>>>
>>>>
>>>> Hi Mike,
>>>>
>>>> Yes,
>>>>
>>>> I created a 50G LV and took a snapshot of the same size:
>>>>
>>>>   lvcreate -n data-lv -L50G testvg
>>>>   lvcreate -n snap-lv -L50G -s testvg/data-lv
>>>>
>>>> Then I ran the following fio job:
>>>>
>>>> [global]
>>>> randrepeat=1
>>>> ioengine=libaio
>>>> bs=1M
>>>> size=6G
>>>> offset_increment=6G
>>>> numjobs=8
>>>> direct=1
>>>> iodepth=32
>>>> group_reporting
>>>> filename=/dev/testvg/data-lv
>>>>
>>>> [test]
>>>> rw=write
>>>> timeout=180
>>>>
>>>> , concurrently with the following script:
>>>>
>>>> lvcreate -n dummy-lv -L1G testvg
>>>>
>>>> while true
>>>> do
>>>>  lvcreate -n dummy-snap -L1M -s testvg/dummy-lv
>>>>  lvremove -f testvg/dummy-snap
>>>> done
>>>>
>>>> This reproduced the deadlock for me. I also ran 'echo 30 >
>>>> /proc/sys/kernel/hung_task_timeout_secs', to reduce the hung task
>>>> timeout.
>>>>
>>>> Nikos.
>>>
>>> Very nice, well done.  Curious if you've tested with the fix I've staged
>>> (see above)?  If so, does it resolve the deadlock?  If you've had
>>> success I'd be happy to update the tags in the commit header to include
>>> your Tested-by before sending it to Linus.  Also, any review of the
>>> patch that you can do would be appreciated and with your formal
>>> Reviewed-by reply would be welcomed and folded in too.
>>>
>>> Mike
>>
>>
>>
>> --
>> Guruswamy Basavaiah
> 
> 
> 
>From 5b1ae3cfc07e53e6e6e37f9f40b074dd7a8536b9 Mon Sep 17 00:00:00 2001
From: Mikulas Patocka 
Date: Wed, 2 Oct 2019 06:14:17 -0400
Subject: [PATCH 1/2] dm snapshot: introduce account_start_copy() and
 account_end_copy()

This simple refactoring moves code for modifying the semaphore cow_count
into separate functions to prepare for change

Re: [dm-devel] Fix "dm kcopyd: Fix bug causing workqueue stalls" causes dead lock

2019-10-10 Thread Nikos Tsironis
On 10/9/19 7:04 PM, Mike Snitzer wrote:
> On Wed, Oct 09 2019 at 11:44am -0400,
> Nikos Tsironis  wrote:
> 
>> On 10/9/19 5:13 PM, Mike Snitzer wrote:> On Tue, Oct 01 2019 at  8:43am 
>> -0400,
>>> Nikos Tsironis  wrote:
>>>
>>>> On 10/1/19 3:27 PM, Guruswamy Basavaiah wrote:
>>>>> Hello Nikos,
>>>>>  Yes, issue is consistently reproducible with us, in a particular
>>>>> set-up and test case.
>>>>>  I will get the access to set-up next week, will try to test and let
>>>>> you know the results before end of next week.
>>>>>
>>>>
>>>> That sounds great!
>>>>
>>>> Thanks a lot,
>>>> Nikos
>>>
>>> Hi Guru,
>>>
>>> Any chance you could try this fix that I've staged to send to Linus?
>>> https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-5.4=633b1613b2a49304743c18314bb6e6465c21fd8a
>>>
>>> Shiort of that, Nikos: do you happen to have a test scenario that teases
>>> out this deadlock?
>>>
>>
>> Hi Mike,
>>
>> Yes,
>>
>> I created a 50G LV and took a snapshot of the same size:
>>
>>   lvcreate -n data-lv -L50G testvg
>>   lvcreate -n snap-lv -L50G -s testvg/data-lv
>>
>> Then I ran the following fio job:
>>
>> [global]
>> randrepeat=1
>> ioengine=libaio
>> bs=1M
>> size=6G
>> offset_increment=6G
>> numjobs=8
>> direct=1
>> iodepth=32
>> group_reporting
>> filename=/dev/testvg/data-lv
>>
>> [test]
>> rw=write
>> timeout=180
>>
>> , concurrently with the following script:
>>
>> lvcreate -n dummy-lv -L1G testvg
>>
>> while true
>> do
>>  lvcreate -n dummy-snap -L1M -s testvg/dummy-lv
>>  lvremove -f testvg/dummy-snap
>> done
>>
>> This reproduced the deadlock for me. I also ran 'echo 30 >
>> /proc/sys/kernel/hung_task_timeout_secs', to reduce the hung task
>> timeout.
>>
>> Nikos.
> 
> Very nice, well done.  Curious if you've tested with the fix I've staged
> (see above)?  If so, does it resolve the deadlock?  If you've had
> success I'd be happy to update the tags in the commit header to include
> your Tested-by before sending it to Linus.  Also, any review of the
> patch that you can do would be appreciated and with your formal
> Reviewed-by reply would be welcomed and folded in too.
> 

Yes, I have tested the staged fix. I forgot to mention it in my previous
mail.

I ran the test for the default 'snapshot_cow_threshold' value of 2048
and I also ran it for a value of 1, to stress it a little more.

In both cases everything went fine, the deadlock was gone.

Nikos

> Mike
> 

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH 2/2] dm-snapshot: Reimplement the cow limit.

2019-10-10 Thread Nikos Tsironis
On 10/2/19 1:15 PM, Mikulas Patocka wrote:
> Commit 721b1d98fb517a ("dm snapshot: Fix excessive memory usage and
> workqueue stalls") introduced a semaphore to limit the maximum number of
> in-flight kcopyd (COW) jobs.
> 
> The implementation of this throttling mechanism is prone to a deadlock:
> 
> 1. One or more threads write to the origin device causing COW, which is
>performed by kcopyd.
> 
> 2. At some point some of these threads might reach the s->cow_count
>semaphore limit and block in down(>cow_count), holding a read lock
>on _origins_lock.
> 
> 3. Someone tries to acquire a write lock on _origins_lock, e.g.,
>snapshot_ctr(), which blocks because the threads at step (2) already
>hold a read lock on it.
> 
> 4. A COW operation completes and kcopyd runs dm-snapshot's completion
>callback, which ends up calling pending_complete().
>pending_complete() tries to resubmit any deferred origin bios. This
>requires acquiring a read lock on _origins_lock, which blocks.
> 
>This happens because the read-write semaphore implementation gives
>priority to writers, meaning that as soon as a writer tries to enter
>the critical section, no readers will be allowed in, until all
>writers have completed their work.
> 
>So, pending_complete() waits for the writer at step (3) to acquire
>and release the lock. This writer waits for the readers at step (2)
>to release the read lock and those readers wait for
>pending_complete() (the kcopyd thread) to signal the s->cow_count
>semaphore: DEADLOCK.
> 
> In order to fix the bug, I reworked limiting, so that it waits without 
> holding any locks. The patch adds a variable in_progress that counts how 
> many kcopyd jobs are running. A function wait_for_in_progress will sleep 
> if the variable in_progress is over the limit. It drops _origins_lock in 
> order to avoid the deadlock.
> 
> Signed-off-by: Mikulas Patocka 
> Cc: sta...@vger.kernel.org# v5.0+
> Fixes: 721b1d98fb51 ("dm snapshot: Fix excessive memory usage and workqueue 
> stalls")
> 

Reviewed-by: Nikos Tsironis 

> ---
>  drivers/md/dm-snap.c |   69 
> ---
>  1 file changed, 55 insertions(+), 14 deletions(-)
> 
> Index: linux-2.6/drivers/md/dm-snap.c
> ===
> --- linux-2.6.orig/drivers/md/dm-snap.c   2019-10-01 15:23:42.0 
> +0200
> +++ linux-2.6/drivers/md/dm-snap.c2019-10-02 12:01:23.0 +0200
> @@ -18,7 +18,6 @@
>  #include 
>  #include 
>  #include 
> -#include 
>  
>  #include "dm.h"
>  
> @@ -107,8 +106,8 @@ struct dm_snapshot {
>   /* The on disk metadata handler */
>   struct dm_exception_store *store;
>  
> - /* Maximum number of in-flight COW jobs. */
> - struct semaphore cow_count;
> + unsigned in_progress;
> + struct wait_queue_head in_progress_wait;
>  
>   struct dm_kcopyd_client *kcopyd_client;
>  
> @@ -162,8 +161,8 @@ struct dm_snapshot {
>   */
>  #define DEFAULT_COW_THRESHOLD 2048
>  
> -static int cow_threshold = DEFAULT_COW_THRESHOLD;
> -module_param_named(snapshot_cow_threshold, cow_threshold, int, 0644);
> +static unsigned cow_threshold = DEFAULT_COW_THRESHOLD;
> +module_param_named(snapshot_cow_threshold, cow_threshold, uint, 0644);
>  MODULE_PARM_DESC(snapshot_cow_threshold, "Maximum number of chunks being 
> copied on write");
>  
>  DECLARE_DM_KCOPYD_THROTTLE_WITH_MODULE_PARM(snapshot_copy_throttle,
> @@ -1327,7 +1326,7 @@ static int snapshot_ctr(struct dm_target
>   goto bad_hash_tables;
>   }
>  
> - sema_init(>cow_count, (cow_threshold > 0) ? cow_threshold : INT_MAX);
> + init_waitqueue_head(>in_progress_wait);
>  
>   s->kcopyd_client = dm_kcopyd_client_create(_kcopyd_throttle);
>   if (IS_ERR(s->kcopyd_client)) {
> @@ -1509,17 +1508,46 @@ static void snapshot_dtr(struct dm_targe
>  
>   dm_put_device(ti, s->origin);
>  
> + WARN_ON(s->in_progress);
> +
>   kfree(s);
>  }
>  
>  static void account_start_copy(struct dm_snapshot *s)
>  {
> - down(>cow_count);
> + spin_lock(>in_progress_wait.lock);
> + s->in_progress++;
> + spin_unlock(>in_progress_wait.lock);
>  }
>  
>  static void account_end_copy(struct dm_snapshot *s)
>  {
> - up(>cow_count);
> + spin_lock(>in_progress_wait.lock);
> + BUG_ON(!s->in_progress);
> + s->in_progress--;
> + if (likely(s->in_progress <= cow_threshold) && 
&

Re: [dm-devel] [PATCH 1/2] dm-snapshot: introduce account_start_copy and account_end_copy

2019-10-10 Thread Nikos Tsironis
On 10/2/19 1:14 PM, Mikulas Patocka wrote:
> This is simple refactoring that moves code for modifying the semaphore
> cow_count into separate functions. It is needed by the following patch.
> 
> Signed-off-by: Mikulas Patocka 
> Cc: sta...@vger.kernel.org# v5.0+
> Fixes: 721b1d98fb51 ("dm snapshot: Fix excessive memory usage and workqueue 
> stalls")
> 

Reviewed-by: Nikos Tsironis 

> ---
>  drivers/md/dm-snap.c |   20 +++-
>  1 file changed, 15 insertions(+), 5 deletions(-)
> 
> Index: linux-2.6/drivers/md/dm-snap.c
> ===
> --- linux-2.6.orig/drivers/md/dm-snap.c   2019-10-01 15:19:20.0 
> +0200
> +++ linux-2.6/drivers/md/dm-snap.c2019-10-01 15:23:10.0 +0200
> @@ -1512,6 +1512,16 @@ static void snapshot_dtr(struct dm_targe
>   kfree(s);
>  }
>  
> +static void account_start_copy(struct dm_snapshot *s)
> +{
> + down(>cow_count);
> +}
> +
> +static void account_end_copy(struct dm_snapshot *s)
> +{
> + up(>cow_count);
> +}
> +
>  /*
>   * Flush a list of buffers.
>   */
> @@ -1732,7 +1742,7 @@ static void copy_callback(int read_err,
>   rb_link_node(>out_of_order_node, parent, p);
>   rb_insert_color(>out_of_order_node, >out_of_order_tree);
>   }
> - up(>cow_count);
> + account_end_copy(s);
>  }
>  
>  /*
> @@ -1756,7 +1766,7 @@ static void start_copy(struct dm_snap_pe
>   dest.count = src.count;
>  
>   /* Hand over to kcopyd */
> - down(>cow_count);
> + account_start_copy(s);
>   dm_kcopyd_copy(s->kcopyd_client, , 1, , 0, copy_callback, pe);
>  }
>  
> @@ -1776,7 +1786,7 @@ static void start_full_bio(struct dm_sna
>   pe->full_bio = bio;
>   pe->full_bio_end_io = bio->bi_end_io;
>  
> - down(>cow_count);
> + account_start_copy(s);
>   callback_data = dm_kcopyd_prepare_callback(s->kcopyd_client,
>  copy_callback, pe);
>  
> @@ -1866,7 +1876,7 @@ static void zero_callback(int read_err,
>   struct bio *bio = context;
>   struct dm_snapshot *s = bio->bi_private;
>  
> - up(>cow_count);
> + account_end_copy(s);
>   bio->bi_status = write_err ? BLK_STS_IOERR : 0;
>   bio_endio(bio);
>  }
> @@ -1880,7 +1890,7 @@ static void zero_exception(struct dm_sna
>   dest.sector = bio->bi_iter.bi_sector;
>   dest.count = s->store->chunk_size;
>  
> - down(>cow_count);
> + account_start_copy(s);
>   WARN_ON_ONCE(bio->bi_private);
>   bio->bi_private = s;
>   dm_kcopyd_zero(s->kcopyd_client, 1, , 0, zero_callback, bio);
> 

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] Fix "dm kcopyd: Fix bug causing workqueue stalls" causes dead lock

2019-10-09 Thread Nikos Tsironis
On 10/9/19 5:13 PM, Mike Snitzer wrote:> On Tue, Oct 01 2019 at  8:43am -0400,
> Nikos Tsironis  wrote:
> 
>> On 10/1/19 3:27 PM, Guruswamy Basavaiah wrote:
>>> Hello Nikos,
>>>  Yes, issue is consistently reproducible with us, in a particular
>>> set-up and test case.
>>>  I will get the access to set-up next week, will try to test and let
>>> you know the results before end of next week.
>>>
>>
>> That sounds great!
>>
>> Thanks a lot,
>> Nikos
> 
> Hi Guru,
> 
> Any chance you could try this fix that I've staged to send to Linus?
> https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-5.4=633b1613b2a49304743c18314bb6e6465c21fd8a
> 
> Shiort of that, Nikos: do you happen to have a test scenario that teases
> out this deadlock?
> 

Hi Mike,

Yes,

I created a 50G LV and took a snapshot of the same size:

  lvcreate -n data-lv -L50G testvg
  lvcreate -n snap-lv -L50G -s testvg/data-lv

Then I ran the following fio job:

[global]
randrepeat=1
ioengine=libaio
bs=1M
size=6G
offset_increment=6G
numjobs=8
direct=1
iodepth=32
group_reporting
filename=/dev/testvg/data-lv

[test]
rw=write
timeout=180

, concurrently with the following script:

lvcreate -n dummy-lv -L1G testvg

while true
do
 lvcreate -n dummy-snap -L1M -s testvg/dummy-lv
 lvremove -f testvg/dummy-snap
done

This reproduced the deadlock for me. I also ran 'echo 30 >
/proc/sys/kernel/hung_task_timeout_secs', to reduce the hung task
timeout.

Nikos.

> Thanks,
> Mike
> 

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH] dm-clone: replace spin_lock_irqsave with spin_lock_irq

2019-10-07 Thread Nikos Tsironis
On 10/4/19 5:17 PM, Mikulas Patocka wrote:
> If we are in a place where it is known that interrupts are enabled,
> functions spin_lock_irq/spin_unlock_irq should be used instead of
> spin_lock_irqsave/spin_unlock_irqrestore.
> 
> spin_lock_irq and spin_unlock_irq are faster because the don't need to
> push and pop the flags register.
> 
> Signed-off-by: Mikulas Patocka 
> 

I reviewed the patch and it looks good. As a minor addition, I attach a
patch which updates the dm_clone_cond_set_range() comment.

Moreover, I will send a complementary patch converting a few more uses
of spin_lock_irqsave/spin_unlock_irqrestore to
spin_lock_irq/spin_unlock_irq.

Thanks,
Nikos


>From 097517d594cc127d2f21ca976f1e7df304e1ed10 Mon Sep 17 00:00:00 2001
From: Nikos Tsironis 
Date: Mon, 7 Oct 2019 14:07:19 +0300
Subject: [PATCH] dm clone: Fix dm_clone_cond_set_range() comment

Signed-off-by: Nikos Tsironis 
---
 drivers/md/dm-clone-metadata.h | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/md/dm-clone-metadata.h b/drivers/md/dm-clone-metadata.h
index 434bff08508b..9d3d29e6a838 100644
--- a/drivers/md/dm-clone-metadata.h
+++ b/drivers/md/dm-clone-metadata.h
@@ -44,7 +44,9 @@ int dm_clone_set_region_hydrated(struct dm_clone_metadata 
*cmd, unsigned long re
  * @start: Starting region number
  * @nr_regions: Number of regions in the range
  *
- * This function doesn't block, so it's safe to call it from interrupt context.
+ * This function doesn't block, but since it uses
+ * spin_lock_irq()/spin_unlock_irq() it's NOT safe to call it from any context
+ * where interrupts are disabled, e.g., from interrupt context.
  */
 int dm_clone_cond_set_range(struct dm_clone_metadata *cmd, unsigned long start,
unsigned long nr_regions);
-- 
2.11.0

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH 2/2] dm-snapshot: Reimplement the cow limit.

2019-10-02 Thread Nikos Tsironis
Hi Mikulas,

I agree that it's better to avoid holding any locks while waiting for
some pending kcopyd jobs to finish, but please see the comments below.

On 10/2/19 1:15 PM, Mikulas Patocka wrote:
> Commit 721b1d98fb517a ("dm snapshot: Fix excessive memory usage and
> workqueue stalls") introduced a semaphore to limit the maximum number of
> in-flight kcopyd (COW) jobs.
> 
> The implementation of this throttling mechanism is prone to a deadlock:
> 
> 1. One or more threads write to the origin device causing COW, which is
>performed by kcopyd.
> 
> 2. At some point some of these threads might reach the s->cow_count
>semaphore limit and block in down(>cow_count), holding a read lock
>on _origins_lock.
> 
> 3. Someone tries to acquire a write lock on _origins_lock, e.g.,
>snapshot_ctr(), which blocks because the threads at step (2) already
>hold a read lock on it.
> 
> 4. A COW operation completes and kcopyd runs dm-snapshot's completion
>callback, which ends up calling pending_complete().
>pending_complete() tries to resubmit any deferred origin bios. This
>requires acquiring a read lock on _origins_lock, which blocks.
> 
>This happens because the read-write semaphore implementation gives
>priority to writers, meaning that as soon as a writer tries to enter
>the critical section, no readers will be allowed in, until all
>writers have completed their work.
> 
>So, pending_complete() waits for the writer at step (3) to acquire
>and release the lock. This writer waits for the readers at step (2)
>to release the read lock and those readers wait for
>pending_complete() (the kcopyd thread) to signal the s->cow_count
>semaphore: DEADLOCK.
> 
> In order to fix the bug, I reworked limiting, so that it waits without 
> holding any locks. The patch adds a variable in_progress that counts how 
> many kcopyd jobs are running. A function wait_for_in_progress will sleep 
> if the variable in_progress is over the limit. It drops _origins_lock in 
> order to avoid the deadlock.
> 
> Signed-off-by: Mikulas Patocka 
> Cc: sta...@vger.kernel.org# v5.0+
> Fixes: 721b1d98fb51 ("dm snapshot: Fix excessive memory usage and workqueue 
> stalls")
> 
> ---
>  drivers/md/dm-snap.c |   69 
> ---
>  1 file changed, 55 insertions(+), 14 deletions(-)
> 
> Index: linux-2.6/drivers/md/dm-snap.c
> ===
> --- linux-2.6.orig/drivers/md/dm-snap.c   2019-10-01 15:23:42.0 
> +0200
> +++ linux-2.6/drivers/md/dm-snap.c2019-10-02 12:01:23.0 +0200
> @@ -18,7 +18,6 @@
>  #include 
>  #include 
>  #include 
> -#include 
>  
>  #include "dm.h"
>  
> @@ -107,8 +106,8 @@ struct dm_snapshot {
>   /* The on disk metadata handler */
>   struct dm_exception_store *store;
>  
> - /* Maximum number of in-flight COW jobs. */
> - struct semaphore cow_count;
> + unsigned in_progress;
> + struct wait_queue_head in_progress_wait;
>  
>   struct dm_kcopyd_client *kcopyd_client;
>  
> @@ -162,8 +161,8 @@ struct dm_snapshot {
>   */
>  #define DEFAULT_COW_THRESHOLD 2048
>  
> -static int cow_threshold = DEFAULT_COW_THRESHOLD;
> -module_param_named(snapshot_cow_threshold, cow_threshold, int, 0644);
> +static unsigned cow_threshold = DEFAULT_COW_THRESHOLD;
> +module_param_named(snapshot_cow_threshold, cow_threshold, uint, 0644);
>  MODULE_PARM_DESC(snapshot_cow_threshold, "Maximum number of chunks being 
> copied on write");
>  
>  DECLARE_DM_KCOPYD_THROTTLE_WITH_MODULE_PARM(snapshot_copy_throttle,
> @@ -1327,7 +1326,7 @@ static int snapshot_ctr(struct dm_target
>   goto bad_hash_tables;
>   }
>  
> - sema_init(>cow_count, (cow_threshold > 0) ? cow_threshold : INT_MAX);
> + init_waitqueue_head(>in_progress_wait);
>  
>   s->kcopyd_client = dm_kcopyd_client_create(_kcopyd_throttle);
>   if (IS_ERR(s->kcopyd_client)) {
> @@ -1509,17 +1508,46 @@ static void snapshot_dtr(struct dm_targe
>  
>   dm_put_device(ti, s->origin);
>  
> + WARN_ON(s->in_progress);
> +
>   kfree(s);
>  }
>  
>  static void account_start_copy(struct dm_snapshot *s)
>  {
> - down(>cow_count);
> + spin_lock(>in_progress_wait.lock);
> + s->in_progress++;
> + spin_unlock(>in_progress_wait.lock);
>  }
>  
>  static void account_end_copy(struct dm_snapshot *s)
>  {
> - up(>cow_count);
> + spin_lock(>in_progress_wait.lock);
> + BUG_ON(!s->in_progress);
> + s->in_progress--;
> + if (likely(s->in_progress <= cow_threshold) && 
> unlikely(waitqueue_active(>in_progress_wait)))
> + wake_up_locked(>in_progress_wait);
> + spin_unlock(>in_progress_wait.lock);
> +}
> +
> +static bool wait_for_in_progress(struct dm_snapshot *s, bool unlock_origins)
> +{
> + if (unlikely(s->in_progress > cow_threshold)) {
> + spin_lock(>in_progress_wait.lock);
> + if 

Re: [dm-devel] Fix "dm kcopyd: Fix bug causing workqueue stalls" causes dead lock

2019-10-01 Thread Nikos Tsironis
On 10/1/19 3:27 PM, Guruswamy Basavaiah wrote:
> Hello Nikos,
>  Yes, issue is consistently reproducible with us, in a particular
> set-up and test case.
>  I will get the access to set-up next week, will try to test and let
> you know the results before end of next week.
> 

That sounds great!

Thanks a lot,
Nikos

> Guru
> 
> On Tue, 1 Oct 2019 at 17:42, Nikos Tsironis  wrote:
>>
>> On 9/29/19 5:36 PM, Guruswamy Basavaiah wrote:
>>> Hello Nikos,
>>>  Thanks for pointing out the lvcreate write lock.
>>>
>>> Guru
>>>
>>
>> Hi Guru,
>>
>> I have sent a fix for this and I have Cc-ed you.
>>
>> Is this something you are able to consistently reproduce? If so, it
>> would be great if you could also test the fix.
>>
>> Thanks,
>> Nikos
>>
>>>
>>> On Sat, 28 Sep 2019 at 01:03, Nikos Tsironis  wrote:
>>>>
>>>> On 9/27/19 4:19 PM, Guruswamy Basavaiah wrote:
>>>>> Hello,
>>>>>  We have drbd partition on top of lvm partition. when node having
>>>>> secondary drbd partition is coming up, large amount of data will be
>>>>> synced between primary to secondary drbd partitions.
>>>>>
>>>>> During this time, we see the drbd Sync(Resync) stops at some point.
>>>>> After 120 seconds we see hung-task-timeout warnings in the logs.(see
>>>>> at the end of this email)
>>>>>
>>>>> If i increase the cow_count semaphore value from 2048 to 8192 or
>>>>> remove the below patch, drbd sync works seamlessly.
>>>>>
>>>>> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/patch/?id=721b1d98fb517ae99ab3b757021cf81db41e67be
>>>>>
>>>>> I am not familiar with dm code, from hung task back traces what i
>>>>> understand is, when thread is trying to queue work to kcopyd, holding
>>>>> "&_origins_lock" and blocked on cow_count lock,
>>>>> jobs from kcopyd is trying to queue work to same kcopyd and blocked on
>>>>> "&_origins_lock" and dead lock.
>>>>>
>>>>
>>>> Hello Guruswamy,
>>>>
>>>> I am Cc-ing the maintainers, so they can be in the loop.
>>>>
>>>> I examined the attached logs and I believe the following happens:
>>>>
>>>> 1. DRBD issues a number of writes to the snapshot origin device. These
>>>>writes cause COW, which is performed by kcopyd.
>>>>
>>>> 2. At some point DRBD reaches the cow_count semaphore limit (2048) and
>>>>blocks in down(>cow_count), holding a read lock on _origins_lock.
>>>>
>>>> 3. Someone tries to create a new snapshot. This involves taking a write
>>>>lock on _origins_lock, which blocks because DRBD at step (2) already
>>>>holds a read lock on it. That's the blocked lvcreate at the end of
>>>>the trace.
>>>>
>>>> 4. A COW operation, issued by step (1), completes and kcopyd runs
>>>>dm-snapshot's completion callback, which tries to take a read lock on
>>>>_origins_lock, before signaling the cow_count semaphore. This read
>>>>lock blocks, the semaphore is never signaled and we have the deadlock
>>>>you experienced.
>>>>
>>>> At first glance this seemed strange, because DRBD at step (2) holds a
>>>> read lock on _origins_lock, so taking another read lock should be
>>>> possible.
>>>>
>>>> But, if I am not missing something, the read-write semaphore
>>>> implementation gives priority to writers, meaning that as soon as a
>>>> writer tries to enter the critical section, the lvcreate in our case, no
>>>> readers will be allowed in until all writers have completed their work.
>>>>
>>>> That's what I believe is causing the deadlock you are experiencing.
>>>>
>>>> I will send a patch fixing this and I will let you now.
>>>>
>>>> Thanks,
>>>> Nikos
>>>>
>>>>> Below is the hung task back traces.
>>>>> Sep 24 12:08:48.974658 err CFPU-1 kernel: [  279.991760] INFO: task
>>>>> kworker/1:1:170 blocked for more than 120 seconds.
>>>>> Sep 24 12:08:48.974658 err CFPU-1 kernel: [  279.998569]
>>>>> Tainted: P   O4.4.184-octeon-distro.git-v2.96-4-rc-wnd #1
>>>>> Sep 24 12:08:48.974658 err C

Re: [dm-devel] Fix "dm kcopyd: Fix bug causing workqueue stalls" causes dead lock

2019-10-01 Thread Nikos Tsironis
On 9/29/19 5:36 PM, Guruswamy Basavaiah wrote:
> Hello Nikos,
>  Thanks for pointing out the lvcreate write lock.
> 
> Guru
> 

Hi Guru,

I have sent a fix for this and I have Cc-ed you.

Is this something you are able to consistently reproduce? If so, it
would be great if you could also test the fix.

Thanks,
Nikos

> 
> On Sat, 28 Sep 2019 at 01:03, Nikos Tsironis  wrote:
>>
>> On 9/27/19 4:19 PM, Guruswamy Basavaiah wrote:
>>> Hello,
>>>  We have drbd partition on top of lvm partition. when node having
>>> secondary drbd partition is coming up, large amount of data will be
>>> synced between primary to secondary drbd partitions.
>>>
>>> During this time, we see the drbd Sync(Resync) stops at some point.
>>> After 120 seconds we see hung-task-timeout warnings in the logs.(see
>>> at the end of this email)
>>>
>>> If i increase the cow_count semaphore value from 2048 to 8192 or
>>> remove the below patch, drbd sync works seamlessly.
>>>
>>> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/patch/?id=721b1d98fb517ae99ab3b757021cf81db41e67be
>>>
>>> I am not familiar with dm code, from hung task back traces what i
>>> understand is, when thread is trying to queue work to kcopyd, holding
>>> "&_origins_lock" and blocked on cow_count lock,
>>> jobs from kcopyd is trying to queue work to same kcopyd and blocked on
>>> "&_origins_lock" and dead lock.
>>>
>>
>> Hello Guruswamy,
>>
>> I am Cc-ing the maintainers, so they can be in the loop.
>>
>> I examined the attached logs and I believe the following happens:
>>
>> 1. DRBD issues a number of writes to the snapshot origin device. These
>>writes cause COW, which is performed by kcopyd.
>>
>> 2. At some point DRBD reaches the cow_count semaphore limit (2048) and
>>blocks in down(>cow_count), holding a read lock on _origins_lock.
>>
>> 3. Someone tries to create a new snapshot. This involves taking a write
>>lock on _origins_lock, which blocks because DRBD at step (2) already
>>holds a read lock on it. That's the blocked lvcreate at the end of
>>the trace.
>>
>> 4. A COW operation, issued by step (1), completes and kcopyd runs
>>dm-snapshot's completion callback, which tries to take a read lock on
>>_origins_lock, before signaling the cow_count semaphore. This read
>>lock blocks, the semaphore is never signaled and we have the deadlock
>>you experienced.
>>
>> At first glance this seemed strange, because DRBD at step (2) holds a
>> read lock on _origins_lock, so taking another read lock should be
>> possible.
>>
>> But, if I am not missing something, the read-write semaphore
>> implementation gives priority to writers, meaning that as soon as a
>> writer tries to enter the critical section, the lvcreate in our case, no
>> readers will be allowed in until all writers have completed their work.
>>
>> That's what I believe is causing the deadlock you are experiencing.
>>
>> I will send a patch fixing this and I will let you now.
>>
>> Thanks,
>> Nikos
>>
>>> Below is the hung task back traces.
>>> Sep 24 12:08:48.974658 err CFPU-1 kernel: [  279.991760] INFO: task
>>> kworker/1:1:170 blocked for more than 120 seconds.
>>> Sep 24 12:08:48.974658 err CFPU-1 kernel: [  279.998569]
>>> Tainted: P   O4.4.184-octeon-distro.git-v2.96-4-rc-wnd #1
>>> Sep 24 12:08:48.974658 err CFPU-1 kernel: [  280.006593] "echo 0 >
>>> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>> Sep 24 12:08:48.974658 info CFPU-1 kernel: [  280.014435] kworker/1:1
>>>D 80e1db78 0   170  2 0x0010
>>> Sep 24 12:08:48.974658 info CFPU-1 kernel: [  280.014482] Workqueue:
>>> kcopyd do_work [dm_mod]
>>> Sep 24 12:08:48.974658 warn CFPU-1 kernel: [  280.014487] Stack :
>>>  0001 00030003 8007fde8bac8
>>> Sep 24 12:08:48.974658 warn CFPU-1 kernel: [  280.014487]
>>> 8007fe759b00 0002 c0285294 8007f8d1ca00
>>> Sep 24 12:08:48.974658 warn CFPU-1 kernel: [  280.014487]
>>> c027eda8 0001 80b3 0100
>>> Sep 24 12:08:48.974658 warn CFPU-1 kernel: [  280.014487]
>>> 800784c098c8 80e1db78 8007fe759b00 80e204b8
>>> Sep 24 12:08:48.974658 warn CFPU-1 kernel: [  280.014487]
>>> 800788ef79c0 

[dm-devel] [PATCH 1/1] dm snapshot: Fix bug in COW throttling mechanism causing deadlocks

2019-10-01 Thread Nikos Tsironis
Commit 721b1d98fb517a ("dm snapshot: Fix excessive memory usage and
workqueue stalls") introduced a semaphore to limit the maximum number of
in-flight kcopyd (COW) jobs.

The implementation of this throttling mechanism is prone to a deadlock:

1. One or more threads write to the origin device causing COW, which is
   performed by kcopyd.

2. At some point some of these threads might reach the s->cow_count
   semaphore limit and block in down(>cow_count), holding a read lock
   on _origins_lock.

3. Someone tries to acquire a write lock on _origins_lock, e.g.,
   snapshot_ctr(), which blocks because the threads at step (2) already
   hold a read lock on it.

4. A COW operation completes and kcopyd runs dm-snapshot's completion
   callback, which ends up calling pending_complete().
   pending_complete() tries to resubmit any deferred origin bios. This
   requires acquiring a read lock on _origins_lock, which blocks.

   This happens because the read-write semaphore implementation gives
   priority to writers, meaning that as soon as a writer tries to enter
   the critical section, no readers will be allowed in, until all
   writers have completed their work.

   So, pending_complete() waits for the writer at step (3) to acquire
   and release the lock. This writer waits for the readers at step (2)
   to release the read lock and those readers wait for
   pending_complete() (the kcopyd thread) to signal the s->cow_count
   semaphore: DEADLOCK.

Fix this by delegating the resubmission of any deferred origin bios to
another thread, so the kcopyd thread never tries to acquire
_origins_lock and it's free to continue its work and signal the
s->cow_count semaphore.

Cc: sta...@vger.kernel.org
Fixes: 721b1d98fb517a ("dm snapshot: Fix excessive memory usage and workqueue 
stalls")
Reported-by: Guruswamy Basavaiah 
Signed-off-by: Nikos Tsironis 
---
 drivers/md/dm-snap.c | 99 +++-
 1 file changed, 90 insertions(+), 9 deletions(-)

diff --git a/drivers/md/dm-snap.c b/drivers/md/dm-snap.c
index f150f5c5492b..d701fe53bc96 100644
--- a/drivers/md/dm-snap.c
+++ b/drivers/md/dm-snap.c
@@ -4,6 +4,7 @@
  * This file is released under the GPL.
  */
 
+#include 
 #include 
 #include 
 #include 
@@ -19,6 +20,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "dm.h"
 
@@ -94,11 +96,12 @@ struct dm_snapshot {
struct dm_exception_table pending;
struct dm_exception_table complete;
 
-   /*
-* pe_lock protects all pending_exception operations and access
-* as well as the snapshot_bios list.
-*/
-   spinlock_t pe_lock;
+   /* Origin bios queued for resubmission to the origin device. */
+   spinlock_t deferred_bios_lock;
+   struct bio_list deferred_origin_bios;
+
+   struct workqueue_struct *wq;
+   struct work_struct deferred_bios_work;
 
/* Chunks with outstanding reads */
spinlock_t tracked_chunk_lock;
@@ -1224,6 +1227,8 @@ static int parse_snapshot_features(struct dm_arg_set *as, 
struct dm_snapshot *s,
return r;
 }
 
+static void process_deferred_bios(struct work_struct *work);
+
 /*
  * Construct a snapshot mapping:
  * [<# feature args> []*]
@@ -1313,7 +1318,8 @@ static int snapshot_ctr(struct dm_target *ti, unsigned 
int argc, char **argv)
s->out_of_order_tree = RB_ROOT;
init_rwsem(>lock);
INIT_LIST_HEAD(>list);
-   spin_lock_init(>pe_lock);
+   spin_lock_init(>deferred_bios_lock);
+   bio_list_init(>deferred_origin_bios);
s->state_bits = 0;
s->merge_failed = 0;
s->first_merging_chunk = 0;
@@ -1336,6 +1342,15 @@ static int snapshot_ctr(struct dm_target *ti, unsigned 
int argc, char **argv)
goto bad_kcopyd;
}
 
+   s->wq = alloc_workqueue("dm-" DM_MSG_PREFIX, WQ_MEM_RECLAIM, 0);
+   if (!s->wq) {
+   ti->error = "Could not allocate workqueue";
+   r = -ENOMEM;
+   goto bad_workqueue;
+   }
+
+   INIT_WORK(>deferred_bios_work, process_deferred_bios);
+
r = mempool_init_slab_pool(>pending_pool, MIN_IOS, pending_cache);
if (r) {
ti->error = "Could not allocate mempool for pending exceptions";
@@ -1401,6 +1416,8 @@ static int snapshot_ctr(struct dm_target *ti, unsigned 
int argc, char **argv)
 bad_load_and_register:
mempool_exit(>pending_pool);
 bad_pending_pool:
+   destroy_workqueue(s->wq);
+bad_workqueue:
dm_kcopyd_client_destroy(s->kcopyd_client);
 bad_kcopyd:
dm_exception_table_exit(>pending, pending_cache);
@@ -1423,6 +1440,12 @@ static void __free_exceptions(struct dm_snapshot *s)
dm_kcopyd_client_destroy(s->kcopyd_client);
s->kcopyd_client = NULL;
 
+   /*
+* destroy_workqueue() drains the workqueue so any pendin

[dm-devel] [PATCH 0/1] dm snapshot: Fix bug in COW throttling mechanism causing deadlocks

2019-10-01 Thread Nikos Tsironis
Hello,

This patch fixes the deadlock issue reported in this thread:
https://www.redhat.com/archives/dm-devel/2019-September/msg00168.html.

Although I have been really careful preparing this patch, in order to
avoid any further issues, any extra review would be greatly appreciated.

Thanks,
Nikos

Nikos Tsironis (1):
  dm snapshot: Fix bug in COW throttling mechanism causing deadlocks

 drivers/md/dm-snap.c | 99 +++-
 1 file changed, 90 insertions(+), 9 deletions(-)

-- 
2.11.0

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] Fix "dm kcopyd: Fix bug causing workqueue stalls" causes dead lock

2019-09-27 Thread Nikos Tsironis
On 9/27/19 4:19 PM, Guruswamy Basavaiah wrote:
> Hello,
>  We have drbd partition on top of lvm partition. when node having
> secondary drbd partition is coming up, large amount of data will be
> synced between primary to secondary drbd partitions.
> 
> During this time, we see the drbd Sync(Resync) stops at some point.
> After 120 seconds we see hung-task-timeout warnings in the logs.(see
> at the end of this email)
> 
> If i increase the cow_count semaphore value from 2048 to 8192 or
> remove the below patch, drbd sync works seamlessly.
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/patch/?id=721b1d98fb517ae99ab3b757021cf81db41e67be
> 
> I am not familiar with dm code, from hung task back traces what i
> understand is, when thread is trying to queue work to kcopyd, holding
> "&_origins_lock" and blocked on cow_count lock,
> jobs from kcopyd is trying to queue work to same kcopyd and blocked on
> "&_origins_lock" and dead lock.
> 

Hello Guruswamy,

I am Cc-ing the maintainers, so they can be in the loop.

I examined the attached logs and I believe the following happens:

1. DRBD issues a number of writes to the snapshot origin device. These
   writes cause COW, which is performed by kcopyd.

2. At some point DRBD reaches the cow_count semaphore limit (2048) and
   blocks in down(>cow_count), holding a read lock on _origins_lock.

3. Someone tries to create a new snapshot. This involves taking a write
   lock on _origins_lock, which blocks because DRBD at step (2) already
   holds a read lock on it. That's the blocked lvcreate at the end of
   the trace.

4. A COW operation, issued by step (1), completes and kcopyd runs
   dm-snapshot's completion callback, which tries to take a read lock on
   _origins_lock, before signaling the cow_count semaphore. This read
   lock blocks, the semaphore is never signaled and we have the deadlock
   you experienced.

At first glance this seemed strange, because DRBD at step (2) holds a
read lock on _origins_lock, so taking another read lock should be
possible.

But, if I am not missing something, the read-write semaphore
implementation gives priority to writers, meaning that as soon as a
writer tries to enter the critical section, the lvcreate in our case, no
readers will be allowed in until all writers have completed their work.

That's what I believe is causing the deadlock you are experiencing.

I will send a patch fixing this and I will let you now.

Thanks,
Nikos

> Below is the hung task back traces.
> Sep 24 12:08:48.974658 err CFPU-1 kernel: [  279.991760] INFO: task
> kworker/1:1:170 blocked for more than 120 seconds.
> Sep 24 12:08:48.974658 err CFPU-1 kernel: [  279.998569]
> Tainted: P   O4.4.184-octeon-distro.git-v2.96-4-rc-wnd #1
> Sep 24 12:08:48.974658 err CFPU-1 kernel: [  280.006593] "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Sep 24 12:08:48.974658 info CFPU-1 kernel: [  280.014435] kworker/1:1
>D 80e1db78 0   170  2 0x0010
> Sep 24 12:08:48.974658 info CFPU-1 kernel: [  280.014482] Workqueue:
> kcopyd do_work [dm_mod]
> Sep 24 12:08:48.974658 warn CFPU-1 kernel: [  280.014487] Stack :
>  0001 00030003 8007fde8bac8
> Sep 24 12:08:48.974658 warn CFPU-1 kernel: [  280.014487]
> 8007fe759b00 0002 c0285294 8007f8d1ca00
> Sep 24 12:08:48.974658 warn CFPU-1 kernel: [  280.014487]
> c027eda8 0001 80b3 0100
> Sep 24 12:08:48.974658 warn CFPU-1 kernel: [  280.014487]
> 800784c098c8 80e1db78 8007fe759b00 80e204b8
> Sep 24 12:08:48.974658 warn CFPU-1 kernel: [  280.014487]
> 800788ef79c0 80078505ba70 8007fe759b00 0001852b4620
> Sep 24 12:08:48.974658 warn CFPU-1 kernel: [  280.014487]
> c028 8007852b4620 8007eebf5758 c027edec
> Sep 24 12:08:48.974658 warn CFPU-1 kernel: [  280.014487]
>  8007852b4620 8007835d8e80 c027f38c
> Sep 24 12:08:48.974658 warn CFPU-1 kernel: [  280.014487]
> 800787ac0580 0001 8007f8d1ca60 800785aeb080
> Sep 24 12:08:48.974658 warn CFPU-1 kernel: [  280.014487]
>   0200 c0282488
> Sep 24 12:08:48.974658 warn CFPU-1 kernel: [  280.014487]
> 0200 8007f8d1ca00 c028 c027db90
> Sep 24 12:08:48.974658 warn CFPU-1 kernel: [  280.014487]   ...
> Sep 24 12:08:48.974658 warn CFPU-1 kernel: [  280.014558] Call Trace:
> Sep 24 12:08:48.974658 warn CFPU-1 kernel: [  280.014570]
> [] __schedule+0x3c0/0xa58
> Sep 24 12:08:48.974658 warn CFPU-1 kernel: [  280.014580]
> [] schedule+0x38/0x98
> Sep 24 12:08:48.974658 warn CFPU-1 kernel: [  280.014590]
> [] __down_read+0xa8/0xf0
> Sep 24 12:08:48.974658 warn CFPU-1 kernel: [  280.014609]
> [] do_origin.isra.13+0x44/0x110 [dm_snapshot]
> Sep 24 12:08:48.974658 warn CFPU-1 kernel: [  280.014625]
> 

[dm-devel] [PATCH 1/2] dm clone metadata: Rename md to cmd

2019-09-12 Thread Nikos Tsironis
Rename md to cmd to be consistent with dm-clone-metadata.c

Signed-off-by: Nikos Tsironis 
---
 drivers/md/dm-clone-metadata.h | 36 ++--
 1 file changed, 18 insertions(+), 18 deletions(-)

diff --git a/drivers/md/dm-clone-metadata.h b/drivers/md/dm-clone-metadata.h
index 7b8063ea70c3..434bff08508b 100644
--- a/drivers/md/dm-clone-metadata.h
+++ b/drivers/md/dm-clone-metadata.h
@@ -29,24 +29,24 @@ struct dm_clone_metadata;
 /*
  * Set region status to hydrated.
  *
- * @md: The dm-clone metadata
+ * @cmd: The dm-clone metadata
  * @region_nr: The region number
  *
  * This function doesn't block, so it's safe to call it from interrupt context.
  */
-int dm_clone_set_region_hydrated(struct dm_clone_metadata *md, unsigned long 
region_nr);
+int dm_clone_set_region_hydrated(struct dm_clone_metadata *cmd, unsigned long 
region_nr);
 
 /*
  * Set status of all regions in the provided range to hydrated, if not already
  * hydrated.
  *
- * @md: The dm-clone metadata
+ * @cmd: The dm-clone metadata
  * @start: Starting region number
  * @nr_regions: Number of regions in the range
  *
  * This function doesn't block, so it's safe to call it from interrupt context.
  */
-int dm_clone_cond_set_range(struct dm_clone_metadata *md, unsigned long start,
+int dm_clone_cond_set_range(struct dm_clone_metadata *cmd, unsigned long start,
unsigned long nr_regions);
 
 /*
@@ -69,12 +69,12 @@ struct dm_clone_metadata *dm_clone_metadata_open(struct 
block_device *bdev,
 /*
  * Free the resources related to metadata management.
  */
-void dm_clone_metadata_close(struct dm_clone_metadata *md);
+void dm_clone_metadata_close(struct dm_clone_metadata *cmd);
 
 /*
  * Commit dm-clone metadata to disk.
  */
-int dm_clone_metadata_commit(struct dm_clone_metadata *md);
+int dm_clone_metadata_commit(struct dm_clone_metadata *cmd);
 
 /*
  * Reload the in core copy of the on-disk bitmap.
@@ -93,18 +93,18 @@ int dm_clone_metadata_commit(struct dm_clone_metadata *md);
  * dm_clone_set_region_hydrated() and dm_clone_cond_set_range() refuse to touch
  * the region bitmap, after calling dm_clone_metadata_set_read_only().
  */
-int dm_clone_reload_in_core_bitset(struct dm_clone_metadata *md);
+int dm_clone_reload_in_core_bitset(struct dm_clone_metadata *cmd);
 
 /*
  * Check whether dm-clone's metadata changed this transaction.
  */
-bool dm_clone_changed_this_transaction(struct dm_clone_metadata *md);
+bool dm_clone_changed_this_transaction(struct dm_clone_metadata *cmd);
 
 /*
  * Abort current metadata transaction and rollback metadata to the last
  * committed transaction.
  */
-int dm_clone_metadata_abort(struct dm_clone_metadata *md);
+int dm_clone_metadata_abort(struct dm_clone_metadata *cmd);
 
 /*
  * Switches metadata to a read only mode. Once read-only mode has been entered
@@ -115,44 +115,44 @@ int dm_clone_metadata_abort(struct dm_clone_metadata *md);
  *   dm_clone_cond_set_range()
  *   dm_clone_metadata_abort()
  */
-void dm_clone_metadata_set_read_only(struct dm_clone_metadata *md);
-void dm_clone_metadata_set_read_write(struct dm_clone_metadata *md);
+void dm_clone_metadata_set_read_only(struct dm_clone_metadata *cmd);
+void dm_clone_metadata_set_read_write(struct dm_clone_metadata *cmd);
 
 /*
  * Returns true if the hydration of the destination device is finished.
  */
-bool dm_clone_is_hydration_done(struct dm_clone_metadata *md);
+bool dm_clone_is_hydration_done(struct dm_clone_metadata *cmd);
 
 /*
  * Returns true if region @region_nr is hydrated.
  */
-bool dm_clone_is_region_hydrated(struct dm_clone_metadata *md, unsigned long 
region_nr);
+bool dm_clone_is_region_hydrated(struct dm_clone_metadata *cmd, unsigned long 
region_nr);
 
 /*
  * Returns true if all the regions in the range are hydrated.
  */
-bool dm_clone_is_range_hydrated(struct dm_clone_metadata *md,
+bool dm_clone_is_range_hydrated(struct dm_clone_metadata *cmd,
unsigned long start, unsigned long nr_regions);
 
 /*
  * Returns the number of hydrated regions.
  */
-unsigned long dm_clone_nr_of_hydrated_regions(struct dm_clone_metadata *md);
+unsigned long dm_clone_nr_of_hydrated_regions(struct dm_clone_metadata *cmd);
 
 /*
  * Returns the first unhydrated region with region_nr >= @start
  */
-unsigned long dm_clone_find_next_unhydrated_region(struct dm_clone_metadata 
*md,
+unsigned long dm_clone_find_next_unhydrated_region(struct dm_clone_metadata 
*cmd,
   unsigned long start);
 
 /*
  * Get the number of free metadata blocks.
  */
-int dm_clone_get_free_metadata_block_count(struct dm_clone_metadata *md, 
dm_block_t *result);
+int dm_clone_get_free_metadata_block_count(struct dm_clone_metadata *cmd, 
dm_block_t *result);
 
 /*
  * Get the total number of metadata blocks.
  */
-int dm_clone_get_metadata_dev_size(struct dm_clone_metadata *md, dm_block_t 
*result);
+int dm_clone_get_metadata_dev_size(str

[dm-devel] [PATCH 0/2] dm clone: Minor fixes

2019-09-12 Thread Nikos Tsironis
Hi Mike,

I examined the diff between v3 of dm-clone and the staged version and it
looks fine.

This patch set includes some minor fixes to fold in:

  - Rename 'md' to 'cmd' also in dm-clone-metadata.h, to be consistent
with the changes in dm-clone-metadata.c

  - Explicitly include the header file for kvmalloc(). This is not
strictly required, as the header file is included indirectly by
other header files, but I think it's safer to include it anyway.

Thanks,
Nikos

Nikos Tsironis (2):
  dm clone metadata: Rename md to cmd
  dm clone: Explicitly include header file for kvmalloc()

 drivers/md/dm-clone-metadata.c |  1 +
 drivers/md/dm-clone-metadata.h | 36 ++--
 drivers/md/dm-clone-target.c   |  1 +
 3 files changed, 20 insertions(+), 18 deletions(-)

-- 
2.11.0

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


[dm-devel] [PATCH 2/2] dm clone: Explicitly include header file for kvmalloc()

2019-09-12 Thread Nikos Tsironis
Signed-off-by: Nikos Tsironis 
---
 drivers/md/dm-clone-metadata.c | 1 +
 drivers/md/dm-clone-target.c   | 1 +
 2 files changed, 2 insertions(+)

diff --git a/drivers/md/dm-clone-metadata.c b/drivers/md/dm-clone-metadata.c
index 50abc2fb4c7a..6bc8c1d1c351 100644
--- a/drivers/md/dm-clone-metadata.c
+++ b/drivers/md/dm-clone-metadata.c
@@ -3,6 +3,7 @@
  * Copyright (C) 2019 Arrikto, Inc. All Rights Reserved.
  */
 
+#include 
 #include 
 #include 
 #include 
diff --git a/drivers/md/dm-clone-target.c b/drivers/md/dm-clone-target.c
index f80250c3103e..cd6f9e9fc98e 100644
--- a/drivers/md/dm-clone-target.c
+++ b/drivers/md/dm-clone-target.c
@@ -3,6 +3,7 @@
  * Copyright (C) 2019 Arrikto, Inc. All Rights Reserved.
  */
 
+#include 
 #include 
 #include 
 #include 
-- 
2.11.0

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [dm:for-next 29/30] drivers//md/dm-clone-target.c:563:14: error: implicit declaration of function 'vmalloc'; did you mean 'kmalloc'?

2019-09-11 Thread Nikos Tsironis
On 9/11/19 9:22 PM, Mike Snitzer wrote:> 
> I resolved this and pushed new code, thanks!
> 

Hi Mike,

I just saw the report and was about to fix it, but I noticed you have
already fixed it. Thanks a lot.

I had forgotten to include the header file for vmalloc(), but I saw you
used kvmalloc(), which is even better.

I took a quick look at the diff and there are a few places that still
need fixing:

drivers/md/dm-clone-target.c:563: clone->ht = vmalloc(sz *sizeof(struct 
hash_table_bucket));
drivers/md/dm-clone-target.c:579: vfree(clone->ht);

Also, the allocation of cmd->region_map is done with kvmalloc(), but the
deallocation is still done with vfree():

drivers/md/dm-clone-metadata.c:597: vfree(cmd->region_map);

I will be away from keyboard for the rest of the day, but I will take a
closer look at the diff tomorrow and I will send a new version fixing
these and any other issues I might find.

Thanks,
Nikos.

> On Wed, Sep 11 2019 at 12:03pm -0400,
> kbuild test robot  wrote:
> 
>> tree:   
>> https://kernel.googlesource.com/pub/scm/linux/kernel/git/device-mapper/linux-dm.git
>>  for-next
>> head:   509818079bf1fefff4ed02d6a1b994e20efc0480
>> commit: 1529a543debdf75fb26e7ecd732da0cc36f78a36 [29/30] dm: add clone target
>> config: sparc64-allmodconfig (attached as .config)
>> compiler: sparc64-linux-gcc (GCC) 7.4.0
>> reproduce:
>> wget 
>> https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
>> ~/bin/make.cross
>> chmod +x ~/bin/make.cross
>> git checkout 1529a543debdf75fb26e7ecd732da0cc36f78a36
>> # save the attached .config to linux build tree
>> GCC_VERSION=7.4.0 make.cross ARCH=sparc64 
>>
>> If you fix the issue, kindly add following tag
>> Reported-by: kbuild test robot 
>>
>> All error/warnings (new ones prefixed by >>):
>>
>>drivers//md/dm-clone-target.c: In function 'hash_table_init':
 drivers//md/dm-clone-target.c:563:14: error: implicit declaration of 
 function 'vmalloc'; did you mean 'kmalloc'? 
 [-Werror=implicit-function-declaration]
>>  clone->ht = vmalloc(sz * sizeof(struct hash_table_bucket));
>>  ^~~
>>  kmalloc
 drivers//md/dm-clone-target.c:563:12: warning: assignment makes pointer 
 from integer without a cast [-Wint-conversion]
>>  clone->ht = vmalloc(sz * sizeof(struct hash_table_bucket));
>>^
>>drivers//md/dm-clone-target.c: In function 'hash_table_exit':
 drivers//md/dm-clone-target.c:579:2: error: implicit declaration of 
 function 'vfree'; did you mean 'kfree'? 
 [-Werror=implicit-function-declaration]
>>  vfree(clone->ht);
>>  ^
>>  kfree
>>cc1: some warnings being treated as errors
>> --
>>drivers//md/dm-clone-metadata.c: In function 'dirty_map_init':
 drivers//md/dm-clone-metadata.c:466:28: error: implicit declaration of 
 function 'vzalloc'; did you mean 'kvzalloc'? 
 [-Werror=implicit-function-declaration]
>>  md->dmap[0].dirty_words = vzalloc(bitmap_size(md->nr_words));
>>^~~
>>kvzalloc
 drivers//md/dm-clone-metadata.c:466:26: warning: assignment makes pointer 
 from integer without a cast [-Wint-conversion]
>>  md->dmap[0].dirty_words = vzalloc(bitmap_size(md->nr_words));
>>  ^
>>drivers//md/dm-clone-metadata.c:474:26: warning: assignment makes pointer 
>> from integer without a cast [-Wint-conversion]
>>  md->dmap[1].dirty_words = vzalloc(bitmap_size(md->nr_words));
>>  ^
 drivers//md/dm-clone-metadata.c:478:3: error: implicit declaration of 
 function 'vfree'; did you mean 'kvfree'? 
 [-Werror=implicit-function-declaration]
>>   vfree(md->dmap[0].dirty_words);
>>   ^
>>   kvfree
>>drivers//md/dm-clone-metadata.c: In function 'dm_clone_metadata_open':
 drivers//md/dm-clone-metadata.c:553:19: error: implicit declaration of 
 function 'vmalloc'; did you mean 'kvmalloc'? 
 [-Werror=implicit-function-declaration]
>>  md->region_map = vmalloc(bitmap_size(md->nr_regions));
>>   ^~~
>>   kvmalloc
>>drivers//md/dm-clone-metadata.c:553:17: warning: assignment makes pointer 
>> from integer without a cast [-Wint-conversion]
>>  md->region_map = vmalloc(bitmap_size(md->nr_regions));
>> ^
>>cc1: some warnings being treated as errors
>>
>> vim +563 drivers//md/dm-clone-target.c
>>
>>549   
>>550   #define bucket_lock_irqsave(bucket, flags) \
>>551   spin_lock_irqsave(&(bucket)->lock, flags)
>>552   
>>553   #define bucket_unlock_irqrestore(bucket, flags) \
>>554   spin_unlock_irqrestore(&(bucket)->lock, flags)
>>555   
>>556   static int hash_table_init(struct clone *clone)
>>557   {
>>558   unsigned int i, sz;
>>   

Re: [dm-devel] [RFC PATCH v2 0/1] dm: add clone target

2019-09-11 Thread Nikos Tsironis
Hi Mike,

I just noticed commit 6cf2a73cb2bc42 ("docs: device-mapper: move it to
the admin-guide"), which moves Documentation/device-mapper/ to
Documentation/admin-guide/device-mapper/.

I sent a v3 which moves dm-clone.rst under
Documentation/admin-guide/device-mapper/.

Sorry for that,
Nikos.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


[dm-devel] [RFC PATCH v3 0/1] dm: add clone target

2019-09-11 Thread Nikos Tsironis
This patch adds the dm-clone target, which allows cloning of arbitrary
block devices.

dm-clone produces a one-to-one copy of an existing, read-only source
device into a writable destination device: It presents a virtual block
device which makes all data appear immediately, and redirects reads and
writes accordingly.

The main use case of dm-clone is to clone a potentially remote,
high-latency, read-only, archival-type block device into a writable,
fast, primary-type device for fast, low-latency I/O. The cloned device
is visible/mountable immediately and the copy of the source device to
the destination device happens in the background, in parallel with user
I/O.

For example, one could restore an application backup from a read-only
copy, accessible through a network storage protocol (NBD, Fibre Channel,
iSCSI, AoE, etc.), into a local SSD or NVMe device, and start using the
device immediately, without waiting for the restore to complete.

When the cloning completes, the dm-clone table can be removed altogether
and be replaced, e.g., by a linear table, mapping directly to the
destination device.

dm-clone is optimized for small, random writes, with size equal to
dm-clone's region size, e.g., 4K.

For more information regarding dm-clone's operation, please read the
attached documentation.

A preliminary test suite for dm-clone can be found at
https://github.com/arrikto/device-mapper-test-suite/tree/feature-dm-clone

Changes in v3:
  - Move Documentation/device-mapper/dm-clone.rst to
Documentation/admin-guide/device-mapper/dm-clone.rst

v2: https://www.redhat.com/archives/dm-devel/2019-September/msg00061.html
v1: https://www.redhat.com/archives/dm-devel/2019-July/msg00088.html

Nikos Tsironis (1):
  dm: add clone target

 .../admin-guide/device-mapper/dm-clone.rst |  333 +++
 drivers/md/Kconfig |   14 +
 drivers/md/Makefile|2 +
 drivers/md/dm-clone-metadata.c |  963 +
 drivers/md/dm-clone-metadata.h |  158 ++
 drivers/md/dm-clone-target.c   | 2190 
 6 files changed, 3660 insertions(+)
 create mode 100644 Documentation/admin-guide/device-mapper/dm-clone.rst
 create mode 100644 drivers/md/dm-clone-metadata.c
 create mode 100644 drivers/md/dm-clone-metadata.h
 create mode 100644 drivers/md/dm-clone-target.c

-- 
2.11.0

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


[dm-devel] [RFC PATCH v3 1/1] dm: add clone target

2019-09-11 Thread Nikos Tsironis
Add the dm-clone target, which allows cloning of arbitrary block
devices.

dm-clone produces a one-to-one copy of an existing, read-only source
device into a writable destination device: It presents a virtual block
device which makes all data appear immediately, and redirects reads and
writes accordingly.

The main use case of dm-clone is to clone a potentially remote,
high-latency, read-only, archival-type block device into a writable,
fast, primary-type device for fast, low-latency I/O. The cloned device
is visible/mountable immediately and the copy of the source device to
the destination device happens in the background, in parallel with user
I/O.

When the cloning completes, the dm-clone table can be removed altogether
and be replaced, e.g., by a linear table, mapping directly to the
destination device.

For further information and examples of how to use dm-clone, please read
Documentation/admin-guide/device-mapper/dm-clone.rst

Suggested-by: Vangelis Koukis 
Co-developed-by: Ilias Tsitsimpis 
Signed-off-by: Ilias Tsitsimpis 
Signed-off-by: Nikos Tsironis 
---
 .../admin-guide/device-mapper/dm-clone.rst |  333 +++
 drivers/md/Kconfig |   14 +
 drivers/md/Makefile|2 +
 drivers/md/dm-clone-metadata.c |  963 +
 drivers/md/dm-clone-metadata.h |  158 ++
 drivers/md/dm-clone-target.c   | 2190 
 6 files changed, 3660 insertions(+)
 create mode 100644 Documentation/admin-guide/device-mapper/dm-clone.rst
 create mode 100644 drivers/md/dm-clone-metadata.c
 create mode 100644 drivers/md/dm-clone-metadata.h
 create mode 100644 drivers/md/dm-clone-target.c

diff --git a/Documentation/admin-guide/device-mapper/dm-clone.rst 
b/Documentation/admin-guide/device-mapper/dm-clone.rst
new file mode 100644
index ..b43a34c1430a
--- /dev/null
+++ b/Documentation/admin-guide/device-mapper/dm-clone.rst
@@ -0,0 +1,333 @@
+.. SPDX-License-Identifier: GPL-2.0-only
+
+
+dm-clone
+
+
+Introduction
+
+
+dm-clone is a device mapper target which produces a one-to-one copy of an
+existing, read-only source device into a writable destination device: It
+presents a virtual block device which makes all data appear immediately, and
+redirects reads and writes accordingly.
+
+The main use case of dm-clone is to clone a potentially remote, high-latency,
+read-only, archival-type block device into a writable, fast, primary-type 
device
+for fast, low-latency I/O. The cloned device is visible/mountable immediately
+and the copy of the source device to the destination device happens in the
+background, in parallel with user I/O.
+
+For example, one could restore an application backup from a read-only copy,
+accessible through a network storage protocol (NBD, Fibre Channel, iSCSI, AoE,
+etc.), into a local SSD or NVMe device, and start using the device immediately,
+without waiting for the restore to complete.
+
+When the cloning completes, the dm-clone table can be removed altogether and be
+replaced, e.g., by a linear table, mapping directly to the destination device.
+
+The dm-clone target reuses the metadata library used by the thin-provisioning
+target.
+
+Glossary
+
+
+   Hydration
+ The process of filling a region of the destination device with data from
+ the same region of the source device, i.e., copying the region from the
+ source to the destination device.
+
+Once a region gets hydrated we redirect all I/O regarding it to the destination
+device.
+
+Design
+==
+
+Sub-devices
+---
+
+The target is constructed by passing three devices to it (along with other
+parameters detailed later):
+
+1. A source device - the read-only device that gets cloned and source of the
+   hydration.
+
+2. A destination device - the destination of the hydration, which will become a
+   clone of the source device.
+
+3. A small metadata device - it records which regions are already valid in the
+   destination device, i.e., which regions have already been hydrated, or have
+   been written to directly, via user I/O.
+
+The size of the destination device must be at least equal to the size of the
+source device.
+
+Regions
+---
+
+dm-clone divides the source and destination devices in fixed sized regions.
+Regions are the unit of hydration, i.e., the minimum amount of data copied from
+the source to the destination device.
+
+The region size is configurable when you first create the dm-clone device. The
+recommended region size is the same as the file system block size, which 
usually
+is 4KB. The region size must be between 8 sectors (4KB) and 2097152 sectors
+(1GB) and a power of two.
+
+Reads and writes from/to hydrated regions are serviced from the destination
+device.
+
+A read to a not yet hydrated region is serviced directly from the source 
device.
+
+A write to a not yet hydrated region will be delayed until

[dm-devel] [RFC PATCH v2 1/1] dm: add clone target

2019-09-06 Thread Nikos Tsironis
Add the dm-clone target, which allows cloning of arbitrary block
devices.

dm-clone produces a one-to-one copy of an existing, read-only source
device into a writable destination device: It presents a virtual block
device which makes all data appear immediately, and redirects reads and
writes accordingly.

The main use case of dm-clone is to clone a potentially remote,
high-latency, read-only, archival-type block device into a writable,
fast, primary-type device for fast, low-latency I/O. The cloned device
is visible/mountable immediately and the copy of the source device to
the destination device happens in the background, in parallel with user
I/O.

When the cloning completes, the dm-clone table can be removed altogether
and be replaced, e.g., by a linear table, mapping directly to the
destination device.

For further information and examples of how to use dm-clone, please read
Documentation/device-mapper/dm-clone.rst

Suggested-by: Vangelis Koukis 
Co-developed-by: Ilias Tsitsimpis 
Signed-off-by: Ilias Tsitsimpis 
Signed-off-by: Nikos Tsironis 
---
 Documentation/device-mapper/dm-clone.rst |  333 +
 drivers/md/Kconfig   |   14 +
 drivers/md/Makefile  |2 +
 drivers/md/dm-clone-metadata.c   |  963 +
 drivers/md/dm-clone-metadata.h   |  158 +++
 drivers/md/dm-clone-target.c | 2190 ++
 6 files changed, 3660 insertions(+)
 create mode 100644 Documentation/device-mapper/dm-clone.rst
 create mode 100644 drivers/md/dm-clone-metadata.c
 create mode 100644 drivers/md/dm-clone-metadata.h
 create mode 100644 drivers/md/dm-clone-target.c

diff --git a/Documentation/device-mapper/dm-clone.rst 
b/Documentation/device-mapper/dm-clone.rst
new file mode 100644
index ..b43a34c1430a
--- /dev/null
+++ b/Documentation/device-mapper/dm-clone.rst
@@ -0,0 +1,333 @@
+.. SPDX-License-Identifier: GPL-2.0-only
+
+
+dm-clone
+
+
+Introduction
+
+
+dm-clone is a device mapper target which produces a one-to-one copy of an
+existing, read-only source device into a writable destination device: It
+presents a virtual block device which makes all data appear immediately, and
+redirects reads and writes accordingly.
+
+The main use case of dm-clone is to clone a potentially remote, high-latency,
+read-only, archival-type block device into a writable, fast, primary-type 
device
+for fast, low-latency I/O. The cloned device is visible/mountable immediately
+and the copy of the source device to the destination device happens in the
+background, in parallel with user I/O.
+
+For example, one could restore an application backup from a read-only copy,
+accessible through a network storage protocol (NBD, Fibre Channel, iSCSI, AoE,
+etc.), into a local SSD or NVMe device, and start using the device immediately,
+without waiting for the restore to complete.
+
+When the cloning completes, the dm-clone table can be removed altogether and be
+replaced, e.g., by a linear table, mapping directly to the destination device.
+
+The dm-clone target reuses the metadata library used by the thin-provisioning
+target.
+
+Glossary
+
+
+   Hydration
+ The process of filling a region of the destination device with data from
+ the same region of the source device, i.e., copying the region from the
+ source to the destination device.
+
+Once a region gets hydrated we redirect all I/O regarding it to the destination
+device.
+
+Design
+==
+
+Sub-devices
+---
+
+The target is constructed by passing three devices to it (along with other
+parameters detailed later):
+
+1. A source device - the read-only device that gets cloned and source of the
+   hydration.
+
+2. A destination device - the destination of the hydration, which will become a
+   clone of the source device.
+
+3. A small metadata device - it records which regions are already valid in the
+   destination device, i.e., which regions have already been hydrated, or have
+   been written to directly, via user I/O.
+
+The size of the destination device must be at least equal to the size of the
+source device.
+
+Regions
+---
+
+dm-clone divides the source and destination devices in fixed sized regions.
+Regions are the unit of hydration, i.e., the minimum amount of data copied from
+the source to the destination device.
+
+The region size is configurable when you first create the dm-clone device. The
+recommended region size is the same as the file system block size, which 
usually
+is 4KB. The region size must be between 8 sectors (4KB) and 2097152 sectors
+(1GB) and a power of two.
+
+Reads and writes from/to hydrated regions are serviced from the destination
+device.
+
+A read to a not yet hydrated region is serviced directly from the source 
device.
+
+A write to a not yet hydrated region will be delayed until the corresponding
+region has been hydrated and the hydration of the region starts immediately.
+
+Note

[dm-devel] [RFC PATCH v2 0/1] dm: add clone target

2019-09-06 Thread Nikos Tsironis
This patch adds the dm-clone target, which allows cloning of arbitrary
block devices.

dm-clone produces a one-to-one copy of an existing, read-only source
device into a writable destination device: It presents a virtual block
device which makes all data appear immediately, and redirects reads and
writes accordingly.

The main use case of dm-clone is to clone a potentially remote,
high-latency, read-only, archival-type block device into a writable,
fast, primary-type device for fast, low-latency I/O. The cloned device
is visible/mountable immediately and the copy of the source device to
the destination device happens in the background, in parallel with user
I/O.

For example, one could restore an application backup from a read-only
copy, accessible through a network storage protocol (NBD, Fibre Channel,
iSCSI, AoE, etc.), into a local SSD or NVMe device, and start using the
device immediately, without waiting for the restore to complete.

When the cloning completes, the dm-clone table can be removed altogether
and be replaced, e.g., by a linear table, mapping directly to the
destination device.

dm-clone is optimized for small, random writes, with size equal to
dm-clone's region size, e.g., 4K.

For more information regarding dm-clone's operation, please read the
attached documentation.

A preliminary test suite for dm-clone can be found at
https://github.com/arrikto/device-mapper-test-suite/tree/feature-dm-clone

Changes in v2:
  - Remove needless empty newlines.
  - Use the term "region" consistently and never call it a block.
  - Rename "origin" device to "source" device and "clone" device to
"destination" device.
  - Express "hydration_threshold" in multiples of a region.
  - Rename "hydration_block_size" to "hydration_batch_size" and express
it in multiples of a region.
  - Add missing "static" keyword to __load_bitset_in_core() and
__metadata_commit().
  - clone_message(): Don't print misleading error message about
"unsupported message" in case of correct messages with wrong number
of arguments.
  - Rename module parameter "clone_copy_throttle" to
"clone_hydration_throttle" to be consistent with its description.

I also updated the test suite to reflect these changes.

v1: https://www.redhat.com/archives/dm-devel/2019-July/msg00088.html

Nikos Tsironis (1):
  dm: add clone target

 Documentation/device-mapper/dm-clone.rst |  333 +
 drivers/md/Kconfig   |   14 +
 drivers/md/Makefile  |2 +
 drivers/md/dm-clone-metadata.c   |  963 +
 drivers/md/dm-clone-metadata.h   |  158 +++
 drivers/md/dm-clone-target.c | 2190 ++
 6 files changed, 3660 insertions(+)
 create mode 100644 Documentation/device-mapper/dm-clone.rst
 create mode 100644 drivers/md/dm-clone-metadata.c
 create mode 100644 drivers/md/dm-clone-metadata.h
 create mode 100644 drivers/md/dm-clone-target.c

-- 
2.11.0

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [RFC PATCH 1/1] dm: add clone target

2019-08-31 Thread Nikos Tsironis
On 8/29/19 7:19 PM, Mike Snitzer wrote:
> On Tue, Jul 09 2019 at 10:15am -0400,
> Nikos Tsironis  wrote:
> 
>> Add the dm-clone target, which allows cloning of arbitrary block
>> devices.
>>
>> dm-clone produces a one-to-one copy of an existing, read-only device
>> (origin) into a writable device (clone): It presents a virtual block
>> device which makes all data appear immediately, and redirects reads and
>> writes accordingly.
>>
>> The main use case of dm-clone is to clone a potentially remote,
>> high-latency, read-only, archival-type block device into a writable,
>> fast, primary-type device for fast, low-latency I/O. The cloned device
>> is visible/mountable immediately and the copy of the origin device to
>> the clone device happens in the background, in parallel with user I/O.
>>
>> When the cloning completes, the dm-clone table can be removed altogether
>> and be replaced, e.g., by a linear table, mapping directly to the clone
>> device.
>>
>> For further information and examples of how to use dm-clone, please read
>> Documentation/device-mapper/dm-clone.rst
>>
>> Suggested-by: Vangelis Koukis 
>> Co-developed-by: Ilias Tsitsimpis 
>> Signed-off-by: Ilias Tsitsimpis 
>> Signed-off-by: Nikos Tsironis 
>> ---
>>  Documentation/device-mapper/dm-clone.rst |  334 +
>>  drivers/md/Kconfig   |   13 +
>>  drivers/md/Makefile  |2 +
>>  drivers/md/dm-clone-metadata.c   |  991 +
>>  drivers/md/dm-clone-metadata.h   |  158 +++
>>  drivers/md/dm-clone-target.c | 2244 
>> ++
>>  6 files changed, 3742 insertions(+)
>>  create mode 100644 Documentation/device-mapper/dm-clone.rst
>>  create mode 100644 drivers/md/dm-clone-metadata.c
>>  create mode 100644 drivers/md/dm-clone-metadata.h
>>  create mode 100644 drivers/md/dm-clone-target.c
>>
>> diff --git a/Documentation/device-mapper/dm-clone.rst 
>> b/Documentation/device-mapper/dm-clone.rst
>> new file mode 100644
>> index ..948b7ce31ce3
>> --- /dev/null
>> +++ b/Documentation/device-mapper/dm-clone.rst
>> @@ -0,0 +1,334 @@
>> +.. SPDX-License-Identifier: GPL-2.0-only
>> +
>> +
>> +dm-clone
>> +
>> +
>> +Introduction
>> +
>> +
>> +dm-clone is a device mapper target which produces a one-to-one copy of an
>> +existing, read-only device (origin) into a writable device (clone): It 
>> presents
>> +a virtual block device which makes all data appear immediately, and 
>> redirects
>> +reads and writes accordingly.
>> +
>> +The main use case of dm-clone is to clone a potentially remote, 
>> high-latency,
>> +read-only, archival-type block device into a writable, fast, primary-type 
>> device
>> +for fast, low-latency I/O. The cloned device is visible/mountable 
>> immediately
>> +and the copy of the origin device to the clone device happens in the 
>> background,
>> +in parallel with user I/O.
>> +
>> +For example, one could restore an application backup from a read-only copy,
>> +accessible through a network storage protocol (NBD, Fibre Channel, iSCSI, 
>> AoE,
>> +etc.), into a local SSD or NVMe device, and start using the device 
>> immediately,
>> +without waiting for the restore to complete.
>> +
>> +When the cloning completes, the dm-clone table can be removed altogether 
>> and be
>> +replaced, e.g., by a linear table, mapping directly to the clone device.
>> +
>> +The dm-clone target reuses the metadata library used by the 
>> thin-provisioning
>> +target.
>> +
>> +Glossary
>> +
>> +
>> +   Region
>> + A fixed sized block. The unit of hydration.
>> +
>> +   Hydration
>> + The process of filling a region of the clone device with data from the 
>> same
>> + region of the origin device, i.e., copying the region from the origin 
>> to
>> + the clone device.
>> +
>> +Once a region gets hydrated we redirect all I/O regarding it to the clone
>> +device.
> 
> There is a lot of awkward jargon that you're mixing into this target.
> 
> Why "region" and not "block"?  I can let "region" go but please be
> consistent (don't fall back to calling a region a block anywhere).
> 

I used the term "region" to avoid confusion with a device's
logical/physical block size. A "region" is the unit of copying from the
source to t

Re: [dm-devel] [RFC PATCH 1/1] dm: add clone target

2019-08-28 Thread Nikos Tsironis
On 8/27/19 6:34 PM, Mike Snitzer wrote:
> On Tue, Aug 27 2019 at 10:09am -0400,
> Nikos Tsironis  wrote:
> 
>> Hello,
>>
>> This is a kind reminder for this patch set. I'm bumping this thread to
>> solicit your feedback.
>>
>> Following the discussion with Heinz, I have provided extensive
>> benchmarks that show dm-clone's significant performance increase
>> compared to a dm-snapshot/dm-raid1 stack.
>>
>> How can we move forward with the review of dm-clone, so it can
>> eventually be merged upstream?
>>
>> Looking forward to your feedback,
> 
> I actually pulled it into my local dm-5.4 branch yesterday and have
> started reviewing.  Firrst pass it looks like you've got solid code; a
> lot of familiar code patterns too (barrowed from thinp, etc).
> 
> But the first thing that is tripping me up is the name "dm-clone"
> considering how cloning is so fundamental to all DM.  The second term
> that is just awkward is "hydration".  But that is just my initial
> thought.  I'll need the rest of the week to really dig in and have more
> constructive feedback for you.
> 

Hi Mike,

Thank you for your prompt response and also thank you in advance for all
the effort you will put in reviewing dm-clone.

Looking forward to your feedback,

Nikos

> Thanks for the ping; wasn't needed in this instance but it never hurts.
> 
> Mike
> 

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [RFC PATCH 1/1] dm: add clone target

2019-08-27 Thread Nikos Tsironis
Hello,

This is a kind reminder for this patch set. I'm bumping this thread to
solicit your feedback.

Following the discussion with Heinz, I have provided extensive
benchmarks that show dm-clone's significant performance increase
compared to a dm-snapshot/dm-raid1 stack.

How can we move forward with the review of dm-clone, so it can
eventually be merged upstream?

Looking forward to your feedback,

Nikos

On 7/30/19 1:13 PM, Nikos Tsironis wrote:
> On 7/30/19 12:20 AM, Heinz Mauelshagen wrote:
>> Hi Nikos,
>>
>> thanks for providing these benchmarks which  seem to confirm the
>> advantages of clone vs. a snapshot/raid1 stack.
>>
>> Can you please provide 'dmsetup table' for both configurations for 
>> completeness?
>>
>> Heinz
>>
> 
> Hi Heinz,
> 
> Yes, of course. The below 'dmsetup table' output is for the 4K
> region/chunk size benchmark. The 'dmsetup table' output for the rest of
> the benchmarks is the same, changing only the region/chunk sizes of
> dm-clone and dm-snapshot.
> 
> dm-clone stack (dmsetup table)
> ==
> 
> source--vg-origin--lv: 0 629145600 linear 8:16 2048
> dest--vg-meta--lv: 0 65536 linear 259:0 629147648
> clone: 0 629145600 clone 254:1 254:0 254:2 8
> dest--vg-clone--lv: 0 629145600 linear 259:0 2048
> 
> dm-snapshot + dm-raid stack (dmsetup table)
> ===
> 
> mirrorvg-snap-cow: 0 104857600 linear 259:0 629155840
> mirrorvg-raid1--lv_rimage_1: 0 629145600 linear 259:0 10240
> mirrorvg-snap: 0 629145600 snapshot 254:5 254:6 P 8
> mirrorvg-raid1--lv_rimage_0: 0 629145600 linear 8:16 10240
> mirrorvg-raid1--lv-real: 0 629145600 raid raid1 3 0 region_size 1024 2 254:0 
> 254:1 254:2 254:3
> mirrorvg-raid1--lv: 0 629145600 snapshot-origin 254:5
> mirrorvg-raid1--lv_rmeta_1: 0 8192 linear 259:0 2048
> mirrorvg-raid1--lv_rmeta_0: 0 8192 linear 8:16 2048
> 
> Nikos
> 
>> On 7/22/19 10:16 PM, Nikos Tsironis wrote:
>>> On 7/17/19 5:41 PM, Heinz Mauelshagen wrote:
>>>> Hi Nikos,
>>>>
>>>> thanks for elaborating on those details.
>>>>
>>>> Hash table collisions, exception store entry commit overhead,
>>>> SSD cache flush issues etc. are all valid points relative to performance
>>>> and work set footprints in general.
>>>>
>>>> Do you have any performance numbers for your solution vs.
>>>> a snapshot one showing the approach is actually superior in
>>>> in real configurations?
>>> Hi Heinz,
>>>
>>> Please see below for detailed benchmark results.
>>>
>>>> I'm asking this particularly in the context of your remark
>>>>
>>>> "A write to a not yet hydrated region will be delayed until the
>>>> corresponding
>>>> region has been hydrated and the hydration of the region starts
>>>> immediately."
>>>>
>>>> which'll cause a potentially large working set of delayed writes unless
>>>> those
>>>> cover the whole eventually larger than 4K region.
>>>> How does your 'clone' target perform on such heavy write situations?
>>>>
>>> This situation occurs only when the writes are smaller than the region
>>> size of dm-clone. E.g., if the user sets a region size of 64K and issues
>>> 4K writes.
>>>
>>> In this case, we experience a performance drop due to COW. This is true
>>> _both_ for dm-snapshot and dm-clone and is _unavoidable_.
>>>
>>> But, the common case will be setting a region size equal to the file
>>> system block size, e.g., 4K, and thus avoiding the COW overhead. Note
>>> that LVM snapshots _already_ use 4K as the _default_ chunk size.
>>>
>>> Nevertheless, even for larger region/chunk sizes, dm-clone outperforms
>>> the dm-snapshot based solution, as is evident by the following
>>> performance measurements.
>>>
>>>> In general, performance and storage footprint test results based on the
>>>> same set
>>>> of read/write tests including heavy loads with region size variations
>>>> run on 'clone'
>>>> and 'snapshot' would help your point.
>>>>
>>>> Heinz
>>>>
>>> I used fio to run a series of read and write tests that compare the
>>> performance of dm-clone against your proposed dm-snapshot over dm-raid
>>> solution.
>>>
>>> I used a 375GB spinning disk as the origin device storing the data to be
>>> cloned and a 375GB SSD as the clone device and for storing b

Re: [dm-devel] [RFC PATCH 1/1] dm: add clone target

2019-07-30 Thread Nikos Tsironis
On 7/30/19 12:20 AM, Heinz Mauelshagen wrote:
> Hi Nikos,
> 
> thanks for providing these benchmarks which  seem to confirm the
> advantages of clone vs. a snapshot/raid1 stack.
> 
> Can you please provide 'dmsetup table' for both configurations for 
> completeness?
> 
> Heinz
> 

Hi Heinz,

Yes, of course. The below 'dmsetup table' output is for the 4K
region/chunk size benchmark. The 'dmsetup table' output for the rest of
the benchmarks is the same, changing only the region/chunk sizes of
dm-clone and dm-snapshot.

dm-clone stack (dmsetup table)
==

source--vg-origin--lv: 0 629145600 linear 8:16 2048
dest--vg-meta--lv: 0 65536 linear 259:0 629147648
clone: 0 629145600 clone 254:1 254:0 254:2 8
dest--vg-clone--lv: 0 629145600 linear 259:0 2048

dm-snapshot + dm-raid stack (dmsetup table)
===

mirrorvg-snap-cow: 0 104857600 linear 259:0 629155840
mirrorvg-raid1--lv_rimage_1: 0 629145600 linear 259:0 10240
mirrorvg-snap: 0 629145600 snapshot 254:5 254:6 P 8
mirrorvg-raid1--lv_rimage_0: 0 629145600 linear 8:16 10240
mirrorvg-raid1--lv-real: 0 629145600 raid raid1 3 0 region_size 1024 2 254:0 
254:1 254:2 254:3
mirrorvg-raid1--lv: 0 629145600 snapshot-origin 254:5
mirrorvg-raid1--lv_rmeta_1: 0 8192 linear 259:0 2048
mirrorvg-raid1--lv_rmeta_0: 0 8192 linear 8:16 2048

Nikos

> On 7/22/19 10:16 PM, Nikos Tsironis wrote:
>> On 7/17/19 5:41 PM, Heinz Mauelshagen wrote:
>>> Hi Nikos,
>>>
>>> thanks for elaborating on those details.
>>>
>>> Hash table collisions, exception store entry commit overhead,
>>> SSD cache flush issues etc. are all valid points relative to performance
>>> and work set footprints in general.
>>>
>>> Do you have any performance numbers for your solution vs.
>>> a snapshot one showing the approach is actually superior in
>>> in real configurations?
>> Hi Heinz,
>>
>> Please see below for detailed benchmark results.
>>
>>> I'm asking this particularly in the context of your remark
>>>
>>> "A write to a not yet hydrated region will be delayed until the
>>> corresponding
>>> region has been hydrated and the hydration of the region starts
>>> immediately."
>>>
>>> which'll cause a potentially large working set of delayed writes unless
>>> those
>>> cover the whole eventually larger than 4K region.
>>> How does your 'clone' target perform on such heavy write situations?
>>>
>> This situation occurs only when the writes are smaller than the region
>> size of dm-clone. E.g., if the user sets a region size of 64K and issues
>> 4K writes.
>>
>> In this case, we experience a performance drop due to COW. This is true
>> _both_ for dm-snapshot and dm-clone and is _unavoidable_.
>>
>> But, the common case will be setting a region size equal to the file
>> system block size, e.g., 4K, and thus avoiding the COW overhead. Note
>> that LVM snapshots _already_ use 4K as the _default_ chunk size.
>>
>> Nevertheless, even for larger region/chunk sizes, dm-clone outperforms
>> the dm-snapshot based solution, as is evident by the following
>> performance measurements.
>>
>>> In general, performance and storage footprint test results based on the
>>> same set
>>> of read/write tests including heavy loads with region size variations
>>> run on 'clone'
>>> and 'snapshot' would help your point.
>>>
>>> Heinz
>>>
>> I used fio to run a series of read and write tests that compare the
>> performance of dm-clone against your proposed dm-snapshot over dm-raid
>> solution.
>>
>> I used a 375GB spinning disk as the origin device storing the data to be
>> cloned and a 375GB SSD as the clone device and for storing both
>> dm-clone's metadata and dm-snapshot's exceptions (COW space).
>>
>> dm-clone stack (dmsetup ls --tree)
>> ==
>>
>> clone (254:3)
>>   ├─source--vg-origin--lv (254:2)
>>   │  └─ (8:16)
>>   ├─dest--vg-clone--lv (254:0)
>>   │  └─ (259:0)
>>   └─dest--vg-meta--lv (254:1)
>>  └─ (259:0)
>>
>> dm-snapshot + dm-raid stack (dmsetup ls --tree)
>> ===
>>
>> mirrorvg-snap (254:7)
>>   ├─mirrorvg-snap-cow (254:6)
>>   │  └─ (259:0)
>>   └─mirrorvg-raid1--lv-real (254:5)
>>  ├─mirrorvg-raid1--lv_rimage_1 (254:3)
>>  │  └─ (259:0)
>>  ├─mirrorvg-raid1--lv_rmeta_1 (254:2)
>>  │  └─ (259:0)
>>  ├─mirrorvg-raid1--lv_rimage_0 (254:1)
>>

Re: [dm-devel] [RFC PATCH 1/1] dm: add clone target

2019-07-22 Thread Nikos Tsironis
 33.6 msec  |
|   64 KB   |  2.6 msec |  31.2 msec  |
|   128 KB  |  3.8 msec |  35.7 msec  |
+---+---+-+

* dm-snapshot+dm-raid has 7.5 to 90 times _more_ write latency than
  dm-clone.

* For the common case of a 4 KB region/chunk size, dm-clone has minimal
  overhead over the SSD device.

* Even for region/chunk sizes greater than 4KB dm-clone's overhead is
  minimal compared to dm-snapshot+dm-raid.

2. Random read latency

+---+--+-+
| region/chunk size | dm-clone | dm-snapshot |
+---+--+-+
|4 KB   | 1.5 msec |  10.7 msec  |
|8 KB   | 1.5 msec |   9.7 msec  |
|   16 KB   | 1.5 msec |  11.9 msec  |
|   32 KB   | 1.5 msec |  28.6 msec  |
|   64 KB   | 1.5 msec |  27.5 msec  |
|   128 KB  | 1.5 msec |  27.3 msec  |
+---+--+-+

* dm-snapshot+dm-raid has 6.5 to 19 times _more_ read latency than
  dm-clone.

* For all region/chunk sizes dm-clone has minimal overhead over the HDD
  device.

IOPS benchmark
--

1. Random write IOPS

+---+--+-+
| region/chunk size | dm-clone | dm-snapshot |
+---+--+-+
|4 KB   |  62347   | 3758|
|8 KB   |   696| 388 |
|   16 KB   |   667| 217 |
|   32 KB   |   614| 207 |
|   64 KB   |   531| 186 |
|   128 KB  |   417| 159 |
+---+--+-+

* dm-clone achieves 1.8 to 16.6 times _more_ IOPS than
  dm-snapshot+dm-raid.

* For the common case of a 4 KB region/chunk size, dm-clone has minimal
  overhead over the SSD device.

* Even for region/chunk sizes greater than 4KB dm-clone achieves
  significantly more IOPS than dm-snapshot+dm-raid.

2. Random read IOPS

+---+--+-+
| region/chunk size | dm-clone | dm-snapshot |
+---+--+-+
|4 KB   |   767| 680 |
|8 KB   |   714| 677 |
|   16 KB   |   715| 338 |
|   32 KB   |   717| 338 |
|   64 KB   |   720| 338 |
|   128 KB  |   724| 339 |
+---+--+-+

* dm-clone achieves 1.1 to 2.1 times _more_ IOPS than
  dm-snapshot+dm-raid.

Bandwidth benchmark
---

1. Sequential write BW

+---++-+
| region/chunk size |  dm-clone  | dm-snapshot |
+---++-+
|4 KB   | 389.4 MB/s |  135.3 MB/s |
|8 KB   | 390.5 MB/s |  231.7 MB/s |
|   16 KB   | 390.5 MB/s |  213.1 MB/s |
|   32 KB   | 390.4 MB/s |  214.0 MB/s |
|   64 KB   | 390.3 MB/s |  214.0 MB/s |
|   128 KB  | 390.5 MB/s |  211.3 MB/s |
+---++-+

* dm-clone achieves 1.7 to 2.9 times more write BW than
  dm-snapshot+dm-raid.

* For all region/chunk sizes dm-clone achieves the same write BW as the
  SSD device.

2. Sequential read BW

+---++-+
| region/chunk size |  dm-clone  | dm-snapshot |
+---++-+
|4 KB   | 442.8 MB/s |  217.3 MB/s |
|8 KB   | 443.8 MB/s |  288.8 MB/s |
|   16 KB   | 443.8 MB/s |  275.3 MB/s |
|   32 KB   | 443.8 MB/s |  276.1 MB/s |
|   64 KB   | 443.6 MB/s |  276.1 MB/s |
|   128 KB  | 443.6 MB/s |  275.2 MB/s |
+---++-+

* dm-clone achieves 1.5 to 2 times more read BW than
  dm-snapshot+dm-raid.

Metadata/Storage overhead
=

dm-clone had a _maximum_ metadata overhead of around 20 MB for all
benchmarks. As dm-clone doesn't require any extra COW space for
temporarily storing the written data (writes just go directly to the
clone device) this is the _only_ storage overhead incurred by dm-clone,
irrespective of the amount of the written data

On the other hand, the COW space utilization of dm-snapshot, for the
bandwidth benchmarks, varied from 11.95 GB to 20.41 GB, depending on the
amount of written data.

I want to emphasize that after the cloning/syncing is complete we have
to merge this multi-gigabyte COW space back to the clone/destination
device. This will cause _further_ performance degradation, which is
_not_ reflected in the above performance measurements, but _will_ be
present in real workloads, if the dm-snapshot based solution is used.


To summarize, dm-clone performs _significantly_ better than a
dm-snapshot based solution, on all aspects (latency, IOPS, BW), and with
a _fraction_ of the storage/metadata overhead.

If you have any more questions, I would be more than happy to discuss
them with you.

Thanks,
Nikos

>

Re: [dm-devel] dm kcopyd: Increase sub-job size to 512KiB

2019-07-16 Thread Nikos Tsironis
On 7/16/19 5:14 PM, Mike Snitzer wrote:
> On Tue, Jul 16 2019 at 10:11am -0400,
> Mike Snitzer  wrote:
> 
>> On Tue, Jul 16 2019 at  9:59am -0400,
>> Nikos Tsironis  wrote:
>>
>>> On 7/15/19 9:22 PM, Mike Snitzer wrote:
>>>> On Fri, Jul 12 2019 at  9:45am -0400,
>>>> Nikos Tsironis  wrote:
>>>>
>>>>> Hi Mike,
>>>>>
>>>>> A kind reminder about this patch. Do you require any changes or will you
>>>>> merge it as is?
>>>>
>>>> I think we need changes to expose knob(s) to tune this value on a global
>>>> _and_ device level via sysfs.  E.g.:
>>>>
>>>> 1) dm_mod module param for global
>>>> 2) but also allow a per-device override, like:
>>>>echo 512 > /sys/block/dm-X/dm/kcopyd_subjob_size
>>>>
>>>
>>> Hi Mike,
>>>
>>> Thanks for your feedback. I agree, this sounds like the best thing to do.
>>>
>>>> 1 is super easy and is a start.  Layering in 2 is a bit more involved.
>>>
>>> Maybe I could help with (2). We could discuss about it and how you think
>>> it's best to do it and then I could proceed with an implementation.
>>>
>>> Please let me know what you think.
>>>
>>>>
>>>> In hindsight (given how risk-averse I am on changing the default) I
>>>> should've kept the default 128 but allowed override with modparam
>>>> dm_mod.kcopyd_subjob_size=1024
>>>>
>>>> Would this be an OK first step?
>>>
>>> Yes, this would be great.
>>>
>>>>
>>>> If so, we're still in the 5.3 merge window, I'll see what I can do.
>>>
>>> Shall I proceed with a patch adding the dm_mod.kcopyd_subjob_size
>>> modparam?
>>
>> Sure.  And it could be that we won't need 2.
>>
>> Ideally the default would work for every setup.  Less knobs the better.
>> But as a stop-gap I think we need to expose a knob that allows override.
>>
>> Thinking further, I don't think changing the default to 512K is too
>> risky (famous last words).  So please just update your original patch to
>> include the modparam so that users can get the old 64K back if needed.
>>
>> BTW, the param name should probably be "kcopyd_subjob_size_kb" to
>> reflect the value is KB.
> 
> One other thing: not sure what the max should be on this
> modparam.. maybe 1024K?

I think 1024K is a reasonable maximum value.

I will add the "kcopyd_subjob_size_kb" modparam and send a second
version of the patch.

Thanks,
Nikos

> 
> Mike
> 

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] dm kcopyd: Increase sub-job size to 512KiB

2019-07-16 Thread Nikos Tsironis
On 7/15/19 9:22 PM, Mike Snitzer wrote:
> On Fri, Jul 12 2019 at  9:45am -0400,
> Nikos Tsironis  wrote:
> 
>> Hi Mike,
>>
>> A kind reminder about this patch. Do you require any changes or will you
>> merge it as is?
> 
> I think we need changes to expose knob(s) to tune this value on a global
> _and_ device level via sysfs.  E.g.:
> 
> 1) dm_mod module param for global
> 2) but also allow a per-device override, like:
>echo 512 > /sys/block/dm-X/dm/kcopyd_subjob_size
> 

Hi Mike,

Thanks for your feedback. I agree, this sounds like the best thing to do.

> 1 is super easy and is a start.  Layering in 2 is a bit more involved.

Maybe I could help with (2). We could discuss about it and how you think
it's best to do it and then I could proceed with an implementation.

Please let me know what you think.

> 
> In hindsight (given how risk-averse I am on changing the default) I
> should've kept the default 128 but allowed override with modparam
> dm_mod.kcopyd_subjob_size=1024
> 
> Would this be an OK first step?

Yes, this would be great.

> 
> If so, we're still in the 5.3 merge window, I'll see what I can do.

Shall I proceed with a patch adding the dm_mod.kcopyd_subjob_size
modparam?

Thanks,
Nikos

> 
> Thanks,
> Mike
>

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH] dm kcopyd: Increase sub-job size to 512KiB

2019-07-12 Thread Nikos Tsironis
Hi Mike,

A kind reminder about this patch. Do you require any changes or will you
merge it as is?

Thanks,
Nikos

On 6/3/19 4:40 PM, Nikos Tsironis wrote:
> Currently, kcopyd has a sub-job size of 64KiB and a maximum number of 8
> sub-jobs. As a result, for any kcopyd job, we have a maximum of 512KiB
> of I/O in flight.
> 
> This upper limit to the amount of in-flight I/O under-utilizes fast
> devices and results in decreased throughput, e.g., when writing to a
> snapshotted thin LV with I/O size less than the pool's block size (so
> COW is performed using kcopyd).
> 
> Increase kcopyd's sub-job size to 512KiB, so we have a maximum of 4MiB
> of I/O in flight for each kcopyd job. This results in an up to 96%
> improvement of bandwidth when writing to a snapshotted thin LV, with I/O
> sizes less than the pool's block size.
> 
> We evaluate the performance impact of the change by running the
> snap_breaking_throughput benchmark, from the device mapper test suite
> [1].
> 
> The benchmark:
> 
>   1. Creates a 1G thin LV
>   2. Provisions the thin LV
>   3. Takes a snapshot of the thin LV
>   4. Writes to the thin LV with:
> 
>   dd if=/dev/zero of=/dev/vg/thin_lv oflag=direct bs=
> 
> Running this benchmark with various thin pool block sizes and dd I/O
> sizes (all combinations triggering the use of kcopyd) we get the
> following results:
> 
> +-+-+--+-+
> | Pool block size | dd I/O size | BW before (MB/s) | BW after (MB/s) |
> +-+-+--+-+
> |   1 MB  |  256 KB |   242|   280   |
> |   1 MB  |  512 KB |   238|   295   |
> | | |  | |
> |   2 MB  |  256 KB |   238|   354   |
> |   2 MB  |  512 KB |   241|   380   |
> |   2 MB  |1 MB |   245|   394   |
> | | |  | |
> |   4 MB  |  256 KB |   248|   412   |
> |   4 MB  |  512 KB |   234|   432   |
> |   4 MB  |1 MB |   251|   474   |
> |   4 MB  |2 MB |   257|   504   |
> | | |  | |
> |   8 MB  |  256 KB |   239|   420   |
> |   8 MB  |  512 KB |   256|   431   |
> |   8 MB  |1 MB |   264|   467   |
> |   8 MB  |2 MB |   264|   502   |
> |   8 MB  |4 MB |   281|   537   |
> +-----+-+--+-+
> 
> [1] https://github.com/jthornber/device-mapper-test-suite
> 
> Signed-off-by: Nikos Tsironis 
> ---
>  drivers/md/dm-kcopyd.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/md/dm-kcopyd.c b/drivers/md/dm-kcopyd.c
> index 671c24332802..db0a7d1e33b7 100644
> --- a/drivers/md/dm-kcopyd.c
> +++ b/drivers/md/dm-kcopyd.c
> @@ -28,7 +28,7 @@
>  
>  #include "dm-core.h"
>  
> -#define SUB_JOB_SIZE 128
> +#define SUB_JOB_SIZE 1024
>  #define SPLIT_COUNT  8
>  #define MIN_JOBS 8
>  #define RESERVE_PAGES(DIV_ROUND_UP(SUB_JOB_SIZE << SECTOR_SHIFT, 
> PAGE_SIZE))
> 

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH v2] dm snapshot: add optional discard support features

2019-07-12 Thread Nikos Tsironis
Hi Mike,

I have reviewed the patch. A few comments below.

On 7/11/19 11:46 PM, Mike Snitzer wrote:
> discard_zeroes_cow - a discard issued to the snapshot device that maps
> to entire chunks to will zero the corresponding exception(s) in the
> snapshot's exception store.
> 
> discard_passdown_origin - a discard to the snapshot device is passed down
> to the snapshot-origin's underlying device.  This doesn't cause copy-out
> to the snapshot exception store because the snapshot-origin target is
> bypassed.
> 
> The discard_passdown_origin feature depends on the discard_zeroes_cow
> feature being enabled.
> 
> When these 2 features are enabled they allow a temporarily read-only
> device that has completely exhausted its free space to recover space.
> To do so dm-snapshot provides temporary buffer to accommodate writes
> that the temporarily read-only device cannot handle yet.  Once the upper
> layer frees space (e.g. fstrim to XFS) the discards issued to the
> dm-snapshot target will be issued to underlying read-only device whose
> free space was exhausted.  In addition those discards will also cause
> zeroes to be written to the snapshot exception store if corresponding
> exceptions exist.  If the underlying origin device provides
> deduplication for zero blocks then if/when the snapshot is merged backed
> to the origin those blocks will become unused.  Once the origin has
> gained adequate space, merging the snapshot back to the thinly
> provisioned device will permit continued use of that device without the
> temporary space provided by the snapshot.
> 
> Requested-by: John Dorminy 
> Signed-off-by: Mike Snitzer 
> ---
>  Documentation/device-mapper/snapshot.txt |  16 +++
>  drivers/md/dm-snap.c | 186 
> +++
>  2 files changed, 181 insertions(+), 21 deletions(-)
> 
> diff --git a/Documentation/device-mapper/snapshot.txt 
> b/Documentation/device-mapper/snapshot.txt
> index b8bbb516f989..1810833f6dc6 100644
> --- a/Documentation/device-mapper/snapshot.txt
> +++ b/Documentation/device-mapper/snapshot.txt
> @@ -31,6 +31,7 @@ its visible content unchanged, at least until the  device> fills up.
>  
>  
>  *) snapshot
> +   [<# feature args> []*]
>  
>  A snapshot of the  block device is created. Changed chunks of
>   sectors will be stored on the .  Writes will
> @@ -53,8 +54,23 @@ When loading or unloading the snapshot target, the 
> corresponding
>  snapshot-origin or snapshot-merge target must be suspended. A failure to
>  suspend the origin target could result in data corruption.
>  
> +Optional features:
> +
> +   discard_zeroes_cow - a discard issued to the snapshot device that
> +   maps to entire chunks to will zero the corresponding exception(s) in
> +   the snapshot's exception store.
> +
> +   discard_passdown_origin - a discard to the snapshot device is passed
> +   down to the snapshot-origin's underlying device.  This doesn't cause
> +   copy-out to the snapshot exception store because the snapshot-origin
> +   target is bypassed.
> +
> +   The discard_passdown_origin feature depends on the discard_zeroes_cow
> +   feature being enabled.
> +
>  
>  * snapshot-merge
> +  [<# feature args> []*]
>  
>  takes the same table arguments as the snapshot target except it only
>  works with persistent snapshots.  This target assumes the role of the
> diff --git a/drivers/md/dm-snap.c b/drivers/md/dm-snap.c
> index 3107f2b1988b..63916e1dc569 100644
> --- a/drivers/md/dm-snap.c
> +++ b/drivers/md/dm-snap.c
> @@ -1,6 +1,4 @@
>  /*
> - * dm-snapshot.c
> - *
>   * Copyright (C) 2001-2002 Sistina Software (UK) Limited.
>   *
>   * This file is released under the GPL.
> @@ -134,7 +132,10 @@ struct dm_snapshot {
>* - I/O error while merging
>*  => stop merging; set merge_failed; process I/O normally.
>*/
> - int merge_failed;
> + bool merge_failed:1;
> +
> + bool discard_zeroes_cow:1;
> + bool discard_passdown_origin:1;
>  
>   /*
>* Incoming bios that overlap with chunks being merged must wait
> @@ -1173,12 +1174,64 @@ static void stop_merge(struct dm_snapshot *s)
>   clear_bit(SHUTDOWN_MERGE, >state_bits);
>  }
>  
> +static int parse_snapshot_features(struct dm_arg_set *as, struct dm_snapshot 
> *s,
> +struct dm_target *ti)
> +{
> + int r;
> + unsigned argc;
> + const char *arg_name;
> +
> + static const struct dm_arg _args[] = {
> + {0, 2, "Invalid number of feature arguments"},
> + };
> +
> + /*
> +  * No feature arguments supplied.
> +  */
> + if (!as->argc)
> + return 0;
> +
> + r = dm_read_arg_group(_args, as, , >error);
> + if (r)
> + return -EINVAL;
> +
> + while (argc && !r) {
> + arg_name = dm_shift_arg(as);
> + argc--;
> +
> + if (!strcasecmp(arg_name, "discard_zeroes_cow"))
> + s->discard_zeroes_cow = true;
> +
> +  

Re: [dm-devel] [RFC PATCH 1/1] dm: add clone target

2019-07-10 Thread Nikos Tsironis
On 7/10/19 12:28 AM, Heinz Mauelshagen wrote:
> Hi Nikos,
> 
> what is the crucial factor your target offers vs. resynchronizing such a 
> latency distinct
> 2-legged mirror with a read-write snapshot (local, fast exception store) 
> on top, tearing the
> mirror down keeping the local leg once fully in sync and merging the 
> snapshot back into it?
> 
> Heinz
> 

Hi Heinz,

The most significant benefits of dm-clone over the solution you propose
is significantly better performance, no need for extra COW space, no
need to merge back a snapshot, and the ability to skip syncing the
unused space of a file system.

1. In order to ensure snapshot consistency, dm-snapshot needs to
   commit a completed exception, before signaling the completion of the
   write that triggered it to upper layers.

   The persistent exception store commits exceptions every time a
   metadata area is filled or when there are no more exceptions
   in-flight. For a 4K chunk size we have 256 exceptions per metadata
   area, so the best case scenario is one commit per 256 writes. Here I
   assume a write with size equal to the chunk size of dm-snapshot,
   e.g., 4K, so there is no COW overhead, and that we write to new
   chunks, so we need to allocate new exceptions.

   Part of committing the metadata is flushing the cache of the
   underlying device, if there is one. We have seen SSDs which can
   sustain hundreds of thousands of random write IOPS, but they take up
   to 8ms to flush their cache. In such a case, flushing the SSD cache
   every few writes significantly degrades performance.

   Moreover, dm-snapshot forces exceptions to complete in the order they
   were allocated, to avoid snapshot space leak on crash (commit
   230c83afdd9cd). This inserts further latency in exception completions
   and thus user write completions.

   On the other hand, when cloning a device we don't need to be so
   strict and can rely on committing the metadata every time a FLUSH or
   FUA bio is written, or periodically, like dm-thin and dm-cache do.

   dm-clone does exactly that. When a region/chunk is cloned or
   over-written by a write, we just set a bit in the relevant in-core
   bitmap. The metadata are committed once every second or when we
   receive a FLUSH or FUA bio.

   This improves performance significantly and results in increased IOPS
   and reduced latency, especially in cases where flushing the disk
   cache is very expensive.

2. For large devices, e.g. multi terabyte disks, resynchronizing the
   local leg can take a lot of time. If the application running over the
   local device is write-heavy, dm-snapshot will end up allocating a
   large number of exceptions. This increases the number of hash table
   collisions and thus increases the time we need to do a hash table
   lookup.

   dm-snapshot needs to look up the exception hash tables in order to
   service an I/O, so this increases latency and degrades performance.

   On the other hand, dm-clone is just testing a bit to see if a region
   is cloned or not and decides what to do based on that test.

3. With dm-clone there is no need to reserve extra COW space for
   temporarily storing the written data, while the clone device is
   syncing. Nor would one need to worry about monitoring and expanding
   the COW device to prevent it from filling up.

4. With dm-clone there is no need to merge back potentially several
   gigabytes once cloning/syncing completes. We also avoid the relevant
   performance degradation incurred by the merging process. Writes just
   go directly to the clone device.

5. dm-clone implements support for discards, so it can skip
   cloning/syncing the relevant regions. In the case of a large block
   device which contains a filesystem with empty space, e.g. a 2TB
   device containing 500GB of useful data in a filesystem, this can
   significantly reduce the time needed to sync/clone.

This was a rather long email, but I hope it makes the significant
benefits of dm-clone over using dm-snapshot, and our rationale behind
the decision to implement a new target clearer.

I would be more than happy to continue the conversation and focus on any
other questions you may have.

Thanks,
Nikos

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


  1   2   >