Re: [dm-devel] Improve processing efficiency for addition and deletion of multipath devices

2016-11-30 Thread tang . junhui
Hello Ben, Hannes 

I'm sorry for the late reply.

> You can't just get the wwid with no work (look at all work uev_add_path
> does, specifically alloc_path_with_pathinfo). Now you could reorder
> this, but there isn't much point, since it is doing useful things, like
> checking if this is a spurious uevent, and necessary things, like
> figuring out the device type and using that that the configuration to
> figure out HOW to get the wwid.

IMO, WWID can be geted from uevent, since it has a ID_SERIAL filed in 
uevent body.

> It seems like what you want to do is to
> call uev_add_path multiple times, but defer most of the work that
> ev_add_path does (creating or updating the multipath device), until
> you've processed all that paths.

Not exactly, the input parameter "struct uevent *uev" is a batch of 
merged uevents, so uev_add_path() is called once to process the merged 
uevent.

> split off uev_add_path() and
> ev_add_path().
> Then uev_add_path could generate a list of fully-formed paths which
> ev_add_path() would process.
> IE generalize coalesce_paths() to work on a passed-in list rather than
> the normal vecs->pathvec.

Hannes, I think my thoughts are close to your idea now, In uev_add_path(), 

we get all information of the merged paths, and then call ev_add_path() 
to create or update the multipath device. Maybe the different between us 
is "processing all the same type uevents(add, etc.) in ev_add_path()" or 
"processing all the same type uevents(add, etc.) which came from the 
same LUN in ev_add_path()". I think your idea can also reach the goal, and
reduce the DM reload times. So we will try to code as your idea, and 
in list_merger_uevents(&uevq_tmp) only merger the same type(add, etc.) 
uevents,
add stop mergering when another type uevents are occured.

Thanks all,
Tang
 



发件人: Hannes Reinecke 
收件人: Benjamin Marzinski , 
tang.jun...@zte.com.cn, 
抄送:   dm-devel@redhat.com, zhang.ka...@zte.com.cn, Martin Wilck 
, Bart Van Assche 
日期:   2016/11/28 23:46
主题:   Re: [dm-devel] Improve processing efficiency for addition and 
deletion of multipath devices
发件人: dm-devel-boun...@redhat.com



On 11/28/2016 04:25 PM, Benjamin Marzinski wrote:
> On Mon, Nov 28, 2016 at 10:19:15AM +0800, tang.jun...@zte.com.cn wrote:
>>Hello Christophe, Ben, Hannes, Martin, Bart,
>>I am a member of host-side software development team of ZXUSP 
storage
>>project
>>in ZTE Corporation. Facing the market demand, our team decides to 
write
>>code to
>>promote multipath efficiency next month. The whole idea is in the 
mail
>>below.We
>>hope to participate in and make progress with the open source 
community,
>>so any
>>suggestion and comment would be welcome.
> 
> Like I mentioned before, I think this is a good idea in general, but the
> devil is in the details here.
> 
>>
>>Thanks,
>>Tang
>>
>> 
--
>> 
--
>>1.Problem
>>In these scenarios, multipath processing efficiency is low:
>>1) Many paths exist in each multipath device,
>>2) Devices addition or deletion during iSCSI login/logout or FC link
>>up/down.
> 
> 
> 
>>4.Proposal
>>Other than processing uevents one by one, uevents which coming from 
the
>>same LUN devices can be mergered to one, and then uevent processing
>>thread only needs to process it once, and it only produces one DM 
addition
>>uevent which could reduce system resource consumption.
>>
>>The example in Chapter 2 is continued to use to explain the 
proposal:
>>1) Multipath receives block device addition uevents from udev:
>>UDEV  [89068.806214] add 
>> /devices/platform/host3/session44/target3:0:0/3:0:0:0/block/sdc 
(block)
>>UDEV  [89068.909457] add 
>> /devices/platform/host3/session44/target3:0:0/3:0:0:2/block/sdg 
(block)
>>UDEV  [89068.944956] add 
>> /devices/platform/host3/session44/target3:0:0/3:0:0:1/block/sde 
(block)
>>UDEV  [89068.959215] add 
>> /devices/platform/host5/session46/target5:0:0/5:0:0:0/block/sdh 
(block)
>>UDEV  [89068.978558] add 
>> /devices/platform/host5/session46/target5:0:0/5:0:0:2/block/sdk 
(block)
>>UDEV  [89069.004217] add 
>> /devices/platform/host5/session46/target5:0:0/5:0:0:1/block/sdj 
(block)
>>UDEV  [89069.486361] add 
>> /devices/platform/host4/session45/target4:0:0/4:0:0:1/block/sdf 
(block)
>>UDEV  [89069.495194] add 
>> /devices/platform/host4/session45/target4:0:0/4:0:0:0/block/sdd 
(block)
>>UDEV  [89069.511628] add 
>> /devices/platform/host4/session45/target4:0:0/4:0:0:2/block/sdi 
(block)
>>UDEV  [89069.716292] add 
>> /devices/platform/host6/session47/target6:0:0/6:0:0:0/block/sdl 
(block)
>>UDEV  [8906

Re: [dm-devel] [PATCH] libmultipath: ensure dev_loss_tmo will be update to MAX_DEV_LOSS_TMO if no_path_retry set to queue

2016-11-30 Thread peng.liang5
If fast_io_fail_tmo isn't set, it will be use the DEFAULT_FAST_IO_FAIL in 
select_fast_io_fail.

So, multipath will not run the limited of dev_loss_tmo to 600.


And I think using MP_FAST_IO_FAIL_UNSET as the condition is meaningless after 
multipath

run select_fast_io_fail even if it's not set.






原始邮件



发件人:BenjaminMarzinski
收件人:彭亮10137102
抄送人:<dm-devel@redhat.com>张凯10072500
日 期 :2016年11月29日 08:30
主 题 :Re: [dm-devel] [PATCH] libmultipath: ensure dev_loss_tmo will be update to 
MAX_DEV_LOSS_TMO if no_path_retry set to queue





On Fri, Nov 25, 2016 at 02:36:04PM +0800, peng.lia...@zte.com.cn wrote:
> From: PengLiang <peng.lia...@zte.com.cn>
> 
> If no_path_retry set to queue, we should make sure dev_loss_tmo update to 
MAX_DEV_LOSS_TMO.
> But, it will be limit to 600 if fast_io_fail_tmo set to off or 0 meanwhile.

Doesn't the system still limit dev_loss_tmo to 600 if fast_io_fail_tmo isn't 
set. Multipath
was using this limit, since the underlying system uses it.

-Ben

> 
> Signed-off-by: PengLiang <peng.lia...@zte.com.cn>
> ---
>  libmultipath/discovery.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/libmultipath/discovery.c b/libmultipath/discovery.c
> index aaa915c..05b0842 100644
> --- a/libmultipath/discovery.c
> +++ b/libmultipath/discovery.c
> @@ -608,7 +608,8 @@ sysfs_set_rport_tmo(struct multipath *mpp, struct path 
*pp)
>  goto out
>  }
>  }
> -} else if (mpp->dev_loss > DEFAULT_DEV_LOSS_TMO) {
> +} else if (mpp->dev_loss > DEFAULT_DEV_LOSS_TMO &&
> +mpp->no_path_retry != NO_PATH_RETRY_QUEUE) {
>  condlog(3, "%s: limiting dev_loss_tmo to %d, since "
>  "fast_io_fail is not set",
>  rport_id, DEFAULT_DEV_LOSS_TMO)
> -- 
> 2.8.1.windows.1

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

[dm-devel] [PATCH] dm-persistent-data: free sm_metadata on failed create

2016-11-30 Thread Benjamin Marzinski
In dm_sm_metadata_create we temporarily change the dm_space_map
operations from ops, whose destroy function deallocates the
sm_metadata, to bootstrap_ops, whose destroy function doesn't. If we
fail in dm_ll_new_metadata or sm_ll_extend, we exit back to
dm_tm_create_internal, which calls dm_sm_destroy with the intention
of freeing the sm_metadata, but it doesn't.

This patch sets the dm_space_map operations back to ops if
dm_sm_metadata_create fails when it is set to bootstrap_ops.

Signed-off-by: Benjamin Marzinski 
---
 drivers/md/persistent-data/dm-space-map-metadata.c | 14 ++
 1 file changed, 6 insertions(+), 8 deletions(-)

diff --git a/drivers/md/persistent-data/dm-space-map-metadata.c 
b/drivers/md/persistent-data/dm-space-map-metadata.c
index 7e44005..20557e2 100644
--- a/drivers/md/persistent-data/dm-space-map-metadata.c
+++ b/drivers/md/persistent-data/dm-space-map-metadata.c
@@ -775,17 +775,15 @@ int dm_sm_metadata_create(struct dm_space_map *sm,
memcpy(&smm->sm, &bootstrap_ops, sizeof(smm->sm));
 
r = sm_ll_new_metadata(&smm->ll, tm);
+   if (!r) {
+   if (nr_blocks > DM_SM_METADATA_MAX_BLOCKS)
+   nr_blocks = DM_SM_METADATA_MAX_BLOCKS;
+   r = sm_ll_extend(&smm->ll, nr_blocks);
+   }
+   memcpy(&smm->sm, &ops, sizeof(smm->sm));
if (r)
return r;
 
-   if (nr_blocks > DM_SM_METADATA_MAX_BLOCKS)
-   nr_blocks = DM_SM_METADATA_MAX_BLOCKS;
-   r = sm_ll_extend(&smm->ll, nr_blocks);
-   if (r)
-   return r;
-
-   memcpy(&smm->sm, &ops, sizeof(smm->sm));
-
/*
 * Now we need to update the newly created data structures with the
 * allocated blocks that they were built from.
-- 
2.1.0

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


[dm-devel] [PATCH v2] dm raid: add raid4/5/6 journaling support

2016-11-30 Thread Heinz Mauelshagen
Add md raid4/5/6 journaling support (upstream commit bac624f3f86a started
the implementation) which closes the write hole (i.e. non-atomic updates
to stripes) using a dedicated journal device.

Background:
raid4/5/6 stripes hold N data payloads per stripe plus one parity raid4/5
or two raid6 P/Q syndrome payloads in an in-memory stripe cache.
Parity or P/Q syndromes used to recover any data payloads in case of a disk
failure are calculated from the N data payloads and need to be updated on the
different component devices of the raid device.  Those are non-atomic,
persistent updates.  Hence a crash can cause failure to update all stripe
payloads persistently and thus cause data loss during stripe recovery.
This problem gets addressed by writing whole stripe cache entries (together with
journal metadata) to a persistent journal entry on a dedicated journal device.
Only if that journal entry is written successfully, the stripe cache entry is
updated on the component devices of the raid device (i.e. writethrough type).
In case of a crash, the entry can be recovered from the journal and be written
again thus ensuring consistent stripe payload suitable to data recovery.

Future dependencies:
once writeback caching being worked on to compensate for the throughput
implictions involved with writethrough overhead is supported with journaling
in upstream, an additional patch based on this one will support it in dm-raid.

Journal resilience related remarks:
because stripes are recovered from the journal in case of a crash, the
journal device better be resilient.  Resilience becomes mandatory with
future writeback support, because loosing the working set in the log
means data loss as oposed to writethrough, were the loss of the
journal device 'only' reintroduces the write hole.

Resolves: rhbz1400194

Fix comment on data offsets in parse_dev_params()
and initialize new_data_offset as well.

Signed-off-by: Heinz Mauelshagen 
---
 Documentation/device-mapper/dm-raid.txt |  13 +++
 drivers/md/dm-raid.c| 152 
 2 files changed, 147 insertions(+), 18 deletions(-)

diff --git a/Documentation/device-mapper/dm-raid.txt 
b/Documentation/device-mapper/dm-raid.txt
index 5e3786f..67f69b2 100644
--- a/Documentation/device-mapper/dm-raid.txt
+++ b/Documentation/device-mapper/dm-raid.txt
@@ -161,6 +161,15 @@ The target is named "raid" and it accepts the following 
parameters:
the RAID type (i.e. the allocation algorithm) as well, e.g.
changing from raid5_ls to raid5_n.
 
+   [journal_dev ]
+   This option adds a journal device to raid4/5/6 raid sets and
+   uses it to close the 'write hole' caused by the non-atomic 
updates
+   to the component devices which can cause data loss during 
recovery.
+   The journal device is used as writethrough thus causing writes 
to
+   be throttled versus non-journaled raid4/5/6 sets.
+   Takeover/reshape is not possible with a raid4/5/6 journal 
device;
+   it has to be deconfigured before requesting these.
+
 <#raid_devs>: The number of devices composing the array.
Each device consists of two entries.  The first is the device
containing the metadata (if any); the second is the one containing the
@@ -245,6 +254,9 @@ recovery.  Here is a fuller description of the individual 
fields:
   The current data offset to the start of the user data on
each component device of a raid set (see the respective
raid parameter to support out-of-place reshaping).
+ 'J' - active raid4/5/6 journal device.
+   'D' - dead journal device.
+   '-' - no journal device.
 
 
 Message Interface
@@ -314,3 +326,4 @@ Version History
 1.9.0   Add support for RAID level takeover/reshape/region size
and set size reduction.
 1.9.1   Fix activation of existing RAID 4/10 mapped devices
+1.10.0  Add support for raid4/5/6 journal device
diff --git a/drivers/md/dm-raid.c b/drivers/md/dm-raid.c
index 9d5c6bb..215285b 100644
--- a/drivers/md/dm-raid.c
+++ b/drivers/md/dm-raid.c
@@ -24,6 +24,11 @@
  */
 #defineMIN_FREE_RESHAPE_SPACE to_sector(4*4096)
 
+/*
+ * Minimum journal space 4 MiB in sectors.
+ */
+#defineMIN_RAID456_JOURNAL_SPACE (4*2048)
+
 static bool devices_handle_discard_safely = false;
 
 /*
@@ -73,6 +78,9 @@ struct raid_dev {
 #define __CTR_FLAG_DATA_OFFSET 13 /* 2 */ /* Only with reshapable 
raid4/5/6/10! */
 #define __CTR_FLAG_RAID10_USE_NEAR_SETS 14 /* 2 */ /* Only with raid10! */
 
+/* New for v1.10.0 */
+#define __CTR_FLAG_JOURNAL_DEV 15 /* 2 */ /* Only with raid4/5/6! */
+
 /*
  * Flags for rs->ctr_flags field.
  */
@@ -91,6 +99,7 @@ struct raid_dev {
 #define CTR_FLAG_DELTA_DISKS   (1 << __CTR_FLAG_DELTA_DISKS)
 #define CTR_FLAG_DATA_OFFSET   (1 << __CTR_FLAG_DATA_OFFSET)
 #defin

[dm-devel] [PATCH] dm raid: add raid4/5/6 journaling support

2016-11-30 Thread Heinz Mauelshagen
Add md raid4/5/6 journaling support (upstream commit bac624f3f86a started
the implementation) which closes the write hole (i.e. non-atomic updates
to stripes) using a dedicated journal device.

Background:
raid4/5/6 stripes hold N data payloads per stripe plus one parity raid4/5
or two raid6 P/Q syndrome payloads in an in-memory stripe cache.
Parity or P/Q syndromes used to recover any data payloads in case of a disk
failure are calculated from the N data payloads and need to be updated on the
different component devices of the raid device.  Those are non-atomic,
persistent updates.  Hence a crash can cause failure to update all stripe
payloads persistently and thus cause data loss during stripe recovery.
This problem gets addressed by writing whole stripe cache entries (together with
journal metadata) to a persistent journal entry on a dedicated journal device.
Only if that journal entry is written successfully, the stripe cache entry is
updated on the component devices of the raid device (i.e. writethrough type).
In case of a crash, the entry can be recovered from the journal and be written
again thus ensuring consistent stripe payload suitable to data recovery.

Future dependencies:
once writeback caching being worked on to compensate for the throughput
implictions involved with writethrough overhead is supported with journaling
in upstream, an additional patch based on this one will support it in dm-raid.

Journal resilience related remarks:
because stripes are recovered from the journal in case of a crash, the
journal device better be resilient.  Resilience becomes mandatory with
future writeback support, because loosing the working set in the log
means data loss as oposed to writethrough, were the loss of the
journal device 'only' reintroduces the write hole.

Resolves: rhbz1400194

Fix comment on data offsets in parse_dev_params()
and initialize new_data_offset as well.

Signed-off-by: Heinz Mauelshagen 
---
 Documentation/device-mapper/dm-raid.txt |  13 +++
 drivers/md/dm-raid.c| 152 
 2 files changed, 147 insertions(+), 18 deletions(-)

diff --git a/Documentation/device-mapper/dm-raid.txt 
b/Documentation/device-mapper/dm-raid.txt
index 5e3786f..67f69b2 100644
--- a/Documentation/device-mapper/dm-raid.txt
+++ b/Documentation/device-mapper/dm-raid.txt
@@ -161,6 +161,15 @@ The target is named "raid" and it accepts the following 
parameters:
the RAID type (i.e. the allocation algorithm) as well, e.g.
changing from raid5_ls to raid5_n.
 
+   [journal_dev ]
+   This option adds a journal device to raid4/5/6 raid sets and
+   uses it to close the 'write hole' caused by the non-atomic 
updates
+   to the component devices which can cause data loss during 
recovery.
+   The journal device is used as writethrough thus causing writes 
to
+   be throttled versus non-journaled raid4/5/6 sets.
+   Takeover/reshape is not possible with a raid4/5/6 journal 
device;
+   it has to be deconfigured before requesting these.
+
 <#raid_devs>: The number of devices composing the array.
Each device consists of two entries.  The first is the device
containing the metadata (if any); the second is the one containing the
@@ -245,6 +254,9 @@ recovery.  Here is a fuller description of the individual 
fields:
   The current data offset to the start of the user data on
each component device of a raid set (see the respective
raid parameter to support out-of-place reshaping).
+ 'J' - active raid4/5/6 journal device.
+   'D' - dead journal device.
+   '-' - no journal device.
 
 
 Message Interface
@@ -314,3 +326,4 @@ Version History
 1.9.0   Add support for RAID level takeover/reshape/region size
and set size reduction.
 1.9.1   Fix activation of existing RAID 4/10 mapped devices
+1.10.0  Add support for raid4/5/6 journal device
diff --git a/drivers/md/dm-raid.c b/drivers/md/dm-raid.c
index 9d5c6bb..bdf00c4 100644
--- a/drivers/md/dm-raid.c
+++ b/drivers/md/dm-raid.c
@@ -24,6 +24,11 @@
  */
 #defineMIN_FREE_RESHAPE_SPACE to_sector(4*4096)
 
+/*
+ * Minimum journal space 4 MiB in sectors.
+ */
+#defineMIN_RAID456_JOURNAL_SPACE (4*2048)
+
 static bool devices_handle_discard_safely = false;
 
 /*
@@ -73,6 +78,9 @@ struct raid_dev {
 #define __CTR_FLAG_DATA_OFFSET 13 /* 2 */ /* Only with reshapable 
raid4/5/6/10! */
 #define __CTR_FLAG_RAID10_USE_NEAR_SETS 14 /* 2 */ /* Only with raid10! */
 
+/* New for v1.10.0 */
+#define __CTR_FLAG_JOURNAL_DEV 15 /* 2 */ /* Only with raid4/5/6! */
+
 /*
  * Flags for rs->ctr_flags field.
  */
@@ -91,6 +99,7 @@ struct raid_dev {
 #define CTR_FLAG_DELTA_DISKS   (1 << __CTR_FLAG_DELTA_DISKS)
 #define CTR_FLAG_DATA_OFFSET   (1 << __CTR_FLAG_DATA_OFFSET)
 #defin