On 2018/1/11 15:57, Gang He wrote: > > > >>>> >> On 2018/1/11 15:19, Gang He wrote: >>> >>> >>> >>>>>> >>>> On 2018/1/11 12:31, Gang He wrote: >>>>> Hi Changwei, >>>>> >>>>> >>>>>>>> >>>>>> On 2018/1/11 11:33, Gang He wrote: >>>>>>> Hi Changwei, >>>>>>> >>>>>>> >>>>>>>>>> >>>>>>>> On 2018/1/11 10:07, Gang He wrote: >>>>>>>>> Hi Changwei, >>>>>>>>> >>>>>>>>> >>>>>>>>>>>> >>>>>>>>>> On 2018/1/10 18:14, Gang He wrote: >>>>>>>>>>> Hi Changwei, >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>> On 2018/1/10 17:05, Gang He wrote: >>>>>>>>>>>>> Hi Changwei, >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>> Hi Gang, >>>>>>>>>>>>>> >>>>>>>>>>>>>> On 2017/12/14 13:16, Gang He wrote: >>>>>>>>>>>>>>> As you know, ocfs2 has support trim the underlying disk via >>>>>>>>>>>>>>> fstrim command. But there is a problem, ocfs2 is a shared disk >>>>>>>>>>>>>>> cluster file system, if the user configures a scheduled fstrim >>>>>>>>>>>>>>> job on each file system node, this will trigger multiple nodes >>>>>>>>>>>>>>> trim a shared disk simultaneously, it is very wasteful for CPU >>>>>>>>>>>>>>> and IO consumption, also might negatively affect the lifetime >>>>>>>>>>>>>>> of poor-quality SSD devices. >>>>>>>>>>>>>>> Then, we introduce a trimfs dlm lock to communicate with each >>>>>>>>>>>>>>> other in this case, which will make only one fstrim command to >>>>>>>>>>>>>>> do the trimming on a shared disk among the cluster, the fstrim >>>>>>>>>>>>>>> commands from the other nodes should wait for the first fstrim >>>>>>>>>>>>>>> to finish and returned success directly, to avoid running a the >>>>>>>>>>>>>>> same trim on the shared disk again. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Compare with first version, I change the fstrim commands' >>>>>>>>>>>>>>> returned >>>>>>>>>>>>>>> value and behavior in case which meets a fstrim command is >>>>>>>>>>>>>>> running >>>>>>>>>>>>>>> on a shared disk. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Signed-off-by: Gang He <g...@suse.com> >>>>>>>>>>>>>>> --- >>>>>>>>>>>>>>> fs/ocfs2/alloc.c | 44 >>>>>>>>>>>>>>> ++++++++++++++++++++++++++++++++++++++++++++ >>>>>>>>>>>>>>> 1 file changed, 44 insertions(+) >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> diff --git a/fs/ocfs2/alloc.c b/fs/ocfs2/alloc.c >>>>>>>>>>>>>>> index ab5105f..5c9c3e2 100644 >>>>>>>>>>>>>>> --- a/fs/ocfs2/alloc.c >>>>>>>>>>>>>>> +++ b/fs/ocfs2/alloc.c >>>>>>>>>>>>>>> @@ -7382,6 +7382,7 @@ int ocfs2_trim_fs(struct super_block *sb, >>>>>>>>>>>>>>> struct >>>>>>>>>>>>>> fstrim_range *range) >>>>>>>>>>>>>>> struct buffer_head *gd_bh = NULL; >>>>>>>>>>>>>>> struct ocfs2_dinode *main_bm; >>>>>>>>>>>>>>> struct ocfs2_group_desc *gd = NULL; >>>>>>>>>>>>>>> + struct ocfs2_trim_fs_info info, *pinfo = NULL; >>>>>>>>>>>>>> >>>>>>>>>>>>>> I think *pinfo* is not necessary. >>>>>>>>>>>>> This pointer is necessary, since it can be NULL or non-NULL >>>>>>>>>>>>> depend on the >>>>>>>>>>>> code logic. >>>>>>>>>>>> >>>>>>>>>>>> This point is OK for me. >>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> start = range->start >> osb->s_clustersize_bits; >>>>>>>>>>>>>>> len = range->len >> osb->s_clustersize_bits; >>>>>>>>>>>>>>> @@ -7419,6 +7420,42 @@ int ocfs2_trim_fs(struct super_block >>>>>>>>>>>>>>> *sb, struct >>>>>>>>>>>>>> fstrim_range *range) >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> trace_ocfs2_trim_fs(start, len, minlen); >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> + ocfs2_trim_fs_lock_res_init(osb); >>>>>>>>>>>>>>> + ret = ocfs2_trim_fs_lock(osb, NULL, 1); >>>>>>>>>>>>>> >>>>>>>>>>>>>> I don't get why try to lock here and if fails, acquire the same >>>>>>>>>>>>>> lock again >>>>>>>>>>>>>> later but wait until granted. >>>>>>>>>>>>> Please think about the user case, the patch is only used to >>>>>>>>>>>>> handle this >>>>>>>>>>>> case. >>>>>>>>>>>>> When the administer configures a fstrim schedule task on each >>>>>>>>>>>>> node, then >>>>>>>>>>>> each node will trigger a fstrim on shared disks concurrently. >>>>>>>>>>>>> In this case, we should avoid duplicated fstrim on a shared disk >>>>>>>>>>>>> since this >>>>>>>>>>>> will waste CPU/IO resources and affect SSD lifetime sometimes. >>>>>>>>>>>> >>>>>>>>>>>> I'm not worrying about that trimfs will affect SSD's lifetime >>>>>>>>>>>> quite a lot, >>>>>>>>>>>> since physical-logical address converting table resides in RAM >>>>>>>>>>>> while SSD is >>>>>>>>>>>> working. >>>>>>>>>>>> And that table won't be at a big scale. My point here is not >>>>>>>>>>>> affecting this >>>>>>>>>>>> patch. Just a tip here. >>>>>>>>>>> This depend on SSD firmware implementation, but for secure-trim, it >>>>>>>>>>> really >>>>>>>>>> possibly affect SSD lifetime. >>>>>>>>>>> >>>>>>>>>>>>> Firstly, we use try_lock to get fstrim dlm lock to identify if >>>>>>>>>>>>> there is any >>>>>>>>>>>> other node which is doing fstrim on the disk. >>>>>>>>>>>>> If not, this node is the first one, this node should do fstrim >>>>>>>>>>>>> operation on >>>>>>>>>>>> the disk. >>>>>>>>>>>>> If yes, this node is not the first one, this node should wait >>>>>>>>>>>>> until the >>>>>>>>>>>> first node is done for fstrim operation, then return the result >>>>>>>>>>>> from DLM >>>>>>>>>>>> lock's value. >>>>>>>>>>>>> >>>>>>>>>>>>>> Can it just acquire the _trimfs_ lock as a blocking one directly >>>>>>>>>>>>>> here? >>>>>>>>>>>>> We can not do a blocking lock directly, since we need to identify >>>>>>>>>>>>> if there >>>>>>>>>>>> is any other node has being do fstrim operation when this node >>>>>>>>>>>> start to do >>>>>>>>>>>> fstrim. >>>>>>>>>>>> >>>>>>>>>>>> Thanks for your elaboration. >>>>>>>>>>>> >>>>>>>>>>>> Well how about the third node trying to trimming fs too? >>>>>>>>>>>> It needs LVB from the second node. >>>>>>>>>>>> But it seems that the second node can't provide a valid LVB. >>>>>>>>>>>> So the third node will perform trimfs once more. >>>>>>>>>>> No, the second node does not change DLM lock's value, but the DLM >>>>>>>>>>> lock's >>>>>>>>>> value is still valid. >>>>>>>>>>> The third node also refer to this DLM lock's value, then do the >>>>>>>>>>> same logic >>>>>>>>>> like the second node. >>>>>>>>>> >>>>>>>>>> Hi Gang, >>>>>>>>>> I don't see any places where >>>>>>>>>> ocfs2_lock_res::ocfs2_lock_res_ops::set_lvb is >>>>>>>>>> set while flag LOCK_TYPE_USES_LVB is added. >>>>>>>>>> >>>>>>>>>> Are you sure below code path can work well? >>>>>>>>> Yes, have done a full testing on two and three nodes. >>>>>>>>> >>>>>>>>>> ocfs2_process_blocked_lock >>>>>>>>>> ocfs2_unblock_lock >>>>>>>>>> Reference to ::set_lvb since LOCK_TYPE_USES_LVB is set. >>>>>>>>>> >>>>>>>>> the set_lvb callback function is not necessary, if we update DLM lock >>>>>>>>> value >>>>>>>> by ourselves before unlock. >>>>>>>> >>>>>>>> I think this may relates to *LOCK_TYPE_REQUIRES_REFRESH* flag. >>>>>>> Alright, I think *LOCK_TYPE_REQUIRES_REFRESH* flag is harmless here, >>>>>> although this flag is probably unnecessary. >>>>>> >>>>>> OK, I agree with adding *LOCK_TYPE_REQUIRES_REFRESH* for now. >>>>>> Can you give a explanation for my another concern about three nodes' >>>>>> concurrent trimming fs.? >>>>>> >>>>>> For your convenience, I paste it here: >>>>>> >>>>>> The LVB passing path should be like below: >>>>>> >>>>>> NODE 1 lvb (ex granted at time1) -> NODE 2 lvb(ex granted at time2) -> >>>>>> NODE 3 >>>>>> lvb(ex granted at time3). >>>>>> time1 < time2 < time3 >>>>>> >>>>>> So I think NODE 3 can't obtain LVB from NODE 1 but from NODE 2. >>>>> Yes, if node1 finished the fstrim successfully, node1 updated DLM lock >>>> value, then this DLM lock value is valid and unlocked/dropped. >>>>> node2 returned the result from this DLM lock value, node2 did not change >>>>> DLM >>>> lock value (but value is still valid) and unlocked/dropped. >>>>> node3 returned the result from this DLM lock value (node1 updated), node3 >>>> did not change DLM lock value (but value is still valid) and >>>> unlocked/dropped. >>>>> after the last nodeN returned and unlocked/dropped this lock resource, >>>>> this >>>> DLm lock value will become invalid. >>>>> >>>>> Next round, the first node gets the lock and update the fstrim result to >>>>> DLM >>>> lock value directly. >>>>> the same logic like before. >>>>> >>>>> And more, this DLM lock value validity is OK when some node is suddenly >>>> crashed during the above case. >>>> >>>> I got your point. >>>> You don't update LVB when node 2 or node 3 gets EX lock granted, so the >>>> original LVB from node 1 will be transferred back to node 1, right? >>> No, For DLM lock value block, it is transparent to DLM lock mechanism. >>> When one node get the lock from the cluster, DLM will consider this lock >> value block is valid. >>> About the context in DLM lock value block, you can update it via a pointer >> manually when you get this lock. >>> If you do not update DLM lock value block when you get the lock, it is OK, >> for DLM, it do not care if you update that buffer, just copy it back. >> >> Hi Gang, >> I am puzzled. Perhaps my question is not clear. >> Your answer is No, but your elaboration seems Yes... >> >> Anyway, if you don't update LVB even EX lock is granted to node 2, just add >> a comment to your patch, I suggest:) > The remained waiting nodes get the lock, and do not update DLM lock value > block, but this > DLM lock value block is still valid. Actually, the DLM lock value block > should be updated by the > node, which really does a fstrim operation on the disk.
Hi Gang, I think this trick is acceptable, but a little hacky. So it's up to you whether to add a comment to indicate your intention. > >> Because I feel a bit strange with this patch's trick. >> >> And I still think double lock (UNQUEUE first, QUEUE later) is not necessary. >> Only one QUEUE cluster lock can still do the same thing, I suppose. > Since we need to identify if there is a node which has been doing the fstrim > when we start to do a fstrim. Why do we have to identiy if thers is a node doing trimfs by an extra UNQUEUE cluster lock? If we directly request a QUEUE cluster lock we can still tell if there is a node doing fstrim through check ocfs2_trim_fs_lvb::lvb_nodenum. Specially speaking, if ocfs2_trim_fs_lvb::lvb_nodenum is set to another node, there should be another node doing trimfs. Do you agree? Thanks, Changwei > In ocfs2, for most cases, the code gets a block lock and cache this lock > until downgrade. > this trick is better to avoid network traffic, but can not provide DLM layer > trylock feature since the node does not drop the lock until the memory object > (e.g. inode) is evicted. > > Thanks > Gang > >> You drop the ocfs2 lock resource once the trimfs is done and the LVB is >> cleared to a zeroed space. >> The first node will not get a valid LVB. >> So the second node to trim fs will get a LVB with ::lvb_nodenum set to the >> first node. >> >> Thanks, >> Changwei >> >>> >>> Thanks >>> Gang >>> >>>>> >>>>> Thanks >>>>> Gang >>>>> >>>>>> Moreover, if node 1 is the master of trimfs lock resource, node 1's LVB >>>>>> will >>>>>> be updated to be the same as node 2. >>>>>> >>>>>>> >>>>>>>> Actually, I don't see why this flag is necessary to _orphan scan_. >>>>>>>> Why can't _orphan scan_ also set LVB during >>>>>>>> ocfs2_process_blocked_lock->ocfs2_unblock_lock? >>>>>>>> >>>>>>>> And it seems that _orphan scan_ also doesn't need to persist any stuff >>>>>>>> in >>>>>>>> LVB into disk. >>>>>>> More comments, you can look at dlmglue.c file carefully. >>>>>>> set_lvb is a callback function, which will be invoked automatically >>>>>>> before >>>>>> downgrade. >>>>>>> you can use this mechanism, you also do not do like that. >>>>>>> you just need to make sure to update DLM lock value before >>>>>>> unlock/downgrade. >>>>>>> >>>>>>> Thanks >>>>>>> Gang >>>>>>> >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Changwei >>>>>>>> >>>>>>>>> By the way, the code is transparent to the underlying DLM stack (o2cb >>>>>>>>> or >>>>>>>> pcmk). >>>>>>>> >>>>>>>> True. >>>>>>>>> >>>>>>>>> Thanks >>>>>>>>> Gang >>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> Changwei >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> IOW, three nodes are trying to trimming fs concurrently. Is your >>>>>>>>>>>> patch able >>>>>>>>>>>> to handle such a scenario? >>>>>>>>>>> Yes, the patch can handle this case. >>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Even the second lock request with QUEUE set just follows >>>>>>>>>>>> ocfs2_trim_fs_lock_res_uninit() will not get rid of concurrent >>>>>>>>>>>> trimfs. >>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>> + if (ret < 0) { >>>>>>>>>>>>>>> + if (ret != -EAGAIN) { >>>>>>>>>>>>>>> + mlog_errno(ret); >>>>>>>>>>>>>>> + ocfs2_trim_fs_lock_res_uninit(osb); >>>>>>>>>>>>>>> + goto out_unlock; >>>>>>>>>>>>>>> + } >>>>>>>>>>>>>>> + >>>>>>>>>>>>>>> + mlog(ML_NOTICE, "Wait for trim on device (%s) >>>>>>>>>>>>>>> to " >>>>>>>>>>>>>>> + "finish, which is running from another >>>>>>>>>>>>>>> node.\n", >>>>>>>>>>>>>>> + osb->dev_str); >>>>>>>>>>>>>>> + ret = ocfs2_trim_fs_lock(osb, &info, 0); >>>>>>>>>>>>>>> + if (ret < 0) { >>>>>>>>>>>>>>> + mlog_errno(ret); >>>>>>>>>>>>>>> + ocfs2_trim_fs_lock_res_uninit(osb); >>>>>>>>>>>>>> >>>>>>>>>>>>>> In ocfs2_trim_fs_lock_res_uninit(), you drop lock. But it is >>>>>>>>>>>>>> never granted. >>>>>>>>>>>>>> Still need to drop lock resource? >>>>>>>>>>>>> Yes, we need to init/uninit fstrim dlm lock resource for each >>>>>>>>>>>>> time. >>>>>>>>>>>>> Otherwise, trylock does not work, this is a little different from >>>>>>>>>>>>> other dlm >>>>>>>>>>>> lock usage in ocfs2. >>>>>>>>>>>> >>>>>>>>>>>> This point is OK for now, too. >>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>> + goto out_unlock; >>>>>>>>>>>>>>> + } >>>>>>>>>>>>>>> + >>>>>>>>>>>>>>> + if (info.tf_valid && info.tf_success && >>>>>>>>>>>>>>> + info.tf_start == start && info.tf_len == >>>>>>>>>>>>>>> len && >>>>>>>>>>>>>>> + info.tf_minlen == minlen) { >>>>>>>>>>>>>>> + /* Avoid sending duplicated trim to a >>>>>>>>>>>>>>> shared device */ >>>>>>>>>>>>>>> + mlog(ML_NOTICE, "The same trim on >>>>>>>>>>>>>>> device (%s) was " >>>>>>>>>>>>>>> + "just done from node (%u), >>>>>>>>>>>>>>> return.\n", >>>>>>>>>>>>>>> + osb->dev_str, info.tf_nodenum); >>>>>>>>>>>>>>> + range->len = info.tf_trimlen; >>>>>>>>>>>>>>> + goto out_trimunlock; >>>>>>>>>>>>>>> + } >>>>>>>>>>>>>>> + } >>>>>>>>>>>>>>> + >>>>>>>>>>>>>>> + info.tf_nodenum = osb->node_num; >>>>>>>>>>>>>>> + info.tf_start = start; >>>>>>>>>>>>>>> + info.tf_len = len; >>>>>>>>>>>>>>> + info.tf_minlen = minlen; >>>>>>>>>>>>>> >>>>>>>>>>>>>> If we faild during dong trimfs, I think we should not cache >>>>>>>>>>>>>> above info in >>>>>>>>>>>>>> LVB. >>>>>>>>>>>>> It is necessary, if the second node is waiting the first node, >>>>>>>>>>>>> the first >>>>>>>>>>>> node fails to do fstrim, >>>>>>>>>>>>> the first node should update dlm lock's value, then the second >>>>>>>>>>>>> node can get >>>>>>>>>>>> the latest dlm lock value (rather than the last time DLM lock >>>>>>>>>>>> value), >>>>>>>>>>>>> the second node will do the fstrim again, since the first node >>>>>>>>>>>>> has failed. >>>>>>>>>>>> >>>>>>>>>>>> Yes, it makes scene. >>>>>>>>>>>>> >>>>>>>>>>>>>> BTW, it seems that this patch is on top of 'try lock' patches >>>>>>>>>>>>>> which you >>>>>>>>>>>>>> previously sent out. >>>>>>>>>>>>>> Are they related? >>>>>>>>>>>>> try lock patch is related to non-block aio support for ocfs2. >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks >>>>>>>>>>>>> Gang >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>> Changwei >>>>>>>>>>>>>> >>>>>>>>>>>>>>> + >>>>>>>>>>>>>>> /* Determine first and last group to examine >>>>>>>>>>>>>>> based on start and len >>>> */ >>>>>>>>>>>>>>> first_group = >>>>>>>>>>>>>>> ocfs2_which_cluster_group(main_bm_inode, start); >>>>>>>>>>>>>>> if (first_group == >>>>>>>>>>>>>>> osb->first_cluster_group_blkno) >>>>>>>>>>>>>>> @@ -7463,6 +7500,13 @@ int ocfs2_trim_fs(struct super_block >>>>>>>>>>>>>>> *sb, struct >>>>>>>>>>>>>> fstrim_range *range) >>>>>>>>>>>>>>> group += >>>>>>>>>>>>>>> ocfs2_clusters_to_blocks(sb, osb->bitmap_cpg); >>>>>>>>>>>>>>> } >>>>>>>>>>>>>>> range->len = trimmed * sb->s_blocksize; >>>>>>>>>>>>>>> + >>>>>>>>>>>>>>> + info.tf_trimlen = range->len; >>>>>>>>>>>>>>> + info.tf_success = (ret ? 0 : 1); >>>>>>>>>>>>>>> + pinfo = &info; >>>>>>>>>>>>>>> +out_trimunlock: >>>>>>>>>>>>>>> + ocfs2_trim_fs_unlock(osb, pinfo); >>>>>>>>>>>>>>> + ocfs2_trim_fs_lock_res_uninit(osb); >>>>>>>>>>>>>>> out_unlock: >>>>>>>>>>>>>>> ocfs2_inode_unlock(main_bm_inode, 0); >>>>>>>>>>>>>>> brelse(main_bm_bh); >>>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>> >>>>>>> >>>>> >>> >