Re: [Ocfs2-devel] [PATCH] ocfs2: Fix start offset to ocfs2_zero_range_for_truncate()
Good catch! Thank you for the fix Reviewed-by: Srinivas Eeda <srinivas.e...@oracle.com> On 08/11/2016 04:12 PM, Ashish Samant wrote: > If we do fallocate with punch hole option on a reflink, with start offset > on a cluster boundary and end offset somewhere in another cluster, we > dont COW the first cluster starting at the start offset. But in this > case, we were wrongly passing this cluster to > ocfs2_zero_range_for_truncate() to zero out. > > Fix this by skipping this cluster in such a scenario. > > Signed-off-by: Ashish Samant <ashish.sam...@oracle.com> > --- > fs/ocfs2/file.c | 34 -- > 1 file changed, 24 insertions(+), 10 deletions(-) > > diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c > index 4a6e130..ab305aa 100644 > --- a/fs/ocfs2/file.c > +++ b/fs/ocfs2/file.c > @@ -1522,7 +1522,8 @@ static int ocfs2_zero_partial_clusters(struct inode > *inode, > u64 start, u64 len) > { > int ret = 0; > - u64 tmpend, end = start + len; > + u64 tmpend = 0; > + u64 end = start + len; > struct ocfs2_super *osb = OCFS2_SB(inode->i_sb); > unsigned int csize = osb->s_clustersize; > handle_t *handle; > @@ -1554,18 +1555,31 @@ static int ocfs2_zero_partial_clusters(struct inode > *inode, > } > > /* > - * We want to get the byte offset of the end of the 1st cluster. > + * If start is on a cluster boundary and end is somewhere in another > + * cluster, we have not COWed the cluster starting at start, unless > + * end is also within the same cluster. So, in this case, we skip this > + * first call to ocfs2_zero_range_for_truncate() truncate and move on > + * to the next one. >*/ > - tmpend = (u64)osb->s_clustersize + (start & ~(osb->s_clustersize - 1)); > - if (tmpend > end) > - tmpend = end; > + if ((start & (csize - 1)) != 0) { > + /* > + * We want to get the byte offset of the end of the 1st > + * cluster. > + */ > + tmpend = (u64)osb->s_clustersize + > + (start & ~(osb->s_clustersize - 1)); > + if (tmpend > end) > + tmpend = end; > > - trace_ocfs2_zero_partial_clusters_range1((unsigned long long)start, > - (unsigned long long)tmpend); > + trace_ocfs2_zero_partial_clusters_range1( > + (unsigned long long)start, > + (unsigned long long)tmpend); > > - ret = ocfs2_zero_range_for_truncate(inode, handle, start, tmpend); > - if (ret) > - mlog_errno(ret); > + ret = ocfs2_zero_range_for_truncate(inode, handle, start, > + tmpend); > + if (ret) > + mlog_errno(ret); > + } > > if (tmpend < end) { > /* ___ Ocfs2-devel mailing list Ocfs2-devel@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-devel
Re: [Ocfs2-devel] what MLE wants to do?
In simple terms, mle's get into life at the beginning of the lock mastery and ends at the end of lock mastery(either new lockres or during lockres migration). It's purpose is to handle the race when more than one node try to master or find master of a lockres. On 05/03/2016 02:23 AM, Gechangwei wrote: Hi OCFS2 experts, Recently, I am working hard on OCFS2 DLM related code. I think it is an essential part of OCFS2 which must be mastered. But I was confused with MLE (stands for master list entry) related procedure. Would you please introduce me some study material covering MLE or just directly give me a clue on what MLE wants to do briefly via an email. Thanks a lot. Br. Gechangwei H3C Technologies Co., Limited > -- > -- > - - 本邮件及其附件含有杭州华三通信技术有限公司的保密信息,仅限于发送给上面 地址中列出 的个人或群组。禁止任何其他人以任何形式使用(包括但不限于全部或部分地泄 露、复制、 或散发)本邮件中的信息。如果您错收了本邮件,请您立即电话或邮件通知发件 人并删除本 邮件! This e-mail and its attachments contain confidential information from H3C, which is intended only for the person or entity whose address is listed above. Any use of the information contained herein in any way (including, but not limited to, total or partial disclosure, reproduction, or dissemination) by persons other than the intended recipient(s) is prohibited. If you receive this e-mail in error, please notify the sender by phone or email immediately and delete it! ___ Ocfs2-devel mailing list Ocfs2-devel@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-devel ___ Ocfs2-devel mailing list Ocfs2-devel@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-devel
Re: [Ocfs2-devel] [patch 19/25] ocfs2: o2hb: add negotiate timer
On 03/28/2016 05:41 AM, jiangyiwen wrote: > On 2016/3/28 9:23, Junxiao Bi wrote: >> Hi Yiwen, >> >> On 03/26/2016 10:54 AM, jiangyiwen wrote: >>> Hi, Junxiao >>> This patch may have a problem. That is journal of every nodes become >>> abort when storage down, and then when storage up, because journal >>> has become abort, all of operations of metadata will fail. So how to >>> restore environment? panic or reset? how to trigger? >> Journal aborted means io error was returned by storage, right? >> If so, o2hb_thread should also get io error, in this case, nego process >> will be bypassed, and nodes will be fenced at last, see "[patch 23/25] >> ocfs2: o2hb: don't negotiate if last hb fail". >> >> Thanks, >> Junxiao. >>> Thanks, >>> Yiwen Jiang. >> >> . >> > yes, you are right, sorry I don't see this patch before. > > But I understand the results of storage down should return IO error > rather than getting hang. that is upto the driver or storage. If they return I/O error, o2hb i/o's will get i/o errors and will be addressed accordingly. These changes are made for cases where storage is not yet responding due to head failovers/recovery on the storage array. > > Thanks, > Yiwen Jiang. > > > ___ > Ocfs2-devel mailing list > Ocfs2-devel@oss.oracle.com > https://oss.oracle.com/mailman/listinfo/ocfs2-devel ___ Ocfs2-devel mailing list Ocfs2-devel@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-devel
Re: [Ocfs2-devel] [PATCH] ocfs2: fix SGID not inherited issue
Thanks Junxiao! Acked-by: Srinivas Eeda <srinivas.e...@oracle.com> On 12/06/2015 08:09 PM, Junxiao Bi wrote: > commit 8f1eb48758aa ("ocfs2: fix umask ignored issue") introduced an issue, > SGID of sub dir was not inherited from its parents dir. It is because SGID > is set into "inode->i_mode" in ocfs2_get_init_inode(), but is overwritten > by "mode" which don't have SGID set later. > > Fixes: 8f1eb48758aa ("ocfs2: fix umask ignored issue") > Signed-off-by: Junxiao Bi <junxiao...@oracle.com> > Cc: <sta...@vger.kernel.org> > --- > fs/ocfs2/namei.c |4 +--- > 1 file changed, 1 insertion(+), 3 deletions(-) > > diff --git a/fs/ocfs2/namei.c b/fs/ocfs2/namei.c > index a03f6f4..3123408 100644 > --- a/fs/ocfs2/namei.c > +++ b/fs/ocfs2/namei.c > @@ -367,13 +367,11 @@ static int ocfs2_mknod(struct inode *dir, > goto leave; > } > > - status = posix_acl_create(dir, , _acl, ); > + status = posix_acl_create(dir, >i_mode, _acl, ); > if (status) { > mlog_errno(status); > goto leave; > } > - /* update inode->i_mode after mask with "umask". */ > - inode->i_mode = mode; > > handle = ocfs2_start_trans(osb, ocfs2_mknod_credits(osb->sb, > S_ISDIR(mode), ___ Ocfs2-devel mailing list Ocfs2-devel@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-devel
Re: [Ocfs2-devel] [PATCH v2 2/4] ocfs2: sysfile interfaces for online file check
On 11/24/2015 01:46 PM, Mark Fasheh wrote: > On Tue, Nov 03, 2015 at 04:20:27PM +0800, Junxiao Bi wrote: >> Hi Gang, >> >> On 11/03/2015 03:54 PM, Gang He wrote: >>> Hi Junxiao, >>> >>> Thank for your reviewing. >>> Current design, we use a sysfile as a interface to check/fix a file (via >>> pass a ino number). >>> But, this operation is manually triggered by user, instead of automatically >>> fix in the kernel. >>> Why? >>> 1) we should let users make this decision, since some users do not want to >>> fix when encountering a file system corruption, maybe they want to keep the >>> file system unchanged for a further investigation. >> If user don't want this, they should not use error=continue option, let >> fs go after a corruption is very dangerous. > Maybe we need another errors=XXX flag (maybe errors=fix)? Great idea Mark! I think adding errors=fix would be a good way to address both concerns :) It gives some control if anyone is uncomfortable of things getting checked/fixed automatically. > > You both make good points, here's what I gather from the conversation: > > - Some customers would be sad if they have to manually fix corruptions. > This takes effort on their part, and if the FS can handle it > automatically, it should. > > - There are valid concerns that automatically fixing things is a change in > behavior that might not be welcome, or worse might lead to unforseeable > circumstances. > > - I will add that fixing things automatically implies checking them > automatically which could introduce some performance impact depending on > how much checking we're doing. > > So if the user wants errors to be fixed automatically, they could mount with > errros=fix, and everyone else would have no change in behavior unless they > wanted to make use of the new feature. > > >>> 2) frankly speaking, this feature will probably bring a second corruption >>> if there is some error in the code, I do not suggest to use automatically >>> fix by default in the first version. >> I think if this feature could bring more corruption, then this should be >> fixed first. > Btw, I am pretty sure that Gang is referring to the feature being new and > thus more likely to have problems. There is nothing I see in here that is > file system corrupting. > --Mark > > > -- > Mark Fasheh > > ___ > Ocfs2-devel mailing list > Ocfs2-devel@oss.oracle.com > https://oss.oracle.com/mailman/listinfo/ocfs2-devel ___ Ocfs2-devel mailing list Ocfs2-devel@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-devel
Re: [Ocfs2-devel] [PATCH v2 0/4] Add online file check feature
Hi Gang, thanks for pointing to explanation of the feature. What I am curious about is ... what were the real cases that you came across prompted this change and how this change would help in that case. Thanks, --Srini On 10/28/2015 09:44 PM, Gang He wrote: > Hello Srini, > > There is a doc about ocfs2 online file check. > > OCFS2 online file check > --- > > This document will describe OCFS2 online file check feature. > > Introduction > > OCFS2 is often used in high-availaibility systems. However, OCFS2 usually > converts the filesystem to read-only on errors. This may not be necessary, > since > turning the filesystem read-only would affect other running processes as well, > decreasing availability. Then, a mount option (errors=continue) was > introduced, > which would return the EIO to the calling process and terminate furhter > processing so that the filesystem is not corrupted further. So,the filesystem > is > not converted to read-only, and the problematic file's inode number is > reported > in the kernel log so that the user can try to check/fix this file via online > filecheck feature. > > Scope > = > This effort is to check/fix small issues which may hinder day-to-day > operations > of a cluster filesystem by turning the filesystem read-only. The scope of > checking/fixing is at the file level, initially for regular files and > eventually > to all files (including system files) of the filesystem. > > In case of directory to file links is incorrect, the directory inode is > reported as erroneous. > > This feature is not suited for extravagant checks which involve dependency of > other components of the filesystem, such as but not limited to, checking if > the > bits for file blocks in the allocation has been set. In case of such an error, > the offline fsck should/would be recommended. > > Finally, such an operation/feature should not be automated lest the filesystem > may end up with more damage than before the repair attempt. So, this has to > be performed using user interaction and consent. > > User interface > == > When there are errors in the OCFS2 filesystem, they are usually accompanied > by the inode number which caused the error. This inode number would be the > input to check/fix the file. > > There is a sysfs file for each OCFS2 file system mounting: > >/sys/fs/ocfs2//filecheck > > Here, indicates the name of OCFS2 volumn device which has been > already > mounted. The file above would accept inode numbers. This could be used to > communicate with kernel space, tell which file(inode number) will be checked > or > fixed. Currently, three operations are supported, which includes checking > inode, fixing inode and setting the size of result record history. > > 1. If you want to know what error exactly happened to before fixing, > do > ># echo "CHECK " > /sys/fs/ocfs2//filecheck ># cat /sys/fs/ocfs2//filecheck > > The output is like this: >INOTYPEDONEERROR > 39502 0 1 GENERATION > > lists the inode numbers. > is what kind of operation you've done, 0 for inode check,1 for inode > fix. > indicates whether the operation has been finished. > says what kind of errors was found. For the details, please refer to > the > file linux/fs/ocfs2/filecheck.h. > > 2. If you determine to fix this inode, do > ># echo "FIX " > /sys/fs/ocfs2//filecheck ># cat /sys/fs/ocfs2//filecheck > > The output is like this: >INOTYPEDONEERROR > 39502 1 1 SUCCESS > > This time, the column indicates whether this fix is successful or not. > > 3. The record cache is used to store the history of check/fix result. Its > defalut size is 10, and can be adjust between the range of 10 ~ 100. You can > adjust the size like this: > ># echo "SET " > /sys/fs/ocfs2//filecheck > > Fixing stuff > > On receivng the inode, the filesystem would read the inode and the > file metadata. In case of errors, the filesystem would fix the errors > and report the problems it fixed in the kernel log. As a precautionary > measure, > the inode must first be checked for errors before performing a final fix. > > The inode and the fix history will be maintained temporarily in a > small linked list buffer which would contain the last (N) inodes > fixed/checked, along with the logs of what errors were reported/fixed. > > Thanks > Gang > > >> Hi Gang, >> >> thank you for implementing this. I would like to understand this better >> on where and how it helps ... would you mind sharing couple >> examples(real scenarios). >> >> Thanks, >> --Srini >> >> >> On 10/27/2015 11:25 PM, Gang He wrote: >>> When there are errors in the ocfs2 filesystem, >>> they are usually accompanied by the inode number which caused the error. >>> This inode number would be the input to fixing the file. >>> One of
Re: [Ocfs2-devel] [PATCH v2 0/4] Add online file check feature
Hi Gang, thank you for implementing this. I would like to understand this better on where and how it helps ... would you mind sharing couple examples(real scenarios). Thanks, --Srini On 10/27/2015 11:25 PM, Gang He wrote: > When there are errors in the ocfs2 filesystem, > they are usually accompanied by the inode number which caused the error. > This inode number would be the input to fixing the file. > One of these options could be considered: > A file in the sys filesytem which would accept inode numbers. > This could be used to communication back what has to be fixed or is fixed. > You could write: > $# echo "CHECK " > /sys/fs/ocfs2/devname/filecheck > or > $# echo "FIX " > /sys/fs/ocfs2/devname/filecheck > > Compare with first version, I use strncasecmp instead of double strncmp > functions. Second, update the source file contribution vendor. > > Gang He (4): >ocfs2: export ocfs2_kset for online file check >ocfs2: sysfile interfaces for online file check >ocfs2: create/remove sysfile for online file check >ocfs2: check/fix inode block for online file check > > fs/ocfs2/Makefile | 3 +- > fs/ocfs2/filecheck.c | 566 > + > fs/ocfs2/filecheck.h | 48 + > fs/ocfs2/inode.c | 196 - > fs/ocfs2/inode.h | 3 + > fs/ocfs2/ocfs2_trace.h | 2 + > fs/ocfs2/stackglue.c | 3 +- > fs/ocfs2/stackglue.h | 2 + > fs/ocfs2/super.c | 5 + > 9 files changed, 820 insertions(+), 8 deletions(-) > create mode 100644 fs/ocfs2/filecheck.c > create mode 100644 fs/ocfs2/filecheck.h > ___ Ocfs2-devel mailing list Ocfs2-devel@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-devel
Re: [Ocfs2-devel] [Ocfs2-test-devel] OCFS2 test project current status ANNOUNCEMENT
Thank you all, for all you contributions in fixing ocfs2 test suite, nice effort :) On 10/08/2015 11:09 PM, Junxiao Bi wrote: > Hi Eric, > > Thank you for reviewing patches and make a summary of the project. Let > put more effort to make it better in the future. > > Thanks, > Junxiao. > > On 10/09/2015 01:12 PM, Eric Ren wrote: >> Hi all, >> >> Now, we have already collected 55 patches into Mark's github repo[1] >> since Mar 16, 2012. >> [1] https://github.com/markfasheh/ocfs2-test >> >> Thanks a lot for their contributors: >> Junxiao Bi: 23 >> Jeff Liu: 11 >> Tiger Yang: 12 >> Goldwynr: 5 >> Eric Ren: 3 >> Gang He: 1 >> >> This patches below are dropped, need improved, or conflicted with >> previous ones: >> >> [PATCH 09/10] Trigger udev rescan to get volume label that mkfs made >> remotely. -- relate to blkid.conf, needs improved >> [PATCH 42/59] mmap-test: fix segment fault -- needs improved >> [PATCH 48/59] ocfs2-test: fix build error -- conflicted, dropped >> [PATCH 45/59] flock-test: disable posix lock test -- needs improved >> [PATCH 53/59] discontig: fix remote mount -- conflicted, dropped >> [PATCH 32/59] xattr-test: fix single test failed-- conflicted, dropped >> [PATCH 17/59] Splice: Fix conflict definition of splice(2) for splice tests. >>-- conflicted, dropped >> [PATCH 13/59] aio-stress: fix aio-stress Makefile to link pthread >> explicitly. -- conflicted, dropped >> >> There might be some incorrectness in my conclusion. For details, please >> refer to ocfs2-test-devel mailist! >> >> Test project is very important to ensure the quality of ocfs2. Welcome >> more contributions from you! Let's make it better;-) >> >> Thanks, >> Eric >> >> ___ >> Ocfs2-devel mailing list >> Ocfs2-devel@oss.oracle.com >> https://oss.oracle.com/mailman/listinfo/ocfs2-devel >> > > ___ > Ocfs2-test-devel mailing list > ocfs2-test-de...@oss.oracle.com > https://oss.oracle.com/mailman/listinfo/ocfs2-test-devel ___ Ocfs2-devel mailing list Ocfs2-devel@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-devel
Re: [Ocfs2-devel] ocfs2 version issue
Hi Goldwyn On 09/01/2015 03:32 AM, Goldwyn Rodrigues wrote: > Hi Junxiao, > > On 08/31/2015 09:22 PM, Junxiao Bi wrote: >> Hi Goldwyn, >> >> Ocfs2 kernel version is removed from commit >> ff8fb335221e2c446b0d4cbea26be371fd2feb64 ("ocfs2: remove versioning >> information"), but Oracle CRS depends on this version, and this made >> Oracle CRS can't be installed. So i think we should revert this commit >> and sync this version with ocfs2-tools. Do you have any other concern >> about this? > We removed the version because it did not make sense. I am not sure how other kernel modules work, but for me a OCFS2 maintaining it's own version is a good idea :). This version number defines the feature-set it currently supports. It also allows a feature-set to be backported to older kernels(if need arises). > Even if we put in > the version back we will have to maintain it and will have a similar > case where the version number in the kernel is way behind the tools > versions because of lack of updating. Add to it the confusion of users > who would not know which version to quote. I agree that this got out of sync so probably we should fix that part than to remove it all together. I also don't see why tools version has to dictate kernel module version or vise-versa. Isn't it practical for a kernel module to implement a new feature-set but doesn't require any tools changes or vise-versa ? ;) > I suggest the Oracle CRS should be modified to use the kernel version as > opposed to the ocfs2 kernel module specific versions. It may not be just Oracle CRS ... there may be other applications that might be using this. The problem is that we don't want a customer to upgrade a kernel and run into a failure which just looks bad on kernel/filesystem :(. One may argue to write notes/readme's to inform customers but it's not practical and is a painful for some customers to update application software and kernel at the same time. Anyway, if we decide to remove the versioning forever then it may be a good idea to deprecate it first so applications have enough time to incorporate these changes. But my vote would be for ocfs2 to maintain it's own version number :) ___ Ocfs2-devel mailing list Ocfs2-devel@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-devel
Re: [Ocfs2-devel] [RFC] ocfs2: Idea to make ocfs2_search_chain high efficiency
Hi Norton, while the localalloc bmp is enabled the chances of a particular gd ending with zero free bits is very minimal. local alloc bmp will pick next gd once min number of free bits falls below localalloc bmp size. So next gd is picked while the current gd still has free space. Having said that, I understand the problem you are describing. Since we only move the gd with more free bits ahead of the previous unusable gd we end of rescanning unusable gds whenever the newly picked gd becomes unusable. This is inefficient and causes performance problem when fs gets fragmented I am not sure if we will be allowed to use the bg_reserved2 ... but if we are then we could use it to remember next gd we should look at. For example, if we assume there are 'N' GDs and we are currently seeing that GD10 has more free space ... so the order of GDs are GD10, GD1, GD2, ... GDN. Now if we end up using most of the space in GD10, according to current logic we re-read GD1, GD2.. GD9 and then read GD11 and pick that (assuming it has free space). But instead if we had GD10 remembers GD11 as the next one to look at ... then we can skip reading GD1..GD9. Thanks, --Srini On 08/24/2015 05:30 AM, Norton.Zhu wrote: In ocfs2_search_chain, I found it has low efficiency while searching an available gd in some circumstances: 1) The lun has a great many gd(it reads lots of unavailable gd(free bits is zero)); 2) Not too many gd, but the available gd is scattered in the unavailable gd(fragmentation); So I have an idea to optimize the search method: 1) Use the reserved member in the ocfs2_group_desc to make an available chain list(gds in the list are all available, free bits more than zero); 2) At the beginning, the chain list is the same with origin chain list; 3) While do allocation, it searches gd in the available chain list; 4) After each allocation, if some gd's free bits is zero, Remove it from the available chain list; 5) After each reclaim, if some gd's free bits change from zero to positive, Append it to the head of the available chain list; Once started with the basics outlined above, no unavailable gd will be read. Anyone has better ideas or advices? ___ Ocfs2-devel mailing list Ocfs2-devel@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-devel ___ Ocfs2-devel mailing list Ocfs2-devel@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-devel
Re: [Ocfs2-devel] [PATCH] ocfs2: optimize error handling in dlm_request_join
On 08/20/2015 04:50 AM, Norton.Zhu wrote: Currently error handling in dlm_request_join is a little obscure. So optimize it to promote readability. Signed-off-by: Norton.Zhu norton@huawei.com --- fs/ocfs2/dlm/dlmdomain.c | 69 ++-- 1 file changed, 37 insertions(+), 32 deletions(-) diff --git a/fs/ocfs2/dlm/dlmdomain.c b/fs/ocfs2/dlm/dlmdomain.c index 7df88a6..af4f7aa 100644 --- a/fs/ocfs2/dlm/dlmdomain.c +++ b/fs/ocfs2/dlm/dlmdomain.c @@ -1465,39 +1465,44 @@ static int dlm_request_join(struct dlm_ctxt *dlm, if (status == -ENOPROTOOPT) { status = 0; *response = JOIN_OK_NO_MAP; - } else if (packet.code == JOIN_DISALLOW || -packet.code == JOIN_OK_NO_MAP) { - *response = packet.code; - } else if (packet.code == JOIN_PROTOCOL_MISMATCH) { - mlog(ML_NOTICE, - This node requested DLM locking protocol %u.%u and - filesystem locking protocol %u.%u. At least one of - the protocol versions on node %d is not compatible, - disconnecting\n, - dlm-dlm_locking_proto.pv_major, - dlm-dlm_locking_proto.pv_minor, - dlm-fs_locking_proto.pv_major, - dlm-fs_locking_proto.pv_minor, - node); - status = -EPROTO; - *response = packet.code; - } else if (packet.code == JOIN_OK) { - *response = packet.code; - /* Use the same locking protocol as the remote node */ - dlm-dlm_locking_proto.pv_minor = packet.dlm_minor; - dlm-fs_locking_proto.pv_minor = packet.fs_minor; - mlog(0, - Node %d responds JOIN_OK with DLM locking protocol - %u.%u and fs locking protocol %u.%u\n, - node, - dlm-dlm_locking_proto.pv_major, - dlm-dlm_locking_proto.pv_minor, - dlm-fs_locking_proto.pv_major, - dlm-fs_locking_proto.pv_minor); } else { - status = -EINVAL; - mlog(ML_ERROR, invalid response %d from node %u\n, - packet.code, node); + *response = packet.code; Norton, it looks much better :) one minor comment. we don't want to reset *response with packet.code if it's unrecognized. We should leave the value to JOIN_DISALLOW; rest looks good. + switch (packet.code) { + case JOIN_DISALLOW: + case JOIN_OK_NO_MAP: + break; + case JOIN_PROTOCOL_MISMATCH: + mlog(ML_NOTICE, + This node requested DLM locking protocol %u.%u and + filesystem locking protocol %u.%u. At least one of + the protocol versions on node %d is not compatible, + disconnecting\n, + dlm-dlm_locking_proto.pv_major, + dlm-dlm_locking_proto.pv_minor, + dlm-fs_locking_proto.pv_major, + dlm-fs_locking_proto.pv_minor, + node); + status = -EPROTO; + break; + case JOIN_OK: + /* Use the same locking protocol as the remote node */ + dlm-dlm_locking_proto.pv_minor = packet.dlm_minor; + dlm-fs_locking_proto.pv_minor = packet.fs_minor; + mlog(0, + Node %d responds JOIN_OK with DLM locking protocol + %u.%u and fs locking protocol %u.%u\n, + node, + dlm-dlm_locking_proto.pv_major, + dlm-dlm_locking_proto.pv_minor, + dlm-fs_locking_proto.pv_major, + dlm-fs_locking_proto.pv_minor); + break; + default: + status = -EINVAL; + mlog(ML_ERROR, invalid response %d from node %u\n, + packet.code, node); + break; + } } mlog(0, status %d, node %d response is %d\n, status, node, ___ Ocfs2-devel mailing list Ocfs2-devel@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-devel
Re: [Ocfs2-devel] ocfs2: should not use le32_add_cpu to set ocfs2_dinode i_flags
Ok, understood. Thanks for explaining :) On 01/21/2015 05:07 PM, Joseph Qi wrote: Hi Srini, On 2015/1/22 1:55, Srinivas Eeda wrote: Hi Joesph, thanks a lot for submitting the above patch. I am trying to understand what kind of flag corruption have you noticed and under what circumstances ? I agree with the patch that bitwise operations are better than adding, but I am not able to understand the corruption it could cause. Can you please share! Suppose a the following case: The dinode i_flag already has the OCFS2_ORPHANED_FL bit set, and calling le32_add_cpu again. Then it will corrupt the i_flag. -- Joseph Thanks, --Srini ___ Ocfs2-devel mailing list Ocfs2-devel@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-devel
Re: [Ocfs2-devel] Git repos?
git repos for ocfs2 kernel code is linux.git and linux-next.git on kernel.org For tools it's git://oss.oracle.com/git/ocfs2-tools.git On 01/08/2015 10:43 AM, Jonathan Fraser wrote: Hi, Sorry for the newbe question, but where are the git repos for the staged changes? I've searched high and low. I'd like to test with some of the recent changes. Thanks, Jon Fraser ___ Ocfs2-devel mailing list Ocfs2-devel@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-devel ___ Ocfs2-devel mailing list Ocfs2-devel@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-devel
Re: [Ocfs2-devel] ocfs2-tools repository and documentation
Hi Goldwyn, Germano I have updated ocfs2-tools git repo with latest fixes :) Germano, following documentation is available for OCFS2 http://docs.oracle.com/cd/E37670_01/E37355/html/ol_ocfs2.html http://www.oracle.com/us/technologies/linux/025995.htm https://oss.oracle.com/projects/ocfs2/dist/documentation/v1.8/ocfs2-1_8_2-manpages.pdf https://oss.oracle.com/projects/ocfs2/documentation/ On 12/15/2014 10:47 AM, Germano Percossi wrote: Hi Goldwyn, Thanks for your answer. I will have a look at your code. I just want to be sure that the source code I am looking at is the latest one to avoid fixing things that maybe have been already fixed elsewhere. Same is true for documentation and package repositories. If I could have links to the latest, no matter how old and not maintained, documentation and repository, it would be helpful, as well. Do you know of any other distro, besides Suse, maintaining a stack of patches? Cheers, Germano On 14/12/14 07:24, Goldwyn Rodrigues wrote: Yes, if you don't count the last 2 patches, it is more close to 3 years now ;) Since the patches were not being updated. I started maintaining an alternate repository where I am putting all the bugs reported at SUSE: https://github.com/goldwynr/ocfs2-tools Branch suse-fixes has the fixes found by SUSE over the upstream branch. Branch nocontrold has the patches for the feature of doing away with ocfs2_controld to work with the latest corosync/pacemaker stack. Patches for the kernel are already in the kernel but the ones in the tools need some review. I had a mail conversation with Srini and he has promised to update the upstream branch soon. Regards, ___ Ocfs2-devel mailing list Ocfs2-devel@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-devel ___ Ocfs2-devel mailing list Ocfs2-devel@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-devel
Re: [Ocfs2-devel] [PATCH v2] ocfs2: o2net: fix connect expired
looks good. Thanks for your explanation and fix Reviewed-by: Srinivas Eeda srinivas.e...@oracle.com On 10/30/2014 11:08 PM, Junxiao Bi wrote: Set nn_persistent_error to -ENOTCONN will stop reconnect since the stop condition in o2net_start_connect() will be true. stop = (nn-nn_sc || (nn-nn_persistent_error (nn-nn_persistent_error != -ENOTCONN || timeout == 0))); This will make connection never be established if the first connection request is lost. Set nn_persistent_error to 0 when connect expired to fix this. With this changes, dlm will not be waken up when connect expired, this is OK since dlm depends on network, dlm can do nothing in this case if waken up. Let it wait there for network recover and connect built again to continue. Signed-off-by: Junxiao Bi junxiao...@oracle.com --- fs/ocfs2/cluster/tcp.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/ocfs2/cluster/tcp.c b/fs/ocfs2/cluster/tcp.c index 97de0fb..4d6b645 100644 --- a/fs/ocfs2/cluster/tcp.c +++ b/fs/ocfs2/cluster/tcp.c @@ -1736,7 +1736,7 @@ static void o2net_connect_expired(struct work_struct *work) o2net_idle_timeout() / 1000, o2net_idle_timeout() % 1000); - o2net_set_nn_state(nn, NULL, 0, -ENOTCONN); + o2net_set_nn_state(nn, NULL, 0, 0); } spin_unlock(nn-nn_lock); } ___ Ocfs2-devel mailing list Ocfs2-devel@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-devel
Re: [Ocfs2-devel] [PATCH] ocfs2: o2net: fix connect expired
Hi Junxiao, I am hesitant to make changes to o2net_set_nn_state calls as it is a sensitive code. Also, currently, nn_timeout is only used to determine if a reconnect needs to be allowed or not. So, lets rename that to nn_reconnect and use it for that purpose. With your change, o2net_connect_expired now doesn't pass -ENOTCONN when a network hb timeout happens. So o2net_set_nn_state now fails to wake_up any threads stuck on nn_sc_wq(check o2net_send_message_vec function) ? Thanks, --Srini On 10/30/2014 04:58 AM, Junxiao Bi wrote: - srinivas.e...@oracle.com wrote: Hi Junxiao, thanks for explaining. For this case allowing a reconnect (setting atomic_set(nn-nn_timeout, 1); ) in o2net_connect_expired should work ? Hi Srini, Yes, that should also work for this case. But it breaks the usage of nn_timeout, it is only used for idle timeout. Thanks, Junxiao. Thanks, --Srini On 10/29/2014 10:32 PM, Junxiao Bi wrote: Hi Srini, On 10/30/2014 01:16 PM, Srinivas Eeda wrote: Junxiao, can you please describe under what circumstances you saw this problem? My understanding is o2net_connect_expired is only queued when connection actually broke and ENOTCONN is the right error in that case. This happened when o2net was issuing the first connect request to some node, but the request packet is lost due to some error like network broken, then the connect would be expired, in o2net_connect_expired() that set nn-nn_persistent_error to -ENOTCONN but timeout was zero, so o2net_start_connect() would return without sending another connect request, connection to the node will never be built. Thanks, Junxiao. Thanks, --Srini On 10/29/2014 06:41 PM, Junxiao Bi wrote: Set nn_persistent_error to -ENOTCONN will stop reconnect since the stop condition in o2net_start_connect() will be true. stop = (nn-nn_sc || (nn-nn_persistent_error (nn-nn_persistent_error != -ENOTCONN || timeout == 0))); This will make connection never be established if the first connection request is lost. Signed-off-by: Junxiao Bi junxiao...@oracle.com --- fs/ocfs2/cluster/tcp.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/ocfs2/cluster/tcp.c b/fs/ocfs2/cluster/tcp.c index 97de0fb..4d6b645 100644 --- a/fs/ocfs2/cluster/tcp.c +++ b/fs/ocfs2/cluster/tcp.c @@ -1736,7 +1736,7 @@ static void o2net_connect_expired(struct work_struct *work) o2net_idle_timeout() / 1000, o2net_idle_timeout() % 1000); -o2net_set_nn_state(nn, NULL, 0, -ENOTCONN); +o2net_set_nn_state(nn, NULL, 0, 0); } spin_unlock(nn-nn_lock); } ___ Ocfs2-devel mailing list Ocfs2-devel@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-devel
Re: [Ocfs2-devel] [PATCH] ocfs2: o2net: fix connect expired
Junxiao, can you please describe under what circumstances you saw this problem? My understanding is o2net_connect_expired is only queued when connection actually broke and ENOTCONN is the right error in that case. Thanks, --Srini On 10/29/2014 06:41 PM, Junxiao Bi wrote: Set nn_persistent_error to -ENOTCONN will stop reconnect since the stop condition in o2net_start_connect() will be true. stop = (nn-nn_sc || (nn-nn_persistent_error (nn-nn_persistent_error != -ENOTCONN || timeout == 0))); This will make connection never be established if the first connection request is lost. Signed-off-by: Junxiao Bi junxiao...@oracle.com --- fs/ocfs2/cluster/tcp.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/ocfs2/cluster/tcp.c b/fs/ocfs2/cluster/tcp.c index 97de0fb..4d6b645 100644 --- a/fs/ocfs2/cluster/tcp.c +++ b/fs/ocfs2/cluster/tcp.c @@ -1736,7 +1736,7 @@ static void o2net_connect_expired(struct work_struct *work) o2net_idle_timeout() / 1000, o2net_idle_timeout() % 1000); - o2net_set_nn_state(nn, NULL, 0, -ENOTCONN); + o2net_set_nn_state(nn, NULL, 0, 0); } spin_unlock(nn-nn_lock); } ___ Ocfs2-devel mailing list Ocfs2-devel@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-devel
Re: [Ocfs2-devel] [PATCH] ocfs2: o2net: fix connect expired
Hi Junxiao, thanks for explaining. For this case allowing a reconnect (setting atomic_set(nn-nn_timeout, 1); ) in o2net_connect_expired should work ? Thanks, --Srini On 10/29/2014 10:32 PM, Junxiao Bi wrote: Hi Srini, On 10/30/2014 01:16 PM, Srinivas Eeda wrote: Junxiao, can you please describe under what circumstances you saw this problem? My understanding is o2net_connect_expired is only queued when connection actually broke and ENOTCONN is the right error in that case. This happened when o2net was issuing the first connect request to some node, but the request packet is lost due to some error like network broken, then the connect would be expired, in o2net_connect_expired() that set nn-nn_persistent_error to -ENOTCONN but timeout was zero, so o2net_start_connect() would return without sending another connect request, connection to the node will never be built. Thanks, Junxiao. Thanks, --Srini On 10/29/2014 06:41 PM, Junxiao Bi wrote: Set nn_persistent_error to -ENOTCONN will stop reconnect since the stop condition in o2net_start_connect() will be true. stop = (nn-nn_sc || (nn-nn_persistent_error (nn-nn_persistent_error != -ENOTCONN || timeout == 0))); This will make connection never be established if the first connection request is lost. Signed-off-by: Junxiao Bi junxiao...@oracle.com --- fs/ocfs2/cluster/tcp.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/ocfs2/cluster/tcp.c b/fs/ocfs2/cluster/tcp.c index 97de0fb..4d6b645 100644 --- a/fs/ocfs2/cluster/tcp.c +++ b/fs/ocfs2/cluster/tcp.c @@ -1736,7 +1736,7 @@ static void o2net_connect_expired(struct work_struct *work) o2net_idle_timeout() / 1000, o2net_idle_timeout() % 1000); -o2net_set_nn_state(nn, NULL, 0, -ENOTCONN); +o2net_set_nn_state(nn, NULL, 0, 0); } spin_unlock(nn-nn_lock); } ___ Ocfs2-devel mailing list Ocfs2-devel@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-devel
[Ocfs2-devel] [PATCH 1/1] o2dlm: fix a race between purge and master query
Node A sends master query request to node B which is the master. At this time lockres happens to be on purgelist. dlm_master_request_handler gets the dlm spinlock, finds the resource and releases the dlm spin lock. Right at this dlm_thread on this node could purge the lockres. dlm_master_request_handler can then acquire lockres spinlock and reply to Node A that node B is the master even though lockres on node B is purged. The above scenario will now make node A falsely think node B is the master which is inconsistent. Further if another node C tries to master the same resource, every node will respond they are not the master. Node C then masters the resource and sends assert master to all nodes. This will now make node A crash with the following message. dlm_assert_master_handler:1831 ERROR: DIE! Mastery assert from 9, but current owner is 10! Signed-off-by: Srinivas Eeda srinivas.e...@oracle.com --- fs/ocfs2/dlm/dlmmaster.c | 12 1 file changed, 12 insertions(+) diff --git a/fs/ocfs2/dlm/dlmmaster.c b/fs/ocfs2/dlm/dlmmaster.c index 215e41a..3689b35 100644 --- a/fs/ocfs2/dlm/dlmmaster.c +++ b/fs/ocfs2/dlm/dlmmaster.c @@ -1460,6 +1460,18 @@ way_up_top: /* take care of the easy cases up front */ spin_lock(res-spinlock); + + /* +* Right after dlm spinlock was released, dlm_thread could have +* purged the lockres. Check if lockres got unhashed. If so +* start over. +*/ + if (hlist_unhashed(res-hash_node)) { + spin_unlock(res-spinlock); + dlm_lockres_put(res); + goto way_up_top; + } + if (res-state (DLM_LOCK_RES_RECOVERING| DLM_LOCK_RES_MIGRATING)) { spin_unlock(res-spinlock); -- 1.9.1 ___ Ocfs2-devel mailing list Ocfs2-devel@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-devel
Re: [Ocfs2-devel] [PATCH] [ocfs2]: drop(unlock) dentry before iput in ocfs2_do_drop_dentry_lock
This fix will cause problems for regular file unlinks. If the inode is not cleared on node 2, then both node 1 and node 2 will fail to get try open lock and fail to clear the orphan inodes. I am assuming the deadlock you have seen is because of quota's enabled and fix e3a767b60fd8a9f5e133f42f4970cff77ec43173 should have avoided this deadlock by delaying the dropping of the dquot reference. That didn't happen ? Why else is downconvert task taking the inode lock ? Thanks, --Srini On 10/19/2014 10:28 PM, Wengang Wang wrote: There is dead lock case like this(the Dentry is a dentry to the Inode): node 1node 2 --- -- granted inode lock requesting dentry lock responding to node 1 request on dentry lock, it calls ocfs2_do_drop_dentry_lock with the call trace: #4 wait_for_completion #5 __ocfs2_cluster_lock #6 ocfs2_inode_lock_full_nested #7 ocfs2_evict_inode #8 evict #9 iput #10 ocfs2_do_drop_dentry_lock You can see node 2 is requesting the inode lock before unlocking dentry lock thus a two nodes dead lock formed. The story is like this: Responding node 1's request, Node 2, ocfs2_dentry_convert_worker, returned UNBLOCK_STOP_POST which means don't downconvert but do post_unlock action. It was doing so because it thought it doesn't need to downconvert but will do an unlock instead in the post_unlock action ocfs2_dentry_post_unlock(). By doing so it can save a clusted downconvert request for performace consideration. But in ocfs2_dentry_post_unlock() before unlock the dentry lock, it was requesting the inode lock forming the deadlock expectedly. Fix: unlock dentry lock first before the requesting inode lock in ocfs2_do_drop_dentry_lock. Signed-off-by: Wengang Wang wen.gang.w...@oracle.com --- fs/ocfs2/dcache.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/ocfs2/dcache.c b/fs/ocfs2/dcache.c index e2e05a1..19a7d6c 100644 --- a/fs/ocfs2/dcache.c +++ b/fs/ocfs2/dcache.c @@ -369,9 +369,9 @@ out_attach: static void ocfs2_drop_dentry_lock(struct ocfs2_super *osb, struct ocfs2_dentry_lock *dl) { - iput(dl-dl_inode); ocfs2_simple_drop_lockres(osb, dl-dl_lockres); ocfs2_lock_res_free(dl-dl_lockres); + iput(dl-dl_inode); kfree(dl); } ___ Ocfs2-devel mailing list Ocfs2-devel@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-devel
Re: [Ocfs2-devel] Ocfs2-devel Digest, Vol 127, Issue 25
On 10/09/2014 12:53 AM, Joseph Qi wrote: On 2014/10/9 15:16, Guozhonghua wrote: Hi Joseph and Srinivas, We had merged and test the two patches: 1.ocfs2: o2net: set tcp user timeout to max value 8e9801dfe37c9e68cdbfcd15988df2187191864e 2.ocfs2: o2net: don't shutdown connection when idle timeout c43c363def04cdaed0d9e26dae846081f55714e7 They are works well as we shut down and up the Ethernet interface manually and intervals to create the scenarios with shell scripts, the issues cat not be recreated. Thanks you for reviews and better advices. There is another question. As the node number rises to 32 or 128 in one cluster, we think the TCP keep alive MSG interval should be make longer from 2 seconds to 10 seconds and the idle timeout value should be 6ms or 9ms. We think it can reduce the non-useful keep alive messages and improve the performance of the TCP connection. O2CB_IDLE_TIMEOUT_MS=3 to 9 O2CB_KEEPALIVE_DELAY_MS=2000 to 1 In my opinion, O2CB_IDLE_TIMEOUT_MS can be changed to suit your scenario. But for O2CB_KEEPALIVE_DELAY_MS, I don't think you have to change it. It will send keepalive packet only if there is no DLM messages between nodes. We test the values and the changes works well. As Joseph explained they both are tunables, you can run service o2cb configure to change values. It falls down to how sensitive your application want to be to detect network problems. Can you please elaborate on what you meant by changes works well. How did you test it and what you found ? Is there any side effect? We are forward your better thoughts about that. Thanks. - 本邮件及其附件含有杭州华三通信技术有限公司的保密信息,仅限于发送给上面地址中列出 的个人或群组。禁止任何其他人以任何形式使用(包括但不限于全部或部分地泄露、复制、 或散发)本邮件中的信息。如果您错收了本邮件,请您立即电话或邮件通知发件人并删除本 邮件! This e-mail and its attachments contain confidential information from H3C, which is intended only for the person or entity whose address is listed above. Any use of the information contained herein in any way (including, but not limited to, total or partial disclosure, reproduction, or dissemination) by persons other than the intended recipient(s) is prohibited. If you receive this e-mail in error, please notify the sender by phone or email immediately and delete it! ___ Ocfs2-devel mailing list Ocfs2-devel@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-devel
Re: [Ocfs2-devel] One node hangs up issue requiring goog idea, thanks
Joseph, yes I agree with you that the proposed change won't work for all cases. The only way to fix if we allow reconnects is to track all messages and replay the last few in case of reconnects. However this problem shouldn't arise because of the following two fixes Junxiao put in. Can you guys review and comment on them ? 1.ocfs2: o2net: set tcp user timeout to max value 8e9801dfe37c9e68cdbfcd15988df2187191864e 2.ocfs2: o2net: don't shutdown connection when idle timeout c43c363def04cdaed0d9e26dae846081f55714e7 The above will not close the socket connection in case of network hiccup and should avoid loss of messages. Thanks, --Srini On 09/30/2014 12:21 AM, Joseph Qi wrote: I don't think this is right. There are two scenario when returns -ENOTCONN or -EHOSTDOWN: 1) Connection down when sending the message; 2) Connection down when responding the messages; So just resend the message may lead to remote handles twice. On 2014/9/26 20:06, Guozhonghua wrote: Hi, all, As we use OCFS2, the network is not good. When the converting request message can’t send to the another node, there will be a node hangs up which will still waiting for the dlm. CAS2/logdir/var/log/syslog.1-6778-Sep 16 20:57:16 CAS2 kernel: [516366.623623] o2net: Connection to node CAS1 (num 1) at 10.172.254.1:7100 has been idle for 30.87 secs, shutting it down. CAS2/logdir/var/log/syslog.1-6779-Sep 16 20:57:16 CAS2 kernel: [516366.623631] o2net_idle_timer 1621: Local and remote node is heartbeating, and try connect CAS2/logdir/var/log/syslog.1-6780-Sep 16 20:57:16 CAS2 kernel: [516366.623792] o2net: No longer connected to node CAS1 (num 1) at 10.172.254.1:7100 CAS2/logdir/var/log/syslog.1:6781:Sep 16 20:57:16 CAS2 kernel: [516366.623881] (dlm_thread,5140,4):dlm_send_proxy_ast_msg:482 ERROR: B258FD07DDD64710B68EB9683FD7D1B9: res M00046e0117, error -112 send AST to node 1 CAS2/logdir/var/log/syslog.1-6782-Sep 16 20:57:16 CAS2 kernel: [516366.623900] (dlm_thread,5140,4):dlm_flush_asts:596 ERROR: status = -112 CAS2/logdir/var/log/syslog.1-6783-Sep 16 20:57:16 CAS2 kernel: [516366.623937] (dlm_thread,5140,4):dlm_send_proxy_ast_msg:482 ERROR: B258FD07DDD64710B68EB9683FD7D1B9: res M0016260110, error -107 send AST to node 1 CAS2/logdir/var/log/syslog.1-6784-Sep 16 20:57:16 CAS2 kernel: [516366.623946] (dlm_thread,5140,4):dlm_flush_asts:596 ERROR: status = -107 CAS2/logdir/var/log/syslog.1-6785-Sep 16 20:57:16 CAS2 kernel: [516366.623997] Connect node 1 OK, and set timeout 0 CAS2/logdir/var/log/syslog.1-6786-Sep 16 20:57:17 CAS2 kernel: [516367.623592] o2net: Connected to node CAS1 (num 1) at 10.172.254.1:7100 debugfs: fs_locks -B Lockres: M00046e0117 Mode: Protected Read Flags: Initialized Attached Busy RO Holders: 0 EX Holders: 0 Pending Action: Convert Pending Unlock Action: None Requested Mode: Exclusive Blocking Mode: No Lock PR Gets: 318317 Fails: 0Waits (usec) Total: 128622 Max: 3 EX Gets: 706878 Fails: 0Waits (usec) Total: 284967 Max: 2 Disk Refreshes: 0 debugfs: dlm_locks M00046e0117 Lockres: M00046e0117 Owner: 2State: 0x0 Last Used: 0 ASTs Reserved: 0Inflight: 0Migration Pending: No Refs: 4Locks: 2On Lists: None Reference Map: 1 Lock-Queue Node Level Conv Cookie Refs AST BAST Pending-Action Granted 1 PR -11:1952 No NoNone Converting 2 PR EX2:1962 No NoNone We reviews the code, and want to resend the dlm message to avoid it. The patch is required reviewing. The patch has been test when the network interface is shut down and up manually to recreate the issue. If the TCP channel between two node set up within 5 seconds, resend msg works well. We are forward to appreciate another better way to avoid it. Thanks. --- ocfs2/dlm/dlmthread.c 2014-06-07 10:40:09.0 +0800 +++ ocfs2/dlm/dlmthread.c 2014-09-26 16:42:36.0 +0800 @@ -517,6 +517,9 @@ static void dlm_flush_asts(struct dlm_ct struct dlm_lock_resource *res; u8 hi; +/* resend the msg again */ +int send_times = 0; + spin_lock(dlm-ast_lock); while (!list_empty(dlm-pending_asts)) { lock = list_entry(dlm-pending_asts.next, @@ -539,9 +542,16 @@ static void dlm_flush_asts(struct dlm_ct spin_unlock(dlm-ast_lock); if (lock-ml.node != dlm-node_num) { - ret = dlm_do_remote_ast(dlm, res, lock); - if (ret 0) + ret = dlm_do_remote_ast(dlm, res, lock); + if (ret 0) { mlog_errno(ret); + while ((ret == -112 || ret == -107) send_times++ 5 ) { +
Re: [Ocfs2-devel] [PATCH v2] ocfs2: don't fire quorum before connection established
Looks good to me. Thanks for the patch Reviewed-by: Srinivas Eeda srinivas.e...@oracle.com On 09/15/2014 10:15 PM, Junxiao Bi wrote: Firing quorum before connection established can cause unexpected node to reboot. Assume there are 3 nodes in the cluster, Node 1, 2, 3. Node 2 and 3 have wrong ip address of Node 1 in cluster.conf and global heartbeat is enabled in the cluster. After the heatbeat are started on these three nodes, Node 1 will reboot due to quorum fencing. It is similar case if Node 1's networking is not ready when starting the global heatbeat. The reboot is not friendly as customer is not fully ready for ocfs2 to work. Fix it by not allow firing quorum before connection established. In this case, ocfs2 will wait until wrong configure fixed or networking up to continue. Also update the log to guide user where to check when connection is not built for a long time. Signed-off-by: Junxiao Bi junxiao...@oracle.com Reviewed-by: Srinivas Eeda srinivas.e...@oracle.com --- fs/ocfs2/cluster/tcp.c |5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/fs/ocfs2/cluster/tcp.c b/fs/ocfs2/cluster/tcp.c index ea34952..b2cc010 100644 --- a/fs/ocfs2/cluster/tcp.c +++ b/fs/ocfs2/cluster/tcp.c @@ -536,7 +536,7 @@ static void o2net_set_nn_state(struct o2net_node *nn, if (nn-nn_persistent_error || nn-nn_sc_valid) wake_up(nn-nn_sc_wq); - if (!was_err nn-nn_persistent_error) { + if (was_valid !was_err nn-nn_persistent_error) { o2quo_conn_err(o2net_num_from_nn(nn)); queue_delayed_work(o2net_wq, nn-nn_still_up, msecs_to_jiffies(O2NET_QUORUM_DELAY_MS)); @@ -1721,7 +1721,8 @@ static void o2net_connect_expired(struct work_struct *work) spin_lock(nn-nn_lock); if (!nn-nn_sc_valid) { printk(KERN_NOTICE o2net: No connection established with -node %u after %u.%u seconds, giving up.\n, +node %u after %u.%u seconds, check network and + cluster configuration.\n, o2net_num_from_nn(nn), o2net_idle_timeout() / 1000, o2net_idle_timeout() % 1000); ___ Ocfs2-devel mailing list Ocfs2-devel@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-devel
Re: [Ocfs2-devel] [PATCH] fs: ocfs2: dir.c: Cleaning up uninitialized variables
Acked-by: Srinivas Eeda srinivas.e...@oracle.com On 06/01/2014 06:53 AM, Rickard Strandqvist wrote: There is a risk that the variable will be used without being initialized. This was largely found by using a static code analysis program called cppcheck. Signed-off-by: Rickard Strandqvist rickard_strandqv...@spectrumdigital.se --- fs/ocfs2/dir.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/ocfs2/dir.c b/fs/ocfs2/dir.c index 0717662..27aa4a1 100644 --- a/fs/ocfs2/dir.c +++ b/fs/ocfs2/dir.c @@ -3738,7 +3738,7 @@ static int ocfs2_dx_dir_rebalance(struct ocfs2_super *osb, struct inode *dir, int credits, ret, i, num_used, did_quota = 0; u32 cpos, split_hash, insert_hash = hinfo-major_hash; u64 orig_leaves_start; - int num_dx_leaves; + int num_dx_leaves = 0; struct buffer_head **orig_dx_leaves = NULL; struct buffer_head **new_dx_leaves = NULL; struct ocfs2_alloc_context *data_ac = NULL, *meta_ac = NULL; ___ Ocfs2-devel mailing list Ocfs2-devel@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-devel
[Ocfs2-devel] ocfs2: Fix panic on kfree(xattr-name)
Hi Andrew, can you please pull the following patch from Tetsuo Handa. It fixes a regression in ocfs2/mainline and linux-next Thanks, --Srini ___ Ocfs2-devel mailing list Ocfs2-devel@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-devel
[Ocfs2-devel] [PATCH 1/1] ocfs2: Fix panic on kfree(xattr-name)
From: Tetsuo Handa penguin-ker...@i-love.sakura.ne.jp Commit 9548906b 'xattr: Constify -name member of struct xattr.' missed that ocfs2 is calling kfree(xattr-name). As a result, kernel panic occurs upon calling kfree(xattr-name) because xattr-name refers static constant names. This patch removes kfree(xattr-name) from ocfs2_mknod() and ocfs2_symlink(). Reported-by: Tariq Saeed tariq.x.sa...@oracle.com Signed-off-by: Tetsuo Handa penguin-ker...@i-love.sakura.ne.jp Reviewed-by: Srinivas Eeda srinivas.e...@oracle.com --- fs/ocfs2/namei.c |2 -- 1 file changed, 2 deletions(-) diff --git a/fs/ocfs2/namei.c b/fs/ocfs2/namei.c index 3683643..feed025f 100644 --- a/fs/ocfs2/namei.c +++ b/fs/ocfs2/namei.c @@ -450,7 +450,6 @@ leave: brelse(new_fe_bh); brelse(parent_fe_bh); - kfree(si.name); kfree(si.value); ocfs2_free_dir_lookup_result(lookup); @@ -1855,7 +1854,6 @@ bail: brelse(new_fe_bh); brelse(parent_fe_bh); - kfree(si.name); kfree(si.value); ocfs2_free_dir_lookup_result(lookup); if (inode_ac) -- 1.7.9.5 ___ Ocfs2-devel mailing list Ocfs2-devel@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-devel
Re: [Ocfs2-devel] [PATCH] ocfs2: do not put bh when buffer_uptodate failed
Thanks for explaining Reviewed-by: Srinivas Eeda srinivas.e...@oracle.com On 03/25/2014 06:25 PM, alex chen wrote: On 2014/3/26 2:45, Srinivas Eeda wrote: These changes looks good to me. However ocfs2_read_blocks and ocfs2_read_blocks_sync needs the same fix ? :) There is no need to do this in ocfs2_read_blocks and ocfs2_read_blocks_sync, because bh will be set to NULL after put_bh, and brelse will handle it. On 03/25/2014 12:05 AM, alex chen wrote: Do not put bh when buffer_uptodate failed in ocfs2_write_block and ocfs2_write_super_or_backup, because it will put bh in b_end_io. Otherwise it will hit a warning VFS: brelse: Trying to free free buffer. Signed-off-by: Alex Chen alex.c...@huawei.com Reviewed-by: Joseph Qi joseph...@huawei.com --- fs/ocfs2/buffer_head_io.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/fs/ocfs2/buffer_head_io.c b/fs/ocfs2/buffer_head_io.c index 5b704c6..1edcb14 100644 --- a/fs/ocfs2/buffer_head_io.c +++ b/fs/ocfs2/buffer_head_io.c @@ -90,7 +90,6 @@ int ocfs2_write_block(struct ocfs2_super *osb, struct buffer_head *bh, * information for this bh as it's not marked locally * uptodate. */ ret = -EIO; - put_bh(bh); mlog_errno(ret); } @@ -420,7 +419,6 @@ int ocfs2_write_super_or_backup(struct ocfs2_super *osb, if (!buffer_uptodate(bh)) { ret = -EIO; - put_bh(bh); mlog_errno(ret); } ___ Ocfs2-devel mailing list Ocfs2-devel@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-devel ___ Ocfs2-devel mailing list Ocfs2-devel@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-devel
Re: [Ocfs2-devel] [PATCH] ocfs2: do not put bh when buffer_uptodate failed
These changes looks good to me. However ocfs2_read_blocks and ocfs2_read_blocks_sync needs the same fix ? :) On 03/25/2014 12:05 AM, alex chen wrote: Do not put bh when buffer_uptodate failed in ocfs2_write_block and ocfs2_write_super_or_backup, because it will put bh in b_end_io. Otherwise it will hit a warning VFS: brelse: Trying to free free buffer. Signed-off-by: Alex Chen alex.c...@huawei.com Reviewed-by: Joseph Qi joseph...@huawei.com --- fs/ocfs2/buffer_head_io.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/fs/ocfs2/buffer_head_io.c b/fs/ocfs2/buffer_head_io.c index 5b704c6..1edcb14 100644 --- a/fs/ocfs2/buffer_head_io.c +++ b/fs/ocfs2/buffer_head_io.c @@ -90,7 +90,6 @@ int ocfs2_write_block(struct ocfs2_super *osb, struct buffer_head *bh, * information for this bh as it's not marked locally * uptodate. */ ret = -EIO; - put_bh(bh); mlog_errno(ret); } @@ -420,7 +419,6 @@ int ocfs2_write_super_or_backup(struct ocfs2_super *osb, if (!buffer_uptodate(bh)) { ret = -EIO; - put_bh(bh); mlog_errno(ret); } ___ Ocfs2-devel mailing list Ocfs2-devel@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-devel
Re: [Ocfs2-devel] One Patch reported, Anyone review it, thanks
Guozhonghua, idea of the fix looks good, but the patch doesn't seem to meet mainline standards. Can you please follow the instructions that Jeff emailed you last time and resubmit the patch. On 03/11/2014 04:09 AM, Guozhonghua wrote: I review the code of the kernel 3.11.10. The difference is as below: -- ../linux-3.11.10/fs/ocfs2/dlm/dlmdomain.c 2013-11-30 02:42:37.0 +0800 +++ ocfs2-ko-3.11/dlm/dlmdomain.c 2014-03-11 18:57:23.323897515 +0800 @@ -1135,6 +1135,9 @@ static int dlm_query_region_handler(stru int status = 0; int locked = 0; +/* Wether domain locked */ comment probably not required +int domain_locked = 0; + qr = (struct dlm_query_region *) msg-buf; mlog(0, Node %u queries hb regions on domain %s\n, qr-qr_node, @@ -1150,6 +1153,7 @@ static int dlm_query_region_handler(stru status = -EINVAL; spin_lock(dlm_domain_lock); +domain_locked = 1; dlm = __dlm_lookup_domain_full(qr-qr_domain, qr-qr_namelen); if (!dlm) { mlog(ML_ERROR, Node %d queried hb regions on domain %s @@ -1181,9 +1185,12 @@ static int dlm_query_region_handler(stru bail: if (locked) spin_unlock(dlm-spinlock); - spin_unlock(dlm_domain_lock); - kfree(local); +if (domain_locked) + spin_unlock(dlm_domain_lock); + +if (local) +kfree(local); no need to check if (local), kfree will take care of that part return status; } - ??,? ?(?? ???)?,?? ??! This e-mail and its attachments contain confidential information from H3C, which is intended only for the person or entity whose address is listed above. Any use of the information contained herein in any way (including, but not limited to, total or partial disclosure, reproduction, or dissemination) by persons other than the intended recipient(s) is prohibited. If you receive this e-mail in error, please notify the sender by phone or email immediately and delete it! ___ Ocfs2-devel mailing list Ocfs2-devel@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-devel ___ Ocfs2-devel mailing list Ocfs2-devel@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-devel
Re: [Ocfs2-devel] [PATCH] ocfs2: dlm: fix lock migration crash
looks good, thanks Junxiao for fixing this. Reviewed-by: Srinivas Eedasrinivas.e...@oracle.com On 02/25/2014 11:47 PM, Junxiao Bi wrote: This issue was introduced by commit 800deef3 where it replaced list_for_each with list_for_each_entry. The variable lock will point to invalid data if tmpq list is empty and a panic will be triggered due to this. Sunil advised reverting it back, but the old version was also not right. At the end of the outer for loop, that list_for_each_entry will also set lock to an invalid data, then in the next loop, if the tmpq list is empty, lock will be an stale invalid data and cause the panic. So reverting the list_for_each back and reset lock to NULL to fix this issue. Another concern is that this seemes can not happen because the tmpq list should not be empty. Let me describe how. old lock resource owner(node 1): migratation target(node 2): image there's lockres with a EX lock from node 2 in granted list, a NR lock from node x with convert_type EX in converting list. dlm_empty_lockres() { dlm_pick_migration_target() { pick node 2 as target as its lock is the first one in granted list. } dlm_migrate_lockres() { dlm_mark_lockres_migrating() { res-state |= DLM_LOCK_RES_BLOCK_DIRTY; wait_event(dlm-ast_wq, !dlm_lockres_is_dirty(dlm, res)); //after the above code, we can not dirty lockres any more, // so dlm_thread shuffle list will not run downconvert lock from EX to NR upconvert lock from NR to EX migration may schedule out here, then node 2 send down convert request to convert type from EX to NR, then send up convert request to convert type from NR to EX, at this time, lockres granted list is empty, and two locks in the converting list, node x up convert lock followed by node 2 up convert lock. // will set lockres RES_MIGRATING flag, the following // lock/unlock can not run dlm_lockres_release_ast(dlm, res); } dlm_send_one_lockres() dlm_process_recovery_data() for (i=0; imres-num_locks; i++) if (ml-node == dlm-node_num) for (j = DLM_GRANTED_LIST; j = DLM_BLOCKED_LIST; j++) { list_for_each_entry(lock, tmpq, list) if (lock) break; lock is invalid as grant list is empty. } if (lock-ml.node != ml-node) BUG() crash here } I see the above locks status from a vmcore of our internal bug. Signed-off-by: Junxiao Bi junxiao...@oracle.com Cc: Sunil Mushran sunil.mush...@gmail.com Cc: Srinivas Eeda srinivas.e...@oracle.com Cc: sta...@vger.kernel.org --- fs/ocfs2/dlm/dlmrecovery.c | 14 -- 1 file changed, 8 insertions(+), 6 deletions(-) diff --git a/fs/ocfs2/dlm/dlmrecovery.c b/fs/ocfs2/dlm/dlmrecovery.c index 7035af0..c2dd258 100644 --- a/fs/ocfs2/dlm/dlmrecovery.c +++ b/fs/ocfs2/dlm/dlmrecovery.c @@ -1750,13 +1750,13 @@ static int dlm_process_recovery_data(struct dlm_ctxt *dlm, struct dlm_migratable_lockres *mres) { struct dlm_migratable_lock *ml; - struct list_head *queue; + struct list_head *queue, *iter; struct list_head *tmpq = NULL; struct dlm_lock *newlock = NULL; struct dlm_lockstatus *lksb = NULL; int ret = 0; int i, j, bad; - struct dlm_lock *lock = NULL; + struct dlm_lock *lock; u8 from = O2NM_MAX_NODES; unsigned int added = 0; __be64 c; @@ -1791,14 +1791,16 @@ static int dlm_process_recovery_data(struct dlm_ctxt *dlm, /* MIGRATION ONLY! */ BUG_ON(!(mres-flags DLM_MRES_MIGRATION)); + lock = NULL; spin_lock(res-spinlock); for (j = DLM_GRANTED_LIST; j = DLM_BLOCKED_LIST; j++) { tmpq = dlm_list_idx_to_ptr(res, j); - list_for_each_entry(lock, tmpq, list) { - if (lock-ml.cookie != ml-cookie) - lock = NULL; - else + list_for_each(iter, tmpq
Re: [Ocfs2-devel] [PATCH] ocfs2: fix dlm lock migration crash
Junxiao, thanks for looking into this issue. Please see my comment below On 02/24/2014 01:07 AM, Junxiao Bi wrote: Hi, On 07/19/2012 09:59 AM, Sunil Mushran wrote: Different issues. On Wed, Jul 18, 2012 at 6:34 PM, Junxiao Bi junxiao...@oracle.com mailto:junxiao...@oracle.com wrote: On 07/19/2012 12:36 AM, Sunil Mushran wrote: This bug was detected during code audit. Never seen a crash. If it does hit, then we have bigger problems. So no point posting to stable. I read a lot of dlm recovery code recently, I found this bug could happen at the following scenario. node 1: migrate target node x: dlm_unregister_domain() dlm_migrate_all_locks() dlm_empty_lockres() select node x as migrate target node since there is a node x lock on the granted list. dlm_migrate_lockres() dlm_mark_lockres_migrating() { wait_event(dlm-ast_wq, !dlm_lockres_is_dirty(dlm, res)); node x unlock may happen here, res-granted list can be empty. If the unlock request got sent at this point, and if the request was *processed*, lock must have been removed from the granted_list. If the request was *not yet processed*, then the DLM_LOCK_RES_MIGRATING set in dlm_lockres_release_ast would make dlm_unlock handler to return DLM_MIGRATING to the caller (in this case node x). So I don't see how granted_list could have stale lock. Am I missing something ? I do think there is such race that you pointed below exist, but I am not sure if it was due to the above race described. dlm_lockres_release_ast(dlm, res); } dlm_send_one_lockres() dlm_process_recovery_data() { tmpq is res-granted list and is empty. list_for_each_entry(lock, tmpq, list) { if (lock-ml.cookie != ml-cookie) lock = NULL; else break; } lock will be invalid here. if (lock-ml.node != ml-node) BUG() -- crash here. } Thanks, Junxiao. Our customer can reproduce it. Also I saw you were assigned a similar bug before, see https://oss.oracle.com/bugzilla/show_bug.cgi?id=1220, is it the same BUG? On Tue, Jul 17, 2012 at 6:36 PM, Junxiao Bi junxiao...@oracle.com mailto:junxiao...@oracle.com wrote: Hi Sunil, On 07/18/2012 03:49 AM, Sunil Mushran wrote: On Tue, Jul 17, 2012 at 12:10 AM, Junxiao Bi junxiao...@oracle.com mailto:junxiao...@oracle.com wrote: In the target node of the dlm lock migration, the logic to find the local dlm lock is wrong, it shouldn't change the loop variable lock in the list_for_each_entry loop. This will cause a NULL-pointer accessing crash. Signed-off-by: Junxiao Bi junxiao...@oracle.com mailto:junxiao...@oracle.com Cc: sta...@vger.kernel.org mailto:sta...@vger.kernel.org --- fs/ocfs2/dlm/dlmrecovery.c | 12 +++- 1 file changed, 7 insertions(+), 5 deletions(-) diff --git a/fs/ocfs2/dlm/dlmrecovery.c b/fs/ocfs2/dlm/dlmrecovery.c index 01ebfd0..0b9cc88 100644 --- a/fs/ocfs2/dlm/dlmrecovery.c +++ b/fs/ocfs2/dlm/dlmrecovery.c @@ -1762,6 +1762,7 @@ static int dlm_process_recovery_data(struct dlm_ctxt *dlm, u8 from = O2NM_MAX_NODES; unsigned int added = 0; __be64 c; + int found; mlog(0, running %d locks for this lockres\n, mres-num_locks); for (i=0; imres-num_locks; i++) { @@ -1793,22 +1794,23 @@ static int dlm_process_recovery_data(struct dlm_ctxt *dlm, /* MIGRATION ONLY! */ BUG_ON(!(mres-flags DLM_MRES_MIGRATION)); + found = 0; spin_lock(res-spinlock); for (j = DLM_GRANTED_LIST; j = DLM_BLOCKED_LIST; j++) { tmpq = dlm_list_idx_to_ptr(res, j); list_for_each_entry(lock, tmpq, list) { - if (lock-ml.cookie != ml-cookie) - lock = NULL; - else + if (lock-ml.cookie == ml-cookie) { + found = 1;
Re: [Ocfs2-devel] [PATCH 0/6 v2] ocfs2: Avoid pending orphaned inodes
Hi Jan, thanks a lot for these patches. They all look good to me ... I just have one question on patch 5 Thanks, --Srini On 02/20/2014 07:18 AM, Jan Kara wrote: Hello, here is a second version of my patchset to solve a deadlocks when we do not defer dropping of inode reference from downconvert worker. I have tested the patches (also with lockdep enabled) and they seem to work fine. Comments are welcome! Honza ___ Ocfs2-devel mailing list Ocfs2-devel@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-devel
Re: [Ocfs2-devel] [PATCH 3/6] quota: Provide function to grab quota structure reference
looks good to me Reviewed-by: Srinivas Eeda srinivas.e...@oracle.com On 02/20/2014 07:18 AM, Jan Kara wrote: Provide dqgrab() function to get quota structure reference when we are sure it already has at least one active reference. Make use of this function inside quota code. Signed-off-by: Jan Kara j...@suse.cz --- fs/quota/dquot.c | 4 ++-- include/linux/quotaops.h | 8 2 files changed, 10 insertions(+), 2 deletions(-) diff --git a/fs/quota/dquot.c b/fs/quota/dquot.c index 831d49a4111f..e3f09e34d0b2 100644 --- a/fs/quota/dquot.c +++ b/fs/quota/dquot.c @@ -528,7 +528,7 @@ restart: if (atomic_read(dquot-dq_count)) { DEFINE_WAIT(wait); - atomic_inc(dquot-dq_count); + dqgrab(dquot); prepare_to_wait(dquot-dq_wait_unused, wait, TASK_UNINTERRUPTIBLE); spin_unlock(dq_list_lock); @@ -624,7 +624,7 @@ int dquot_writeback_dquots(struct super_block *sb, int type) /* Now we have active dquot from which someone is * holding reference so we can safely just increase * use count */ - atomic_inc(dquot-dq_count); + dqgrab(dquot); spin_unlock(dq_list_lock); dqstats_inc(DQST_LOOKUPS); err = sb-dq_op-write_dquot(dquot); diff --git a/include/linux/quotaops.h b/include/linux/quotaops.h index 6965fe394c3b..1d3eee594cd6 100644 --- a/include/linux/quotaops.h +++ b/include/linux/quotaops.h @@ -46,6 +46,14 @@ void inode_reclaim_rsv_space(struct inode *inode, qsize_t number); void dquot_initialize(struct inode *inode); void dquot_drop(struct inode *inode); struct dquot *dqget(struct super_block *sb, struct kqid qid); +static inline struct dquot *dqgrab(struct dquot *dquot) +{ + /* Make sure someone else has active reference to dquot */ + WARN_ON_ONCE(!atomic_read(dquot-dq_count)); + WARN_ON_ONCE(!test_bit(DQ_ACTIVE_B, dquot-dq_flags)); + atomic_inc(dquot-dq_count); + return dquot; +} void dqput(struct dquot *dquot); int dquot_scan_active(struct super_block *sb, int (*fn)(struct dquot *dquot, unsigned long priv), ___ Ocfs2-devel mailing list Ocfs2-devel@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-devel
Re: [Ocfs2-devel] [PATCH 4/6] ocfs2: Implement delayed dropping of last dquot reference
looks good to me Reviewed-by: Srinivas Eeda srinivas.e...@oracle.com On 02/20/2014 07:18 AM, Jan Kara wrote: We cannot drop last dquot reference from downconvert thread as that creates the following deadlock: NODE 1 NODE2 holds dentry lock for 'foo' holds inode lock for GLOBAL_BITMAP_SYSTEM_INODE dquot_initialize(bar) ocfs2_dquot_acquire() ocfs2_inode_lock(USER_QUOTA_SYSTEM_INODE) ... downconvert thread (triggered from another node or a different process from NODE2) ocfs2_dentry_post_unlock() ... iput(foo) ocfs2_evict_inode(foo) ocfs2_clear_inode(foo) dquot_drop(inode) ... ocfs2_dquot_release() ocfs2_inode_lock(USER_QUOTA_SYSTEM_INODE) - blocks finds we need more space in quota file ... ocfs2_extend_no_holes() ocfs2_inode_lock(GLOBAL_BITMAP_SYSTEM_INODE) - deadlocks waiting for downconvert thread We solve the problem by postponing dropping of the last dquot reference to a workqueue if it happens from the downconvert thread. Signed-off-by: Jan Kara j...@suse.cz --- fs/ocfs2/ocfs2.h| 5 + fs/ocfs2/quota.h| 2 ++ fs/ocfs2/quota_global.c | 35 +++ fs/ocfs2/super.c| 8 4 files changed, 50 insertions(+) diff --git a/fs/ocfs2/ocfs2.h b/fs/ocfs2/ocfs2.h index 553f53cc73ae..64c02239ba46 100644 --- a/fs/ocfs2/ocfs2.h +++ b/fs/ocfs2/ocfs2.h @@ -30,6 +30,7 @@ #include linux/sched.h #include linux/wait.h #include linux/list.h +#include linux/llist.h #include linux/rbtree.h #include linux/workqueue.h #include linux/kref.h @@ -419,6 +420,10 @@ struct ocfs2_super struct ocfs2_dentry_lock *dentry_lock_list; struct work_struct dentry_lock_work; + /* List of dquot structures to drop last reference to */ + struct llist_head dquot_drop_list; + struct work_struct dquot_drop_work; + wait_queue_head_t osb_mount_event; /* Truncate log info */ diff --git a/fs/ocfs2/quota.h b/fs/ocfs2/quota.h index d5ab56cbe5c5..f266d67df3c6 100644 --- a/fs/ocfs2/quota.h +++ b/fs/ocfs2/quota.h @@ -28,6 +28,7 @@ struct ocfs2_dquot { unsigned int dq_use_count; /* Number of nodes having reference to this entry in global quota file */ s64 dq_origspace; /* Last globally synced space usage */ s64 dq_originodes; /* Last globally synced inode usage */ + struct llist_node list; /* Member of list of dquots to drop */ }; /* Description of one chunk to recover in memory */ @@ -110,6 +111,7 @@ int ocfs2_read_quota_phys_block(struct inode *inode, u64 p_block, int ocfs2_create_local_dquot(struct dquot *dquot); int ocfs2_local_release_dquot(handle_t *handle, struct dquot *dquot); int ocfs2_local_write_dquot(struct dquot *dquot); +void ocfs2_drop_dquot_refs(struct work_struct *work); extern const struct dquot_operations ocfs2_quota_operations; extern struct quota_format_type ocfs2_quota_format; diff --git a/fs/ocfs2/quota_global.c b/fs/ocfs2/quota_global.c index aaa50611ec66..7921e209c64b 100644 --- a/fs/ocfs2/quota_global.c +++ b/fs/ocfs2/quota_global.c @@ -10,6 +10,7 @@ #include linux/jiffies.h #include linux/writeback.h #include linux/workqueue.h +#include linux/llist.h #include cluster/masklog.h @@ -679,6 +680,27 @@ static int ocfs2_calc_qdel_credits(struct super_block *sb, int type) OCFS2_INODE_UPDATE_CREDITS; } +void ocfs2_drop_dquot_refs(struct work_struct *work) +{ + struct ocfs2_super *osb = container_of(work, struct ocfs2_super, +dquot_drop_work); + struct llist_node *list; + struct ocfs2_dquot *odquot, *next_odquot; + + list = llist_del_all(osb-dquot_drop_list); + llist_for_each_entry_safe(odquot, next_odquot, list, list) { + /* Drop the reference we acquired in ocfs2_dquot_release() */ + dqput(odquot-dq_dquot); + } +} + +/* + * Called when the last reference to dquot is dropped. If we are called from + * downconvert thread, we cannot do all the handling here because grabbing + * quota lock could deadlock (the node holding the quota lock could need some + * other cluster lock to proceed but with blocked downconvert thread we cannot + * release any lock). + */ static int
Re: [Ocfs2-devel] [PATCH 1/6] ocfs2: Remove OCFS2_INODE_SKIP_DELETE flag
Reviewed-by: Srinivas Eeda srinivas.e...@oracle.com On 02/20/2014 07:18 AM, Jan Kara wrote: The flag was never set, delete it. Signed-off-by: Jan Kara j...@suse.cz --- fs/ocfs2/inode.c | 6 -- fs/ocfs2/inode.h | 8 +++- fs/ocfs2/journal.c | 6 -- 3 files changed, 3 insertions(+), 17 deletions(-) diff --git a/fs/ocfs2/inode.c b/fs/ocfs2/inode.c index f29a90fde619..b4baaefe4dd4 100644 --- a/fs/ocfs2/inode.c +++ b/fs/ocfs2/inode.c @@ -822,12 +822,6 @@ static int ocfs2_inode_is_valid_to_delete(struct inode *inode) goto bail_unlock; } - /* If we have allowd wipe of this inode for another node, it - * will be marked here so we can safely skip it. Recovery will - * cleanup any inodes we might inadvertently skip here. */ - if (oi-ip_flags OCFS2_INODE_SKIP_DELETE) - goto bail_unlock; - ret = 1; bail_unlock: spin_unlock(oi-ip_lock); diff --git a/fs/ocfs2/inode.h b/fs/ocfs2/inode.h index 621fc73bf23d..f60bc314ee0a 100644 --- a/fs/ocfs2/inode.h +++ b/fs/ocfs2/inode.h @@ -84,8 +84,6 @@ struct ocfs2_inode_info #define OCFS2_INODE_BITMAP 0x0004 /* This inode has been wiped from disk */ #define OCFS2_INODE_DELETED 0x0008 -/* Another node is deleting, so our delete is a nop */ -#define OCFS2_INODE_SKIP_DELETE 0x0010 /* Has the inode been orphaned on another node? * * This hints to ocfs2_drop_inode that it should clear i_nlink before @@ -100,11 +98,11 @@ struct ocfs2_inode_info * rely on ocfs2_delete_inode to sort things out under the proper * cluster locks. */ -#define OCFS2_INODE_MAYBE_ORPHANED 0x0020 +#define OCFS2_INODE_MAYBE_ORPHANED 0x0010 /* Does someone have the file open O_DIRECT */ -#define OCFS2_INODE_OPEN_DIRECT 0x0040 +#define OCFS2_INODE_OPEN_DIRECT 0x0020 /* Tell the inode wipe code it's not in orphan dir */ -#define OCFS2_INODE_SKIP_ORPHAN_DIR 0x0080 +#define OCFS2_INODE_SKIP_ORPHAN_DIR 0x0040 static inline struct ocfs2_inode_info *OCFS2_I(struct inode *inode) { diff --git a/fs/ocfs2/journal.c b/fs/ocfs2/journal.c index 44fc3e530c3d..03ea9314fecd 100644 --- a/fs/ocfs2/journal.c +++ b/fs/ocfs2/journal.c @@ -2132,12 +2132,6 @@ static int ocfs2_recover_orphans(struct ocfs2_super *osb, iter = oi-ip_next_orphan; spin_lock(oi-ip_lock); - /* The remote delete code may have set these on the - * assumption that the other node would wipe them - * successfully. If they are still in the node's - * orphan dir, we need to reset that state. */ - oi-ip_flags = ~(OCFS2_INODE_DELETED|OCFS2_INODE_SKIP_DELETE); - /* Set the proper information to get us going into * ocfs2_delete_inode. */ oi-ip_flags |= OCFS2_INODE_MAYBE_ORPHANED; ___ Ocfs2-devel mailing list Ocfs2-devel@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-devel
Re: [Ocfs2-devel] [PATCH 2/6] ocfs2: Move dquot_initialize() in ocfs2_delete_inode() somewhat later
looks good to me Reviewed-by: Srinivas Eeda srinivas.e...@oracle.com On 02/20/2014 07:18 AM, Jan Kara wrote: Move dquot_initalize() call in ocfs2_delete_inode() after the moment we verify inode is actually a sane one to delete. We certainly don't want to initialize quota for system inodes etc. This also avoids calling into quota code from downconvert thread. Add more details into the comment why bailing out from ocfs2_delete_inode() when we are in downconvert thread is OK. Signed-off-by: Jan Kara j...@suse.cz --- fs/ocfs2/inode.c | 16 +--- 1 file changed, 9 insertions(+), 7 deletions(-) diff --git a/fs/ocfs2/inode.c b/fs/ocfs2/inode.c index b4baaefe4dd4..3b0d722de35e 100644 --- a/fs/ocfs2/inode.c +++ b/fs/ocfs2/inode.c @@ -804,11 +804,13 @@ static int ocfs2_inode_is_valid_to_delete(struct inode *inode) goto bail; } - /* If we're coming from downconvert_thread we can't go into our own - * voting [hello, deadlock city!], so unforuntately we just - * have to skip deleting this guy. That's OK though because - * the node who's doing the actual deleting should handle it - * anyway. */ + /* + * If we're coming from downconvert_thread we can't go into our own + * voting [hello, deadlock city!] so we cannot delete the inode. But + * since we dropped last inode ref when downconverting dentry lock, + * we cannot have the file open and thus the node doing unlink will + * take care of deleting the inode. + */ if (current == osb-dc_task) goto bail; @@ -954,8 +956,6 @@ static void ocfs2_delete_inode(struct inode *inode) if (is_bad_inode(inode) || !OCFS2_I(inode)-ip_blkno) goto bail; - dquot_initialize(inode); - if (!ocfs2_inode_is_valid_to_delete(inode)) { /* It's probably not necessary to truncate_inode_pages * here but we do it for safety anyway (it will most @@ -964,6 +964,8 @@ static void ocfs2_delete_inode(struct inode *inode) goto bail; } + dquot_initialize(inode); + /* We want to block signals in delete_inode as the lock and * messaging paths may return us -ERESTARTSYS. Which would * cause us to exit early, resulting in inodes being orphaned ___ Ocfs2-devel mailing list Ocfs2-devel@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-devel
Re: [Ocfs2-devel] [PATCH 5/6] ocfs2: Avoid blocking in ocfs2_mark_lockres_freeing() in downconvert thread
I like the idea of dc_task handling queued basts in ocfs2_mark_lockres_freeing. I am wondering if we should call lockres-l_ops-post_unlock(osb, lockres) ? Would there be another node waiting for a bast response ? On 02/20/2014 07:18 AM, Jan Kara wrote: If we are dropping last inode reference from downconvert thread, we will end up calling ocfs2_mark_lockres_freeing() which can block if the lock we are freeing is queued thus creating an A-A deadlock. Luckily, since we are the downconvert thread, we can immediately dequeue the lock and thus avoid waiting in this case. Signed-off-by: Jan Kara j...@suse.cz --- fs/ocfs2/dlmglue.c | 33 +++-- fs/ocfs2/dlmglue.h | 3 ++- fs/ocfs2/inode.c | 7 --- 3 files changed, 37 insertions(+), 6 deletions(-) diff --git a/fs/ocfs2/dlmglue.c b/fs/ocfs2/dlmglue.c index 19986959d149..b7580157ef01 100644 --- a/fs/ocfs2/dlmglue.c +++ b/fs/ocfs2/dlmglue.c @@ -3150,7 +3150,8 @@ out: * it safe to drop. * * You can *not* attempt to call cluster_lock on this lockres anymore. */ -void ocfs2_mark_lockres_freeing(struct ocfs2_lock_res *lockres) +void ocfs2_mark_lockres_freeing(struct ocfs2_super *osb, + struct ocfs2_lock_res *lockres) { int status; struct ocfs2_mask_waiter mw; @@ -3160,6 +3161,33 @@ void ocfs2_mark_lockres_freeing(struct ocfs2_lock_res *lockres) spin_lock_irqsave(lockres-l_lock, flags); lockres-l_flags |= OCFS2_LOCK_FREEING; + if (lockres-l_flags OCFS2_LOCK_QUEUED current == osb-dc_task) { + unsigned long flags; + + /* + * We know the downconvert is queued but not in progress + * because we are the downconvert thread and processing + * different lock. So we can just remove the lock from the + * queue. This is not only an optimization but also a way + * to avoid the following deadlock: + * ocfs2_dentry_post_unlock() + * ocfs2_dentry_lock_put() + * ocfs2_drop_dentry_lock() + * iput() + * ocfs2_evict_inode() + * ocfs2_clear_inode() + * ocfs2_mark_lockres_freeing() + * ... blocks waiting for OCFS2_LOCK_QUEUED + * since we are the downconvert thread which + * should clear the flag. + */ + spin_lock_irqsave(osb-dc_task_lock, flags); + list_del_init(lockres-l_blocked_list); + osb-blocked_lock_count--; + spin_unlock_irqrestore(osb-dc_task_lock, flags); + lockres_clear_flags(lockres, OCFS2_LOCK_QUEUED); + goto out_unlock; + } while (lockres-l_flags OCFS2_LOCK_QUEUED) { lockres_add_mask_waiter(lockres, mw, OCFS2_LOCK_QUEUED, 0); spin_unlock_irqrestore(lockres-l_lock, flags); @@ -3172,6 +3200,7 @@ void ocfs2_mark_lockres_freeing(struct ocfs2_lock_res *lockres) spin_lock_irqsave(lockres-l_lock, flags); } +out_unlock: spin_unlock_irqrestore(lockres-l_lock, flags); } @@ -3180,7 +3209,7 @@ void ocfs2_simple_drop_lockres(struct ocfs2_super *osb, { int ret; - ocfs2_mark_lockres_freeing(lockres); + ocfs2_mark_lockres_freeing(osb, lockres); ret = ocfs2_drop_lock(osb, lockres); if (ret) mlog_errno(ret); diff --git a/fs/ocfs2/dlmglue.h b/fs/ocfs2/dlmglue.h index 1d596d8c4a4a..d293a22c32c5 100644 --- a/fs/ocfs2/dlmglue.h +++ b/fs/ocfs2/dlmglue.h @@ -157,7 +157,8 @@ int ocfs2_refcount_lock(struct ocfs2_refcount_tree *ref_tree, int ex); void ocfs2_refcount_unlock(struct ocfs2_refcount_tree *ref_tree, int ex); -void ocfs2_mark_lockres_freeing(struct ocfs2_lock_res *lockres); +void ocfs2_mark_lockres_freeing(struct ocfs2_super *osb, + struct ocfs2_lock_res *lockres); void ocfs2_simple_drop_lockres(struct ocfs2_super *osb, struct ocfs2_lock_res *lockres); diff --git a/fs/ocfs2/inode.c b/fs/ocfs2/inode.c index 3b0d722de35e..9661f8db21dc 100644 --- a/fs/ocfs2/inode.c +++ b/fs/ocfs2/inode.c @@ -1053,6 +1053,7 @@ static void ocfs2_clear_inode(struct inode *inode) { int status; struct ocfs2_inode_info *oi = OCFS2_I(inode); + struct ocfs2_super *osb = OCFS2_SB(inode-i_sb); clear_inode(inode); trace_ocfs2_clear_inode((unsigned long long)oi-ip_blkno, @@ -1069,9 +1070,9 @@ static void ocfs2_clear_inode(struct inode *inode) /* Do these before all the other work so that we don't bounce * the downconvert thread while waiting to destroy the locks. */ - ocfs2_mark_lockres_freeing(oi-ip_rw_lockres); -
Re: [Ocfs2-devel] [patch 03/11] ocfs2/o2net: incorrect to terminate accepting connections loop upon rejecting an invalid one
On 01/24/2014 01:55 PM, Mark Fasheh wrote: On Fri, Jan 24, 2014 at 12:47:02PM -0800, a...@linux-foundation.org wrote: From: Tariq Saeed tariq.x.sa...@oracle.com Subject: ocfs2/o2net: incorrect to terminate accepting connections loop upon rejecting an invalid one When o2net-accept-one() rejects an illegal connection, it terminates the loop picking up the remaining queued connections. This fix will continue accepting connections till the queue is emtpy. Addresses Orabug 17489469. Thanks for sending this, review comments below. diff -puN fs/ocfs2/cluster/tcp.c~ocfs2-o2net-incorrect-to-terminate-accepting-connections-loop-upon-rejecting-an-invalid-one fs/ocfs2/cluster/tcp.c --- a/fs/ocfs2/cluster/tcp.c~ocfs2-o2net-incorrect-to-terminate-accepting-connections-loop-upon-rejecting-an-invalid-one +++ a/fs/ocfs2/cluster/tcp.c @@ -1826,7 +1826,7 @@ int o2net_register_hb_callbacks(void) /* */ -static int o2net_accept_one(struct socket *sock) +static int o2net_accept_one(struct socket *sock, int *more) { int ret, slen; struct sockaddr_in sin; @@ -1837,6 +1837,7 @@ static int o2net_accept_one(struct socke struct o2net_node *nn; BUG_ON(sock == NULL); +*more = 0; ret = sock_create_lite(sock-sk-sk_family, sock-sk-sk_type, sock-sk-sk_protocol, new_sock); if (ret) @@ -1848,6 +1849,7 @@ static int o2net_accept_one(struct socke if (ret 0) goto out; +*more = 1; new_sock-sk-sk_allocation = GFP_ATOMIC; ret = o2net_set_nodelay(new_sock); @@ -1949,8 +1951,15 @@ out: static void o2net_accept_many(struct work_struct *work) { struct socket *sock = o2net_listen_sock; -while (o2net_accept_one(sock) == 0) +int more; +int err; + +for (;;) { +err = o2net_accept_one(sock, more); +if (!more) +break; We're throwing out 'err' here and trusting the variable 'more'. However, err could be set and more would be 0 regardless of whether there actually are more connections to be had. This makes more sense given when 'more' is set: Hi Mark, thanks for reviewing this :). Please correct me if I am wrong. The idea was if accept itself fails exit the loop, otherwise continue scanning all queued connections. For eg: if a connection request from unknown node followed by known node happens at the same time, we can ignore the ret value of the first connection request and still continue accepting the next connection. ret value appears to be of not much importance other than just for logging or debugging purpose. if (err) break; /* only trust the value of 'more' when err == 0 */ if (more) break; Thanks, --Mark -- Mark Fasheh ___ Ocfs2-devel mailing list Ocfs2-devel@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-devel ___ Ocfs2-devel mailing list Ocfs2-devel@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-devel
Re: [Ocfs2-devel] [PATCH] ocfs2: Code cleanup: Removed unused functions
Hi Goldwyn, I am not 100% sure ... but my understanding is ocfs2_fs.h defines metadata structures and helper functions. This file is shared between ocfs2-tools and kernel modules. The functions you mentioned are used by ocfs2-tools code and hence exist in this file. Thanks, --Srini On 01/20/2014 05:47 AM, Goldwyn Rodrigues wrote: These functions are either coded in individual files as static or not used at all. Remove them. Signed-off-by: Goldwyn Rodrigues rgold...@suse.com --- fs/ocfs2/ocfs2_fs.h | 40 1 file changed, 40 deletions(-) diff --git a/fs/ocfs2/ocfs2_fs.h b/fs/ocfs2/ocfs2_fs.h index 938387a..3ba3851 100644 --- a/fs/ocfs2/ocfs2_fs.h +++ b/fs/ocfs2/ocfs2_fs.h @@ -1325,16 +1325,6 @@ static inline int ocfs2_extent_recs_per_dx_root(struct super_block *sb) return size / sizeof(struct ocfs2_extent_rec); } -static inline int ocfs2_chain_recs_per_inode(struct super_block *sb) -{ - int size; - - size = sb-s_blocksize - - offsetof(struct ocfs2_dinode, id2.i_chain.cl_recs); - - return size / sizeof(struct ocfs2_chain_rec); -} - static inline u16 ocfs2_extent_recs_per_eb(struct super_block *sb) { int size; @@ -1493,16 +1483,6 @@ static inline int ocfs2_extent_recs_per_inode(int blocksize) return size / sizeof(struct ocfs2_extent_rec); } -static inline int ocfs2_chain_recs_per_inode(int blocksize) -{ - int size; - - size = blocksize - - offsetof(struct ocfs2_dinode, id2.i_chain.cl_recs); - - return size / sizeof(struct ocfs2_chain_rec); -} - static inline int ocfs2_extent_recs_per_eb(int blocksize) { int size; @@ -1589,12 +1569,6 @@ static inline int ocfs2_xattr_recs_per_xb(int blocksize) #endif /* __KERNEL__ */ -static inline int ocfs2_system_inode_is_global(int type) -{ - return ((type = 0) - (type = OCFS2_LAST_GLOBAL_SYSTEM_INODE)); -} - static inline int ocfs2_sprintf_system_inode_name(char *buf, int len, int type, int slot) { @@ -1622,19 +1596,5 @@ static inline void ocfs2_set_de_type(struct ocfs2_dir_entry *de, de-file_type = ocfs2_type_by_mode[(mode S_IFMT)S_SHIFT]; } -static inline int ocfs2_gd_is_discontig(struct ocfs2_group_desc *gd) -{ - if ((offsetof(struct ocfs2_group_desc, bg_bitmap) + - le16_to_cpu(gd-bg_size)) != - offsetof(struct ocfs2_group_desc, bg_list)) - return 0; - /* - * Only valid to check l_next_free_rec if - * bg_bitmap + bg_size == bg_list. - */ - if (!gd-bg_list.l_next_free_rec) - return 0; - return 1; -} #endif /* _OCFS2_FS_H */ ___ Ocfs2-devel mailing list Ocfs2-devel@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-devel
Re: [Ocfs2-devel] [PATCH 5/5] ocfs2: Implement delayed dropping of last dquot reference
On 01/20/2014 07:31 AM, Goldwyn Rodrigues wrote: On 01/16/2014 04:58 PM, Jan Kara wrote: On Thu 16-01-14 23:28:49, Jan Kara wrote: We cannot drop last dquot reference from downconvert thread as that creates the following deadlock: NODE 1 NODE2 holds dentry lock for 'foo' holds inode lock for GLOBAL_BITMAP_SYSTEM_INODE dquot_initialize(bar) ocfs2_dquot_acquire() ocfs2_inode_lock(USER_QUOTA_SYSTEM_INODE) ... downconvert thread (triggered from another node or a different process from NODE2) ocfs2_dentry_post_unlock() ... iput(foo) ocfs2_evict_inode(foo) ocfs2_clear_inode(foo) dquot_drop(inode) ... ocfs2_dquot_release() ocfs2_inode_lock(USER_QUOTA_SYSTEM_INODE) - blocks finds we need more space in quota file ... ocfs2_extend_no_holes() ocfs2_inode_lock(GLOBAL_BITMAP_SYSTEM_INODE) - deadlocks waiting for downconvert thread We solve the problem by postponing dropping of the last dquot reference to a workqueue if it happens from the downconvert thread. Hum, now looking again into ocfs2_clear_inode() there are more problems than I originally thought. Look for example at ocfs2_mark_lockres_freeing(). That will block on rw/inode/open lock if there is downconvert pending waiting for that downconvert to finish. However that never happens when ocfs2_clear_inode() is called from the downconvert thread. So we are back to square one - I don't see a way how to fix these deadlocks without postponing dropping of inode reference to a workqueue :(. Since the reason of the unlink performance is the delay in calling ocfs2_open_unlock(), and the ocfs2_mark_lockres_freeing() comes after ocfs2_open_unlock(): can we move the call to ocfs2_open_unlock() to ocfs2_evict_inode() and then perform ocfs2_clear_inode() in a deferred way? once ocfs2_evict_inode is returned, vfs would destroy the inode, so I think we should do the cleanup before that and hence we cannot differ ocfs2_clear_inode from inside ocfs2_evict_inode May be we should queue ocfs2_blocking_ast call itself to down convert thread. That way it doesn't prevent down convert thread from clearing the inode. Once when down convert thread comes to processes the queued ast/bast and finds lockres cleared it can just return. Honza Signed-off-by: Jan Kara j...@suse.cz --- fs/ocfs2/ocfs2.h| 5 + fs/ocfs2/quota.h| 2 ++ fs/ocfs2/quota_global.c | 35 +++ fs/ocfs2/super.c| 8 4 files changed, 50 insertions(+) diff --git a/fs/ocfs2/ocfs2.h b/fs/ocfs2/ocfs2.h index ca81f6b49236..f6134345fe42 100644 --- a/fs/ocfs2/ocfs2.h +++ b/fs/ocfs2/ocfs2.h @@ -30,6 +30,7 @@ #include linux/sched.h #include linux/wait.h #include linux/list.h +#include linux/llist.h #include linux/rbtree.h #include linux/workqueue.h #include linux/kref.h @@ -410,6 +411,10 @@ struct ocfs2_super struct list_head blocked_lock_list; unsigned long blocked_lock_count; +/* List of dquot structures to drop last reference to */ +struct llist_head dquot_drop_list; +struct work_struct dquot_drop_work; + wait_queue_head_tosb_mount_event; /* Truncate log info */ diff --git a/fs/ocfs2/quota.h b/fs/ocfs2/quota.h index d5ab56cbe5c5..f266d67df3c6 100644 --- a/fs/ocfs2/quota.h +++ b/fs/ocfs2/quota.h @@ -28,6 +28,7 @@ struct ocfs2_dquot { unsigned int dq_use_count;/* Number of nodes having reference to this entry in global quota file */ s64 dq_origspace;/* Last globally synced space usage */ s64 dq_originodes;/* Last globally synced inode usage */ +struct llist_node list;/* Member of list of dquots to drop */ }; /* Description of one chunk to recover in memory */ @@ -110,6 +111,7 @@ int ocfs2_read_quota_phys_block(struct inode *inode, u64 p_block, int ocfs2_create_local_dquot(struct dquot *dquot); int ocfs2_local_release_dquot(handle_t *handle, struct dquot *dquot); int ocfs2_local_write_dquot(struct dquot *dquot); +void ocfs2_drop_dquot_refs(struct work_struct *work); extern const struct dquot_operations ocfs2_quota_operations; extern struct quota_format_type ocfs2_quota_format; diff --git a/fs/ocfs2/quota_global.c b/fs/ocfs2/quota_global.c index aaa50611ec66..7921e209c64b 100644 --- a/fs/ocfs2/quota_global.c +++ b/fs/ocfs2/quota_global.c @@ -10,6 +10,7 @@ #include linux/jiffies.h #include linux/writeback.h #include linux/workqueue.h
Re: [Ocfs2-devel] [PATCH] Revert iput deferring code in ocfs2_drop_dentry_lock
Hi Jan, thanks a lot for explaining the problem. Please see my comment below. On 01/16/2014 06:02 AM, Jan Kara wrote: On Thu 16-01-14 07:35:58, Goldwyn Rodrigues wrote: On 01/15/2014 08:47 PM, Jan Kara wrote: On Wed 15-01-14 17:17:55, Goldwyn Rodrigues wrote: On 01/15/2014 09:53 AM, Jan Kara wrote: On Wed 15-01-14 09:32:39, Goldwyn Rodrigues wrote: On 01/15/2014 07:33 AM, Jan Kara wrote: On Wed 15-01-14 06:51:51, Goldwyn Rodrigues wrote: The following patches are reverted in this patch because these patches caused regression in the unlink() calls. ea455f8ab68338ba69f5d3362b342c115bea8e13 - ocfs2: Push out dropping of dentry lock to ocfs2_wq f7b1aa69be138ad9d7d3f31fa56f4c9407f56b6a - ocfs2: Fix deadlock on umount 5fd131893793567c361ae64cbeb28a2a753bbe35 - ocfs2: Don't oops in ocfs2_kill_sb on a failed mount The regression is caused because these patches delay the iput() in case of dentry unlocks. This also delays the unlocking of the open lockres. The open lockresource is required to test if the inode can be wiped from disk on not. When the deleting node does not get the open lock, it marks it as orphan (even though it is not in use by another node/process) and causes a journal checkpoint. This delays operations following the inode eviction. This also moves the inode to the orphaned inode which further causes more I/O and a lot of unneccessary orphans. As for the fix, I tried reproducing the problem by using quotas and enabling lockdep, but the issue could not be reproduced. So I was thinking about this for a while trying to remember... What it really boils down to is whether it is OK to call ocfs2_delete_inode() from DLM downconvert thread. Bear in mind that ocfs2_delete_inode() takes all kinds of interesting cluster locks meaning that the downconvert thread can block waiting to acquire some cluster lock. That seems like asking for trouble to me (i.e., what if the node holding the lock we need from downconvert thread needs a lock from us first?) but maybe it is OK these days. The only lock it tries to take is the inode lock resource, which seems to be fine to take. Why is it obviously fine? Some other node might be holding the inode lock and still write to the inode. If that needs a cluster lock for block allocation and that allocation cluster lock is currently hold by our node, we are screwed. Aren't we? No. If another thread is using the allocation cluster lock implies we already have the inode lock. Cluster locks are also taken in a strict order in order to avoid ABBA locking. I agree with what you wrote above but it's not exactly what I'm talking about. What seems to be possible is: NODE 1 NODE2 holds dentry lock for 'foo' holds inode lock for GLOBAL_BITMAP_SYSTEM_INODE dquot_initialize(bar) ocfs2_dquot_acquire() ocfs2_inode_lock(USER_QUOTA_SYSTEM_INODE) ... downconvert thread (triggered from another node or a different process from NODE2) ocfs2_dentry_post_unlock() ... iput(foo) ocfs2_evict_inode(foo) ocfs2_clear_inode(foo) dquot_drop(inode) ... ocfs2_inode_lock(USER_QUOTA_SYSTEM_INODE) - blocks finds we need more space in quota file ... ocfs2_extend_no_holes() ocfs2_inode_lock(GLOBAL_BITMAP_SYSTEM_INODE) - deadlocks waiting for downconvert thread Thanks. This explains the situation much better. Whats stopping NODE1 from releasing the GLOBAL_BITMAP_SYSTEM_INODE? (to break from hold and wait condition) The fact that to release the lock, it has to be downconverted. And there is only one downconvert thread (ocfs2dc) and that is busy waiting for USER_QUOTA_SYSTEM_INODE lock... Yes that indeed seems to be a problem, thanks a lot for explaining :). I do not know much about quota's code but wondering if we can fix this problem by enforcing the lock ordering ? Can NODE2 check first if it needs space before getting lock on USER_QUOTA_SYSTEM_INODE. If it does then it should acquire GLOBAL_BITMAP_SYSTEM_INODE before acquiring lock on USER_QUOTA_SYSTEM_INODE ? (Basically we enforce GLOBAL_BITMAP_SYSTEM_INODE cannot be taken while node has USER_QUOTA_SYSTEM_INODE lock) However the positive part of this is that the race I was originally afraid of - when ocfs2_delete_inode() would be triggered from ocfs2_dentry_post_unlock() - isn't possible because ocfs2 seems to keep invariant that node can cache dentry for 'foo' without holding inode lock
Re: [Ocfs2-devel] [PATCH 1/1] o2dlm: fix NULL pointer dereference in o2dlm_blocking_ast_wrapper
On 01/13/2014 08:06 PM, Joseph Qi wrote: On 2014/1/11 9:19, Srinivas Eeda wrote: From: Srinivas Eeda seeda@srini.(none) A tiny race between BAST and unlock message causes the NULL dereference. A node sends an unlock request to master and receives a response. Before processing the response it receives a BAST from the master. Since both requests are processed by different threads it creates a race. While the BAST is being processed, lock can get freed by unlock code. This patch makes bast to return immediately if lock is found but unlock is pending. The code should handle this race. We also have to fix master node to skip sending BAST after receiving unlock message. Below is the crash stack BUG: unable to handle kernel NULL pointer dereference at 0048 IP: [a015e023] o2dlm_blocking_ast_wrapper+0xd/0x16 [a034e3db] dlm_do_local_bast+0x8e/0x97 [ocfs2_dlm] [a034f366] dlm_proxy_ast_handler+0x838/0x87e [ocfs2_dlm] [a0308abe] o2net_process_message+0x395/0x5b8 [ocfs2_nodemanager] [a030aac8] o2net_rx_until_empty+0x762/0x90d [ocfs2_nodemanager] [81071802] worker_thread+0x14d/0x1ed Signed-off-by: Srinivas Eeda srinivas.e...@oracle.com --- fs/ocfs2/dlm/dlmast.c |9 +++-- 1 file changed, 7 insertions(+), 2 deletions(-) diff --git a/fs/ocfs2/dlm/dlmast.c b/fs/ocfs2/dlm/dlmast.c index b46278f..dbc6cee 100644 --- a/fs/ocfs2/dlm/dlmast.c +++ b/fs/ocfs2/dlm/dlmast.c @@ -385,8 +385,13 @@ int dlm_proxy_ast_handler(struct o2net_msg *msg, u32 len, void *data, head = res-granted; list_for_each_entry(lock, head, list) { -if (lock-ml.cookie == cookie) -goto do_ast; +/* if lock is found but unlock is pending ignore the bast */ +if (lock-ml.cookie == cookie) { +if (lock-unlock_pending) +break; +else +goto do_ast; +} } mlog(0, Got %sast for unknown lock! cookie=%u:%llu, name=%.*s, I found you sent a version on Jan 30, 2012. https://oss.oracle.com/pipermail/ocfs2-devel/2012-January/008469.html Compared with the old version, this version only saves a little bit CPU, am I right? Yes you are right. I made the change as Goldwyn suggested which is a good thing to have :) ___ Ocfs2-devel mailing list Ocfs2-devel@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-devel
Re: [Ocfs2-devel] [PATCH 1/1] o2dlm: fix NULL pointer dereference in o2dlm_blocking_ast_wrapper
On 01/13/2014 07:37 AM, Joel Becker wrote: On Fri, Jan 10, 2014 at 05:19:13PM -0800, Srinivas Eeda wrote: From: Srinivas Eeda seeda@srini.(none) A tiny race between BAST and unlock message causes the NULL dereference. A node sends an unlock request to master and receives a response. Before processing the response it receives a BAST from the master. Since both requests are processed by different threads it creates a race. While the BAST is being processed, lock can get freed by unlock code. This patch makes bast to return immediately if lock is found but unlock is pending. The code should handle this race. We also have to fix master node to skip sending BAST after receiving unlock message. Did the master send the BAST after the unlock, or does that race too? Does the master know the unlock has succeeded, or does it just think so? I think it's due to a race but I haven't debugged the master. My guess is unlock request sneaked in before the dlm_flush_asts was called. However non master node should handle this race as well, so just did that part which fixed a bug we were seeing. @@ -385,8 +385,13 @@ int dlm_proxy_ast_handler(struct o2net_msg *msg, u32 len, void *data, head = res-granted; list_for_each_entry(lock, head, list) { -if (lock-ml.cookie == cookie) -goto do_ast; +/* if lock is found but unlock is pending ignore the bast */ +if (lock-ml.cookie == cookie) { +if (lock-unlock_pending) +break; +else +goto do_ast; +} This breaks out for asts as well as basts. Can't that cause problems with the unlock ast expected by the caller? if unlock_pending is set, then the node is trying to unlock an existing lock and shouldn't receive any asts ? Joel ___ Ocfs2-devel mailing list Ocfs2-devel@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-devel
[Ocfs2-devel] [PATCH 1/1] o2dlm: fix NULL pointer dereference in o2dlm_blocking_ast_wrapper
From: Srinivas Eeda seeda@srini.(none) A tiny race between BAST and unlock message causes the NULL dereference. A node sends an unlock request to master and receives a response. Before processing the response it receives a BAST from the master. Since both requests are processed by different threads it creates a race. While the BAST is being processed, lock can get freed by unlock code. This patch makes bast to return immediately if lock is found but unlock is pending. The code should handle this race. We also have to fix master node to skip sending BAST after receiving unlock message. Below is the crash stack BUG: unable to handle kernel NULL pointer dereference at 0048 IP: [a015e023] o2dlm_blocking_ast_wrapper+0xd/0x16 [a034e3db] dlm_do_local_bast+0x8e/0x97 [ocfs2_dlm] [a034f366] dlm_proxy_ast_handler+0x838/0x87e [ocfs2_dlm] [a0308abe] o2net_process_message+0x395/0x5b8 [ocfs2_nodemanager] [a030aac8] o2net_rx_until_empty+0x762/0x90d [ocfs2_nodemanager] [81071802] worker_thread+0x14d/0x1ed Signed-off-by: Srinivas Eeda srinivas.e...@oracle.com --- fs/ocfs2/dlm/dlmast.c |9 +++-- 1 file changed, 7 insertions(+), 2 deletions(-) diff --git a/fs/ocfs2/dlm/dlmast.c b/fs/ocfs2/dlm/dlmast.c index b46278f..dbc6cee 100644 --- a/fs/ocfs2/dlm/dlmast.c +++ b/fs/ocfs2/dlm/dlmast.c @@ -385,8 +385,13 @@ int dlm_proxy_ast_handler(struct o2net_msg *msg, u32 len, void *data, head = res-granted; list_for_each_entry(lock, head, list) { - if (lock-ml.cookie == cookie) - goto do_ast; + /* if lock is found but unlock is pending ignore the bast */ + if (lock-ml.cookie == cookie) { + if (lock-unlock_pending) + break; + else + goto do_ast; + } } mlog(0, Got %sast for unknown lock! cookie=%u:%llu, name=%.*s, -- 1.7.9.5 ___ Ocfs2-devel mailing list Ocfs2-devel@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-devel
Re: [Ocfs2-devel] What's the need of OCFS2_INODE_MAYBE_ORPHANED?
On 01/09/2014 08:34 AM, Goldwyn Rodrigues wrote: On 01/09/2014 10:06 AM, Srinivas Eeda wrote: On 01/09/2014 07:44 AM, Goldwyn Rodrigues wrote: Hi Srini, Thanks for the reply. On 01/08/2014 11:30 PM, Srinivas Eeda wrote: From the comments in fs/ocfs2/inode.h:90 it seems, this was used in legacy ocfs2 systems when a node received unlink votes. Since unlink votes has been done away with and replaced with open locks, is this flag still required? If yes, why? My understanding is that unlink voting protocol was heavy. So the following was done to address it. To do an unlink, dentry has to be removed. In order to do that the node has to get EX lock on the dentry which means all other nodes have to downconvert. In general EX lock on dentry is acquired only in unlink and I assume rename case. So all nodes which down convert the lock mark their inode OCFS2_INODE_MAYBE_ORPHANED. The only problem with this is that dentry on a node can get purged because of memory pressure which marks inode as OCFS2_INODE_MAYBE_ORPHANED even when no unlink was done on this inode. I think you are getting confused between dentry_lock (dentry_lockres) and open lock (ip_open_lockres). AFAICS, dentry locks are used to control the remote dentries. I was trying to answer why we need OCFS2_INODE_MAYBE_ORPHANED flag, I guess I wasn't clear. I'll make an other attempt :). One way for node A to tell node B that an unlink had happened on node A is by sending an explicit message(something similar to what we had in old release). When node B received such communication it marked inode with OCFS2_INODE_MAYBE_ORPHANED flag if it still had the inode in use. The other way(current implementation) is to indirectly tell it by asking node B to purge dentry lockres. Once node B has been informed that dentry lock has to be released, it assumes inode might have been unlinked somewhere and marks the inode with OCFS2_INODE_MAYBE_ORPHANED flag. So, we need OCFS2_INODE_MAYBE_ORPHANED flag to tell node B that it should finish the second phase of unlink(remove the inode from file system) when it closes the file. Okay, but why should node B do the cleanup/wipe when node A initiated the unlink()? Shouldn't it be done by node A? All node B should do is to write the inode and clear it from the cache. The sequence is synchronized by dentry_lock. Right? removing dentry is only the first part. An inode can still be open after that. Yes, I did not consider that. How about using open locks ro_holders count to identify this? That may just work. Thanks! One problem I see in using open lock for this is it could be late. Consider the scenario where node A removes the dentry and then the node crashes before trying the try_open_lock. Node B does the file close later but it doesn't know that the file was unlinked and doesn't do the clean up. To me it appears OCFS2_INODE_MAYBE_ORPHANED is necessary. Any delay it is causing must be addressed differently. Removing those 3 patches I pointed should help in synchronously deleting the file. But I am not sure if it would speed up the process. We are performing ocfs2_inode_lock() anyways which is re-reading the inode from disk (for node A) node B should do the cleanup in this case because it is the last node to close the file. Node A, will do the first phase of unlink(remove dentry) and node B will do the second phase of unlink(remove the inode). Agree. Lets see if we can use open locks for this. Thanks! ___ Ocfs2-devel mailing list Ocfs2-devel@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-devel
Re: [Ocfs2-devel] What's the need of OCFS2_INODE_MAYBE_ORPHANED?
Hi Goldwyn, On 01/08/2014 04:12 PM, Goldwyn Rodrigues wrote: Hi, From the comments in fs/ocfs2/inode.h:90 it seems, this was used in legacy ocfs2 systems when a node received unlink votes. Since unlink votes has been done away with and replaced with open locks, is this flag still required? If yes, why? My understanding is that unlink voting protocol was heavy. So the following was done to address it. To do an unlink, dentry has to be removed. In order to do that the node has to get EX lock on the dentry which means all other nodes have to downconvert. In general EX lock on dentry is acquired only in unlink and I assume rename case. So all nodes which down convert the lock mark their inode OCFS2_INODE_MAYBE_ORPHANED. The only problem with this is that dentry on a node can get purged because of memory pressure which marks inode as OCFS2_INODE_MAYBE_ORPHANED even when no unlink was done on this inode. From my ongoing investigation of unlink() times, it seems this flag is causing the delay with releasing the open locks while downconverting dentry locks. The flag is set _everytime_ a dentry downconvert is performed even if the file is not scheduled to be deleted. If not, we can be smartly evict the inodes which are *not* to be deleted (i_nlink0) by not offloading to ocfs2_wq. This way open lock will release faster speeding up unlink on the deleting node. Are you referring to the delay caused by ocfs2_drop_dentry_lock queueing dentry locks to dentry_lock_list ?. If that's the case, have you tried removing following patches which introduced that behavior ? I think that quota's deadlock bug might have to be addressed differently ? ea455f8ab68338ba69f5d3362b342c115bea8e13 eb90e46458b08bc7c1c96420ca0eb4263dc1d6e5 bb44bf820481e19381ec549118e4ee0b89d56191 The above patches were leaving orphan files around which was causing a big problem to some applications that removes lot of files which inturn caused intermittent hangs ___ Ocfs2-devel mailing list Ocfs2-devel@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-devel
Re: [Ocfs2-devel] What's the need of OCFS2_INODE_MAYBE_ORPHANED?
On 01/08/2014 07:12 PM, Goldwyn Rodrigues wrote: Hi Srini, On 01/08/2014 07:29 PM, Srinivas Eeda wrote: Hi Goldwyn, On 01/08/2014 04:12 PM, Goldwyn Rodrigues wrote: Hi, From the comments in fs/ocfs2/inode.h:90 it seems, this was used in legacy ocfs2 systems when a node received unlink votes. Since unlink votes has been done away with and replaced with open locks, is this flag still required? If yes, why? My understanding is that unlink voting protocol was heavy. So the following was done to address it. To do an unlink, dentry has to be removed. In order to do that the node has to get EX lock on the dentry which means all other nodes have to downconvert. In general EX lock on dentry is acquired only in unlink and I assume rename case. So all nodes which down convert the lock mark their inode OCFS2_INODE_MAYBE_ORPHANED. The only problem with this is that dentry on a node can get purged because of memory pressure which marks inode as OCFS2_INODE_MAYBE_ORPHANED even when no unlink was done on this inode. I think you are getting confused between dentry_lock (dentry_lockres) and open lock (ip_open_lockres). AFAICS, dentry locks are used to control the remote dentries. I was trying to answer why we need OCFS2_INODE_MAYBE_ORPHANED flag, I guess I wasn't clear. I'll make an other attempt :). One way for node A to tell node B that an unlink had happened on node A is by sending an explicit message(something similar to what we had in old release). When node B received such communication it marked inode with OCFS2_INODE_MAYBE_ORPHANED flag if it still had the inode in use. The other way(current implementation) is to indirectly tell it by asking node B to purge dentry lockres. Once node B has been informed that dentry lock has to be released, it assumes inode might have been unlinked somewhere and marks the inode with OCFS2_INODE_MAYBE_ORPHANED flag. So, we need OCFS2_INODE_MAYBE_ORPHANED flag to tell node B that it should finish the second phase of unlink(remove the inode from file system) when it closes the file. From my ongoing investigation of unlink() times, it seems this flag is causing the delay with releasing the open locks while downconverting dentry locks. The flag is set _everytime_ a dentry downconvert is performed even if the file is not scheduled to be deleted. If not, we can be smartly evict the inodes which are *not* to be deleted (i_nlink0) by not offloading to ocfs2_wq. This way open lock will release faster speeding up unlink on the deleting node. Are you referring to the delay caused by ocfs2_drop_dentry_lock queueing dentry locks to dentry_lock_list ?. If that's the case, have you tried removing following patches which introduced that behavior ? I think that quota's deadlock bug might have to be addressed differently ? ea455f8ab68338ba69f5d3362b342c115bea8e13 Yes, that should make some difference. Let me try that. However, I was suggesting we do not set the OCFS2_INODE_MAYBE_ORPHANED flag in ocfs2_dentry_convert_worker as well, but I am not sure of the consequences and that is the reason I asked why it is used. eb90e46458b08bc7c1c96420ca0eb4263dc1d6e5 bb44bf820481e19381ec549118e4ee0b89d56191 I did not find these gits. Which tree are you referring to? Sorry, my bad. Those commit id's were from my local repo. I meant f7b1aa69be138ad9d7d3f31fa56f4c9407f56b6a and 5fd131893793567c361ae64cbeb28a2a753bbe35 The above patches were leaving orphan files around which was causing a big problem to some applications that removes lot of files which inturn caused intermittent hangs ___ Ocfs2-devel mailing list Ocfs2-devel@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-devel
Re: [Ocfs2-devel] [PATCH 1/1] ocfs2/dlm: ocfs2 dlm umount skip migrating lockres
Joseph, thanks for pointing it out. Andrew, can you please pull the following patch acked by Sunil and Joel https://oss.oracle.com/pipermail/ocfs2-devel/2012-August/008670.html Thanks, --Srini On 09/11/2013 07:07 PM, Joseph Qi wrote: On 2013/9/12 2:40, Tariq Saeed wrote: umount thread could race with migrate handler thread receiving resource migration request from other node. When this happens, migrating thread could set this node as the master along with DLM_LOCK_RES_MIGRATING flag. umount thread should skip migrating this newly owned lockres until DLM_LOCK_RES_MIGRATING flag is unset by migrate handler thread. umount thread will ultimately migrate this lockres during another pass of the lockres hash list. Signed-off-by: Tariq Saeed tariq.x.sa...@oracle.com Signed-off-by: Srinivas Eeda srinivas.e...@oracle.com --- fs/ocfs2/dlm/dlmmaster.c |4 1 files changed, 4 insertions(+), 0 deletions(-) diff --git a/fs/ocfs2/dlm/dlmmaster.c b/fs/ocfs2/dlm/dlmmaster.c index 33ecbe0..1643b58 100644 --- a/fs/ocfs2/dlm/dlmmaster.c +++ b/fs/ocfs2/dlm/dlmmaster.c @@ -2360,6 +2360,10 @@ static int dlm_is_lockres_migrateable(struct dlm_ctxt *dlm, if (res-owner != dlm-node_num) return 0; +if (res-state DLM_LOCK_RES_MIGRATING) { +return 0; +} + for (idx = DLM_GRANTED_LIST; idx = DLM_BLOCKED_LIST; idx++) { queue = dlm_list_idx_to_ptr(res, idx); list_for_each_entry(lock, queue, list) { This patch was sent before by Jiufei Xue on Aug 2012, which was titled as ocfs2: delay migration when the lockres is in migration state. Please refer the below link for details: https://oss.oracle.com/pipermail/ocfs2-devel/2012-August/008670.html As I remembered, the patch was alerady acked by Sunil and Joel, and then it was in ocfs2.git. I am not sure why it wasn't merged to mainline. ___ Ocfs2-devel mailing list Ocfs2-devel@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-devel ___ Ocfs2-devel mailing list Ocfs2-devel@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-devel
Re: [Ocfs2-devel] [PATCH] ocfs2: resend master request when lost connection with someone
On 05/31/2013 03:38 AM, Xue jiufei wrote: Hi, Xiaowei It's OK to simlify the patch just as you did. But we don't want to resend master request to all others nodes in consideration of network traffic. So we record those maybe down nodes in down_nodemap. 于 2013/5/28 14:12, xiaowei.hu 写道: Hi, I reviewed this patch , it did could fix a temp lost connection problem, but a few questions: 1. since we don't need to know the node numbers of down nodes, if simply replace the down_nodemap[BITS_TO_LONGS(O2NM_MAX_NODES)], with a int named for example mreq_msg_send_fail ? 2.since the final work is to return -EAGAIN, the resend all master requests. How about we simply do this?: while ((nodenum = dlm_node_iter_next(iter)) = 0) { ret = dlm_do_master_request(res, mle, nodenum); -if (ret 0) +if (ret 0) { mlog_errno(ret); +wait_on_recovery = 1; +msleep(DLM_NODE_DEATH_WAIT_MAX); +goto redo_request; +} Am I missing something? Thanks, Xiaowei On 12/22/2012 03:00 PM, Xue jiufei wrote: Function dlm_get_lock_resource() sends master request to all nodes in domain_map and waits for their responses when the node(say nodeA) doesn't known who the master is. When nodeA sends the master request, it happened that network of nodeB down for a while, and then restore. The master request from nodeA does not reach nodeB. NodeA may wait again and again in dlm_wait_for_lock_mastery() and never returns. This patch resend the mater request when a node lost connection with some other nodes. Yes, with the current reconnect code there is a possibility of message loss and can happen to all kinds of messages and responses. You are assuming the message never received by node B. Message might have been received by node B but the response might have lost too. Currently there is no way of telling the difference. DLM layer shouldn't worry about nodes loosing connection temporarily. Right way to fix this is to fix o2net reconnect code. Signed-off-by: xuejiufei xuejiu...@huawei.com --- fs/ocfs2/dlm/dlmmaster.c | 41 +++-- 1 files changed, 35 insertions(+), 6 deletions(-) diff --git a/fs/ocfs2/dlm/dlmmaster.c b/fs/ocfs2/dlm/dlmmaster.c index c491f97..2a99a95 100644 --- a/fs/ocfs2/dlm/dlmmaster.c +++ b/fs/ocfs2/dlm/dlmmaster.c @@ -106,7 +106,7 @@ static int dlm_do_master_request(struct dlm_lock_resource *res, static int dlm_wait_for_lock_mastery(struct dlm_ctxt *dlm, struct dlm_lock_resource *res, struct dlm_master_list_entry *mle, - int *blocked); + int *blocked, int *retry, int host_down); static int dlm_restart_lock_mastery(struct dlm_ctxt *dlm, struct dlm_lock_resource *res, struct dlm_master_list_entry *mle, @@ -712,6 +712,8 @@ struct dlm_lock_resource * dlm_get_lock_resource(struct dlm_ctxt *dlm, unsigned int hash; int tries = 0; int bit, wait_on_recovery = 0; +int retry = 0; +unsigned long down_nodemap[BITS_TO_LONGS(O2NM_MAX_NODES)]; BUG_ON(!lockid); @@ -910,11 +912,25 @@ redo_request: goto wait; ret = -EINVAL; -dlm_node_iter_init(mle-vote_map, iter); +if (!retry) +dlm_node_iter_init(mle-vote_map, iter); +else { +mlog(0, %s:%.*s: retrying, send master request to maybe down node\n, +dlm-name, res-lockname.len, res-lockname.name); +dlm_node_iter_init(down_nodemap, iter); +} +memset(down_nodemap, 0, sizeof(down_nodemap)); + while ((nodenum = dlm_node_iter_next(iter)) = 0) { ret = dlm_do_master_request(res, mle, nodenum); -if (ret 0) +if (ret 0) { mlog_errno(ret); +if (dlm_is_host_down(ret)) { +mlog(0, %s:%.*s: node %u maybe dead, set down_nodemap\n, +dlm-name, res-lockname.len, res-lockname.name, nodenum); +set_bit(nodenum, down_nodemap); +} +} if (mle-master != O2NM_MAX_NODES) { /* found a master ! */ if (mle-master = nodenum) @@ -931,9 +947,11 @@ redo_request: wait: /* keep going until the response map includes all nodes */ -ret = dlm_wait_for_lock_mastery(dlm, res, mle, blocked); +ret = dlm_wait_for_lock_mastery(dlm, res, mle, blocked, retry, +find_next_bit(down_nodemap, O2NM_MAX_NODES, 0) O2NM_MAX_NODES); if (ret 0) { -wait_on_recovery = 1; +if (!retry) +wait_on_recovery = 1; mlog(0, %s: res %.*s, Node map changed, redo the master request now, blocked=%d\n, dlm-name, res-lockname.len, res-lockname.name, blocked); @@ -980,7 +998,7 @@
[Ocfs2-devel] ocfs2 discontig localalloc patches (ver 2)
Hi Joel, et al, sorry for the delay in resending discontiguous localalloc patches. Can you please review when you get a chance. I'll email the tools patches once kernel patches are approved. I came across two use cases where this feature will help. 1. On a customer site, after running an application that creates, deletes ~200 hundred files per second, filesystem got fragmented which impacts localalloc bitmap. Filesystem usage was about 52%. The fragmentation significantly reduces localalloc bitmap size from about 256 mb to closer to 1M and eventually gets disabled. 2. A filesystem of 2TB has a cluster size of 4K which divides filesystem into about 16000 subgroups. In the begining, localalloc bitmap looks for about 2048 contiguous clusters(8mb). In the begining customer was seeing about 600Mb/sec throughput. But once the filesystem was used to about 51% usage, there are very few contiguous chunks that are larger than 2048 clusters(but are scattered). So everytime localalloc bitmap needs to be refilled, it has to scan quite a bit of subgroups before finding one big chunk.This behavior brought down throughput from 600mb/sec down to around 15mb/sec. When I reduced the bitmap to half, performance was back up. Changes: 1. Allow localalloc bitmap to use multiple contiguous chunks. 2. Each localalloc extent is 128K to start with. This doesn't limit the bitmap from getting one big chunk to fill the bitmap. I have recreated customer filesystem on an iscsi storage with the o2image captured from the customer system. When ran a test of creating 100Mb files, I see a throughput of ~70M/sec without the discontig-la feature and about 280M/sec with discontig-la. Thanks, --Srini ___ Ocfs2-devel mailing list Ocfs2-devel@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-devel
[Ocfs2-devel] [PATCH 5/5] ocfs2 set min default contig localalloc size
When discontiguous localalloc is enabled, localalloc should just try for a minimum smaller contiguous chunk. This will improve performance when the file system is fragmented. This setting doesn't stop localalloc from getting bigger chunk. The default chunk size is set to fill 128Kb. localalloc will try to look for atleast that big of chunk. If it's not available then it reduces the size by half and retries. #define OCFS2_DEFAULT_LOCALALLOC_CHUNK_SIZE (128*1024) /* 128kb */ Signed-off-by: Srinivas Eeda srinivas.e...@oracle.com --- fs/ocfs2/localalloc.c | 28 +--- 1 files changed, 21 insertions(+), 7 deletions(-) diff --git a/fs/ocfs2/localalloc.c b/fs/ocfs2/localalloc.c index 82653f2..569e131 100644 --- a/fs/ocfs2/localalloc.c +++ b/fs/ocfs2/localalloc.c @@ -271,6 +271,19 @@ ocfs2_local_alloc_adjust_bits_wanted(struct ocfs2_dinode *alloc, return wanted; } +#define OCFS2_DEFAULT_LOCALALLOC_CHUNK_SIZE (128*1024) /* 128kb */ + +void ocfs2_set_default_min_alloc_size(struct ocfs2_super *osb) +{ + if (ocfs2_supports_discontig_la(osb)) { + osb-local_alloc_bits = OCFS2_DEFAULT_LOCALALLOC_CHUNK_SIZE / + osb-s_clustersize; + if (!osb-local_alloc_bits) + osb-local_alloc_bits = 1; + } else + osb-local_alloc_bits = osb-local_alloc_default_bits; +} + void ocfs2_la_set_sizes(struct ocfs2_super *osb, int requested_mb) { struct super_block *sb = osb-sb; @@ -294,8 +307,7 @@ void ocfs2_la_set_sizes(struct ocfs2_super *osb, int requested_mb) osb-local_alloc_default_bits = ocfs2_megabytes_to_clusters(sb, requested_mb); } - - osb-local_alloc_bits = osb-local_alloc_default_bits; + ocfs2_set_default_min_alloc_size(osb); } static inline int ocfs2_la_state_enabled(struct ocfs2_super *osb) @@ -1219,7 +1231,7 @@ static int ocfs2_recalc_la_window(struct ocfs2_super *osb, * low space. */ if (osb-local_alloc_state != OCFS2_LA_THROTTLED) - osb-local_alloc_bits = osb-local_alloc_default_bits; + ocfs2_set_default_min_alloc_size(osb); out_unlock: state = osb-local_alloc_state; @@ -1369,10 +1381,12 @@ static int ocfs2_local_alloc_new_window(struct ocfs2_super *osb, if (!alloc-id1.bitmap1.i_total) goto bail; - spin_lock(osb-osb_lock); - if (cluster_count osb-local_alloc_bits) - osb-local_alloc_bits = cluster_count; - spin_unlock(osb-osb_lock); + if (!ocfs2_supports_discontig_la(osb)) { + spin_lock(osb-osb_lock); + if (cluster_count osb-local_alloc_bits) + osb-local_alloc_bits = cluster_count; + spin_unlock(osb-osb_lock); + } osb-la_last_gd = ac-ac_last_group; bitmap = ocfs2_local_alloc_bitmap(osb, la); -- 1.5.4.3 ___ Ocfs2-devel mailing list Ocfs2-devel@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-devel
[Ocfs2-devel] [PATCH 1/5] ocfs2: modifiy reservation code to support discontiguous allocations
Currently reservation code assumes that bitmap given to it is one contigous chunk. This patch enhances it to handle a discontigous chunks. It adds new fields m_bitmap_ext_cnt and m_bitmap_ext_arr. m_bitmap_ext_arr tracks the sizes of each contigous chunk and m_bitmap_ext_cnt trackes size of m_bitmap_ext_arr Callers should pass number of discontiguous chunks during ocfs2_resmap_restart and later call ocfs2_resmap_set_extent_size for every chunk. Signed-off-by: Srinivas Eeda srinivas.e...@oracle.com --- fs/ocfs2/localalloc.c |4 +++- fs/ocfs2/reservations.c | 39 ++- fs/ocfs2/reservations.h |7 ++- 3 files changed, 43 insertions(+), 7 deletions(-) diff --git a/fs/ocfs2/localalloc.c b/fs/ocfs2/localalloc.c index a9f78c7..53a162f 100644 --- a/fs/ocfs2/localalloc.c +++ b/fs/ocfs2/localalloc.c @@ -1191,9 +1191,11 @@ retry_enospc: memset(OCFS2_LOCAL_ALLOC(alloc)-la_bitmap, 0, le16_to_cpu(la-la_size)); - ocfs2_resmap_restart(osb-osb_la_resmap, cluster_count, + ocfs2_resmap_restart(osb-osb_la_resmap, 1, cluster_count, OCFS2_LOCAL_ALLOC(alloc)-la_bitmap); + ocfs2_resmap_set_extent_size(osb-osb_la_resmap, 0, cluster_count); + trace_ocfs2_local_alloc_new_window_result( OCFS2_LOCAL_ALLOC(alloc)-la_bm_off, le32_to_cpu(alloc-id1.bitmap1.i_total)); diff --git a/fs/ocfs2/reservations.c b/fs/ocfs2/reservations.c index 41ffd36..e1034ed 100644 --- a/fs/ocfs2/reservations.c +++ b/fs/ocfs2/reservations.c @@ -291,7 +291,7 @@ static void ocfs2_resmap_clear_all_resv(struct ocfs2_reservation_map *resmap) } } -void ocfs2_resmap_restart(struct ocfs2_reservation_map *resmap, +void ocfs2_resmap_restart(struct ocfs2_reservation_map *resmap, u32 ext_cnt, unsigned int clen, char *disk_bitmap) { if (ocfs2_resmap_disabled(resmap)) @@ -300,12 +300,32 @@ void ocfs2_resmap_restart(struct ocfs2_reservation_map *resmap, spin_lock(resv_lock); ocfs2_resmap_clear_all_resv(resmap); + + /* free existing extent array */ + kfree(resmap-m_bitmap_ext_arr); + resmap-m_bitmap_len = clen; resmap-m_disk_bitmap = disk_bitmap; + resmap-m_bitmap_ext_cnt = ext_cnt; + resmap-m_bitmap_ext_arr = kmalloc((sizeof(u32) * ext_cnt), GFP_NOFS); + if (!resmap-m_bitmap_ext_arr) { + mlog_errno(-ENOMEM); + resmap-m_osb-osb_resv_level = 0; /* disable reservations */ + } + spin_unlock(resv_lock); } +void ocfs2_resmap_set_extent_size(struct ocfs2_reservation_map *resmap, + int indx, u32 sz) +{ + if (ocfs2_resmap_disabled(resmap)) + return; + + resmap-m_bitmap_ext_arr[indx] = sz; +} + void ocfs2_resmap_uninit(struct ocfs2_reservation_map *resmap) { /* Does nothing for now. Keep this around for API symmetry */ @@ -419,21 +439,30 @@ static int ocfs2_resmap_find_free_bits(struct ocfs2_reservation_map *resmap, unsigned int *rlen) { void *bitmap = resmap-m_disk_bitmap; - unsigned int best_start, best_len = 0; + unsigned int best_start, len, ext, best_len = 0; int offset, start, found; trace_ocfs2_resmap_find_free_bits_begin(search_start, search_len, wanted, resmap-m_bitmap_len); - found = best_start = best_len = 0; - + found = best_start = best_len = ext = 0; start = search_start; + len = resmap-m_bitmap_ext_arr[ext++]; + while ((offset = ocfs2_find_next_zero_bit(bitmap, resmap-m_bitmap_len, -start)) != -1) { + start)) != -1) { /* Search reached end of the region */ if (offset = (search_start + search_len)) break; + if (offset = len) { + if (ext = resmap-m_bitmap_ext_cnt) + break; + len += resmap-m_bitmap_ext_arr[ext++]; + found = 1; + start = offset + 1; + } + if (offset == start) { /* we found a zero */ found++; diff --git a/fs/ocfs2/reservations.h b/fs/ocfs2/reservations.h index 42c2b80..fed3aae 100644 --- a/fs/ocfs2/reservations.h +++ b/fs/ocfs2/reservations.h @@ -56,6 +56,8 @@ struct ocfs2_reservation_map { u32 m_bitmap_len; /* Number of valid * bits available */ + u32 m_bitmap_ext_cnt; + u32 *m_bitmap_ext_arr; struct list_headm_lru; /* LRU of reservations * structures
[Ocfs2-devel] [PATCH 3/5] ocfs2: new structure to implement discontiguous localalloc bitmap
Currently localalloc bitmap uses single contiguous free chunk of clusters. It's size gets reduced as the contiguous free chunks gets smaller. This ultimately gets disabled if the largest contiguous free chunk is smaller than 1Mb. This patch localalloc bitmap to use space from multiple contiguous free chunk clusters. This patch introduces new ocfs2_local_alloc_rec structure which tracks a single contiguous free chunk. When discontigous localalloc is enabled, ocfs2_local_alloc now holds an array of these records each pointing to a free chunk on the disk. In best case there is one record and as the filesystem gets fragmented you may see multipe records. This feature can be enabled/disabled when the file system is offline. Signed-off-by: Srinivas Eeda srinivas.e...@oracle.com --- fs/ocfs2/localalloc.c | 22 ++-- fs/ocfs2/ocfs2.h |7 + fs/ocfs2/ocfs2_fs.h | 65 - 3 files changed, 79 insertions(+), 15 deletions(-) diff --git a/fs/ocfs2/localalloc.c b/fs/ocfs2/localalloc.c index 45818df..98961f4 100644 --- a/fs/ocfs2/localalloc.c +++ b/fs/ocfs2/localalloc.c @@ -48,6 +48,13 @@ #define OCFS2_LOCAL_ALLOC(dinode) (((dinode)-id2.i_lab)) +#define OCFS2_LOCAL_ALLOC_REC_SZ(la) (le16_to_cpu(la-la_rec_count) *\ +sizeof(struct ocfs2_local_alloc_rec)) +#define OCFS2_LOCAL_ALLOC_BITS_PER_REC (sizeof(struct ocfs2_local_alloc_rec)*8) + +#define OCFS2_LOCAL_ALLOC_SIZE(osb)ocfs2_local_alloc_size(osb-sb,\ + osb-s_feature_incompat) + static u32 ocfs2_local_alloc_count_bits(struct ocfs2_dinode *alloc); static int ocfs2_local_alloc_find_clear_bits(struct ocfs2_super *osb, @@ -74,6 +81,15 @@ static int ocfs2_local_alloc_new_window(struct ocfs2_super *osb, static int ocfs2_local_alloc_slide_window(struct ocfs2_super *osb); +static inline u8 *ocfs2_local_alloc_bitmap(struct ocfs2_super *osb, + struct ocfs2_local_alloc *la) +{ + if (ocfs2_supports_discontig_la(osb)) + return (u8 *)la-la_recs + le16_to_cpu(la-la_bm_start); + else + return la-la_bitmap; +} + /* * ocfs2_la_default_mb() - determine a default size, in megabytes of * the local alloc. @@ -184,7 +200,7 @@ unsigned int ocfs2_la_default_mb(struct ocfs2_super *osb) /* We can't store more bits than we can in a block. */ la_max_mb = ocfs2_clusters_to_megabytes(osb-sb, - ocfs2_local_alloc_size(sb) * 8); + OCFS2_LOCAL_ALLOC_SIZE(osb) * 8); if (la_mb la_max_mb) la_mb = la_max_mb; @@ -198,7 +214,7 @@ void ocfs2_la_set_sizes(struct ocfs2_super *osb, int requested_mb) unsigned int la_max_mb; la_max_mb = ocfs2_clusters_to_megabytes(sb, - ocfs2_local_alloc_size(sb) * 8); + OCFS2_LOCAL_ALLOC_SIZE(osb) * 8); trace_ocfs2_la_set_sizes(requested_mb, la_max_mb, la_default_mb); @@ -329,7 +345,7 @@ int ocfs2_load_local_alloc(struct ocfs2_super *osb) } if ((la-la_size == 0) || - (le16_to_cpu(la-la_size) ocfs2_local_alloc_size(inode-i_sb))) { + (le16_to_cpu(la-la_size) OCFS2_LOCAL_ALLOC_SIZE(osb))) { mlog(ML_ERROR, Local alloc size is invalid (la_size = %u)\n, le16_to_cpu(la-la_size)); status = -EINVAL; diff --git a/fs/ocfs2/ocfs2.h b/fs/ocfs2/ocfs2.h index 836a647..7ab1afe 100644 --- a/fs/ocfs2/ocfs2.h +++ b/fs/ocfs2/ocfs2.h @@ -523,6 +523,13 @@ static inline int ocfs2_supports_discontig_bg(struct ocfs2_super *osb) return 0; } +static inline int ocfs2_supports_discontig_la(struct ocfs2_super *osb) +{ + if (osb-s_feature_incompat OCFS2_FEATURE_INCOMPAT_DISCONTIG_LA) + return 1; + return 0; +} + static inline unsigned int ocfs2_link_max(struct ocfs2_super *osb) { if (ocfs2_supports_indexed_dirs(osb)) diff --git a/fs/ocfs2/ocfs2_fs.h b/fs/ocfs2/ocfs2_fs.h index 938387a..409b2b0 100644 --- a/fs/ocfs2/ocfs2_fs.h +++ b/fs/ocfs2/ocfs2_fs.h @@ -102,7 +102,8 @@ | OCFS2_FEATURE_INCOMPAT_INDEXED_DIRS \ | OCFS2_FEATURE_INCOMPAT_REFCOUNT_TREE \ | OCFS2_FEATURE_INCOMPAT_DISCONTIG_BG \ -| OCFS2_FEATURE_INCOMPAT_CLUSTERINFO) +| OCFS2_FEATURE_INCOMPAT_CLUSTERINFO \ +| OCFS2_FEATURE_INCOMPAT_DISCONTIG_LA) #define OCFS2_FEATURE_RO_COMPAT_SUPP (OCFS2_FEATURE_RO_COMPAT_UNWRITTEN \ | OCFS2_FEATURE_RO_COMPAT_USRQUOTA
[Ocfs2-devel] [PATCH 4/5] ocfs2: implement discontiguous localalloc bitmap
This patch adds code to support discontiguous localalloc bitmap. At any given time there can be a combination of volumes that have discontigous feature enabled or disabled. Signed-off-by: Srinivas Eeda srinivas.e...@oracle.com --- fs/ocfs2/localalloc.c | 478 +++-- 1 files changed, 341 insertions(+), 137 deletions(-) diff --git a/fs/ocfs2/localalloc.c b/fs/ocfs2/localalloc.c index 98961f4..82653f2 100644 --- a/fs/ocfs2/localalloc.c +++ b/fs/ocfs2/localalloc.c @@ -55,14 +55,16 @@ #define OCFS2_LOCAL_ALLOC_SIZE(osb)ocfs2_local_alloc_size(osb-sb,\ osb-s_feature_incompat) -static u32 ocfs2_local_alloc_count_bits(struct ocfs2_dinode *alloc); +static u32 ocfs2_local_alloc_count_bits(struct ocfs2_super *osb, + struct ocfs2_dinode *alloc); static int ocfs2_local_alloc_find_clear_bits(struct ocfs2_super *osb, struct ocfs2_dinode *alloc, u32 *numbits, struct ocfs2_alloc_reservation *resv); -static void ocfs2_clear_local_alloc(struct ocfs2_dinode *alloc); +static void ocfs2_clear_local_alloc(struct ocfs2_super *osb, + struct ocfs2_dinode *alloc); static int ocfs2_sync_local_to_main(struct ocfs2_super *osb, handle_t *handle, @@ -207,6 +209,68 @@ unsigned int ocfs2_la_default_mb(struct ocfs2_super *osb) return la_mb; } +static inline u32 ocfs2_local_bitmap_to_cluster(struct ocfs2_super *osb, + struct ocfs2_local_alloc *la, + u32 bit) +{ + u32 start, prev, offset; + int rec; + + if (!ocfs2_supports_discontig_la(osb)) + return le32_to_cpu(la-la_bm_off) + bit; + + rec = start = prev = 0; + for (rec = 0; rec le16_to_cpu(la-la_rec_count); rec++) { + prev = start; + start += le32_to_cpu(la-la_recs[rec].la_clusters); + if (bit start) + break; + } + offset = le32_to_cpu(la-la_recs[rec].la_start) + (bit - prev); + + return offset; +} + +/* + * This function is called before allocating a new chunk for the localalloc + * bitmap to make sure there is enough space in the bitmap for the new record. + * It adjusts bits wanted and returns 0(if full) or a possitive + * number(bits required) + */ +static unsigned int +ocfs2_local_alloc_adjust_bits_wanted(struct ocfs2_dinode *alloc, +struct ocfs2_alloc_context *ac) +{ + unsigned int wanted, bits_used, total_bits; + struct ocfs2_local_alloc *la; + int bits_left; + + BUG_ON(ac-ac_bits_given ac-ac_bits_wanted); + wanted = ac-ac_bits_wanted - ac-ac_bits_given; + if (wanted == 0) + return 0; + + la = OCFS2_LOCAL_ALLOC(alloc); + total_bits = le16_to_cpu(la-la_dc_size) 3; + + /* space for localalloc bitmap + space used for extent records */ + bits_used = le32_to_cpu(alloc-id1.bitmap1.i_total) + + (le16_to_cpu(la-la_rec_count) * +OCFS2_LOCAL_ALLOC_BITS_PER_REC); + + bits_left = total_bits - bits_used; + if (bits_left = OCFS2_LOCAL_ALLOC_BITS_PER_REC) + return 0; + + /* space fore new record */ + bits_left -= OCFS2_LOCAL_ALLOC_BITS_PER_REC; + if (wanted bits_left) { + wanted = bits_left; + ac-ac_bits_wanted = ac-ac_bits_given + bits_left; + } + return wanted; +} + void ocfs2_la_set_sizes(struct ocfs2_super *osb, int requested_mb) { struct super_block *sb = osb-sb; @@ -344,28 +408,38 @@ int ocfs2_load_local_alloc(struct ocfs2_super *osb) goto bail; } - if ((la-la_size == 0) || - (le16_to_cpu(la-la_size) OCFS2_LOCAL_ALLOC_SIZE(osb))) { + if (!ocfs2_supports_discontig_la(osb) + (!la-la_size || +(le16_to_cpu(la-la_size) OCFS2_LOCAL_ALLOC_SIZE(osb { mlog(ML_ERROR, Local alloc size is invalid (la_size = %u)\n, le16_to_cpu(la-la_size)); status = -EINVAL; goto bail; } + if (ocfs2_supports_discontig_la(osb) + (!la-la_dc_size || +(le16_to_cpu(la-la_dc_size) OCFS2_LOCAL_ALLOC_SIZE(osb { + mlog(ML_ERROR, Local alloc size is invalid +(la_dc_size = %u)\n, le16_to_cpu(la-la_dc_size)); + status = -EINVAL; + goto bail; + } + /* do a little verification. */ - num_used = ocfs2_local_alloc_count_bits(alloc); + num_used = ocfs2_local_alloc_count_bits(osb, alloc); /* hopefully the local alloc has always been recovered
[Ocfs2-devel] [PATCH 2/5] ocfs2: move ocfs2-local-alloc-inode to ocfs2-super
ocfs2_local_alloc_inode is used in multiple functions. It is convenient if we move ocfs2_local_alloc_inode to ocfs2 super. Signed-off-by: Srinivas Eeda srinivas.e...@oracle.com --- fs/ocfs2/localalloc.c | 53 - fs/ocfs2/ocfs2.h |1 + 2 files changed, 14 insertions(+), 40 deletions(-) diff --git a/fs/ocfs2/localalloc.c b/fs/ocfs2/localalloc.c index 53a162f..45818df 100644 --- a/fs/ocfs2/localalloc.c +++ b/fs/ocfs2/localalloc.c @@ -72,8 +72,7 @@ static int ocfs2_local_alloc_new_window(struct ocfs2_super *osb, handle_t *handle, struct ocfs2_alloc_context *ac); -static int ocfs2_local_alloc_slide_window(struct ocfs2_super *osb, - struct inode *local_alloc_inode); +static int ocfs2_local_alloc_slide_window(struct ocfs2_super *osb); /* * ocfs2_la_default_mb() - determine a default size, in megabytes of @@ -352,7 +351,8 @@ int ocfs2_load_local_alloc(struct ocfs2_super *osb) le32_to_cpu(alloc-id1.bitmap1.i_total), OCFS2_LOCAL_ALLOC(alloc)-la_bm_off); - osb-local_alloc_bh = alloc_bh; + osb-local_alloc_bh= alloc_bh; + osb-local_alloc_inode = inode; osb-local_alloc_state = OCFS2_LA_ENABLED; bail: @@ -379,7 +379,6 @@ void ocfs2_shutdown_local_alloc(struct ocfs2_super *osb) { int status; handle_t *handle; - struct inode *local_alloc_inode = NULL; struct buffer_head *bh = NULL; struct buffer_head *main_bm_bh = NULL; struct inode *main_bm_inode = NULL; @@ -392,16 +391,6 @@ void ocfs2_shutdown_local_alloc(struct ocfs2_super *osb) if (osb-local_alloc_state == OCFS2_LA_UNUSED) goto out; - local_alloc_inode = - ocfs2_get_system_file_inode(osb, - LOCAL_ALLOC_SYSTEM_INODE, - osb-slot_num); - if (!local_alloc_inode) { - status = -ENOENT; - mlog_errno(status); - goto out; - } - osb-local_alloc_state = OCFS2_LA_DISABLED; ocfs2_resmap_uninit(osb-osb_la_resmap); @@ -441,7 +430,8 @@ void ocfs2_shutdown_local_alloc(struct ocfs2_super *osb) } memcpy(alloc_copy, alloc, bh-b_size); - status = ocfs2_journal_access_di(handle, INODE_CACHE(local_alloc_inode), + status = ocfs2_journal_access_di(handle, +INODE_CACHE(osb-local_alloc_inode), bh, OCFS2_JOURNAL_ACCESS_WRITE); if (status 0) { mlog_errno(status); @@ -473,9 +463,6 @@ out_mutex: iput(main_bm_inode); out: - if (local_alloc_inode) - iput(local_alloc_inode); - if (alloc_copy) kfree(alloc_copy); } @@ -631,22 +618,11 @@ int ocfs2_reserve_local_alloc_bits(struct ocfs2_super *osb, { int status; struct ocfs2_dinode *alloc; - struct inode *local_alloc_inode; unsigned int free_bits; BUG_ON(!ac); - local_alloc_inode = - ocfs2_get_system_file_inode(osb, - LOCAL_ALLOC_SYSTEM_INODE, - osb-slot_num); - if (!local_alloc_inode) { - status = -ENOENT; - mlog_errno(status); - goto bail; - } - - mutex_lock(local_alloc_inode-i_mutex); + mutex_lock(osb-local_alloc_inode-i_mutex); /* * We must double check state and allocator bits because @@ -680,8 +656,7 @@ int ocfs2_reserve_local_alloc_bits(struct ocfs2_super *osb, le32_to_cpu(alloc-id1.bitmap1.i_used); if (bits_wanted free_bits) { /* uhoh, window change time. */ - status = - ocfs2_local_alloc_slide_window(osb, local_alloc_inode); + status = ocfs2_local_alloc_slide_window(osb); if (status 0) { if (status != -ENOSPC) mlog_errno(status); @@ -704,7 +679,8 @@ int ocfs2_reserve_local_alloc_bits(struct ocfs2_super *osb, goto bail; } - ac-ac_inode = local_alloc_inode; + ac-ac_inode = osb-local_alloc_inode; + igrab(ac-ac_inode); /* We should never use localalloc from another slot */ ac-ac_alloc_slot = osb-slot_num; ac-ac_which = OCFS2_AC_USE_LOCAL; @@ -712,10 +688,8 @@ int ocfs2_reserve_local_alloc_bits(struct ocfs2_super *osb, ac-ac_bh = osb-local_alloc_bh; status = 0; bail: - if (status 0 local_alloc_inode) { - mutex_unlock(local_alloc_inode-i_mutex); - iput(local_alloc_inode); - } + if (status 0) + mutex_unlock(osb
Re: [Ocfs2-devel] RFC: OCFS2 heartbeat improvements
On 8/22/2012 7:17 AM, Jie Liu wrote: Hi All, These days, I am investigating an issue regarding OCFS2 unexpected reboot in some real world use cases. This problem occurred when the network status goes south, when the disk IO load is too high, etc... I suspect it might caused by ocfs2 fencing if it's BIO reading/writing can not be scheduled and processed quickly, or something like this happened in the network IO heartbeat thread. Now am trying to reproduce this problem locally. In the meantime, I'd like to ping you guys with some rough ideas to improve the disk IO heartbeat to see if they are sounds reasonable or not. Firstly, if an OCFS2 node is suffer from heavy disk IO, how about to fix the bio read/write to make this IO request can not be preempted by other requests? e.g, for o2hb_issue_node_write(), currently, it do bio submission with WRITE only, 'submit_bio(WRITE, bio)'. If we change the flag to WRITE_SYNC, or even submit the request combine with REQ_FUA, maybe could get highest priority for disk IO request. This was submitted before by Noboru Iwamatsu and acked by sunil and tao but some how didn't get merged https://oss.oracle.com/pipermail/ocfs2-devel/2011-December/008438.html Secondly, the comments for bio allocation at o2hb_setup_one_bio() indicates that we can pre-allocate bio instead of acquire for each time. But I have not saw any code snippet doing such things in kernel. :( how about creating a private bio set for each o2hb_region, so that we can do allocation out of it? maybe it's faster than do allocation from global bio sets. Also, does it make sense if creating a memory pool on each o2hb_region, so that we can have continuous pages bind to those bios? Any comments are appreciated! Thanks, -Jeff ___ Ocfs2-devel mailing list Ocfs2-devel@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-devel ___ Ocfs2-devel mailing list Ocfs2-devel@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-devel
Re: [Ocfs2-devel] a bug about deadlock when enable quota on ocfs2
Hi Jan, thanks for helping. Jan Kara wrote: Hello, his comments: @ With those patches in, all other nodes will now queue downgrade of dentry @ locks to ocfs2_wq thread. Then Node 1 gets a lock is in use when it calls @ ocfs2_try_open_lock and so does other nodes and hence orphans lie around. Now @ orphans will keep growing and only gets cleared when all nodes umount the @ volume. This causes a different problems 1)space is not cleared 2) as orphans @ keep growing, orphan thread takes long time to scan all orphans(but still @ fails to clear oprhans because of open lock still around) and hence will @ block new unlinks for that duration because it gets a EX on orphan scan lock. I think the analysis is not completely correct (or I misunderstood it). We defer only putting of inode reference to workqueue (lockres is freed already in ocfs2_drop_dentry_lock()). However it is correct that the queue of inodes to put can get long and the system gets into trouble. Sorry for not being clear. This is an issue when thread running unlink and ocfs2_wq on other node end up running ocfs2_delete_inode at the same time. They both call ocfs2_try_open_lock during query wipe inode and get EAGAIN. So they both defer the actual clean up. This will become a problem if a user deletes tons of files at the same time. Lot of orphans gets queued and it becomes a problem when user continues to delete. My questions are 1.) what kind of potential deadlock in your comment? Dropping inode reference can result in deletion of inode when this was the last reference to an unlinked file. However ocfs2_delete_inode() needs to take locks which rank above locks held when ocfs2_drop_dentry_lock() is called. You can check this by removing my patches, enabling CONFIG_PROVE_LOCKING and see the warning lockdep spits. I am not familiar with which locks get out of order. Tiger can you please check this. .) I have tried removing this patch, ocfs2 became more durable, although it caused another panic but not get deadlock again, could we remove this patch and just to fix the new problem? may the new problem is the potential deadlock you mentioned. I talked about possible solutions to Wengang already. Basically before we start unlinking we could check whether there aren't too many queued puts of inode references and if yes, drop some of them directly from unlink process before we acquire any cluster locks or so. We could do this but if there is a deadlock bug we could still run into it when we try deleting directly right? Honza PS: I CC'ed ocfs2-devel so that this discussion gets archived and other developers can help as well. PS2: I'm going for a longer vacation now so I won't be responding to email for some time. Have a good vacation :) ___ Ocfs2-devel mailing list Ocfs2-devel@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-devel
Re: [Ocfs2-devel] [PATCH] Fix waiting status race condition in dlm recovery
comments inline On 5/24/2012 10:53 PM, xiaowei...@oracle.com wrote: From: Xiaowei.Huxiaowei...@oracle.com when the master requested locks ,but one/some of the live nodes died, after it received the request msg and before send out the locks packages, the recovery will fall into endless loop,waiting for the status changed to finalize NodeA NodeB selected as recovery master dlm_remaster_locks - dlm_requeset_all_locks this send request locks msg to B received the msg from A, queue worker dlm_request_all_locks_worker return 0 go on set state to requested wait for the state become done NodeB lost connection due to network before the worker begin, or it die. NodeA still waiting for the change of reco state. It won't end if it not get data done msg And at this time nodeB do not realize this (or it just died), it won't send the msg for ever, nodeA left in the recovery process forever. This patch let the recovery master check if the node still in live node map when it stay in REQUESTED status. Signed-off-by: Xiaowei.Huxiaowei...@oracle.com --- fs/ocfs2/dlm/dlmrecovery.c |9 + 1 files changed, 9 insertions(+), 0 deletions(-) diff --git a/fs/ocfs2/dlm/dlmrecovery.c b/fs/ocfs2/dlm/dlmrecovery.c index 01ebfd0..62659e8 100644 --- a/fs/ocfs2/dlm/dlmrecovery.c +++ b/fs/ocfs2/dlm/dlmrecovery.c @@ -555,6 +555,7 @@ static int dlm_remaster_locks(struct dlm_ctxt *dlm, u8 dead_node) int all_nodes_done; int destroy = 0; int pass = 0; + int dying = 0; do { /* we have become recovery master. there is no escaping @@ -659,6 +660,7 @@ static int dlm_remaster_locks(struct dlm_ctxt *dlm, u8 dead_node) list_for_each_entry(ndata,dlm-reco.node_data, list) { mlog(0, checking recovery state of node %u\n, ndata-node_num); + dying = 0; switch (ndata-state) { case DLM_RECO_NODE_DATA_INIT: case DLM_RECO_NODE_DATA_REQUESTING: @@ -679,6 +681,13 @@ static int dlm_remaster_locks(struct dlm_ctxt *dlm, u8 dead_node) dlm-name, ndata-node_num, ndata-state==DLM_RECO_NODE_DATA_RECEIVING ? receiving : requested); + spin_lock(dlm-spinlock); + dying = !test_bit(ndata-node_num, dlm-live_nodes_map); + spin_unlock(dlm-spinlock); + if (dying) { + ndata-state = DLM_RECO_NODE_DATA_DEAD; + break; + } all_nodes_done = 0; break; case DLM_RECO_NODE_DATA_DONE: fix seems to address the issue, but can you please add a function dlm_is_node_in_livemap similar to dlm_is_node_dead so that it' improves readability. You can then add the following to check if the node is still alive +if (!dlm_is_node_in_livemap(dlm, ndata-node_num)) +ndate-state = DLM_RECO_NODE_DATA_DEAD; +else +all_nodes_done = 0; ___ Ocfs2-devel mailing list Ocfs2-devel@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-devel
[Ocfs2-devel] [PATCH 3/3] ocfs2: modify reservation code to support discontigous localalloc
Currently reservation code assumes a bitmap given to it is all one contigous chunk. This patch enhances it to handle a discontigous chunks. It adds new fields m_bitmap_ext_cnt and m_bitmap_ext_arr. m_bitmap_ext_arr tracks the sizes of each contigous free bits and m_bitmap_ext_cnt trackes number of m_bitmap_ext_arr. Signed-off-by: Srinivas Eeda srinivas.e...@oracle.com --- fs/ocfs2/reservations.c | 41 ++--- fs/ocfs2/reservations.h |7 ++- 2 files changed, 40 insertions(+), 8 deletions(-) diff --git a/fs/ocfs2/reservations.c b/fs/ocfs2/reservations.c index 41ffd36..fea93d7 100644 --- a/fs/ocfs2/reservations.c +++ b/fs/ocfs2/reservations.c @@ -291,7 +291,15 @@ static void ocfs2_resmap_clear_all_resv(struct ocfs2_reservation_map *resmap) } } -void ocfs2_resmap_restart(struct ocfs2_reservation_map *resmap, +void ocfs2_resmap_set_ext(struct ocfs2_reservation_map *resmap, int arr, u32 sz) +{ + if (ocfs2_resmap_disabled(resmap)) + return; + + resmap-m_bitmap_ext_arr[arr] = sz; +} + +void ocfs2_resmap_restart(struct ocfs2_reservation_map *resmap, u32 ext_cnt, unsigned int clen, char *disk_bitmap) { if (ocfs2_resmap_disabled(resmap)) @@ -300,9 +308,21 @@ void ocfs2_resmap_restart(struct ocfs2_reservation_map *resmap, spin_lock(resv_lock); ocfs2_resmap_clear_all_resv(resmap); + + /* free existing extent array */ + if (resmap-m_bitmap_ext_arr) + kfree(resmap-m_bitmap_ext_arr); + resmap-m_bitmap_len = clen; resmap-m_disk_bitmap = disk_bitmap; + resmap-m_bitmap_ext_cnt = ext_cnt; + resmap-m_bitmap_ext_arr = kmalloc((sizeof(u32) * ext_cnt), GFP_NOFS); + if (!resmap-m_bitmap_ext_arr) { + mlog_errno(-ENOMEM); + resmap-m_osb-osb_resv_level = 0; + } + spin_unlock(resv_lock); } @@ -419,20 +439,26 @@ static int ocfs2_resmap_find_free_bits(struct ocfs2_reservation_map *resmap, unsigned int *rlen) { void *bitmap = resmap-m_disk_bitmap; - unsigned int best_start, best_len = 0; + unsigned int best_start, len, ext, best_len = 0; int offset, start, found; trace_ocfs2_resmap_find_free_bits_begin(search_start, search_len, wanted, resmap-m_bitmap_len); - found = best_start = best_len = 0; - + found = best_start = best_len = ext = 0; start = search_start; + len = resmap-m_bitmap_ext_arr[ext++]; while ((offset = ocfs2_find_next_zero_bit(bitmap, resmap-m_bitmap_len, -start)) != -1) { + start)) != -1) { /* Search reached end of the region */ if (offset = (search_start + search_len)) - break; + goto out; + + if (offset = len) { + len += resmap-m_bitmap_ext_arr[ext]; + found = 1; + start = offset + 1; + } if (offset == start) { /* we found a zero */ @@ -450,9 +476,10 @@ static int ocfs2_resmap_find_free_bits(struct ocfs2_reservation_map *resmap, } if (found = wanted) - break; + goto out; } +out: if (best_len == 0) return 0; diff --git a/fs/ocfs2/reservations.h b/fs/ocfs2/reservations.h index 42c2b80..bb5e94f 100644 --- a/fs/ocfs2/reservations.h +++ b/fs/ocfs2/reservations.h @@ -56,6 +56,8 @@ struct ocfs2_reservation_map { u32 m_bitmap_len; /* Number of valid * bits available */ + u32 m_bitmap_ext_cnt; + u32 *m_bitmap_ext_arr; struct list_headm_lru; /* LRU of reservations * structures. */ @@ -94,6 +96,9 @@ void ocfs2_resv_discard(struct ocfs2_reservation_map *resmap, int ocfs2_resmap_init(struct ocfs2_super *osb, struct ocfs2_reservation_map *resmap); +void ocfs2_resmap_set_ext(struct ocfs2_reservation_map *resmap, int arr, + u32 sz); + /** * ocfs2_resmap_restart() - restart a reservation bitmap * @resmap: reservations bitmap @@ -107,7 +112,7 @@ int ocfs2_resmap_init(struct ocfs2_super *osb, * reservations. A future version will recalculate existing * reservations based on the new bitmap. */ -void ocfs2_resmap_restart(struct ocfs2_reservation_map *resmap, +void ocfs2_resmap_restart(struct ocfs2_reservation_map *resmap, u32 ext_cnt, unsigned int clen, char *disk_bitmap); /** -- 1.5.4.3
[Ocfs2-devel] [PATCH 1/3] ocfs2: new structure to implment discontiguous local alloc bitmap
Current local alloc handles single contiguous free chunk of clusters. This patch enhances local alloc to handle discontigous free chunks. It adds a new ocfs2_local_alloc_rec structure which tracks single contiguous free chunk. An array of these sit in the bitmap itself and track discontiguous chunks. In best case there is only one record and increases as the filesystem gets fragmented. Number of records at a time are limited depending on the size of the bitmap and the max limit is defined by OCFS2_MAX_LOCAL_ALLOC_RECS. Signed-off-by: Srinivas Eeda srinivas.e...@oracle.com --- fs/ocfs2/localalloc.c | 10 ++ fs/ocfs2/ocfs2.h |8 fs/ocfs2/ocfs2_fs.h | 48 ++-- 3 files changed, 60 insertions(+), 6 deletions(-) diff --git a/fs/ocfs2/localalloc.c b/fs/ocfs2/localalloc.c index 210c352..4190e53 100644 --- a/fs/ocfs2/localalloc.c +++ b/fs/ocfs2/localalloc.c @@ -48,6 +48,16 @@ #define OCFS2_LOCAL_ALLOC(dinode) (((dinode)-id2.i_lab)) +#define OCFS2_LOCAL_ALLOC_REC_SZ(la) (le16_to_cpu(la-la_rec_count) *\ +sizeof(struct ocfs2_local_alloc_rec)) +#define OCFS2_LOCAL_ALLOC_BITMAP(la)((char *)((la-la_recs)) +\ +OCFS2_LOCAL_ALLOC_REC_SZ(la)) +#define OCFS2_LOCAL_ALLOC_BITS_PER_REC (sizeof(struct ocfs2_local_alloc_rec)*8) + +/* Maximum number of local alloc records */ +#define OCFS2_MAX_LOCAL_ALLOC_REC_LIMIT128 + + static u32 ocfs2_local_alloc_count_bits(struct ocfs2_dinode *alloc); static int ocfs2_local_alloc_find_clear_bits(struct ocfs2_super *osb, diff --git a/fs/ocfs2/ocfs2.h b/fs/ocfs2/ocfs2.h index d355e6e..d4c36d2 100644 --- a/fs/ocfs2/ocfs2.h +++ b/fs/ocfs2/ocfs2.h @@ -367,6 +367,7 @@ struct ocfs2_super * by osb_lock */ struct buffer_head *local_alloc_bh; + struct inode *local_alloc_inode; u64 la_last_gd; @@ -522,6 +523,13 @@ static inline int ocfs2_supports_discontig_bg(struct ocfs2_super *osb) return 0; } +static inline int ocfs2_supports_discontig_la(struct ocfs2_super *osb) +{ + if (osb-s_feature_incompat OCFS2_FEATURE_INCOMPAT_DISCONTIG_LA) + return 1; + return 0; +} + static inline unsigned int ocfs2_link_max(struct ocfs2_super *osb) { if (ocfs2_supports_indexed_dirs(osb)) diff --git a/fs/ocfs2/ocfs2_fs.h b/fs/ocfs2/ocfs2_fs.h index 938387a..6a0fe02 100644 --- a/fs/ocfs2/ocfs2_fs.h +++ b/fs/ocfs2/ocfs2_fs.h @@ -102,7 +102,8 @@ | OCFS2_FEATURE_INCOMPAT_INDEXED_DIRS \ | OCFS2_FEATURE_INCOMPAT_REFCOUNT_TREE \ | OCFS2_FEATURE_INCOMPAT_DISCONTIG_BG \ -| OCFS2_FEATURE_INCOMPAT_CLUSTERINFO) +| OCFS2_FEATURE_INCOMPAT_CLUSTERINFO \ +| OCFS2_FEATURE_INCOMPAT_DISCONTIG_LA) #define OCFS2_FEATURE_RO_COMPAT_SUPP (OCFS2_FEATURE_RO_COMPAT_UNWRITTEN \ | OCFS2_FEATURE_RO_COMPAT_USRQUOTA \ | OCFS2_FEATURE_RO_COMPAT_GRPQUOTA) @@ -177,6 +178,9 @@ */ #define OCFS2_FEATURE_INCOMPAT_CLUSTERINFO 0x4000 +/* Discontiguous local alloc */ +#define OCFS2_FEATURE_INCOMPAT_DISCONTIG_LA0x8000 + /* * backup superblock flag is used to indicate that this volume * has backup superblocks. @@ -664,14 +668,19 @@ struct ocfs2_super_block { * Local allocation bitmap for OCFS2 slots * Note that it exists inside an ocfs2_dinode, so all offsets are * relative to the start of ocfs2_dinode.id2. + * Each ocfs2_local_alloc_rec tracks one contigous chunk of clusters. */ +struct ocfs2_local_alloc_rec { + __le32 la_start;/* 1st cluster in this extent */ + __le32 la_clusters; /* Number of contiguous clusters */ +}; + struct ocfs2_local_alloc { /*00*/ __le32 la_bm_off; /* Starting bit offset in main bitmap */ __le16 la_size; /* Size of included bitmap, in bytes */ - __le16 la_reserved1; - __le64 la_reserved2; -/*10*/ __u8 la_bitmap[0]; + __le16 la_rec_count;/* Number of discontiguous records */ + struct ocfs2_local_alloc_rec la_recs[0]; /* Localalloc records */ }; /* @@ -1380,11 +1389,24 @@ static inline u16 ocfs2_local_alloc_size(struct super_block *sb) u16 size; size = sb-s_blocksize - - offsetof(struct ocfs2_dinode, id2.i_lab.la_bitmap); + offsetof(struct ocfs2_dinode, id2.i_lab.la_recs); + size -= sizeof(struct ocfs2_local_alloc_rec); return size; } +/* effectively this is also the bitmap size */ +static inline u32 ocfs2_local_alloc_cluster_count(struct ocfs2_local_alloc *la) +{ + u32 i, clusters; + + clusters = 0; + for (i
[Ocfs2-devel] ocfs2 discontiguous localalloc patches
Hi all, can you please review following 3 patches that implement discontiguous localalloc bitmap support for ocfs2 file system. This feature helps applications that significantly fragment the filesystem. These fixes needs changes to ocfs2 tools as well. I am sending those patches for review separately. A write up on this feature is available at http://oss.oracle.com/osswiki/OCFS2/DesignDocs/DiscontiguousLocalAlloc.html Thanks, --Srini ___ Ocfs2-devel mailing list Ocfs2-devel@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-devel
[Ocfs2-devel] [PATCH 2/3] ocfs2: implement discontiguous localalloc bitmap
This patch adds supporting functions and modifies localalloc code to implement discontiguous localalloc bitmap. Signed-off-by: Srinivas Eeda srinivas.e...@oracle.com --- fs/ocfs2/localalloc.c | 523 - 1 files changed, 342 insertions(+), 181 deletions(-) diff --git a/fs/ocfs2/localalloc.c b/fs/ocfs2/localalloc.c index 4190e53..f63381e 100644 --- a/fs/ocfs2/localalloc.c +++ b/fs/ocfs2/localalloc.c @@ -48,6 +48,9 @@ #define OCFS2_LOCAL_ALLOC(dinode) (((dinode)-id2.i_lab)) +/* defines minimum contiguous required */ +#define OCFS2_LOCAL_ALLOC_MIN_BITS 2 + #define OCFS2_LOCAL_ALLOC_REC_SZ(la) (le16_to_cpu(la-la_rec_count) *\ sizeof(struct ocfs2_local_alloc_rec)) #define OCFS2_LOCAL_ALLOC_BITMAP(la)((char *)((la-la_recs)) +\ @@ -58,7 +61,8 @@ #define OCFS2_MAX_LOCAL_ALLOC_REC_LIMIT128 -static u32 ocfs2_local_alloc_count_bits(struct ocfs2_dinode *alloc); +static u32 ocfs2_local_alloc_count_bits(struct ocfs2_super *osb, + struct ocfs2_dinode *alloc); static int ocfs2_local_alloc_find_clear_bits(struct ocfs2_super *osb, struct ocfs2_dinode *alloc, @@ -82,8 +86,7 @@ static int ocfs2_local_alloc_new_window(struct ocfs2_super *osb, handle_t *handle, struct ocfs2_alloc_context *ac); -static int ocfs2_local_alloc_slide_window(struct ocfs2_super *osb, - struct inode *local_alloc_inode); +static int ocfs2_local_alloc_slide_window(struct ocfs2_super *osb); /* * ocfs2_la_default_mb() - determine a default size, in megabytes of @@ -202,6 +205,74 @@ unsigned int ocfs2_la_default_mb(struct ocfs2_super *osb) return la_mb; } +static u32 ocfs2_local_bitmap_to_cluster(struct ocfs2_local_alloc *la, u32 bit) +{ + u32 start, prev, offset; + int rec; + + rec = start = prev = 0; + for (rec = 0; rec le16_to_cpu(la-la_rec_count); rec++) { + prev = start; + start += le32_to_cpu(la-la_recs[rec].la_clusters); + if (bit start) + break; + } + offset = le32_to_cpu(la-la_recs[rec].la_start) + (bit - prev); + + return offset; +} + +/* + * This function is called before allocating a new chunk for the localalloc + * bitmap to make sure there is enough space in the bitmap for the new record + */ +static u32 ocfs2_local_alloc_adjust_bits_wanted(struct ocfs2_local_alloc *la, + struct ocfs2_alloc_context *ac) +{ + u32 required, available, cluster_cnt; + + if (ac-ac_bits_given == ac-ac_bits_wanted) + return 0; + + /* total bits available in bitmap */ + available = le16_to_cpu(la-la_size) 3; + cluster_cnt = ocfs2_local_alloc_cluster_count(la); + + /* +* Wanted shouldn't be greater than bitmap size and given should be +* equal to cluster count +*/ + BUG_ON(ac-ac_bits_given ac-ac_bits_wanted); + BUG_ON(ac-ac_bits_wanted available); + BUG_ON(ac-ac_bits_given != cluster_cnt); + + /* reduce bits taken by each record structure */ + available -= (le16_to_cpu(la-la_rec_count) * + OCFS2_LOCAL_ALLOC_BITS_PER_REC); + + /* reduce space reserved for bitmap for already allocated clusters */ + available -= cluster_cnt; + + /* if available bits are not enough to fit a new record return 0 */ + if (available (OCFS2_LOCAL_ALLOC_BITS_PER_REC + 1)) + return 0; + + /* Adjust space that will be consumed by new record structure */ + available -= OCFS2_LOCAL_ALLOC_BITS_PER_REC; + + required = ac-ac_bits_wanted - ac-ac_bits_given; + + /* +* we can't allocate clusters more than the bits available. Adjust +* bits wanted +*/ + if (required available) { + ac-ac_bits_wanted = ac-ac_bits_given + available; + return available; + } else + return required; +} + void ocfs2_la_set_sizes(struct ocfs2_super *osb, int requested_mb) { struct super_block *sb = osb-sb; @@ -239,12 +310,14 @@ void ocfs2_local_alloc_seen_free_bits(struct ocfs2_super *osb, unsigned int num_clusters) { spin_lock(osb-osb_lock); - if (osb-local_alloc_state == OCFS2_LA_DISABLED || - osb-local_alloc_state == OCFS2_LA_THROTTLED) - if (num_clusters = osb-local_alloc_default_bits) { - cancel_delayed_work(osb-la_enable_wq); + if (osb-local_alloc_state == OCFS2_LA_DISABLED) { + cancel_delayed_work(osb-la_enable_wq); + if (num_clusters = osb-local_alloc_bits) + osb-local_alloc_state
Re: [Ocfs2-devel] ocfs2 discontiguous localalloc patches
Joel Becker wrote: On Mon, May 07, 2012 at 04:21:27PM -0700, Srinivas Eeda wrote: can you please review following 3 patches that implement discontiguous localalloc bitmap support for ocfs2 file system. This feature helps applications that significantly fragment the filesystem. Hi Srini. Have you some performance numbers backing this? That is, I believe that the described filesystem turned off local alloc. Do you have proof that these patches, turning it back on, improved the customer's performance? Joel Hi Joel, thanks a lot for the quick reply. I have some stat_sysdir.sh snapshots at http://oss.oracle.com/~seeda/diag/stat_sysdir/ collected from a system. It has 4 snapshots collected when the file system usage is at 8%, 19%, 21% and 52%. In file stat_sysdir_52_percent_usage_slow_del.out, for the filesystem that has UUID: 3A6F54DF288C4AF2ABD1E00FC49BE7ED you could see that local_alloc: bitmap total is 38 and is 0(disabled) for local_alloc:0001, and local_alloc:0002. for the filesystem that has uuid AC444DB162AE427C899BA89E076DD479, all localalloc appears to be disabled. Sorry I didn't collect /sys/kernel/debug/fs/uuid/fs_state. But, given the file system state, even if localalloc is not disabled localalloc need to be refilled every 40 clusters. Thanks, --Srini ___ Ocfs2-devel mailing list Ocfs2-devel@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-devel
Re: [Ocfs2-devel] [PATCH 2/3] ocfs2: implement discontiguous localalloc bitmap
Joel Becker wrote: On Mon, May 07, 2012 at 04:21:29PM -0700, Srinivas Eeda wrote: OH MY DOG NO. NEVER EVER DO THIS. You cannot update an old filesystem on the fly! What about other nodes that are running older versions of the software? They will crash or corrupt data! The entire point of feature bits is to make sure all nodes are speaking the same code. NAK NAK NAK This explains why you trusted la_rec_count earlier. But that is broken. When your patches are done, the code should use la_bm_off and la_bitmap when !DISCONTIG_LA and then use la_rec_count, etc when DISCONTIG_LA. The only way to transition between them is a tunefs.ocfs2 operation that walks the filesystem, flushes the bitmap, and then sets/clears la_rec_count appropriately depending on the direction.. Please please don't hate me :( ... the changes takes care of old formats as well ... I used the reserved space in the structure so that the code changes will be minimal and still compatible with old file system formats. I agree that we need to have some reserved space still available. So as discussed I'll redo the changes accordingly. Please ignore all the patches. Thanks, --Srini ___ Ocfs2-devel mailing list Ocfs2-devel@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-devel
[Ocfs2-devel] [PATCH 1/1] ocfs2: use spinlock irqsave for downconvert lock.patch
When ocfs2dc thread holds dc_task_lock spinlock and receives soft IRQ it deadlock itself trying to get same spinlock in ocfs2_wake_downconvert_thread. Below is the stack snippet. The patch disables interrupts when acquiring dc_task_lock spinlock. ocfs2_wake_downconvert_thread ocfs2_rw_unlock ocfs2_dio_end_io dio_complete . bio_endio req_bio_endio scsi_io_completion blk_done_softirq __do_softirq do_softirq irq_exit do_IRQ ocfs2_downconvert_thread [kthread] Signed-off-by: Srinivas Eeda srinivas.e...@oracle.com --- fs/ocfs2/dlmglue.c | 30 ++ 1 files changed, 18 insertions(+), 12 deletions(-) diff --git a/fs/ocfs2/dlmglue.c b/fs/ocfs2/dlmglue.c index 81a4cd2..d8552a5 100644 --- a/fs/ocfs2/dlmglue.c +++ b/fs/ocfs2/dlmglue.c @@ -3932,6 +3932,8 @@ unqueue: static void ocfs2_schedule_blocked_lock(struct ocfs2_super *osb, struct ocfs2_lock_res *lockres) { + unsigned long flags; + assert_spin_locked(lockres-l_lock); if (lockres-l_flags OCFS2_LOCK_FREEING) { @@ -3945,21 +3947,22 @@ static void ocfs2_schedule_blocked_lock(struct ocfs2_super *osb, lockres_or_flags(lockres, OCFS2_LOCK_QUEUED); - spin_lock(osb-dc_task_lock); + spin_lock_irqsave(osb-dc_task_lock, flags); if (list_empty(lockres-l_blocked_list)) { list_add_tail(lockres-l_blocked_list, osb-blocked_lock_list); osb-blocked_lock_count++; } - spin_unlock(osb-dc_task_lock); + spin_unlock_irqrestore(osb-dc_task_lock, flags); } static void ocfs2_downconvert_thread_do_work(struct ocfs2_super *osb) { unsigned long processed; + unsigned long flags; struct ocfs2_lock_res *lockres; - spin_lock(osb-dc_task_lock); + spin_lock_irqsave(osb-dc_task_lock, flags); /* grab this early so we know to try again if a state change and * wake happens part-way through our work */ osb-dc_work_sequence = osb-dc_wake_sequence; @@ -3972,38 +3975,40 @@ static void ocfs2_downconvert_thread_do_work(struct ocfs2_super *osb) struct ocfs2_lock_res, l_blocked_list); list_del_init(lockres-l_blocked_list); osb-blocked_lock_count--; - spin_unlock(osb-dc_task_lock); + spin_unlock_irqrestore(osb-dc_task_lock, flags); BUG_ON(!processed); processed--; ocfs2_process_blocked_lock(osb, lockres); - spin_lock(osb-dc_task_lock); + spin_lock_irqsave(osb-dc_task_lock, flags); } - spin_unlock(osb-dc_task_lock); + spin_unlock_irqrestore(osb-dc_task_lock, flags); } static int ocfs2_downconvert_thread_lists_empty(struct ocfs2_super *osb) { int empty = 0; + unsigned long flags; - spin_lock(osb-dc_task_lock); + spin_lock_irqsave(osb-dc_task_lock, flags); if (list_empty(osb-blocked_lock_list)) empty = 1; - spin_unlock(osb-dc_task_lock); + spin_unlock_irqrestore(osb-dc_task_lock, flags); return empty; } static int ocfs2_downconvert_thread_should_wake(struct ocfs2_super *osb) { int should_wake = 0; + unsigned long flags; - spin_lock(osb-dc_task_lock); + spin_lock_irqsave(osb-dc_task_lock, flags); if (osb-dc_work_sequence != osb-dc_wake_sequence) should_wake = 1; - spin_unlock(osb-dc_task_lock); + spin_unlock_irqrestore(osb-dc_task_lock, flags); return should_wake; } @@ -4033,10 +4038,11 @@ static int ocfs2_downconvert_thread(void *arg) void ocfs2_wake_downconvert_thread(struct ocfs2_super *osb) { - spin_lock(osb-dc_task_lock); + unsigned long flags; + spin_lock_irqsave(osb-dc_task_lock, flags); /* make sure the voting thread gets a swipe at whatever changes * the caller may have made to the voting state */ osb-dc_wake_sequence++; - spin_unlock(osb-dc_task_lock); + spin_unlock_irqrestore(osb-dc_task_lock, flags); wake_up(osb-dc_event); } -- 1.5.4.3 ___ Ocfs2-devel mailing list Ocfs2-devel@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-devel
[Ocfs2-devel] [PATCH 1/1] ocfs2: use spinlock irqsave for downconvert lock.patch
When ocfs2dc thread holds dc_task_lock spinlock and receives soft IRQ it deadlock itself trying to get same spinlock in ocfs2_wake_downconvert_thread. Below is the stack snippet. The patch disables interrupts when acquiring dc_task_lock spinlock. ocfs2_wake_downconvert_thread ocfs2_rw_unlock ocfs2_dio_end_io dio_complete . bio_endio req_bio_endio scsi_io_completion blk_done_softirq __do_softirq do_softirq irq_exit do_IRQ ocfs2_downconvert_thread [kthread] Signed-off-by: Srinivas Eeda srinivas.e...@oracle.com --- fs/ocfs2/dlmglue.c | 31 +++ 1 files changed, 19 insertions(+), 12 deletions(-) diff --git a/fs/ocfs2/dlmglue.c b/fs/ocfs2/dlmglue.c index 81a4cd2..67af5db 100644 --- a/fs/ocfs2/dlmglue.c +++ b/fs/ocfs2/dlmglue.c @@ -3932,6 +3932,8 @@ unqueue: static void ocfs2_schedule_blocked_lock(struct ocfs2_super *osb, struct ocfs2_lock_res *lockres) { + unsigned long flags; + assert_spin_locked(lockres-l_lock); if (lockres-l_flags OCFS2_LOCK_FREEING) { @@ -3945,21 +3947,22 @@ static void ocfs2_schedule_blocked_lock(struct ocfs2_super *osb, lockres_or_flags(lockres, OCFS2_LOCK_QUEUED); - spin_lock(osb-dc_task_lock); + spin_lock_irqsave(osb-dc_task_lock, flags); if (list_empty(lockres-l_blocked_list)) { list_add_tail(lockres-l_blocked_list, osb-blocked_lock_list); osb-blocked_lock_count++; } - spin_unlock(osb-dc_task_lock); + spin_unlock_irqrestore(osb-dc_task_lock, flags); } static void ocfs2_downconvert_thread_do_work(struct ocfs2_super *osb) { unsigned long processed; + unsigned long flags; struct ocfs2_lock_res *lockres; - spin_lock(osb-dc_task_lock); + spin_lock_irqsave(osb-dc_task_lock, flags); /* grab this early so we know to try again if a state change and * wake happens part-way through our work */ osb-dc_work_sequence = osb-dc_wake_sequence; @@ -3972,38 +3975,40 @@ static void ocfs2_downconvert_thread_do_work(struct ocfs2_super *osb) struct ocfs2_lock_res, l_blocked_list); list_del_init(lockres-l_blocked_list); osb-blocked_lock_count--; - spin_unlock(osb-dc_task_lock); + spin_unlock_irqrestore(osb-dc_task_lock, flags); BUG_ON(!processed); processed--; ocfs2_process_blocked_lock(osb, lockres); - spin_lock(osb-dc_task_lock); + spin_lock_irqsave(osb-dc_task_lock, flags); } - spin_unlock(osb-dc_task_lock); + spin_unlock_irqrestore(osb-dc_task_lock, flags); } static int ocfs2_downconvert_thread_lists_empty(struct ocfs2_super *osb) { int empty = 0; + unsigned long flags; - spin_lock(osb-dc_task_lock); + spin_lock_irqsave(osb-dc_task_lock, flags); if (list_empty(osb-blocked_lock_list)) empty = 1; - spin_unlock(osb-dc_task_lock); + spin_unlock_irqrestore(osb-dc_task_lock, flags); return empty; } static int ocfs2_downconvert_thread_should_wake(struct ocfs2_super *osb) { int should_wake = 0; + unsigned long flags; - spin_lock(osb-dc_task_lock); + spin_lock_irqsave(osb-dc_task_lock, flags); if (osb-dc_work_sequence != osb-dc_wake_sequence) should_wake = 1; - spin_unlock(osb-dc_task_lock); + spin_unlock_irqrestore(osb-dc_task_lock, flags); return should_wake; } @@ -4033,10 +4038,12 @@ static int ocfs2_downconvert_thread(void *arg) void ocfs2_wake_downconvert_thread(struct ocfs2_super *osb) { - spin_lock(osb-dc_task_lock); + unsigned long flags; + + spin_lock_irqsave(osb-dc_task_lock, flags); /* make sure the voting thread gets a swipe at whatever changes * the caller may have made to the voting state */ osb-dc_wake_sequence++; - spin_unlock(osb-dc_task_lock); + spin_unlock_irqrestore(osb-dc_task_lock, flags); wake_up(osb-dc_event); } -- 1.5.4.3 ___ Ocfs2-devel mailing list Ocfs2-devel@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-devel
Re: [Ocfs2-devel] [PATCH 1/1] ocfs2: use spinlock irqsave for downconvert lock.patch
sorry ignore this patch, resent another one after adding the new line. On 1/30/2012 9:47 PM, Srinivas Eeda wrote: When ocfs2dc thread holds dc_task_lock spinlock and receives soft IRQ it deadlock itself trying to get same spinlock in ocfs2_wake_downconvert_thread. Below is the stack snippet. The patch disables interrupts when acquiring dc_task_lock spinlock. ocfs2_wake_downconvert_thread ocfs2_rw_unlock ocfs2_dio_end_io dio_complete . bio_endio req_bio_endio scsi_io_completion blk_done_softirq __do_softirq do_softirq irq_exit do_IRQ ocfs2_downconvert_thread [kthread] Signed-off-by: Srinivas Eedasrinivas.e...@oracle.com --- fs/ocfs2/dlmglue.c | 30 ++ 1 files changed, 18 insertions(+), 12 deletions(-) diff --git a/fs/ocfs2/dlmglue.c b/fs/ocfs2/dlmglue.c index 81a4cd2..d8552a5 100644 --- a/fs/ocfs2/dlmglue.c +++ b/fs/ocfs2/dlmglue.c @@ -3932,6 +3932,8 @@ unqueue: static void ocfs2_schedule_blocked_lock(struct ocfs2_super *osb, struct ocfs2_lock_res *lockres) { + unsigned long flags; + assert_spin_locked(lockres-l_lock); if (lockres-l_flags OCFS2_LOCK_FREEING) { @@ -3945,21 +3947,22 @@ static void ocfs2_schedule_blocked_lock(struct ocfs2_super *osb, lockres_or_flags(lockres, OCFS2_LOCK_QUEUED); - spin_lock(osb-dc_task_lock); + spin_lock_irqsave(osb-dc_task_lock, flags); if (list_empty(lockres-l_blocked_list)) { list_add_tail(lockres-l_blocked_list, osb-blocked_lock_list); osb-blocked_lock_count++; } - spin_unlock(osb-dc_task_lock); + spin_unlock_irqrestore(osb-dc_task_lock, flags); } static void ocfs2_downconvert_thread_do_work(struct ocfs2_super *osb) { unsigned long processed; + unsigned long flags; struct ocfs2_lock_res *lockres; - spin_lock(osb-dc_task_lock); + spin_lock_irqsave(osb-dc_task_lock, flags); /* grab this early so we know to try again if a state change and * wake happens part-way through our work */ osb-dc_work_sequence = osb-dc_wake_sequence; @@ -3972,38 +3975,40 @@ static void ocfs2_downconvert_thread_do_work(struct ocfs2_super *osb) struct ocfs2_lock_res, l_blocked_list); list_del_init(lockres-l_blocked_list); osb-blocked_lock_count--; - spin_unlock(osb-dc_task_lock); + spin_unlock_irqrestore(osb-dc_task_lock, flags); BUG_ON(!processed); processed--; ocfs2_process_blocked_lock(osb, lockres); - spin_lock(osb-dc_task_lock); + spin_lock_irqsave(osb-dc_task_lock, flags); } - spin_unlock(osb-dc_task_lock); + spin_unlock_irqrestore(osb-dc_task_lock, flags); } static int ocfs2_downconvert_thread_lists_empty(struct ocfs2_super *osb) { int empty = 0; + unsigned long flags; - spin_lock(osb-dc_task_lock); + spin_lock_irqsave(osb-dc_task_lock, flags); if (list_empty(osb-blocked_lock_list)) empty = 1; - spin_unlock(osb-dc_task_lock); + spin_unlock_irqrestore(osb-dc_task_lock, flags); return empty; } static int ocfs2_downconvert_thread_should_wake(struct ocfs2_super *osb) { int should_wake = 0; + unsigned long flags; - spin_lock(osb-dc_task_lock); + spin_lock_irqsave(osb-dc_task_lock, flags); if (osb-dc_work_sequence != osb-dc_wake_sequence) should_wake = 1; - spin_unlock(osb-dc_task_lock); + spin_unlock_irqrestore(osb-dc_task_lock, flags); return should_wake; } @@ -4033,10 +4038,11 @@ static int ocfs2_downconvert_thread(void *arg) void ocfs2_wake_downconvert_thread(struct ocfs2_super *osb) { - spin_lock(osb-dc_task_lock); + unsigned long flags; + spin_lock_irqsave(osb-dc_task_lock, flags); /* make sure the voting thread gets a swipe at whatever changes * the caller may have made to the voting state */ osb-dc_wake_sequence++; - spin_unlock(osb-dc_task_lock); + spin_unlock_irqrestore(osb-dc_task_lock, flags); wake_up(osb-dc_event); } ___ Ocfs2-devel mailing list Ocfs2-devel@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-devel
[Ocfs2-devel] [PATCH 1/1] o2dlm: fix NULL pointer dereference in o2dlm_blocking_ast_wrapper
A tiny race between BAST and unlock message causes the NULL dereference. A node sends an unlock request to master and receives a response. Before processing the response it receives a BAST from the master. Since both requests are processed by different threads it creates a race. While the BAST is being processed, lock can get freed by unlock code. This patch makes bast to return immediately if lock is found but unlock is pending. The code should handle this race. We also have to fix master node to skip sending BAST after receiving unlock message. Below is the crash stack BUG: unable to handle kernel NULL pointer dereference at 0048 IP: [a015e023] o2dlm_blocking_ast_wrapper+0xd/0x16 [a034e3db] dlm_do_local_bast+0x8e/0x97 [ocfs2_dlm] [a034f366] dlm_proxy_ast_handler+0x838/0x87e [ocfs2_dlm] [a0308abe] o2net_process_message+0x395/0x5b8 [ocfs2_nodemanager] [a030aac8] o2net_rx_until_empty+0x762/0x90d [ocfs2_nodemanager] [81071802] worker_thread+0x14d/0x1ed Signed-off-by: Srinivas Eeda srinivas.e...@oracle.com --- fs/ocfs2/dlm/dlmast.c |3 ++- 1 files changed, 2 insertions(+), 1 deletions(-) diff --git a/fs/ocfs2/dlm/dlmast.c b/fs/ocfs2/dlm/dlmast.c index 3a3ed4b..1281e8a 100644 --- a/fs/ocfs2/dlm/dlmast.c +++ b/fs/ocfs2/dlm/dlmast.c @@ -386,8 +386,9 @@ int dlm_proxy_ast_handler(struct o2net_msg *msg, u32 len, void *data, head = res-granted; list_for_each(iter, head) { + /* if lock is found but unlock is pending ignore the bast */ lock = list_entry (iter, struct dlm_lock, list); - if (lock-ml.cookie == cookie) + if ((lock-ml.cookie == cookie) (!lock-unlock_pending)) goto do_ast; } -- 1.5.4.3 ___ Ocfs2-devel mailing list Ocfs2-devel@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-devel
Re: [Ocfs2-devel] [PATCH 1/1] ocfs2: use spinlock irqsave for downconvert lock.patch
Hi Tao, thanks for reviewing. When ocfs2dc thread holds dc_task_lock spinlock and receives soft IRQ for I/O completion it deadlock itself trying to get same spinlock in ocfs2_wake_downconvert_thread could you please describe it in more detail? When ocfs2dc thread is running on a cpu and acquired dc_task_lock spin lock, an interrupt came in for i/o completion. Interrupt handler then tries to acquire same spinlock and ends up in deadlock. Below is the stack that gives more details. [814fd509] default_do_nmi+0x39/0x200 [814fd750] do_nmi+0x80/0xa0 [814fced0] nmi+0x20/0x30 [8103ee83] ? __ticket_spin_lock+0x13/0x20 EOE IRQ [814fc41e] _raw_spin_lock+0xe/0x20 [a04e32e8] ocfs2_wake_downconvert_thread+0x28/0x60 [ocfs2] [a04eb388] ocfs2_rw_unlock+0xe8/0x1a0 [ocfs2] [8110c635] ? mempool_free+0x95/0xa0 [a04d18c1] ocfs2_dio_end_io+0x61/0xb0 [ocfs2] [8119ebdb] dio_complete+0xbb/0xe0 [8119ec6d] dio_bio_end_aio+0x6d/0xc0 [8119a1bd] bio_endio+0x1d/0x40 [81234643] req_bio_endio+0xa3/0xe0 [8123589f] blk_update_request+0xff/0x490 [8119bec4] ? bio_free+0x64/0x70 [a000193a] end_clone_bio+0x5a/0x90 [dm_mod] [8119a1bd] bio_endio+0x1d/0x40 [81234643] req_bio_endio+0xa3/0xe0 [813558a6] ? scsi_done+0x26/0x60 [8123589f] blk_update_request+0xff/0x490 [a01ee77a] ? qla2x00_process_completed_request+0x5a/0xe0 [qla2xxx] [81235c57] blk_update_bidi_request+0x27/0xb0 [81236d8f] blk_end_bidi_request+0x2f/0x80 [81236e30] blk_end_request+0x10/0x20 [8135d600] scsi_end_request+0x40/0xb0 [8135d96f] scsi_io_completion+0x9f/0x5c0 [8103ef19] ? default_spin_lock_flags+0x9/0x10 [81354a59] scsi_finish_command+0xc9/0x130 [8135dff7] scsi_softirq_done+0x147/0x170 [8123ca32] blk_done_softirq+0x82/0xa0 [8106f987] __do_softirq+0xb7/0x210 [814fc41e] ? _raw_spin_lock+0xe/0x20 [8150599c] call_softirq+0x1c/0x30 [81016365] do_softirq+0x65/0xa0 [8106f78d] irq_exit+0xbd/0xe0 [815061f6] do_IRQ+0x66/0xe0 @ [814fc953] common_interrupt+0x13/0x13 EOI [81261625] ? __list_del_entry+0x35/0xd0 [a04e9d53] ocfs2_downconvert_thread+0x133/0x2a0 [ocfs2] [814f9d10] ? __schedule+0x3f0/0x800 [8108b930] ? wake_up_bit+0x40/0x40 [a04e9c20] ? ocfs2_process_blocked_lock+0x270/0x270 [ocfs2] [8108b346] kthread+0x96/0xa0 [815058a4] kernel_thread_helper+0x4/0x10 [8108b2b0] ? kthread_worker_fn+0x1a0/0x1a0 [815058a0] ? gs_change+0x13/0x13 @ ---[ end trace 628bf423ee0bc8a8 ]--- btw, if you are afraid of deadlocking in soft IRQ, I guess we should use spin_lock_bh not spin_lock_irqsave? Yes, looks like we could, but I am not 100% sure if it will be safe? Thanks Tao The patch disables interrupts when acquiring dc_task_lock spinlock Signed-off-by: Srinivas Eedasrinivas.e...@oracle.com --- fs/ocfs2/dlmglue.c | 30 ++ 1 files changed, 18 insertions(+), 12 deletions(-) diff --git a/fs/ocfs2/dlmglue.c b/fs/ocfs2/dlmglue.c index 81a4cd2..d8552a5 100644 --- a/fs/ocfs2/dlmglue.c +++ b/fs/ocfs2/dlmglue.c @@ -3932,6 +3932,8 @@ unqueue: static void ocfs2_schedule_blocked_lock(struct ocfs2_super *osb, struct ocfs2_lock_res *lockres) { +unsigned long flags; + assert_spin_locked(lockres-l_lock); if (lockres-l_flags OCFS2_LOCK_FREEING) { @@ -3945,21 +3947,22 @@ static void ocfs2_schedule_blocked_lock(struct ocfs2_super *osb, lockres_or_flags(lockres, OCFS2_LOCK_QUEUED); -spin_lock(osb-dc_task_lock); +spin_lock_irqsave(osb-dc_task_lock, flags); if (list_empty(lockres-l_blocked_list)) { list_add_tail(lockres-l_blocked_list, osb-blocked_lock_list); osb-blocked_lock_count++; } -spin_unlock(osb-dc_task_lock); +spin_unlock_irqrestore(osb-dc_task_lock, flags); } static void ocfs2_downconvert_thread_do_work(struct ocfs2_super *osb) { unsigned long processed; +unsigned long flags; struct ocfs2_lock_res *lockres; -spin_lock(osb-dc_task_lock); +spin_lock_irqsave(osb-dc_task_lock, flags); /* grab this early so we know to try again if a state change and * wake happens part-way through our work */ osb-dc_work_sequence = osb-dc_wake_sequence; @@ -3972,38 +3975,40 @@ static void ocfs2_downconvert_thread_do_work(struct ocfs2_super *osb) struct ocfs2_lock_res, l_blocked_list); list_del_init(lockres-l_blocked_list); osb-blocked_lock_count--; -spin_unlock(osb-dc_task_lock); +spin_unlock_irqrestore(osb-dc_task_lock, flags); BUG_ON(!processed); processed--;
[Ocfs2-devel] [PATCH 1/1] ocfs2: use spinlock irqsave for downconvert lock.patch
When ocfs2dc thread holds dc_task_lock spinlock and receives soft IRQ for I/O completion it deadlock itself trying to get same spinlock in ocfs2_wake_downconvert_thread The patch disables interrupts when acquiring dc_task_lock spinlock Signed-off-by: Srinivas Eeda srinivas.e...@oracle.com --- fs/ocfs2/dlmglue.c | 30 ++ 1 files changed, 18 insertions(+), 12 deletions(-) diff --git a/fs/ocfs2/dlmglue.c b/fs/ocfs2/dlmglue.c index 81a4cd2..d8552a5 100644 --- a/fs/ocfs2/dlmglue.c +++ b/fs/ocfs2/dlmglue.c @@ -3932,6 +3932,8 @@ unqueue: static void ocfs2_schedule_blocked_lock(struct ocfs2_super *osb, struct ocfs2_lock_res *lockres) { + unsigned long flags; + assert_spin_locked(lockres-l_lock); if (lockres-l_flags OCFS2_LOCK_FREEING) { @@ -3945,21 +3947,22 @@ static void ocfs2_schedule_blocked_lock(struct ocfs2_super *osb, lockres_or_flags(lockres, OCFS2_LOCK_QUEUED); - spin_lock(osb-dc_task_lock); + spin_lock_irqsave(osb-dc_task_lock, flags); if (list_empty(lockres-l_blocked_list)) { list_add_tail(lockres-l_blocked_list, osb-blocked_lock_list); osb-blocked_lock_count++; } - spin_unlock(osb-dc_task_lock); + spin_unlock_irqrestore(osb-dc_task_lock, flags); } static void ocfs2_downconvert_thread_do_work(struct ocfs2_super *osb) { unsigned long processed; + unsigned long flags; struct ocfs2_lock_res *lockres; - spin_lock(osb-dc_task_lock); + spin_lock_irqsave(osb-dc_task_lock, flags); /* grab this early so we know to try again if a state change and * wake happens part-way through our work */ osb-dc_work_sequence = osb-dc_wake_sequence; @@ -3972,38 +3975,40 @@ static void ocfs2_downconvert_thread_do_work(struct ocfs2_super *osb) struct ocfs2_lock_res, l_blocked_list); list_del_init(lockres-l_blocked_list); osb-blocked_lock_count--; - spin_unlock(osb-dc_task_lock); + spin_unlock_irqrestore(osb-dc_task_lock, flags); BUG_ON(!processed); processed--; ocfs2_process_blocked_lock(osb, lockres); - spin_lock(osb-dc_task_lock); + spin_lock_irqsave(osb-dc_task_lock, flags); } - spin_unlock(osb-dc_task_lock); + spin_unlock_irqrestore(osb-dc_task_lock, flags); } static int ocfs2_downconvert_thread_lists_empty(struct ocfs2_super *osb) { int empty = 0; + unsigned long flags; - spin_lock(osb-dc_task_lock); + spin_lock_irqsave(osb-dc_task_lock, flags); if (list_empty(osb-blocked_lock_list)) empty = 1; - spin_unlock(osb-dc_task_lock); + spin_unlock_irqrestore(osb-dc_task_lock, flags); return empty; } static int ocfs2_downconvert_thread_should_wake(struct ocfs2_super *osb) { int should_wake = 0; + unsigned long flags; - spin_lock(osb-dc_task_lock); + spin_lock_irqsave(osb-dc_task_lock, flags); if (osb-dc_work_sequence != osb-dc_wake_sequence) should_wake = 1; - spin_unlock(osb-dc_task_lock); + spin_unlock_irqrestore(osb-dc_task_lock, flags); return should_wake; } @@ -4033,10 +4038,11 @@ static int ocfs2_downconvert_thread(void *arg) void ocfs2_wake_downconvert_thread(struct ocfs2_super *osb) { - spin_lock(osb-dc_task_lock); + unsigned long flags; + spin_lock_irqsave(osb-dc_task_lock, flags); /* make sure the voting thread gets a swipe at whatever changes * the caller may have made to the voting state */ osb-dc_wake_sequence++; - spin_unlock(osb-dc_task_lock); + spin_unlock_irqrestore(osb-dc_task_lock, flags); wake_up(osb-dc_event); } -- 1.5.4.3 ___ Ocfs2-devel mailing list Ocfs2-devel@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-devel
Re: [Ocfs2-devel] [PATCH 0/3] ocfs2: fix slow deleting
Below is excerpts from Joel's email for the same question :) Currently, orphan scan just iterate all the slots and call ocfs2_queue_recovery_completion, but I don't think it is proper for a node to query another mounted one since that node will query it by itself. Node 1 has an inode it was using. The dentry went away due to memory pressure. Node 1 closes the inode, but it's on the free list. The node has the open lock. Node 2 unlinks the inode. It grabs the dentry lock to notify others, but node 1 has no dentry and doesn't get the message. It trylocks the open lock, sees that another node has a PR, and does nothing. Later node 2 runs its orphan dir. It igets the inode, trylocks the open lock, sees the PR still, and does nothing. Basically, we have to trigger an orphan iput on node 1. The only way for this to happen is if node 1 runs node 2's orphan dir. This patch exists because that wasn't happening. On 7/7/2011 1:26 PM, Sunil Mushran wrote: On 07/07/2011 01:02 PM, Sunil Mushran wrote: On 07/06/2011 11:19 PM, Srinivas Eeda wrote: On 7/5/2011 11:17 PM, Sunil Mushran wrote: 2. All nodes have to scan all slots. Even live slots. I remember we did for a reason. And that reason should be in the comment in the patch written by Srini. When a node unlinks a file it inserts an entry into it's own orphan slot. If another node is the last one to close the file and dentry got flushed then it will not do the cleanup as it doesn't know the file was orphaned. The file will remain in the orphan slot till the node umounts and the same slot is reused again. To overcome this problem a node has to rescan all slots(including live slots) and try to do the cleanup. The qs is not why are we scanning all live slots on all nodes. As in, why not just recover the local slot. There was a reason for that. Yes, we have to recover unused slots for the reason listed previously. bleh let me rephrase. The qs is why are we scanning all live slots on all nodes. Wengangs patch limits the scanning to the local (live) slot only. And I remember we had a reason for it. ___ Ocfs2-devel mailing list Ocfs2-devel@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-devel
Re: [Ocfs2-devel] [PATCH 0/3] ocfs2: fix slow deleting
On 7/5/2011 11:17 PM, Sunil Mushran wrote: 2. All nodes have to scan all slots. Even live slots. I remember we did for a reason. And that reason should be in the comment in the patch written by Srini. When a node unlinks a file it inserts an entry into it's own orphan slot. If another node is the last one to close the file and dentry got flushed then it will not do the cleanup as it doesn't know the file was orphaned. The file will remain in the orphan slot till the node umounts and the same slot is reused again. To overcome this problem a node has to rescan all slots(including live slots) and try to do the cleanup. ___ Ocfs2-devel mailing list Ocfs2-devel@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-devel
Re: [Ocfs2-devel] [PATCH] remove lockres from purge list when we are getting it for creating lock
+ spin_unlock(tmpres-spinlock); + spin_unlock(dlm-spinlock); lockres could still get added to purgelist at this point and we could still have the same problem? I think, here we need some mechanism that marks the lockres is in use that would protect it from adding to the purgelist? Seems there shouldn't be the possibility that the lockres was moved to purge list again. I think you are right about that once removed from the purge list it will not be put back again on the purgelist in this case :) I still think that there is a hole here. Is it possible that the spinlocks are released here and right then dlm_thread was walking through dirty list and then it could put this on purge list for the first time? If it's already on purgelist, then we seem to be safe by removing it. But what is preventing the lockres getting onto to purge list right at this point. locks are added to lockres a little later but till then it is not safe?(I think). We are in the case of a create lock, there should not be any outstanding convert/unlock request in progress concurrently. If there is one, the there must be lock(s) attached to the lockres. But if there are locks attached, the lockres can't be purged. Also for concurrent create lock, there is no problem either because there must be a lock attached to the lockres. I think your patch works too. But changing DLM_LOCK_RES_IN_USE to DLM_LOCK_RES_CREATING_LOCK can help understand better. I am ok with the name :). The thing we should do here is once we found a lockres and moving ahead with create lock request, we should be sure it's protected before releasing the spinlocks. I also thought about inflight_locks, but so far I failed to find correct places to drop it. thanks, wengang. if (res) dlm_lockres_put(res); res = tmpres; ___ Ocfs2-devel mailing list Ocfs2-devel@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-devel
Re: [Ocfs2-devel] [PATCH] remove lockres from purge list when we are getting it for creating lock
I think I have seen this problem in ocfs2-1.2 and it was addressed by using a new state DLM_LOCK_RES_IN_USE. But we didn't merge into mainline as sunil suggested we need to look for a different approach http://oss.oracle.com/pipermail/ocfs2-devel/2010-June/006669.html http://www.mail-archive.com/ocfs2-devel@oss.oracle.com/msg05733.html one comment inline On 6/8/2011 3:04 AM, Wengang Wang wrote: When we are to purge a lockres, we check if it's unused. the check includes 1. no locks attached to the lockres. 2. not dirty. 3. not in recovery. 4. no interested nodes in refmap. If the the above four are satisfied, we are going to purge it(remove it from hash table). While, when a create lock is in progress especially when lockres is owned remotely(spinlocks are dropped when networking), the lockres can satisfy above four condition. If it's put to purge list, it can be purged out from hash table when we are still accessing on it(sending request to owner for example). That's not what we want. I have met such a problem (orabug 12405575). The lockres is in the purge list already(there is a delay for real purge work) and the create lock request comes. When we are sending network message to the owner in dlmlock_remote(), the owner crashed. So we get DLM_RECOVERING as return value and retries dlmlock_remote(). And before the owner crash, we have purged the lockres. So the lockres become stale(on lockres-onwer). Thus the code calls dlmlock_remote() infinitely. fix: we remove the lockres from purge list if it's there in dlm_get_lock_resource() which is called for only createlock case. So that the lockres can't be purged when we are in progress of createlock. Signed-off-by: Wengang Wangwen.gang.w...@oracle.com --- fs/ocfs2/dlm/dlmmaster.c | 41 + 1 files changed, 29 insertions(+), 12 deletions(-) diff --git a/fs/ocfs2/dlm/dlmmaster.c b/fs/ocfs2/dlm/dlmmaster.c index 11eefb8..511d43c 100644 --- a/fs/ocfs2/dlm/dlmmaster.c +++ b/fs/ocfs2/dlm/dlmmaster.c @@ -709,28 +709,27 @@ lookup: spin_lock(dlm-spinlock); tmpres = __dlm_lookup_lockres_full(dlm, lockid, namelen, hash); if (tmpres) { - int dropping_ref = 0; - - spin_unlock(dlm-spinlock); - spin_lock(tmpres-spinlock); /* We wait for the other thread that is mastering the resource */ if (tmpres-owner == DLM_LOCK_RES_OWNER_UNKNOWN) { + spin_unlock(dlm-spinlock); __dlm_wait_on_lockres(tmpres); BUG_ON(tmpres-owner == DLM_LOCK_RES_OWNER_UNKNOWN); + spin_unlock(tmpres-spinlock); + dlm_lockres_put(tmpres); + tmpres = NULL; + goto lookup; } if (tmpres-owner == dlm-node_num) { BUG_ON(tmpres-state DLM_LOCK_RES_DROPPING_REF); dlm_lockres_grab_inflight_ref(dlm, tmpres); - } else if (tmpres-state DLM_LOCK_RES_DROPPING_REF) - dropping_ref = 1; - spin_unlock(tmpres-spinlock); - - /* wait until done messaging the master, drop our ref to allow - * the lockres to be purged, start over. */ - if (dropping_ref) { - spin_lock(tmpres-spinlock); + } else if (tmpres-state DLM_LOCK_RES_DROPPING_REF) { + /* + * wait until done messaging the master, drop our ref to + * allow the lockres to be purged, start over. + */ + spin_unlock(dlm-spinlock); __dlm_wait_on_lockres_flags(tmpres, DLM_LOCK_RES_DROPPING_REF); spin_unlock(tmpres-spinlock); dlm_lockres_put(tmpres); @@ -739,6 +738,24 @@ lookup: } mlog(0, found in hash!\n); + /* + * we are going to do a create-lock next. so remove the lockres + * from purge list to avoid the case that we will access on the + * lockres which is already purged out from hash table in + * dlm_run_purge_list() path. + * otherwise, we could run into a problem: + * the owner dead(recovery didn't take care of this lockres + * since it's not in hashtable), and the code keeps sending + * request to the dead node and getting DLM_RECOVERING and + * then retrying infinitely. + */ + if (!list_empty(tmpres-purge)) { + list_del_init(tmpres-purge); + dlm_lockres_put(tmpres); + } + + spin_unlock(tmpres-spinlock); + spin_unlock(dlm-spinlock); lockres could still get added to purgelist at this
[Ocfs2-devel] [PATCH 2/2] ocfs2/cluster: Add per-region debugfs file to show the elapsed time
From: Sunil Mushran sunil.mush...@oracle.com Mainline fa16655a622e7c0fda76ca5155db6efc86968c65 A per-region debugfs file, elapsed_time_in_ms, shows the time since the heartbeat timer was last armed. Signed-off-by: Sunil Mushran sunil.mush...@oracle.com --- fs/ocfs2/cluster/heartbeat.c | 46 ++ 1 files changed, 46 insertions(+), 0 deletions(-) diff --git a/fs/ocfs2/cluster/heartbeat.c b/fs/ocfs2/cluster/heartbeat.c index 8580e66..0ae0dea 100644 --- a/fs/ocfs2/cluster/heartbeat.c +++ b/fs/ocfs2/cluster/heartbeat.c @@ -63,6 +63,7 @@ static DECLARE_WAIT_QUEUE_HEAD(o2hb_steady_queue); #define O2HB_DEBUG_DIR o2hb #define O2HB_DEBUG_LIVENODES livenodes +#define O2HB_DEBUG_REGION_ELAPSED_TIME elapsed_time_in_ms static struct dentry *o2hb_debug_dir; static struct dentry *o2hb_debug_livenodes; @@ -135,6 +136,7 @@ struct o2hb_region { struct o2hb_disk_slot *hr_slots; struct dentry *hr_debug_dir; + struct dentry *hr_debug_elapsed_time; /* let the person setting up hb wait for it to return until it * has reached a 'steady' state. This will be fixed when we have @@ -950,6 +952,29 @@ bail: return -ENOMEM; } +static int o2hb_region_debug_open(struct inode *inode, struct file *file) +{ + struct o2hb_region *reg = inode-i_private; + char *buf = NULL; + int out = 0; + + buf = kmalloc(PAGE_SIZE, GFP_KERNEL); + if (!buf) + goto bail; + + out += snprintf(buf + out, PAGE_SIZE - out, %u\n, + jiffies_to_msecs(jiffies - +reg-hr_last_timeout_start)); + + i_size_write(inode, out); + + file-private_data = buf; + + return 0; +bail: + return -ENOMEM; +} + static int o2hb_debug_release(struct inode *inode, struct file *file) { kfree(file-private_data); @@ -985,6 +1010,13 @@ static struct file_operations o2hb_debug_fops = { .llseek = generic_file_llseek, }; +static struct file_operations o2hb_region_debug_fops = { + .open = o2hb_region_debug_open, + .release = o2hb_debug_release, + .read = o2hb_debug_read, + .llseek = generic_file_llseek, +}; + void o2hb_exit(void) { if (o2hb_debug_livenodes) @@ -1087,6 +1119,7 @@ static void o2hb_region_release(struct config_item *item) if (reg-hr_slots) kfree(reg-hr_slots); + debugfs_remove(reg-hr_debug_elapsed_time); debugfs_remove(reg-hr_debug_dir); spin_lock(o2hb_live_lock); @@ -1617,6 +1650,17 @@ static struct config_item *o2hb_heartbeat_group_make_item(struct config_group *g goto out; } + reg-hr_debug_elapsed_time = + debugfs_create_file(O2HB_DEBUG_REGION_ELAPSED_TIME, + S_IFREG|S_IRUSR, + reg-hr_debug_dir, + reg, + o2hb_region_debug_fops); + if (!reg-hr_debug_elapsed_time) { + mlog_errno(-ENOMEM); + goto out; + } + spin_lock(o2hb_live_lock); list_add_tail(reg-hr_all_item, o2hb_all_regions); spin_unlock(o2hb_live_lock); @@ -1624,6 +1668,8 @@ static struct config_item *o2hb_heartbeat_group_make_item(struct config_group *g return reg-hr_item; out: + if (reg reg-hr_debug_dir) + debugfs_remove(reg-hr_debug_dir); kfree(reg); return NULL; } -- 1.5.6.5 ___ Ocfs2-devel mailing list Ocfs2-devel@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-devel
[Ocfs2-devel] backports to 1.4 from mainline
The following two patches are backports from mainline to 1.4. These patches create debugfs entry for heartbeat regions and to show elapsed time. ___ Ocfs2-devel mailing list Ocfs2-devel@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-devel
[Ocfs2-devel] [PATCH 1/2] ocfs2/cluster: Create debugfs dir for heartbeat regions
From: Sunil Mushran sunil.mush...@oracle.com Mainline 0841ed580fe8a3e51ba9dbb133dafc787cce428f Signed-off-by: Sunil Mushran sunil.mush...@oracle.com --- fs/ocfs2/cluster/heartbeat.c | 27 +++ 1 files changed, 19 insertions(+), 8 deletions(-) diff --git a/fs/ocfs2/cluster/heartbeat.c b/fs/ocfs2/cluster/heartbeat.c index 55e8718..8580e66 100644 --- a/fs/ocfs2/cluster/heartbeat.c +++ b/fs/ocfs2/cluster/heartbeat.c @@ -134,6 +134,8 @@ struct o2hb_region { struct block_device *hr_bdev; struct o2hb_disk_slot *hr_slots; + struct dentry *hr_debug_dir; + /* let the person setting up hb wait for it to return until it * has reached a 'steady' state. This will be fixed when we have * a more complete api that doesn't lead to this sort of fragility. */ @@ -1085,6 +1087,8 @@ static void o2hb_region_release(struct config_item *item) if (reg-hr_slots) kfree(reg-hr_slots); + debugfs_remove(reg-hr_debug_dir); + spin_lock(o2hb_live_lock); list_del(reg-hr_all_item); spin_unlock(o2hb_live_lock); @@ -1597,24 +1601,31 @@ static struct config_item *o2hb_heartbeat_group_make_item(struct config_group *g const char *name) { struct o2hb_region *reg = NULL; - struct config_item *ret = NULL; reg = kzalloc(sizeof(struct o2hb_region), GFP_KERNEL); - if (reg == NULL) - goto out; /* ENOMEM */ + if (reg == NULL) { + mlog_errno(-ENOMEM); + goto out; + } config_item_init_type_name(reg-hr_item, name, o2hb_region_type); - ret = reg-hr_item; + reg-hr_debug_dir = + debugfs_create_dir(config_item_name(reg-hr_item), o2hb_debug_dir); + if (!reg-hr_debug_dir) { + mlog_errno(-ENOMEM); + goto out; + } spin_lock(o2hb_live_lock); list_add_tail(reg-hr_all_item, o2hb_all_regions); spin_unlock(o2hb_live_lock); -out: - if (ret == NULL) - kfree(reg); - return ret; + return reg-hr_item; + +out: + kfree(reg); + return NULL; } static void o2hb_heartbeat_group_drop_item(struct config_group *group, -- 1.5.6.5 ___ Ocfs2-devel mailing list Ocfs2-devel@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-devel
[Ocfs2-devel] [PATCH] ocfs2: validate bg_free_bits_count after update
This patch adds a safe check to ensure bg_free_bits_count doesn't exceed bg_bits in a group descriptor. This is to avoid on disk corruption that was seen recently. debugfs: group 52803072 Group Chain: 179 Parent Inode: 11 Generation: 2959379682 CRC32: ECC: ## Block#TotalUsed Free Contig Size 052803072 322564294965350 34202182074032 .. Signed-off-by: Srinivas Eeda srinivas.e...@oracle.com --- fs/ocfs2/suballoc.c | 16 1 files changed, 16 insertions(+), 0 deletions(-) diff --git a/fs/ocfs2/suballoc.c b/fs/ocfs2/suballoc.c index 849c2f0..5fed60d 100644 --- a/fs/ocfs2/suballoc.c +++ b/fs/ocfs2/suballoc.c @@ -1380,6 +1380,14 @@ static inline int ocfs2_block_group_set_bits(handle_t *handle, } le16_add_cpu(bg-bg_free_bits_count, -num_bits); + if (le16_to_cpu(bg-bg_free_bits_count) le16_to_cpu(bg-bg_bits)) { + ocfs2_error(alloc_inode-i_sb, Group descriptor # %llu has bit +count %u but claims %u are freed. num_bits %d, + (unsigned long long)le64_to_cpu(bg-bg_blkno), + le16_to_cpu(bg-bg_bits), + le16_to_cpu(bg-bg_free_bits_count), num_bits); + return -EROFS; + } while(num_bits--) ocfs2_set_bit(bit_off++, bitmap); @@ -2419,6 +2427,14 @@ static int ocfs2_block_group_clear_bits(handle_t *handle, (unsigned long *) undo_bg-bg_bitmap); } le16_add_cpu(bg-bg_free_bits_count, num_bits); + if (le16_to_cpu(bg-bg_free_bits_count) le16_to_cpu(bg-bg_bits)) { + ocfs2_error(alloc_inode-i_sb, Group descriptor # %llu has bit +count %u but claims %u are freed. num_bits %d, + (unsigned long long)le64_to_cpu(bg-bg_blkno), + le16_to_cpu(bg-bg_bits), + le16_to_cpu(bg-bg_free_bits_count), num_bits); + return -EROFS; + } if (undo_fn) jbd_unlock_bh_state(group_bh); -- 1.5.6.5 ___ Ocfs2-devel mailing list Ocfs2-devel@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-devel
[Ocfs2-devel] [PATCH 1/1] ocfs2: validate bg_free_bits_count after update
This patch adds a safe check to ensure bg_free_bits_count doesn't exceed bg_bits in a group descriptor. This is to avoid on disk corruption that was seen recently. debugfs: group 52803072 Group Chain: 179 Parent Inode: 11 Generation: 2959379682 CRC32: ECC: ## Block#TotalUsed Free Contig Size 052803072 322564294965350 34202182074032 .. Signed-off-by: Srinivas Eeda srinivas.e...@oracle.com --- fs/ocfs2/suballoc.c | 16 1 files changed, 16 insertions(+), 0 deletions(-) diff --git a/fs/ocfs2/suballoc.c b/fs/ocfs2/suballoc.c index 849c2f0..d606a79 100644 --- a/fs/ocfs2/suballoc.c +++ b/fs/ocfs2/suballoc.c @@ -1380,6 +1380,14 @@ static inline int ocfs2_block_group_set_bits(handle_t *handle, } le16_add_cpu(bg-bg_free_bits_count, -num_bits); + if (le16_to_cpu(bg-bg_free_bits_count) = le16_to_cpu(bg-bg_bits)) { + ocfs2_error(alloc_inode-i_sb, Group descriptor # %llu has bit +count %u but claims %u are freed. num_bits %d, + (unsigned long long)le64_to_cpu(bg-bg_blkno), + le16_to_cpu(bg-bg_bits), + le16_to_cpu(bg-bg_free_bits_count), num_bits); + return -EIO; + } while(num_bits--) ocfs2_set_bit(bit_off++, bitmap); @@ -2419,6 +2427,14 @@ static int ocfs2_block_group_clear_bits(handle_t *handle, (unsigned long *) undo_bg-bg_bitmap); } le16_add_cpu(bg-bg_free_bits_count, num_bits); + if (le16_to_cpu(bg-bg_free_bits_count) = le16_to_cpu(bg-bg_bits)) { + ocfs2_error(alloc_inode-i_sb, Group descriptor # %llu has bit +count %u but claims %u are freed. num_bits %d, + (unsigned long long)le64_to_cpu(bg-bg_blkno), + le16_to_cpu(bg-bg_bits), + le16_to_cpu(bg-bg_free_bits_count), num_bits); + return -EIO; + } if (undo_fn) jbd_unlock_bh_state(group_bh); -- 1.5.6.5 ___ Ocfs2-devel mailing list Ocfs2-devel@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-devel
[Ocfs2-devel] [PATCH] o2dlm: force free mles during dlm exit
While umounting, a block mle doesn't get freed if dlm is shutdown after master request is received but before assert master. This results in unclean shutdown of dlm domain. This patch frees all mles that lie around after other nodes were notified about exiting the dlm and marking dlm state as leaving. Only block mles are expected to be around, so we log ERROR for other mles but still free them. Signed-off-by: Srinivas Eeda srinivas.e...@oracle.com --- fs/ocfs2/dlm/dlmcommon.h |1 + fs/ocfs2/dlm/dlmdomain.c |1 + fs/ocfs2/dlm/dlmmaster.c | 34 ++ 3 files changed, 36 insertions(+), 0 deletions(-) diff --git a/fs/ocfs2/dlm/dlmcommon.h b/fs/ocfs2/dlm/dlmcommon.h index 4b6ae2c..7652989 100644 --- a/fs/ocfs2/dlm/dlmcommon.h +++ b/fs/ocfs2/dlm/dlmcommon.h @@ -1030,6 +1030,7 @@ int dlm_drop_lockres_ref(struct dlm_ctxt *dlm, struct dlm_lock_resource *res); void dlm_clean_master_list(struct dlm_ctxt *dlm, u8 dead_node); +void dlm_force_free_mles(struct dlm_ctxt *dlm); int dlm_lock_basts_flushed(struct dlm_ctxt *dlm, struct dlm_lock *lock); int __dlm_lockres_has_locks(struct dlm_lock_resource *res); int __dlm_lockres_unused(struct dlm_lock_resource *res); diff --git a/fs/ocfs2/dlm/dlmdomain.c b/fs/ocfs2/dlm/dlmdomain.c index 153abb5..11a5c87 100644 --- a/fs/ocfs2/dlm/dlmdomain.c +++ b/fs/ocfs2/dlm/dlmdomain.c @@ -693,6 +693,7 @@ void dlm_unregister_domain(struct dlm_ctxt *dlm) dlm_mark_domain_leaving(dlm); dlm_leave_domain(dlm); + dlm_force_free_mles(dlm); dlm_complete_dlm_shutdown(dlm); } dlm_put(dlm); diff --git a/fs/ocfs2/dlm/dlmmaster.c b/fs/ocfs2/dlm/dlmmaster.c index ffb4c68..156f420 100644 --- a/fs/ocfs2/dlm/dlmmaster.c +++ b/fs/ocfs2/dlm/dlmmaster.c @@ -3433,3 +3433,37 @@ void dlm_lockres_release_ast(struct dlm_ctxt *dlm, wake_up(res-wq); wake_up(dlm-migration_wq); } + +void dlm_force_free_mles(struct dlm_ctxt *dlm) +{ + int i; + struct hlist_head *bucket; + struct dlm_master_list_entry *mle; + struct hlist_node *tmp, *list; + + /* We notified all other nodes that we are exiting the domain and +* marked the dlm state to DLM_CTXT_LEAVING. If any mles are still +* around we force free them and wake any processes that are waiting +* on the mles */ + spin_lock(dlm-spinlock); + spin_lock(dlm-master_lock); + for (i = 0; i DLM_HASH_BUCKETS; i++) { + bucket = dlm_master_hash(dlm, i); + hlist_for_each_safe(list, tmp, bucket) { + mle = hlist_entry(list, struct dlm_master_list_entry, + master_hash_node); + if (mle-type != DLM_MLE_BLOCK) { + mlog(ML_ERROR, bad mle: %p\n, mle); + dlm_print_one_mle(mle); + } + atomic_set(mle-woken, 1); + wake_up(mle-wq); + + __dlm_unlink_mle(dlm, mle); + __dlm_mle_detach_hb_events(dlm, mle); + __dlm_put_mle(mle); + } + } + spin_unlock(dlm-master_lock); + spin_unlock(dlm-spinlock); +} -- 1.5.6.5 ___ Ocfs2-devel mailing list Ocfs2-devel@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-devel
[Ocfs2-devel] [PATCH 1/1] o2dlm: force free mles during dlm exit
While umounting, a block mle doesn't get freed if dlm is shutdown after master request is received but before assert master. This results in unclean shutdown of dlm domain. This patch frees all mles that lie around after other nodes were notified about exiting the dlm and marking dlm state as leaving. Only block mles are expected to be around, so we log ERROR for other mles but still free them. Signed-off-by: Srinivas Eeda srinivas.e...@oracle.com --- fs/ocfs2/dlm/dlmcommon.h |1 + fs/ocfs2/dlm/dlmdomain.c |1 + fs/ocfs2/dlm/dlmmaster.c | 40 3 files changed, 42 insertions(+), 0 deletions(-) diff --git a/fs/ocfs2/dlm/dlmcommon.h b/fs/ocfs2/dlm/dlmcommon.h index 4b6ae2c..7652989 100644 --- a/fs/ocfs2/dlm/dlmcommon.h +++ b/fs/ocfs2/dlm/dlmcommon.h @@ -1030,6 +1030,7 @@ int dlm_drop_lockres_ref(struct dlm_ctxt *dlm, struct dlm_lock_resource *res); void dlm_clean_master_list(struct dlm_ctxt *dlm, u8 dead_node); +void dlm_force_free_mles(struct dlm_ctxt *dlm); int dlm_lock_basts_flushed(struct dlm_ctxt *dlm, struct dlm_lock *lock); int __dlm_lockres_has_locks(struct dlm_lock_resource *res); int __dlm_lockres_unused(struct dlm_lock_resource *res); diff --git a/fs/ocfs2/dlm/dlmdomain.c b/fs/ocfs2/dlm/dlmdomain.c index 153abb5..11a5c87 100644 --- a/fs/ocfs2/dlm/dlmdomain.c +++ b/fs/ocfs2/dlm/dlmdomain.c @@ -693,6 +693,7 @@ void dlm_unregister_domain(struct dlm_ctxt *dlm) dlm_mark_domain_leaving(dlm); dlm_leave_domain(dlm); + dlm_force_free_mles(dlm); dlm_complete_dlm_shutdown(dlm); } dlm_put(dlm); diff --git a/fs/ocfs2/dlm/dlmmaster.c b/fs/ocfs2/dlm/dlmmaster.c index ffb4c68..f564b0e 100644 --- a/fs/ocfs2/dlm/dlmmaster.c +++ b/fs/ocfs2/dlm/dlmmaster.c @@ -3433,3 +3433,43 @@ void dlm_lockres_release_ast(struct dlm_ctxt *dlm, wake_up(res-wq); wake_up(dlm-migration_wq); } + +void dlm_force_free_mles(struct dlm_ctxt *dlm) +{ + int i; + struct hlist_head *bucket; + struct dlm_master_list_entry *mle; + struct hlist_node *tmp, *list; + + /* +* We notified all other nodes that we are exiting the domain and +* marked the dlm state to DLM_CTXT_LEAVING. If any mles are still +* around we force free them and wake any processes that are waiting +* on the mles +*/ + spin_lock(dlm-spinlock); + spin_lock(dlm-master_lock); + + BUG_ON(dlm-dlm_state != DLM_CTXT_LEAVING); + BUG_ON((find_next_bit(dlm-domain_map, O2NM_MAX_NODES, 0) O2NM_MAX_NODES)); + + for (i = 0; i DLM_HASH_BUCKETS; i++) { + bucket = dlm_master_hash(dlm, i); + hlist_for_each_safe(list, tmp, bucket) { + mle = hlist_entry(list, struct dlm_master_list_entry, + master_hash_node); + if (mle-type != DLM_MLE_BLOCK) { + mlog(ML_ERROR, bad mle: %p\n, mle); + dlm_print_one_mle(mle); + } + atomic_set(mle-woken, 1); + wake_up(mle-wq); + + __dlm_unlink_mle(dlm, mle); + __dlm_mle_detach_hb_events(dlm, mle); + __dlm_put_mle(mle); + } + } + spin_unlock(dlm-master_lock); + spin_unlock(dlm-spinlock); +} -- 1.5.6.5 ___ Ocfs2-devel mailing list Ocfs2-devel@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-devel
[Ocfs2-devel] [PATCH 1/1] o2dlm: free block mles during shutdown
If a node initiates shutdown after another node initiated the lock mastery process, this node might have created block mle but will not release it if it doesn't get the assert master from the other node. This causes block mle's to lie around unfreed. This patch frees any block mles that exists on master list after the node sent DLM_EXIT_DOMAIN_MSG to other nodes. Signed-off-by: Srinivas Eeda srinivas.e...@oracle.com --- fs/ocfs2/dlm/dlmcommon.h |1 + fs/ocfs2/dlm/dlmdomain.c |1 + fs/ocfs2/dlm/dlmmaster.c | 33 + 3 files changed, 35 insertions(+), 0 deletions(-) diff --git a/fs/ocfs2/dlm/dlmcommon.h b/fs/ocfs2/dlm/dlmcommon.h index 4b6ae2c..48282dd 100644 --- a/fs/ocfs2/dlm/dlmcommon.h +++ b/fs/ocfs2/dlm/dlmcommon.h @@ -1030,6 +1030,7 @@ int dlm_drop_lockres_ref(struct dlm_ctxt *dlm, struct dlm_lock_resource *res); void dlm_clean_master_list(struct dlm_ctxt *dlm, u8 dead_node); +void dlm_free_block_mles(struct dlm_ctxt *dlm); int dlm_lock_basts_flushed(struct dlm_ctxt *dlm, struct dlm_lock *lock); int __dlm_lockres_has_locks(struct dlm_lock_resource *res); int __dlm_lockres_unused(struct dlm_lock_resource *res); diff --git a/fs/ocfs2/dlm/dlmdomain.c b/fs/ocfs2/dlm/dlmdomain.c index 153abb5..8744fff 100644 --- a/fs/ocfs2/dlm/dlmdomain.c +++ b/fs/ocfs2/dlm/dlmdomain.c @@ -693,6 +693,7 @@ void dlm_unregister_domain(struct dlm_ctxt *dlm) dlm_mark_domain_leaving(dlm); dlm_leave_domain(dlm); + dlm_free_block_mles(dlm); dlm_complete_dlm_shutdown(dlm); } dlm_put(dlm); diff --git a/fs/ocfs2/dlm/dlmmaster.c b/fs/ocfs2/dlm/dlmmaster.c index ffb4c68..5f4d6fd 100644 --- a/fs/ocfs2/dlm/dlmmaster.c +++ b/fs/ocfs2/dlm/dlmmaster.c @@ -3433,3 +3433,36 @@ void dlm_lockres_release_ast(struct dlm_ctxt *dlm, wake_up(res-wq); wake_up(dlm-migration_wq); } + +void dlm_free_block_mles(struct dlm_ctxt *dlm) +{ + int i; + struct hlist_head *bucket; + struct dlm_master_list_entry *mle; + struct hlist_node *list; + + spin_lock(dlm-spinlock); + spin_lock(dlm-master_lock); + for (i = 0; i DLM_HASH_BUCKETS; i++) { + bucket = dlm_master_hash(dlm, i); + hlist_for_each(list, bucket) { + mle = hlist_entry(list, struct dlm_master_list_entry, + master_hash_node); + if (mle-type != DLM_MLE_BLOCK) { + mlog(ML_ERROR, mle for %.*s not destroyed, +type %d\n, +mle-mnamelen, mle-mname, mle-type); + continue; + } + spin_lock(mle-spinlock); + atomic_set(mle-woken, 1); + spin_unlock(mle-spinlock); + wake_up(mle-wq); + + __dlm_mle_detach_hb_events(dlm, mle); + __dlm_put_mle(mle); + } + } + spin_unlock(dlm-master_lock); + spin_unlock(dlm-spinlock); +} -- 1.5.6.5 ___ Ocfs2-devel mailing list Ocfs2-devel@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-devel
Re: [Ocfs2-devel] [PATCH] ocfs2/dlm: avoid incorrect bit set in refmap on recovery master
thanks for making this patch, it looks good just few minor changes about comments On 7/23/2010 5:15 AM, Wengang Wang wrote: In the following situation, there remains an incorrect bit in refmap on the recovery master. Finally the recovery master will fail at purging the lockres due to the incorrect bit in refmap. 1) node A has no interest on lockres A any longer, so it is purging it. 2) the owner of lockres A is node B, so node A is sending de-ref message to node B. 3) at this time, node B crashed. node C becomes the recovery master. it recovers lockres A(because the master is the dead node B). 4) node A migrated lockres A to node C with a refbit there. 5) node A failed to send de-ref message to node B because it crashed. The failure is ignored. no other action is done for lockres A any more. For mormal, re-send the deref message to it to recovery master can fix it. Well, ignoring the failure of deref to the original master and not recovering the lockres to recovery master has the same effect. And the later is simpler. Signed-off-by: Wengang Wang wen.gang.w...@oracle.com --- fs/ocfs2/dlm/dlmrecovery.c | 17 +++-- fs/ocfs2/dlm/dlmthread.c | 28 +--- 2 files changed, 32 insertions(+), 13 deletions(-) diff --git a/fs/ocfs2/dlm/dlmrecovery.c b/fs/ocfs2/dlm/dlmrecovery.c index 9dfaac7..06640f6 100644 --- a/fs/ocfs2/dlm/dlmrecovery.c +++ b/fs/ocfs2/dlm/dlmrecovery.c @@ -1997,6 +1997,8 @@ void dlm_move_lockres_to_recovery_list(struct dlm_ctxt *dlm, struct list_head *queue; struct dlm_lock *lock, *next; + assert_spin_locked(dlm-spinlock); + assert_spin_locked(res-spinlock); res-state |= DLM_LOCK_RES_RECOVERING; if (!list_empty(res-recovering)) { mlog(0, @@ -2336,9 +2338,20 @@ static void dlm_do_local_recovery_cleanup(struct dlm_ctxt *dlm, u8 dead_node) /* the wake_up for this will happen when the * RECOVERING flag is dropped later */ remove above comment as it doesn't seem to be relevant anymore. - res-state = ~DLM_LOCK_RES_DROPPING_REF; + if (res-state DLM_LOCK_RES_DROPPING_REF) { + /* + * don't migrate a lockres which is in + * progress of dropping ref + */ move this comment to before the if condition + mlog(ML_NOTICE, %.*s ignored for + migration\n, res-lockname.len, + res-lockname.name); This information only helps us in diagnosing any related issue and is not helpful in normal cases. So it should be 0 instead of ML_NOTICE. + res-state = + ~DLM_LOCK_RES_DROPPING_REF; we don't need to clear this state as dlm_purge_lockres removes it. + } else + dlm_move_lockres_to_recovery_list(dlm, + res); - dlm_move_lockres_to_recovery_list(dlm, res); } else if (res-owner == dlm-node_num) { dlm_free_dead_locks(dlm, res, dead_node); __dlm_lockres_calc_usage(dlm, res); diff --git a/fs/ocfs2/dlm/dlmthread.c b/fs/ocfs2/dlm/dlmthread.c index dd78ca3..47420ce 100644 --- a/fs/ocfs2/dlm/dlmthread.c +++ b/fs/ocfs2/dlm/dlmthread.c @@ -92,17 +92,23 @@ int __dlm_lockres_has_locks(struct dlm_lock_resource *res) * truly ready to be freed. */ int __dlm_lockres_unused(struct dlm_lock_resource *res) { - if (!__dlm_lockres_has_locks(res) - (list_empty(res-dirty) !(res-state DLM_LOCK_RES_DIRTY))) { - /* try not to scan the bitmap unless the first two - * conditions are already true */ - int bit = find_next_bit(res-refmap, O2NM_MAX_NODES, 0); - if (bit = O2NM_MAX_NODES) { - /* since the bit for dlm-node_num is not - * set, inflight_locks better be zero */ - BUG_ON(res-inflight_locks != 0); - return 1; - } + int bit; + + if (__dlm_lockres_has_locks(res)) + return 0; + + if (!list_empty(res-dirty) || res-state DLM_LOCK_RES_DIRTY) + return 0; + + if (res-state DLM_LOCK_RES_RECOVERING) + return 0; + + bit = find_next_bit(res-refmap, O2NM_MAX_NODES, 0); + if (bit = O2NM_MAX_NODES) { + /* since the bit for dlm-node_num is not + * set, inflight_locks better be zero */ +
[Ocfs2-devel] [PATCH 1/1] ocfs2 fix o2dlm dlm run purgelist (rev 3)
This patch fixes two problems in dlm_run_purgelist 1. If a lockres is found to be in use, dlm_run_purgelist keeps trying to purge the same lockres instead of trying the next lockres. 2. When a lockres is found unused, dlm_run_purgelist releases lockres spinlock before setting DLM_LOCK_RES_DROPPING_REF and calls dlm_purge_lockres. spinlock is reacquired but in this window lockres can get reused. This leads to BUG. This patch modifies dlm_run_purgelist to skip lockres if it's in use and purge next lockres. It also sets DLM_LOCK_RES_DROPPING_REF before releasing the lockres spinlock protecting it from getting reused. Signed-off-by: Srinivas Eeda srinivas.e...@oracle.com Acked-by: Sunil Mushran sunil.mush...@oracle.com --- fs/ocfs2/dlm/dlmthread.c | 80 +++-- 1 files changed, 34 insertions(+), 46 deletions(-) diff --git a/fs/ocfs2/dlm/dlmthread.c b/fs/ocfs2/dlm/dlmthread.c index 11a6d1f..960dc8d 100644 --- a/fs/ocfs2/dlm/dlmthread.c +++ b/fs/ocfs2/dlm/dlmthread.c @@ -152,45 +152,25 @@ void dlm_lockres_calc_usage(struct dlm_ctxt *dlm, spin_unlock(dlm-spinlock); } -static int dlm_purge_lockres(struct dlm_ctxt *dlm, +static void dlm_purge_lockres(struct dlm_ctxt *dlm, struct dlm_lock_resource *res) { int master; int ret = 0; - spin_lock(res-spinlock); - if (!__dlm_lockres_unused(res)) { - mlog(0, %s:%.*s: tried to purge but not unused\n, -dlm-name, res-lockname.len, res-lockname.name); - __dlm_print_one_lock_resource(res); - spin_unlock(res-spinlock); - BUG(); - } - - if (res-state DLM_LOCK_RES_MIGRATING) { - mlog(0, %s:%.*s: Delay dropref as this lockres is -being remastered\n, dlm-name, res-lockname.len, -res-lockname.name); - /* Re-add the lockres to the end of the purge list */ - if (!list_empty(res-purge)) { - list_del_init(res-purge); - list_add_tail(res-purge, dlm-purge_list); - } - spin_unlock(res-spinlock); - return 0; - } + assert_spin_locked(dlm-spinlock); + assert_spin_locked(res-spinlock); master = (res-owner == dlm-node_num); - if (!master) - res-state |= DLM_LOCK_RES_DROPPING_REF; - spin_unlock(res-spinlock); mlog(0, purging lockres %.*s, master = %d\n, res-lockname.len, res-lockname.name, master); if (!master) { + res-state |= DLM_LOCK_RES_DROPPING_REF; /* drop spinlock... retake below */ + spin_unlock(res-spinlock); spin_unlock(dlm-spinlock); spin_lock(res-spinlock); @@ -208,31 +188,35 @@ static int dlm_purge_lockres(struct dlm_ctxt *dlm, mlog(0, %s:%.*s: dlm_deref_lockres returned %d\n, dlm-name, res-lockname.len, res-lockname.name, ret); spin_lock(dlm-spinlock); + spin_lock(res-spinlock); } - spin_lock(res-spinlock); if (!list_empty(res-purge)) { mlog(0, removing lockres %.*s:%p from purgelist, master = %d\n, res-lockname.len, res-lockname.name, res, master); list_del_init(res-purge); - spin_unlock(res-spinlock); dlm_lockres_put(res); dlm-purge_count--; - } else - spin_unlock(res-spinlock); + } + + if (!__dlm_lockres_unused(res)) { + mlog(ML_ERROR, found lockres %s:%.*s: in use after deref\n, +dlm-name, res-lockname.len, res-lockname.name); + __dlm_print_one_lock_resource(res); + BUG(); + } __dlm_unhash_lockres(res); /* lockres is not in the hash now. drop the flag and wake up * any processes waiting in dlm_get_lock_resource. */ if (!master) { - spin_lock(res-spinlock); res-state = ~DLM_LOCK_RES_DROPPING_REF; spin_unlock(res-spinlock); wake_up(res-wq); - } - return 0; + } else + spin_unlock(res-spinlock); } static void dlm_run_purge_list(struct dlm_ctxt *dlm, @@ -251,17 +235,7 @@ static void dlm_run_purge_list(struct dlm_ctxt *dlm, lockres = list_entry(dlm-purge_list.next, struct dlm_lock_resource, purge); - /* Status of the lockres *might* change so double -* check. If the lockres is unused, holding the dlm -* spinlock will prevent people from getting and more -* refs on it -- there's no need to keep the lockres -* spinlock. */ spin_lock(lockres-spinlock
Re: [Ocfs2-devel] [PATCH 1/1] ocfs2 fix o2dlm dlm run purgelist(bug 9094491) - rev3
Joel Becker wrote: On Tue, Jun 22, 2010 at 10:48:28PM -0700, Srinivas Eeda wrote: +if (!__dlm_lockres_unused) { +mlog(ML_ERROR, found lockres %s:%.*s: in use after deref\n, + dlm-name, res-lockname.len, res-lockname.name); +__dlm_print_one_lock_resource(res); +BUG(); +} /build/jlbec/linux-2.6/working/fs/ocfs2/dlm/dlmthread.c: In function ‘dlm_purge_lockres’: /build/jlbec/linux-2.6/working/fs/ocfs2/dlm/dlmthread.c:203: warning: the address of ‘__dlm_lockres_unused’ will always evaluate as ‘true’ Was this even tested? I'm leaving this patch out of 'fixes' until corrected and tested. Sorry, I had the typo while making the review changes. I ran the usual tests but that didn't catch this problem. I should have payed more attention to the build log. I made the change and tested it, will send you the modified patch Joel ___ Ocfs2-devel mailing list Ocfs2-devel@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-devel
[Ocfs2-devel] [PATCH 1/1] ocfs2 fix o2dlm dlm run purgelist(bug 9094491) - rev3
There are two problems in dlm_run_purgelist 1. If a lockres is found to be in use, dlm_run_purgelist keeps trying to purge the same lockres instead of trying the next lockres. 2. When a lockres is found unused, dlm_run_purgelist releases lockres spinlock before setting DLM_LOCK_RES_DROPPING_REF and calls dlm_purge_lockres. spinlock is reacquired but in this window lockres can get reused. This leads to BUG. This patch modifies dlm_run_purgelist to skip lockres if it's in use and purge next lockres. It also sets DLM_LOCK_RES_DROPPING_REF before releasing the lockres spinlock protecting it from getting reused. Signed-off-by: Srinivas Eeda srinivas.e...@oracle.com Signed-off-by: Sunil Mushransunil.mush...@oracle.com --- fs/ocfs2/dlm/dlmthread.c | 79 +++-- 1 files changed, 33 insertions(+), 46 deletions(-) diff --git a/fs/ocfs2/dlm/dlmthread.c b/fs/ocfs2/dlm/dlmthread.c index 11a6d1f..76ab4aa 100644 --- a/fs/ocfs2/dlm/dlmthread.c +++ b/fs/ocfs2/dlm/dlmthread.c @@ -152,45 +152,25 @@ void dlm_lockres_calc_usage(struct dlm_ctxt *dlm, spin_unlock(dlm-spinlock); } -static int dlm_purge_lockres(struct dlm_ctxt *dlm, +static void dlm_purge_lockres(struct dlm_ctxt *dlm, struct dlm_lock_resource *res) { int master; int ret = 0; - spin_lock(res-spinlock); - if (!__dlm_lockres_unused(res)) { - mlog(0, %s:%.*s: tried to purge but not unused\n, -dlm-name, res-lockname.len, res-lockname.name); - __dlm_print_one_lock_resource(res); - spin_unlock(res-spinlock); - BUG(); - } - - if (res-state DLM_LOCK_RES_MIGRATING) { - mlog(0, %s:%.*s: Delay dropref as this lockres is -being remastered\n, dlm-name, res-lockname.len, -res-lockname.name); - /* Re-add the lockres to the end of the purge list */ - if (!list_empty(res-purge)) { - list_del_init(res-purge); - list_add_tail(res-purge, dlm-purge_list); - } - spin_unlock(res-spinlock); - return 0; - } + assert_spin_locked(dlm-spinlock); + assert_spin_locked(res-spinlock); master = (res-owner == dlm-node_num); - if (!master) - res-state |= DLM_LOCK_RES_DROPPING_REF; - spin_unlock(res-spinlock); mlog(0, purging lockres %.*s, master = %d\n, res-lockname.len, res-lockname.name, master); if (!master) { + res-state |= DLM_LOCK_RES_DROPPING_REF; /* drop spinlock... retake below */ + spin_unlock(res-spinlock); spin_unlock(dlm-spinlock); spin_lock(res-spinlock); @@ -208,31 +188,35 @@ static int dlm_purge_lockres(struct dlm_ctxt *dlm, mlog(0, %s:%.*s: dlm_deref_lockres returned %d\n, dlm-name, res-lockname.len, res-lockname.name, ret); spin_lock(dlm-spinlock); + spin_lock(res-spinlock); } - spin_lock(res-spinlock); if (!list_empty(res-purge)) { mlog(0, removing lockres %.*s:%p from purgelist, master = %d\n, res-lockname.len, res-lockname.name, res, master); list_del_init(res-purge); - spin_unlock(res-spinlock); dlm_lockres_put(res); dlm-purge_count--; - } else - spin_unlock(res-spinlock); + } + + if (!__dlm_lockres_unused(res)) { + mlog(ML_ERROR, found lockres %s:%.*s: in use after deref\n, +dlm-name, res-lockname.len, res-lockname.name); + __dlm_print_one_lock_resource(res); + BUG(); + } __dlm_unhash_lockres(res); /* lockres is not in the hash now. drop the flag and wake up * any processes waiting in dlm_get_lock_resource. */ if (!master) { - spin_lock(res-spinlock); res-state = ~DLM_LOCK_RES_DROPPING_REF; spin_unlock(res-spinlock); wake_up(res-wq); - } - return 0; + } else + spin_unlock(res-spinlock); } static void dlm_run_purge_list(struct dlm_ctxt *dlm, @@ -251,17 +235,7 @@ static void dlm_run_purge_list(struct dlm_ctxt *dlm, lockres = list_entry(dlm-purge_list.next, struct dlm_lock_resource, purge); - /* Status of the lockres *might* change so double -* check. If the lockres is unused, holding the dlm -* spinlock will prevent people from getting and more -* refs on it -- there's no need to keep the lockres -* spinlock. */ spin_lock(lockres-spinlock
Re: [Ocfs2-devel] [PATCH 1/1] ocfs2 fix o2dlm dlm run purgelist(bug 9094491) - rev3
Sunil, Joel, Wengang. Thanks for reviewing the patch and your comments. On 6/23/2010 10:00 AM, Sunil Mushran wrote: Signed-off-by: Sunil Mushransunil.mush...@oracle.com On 06/22/2010 10:48 PM, Srinivas Eeda wrote: There are two problems in dlm_run_purgelist 1. If a lockres is found to be in use, dlm_run_purgelist keeps trying to purge the same lockres instead of trying the next lockres. 2. When a lockres is found unused, dlm_run_purgelist releases lockres spinlock before setting DLM_LOCK_RES_DROPPING_REF and calls dlm_purge_lockres. spinlock is reacquired but in this window lockres can get reused. This leads to BUG. This patch modifies dlm_run_purgelist to skip lockres if it's in use and purge next lockres. It also sets DLM_LOCK_RES_DROPPING_REF before releasing the lockres spinlock protecting it from getting reused. Signed-off-by: Srinivas Eedasrinivas.e...@oracle.com --- fs/ocfs2/dlm/dlmthread.c | 79 +++-- 1 files changed, 33 insertions(+), 46 deletions(-) diff --git a/fs/ocfs2/dlm/dlmthread.c b/fs/ocfs2/dlm/dlmthread.c index 11a6d1f..6822f9a 100644 --- a/fs/ocfs2/dlm/dlmthread.c +++ b/fs/ocfs2/dlm/dlmthread.c @@ -152,45 +152,25 @@ void dlm_lockres_calc_usage(struct dlm_ctxt *dlm, spin_unlock(dlm-spinlock); } -static int dlm_purge_lockres(struct dlm_ctxt *dlm, +static void dlm_purge_lockres(struct dlm_ctxt *dlm, struct dlm_lock_resource *res) { int master; int ret = 0; -spin_lock(res-spinlock); -if (!__dlm_lockres_unused(res)) { -mlog(0, %s:%.*s: tried to purge but not unused\n, - dlm-name, res-lockname.len, res-lockname.name); -__dlm_print_one_lock_resource(res); -spin_unlock(res-spinlock); -BUG(); -} - -if (res-state DLM_LOCK_RES_MIGRATING) { -mlog(0, %s:%.*s: Delay dropref as this lockres is - being remastered\n, dlm-name, res-lockname.len, - res-lockname.name); -/* Re-add the lockres to the end of the purge list */ -if (!list_empty(res-purge)) { -list_del_init(res-purge); -list_add_tail(res-purge,dlm-purge_list); -} -spin_unlock(res-spinlock); -return 0; -} +assert_spin_locked(dlm-spinlock); +assert_spin_locked(res-spinlock); master = (res-owner == dlm-node_num); -if (!master) -res-state |= DLM_LOCK_RES_DROPPING_REF; -spin_unlock(res-spinlock); mlog(0, purging lockres %.*s, master = %d\n, res-lockname.len, res-lockname.name, master); if (!master) { +res-state |= DLM_LOCK_RES_DROPPING_REF; /* drop spinlock... retake below */ +spin_unlock(res-spinlock); spin_unlock(dlm-spinlock); spin_lock(res-spinlock); @@ -208,31 +188,35 @@ static int dlm_purge_lockres(struct dlm_ctxt *dlm, mlog(0, %s:%.*s: dlm_deref_lockres returned %d\n, dlm-name, res-lockname.len, res-lockname.name, ret); spin_lock(dlm-spinlock); +spin_lock(res-spinlock); } -spin_lock(res-spinlock); if (!list_empty(res-purge)) { mlog(0, removing lockres %.*s:%p from purgelist, master = %d\n, res-lockname.len, res-lockname.name, res, master); list_del_init(res-purge); -spin_unlock(res-spinlock); dlm_lockres_put(res); dlm-purge_count--; -} else -spin_unlock(res-spinlock); +} + +if (!__dlm_lockres_unused) { +mlog(ML_ERROR, found lockres %s:%.*s: in use after deref\n, + dlm-name, res-lockname.len, res-lockname.name); +__dlm_print_one_lock_resource(res); +BUG(); +} __dlm_unhash_lockres(res); /* lockres is not in the hash now. drop the flag and wake up * any processes waiting in dlm_get_lock_resource. */ if (!master) { -spin_lock(res-spinlock); res-state= ~DLM_LOCK_RES_DROPPING_REF; spin_unlock(res-spinlock); wake_up(res-wq); -} -return 0; +} else +spin_unlock(res-spinlock); } static void dlm_run_purge_list(struct dlm_ctxt *dlm, @@ -251,17 +235,7 @@ static void dlm_run_purge_list(struct dlm_ctxt *dlm, lockres = list_entry(dlm-purge_list.next, struct dlm_lock_resource, purge); -/* Status of the lockres *might* change so double - * check. If the lockres is unused, holding the dlm - * spinlock will prevent people from getting and more - * refs on it -- there's no need to keep the lockres - * spinlock. */ spin_lock(lockres-spinlock); -unused = __dlm_lockres_unused(lockres); -spin_unlock(lockres-spinlock); - -if (!unused) -continue; purge_jiffies = lockres
[Ocfs2-devel] [PATCH 1/1] ocfs2 fix o2dlm dlm run purgelist(bug 9094491) - rev3
There are two problems in dlm_run_purgelist 1. If a lockres is found to be in use, dlm_run_purgelist keeps trying to purge the same lockres instead of trying the next lockres. 2. When a lockres is found unused, dlm_run_purgelist releases lockres spinlock before setting DLM_LOCK_RES_DROPPING_REF and calls dlm_purge_lockres. spinlock is reacquired but in this window lockres can get reused. This leads to BUG. This patch modifies dlm_run_purgelist to skip lockres if it's in use and purge next lockres. It also sets DLM_LOCK_RES_DROPPING_REF before releasing the lockres spinlock protecting it from getting reused. Signed-off-by: Srinivas Eeda srinivas.e...@oracle.com --- fs/ocfs2/dlm/dlmthread.c | 76 -- 1 files changed, 33 insertions(+), 43 deletions(-) diff --git a/fs/ocfs2/dlm/dlmthread.c b/fs/ocfs2/dlm/dlmthread.c index 11a6d1f..cb74689 100644 --- a/fs/ocfs2/dlm/dlmthread.c +++ b/fs/ocfs2/dlm/dlmthread.c @@ -152,45 +152,26 @@ void dlm_lockres_calc_usage(struct dlm_ctxt *dlm, spin_unlock(dlm-spinlock); } -static int dlm_purge_lockres(struct dlm_ctxt *dlm, +static void dlm_purge_lockres(struct dlm_ctxt *dlm, struct dlm_lock_resource *res) { int master; int ret = 0; - spin_lock(res-spinlock); - if (!__dlm_lockres_unused(res)) { - mlog(0, %s:%.*s: tried to purge but not unused\n, -dlm-name, res-lockname.len, res-lockname.name); - __dlm_print_one_lock_resource(res); - spin_unlock(res-spinlock); - BUG(); - } - - if (res-state DLM_LOCK_RES_MIGRATING) { - mlog(0, %s:%.*s: Delay dropref as this lockres is -being remastered\n, dlm-name, res-lockname.len, -res-lockname.name); - /* Re-add the lockres to the end of the purge list */ - if (!list_empty(res-purge)) { - list_del_init(res-purge); - list_add_tail(res-purge, dlm-purge_list); - } - spin_unlock(res-spinlock); - return 0; - } + assert_spin_locked(dlm-spinlock); + assert_spin_locked(res-spinlock); master = (res-owner == dlm-node_num); if (!master) res-state |= DLM_LOCK_RES_DROPPING_REF; - spin_unlock(res-spinlock); mlog(0, purging lockres %.*s, master = %d\n, res-lockname.len, res-lockname.name, master); if (!master) { /* drop spinlock... retake below */ + spin_unlock(res-spinlock); spin_unlock(dlm-spinlock); spin_lock(res-spinlock); @@ -208,31 +189,35 @@ static int dlm_purge_lockres(struct dlm_ctxt *dlm, mlog(0, %s:%.*s: dlm_deref_lockres returned %d\n, dlm-name, res-lockname.len, res-lockname.name, ret); spin_lock(dlm-spinlock); + spin_lock(res-spinlock); } - spin_lock(res-spinlock); if (!list_empty(res-purge)) { mlog(0, removing lockres %.*s:%p from purgelist, master = %d\n, res-lockname.len, res-lockname.name, res, master); list_del_init(res-purge); - spin_unlock(res-spinlock); dlm_lockres_put(res); dlm-purge_count--; - } else - spin_unlock(res-spinlock); + } - __dlm_unhash_lockres(res); + if (__dlm_lockres_unused(res)) + __dlm_unhash_lockres(res); + else { + mlog(ML_ERROR, found lockres %s:%.*s: in use after deref\n, +dlm-name, res-lockname.len, res-lockname.name); + __dlm_print_one_lock_resource(res); + BUG(); + } /* lockres is not in the hash now. drop the flag and wake up * any processes waiting in dlm_get_lock_resource. */ if (!master) { - spin_lock(res-spinlock); res-state = ~DLM_LOCK_RES_DROPPING_REF; spin_unlock(res-spinlock); wake_up(res-wq); - } - return 0; + } else + spin_unlock(res-spinlock); } static void dlm_run_purge_list(struct dlm_ctxt *dlm, @@ -251,17 +236,7 @@ static void dlm_run_purge_list(struct dlm_ctxt *dlm, lockres = list_entry(dlm-purge_list.next, struct dlm_lock_resource, purge); - /* Status of the lockres *might* change so double -* check. If the lockres is unused, holding the dlm -* spinlock will prevent people from getting and more -* refs on it -- there's no need to keep the lockres -* spinlock. */ spin_lock(lockres-spinlock); - unused = __dlm_lockres_unused(lockres
[Ocfs2-devel] [PATCH 1/1] ocfs2 fix o2dlm dlm run purgelist(bug 9094491) - rev3
There are two problems in dlm_run_purgelist 1. If a lockres is found to be in use, dlm_run_purgelist keeps trying to purge the same lockres instead of trying the next lockres. 2. When a lockres is found unused, dlm_run_purgelist releases lockres spinlock before setting DLM_LOCK_RES_DROPPING_REF and calls dlm_purge_lockres. spinlock is reacquired but in this window lockres can get reused. This leads to BUG. This patch modifies dlm_run_purgelist to skip lockres if it's in use and purge next lockres. It also sets DLM_LOCK_RES_DROPPING_REF before releasing the lockres spinlock protecting it from getting reused. Signed-off-by: Srinivas Eeda srinivas.e...@oracle.com --- fs/ocfs2/dlm/dlmthread.c | 79 +++-- 1 files changed, 33 insertions(+), 46 deletions(-) diff --git a/fs/ocfs2/dlm/dlmthread.c b/fs/ocfs2/dlm/dlmthread.c index 11a6d1f..6822f9a 100644 --- a/fs/ocfs2/dlm/dlmthread.c +++ b/fs/ocfs2/dlm/dlmthread.c @@ -152,45 +152,25 @@ void dlm_lockres_calc_usage(struct dlm_ctxt *dlm, spin_unlock(dlm-spinlock); } -static int dlm_purge_lockres(struct dlm_ctxt *dlm, +static void dlm_purge_lockres(struct dlm_ctxt *dlm, struct dlm_lock_resource *res) { int master; int ret = 0; - spin_lock(res-spinlock); - if (!__dlm_lockres_unused(res)) { - mlog(0, %s:%.*s: tried to purge but not unused\n, -dlm-name, res-lockname.len, res-lockname.name); - __dlm_print_one_lock_resource(res); - spin_unlock(res-spinlock); - BUG(); - } - - if (res-state DLM_LOCK_RES_MIGRATING) { - mlog(0, %s:%.*s: Delay dropref as this lockres is -being remastered\n, dlm-name, res-lockname.len, -res-lockname.name); - /* Re-add the lockres to the end of the purge list */ - if (!list_empty(res-purge)) { - list_del_init(res-purge); - list_add_tail(res-purge, dlm-purge_list); - } - spin_unlock(res-spinlock); - return 0; - } + assert_spin_locked(dlm-spinlock); + assert_spin_locked(res-spinlock); master = (res-owner == dlm-node_num); - if (!master) - res-state |= DLM_LOCK_RES_DROPPING_REF; - spin_unlock(res-spinlock); mlog(0, purging lockres %.*s, master = %d\n, res-lockname.len, res-lockname.name, master); if (!master) { + res-state |= DLM_LOCK_RES_DROPPING_REF; /* drop spinlock... retake below */ + spin_unlock(res-spinlock); spin_unlock(dlm-spinlock); spin_lock(res-spinlock); @@ -208,31 +188,35 @@ static int dlm_purge_lockres(struct dlm_ctxt *dlm, mlog(0, %s:%.*s: dlm_deref_lockres returned %d\n, dlm-name, res-lockname.len, res-lockname.name, ret); spin_lock(dlm-spinlock); + spin_lock(res-spinlock); } - spin_lock(res-spinlock); if (!list_empty(res-purge)) { mlog(0, removing lockres %.*s:%p from purgelist, master = %d\n, res-lockname.len, res-lockname.name, res, master); list_del_init(res-purge); - spin_unlock(res-spinlock); dlm_lockres_put(res); dlm-purge_count--; - } else - spin_unlock(res-spinlock); + } + + if (!__dlm_lockres_unused) { + mlog(ML_ERROR, found lockres %s:%.*s: in use after deref\n, +dlm-name, res-lockname.len, res-lockname.name); + __dlm_print_one_lock_resource(res); + BUG(); + } __dlm_unhash_lockres(res); /* lockres is not in the hash now. drop the flag and wake up * any processes waiting in dlm_get_lock_resource. */ if (!master) { - spin_lock(res-spinlock); res-state = ~DLM_LOCK_RES_DROPPING_REF; spin_unlock(res-spinlock); wake_up(res-wq); - } - return 0; + } else + spin_unlock(res-spinlock); } static void dlm_run_purge_list(struct dlm_ctxt *dlm, @@ -251,17 +235,7 @@ static void dlm_run_purge_list(struct dlm_ctxt *dlm, lockres = list_entry(dlm-purge_list.next, struct dlm_lock_resource, purge); - /* Status of the lockres *might* change so double -* check. If the lockres is unused, holding the dlm -* spinlock will prevent people from getting and more -* refs on it -- there's no need to keep the lockres -* spinlock. */ spin_lock(lockres-spinlock); - unused = __dlm_lockres_unused(lockres
[Ocfs2-devel] [PATCH 1/1] ocfs2 fix o2dlm dlm run purgelist
There are two problems in dlm_run_purgelist 1. If a lockres is found to be in use, dlm_run_purgelist keeps trying to purge the same lockres instead of trying the next lockres. 2. When a lockres is found unused, dlm_run_purgelist releases lockres spinlock before setting DLM_LOCK_RES_DROPPING_REF and calls dlm_purge_lockres. spinlock is reacquired but in this window lockres can get reused. This leads to BUG. This patch modifies dlm_run_purgelist to skip lockres if it's in use and purge next lockres. It also sets DLM_LOCK_RES_DROPPING_REF before releasing the lockres spinlock protecting it from getting reused. Signed-off-by: Srinivas Eeda srinivas.e...@oracle.com --- fs/ocfs2/dlm/dlmthread.c | 55 + 1 files changed, 26 insertions(+), 29 deletions(-) diff --git a/fs/ocfs2/dlm/dlmthread.c b/fs/ocfs2/dlm/dlmthread.c index 11a6d1f..79d1ef6 100644 --- a/fs/ocfs2/dlm/dlmthread.c +++ b/fs/ocfs2/dlm/dlmthread.c @@ -158,15 +158,6 @@ static int dlm_purge_lockres(struct dlm_ctxt *dlm, int master; int ret = 0; - spin_lock(res-spinlock); - if (!__dlm_lockres_unused(res)) { - mlog(0, %s:%.*s: tried to purge but not unused\n, -dlm-name, res-lockname.len, res-lockname.name); - __dlm_print_one_lock_resource(res); - spin_unlock(res-spinlock); - BUG(); - } - if (res-state DLM_LOCK_RES_MIGRATING) { mlog(0, %s:%.*s: Delay dropref as this lockres is being remastered\n, dlm-name, res-lockname.len, @@ -184,13 +175,13 @@ static int dlm_purge_lockres(struct dlm_ctxt *dlm, if (!master) res-state |= DLM_LOCK_RES_DROPPING_REF; - spin_unlock(res-spinlock); mlog(0, purging lockres %.*s, master = %d\n, res-lockname.len, res-lockname.name, master); if (!master) { /* drop spinlock... retake below */ + spin_unlock(res-spinlock); spin_unlock(dlm-spinlock); spin_lock(res-spinlock); @@ -208,30 +199,34 @@ static int dlm_purge_lockres(struct dlm_ctxt *dlm, mlog(0, %s:%.*s: dlm_deref_lockres returned %d\n, dlm-name, res-lockname.len, res-lockname.name, ret); spin_lock(dlm-spinlock); + spin_lock(res-spinlock); } - spin_lock(res-spinlock); if (!list_empty(res-purge)) { mlog(0, removing lockres %.*s:%p from purgelist, master = %d\n, res-lockname.len, res-lockname.name, res, master); list_del_init(res-purge); - spin_unlock(res-spinlock); dlm_lockres_put(res); dlm-purge_count--; - } else - spin_unlock(res-spinlock); + } - __dlm_unhash_lockres(res); + if (__dlm_lockres_unused(res)) + __dlm_unhash_lockres(res); + else { + mlog(ML_ERROR, found lockres %s:%.*s: in use after deref\n, +dlm-name, res-lockname.len, res-lockname.name); + __dlm_print_one_lock_resource(res); + } /* lockres is not in the hash now. drop the flag and wake up * any processes waiting in dlm_get_lock_resource. */ if (!master) { - spin_lock(res-spinlock); res-state = ~DLM_LOCK_RES_DROPPING_REF; spin_unlock(res-spinlock); wake_up(res-wq); - } + } else + spin_unlock(res-spinlock); return 0; } @@ -251,17 +246,7 @@ static void dlm_run_purge_list(struct dlm_ctxt *dlm, lockres = list_entry(dlm-purge_list.next, struct dlm_lock_resource, purge); - /* Status of the lockres *might* change so double -* check. If the lockres is unused, holding the dlm -* spinlock will prevent people from getting and more -* refs on it -- there's no need to keep the lockres -* spinlock. */ spin_lock(lockres-spinlock); - unused = __dlm_lockres_unused(lockres); - spin_unlock(lockres-spinlock); - - if (!unused) - continue; purge_jiffies = lockres-last_used + msecs_to_jiffies(DLM_PURGE_INTERVAL_MS); @@ -273,15 +258,27 @@ static void dlm_run_purge_list(struct dlm_ctxt *dlm, * in tail order, we can stop at the first * unpurgable resource -- anyone added after * him will have a greater last_used value */ + spin_unlock(lockres-spinlock); break; } + /* Status of the lockres *might* change so double +* check. If the lockres
Re: [Ocfs2-devel] [PATCH 2/2] ocfs2: o2dlm fix race in purge lockres and newlock (orabug 9094491)
On 6/17/2010 7:11 PM, Sunil Mushran wrote: This patch looks ok on the surface. Should be usable. But checking into the repo is another matter. ok, won't checkin but will give this as one-off fix for now. My problem is that this flag is very much like inflight_locks but is not the same. inflight_locks are taken only on lockres' that are mastered by the node. This new flag does the same but for non mastered locks too. A better solution will be to merge the two. And that means cleaning up inflight_locks. will look into this. The reason for the NAK for check in is that the code is adding more messiness to an area that is already very messy. Sunil On 06/15/2010 09:43 PM, Srinivas Eeda wrote: This patch fixes the following hole. dlmlock tries to create a new lock on a lockres that is on purge list. It calls dlm_get_lockresource and later adds a lock to blocked list. But in this window, dlm_thread can purge the lockres and unhash it. This will cause a BUG, as when the AST comes back from the master lockres is not found This patch marks the lockres with a new state DLM_LOCK_RES_IN_USE which would protect lockres from dlm_thread purging it. Signed-off-by: Srinivas Eedasrinivas.e...@oracle.com --- fs/ocfs2/dlm/dlmcommon.h |1 + fs/ocfs2/dlm/dlmlock.c |4 fs/ocfs2/dlm/dlmmaster.c |5 - fs/ocfs2/dlm/dlmthread.c |1 + 4 files changed, 10 insertions(+), 1 deletions(-) diff --git a/fs/ocfs2/dlm/dlmcommon.h b/fs/ocfs2/dlm/dlmcommon.h index 0102be3..6cf3c08 100644 --- a/fs/ocfs2/dlm/dlmcommon.h +++ b/fs/ocfs2/dlm/dlmcommon.h @@ -279,6 +279,7 @@ static inline void __dlm_set_joining_node(struct dlm_ctxt *dlm, #define DLM_LOCK_RES_DIRTY0x0008 #define DLM_LOCK_RES_IN_PROGRESS 0x0010 #define DLM_LOCK_RES_MIGRATING0x0020 +#define DLM_LOCK_RES_IN_USE 0x0100 #define DLM_LOCK_RES_DROPPING_REF 0x0040 #define DLM_LOCK_RES_BLOCK_DIRTY 0x1000 #define DLM_LOCK_RES_SETREF_INPROG0x2000 diff --git a/fs/ocfs2/dlm/dlmlock.c b/fs/ocfs2/dlm/dlmlock.c index 777..501ac40 100644 --- a/fs/ocfs2/dlm/dlmlock.c +++ b/fs/ocfs2/dlm/dlmlock.c @@ -134,6 +134,8 @@ static enum dlm_status dlmlock_master(struct dlm_ctxt *dlm, if (status != DLM_NORMAL lock-ml.node != dlm-node_num) { /* erf. state changed after lock was dropped. */ +/* DLM_LOCK_RES_IN_USE is set in dlm_get_lock_resource */ +res-state= ~DLM_LOCK_RES_IN_USE; spin_unlock(res-spinlock); dlm_error(status); return status; @@ -180,6 +182,7 @@ static enum dlm_status dlmlock_master(struct dlm_ctxt *dlm, kick_thread = 1; } } +res-state= ~DLM_LOCK_RES_IN_USE; /* reduce the inflight count, this may result in the lockres * being purged below during calc_usage */ if (lock-ml.node == dlm-node_num) @@ -246,6 +249,7 @@ static enum dlm_status dlmlock_remote(struct dlm_ctxt *dlm, spin_lock(res-spinlock); res-state= ~DLM_LOCK_RES_IN_PROGRESS; +res-state= ~DLM_LOCK_RES_IN_USE; lock-lock_pending = 0; if (status != DLM_NORMAL) { if (status == DLM_RECOVERING diff --git a/fs/ocfs2/dlm/dlmmaster.c b/fs/ocfs2/dlm/dlmmaster.c index 9289b43..f0f2d97 100644 --- a/fs/ocfs2/dlm/dlmmaster.c +++ b/fs/ocfs2/dlm/dlmmaster.c @@ -719,6 +719,7 @@ lookup: if (tmpres) { int dropping_ref = 0; +tmpres-state |= DLM_LOCK_RES_IN_USE; spin_unlock(dlm-spinlock); spin_lock(tmpres-spinlock); @@ -731,8 +732,10 @@ lookup: if (tmpres-owner == dlm-node_num) { BUG_ON(tmpres-state DLM_LOCK_RES_DROPPING_REF); dlm_lockres_grab_inflight_ref(dlm, tmpres); -} else if (tmpres-state DLM_LOCK_RES_DROPPING_REF) +} else if (tmpres-state DLM_LOCK_RES_DROPPING_REF) { +tmpres-state= ~DLM_LOCK_RES_IN_USE; dropping_ref = 1; +} spin_unlock(tmpres-spinlock); /* wait until done messaging the master, drop our ref to allow diff --git a/fs/ocfs2/dlm/dlmthread.c b/fs/ocfs2/dlm/dlmthread.c index fb0be6c..2b82e0b 100644 --- a/fs/ocfs2/dlm/dlmthread.c +++ b/fs/ocfs2/dlm/dlmthread.c @@ -93,6 +93,7 @@ int __dlm_lockres_has_locks(struct dlm_lock_resource *res) int __dlm_lockres_unused(struct dlm_lock_resource *res) { if (!__dlm_lockres_has_locks(res) +!(res-state DLM_LOCK_RES_IN_USE) (list_empty(res-dirty) !(res-state DLM_LOCK_RES_DIRTY))) { /* try not to scan the bitmap unless the first two * conditions are already true */ ___ Ocfs2-devel mailing list Ocfs2-devel@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-devel
Re: [Ocfs2-devel] [PATCH 1/2] ocfs2 fix o2dlm dlm run purgelist
On 6/17/2010 1:32 AM, Joel Becker wrote: On Wed, Jun 16, 2010 at 06:44:43PM -0700, Sunil Mushran wrote: One way to skip a lockres in the purgelist is to list_del_init() and list_add_tail(). That simplifies the patch a lot. I have attached a quick dirty patch. See if that satisfies all the requirements. As far as I can tell from reading the code, the time_after() check is because they are time ordered. Wouldn't moving it to the end violate that? right. that's why I didn't want to move used lockres to tail :) Joel ___ Ocfs2-devel mailing list Ocfs2-devel@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-devel
Re: [Ocfs2-devel] [PATCH 1/2] ocfs2 fix o2dlm dlm run purgelist
On 6/17/2010 7:48 AM, Sunil Mushran wrote: On 06/17/2010 01:35 AM, Srinivas Eeda wrote: On 6/17/2010 1:32 AM, Joel Becker wrote: On Wed, Jun 16, 2010 at 06:44:43PM -0700, Sunil Mushran wrote: One way to skip a lockres in the purgelist is to list_del_init() and list_add_tail(). That simplifies the patch a lot. I have attached a quick dirty patch. See if that satisfies all the requirements. As far as I can tell from reading the code, the time_after() check is because they are time ordered. Wouldn't moving it to the end violate that? right. that's why I didn't want to move used lockres to tail :) No. There is no need for time ordering. Or, am I missing something? We delay the purge incase the file system changes its mind and wants to reuse it. By delaying, we hold onto the mastery information. That's it. So, should we preserve this? or purge the lockres when it's on purge list? OR we could just update the last_used and move it to end of the list if it can be purged. ___ Ocfs2-devel mailing list Ocfs2-devel@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-devel
Re: [Ocfs2-devel] [PATCH 1/2] ocfs2 fix o2dlm dlm run purgelist
Sunil, as of now, there is still a window in dlm_get_lock_resource, where it finds the lockres but it doesn't protect it from getting purged. Second patch fixes this by marking it in_use, can you please review that one as well. Thanks, --Srini On 6/17/2010 8:06 AM, Sunil Mushran wrote: On 06/15/2010 11:06 PM, Wengang Wang wrote: still the question. If you have sent DEREF request to the master, and the lockres became in-use again, then the lockres remains in the hash table and also in the purge list. So 1) If this node is the last ref, there is a possibility that the master purged the lockres after receiving DEREF request from this node. In this case, when this node does dlmlock_remote(), the lockres won't be found on the master. How to deal with it? 2) The lockres on this node is going to be purged again, it means it will send secondary DEREFs to the master. This is not good I think. A thought is setting lockres-owner to DLM_LOCK_RES_OWNER_UNKNOWN after sending a DEREF request againt this lockres. Also redo master reqeust before locking on it. The fix we are working towards is to ensure that we set DLM_LOCK_RES_DROPPING_REF once we are determined to purge the lockres. As in, we should not let go of the spinlock before we have either set the flag or decided against purging that resource. Once the flag is set, new users looking up the resource via dlm_get_lock_resource() will notice the flag and will then wait for that flag to be cleared before looking up the lockres hash again. If all goes well, the lockres will not be found (because it has since been unhashed) and it will be forced to go thru the full mastery process. ___ Ocfs2-devel mailing list Ocfs2-devel@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-devel
Re: [Ocfs2-devel] [PATCH] ocfs2/dlm: delay the migration when the lockres is in recovery
patch looks good, it fixes the umount code path which prevents a lockres from migrating if it needs to be recovered. I have few comments on the scenario you described. On 6/11/2010 3:25 AM, Wengang Wang wrote: Any comment on this patch? regards, wengang. On 10-05-25 15:59, Wengang Wang wrote: We shouldn't migrate a lockres in recovery state. Otherwise, it has the following problem: 1) Recovery happened as recovery master on a node(node A) which is in umount migrating all lockres' it owned(master is node A) to other nodes, say a node B. 2) So node A wants to take over all the lockres' those are mastered by the crashed node C. 3) Receiving request_locks request from node A, node B send mig_lockres requests(for recovery) to node A for all lockres' that was mastered by the crashed node C. It can also send the request for a lockres(lockres A) which is not in node A's hashtable. why wouldn't lockres A be in node A's hashtable? if it's not in hash table, then it won't be migratable 4) Receiving the mig_lockres request for lockres A from node B, a new lockres object lockres A', with INRECOVERING flag set, is created and inserted to hash table. 5) The recovery for lockres A' is going on on node A, it finally mastered the lockres A'. And now, RECOVERING flag is not cleared from lockres A' nor from lockres A on node B. 6) The migration for lockres A' goes since now node A mastered lockres A' already. the mig_lockres request(for migration) is sent to node B. 7) Node B responsed with -EFAULT because now lockres A is still in recovery state. 8) Node A BUG() on the -EFAULT. fix: The recovery state is cleared on node A(recovery master) after it's cleared on node B. We wait until the in recovery state is cleared from node A and migrate it to node B. Signed-off-by: Wengang Wang wen.gang.w...@oracle.com --- fs/ocfs2/dlm/dlmmaster.c |3 +++ 1 files changed, 3 insertions(+), 0 deletions(-) diff --git a/fs/ocfs2/dlm/dlmmaster.c b/fs/ocfs2/dlm/dlmmaster.c index 9289b43..de9c128 100644 --- a/fs/ocfs2/dlm/dlmmaster.c +++ b/fs/ocfs2/dlm/dlmmaster.c @@ -2371,6 +2371,9 @@ static int dlm_is_lockres_migrateable(struct dlm_ctxt *dlm, goto leave; } + if (unlikely(res-state DLM_LOCK_RES_RECOVERING)) + goto leave; + ret = 0; queue = res-granted; for (i = 0; i 3; i++) { -- 1.6.6.1 ___ Ocfs2-devel mailing list Ocfs2-devel@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-devel ___ Ocfs2-devel mailing list Ocfs2-devel@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-devel ___ Ocfs2-devel mailing list Ocfs2-devel@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-devel
[Ocfs2-devel] [PATCH 2/2] ocfs2: o2dlm fix race in purge lockres and newlock (orabug 9094491)
This patch fixes the following hole. dlmlock tries to create a new lock on a lockres that is on purge list. It calls dlm_get_lockresource and later adds a lock to blocked list. But in this window, dlm_thread can purge the lockres and unhash it. This will cause a BUG, as when the AST comes back from the master lockres is not found This patch marks the lockres with a new state DLM_LOCK_RES_IN_USE which would protect lockres from dlm_thread purging it. Signed-off-by: Srinivas Eeda srinivas.e...@oracle.com --- fs/ocfs2/dlm/dlmcommon.h |1 + fs/ocfs2/dlm/dlmlock.c |4 fs/ocfs2/dlm/dlmmaster.c |5 - fs/ocfs2/dlm/dlmthread.c |1 + 4 files changed, 10 insertions(+), 1 deletions(-) diff --git a/fs/ocfs2/dlm/dlmcommon.h b/fs/ocfs2/dlm/dlmcommon.h index 0102be3..6cf3c08 100644 --- a/fs/ocfs2/dlm/dlmcommon.h +++ b/fs/ocfs2/dlm/dlmcommon.h @@ -279,6 +279,7 @@ static inline void __dlm_set_joining_node(struct dlm_ctxt *dlm, #define DLM_LOCK_RES_DIRTY0x0008 #define DLM_LOCK_RES_IN_PROGRESS 0x0010 #define DLM_LOCK_RES_MIGRATING0x0020 +#define DLM_LOCK_RES_IN_USE 0x0100 #define DLM_LOCK_RES_DROPPING_REF 0x0040 #define DLM_LOCK_RES_BLOCK_DIRTY 0x1000 #define DLM_LOCK_RES_SETREF_INPROG0x2000 diff --git a/fs/ocfs2/dlm/dlmlock.c b/fs/ocfs2/dlm/dlmlock.c index 777..501ac40 100644 --- a/fs/ocfs2/dlm/dlmlock.c +++ b/fs/ocfs2/dlm/dlmlock.c @@ -134,6 +134,8 @@ static enum dlm_status dlmlock_master(struct dlm_ctxt *dlm, if (status != DLM_NORMAL lock-ml.node != dlm-node_num) { /* erf. state changed after lock was dropped. */ + /* DLM_LOCK_RES_IN_USE is set in dlm_get_lock_resource */ + res-state = ~DLM_LOCK_RES_IN_USE; spin_unlock(res-spinlock); dlm_error(status); return status; @@ -180,6 +182,7 @@ static enum dlm_status dlmlock_master(struct dlm_ctxt *dlm, kick_thread = 1; } } + res-state = ~DLM_LOCK_RES_IN_USE; /* reduce the inflight count, this may result in the lockres * being purged below during calc_usage */ if (lock-ml.node == dlm-node_num) @@ -246,6 +249,7 @@ static enum dlm_status dlmlock_remote(struct dlm_ctxt *dlm, spin_lock(res-spinlock); res-state = ~DLM_LOCK_RES_IN_PROGRESS; + res-state = ~DLM_LOCK_RES_IN_USE; lock-lock_pending = 0; if (status != DLM_NORMAL) { if (status == DLM_RECOVERING diff --git a/fs/ocfs2/dlm/dlmmaster.c b/fs/ocfs2/dlm/dlmmaster.c index 9289b43..f0f2d97 100644 --- a/fs/ocfs2/dlm/dlmmaster.c +++ b/fs/ocfs2/dlm/dlmmaster.c @@ -719,6 +719,7 @@ lookup: if (tmpres) { int dropping_ref = 0; + tmpres-state |= DLM_LOCK_RES_IN_USE; spin_unlock(dlm-spinlock); spin_lock(tmpres-spinlock); @@ -731,8 +732,10 @@ lookup: if (tmpres-owner == dlm-node_num) { BUG_ON(tmpres-state DLM_LOCK_RES_DROPPING_REF); dlm_lockres_grab_inflight_ref(dlm, tmpres); - } else if (tmpres-state DLM_LOCK_RES_DROPPING_REF) + } else if (tmpres-state DLM_LOCK_RES_DROPPING_REF) { + tmpres-state = ~DLM_LOCK_RES_IN_USE; dropping_ref = 1; + } spin_unlock(tmpres-spinlock); /* wait until done messaging the master, drop our ref to allow diff --git a/fs/ocfs2/dlm/dlmthread.c b/fs/ocfs2/dlm/dlmthread.c index fb0be6c..2b82e0b 100644 --- a/fs/ocfs2/dlm/dlmthread.c +++ b/fs/ocfs2/dlm/dlmthread.c @@ -93,6 +93,7 @@ int __dlm_lockres_has_locks(struct dlm_lock_resource *res) int __dlm_lockres_unused(struct dlm_lock_resource *res) { if (!__dlm_lockres_has_locks(res) + !(res-state DLM_LOCK_RES_IN_USE) (list_empty(res-dirty) !(res-state DLM_LOCK_RES_DIRTY))) { /* try not to scan the bitmap unless the first two * conditions are already true */ -- 1.5.6.5 ___ Ocfs2-devel mailing list Ocfs2-devel@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-devel
Re: [Ocfs2-devel] [PATCH] ocfs2/dlm: check dlm_state under spinlock
The lock order in this code causes dead lock, not caused by your patch. The lock order in dlm_query_join_handler is dlm_domain_lock -dlm-spinlock dead locks with .. dlm_lockres_put calls dlm_lockres_release while holding dlm-spinlock which calls dlm_put which gets dlm_domain_lock. So the spin lock order here is dlm-spinlock - dlm_domain_lock On 6/15/2010 9:08 PM, Wengang Wang wrote: We should check dlm-dlm_state under dlm-spinlock though maybe in this case it doesn't hurt. Signed-off-by: Wengang Wang wen.gang.w...@oracle.com --- fs/ocfs2/dlm/dlmdomain.c | 13 ++--- 1 files changed, 6 insertions(+), 7 deletions(-) diff --git a/fs/ocfs2/dlm/dlmdomain.c b/fs/ocfs2/dlm/dlmdomain.c index 6b5a492..ab82add 100644 --- a/fs/ocfs2/dlm/dlmdomain.c +++ b/fs/ocfs2/dlm/dlmdomain.c @@ -796,7 +796,7 @@ static int dlm_query_join_handler(struct o2net_msg *msg, u32 len, void *data, spin_lock(dlm_domain_lock); dlm = __dlm_lookup_domain_full(query-domain, query-name_len); if (!dlm) - goto unlock_respond; + goto unlock_domain_respond; /* * There is a small window where the joining node may not see the @@ -811,7 +811,7 @@ static int dlm_query_join_handler(struct o2net_msg *msg, u32 len, void *data, have node %u in its nodemap\n, query-node_idx, nodenum); packet.code = JOIN_DISALLOW; - goto unlock_respond; + goto unlock_domain_respond; } } nodenum++; @@ -821,9 +821,9 @@ static int dlm_query_join_handler(struct o2net_msg *msg, u32 len, void *data, * to be put in someone's domain map. * Also, explicitly disallow joining at certain troublesome * times (ie. during recovery). */ - if (dlm dlm-dlm_state != DLM_CTXT_LEAVING) { + spin_lock(dlm-spinlock); + if (dlm-dlm_state != DLM_CTXT_LEAVING) { int bit = query-node_idx; - spin_lock(dlm-spinlock); if (dlm-dlm_state == DLM_CTXT_NEW dlm-joining_node == DLM_LOCK_RES_OWNER_UNKNOWN) { @@ -869,10 +869,9 @@ static int dlm_query_join_handler(struct o2net_msg *msg, u32 len, void *data, __dlm_set_joining_node(dlm, query-node_idx); } } - - spin_unlock(dlm-spinlock); } -unlock_respond: + spin_unlock(dlm-spinlock); +unlock_domain_respond: spin_unlock(dlm_domain_lock); respond: ___ Ocfs2-devel mailing list Ocfs2-devel@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-devel
Re: [Ocfs2-devel] [PATCH 1/1] ocfs2: o2dlm fix race in purge lockres and newlock (orabug 9094491)
Please ignore this patch. I'll resend this patch along with dlm purge lockres for completeness. On 6/9/2010 7:07 PM, Srinivas Eeda wrote: dlm_thread sends a deref message to the master node. At the same time, another thread sends a new lock request to the master node. Since dlm_thread wouldn't know about it, it would unhash the lockres after it gets the response. lock request AST would then won't find the lockres and hence BUGs. The fix is to add new state DLM_LOCK_RES_IN_USE which would prevent dlm_thread from purging the lockres and/or unhashing lockres. Signed-off-by: Srinivas Eeda srinivas.e...@oracle.com --- fs/ocfs2/dlm/dlmcommon.h |1 + fs/ocfs2/dlm/dlmlock.c |4 fs/ocfs2/dlm/dlmmaster.c |5 - fs/ocfs2/dlm/dlmthread.c |4 +++- 4 files changed, 12 insertions(+), 2 deletions(-) diff --git a/fs/ocfs2/dlm/dlmcommon.h b/fs/ocfs2/dlm/dlmcommon.h index 0102be3..0290612 100644 --- a/fs/ocfs2/dlm/dlmcommon.h +++ b/fs/ocfs2/dlm/dlmcommon.h @@ -280,6 +280,7 @@ static inline void __dlm_set_joining_node(struct dlm_ctxt *dlm, #define DLM_LOCK_RES_IN_PROGRESS 0x0010 #define DLM_LOCK_RES_MIGRATING0x0020 #define DLM_LOCK_RES_DROPPING_REF 0x0040 +#define DLM_LOCK_RES_IN_USE 0x0100 #define DLM_LOCK_RES_BLOCK_DIRTY 0x1000 #define DLM_LOCK_RES_SETREF_INPROG0x2000 diff --git a/fs/ocfs2/dlm/dlmlock.c b/fs/ocfs2/dlm/dlmlock.c index 777..501ac40 100644 --- a/fs/ocfs2/dlm/dlmlock.c +++ b/fs/ocfs2/dlm/dlmlock.c @@ -134,6 +134,8 @@ static enum dlm_status dlmlock_master(struct dlm_ctxt *dlm, if (status != DLM_NORMAL lock-ml.node != dlm-node_num) { /* erf. state changed after lock was dropped. */ + /* DLM_LOCK_RES_IN_USE is set in dlm_get_lock_resource */ + res-state = ~DLM_LOCK_RES_IN_USE; spin_unlock(res-spinlock); dlm_error(status); return status; @@ -180,6 +182,7 @@ static enum dlm_status dlmlock_master(struct dlm_ctxt *dlm, kick_thread = 1; } } + res-state = ~DLM_LOCK_RES_IN_USE; /* reduce the inflight count, this may result in the lockres * being purged below during calc_usage */ if (lock-ml.node == dlm-node_num) @@ -246,6 +249,7 @@ static enum dlm_status dlmlock_remote(struct dlm_ctxt *dlm, spin_lock(res-spinlock); res-state = ~DLM_LOCK_RES_IN_PROGRESS; + res-state = ~DLM_LOCK_RES_IN_USE; lock-lock_pending = 0; if (status != DLM_NORMAL) { if (status == DLM_RECOVERING diff --git a/fs/ocfs2/dlm/dlmmaster.c b/fs/ocfs2/dlm/dlmmaster.c index 9289b43..f0f2d97 100644 --- a/fs/ocfs2/dlm/dlmmaster.c +++ b/fs/ocfs2/dlm/dlmmaster.c @@ -719,6 +719,7 @@ lookup: if (tmpres) { int dropping_ref = 0; + tmpres-state |= DLM_LOCK_RES_IN_USE; spin_unlock(dlm-spinlock); spin_lock(tmpres-spinlock); @@ -731,8 +732,10 @@ lookup: if (tmpres-owner == dlm-node_num) { BUG_ON(tmpres-state DLM_LOCK_RES_DROPPING_REF); dlm_lockres_grab_inflight_ref(dlm, tmpres); - } else if (tmpres-state DLM_LOCK_RES_DROPPING_REF) + } else if (tmpres-state DLM_LOCK_RES_DROPPING_REF) { + tmpres-state = ~DLM_LOCK_RES_IN_USE; dropping_ref = 1; + } spin_unlock(tmpres-spinlock); /* wait until done messaging the master, drop our ref to allow diff --git a/fs/ocfs2/dlm/dlmthread.c b/fs/ocfs2/dlm/dlmthread.c index 11a6d1f..b2315cb 100644 --- a/fs/ocfs2/dlm/dlmthread.c +++ b/fs/ocfs2/dlm/dlmthread.c @@ -93,6 +93,7 @@ int __dlm_lockres_has_locks(struct dlm_lock_resource *res) int __dlm_lockres_unused(struct dlm_lock_resource *res) { if (!__dlm_lockres_has_locks(res) + !(res-state DLM_LOCK_RES_IN_USE) (list_empty(res-dirty) !(res-state DLM_LOCK_RES_DIRTY))) { /* try not to scan the bitmap unless the first two * conditions are already true */ @@ -222,7 +223,8 @@ static int dlm_purge_lockres(struct dlm_ctxt *dlm, } else spin_unlock(res-spinlock); - __dlm_unhash_lockres(res); + if (__dlm_lockres_unused(res)) + __dlm_unhash_lockres(res); /* lockres is not in the hash now. drop the flag and wake up * any processes waiting in dlm_get_lock_resource. */ ___ Ocfs2-devel mailing list Ocfs2-devel@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-devel
Re: [Ocfs2-devel] [PATCH] ocfs2/dlm: remove lockres from purge when a lock is added
Wengang, thanks for the patch. My comments are inline :) On 6/8/2010 7:38 AM, Wengang Wang wrote: dlm_thread(when purges a lockres) races with another thread that is running on dlmlock_remote(). dlmlock_remote() can add a lock to the blocked list of the lockres without taking dlm-spinlock. This can lead a BUG() in dlm_thread because of the added lock. Fix We take dlm-spinlock when adding a lock to the blocked list and make sure the lockres is not in purge list. If we removed the lockres from purge list, try to add it back. Signed-off-by: Wengang Wang wen.gang.w...@oracle.com --- fs/ocfs2/dlm/dlmcommon.h | 15 +++ fs/ocfs2/dlm/dlmlock.c | 18 +++--- fs/ocfs2/dlm/dlmthread.c | 28 3 files changed, 58 insertions(+), 3 deletions(-) diff --git a/fs/ocfs2/dlm/dlmcommon.h b/fs/ocfs2/dlm/dlmcommon.h index 4b6ae2c..134973a 100644 --- a/fs/ocfs2/dlm/dlmcommon.h +++ b/fs/ocfs2/dlm/dlmcommon.h @@ -1002,6 +1002,9 @@ void dlm_move_lockres_to_recovery_list(struct dlm_ctxt *dlm, /* will exit holding res-spinlock, but may drop in function */ void __dlm_wait_on_lockres_flags(struct dlm_lock_resource *res, int flags); +void __dlm_wait_on_lockres_flags_lock_dlm(struct dlm_ctxt *dlm, + struct dlm_lock_resource *res, + int flags); void __dlm_wait_on_lockres_flags_set(struct dlm_lock_resource *res, int flags); /* will exit holding res-spinlock, but may drop in function */ @@ -1012,6 +1015,18 @@ static inline void __dlm_wait_on_lockres(struct dlm_lock_resource *res) DLM_LOCK_RES_MIGRATING)); } +/* + * will exit holding dlm-spinlock and res-spinlock, but may drop in + * function + */ +static inline void __dlm_wait_on_lockres_lock_dlm(struct dlm_ctxt *dlm, + struct dlm_lock_resource *res) +{ + __dlm_wait_on_lockres_flags_lock_dlm(dlm, res, + (DLM_LOCK_RES_IN_PROGRESS| + DLM_LOCK_RES_RECOVERING| + DLM_LOCK_RES_MIGRATING)); +} void __dlm_unlink_mle(struct dlm_ctxt *dlm, struct dlm_master_list_entry *mle); void __dlm_insert_mle(struct dlm_ctxt *dlm, struct dlm_master_list_entry *mle); diff --git a/fs/ocfs2/dlm/dlmlock.c b/fs/ocfs2/dlm/dlmlock.c index 69cf369..fb3c178 100644 --- a/fs/ocfs2/dlm/dlmlock.c +++ b/fs/ocfs2/dlm/dlmlock.c @@ -222,23 +222,35 @@ static enum dlm_status dlmlock_remote(struct dlm_ctxt *dlm, struct dlm_lock *lock, int flags) { enum dlm_status status = DLM_DENIED; - int lockres_changed = 1; + int lockres_changed = 1, recalc; mlog_entry(type=%d\n, lock-ml.type); mlog(0, lockres %.*s, flags = 0x%x\n, res-lockname.len, res-lockname.name, flags); + spin_lock(dlm-spinlock); spin_lock(res-spinlock); /* will exit this call with spinlock held */ - __dlm_wait_on_lockres(res); + __dlm_wait_on_lockres_lock_dlm(dlm, res); res-state |= DLM_LOCK_RES_IN_PROGRESS; /* add lock to local (secondary) queue */ dlm_lock_get(lock); list_add_tail(lock-list, res-blocked); lock-lock_pending = 1; + /* + * make sure this lockres is not in purge list if we remove the + * lockres from purge list, we try to add it back if locking failed. + * + * there is a race with dlm_thread purging this lockres. + * in this case, it can trigger a BUG() because we added a lock to the + * blocked list. + */ + recalc = !list_empty(res-purge); + __dlm_lockres_calc_usage(dlm, res); Thanks for this patch, looks good. A few comments on another approach, let me know your thoughts. The patch I sent for review marks the lockres to be in use (before dlmlock_remote is called) in dlm_get_lockresource. I have to do that to protect the lockres from unhashing once it's being used. But since lockres is still in purgelist the BUG you mentioned can still trigger in dlm_thread. once a lockres is on purge list there are quite a few cases(case this patch is addressing, case the patch I just sent for review is addresing, and the case of the recovery) that can trigger the lockres to be reused, before or while it is getting purged. dlm_thread(dlm_run_purge_list) already expects that a lockres on purge list could just getting started to be reused. However it doesn't handle the above cases. I think it's better to enhance dlm_run_purge_list/dlm_purge_lockres code to handle these cases. I'am working on this change. There also seems to be a case thats not handled. Assume dlm_thread sent deref message and waiting for response, now dlmlock_remote removed it from purge list. dlm_thread got response from master and continues and
[Ocfs2-devel] [PATCH 1/1] ocfs2: o2dlm fix race in purge lockres and newlock (orabug 9094491)
dlm_thread sends a deref message to the master node. At the same time, another thread sends a new lock request to the master node. Since dlm_thread wouldn't know about it, it would unhash the lockres after it gets the response. lock request AST would then won't find the lockres and hence BUGs. The fix is to add new state DLM_LOCK_RES_IN_USE which would prevent dlm_thread from purging the lockres and/or unhashing lockres. Signed-off-by: Srinivas Eeda srinivas.e...@oracle.com --- fs/ocfs2/dlm/dlmcommon.h |1 + fs/ocfs2/dlm/dlmlock.c |4 fs/ocfs2/dlm/dlmmaster.c |5 - fs/ocfs2/dlm/dlmthread.c |4 +++- 4 files changed, 12 insertions(+), 2 deletions(-) diff --git a/fs/ocfs2/dlm/dlmcommon.h b/fs/ocfs2/dlm/dlmcommon.h index 0102be3..0290612 100644 --- a/fs/ocfs2/dlm/dlmcommon.h +++ b/fs/ocfs2/dlm/dlmcommon.h @@ -280,6 +280,7 @@ static inline void __dlm_set_joining_node(struct dlm_ctxt *dlm, #define DLM_LOCK_RES_IN_PROGRESS 0x0010 #define DLM_LOCK_RES_MIGRATING0x0020 #define DLM_LOCK_RES_DROPPING_REF 0x0040 +#define DLM_LOCK_RES_IN_USE 0x0100 #define DLM_LOCK_RES_BLOCK_DIRTY 0x1000 #define DLM_LOCK_RES_SETREF_INPROG0x2000 diff --git a/fs/ocfs2/dlm/dlmlock.c b/fs/ocfs2/dlm/dlmlock.c index 777..501ac40 100644 --- a/fs/ocfs2/dlm/dlmlock.c +++ b/fs/ocfs2/dlm/dlmlock.c @@ -134,6 +134,8 @@ static enum dlm_status dlmlock_master(struct dlm_ctxt *dlm, if (status != DLM_NORMAL lock-ml.node != dlm-node_num) { /* erf. state changed after lock was dropped. */ + /* DLM_LOCK_RES_IN_USE is set in dlm_get_lock_resource */ + res-state = ~DLM_LOCK_RES_IN_USE; spin_unlock(res-spinlock); dlm_error(status); return status; @@ -180,6 +182,7 @@ static enum dlm_status dlmlock_master(struct dlm_ctxt *dlm, kick_thread = 1; } } + res-state = ~DLM_LOCK_RES_IN_USE; /* reduce the inflight count, this may result in the lockres * being purged below during calc_usage */ if (lock-ml.node == dlm-node_num) @@ -246,6 +249,7 @@ static enum dlm_status dlmlock_remote(struct dlm_ctxt *dlm, spin_lock(res-spinlock); res-state = ~DLM_LOCK_RES_IN_PROGRESS; + res-state = ~DLM_LOCK_RES_IN_USE; lock-lock_pending = 0; if (status != DLM_NORMAL) { if (status == DLM_RECOVERING diff --git a/fs/ocfs2/dlm/dlmmaster.c b/fs/ocfs2/dlm/dlmmaster.c index 9289b43..f0f2d97 100644 --- a/fs/ocfs2/dlm/dlmmaster.c +++ b/fs/ocfs2/dlm/dlmmaster.c @@ -719,6 +719,7 @@ lookup: if (tmpres) { int dropping_ref = 0; + tmpres-state |= DLM_LOCK_RES_IN_USE; spin_unlock(dlm-spinlock); spin_lock(tmpres-spinlock); @@ -731,8 +732,10 @@ lookup: if (tmpres-owner == dlm-node_num) { BUG_ON(tmpres-state DLM_LOCK_RES_DROPPING_REF); dlm_lockres_grab_inflight_ref(dlm, tmpres); - } else if (tmpres-state DLM_LOCK_RES_DROPPING_REF) + } else if (tmpres-state DLM_LOCK_RES_DROPPING_REF) { + tmpres-state = ~DLM_LOCK_RES_IN_USE; dropping_ref = 1; + } spin_unlock(tmpres-spinlock); /* wait until done messaging the master, drop our ref to allow diff --git a/fs/ocfs2/dlm/dlmthread.c b/fs/ocfs2/dlm/dlmthread.c index 11a6d1f..b2315cb 100644 --- a/fs/ocfs2/dlm/dlmthread.c +++ b/fs/ocfs2/dlm/dlmthread.c @@ -93,6 +93,7 @@ int __dlm_lockres_has_locks(struct dlm_lock_resource *res) int __dlm_lockres_unused(struct dlm_lock_resource *res) { if (!__dlm_lockres_has_locks(res) + !(res-state DLM_LOCK_RES_IN_USE) (list_empty(res-dirty) !(res-state DLM_LOCK_RES_DIRTY))) { /* try not to scan the bitmap unless the first two * conditions are already true */ @@ -222,7 +223,8 @@ static int dlm_purge_lockres(struct dlm_ctxt *dlm, } else spin_unlock(res-spinlock); - __dlm_unhash_lockres(res); + if (__dlm_lockres_unused(res)) + __dlm_unhash_lockres(res); /* lockres is not in the hash now. drop the flag and wake up * any processes waiting in dlm_get_lock_resource. */ -- 1.5.6.5 ___ Ocfs2-devel mailing list Ocfs2-devel@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-devel