from:"Srinivas Eeda"

Re: [Ocfs2-devel] [PATCH] ocfs2: Fix start offset to ocfs2_zero_range_for_truncate()

2016-08-22 Thread Srinivas Eeda

Good catch! Thank you for the fix

Reviewed-by: Srinivas Eeda <srinivas.e...@oracle.com>

On 08/11/2016 04:12 PM, Ashish Samant wrote:
> If we do fallocate with punch hole option on a reflink, with start offset
> on a cluster boundary and end offset somewhere in another cluster, we
> dont COW the first cluster starting at the start offset. But in this
> case, we were wrongly passing this cluster to
> ocfs2_zero_range_for_truncate() to zero out.
>
> Fix this by skipping this cluster in such a scenario.
>
> Signed-off-by: Ashish Samant <ashish.sam...@oracle.com>
> ---
>   fs/ocfs2/file.c | 34 --
>   1 file changed, 24 insertions(+), 10 deletions(-)
>
> diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
> index 4a6e130..ab305aa 100644
> --- a/fs/ocfs2/file.c
> +++ b/fs/ocfs2/file.c
> @@ -1522,7 +1522,8 @@ static int ocfs2_zero_partial_clusters(struct inode 
> *inode,
>  u64 start, u64 len)
>   {
>   int ret = 0;
> - u64 tmpend, end = start + len;
> + u64 tmpend = 0;
> + u64 end = start + len;
>   struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
>   unsigned int csize = osb->s_clustersize;
>   handle_t *handle;
> @@ -1554,18 +1555,31 @@ static int ocfs2_zero_partial_clusters(struct inode 
> *inode,
>   }
>   
>   /*
> -  * We want to get the byte offset of the end of the 1st cluster.
> +  * If start is on a cluster boundary and end is somewhere in another
> +  * cluster, we have not COWed the cluster starting at start, unless
> +  * end is also within the same cluster. So, in this case, we skip this
> +  * first call to ocfs2_zero_range_for_truncate() truncate and move on
> +  * to the next one.
>*/
> - tmpend = (u64)osb->s_clustersize + (start & ~(osb->s_clustersize - 1));
> - if (tmpend > end)
> - tmpend = end;
> + if ((start & (csize - 1)) != 0) {
> + /*
> +  * We want to get the byte offset of the end of the 1st
> +  * cluster.
> +  */
> + tmpend = (u64)osb->s_clustersize +
> + (start & ~(osb->s_clustersize - 1));
> + if (tmpend > end)
> + tmpend = end;
>   
> - trace_ocfs2_zero_partial_clusters_range1((unsigned long long)start,
> -  (unsigned long long)tmpend);
> + trace_ocfs2_zero_partial_clusters_range1(
> + (unsigned long long)start,
> + (unsigned long long)tmpend);
>   
> - ret = ocfs2_zero_range_for_truncate(inode, handle, start, tmpend);
> - if (ret)
> - mlog_errno(ret);
> + ret = ocfs2_zero_range_for_truncate(inode, handle, start,
> + tmpend);
> + if (ret)
> + mlog_errno(ret);
> + }
>   
>   if (tmpend < end) {
>   /*


___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] what MLE wants to do?

2016-05-03 Thread Srinivas Eeda

In simple terms, mle's get into life at the beginning of the lock
mastery and ends at the end of lock mastery(either new lockres or during
lockres migration). It's purpose is to handle the race when more than
one node try to master or find master of a lockres.

On 05/03/2016 02:23 AM, Gechangwei wrote:

Hi OCFS2 experts,

Recently, I am working hard on OCFS2 DLM related code. I think it is
an essential part of OCFS2 which must be mastered.

But I was confused with MLE (stands for master list entry) related
procedure.

Would you please introduce me some study material covering MLE or just
directly give me a clue on what MLE wants to do briefly via an email.

Thanks a lot.

Br.

Gechangwei

H3C Technologies Co., Limited

> --

> -

-
本邮件及其附件含有杭州华三通信技术有限公司的保密信息，仅限于发送给上面
地址中列出
的个人或群组。禁止任何其他人以任何形式使用（包括但不限于全部或部分地泄
露、复制、
或散发）本邮件中的信息。如果您错收了本邮件，请您立即电话或邮件通知发件
人并删除本

邮件！
This e-mail and its attachments contain confidential information from
H3C, which is
intended only for the person or entity whose address is listed above.
Any use of the
information contained herein in any way (including, but not limited
to, total or partial
disclosure, reproduction, or dissemination) by persons other than the
intended
recipient(s) is prohibited. If you receive this e-mail in error,
please notify the sender

by phone or email immediately and delete it!

___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] [patch 19/25] ocfs2: o2hb: add negotiate timer

2016-03-28 Thread Srinivas Eeda

On 03/28/2016 05:41 AM, jiangyiwen wrote:
> On 2016/3/28 9:23, Junxiao Bi wrote:
>> Hi Yiwen,
>>
>> On 03/26/2016 10:54 AM, jiangyiwen wrote:
>>> Hi, Junxiao
>>> This patch may have a problem. That is journal of every nodes become
>>> abort when storage down, and then when storage up, because journal
>>> has become abort, all of operations of metadata will fail. So how to
>>> restore environment? panic or reset? how to trigger?
>> Journal aborted means io error was returned by storage, right?
>> If so, o2hb_thread should also get io error, in this case, nego process
>> will be bypassed, and nodes will be fenced at last, see "[patch 23/25]
>> ocfs2: o2hb: don't negotiate if last hb fail".
>>
>> Thanks,
>> Junxiao.
>>> Thanks,
>>> Yiwen Jiang.
>>
>> .
>>
> yes, you are right, sorry I don't see this patch before.
>
> But I understand the results of storage down should return IO error
> rather than getting hang.
that is upto the driver or storage. If they return I/O error, o2hb i/o's 
will get i/o errors and will be addressed accordingly. These changes are 
made for cases where storage is not yet responding due to head 
failovers/recovery on the storage array.

>
> Thanks,
> Yiwen Jiang.
>
>
> ___
> Ocfs2-devel mailing list
> Ocfs2-devel@oss.oracle.com
> https://oss.oracle.com/mailman/listinfo/ocfs2-devel


___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] [PATCH] ocfs2: fix SGID not inherited issue

2015-12-07 Thread Srinivas Eeda

Thanks Junxiao!

Acked-by: Srinivas Eeda <srinivas.e...@oracle.com>

On 12/06/2015 08:09 PM, Junxiao Bi wrote:
> commit 8f1eb48758aa ("ocfs2: fix umask ignored issue") introduced an issue,
> SGID of sub dir was not inherited from its parents dir. It is because SGID
> is set into "inode->i_mode" in ocfs2_get_init_inode(), but is overwritten
> by "mode" which don't have SGID set later.
>
> Fixes: 8f1eb48758aa ("ocfs2: fix umask ignored issue")
> Signed-off-by: Junxiao Bi <junxiao...@oracle.com>
> Cc: <sta...@vger.kernel.org>
> ---
>   fs/ocfs2/namei.c |4 +---
>   1 file changed, 1 insertion(+), 3 deletions(-)
>
> diff --git a/fs/ocfs2/namei.c b/fs/ocfs2/namei.c
> index a03f6f4..3123408 100644
> --- a/fs/ocfs2/namei.c
> +++ b/fs/ocfs2/namei.c
> @@ -367,13 +367,11 @@ static int ocfs2_mknod(struct inode *dir,
>   goto leave;
>   }
>   
> - status = posix_acl_create(dir, , _acl, );
> + status = posix_acl_create(dir, >i_mode, _acl, );
>   if (status) {
>   mlog_errno(status);
>   goto leave;
>   }
> - /* update inode->i_mode after mask with "umask". */
> - inode->i_mode = mode;
>   
>   handle = ocfs2_start_trans(osb, ocfs2_mknod_credits(osb->sb,
>   S_ISDIR(mode),


___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] [PATCH v2 2/4] ocfs2: sysfile interfaces for online file check

2015-11-24 Thread Srinivas Eeda

On 11/24/2015 01:46 PM, Mark Fasheh wrote:
> On Tue, Nov 03, 2015 at 04:20:27PM +0800, Junxiao Bi wrote:
>> Hi Gang,
>>
>> On 11/03/2015 03:54 PM, Gang He wrote:
>>> Hi Junxiao,
>>>
>>> Thank for your reviewing.
>>> Current design, we use a sysfile as a interface to check/fix a file (via 
>>> pass a ino number).
>>> But, this operation is manually triggered by user, instead of automatically 
>>>  fix in the kernel.
>>> Why?
>>> 1) we should let users make this decision, since some users do not want to 
>>> fix when encountering a file system corruption, maybe they want to keep the 
>>> file system unchanged for a further investigation.
>> If user don't want this, they should not use error=continue option, let
>> fs go after a corruption is very dangerous.
> Maybe we need another errors=XXX flag (maybe errors=fix)?
Great idea Mark! I think adding errors=fix would be a good way to 
address both concerns :) It gives some control if anyone is 
uncomfortable of things getting checked/fixed automatically.

>
> You both make good points, here's what I gather from the conversation:
>
>   - Some customers would be sad if they have to manually fix corruptions.
> This takes effort on their part, and if the FS can handle it
> automatically, it should.
>
>   - There are valid concerns that automatically fixing things is a change in
> behavior that might not be welcome, or worse might lead to unforseeable
> circumstances.
>
>   - I will add that fixing things automatically implies checking them
> automatically which could introduce some performance impact depending on
> how much checking we're doing.
>
> So if the user wants errors to be fixed automatically, they could mount with
> errros=fix, and everyone else would have no change in behavior unless they
> wanted to make use of the new feature.
>
>
>>> 2) frankly speaking, this feature will probably bring a second corruption 
>>> if there is some error in the code, I do not suggest to use automatically 
>>> fix by default in the first version.
>> I think if this feature could bring more corruption, then this should be
>> fixed first.
> Btw, I am pretty sure that Gang is referring to the feature being new and
> thus more likely to have problems. There is nothing I see in here that is
> file system corrupting.
>   --Mark
>
>
> --
> Mark Fasheh
>
> ___
> Ocfs2-devel mailing list
> Ocfs2-devel@oss.oracle.com
> https://oss.oracle.com/mailman/listinfo/ocfs2-devel


___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] [PATCH v2 0/4] Add online file check feature

2015-10-29 Thread Srinivas Eeda

Hi Gang,

thanks for pointing to explanation of the feature.

What I am curious about is ... what were the real cases that you came 
across prompted this change and how this change would help in that case.

Thanks,
--Srini


On 10/28/2015 09:44 PM, Gang He wrote:
> Hello Srini,
>
> There is a doc about ocfs2 online file check.
>
> OCFS2 online file check
> ---
>
> This document will describe OCFS2 online file check feature.
>
> Introduction
> 
> OCFS2 is often used in high-availaibility systems. However, OCFS2 usually
> converts the filesystem to read-only on errors. This may not be necessary, 
> since
> turning the filesystem read-only would affect other running processes as well,
> decreasing availability. Then, a mount option (errors=continue) was 
> introduced,
> which would return the EIO to the calling process and terminate furhter
> processing so that the filesystem is not corrupted further. So,the filesystem 
> is
> not converted to read-only, and the problematic file's inode number is 
> reported
> in the kernel log so that the user can try to check/fix this file via online
> filecheck feature.
>
> Scope
> =
> This effort is to check/fix small issues which may hinder day-to-day 
> operations
> of a cluster filesystem by turning the filesystem read-only. The scope of
> checking/fixing is at the file level, initially for regular files and 
> eventually
> to all files (including system files) of the filesystem.
>
> In case of directory to file links is incorrect, the directory inode is
> reported as erroneous.
>
> This feature is not suited for extravagant checks which involve dependency of
> other components of the filesystem, such as but not limited to, checking if 
> the
> bits for file blocks in the allocation has been set. In case of such an error,
> the offline fsck should/would be recommended.
>
> Finally, such an operation/feature should not be automated lest the filesystem
> may end up with more damage than before the repair attempt. So, this has to
> be performed using user interaction and consent.
>
> User interface
> ==
> When there are errors in the OCFS2 filesystem, they are usually accompanied
> by the inode number which caused the error. This inode number would be the
> input to check/fix the file.
>
> There is a sysfs file for each OCFS2 file system mounting:
>
>/sys/fs/ocfs2//filecheck
>
> Here,  indicates the name of OCFS2 volumn device which has been 
> already
> mounted. The file above would accept inode numbers. This could be used to
> communicate with kernel space, tell which file(inode number) will be checked 
> or
> fixed. Currently, three operations are supported, which includes checking
> inode, fixing inode and setting the size of result record history.
>
> 1. If you want to know what error exactly happened to  before fixing, 
> do
>
># echo "CHECK " > /sys/fs/ocfs2//filecheck
># cat /sys/fs/ocfs2//filecheck
>
> The output is like this:
>INOTYPEDONEERROR
> 39502 0   1   GENERATION
>
>  lists the inode numbers.
>  is what kind of operation you've done, 0 for inode check,1 for inode 
> fix.
> indicates whether the operation has been finished.
>  says what kind of errors was found. For the details, please refer to 
> the
> file linux/fs/ocfs2/filecheck.h.
>
> 2. If you determine to fix this inode, do
>
># echo "FIX " > /sys/fs/ocfs2//filecheck
># cat /sys/fs/ocfs2//filecheck
>
> The output is like this:
>INOTYPEDONEERROR
> 39502 1   1   SUCCESS
>
> This time, the  column indicates whether this fix is successful or not.
>
> 3. The record cache is used to store the history of check/fix result. Its
> defalut size is 10, and can be adjust between the range of 10 ~ 100. You can
> adjust the size like this:
>
># echo "SET " > /sys/fs/ocfs2//filecheck
>
> Fixing stuff
> 
> On receivng the inode, the filesystem would read the inode and the
> file metadata. In case of errors, the filesystem would fix the errors
> and report the problems it fixed in the kernel log. As a precautionary 
> measure,
> the inode must first be checked for errors before performing a final fix.
>
> The inode and the fix history will be maintained temporarily in a
> small linked list buffer which would contain the last (N) inodes
> fixed/checked, along with the logs of what errors were reported/fixed.
>
> Thanks
> Gang
>
>
>> Hi Gang,
>>
>> thank you for implementing this. I would like to understand this better
>> on where and how it helps ... would you mind sharing couple
>> examples(real scenarios).
>>
>> Thanks,
>> --Srini
>>
>>
>> On 10/27/2015 11:25 PM, Gang He wrote:
>>> When there are errors in the ocfs2 filesystem,
>>> they are usually accompanied by the inode number which caused the error.
>>> This inode number would be the input to fixing the file.
>>> One of

Re: [Ocfs2-devel] [PATCH v2 0/4] Add online file check feature

2015-10-28 Thread Srinivas Eeda

Hi Gang,

thank you for implementing this. I would like to understand this better 
on where and how it helps ... would you mind sharing couple 
examples(real scenarios).

Thanks,
--Srini


On 10/27/2015 11:25 PM, Gang He wrote:
> When there are errors in the ocfs2 filesystem,
> they are usually accompanied by the inode number which caused the error.
> This inode number would be the input to fixing the file.
> One of these options could be considered:
> A file in the sys filesytem which would accept inode numbers.
> This could be used to communication back what has to be fixed or is fixed.
> You could write:
> $# echo "CHECK " > /sys/fs/ocfs2/devname/filecheck
> or
> $# echo "FIX " > /sys/fs/ocfs2/devname/filecheck
>
> Compare with first version, I use strncasecmp instead of double strncmp
> functions. Second, update the source file contribution vendor.
>
> Gang He (4):
>ocfs2: export ocfs2_kset for online file check
>ocfs2: sysfile interfaces for online file check
>ocfs2: create/remove sysfile for online file check
>ocfs2: check/fix inode block for online file check
>
>   fs/ocfs2/Makefile  |   3 +-
>   fs/ocfs2/filecheck.c   | 566 
> +
>   fs/ocfs2/filecheck.h   |  48 +
>   fs/ocfs2/inode.c   | 196 -
>   fs/ocfs2/inode.h   |   3 +
>   fs/ocfs2/ocfs2_trace.h |   2 +
>   fs/ocfs2/stackglue.c   |   3 +-
>   fs/ocfs2/stackglue.h   |   2 +
>   fs/ocfs2/super.c   |   5 +
>   9 files changed, 820 insertions(+), 8 deletions(-)
>   create mode 100644 fs/ocfs2/filecheck.c
>   create mode 100644 fs/ocfs2/filecheck.h
>


___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] [Ocfs2-test-devel] OCFS2 test project current status ANNOUNCEMENT

2015-10-09 Thread Srinivas Eeda

Thank you all, for all you contributions in fixing ocfs2 test suite, 
nice effort :)


On 10/08/2015 11:09 PM, Junxiao Bi wrote:
> Hi Eric,
>
> Thank you for reviewing patches and make a summary of the project. Let
> put more effort to make it better in the future.
>
> Thanks,
> Junxiao.
>
> On 10/09/2015 01:12 PM, Eric Ren wrote:
>> Hi all,
>>
>> Now, we have already collected 55 patches into Mark's github repo[1]
>> since Mar 16, 2012.
>> [1] https://github.com/markfasheh/ocfs2-test
>>
>> Thanks a lot for their contributors:
>>  Junxiao Bi: 23
>>  Jeff Liu: 11
>>  Tiger Yang: 12
>>  Goldwynr: 5
>>  Eric Ren: 3
>>  Gang He: 1
>>
>> This patches below are dropped, need improved, or conflicted with
>> previous ones:
>>
>> [PATCH 09/10] Trigger udev rescan to get volume label that mkfs made 
>> remotely.   -- relate to blkid.conf, needs improved
>> [PATCH 42/59] mmap-test: fix segment fault  -- needs improved
>> [PATCH 48/59] ocfs2-test: fix build error   -- conflicted, dropped
>> [PATCH 45/59] flock-test: disable posix lock test   -- needs improved
>> [PATCH 53/59] discontig: fix remote mount   -- conflicted, dropped
>> [PATCH 32/59] xattr-test: fix single test failed-- conflicted, dropped
>> [PATCH 17/59] Splice: Fix conflict definition of splice(2) for splice tests. 
>>-- conflicted, dropped
>> [PATCH 13/59] aio-stress: fix aio-stress Makefile to link pthread 
>> explicitly.   -- conflicted, dropped
>>
>> There might be some incorrectness in my conclusion. For details, please
>> refer to ocfs2-test-devel mailist!
>>
>> Test project is very important to ensure the quality of ocfs2. Welcome
>> more contributions from you! Let's make it better;-)
>>
>> Thanks,
>> Eric
>>
>> ___
>> Ocfs2-devel mailing list
>> Ocfs2-devel@oss.oracle.com
>> https://oss.oracle.com/mailman/listinfo/ocfs2-devel
>>
>
> ___
> Ocfs2-test-devel mailing list
> ocfs2-test-de...@oss.oracle.com
> https://oss.oracle.com/mailman/listinfo/ocfs2-test-devel


___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] ocfs2 version issue

2015-09-01 Thread Srinivas Eeda

Hi Goldwyn

On 09/01/2015 03:32 AM, Goldwyn Rodrigues wrote:
> Hi Junxiao,
>
> On 08/31/2015 09:22 PM, Junxiao Bi wrote:
>> Hi Goldwyn,
>>
>> Ocfs2 kernel version is removed from commit
>> ff8fb335221e2c446b0d4cbea26be371fd2feb64 ("ocfs2: remove versioning
>> information"), but Oracle CRS depends on this version, and this made
>> Oracle CRS can't be installed. So i think we should revert this commit
>> and sync this version with ocfs2-tools. Do you have any other concern
>> about this?
> We removed the version because it did not make sense.
I am not sure how other kernel modules work, but for me a OCFS2 
maintaining it's own version is a good idea :). This version number 
defines the feature-set it currently supports. It also allows a 
feature-set to be backported to older kernels(if need arises).

>   Even if we put in
> the version back we will have to maintain it and will have a similar
> case where the version number in the kernel is way behind the tools
> versions because of lack of updating. Add to it the confusion of users
> who would not know which version to quote.
I agree that this got out of sync so probably we should fix that part 
than to remove it all together. I also don't see why tools version has 
to dictate kernel module version or vise-versa. Isn't it practical for a 
kernel module to implement a new feature-set but doesn't require any 
tools changes or vise-versa ? ;)

> I suggest the Oracle CRS should be modified to use the kernel version as
> opposed to the ocfs2 kernel module specific versions.
It may not be just Oracle CRS ... there may be other applications that 
might be using this. The problem is that we don't want a customer to 
upgrade a kernel and run into a failure which just looks bad on 
kernel/filesystem :(.  One may argue to write notes/readme's to inform 
customers but it's not practical and is a painful for some customers to 
update application software and kernel at the same time.

Anyway, if we decide to remove the versioning forever then it may be a 
good idea to deprecate it first so applications have enough time to 
incorporate these changes. But my vote would be for ocfs2 to maintain 
it's own version number :)


___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] [RFC] ocfs2: Idea to make ocfs2_search_chain high efficiency

2015-08-24 Thread Srinivas Eeda

Hi Norton,

while the localalloc bmp is enabled the chances of a particular gd 
ending with zero free bits is very minimal. local alloc bmp will pick 
next gd once min number of free bits falls below localalloc bmp size. So 
next gd is picked while the current gd still has free space.

Having said that, I understand the problem you are describing. Since we 
only move the gd with more free bits ahead of the previous unusable gd 
we end of rescanning unusable gds whenever the newly picked gd becomes 
unusable. This is inefficient and causes performance problem when fs 
gets fragmented

I am not sure if we will be allowed to use the bg_reserved2 ... but if 
we are then we could use it to remember  next gd we should look at.

For example, if we assume there are 'N' GDs and we are currently seeing 
that GD10 has more free space ... so the order of GDs are

GD10, GD1, GD2, ... GDN.

Now if we end up using most of the space in GD10, according to current 
logic we re-read GD1, GD2.. GD9 and then read GD11 and pick that 
(assuming it has free space). But instead if we had GD10 remembers GD11 
as the next one to look at ... then we can skip reading GD1..GD9.

Thanks,
--Srini


On 08/24/2015 05:30 AM, Norton.Zhu wrote:
 In ocfs2_search_chain, I found it has low efficiency  while searching an 
 available gd in some circumstances:
 1) The lun has a great many gd(it reads lots of unavailable gd(free bits is 
 zero));
 2) Not too many gd, but the available gd is scattered in the unavailable 
 gd(fragmentation);

 So I have an idea to optimize the search method:
 1) Use the reserved member in the ocfs2_group_desc to make an available chain 
 list(gds in the list are all available, free bits more than zero);
 2) At the beginning, the chain list is the same with origin chain list;
 3) While do allocation, it searches gd in the available chain list;
 4) After each allocation, if some gd's free bits is zero, Remove it from the 
 available chain list;
 5) After each reclaim, if some gd's free bits change from zero to positive, 
 Append it to the head of the available chain list;

 Once started with the basics outlined above, no unavailable gd will be read.

 Anyone has better ideas or advices?


 ___
 Ocfs2-devel mailing list
 Ocfs2-devel@oss.oracle.com
 https://oss.oracle.com/mailman/listinfo/ocfs2-devel


___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] [PATCH] ocfs2: optimize error handling in dlm_request_join

2015-08-20 Thread Srinivas Eeda

On 08/20/2015 04:50 AM, Norton.Zhu wrote:
 Currently error handling in dlm_request_join is a little obscure.
 So optimize it to promote readability.

 Signed-off-by: Norton.Zhu norton@huawei.com
 ---
   fs/ocfs2/dlm/dlmdomain.c | 69 
 ++--
   1 file changed, 37 insertions(+), 32 deletions(-)

 diff --git a/fs/ocfs2/dlm/dlmdomain.c b/fs/ocfs2/dlm/dlmdomain.c
 index 7df88a6..af4f7aa 100644
 --- a/fs/ocfs2/dlm/dlmdomain.c
 +++ b/fs/ocfs2/dlm/dlmdomain.c
 @@ -1465,39 +1465,44 @@ static int dlm_request_join(struct dlm_ctxt *dlm,
   if (status == -ENOPROTOOPT) {
   status = 0;
   *response = JOIN_OK_NO_MAP;
 - } else if (packet.code == JOIN_DISALLOW ||
 -packet.code == JOIN_OK_NO_MAP) {
 - *response = packet.code;
 - } else if (packet.code == JOIN_PROTOCOL_MISMATCH) {
 - mlog(ML_NOTICE,
 -  This node requested DLM locking protocol %u.%u and 
 -  filesystem locking protocol %u.%u.  At least one of 
 -  the protocol versions on node %d is not compatible, 
 -  disconnecting\n,
 -  dlm-dlm_locking_proto.pv_major,
 -  dlm-dlm_locking_proto.pv_minor,
 -  dlm-fs_locking_proto.pv_major,
 -  dlm-fs_locking_proto.pv_minor,
 -  node);
 - status = -EPROTO;
 - *response = packet.code;
 - } else if (packet.code == JOIN_OK) {
 - *response = packet.code;
 - /* Use the same locking protocol as the remote node */
 - dlm-dlm_locking_proto.pv_minor = packet.dlm_minor;
 - dlm-fs_locking_proto.pv_minor = packet.fs_minor;
 - mlog(0,
 -  Node %d responds JOIN_OK with DLM locking protocol 
 -  %u.%u and fs locking protocol %u.%u\n,
 -  node,
 -  dlm-dlm_locking_proto.pv_major,
 -  dlm-dlm_locking_proto.pv_minor,
 -  dlm-fs_locking_proto.pv_major,
 -  dlm-fs_locking_proto.pv_minor);
   } else {
 - status = -EINVAL;
 - mlog(ML_ERROR, invalid response %d from node %u\n,
 -  packet.code, node);
 + *response = packet.code;
Norton, it looks much better :)

one minor comment. we don't want to reset *response with packet.code 
if it's unrecognized. We should leave the value to JOIN_DISALLOW;

rest looks good.

 + switch (packet.code) {
 + case JOIN_DISALLOW:
 + case JOIN_OK_NO_MAP:
 + break;
 + case JOIN_PROTOCOL_MISMATCH:
 + mlog(ML_NOTICE,
 +  This node requested DLM locking protocol %u.%u 
 and 
 +  filesystem locking protocol %u.%u.  At least one 
 of 
 +  the protocol versions on node %d is not 
 compatible, 
 +  disconnecting\n,
 +  dlm-dlm_locking_proto.pv_major,
 +  dlm-dlm_locking_proto.pv_minor,
 +  dlm-fs_locking_proto.pv_major,
 +  dlm-fs_locking_proto.pv_minor,
 +  node);
 + status = -EPROTO;
 + break;
 + case JOIN_OK:
 + /* Use the same locking protocol as the remote node */
 + dlm-dlm_locking_proto.pv_minor = packet.dlm_minor;
 + dlm-fs_locking_proto.pv_minor = packet.fs_minor;
 + mlog(0,
 +  Node %d responds JOIN_OK with DLM locking 
 protocol 
 +  %u.%u and fs locking protocol %u.%u\n,
 +  node,
 +  dlm-dlm_locking_proto.pv_major,
 +  dlm-dlm_locking_proto.pv_minor,
 +  dlm-fs_locking_proto.pv_major,
 +  dlm-fs_locking_proto.pv_minor);
 + break;
 + default:
 + status = -EINVAL;
 + mlog(ML_ERROR, invalid response %d from node %u\n,
 +  packet.code, node);
 + break;
 + }
   }

   mlog(0, status %d, node %d response is %d\n, status, node,


___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] ocfs2: should not use le32_add_cpu to set ocfs2_dinode i_flags

2015-01-21 Thread Srinivas Eeda

Ok, understood. Thanks for explaining :)


On 01/21/2015 05:07 PM, Joseph Qi wrote:
 Hi Srini,

 On 2015/1/22 1:55, Srinivas Eeda wrote:
 Hi Joesph,

 thanks a lot for submitting the above patch. I am trying to understand what 
 kind of flag corruption have you noticed and under what circumstances ? I 
 agree with the patch that bitwise operations are better than adding, but I 
 am not able to understand the corruption it could cause. Can you please 
 share!

 Suppose a the following case:
 The dinode i_flag already has the OCFS2_ORPHANED_FL bit set, and calling
 le32_add_cpu again. Then it will corrupt the i_flag.

 --
 Joseph

 Thanks,
 --Srini






___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] Git repos?

2015-01-08 Thread Srinivas Eeda

git repos for ocfs2 kernel code is linux.git and linux-next.git on 
kernel.org
For tools it's git://oss.oracle.com/git/ocfs2-tools.git

On 01/08/2015 10:43 AM, Jonathan Fraser wrote:
 Hi,

 Sorry for the newbe question, but where are the git repos for the staged
 changes?  I've searched high and low. I'd like to test with some of the
 recent changes.

 Thanks,
 Jon Fraser
 ___
 Ocfs2-devel mailing list
 Ocfs2-devel@oss.oracle.com
 https://oss.oracle.com/mailman/listinfo/ocfs2-devel


___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] ocfs2-tools repository and documentation

2014-12-15 Thread Srinivas Eeda

Hi Goldwyn, Germano

I have updated ocfs2-tools git repo with latest fixes :)

Germano,

following documentation is available for OCFS2

http://docs.oracle.com/cd/E37670_01/E37355/html/ol_ocfs2.html
http://www.oracle.com/us/technologies/linux/025995.htm
https://oss.oracle.com/projects/ocfs2/dist/documentation/v1.8/ocfs2-1_8_2-manpages.pdf
https://oss.oracle.com/projects/ocfs2/documentation/


On 12/15/2014 10:47 AM, Germano Percossi wrote:
 Hi Goldwyn,

 Thanks for your answer. I will have a look at your code.
 I just want to be sure that the source code I am looking at
 is the latest one to avoid fixing things that maybe have been already
 fixed elsewhere.

 Same is true for documentation and package repositories.
 If I could have links to the latest, no matter how old and not
 maintained, documentation and repository, it would be helpful,
 as well.

 Do you know of any other distro, besides Suse, maintaining a stack
 of patches?

 Cheers,
 Germano

 On 14/12/14 07:24, Goldwyn Rodrigues wrote:
 Yes, if you don't count the last 2 patches, it is more close to 3 years
 now ;)

 Since the patches were not being updated. I started maintaining an
 alternate repository where I am putting all the bugs reported at SUSE:

 https://github.com/goldwynr/ocfs2-tools

 Branch suse-fixes has the fixes found by SUSE over the upstream branch.

 Branch nocontrold has the patches for the feature of doing away with
 ocfs2_controld to work with the latest corosync/pacemaker stack. Patches
 for the kernel are already in the kernel but the ones in the tools need
 some review.

 I had a mail conversation with Srini and he has promised to update the
 upstream branch soon.

 Regards,


 ___
 Ocfs2-devel mailing list
 Ocfs2-devel@oss.oracle.com
 https://oss.oracle.com/mailman/listinfo/ocfs2-devel


___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] [PATCH v2] ocfs2: o2net: fix connect expired

2014-10-31 Thread Srinivas Eeda

looks good. Thanks for your explanation and fix

Reviewed-by: Srinivas Eeda srinivas.e...@oracle.com

On 10/30/2014 11:08 PM, Junxiao Bi wrote:
 Set nn_persistent_error to -ENOTCONN will stop reconnect since the
 stop condition in o2net_start_connect() will be true.

 stop = (nn-nn_sc ||
   (nn-nn_persistent_error 
   (nn-nn_persistent_error != -ENOTCONN || timeout == 0)));

 This will make connection never be established if the first connection request
 is lost.

 Set nn_persistent_error to 0 when connect expired to fix this. With this
 changes, dlm will not be waken up when connect expired, this is OK since
 dlm depends on network, dlm can do nothing in this case if waken up. Let
 it wait there for network recover and connect built again to continue.

 Signed-off-by: Junxiao Bi junxiao...@oracle.com
 ---
   fs/ocfs2/cluster/tcp.c |2 +-
   1 file changed, 1 insertion(+), 1 deletion(-)

 diff --git a/fs/ocfs2/cluster/tcp.c b/fs/ocfs2/cluster/tcp.c
 index 97de0fb..4d6b645 100644
 --- a/fs/ocfs2/cluster/tcp.c
 +++ b/fs/ocfs2/cluster/tcp.c
 @@ -1736,7 +1736,7 @@ static void o2net_connect_expired(struct work_struct 
 *work)
o2net_idle_timeout() / 1000,
o2net_idle_timeout() % 1000);
   
 - o2net_set_nn_state(nn, NULL, 0, -ENOTCONN);
 + o2net_set_nn_state(nn, NULL, 0, 0);
   }
   spin_unlock(nn-nn_lock);
   }


___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] [PATCH] ocfs2: o2net: fix connect expired

2014-10-30 Thread Srinivas Eeda

Hi Junxiao,

I am hesitant to make changes to o2net_set_nn_state calls as it is a 
sensitive code. Also, currently, nn_timeout is only used to determine if 
a reconnect needs to be allowed or not. So, lets rename that to 
nn_reconnect and use it for that purpose.

With your change, o2net_connect_expired now doesn't pass -ENOTCONN when 
a network hb timeout happens. So o2net_set_nn_state now fails to wake_up 
any threads stuck on nn_sc_wq(check o2net_send_message_vec function) ?

Thanks,
--Srini



On 10/30/2014 04:58 AM, Junxiao Bi wrote:
 - srinivas.e...@oracle.com wrote:

 Hi Junxiao,

 thanks for explaining. For this case allowing a reconnect (setting
 atomic_set(nn-nn_timeout, 1); ) in o2net_connect_expired should
 work ?
 Hi Srini,

 Yes, that should also work for this case. But it breaks the usage of 
 nn_timeout, it is only used for idle timeout.

 Thanks,
 Junxiao.
 Thanks,
 --Srini


 On 10/29/2014 10:32 PM, Junxiao Bi wrote:
 Hi Srini,

 On 10/30/2014 01:16 PM, Srinivas Eeda wrote:
 Junxiao,

 can you please describe under what circumstances you saw this
 problem?
 My understanding is o2net_connect_expired is only queued when
 connection
 actually broke and ENOTCONN is the right error in that case.
 This happened when o2net was issuing the first connect request to
 some
 node, but the request packet is lost due to some error like network
 broken, then the connect would be expired, in
 o2net_connect_expired()
 that set nn-nn_persistent_error to -ENOTCONN but timeout was zero,
 so
 o2net_start_connect() would return without sending another connect
 request, connection to the node will never be built.

 Thanks,
 Junxiao.

 Thanks,
 --Srini

 On 10/29/2014 06:41 PM, Junxiao Bi wrote:
 Set nn_persistent_error to -ENOTCONN will stop reconnect since
 the
 stop condition in o2net_start_connect() will be true.

 stop = (nn-nn_sc ||
   (nn-nn_persistent_error 
   (nn-nn_persistent_error != -ENOTCONN || timeout ==
 0)));
 This will make connection never be established if the first
 connection
 request
 is lost.

 Signed-off-by: Junxiao Bi junxiao...@oracle.com
 ---
 fs/ocfs2/cluster/tcp.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

 diff --git a/fs/ocfs2/cluster/tcp.c b/fs/ocfs2/cluster/tcp.c
 index 97de0fb..4d6b645 100644
 --- a/fs/ocfs2/cluster/tcp.c
 +++ b/fs/ocfs2/cluster/tcp.c
 @@ -1736,7 +1736,7 @@ static void o2net_connect_expired(struct
 work_struct *work)
  o2net_idle_timeout() / 1000,
  o2net_idle_timeout() % 1000);
 -o2net_set_nn_state(nn, NULL, 0, -ENOTCONN);
 +o2net_set_nn_state(nn, NULL, 0, 0);
 }
 spin_unlock(nn-nn_lock);
 }


___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] [PATCH] ocfs2: o2net: fix connect expired

2014-10-29 Thread Srinivas Eeda

Junxiao,

can you please describe under what circumstances you saw this problem? 
My understanding is o2net_connect_expired is only queued when connection 
actually broke and ENOTCONN is the right error in that case.

Thanks,
--Srini

On 10/29/2014 06:41 PM, Junxiao Bi wrote:
 Set nn_persistent_error to -ENOTCONN will stop reconnect since the
 stop condition in o2net_start_connect() will be true.

 stop = (nn-nn_sc ||
   (nn-nn_persistent_error 
   (nn-nn_persistent_error != -ENOTCONN || timeout == 0)));

 This will make connection never be established if the first connection request
 is lost.

 Signed-off-by: Junxiao Bi junxiao...@oracle.com
 ---
   fs/ocfs2/cluster/tcp.c |2 +-
   1 file changed, 1 insertion(+), 1 deletion(-)

 diff --git a/fs/ocfs2/cluster/tcp.c b/fs/ocfs2/cluster/tcp.c
 index 97de0fb..4d6b645 100644
 --- a/fs/ocfs2/cluster/tcp.c
 +++ b/fs/ocfs2/cluster/tcp.c
 @@ -1736,7 +1736,7 @@ static void o2net_connect_expired(struct work_struct 
 *work)
o2net_idle_timeout() / 1000,
o2net_idle_timeout() % 1000);
   
 - o2net_set_nn_state(nn, NULL, 0, -ENOTCONN);
 + o2net_set_nn_state(nn, NULL, 0, 0);
   }
   spin_unlock(nn-nn_lock);
   }


___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] [PATCH] ocfs2: o2net: fix connect expired

2014-10-29 Thread Srinivas Eeda

Hi Junxiao,

thanks for explaining. For this case allowing a reconnect (setting 
atomic_set(nn-nn_timeout, 1); ) in o2net_connect_expired should work ?

Thanks,
--Srini


On 10/29/2014 10:32 PM, Junxiao Bi wrote:
 Hi Srini,

 On 10/30/2014 01:16 PM, Srinivas Eeda wrote:
 Junxiao,

 can you please describe under what circumstances you saw this problem?
 My understanding is o2net_connect_expired is only queued when connection
 actually broke and ENOTCONN is the right error in that case.
 This happened when o2net was issuing the first connect request to some
 node, but the request packet is lost due to some error like network
 broken, then the connect would be expired, in o2net_connect_expired()
 that set nn-nn_persistent_error to -ENOTCONN but timeout was zero, so
 o2net_start_connect() would return without sending another connect
 request, connection to the node will never be built.

 Thanks,
 Junxiao.

 Thanks,
 --Srini

 On 10/29/2014 06:41 PM, Junxiao Bi wrote:
 Set nn_persistent_error to -ENOTCONN will stop reconnect since the
 stop condition in o2net_start_connect() will be true.

 stop = (nn-nn_sc ||
  (nn-nn_persistent_error 
  (nn-nn_persistent_error != -ENOTCONN || timeout == 0)));

 This will make connection never be established if the first connection
 request
 is lost.

 Signed-off-by: Junxiao Bi junxiao...@oracle.com
 ---
fs/ocfs2/cluster/tcp.c |2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

 diff --git a/fs/ocfs2/cluster/tcp.c b/fs/ocfs2/cluster/tcp.c
 index 97de0fb..4d6b645 100644
 --- a/fs/ocfs2/cluster/tcp.c
 +++ b/fs/ocfs2/cluster/tcp.c
 @@ -1736,7 +1736,7 @@ static void o2net_connect_expired(struct
 work_struct *work)
 o2net_idle_timeout() / 1000,
 o2net_idle_timeout() % 1000);
-o2net_set_nn_state(nn, NULL, 0, -ENOTCONN);
 +o2net_set_nn_state(nn, NULL, 0, 0);
}
spin_unlock(nn-nn_lock);
}


___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

[Ocfs2-devel] [PATCH 1/1] o2dlm: fix a race between purge and master query

2014-10-28 Thread Srinivas Eeda

Node A sends master query request to node B which is the master. At this time
lockres happens to be on purgelist. dlm_master_request_handler gets the dlm
spinlock, finds the resource and releases the dlm spin lock. Right at this
dlm_thread on this node could purge the lockres. dlm_master_request_handler
can then acquire lockres spinlock and reply to Node A that node B is the
master even though lockres on node B is purged.

The above scenario will now make node A falsely think node B is the master
which is inconsistent. Further if another node C tries to master the same
resource, every node will respond they are not the master. Node C then masters
the resource and sends assert master to all nodes. This will now make node A
crash with the following message.

dlm_assert_master_handler:1831 ERROR: DIE! Mastery assert from 9, but current
owner is 10!

Signed-off-by: Srinivas Eeda srinivas.e...@oracle.com
---
 fs/ocfs2/dlm/dlmmaster.c | 12 
 1 file changed, 12 insertions(+)

diff --git a/fs/ocfs2/dlm/dlmmaster.c b/fs/ocfs2/dlm/dlmmaster.c
index 215e41a..3689b35 100644
--- a/fs/ocfs2/dlm/dlmmaster.c
+++ b/fs/ocfs2/dlm/dlmmaster.c
@@ -1460,6 +1460,18 @@ way_up_top:
 
/* take care of the easy cases up front */
spin_lock(res-spinlock);
+
+   /*
+* Right after dlm spinlock was released, dlm_thread could have
+* purged the lockres. Check if lockres got unhashed. If so
+* start over.
+*/
+   if (hlist_unhashed(res-hash_node)) {
+   spin_unlock(res-spinlock);
+   dlm_lockres_put(res);
+   goto way_up_top;
+   }
+
if (res-state  (DLM_LOCK_RES_RECOVERING|
  DLM_LOCK_RES_MIGRATING)) {
spin_unlock(res-spinlock);
-- 
1.9.1


___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] [PATCH] [ocfs2]: drop(unlock) dentry before iput in ocfs2_do_drop_dentry_lock

2014-10-20 Thread Srinivas Eeda

This fix will cause problems for regular file unlinks. If the inode is 
not cleared on node 2, then both node 1 and node 2 will fail to get try 
open lock and fail to clear the orphan inodes.

I am assuming the deadlock you have seen is because of quota's enabled 
and fix e3a767b60fd8a9f5e133f42f4970cff77ec43173 should have avoided 
this deadlock by delaying the dropping of the dquot reference. That 
didn't happen ? Why else is downconvert task taking the inode lock ?

Thanks,
--Srini

On 10/19/2014 10:28 PM, Wengang Wang wrote:
 There is dead lock case like this(the Dentry is a dentry to the Inode):

 node 1node 2
 --- --

 granted inode lock

 requesting dentry lock
  responding to node 1 request on dentry lock,
   it calls ocfs2_do_drop_dentry_lock with the
   call trace:
 #4   wait_for_completion
 #5   __ocfs2_cluster_lock
 #6   ocfs2_inode_lock_full_nested
 #7   ocfs2_evict_inode
 #8   evict
 #9   iput
 #10  ocfs2_do_drop_dentry_lock

 You can see node 2 is requesting the inode lock before unlocking dentry lock
 thus a two nodes dead lock formed.

 The story is like this:

 Responding node 1's request, Node 2, ocfs2_dentry_convert_worker, returned
 UNBLOCK_STOP_POST which means don't downconvert but do post_unlock action.
 It was doing so because it thought it doesn't need to downconvert but will do
 an unlock instead in the post_unlock action ocfs2_dentry_post_unlock().
 By doing so it can save a clusted downconvert request for performace
 consideration. But in ocfs2_dentry_post_unlock() before unlock the dentry 
 lock,
 it was requesting the inode lock forming the deadlock expectedly.

 Fix:
 unlock dentry lock first before the requesting inode lock in
 ocfs2_do_drop_dentry_lock.

 Signed-off-by: Wengang Wang wen.gang.w...@oracle.com
 ---
   fs/ocfs2/dcache.c | 2 +-
   1 file changed, 1 insertion(+), 1 deletion(-)

 diff --git a/fs/ocfs2/dcache.c b/fs/ocfs2/dcache.c
 index e2e05a1..19a7d6c 100644
 --- a/fs/ocfs2/dcache.c
 +++ b/fs/ocfs2/dcache.c
 @@ -369,9 +369,9 @@ out_attach:
   static void ocfs2_drop_dentry_lock(struct ocfs2_super *osb,
  struct ocfs2_dentry_lock *dl)
   {
 - iput(dl-dl_inode);
   ocfs2_simple_drop_lockres(osb, dl-dl_lockres);
   ocfs2_lock_res_free(dl-dl_lockres);
 + iput(dl-dl_inode);
   kfree(dl);
   }
   


___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] Ocfs2-devel Digest, Vol 127, Issue 25

2014-10-09 Thread Srinivas Eeda

On 10/09/2014 12:53 AM, Joseph Qi wrote:
On 2014/10/9 15:16, Guozhonghua wrote:
Hi Joseph and Srinivas,

We had merged and test the two patches:
1.ocfs2: o2net: set tcp user timeout to max value
8e9801dfe37c9e68cdbfcd15988df2187191864e
2.ocfs2: o2net: don't shutdown connection when idle timeout
c43c363def04cdaed0d9e26dae846081f55714e7

They are works well as we shut down and up the Ethernet interface manually
and intervals to create the scenarios with shell scripts, the issues cat not
be recreated.
Thanks you for reviews and better advices.

There is another question.
As the node number rises to 32 or 128 in one cluster, we think the TCP keep
alive MSG interval should be make longer from 2 seconds to 10 seconds and
the idle timeout value should be 6ms or 9ms.
We think it can reduce the non-useful keep alive messages and improve the
performance of the TCP connection.
O2CB_IDLE_TIMEOUT_MS=3 to 9
O2CB_KEEPALIVE_DELAY_MS=2000 to 1

In my opinion, O2CB_IDLE_TIMEOUT_MS can be changed to suit your scenario.
But for O2CB_KEEPALIVE_DELAY_MS, I don't think you have to change it.
It will send keepalive packet only if there is no DLM messages between nodes.

We test the values and the changes works well.
As Joseph explained they both are tunables, you can run service o2cb
configure to change values. It falls down to how sensitive your
application want to be to detect network problems.

Can you please elaborate on what you meant by changes works well. How
did you test it and what you found ?

Is there any side effect? We are forward your better thoughts about that.
Thanks.

-
本邮件及其附件含有杭州华三通信技术有限公司的保密信息，仅限于发送给上面地址中列出
的个人或群组。禁止任何其他人以任何形式使用（包括但不限于全部或部分地泄露、复制、
或散发）本邮件中的信息。如果您错收了本邮件，请您立即电话或邮件通知发件人并删除本
邮件！
This e-mail and its attachments contain confidential information from H3C,
which is
intended only for the person or entity whose address is listed above. Any
use of the
information contained herein in any way (including, but not limited to,
total or partial
disclosure, reproduction, or dissemination) by persons other than the
intended
recipient(s) is prohibited. If you receive this e-mail in error, please
notify the sender
by phone or email immediately and delete it!

___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] One node hangs up issue requiring goog idea, thanks

2014-09-30 Thread Srinivas Eeda

Joseph,

yes I agree with you that the proposed change won't work for all cases. 
The only way to fix if we allow reconnects is to track all messages and 
replay the last few in case of reconnects.

However this problem shouldn't arise because of the following two fixes 
Junxiao put in. Can you guys review and comment on them ?

1.ocfs2: o2net: set tcp user timeout to max value
 8e9801dfe37c9e68cdbfcd15988df2187191864e

2.ocfs2: o2net: don't shutdown connection when idle timeout
 c43c363def04cdaed0d9e26dae846081f55714e7

The above will not close the socket connection in case of network hiccup 
and should avoid loss of messages.

Thanks,
--Srini


On 09/30/2014 12:21 AM, Joseph Qi wrote:
 I don't think this is right.
 There are two scenario when returns -ENOTCONN or -EHOSTDOWN:
 1) Connection down when sending the message;
 2) Connection down when responding the messages;
 So just resend the message may lead to remote handles twice.

 On 2014/9/26 20:06, Guozhonghua wrote:
 Hi, all,

 As we use OCFS2, the network is not good.
 When the converting request message can’t send to the another node, there 
 will be a node hangs up which will still waiting for the dlm.

 CAS2/logdir/var/log/syslog.1-6778-Sep 16 20:57:16 CAS2 kernel: 
 [516366.623623] o2net: Connection to node CAS1 (num 1) at 10.172.254.1:7100 
 has been idle for 30.87 secs, shutting it down.
 CAS2/logdir/var/log/syslog.1-6779-Sep 16 20:57:16 CAS2 kernel: 
 [516366.623631] o2net_idle_timer 1621: Local and remote node is 
 heartbeating, and try connect
 CAS2/logdir/var/log/syslog.1-6780-Sep 16 20:57:16 CAS2 kernel: 
 [516366.623792] o2net: No longer connected to node CAS1 (num 1) at 
 10.172.254.1:7100
 CAS2/logdir/var/log/syslog.1:6781:Sep 16 20:57:16 CAS2 kernel: 
 [516366.623881] (dlm_thread,5140,4):dlm_send_proxy_ast_msg:482 ERROR: 
 B258FD07DDD64710B68EB9683FD7D1B9: res M00046e0117, error 
 -112 send AST to node 1
 CAS2/logdir/var/log/syslog.1-6782-Sep 16 20:57:16 CAS2 kernel: 
 [516366.623900] (dlm_thread,5140,4):dlm_flush_asts:596 ERROR: status = -112
 CAS2/logdir/var/log/syslog.1-6783-Sep 16 20:57:16 CAS2 kernel: 
 [516366.623937] (dlm_thread,5140,4):dlm_send_proxy_ast_msg:482 ERROR: 
 B258FD07DDD64710B68EB9683FD7D1B9: res M0016260110, error 
 -107 send AST to node 1
 CAS2/logdir/var/log/syslog.1-6784-Sep 16 20:57:16 CAS2 kernel: 
 [516366.623946] (dlm_thread,5140,4):dlm_flush_asts:596 ERROR: status = -107
 CAS2/logdir/var/log/syslog.1-6785-Sep 16 20:57:16 CAS2 kernel: 
 [516366.623997] Connect node 1 OK, and set timeout 0
 CAS2/logdir/var/log/syslog.1-6786-Sep 16 20:57:17 CAS2 kernel: 
 [516367.623592] o2net: Connected to node CAS1 (num 1) at 10.172.254.1:7100

 debugfs: fs_locks -B
 Lockres: M00046e0117  Mode: Protected Read
 Flags: Initialized Attached Busy
 RO Holders: 0  EX Holders: 0
 Pending Action: Convert  Pending Unlock Action: None
 Requested Mode: Exclusive  Blocking Mode: No Lock
 PR  Gets: 318317  Fails: 0Waits (usec) Total: 128622  Max: 3
 EX  Gets: 706878  Fails: 0Waits (usec) Total: 284967  Max: 2
 Disk Refreshes: 0

 debugfs: dlm_locks M00046e0117
 Lockres: M00046e0117   Owner: 2State: 0x0
 Last Used: 0  ASTs Reserved: 0Inflight: 0Migration Pending: No
 Refs: 4Locks: 2On Lists: None
 Reference Map: 1
   Lock-Queue  Node  Level  Conv  Cookie   Refs  AST  BAST  
 Pending-Action
   Granted 1 PR -11:1952 No   NoNone
   Converting  2 PR EX2:1962 No   NoNone

 We reviews the code, and want to resend the dlm message to avoid it.

 The patch is required reviewing.
 The patch has been test when the network interface is shut down and up 
 manually to recreate the issue.
 If the TCP channel between two node set up within 5 seconds, resend msg 
 works well.

 We are forward to appreciate another better way to avoid it.
 Thanks.


 --- ocfs2/dlm/dlmthread.c 2014-06-07 10:40:09.0 +0800
 +++ ocfs2/dlm/dlmthread.c 2014-09-26 16:42:36.0 +0800
 @@ -517,6 +517,9 @@ static void dlm_flush_asts(struct dlm_ct
  struct dlm_lock_resource *res;
  u8 hi;

 +/* resend the msg again */
 +int send_times = 0;
 +
  spin_lock(dlm-ast_lock);
  while (!list_empty(dlm-pending_asts)) {
  lock = list_entry(dlm-pending_asts.next,
 @@ -539,9 +542,16 @@ static void dlm_flush_asts(struct dlm_ct
  spin_unlock(dlm-ast_lock);

  if (lock-ml.node != dlm-node_num) {
 -   ret = dlm_do_remote_ast(dlm, res, lock);
 -   if (ret  0)
 +   ret = dlm_do_remote_ast(dlm, res, lock);
 +   if (ret  0) {
  mlog_errno(ret);
 +   while ((ret == -112 || ret == -107)  
 send_times++  5 ) {
 +

Re: [Ocfs2-devel] [PATCH v2] ocfs2: don't fire quorum before connection established

2014-09-16 Thread Srinivas Eeda

Looks good to me. Thanks for the patch

Reviewed-by: Srinivas Eeda srinivas.e...@oracle.com

On 09/15/2014 10:15 PM, Junxiao Bi wrote:
 Firing quorum before connection established can cause unexpected node to 
 reboot.
 Assume there are 3 nodes in the cluster, Node 1, 2, 3. Node 2 and 3 have
 wrong ip address of Node 1 in cluster.conf and global heartbeat is enabled
 in the cluster. After the heatbeat are started on these three nodes, Node 1
 will reboot due to quorum fencing. It is similar case if Node 1's networking
 is not ready when starting the global heatbeat.
 The reboot is not friendly as customer is not fully ready for ocfs2 to work.
 Fix it by not allow firing quorum before connection established. In this
 case, ocfs2 will wait until wrong configure fixed or networking up to 
 continue.
 Also update the log to guide user where to check when connection is not built
 for a long time.

 Signed-off-by: Junxiao Bi junxiao...@oracle.com
 Reviewed-by: Srinivas Eeda srinivas.e...@oracle.com
 ---
   fs/ocfs2/cluster/tcp.c |5 +++--
   1 file changed, 3 insertions(+), 2 deletions(-)

 diff --git a/fs/ocfs2/cluster/tcp.c b/fs/ocfs2/cluster/tcp.c
 index ea34952..b2cc010 100644
 --- a/fs/ocfs2/cluster/tcp.c
 +++ b/fs/ocfs2/cluster/tcp.c
 @@ -536,7 +536,7 @@ static void o2net_set_nn_state(struct o2net_node *nn,
   if (nn-nn_persistent_error || nn-nn_sc_valid)
   wake_up(nn-nn_sc_wq);
   
 - if (!was_err  nn-nn_persistent_error) {
 + if (was_valid  !was_err  nn-nn_persistent_error) {
   o2quo_conn_err(o2net_num_from_nn(nn));
   queue_delayed_work(o2net_wq, nn-nn_still_up,
  msecs_to_jiffies(O2NET_QUORUM_DELAY_MS));
 @@ -1721,7 +1721,8 @@ static void o2net_connect_expired(struct work_struct 
 *work)
   spin_lock(nn-nn_lock);
   if (!nn-nn_sc_valid) {
   printk(KERN_NOTICE o2net: No connection established with 
 -node %u after %u.%u seconds, giving up.\n,
 +node %u after %u.%u seconds, check network and
 + cluster configuration.\n,
o2net_num_from_nn(nn),
o2net_idle_timeout() / 1000,
o2net_idle_timeout() % 1000);


___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] [PATCH] fs: ocfs2: dir.c: Cleaning up uninitialized variables

2014-06-02 Thread Srinivas Eeda

Acked-by: Srinivas Eeda srinivas.e...@oracle.com

On 06/01/2014 06:53 AM, Rickard Strandqvist wrote:
 There is a risk that the variable will be used without being initialized.

 This was largely found by using a static code analysis program called 
 cppcheck.

 Signed-off-by: Rickard Strandqvist rickard_strandqv...@spectrumdigital.se
 ---
   fs/ocfs2/dir.c |2 +-
   1 file changed, 1 insertion(+), 1 deletion(-)

 diff --git a/fs/ocfs2/dir.c b/fs/ocfs2/dir.c
 index 0717662..27aa4a1 100644
 --- a/fs/ocfs2/dir.c
 +++ b/fs/ocfs2/dir.c
 @@ -3738,7 +3738,7 @@ static int ocfs2_dx_dir_rebalance(struct ocfs2_super 
 *osb, struct inode *dir,
   int credits, ret, i, num_used, did_quota = 0;
   u32 cpos, split_hash, insert_hash = hinfo-major_hash;
   u64 orig_leaves_start;
 - int num_dx_leaves;
 + int num_dx_leaves = 0;
   struct buffer_head **orig_dx_leaves = NULL;
   struct buffer_head **new_dx_leaves = NULL;
   struct ocfs2_alloc_context *data_ac = NULL, *meta_ac = NULL;


___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

[Ocfs2-devel] ocfs2: Fix panic on kfree(xattr-name)

2014-03-28 Thread Srinivas Eeda

Hi Andrew,

can you please pull the following patch from Tetsuo Handa. It fixes a
regression in ocfs2/mainline and linux-next

Thanks,
--Srini


___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

[Ocfs2-devel] [PATCH 1/1] ocfs2: Fix panic on kfree(xattr-name)

2014-03-28 Thread Srinivas Eeda

From: Tetsuo Handa penguin-ker...@i-love.sakura.ne.jp

Commit 9548906b 'xattr: Constify -name member of struct xattr.' missed that
ocfs2 is calling kfree(xattr-name). As a result, kernel panic occurs upon
calling kfree(xattr-name) because xattr-name refers static constant names.
This patch removes kfree(xattr-name) from ocfs2_mknod() and ocfs2_symlink().

Reported-by: Tariq Saeed tariq.x.sa...@oracle.com
Signed-off-by: Tetsuo Handa penguin-ker...@i-love.sakura.ne.jp
Reviewed-by: Srinivas Eeda srinivas.e...@oracle.com
---
 fs/ocfs2/namei.c |2 --
 1 file changed, 2 deletions(-)

diff --git a/fs/ocfs2/namei.c b/fs/ocfs2/namei.c
index 3683643..feed025f 100644
--- a/fs/ocfs2/namei.c
+++ b/fs/ocfs2/namei.c
@@ -450,7 +450,6 @@ leave:
 
brelse(new_fe_bh);
brelse(parent_fe_bh);
-   kfree(si.name);
kfree(si.value);
 
ocfs2_free_dir_lookup_result(lookup);
@@ -1855,7 +1854,6 @@ bail:
 
brelse(new_fe_bh);
brelse(parent_fe_bh);
-   kfree(si.name);
kfree(si.value);
ocfs2_free_dir_lookup_result(lookup);
if (inode_ac)
-- 
1.7.9.5


___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] [PATCH] ocfs2: do not put bh when buffer_uptodate failed

2014-03-26 Thread Srinivas Eeda

Thanks for explaining
Reviewed-by: Srinivas Eeda srinivas.e...@oracle.com

On 03/25/2014 06:25 PM, alex chen wrote:
 On 2014/3/26 2:45, Srinivas Eeda wrote:
 These changes looks good to me. However ocfs2_read_blocks and
 ocfs2_read_blocks_sync needs the same fix ? :)
 There is no need to do this in ocfs2_read_blocks and
 ocfs2_read_blocks_sync, because bh will be set to NULL after put_bh,
 and brelse will handle it.

 On 03/25/2014 12:05 AM, alex chen wrote:
 Do not put bh when buffer_uptodate failed in ocfs2_write_block and
 ocfs2_write_super_or_backup, because it will put bh in b_end_io.
 Otherwise it will hit a warning VFS: brelse: Trying to free free
 buffer.

 Signed-off-by: Alex Chen alex.c...@huawei.com
 Reviewed-by: Joseph Qi joseph...@huawei.com
 ---
fs/ocfs2/buffer_head_io.c | 2 --
1 file changed, 2 deletions(-)

 diff --git a/fs/ocfs2/buffer_head_io.c b/fs/ocfs2/buffer_head_io.c
 index 5b704c6..1edcb14 100644
 --- a/fs/ocfs2/buffer_head_io.c
 +++ b/fs/ocfs2/buffer_head_io.c
 @@ -90,7 +90,6 @@ int ocfs2_write_block(struct ocfs2_super *osb, struct 
 buffer_head *bh,
  * information for this bh as it's not marked locally
  * uptodate. */
 ret = -EIO;
 -   put_bh(bh);
 mlog_errno(ret);
 }

 @@ -420,7 +419,6 @@ int ocfs2_write_super_or_backup(struct ocfs2_super *osb,

 if (!buffer_uptodate(bh)) {
 ret = -EIO;
 -   put_bh(bh);
 mlog_errno(ret);
 }


 ___
 Ocfs2-devel mailing list
 Ocfs2-devel@oss.oracle.com
 https://oss.oracle.com/mailman/listinfo/ocfs2-devel




___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] [PATCH] ocfs2: do not put bh when buffer_uptodate failed

2014-03-25 Thread Srinivas Eeda

These changes looks good to me. However ocfs2_read_blocks and 
ocfs2_read_blocks_sync needs the same fix ? :)

On 03/25/2014 12:05 AM, alex chen wrote:
 Do not put bh when buffer_uptodate failed in ocfs2_write_block and
 ocfs2_write_super_or_backup, because it will put bh in b_end_io.
 Otherwise it will hit a warning VFS: brelse: Trying to free free
 buffer.

 Signed-off-by: Alex Chen alex.c...@huawei.com
 Reviewed-by: Joseph Qi joseph...@huawei.com
 ---
   fs/ocfs2/buffer_head_io.c | 2 --
   1 file changed, 2 deletions(-)

 diff --git a/fs/ocfs2/buffer_head_io.c b/fs/ocfs2/buffer_head_io.c
 index 5b704c6..1edcb14 100644
 --- a/fs/ocfs2/buffer_head_io.c
 +++ b/fs/ocfs2/buffer_head_io.c
 @@ -90,7 +90,6 @@ int ocfs2_write_block(struct ocfs2_super *osb, struct 
 buffer_head *bh,
* information for this bh as it's not marked locally
* uptodate. */
   ret = -EIO;
 - put_bh(bh);
   mlog_errno(ret);
   }

 @@ -420,7 +419,6 @@ int ocfs2_write_super_or_backup(struct ocfs2_super *osb,

   if (!buffer_uptodate(bh)) {
   ret = -EIO;
 - put_bh(bh);
   mlog_errno(ret);
   }



___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] One Patch reported, Anyone review it, thanks

2014-03-11 Thread Srinivas Eeda

Guozhonghua, idea of the fix looks good, but the patch doesn't seem to 
meet mainline standards. Can you please follow the instructions that 
Jeff emailed you last time and resubmit the patch.


On 03/11/2014 04:09 AM, Guozhonghua wrote:


I review the code of the kernel 3.11.10.

The difference is as below:

-- ../linux-3.11.10/fs/ocfs2/dlm/dlmdomain.c   2013-11-30 
02:42:37.0 +0800


+++ ocfs2-ko-3.11/dlm/dlmdomain.c 2014-03-11 18:57:23.323897515 +0800

@@ -1135,6 +1135,9 @@ static int dlm_query_region_handler(stru

  int status = 0;

  int locked = 0;

+/* Wether domain locked */


comment probably not required


+int domain_locked = 0;

+

  qr = (struct dlm_query_region *) msg-buf;

   mlog(0, Node %u queries hb regions on domain %s\n, qr-qr_node,

@@ -1150,6 +1153,7 @@ static int dlm_query_region_handler(stru

  status = -EINVAL;

spin_lock(dlm_domain_lock);

+domain_locked = 1;

  dlm = __dlm_lookup_domain_full(qr-qr_domain, qr-qr_namelen);

  if (!dlm) {

mlog(ML_ERROR, Node %d queried hb regions on domain %s 

@@ -1181,9 +1185,12 @@ static int dlm_query_region_handler(stru

bail:

  if (locked)

spin_unlock(dlm-spinlock);

- spin_unlock(dlm_domain_lock);

-   kfree(local);

+if (domain_locked)

+ spin_unlock(dlm_domain_lock);

+

+if (local)

+kfree(local);


no need to check if (local),  kfree will take care of that part


   return status;

}

-
??,?
?(??
???)?,??
??!
This e-mail and its attachments contain confidential information from 
H3C, which is
intended only for the person or entity whose address is listed above. 
Any use of the
information contained herein in any way (including, but not limited 
to, total or partial
disclosure, reproduction, or dissemination) by persons other than the 
intended
recipient(s) is prohibited. If you receive this e-mail in error, 
please notify the sender

by phone or email immediately and delete it!


___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel


___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] [PATCH] ocfs2: dlm: fix lock migration crash

2014-02-26 Thread Srinivas Eeda

looks good, thanks Junxiao for fixing this.

Reviewed-by: Srinivas Eedasrinivas.e...@oracle.com

On 02/25/2014 11:47 PM, Junxiao Bi wrote:
 This issue was introduced by commit 800deef3 where it replaced list_for_each
 with list_for_each_entry. The variable lock will point to invalid data if
 tmpq list is empty and a panic will be triggered due to this.
 Sunil advised reverting it back, but the old version was also not right. At
 the end of the outer for loop, that list_for_each_entry will also set lock
 to an invalid data, then in the next loop, if the tmpq list is empty, lock
 will be an stale invalid data and cause the panic. So reverting the 
 list_for_each
 back and reset lock to NULL to fix this issue.

 Another concern is that this seemes can not happen because the tmpq list 
 should
 not be empty. Let me describe how.
 old lock resource owner(node 1):  migratation 
 target(node 2):
 image there's lockres with a EX lock from node 2 in
 granted list, a NR lock from node x with convert_type
 EX in converting list.
 dlm_empty_lockres() {
   dlm_pick_migration_target() {
 pick node 2 as target as its lock is the first one
 in granted list.
   }
   dlm_migrate_lockres() {
 dlm_mark_lockres_migrating() {
   res-state |= DLM_LOCK_RES_BLOCK_DIRTY;
   wait_event(dlm-ast_wq, !dlm_lockres_is_dirty(dlm, res));
//after the above code, we can not dirty lockres any more,
   // so dlm_thread shuffle list will not run
 
 downconvert lock from EX to NR
 upconvert 
 lock from NR to EX
  migration may schedule out here, then
  node 2 send down convert request to convert type from EX to
  NR, then send up convert request to convert type from NR to
  EX, at this time, lockres granted list is empty, and two locks
  in the converting list, node x up convert lock followed by
  node 2 up convert lock.

// will set lockres RES_MIGRATING flag, the following
// lock/unlock can not run
   dlm_lockres_release_ast(dlm, res);
 }

 dlm_send_one_lockres()
   
 dlm_process_recovery_data()
 for (i=0; 
 imres-num_locks; i++)
   if 
 (ml-node == dlm-node_num)
 for 
 (j = DLM_GRANTED_LIST; j = DLM_BLOCKED_LIST; j++) {
  
 list_for_each_entry(lock, tmpq, list)
  if 
 (lock) break;  lock is invalid as grant list is empty.
 }
 if 
 (lock-ml.node != ml-node)
   
 BUG()  crash here
   }
 I see the above locks status from a vmcore of our internal bug.

 Signed-off-by: Junxiao Bi junxiao...@oracle.com
 Cc: Sunil Mushran sunil.mush...@gmail.com
 Cc: Srinivas Eeda srinivas.e...@oracle.com
 Cc: sta...@vger.kernel.org
 ---
   fs/ocfs2/dlm/dlmrecovery.c |   14 --
   1 file changed, 8 insertions(+), 6 deletions(-)

 diff --git a/fs/ocfs2/dlm/dlmrecovery.c b/fs/ocfs2/dlm/dlmrecovery.c
 index 7035af0..c2dd258 100644
 --- a/fs/ocfs2/dlm/dlmrecovery.c
 +++ b/fs/ocfs2/dlm/dlmrecovery.c
 @@ -1750,13 +1750,13 @@ static int dlm_process_recovery_data(struct dlm_ctxt 
 *dlm,
struct dlm_migratable_lockres *mres)
   {
   struct dlm_migratable_lock *ml;
 - struct list_head *queue;
 + struct list_head *queue, *iter;
   struct list_head *tmpq = NULL;
   struct dlm_lock *newlock = NULL;
   struct dlm_lockstatus *lksb = NULL;
   int ret = 0;
   int i, j, bad;
 - struct dlm_lock *lock = NULL;
 + struct dlm_lock *lock;
   u8 from = O2NM_MAX_NODES;
   unsigned int added = 0;
   __be64 c;
 @@ -1791,14 +1791,16 @@ static int dlm_process_recovery_data(struct dlm_ctxt 
 *dlm,
   /* MIGRATION ONLY! */
   BUG_ON(!(mres-flags  DLM_MRES_MIGRATION));
   
 + lock = NULL;
   spin_lock(res-spinlock);
   for (j = DLM_GRANTED_LIST; j = DLM_BLOCKED_LIST; j++) {
   tmpq = dlm_list_idx_to_ptr(res, j);
 - list_for_each_entry(lock, tmpq, list) {
 - if (lock-ml.cookie != ml-cookie)
 - lock = NULL;
 - else
 + list_for_each(iter, tmpq

Re: [Ocfs2-devel] [PATCH] ocfs2: fix dlm lock migration crash

2014-02-24 Thread Srinivas Eeda


Junxiao, thanks for looking into this issue. Please see my comment below

On 02/24/2014 01:07 AM, Junxiao Bi wrote:

Hi,

On 07/19/2012 09:59 AM, Sunil Mushran wrote:

Different issues.

On Wed, Jul 18, 2012 at 6:34 PM, Junxiao Bi junxiao...@oracle.com 
mailto:junxiao...@oracle.com wrote:


On 07/19/2012 12:36 AM, Sunil Mushran wrote:

This bug was detected during code audit. Never seen a crash. If
it does hit,
then we have bigger problems. So no point posting to stable.


I read a lot of dlm recovery code recently, I found this bug could 
happen at the following scenario.


node 1: migrate target node x:
dlm_unregister_domain()
 dlm_migrate_all_locks()
  dlm_empty_lockres()
   select node x as migrate target node
   since there is a node x lock on the granted list.
   dlm_migrate_lockres()
dlm_mark_lockres_migrating() {
 wait_event(dlm-ast_wq, !dlm_lockres_is_dirty(dlm, res));
 node x unlock may happen here, res-granted list can be empty.
If the unlock request got sent at this point, and if the request was 
*processed*, lock must have been removed from the granted_list. If the 
request was *not yet processed*, then the DLM_LOCK_RES_MIGRATING set in 
dlm_lockres_release_ast would make dlm_unlock handler to return 
DLM_MIGRATING to the caller (in this case node x). So I don't see how 
granted_list could have stale lock. Am I missing something ?


I do think there is such race that you pointed below exist, but I am not 
sure if it was due to the above race described.



dlm_lockres_release_ast(dlm, res);
}
dlm_send_one_lockres()
dlm_process_recovery_data() {
tmpq is 
res-granted list and is empty.

list_for_each_entry(lock, tmpq, list) {
 if 
(lock-ml.cookie != ml-cookie)

  lock = NULL;
 else
  break;
}
lock will be 
invalid here.
if (lock-ml.node 
!= ml-node)
BUG() -- 
crash here.

   }

Thanks,
Junxiao.


Our customer can reproduce it. Also I saw you were assigned a
similar bug before, see
https://oss.oracle.com/bugzilla/show_bug.cgi?id=1220, is it the
same BUG?


On Tue, Jul 17, 2012 at 6:36 PM, Junxiao Bi
junxiao...@oracle.com mailto:junxiao...@oracle.com wrote:

Hi Sunil,

On 07/18/2012 03:49 AM, Sunil Mushran wrote:

On Tue, Jul 17, 2012 at 12:10 AM, Junxiao Bi
junxiao...@oracle.com mailto:junxiao...@oracle.com wrote:

In the target node of the dlm lock migration, the logic
to find
the local dlm lock is wrong, it shouldn't change the
loop variable
lock in the list_for_each_entry loop. This will cause
a NULL-pointer
accessing crash.

Signed-off-by: Junxiao Bi junxiao...@oracle.com
mailto:junxiao...@oracle.com
Cc: sta...@vger.kernel.org mailto:sta...@vger.kernel.org
---
 fs/ocfs2/dlm/dlmrecovery.c |   12 +++-
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/fs/ocfs2/dlm/dlmrecovery.c
b/fs/ocfs2/dlm/dlmrecovery.c
index 01ebfd0..0b9cc88 100644
--- a/fs/ocfs2/dlm/dlmrecovery.c
+++ b/fs/ocfs2/dlm/dlmrecovery.c
@@ -1762,6 +1762,7 @@ static int
dlm_process_recovery_data(struct dlm_ctxt *dlm,
u8 from = O2NM_MAX_NODES;
unsigned int added = 0;
__be64 c;
+   int found;

mlog(0, running %d locks for this lockres\n,
mres-num_locks);
for (i=0; imres-num_locks; i++) {
@@ -1793,22 +1794,23 @@ static int
dlm_process_recovery_data(struct dlm_ctxt *dlm,
/* MIGRATION ONLY! */
BUG_ON(!(mres-flags  DLM_MRES_MIGRATION));

+   found = 0;
spin_lock(res-spinlock);
for (j = DLM_GRANTED_LIST; j =
DLM_BLOCKED_LIST; j++) {
tmpq =
dlm_list_idx_to_ptr(res, j);
list_for_each_entry(lock, tmpq, list) {
-   if
(lock-ml.cookie != ml-cookie)
-   lock = NULL;
- else
+   if
(lock-ml.cookie == ml-cookie) {
+   found = 1;

Re: [Ocfs2-devel] [PATCH 0/6 v2] ocfs2: Avoid pending orphaned inodes

2014-02-20 Thread Srinivas Eeda

Hi Jan,

thanks a lot for these patches. They all look good to me ... I just have 
one question on patch 5

Thanks,
--Srini

On 02/20/2014 07:18 AM, Jan Kara wrote:
Hello,

here is a second version of my patchset to solve a deadlocks when we do not
 defer dropping of inode reference from downconvert worker. I have tested the
 patches (also with lockdep enabled) and they seem to work fine. Comments are
 welcome!

   Honza


___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] [PATCH 3/6] quota: Provide function to grab quota structure reference

2014-02-20 Thread Srinivas Eeda

looks good to me
Reviewed-by: Srinivas Eeda srinivas.e...@oracle.com

On 02/20/2014 07:18 AM, Jan Kara wrote:
 Provide dqgrab() function to get quota structure reference when we are
 sure it already has at least one active reference. Make use of this
 function inside quota code.

 Signed-off-by: Jan Kara j...@suse.cz
 ---
   fs/quota/dquot.c | 4 ++--
   include/linux/quotaops.h | 8 
   2 files changed, 10 insertions(+), 2 deletions(-)

 diff --git a/fs/quota/dquot.c b/fs/quota/dquot.c
 index 831d49a4111f..e3f09e34d0b2 100644
 --- a/fs/quota/dquot.c
 +++ b/fs/quota/dquot.c
 @@ -528,7 +528,7 @@ restart:
   if (atomic_read(dquot-dq_count)) {
   DEFINE_WAIT(wait);
   
 - atomic_inc(dquot-dq_count);
 + dqgrab(dquot);
   prepare_to_wait(dquot-dq_wait_unused, wait,
   TASK_UNINTERRUPTIBLE);
   spin_unlock(dq_list_lock);
 @@ -624,7 +624,7 @@ int dquot_writeback_dquots(struct super_block *sb, int 
 type)
   /* Now we have active dquot from which someone is
* holding reference so we can safely just increase
* use count */
 - atomic_inc(dquot-dq_count);
 + dqgrab(dquot);
   spin_unlock(dq_list_lock);
   dqstats_inc(DQST_LOOKUPS);
   err = sb-dq_op-write_dquot(dquot);
 diff --git a/include/linux/quotaops.h b/include/linux/quotaops.h
 index 6965fe394c3b..1d3eee594cd6 100644
 --- a/include/linux/quotaops.h
 +++ b/include/linux/quotaops.h
 @@ -46,6 +46,14 @@ void inode_reclaim_rsv_space(struct inode *inode, qsize_t 
 number);
   void dquot_initialize(struct inode *inode);
   void dquot_drop(struct inode *inode);
   struct dquot *dqget(struct super_block *sb, struct kqid qid);
 +static inline struct dquot *dqgrab(struct dquot *dquot)
 +{
 + /* Make sure someone else has active reference to dquot */
 + WARN_ON_ONCE(!atomic_read(dquot-dq_count));
 + WARN_ON_ONCE(!test_bit(DQ_ACTIVE_B, dquot-dq_flags));
 + atomic_inc(dquot-dq_count);
 + return dquot;
 +}
   void dqput(struct dquot *dquot);
   int dquot_scan_active(struct super_block *sb,
 int (*fn)(struct dquot *dquot, unsigned long priv),


___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] [PATCH 4/6] ocfs2: Implement delayed dropping of last dquot reference

2014-02-20 Thread Srinivas Eeda

looks good to me
Reviewed-by: Srinivas Eeda srinivas.e...@oracle.com

On 02/20/2014 07:18 AM, Jan Kara wrote:
 We cannot drop last dquot reference from downconvert thread as that
 creates the following deadlock:

 NODE 1  NODE2
 holds dentry lock for 'foo'
 holds inode lock for GLOBAL_BITMAP_SYSTEM_INODE
  dquot_initialize(bar)
ocfs2_dquot_acquire()
  
 ocfs2_inode_lock(USER_QUOTA_SYSTEM_INODE)
  ...
 downconvert thread (triggered from another
 node or a different process from NODE2)
ocfs2_dentry_post_unlock()
  ...
  iput(foo)
ocfs2_evict_inode(foo)
  ocfs2_clear_inode(foo)
dquot_drop(inode)
  ...
   ocfs2_dquot_release()
ocfs2_inode_lock(USER_QUOTA_SYSTEM_INODE)
 - blocks
  finds we need more space in
  quota file
  ...
  ocfs2_extend_no_holes()

 ocfs2_inode_lock(GLOBAL_BITMAP_SYSTEM_INODE)
  - deadlocks waiting for
downconvert thread

 We solve the problem by postponing dropping of the last dquot reference
 to a workqueue if it happens from the downconvert thread.

 Signed-off-by: Jan Kara j...@suse.cz
 ---
   fs/ocfs2/ocfs2.h|  5 +
   fs/ocfs2/quota.h|  2 ++
   fs/ocfs2/quota_global.c | 35 +++
   fs/ocfs2/super.c|  8 
   4 files changed, 50 insertions(+)

 diff --git a/fs/ocfs2/ocfs2.h b/fs/ocfs2/ocfs2.h
 index 553f53cc73ae..64c02239ba46 100644
 --- a/fs/ocfs2/ocfs2.h
 +++ b/fs/ocfs2/ocfs2.h
 @@ -30,6 +30,7 @@
   #include linux/sched.h
   #include linux/wait.h
   #include linux/list.h
 +#include linux/llist.h
   #include linux/rbtree.h
   #include linux/workqueue.h
   #include linux/kref.h
 @@ -419,6 +420,10 @@ struct ocfs2_super
   struct ocfs2_dentry_lock *dentry_lock_list;
   struct work_struct dentry_lock_work;
   
 + /* List of dquot structures to drop last reference to */
 + struct llist_head dquot_drop_list;
 + struct work_struct dquot_drop_work;
 +
   wait_queue_head_t   osb_mount_event;
   
   /* Truncate log info */
 diff --git a/fs/ocfs2/quota.h b/fs/ocfs2/quota.h
 index d5ab56cbe5c5..f266d67df3c6 100644
 --- a/fs/ocfs2/quota.h
 +++ b/fs/ocfs2/quota.h
 @@ -28,6 +28,7 @@ struct ocfs2_dquot {
   unsigned int dq_use_count;  /* Number of nodes having reference to 
 this entry in global quota file */
   s64 dq_origspace;   /* Last globally synced space usage */
   s64 dq_originodes;  /* Last globally synced inode usage */
 + struct llist_node list; /* Member of list of dquots to drop */
   };
   
   /* Description of one chunk to recover in memory */
 @@ -110,6 +111,7 @@ int ocfs2_read_quota_phys_block(struct inode *inode, u64 
 p_block,
   int ocfs2_create_local_dquot(struct dquot *dquot);
   int ocfs2_local_release_dquot(handle_t *handle, struct dquot *dquot);
   int ocfs2_local_write_dquot(struct dquot *dquot);
 +void ocfs2_drop_dquot_refs(struct work_struct *work);
   
   extern const struct dquot_operations ocfs2_quota_operations;
   extern struct quota_format_type ocfs2_quota_format;
 diff --git a/fs/ocfs2/quota_global.c b/fs/ocfs2/quota_global.c
 index aaa50611ec66..7921e209c64b 100644
 --- a/fs/ocfs2/quota_global.c
 +++ b/fs/ocfs2/quota_global.c
 @@ -10,6 +10,7 @@
   #include linux/jiffies.h
   #include linux/writeback.h
   #include linux/workqueue.h
 +#include linux/llist.h
   
   #include cluster/masklog.h
   
 @@ -679,6 +680,27 @@ static int ocfs2_calc_qdel_credits(struct super_block 
 *sb, int type)
  OCFS2_INODE_UPDATE_CREDITS;
   }
   
 +void ocfs2_drop_dquot_refs(struct work_struct *work)
 +{
 + struct ocfs2_super *osb = container_of(work, struct ocfs2_super,
 +dquot_drop_work);
 + struct llist_node *list;
 + struct ocfs2_dquot *odquot, *next_odquot;
 +
 + list = llist_del_all(osb-dquot_drop_list);
 + llist_for_each_entry_safe(odquot, next_odquot, list, list) {
 + /* Drop the reference we acquired in ocfs2_dquot_release() */
 + dqput(odquot-dq_dquot);
 + }
 +}
 +
 +/*
 + * Called when the last reference to dquot is dropped. If we are called from
 + * downconvert thread, we cannot do all the handling here because grabbing
 + * quota lock could deadlock (the node holding the quota lock could need some
 + * other cluster lock to proceed but with blocked downconvert thread we 
 cannot
 + * release any lock).
 + */
   static int

Re: [Ocfs2-devel] [PATCH 1/6] ocfs2: Remove OCFS2_INODE_SKIP_DELETE flag

2014-02-20 Thread Srinivas Eeda

Reviewed-by: Srinivas Eeda srinivas.e...@oracle.com

On 02/20/2014 07:18 AM, Jan Kara wrote:
 The flag was never set, delete it.

 Signed-off-by: Jan Kara j...@suse.cz
 ---
   fs/ocfs2/inode.c   | 6 --
   fs/ocfs2/inode.h   | 8 +++-
   fs/ocfs2/journal.c | 6 --
   3 files changed, 3 insertions(+), 17 deletions(-)

 diff --git a/fs/ocfs2/inode.c b/fs/ocfs2/inode.c
 index f29a90fde619..b4baaefe4dd4 100644
 --- a/fs/ocfs2/inode.c
 +++ b/fs/ocfs2/inode.c
 @@ -822,12 +822,6 @@ static int ocfs2_inode_is_valid_to_delete(struct inode 
 *inode)
   goto bail_unlock;
   }
   
 - /* If we have allowd wipe of this inode for another node, it
 -  * will be marked here so we can safely skip it. Recovery will
 -  * cleanup any inodes we might inadvertently skip here. */
 - if (oi-ip_flags  OCFS2_INODE_SKIP_DELETE)
 - goto bail_unlock;
 -
   ret = 1;
   bail_unlock:
   spin_unlock(oi-ip_lock);
 diff --git a/fs/ocfs2/inode.h b/fs/ocfs2/inode.h
 index 621fc73bf23d..f60bc314ee0a 100644
 --- a/fs/ocfs2/inode.h
 +++ b/fs/ocfs2/inode.h
 @@ -84,8 +84,6 @@ struct ocfs2_inode_info
   #define OCFS2_INODE_BITMAP  0x0004
   /* This inode has been wiped from disk */
   #define OCFS2_INODE_DELETED 0x0008
 -/* Another node is deleting, so our delete is a nop */
 -#define OCFS2_INODE_SKIP_DELETE  0x0010
   /* Has the inode been orphaned on another node?
*
* This hints to ocfs2_drop_inode that it should clear i_nlink before
 @@ -100,11 +98,11 @@ struct ocfs2_inode_info
* rely on ocfs2_delete_inode to sort things out under the proper
* cluster locks.
*/
 -#define OCFS2_INODE_MAYBE_ORPHANED   0x0020
 +#define OCFS2_INODE_MAYBE_ORPHANED   0x0010
   /* Does someone have the file open O_DIRECT */
 -#define OCFS2_INODE_OPEN_DIRECT  0x0040
 +#define OCFS2_INODE_OPEN_DIRECT  0x0020
   /* Tell the inode wipe code it's not in orphan dir */
 -#define OCFS2_INODE_SKIP_ORPHAN_DIR 0x0080
 +#define OCFS2_INODE_SKIP_ORPHAN_DIR 0x0040
   
   static inline struct ocfs2_inode_info *OCFS2_I(struct inode *inode)
   {
 diff --git a/fs/ocfs2/journal.c b/fs/ocfs2/journal.c
 index 44fc3e530c3d..03ea9314fecd 100644
 --- a/fs/ocfs2/journal.c
 +++ b/fs/ocfs2/journal.c
 @@ -2132,12 +2132,6 @@ static int ocfs2_recover_orphans(struct ocfs2_super 
 *osb,
   iter = oi-ip_next_orphan;
   
   spin_lock(oi-ip_lock);
 - /* The remote delete code may have set these on the
 -  * assumption that the other node would wipe them
 -  * successfully.  If they are still in the node's
 -  * orphan dir, we need to reset that state. */
 - oi-ip_flags = ~(OCFS2_INODE_DELETED|OCFS2_INODE_SKIP_DELETE);
 -
   /* Set the proper information to get us going into
* ocfs2_delete_inode. */
   oi-ip_flags |= OCFS2_INODE_MAYBE_ORPHANED;


___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] [PATCH 2/6] ocfs2: Move dquot_initialize() in ocfs2_delete_inode() somewhat later

2014-02-20 Thread Srinivas Eeda

looks good to me

Reviewed-by: Srinivas Eeda srinivas.e...@oracle.com

On 02/20/2014 07:18 AM, Jan Kara wrote:
 Move dquot_initalize() call in ocfs2_delete_inode() after the moment we
 verify inode is actually a sane one to delete. We certainly don't want
 to initialize quota for system inodes etc. This also avoids calling into
 quota code from downconvert thread.

 Add more details into the comment why bailing out from
 ocfs2_delete_inode() when we are in downconvert thread is OK.

 Signed-off-by: Jan Kara j...@suse.cz
 ---
   fs/ocfs2/inode.c | 16 +---
   1 file changed, 9 insertions(+), 7 deletions(-)

 diff --git a/fs/ocfs2/inode.c b/fs/ocfs2/inode.c
 index b4baaefe4dd4..3b0d722de35e 100644
 --- a/fs/ocfs2/inode.c
 +++ b/fs/ocfs2/inode.c
 @@ -804,11 +804,13 @@ static int ocfs2_inode_is_valid_to_delete(struct inode 
 *inode)
   goto bail;
   }
   
 - /* If we're coming from downconvert_thread we can't go into our own
 -  * voting [hello, deadlock city!], so unforuntately we just
 -  * have to skip deleting this guy. That's OK though because
 -  * the node who's doing the actual deleting should handle it
 -  * anyway. */
 + /*
 +  * If we're coming from downconvert_thread we can't go into our own
 +  * voting [hello, deadlock city!] so we cannot delete the inode. But
 +  * since we dropped last inode ref when downconverting dentry lock,
 +  * we cannot have the file open and thus the node doing unlink will
 +  * take care of deleting the inode.
 +  */
   if (current == osb-dc_task)
   goto bail;
   
 @@ -954,8 +956,6 @@ static void ocfs2_delete_inode(struct inode *inode)
   if (is_bad_inode(inode) || !OCFS2_I(inode)-ip_blkno)
   goto bail;
   
 - dquot_initialize(inode);
 -
   if (!ocfs2_inode_is_valid_to_delete(inode)) {
   /* It's probably not necessary to truncate_inode_pages
* here but we do it for safety anyway (it will most
 @@ -964,6 +964,8 @@ static void ocfs2_delete_inode(struct inode *inode)
   goto bail;
   }
   
 + dquot_initialize(inode);
 +
   /* We want to block signals in delete_inode as the lock and
* messaging paths may return us -ERESTARTSYS. Which would
* cause us to exit early, resulting in inodes being orphaned


___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] [PATCH 5/6] ocfs2: Avoid blocking in ocfs2_mark_lockres_freeing() in downconvert thread

2014-02-20 Thread Srinivas Eeda

I like the idea of dc_task handling queued basts in 
ocfs2_mark_lockres_freeing.

I am wondering if we should call lockres-l_ops-post_unlock(osb, 
lockres) ? Would there be another node waiting for a bast response ?

On 02/20/2014 07:18 AM, Jan Kara wrote:
 If we are dropping last inode reference from downconvert thread, we will
 end up calling ocfs2_mark_lockres_freeing() which can block if the lock
 we are freeing is queued thus creating an A-A deadlock. Luckily, since
 we are the downconvert thread, we can immediately dequeue the lock and
 thus avoid waiting in this case.

 Signed-off-by: Jan Kara j...@suse.cz
 ---
   fs/ocfs2/dlmglue.c | 33 +++--
   fs/ocfs2/dlmglue.h |  3 ++-
   fs/ocfs2/inode.c   |  7 ---
   3 files changed, 37 insertions(+), 6 deletions(-)

 diff --git a/fs/ocfs2/dlmglue.c b/fs/ocfs2/dlmglue.c
 index 19986959d149..b7580157ef01 100644
 --- a/fs/ocfs2/dlmglue.c
 +++ b/fs/ocfs2/dlmglue.c
 @@ -3150,7 +3150,8 @@ out:
* it safe to drop.
*
* You can *not* attempt to call cluster_lock on this lockres anymore. */
 -void ocfs2_mark_lockres_freeing(struct ocfs2_lock_res *lockres)
 +void ocfs2_mark_lockres_freeing(struct ocfs2_super *osb,
 + struct ocfs2_lock_res *lockres)
   {
   int status;
   struct ocfs2_mask_waiter mw;
 @@ -3160,6 +3161,33 @@ void ocfs2_mark_lockres_freeing(struct ocfs2_lock_res 
 *lockres)
   
   spin_lock_irqsave(lockres-l_lock, flags);
   lockres-l_flags |= OCFS2_LOCK_FREEING;
 + if (lockres-l_flags  OCFS2_LOCK_QUEUED  current == osb-dc_task) {
 + unsigned long flags;
 +
 + /*
 +  * We know the downconvert is queued but not in progress
 +  * because we are the downconvert thread and processing
 +  * different lock. So we can just remove the lock from the
 +  * queue. This is not only an optimization but also a way
 +  * to avoid the following deadlock:
 +  *   ocfs2_dentry_post_unlock()
 +  * ocfs2_dentry_lock_put()
 +  *   ocfs2_drop_dentry_lock()
 +  * iput()
 +  *   ocfs2_evict_inode()
 +  * ocfs2_clear_inode()
 +  *   ocfs2_mark_lockres_freeing()
 +  * ... blocks waiting for OCFS2_LOCK_QUEUED
 +  * since we are the downconvert thread which
 +  * should clear the flag.
 +  */
 + spin_lock_irqsave(osb-dc_task_lock, flags);
 + list_del_init(lockres-l_blocked_list);
 + osb-blocked_lock_count--;
 + spin_unlock_irqrestore(osb-dc_task_lock, flags);
 + lockres_clear_flags(lockres, OCFS2_LOCK_QUEUED);
 + goto out_unlock;
 + }
   while (lockres-l_flags  OCFS2_LOCK_QUEUED) {
   lockres_add_mask_waiter(lockres, mw, OCFS2_LOCK_QUEUED, 0);
   spin_unlock_irqrestore(lockres-l_lock, flags);
 @@ -3172,6 +3200,7 @@ void ocfs2_mark_lockres_freeing(struct ocfs2_lock_res 
 *lockres)
   
   spin_lock_irqsave(lockres-l_lock, flags);
   }
 +out_unlock:
   spin_unlock_irqrestore(lockres-l_lock, flags);
   }
   
 @@ -3180,7 +3209,7 @@ void ocfs2_simple_drop_lockres(struct ocfs2_super *osb,
   {
   int ret;
   
 - ocfs2_mark_lockres_freeing(lockres);
 + ocfs2_mark_lockres_freeing(osb, lockres);
   ret = ocfs2_drop_lock(osb, lockres);
   if (ret)
   mlog_errno(ret);
 diff --git a/fs/ocfs2/dlmglue.h b/fs/ocfs2/dlmglue.h
 index 1d596d8c4a4a..d293a22c32c5 100644
 --- a/fs/ocfs2/dlmglue.h
 +++ b/fs/ocfs2/dlmglue.h
 @@ -157,7 +157,8 @@ int ocfs2_refcount_lock(struct ocfs2_refcount_tree 
 *ref_tree, int ex);
   void ocfs2_refcount_unlock(struct ocfs2_refcount_tree *ref_tree, int ex);
   
   
 -void ocfs2_mark_lockres_freeing(struct ocfs2_lock_res *lockres);
 +void ocfs2_mark_lockres_freeing(struct ocfs2_super *osb,
 + struct ocfs2_lock_res *lockres);
   void ocfs2_simple_drop_lockres(struct ocfs2_super *osb,
  struct ocfs2_lock_res *lockres);
   
 diff --git a/fs/ocfs2/inode.c b/fs/ocfs2/inode.c
 index 3b0d722de35e..9661f8db21dc 100644
 --- a/fs/ocfs2/inode.c
 +++ b/fs/ocfs2/inode.c
 @@ -1053,6 +1053,7 @@ static void ocfs2_clear_inode(struct inode *inode)
   {
   int status;
   struct ocfs2_inode_info *oi = OCFS2_I(inode);
 + struct ocfs2_super *osb = OCFS2_SB(inode-i_sb);
   
   clear_inode(inode);
   trace_ocfs2_clear_inode((unsigned long long)oi-ip_blkno,
 @@ -1069,9 +1070,9 @@ static void ocfs2_clear_inode(struct inode *inode)
   
   /* Do these before all the other work so that we don't bounce
* the downconvert thread while waiting to destroy the locks. */
 - ocfs2_mark_lockres_freeing(oi-ip_rw_lockres);
 -

Re: [Ocfs2-devel] [patch 03/11] ocfs2/o2net: incorrect to terminate accepting connections loop upon rejecting an invalid one

2014-01-24 Thread Srinivas Eeda

On 01/24/2014 01:55 PM, Mark Fasheh wrote:
 On Fri, Jan 24, 2014 at 12:47:02PM -0800, a...@linux-foundation.org wrote:
 From: Tariq Saeed tariq.x.sa...@oracle.com
 Subject: ocfs2/o2net: incorrect to terminate accepting connections loop upon 
 rejecting an invalid one

 When o2net-accept-one() rejects an illegal connection, it terminates the
 loop picking up the remaining queued connections.  This fix will continue
 accepting connections till the queue is emtpy.

 Addresses Orabug 17489469.
 Thanks for sending this, review comments below.


 diff -puN 
 fs/ocfs2/cluster/tcp.c~ocfs2-o2net-incorrect-to-terminate-accepting-connections-loop-upon-rejecting-an-invalid-one
  fs/ocfs2/cluster/tcp.c
 --- 
 a/fs/ocfs2/cluster/tcp.c~ocfs2-o2net-incorrect-to-terminate-accepting-connections-loop-upon-rejecting-an-invalid-one
 +++ a/fs/ocfs2/cluster/tcp.c
 @@ -1826,7 +1826,7 @@ int o2net_register_hb_callbacks(void)
   
   /*  */
   
 -static int o2net_accept_one(struct socket *sock)
 +static int o2net_accept_one(struct socket *sock, int *more)
   {
  int ret, slen;
  struct sockaddr_in sin;
 @@ -1837,6 +1837,7 @@ static int o2net_accept_one(struct socke
  struct o2net_node *nn;
   
  BUG_ON(sock == NULL);
 +*more = 0;
  ret = sock_create_lite(sock-sk-sk_family, sock-sk-sk_type,
 sock-sk-sk_protocol, new_sock);
  if (ret)
 @@ -1848,6 +1849,7 @@ static int o2net_accept_one(struct socke
  if (ret  0)
  goto out;
   
 +*more = 1;
  new_sock-sk-sk_allocation = GFP_ATOMIC;
   
  ret = o2net_set_nodelay(new_sock);
 @@ -1949,8 +1951,15 @@ out:
   static void o2net_accept_many(struct work_struct *work)
   {
  struct socket *sock = o2net_listen_sock;
 -while (o2net_accept_one(sock) == 0)
 +int more;
 +int err;
 +
 +for (;;) {
 +err = o2net_accept_one(sock, more);
 +if (!more)
 +break;
 We're throwing out 'err' here and trusting the variable 'more'. However, err
 could be set and more would be 0 regardless of whether there actually are
 more connections to be had. This makes more sense given when 'more' is set:
Hi Mark,

thanks for reviewing this :). Please correct me if I am wrong.

The idea was if accept itself fails exit the loop, otherwise continue 
scanning all queued connections. For eg: if a connection request from 
unknown node followed by known node happens at the same time, we can 
ignore the ret value of the first connection request and still continue 
accepting the next connection.

ret value appears to be of not much importance other than just for 
logging or debugging purpose.



   if (err)
   break;
   /* only trust the value of 'more' when err == 0 */
   if (more)
   break;

 Thanks,
   --Mark

 --
 Mark Fasheh

 ___
 Ocfs2-devel mailing list
 Ocfs2-devel@oss.oracle.com
 https://oss.oracle.com/mailman/listinfo/ocfs2-devel


___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] [PATCH] ocfs2: Code cleanup: Removed unused functions

2014-01-21 Thread Srinivas Eeda

Hi Goldwyn,

I am not 100% sure ... but my understanding is ocfs2_fs.h defines 
metadata structures and helper functions. This file is shared between 
ocfs2-tools and kernel modules. The functions you mentioned are used by 
ocfs2-tools code and hence exist in this file.

Thanks,
--Srini

On 01/20/2014 05:47 AM, Goldwyn Rodrigues wrote:
 These functions are either coded in individual files as static or
 not used at all. Remove them.

 Signed-off-by: Goldwyn Rodrigues rgold...@suse.com
 ---
   fs/ocfs2/ocfs2_fs.h | 40 
   1 file changed, 40 deletions(-)

 diff --git a/fs/ocfs2/ocfs2_fs.h b/fs/ocfs2/ocfs2_fs.h
 index 938387a..3ba3851 100644
 --- a/fs/ocfs2/ocfs2_fs.h
 +++ b/fs/ocfs2/ocfs2_fs.h
 @@ -1325,16 +1325,6 @@ static inline int ocfs2_extent_recs_per_dx_root(struct 
 super_block *sb)
   return size / sizeof(struct ocfs2_extent_rec);
   }
   
 -static inline int ocfs2_chain_recs_per_inode(struct super_block *sb)
 -{
 - int size;
 -
 - size = sb-s_blocksize -
 - offsetof(struct ocfs2_dinode, id2.i_chain.cl_recs);
 -
 - return size / sizeof(struct ocfs2_chain_rec);
 -}
 -
   static inline u16 ocfs2_extent_recs_per_eb(struct super_block *sb)
   {
   int size;
 @@ -1493,16 +1483,6 @@ static inline int ocfs2_extent_recs_per_inode(int 
 blocksize)
   return size / sizeof(struct ocfs2_extent_rec);
   }
   
 -static inline int ocfs2_chain_recs_per_inode(int blocksize)
 -{
 - int size;
 -
 - size = blocksize -
 - offsetof(struct ocfs2_dinode, id2.i_chain.cl_recs);
 -
 - return size / sizeof(struct ocfs2_chain_rec);
 -}
 -
   static inline int ocfs2_extent_recs_per_eb(int blocksize)
   {
   int size;
 @@ -1589,12 +1569,6 @@ static inline int ocfs2_xattr_recs_per_xb(int 
 blocksize)
   #endif  /* __KERNEL__ */
   
   
 -static inline int ocfs2_system_inode_is_global(int type)
 -{
 - return ((type = 0) 
 - (type = OCFS2_LAST_GLOBAL_SYSTEM_INODE));
 -}
 -
   static inline int ocfs2_sprintf_system_inode_name(char *buf, int len,
 int type, int slot)
   {
 @@ -1622,19 +1596,5 @@ static inline void ocfs2_set_de_type(struct 
 ocfs2_dir_entry *de,
   de-file_type = ocfs2_type_by_mode[(mode  S_IFMT)S_SHIFT];
   }
   
 -static inline int ocfs2_gd_is_discontig(struct ocfs2_group_desc *gd)
 -{
 - if ((offsetof(struct ocfs2_group_desc, bg_bitmap) +
 -  le16_to_cpu(gd-bg_size)) !=
 - offsetof(struct ocfs2_group_desc, bg_list))
 - return 0;
 - /*
 -  * Only valid to check l_next_free_rec if
 -  * bg_bitmap + bg_size == bg_list.
 -  */
 - if (!gd-bg_list.l_next_free_rec)
 - return 0;
 - return 1;
 -}
   #endif  /* _OCFS2_FS_H */
   


___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] [PATCH 5/5] ocfs2: Implement delayed dropping of last dquot reference

2014-01-20 Thread Srinivas Eeda

On 01/20/2014 07:31 AM, Goldwyn Rodrigues wrote:
 On 01/16/2014 04:58 PM, Jan Kara wrote:
 On Thu 16-01-14 23:28:49, Jan Kara wrote:
 We cannot drop last dquot reference from downconvert thread as that
 creates the following deadlock:

 NODE 1  NODE2
 holds dentry lock for 'foo'
 holds inode lock for GLOBAL_BITMAP_SYSTEM_INODE
  dquot_initialize(bar)
 ocfs2_dquot_acquire()
 ocfs2_inode_lock(USER_QUOTA_SYSTEM_INODE)
  ...
 downconvert thread (triggered from another
 node or a different process from NODE2)
ocfs2_dentry_post_unlock()
  ...
  iput(foo)
ocfs2_evict_inode(foo)
  ocfs2_clear_inode(foo)
dquot_drop(inode)
  ...
 ocfs2_dquot_release()
ocfs2_inode_lock(USER_QUOTA_SYSTEM_INODE)
 - blocks
  finds we need more 
 space in
  quota file
  ...
 ocfs2_extend_no_holes()
 ocfs2_inode_lock(GLOBAL_BITMAP_SYSTEM_INODE)
  - deadlocks waiting 
 for
downconvert thread

 We solve the problem by postponing dropping of the last dquot reference
 to a workqueue if it happens from the downconvert thread.
Hum, now looking again into ocfs2_clear_inode() there are more 
 problems
 than I originally thought. Look for example at
 ocfs2_mark_lockres_freeing(). That will block on rw/inode/open lock if
 there is downconvert pending waiting for that downconvert to finish.
 However that never happens when ocfs2_clear_inode() is called from the
 downconvert thread.

 So we are back to square one - I don't see a way how to fix these 
 deadlocks
 without postponing dropping of inode reference to a workqueue :(.


 Since the reason of the unlink performance is the delay in calling 
 ocfs2_open_unlock(), and the ocfs2_mark_lockres_freeing() comes after 
 ocfs2_open_unlock(): can we move the call to ocfs2_open_unlock() to 
 ocfs2_evict_inode() and then perform ocfs2_clear_inode() in a deferred 
 way?
once ocfs2_evict_inode is returned, vfs would destroy the inode, so I 
think we should do the cleanup before that and hence we cannot differ 
ocfs2_clear_inode from inside ocfs2_evict_inode

May be we should queue ocfs2_blocking_ast call itself to down convert 
thread. That way it doesn't prevent down convert thread from clearing 
the inode. Once when down convert thread comes to processes the queued 
ast/bast and finds lockres cleared it can just return.




 Honza



 Signed-off-by: Jan Kara j...@suse.cz
 ---
   fs/ocfs2/ocfs2.h|  5 +
   fs/ocfs2/quota.h|  2 ++
   fs/ocfs2/quota_global.c | 35 +++
   fs/ocfs2/super.c|  8 
   4 files changed, 50 insertions(+)

 diff --git a/fs/ocfs2/ocfs2.h b/fs/ocfs2/ocfs2.h
 index ca81f6b49236..f6134345fe42 100644
 --- a/fs/ocfs2/ocfs2.h
 +++ b/fs/ocfs2/ocfs2.h
 @@ -30,6 +30,7 @@
   #include linux/sched.h
   #include linux/wait.h
   #include linux/list.h
 +#include linux/llist.h
   #include linux/rbtree.h
   #include linux/workqueue.h
   #include linux/kref.h
 @@ -410,6 +411,10 @@ struct ocfs2_super
   struct list_head blocked_lock_list;
   unsigned long blocked_lock_count;

 +/* List of dquot structures to drop last reference to */
 +struct llist_head dquot_drop_list;
 +struct work_struct dquot_drop_work;
 +
   wait_queue_head_tosb_mount_event;

   /* Truncate log info */
 diff --git a/fs/ocfs2/quota.h b/fs/ocfs2/quota.h
 index d5ab56cbe5c5..f266d67df3c6 100644
 --- a/fs/ocfs2/quota.h
 +++ b/fs/ocfs2/quota.h
 @@ -28,6 +28,7 @@ struct ocfs2_dquot {
   unsigned int dq_use_count;/* Number of nodes having 
 reference to this entry in global quota file */
   s64 dq_origspace;/* Last globally synced space usage */
   s64 dq_originodes;/* Last globally synced inode usage */
 +struct llist_node list;/* Member of list of dquots to drop */
   };

   /* Description of one chunk to recover in memory */
 @@ -110,6 +111,7 @@ int ocfs2_read_quota_phys_block(struct inode 
 *inode, u64 p_block,
   int ocfs2_create_local_dquot(struct dquot *dquot);
   int ocfs2_local_release_dquot(handle_t *handle, struct dquot *dquot);
   int ocfs2_local_write_dquot(struct dquot *dquot);
 +void ocfs2_drop_dquot_refs(struct work_struct *work);

   extern const struct dquot_operations ocfs2_quota_operations;
   extern struct quota_format_type ocfs2_quota_format;
 diff --git a/fs/ocfs2/quota_global.c b/fs/ocfs2/quota_global.c
 index aaa50611ec66..7921e209c64b 100644
 --- a/fs/ocfs2/quota_global.c
 +++ b/fs/ocfs2/quota_global.c
 @@ -10,6 +10,7 @@
   #include linux/jiffies.h
   #include linux/writeback.h
   #include linux/workqueue.h

Re: [Ocfs2-devel] [PATCH] Revert iput deferring code in ocfs2_drop_dentry_lock

2014-01-16 Thread Srinivas Eeda

Hi Jan,
thanks a lot for explaining the problem. Please see my comment below.

On 01/16/2014 06:02 AM, Jan Kara wrote:
 On Thu 16-01-14 07:35:58, Goldwyn Rodrigues wrote:
 On 01/15/2014 08:47 PM, Jan Kara wrote:
 On Wed 15-01-14 17:17:55, Goldwyn Rodrigues wrote:
 On 01/15/2014 09:53 AM, Jan Kara wrote:
 On Wed 15-01-14 09:32:39, Goldwyn Rodrigues wrote:
 On 01/15/2014 07:33 AM, Jan Kara wrote:
 On Wed 15-01-14 06:51:51, Goldwyn Rodrigues wrote:
 The following patches are reverted in this patch because these
 patches caused regression in the unlink() calls.

 ea455f8ab68338ba69f5d3362b342c115bea8e13 - ocfs2: Push out dropping
 of dentry lock to ocfs2_wq
 f7b1aa69be138ad9d7d3f31fa56f4c9407f56b6a - ocfs2: Fix deadlock on 
 umount
 5fd131893793567c361ae64cbeb28a2a753bbe35 - ocfs2: Don't oops in
 ocfs2_kill_sb on a failed mount

 The regression is caused because these patches delay the iput() in case
 of dentry unlocks. This also delays the unlocking of the open lockres.
 The open lockresource is required to test if the inode can be wiped 
 from
 disk on not. When the deleting node does not get the open lock, it 
 marks
 it as orphan (even though it is not in use by another node/process)
 and causes a journal checkpoint. This delays operations following the
 inode eviction. This also moves the inode to the orphaned inode
 which further causes more I/O and a lot of unneccessary orphans.

 As for the fix, I tried reproducing the problem by using quotas and 
 enabling
 lockdep, but the issue could not be reproduced.
So I was thinking about this for a while trying to remember... What 
 it
 really boils down to is whether it is OK to call ocfs2_delete_inode() 
 from
 DLM downconvert thread. Bear in mind that ocfs2_delete_inode() takes all
 kinds of interesting cluster locks meaning that the downconvert thread 
 can
 block waiting to acquire some cluster lock. That seems like asking for
 trouble to me (i.e., what if the node holding the lock we need from
 downconvert thread needs a lock from us first?) but maybe it is OK these
 days.

 The only lock it tries to take is the inode lock resource, which
 seems to be fine to take.
Why is it obviously fine? Some other node might be holding the inode 
 lock
 and still write to the inode. If that needs a cluster lock for block
 allocation and that allocation cluster lock is currently hold by our node,
 we are screwed. Aren't we?

 No. If another thread is using the allocation cluster lock implies
 we already have the inode lock. Cluster locks are also taken in a
 strict order in order to avoid ABBA locking.
I agree with what you wrote above but it's not exactly what I'm talking
 about. What seems to be possible is:
 NODE 1  NODE2
 holds dentry lock for 'foo'
 holds inode lock for GLOBAL_BITMAP_SYSTEM_INODE
 dquot_initialize(bar)
   ocfs2_dquot_acquire()
 
 ocfs2_inode_lock(USER_QUOTA_SYSTEM_INODE)
 ...
 downconvert thread (triggered from another
 node or a different process from NODE2)
ocfs2_dentry_post_unlock()
  ...
  iput(foo)
ocfs2_evict_inode(foo)
  ocfs2_clear_inode(foo)
dquot_drop(inode)
 ...
  ocfs2_inode_lock(USER_QUOTA_SYSTEM_INODE)
   - blocks
 finds we need more space in
 quota file
 ...
 ocfs2_extend_no_holes()
   
 ocfs2_inode_lock(GLOBAL_BITMAP_SYSTEM_INODE)
 - deadlocks waiting for
   downconvert thread

 Thanks. This explains the situation much better. Whats stopping
 NODE1 from releasing the GLOBAL_BITMAP_SYSTEM_INODE? (to break from
 hold and wait condition)
The fact that to release the lock, it has to be downconverted. And there
 is only one downconvert thread (ocfs2dc) and that is busy waiting for
 USER_QUOTA_SYSTEM_INODE lock...
Yes that indeed seems to be a problem, thanks a lot for explaining :). I 
do not know much about quota's code but wondering if we can fix this 
problem by enforcing the lock ordering ? Can NODE2 check first if it 
needs space before getting lock on USER_QUOTA_SYSTEM_INODE. If it does 
then it should acquire GLOBAL_BITMAP_SYSTEM_INODE before acquiring lock 
on USER_QUOTA_SYSTEM_INODE ? (Basically we enforce 
GLOBAL_BITMAP_SYSTEM_INODE cannot be taken while node has 
USER_QUOTA_SYSTEM_INODE lock)


 However the positive part of this is that the race I was originally afraid
 of - when ocfs2_delete_inode() would be triggered from
 ocfs2_dentry_post_unlock() - isn't possible because ocfs2 seems to keep
 invariant that node can cache dentry for 'foo' without holding inode lock

Re: [Ocfs2-devel] [PATCH 1/1] o2dlm: fix NULL pointer dereference in o2dlm_blocking_ast_wrapper

2014-01-13 Thread Srinivas Eeda

On 01/13/2014 08:06 PM, Joseph Qi wrote:
 On 2014/1/11 9:19, Srinivas Eeda wrote:
 From: Srinivas Eeda seeda@srini.(none)

 A tiny race between BAST and unlock message causes the NULL dereference.

 A node sends an unlock request to master and receives a response. Before
 processing the response it receives a BAST from the master. Since both 
 requests
 are processed by different threads it creates a race. While the BAST is being
 processed, lock can get freed by unlock code.

 This patch makes bast to return immediately if lock is found but unlock is
 pending. The code should handle this race. We also have to fix master node to
 skip sending BAST after receiving unlock message.

 Below is the crash stack

 BUG: unable to handle kernel NULL pointer dereference at 0048
 IP: [a015e023] o2dlm_blocking_ast_wrapper+0xd/0x16
 [a034e3db] dlm_do_local_bast+0x8e/0x97 [ocfs2_dlm]
 [a034f366] dlm_proxy_ast_handler+0x838/0x87e [ocfs2_dlm]
 [a0308abe] o2net_process_message+0x395/0x5b8 [ocfs2_nodemanager]
 [a030aac8] o2net_rx_until_empty+0x762/0x90d [ocfs2_nodemanager]
 [81071802] worker_thread+0x14d/0x1ed

 Signed-off-by: Srinivas Eeda srinivas.e...@oracle.com
 ---
   fs/ocfs2/dlm/dlmast.c |9 +++--
   1 file changed, 7 insertions(+), 2 deletions(-)

 diff --git a/fs/ocfs2/dlm/dlmast.c b/fs/ocfs2/dlm/dlmast.c
 index b46278f..dbc6cee 100644
 --- a/fs/ocfs2/dlm/dlmast.c
 +++ b/fs/ocfs2/dlm/dlmast.c
 @@ -385,8 +385,13 @@ int dlm_proxy_ast_handler(struct o2net_msg *msg, u32 
 len, void *data,
  head = res-granted;
   
  list_for_each_entry(lock, head, list) {
 -if (lock-ml.cookie == cookie)
 -goto do_ast;
 +/* if lock is found but unlock is pending ignore the bast */
 +if (lock-ml.cookie == cookie) {
 +if (lock-unlock_pending)
 +break;
 +else
 +goto do_ast;
 +}
  }
   
  mlog(0, Got %sast for unknown lock! cookie=%u:%llu, name=%.*s, 

 I found you sent a version on Jan 30, 2012.
 https://oss.oracle.com/pipermail/ocfs2-devel/2012-January/008469.html
 Compared with the old version, this version only saves a little bit CPU,
 am I right?
Yes you are right. I made the change as Goldwyn suggested which is a 
good thing to have :)

___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] [PATCH 1/1] o2dlm: fix NULL pointer dereference in o2dlm_blocking_ast_wrapper

2014-01-13 Thread Srinivas Eeda

On 01/13/2014 07:37 AM, Joel Becker wrote:
 On Fri, Jan 10, 2014 at 05:19:13PM -0800, Srinivas Eeda wrote:
 From: Srinivas Eeda seeda@srini.(none)

 A tiny race between BAST and unlock message causes the NULL dereference.

 A node sends an unlock request to master and receives a response. Before
 processing the response it receives a BAST from the master. Since both 
 requests
 are processed by different threads it creates a race. While the BAST is being
 processed, lock can get freed by unlock code.

 This patch makes bast to return immediately if lock is found but unlock is
 pending. The code should handle this race. We also have to fix master node to
 skip sending BAST after receiving unlock message.
 Did the master send the BAST after the unlock, or does that race too?
 Does the master know the unlock has succeeded, or does it just think so?
I think it's due to a race but I haven't debugged the master. My guess 
is unlock request sneaked in before the dlm_flush_asts was called. 
However non master node should handle this race as well, so just did 
that part which fixed a bug we were seeing.



 @@ -385,8 +385,13 @@ int dlm_proxy_ast_handler(struct o2net_msg *msg, u32 
 len, void *data,
  head = res-granted;
   
  list_for_each_entry(lock, head, list) {
 -if (lock-ml.cookie == cookie)
 -goto do_ast;
 +/* if lock is found but unlock is pending ignore the bast */
 +if (lock-ml.cookie == cookie) {
 +if (lock-unlock_pending)
 +break;
 +else
 +goto do_ast;
 +}
 This breaks out for asts as well as basts.  Can't that cause problems
 with the unlock ast expected by the caller?
if unlock_pending is set, then the node is trying to unlock an existing 
lock and shouldn't receive any asts ?

 Joel




___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

[Ocfs2-devel] [PATCH 1/1] o2dlm: fix NULL pointer dereference in o2dlm_blocking_ast_wrapper

2014-01-10 Thread Srinivas Eeda

From: Srinivas Eeda seeda@srini.(none)

A tiny race between BAST and unlock message causes the NULL dereference.

A node sends an unlock request to master and receives a response. Before
processing the response it receives a BAST from the master. Since both requests
are processed by different threads it creates a race. While the BAST is being
processed, lock can get freed by unlock code.

This patch makes bast to return immediately if lock is found but unlock is
pending. The code should handle this race. We also have to fix master node to
skip sending BAST after receiving unlock message.

Below is the crash stack

BUG: unable to handle kernel NULL pointer dereference at 0048
IP: [a015e023] o2dlm_blocking_ast_wrapper+0xd/0x16
[a034e3db] dlm_do_local_bast+0x8e/0x97 [ocfs2_dlm]
[a034f366] dlm_proxy_ast_handler+0x838/0x87e [ocfs2_dlm]
[a0308abe] o2net_process_message+0x395/0x5b8 [ocfs2_nodemanager]
[a030aac8] o2net_rx_until_empty+0x762/0x90d [ocfs2_nodemanager]
[81071802] worker_thread+0x14d/0x1ed

Signed-off-by: Srinivas Eeda srinivas.e...@oracle.com
---
 fs/ocfs2/dlm/dlmast.c |9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/fs/ocfs2/dlm/dlmast.c b/fs/ocfs2/dlm/dlmast.c
index b46278f..dbc6cee 100644
--- a/fs/ocfs2/dlm/dlmast.c
+++ b/fs/ocfs2/dlm/dlmast.c
@@ -385,8 +385,13 @@ int dlm_proxy_ast_handler(struct o2net_msg *msg, u32 len, 
void *data,
head = res-granted;
 
list_for_each_entry(lock, head, list) {
-   if (lock-ml.cookie == cookie)
-   goto do_ast;
+   /* if lock is found but unlock is pending ignore the bast */
+   if (lock-ml.cookie == cookie) {
+   if (lock-unlock_pending)
+   break;
+   else
+   goto do_ast;
+   }
}
 
mlog(0, Got %sast for unknown lock! cookie=%u:%llu, name=%.*s, 
-- 
1.7.9.5


___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] What's the need of OCFS2_INODE_MAYBE_ORPHANED?

2014-01-09 Thread Srinivas Eeda

On 01/09/2014 08:34 AM, Goldwyn Rodrigues wrote:
 On 01/09/2014 10:06 AM, Srinivas Eeda wrote:
 On 01/09/2014 07:44 AM, Goldwyn Rodrigues wrote:
 Hi Srini,

 Thanks for the reply.

 On 01/08/2014 11:30 PM, Srinivas Eeda wrote:
 From the comments in fs/ocfs2/inode.h:90 it seems, this was 
 used in
 legacy ocfs2 systems when a node received unlink votes. Since 
 unlink
 votes has been done away with and replaced with open locks, is this
 flag still required? If yes, why?
 My understanding is that unlink voting protocol was heavy. So the
 following was done to address it.

 To do an unlink, dentry has to be removed. In order to do that the
 node
 has to get EX lock on the dentry which means all other nodes have to
 downconvert. In general EX lock on dentry is acquired only in
 unlink and
 I assume rename case. So all nodes which down convert the lock mark
 their inode OCFS2_INODE_MAYBE_ORPHANED. The only problem with 
 this is
 that dentry on a node can get purged because of memory pressure 
 which
 marks inode as OCFS2_INODE_MAYBE_ORPHANED even when no unlink was 
 done
 on this inode.


 I think you are getting confused between dentry_lock (dentry_lockres)
 and open lock (ip_open_lockres). AFAICS, dentry locks are used to
 control the remote dentries.
 I was trying to answer why we need OCFS2_INODE_MAYBE_ORPHANED flag, I
 guess I wasn't clear. I'll make an other attempt :).

 One way for node A to tell node B that an unlink had happened on 
 node A
 is by sending an explicit message(something similar to what we had in
 old release). When node B received such communication it marked inode
 with OCFS2_INODE_MAYBE_ORPHANED flag if it still had the inode in use.

 The other way(current implementation) is to indirectly tell it by 
 asking
 node B to purge dentry lockres. Once node B has been informed that
 dentry lock has to be released, it assumes inode might have been
 unlinked somewhere and marks the inode with OCFS2_INODE_MAYBE_ORPHANED
 flag.

 So, we need OCFS2_INODE_MAYBE_ORPHANED flag to tell node B that it
 should finish the second phase of unlink(remove the inode from file
 system) when it closes the file.

 Okay, but  why should node B do the cleanup/wipe when node A initiated
 the unlink()? Shouldn't it be done by node A? All node B should do is
 to write the inode and clear it from the cache. The sequence is
 synchronized by dentry_lock. Right?
 removing dentry is only the first part. An inode can still be open after
 that.

 Yes, I did not consider that.
 How about using open locks ro_holders count to identify this? That may 
 just work. Thanks!
One problem I see in using open lock for this is it could be late. 
Consider the scenario where node A removes the dentry and then the node 
crashes before trying the try_open_lock. Node B does the file close 
later but it doesn't know that the file was unlinked and doesn't do the 
clean up.

To me it appears OCFS2_INODE_MAYBE_ORPHANED is necessary. Any delay it 
is causing must be addressed differently.

Removing those 3 patches I pointed should help in synchronously deleting 
the file. But I am not sure if it would speed up the process.




 We are performing ocfs2_inode_lock() anyways which is re-reading the
 inode from disk (for node A)
 node B should do the cleanup in this case because it is the last node to
 close the file. Node A, will do the first phase of unlink(remove dentry)
 and node B will do the second phase of unlink(remove the inode).


 Agree. Lets see if we can use open locks for this.

 Thanks!



___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] What's the need of OCFS2_INODE_MAYBE_ORPHANED?

2014-01-08 Thread Srinivas Eeda

Hi Goldwyn,

On 01/08/2014 04:12 PM, Goldwyn Rodrigues wrote:
 Hi,

 From the comments in fs/ocfs2/inode.h:90 it seems, this was used in
 legacy ocfs2 systems when a node received unlink votes. Since unlink
 votes has been done away with and replaced with open locks, is this
 flag still required? If yes, why?
My understanding is that unlink voting protocol was heavy. So the 
following was done to address it.

To do an unlink, dentry has to be removed. In order to do that the node 
has to get EX lock on the dentry which means all other nodes have to 
downconvert. In general EX lock on dentry is acquired only in unlink and 
I assume rename case. So all nodes which down convert the lock mark 
their inode OCFS2_INODE_MAYBE_ORPHANED. The only problem with this is 
that dentry on a node can get purged because of memory pressure which 
marks inode as OCFS2_INODE_MAYBE_ORPHANED even when no unlink was done 
on this inode.


 From my ongoing investigation of unlink() times, it seems this flag is
 causing the delay with releasing the open locks while downconverting
 dentry locks. The flag is set  _everytime_ a dentry downconvert is
 performed even if the file  is not scheduled to be deleted. If not, we
 can be smartly evict the inodes which are *not* to be deleted
 (i_nlink0) by not offloading to ocfs2_wq. This way open lock will
 release faster speeding up unlink on the deleting node.


Are you referring to the delay caused by ocfs2_drop_dentry_lock queueing 
dentry locks to dentry_lock_list ?. If that's the case, have you tried 
removing following patches which introduced that behavior ? I think that 
quota's deadlock bug might have to be addressed differently ?

ea455f8ab68338ba69f5d3362b342c115bea8e13
eb90e46458b08bc7c1c96420ca0eb4263dc1d6e5
bb44bf820481e19381ec549118e4ee0b89d56191

The above patches were leaving orphan files around which was causing a 
big problem to some applications that removes lot of files which inturn 
caused intermittent hangs

___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] What's the need of OCFS2_INODE_MAYBE_ORPHANED?

2014-01-08 Thread Srinivas Eeda

On 01/08/2014 07:12 PM, Goldwyn Rodrigues wrote:
 Hi Srini,

 On 01/08/2014 07:29 PM, Srinivas Eeda wrote:
 Hi Goldwyn,

 On 01/08/2014 04:12 PM, Goldwyn Rodrigues wrote:
 Hi,

 From the comments in fs/ocfs2/inode.h:90 it seems, this was used in
 legacy ocfs2 systems when a node received unlink votes. Since unlink
 votes has been done away with and replaced with open locks, is this
 flag still required? If yes, why?
 My understanding is that unlink voting protocol was heavy. So the
 following was done to address it.

 To do an unlink, dentry has to be removed. In order to do that the node
 has to get EX lock on the dentry which means all other nodes have to
 downconvert. In general EX lock on dentry is acquired only in unlink and
 I assume rename case. So all nodes which down convert the lock mark
 their inode OCFS2_INODE_MAYBE_ORPHANED. The only problem with this is
 that dentry on a node can get purged because of memory pressure which
 marks inode as OCFS2_INODE_MAYBE_ORPHANED even when no unlink was done
 on this inode.


 I think you are getting confused between dentry_lock (dentry_lockres) 
 and open lock (ip_open_lockres). AFAICS, dentry locks are used to 
 control the remote dentries.
I was trying to answer why we need OCFS2_INODE_MAYBE_ORPHANED flag, I 
guess I wasn't clear. I'll make an other attempt :).

One way for node A to tell node B that an unlink had happened on node A 
is by sending an explicit message(something similar to what we had in 
old release). When node B received such communication it marked inode 
with OCFS2_INODE_MAYBE_ORPHANED flag if it still had the inode in use.

The other way(current implementation) is to indirectly tell it by asking 
node B to purge dentry lockres. Once node B has been informed that 
dentry lock has to be released, it assumes inode might have been 
unlinked somewhere and marks the inode with OCFS2_INODE_MAYBE_ORPHANED flag.

So, we need OCFS2_INODE_MAYBE_ORPHANED flag to tell node B that it 
should finish the second phase of unlink(remove the inode from file 
system) when it closes the file.



 From my ongoing investigation of unlink() times, it seems this flag is
 causing the delay with releasing the open locks while downconverting
 dentry locks. The flag is set  _everytime_ a dentry downconvert is
 performed even if the file  is not scheduled to be deleted. If not, we
 can be smartly evict the inodes which are *not* to be deleted
 (i_nlink0) by not offloading to ocfs2_wq. This way open lock will
 release faster speeding up unlink on the deleting node.


 Are you referring to the delay caused by ocfs2_drop_dentry_lock queueing
 dentry locks to dentry_lock_list ?. If that's the case, have you tried
 removing following patches which introduced that behavior ? I think that
 quota's deadlock bug might have to be addressed differently ?

 ea455f8ab68338ba69f5d3362b342c115bea8e13

 Yes, that should make some difference. Let me try that. However, I was 
 suggesting we do not set the OCFS2_INODE_MAYBE_ORPHANED flag in 
 ocfs2_dentry_convert_worker as well, but I am not sure of the 
 consequences and that is the reason I asked why it is used.

 eb90e46458b08bc7c1c96420ca0eb4263dc1d6e5
 bb44bf820481e19381ec549118e4ee0b89d56191

 I did not find these gits. Which tree are you referring to?

Sorry, my bad. Those commit id's were from my local repo. I meant
f7b1aa69be138ad9d7d3f31fa56f4c9407f56b6a and
5fd131893793567c361ae64cbeb28a2a753bbe35


 The above patches were leaving orphan files around which was causing a
 big problem to some applications that removes lot of files which inturn
 caused intermittent hangs





___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] [PATCH 1/1] ocfs2/dlm: ocfs2 dlm umount skip migrating lockres

2013-09-11 Thread Srinivas Eeda

Joseph,
thanks for pointing it out.

Andrew,
can you please pull the following patch acked by Sunil and Joel

https://oss.oracle.com/pipermail/ocfs2-devel/2012-August/008670.html


Thanks,
--Srini


On 09/11/2013 07:07 PM, Joseph Qi wrote:
 On 2013/9/12 2:40, Tariq Saeed wrote:
 umount thread could race with migrate handler thread receiving resource
 migration request from other node. When this happens, migrating thread
 could set this node as the master along with DLM_LOCK_RES_MIGRATING flag.
 umount thread should skip migrating this newly owned lockres until
 DLM_LOCK_RES_MIGRATING flag is unset by migrate handler thread. umount
 thread will ultimately migrate this lockres during another pass of the
 lockres hash list.

 Signed-off-by: Tariq Saeed tariq.x.sa...@oracle.com
 Signed-off-by: Srinivas Eeda srinivas.e...@oracle.com
 ---
   fs/ocfs2/dlm/dlmmaster.c |4 
   1 files changed, 4 insertions(+), 0 deletions(-)

 diff --git a/fs/ocfs2/dlm/dlmmaster.c b/fs/ocfs2/dlm/dlmmaster.c
 index 33ecbe0..1643b58 100644
 --- a/fs/ocfs2/dlm/dlmmaster.c
 +++ b/fs/ocfs2/dlm/dlmmaster.c
 @@ -2360,6 +2360,10 @@ static int dlm_is_lockres_migrateable(struct dlm_ctxt 
 *dlm,
  if (res-owner != dlm-node_num)
  return 0;
   
 +if (res-state  DLM_LOCK_RES_MIGRATING) {
 +return 0;
 +}
 +
   for (idx = DLM_GRANTED_LIST; idx = DLM_BLOCKED_LIST; idx++) {
  queue = dlm_list_idx_to_ptr(res, idx);
  list_for_each_entry(lock, queue, list) {

 This patch was sent before by Jiufei Xue on Aug 2012, which was titled
 as ocfs2: delay migration when the lockres is in migration state.
 Please refer the below link for details:
 https://oss.oracle.com/pipermail/ocfs2-devel/2012-August/008670.html

 As I remembered, the patch was alerady acked by Sunil and Joel, and then
 it was in ocfs2.git.
 I am not sure why it wasn't merged to mainline.



 ___
 Ocfs2-devel mailing list
 Ocfs2-devel@oss.oracle.com
 https://oss.oracle.com/mailman/listinfo/ocfs2-devel


___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] [PATCH] ocfs2: resend master request when lost connection with someone

2013-05-31 Thread Srinivas Eeda

On 05/31/2013 03:38 AM, Xue jiufei wrote:
 Hi, Xiaowei
 It's OK to simlify the patch just as you did. But we don't want to resend
 master request to all others nodes in consideration of network traffic.
 So we record those maybe down nodes in down_nodemap.

 于 2013/5/28 14:12, xiaowei.hu 写道:
 Hi,

 I reviewed this patch , it did could fix a temp lost connection problem, but 
 a few questions:

 1. since we don't need to know the node numbers of down nodes, if simply 
 replace the down_nodemap[BITS_TO_LONGS(O2NM_MAX_NODES)], with a int named 
 for example mreq_msg_send_fail ?

 2.since the final work is to return -EAGAIN, the resend all master requests. 
 How about we simply do this?:

   while ((nodenum = dlm_node_iter_next(iter)) = 0) {
   ret = dlm_do_master_request(res, mle, nodenum);
 -if (ret  0)
 +if (ret  0) {
   mlog_errno(ret);
 +wait_on_recovery = 1;
 +msleep(DLM_NODE_DEATH_WAIT_MAX);
 +goto redo_request;
 +}

 Am I missing something?

 Thanks,
 Xiaowei

 On 12/22/2012 03:00 PM, Xue jiufei wrote:
 Function dlm_get_lock_resource() sends master request to all nodes in
 domain_map and waits for their responses when the node(say nodeA) doesn't
 known who the master is.
 When nodeA sends the master request, it happened that network of
 nodeB down for a while, and then restore. The master request
 from nodeA does not reach nodeB. NodeA may wait again and again in
 dlm_wait_for_lock_mastery() and never returns.
 This patch resend the mater request when a node lost connection with
 some other nodes.
Yes, with the current reconnect code there is a possibility of message 
loss and can happen to all kinds of messages and responses. You are 
assuming the message never received by node B. Message might have been 
received by node B but the response might have lost too. Currently there 
is no way of telling the difference. DLM layer shouldn't worry about 
nodes loosing connection temporarily. Right way to fix this is to fix 
o2net reconnect code.

 Signed-off-by: xuejiufei xuejiu...@huawei.com
 ---
fs/ocfs2/dlm/dlmmaster.c |   41 +++--
1 files changed, 35 insertions(+), 6 deletions(-)

 diff --git a/fs/ocfs2/dlm/dlmmaster.c b/fs/ocfs2/dlm/dlmmaster.c
 index c491f97..2a99a95 100644
 --- a/fs/ocfs2/dlm/dlmmaster.c
 +++ b/fs/ocfs2/dlm/dlmmaster.c
 @@ -106,7 +106,7 @@ static int dlm_do_master_request(struct 
 dlm_lock_resource *res,
static int dlm_wait_for_lock_mastery(struct dlm_ctxt *dlm,
 struct dlm_lock_resource *res,
 struct dlm_master_list_entry *mle,
 - int *blocked);
 + int *blocked, int *retry, int host_down);
static int dlm_restart_lock_mastery(struct dlm_ctxt *dlm,
struct dlm_lock_resource *res,
struct dlm_master_list_entry *mle,
 @@ -712,6 +712,8 @@ struct dlm_lock_resource * dlm_get_lock_resource(struct 
 dlm_ctxt *dlm,
unsigned int hash;
int tries = 0;
int bit, wait_on_recovery = 0;
 +int retry = 0;
 +unsigned long down_nodemap[BITS_TO_LONGS(O2NM_MAX_NODES)];
  BUG_ON(!lockid);
@@ -910,11 +912,25 @@ redo_request:
goto wait;
  ret = -EINVAL;
 -dlm_node_iter_init(mle-vote_map, iter);
 +if (!retry)
 +dlm_node_iter_init(mle-vote_map, iter);
 +else {
 +mlog(0, %s:%.*s: retrying, send master request to maybe down 
 node\n,
 +dlm-name, res-lockname.len, res-lockname.name);
 +dlm_node_iter_init(down_nodemap, iter);
 +}
 +memset(down_nodemap, 0, sizeof(down_nodemap));
 +
while ((nodenum = dlm_node_iter_next(iter)) = 0) {
ret = dlm_do_master_request(res, mle, nodenum);
 -if (ret  0)
 +if (ret  0) {
mlog_errno(ret);
 +if (dlm_is_host_down(ret)) {
 +mlog(0, %s:%.*s: node %u maybe dead, set down_nodemap\n,
 +dlm-name, res-lockname.len, res-lockname.name, 
 nodenum);
 +set_bit(nodenum, down_nodemap);
 +}
 +}
if (mle-master != O2NM_MAX_NODES) {
/* found a master ! */
if (mle-master = nodenum)
 @@ -931,9 +947,11 @@ redo_request:
  wait:
/* keep going until the response map includes all nodes */
 -ret = dlm_wait_for_lock_mastery(dlm, res, mle, blocked);
 +ret = dlm_wait_for_lock_mastery(dlm, res, mle, blocked, retry,
 +find_next_bit(down_nodemap, O2NM_MAX_NODES, 0)  
 O2NM_MAX_NODES);
if (ret  0) {
 -wait_on_recovery = 1;
 +if (!retry)
 +wait_on_recovery = 1;
mlog(0, %s: res %.*s, Node map changed, redo the master 
 request now, blocked=%d\n, dlm-name, res-lockname.len,
 res-lockname.name, blocked);
 @@ -980,7 +998,7 @@

[Ocfs2-devel] ocfs2 discontig localalloc patches (ver 2)

2012-09-18 Thread Srinivas Eeda

Hi Joel, et al,

sorry for the delay in resending discontiguous localalloc patches. Can you 
please review when you get a chance. I'll email the tools patches once kernel 
patches are approved.


I came across two use cases where this feature will help.

1. On a customer site, after running an application that creates, deletes ~200 
hundred files per second, filesystem got fragmented which impacts localalloc 
bitmap. Filesystem usage was about 52%. The fragmentation significantly reduces 
localalloc bitmap size from about 256 mb to closer to 1M and eventually gets 
disabled.

2. A filesystem of 2TB has a cluster size of 4K which divides filesystem into 
about 16000 subgroups. In the begining, localalloc bitmap looks for about 2048 
contiguous clusters(8mb). In the begining customer was seeing about 600Mb/sec 
throughput. But once the filesystem was used to about 51% usage, there are very 
few contiguous chunks that are larger than 2048 clusters(but are scattered). So 
everytime localalloc bitmap needs to be refilled, it has to scan quite a bit of 
subgroups before finding one big chunk.This behavior brought down throughput 
from 600mb/sec down to around  15mb/sec. When I reduced the bitmap to half, 
performance was back up.


Changes:
1. Allow localalloc bitmap to use multiple contiguous chunks.

2. Each localalloc extent is 128K to start with. This doesn't limit the bitmap 
from getting one big chunk to fill the bitmap.


I have recreated customer filesystem on an iscsi storage with the o2image 
captured from the customer system. When ran a test of creating 100Mb files, I 
see a throughput of ~70M/sec without the discontig-la feature and about 
280M/sec with discontig-la.

Thanks,
--Srini


___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

[Ocfs2-devel] [PATCH 5/5] ocfs2 set min default contig localalloc size

2012-09-18 Thread Srinivas Eeda

When discontiguous localalloc is enabled, localalloc should just try for a
minimum smaller contiguous chunk. This will improve performance when the
file system is fragmented. This setting doesn't stop localalloc from getting
bigger chunk.

The default chunk size is set to fill 128Kb. localalloc will try to look for
atleast that big of chunk. If it's not available then it reduces the size by
half and retries.

#define OCFS2_DEFAULT_LOCALALLOC_CHUNK_SIZE (128*1024) /* 128kb */

Signed-off-by: Srinivas Eeda srinivas.e...@oracle.com
---
 fs/ocfs2/localalloc.c |   28 +---
 1 files changed, 21 insertions(+), 7 deletions(-)

diff --git a/fs/ocfs2/localalloc.c b/fs/ocfs2/localalloc.c
index 82653f2..569e131 100644
--- a/fs/ocfs2/localalloc.c
+++ b/fs/ocfs2/localalloc.c
@@ -271,6 +271,19 @@ ocfs2_local_alloc_adjust_bits_wanted(struct ocfs2_dinode 
*alloc,
return wanted;
 }
 
+#define OCFS2_DEFAULT_LOCALALLOC_CHUNK_SIZE (128*1024) /* 128kb */
+
+void ocfs2_set_default_min_alloc_size(struct ocfs2_super *osb)
+{
+   if (ocfs2_supports_discontig_la(osb)) {
+   osb-local_alloc_bits = OCFS2_DEFAULT_LOCALALLOC_CHUNK_SIZE /
+   osb-s_clustersize;
+   if (!osb-local_alloc_bits)
+   osb-local_alloc_bits = 1;
+   } else
+   osb-local_alloc_bits = osb-local_alloc_default_bits;
+}
+
 void ocfs2_la_set_sizes(struct ocfs2_super *osb, int requested_mb)
 {
struct super_block *sb = osb-sb;
@@ -294,8 +307,7 @@ void ocfs2_la_set_sizes(struct ocfs2_super *osb, int 
requested_mb)
osb-local_alloc_default_bits =
ocfs2_megabytes_to_clusters(sb, requested_mb);
}
-
-   osb-local_alloc_bits = osb-local_alloc_default_bits;
+   ocfs2_set_default_min_alloc_size(osb);
 }
 
 static inline int ocfs2_la_state_enabled(struct ocfs2_super *osb)
@@ -1219,7 +1231,7 @@ static int ocfs2_recalc_la_window(struct ocfs2_super *osb,
 * low space.
 */
if (osb-local_alloc_state != OCFS2_LA_THROTTLED)
-   osb-local_alloc_bits = osb-local_alloc_default_bits;
+   ocfs2_set_default_min_alloc_size(osb);
 
 out_unlock:
state = osb-local_alloc_state;
@@ -1369,10 +1381,12 @@ static int ocfs2_local_alloc_new_window(struct 
ocfs2_super *osb,
if (!alloc-id1.bitmap1.i_total)
goto bail;
 
-   spin_lock(osb-osb_lock);
-   if (cluster_count  osb-local_alloc_bits)
-   osb-local_alloc_bits = cluster_count;
-   spin_unlock(osb-osb_lock);
+   if (!ocfs2_supports_discontig_la(osb)) {
+   spin_lock(osb-osb_lock);
+   if (cluster_count  osb-local_alloc_bits)
+   osb-local_alloc_bits = cluster_count;
+   spin_unlock(osb-osb_lock);
+   }
osb-la_last_gd = ac-ac_last_group;
 
bitmap = ocfs2_local_alloc_bitmap(osb, la);
-- 
1.5.4.3


___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

[Ocfs2-devel] [PATCH 1/5] ocfs2: modifiy reservation code to support discontiguous allocations

2012-09-18 Thread Srinivas Eeda

Currently reservation code assumes that bitmap given to it is one contigous
chunk. This patch enhances it to handle a discontigous chunks. It adds new
fields m_bitmap_ext_cnt and m_bitmap_ext_arr. m_bitmap_ext_arr tracks the sizes
of each contigous chunk and m_bitmap_ext_cnt trackes size of m_bitmap_ext_arr

Callers should pass number of discontiguous chunks during ocfs2_resmap_restart
and later call ocfs2_resmap_set_extent_size for every chunk.

Signed-off-by: Srinivas Eeda srinivas.e...@oracle.com
---
 fs/ocfs2/localalloc.c   |4 +++-
 fs/ocfs2/reservations.c |   39 ++-
 fs/ocfs2/reservations.h |7 ++-
 3 files changed, 43 insertions(+), 7 deletions(-)

diff --git a/fs/ocfs2/localalloc.c b/fs/ocfs2/localalloc.c
index a9f78c7..53a162f 100644
--- a/fs/ocfs2/localalloc.c
+++ b/fs/ocfs2/localalloc.c
@@ -1191,9 +1191,11 @@ retry_enospc:
memset(OCFS2_LOCAL_ALLOC(alloc)-la_bitmap, 0,
   le16_to_cpu(la-la_size));
 
-   ocfs2_resmap_restart(osb-osb_la_resmap, cluster_count,
+   ocfs2_resmap_restart(osb-osb_la_resmap, 1, cluster_count,
 OCFS2_LOCAL_ALLOC(alloc)-la_bitmap);
 
+   ocfs2_resmap_set_extent_size(osb-osb_la_resmap, 0, cluster_count);
+
trace_ocfs2_local_alloc_new_window_result(
OCFS2_LOCAL_ALLOC(alloc)-la_bm_off,
le32_to_cpu(alloc-id1.bitmap1.i_total));
diff --git a/fs/ocfs2/reservations.c b/fs/ocfs2/reservations.c
index 41ffd36..e1034ed 100644
--- a/fs/ocfs2/reservations.c
+++ b/fs/ocfs2/reservations.c
@@ -291,7 +291,7 @@ static void ocfs2_resmap_clear_all_resv(struct 
ocfs2_reservation_map *resmap)
}
 }
 
-void ocfs2_resmap_restart(struct ocfs2_reservation_map *resmap,
+void ocfs2_resmap_restart(struct ocfs2_reservation_map *resmap, u32 ext_cnt,
  unsigned int clen, char *disk_bitmap)
 {
if (ocfs2_resmap_disabled(resmap))
@@ -300,12 +300,32 @@ void ocfs2_resmap_restart(struct ocfs2_reservation_map 
*resmap,
spin_lock(resv_lock);
 
ocfs2_resmap_clear_all_resv(resmap);
+
+   /* free existing extent array */
+   kfree(resmap-m_bitmap_ext_arr);
+
resmap-m_bitmap_len = clen;
resmap-m_disk_bitmap = disk_bitmap;
 
+   resmap-m_bitmap_ext_cnt = ext_cnt;
+   resmap-m_bitmap_ext_arr = kmalloc((sizeof(u32) * ext_cnt), GFP_NOFS);
+   if (!resmap-m_bitmap_ext_arr) {
+   mlog_errno(-ENOMEM);
+   resmap-m_osb-osb_resv_level = 0; /* disable reservations */
+   }
+
spin_unlock(resv_lock);
 }
 
+void ocfs2_resmap_set_extent_size(struct ocfs2_reservation_map *resmap,
+ int indx, u32 sz)
+{
+   if (ocfs2_resmap_disabled(resmap))
+   return;
+
+   resmap-m_bitmap_ext_arr[indx] = sz;
+}
+
 void ocfs2_resmap_uninit(struct ocfs2_reservation_map *resmap)
 {
/* Does nothing for now. Keep this around for API symmetry */
@@ -419,21 +439,30 @@ static int ocfs2_resmap_find_free_bits(struct 
ocfs2_reservation_map *resmap,
   unsigned int *rlen)
 {
void *bitmap = resmap-m_disk_bitmap;
-   unsigned int best_start, best_len = 0;
+   unsigned int best_start, len, ext, best_len = 0;
int offset, start, found;
 
trace_ocfs2_resmap_find_free_bits_begin(search_start, search_len,
wanted, resmap-m_bitmap_len);
 
-   found = best_start = best_len = 0;
-
+   found = best_start = best_len = ext = 0;
start = search_start;
+   len = resmap-m_bitmap_ext_arr[ext++];
+
while ((offset = ocfs2_find_next_zero_bit(bitmap, resmap-m_bitmap_len,
-start)) != -1) {
+ start)) != -1) {
/* Search reached end of the region */
if (offset = (search_start + search_len))
break;
 
+   if (offset = len) {
+   if (ext = resmap-m_bitmap_ext_cnt)
+   break;
+   len += resmap-m_bitmap_ext_arr[ext++];
+   found = 1;
+   start = offset + 1;
+   }
+
if (offset == start) {
/* we found a zero */
found++;
diff --git a/fs/ocfs2/reservations.h b/fs/ocfs2/reservations.h
index 42c2b80..fed3aae 100644
--- a/fs/ocfs2/reservations.h
+++ b/fs/ocfs2/reservations.h
@@ -56,6 +56,8 @@ struct ocfs2_reservation_map {
u32 m_bitmap_len;   /* Number of valid
 * bits available */
 
+   u32 m_bitmap_ext_cnt;
+   u32 *m_bitmap_ext_arr;
struct list_headm_lru;  /* LRU of reservations
 * structures

[Ocfs2-devel] [PATCH 3/5] ocfs2: new structure to implement discontiguous localalloc bitmap

2012-09-18 Thread Srinivas Eeda

Currently localalloc bitmap uses single contiguous free chunk of clusters. It's
size gets reduced as the contiguous free chunks gets smaller. This ultimately
gets disabled if the largest contiguous free chunk is smaller than 1Mb.

This patch localalloc bitmap to use space from multiple contiguous free chunk
clusters.

This patch introduces new ocfs2_local_alloc_rec structure which tracks a
single contiguous free chunk. When discontigous localalloc is enabled,
ocfs2_local_alloc now holds an array of these records each pointing to a free
chunk on the disk. In best case there is one record and as the filesystem gets
fragmented you may see multipe records.

This feature can be enabled/disabled when the file system is offline.

Signed-off-by: Srinivas Eeda srinivas.e...@oracle.com
---
 fs/ocfs2/localalloc.c |   22 ++--
 fs/ocfs2/ocfs2.h  |7 +
 fs/ocfs2/ocfs2_fs.h   |   65 -
 3 files changed, 79 insertions(+), 15 deletions(-)

diff --git a/fs/ocfs2/localalloc.c b/fs/ocfs2/localalloc.c
index 45818df..98961f4 100644
--- a/fs/ocfs2/localalloc.c
+++ b/fs/ocfs2/localalloc.c
@@ -48,6 +48,13 @@
 
 #define OCFS2_LOCAL_ALLOC(dinode)  (((dinode)-id2.i_lab))
 
+#define OCFS2_LOCAL_ALLOC_REC_SZ(la)   (le16_to_cpu(la-la_rec_count) *\
+sizeof(struct ocfs2_local_alloc_rec))
+#define OCFS2_LOCAL_ALLOC_BITS_PER_REC (sizeof(struct ocfs2_local_alloc_rec)*8)
+
+#define OCFS2_LOCAL_ALLOC_SIZE(osb)ocfs2_local_alloc_size(osb-sb,\
+  osb-s_feature_incompat)
+
 static u32 ocfs2_local_alloc_count_bits(struct ocfs2_dinode *alloc);
 
 static int ocfs2_local_alloc_find_clear_bits(struct ocfs2_super *osb,
@@ -74,6 +81,15 @@ static int ocfs2_local_alloc_new_window(struct ocfs2_super 
*osb,
 
 static int ocfs2_local_alloc_slide_window(struct ocfs2_super *osb);
 
+static inline u8 *ocfs2_local_alloc_bitmap(struct ocfs2_super *osb,
+  struct ocfs2_local_alloc *la)
+{
+   if (ocfs2_supports_discontig_la(osb))
+   return (u8 *)la-la_recs + le16_to_cpu(la-la_bm_start);
+   else
+   return la-la_bitmap;
+}
+
 /*
  * ocfs2_la_default_mb() - determine a default size, in megabytes of
  * the local alloc.
@@ -184,7 +200,7 @@ unsigned int ocfs2_la_default_mb(struct ocfs2_super *osb)
 
/* We can't store more bits than we can in a block. */
la_max_mb = ocfs2_clusters_to_megabytes(osb-sb,
-   ocfs2_local_alloc_size(sb) * 8);
+   OCFS2_LOCAL_ALLOC_SIZE(osb) * 8);
if (la_mb  la_max_mb)
la_mb = la_max_mb;
 
@@ -198,7 +214,7 @@ void ocfs2_la_set_sizes(struct ocfs2_super *osb, int 
requested_mb)
unsigned int la_max_mb;
 
la_max_mb = ocfs2_clusters_to_megabytes(sb,
-   ocfs2_local_alloc_size(sb) * 8);
+   OCFS2_LOCAL_ALLOC_SIZE(osb) * 8);
 
trace_ocfs2_la_set_sizes(requested_mb, la_max_mb, la_default_mb);
 
@@ -329,7 +345,7 @@ int ocfs2_load_local_alloc(struct ocfs2_super *osb)
}
 
if ((la-la_size == 0) ||
-   (le16_to_cpu(la-la_size)  ocfs2_local_alloc_size(inode-i_sb))) {
+   (le16_to_cpu(la-la_size)  OCFS2_LOCAL_ALLOC_SIZE(osb))) {
mlog(ML_ERROR, Local alloc size is invalid (la_size = %u)\n,
 le16_to_cpu(la-la_size));
status = -EINVAL;
diff --git a/fs/ocfs2/ocfs2.h b/fs/ocfs2/ocfs2.h
index 836a647..7ab1afe 100644
--- a/fs/ocfs2/ocfs2.h
+++ b/fs/ocfs2/ocfs2.h
@@ -523,6 +523,13 @@ static inline int ocfs2_supports_discontig_bg(struct 
ocfs2_super *osb)
return 0;
 }
 
+static inline int ocfs2_supports_discontig_la(struct ocfs2_super *osb)
+{
+   if (osb-s_feature_incompat  OCFS2_FEATURE_INCOMPAT_DISCONTIG_LA)
+   return 1;
+   return 0;
+}
+
 static inline unsigned int ocfs2_link_max(struct ocfs2_super *osb)
 {
if (ocfs2_supports_indexed_dirs(osb))
diff --git a/fs/ocfs2/ocfs2_fs.h b/fs/ocfs2/ocfs2_fs.h
index 938387a..409b2b0 100644
--- a/fs/ocfs2/ocfs2_fs.h
+++ b/fs/ocfs2/ocfs2_fs.h
@@ -102,7 +102,8 @@
 | OCFS2_FEATURE_INCOMPAT_INDEXED_DIRS \
 | OCFS2_FEATURE_INCOMPAT_REFCOUNT_TREE 
\
 | OCFS2_FEATURE_INCOMPAT_DISCONTIG_BG  
\
-| OCFS2_FEATURE_INCOMPAT_CLUSTERINFO)
+| OCFS2_FEATURE_INCOMPAT_CLUSTERINFO \
+| OCFS2_FEATURE_INCOMPAT_DISCONTIG_LA)
 #define OCFS2_FEATURE_RO_COMPAT_SUPP   (OCFS2_FEATURE_RO_COMPAT_UNWRITTEN \
 | OCFS2_FEATURE_RO_COMPAT_USRQUOTA

[Ocfs2-devel] [PATCH 4/5] ocfs2: implement discontiguous localalloc bitmap

2012-09-18 Thread Srinivas Eeda

This patch adds code to support discontiguous localalloc bitmap. At any given
time there can be a combination of volumes that have discontigous feature
enabled or disabled.

Signed-off-by: Srinivas Eeda srinivas.e...@oracle.com
---
 fs/ocfs2/localalloc.c |  478 +++--
 1 files changed, 341 insertions(+), 137 deletions(-)

diff --git a/fs/ocfs2/localalloc.c b/fs/ocfs2/localalloc.c
index 98961f4..82653f2 100644
--- a/fs/ocfs2/localalloc.c
+++ b/fs/ocfs2/localalloc.c
@@ -55,14 +55,16 @@
 #define OCFS2_LOCAL_ALLOC_SIZE(osb)ocfs2_local_alloc_size(osb-sb,\
   osb-s_feature_incompat)
 
-static u32 ocfs2_local_alloc_count_bits(struct ocfs2_dinode *alloc);
+static u32 ocfs2_local_alloc_count_bits(struct ocfs2_super *osb,
+   struct ocfs2_dinode *alloc);
 
 static int ocfs2_local_alloc_find_clear_bits(struct ocfs2_super *osb,
 struct ocfs2_dinode *alloc,
 u32 *numbits,
 struct ocfs2_alloc_reservation 
*resv);
 
-static void ocfs2_clear_local_alloc(struct ocfs2_dinode *alloc);
+static void ocfs2_clear_local_alloc(struct ocfs2_super *osb,
+   struct ocfs2_dinode *alloc);
 
 static int ocfs2_sync_local_to_main(struct ocfs2_super *osb,
handle_t *handle,
@@ -207,6 +209,68 @@ unsigned int ocfs2_la_default_mb(struct ocfs2_super *osb)
return la_mb;
 }
 
+static inline u32 ocfs2_local_bitmap_to_cluster(struct ocfs2_super *osb,
+   struct ocfs2_local_alloc *la,
+   u32 bit)
+{
+   u32 start, prev, offset;
+   int rec;
+
+   if (!ocfs2_supports_discontig_la(osb))
+   return le32_to_cpu(la-la_bm_off) + bit;
+
+   rec = start = prev = 0;
+   for (rec = 0; rec  le16_to_cpu(la-la_rec_count); rec++) {
+   prev = start;
+   start += le32_to_cpu(la-la_recs[rec].la_clusters);
+   if (bit  start)
+   break;
+   }
+   offset = le32_to_cpu(la-la_recs[rec].la_start) + (bit - prev);
+
+   return offset;
+}
+
+/*
+ * This function is called before allocating a new chunk for the localalloc
+ * bitmap to make sure there is enough space in the bitmap for the new record.
+ * It adjusts bits wanted and returns 0(if full) or a possitive
+ * number(bits required)
+ */
+static unsigned int
+ocfs2_local_alloc_adjust_bits_wanted(struct ocfs2_dinode *alloc,
+struct ocfs2_alloc_context *ac)
+{
+   unsigned int wanted, bits_used, total_bits;
+   struct ocfs2_local_alloc *la;
+   int bits_left;
+
+   BUG_ON(ac-ac_bits_given  ac-ac_bits_wanted);
+   wanted = ac-ac_bits_wanted - ac-ac_bits_given;
+   if (wanted == 0)
+   return 0;
+
+   la = OCFS2_LOCAL_ALLOC(alloc);
+   total_bits = le16_to_cpu(la-la_dc_size)  3;
+
+   /* space for localalloc bitmap + space used for extent records */
+   bits_used  = le32_to_cpu(alloc-id1.bitmap1.i_total) +
+   (le16_to_cpu(la-la_rec_count) *
+OCFS2_LOCAL_ALLOC_BITS_PER_REC);
+
+   bits_left = total_bits - bits_used;
+   if (bits_left = OCFS2_LOCAL_ALLOC_BITS_PER_REC)
+   return 0;
+
+   /* space fore new record */
+   bits_left -= OCFS2_LOCAL_ALLOC_BITS_PER_REC;
+   if (wanted  bits_left) {
+   wanted = bits_left;
+   ac-ac_bits_wanted = ac-ac_bits_given + bits_left;
+   }
+   return wanted;
+}
+
 void ocfs2_la_set_sizes(struct ocfs2_super *osb, int requested_mb)
 {
struct super_block *sb = osb-sb;
@@ -344,28 +408,38 @@ int ocfs2_load_local_alloc(struct ocfs2_super *osb)
goto bail;
}
 
-   if ((la-la_size == 0) ||
-   (le16_to_cpu(la-la_size)  OCFS2_LOCAL_ALLOC_SIZE(osb))) {
+   if (!ocfs2_supports_discontig_la(osb) 
+   (!la-la_size ||
+(le16_to_cpu(la-la_size)  OCFS2_LOCAL_ALLOC_SIZE(osb {
mlog(ML_ERROR, Local alloc size is invalid (la_size = %u)\n,
 le16_to_cpu(la-la_size));
status = -EINVAL;
goto bail;
}
 
+   if (ocfs2_supports_discontig_la(osb) 
+   (!la-la_dc_size ||
+(le16_to_cpu(la-la_dc_size)  OCFS2_LOCAL_ALLOC_SIZE(osb {
+   mlog(ML_ERROR, Local alloc size is invalid 
+(la_dc_size = %u)\n, le16_to_cpu(la-la_dc_size));
+   status = -EINVAL;
+   goto bail;
+   }
+
/* do a little verification. */
-   num_used = ocfs2_local_alloc_count_bits(alloc);
+   num_used = ocfs2_local_alloc_count_bits(osb, alloc);
 
/* hopefully the local alloc has always been recovered

[Ocfs2-devel] [PATCH 2/5] ocfs2: move ocfs2-local-alloc-inode to ocfs2-super

2012-09-18 Thread Srinivas Eeda

ocfs2_local_alloc_inode is used in multiple functions. It is convenient if we
move ocfs2_local_alloc_inode to ocfs2 super.

Signed-off-by: Srinivas Eeda srinivas.e...@oracle.com
---
 fs/ocfs2/localalloc.c |   53 -
 fs/ocfs2/ocfs2.h  |1 +
 2 files changed, 14 insertions(+), 40 deletions(-)

diff --git a/fs/ocfs2/localalloc.c b/fs/ocfs2/localalloc.c
index 53a162f..45818df 100644
--- a/fs/ocfs2/localalloc.c
+++ b/fs/ocfs2/localalloc.c
@@ -72,8 +72,7 @@ static int ocfs2_local_alloc_new_window(struct ocfs2_super 
*osb,
handle_t *handle,
struct ocfs2_alloc_context *ac);
 
-static int ocfs2_local_alloc_slide_window(struct ocfs2_super *osb,
- struct inode *local_alloc_inode);
+static int ocfs2_local_alloc_slide_window(struct ocfs2_super *osb);
 
 /*
  * ocfs2_la_default_mb() - determine a default size, in megabytes of
@@ -352,7 +351,8 @@ int ocfs2_load_local_alloc(struct ocfs2_super *osb)
 le32_to_cpu(alloc-id1.bitmap1.i_total),
 OCFS2_LOCAL_ALLOC(alloc)-la_bm_off);
 
-   osb-local_alloc_bh = alloc_bh;
+   osb-local_alloc_bh= alloc_bh;
+   osb-local_alloc_inode = inode;
osb-local_alloc_state = OCFS2_LA_ENABLED;
 
 bail:
@@ -379,7 +379,6 @@ void ocfs2_shutdown_local_alloc(struct ocfs2_super *osb)
 {
int status;
handle_t *handle;
-   struct inode *local_alloc_inode = NULL;
struct buffer_head *bh = NULL;
struct buffer_head *main_bm_bh = NULL;
struct inode *main_bm_inode = NULL;
@@ -392,16 +391,6 @@ void ocfs2_shutdown_local_alloc(struct ocfs2_super *osb)
if (osb-local_alloc_state == OCFS2_LA_UNUSED)
goto out;
 
-   local_alloc_inode =
-   ocfs2_get_system_file_inode(osb,
-   LOCAL_ALLOC_SYSTEM_INODE,
-   osb-slot_num);
-   if (!local_alloc_inode) {
-   status = -ENOENT;
-   mlog_errno(status);
-   goto out;
-   }
-
osb-local_alloc_state = OCFS2_LA_DISABLED;
 
ocfs2_resmap_uninit(osb-osb_la_resmap);
@@ -441,7 +430,8 @@ void ocfs2_shutdown_local_alloc(struct ocfs2_super *osb)
}
memcpy(alloc_copy, alloc, bh-b_size);
 
-   status = ocfs2_journal_access_di(handle, INODE_CACHE(local_alloc_inode),
+   status = ocfs2_journal_access_di(handle,
+INODE_CACHE(osb-local_alloc_inode),
 bh, OCFS2_JOURNAL_ACCESS_WRITE);
if (status  0) {
mlog_errno(status);
@@ -473,9 +463,6 @@ out_mutex:
iput(main_bm_inode);
 
 out:
-   if (local_alloc_inode)
-   iput(local_alloc_inode);
-
if (alloc_copy)
kfree(alloc_copy);
 }
@@ -631,22 +618,11 @@ int ocfs2_reserve_local_alloc_bits(struct ocfs2_super 
*osb,
 {
int status;
struct ocfs2_dinode *alloc;
-   struct inode *local_alloc_inode;
unsigned int free_bits;
 
BUG_ON(!ac);
 
-   local_alloc_inode =
-   ocfs2_get_system_file_inode(osb,
-   LOCAL_ALLOC_SYSTEM_INODE,
-   osb-slot_num);
-   if (!local_alloc_inode) {
-   status = -ENOENT;
-   mlog_errno(status);
-   goto bail;
-   }
-
-   mutex_lock(local_alloc_inode-i_mutex);
+   mutex_lock(osb-local_alloc_inode-i_mutex);
 
/*
 * We must double check state and allocator bits because
@@ -680,8 +656,7 @@ int ocfs2_reserve_local_alloc_bits(struct ocfs2_super *osb,
le32_to_cpu(alloc-id1.bitmap1.i_used);
if (bits_wanted  free_bits) {
/* uhoh, window change time. */
-   status =
-   ocfs2_local_alloc_slide_window(osb, local_alloc_inode);
+   status = ocfs2_local_alloc_slide_window(osb);
if (status  0) {
if (status != -ENOSPC)
mlog_errno(status);
@@ -704,7 +679,8 @@ int ocfs2_reserve_local_alloc_bits(struct ocfs2_super *osb,
goto bail;
}
 
-   ac-ac_inode = local_alloc_inode;
+   ac-ac_inode = osb-local_alloc_inode;
+   igrab(ac-ac_inode);
/* We should never use localalloc from another slot */
ac-ac_alloc_slot = osb-slot_num;
ac-ac_which = OCFS2_AC_USE_LOCAL;
@@ -712,10 +688,8 @@ int ocfs2_reserve_local_alloc_bits(struct ocfs2_super *osb,
ac-ac_bh = osb-local_alloc_bh;
status = 0;
 bail:
-   if (status  0  local_alloc_inode) {
-   mutex_unlock(local_alloc_inode-i_mutex);
-   iput(local_alloc_inode);
-   }
+   if (status  0)
+   mutex_unlock(osb

Re: [Ocfs2-devel] RFC: OCFS2 heartbeat improvements

2012-08-22 Thread srinivas eeda



On 8/22/2012 7:17 AM, Jie Liu wrote:

Hi All,

These days, I am investigating an issue regarding OCFS2 unexpected 
reboot in some real world use cases.
This problem occurred when the network status goes south, when the 
disk IO load is too high, etc...
I suspect it might caused by ocfs2 fencing if it's BIO reading/writing 
can not be scheduled and processed quickly, or

something like this happened in the network IO heartbeat thread.

Now am trying to reproduce this problem locally.  In the meantime, I'd 
like to ping you guys with some rough ideas
to improve the disk IO heartbeat to see if they are sounds reasonable 
or not.


Firstly, if an OCFS2 node is suffer from heavy disk IO, how about to 
fix the bio read/write to make this IO request can not
be preempted by other requests? e.g,  for o2hb_issue_node_write(), 
currently, it do bio submission with WRITE only,
'submit_bio(WRITE, bio)'.   If we change the flag to WRITE_SYNC, or 
even submit the request combine with REQ_FUA,

maybe could get highest priority for disk IO request.
This was submitted before by Noboru Iwamatsu and acked by sunil and tao 
but some how didn't get merged


https://oss.oracle.com/pipermail/ocfs2-devel/2011-December/008438.html


Secondly, the comments for bio allocation at o2hb_setup_one_bio() 
indicates that we can pre-allocate bio instead of
acquire for each time.  But I have not saw any code snippet doing such 
things in kernel. :(
how about creating a private bio set for each o2hb_region, so that we 
can do allocation out of it?
maybe it's faster than do allocation from global bio sets.  Also, does 
it make sense if creating a memory pool
on each o2hb_region, so that we can have continuous pages bind to 
those bios?



Any comments are appreciated!

Thanks,
-Jeff








___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel
___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] a bug about deadlock when enable quota on ocfs2

2012-07-16 Thread Srinivas Eeda

Hi Jan,

thanks for helping.

Jan Kara wrote:
   Hello,
 his comments:
 @ With those patches in, all other nodes will now queue downgrade of dentry
 @ locks to ocfs2_wq thread. Then Node 1 gets a lock is in use when it calls
 @ ocfs2_try_open_lock and so does other nodes and hence orphans lie
 around. Now
 @ orphans will keep growing and only gets cleared when all nodes umount the
 @ volume. This causes a different problems 1)space is not cleared 2)
 as orphans
 @ keep growing, orphan thread takes long time to scan all orphans(but still
 @ fails to clear oprhans because of open lock still around) and hence will
 @ block new unlinks for that duration because it gets a EX on orphan
 scan lock.
 
   I think the analysis is not completely correct (or I misunderstood it).
 We defer only putting of inode reference to workqueue (lockres is freed
 already in ocfs2_drop_dentry_lock()). However it is correct that the queue
 of inodes to put can get long and the system gets into trouble.
   
Sorry for not being clear. This is an issue when thread running unlink 
and ocfs2_wq on other node end up running ocfs2_delete_inode at the same 
time. They both call ocfs2_try_open_lock during query wipe inode and get 
EAGAIN. So they both defer the actual clean up.

This will become a problem if a user deletes tons of files at the same 
time. Lot of  orphans gets queued and it becomes a problem when user 
continues to delete.
 My questions are
 1.) what kind of potential deadlock in your comment?
 
   Dropping inode reference can result in deletion of inode when this was
 the last reference to an unlinked file. However ocfs2_delete_inode() needs
 to take locks which rank above locks held when ocfs2_drop_dentry_lock() is
 called. You can check this by removing my patches, enabling
 CONFIG_PROVE_LOCKING and see the warning lockdep spits.
I am not familiar with which locks get out of order. Tiger can you 
please check this.
 .) I have tried removing this patch, ocfs2 became more durable,
 although it caused another panic but not get deadlock again, could
 we remove this patch and just to fix the new problem? may the new
 problem is the potential deadlock you mentioned.
 
   I talked about possible solutions to Wengang already. Basically before we
 start unlinking we could check whether there aren't too many queued puts of
 inode references and if yes, drop some of them directly from unlink process
 before we acquire any cluster locks or so.
   
We could do this but if there is a deadlock bug we could still run into 
it when we try deleting directly right?
   Honza

 PS: I CC'ed ocfs2-devel so that this discussion gets archived and other
 developers can help as well.
 PS2: I'm going for a longer vacation now so I won't be responding to email
 for some time.
   
Have a good vacation :)


___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] [PATCH] Fix waiting status race condition in dlm recovery

2012-05-25 Thread srinivas eeda

comments inline

On 5/24/2012 10:53 PM, xiaowei...@oracle.com wrote:
 From: Xiaowei.Huxiaowei...@oracle.com

 when the master requested locks ,but one/some of the live nodes died,
 after it received the request msg and before send out the locks packages,
 the recovery will fall into endless loop,waiting for the status changed to 
 finalize

 NodeA NodeB
 selected as recovery master
 dlm_remaster_locks
-  dlm_requeset_all_locks
this send request locks msg to B
received the msg from A,
queue worker 
 dlm_request_all_locks_worker
return 0
 go on set state to requested
 wait for the state become done
NodeB lost connection due to 
 network
before the worker begin, or it die.
 NodeA still waiting for the
 change of reco state.
 It won't end if it not get data done msg
 And at this time nodeB do not realize this (or it just died),
 it won't send the msg for ever, nodeA left in the recovery process forever.

 This patch let the recovery master check if the node still in live node
 map when it stay in REQUESTED status.

 Signed-off-by: Xiaowei.Huxiaowei...@oracle.com
 ---
   fs/ocfs2/dlm/dlmrecovery.c |9 +
   1 files changed, 9 insertions(+), 0 deletions(-)

 diff --git a/fs/ocfs2/dlm/dlmrecovery.c b/fs/ocfs2/dlm/dlmrecovery.c
 index 01ebfd0..62659e8 100644
 --- a/fs/ocfs2/dlm/dlmrecovery.c
 +++ b/fs/ocfs2/dlm/dlmrecovery.c
 @@ -555,6 +555,7 @@ static int dlm_remaster_locks(struct dlm_ctxt *dlm, u8 
 dead_node)
   int all_nodes_done;
   int destroy = 0;
   int pass = 0;
 + int dying = 0;

   do {
   /* we have become recovery master.  there is no escaping
 @@ -659,6 +660,7 @@ static int dlm_remaster_locks(struct dlm_ctxt *dlm, u8 
 dead_node)
   list_for_each_entry(ndata,dlm-reco.node_data, list) {
   mlog(0, checking recovery state of node %u\n,
ndata-node_num);
 + dying = 0;
   switch (ndata-state) {
   case DLM_RECO_NODE_DATA_INIT:
   case DLM_RECO_NODE_DATA_REQUESTING:
 @@ -679,6 +681,13 @@ static int dlm_remaster_locks(struct dlm_ctxt *dlm, u8 
 dead_node)
dlm-name, ndata-node_num,

 ndata-state==DLM_RECO_NODE_DATA_RECEIVING ?
receiving : requested);
 + spin_lock(dlm-spinlock);
 + dying = !test_bit(ndata-node_num, 
 dlm-live_nodes_map);
 + spin_unlock(dlm-spinlock);
 + if (dying) {
 + ndata-state = 
 DLM_RECO_NODE_DATA_DEAD;
 + break;
 + }
   all_nodes_done = 0;
   break;
   case DLM_RECO_NODE_DATA_DONE:
fix seems to address the issue, but can you please add a function 
dlm_is_node_in_livemap similar to dlm_is_node_dead so that it' improves 
readability. You can then add the following to check if the node is 
still alive
+if (!dlm_is_node_in_livemap(dlm, ndata-node_num))
+ndate-state = DLM_RECO_NODE_DATA_DEAD;
+else
+all_nodes_done = 0;

___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-devel

[Ocfs2-devel] [PATCH 3/3] ocfs2: modify reservation code to support discontigous localalloc

2012-05-07 Thread Srinivas Eeda

Currently reservation code assumes a bitmap given to it is all one contigous
chunk. This patch enhances it to handle a discontigous chunks. It adds new
fields m_bitmap_ext_cnt and m_bitmap_ext_arr. m_bitmap_ext_arr tracks the sizes
of each contigous free bits and m_bitmap_ext_cnt trackes number of
m_bitmap_ext_arr.

Signed-off-by: Srinivas Eeda srinivas.e...@oracle.com
---
 fs/ocfs2/reservations.c |   41 ++---
 fs/ocfs2/reservations.h |7 ++-
 2 files changed, 40 insertions(+), 8 deletions(-)

diff --git a/fs/ocfs2/reservations.c b/fs/ocfs2/reservations.c
index 41ffd36..fea93d7 100644
--- a/fs/ocfs2/reservations.c
+++ b/fs/ocfs2/reservations.c
@@ -291,7 +291,15 @@ static void ocfs2_resmap_clear_all_resv(struct 
ocfs2_reservation_map *resmap)
}
 }
 
-void ocfs2_resmap_restart(struct ocfs2_reservation_map *resmap,
+void ocfs2_resmap_set_ext(struct ocfs2_reservation_map *resmap, int arr, u32 
sz)
+{
+   if (ocfs2_resmap_disabled(resmap))
+   return;
+
+   resmap-m_bitmap_ext_arr[arr] = sz;
+}
+
+void ocfs2_resmap_restart(struct ocfs2_reservation_map *resmap, u32 ext_cnt,
  unsigned int clen, char *disk_bitmap)
 {
if (ocfs2_resmap_disabled(resmap))
@@ -300,9 +308,21 @@ void ocfs2_resmap_restart(struct ocfs2_reservation_map 
*resmap,
spin_lock(resv_lock);
 
ocfs2_resmap_clear_all_resv(resmap);
+
+   /* free existing extent array */
+   if (resmap-m_bitmap_ext_arr)
+   kfree(resmap-m_bitmap_ext_arr);
+
resmap-m_bitmap_len = clen;
resmap-m_disk_bitmap = disk_bitmap;
 
+   resmap-m_bitmap_ext_cnt = ext_cnt;
+   resmap-m_bitmap_ext_arr = kmalloc((sizeof(u32) * ext_cnt), GFP_NOFS);
+   if (!resmap-m_bitmap_ext_arr) {
+   mlog_errno(-ENOMEM);
+   resmap-m_osb-osb_resv_level = 0;
+   }
+
spin_unlock(resv_lock);
 }
 
@@ -419,20 +439,26 @@ static int ocfs2_resmap_find_free_bits(struct 
ocfs2_reservation_map *resmap,
   unsigned int *rlen)
 {
void *bitmap = resmap-m_disk_bitmap;
-   unsigned int best_start, best_len = 0;
+   unsigned int best_start, len, ext, best_len = 0;
int offset, start, found;
 
trace_ocfs2_resmap_find_free_bits_begin(search_start, search_len,
wanted, resmap-m_bitmap_len);
 
-   found = best_start = best_len = 0;
-
+   found = best_start = best_len = ext = 0;
start = search_start;
+   len = resmap-m_bitmap_ext_arr[ext++];
while ((offset = ocfs2_find_next_zero_bit(bitmap, resmap-m_bitmap_len,
-start)) != -1) {
+ start)) != -1) {
/* Search reached end of the region */
if (offset = (search_start + search_len))
-   break;
+   goto out;
+
+   if (offset = len) {
+   len += resmap-m_bitmap_ext_arr[ext];
+   found = 1;
+   start = offset + 1;
+   }
 
if (offset == start) {
/* we found a zero */
@@ -450,9 +476,10 @@ static int ocfs2_resmap_find_free_bits(struct 
ocfs2_reservation_map *resmap,
}
 
if (found = wanted)
-   break;
+   goto out;
}
 
+out:
if (best_len == 0)
return 0;
 
diff --git a/fs/ocfs2/reservations.h b/fs/ocfs2/reservations.h
index 42c2b80..bb5e94f 100644
--- a/fs/ocfs2/reservations.h
+++ b/fs/ocfs2/reservations.h
@@ -56,6 +56,8 @@ struct ocfs2_reservation_map {
u32 m_bitmap_len;   /* Number of valid
 * bits available */
 
+   u32 m_bitmap_ext_cnt;
+   u32 *m_bitmap_ext_arr;
struct list_headm_lru;  /* LRU of reservations
 * structures. */
 
@@ -94,6 +96,9 @@ void ocfs2_resv_discard(struct ocfs2_reservation_map *resmap,
 int ocfs2_resmap_init(struct ocfs2_super *osb,
  struct ocfs2_reservation_map *resmap);
 
+void ocfs2_resmap_set_ext(struct ocfs2_reservation_map *resmap, int arr,
+ u32 sz);
+
 /**
  * ocfs2_resmap_restart() - restart a reservation bitmap
  * @resmap: reservations bitmap
@@ -107,7 +112,7 @@ int ocfs2_resmap_init(struct ocfs2_super *osb,
  * reservations. A future version will recalculate existing
  * reservations based on the new bitmap.
  */
-void ocfs2_resmap_restart(struct ocfs2_reservation_map *resmap,
+void ocfs2_resmap_restart(struct ocfs2_reservation_map *resmap, u32 ext_cnt,
  unsigned int clen, char *disk_bitmap);
 
 /**
-- 
1.5.4.3

[Ocfs2-devel] [PATCH 1/3] ocfs2: new structure to implment discontiguous local alloc bitmap

2012-05-07 Thread Srinivas Eeda

Current local alloc handles single contiguous free chunk of clusters. This
patch enhances local alloc to handle discontigous free chunks. It adds a new
ocfs2_local_alloc_rec structure which tracks single contiguous free chunk. An
array of these sit in the bitmap itself and track discontiguous chunks. In
best case there is only one record and increases as the filesystem gets
fragmented. Number of records at a time are limited depending on the size
of the bitmap and the max limit is defined by OCFS2_MAX_LOCAL_ALLOC_RECS.

Signed-off-by: Srinivas Eeda srinivas.e...@oracle.com
---
 fs/ocfs2/localalloc.c |   10 ++
 fs/ocfs2/ocfs2.h  |8 
 fs/ocfs2/ocfs2_fs.h   |   48 ++--
 3 files changed, 60 insertions(+), 6 deletions(-)

diff --git a/fs/ocfs2/localalloc.c b/fs/ocfs2/localalloc.c
index 210c352..4190e53 100644
--- a/fs/ocfs2/localalloc.c
+++ b/fs/ocfs2/localalloc.c
@@ -48,6 +48,16 @@
 
 #define OCFS2_LOCAL_ALLOC(dinode)  (((dinode)-id2.i_lab))
 
+#define OCFS2_LOCAL_ALLOC_REC_SZ(la)   (le16_to_cpu(la-la_rec_count) *\
+sizeof(struct ocfs2_local_alloc_rec))
+#define OCFS2_LOCAL_ALLOC_BITMAP(la)((char *)((la-la_recs)) +\
+OCFS2_LOCAL_ALLOC_REC_SZ(la))
+#define OCFS2_LOCAL_ALLOC_BITS_PER_REC (sizeof(struct ocfs2_local_alloc_rec)*8)
+
+/* Maximum number of local alloc records */
+#define OCFS2_MAX_LOCAL_ALLOC_REC_LIMIT128
+
+
 static u32 ocfs2_local_alloc_count_bits(struct ocfs2_dinode *alloc);
 
 static int ocfs2_local_alloc_find_clear_bits(struct ocfs2_super *osb,
diff --git a/fs/ocfs2/ocfs2.h b/fs/ocfs2/ocfs2.h
index d355e6e..d4c36d2 100644
--- a/fs/ocfs2/ocfs2.h
+++ b/fs/ocfs2/ocfs2.h
@@ -367,6 +367,7 @@ struct ocfs2_super
 * by osb_lock */
 
struct buffer_head *local_alloc_bh;
+   struct inode   *local_alloc_inode;
 
u64 la_last_gd;
 
@@ -522,6 +523,13 @@ static inline int ocfs2_supports_discontig_bg(struct 
ocfs2_super *osb)
return 0;
 }
 
+static inline int ocfs2_supports_discontig_la(struct ocfs2_super *osb)
+{
+   if (osb-s_feature_incompat  OCFS2_FEATURE_INCOMPAT_DISCONTIG_LA)
+   return 1;
+   return 0;
+}
+
 static inline unsigned int ocfs2_link_max(struct ocfs2_super *osb)
 {
if (ocfs2_supports_indexed_dirs(osb))
diff --git a/fs/ocfs2/ocfs2_fs.h b/fs/ocfs2/ocfs2_fs.h
index 938387a..6a0fe02 100644
--- a/fs/ocfs2/ocfs2_fs.h
+++ b/fs/ocfs2/ocfs2_fs.h
@@ -102,7 +102,8 @@
 | OCFS2_FEATURE_INCOMPAT_INDEXED_DIRS \
 | OCFS2_FEATURE_INCOMPAT_REFCOUNT_TREE 
\
 | OCFS2_FEATURE_INCOMPAT_DISCONTIG_BG  
\
-| OCFS2_FEATURE_INCOMPAT_CLUSTERINFO)
+| OCFS2_FEATURE_INCOMPAT_CLUSTERINFO \
+| OCFS2_FEATURE_INCOMPAT_DISCONTIG_LA)
 #define OCFS2_FEATURE_RO_COMPAT_SUPP   (OCFS2_FEATURE_RO_COMPAT_UNWRITTEN \
 | OCFS2_FEATURE_RO_COMPAT_USRQUOTA \
 | OCFS2_FEATURE_RO_COMPAT_GRPQUOTA)
@@ -177,6 +178,9 @@
  */
 #define OCFS2_FEATURE_INCOMPAT_CLUSTERINFO 0x4000
 
+/* Discontiguous local alloc */
+#define OCFS2_FEATURE_INCOMPAT_DISCONTIG_LA0x8000
+
 /*
  * backup superblock flag is used to indicate that this volume
  * has backup superblocks.
@@ -664,14 +668,19 @@ struct ocfs2_super_block {
  * Local allocation bitmap for OCFS2 slots
  * Note that it exists inside an ocfs2_dinode, so all offsets are
  * relative to the start of ocfs2_dinode.id2.
+ * Each ocfs2_local_alloc_rec tracks one contigous chunk of clusters.
  */
+struct ocfs2_local_alloc_rec {
+   __le32 la_start;/* 1st cluster in this extent */
+   __le32 la_clusters; /* Number of contiguous clusters */
+};
+
 struct ocfs2_local_alloc
 {
 /*00*/ __le32 la_bm_off;   /* Starting bit offset in main bitmap */
__le16 la_size; /* Size of included bitmap, in bytes */
-   __le16 la_reserved1;
-   __le64 la_reserved2;
-/*10*/ __u8   la_bitmap[0];
+   __le16 la_rec_count;/* Number of discontiguous records */
+   struct ocfs2_local_alloc_rec la_recs[0]; /* Localalloc records */
 };
 
 /*
@@ -1380,11 +1389,24 @@ static inline u16 ocfs2_local_alloc_size(struct 
super_block *sb)
u16 size;
 
size = sb-s_blocksize -
-   offsetof(struct ocfs2_dinode, id2.i_lab.la_bitmap);
+   offsetof(struct ocfs2_dinode, id2.i_lab.la_recs);
+   size -= sizeof(struct ocfs2_local_alloc_rec);
 
return size;
 }
 
+/* effectively this is also the bitmap size */
+static inline u32 ocfs2_local_alloc_cluster_count(struct ocfs2_local_alloc *la)
+{
+   u32 i, clusters;
+
+   clusters = 0;
+   for (i

[Ocfs2-devel] ocfs2 discontiguous localalloc patches

2012-05-07 Thread Srinivas Eeda

Hi all,

can you please review following 3 patches that implement discontiguous
localalloc bitmap support for ocfs2 file system. This feature helps
applications that significantly fragment the filesystem.

These fixes needs changes to ocfs2 tools as well. I am sending those patches
for review separately.

A write up on this feature is available at
http://oss.oracle.com/osswiki/OCFS2/DesignDocs/DiscontiguousLocalAlloc.html

Thanks,
--Srini


___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-devel

[Ocfs2-devel] [PATCH 2/3] ocfs2: implement discontiguous localalloc bitmap

2012-05-07 Thread Srinivas Eeda

This patch adds supporting functions and modifies localalloc code to implement
discontiguous localalloc bitmap.

Signed-off-by: Srinivas Eeda srinivas.e...@oracle.com
---
 fs/ocfs2/localalloc.c |  523 -
 1 files changed, 342 insertions(+), 181 deletions(-)

diff --git a/fs/ocfs2/localalloc.c b/fs/ocfs2/localalloc.c
index 4190e53..f63381e 100644
--- a/fs/ocfs2/localalloc.c
+++ b/fs/ocfs2/localalloc.c
@@ -48,6 +48,9 @@
 
 #define OCFS2_LOCAL_ALLOC(dinode)  (((dinode)-id2.i_lab))
 
+/* defines minimum contiguous required */
+#define OCFS2_LOCAL_ALLOC_MIN_BITS 2
+
 #define OCFS2_LOCAL_ALLOC_REC_SZ(la)   (le16_to_cpu(la-la_rec_count) *\
 sizeof(struct ocfs2_local_alloc_rec))
 #define OCFS2_LOCAL_ALLOC_BITMAP(la)((char *)((la-la_recs)) +\
@@ -58,7 +61,8 @@
 #define OCFS2_MAX_LOCAL_ALLOC_REC_LIMIT128
 
 
-static u32 ocfs2_local_alloc_count_bits(struct ocfs2_dinode *alloc);
+static u32 ocfs2_local_alloc_count_bits(struct ocfs2_super *osb,
+   struct ocfs2_dinode *alloc);
 
 static int ocfs2_local_alloc_find_clear_bits(struct ocfs2_super *osb,
 struct ocfs2_dinode *alloc,
@@ -82,8 +86,7 @@ static int ocfs2_local_alloc_new_window(struct ocfs2_super 
*osb,
handle_t *handle,
struct ocfs2_alloc_context *ac);
 
-static int ocfs2_local_alloc_slide_window(struct ocfs2_super *osb,
- struct inode *local_alloc_inode);
+static int ocfs2_local_alloc_slide_window(struct ocfs2_super *osb);
 
 /*
  * ocfs2_la_default_mb() - determine a default size, in megabytes of
@@ -202,6 +205,74 @@ unsigned int ocfs2_la_default_mb(struct ocfs2_super *osb)
return la_mb;
 }
 
+static u32 ocfs2_local_bitmap_to_cluster(struct ocfs2_local_alloc *la, u32 bit)
+{
+   u32 start, prev, offset;
+   int rec;
+
+   rec = start = prev = 0;
+   for (rec = 0; rec  le16_to_cpu(la-la_rec_count); rec++) {
+   prev = start;
+   start += le32_to_cpu(la-la_recs[rec].la_clusters);
+   if (bit  start)
+   break;
+   }
+   offset = le32_to_cpu(la-la_recs[rec].la_start) + (bit - prev);
+
+   return offset;
+}
+
+/*
+ * This function is called before allocating a new chunk for the localalloc
+ * bitmap to make sure there is enough space in the bitmap for the new record
+ */
+static u32 ocfs2_local_alloc_adjust_bits_wanted(struct ocfs2_local_alloc *la,
+   struct ocfs2_alloc_context *ac)
+{
+   u32 required, available, cluster_cnt;
+
+   if (ac-ac_bits_given == ac-ac_bits_wanted)
+   return 0;
+
+   /* total bits available in bitmap */
+   available   = le16_to_cpu(la-la_size)  3;
+   cluster_cnt = ocfs2_local_alloc_cluster_count(la);
+
+   /*
+* Wanted shouldn't be greater than bitmap size and given should be
+* equal to cluster count
+*/
+   BUG_ON(ac-ac_bits_given  ac-ac_bits_wanted);
+   BUG_ON(ac-ac_bits_wanted  available);
+   BUG_ON(ac-ac_bits_given != cluster_cnt);
+
+   /* reduce bits taken by each record structure */
+   available -= (le16_to_cpu(la-la_rec_count) *
+ OCFS2_LOCAL_ALLOC_BITS_PER_REC);
+
+   /* reduce space reserved for bitmap for already allocated clusters */
+   available -= cluster_cnt;
+
+   /* if available bits are not enough to fit a new record return 0 */
+   if (available  (OCFS2_LOCAL_ALLOC_BITS_PER_REC + 1))
+   return 0;
+
+   /* Adjust space that will be consumed by new record structure */
+   available -= OCFS2_LOCAL_ALLOC_BITS_PER_REC;
+
+   required = ac-ac_bits_wanted - ac-ac_bits_given;
+
+   /*
+* we can't allocate clusters more than the bits available. Adjust
+* bits wanted
+*/
+   if (required  available) {
+   ac-ac_bits_wanted = ac-ac_bits_given + available;
+   return available;
+   } else
+   return required;
+}
+
 void ocfs2_la_set_sizes(struct ocfs2_super *osb, int requested_mb)
 {
struct super_block *sb = osb-sb;
@@ -239,12 +310,14 @@ void ocfs2_local_alloc_seen_free_bits(struct ocfs2_super 
*osb,
  unsigned int num_clusters)
 {
spin_lock(osb-osb_lock);
-   if (osb-local_alloc_state == OCFS2_LA_DISABLED ||
-   osb-local_alloc_state == OCFS2_LA_THROTTLED)
-   if (num_clusters = osb-local_alloc_default_bits) {
-   cancel_delayed_work(osb-la_enable_wq);
+   if (osb-local_alloc_state == OCFS2_LA_DISABLED) {
+   cancel_delayed_work(osb-la_enable_wq);
+   if (num_clusters = osb-local_alloc_bits)
+   osb-local_alloc_state

Re: [Ocfs2-devel] ocfs2 discontiguous localalloc patches

2012-05-07 Thread Srinivas Eeda

Joel Becker wrote:
 On Mon, May 07, 2012 at 04:21:27PM -0700, Srinivas Eeda wrote:
   
 can you please review following 3 patches that implement discontiguous
 localalloc bitmap support for ocfs2 file system. This feature helps
 applications that significantly fragment the filesystem.
 

   Hi Srini.  Have you some performance numbers backing this?  That
 is, I believe that the described filesystem turned off local alloc.  Do
 you have proof that these patches, turning it back on, improved the
 customer's performance?

 Joel
   
Hi Joel,

thanks a lot for the quick reply.

I have some stat_sysdir.sh snapshots at 
http://oss.oracle.com/~seeda/diag/stat_sysdir/ collected from a system. 
It has 4 snapshots collected when the file system usage is at 8%, 19%, 
21% and 52%.

In file stat_sysdir_52_percent_usage_slow_del.out, for the filesystem 
that has UUID: 3A6F54DF288C4AF2ABD1E00FC49BE7ED you could see that 
local_alloc: bitmap total is 38 and is 0(disabled) for 
local_alloc:0001, and local_alloc:0002. for the filesystem that has uuid 
AC444DB162AE427C899BA89E076DD479, all localalloc appears to be disabled. 
Sorry I didn't collect /sys/kernel/debug/fs/uuid/fs_state. But, given 
the file system state, even if localalloc is not disabled localalloc 
need to be refilled every 40 clusters.

Thanks,
--Srini








___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] [PATCH 2/3] ocfs2: implement discontiguous localalloc bitmap

2012-05-07 Thread Srinivas Eeda

Joel Becker wrote:
 On Mon, May 07, 2012 at 04:21:29PM -0700, Srinivas Eeda wrote:
   

   OH MY DOG NO.  NEVER EVER DO THIS.  You cannot update an old
 filesystem on the fly!  What about other nodes that are running older
 versions of the software?  They will crash or corrupt data!  The entire
 point of feature bits is to make sure all nodes are speaking the same
 code.

 NAK NAK NAK

   This explains why you trusted la_rec_count earlier.  But that is
 broken.  When your patches are done, the code should use la_bm_off and
 la_bitmap when !DISCONTIG_LA and then use la_rec_count, etc when
 DISCONTIG_LA.  The only way to transition between them is a tunefs.ocfs2
 operation that walks the filesystem, flushes the bitmap, and then
 sets/clears la_rec_count appropriately depending on the direction..
   
Please please don't hate me :( ... the changes takes care of old formats 
as well ...  I used the reserved space in the structure so that the code 
changes will be minimal and still compatible with old file system 
formats. I agree that we need to have some reserved space still 
available. So as discussed I'll redo the changes accordingly. Please 
ignore all the patches.

Thanks,
--Srini

___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-devel

[Ocfs2-devel] [PATCH 1/1] ocfs2: use spinlock irqsave for downconvert lock.patch

2012-01-30 Thread Srinivas Eeda

When ocfs2dc thread holds dc_task_lock spinlock and receives soft IRQ it
deadlock itself trying to get same spinlock in ocfs2_wake_downconvert_thread.
Below is the stack snippet.

The patch disables interrupts when acquiring dc_task_lock spinlock.

ocfs2_wake_downconvert_thread
ocfs2_rw_unlock
ocfs2_dio_end_io
dio_complete
.
bio_endio
req_bio_endio

scsi_io_completion
blk_done_softirq
__do_softirq
do_softirq
irq_exit
do_IRQ
ocfs2_downconvert_thread
[kthread]

Signed-off-by: Srinivas Eeda srinivas.e...@oracle.com
---
 fs/ocfs2/dlmglue.c |   30 ++
 1 files changed, 18 insertions(+), 12 deletions(-)

diff --git a/fs/ocfs2/dlmglue.c b/fs/ocfs2/dlmglue.c
index 81a4cd2..d8552a5 100644
--- a/fs/ocfs2/dlmglue.c
+++ b/fs/ocfs2/dlmglue.c
@@ -3932,6 +3932,8 @@ unqueue:
 static void ocfs2_schedule_blocked_lock(struct ocfs2_super *osb,
struct ocfs2_lock_res *lockres)
 {
+   unsigned long flags;
+
assert_spin_locked(lockres-l_lock);
 
if (lockres-l_flags  OCFS2_LOCK_FREEING) {
@@ -3945,21 +3947,22 @@ static void ocfs2_schedule_blocked_lock(struct 
ocfs2_super *osb,
 
lockres_or_flags(lockres, OCFS2_LOCK_QUEUED);
 
-   spin_lock(osb-dc_task_lock);
+   spin_lock_irqsave(osb-dc_task_lock, flags);
if (list_empty(lockres-l_blocked_list)) {
list_add_tail(lockres-l_blocked_list,
  osb-blocked_lock_list);
osb-blocked_lock_count++;
}
-   spin_unlock(osb-dc_task_lock);
+   spin_unlock_irqrestore(osb-dc_task_lock, flags);
 }
 
 static void ocfs2_downconvert_thread_do_work(struct ocfs2_super *osb)
 {
unsigned long processed;
+   unsigned long flags;
struct ocfs2_lock_res *lockres;
 
-   spin_lock(osb-dc_task_lock);
+   spin_lock_irqsave(osb-dc_task_lock, flags);
/* grab this early so we know to try again if a state change and
 * wake happens part-way through our work  */
osb-dc_work_sequence = osb-dc_wake_sequence;
@@ -3972,38 +3975,40 @@ static void ocfs2_downconvert_thread_do_work(struct 
ocfs2_super *osb)
 struct ocfs2_lock_res, l_blocked_list);
list_del_init(lockres-l_blocked_list);
osb-blocked_lock_count--;
-   spin_unlock(osb-dc_task_lock);
+   spin_unlock_irqrestore(osb-dc_task_lock, flags);
 
BUG_ON(!processed);
processed--;
 
ocfs2_process_blocked_lock(osb, lockres);
 
-   spin_lock(osb-dc_task_lock);
+   spin_lock_irqsave(osb-dc_task_lock, flags);
}
-   spin_unlock(osb-dc_task_lock);
+   spin_unlock_irqrestore(osb-dc_task_lock, flags);
 }
 
 static int ocfs2_downconvert_thread_lists_empty(struct ocfs2_super *osb)
 {
int empty = 0;
+   unsigned long flags;
 
-   spin_lock(osb-dc_task_lock);
+   spin_lock_irqsave(osb-dc_task_lock, flags);
if (list_empty(osb-blocked_lock_list))
empty = 1;
 
-   spin_unlock(osb-dc_task_lock);
+   spin_unlock_irqrestore(osb-dc_task_lock, flags);
return empty;
 }
 
 static int ocfs2_downconvert_thread_should_wake(struct ocfs2_super *osb)
 {
int should_wake = 0;
+   unsigned long flags;
 
-   spin_lock(osb-dc_task_lock);
+   spin_lock_irqsave(osb-dc_task_lock, flags);
if (osb-dc_work_sequence != osb-dc_wake_sequence)
should_wake = 1;
-   spin_unlock(osb-dc_task_lock);
+   spin_unlock_irqrestore(osb-dc_task_lock, flags);
 
return should_wake;
 }
@@ -4033,10 +4038,11 @@ static int ocfs2_downconvert_thread(void *arg)
 
 void ocfs2_wake_downconvert_thread(struct ocfs2_super *osb)
 {
-   spin_lock(osb-dc_task_lock);
+   unsigned long flags;
+   spin_lock_irqsave(osb-dc_task_lock, flags);
/* make sure the voting thread gets a swipe at whatever changes
 * the caller may have made to the voting state */
osb-dc_wake_sequence++;
-   spin_unlock(osb-dc_task_lock);
+   spin_unlock_irqrestore(osb-dc_task_lock, flags);
wake_up(osb-dc_event);
 }
-- 
1.5.4.3


___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-devel

[Ocfs2-devel] [PATCH 1/1] ocfs2: use spinlock irqsave for downconvert lock.patch

2012-01-30 Thread Srinivas Eeda

When ocfs2dc thread holds dc_task_lock spinlock and receives soft IRQ it
deadlock itself trying to get same spinlock in ocfs2_wake_downconvert_thread.
Below is the stack snippet.

The patch disables interrupts when acquiring dc_task_lock spinlock.

ocfs2_wake_downconvert_thread
ocfs2_rw_unlock
ocfs2_dio_end_io
dio_complete
.
bio_endio
req_bio_endio

scsi_io_completion
blk_done_softirq
__do_softirq
do_softirq
irq_exit
do_IRQ
ocfs2_downconvert_thread
[kthread]

Signed-off-by: Srinivas Eeda srinivas.e...@oracle.com
---
 fs/ocfs2/dlmglue.c |   31 +++
 1 files changed, 19 insertions(+), 12 deletions(-)

diff --git a/fs/ocfs2/dlmglue.c b/fs/ocfs2/dlmglue.c
index 81a4cd2..67af5db 100644
--- a/fs/ocfs2/dlmglue.c
+++ b/fs/ocfs2/dlmglue.c
@@ -3932,6 +3932,8 @@ unqueue:
 static void ocfs2_schedule_blocked_lock(struct ocfs2_super *osb,
struct ocfs2_lock_res *lockres)
 {
+   unsigned long flags;
+
assert_spin_locked(lockres-l_lock);
 
if (lockres-l_flags  OCFS2_LOCK_FREEING) {
@@ -3945,21 +3947,22 @@ static void ocfs2_schedule_blocked_lock(struct 
ocfs2_super *osb,
 
lockres_or_flags(lockres, OCFS2_LOCK_QUEUED);
 
-   spin_lock(osb-dc_task_lock);
+   spin_lock_irqsave(osb-dc_task_lock, flags);
if (list_empty(lockres-l_blocked_list)) {
list_add_tail(lockres-l_blocked_list,
  osb-blocked_lock_list);
osb-blocked_lock_count++;
}
-   spin_unlock(osb-dc_task_lock);
+   spin_unlock_irqrestore(osb-dc_task_lock, flags);
 }
 
 static void ocfs2_downconvert_thread_do_work(struct ocfs2_super *osb)
 {
unsigned long processed;
+   unsigned long flags;
struct ocfs2_lock_res *lockres;
 
-   spin_lock(osb-dc_task_lock);
+   spin_lock_irqsave(osb-dc_task_lock, flags);
/* grab this early so we know to try again if a state change and
 * wake happens part-way through our work  */
osb-dc_work_sequence = osb-dc_wake_sequence;
@@ -3972,38 +3975,40 @@ static void ocfs2_downconvert_thread_do_work(struct 
ocfs2_super *osb)
 struct ocfs2_lock_res, l_blocked_list);
list_del_init(lockres-l_blocked_list);
osb-blocked_lock_count--;
-   spin_unlock(osb-dc_task_lock);
+   spin_unlock_irqrestore(osb-dc_task_lock, flags);
 
BUG_ON(!processed);
processed--;
 
ocfs2_process_blocked_lock(osb, lockres);
 
-   spin_lock(osb-dc_task_lock);
+   spin_lock_irqsave(osb-dc_task_lock, flags);
}
-   spin_unlock(osb-dc_task_lock);
+   spin_unlock_irqrestore(osb-dc_task_lock, flags);
 }
 
 static int ocfs2_downconvert_thread_lists_empty(struct ocfs2_super *osb)
 {
int empty = 0;
+   unsigned long flags;
 
-   spin_lock(osb-dc_task_lock);
+   spin_lock_irqsave(osb-dc_task_lock, flags);
if (list_empty(osb-blocked_lock_list))
empty = 1;
 
-   spin_unlock(osb-dc_task_lock);
+   spin_unlock_irqrestore(osb-dc_task_lock, flags);
return empty;
 }
 
 static int ocfs2_downconvert_thread_should_wake(struct ocfs2_super *osb)
 {
int should_wake = 0;
+   unsigned long flags;
 
-   spin_lock(osb-dc_task_lock);
+   spin_lock_irqsave(osb-dc_task_lock, flags);
if (osb-dc_work_sequence != osb-dc_wake_sequence)
should_wake = 1;
-   spin_unlock(osb-dc_task_lock);
+   spin_unlock_irqrestore(osb-dc_task_lock, flags);
 
return should_wake;
 }
@@ -4033,10 +4038,12 @@ static int ocfs2_downconvert_thread(void *arg)
 
 void ocfs2_wake_downconvert_thread(struct ocfs2_super *osb)
 {
-   spin_lock(osb-dc_task_lock);
+   unsigned long flags;
+
+   spin_lock_irqsave(osb-dc_task_lock, flags);
/* make sure the voting thread gets a swipe at whatever changes
 * the caller may have made to the voting state */
osb-dc_wake_sequence++;
-   spin_unlock(osb-dc_task_lock);
+   spin_unlock_irqrestore(osb-dc_task_lock, flags);
wake_up(osb-dc_event);
 }
-- 
1.5.4.3


___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] [PATCH 1/1] ocfs2: use spinlock irqsave for downconvert lock.patch

2012-01-30 Thread srinivas eeda

sorry ignore this patch, resent another one after adding the new line.

On 1/30/2012 9:47 PM, Srinivas Eeda wrote:
 When ocfs2dc thread holds dc_task_lock spinlock and receives soft IRQ it
 deadlock itself trying to get same spinlock in ocfs2_wake_downconvert_thread.
 Below is the stack snippet.

 The patch disables interrupts when acquiring dc_task_lock spinlock.

   ocfs2_wake_downconvert_thread
   ocfs2_rw_unlock
   ocfs2_dio_end_io
   dio_complete
   .
   bio_endio
   req_bio_endio
   
   scsi_io_completion
   blk_done_softirq
   __do_softirq
   do_softirq
   irq_exit
   do_IRQ
   ocfs2_downconvert_thread
   [kthread]

 Signed-off-by: Srinivas Eedasrinivas.e...@oracle.com
 ---
   fs/ocfs2/dlmglue.c |   30 ++
   1 files changed, 18 insertions(+), 12 deletions(-)

 diff --git a/fs/ocfs2/dlmglue.c b/fs/ocfs2/dlmglue.c
 index 81a4cd2..d8552a5 100644
 --- a/fs/ocfs2/dlmglue.c
 +++ b/fs/ocfs2/dlmglue.c
 @@ -3932,6 +3932,8 @@ unqueue:
   static void ocfs2_schedule_blocked_lock(struct ocfs2_super *osb,
   struct ocfs2_lock_res *lockres)
   {
 + unsigned long flags;
 +
   assert_spin_locked(lockres-l_lock);

   if (lockres-l_flags  OCFS2_LOCK_FREEING) {
 @@ -3945,21 +3947,22 @@ static void ocfs2_schedule_blocked_lock(struct 
 ocfs2_super *osb,

   lockres_or_flags(lockres, OCFS2_LOCK_QUEUED);

 - spin_lock(osb-dc_task_lock);
 + spin_lock_irqsave(osb-dc_task_lock, flags);
   if (list_empty(lockres-l_blocked_list)) {
   list_add_tail(lockres-l_blocked_list,
   osb-blocked_lock_list);
   osb-blocked_lock_count++;
   }
 - spin_unlock(osb-dc_task_lock);
 + spin_unlock_irqrestore(osb-dc_task_lock, flags);
   }

   static void ocfs2_downconvert_thread_do_work(struct ocfs2_super *osb)
   {
   unsigned long processed;
 + unsigned long flags;
   struct ocfs2_lock_res *lockres;

 - spin_lock(osb-dc_task_lock);
 + spin_lock_irqsave(osb-dc_task_lock, flags);
   /* grab this early so we know to try again if a state change and
* wake happens part-way through our work  */
   osb-dc_work_sequence = osb-dc_wake_sequence;
 @@ -3972,38 +3975,40 @@ static void ocfs2_downconvert_thread_do_work(struct 
 ocfs2_super *osb)
struct ocfs2_lock_res, l_blocked_list);
   list_del_init(lockres-l_blocked_list);
   osb-blocked_lock_count--;
 - spin_unlock(osb-dc_task_lock);
 + spin_unlock_irqrestore(osb-dc_task_lock, flags);

   BUG_ON(!processed);
   processed--;

   ocfs2_process_blocked_lock(osb, lockres);

 - spin_lock(osb-dc_task_lock);
 + spin_lock_irqsave(osb-dc_task_lock, flags);
   }
 - spin_unlock(osb-dc_task_lock);
 + spin_unlock_irqrestore(osb-dc_task_lock, flags);
   }

   static int ocfs2_downconvert_thread_lists_empty(struct ocfs2_super *osb)
   {
   int empty = 0;
 + unsigned long flags;

 - spin_lock(osb-dc_task_lock);
 + spin_lock_irqsave(osb-dc_task_lock, flags);
   if (list_empty(osb-blocked_lock_list))
   empty = 1;

 - spin_unlock(osb-dc_task_lock);
 + spin_unlock_irqrestore(osb-dc_task_lock, flags);
   return empty;
   }

   static int ocfs2_downconvert_thread_should_wake(struct ocfs2_super *osb)
   {
   int should_wake = 0;
 + unsigned long flags;

 - spin_lock(osb-dc_task_lock);
 + spin_lock_irqsave(osb-dc_task_lock, flags);
   if (osb-dc_work_sequence != osb-dc_wake_sequence)
   should_wake = 1;
 - spin_unlock(osb-dc_task_lock);
 + spin_unlock_irqrestore(osb-dc_task_lock, flags);

   return should_wake;
   }
 @@ -4033,10 +4038,11 @@ static int ocfs2_downconvert_thread(void *arg)

   void ocfs2_wake_downconvert_thread(struct ocfs2_super *osb)
   {
 - spin_lock(osb-dc_task_lock);
 + unsigned long flags;
 + spin_lock_irqsave(osb-dc_task_lock, flags);
   /* make sure the voting thread gets a swipe at whatever changes
* the caller may have made to the voting state */
   osb-dc_wake_sequence++;
 - spin_unlock(osb-dc_task_lock);
 + spin_unlock_irqrestore(osb-dc_task_lock, flags);
   wake_up(osb-dc_event);
   }

___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-devel

[Ocfs2-devel] [PATCH 1/1] o2dlm: fix NULL pointer dereference in o2dlm_blocking_ast_wrapper

2012-01-30 Thread Srinivas Eeda

A tiny race between BAST and unlock message causes the NULL dereference.

A node sends an unlock request to master and receives a response. Before
processing the response it receives a BAST from the master. Since both requests
are processed by different threads it creates a race. While the BAST is being
processed, lock can get freed by unlock code.

This patch makes bast to return immediately if lock is found but unlock is
pending. The code should handle this race. We also have to fix master node to
skip sending BAST after receiving unlock message.

Below is the crash stack

BUG: unable to handle kernel NULL pointer dereference at 0048
IP: [a015e023] o2dlm_blocking_ast_wrapper+0xd/0x16
[a034e3db] dlm_do_local_bast+0x8e/0x97 [ocfs2_dlm]
[a034f366] dlm_proxy_ast_handler+0x838/0x87e [ocfs2_dlm]
[a0308abe] o2net_process_message+0x395/0x5b8 [ocfs2_nodemanager]
[a030aac8] o2net_rx_until_empty+0x762/0x90d [ocfs2_nodemanager]
[81071802] worker_thread+0x14d/0x1ed

Signed-off-by: Srinivas Eeda srinivas.e...@oracle.com
---
 fs/ocfs2/dlm/dlmast.c |3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/fs/ocfs2/dlm/dlmast.c b/fs/ocfs2/dlm/dlmast.c
index 3a3ed4b..1281e8a 100644
--- a/fs/ocfs2/dlm/dlmast.c
+++ b/fs/ocfs2/dlm/dlmast.c
@@ -386,8 +386,9 @@ int dlm_proxy_ast_handler(struct o2net_msg *msg, u32 len, 
void *data,
head = res-granted;
 
list_for_each(iter, head) {
+   /* if lock is found but unlock is pending ignore the bast */
lock = list_entry (iter, struct dlm_lock, list);
-   if (lock-ml.cookie == cookie)
+   if ((lock-ml.cookie == cookie)  (!lock-unlock_pending))
goto do_ast;
}
 
-- 
1.5.4.3


___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] [PATCH 1/1] ocfs2: use spinlock irqsave for downconvert lock.patch

2012-01-29 Thread srinivas eeda

Hi Tao,
thanks for reviewing.

 When ocfs2dc thread holds dc_task_lock spinlock and receives soft IRQ for
 I/O completion it deadlock itself trying to get same spinlock in
 ocfs2_wake_downconvert_thread
 could you please describe it in more detail?
When ocfs2dc thread is running on a cpu and acquired dc_task_lock spin 
lock, an interrupt came in for i/o completion. Interrupt handler then 
tries to acquire same spinlock and ends up in deadlock. Below is the 
stack that gives more details.

[814fd509] default_do_nmi+0x39/0x200
[814fd750] do_nmi+0x80/0xa0
[814fced0] nmi+0x20/0x30
[8103ee83] ? __ticket_spin_lock+0x13/0x20
EOE IRQ  [814fc41e] _raw_spin_lock+0xe/0x20
[a04e32e8] ocfs2_wake_downconvert_thread+0x28/0x60 [ocfs2]
[a04eb388] ocfs2_rw_unlock+0xe8/0x1a0 [ocfs2]
[8110c635] ? mempool_free+0x95/0xa0
[a04d18c1] ocfs2_dio_end_io+0x61/0xb0 [ocfs2]
[8119ebdb] dio_complete+0xbb/0xe0
[8119ec6d] dio_bio_end_aio+0x6d/0xc0
[8119a1bd] bio_endio+0x1d/0x40
[81234643] req_bio_endio+0xa3/0xe0
[8123589f] blk_update_request+0xff/0x490
[8119bec4] ? bio_free+0x64/0x70
[a000193a] end_clone_bio+0x5a/0x90 [dm_mod]
[8119a1bd] bio_endio+0x1d/0x40
[81234643] req_bio_endio+0xa3/0xe0
[813558a6] ? scsi_done+0x26/0x60
[8123589f] blk_update_request+0xff/0x490
[a01ee77a] ? qla2x00_process_completed_request+0x5a/0xe0 [qla2xxx]
[81235c57] blk_update_bidi_request+0x27/0xb0
[81236d8f] blk_end_bidi_request+0x2f/0x80
[81236e30] blk_end_request+0x10/0x20
[8135d600] scsi_end_request+0x40/0xb0
[8135d96f] scsi_io_completion+0x9f/0x5c0
[8103ef19] ? default_spin_lock_flags+0x9/0x10
[81354a59] scsi_finish_command+0xc9/0x130
[8135dff7] scsi_softirq_done+0x147/0x170
[8123ca32] blk_done_softirq+0x82/0xa0
[8106f987] __do_softirq+0xb7/0x210
[814fc41e] ? _raw_spin_lock+0xe/0x20
[8150599c] call_softirq+0x1c/0x30
[81016365] do_softirq+0x65/0xa0
[8106f78d] irq_exit+0xbd/0xe0
[815061f6] do_IRQ+0x66/0xe0
@ [814fc953] common_interrupt+0x13/0x13
EOI  [81261625] ? __list_del_entry+0x35/0xd0
[a04e9d53] ocfs2_downconvert_thread+0x133/0x2a0 [ocfs2]
[814f9d10] ? __schedule+0x3f0/0x800
[8108b930] ? wake_up_bit+0x40/0x40
[a04e9c20] ? ocfs2_process_blocked_lock+0x270/0x270 [ocfs2]
[8108b346] kthread+0x96/0xa0
[815058a4] kernel_thread_helper+0x4/0x10
[8108b2b0] ? kthread_worker_fn+0x1a0/0x1a0
[815058a0] ? gs_change+0x13/0x13
@ ---[ end trace 628bf423ee0bc8a8 ]---
 btw, if you are afraid of deadlocking in soft IRQ, I guess we should use
 spin_lock_bh not spin_lock_irqsave?
Yes, looks like we could, but I am not 100% sure if it will be safe?
 Thanks
 Tao
 The patch disables interrupts when acquiring dc_task_lock spinlock

 Signed-off-by: Srinivas Eedasrinivas.e...@oracle.com
 ---
   fs/ocfs2/dlmglue.c |   30 ++
   1 files changed, 18 insertions(+), 12 deletions(-)

 diff --git a/fs/ocfs2/dlmglue.c b/fs/ocfs2/dlmglue.c
 index 81a4cd2..d8552a5 100644
 --- a/fs/ocfs2/dlmglue.c
 +++ b/fs/ocfs2/dlmglue.c
 @@ -3932,6 +3932,8 @@ unqueue:
   static void ocfs2_schedule_blocked_lock(struct ocfs2_super *osb,
  struct ocfs2_lock_res *lockres)
   {
 +unsigned long flags;
 +
  assert_spin_locked(lockres-l_lock);

  if (lockres-l_flags  OCFS2_LOCK_FREEING) {
 @@ -3945,21 +3947,22 @@ static void ocfs2_schedule_blocked_lock(struct 
 ocfs2_super *osb,

  lockres_or_flags(lockres, OCFS2_LOCK_QUEUED);

 -spin_lock(osb-dc_task_lock);
 +spin_lock_irqsave(osb-dc_task_lock, flags);
  if (list_empty(lockres-l_blocked_list)) {
  list_add_tail(lockres-l_blocked_list,
  osb-blocked_lock_list);
  osb-blocked_lock_count++;
  }
 -spin_unlock(osb-dc_task_lock);
 +spin_unlock_irqrestore(osb-dc_task_lock, flags);
   }

   static void ocfs2_downconvert_thread_do_work(struct ocfs2_super *osb)
   {
  unsigned long processed;
 +unsigned long flags;
  struct ocfs2_lock_res *lockres;

 -spin_lock(osb-dc_task_lock);
 +spin_lock_irqsave(osb-dc_task_lock, flags);
  /* grab this early so we know to try again if a state change and
   * wake happens part-way through our work  */
  osb-dc_work_sequence = osb-dc_wake_sequence;
 @@ -3972,38 +3975,40 @@ static void ocfs2_downconvert_thread_do_work(struct 
 ocfs2_super *osb)
   struct ocfs2_lock_res, l_blocked_list);
  list_del_init(lockres-l_blocked_list);
  osb-blocked_lock_count--;
 -spin_unlock(osb-dc_task_lock);
 +spin_unlock_irqrestore(osb-dc_task_lock, flags);

  BUG_ON(!processed);
  processed--;

[Ocfs2-devel] [PATCH 1/1] ocfs2: use spinlock irqsave for downconvert lock.patch

2012-01-28 Thread Srinivas Eeda

When ocfs2dc thread holds dc_task_lock spinlock and receives soft IRQ for
I/O completion it deadlock itself trying to get same spinlock in
ocfs2_wake_downconvert_thread

The patch disables interrupts when acquiring dc_task_lock spinlock

Signed-off-by: Srinivas Eeda srinivas.e...@oracle.com
---
 fs/ocfs2/dlmglue.c |   30 ++
 1 files changed, 18 insertions(+), 12 deletions(-)

diff --git a/fs/ocfs2/dlmglue.c b/fs/ocfs2/dlmglue.c
index 81a4cd2..d8552a5 100644
--- a/fs/ocfs2/dlmglue.c
+++ b/fs/ocfs2/dlmglue.c
@@ -3932,6 +3932,8 @@ unqueue:
 static void ocfs2_schedule_blocked_lock(struct ocfs2_super *osb,
struct ocfs2_lock_res *lockres)
 {
+   unsigned long flags;
+
assert_spin_locked(lockres-l_lock);
 
if (lockres-l_flags  OCFS2_LOCK_FREEING) {
@@ -3945,21 +3947,22 @@ static void ocfs2_schedule_blocked_lock(struct 
ocfs2_super *osb,
 
lockres_or_flags(lockres, OCFS2_LOCK_QUEUED);
 
-   spin_lock(osb-dc_task_lock);
+   spin_lock_irqsave(osb-dc_task_lock, flags);
if (list_empty(lockres-l_blocked_list)) {
list_add_tail(lockres-l_blocked_list,
  osb-blocked_lock_list);
osb-blocked_lock_count++;
}
-   spin_unlock(osb-dc_task_lock);
+   spin_unlock_irqrestore(osb-dc_task_lock, flags);
 }
 
 static void ocfs2_downconvert_thread_do_work(struct ocfs2_super *osb)
 {
unsigned long processed;
+   unsigned long flags;
struct ocfs2_lock_res *lockres;
 
-   spin_lock(osb-dc_task_lock);
+   spin_lock_irqsave(osb-dc_task_lock, flags);
/* grab this early so we know to try again if a state change and
 * wake happens part-way through our work  */
osb-dc_work_sequence = osb-dc_wake_sequence;
@@ -3972,38 +3975,40 @@ static void ocfs2_downconvert_thread_do_work(struct 
ocfs2_super *osb)
 struct ocfs2_lock_res, l_blocked_list);
list_del_init(lockres-l_blocked_list);
osb-blocked_lock_count--;
-   spin_unlock(osb-dc_task_lock);
+   spin_unlock_irqrestore(osb-dc_task_lock, flags);
 
BUG_ON(!processed);
processed--;
 
ocfs2_process_blocked_lock(osb, lockres);
 
-   spin_lock(osb-dc_task_lock);
+   spin_lock_irqsave(osb-dc_task_lock, flags);
}
-   spin_unlock(osb-dc_task_lock);
+   spin_unlock_irqrestore(osb-dc_task_lock, flags);
 }
 
 static int ocfs2_downconvert_thread_lists_empty(struct ocfs2_super *osb)
 {
int empty = 0;
+   unsigned long flags;
 
-   spin_lock(osb-dc_task_lock);
+   spin_lock_irqsave(osb-dc_task_lock, flags);
if (list_empty(osb-blocked_lock_list))
empty = 1;
 
-   spin_unlock(osb-dc_task_lock);
+   spin_unlock_irqrestore(osb-dc_task_lock, flags);
return empty;
 }
 
 static int ocfs2_downconvert_thread_should_wake(struct ocfs2_super *osb)
 {
int should_wake = 0;
+   unsigned long flags;
 
-   spin_lock(osb-dc_task_lock);
+   spin_lock_irqsave(osb-dc_task_lock, flags);
if (osb-dc_work_sequence != osb-dc_wake_sequence)
should_wake = 1;
-   spin_unlock(osb-dc_task_lock);
+   spin_unlock_irqrestore(osb-dc_task_lock, flags);
 
return should_wake;
 }
@@ -4033,10 +4038,11 @@ static int ocfs2_downconvert_thread(void *arg)
 
 void ocfs2_wake_downconvert_thread(struct ocfs2_super *osb)
 {
-   spin_lock(osb-dc_task_lock);
+   unsigned long flags;
+   spin_lock_irqsave(osb-dc_task_lock, flags);
/* make sure the voting thread gets a swipe at whatever changes
 * the caller may have made to the voting state */
osb-dc_wake_sequence++;
-   spin_unlock(osb-dc_task_lock);
+   spin_unlock_irqrestore(osb-dc_task_lock, flags);
wake_up(osb-dc_event);
 }
-- 
1.5.4.3


___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] [PATCH 0/3] ocfs2: fix slow deleting

2011-07-08 Thread Srinivas Eeda

Below is excerpts from Joel's email for the same question :)

   Currently, orphan scan just iterate all the slots and call
  ocfs2_queue_recovery_completion, but I don't think it is proper for a node
  to query another mounted one since that node will query it by
  itself.

Node 1 has an inode it was using.  The dentry went away due to
memory pressure.  Node 1 closes the inode, but it's on the free list.
The node has the open lock.
Node 2 unlinks the inode.  It grabs the dentry lock to notify
others, but node 1 has no dentry and doesn't get the message.  It
trylocks the open lock, sees that another node has a PR, and does
nothing.
Later node 2 runs its orphan dir.  It igets the inode, trylocks
the open lock, sees the PR still, and does nothing.
Basically, we have to trigger an orphan iput on node 1.  The
only way for this to happen is if node 1 runs node 2's orphan dir.  This
patch exists because that wasn't happening.


On 7/7/2011 1:26 PM, Sunil Mushran wrote:
 On 07/07/2011 01:02 PM, Sunil Mushran wrote:
 On 07/06/2011 11:19 PM, Srinivas Eeda wrote:
 On 7/5/2011 11:17 PM, Sunil Mushran wrote:
 2. All nodes have to scan all slots. Even live slots. I remember we
 did for
 a reason. And that reason should be in the comment in the patch 
 written
 by Srini.
 When a node unlinks a file it inserts an entry into it's own orphan
 slot. If another node is the last one to close the file and dentry got
 flushed then it will not do the cleanup as it doesn't know the file was
 orphaned. The file will remain in the orphan slot  till the node 
 umounts
 and the same slot is reused again. To overcome this problem a node has
 to rescan all slots(including live slots) and try to do the cleanup.
 The qs is not why are we scanning all live slots on all nodes. As in,
 why not just recover the local slot. There was a reason for that.
 Yes, we have to recover unused slots for the reason listed previously.

 bleh let me rephrase.

 The qs is why are we scanning all live slots on all nodes. Wengangs
 patch limits the scanning to the local (live) slot only. And I remember
 we had a reason for it.

___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] [PATCH 0/3] ocfs2: fix slow deleting

2011-07-07 Thread Srinivas Eeda

On 7/5/2011 11:17 PM, Sunil Mushran wrote:
 2. All nodes have to scan all slots. Even live slots. I remember we 
 did for
 a reason. And that reason should be in the comment in the patch written
 by Srini.
When a node unlinks a file it inserts an entry into it's own orphan 
slot. If another node is the last one to close the file and dentry got 
flushed then it will not do the cleanup as it doesn't know the file was 
orphaned. The file will remain in the orphan slot  till the node umounts 
and the same slot is reused again. To overcome this problem a node has 
to rescan all slots(including live slots) and try to do the cleanup.



___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] [PATCH] remove lockres from purge list when we are getting it for creating lock

2011-06-09 Thread Srinivas Eeda


 +   spin_unlock(tmpres-spinlock);
 +   spin_unlock(dlm-spinlock);
 lockres could still get added to purgelist at this point and we
 could still have the same problem? I think, here we need some
 mechanism that marks the lockres is in use that would protect it
 from adding to the purgelist?
 Seems there shouldn't be the possibility that the lockres was moved to purge
 list again.
I think you are right about that once removed from the purge list it 
will not be put back again on the purgelist in this case :)

I still think that there is a hole here. Is it possible that the 
spinlocks are released here and right then dlm_thread was walking 
through dirty list and then it could put this on purge list for the 
first time?

If it's already on purgelist, then we seem to be safe by removing it. 
But what is preventing the lockres getting onto to purge list right at 
this point. locks are added to lockres a little later but till then it 
is not safe?(I think).

   We are in the case of a create lock, there should not be any
 outstanding convert/unlock request in progress concurrently. If there is one,
 the there must be lock(s) attached to the lockres. But if there are locks
 attached, the lockres can't be purged. Also for concurrent create lock, 
 there
 is no problem either because there must be a lock attached to the lockres.

 I think your patch works too. But changing
 DLM_LOCK_RES_IN_USE
 to
 DLM_LOCK_RES_CREATING_LOCK
 can help understand better.
I am ok with the name :). The thing we should do here is once we found a 
lockres and moving ahead with create lock request, we should be sure 
it's protected before releasing the spinlocks.
 I also thought about inflight_locks, but so far I failed to find correct
 places to drop it.

 thanks,
 wengang.
 if (res)
 dlm_lockres_put(res);
 res = tmpres;

___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] [PATCH] remove lockres from purge list when we are getting it for creating lock

2011-06-08 Thread Srinivas Eeda

I think I have seen this problem in ocfs2-1.2 and it was addressed by 
using a new state DLM_LOCK_RES_IN_USE. But we didn't merge into mainline 
as sunil suggested we need to look for a different approach

http://oss.oracle.com/pipermail/ocfs2-devel/2010-June/006669.html
http://www.mail-archive.com/ocfs2-devel@oss.oracle.com/msg05733.html

one comment inline

On 6/8/2011 3:04 AM, Wengang Wang wrote:
 When we are to purge a lockres, we check if it's unused.
 the check includes
 1. no locks attached to the lockres.
 2. not dirty.
 3. not in recovery.
 4. no interested nodes in refmap.
 If the the above four are satisfied, we are going to purge it(remove it
 from hash table).

 While, when a create lock is in progress especially when lockres is owned
 remotely(spinlocks are dropped when networking), the lockres can satisfy above
 four condition. If it's put to purge list, it can be purged out from hash 
 table
 when we are still accessing on it(sending request to owner for example). 
 That's
 not what we want.

 I have met such a problem (orabug 12405575).
 The lockres is in the purge list already(there is a delay for real purge work)
 and the create lock request comes. When we are sending network message to the
 owner in dlmlock_remote(), the owner crashed. So we get DLM_RECOVERING as 
 return
 value and retries dlmlock_remote(). And before the owner crash, we have purged
 the lockres. So the lockres become stale(on lockres-onwer). Thus the code 
 calls
 dlmlock_remote() infinitely.

 fix:
 we remove the lockres from purge list if it's there in dlm_get_lock_resource()
 which is called for only createlock case. So that the lockres can't be purged
 when we are in progress of createlock.

 Signed-off-by: Wengang Wangwen.gang.w...@oracle.com
 ---
   fs/ocfs2/dlm/dlmmaster.c |   41 +
   1 files changed, 29 insertions(+), 12 deletions(-)

 diff --git a/fs/ocfs2/dlm/dlmmaster.c b/fs/ocfs2/dlm/dlmmaster.c
 index 11eefb8..511d43c 100644
 --- a/fs/ocfs2/dlm/dlmmaster.c
 +++ b/fs/ocfs2/dlm/dlmmaster.c
 @@ -709,28 +709,27 @@ lookup:
   spin_lock(dlm-spinlock);
   tmpres = __dlm_lookup_lockres_full(dlm, lockid, namelen, hash);
   if (tmpres) {
 - int dropping_ref = 0;
 -
 - spin_unlock(dlm-spinlock);
 -
   spin_lock(tmpres-spinlock);
   /* We wait for the other thread that is mastering the resource 
 */
   if (tmpres-owner == DLM_LOCK_RES_OWNER_UNKNOWN) {
 + spin_unlock(dlm-spinlock);
   __dlm_wait_on_lockres(tmpres);
   BUG_ON(tmpres-owner == DLM_LOCK_RES_OWNER_UNKNOWN);
 + spin_unlock(tmpres-spinlock);
 + dlm_lockres_put(tmpres);
 + tmpres = NULL;
 + goto lookup;
   }

   if (tmpres-owner == dlm-node_num) {
   BUG_ON(tmpres-state  DLM_LOCK_RES_DROPPING_REF);
   dlm_lockres_grab_inflight_ref(dlm, tmpres);
 - } else if (tmpres-state  DLM_LOCK_RES_DROPPING_REF)
 - dropping_ref = 1;
 - spin_unlock(tmpres-spinlock);
 -
 - /* wait until done messaging the master, drop our ref to allow
 -  * the lockres to be purged, start over. */
 - if (dropping_ref) {
 - spin_lock(tmpres-spinlock);
 + } else if (tmpres-state  DLM_LOCK_RES_DROPPING_REF) {
 + /*
 +  * wait until done messaging the master, drop our ref to
 +  * allow the lockres to be purged, start over.
 +  */
 + spin_unlock(dlm-spinlock);
   __dlm_wait_on_lockres_flags(tmpres, 
 DLM_LOCK_RES_DROPPING_REF);
   spin_unlock(tmpres-spinlock);
   dlm_lockres_put(tmpres);
 @@ -739,6 +738,24 @@ lookup:
   }

   mlog(0, found in hash!\n);
 + /*
 +  * we are going to do a create-lock next. so remove the lockres
 +  * from purge list to avoid the case that we will access on the
 +  * lockres which is already purged out from hash table in
 +  * dlm_run_purge_list() path.
 +  * otherwise, we could run into a problem:
 +  * the owner dead(recovery didn't take care of this lockres
 +  * since it's not in hashtable), and the code keeps sending
 +  * request to the dead node and getting DLM_RECOVERING and
 +  * then retrying infinitely.
 +  */
 + if (!list_empty(tmpres-purge)) {
 + list_del_init(tmpres-purge);
 + dlm_lockres_put(tmpres);
 + }
 + 
 + spin_unlock(tmpres-spinlock);
 + spin_unlock(dlm-spinlock);
lockres could still get added to purgelist at this

[Ocfs2-devel] [PATCH 2/2] ocfs2/cluster: Add per-region debugfs file to show the elapsed time

2010-11-02 Thread Srinivas Eeda

From: Sunil Mushran sunil.mush...@oracle.com

Mainline fa16655a622e7c0fda76ca5155db6efc86968c65
A per-region debugfs file, elapsed_time_in_ms, shows the time since the
heartbeat timer was last armed.

Signed-off-by: Sunil Mushran sunil.mush...@oracle.com
---
 fs/ocfs2/cluster/heartbeat.c |   46 ++
 1 files changed, 46 insertions(+), 0 deletions(-)

diff --git a/fs/ocfs2/cluster/heartbeat.c b/fs/ocfs2/cluster/heartbeat.c
index 8580e66..0ae0dea 100644
--- a/fs/ocfs2/cluster/heartbeat.c
+++ b/fs/ocfs2/cluster/heartbeat.c
@@ -63,6 +63,7 @@ static DECLARE_WAIT_QUEUE_HEAD(o2hb_steady_queue);
 
 #define O2HB_DEBUG_DIR o2hb
 #define O2HB_DEBUG_LIVENODES   livenodes
+#define O2HB_DEBUG_REGION_ELAPSED_TIME elapsed_time_in_ms
 static struct dentry *o2hb_debug_dir;
 static struct dentry *o2hb_debug_livenodes;
 
@@ -135,6 +136,7 @@ struct o2hb_region {
struct o2hb_disk_slot   *hr_slots;
 
struct dentry   *hr_debug_dir;
+   struct dentry   *hr_debug_elapsed_time;
 
/* let the person setting up hb wait for it to return until it
 * has reached a 'steady' state.  This will be fixed when we have
@@ -950,6 +952,29 @@ bail:
return -ENOMEM;
 }
 
+static int o2hb_region_debug_open(struct inode *inode, struct file *file)
+{
+   struct o2hb_region *reg = inode-i_private;
+   char *buf = NULL;
+   int out = 0;
+
+   buf = kmalloc(PAGE_SIZE, GFP_KERNEL);
+   if (!buf)
+   goto bail;
+
+   out += snprintf(buf + out, PAGE_SIZE - out, %u\n,
+   jiffies_to_msecs(jiffies -
+reg-hr_last_timeout_start));
+
+   i_size_write(inode, out);
+
+   file-private_data = buf;
+
+   return 0;
+bail:
+   return -ENOMEM;
+}
+
 static int o2hb_debug_release(struct inode *inode, struct file *file)
 {
kfree(file-private_data);
@@ -985,6 +1010,13 @@ static struct file_operations o2hb_debug_fops = {
.llseek =   generic_file_llseek,
 };
 
+static struct file_operations o2hb_region_debug_fops = {
+   .open = o2hb_region_debug_open,
+   .release =  o2hb_debug_release,
+   .read = o2hb_debug_read,
+   .llseek =   generic_file_llseek,
+};
+
 void o2hb_exit(void)
 {
if (o2hb_debug_livenodes)
@@ -1087,6 +1119,7 @@ static void o2hb_region_release(struct config_item *item)
if (reg-hr_slots)
kfree(reg-hr_slots);
 
+   debugfs_remove(reg-hr_debug_elapsed_time);
debugfs_remove(reg-hr_debug_dir);
 
spin_lock(o2hb_live_lock);
@@ -1617,6 +1650,17 @@ static struct config_item 
*o2hb_heartbeat_group_make_item(struct config_group *g
goto out;
}
 
+   reg-hr_debug_elapsed_time =
+   debugfs_create_file(O2HB_DEBUG_REGION_ELAPSED_TIME,
+   S_IFREG|S_IRUSR,
+   reg-hr_debug_dir,
+   reg,
+   o2hb_region_debug_fops);
+   if (!reg-hr_debug_elapsed_time) {
+   mlog_errno(-ENOMEM);
+   goto out;
+   }
+
spin_lock(o2hb_live_lock);
list_add_tail(reg-hr_all_item, o2hb_all_regions);
spin_unlock(o2hb_live_lock);
@@ -1624,6 +1668,8 @@ static struct config_item 
*o2hb_heartbeat_group_make_item(struct config_group *g
return reg-hr_item;
 
 out:
+   if (reg  reg-hr_debug_dir)
+   debugfs_remove(reg-hr_debug_dir);
kfree(reg);
return NULL;
 }
-- 
1.5.6.5


___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-devel

[Ocfs2-devel] backports to 1.4 from mainline

2010-11-02 Thread Srinivas Eeda

The following two patches are backports from mainline to 1.4. These patches
create debugfs entry for heartbeat regions and to show elapsed time.

___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-devel

[Ocfs2-devel] [PATCH 1/2] ocfs2/cluster: Create debugfs dir for heartbeat regions

2010-11-02 Thread Srinivas Eeda

From: Sunil Mushran sunil.mush...@oracle.com

Mainline 0841ed580fe8a3e51ba9dbb133dafc787cce428f
Signed-off-by: Sunil Mushran sunil.mush...@oracle.com
---
 fs/ocfs2/cluster/heartbeat.c |   27 +++
 1 files changed, 19 insertions(+), 8 deletions(-)

diff --git a/fs/ocfs2/cluster/heartbeat.c b/fs/ocfs2/cluster/heartbeat.c
index 55e8718..8580e66 100644
--- a/fs/ocfs2/cluster/heartbeat.c
+++ b/fs/ocfs2/cluster/heartbeat.c
@@ -134,6 +134,8 @@ struct o2hb_region {
struct block_device *hr_bdev;
struct o2hb_disk_slot   *hr_slots;
 
+   struct dentry   *hr_debug_dir;
+
/* let the person setting up hb wait for it to return until it
 * has reached a 'steady' state.  This will be fixed when we have
 * a more complete api that doesn't lead to this sort of fragility. */
@@ -1085,6 +1087,8 @@ static void o2hb_region_release(struct config_item *item)
if (reg-hr_slots)
kfree(reg-hr_slots);
 
+   debugfs_remove(reg-hr_debug_dir);
+
spin_lock(o2hb_live_lock);
list_del(reg-hr_all_item);
spin_unlock(o2hb_live_lock);
@@ -1597,24 +1601,31 @@ static struct config_item 
*o2hb_heartbeat_group_make_item(struct config_group *g
  const char *name)
 {
struct o2hb_region *reg = NULL;
-   struct config_item *ret = NULL;
 
reg = kzalloc(sizeof(struct o2hb_region), GFP_KERNEL);
-   if (reg == NULL)
-   goto out; /* ENOMEM */
+   if (reg == NULL) {
+   mlog_errno(-ENOMEM);
+   goto out;
+   }
 
config_item_init_type_name(reg-hr_item, name, o2hb_region_type);
 
-   ret = reg-hr_item;
+   reg-hr_debug_dir =
+   debugfs_create_dir(config_item_name(reg-hr_item), 
o2hb_debug_dir);
+   if (!reg-hr_debug_dir) {
+   mlog_errno(-ENOMEM);
+   goto out;
+   }
 
spin_lock(o2hb_live_lock);
list_add_tail(reg-hr_all_item, o2hb_all_regions);
spin_unlock(o2hb_live_lock);
-out:
-   if (ret == NULL)
-   kfree(reg);
 
-   return ret;
+   return reg-hr_item;
+
+out:
+   kfree(reg);
+   return NULL;
 }
 
 static void o2hb_heartbeat_group_drop_item(struct config_group *group,
-- 
1.5.6.5


___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-devel

[Ocfs2-devel] [PATCH] ocfs2: validate bg_free_bits_count after update

2010-10-05 Thread Srinivas Eeda

This patch adds a safe check to ensure bg_free_bits_count doesn't exceed
bg_bits in a group descriptor. This is to avoid on disk corruption that was
seen recently.

debugfs: group 52803072
   Group Chain: 179   Parent Inode: 11  Generation: 2959379682
   CRC32:    ECC: 
   ##   Block#TotalUsed Free Contig   Size
   052803072  322564294965350   34202182074032
   ..

Signed-off-by: Srinivas Eeda srinivas.e...@oracle.com
---
 fs/ocfs2/suballoc.c |   16 
 1 files changed, 16 insertions(+), 0 deletions(-)

diff --git a/fs/ocfs2/suballoc.c b/fs/ocfs2/suballoc.c
index 849c2f0..5fed60d 100644
--- a/fs/ocfs2/suballoc.c
+++ b/fs/ocfs2/suballoc.c
@@ -1380,6 +1380,14 @@ static inline int ocfs2_block_group_set_bits(handle_t 
*handle,
}
 
le16_add_cpu(bg-bg_free_bits_count, -num_bits);
+   if (le16_to_cpu(bg-bg_free_bits_count)  le16_to_cpu(bg-bg_bits)) {
+   ocfs2_error(alloc_inode-i_sb, Group descriptor # %llu has bit
+count %u but claims %u are freed. num_bits %d,
+   (unsigned long long)le64_to_cpu(bg-bg_blkno),
+   le16_to_cpu(bg-bg_bits),
+   le16_to_cpu(bg-bg_free_bits_count), num_bits);
+   return -EROFS;
+   }
while(num_bits--)
ocfs2_set_bit(bit_off++, bitmap);
 
@@ -2419,6 +2427,14 @@ static int ocfs2_block_group_clear_bits(handle_t *handle,
(unsigned long *) undo_bg-bg_bitmap);
}
le16_add_cpu(bg-bg_free_bits_count, num_bits);
+   if (le16_to_cpu(bg-bg_free_bits_count)  le16_to_cpu(bg-bg_bits)) {
+   ocfs2_error(alloc_inode-i_sb, Group descriptor # %llu has bit
+count %u but claims %u are freed. num_bits %d,
+   (unsigned long long)le64_to_cpu(bg-bg_blkno),
+   le16_to_cpu(bg-bg_bits),
+   le16_to_cpu(bg-bg_free_bits_count), num_bits);
+   return -EROFS;
+   }
 
if (undo_fn)
jbd_unlock_bh_state(group_bh);
-- 
1.5.6.5


___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-devel

[Ocfs2-devel] [PATCH 1/1] ocfs2: validate bg_free_bits_count after update

2010-10-01 Thread Srinivas Eeda

This patch adds a safe check to ensure bg_free_bits_count doesn't exceed
bg_bits in a group descriptor. This is to avoid on disk corruption that was
seen recently.

debugfs: group 52803072
   Group Chain: 179   Parent Inode: 11  Generation: 2959379682
   CRC32:    ECC: 
   ##   Block#TotalUsed Free Contig   Size
   052803072  322564294965350   34202182074032
   ..

Signed-off-by: Srinivas Eeda srinivas.e...@oracle.com
---
 fs/ocfs2/suballoc.c |   16 
 1 files changed, 16 insertions(+), 0 deletions(-)

diff --git a/fs/ocfs2/suballoc.c b/fs/ocfs2/suballoc.c
index 849c2f0..d606a79 100644
--- a/fs/ocfs2/suballoc.c
+++ b/fs/ocfs2/suballoc.c
@@ -1380,6 +1380,14 @@ static inline int ocfs2_block_group_set_bits(handle_t 
*handle,
}
 
le16_add_cpu(bg-bg_free_bits_count, -num_bits);
+   if (le16_to_cpu(bg-bg_free_bits_count) = le16_to_cpu(bg-bg_bits)) {
+   ocfs2_error(alloc_inode-i_sb, Group descriptor # %llu has bit
+count %u but claims %u are freed. num_bits %d,
+   (unsigned long long)le64_to_cpu(bg-bg_blkno),
+   le16_to_cpu(bg-bg_bits),
+   le16_to_cpu(bg-bg_free_bits_count), num_bits);
+   return -EIO;
+   }
while(num_bits--)
ocfs2_set_bit(bit_off++, bitmap);
 
@@ -2419,6 +2427,14 @@ static int ocfs2_block_group_clear_bits(handle_t *handle,
(unsigned long *) undo_bg-bg_bitmap);
}
le16_add_cpu(bg-bg_free_bits_count, num_bits);
+   if (le16_to_cpu(bg-bg_free_bits_count) = le16_to_cpu(bg-bg_bits)) {
+   ocfs2_error(alloc_inode-i_sb, Group descriptor # %llu has bit
+count %u but claims %u are freed. num_bits %d,
+   (unsigned long long)le64_to_cpu(bg-bg_blkno),
+   le16_to_cpu(bg-bg_bits),
+   le16_to_cpu(bg-bg_free_bits_count), num_bits);
+   return -EIO;
+   }
 
if (undo_fn)
jbd_unlock_bh_state(group_bh);
-- 
1.5.6.5


___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-devel

[Ocfs2-devel] [PATCH] o2dlm: force free mles during dlm exit

2010-09-21 Thread Srinivas Eeda

While umounting, a block mle doesn't get freed if dlm is shutdown after
master request is received but before assert master. This results in unclean
shutdown of dlm domain.

This patch frees all mles that lie around after other nodes were notified about
exiting the dlm and marking dlm state as leaving. Only block mles are expected
to be around, so we log ERROR for other mles but still free them.

Signed-off-by: Srinivas Eeda srinivas.e...@oracle.com
---
 fs/ocfs2/dlm/dlmcommon.h |1 +
 fs/ocfs2/dlm/dlmdomain.c |1 +
 fs/ocfs2/dlm/dlmmaster.c |   34 ++
 3 files changed, 36 insertions(+), 0 deletions(-)

diff --git a/fs/ocfs2/dlm/dlmcommon.h b/fs/ocfs2/dlm/dlmcommon.h
index 4b6ae2c..7652989 100644
--- a/fs/ocfs2/dlm/dlmcommon.h
+++ b/fs/ocfs2/dlm/dlmcommon.h
@@ -1030,6 +1030,7 @@ int dlm_drop_lockres_ref(struct dlm_ctxt *dlm,
 struct dlm_lock_resource *res);
 void dlm_clean_master_list(struct dlm_ctxt *dlm,
   u8 dead_node);
+void dlm_force_free_mles(struct dlm_ctxt *dlm);
 int dlm_lock_basts_flushed(struct dlm_ctxt *dlm, struct dlm_lock *lock);
 int __dlm_lockres_has_locks(struct dlm_lock_resource *res);
 int __dlm_lockres_unused(struct dlm_lock_resource *res);
diff --git a/fs/ocfs2/dlm/dlmdomain.c b/fs/ocfs2/dlm/dlmdomain.c
index 153abb5..11a5c87 100644
--- a/fs/ocfs2/dlm/dlmdomain.c
+++ b/fs/ocfs2/dlm/dlmdomain.c
@@ -693,6 +693,7 @@ void dlm_unregister_domain(struct dlm_ctxt *dlm)
 
dlm_mark_domain_leaving(dlm);
dlm_leave_domain(dlm);
+   dlm_force_free_mles(dlm);
dlm_complete_dlm_shutdown(dlm);
}
dlm_put(dlm);
diff --git a/fs/ocfs2/dlm/dlmmaster.c b/fs/ocfs2/dlm/dlmmaster.c
index ffb4c68..156f420 100644
--- a/fs/ocfs2/dlm/dlmmaster.c
+++ b/fs/ocfs2/dlm/dlmmaster.c
@@ -3433,3 +3433,37 @@ void dlm_lockres_release_ast(struct dlm_ctxt *dlm,
wake_up(res-wq);
wake_up(dlm-migration_wq);
 }
+
+void dlm_force_free_mles(struct dlm_ctxt *dlm)
+{
+   int i;
+   struct hlist_head *bucket;
+   struct dlm_master_list_entry *mle;
+   struct hlist_node *tmp, *list;
+
+   /* We notified all other nodes that we are exiting the domain and
+* marked the dlm state to DLM_CTXT_LEAVING. If any mles are still
+* around we force free them and wake any processes that are waiting
+* on the mles */
+   spin_lock(dlm-spinlock);
+   spin_lock(dlm-master_lock);
+   for (i = 0; i  DLM_HASH_BUCKETS; i++) {
+   bucket = dlm_master_hash(dlm, i);
+   hlist_for_each_safe(list, tmp, bucket) {
+   mle = hlist_entry(list, struct dlm_master_list_entry,
+ master_hash_node);
+   if (mle-type != DLM_MLE_BLOCK) {
+   mlog(ML_ERROR, bad mle: %p\n, mle);
+   dlm_print_one_mle(mle);
+   }
+   atomic_set(mle-woken, 1);
+   wake_up(mle-wq);
+
+   __dlm_unlink_mle(dlm, mle);
+   __dlm_mle_detach_hb_events(dlm, mle);
+   __dlm_put_mle(mle);
+   }
+   }
+   spin_unlock(dlm-master_lock);
+   spin_unlock(dlm-spinlock);
+}
-- 
1.5.6.5


___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-devel

[Ocfs2-devel] [PATCH 1/1] o2dlm: force free mles during dlm exit

2010-09-21 Thread Srinivas Eeda

While umounting, a block mle doesn't get freed if dlm is shutdown after
master request is received but before assert master. This results in unclean
shutdown of dlm domain.

This patch frees all mles that lie around after other nodes were notified about
exiting the dlm and marking dlm state as leaving. Only block mles are expected
to be around, so we log ERROR for other mles but still free them.

Signed-off-by: Srinivas Eeda srinivas.e...@oracle.com
---
 fs/ocfs2/dlm/dlmcommon.h |1 +
 fs/ocfs2/dlm/dlmdomain.c |1 +
 fs/ocfs2/dlm/dlmmaster.c |   40 
 3 files changed, 42 insertions(+), 0 deletions(-)

diff --git a/fs/ocfs2/dlm/dlmcommon.h b/fs/ocfs2/dlm/dlmcommon.h
index 4b6ae2c..7652989 100644
--- a/fs/ocfs2/dlm/dlmcommon.h
+++ b/fs/ocfs2/dlm/dlmcommon.h
@@ -1030,6 +1030,7 @@ int dlm_drop_lockres_ref(struct dlm_ctxt *dlm,
 struct dlm_lock_resource *res);
 void dlm_clean_master_list(struct dlm_ctxt *dlm,
   u8 dead_node);
+void dlm_force_free_mles(struct dlm_ctxt *dlm);
 int dlm_lock_basts_flushed(struct dlm_ctxt *dlm, struct dlm_lock *lock);
 int __dlm_lockres_has_locks(struct dlm_lock_resource *res);
 int __dlm_lockres_unused(struct dlm_lock_resource *res);
diff --git a/fs/ocfs2/dlm/dlmdomain.c b/fs/ocfs2/dlm/dlmdomain.c
index 153abb5..11a5c87 100644
--- a/fs/ocfs2/dlm/dlmdomain.c
+++ b/fs/ocfs2/dlm/dlmdomain.c
@@ -693,6 +693,7 @@ void dlm_unregister_domain(struct dlm_ctxt *dlm)
 
dlm_mark_domain_leaving(dlm);
dlm_leave_domain(dlm);
+   dlm_force_free_mles(dlm);
dlm_complete_dlm_shutdown(dlm);
}
dlm_put(dlm);
diff --git a/fs/ocfs2/dlm/dlmmaster.c b/fs/ocfs2/dlm/dlmmaster.c
index ffb4c68..f564b0e 100644
--- a/fs/ocfs2/dlm/dlmmaster.c
+++ b/fs/ocfs2/dlm/dlmmaster.c
@@ -3433,3 +3433,43 @@ void dlm_lockres_release_ast(struct dlm_ctxt *dlm,
wake_up(res-wq);
wake_up(dlm-migration_wq);
 }
+
+void dlm_force_free_mles(struct dlm_ctxt *dlm)
+{
+   int i;
+   struct hlist_head *bucket;
+   struct dlm_master_list_entry *mle;
+   struct hlist_node *tmp, *list;
+
+   /*
+* We notified all other nodes that we are exiting the domain and
+* marked the dlm state to DLM_CTXT_LEAVING. If any mles are still
+* around we force free them and wake any processes that are waiting
+* on the mles
+*/
+   spin_lock(dlm-spinlock);
+   spin_lock(dlm-master_lock);
+
+   BUG_ON(dlm-dlm_state != DLM_CTXT_LEAVING);
+   BUG_ON((find_next_bit(dlm-domain_map, O2NM_MAX_NODES, 0)  
O2NM_MAX_NODES));
+
+   for (i = 0; i  DLM_HASH_BUCKETS; i++) {
+   bucket = dlm_master_hash(dlm, i);
+   hlist_for_each_safe(list, tmp, bucket) {
+   mle = hlist_entry(list, struct dlm_master_list_entry,
+ master_hash_node);
+   if (mle-type != DLM_MLE_BLOCK) {
+   mlog(ML_ERROR, bad mle: %p\n, mle);
+   dlm_print_one_mle(mle);
+   }
+   atomic_set(mle-woken, 1);
+   wake_up(mle-wq);
+
+   __dlm_unlink_mle(dlm, mle);
+   __dlm_mle_detach_hb_events(dlm, mle);
+   __dlm_put_mle(mle);
+   }
+   }
+   spin_unlock(dlm-master_lock);
+   spin_unlock(dlm-spinlock);
+}
-- 
1.5.6.5


___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-devel

[Ocfs2-devel] [PATCH 1/1] o2dlm: free block mles during shutdown

2010-09-20 Thread Srinivas Eeda

If a node initiates shutdown after another node initiated the lock mastery
process, this node might have created block mle but will not release it if it
doesn't get the assert master from the other node. This causes block mle's to
lie around unfreed.

This patch frees any block mles that exists on master list after the node sent
DLM_EXIT_DOMAIN_MSG to other nodes.

Signed-off-by: Srinivas Eeda srinivas.e...@oracle.com
---
 fs/ocfs2/dlm/dlmcommon.h |1 +
 fs/ocfs2/dlm/dlmdomain.c |1 +
 fs/ocfs2/dlm/dlmmaster.c |   33 +
 3 files changed, 35 insertions(+), 0 deletions(-)

diff --git a/fs/ocfs2/dlm/dlmcommon.h b/fs/ocfs2/dlm/dlmcommon.h
index 4b6ae2c..48282dd 100644
--- a/fs/ocfs2/dlm/dlmcommon.h
+++ b/fs/ocfs2/dlm/dlmcommon.h
@@ -1030,6 +1030,7 @@ int dlm_drop_lockres_ref(struct dlm_ctxt *dlm,
 struct dlm_lock_resource *res);
 void dlm_clean_master_list(struct dlm_ctxt *dlm,
   u8 dead_node);
+void dlm_free_block_mles(struct dlm_ctxt *dlm);
 int dlm_lock_basts_flushed(struct dlm_ctxt *dlm, struct dlm_lock *lock);
 int __dlm_lockres_has_locks(struct dlm_lock_resource *res);
 int __dlm_lockres_unused(struct dlm_lock_resource *res);
diff --git a/fs/ocfs2/dlm/dlmdomain.c b/fs/ocfs2/dlm/dlmdomain.c
index 153abb5..8744fff 100644
--- a/fs/ocfs2/dlm/dlmdomain.c
+++ b/fs/ocfs2/dlm/dlmdomain.c
@@ -693,6 +693,7 @@ void dlm_unregister_domain(struct dlm_ctxt *dlm)
 
dlm_mark_domain_leaving(dlm);
dlm_leave_domain(dlm);
+   dlm_free_block_mles(dlm);
dlm_complete_dlm_shutdown(dlm);
}
dlm_put(dlm);
diff --git a/fs/ocfs2/dlm/dlmmaster.c b/fs/ocfs2/dlm/dlmmaster.c
index ffb4c68..5f4d6fd 100644
--- a/fs/ocfs2/dlm/dlmmaster.c
+++ b/fs/ocfs2/dlm/dlmmaster.c
@@ -3433,3 +3433,36 @@ void dlm_lockres_release_ast(struct dlm_ctxt *dlm,
wake_up(res-wq);
wake_up(dlm-migration_wq);
 }
+
+void dlm_free_block_mles(struct dlm_ctxt *dlm)
+{
+   int i;
+   struct hlist_head *bucket;
+   struct dlm_master_list_entry *mle;
+   struct hlist_node *list;
+
+   spin_lock(dlm-spinlock);
+   spin_lock(dlm-master_lock);
+   for (i = 0; i  DLM_HASH_BUCKETS; i++) {
+   bucket = dlm_master_hash(dlm, i);
+   hlist_for_each(list, bucket) {
+   mle = hlist_entry(list, struct dlm_master_list_entry,
+ master_hash_node);
+   if (mle-type != DLM_MLE_BLOCK) {
+   mlog(ML_ERROR, mle for %.*s not destroyed, 
+type %d\n,
+mle-mnamelen, mle-mname, mle-type);
+   continue;
+   }
+   spin_lock(mle-spinlock);
+   atomic_set(mle-woken, 1);
+   spin_unlock(mle-spinlock);
+   wake_up(mle-wq);
+
+   __dlm_mle_detach_hb_events(dlm, mle);
+   __dlm_put_mle(mle);
+   }
+   }
+   spin_unlock(dlm-master_lock);
+   spin_unlock(dlm-spinlock);
+}
-- 
1.5.6.5


___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] [PATCH] ocfs2/dlm: avoid incorrect bit set in refmap on recovery master

2010-07-23 Thread Srinivas Eeda

thanks for making this patch, it looks good just few minor changes about 
comments

On 7/23/2010 5:15 AM, Wengang Wang wrote:
 In the following situation, there remains an incorrect bit in refmap on the
 recovery master. Finally the recovery master will fail at purging the lockres
 due to the incorrect bit in refmap.

 1) node A has no interest on lockres A any longer, so it is purging it.
 2) the owner of lockres A is node B, so node A is sending de-ref message
 to node B.
 3) at this time, node B crashed. node C becomes the recovery master. it 
 recovers
 lockres A(because the master is the dead node B).
 4) node A migrated lockres A to node C with a refbit there.
 5) node A failed to send de-ref message to node B because it crashed. The 
 failure
 is ignored. no other action is done for lockres A any more.

 For mormal, re-send the deref message to it to recovery master can fix it. 
 Well,
 ignoring the failure of deref to the original master and not recovering the 
 lockres
 to recovery master has the same effect. And the later is simpler.

 Signed-off-by: Wengang Wang wen.gang.w...@oracle.com
 ---
  fs/ocfs2/dlm/dlmrecovery.c |   17 +++--
  fs/ocfs2/dlm/dlmthread.c   |   28 +---
  2 files changed, 32 insertions(+), 13 deletions(-)

 diff --git a/fs/ocfs2/dlm/dlmrecovery.c b/fs/ocfs2/dlm/dlmrecovery.c
 index 9dfaac7..06640f6 100644
 --- a/fs/ocfs2/dlm/dlmrecovery.c
 +++ b/fs/ocfs2/dlm/dlmrecovery.c
 @@ -1997,6 +1997,8 @@ void dlm_move_lockres_to_recovery_list(struct dlm_ctxt 
 *dlm,
   struct list_head *queue;
   struct dlm_lock *lock, *next;
  
 + assert_spin_locked(dlm-spinlock);
 + assert_spin_locked(res-spinlock);
   res-state |= DLM_LOCK_RES_RECOVERING;
   if (!list_empty(res-recovering)) {
   mlog(0,
 @@ -2336,9 +2338,20 @@ static void dlm_do_local_recovery_cleanup(struct 
 dlm_ctxt *dlm, u8 dead_node)
  
   /* the wake_up for this will happen when the
* RECOVERING flag is dropped later */
   
remove above comment as it doesn't seem to be relevant anymore.
 - res-state = ~DLM_LOCK_RES_DROPPING_REF;
 + if (res-state  DLM_LOCK_RES_DROPPING_REF) {
 + /*
 +  * don't migrate a lockres which is in
 +  * progress of dropping ref
 +  */
   
move this comment to before the if condition
 + mlog(ML_NOTICE, %.*s ignored for 
 +  migration\n, res-lockname.len,
 +  res-lockname.name);
   
This information only helps us in diagnosing any related issue and is 
not helpful in normal cases. So it should be 0 instead of ML_NOTICE.
 + res-state =
 + ~DLM_LOCK_RES_DROPPING_REF;
   
we don't need to clear this state as dlm_purge_lockres removes it.
 + } else
 + dlm_move_lockres_to_recovery_list(dlm,
 +   res);
  
 - dlm_move_lockres_to_recovery_list(dlm, res);
   } else if (res-owner == dlm-node_num) {
   dlm_free_dead_locks(dlm, res, dead_node);
   __dlm_lockres_calc_usage(dlm, res);
 diff --git a/fs/ocfs2/dlm/dlmthread.c b/fs/ocfs2/dlm/dlmthread.c
 index dd78ca3..47420ce 100644
 --- a/fs/ocfs2/dlm/dlmthread.c
 +++ b/fs/ocfs2/dlm/dlmthread.c
 @@ -92,17 +92,23 @@ int __dlm_lockres_has_locks(struct dlm_lock_resource *res)
   * truly ready to be freed. */
  int __dlm_lockres_unused(struct dlm_lock_resource *res)
  {
 - if (!__dlm_lockres_has_locks(res) 
 - (list_empty(res-dirty)  !(res-state  DLM_LOCK_RES_DIRTY))) {
 - /* try not to scan the bitmap unless the first two
 -  * conditions are already true */
 - int bit = find_next_bit(res-refmap, O2NM_MAX_NODES, 0);
 - if (bit = O2NM_MAX_NODES) {
 - /* since the bit for dlm-node_num is not
 -  * set, inflight_locks better be zero */
 - BUG_ON(res-inflight_locks != 0);
 - return 1;
 - }
 + int bit;
 +
 + if (__dlm_lockres_has_locks(res))
 + return 0;
 +
 + if (!list_empty(res-dirty) || res-state  DLM_LOCK_RES_DIRTY)
 + return 0;
 +
 + if (res-state  DLM_LOCK_RES_RECOVERING)
 + return 0;
 +
 + bit = find_next_bit(res-refmap, O2NM_MAX_NODES, 0);
 + if (bit = O2NM_MAX_NODES) {
 + /* since the bit for dlm-node_num is not
 +  * set, inflight_locks better be zero */
 +

[Ocfs2-devel] [PATCH 1/1] ocfs2 fix o2dlm dlm run purgelist (rev 3)

2010-07-19 Thread Srinivas Eeda

This patch fixes two problems in dlm_run_purgelist

1. If a lockres is found to be in use, dlm_run_purgelist keeps trying to purge
the same lockres instead of trying the next lockres.

2. When a lockres is found unused, dlm_run_purgelist releases lockres spinlock
before setting DLM_LOCK_RES_DROPPING_REF and calls dlm_purge_lockres.
spinlock is reacquired but in this window lockres can get reused. This leads
to BUG.

This patch modifies dlm_run_purgelist to skip lockres if it's in use and purge
 next lockres. It also sets DLM_LOCK_RES_DROPPING_REF before releasing the
lockres spinlock protecting it from getting reused.

Signed-off-by: Srinivas Eeda srinivas.e...@oracle.com
Acked-by: Sunil Mushran sunil.mush...@oracle.com

---
 fs/ocfs2/dlm/dlmthread.c |   80 +++--
 1 files changed, 34 insertions(+), 46 deletions(-)

diff --git a/fs/ocfs2/dlm/dlmthread.c b/fs/ocfs2/dlm/dlmthread.c
index 11a6d1f..960dc8d 100644
--- a/fs/ocfs2/dlm/dlmthread.c
+++ b/fs/ocfs2/dlm/dlmthread.c
@@ -152,45 +152,25 @@ void dlm_lockres_calc_usage(struct dlm_ctxt *dlm,
spin_unlock(dlm-spinlock);
 }
 
-static int dlm_purge_lockres(struct dlm_ctxt *dlm,
+static void dlm_purge_lockres(struct dlm_ctxt *dlm,
 struct dlm_lock_resource *res)
 {
int master;
int ret = 0;
 
-   spin_lock(res-spinlock);
-   if (!__dlm_lockres_unused(res)) {
-   mlog(0, %s:%.*s: tried to purge but not unused\n,
-dlm-name, res-lockname.len, res-lockname.name);
-   __dlm_print_one_lock_resource(res);
-   spin_unlock(res-spinlock);
-   BUG();
-   }
-
-   if (res-state  DLM_LOCK_RES_MIGRATING) {
-   mlog(0, %s:%.*s: Delay dropref as this lockres is 
-being remastered\n, dlm-name, res-lockname.len,
-res-lockname.name);
-   /* Re-add the lockres to the end of the purge list */
-   if (!list_empty(res-purge)) {
-   list_del_init(res-purge);
-   list_add_tail(res-purge, dlm-purge_list);
-   }
-   spin_unlock(res-spinlock);
-   return 0;
-   }
+   assert_spin_locked(dlm-spinlock);
+   assert_spin_locked(res-spinlock);
 
master = (res-owner == dlm-node_num);
 
-   if (!master)
-   res-state |= DLM_LOCK_RES_DROPPING_REF;
-   spin_unlock(res-spinlock);
 
mlog(0, purging lockres %.*s, master = %d\n, res-lockname.len,
 res-lockname.name, master);
 
if (!master) {
+   res-state |= DLM_LOCK_RES_DROPPING_REF;
/* drop spinlock...  retake below */
+   spin_unlock(res-spinlock);
spin_unlock(dlm-spinlock);
 
spin_lock(res-spinlock);
@@ -208,31 +188,35 @@ static int dlm_purge_lockres(struct dlm_ctxt *dlm,
mlog(0, %s:%.*s: dlm_deref_lockres returned %d\n,
 dlm-name, res-lockname.len, res-lockname.name, ret);
spin_lock(dlm-spinlock);
+   spin_lock(res-spinlock);
}
 
-   spin_lock(res-spinlock);
if (!list_empty(res-purge)) {
mlog(0, removing lockres %.*s:%p from purgelist, 
 master = %d\n, res-lockname.len, res-lockname.name,
 res, master);
list_del_init(res-purge);
-   spin_unlock(res-spinlock);
dlm_lockres_put(res);
dlm-purge_count--;
-   } else
-   spin_unlock(res-spinlock);
+   }
+
+   if (!__dlm_lockres_unused(res)) {
+   mlog(ML_ERROR, found lockres %s:%.*s: in use after deref\n,
+dlm-name, res-lockname.len, res-lockname.name);
+   __dlm_print_one_lock_resource(res);
+   BUG();
+   }
 
__dlm_unhash_lockres(res);
 
/* lockres is not in the hash now.  drop the flag and wake up
 * any processes waiting in dlm_get_lock_resource. */
if (!master) {
-   spin_lock(res-spinlock);
res-state = ~DLM_LOCK_RES_DROPPING_REF;
spin_unlock(res-spinlock);
wake_up(res-wq);
-   }
-   return 0;
+   } else
+   spin_unlock(res-spinlock);
 }
 
 static void dlm_run_purge_list(struct dlm_ctxt *dlm,
@@ -251,17 +235,7 @@ static void dlm_run_purge_list(struct dlm_ctxt *dlm,
lockres = list_entry(dlm-purge_list.next,
 struct dlm_lock_resource, purge);
 
-   /* Status of the lockres *might* change so double
-* check. If the lockres is unused, holding the dlm
-* spinlock will prevent people from getting and more
-* refs on it -- there's no need to keep the lockres
-* spinlock. */
spin_lock(lockres-spinlock

Re: [Ocfs2-devel] [PATCH 1/1] ocfs2 fix o2dlm dlm run purgelist(bug 9094491) - rev3

2010-07-12 Thread Srinivas Eeda

Joel Becker wrote:
 On Tue, Jun 22, 2010 at 10:48:28PM -0700, Srinivas Eeda wrote:
   
 +if (!__dlm_lockres_unused) {
 +mlog(ML_ERROR, found lockres %s:%.*s: in use after deref\n,
 + dlm-name, res-lockname.len, res-lockname.name);
 +__dlm_print_one_lock_resource(res);
 +BUG();
 +}
 

 /build/jlbec/linux-2.6/working/fs/ocfs2/dlm/dlmthread.c: In function 
 ‘dlm_purge_lockres’:
 /build/jlbec/linux-2.6/working/fs/ocfs2/dlm/dlmthread.c:203: warning: the 
 address of ‘__dlm_lockres_unused’ will always evaluate as ‘true’

   Was this even tested?  I'm leaving this patch out of 'fixes'
 until corrected and tested.
   
Sorry, I had the typo while making the review changes. I ran the usual 
tests but that  didn't catch this problem. I should have payed more 
attention to the build log.

I made the change and tested it, will send you the modified patch
 Joel

   


___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-devel

[Ocfs2-devel] [PATCH 1/1] ocfs2 fix o2dlm dlm run purgelist(bug 9094491) - rev3

2010-07-12 Thread Srinivas Eeda

There are two problems in dlm_run_purgelist

1. If a lockres is found to be in use, dlm_run_purgelist keeps trying to purge
the same lockres instead of trying the next lockres.

2. When a lockres is found unused, dlm_run_purgelist releases lockres spinlock
before setting DLM_LOCK_RES_DROPPING_REF and calls dlm_purge_lockres.
spinlock is reacquired but in this window lockres can get reused. This leads
to BUG.

This patch modifies dlm_run_purgelist to skip lockres if it's in use and purge
 next lockres. It also sets DLM_LOCK_RES_DROPPING_REF before releasing the
lockres spinlock protecting it from getting reused.

Signed-off-by: Srinivas Eeda srinivas.e...@oracle.com
Signed-off-by: Sunil Mushransunil.mush...@oracle.com
---
 fs/ocfs2/dlm/dlmthread.c |   79 +++--
 1 files changed, 33 insertions(+), 46 deletions(-)

diff --git a/fs/ocfs2/dlm/dlmthread.c b/fs/ocfs2/dlm/dlmthread.c
index 11a6d1f..76ab4aa 100644
--- a/fs/ocfs2/dlm/dlmthread.c
+++ b/fs/ocfs2/dlm/dlmthread.c
@@ -152,45 +152,25 @@ void dlm_lockres_calc_usage(struct dlm_ctxt *dlm,
spin_unlock(dlm-spinlock);
 }
 
-static int dlm_purge_lockres(struct dlm_ctxt *dlm,
+static void dlm_purge_lockres(struct dlm_ctxt *dlm,
 struct dlm_lock_resource *res)
 {
int master;
int ret = 0;
 
-   spin_lock(res-spinlock);
-   if (!__dlm_lockres_unused(res)) {
-   mlog(0, %s:%.*s: tried to purge but not unused\n,
-dlm-name, res-lockname.len, res-lockname.name);
-   __dlm_print_one_lock_resource(res);
-   spin_unlock(res-spinlock);
-   BUG();
-   }
-
-   if (res-state  DLM_LOCK_RES_MIGRATING) {
-   mlog(0, %s:%.*s: Delay dropref as this lockres is 
-being remastered\n, dlm-name, res-lockname.len,
-res-lockname.name);
-   /* Re-add the lockres to the end of the purge list */
-   if (!list_empty(res-purge)) {
-   list_del_init(res-purge);
-   list_add_tail(res-purge, dlm-purge_list);
-   }
-   spin_unlock(res-spinlock);
-   return 0;
-   }
+   assert_spin_locked(dlm-spinlock);
+   assert_spin_locked(res-spinlock);
 
master = (res-owner == dlm-node_num);
 
-   if (!master)
-   res-state |= DLM_LOCK_RES_DROPPING_REF;
-   spin_unlock(res-spinlock);
 
mlog(0, purging lockres %.*s, master = %d\n, res-lockname.len,
 res-lockname.name, master);
 
if (!master) {
+   res-state |= DLM_LOCK_RES_DROPPING_REF;
/* drop spinlock...  retake below */
+   spin_unlock(res-spinlock);
spin_unlock(dlm-spinlock);
 
spin_lock(res-spinlock);
@@ -208,31 +188,35 @@ static int dlm_purge_lockres(struct dlm_ctxt *dlm,
mlog(0, %s:%.*s: dlm_deref_lockres returned %d\n,
 dlm-name, res-lockname.len, res-lockname.name, ret);
spin_lock(dlm-spinlock);
+   spin_lock(res-spinlock);
}
 
-   spin_lock(res-spinlock);
if (!list_empty(res-purge)) {
mlog(0, removing lockres %.*s:%p from purgelist, 
 master = %d\n, res-lockname.len, res-lockname.name,
 res, master);
list_del_init(res-purge);
-   spin_unlock(res-spinlock);
dlm_lockres_put(res);
dlm-purge_count--;
-   } else
-   spin_unlock(res-spinlock);
+   }
+
+   if (!__dlm_lockres_unused(res)) {
+   mlog(ML_ERROR, found lockres %s:%.*s: in use after deref\n,
+dlm-name, res-lockname.len, res-lockname.name);
+   __dlm_print_one_lock_resource(res);
+   BUG();
+   }
 
__dlm_unhash_lockres(res);
 
/* lockres is not in the hash now.  drop the flag and wake up
 * any processes waiting in dlm_get_lock_resource. */
if (!master) {
-   spin_lock(res-spinlock);
res-state = ~DLM_LOCK_RES_DROPPING_REF;
spin_unlock(res-spinlock);
wake_up(res-wq);
-   }
-   return 0;
+   } else
+   spin_unlock(res-spinlock);
 }
 
 static void dlm_run_purge_list(struct dlm_ctxt *dlm,
@@ -251,17 +235,7 @@ static void dlm_run_purge_list(struct dlm_ctxt *dlm,
lockres = list_entry(dlm-purge_list.next,
 struct dlm_lock_resource, purge);
 
-   /* Status of the lockres *might* change so double
-* check. If the lockres is unused, holding the dlm
-* spinlock will prevent people from getting and more
-* refs on it -- there's no need to keep the lockres
-* spinlock. */
spin_lock(lockres-spinlock

Re: [Ocfs2-devel] [PATCH 1/1] ocfs2 fix o2dlm dlm run purgelist(bug 9094491) - rev3

2010-06-23 Thread Srinivas Eeda

Sunil, Joel, Wengang. Thanks for reviewing the patch and your comments.

On 6/23/2010 10:00 AM, Sunil Mushran wrote:
 Signed-off-by: Sunil Mushransunil.mush...@oracle.com


 On 06/22/2010 10:48 PM, Srinivas Eeda wrote:
 There are two problems in dlm_run_purgelist

 1. If a lockres is found to be in use, dlm_run_purgelist keeps trying 
 to purge
 the same lockres instead of trying the next lockres.

 2. When a lockres is found unused, dlm_run_purgelist releases lockres 
 spinlock
 before setting DLM_LOCK_RES_DROPPING_REF and calls dlm_purge_lockres.
 spinlock is reacquired but in this window lockres can get reused. 
 This leads
 to BUG.

 This patch modifies dlm_run_purgelist to skip lockres if it's in use 
 and purge
   next lockres. It also sets DLM_LOCK_RES_DROPPING_REF before 
 releasing the
 lockres spinlock protecting it from getting reused.

 Signed-off-by: Srinivas Eedasrinivas.e...@oracle.com
 ---
   fs/ocfs2/dlm/dlmthread.c |   79 
 +++--
   1 files changed, 33 insertions(+), 46 deletions(-)

 diff --git a/fs/ocfs2/dlm/dlmthread.c b/fs/ocfs2/dlm/dlmthread.c
 index 11a6d1f..6822f9a 100644
 --- a/fs/ocfs2/dlm/dlmthread.c
 +++ b/fs/ocfs2/dlm/dlmthread.c
 @@ -152,45 +152,25 @@ void dlm_lockres_calc_usage(struct dlm_ctxt *dlm,
   spin_unlock(dlm-spinlock);
   }

 -static int dlm_purge_lockres(struct dlm_ctxt *dlm,
 +static void dlm_purge_lockres(struct dlm_ctxt *dlm,
struct dlm_lock_resource *res)
   {
   int master;
   int ret = 0;

 -spin_lock(res-spinlock);
 -if (!__dlm_lockres_unused(res)) {
 -mlog(0, %s:%.*s: tried to purge but not unused\n,
 - dlm-name, res-lockname.len, res-lockname.name);
 -__dlm_print_one_lock_resource(res);
 -spin_unlock(res-spinlock);
 -BUG();
 -}
 -
 -if (res-state  DLM_LOCK_RES_MIGRATING) {
 -mlog(0, %s:%.*s: Delay dropref as this lockres is 
 - being remastered\n, dlm-name, res-lockname.len,
 - res-lockname.name);
 -/* Re-add the lockres to the end of the purge list */
 -if (!list_empty(res-purge)) {
 -list_del_init(res-purge);
 -list_add_tail(res-purge,dlm-purge_list);
 -}
 -spin_unlock(res-spinlock);
 -return 0;
 -}
 +assert_spin_locked(dlm-spinlock);
 +assert_spin_locked(res-spinlock);

   master = (res-owner == dlm-node_num);

 -if (!master)
 -res-state |= DLM_LOCK_RES_DROPPING_REF;
 -spin_unlock(res-spinlock);

   mlog(0, purging lockres %.*s, master = %d\n, res-lockname.len,
res-lockname.name, master);

   if (!master) {
 +res-state |= DLM_LOCK_RES_DROPPING_REF;
   /* drop spinlock...  retake below */
 +spin_unlock(res-spinlock);
   spin_unlock(dlm-spinlock);

   spin_lock(res-spinlock);
 @@ -208,31 +188,35 @@ static int dlm_purge_lockres(struct dlm_ctxt *dlm,
   mlog(0, %s:%.*s: dlm_deref_lockres returned %d\n,
dlm-name, res-lockname.len, res-lockname.name, ret);
   spin_lock(dlm-spinlock);
 +spin_lock(res-spinlock);
   }

 -spin_lock(res-spinlock);
   if (!list_empty(res-purge)) {
   mlog(0, removing lockres %.*s:%p from purgelist, 
master = %d\n, res-lockname.len, res-lockname.name,
res, master);
   list_del_init(res-purge);
 -spin_unlock(res-spinlock);
   dlm_lockres_put(res);
   dlm-purge_count--;
 -} else
 -spin_unlock(res-spinlock);
 +}
 +
 +if (!__dlm_lockres_unused) {
 +mlog(ML_ERROR, found lockres %s:%.*s: in use after deref\n,
 + dlm-name, res-lockname.len, res-lockname.name);
 +__dlm_print_one_lock_resource(res);
 +BUG();
 +}

   __dlm_unhash_lockres(res);

   /* lockres is not in the hash now.  drop the flag and wake up
* any processes waiting in dlm_get_lock_resource. */
   if (!master) {
 -spin_lock(res-spinlock);
   res-state= ~DLM_LOCK_RES_DROPPING_REF;
   spin_unlock(res-spinlock);
   wake_up(res-wq);
 -}
 -return 0;
 +} else
 +spin_unlock(res-spinlock);
   }

   static void dlm_run_purge_list(struct dlm_ctxt *dlm,
 @@ -251,17 +235,7 @@ static void dlm_run_purge_list(struct dlm_ctxt 
 *dlm,
   lockres = list_entry(dlm-purge_list.next,
struct dlm_lock_resource, purge);

 -/* Status of the lockres *might* change so double
 - * check. If the lockres is unused, holding the dlm
 - * spinlock will prevent people from getting and more
 - * refs on it -- there's no need to keep the lockres
 - * spinlock. */
   spin_lock(lockres-spinlock);
 -unused = __dlm_lockres_unused(lockres);
 -spin_unlock(lockres-spinlock);
 -
 -if (!unused)
 -continue;

   purge_jiffies = lockres

[Ocfs2-devel] [PATCH 1/1] ocfs2 fix o2dlm dlm run purgelist(bug 9094491) - rev3

2010-06-22 Thread Srinivas Eeda

There are two problems in dlm_run_purgelist

1. If a lockres is found to be in use, dlm_run_purgelist keeps trying to purge
the same lockres instead of trying the next lockres.

2. When a lockres is found unused, dlm_run_purgelist releases lockres spinlock
before setting DLM_LOCK_RES_DROPPING_REF and calls dlm_purge_lockres.
spinlock is reacquired but in this window lockres can get reused. This leads
to BUG.

This patch modifies dlm_run_purgelist to skip lockres if it's in use and purge
 next lockres. It also sets DLM_LOCK_RES_DROPPING_REF before releasing the
lockres spinlock protecting it from getting reused.

Signed-off-by: Srinivas Eeda srinivas.e...@oracle.com
---
 fs/ocfs2/dlm/dlmthread.c |   76 --
 1 files changed, 33 insertions(+), 43 deletions(-)

diff --git a/fs/ocfs2/dlm/dlmthread.c b/fs/ocfs2/dlm/dlmthread.c
index 11a6d1f..cb74689 100644
--- a/fs/ocfs2/dlm/dlmthread.c
+++ b/fs/ocfs2/dlm/dlmthread.c
@@ -152,45 +152,26 @@ void dlm_lockres_calc_usage(struct dlm_ctxt *dlm,
spin_unlock(dlm-spinlock);
 }
 
-static int dlm_purge_lockres(struct dlm_ctxt *dlm,
+static void dlm_purge_lockres(struct dlm_ctxt *dlm,
 struct dlm_lock_resource *res)
 {
int master;
int ret = 0;
 
-   spin_lock(res-spinlock);
-   if (!__dlm_lockres_unused(res)) {
-   mlog(0, %s:%.*s: tried to purge but not unused\n,
-dlm-name, res-lockname.len, res-lockname.name);
-   __dlm_print_one_lock_resource(res);
-   spin_unlock(res-spinlock);
-   BUG();
-   }
-
-   if (res-state  DLM_LOCK_RES_MIGRATING) {
-   mlog(0, %s:%.*s: Delay dropref as this lockres is 
-being remastered\n, dlm-name, res-lockname.len,
-res-lockname.name);
-   /* Re-add the lockres to the end of the purge list */
-   if (!list_empty(res-purge)) {
-   list_del_init(res-purge);
-   list_add_tail(res-purge, dlm-purge_list);
-   }
-   spin_unlock(res-spinlock);
-   return 0;
-   }
+   assert_spin_locked(dlm-spinlock);
+   assert_spin_locked(res-spinlock);
 
master = (res-owner == dlm-node_num);
 
if (!master)
res-state |= DLM_LOCK_RES_DROPPING_REF;
-   spin_unlock(res-spinlock);
 
mlog(0, purging lockres %.*s, master = %d\n, res-lockname.len,
 res-lockname.name, master);
 
if (!master) {
/* drop spinlock...  retake below */
+   spin_unlock(res-spinlock);
spin_unlock(dlm-spinlock);
 
spin_lock(res-spinlock);
@@ -208,31 +189,35 @@ static int dlm_purge_lockres(struct dlm_ctxt *dlm,
mlog(0, %s:%.*s: dlm_deref_lockres returned %d\n,
 dlm-name, res-lockname.len, res-lockname.name, ret);
spin_lock(dlm-spinlock);
+   spin_lock(res-spinlock);
}
 
-   spin_lock(res-spinlock);
if (!list_empty(res-purge)) {
mlog(0, removing lockres %.*s:%p from purgelist, 
 master = %d\n, res-lockname.len, res-lockname.name,
 res, master);
list_del_init(res-purge);
-   spin_unlock(res-spinlock);
dlm_lockres_put(res);
dlm-purge_count--;
-   } else
-   spin_unlock(res-spinlock);
+   }
 
-   __dlm_unhash_lockres(res);
+   if (__dlm_lockres_unused(res))
+   __dlm_unhash_lockres(res);
+   else {
+   mlog(ML_ERROR, found lockres %s:%.*s: in use after deref\n,
+dlm-name, res-lockname.len, res-lockname.name);
+   __dlm_print_one_lock_resource(res);
+   BUG();
+   }
 
/* lockres is not in the hash now.  drop the flag and wake up
 * any processes waiting in dlm_get_lock_resource. */
if (!master) {
-   spin_lock(res-spinlock);
res-state = ~DLM_LOCK_RES_DROPPING_REF;
spin_unlock(res-spinlock);
wake_up(res-wq);
-   }
-   return 0;
+   } else
+   spin_unlock(res-spinlock);
 }
 
 static void dlm_run_purge_list(struct dlm_ctxt *dlm,
@@ -251,17 +236,7 @@ static void dlm_run_purge_list(struct dlm_ctxt *dlm,
lockres = list_entry(dlm-purge_list.next,
 struct dlm_lock_resource, purge);
 
-   /* Status of the lockres *might* change so double
-* check. If the lockres is unused, holding the dlm
-* spinlock will prevent people from getting and more
-* refs on it -- there's no need to keep the lockres
-* spinlock. */
spin_lock(lockres-spinlock);
-   unused = __dlm_lockres_unused(lockres

[Ocfs2-devel] [PATCH 1/1] ocfs2 fix o2dlm dlm run purgelist(bug 9094491) - rev3

2010-06-22 Thread Srinivas Eeda

There are two problems in dlm_run_purgelist

1. If a lockres is found to be in use, dlm_run_purgelist keeps trying to purge
the same lockres instead of trying the next lockres.

2. When a lockres is found unused, dlm_run_purgelist releases lockres spinlock
before setting DLM_LOCK_RES_DROPPING_REF and calls dlm_purge_lockres.
spinlock is reacquired but in this window lockres can get reused. This leads
to BUG.

This patch modifies dlm_run_purgelist to skip lockres if it's in use and purge
 next lockres. It also sets DLM_LOCK_RES_DROPPING_REF before releasing the
lockres spinlock protecting it from getting reused.

Signed-off-by: Srinivas Eeda srinivas.e...@oracle.com
---
 fs/ocfs2/dlm/dlmthread.c |   79 +++--
 1 files changed, 33 insertions(+), 46 deletions(-)

diff --git a/fs/ocfs2/dlm/dlmthread.c b/fs/ocfs2/dlm/dlmthread.c
index 11a6d1f..6822f9a 100644
--- a/fs/ocfs2/dlm/dlmthread.c
+++ b/fs/ocfs2/dlm/dlmthread.c
@@ -152,45 +152,25 @@ void dlm_lockres_calc_usage(struct dlm_ctxt *dlm,
spin_unlock(dlm-spinlock);
 }
 
-static int dlm_purge_lockres(struct dlm_ctxt *dlm,
+static void dlm_purge_lockres(struct dlm_ctxt *dlm,
 struct dlm_lock_resource *res)
 {
int master;
int ret = 0;
 
-   spin_lock(res-spinlock);
-   if (!__dlm_lockres_unused(res)) {
-   mlog(0, %s:%.*s: tried to purge but not unused\n,
-dlm-name, res-lockname.len, res-lockname.name);
-   __dlm_print_one_lock_resource(res);
-   spin_unlock(res-spinlock);
-   BUG();
-   }
-
-   if (res-state  DLM_LOCK_RES_MIGRATING) {
-   mlog(0, %s:%.*s: Delay dropref as this lockres is 
-being remastered\n, dlm-name, res-lockname.len,
-res-lockname.name);
-   /* Re-add the lockres to the end of the purge list */
-   if (!list_empty(res-purge)) {
-   list_del_init(res-purge);
-   list_add_tail(res-purge, dlm-purge_list);
-   }
-   spin_unlock(res-spinlock);
-   return 0;
-   }
+   assert_spin_locked(dlm-spinlock);
+   assert_spin_locked(res-spinlock);
 
master = (res-owner == dlm-node_num);
 
-   if (!master)
-   res-state |= DLM_LOCK_RES_DROPPING_REF;
-   spin_unlock(res-spinlock);
 
mlog(0, purging lockres %.*s, master = %d\n, res-lockname.len,
 res-lockname.name, master);
 
if (!master) {
+   res-state |= DLM_LOCK_RES_DROPPING_REF;
/* drop spinlock...  retake below */
+   spin_unlock(res-spinlock);
spin_unlock(dlm-spinlock);
 
spin_lock(res-spinlock);
@@ -208,31 +188,35 @@ static int dlm_purge_lockres(struct dlm_ctxt *dlm,
mlog(0, %s:%.*s: dlm_deref_lockres returned %d\n,
 dlm-name, res-lockname.len, res-lockname.name, ret);
spin_lock(dlm-spinlock);
+   spin_lock(res-spinlock);
}
 
-   spin_lock(res-spinlock);
if (!list_empty(res-purge)) {
mlog(0, removing lockres %.*s:%p from purgelist, 
 master = %d\n, res-lockname.len, res-lockname.name,
 res, master);
list_del_init(res-purge);
-   spin_unlock(res-spinlock);
dlm_lockres_put(res);
dlm-purge_count--;
-   } else
-   spin_unlock(res-spinlock);
+   }
+
+   if (!__dlm_lockres_unused) {
+   mlog(ML_ERROR, found lockres %s:%.*s: in use after deref\n,
+dlm-name, res-lockname.len, res-lockname.name);
+   __dlm_print_one_lock_resource(res);
+   BUG();
+   }
 
__dlm_unhash_lockres(res);
 
/* lockres is not in the hash now.  drop the flag and wake up
 * any processes waiting in dlm_get_lock_resource. */
if (!master) {
-   spin_lock(res-spinlock);
res-state = ~DLM_LOCK_RES_DROPPING_REF;
spin_unlock(res-spinlock);
wake_up(res-wq);
-   }
-   return 0;
+   } else
+   spin_unlock(res-spinlock);
 }
 
 static void dlm_run_purge_list(struct dlm_ctxt *dlm,
@@ -251,17 +235,7 @@ static void dlm_run_purge_list(struct dlm_ctxt *dlm,
lockres = list_entry(dlm-purge_list.next,
 struct dlm_lock_resource, purge);
 
-   /* Status of the lockres *might* change so double
-* check. If the lockres is unused, holding the dlm
-* spinlock will prevent people from getting and more
-* refs on it -- there's no need to keep the lockres
-* spinlock. */
spin_lock(lockres-spinlock);
-   unused = __dlm_lockres_unused(lockres

[Ocfs2-devel] [PATCH 1/1] ocfs2 fix o2dlm dlm run purgelist

2010-06-19 Thread Srinivas Eeda

There are two problems in dlm_run_purgelist

1. If a lockres is found to be in use, dlm_run_purgelist keeps trying to purge
the same lockres instead of trying the next lockres.

2. When a lockres is found unused, dlm_run_purgelist releases lockres spinlock
before setting DLM_LOCK_RES_DROPPING_REF and calls dlm_purge_lockres.
spinlock is reacquired but in this window lockres can get reused. This leads
to BUG.

This patch modifies dlm_run_purgelist to skip lockres if it's in use and purge
 next lockres. It also sets DLM_LOCK_RES_DROPPING_REF before releasing the
lockres spinlock protecting it from getting reused.

Signed-off-by: Srinivas Eeda srinivas.e...@oracle.com
---
 fs/ocfs2/dlm/dlmthread.c |   55 +
 1 files changed, 26 insertions(+), 29 deletions(-)

diff --git a/fs/ocfs2/dlm/dlmthread.c b/fs/ocfs2/dlm/dlmthread.c
index 11a6d1f..79d1ef6 100644
--- a/fs/ocfs2/dlm/dlmthread.c
+++ b/fs/ocfs2/dlm/dlmthread.c
@@ -158,15 +158,6 @@ static int dlm_purge_lockres(struct dlm_ctxt *dlm,
int master;
int ret = 0;
 
-   spin_lock(res-spinlock);
-   if (!__dlm_lockres_unused(res)) {
-   mlog(0, %s:%.*s: tried to purge but not unused\n,
-dlm-name, res-lockname.len, res-lockname.name);
-   __dlm_print_one_lock_resource(res);
-   spin_unlock(res-spinlock);
-   BUG();
-   }
-
if (res-state  DLM_LOCK_RES_MIGRATING) {
mlog(0, %s:%.*s: Delay dropref as this lockres is 
 being remastered\n, dlm-name, res-lockname.len,
@@ -184,13 +175,13 @@ static int dlm_purge_lockres(struct dlm_ctxt *dlm,
 
if (!master)
res-state |= DLM_LOCK_RES_DROPPING_REF;
-   spin_unlock(res-spinlock);
 
mlog(0, purging lockres %.*s, master = %d\n, res-lockname.len,
 res-lockname.name, master);
 
if (!master) {
/* drop spinlock...  retake below */
+   spin_unlock(res-spinlock);
spin_unlock(dlm-spinlock);
 
spin_lock(res-spinlock);
@@ -208,30 +199,34 @@ static int dlm_purge_lockres(struct dlm_ctxt *dlm,
mlog(0, %s:%.*s: dlm_deref_lockres returned %d\n,
 dlm-name, res-lockname.len, res-lockname.name, ret);
spin_lock(dlm-spinlock);
+   spin_lock(res-spinlock);
}
 
-   spin_lock(res-spinlock);
if (!list_empty(res-purge)) {
mlog(0, removing lockres %.*s:%p from purgelist, 
 master = %d\n, res-lockname.len, res-lockname.name,
 res, master);
list_del_init(res-purge);
-   spin_unlock(res-spinlock);
dlm_lockres_put(res);
dlm-purge_count--;
-   } else
-   spin_unlock(res-spinlock);
+   }
 
-   __dlm_unhash_lockres(res);
+   if (__dlm_lockres_unused(res))
+   __dlm_unhash_lockres(res);
+   else {
+   mlog(ML_ERROR, found lockres %s:%.*s: in use after deref\n,
+dlm-name, res-lockname.len, res-lockname.name);
+   __dlm_print_one_lock_resource(res);
+   }
 
/* lockres is not in the hash now.  drop the flag and wake up
 * any processes waiting in dlm_get_lock_resource. */
if (!master) {
-   spin_lock(res-spinlock);
res-state = ~DLM_LOCK_RES_DROPPING_REF;
spin_unlock(res-spinlock);
wake_up(res-wq);
-   }
+   } else
+   spin_unlock(res-spinlock);
return 0;
 }
 
@@ -251,17 +246,7 @@ static void dlm_run_purge_list(struct dlm_ctxt *dlm,
lockres = list_entry(dlm-purge_list.next,
 struct dlm_lock_resource, purge);
 
-   /* Status of the lockres *might* change so double
-* check. If the lockres is unused, holding the dlm
-* spinlock will prevent people from getting and more
-* refs on it -- there's no need to keep the lockres
-* spinlock. */
spin_lock(lockres-spinlock);
-   unused = __dlm_lockres_unused(lockres);
-   spin_unlock(lockres-spinlock);
-
-   if (!unused)
-   continue;
 
purge_jiffies = lockres-last_used +
msecs_to_jiffies(DLM_PURGE_INTERVAL_MS);
@@ -273,15 +258,27 @@ static void dlm_run_purge_list(struct dlm_ctxt *dlm,
 * in tail order, we can stop at the first
 * unpurgable resource -- anyone added after
 * him will have a greater last_used value */
+   spin_unlock(lockres-spinlock);
break;
}
 
+   /* Status of the lockres *might* change so double
+* check. If the lockres

Re: [Ocfs2-devel] [PATCH 2/2] ocfs2: o2dlm fix race in purge lockres and newlock (orabug 9094491)

2010-06-18 Thread Srinivas Eeda

On 6/17/2010 7:11 PM, Sunil Mushran wrote:
 This patch looks ok on the surface. Should be usable. But checking into
 the repo is another matter.
ok, won't checkin but will give this as one-off fix for now.

 My problem is that this flag is very much like inflight_locks but
 is not the same. inflight_locks are taken only on lockres' that are
 mastered by the node. This new flag does the same but for non
 mastered locks too. A better solution will be to merge the two.
 And that means cleaning up inflight_locks.
will look into this.

 The reason for the NAK for check in is that the code is adding
 more messiness to an area that is already very messy.

 Sunil

 On 06/15/2010 09:43 PM, Srinivas Eeda wrote:
 This patch fixes the following hole.
 dlmlock tries to create a new lock on a lockres that is on purge 
 list. It calls
 dlm_get_lockresource and later adds a lock to blocked list. But in 
 this window,
 dlm_thread can purge the lockres and unhash it. This will cause a 
 BUG, as when
 the AST comes back from the master lockres is not found

 This patch marks the lockres with a new state DLM_LOCK_RES_IN_USE 
 which would
 protect lockres from dlm_thread purging it.

 Signed-off-by: Srinivas Eedasrinivas.e...@oracle.com
 ---
   fs/ocfs2/dlm/dlmcommon.h |1 +
   fs/ocfs2/dlm/dlmlock.c   |4 
   fs/ocfs2/dlm/dlmmaster.c |5 -
   fs/ocfs2/dlm/dlmthread.c |1 +
   4 files changed, 10 insertions(+), 1 deletions(-)

 diff --git a/fs/ocfs2/dlm/dlmcommon.h b/fs/ocfs2/dlm/dlmcommon.h
 index 0102be3..6cf3c08 100644
 --- a/fs/ocfs2/dlm/dlmcommon.h
 +++ b/fs/ocfs2/dlm/dlmcommon.h
 @@ -279,6 +279,7 @@ static inline void __dlm_set_joining_node(struct 
 dlm_ctxt *dlm,
   #define DLM_LOCK_RES_DIRTY0x0008
   #define DLM_LOCK_RES_IN_PROGRESS  0x0010
   #define DLM_LOCK_RES_MIGRATING0x0020
 +#define DLM_LOCK_RES_IN_USE   0x0100
   #define DLM_LOCK_RES_DROPPING_REF 0x0040
   #define DLM_LOCK_RES_BLOCK_DIRTY  0x1000
   #define DLM_LOCK_RES_SETREF_INPROG0x2000
 diff --git a/fs/ocfs2/dlm/dlmlock.c b/fs/ocfs2/dlm/dlmlock.c
 index 777..501ac40 100644
 --- a/fs/ocfs2/dlm/dlmlock.c
 +++ b/fs/ocfs2/dlm/dlmlock.c
 @@ -134,6 +134,8 @@ static enum dlm_status dlmlock_master(struct 
 dlm_ctxt *dlm,
   if (status != DLM_NORMAL
   lock-ml.node != dlm-node_num) {
   /* erf.  state changed after lock was dropped. */
 +/* DLM_LOCK_RES_IN_USE is set in dlm_get_lock_resource */
 +res-state= ~DLM_LOCK_RES_IN_USE;
   spin_unlock(res-spinlock);
   dlm_error(status);
   return status;
 @@ -180,6 +182,7 @@ static enum dlm_status dlmlock_master(struct 
 dlm_ctxt *dlm,
   kick_thread = 1;
   }
   }
 +res-state= ~DLM_LOCK_RES_IN_USE;
   /* reduce the inflight count, this may result in the lockres
* being purged below during calc_usage */
   if (lock-ml.node == dlm-node_num)
 @@ -246,6 +249,7 @@ static enum dlm_status dlmlock_remote(struct 
 dlm_ctxt *dlm,

   spin_lock(res-spinlock);
   res-state= ~DLM_LOCK_RES_IN_PROGRESS;
 +res-state= ~DLM_LOCK_RES_IN_USE;
   lock-lock_pending = 0;
   if (status != DLM_NORMAL) {
   if (status == DLM_RECOVERING
 diff --git a/fs/ocfs2/dlm/dlmmaster.c b/fs/ocfs2/dlm/dlmmaster.c
 index 9289b43..f0f2d97 100644
 --- a/fs/ocfs2/dlm/dlmmaster.c
 +++ b/fs/ocfs2/dlm/dlmmaster.c
 @@ -719,6 +719,7 @@ lookup:
   if (tmpres) {
   int dropping_ref = 0;

 +tmpres-state |= DLM_LOCK_RES_IN_USE;
   spin_unlock(dlm-spinlock);

   spin_lock(tmpres-spinlock);
 @@ -731,8 +732,10 @@ lookup:
   if (tmpres-owner == dlm-node_num) {
   BUG_ON(tmpres-state  DLM_LOCK_RES_DROPPING_REF);
   dlm_lockres_grab_inflight_ref(dlm, tmpres);
 -} else if (tmpres-state  DLM_LOCK_RES_DROPPING_REF)
 +} else if (tmpres-state  DLM_LOCK_RES_DROPPING_REF) {
 +tmpres-state= ~DLM_LOCK_RES_IN_USE;
   dropping_ref = 1;
 +}
   spin_unlock(tmpres-spinlock);

   /* wait until done messaging the master, drop our ref to allow
 diff --git a/fs/ocfs2/dlm/dlmthread.c b/fs/ocfs2/dlm/dlmthread.c
 index fb0be6c..2b82e0b 100644
 --- a/fs/ocfs2/dlm/dlmthread.c
 +++ b/fs/ocfs2/dlm/dlmthread.c
 @@ -93,6 +93,7 @@ int __dlm_lockres_has_locks(struct 
 dlm_lock_resource *res)
   int __dlm_lockres_unused(struct dlm_lock_resource *res)
   {
   if (!__dlm_lockres_has_locks(res)
 +!(res-state  DLM_LOCK_RES_IN_USE)
   (list_empty(res-dirty)  !(res-state  DLM_LOCK_RES_DIRTY))) {
   /* try not to scan the bitmap unless the first two
* conditions are already true */



___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] [PATCH 1/2] ocfs2 fix o2dlm dlm run purgelist

2010-06-17 Thread Srinivas Eeda


On 6/17/2010 1:32 AM, Joel Becker wrote:

On Wed, Jun 16, 2010 at 06:44:43PM -0700, Sunil Mushran wrote:
  

One way to skip a lockres in the purgelist is to list_del_init() and
list_add_tail(). That simplifies the patch a lot.

I have attached a quick  dirty patch. See if that satisfies all the
requirements.



As far as I can tell from reading the code, the time_after()
check is because they are time ordered.  Wouldn't moving it to the end
violate that?
  

right. that's why I didn't want to move used lockres to tail :)

Joel

  
___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] [PATCH 1/2] ocfs2 fix o2dlm dlm run purgelist

2010-06-17 Thread Srinivas Eeda


On 6/17/2010 7:48 AM, Sunil Mushran wrote:

On 06/17/2010 01:35 AM, Srinivas Eeda wrote:

On 6/17/2010 1:32 AM, Joel Becker wrote:

On Wed, Jun 16, 2010 at 06:44:43PM -0700, Sunil Mushran wrote:
  

One way to skip a lockres in the purgelist is to list_del_init() and
list_add_tail(). That simplifies the patch a lot.

I have attached a quick  dirty patch. See if that satisfies all the
requirements.



As far as I can tell from reading the code, the time_after()
check is because they are time ordered.  Wouldn't moving it to the end
violate that?
  

right. that's why I didn't want to move used lockres to tail :)



No. There is no need for time ordering. Or, am I missing something?

We delay the purge incase the file system changes its mind and wants
to reuse it. By delaying, we hold onto the mastery information. That's it.

So, should we preserve this? or purge the lockres  when it's on purge 
list? OR we could just update the last_used and move it to end of the 
list if it can be purged.
___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] [PATCH 1/2] ocfs2 fix o2dlm dlm run purgelist

2010-06-17 Thread Srinivas Eeda

Sunil,

as of now, there is still a window in dlm_get_lock_resource, where it 
finds the lockres but it doesn't protect it from getting purged. Second 
patch fixes this by marking it in_use, can you please review that one as 
well.

Thanks,
--Srini

On 6/17/2010 8:06 AM, Sunil Mushran wrote:
 On 06/15/2010 11:06 PM, Wengang Wang wrote:
 still the question.
 If you have sent DEREF request to the master, and the lockres became 
 in-use
 again, then the lockres remains in the hash table and also in the 
 purge list.
 So
 1) If this node is the last ref, there is a possibility that the master
 purged the lockres after receiving DEREF request from this node. In this
 case, when this node does dlmlock_remote(), the lockres won't be 
 found on the
 master. How to deal with it?

 2) The lockres on this node is going to be purged again, it means it 
 will send
 secondary DEREFs to the master. This is not good I think.

 A thought is setting lockres-owner to DLM_LOCK_RES_OWNER_UNKNOWN after
 sending a DEREF request againt this lockres. Also redo master reqeust
 before locking on it.


 The fix we are working towards is to ensure that we set
 DLM_LOCK_RES_DROPPING_REF once we are determined
 to purge the lockres. As in, we should not let go of the spinlock
 before we have either set the flag or decided against purging
 that resource.

 Once the flag is set, new users looking up the resource via
 dlm_get_lock_resource() will notice the flag and will then wait
 for that flag to be cleared before looking up the lockres hash
 again. If all goes well, the lockres will not be found (because it
 has since been unhashed) and it will be forced to go thru the
 full mastery process.

___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] [PATCH] ocfs2/dlm: delay the migration when the lockres is in recovery

2010-06-16 Thread Srinivas Eeda

patch looks good, it fixes the umount code path which prevents a lockres 
from migrating if it needs to be recovered. I have few comments on the 
scenario you described.


On 6/11/2010 3:25 AM, Wengang Wang wrote:

Any comment on this patch?

regards,
wengang.
On 10-05-25 15:59, Wengang Wang wrote:
  

We shouldn't migrate a lockres in recovery state.
Otherwise, it has the following problem:

1) Recovery happened as recovery master on a node(node A) which is in umount
migrating all lockres' it owned(master is node A) to other nodes, say a node B.
2) So node A wants to take over all the lockres' those are mastered by the
crashed node C. 
3) Receiving request_locks request from node A, node B send mig_lockres

requests(for recovery) to node A for all lockres' that was mastered by the
crashed node C. It can also send the request for a lockres(lockres A) which is
not in node A's hashtable.

why wouldn't lockres A be in node A's hashtable? if it's not in hash 
table, then it won't be migratable

4) Receiving the mig_lockres request for lockres A from node B, a new lockres
object lockres A', with INRECOVERING flag set, is created and inserted to hash
table.
5) The recovery for lockres A' is going on on node A, it finally mastered the
lockres A'. And now, RECOVERING flag is not cleared from lockres A' nor from
lockres A on node B.
6) The migration for lockres A' goes since now node A mastered lockres A' 
already.
the mig_lockres request(for migration) is sent to node B.
7) Node B responsed with -EFAULT because now lockres A is still in recovery 
state.
8) Node A BUG() on the -EFAULT.

fix:
The recovery state is cleared on node A(recovery master) after it's cleared on
node B. We wait until the in recovery state is cleared from node A and migrate
it to node B. 


Signed-off-by: Wengang Wang wen.gang.w...@oracle.com
---
 fs/ocfs2/dlm/dlmmaster.c |3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/fs/ocfs2/dlm/dlmmaster.c b/fs/ocfs2/dlm/dlmmaster.c
index 9289b43..de9c128 100644
--- a/fs/ocfs2/dlm/dlmmaster.c
+++ b/fs/ocfs2/dlm/dlmmaster.c
@@ -2371,6 +2371,9 @@ static int dlm_is_lockres_migrateable(struct dlm_ctxt 
*dlm,
goto leave;
}
 
+	if (unlikely(res-state  DLM_LOCK_RES_RECOVERING))

+   goto leave;
+
ret = 0;
queue = res-granted;
for (i = 0; i  3; i++) {
--
1.6.6.1


___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-devel



___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-devel
  
___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-devel

[Ocfs2-devel] [PATCH 2/2] ocfs2: o2dlm fix race in purge lockres and newlock (orabug 9094491)

2010-06-15 Thread Srinivas Eeda

This patch fixes the following hole.
dlmlock tries to create a new lock on a lockres that is on purge list. It calls
dlm_get_lockresource and later adds a lock to blocked list. But in this window,
dlm_thread can purge the lockres and unhash it. This will cause a BUG, as when
the AST comes back from the master lockres is not found

This patch marks the lockres with a new state DLM_LOCK_RES_IN_USE which would
protect lockres from dlm_thread purging it.

Signed-off-by: Srinivas Eeda srinivas.e...@oracle.com
---
 fs/ocfs2/dlm/dlmcommon.h |1 +
 fs/ocfs2/dlm/dlmlock.c   |4 
 fs/ocfs2/dlm/dlmmaster.c |5 -
 fs/ocfs2/dlm/dlmthread.c |1 +
 4 files changed, 10 insertions(+), 1 deletions(-)

diff --git a/fs/ocfs2/dlm/dlmcommon.h b/fs/ocfs2/dlm/dlmcommon.h
index 0102be3..6cf3c08 100644
--- a/fs/ocfs2/dlm/dlmcommon.h
+++ b/fs/ocfs2/dlm/dlmcommon.h
@@ -279,6 +279,7 @@ static inline void __dlm_set_joining_node(struct dlm_ctxt 
*dlm,
 #define DLM_LOCK_RES_DIRTY0x0008
 #define DLM_LOCK_RES_IN_PROGRESS  0x0010
 #define DLM_LOCK_RES_MIGRATING0x0020
+#define DLM_LOCK_RES_IN_USE   0x0100
 #define DLM_LOCK_RES_DROPPING_REF 0x0040
 #define DLM_LOCK_RES_BLOCK_DIRTY  0x1000
 #define DLM_LOCK_RES_SETREF_INPROG0x2000
diff --git a/fs/ocfs2/dlm/dlmlock.c b/fs/ocfs2/dlm/dlmlock.c
index 777..501ac40 100644
--- a/fs/ocfs2/dlm/dlmlock.c
+++ b/fs/ocfs2/dlm/dlmlock.c
@@ -134,6 +134,8 @@ static enum dlm_status dlmlock_master(struct dlm_ctxt *dlm,
if (status != DLM_NORMAL 
lock-ml.node != dlm-node_num) {
/* erf.  state changed after lock was dropped. */
+   /* DLM_LOCK_RES_IN_USE is set in dlm_get_lock_resource */
+   res-state = ~DLM_LOCK_RES_IN_USE;
spin_unlock(res-spinlock);
dlm_error(status);
return status;
@@ -180,6 +182,7 @@ static enum dlm_status dlmlock_master(struct dlm_ctxt *dlm,
kick_thread = 1;
}
}
+   res-state = ~DLM_LOCK_RES_IN_USE;
/* reduce the inflight count, this may result in the lockres
 * being purged below during calc_usage */
if (lock-ml.node == dlm-node_num)
@@ -246,6 +249,7 @@ static enum dlm_status dlmlock_remote(struct dlm_ctxt *dlm,
 
spin_lock(res-spinlock);
res-state = ~DLM_LOCK_RES_IN_PROGRESS;
+   res-state = ~DLM_LOCK_RES_IN_USE;
lock-lock_pending = 0;
if (status != DLM_NORMAL) {
if (status == DLM_RECOVERING 
diff --git a/fs/ocfs2/dlm/dlmmaster.c b/fs/ocfs2/dlm/dlmmaster.c
index 9289b43..f0f2d97 100644
--- a/fs/ocfs2/dlm/dlmmaster.c
+++ b/fs/ocfs2/dlm/dlmmaster.c
@@ -719,6 +719,7 @@ lookup:
if (tmpres) {
int dropping_ref = 0;
 
+   tmpres-state |= DLM_LOCK_RES_IN_USE;
spin_unlock(dlm-spinlock);
 
spin_lock(tmpres-spinlock);
@@ -731,8 +732,10 @@ lookup:
if (tmpres-owner == dlm-node_num) {
BUG_ON(tmpres-state  DLM_LOCK_RES_DROPPING_REF);
dlm_lockres_grab_inflight_ref(dlm, tmpres);
-   } else if (tmpres-state  DLM_LOCK_RES_DROPPING_REF)
+   } else if (tmpres-state  DLM_LOCK_RES_DROPPING_REF) {
+   tmpres-state = ~DLM_LOCK_RES_IN_USE;
dropping_ref = 1;
+   }
spin_unlock(tmpres-spinlock);
 
/* wait until done messaging the master, drop our ref to allow
diff --git a/fs/ocfs2/dlm/dlmthread.c b/fs/ocfs2/dlm/dlmthread.c
index fb0be6c..2b82e0b 100644
--- a/fs/ocfs2/dlm/dlmthread.c
+++ b/fs/ocfs2/dlm/dlmthread.c
@@ -93,6 +93,7 @@ int __dlm_lockres_has_locks(struct dlm_lock_resource *res)
 int __dlm_lockres_unused(struct dlm_lock_resource *res)
 {
if (!__dlm_lockres_has_locks(res) 
+   !(res-state  DLM_LOCK_RES_IN_USE) 
(list_empty(res-dirty)  !(res-state  DLM_LOCK_RES_DIRTY))) {
/* try not to scan the bitmap unless the first two
 * conditions are already true */
-- 
1.5.6.5


___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] [PATCH] ocfs2/dlm: check dlm_state under spinlock

2010-06-15 Thread Srinivas Eeda

The lock order in this code causes dead lock, not caused by your patch. 
The lock order in dlm_query_join_handler is

dlm_domain_lock
-dlm-spinlock

dead locks with ..

dlm_lockres_put calls dlm_lockres_release while holding dlm-spinlock 
which calls dlm_put which gets dlm_domain_lock. So the spin lock order 
here is
dlm-spinlock
- dlm_domain_lock


On 6/15/2010 9:08 PM, Wengang Wang wrote:
 We should check dlm-dlm_state under dlm-spinlock though maybe in this case 
 it
 doesn't hurt.

 Signed-off-by: Wengang Wang wen.gang.w...@oracle.com
 ---
  fs/ocfs2/dlm/dlmdomain.c |   13 ++---
  1 files changed, 6 insertions(+), 7 deletions(-)

 diff --git a/fs/ocfs2/dlm/dlmdomain.c b/fs/ocfs2/dlm/dlmdomain.c
 index 6b5a492..ab82add 100644
 --- a/fs/ocfs2/dlm/dlmdomain.c
 +++ b/fs/ocfs2/dlm/dlmdomain.c
 @@ -796,7 +796,7 @@ static int dlm_query_join_handler(struct o2net_msg *msg, 
 u32 len, void *data,
   spin_lock(dlm_domain_lock);
   dlm = __dlm_lookup_domain_full(query-domain, query-name_len);
   if (!dlm)
 - goto unlock_respond;
 + goto unlock_domain_respond;
  
   /*
* There is a small window where the joining node may not see the
 @@ -811,7 +811,7 @@ static int dlm_query_join_handler(struct o2net_msg *msg, 
 u32 len, void *data,
have node %u in its nodemap\n,
query-node_idx, nodenum);
   packet.code = JOIN_DISALLOW;
 - goto unlock_respond;
 + goto unlock_domain_respond;
   }
   }
   nodenum++;
 @@ -821,9 +821,9 @@ static int dlm_query_join_handler(struct o2net_msg *msg, 
 u32 len, void *data,
* to be put in someone's domain map.
* Also, explicitly disallow joining at certain troublesome
* times (ie. during recovery). */
 - if (dlm  dlm-dlm_state != DLM_CTXT_LEAVING) {
 + spin_lock(dlm-spinlock);
 + if (dlm-dlm_state != DLM_CTXT_LEAVING) {
   int bit = query-node_idx;
 - spin_lock(dlm-spinlock);
  
   if (dlm-dlm_state == DLM_CTXT_NEW 
   dlm-joining_node == DLM_LOCK_RES_OWNER_UNKNOWN) {
 @@ -869,10 +869,9 @@ static int dlm_query_join_handler(struct o2net_msg *msg, 
 u32 len, void *data,
   __dlm_set_joining_node(dlm, query-node_idx);
   }
   }
 -
 - spin_unlock(dlm-spinlock);
   }
 -unlock_respond:
 + spin_unlock(dlm-spinlock);
 +unlock_domain_respond:
   spin_unlock(dlm_domain_lock);
  
  respond:
   

___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] [PATCH 1/1] ocfs2: o2dlm fix race in purge lockres and newlock (orabug 9094491)

2010-06-14 Thread Srinivas Eeda

Please ignore this patch. I'll resend this patch along with dlm purge 
lockres for completeness.

On 6/9/2010 7:07 PM, Srinivas Eeda wrote:
 dlm_thread sends a deref message to the master node. At the same time,
 another thread sends a new lock request to the master node.
 Since dlm_thread wouldn't know about it, it would unhash the lockres
 after it gets the response. lock request AST would then won't find
 the lockres and hence BUGs.

 The fix is to add new state DLM_LOCK_RES_IN_USE which would prevent
 dlm_thread from purging the lockres and/or unhashing lockres.

 Signed-off-by: Srinivas Eeda srinivas.e...@oracle.com
 ---
  fs/ocfs2/dlm/dlmcommon.h |1 +
  fs/ocfs2/dlm/dlmlock.c   |4 
  fs/ocfs2/dlm/dlmmaster.c |5 -
  fs/ocfs2/dlm/dlmthread.c |4 +++-
  4 files changed, 12 insertions(+), 2 deletions(-)

 diff --git a/fs/ocfs2/dlm/dlmcommon.h b/fs/ocfs2/dlm/dlmcommon.h
 index 0102be3..0290612 100644
 --- a/fs/ocfs2/dlm/dlmcommon.h
 +++ b/fs/ocfs2/dlm/dlmcommon.h
 @@ -280,6 +280,7 @@ static inline void __dlm_set_joining_node(struct dlm_ctxt 
 *dlm,
  #define DLM_LOCK_RES_IN_PROGRESS  0x0010
  #define DLM_LOCK_RES_MIGRATING0x0020
  #define DLM_LOCK_RES_DROPPING_REF 0x0040
 +#define DLM_LOCK_RES_IN_USE   0x0100
  #define DLM_LOCK_RES_BLOCK_DIRTY  0x1000
  #define DLM_LOCK_RES_SETREF_INPROG0x2000
  
 diff --git a/fs/ocfs2/dlm/dlmlock.c b/fs/ocfs2/dlm/dlmlock.c
 index 777..501ac40 100644
 --- a/fs/ocfs2/dlm/dlmlock.c
 +++ b/fs/ocfs2/dlm/dlmlock.c
 @@ -134,6 +134,8 @@ static enum dlm_status dlmlock_master(struct dlm_ctxt 
 *dlm,
   if (status != DLM_NORMAL 
   lock-ml.node != dlm-node_num) {
   /* erf.  state changed after lock was dropped. */
 + /* DLM_LOCK_RES_IN_USE is set in dlm_get_lock_resource */
 + res-state = ~DLM_LOCK_RES_IN_USE;
   spin_unlock(res-spinlock);
   dlm_error(status);
   return status;
 @@ -180,6 +182,7 @@ static enum dlm_status dlmlock_master(struct dlm_ctxt 
 *dlm,
   kick_thread = 1;
   }
   }
 + res-state = ~DLM_LOCK_RES_IN_USE;
   /* reduce the inflight count, this may result in the lockres
* being purged below during calc_usage */
   if (lock-ml.node == dlm-node_num)
 @@ -246,6 +249,7 @@ static enum dlm_status dlmlock_remote(struct dlm_ctxt 
 *dlm,
  
   spin_lock(res-spinlock);
   res-state = ~DLM_LOCK_RES_IN_PROGRESS;
 + res-state = ~DLM_LOCK_RES_IN_USE;
   lock-lock_pending = 0;
   if (status != DLM_NORMAL) {
   if (status == DLM_RECOVERING 
 diff --git a/fs/ocfs2/dlm/dlmmaster.c b/fs/ocfs2/dlm/dlmmaster.c
 index 9289b43..f0f2d97 100644
 --- a/fs/ocfs2/dlm/dlmmaster.c
 +++ b/fs/ocfs2/dlm/dlmmaster.c
 @@ -719,6 +719,7 @@ lookup:
   if (tmpres) {
   int dropping_ref = 0;
  
 + tmpres-state |= DLM_LOCK_RES_IN_USE;
   spin_unlock(dlm-spinlock);
  
   spin_lock(tmpres-spinlock);
 @@ -731,8 +732,10 @@ lookup:
   if (tmpres-owner == dlm-node_num) {
   BUG_ON(tmpres-state  DLM_LOCK_RES_DROPPING_REF);
   dlm_lockres_grab_inflight_ref(dlm, tmpres);
 - } else if (tmpres-state  DLM_LOCK_RES_DROPPING_REF)
 + } else if (tmpres-state  DLM_LOCK_RES_DROPPING_REF) {
 + tmpres-state = ~DLM_LOCK_RES_IN_USE;
   dropping_ref = 1;
 + }
   spin_unlock(tmpres-spinlock);
  
   /* wait until done messaging the master, drop our ref to allow
 diff --git a/fs/ocfs2/dlm/dlmthread.c b/fs/ocfs2/dlm/dlmthread.c
 index 11a6d1f..b2315cb 100644
 --- a/fs/ocfs2/dlm/dlmthread.c
 +++ b/fs/ocfs2/dlm/dlmthread.c
 @@ -93,6 +93,7 @@ int __dlm_lockres_has_locks(struct dlm_lock_resource *res)
  int __dlm_lockres_unused(struct dlm_lock_resource *res)
  {
   if (!__dlm_lockres_has_locks(res) 
 + !(res-state  DLM_LOCK_RES_IN_USE) 
   (list_empty(res-dirty)  !(res-state  DLM_LOCK_RES_DIRTY))) {
   /* try not to scan the bitmap unless the first two
* conditions are already true */
 @@ -222,7 +223,8 @@ static int dlm_purge_lockres(struct dlm_ctxt *dlm,
   } else
   spin_unlock(res-spinlock);
  
 - __dlm_unhash_lockres(res);
 + if (__dlm_lockres_unused(res))
 + __dlm_unhash_lockres(res);
  
   /* lockres is not in the hash now.  drop the flag and wake up
* any processes waiting in dlm_get_lock_resource. */
   

___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] [PATCH] ocfs2/dlm: remove lockres from purge when a lock is added

2010-06-10 Thread Srinivas Eeda

Wengang, thanks for the patch. My comments are inline :)

On 6/8/2010 7:38 AM, Wengang Wang wrote:
 dlm_thread(when purges a lockres) races with another thread that is running on
 dlmlock_remote(). dlmlock_remote() can add a lock to the blocked list of the
 lockres without taking dlm-spinlock. This can lead a BUG() in dlm_thread 
 because
 of the added lock.

 Fix
 We take dlm-spinlock when adding a lock to the blocked list and make sure the
 lockres is not in purge list. If we removed the lockres from purge list, try 
 to
 add it back.

 Signed-off-by: Wengang Wang wen.gang.w...@oracle.com
 ---
  fs/ocfs2/dlm/dlmcommon.h |   15 +++
  fs/ocfs2/dlm/dlmlock.c   |   18 +++---
  fs/ocfs2/dlm/dlmthread.c |   28 
  3 files changed, 58 insertions(+), 3 deletions(-)

 diff --git a/fs/ocfs2/dlm/dlmcommon.h b/fs/ocfs2/dlm/dlmcommon.h
 index 4b6ae2c..134973a 100644
 --- a/fs/ocfs2/dlm/dlmcommon.h
 +++ b/fs/ocfs2/dlm/dlmcommon.h
 @@ -1002,6 +1002,9 @@ void dlm_move_lockres_to_recovery_list(struct dlm_ctxt 
 *dlm,
  
  /* will exit holding res-spinlock, but may drop in function */
  void __dlm_wait_on_lockres_flags(struct dlm_lock_resource *res, int flags);
 +void __dlm_wait_on_lockres_flags_lock_dlm(struct dlm_ctxt *dlm,
 +   struct dlm_lock_resource *res,
 +   int flags);
  void __dlm_wait_on_lockres_flags_set(struct dlm_lock_resource *res, int 
 flags);
  
  /* will exit holding res-spinlock, but may drop in function */
 @@ -1012,6 +1015,18 @@ static inline void __dlm_wait_on_lockres(struct 
 dlm_lock_resource *res)
 DLM_LOCK_RES_MIGRATING));
  }
  
 +/*
 + * will exit holding dlm-spinlock and res-spinlock, but may drop in
 + * function
 + */
 +static inline void __dlm_wait_on_lockres_lock_dlm(struct dlm_ctxt *dlm,
 +   struct dlm_lock_resource *res)
 +{
 + __dlm_wait_on_lockres_flags_lock_dlm(dlm, res,
 +  (DLM_LOCK_RES_IN_PROGRESS|
 +   DLM_LOCK_RES_RECOVERING|
 +   DLM_LOCK_RES_MIGRATING));
 +}
  void __dlm_unlink_mle(struct dlm_ctxt *dlm, struct dlm_master_list_entry 
 *mle);
  void __dlm_insert_mle(struct dlm_ctxt *dlm, struct dlm_master_list_entry 
 *mle);
  
 diff --git a/fs/ocfs2/dlm/dlmlock.c b/fs/ocfs2/dlm/dlmlock.c
 index 69cf369..fb3c178 100644
 --- a/fs/ocfs2/dlm/dlmlock.c
 +++ b/fs/ocfs2/dlm/dlmlock.c
 @@ -222,23 +222,35 @@ static enum dlm_status dlmlock_remote(struct dlm_ctxt 
 *dlm,
 struct dlm_lock *lock, int flags)
  {
   enum dlm_status status = DLM_DENIED;
 - int lockres_changed = 1;
 + int lockres_changed = 1, recalc;
  
   mlog_entry(type=%d\n, lock-ml.type);
   mlog(0, lockres %.*s, flags = 0x%x\n, res-lockname.len,
res-lockname.name, flags);
  
 + spin_lock(dlm-spinlock);
   spin_lock(res-spinlock);
  
   /* will exit this call with spinlock held */
 - __dlm_wait_on_lockres(res);
 + __dlm_wait_on_lockres_lock_dlm(dlm, res);
   res-state |= DLM_LOCK_RES_IN_PROGRESS;
  
   /* add lock to local (secondary) queue */
   dlm_lock_get(lock);
   list_add_tail(lock-list, res-blocked);
   lock-lock_pending = 1;
 + /*
 +  * make sure this lockres is not in purge list if we remove the
 +  * lockres from purge list, we try to add it back if locking failed.
 +  *
 +  * there is a race with dlm_thread purging this lockres.
 +  * in this case, it can trigger a BUG() because we added a lock to the
 +  * blocked list.
 +  */
 + recalc = !list_empty(res-purge);
 + __dlm_lockres_calc_usage(dlm, res);
   
Thanks for this patch, looks good. A few comments on another approach, 
let me know your thoughts.

The patch I sent for review marks the lockres to be in use (before 
dlmlock_remote is called) in dlm_get_lockresource. I have to do that to 
protect the lockres from unhashing once it's being used. But since 
lockres is still in purgelist the BUG you mentioned can still trigger in 
dlm_thread.

once a lockres is on purge list there are quite a few cases(case this 
patch is addressing, case the patch I just sent for review is addresing, 
and the case of the recovery) that can trigger the lockres to be reused, 
before or while it is getting purged. dlm_thread(dlm_run_purge_list) 
already expects that a lockres on purge list could just getting started 
to be reused. However it doesn't handle the above cases. I think it's 
better to enhance dlm_run_purge_list/dlm_purge_lockres code to handle 
these cases. I'am working on this change.

There also seems to be a case thats not handled. Assume dlm_thread sent 
deref message and waiting for response, now dlmlock_remote removed it 
from purge list. dlm_thread got response from master and continues and

[Ocfs2-devel] [PATCH 1/1] ocfs2: o2dlm fix race in purge lockres and newlock (orabug 9094491)

2010-06-09 Thread Srinivas Eeda

dlm_thread sends a deref message to the master node. At the same time,
another thread sends a new lock request to the master node.
Since dlm_thread wouldn't know about it, it would unhash the lockres
after it gets the response. lock request AST would then won't find
the lockres and hence BUGs.

The fix is to add new state DLM_LOCK_RES_IN_USE which would prevent
dlm_thread from purging the lockres and/or unhashing lockres.

Signed-off-by: Srinivas Eeda srinivas.e...@oracle.com
---
 fs/ocfs2/dlm/dlmcommon.h |1 +
 fs/ocfs2/dlm/dlmlock.c   |4 
 fs/ocfs2/dlm/dlmmaster.c |5 -
 fs/ocfs2/dlm/dlmthread.c |4 +++-
 4 files changed, 12 insertions(+), 2 deletions(-)

diff --git a/fs/ocfs2/dlm/dlmcommon.h b/fs/ocfs2/dlm/dlmcommon.h
index 0102be3..0290612 100644
--- a/fs/ocfs2/dlm/dlmcommon.h
+++ b/fs/ocfs2/dlm/dlmcommon.h
@@ -280,6 +280,7 @@ static inline void __dlm_set_joining_node(struct dlm_ctxt 
*dlm,
 #define DLM_LOCK_RES_IN_PROGRESS  0x0010
 #define DLM_LOCK_RES_MIGRATING0x0020
 #define DLM_LOCK_RES_DROPPING_REF 0x0040
+#define DLM_LOCK_RES_IN_USE   0x0100
 #define DLM_LOCK_RES_BLOCK_DIRTY  0x1000
 #define DLM_LOCK_RES_SETREF_INPROG0x2000
 
diff --git a/fs/ocfs2/dlm/dlmlock.c b/fs/ocfs2/dlm/dlmlock.c
index 777..501ac40 100644
--- a/fs/ocfs2/dlm/dlmlock.c
+++ b/fs/ocfs2/dlm/dlmlock.c
@@ -134,6 +134,8 @@ static enum dlm_status dlmlock_master(struct dlm_ctxt *dlm,
if (status != DLM_NORMAL 
lock-ml.node != dlm-node_num) {
/* erf.  state changed after lock was dropped. */
+   /* DLM_LOCK_RES_IN_USE is set in dlm_get_lock_resource */
+   res-state = ~DLM_LOCK_RES_IN_USE;
spin_unlock(res-spinlock);
dlm_error(status);
return status;
@@ -180,6 +182,7 @@ static enum dlm_status dlmlock_master(struct dlm_ctxt *dlm,
kick_thread = 1;
}
}
+   res-state = ~DLM_LOCK_RES_IN_USE;
/* reduce the inflight count, this may result in the lockres
 * being purged below during calc_usage */
if (lock-ml.node == dlm-node_num)
@@ -246,6 +249,7 @@ static enum dlm_status dlmlock_remote(struct dlm_ctxt *dlm,
 
spin_lock(res-spinlock);
res-state = ~DLM_LOCK_RES_IN_PROGRESS;
+   res-state = ~DLM_LOCK_RES_IN_USE;
lock-lock_pending = 0;
if (status != DLM_NORMAL) {
if (status == DLM_RECOVERING 
diff --git a/fs/ocfs2/dlm/dlmmaster.c b/fs/ocfs2/dlm/dlmmaster.c
index 9289b43..f0f2d97 100644
--- a/fs/ocfs2/dlm/dlmmaster.c
+++ b/fs/ocfs2/dlm/dlmmaster.c
@@ -719,6 +719,7 @@ lookup:
if (tmpres) {
int dropping_ref = 0;
 
+   tmpres-state |= DLM_LOCK_RES_IN_USE;
spin_unlock(dlm-spinlock);
 
spin_lock(tmpres-spinlock);
@@ -731,8 +732,10 @@ lookup:
if (tmpres-owner == dlm-node_num) {
BUG_ON(tmpres-state  DLM_LOCK_RES_DROPPING_REF);
dlm_lockres_grab_inflight_ref(dlm, tmpres);
-   } else if (tmpres-state  DLM_LOCK_RES_DROPPING_REF)
+   } else if (tmpres-state  DLM_LOCK_RES_DROPPING_REF) {
+   tmpres-state = ~DLM_LOCK_RES_IN_USE;
dropping_ref = 1;
+   }
spin_unlock(tmpres-spinlock);
 
/* wait until done messaging the master, drop our ref to allow
diff --git a/fs/ocfs2/dlm/dlmthread.c b/fs/ocfs2/dlm/dlmthread.c
index 11a6d1f..b2315cb 100644
--- a/fs/ocfs2/dlm/dlmthread.c
+++ b/fs/ocfs2/dlm/dlmthread.c
@@ -93,6 +93,7 @@ int __dlm_lockres_has_locks(struct dlm_lock_resource *res)
 int __dlm_lockres_unused(struct dlm_lock_resource *res)
 {
if (!__dlm_lockres_has_locks(res) 
+   !(res-state  DLM_LOCK_RES_IN_USE) 
(list_empty(res-dirty)  !(res-state  DLM_LOCK_RES_DIRTY))) {
/* try not to scan the bitmap unless the first two
 * conditions are already true */
@@ -222,7 +223,8 @@ static int dlm_purge_lockres(struct dlm_ctxt *dlm,
} else
spin_unlock(res-spinlock);
 
-   __dlm_unhash_lockres(res);
+   if (__dlm_lockres_unused(res))
+   __dlm_unhash_lockres(res);
 
/* lockres is not in the hash now.  drop the flag and wake up
 * any processes waiting in dlm_get_lock_resource. */
-- 
1.5.6.5


___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-devel

1 2 >

1 - 100 of 146 matches

Mail list logo