Re: [Ocfs2-devel] Buffer read will get starvation in case reading/writing the same file from different nodes concurrently

2015-12-07 Thread Gang He
Hello Jeseph,


>>> 
> Hi Gang,
> Eric and I have discussed this case before.
> Using NONBLOCK here is because there is a lock inversion between inode
> lock and page lock. You can refer to the comments of
> ocfs2_inode_lock_with_page for details.
> Actually I have found that NONBLOCK mode is only used in lock inversion
> cases.
Could you tell more details? for a inode object, here are three type cluster 
locks,
ip_open_lockres is used to protect open/delete posix semantic,
ip_rw_lockres looks to be used to protect write in direct-io case, just feel 
the words "rw" let people be confused,
ip_inode_lockres to be used more widely, could you tell more, why there needs a 
lock inversion, not use block-way to get lock in read/write code path?
second, when the read path can not get the lock with the NONBLOCK mode, why 
need to call ocfs2_inode_lock(inode, ret_bh, ex) with a block way in the next 
code?
actually, two lock wrappers use the same cluster lock (ip_inode_lockres), you 
know, the next block-way locking will cost too much time.
 
Thanks a lot.
Gang  

> 
> Thanks,
> Joseph
> 
> On 2015/12/8 11:21, Gang He wrote:
>> Hello Guys,
>> 
>> There is a issue from the customer, who is complaining that buffer reading 
> sometimes lasts too much time ( 1 - 10 seconds) in case reading/writing the 
> same file from different nodes concurrently.
>> According to the demo code from the customer, we also can reproduce this 
> issue at home (run the test program under SLES11SP4 OCFS2 cluster), actually 
> this issue can be reproduced in openSuSe 13.2 (more newer code), but in 
> direct-io mode, this issue will disappear.
>> Base on my investigation, the root cause is the buffer-io using cluster-lock 
> is different from direct-io, I do not know why buffer-io use cluster-lock 
> like 
> this way.
>> the code details are as below,
>> in aops.c file,
>>  281 static int ocfs2_readpage(struct file *file, struct page *page)
>>  282 {
>>  283 struct inode *inode = page->mapping->host;
>>  284 struct ocfs2_inode_info *oi = OCFS2_I(inode);
>>  285 loff_t start = (loff_t)page->index << PAGE_CACHE_SHIFT;
>>  286 int ret, unlock = 1;
>>  287
>>  288 trace_ocfs2_readpage((unsigned long long)oi->ip_blkno,
>>  289  (page ? page->index : 0));
>>  290
>>  291 ret = ocfs2_inode_lock_with_page(inode, NULL, 0, page);  <<== 
> this line
>>  292 if (ret != 0) {
>>  293 if (ret == AOP_TRUNCATED_PAGE)
>>  294 unlock = 0;
>>  295 mlog_errno(ret);
>>  296 goto out;
>>  297 } 
>>  
>> in dlmglue.c file,
>> 2 int ocfs2_inode_lock_with_page(struct inode *inode,
>> 2443   struct buffer_head **ret_bh,
>> 2444   int ex,
>> 2445   struct page *page)
>> 2446 {
>> 2447 int ret;
>> 2448
>> 2449 ret = ocfs2_inode_lock_full(inode, ret_bh, ex, 
> OCFS2_LOCK_NONBLOCK); <<== there, why using NONBLOCK mode to get the cluster 
> lock? this way will let reading IO get starvation. 
>> 2450 if (ret == -EAGAIN) {
>> 2451 unlock_page(page);
>> 2452 if (ocfs2_inode_lock(inode, ret_bh, ex) == 0)
>> 2453 ocfs2_inode_unlock(inode, ex);
>> 2454 ret = AOP_TRUNCATED_PAGE;
>> 2455 }
>> 2456
>> 2457 return ret;
>> 2458 }
>> 
>> If you know the background behind the code, please tell us, why not use 
> block way to get the lock in reading a page, then reading IO will get the 
> page fairly when there is a concurrent writing IO from the other node.
>> Second, I tried to modify that line from OCFS2_LOCK_NONBLOCK to 0 (switch to 
> blocking way), the reading IO will not be blocked too much time (can erase 
> the customer's complaining), but a new problem arises, sometimes the reading 
> IO and writing IO get a dead lock (why dead lock? I am looking at). 
>> 
>> 
>> Thanks
>> Gang  
>> 
>> 
>> 
>> .
>> 

___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel


Re: [Ocfs2-devel] Buffer read will get starvation in case reading/writing the same file from different nodes concurrently

2015-12-07 Thread Joseph Qi
On 2015/12/8 12:51, Eric Ren wrote:
> Hi,
> 
> On Tue, Dec 08, 2015 at 11:55:18AM +0800, joseph wrote: 
>> Hi Gang,
>> Eric and I have discussed this case before.
>> Using NONBLOCK here is because there is a lock inversion between inode
>> lock and page lock. You can refer to the comments of
>> ocfs2_inode_lock_with_page for details.
>> Actually I have found that NONBLOCK mode is only used in lock inversion
>> cases.
> Yes, that comments can helps;-) I try to explain it, if any problem please 
> correct me.
> 
> On node 1, when calling ocfs2_inode_lock_with_page(), thread A(likely doing 
> reading) has
> loced this page. Before thread A stepping into ocfs2_inode_lock_full(), on 
> node 2, thread B(doing
> writing) required EX lock on the same inode, if lockres->l_level is PR, so 
> lockres needs to
> been downconverted PR->NL on node1, i.e. 
> ocfs2_blocking_ast()->ocfs2_wake_downconvert_thread().
> 
> On node 1, if tread ocfs2dc proceeds like this:
> ocfs2_downconvert_thread()
>  ocfs2_downconvert_thread_do_work()
>   ocfs2_process_blocked_lock()
>ocfs2_unblock_lock()
> lockres->l_ops->downconvert_worker(lockres, blocking)
>   ocfs2_data_convert_worker() {
> ...
> 3557 if (blocking == DLM_LOCK_EX) {
> 3558 truncate_inode_pages(mapping, 0);
> 3559 } else {
> 3560 /* We only need to wait on the I/O if we're not 
> also
> 3561  * truncating pages because truncate_inode_pages 
> waits
> 3562  * for us above. We don't truncate pages if we're
> 3563  * blocking anything < EXMODE because we want to 
> keep
> 3564  * them around in that case. */
> 3565 filemap_fdatawait(mapping);
> 3566 }
> ...
>   }
> We can see truncate_inode_pages()
> truncate_inode_pages_range()
>   trylock_page()/lock_page()/find_lock_page()
> 
> if thread A call ocfs2_inode_lock_full() in BLOCK way, there's a window that 
> flag
> OCFS2_LOCK_BLOCED and ->l_blocking=EX have been set by BAST like 
> ocfs2_blocking_ast()->
> ocfs2_generic_handle_bast(), so thread A may be blocked.
> 
> Now, ocfs2dc wants page lock, but thread A hold page lock and has been 
> blocked because of
> thread B. Deadlock occurs, right? Again, please correct me if any.
Yes, you are right.

> 
> So, the author had to use UNBLOCK way here thus create livelock problem. But, 
> I cannot understand
> very clearly how livelock come up. Joseph or anyone else, could you please 
> elaborate it?
Firstly it takes inode lock in NONBLOCK way, if it couldn't be fulfilled
currently and returns -EAGAIN, then unlock page and make downconvert go
on. This is to release the deadlock.
But we do have to take page lock and inode lock to safely accomplish the
request. So it has to return to VFS and retry with page lock. Here
taking inode lock and then unlocking it immediately in BLOCK way wants
to cache the inode and increase the chance of the next retry. But if
another node successfully takes the inode lock again immediately after
this node unlocks inode, the flow you described above happens again and
this node still couldn't be fulfilled. And so forth.

> 
> Thanks,
> Eric
> 
>>
>> Thanks,
>> Joseph
>>
>> On 2015/12/8 11:21, Gang He wrote:
>>> Hello Guys,
>>>
>>> There is a issue from the customer, who is complaining that buffer reading 
>>> sometimes lasts too much time ( 1 - 10 seconds) in case reading/writing the 
>>> same file from different nodes concurrently.
>>> According to the demo code from the customer, we also can reproduce this 
>>> issue at home (run the test program under SLES11SP4 OCFS2 cluster), 
>>> actually this issue can be reproduced in openSuSe 13.2 (more newer code), 
>>> but in direct-io mode, this issue will disappear.
>>> Base on my investigation, the root cause is the buffer-io using 
>>> cluster-lock is different from direct-io, I do not know why buffer-io use 
>>> cluster-lock like this way.
>>> the code details are as below,
>>> in aops.c file,
>>>  281 static int ocfs2_readpage(struct file *file, struct page *page)
>>>  282 {
>>>  283 struct inode *inode = page->mapping->host;
>>>  284 struct ocfs2_inode_info *oi = OCFS2_I(inode);
>>>  285 loff_t start = (loff_t)page->index << PAGE_CACHE_SHIFT;
>>>  286 int ret, unlock = 1;
>>>  287
>>>  288 trace_ocfs2_readpage((unsigned long long)oi->ip_blkno,
>>>  289  (page ? page->index : 0));
>>>  290
>>>  291 ret = ocfs2_inode_lock_with_page(inode, NULL, 0, page);  <<== 
>>> this line
>>>  292 if (ret != 0) {
>>>  293 if (ret == AOP_TRUNCATED_PAGE)
>>>  294 unlock = 0;
>>>  295 mlog_errno(ret);
>>>  296 goto out;
>>>  297 } 
>>>  
>>> in dlmglue.c file,
>>> 2 int 

Re: [Ocfs2-devel] [PATCH] ocfs2: dlm: fix deadlock due to nested lock

2015-12-07 Thread Eric Ren
Hi Junxiao,

On Mon, Dec 07, 2015 at 02:44:21PM +0800, Junxiao Bi wrote: 
> Hi Eric,
> 
> On 12/04/2015 06:07 PM, Eric Ren wrote:
> > Hi Junxiao,
> > 
> > The patch is likely unfair to the blocked lock on remote node(node Y in 
> > your case). The original code let the second request to go only if it's
> > compatible with the predicting level we would downconvert for node Y.
> > Considering more extremer situation, there're more acquiring from node
> > X, that way node Y could heavily starve, right?
> With this fix, lock request is not blocked if ex_holders and ro_holders
> not zero. So if new lock request is always coming before old lock is
Yes.
> released, node Y will starve. Could this happen in a real user case?
Actually, we're suffering from a customer's complaint about long time
IO delay peak when R/W access the same file from different nodes. By
peak, I mean for most of time, the IO time of both R/W is acceptable,
but there may be a long time delay occuring occasionally.

On sles10, that's before dlmglue layer introduced, the R/W time is very
constant and fair in this case. Though the throughoutput looks improved
now, but the comstomer still prefers consistent performance.

I also tested this patch, and it could worsen the situation on my side.
Could you have a test so that we can confirm this each other if needed?
> 
> I think there would be a window where locks are released, at that time,
> node Y could get the lock. Indeed ping-pang locking between nodes will
> hurt performance, so holding a lock in a node for a short while will be
> good to performance.
> > 
> > Just tring to provide some thoughts on this;-)
> That's good. Thank you for the review.
> > 
> > On Thu, Dec 03, 2015 at 04:50:03PM +0800, Junxiao Bi wrote: 
> >> DLM allows nested cluster locking. One node X can acquire a cluster lock
> >> two times before release it. But between these two acquiring, if another
> >> node Y asks for the same lock and is blocked, then a bast will be sent to
> >> node X and OCFS2_LOCK_BLOCKED will be set in that lock's lockres. In this
> >> case, the second acquiring of that lock in node X will cause a deadlock.
> >> Not block for nested locking can fix this.
> > Could you please describe the deadlock more specifically?
> Process A on node X   Process B on node Y
> lock_XYZ(EX)
>   lock_XYZ(EX)
> lock_XYZ(EX)   blocked forever
Thanks! Yeah, it's really bad...
> 
> >>
> >> Use ocfs2-test multiple reflink test can reproduce this on v4.3 kernel,
> >> the whole cluster hung, and get the following call trace.
> > Could the deaklock happen on other older kernel? Because I didn't see this
> > issue when testing reflink on multiple nodes on older kernel.
> We never reproduce this on old kernels. commit 743b5f1434f5 is the key
> to reproduce this issue. As it locks one cluster lock twice in one
> process before releasing it.
Got it, thanks!

Thanks,
Eric
> 
> Thanks,
> Junxiao.
> > 
> > Thanks,
> > Eric
> >>
> >>  INFO: task multi_reflink_t:10118 blocked for more than 120 seconds.
> >>Tainted: G   OE   4.3.0 #1
> >>  "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> >>  multi_reflink_t D 88003e816980 0 10118  10117 0x0080
> >>   880005b735f8 0082 81a25500 88003e75
> >>   880005b735c8 8117992f ea929f80 88003e816980
> >>   7fff  0001 ea929f80
> >>  Call Trace:
> >>   [] ? find_get_entry+0x2f/0xc0
> >>   [] schedule+0x3e/0x80
> >>   [] schedule_timeout+0x1c8/0x220
> >>   [] ? ocfs2_inode_cache_unlock+0x14/0x20 [ocfs2]
> >>   [] ? ocfs2_metadata_cache_unlock+0x19/0x30 [ocfs2]
> >>   [] ? ocfs2_buffer_cached+0x99/0x170 [ocfs2]
> >>   [] ? ocfs2_inode_cache_unlock+0x14/0x20 [ocfs2]
> >>   [] ? ocfs2_metadata_cache_unlock+0x19/0x30 [ocfs2]
> >>   [] ? 
> >> __raw_callee_save___pv_queued_spin_unlock+0x11/0x20
> >>   [] wait_for_completion+0xde/0x110
> >>   [] ? try_to_wake_up+0x240/0x240
> >>   [] __ocfs2_cluster_lock+0x20d/0x720 [ocfs2]
> >>   [] ? 
> >> __raw_callee_save___pv_queued_spin_unlock+0x11/0x20
> >>   [] ocfs2_inode_lock_full_nested+0x181/0x400 [ocfs2]
> >>   [] ? ocfs2_iop_get_acl+0x53/0x113 [ocfs2]
> >>   [] ? igrab+0x42/0x70
> >>   [] ocfs2_iop_get_acl+0x53/0x113 [ocfs2]
> >>   [] get_acl+0x53/0x70
> >>   [] posix_acl_create+0x73/0x130
> >>   [] ocfs2_mknod+0x7cf/0x1140 [ocfs2]
> >>   [] ocfs2_create+0x62/0x110 [ocfs2]
> >>   [] ? __d_alloc+0x65/0x190
> >>   [] ? __inode_permission+0x4e/0xd0
> >>   [] vfs_create+0xd5/0x100
> >>   [] ? lookup_real+0x1d/0x60
> >>   [] lookup_open+0x173/0x1a0
> >>   [] ? percpu_down_read+0x16/0x70
> >>   [] do_last+0x31a/0x830
> >>   [] ? __inode_permission+0x4e/0xd0
> >>   [] ? inode_permission+0x18/0x50
> >>   [] ? link_path_walk+0x290/0x550
> >>   [] path_openat+0x7c/0x140
> >>   [] do_filp_open+0x85/0xe0
> >>   [] ? getname_flags+0x7f/0x1f0
> >>   [] do_sys_open+0x11a/0x220
> >> 

Re: [Ocfs2-devel] [PATCH] ocfs2: dlm: fix deadlock due to nested lock

2015-12-07 Thread Junxiao Bi
Hi Eric,

On 12/07/2015 05:01 PM, Eric Ren wrote:
> Hi Junxiao,
> 
> On Mon, Dec 07, 2015 at 02:44:21PM +0800, Junxiao Bi wrote: 
>> Hi Eric,
>>
>> On 12/04/2015 06:07 PM, Eric Ren wrote:
>>> Hi Junxiao,
>>>
>>> The patch is likely unfair to the blocked lock on remote node(node Y in 
>>> your case). The original code let the second request to go only if it's
>>> compatible with the predicting level we would downconvert for node Y.
>>> Considering more extremer situation, there're more acquiring from node
>>> X, that way node Y could heavily starve, right?
>> With this fix, lock request is not blocked if ex_holders and ro_holders
>> not zero. So if new lock request is always coming before old lock is
> Yes.
>> released, node Y will starve. Could this happen in a real user case?
> Actually, we're suffering from a customer's complaint about long time
> IO delay peak when R/W access the same file from different nodes. By
> peak, I mean for most of time, the IO time of both R/W is acceptable,
> but there may be a long time delay occuring occasionally.
> 
> On sles10, that's before dlmglue layer introduced, the R/W time is very
> constant and fair in this case. Though the throughoutput looks improved
> now, but the comstomer still prefers consistent performance.
I think the peak latency may be caused by downconvert, To downconvert a
EX lock, all dirty data needs be flushed to the disk. So the latency
depends on how much dirty data there is.

> 
> I also tested this patch, and it could worsen the situation on my side.
> Could you have a test so that we can confirm this each other if needed?
I don't have a test case for it. Can you share yours and your perf data?
If there is a starvation, I may need add a pid checking to cluster lock
where not block nested locking in one process.

Thanks,
Junxiao.
>>
>> I think there would be a window where locks are released, at that time,
>> node Y could get the lock. Indeed ping-pang locking between nodes will
>> hurt performance, so holding a lock in a node for a short while will be
>> good to performance.
>>>
>>> Just tring to provide some thoughts on this;-)
>> That's good. Thank you for the review.
>>>
>>> On Thu, Dec 03, 2015 at 04:50:03PM +0800, Junxiao Bi wrote: 
 DLM allows nested cluster locking. One node X can acquire a cluster lock
 two times before release it. But between these two acquiring, if another
 node Y asks for the same lock and is blocked, then a bast will be sent to
 node X and OCFS2_LOCK_BLOCKED will be set in that lock's lockres. In this
 case, the second acquiring of that lock in node X will cause a deadlock.
 Not block for nested locking can fix this.
>>> Could you please describe the deadlock more specifically?
>> Process A on node X   Process B on node Y
>> lock_XYZ(EX)
>>   lock_XYZ(EX)
>> lock_XYZ(EX)   blocked forever
> Thanks! Yeah, it's really bad...
>>

 Use ocfs2-test multiple reflink test can reproduce this on v4.3 kernel,
 the whole cluster hung, and get the following call trace.
>>> Could the deaklock happen on other older kernel? Because I didn't see this
>>> issue when testing reflink on multiple nodes on older kernel.
>> We never reproduce this on old kernels. commit 743b5f1434f5 is the key
>> to reproduce this issue. As it locks one cluster lock twice in one
>> process before releasing it.
> Got it, thanks!
> 
> Thanks,
> Eric
>>
>> Thanks,
>> Junxiao.
>>>
>>> Thanks,
>>> Eric

  INFO: task multi_reflink_t:10118 blocked for more than 120 seconds.
Tainted: G   OE   4.3.0 #1
  "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  multi_reflink_t D 88003e816980 0 10118  10117 0x0080
   880005b735f8 0082 81a25500 88003e75
   880005b735c8 8117992f ea929f80 88003e816980
   7fff  0001 ea929f80
  Call Trace:
   [] ? find_get_entry+0x2f/0xc0
   [] schedule+0x3e/0x80
   [] schedule_timeout+0x1c8/0x220
   [] ? ocfs2_inode_cache_unlock+0x14/0x20 [ocfs2]
   [] ? ocfs2_metadata_cache_unlock+0x19/0x30 [ocfs2]
   [] ? ocfs2_buffer_cached+0x99/0x170 [ocfs2]
   [] ? ocfs2_inode_cache_unlock+0x14/0x20 [ocfs2]
   [] ? ocfs2_metadata_cache_unlock+0x19/0x30 [ocfs2]
   [] ? 
 __raw_callee_save___pv_queued_spin_unlock+0x11/0x20
   [] wait_for_completion+0xde/0x110
   [] ? try_to_wake_up+0x240/0x240
   [] __ocfs2_cluster_lock+0x20d/0x720 [ocfs2]
   [] ? 
 __raw_callee_save___pv_queued_spin_unlock+0x11/0x20
   [] ocfs2_inode_lock_full_nested+0x181/0x400 [ocfs2]
   [] ? ocfs2_iop_get_acl+0x53/0x113 [ocfs2]
   [] ? igrab+0x42/0x70
   [] ocfs2_iop_get_acl+0x53/0x113 [ocfs2]
   [] get_acl+0x53/0x70
   [] posix_acl_create+0x73/0x130
   [] ocfs2_mknod+0x7cf/0x1140 [ocfs2]
   [] ocfs2_create+0x62/0x110 [ocfs2]
  

[Ocfs2-devel] Buffer read will get starvation in case reading/writing the same file from different nodes concurrently

2015-12-07 Thread Gang He
Hello Guys,

There is a issue from the customer, who is complaining that buffer reading 
sometimes lasts too much time ( 1 - 10 seconds) in case reading/writing the 
same file from different nodes concurrently.
According to the demo code from the customer, we also can reproduce this issue 
at home (run the test program under SLES11SP4 OCFS2 cluster), actually this 
issue can be reproduced in openSuSe 13.2 (more newer code), but in direct-io 
mode, this issue will disappear.
Base on my investigation, the root cause is the buffer-io using cluster-lock is 
different from direct-io, I do not know why buffer-io use cluster-lock like 
this way.
the code details are as below,
in aops.c file,
 281 static int ocfs2_readpage(struct file *file, struct page *page)
 282 {
 283 struct inode *inode = page->mapping->host;
 284 struct ocfs2_inode_info *oi = OCFS2_I(inode);
 285 loff_t start = (loff_t)page->index << PAGE_CACHE_SHIFT;
 286 int ret, unlock = 1;
 287
 288 trace_ocfs2_readpage((unsigned long long)oi->ip_blkno,
 289  (page ? page->index : 0));
 290
 291 ret = ocfs2_inode_lock_with_page(inode, NULL, 0, page);  <<== this 
line
 292 if (ret != 0) {
 293 if (ret == AOP_TRUNCATED_PAGE)
 294 unlock = 0;
 295 mlog_errno(ret);
 296 goto out;
 297 } 
 
in dlmglue.c file,
2 int ocfs2_inode_lock_with_page(struct inode *inode,
2443   struct buffer_head **ret_bh,
2444   int ex,
2445   struct page *page)
2446 {
2447 int ret;
2448
2449 ret = ocfs2_inode_lock_full(inode, ret_bh, ex, 
OCFS2_LOCK_NONBLOCK); <<== there, why using NONBLOCK mode to get the cluster 
lock? this way will let reading IO get starvation. 
2450 if (ret == -EAGAIN) {
2451 unlock_page(page);
2452 if (ocfs2_inode_lock(inode, ret_bh, ex) == 0)
2453 ocfs2_inode_unlock(inode, ex);
2454 ret = AOP_TRUNCATED_PAGE;
2455 }
2456
2457 return ret;
2458 }

If you know the background behind the code, please tell us, why not use block 
way to get the lock in reading a page, then reading IO will get the page fairly 
when there is a concurrent writing IO from the other node.
Second, I tried to modify that line from OCFS2_LOCK_NONBLOCK to 0 (switch to 
blocking way), the reading IO will not be blocked too much time (can erase the 
customer's complaining), but a new problem arises, sometimes the reading IO and 
writing IO get a dead lock (why dead lock? I am looking at). 


Thanks
Gang  



___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel


Re: [Ocfs2-devel] Buffer read will get starvation in case reading/writing the same file from different nodes concurrently

2015-12-07 Thread Joseph Qi
Hi Gang,
Eric and I have discussed this case before.
Using NONBLOCK here is because there is a lock inversion between inode
lock and page lock. You can refer to the comments of
ocfs2_inode_lock_with_page for details.
Actually I have found that NONBLOCK mode is only used in lock inversion
cases.

Thanks,
Joseph

On 2015/12/8 11:21, Gang He wrote:
> Hello Guys,
> 
> There is a issue from the customer, who is complaining that buffer reading 
> sometimes lasts too much time ( 1 - 10 seconds) in case reading/writing the 
> same file from different nodes concurrently.
> According to the demo code from the customer, we also can reproduce this 
> issue at home (run the test program under SLES11SP4 OCFS2 cluster), actually 
> this issue can be reproduced in openSuSe 13.2 (more newer code), but in 
> direct-io mode, this issue will disappear.
> Base on my investigation, the root cause is the buffer-io using cluster-lock 
> is different from direct-io, I do not know why buffer-io use cluster-lock 
> like this way.
> the code details are as below,
> in aops.c file,
>  281 static int ocfs2_readpage(struct file *file, struct page *page)
>  282 {
>  283 struct inode *inode = page->mapping->host;
>  284 struct ocfs2_inode_info *oi = OCFS2_I(inode);
>  285 loff_t start = (loff_t)page->index << PAGE_CACHE_SHIFT;
>  286 int ret, unlock = 1;
>  287
>  288 trace_ocfs2_readpage((unsigned long long)oi->ip_blkno,
>  289  (page ? page->index : 0));
>  290
>  291 ret = ocfs2_inode_lock_with_page(inode, NULL, 0, page);  <<== 
> this line
>  292 if (ret != 0) {
>  293 if (ret == AOP_TRUNCATED_PAGE)
>  294 unlock = 0;
>  295 mlog_errno(ret);
>  296 goto out;
>  297 } 
>  
> in dlmglue.c file,
> 2 int ocfs2_inode_lock_with_page(struct inode *inode,
> 2443   struct buffer_head **ret_bh,
> 2444   int ex,
> 2445   struct page *page)
> 2446 {
> 2447 int ret;
> 2448
> 2449 ret = ocfs2_inode_lock_full(inode, ret_bh, ex, 
> OCFS2_LOCK_NONBLOCK); <<== there, why using NONBLOCK mode to get the cluster 
> lock? this way will let reading IO get starvation. 
> 2450 if (ret == -EAGAIN) {
> 2451 unlock_page(page);
> 2452 if (ocfs2_inode_lock(inode, ret_bh, ex) == 0)
> 2453 ocfs2_inode_unlock(inode, ex);
> 2454 ret = AOP_TRUNCATED_PAGE;
> 2455 }
> 2456
> 2457 return ret;
> 2458 }
> 
> If you know the background behind the code, please tell us, why not use block 
> way to get the lock in reading a page, then reading IO will get the page 
> fairly when there is a concurrent writing IO from the other node.
> Second, I tried to modify that line from OCFS2_LOCK_NONBLOCK to 0 (switch to 
> blocking way), the reading IO will not be blocked too much time (can erase 
> the customer's complaining), but a new problem arises, sometimes the reading 
> IO and writing IO get a dead lock (why dead lock? I am looking at). 
> 
> 
> Thanks
> Gang  
> 
> 
> 
> .
> 



___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel


Re: [Ocfs2-devel] Buffer read will get starvation in case reading/writing the same file from different nodes concurrently

2015-12-07 Thread Eric Ren
Hi,

On Tue, Dec 08, 2015 at 11:55:18AM +0800, joseph wrote: 
> Hi Gang,
> Eric and I have discussed this case before.
> Using NONBLOCK here is because there is a lock inversion between inode
> lock and page lock. You can refer to the comments of
> ocfs2_inode_lock_with_page for details.
> Actually I have found that NONBLOCK mode is only used in lock inversion
> cases.
Yes, that comments can helps;-) I try to explain it, if any problem please 
correct me.

On node 1, when calling ocfs2_inode_lock_with_page(), thread A(likely doing 
reading) has
loced this page. Before thread A stepping into ocfs2_inode_lock_full(), on node 
2, thread B(doing
writing) required EX lock on the same inode, if lockres->l_level is PR, so 
lockres needs to
been downconverted PR->NL on node1, i.e. 
ocfs2_blocking_ast()->ocfs2_wake_downconvert_thread().

On node 1, if tread ocfs2dc proceeds like this:
ocfs2_downconvert_thread()
 ocfs2_downconvert_thread_do_work()
  ocfs2_process_blocked_lock()
   ocfs2_unblock_lock()
lockres->l_ops->downconvert_worker(lockres, blocking)
  ocfs2_data_convert_worker() {
...
3557 if (blocking == DLM_LOCK_EX) {
3558 truncate_inode_pages(mapping, 0);
3559 } else {
3560 /* We only need to wait on the I/O if we're not 
also
3561  * truncating pages because truncate_inode_pages 
waits
3562  * for us above. We don't truncate pages if we're
3563  * blocking anything < EXMODE because we want to 
keep
3564  * them around in that case. */
3565 filemap_fdatawait(mapping);
3566 }
...
  }
We can see truncate_inode_pages()
truncate_inode_pages_range()
  trylock_page()/lock_page()/find_lock_page()

if thread A call ocfs2_inode_lock_full() in BLOCK way, there's a window that 
flag
OCFS2_LOCK_BLOCED and ->l_blocking=EX have been set by BAST like 
ocfs2_blocking_ast()->
ocfs2_generic_handle_bast(), so thread A may be blocked.

Now, ocfs2dc wants page lock, but thread A hold page lock and has been blocked 
because of
thread B. Deadlock occurs, right? Again, please correct me if any.

So, the author had to use UNBLOCK way here thus create livelock problem. But, I 
cannot understand
very clearly how livelock come up. Joseph or anyone else, could you please 
elaborate it?

Thanks,
Eric

> 
> Thanks,
> Joseph
> 
> On 2015/12/8 11:21, Gang He wrote:
> > Hello Guys,
> > 
> > There is a issue from the customer, who is complaining that buffer reading 
> > sometimes lasts too much time ( 1 - 10 seconds) in case reading/writing the 
> > same file from different nodes concurrently.
> > According to the demo code from the customer, we also can reproduce this 
> > issue at home (run the test program under SLES11SP4 OCFS2 cluster), 
> > actually this issue can be reproduced in openSuSe 13.2 (more newer code), 
> > but in direct-io mode, this issue will disappear.
> > Base on my investigation, the root cause is the buffer-io using 
> > cluster-lock is different from direct-io, I do not know why buffer-io use 
> > cluster-lock like this way.
> > the code details are as below,
> > in aops.c file,
> >  281 static int ocfs2_readpage(struct file *file, struct page *page)
> >  282 {
> >  283 struct inode *inode = page->mapping->host;
> >  284 struct ocfs2_inode_info *oi = OCFS2_I(inode);
> >  285 loff_t start = (loff_t)page->index << PAGE_CACHE_SHIFT;
> >  286 int ret, unlock = 1;
> >  287
> >  288 trace_ocfs2_readpage((unsigned long long)oi->ip_blkno,
> >  289  (page ? page->index : 0));
> >  290
> >  291 ret = ocfs2_inode_lock_with_page(inode, NULL, 0, page);  <<== 
> > this line
> >  292 if (ret != 0) {
> >  293 if (ret == AOP_TRUNCATED_PAGE)
> >  294 unlock = 0;
> >  295 mlog_errno(ret);
> >  296 goto out;
> >  297 } 
> >  
> > in dlmglue.c file,
> > 2 int ocfs2_inode_lock_with_page(struct inode *inode,
> > 2443   struct buffer_head **ret_bh,
> > 2444   int ex,
> > 2445   struct page *page)
> > 2446 {
> > 2447 int ret;
> > 2448
> > 2449 ret = ocfs2_inode_lock_full(inode, ret_bh, ex, 
> > OCFS2_LOCK_NONBLOCK); <<== there, why using NONBLOCK mode to get the 
> > cluster lock? this way will let reading IO get starvation. 
> > 2450 if (ret == -EAGAIN) {
> > 2451 unlock_page(page);
> > 2452 if (ocfs2_inode_lock(inode, ret_bh, ex) == 0)
> > 2453 ocfs2_inode_unlock(inode, ex);
> > 2454 ret = AOP_TRUNCATED_PAGE;
> > 2455 }
> > 2456
> > 2457 return ret;
> > 2458 }
> > 
> > If you know the background behind the code, please tell 

Re: [Ocfs2-devel] [PATCH V2] jbd2: fix null committed data return in undo_access

2015-12-07 Thread Theodore Ts'o
On Wed, Dec 02, 2015 at 02:54:15PM +0800, Junxiao Bi wrote:
> commit de92c8caf16c ("jbd2: speedup jbd2_journal_get_[write|undo]_access()")
> introduced jbd2_write_access_granted() to improve write|undo_access
> speed, but missed to check the status of b_committed_data which caused
> a kernel panic on ocfs2.

Thanks, applied.  (And already upstream).

- Ted

___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel


Re: [Ocfs2-devel] [PATCH] ocfs2: fix SGID not inherited issue

2015-12-07 Thread Srinivas Eeda
Thanks Junxiao!

Acked-by: Srinivas Eeda 

On 12/06/2015 08:09 PM, Junxiao Bi wrote:
> commit 8f1eb48758aa ("ocfs2: fix umask ignored issue") introduced an issue,
> SGID of sub dir was not inherited from its parents dir. It is because SGID
> is set into "inode->i_mode" in ocfs2_get_init_inode(), but is overwritten
> by "mode" which don't have SGID set later.
>
> Fixes: 8f1eb48758aa ("ocfs2: fix umask ignored issue")
> Signed-off-by: Junxiao Bi 
> Cc: 
> ---
>   fs/ocfs2/namei.c |4 +---
>   1 file changed, 1 insertion(+), 3 deletions(-)
>
> diff --git a/fs/ocfs2/namei.c b/fs/ocfs2/namei.c
> index a03f6f4..3123408 100644
> --- a/fs/ocfs2/namei.c
> +++ b/fs/ocfs2/namei.c
> @@ -367,13 +367,11 @@ static int ocfs2_mknod(struct inode *dir,
>   goto leave;
>   }
>   
> - status = posix_acl_create(dir, , _acl, );
> + status = posix_acl_create(dir, >i_mode, _acl, );
>   if (status) {
>   mlog_errno(status);
>   goto leave;
>   }
> - /* update inode->i_mode after mask with "umask". */
> - inode->i_mode = mode;
>   
>   handle = ocfs2_start_trans(osb, ocfs2_mknod_credits(osb->sb,
>   S_ISDIR(mode),


___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel