Re: [Ocfs2-devel] The root cause analysis about buffer read getting starvation

Eric Ren Fri, 18 Dec 2015 01:12:49 -0800

Hi all,

On Thu, Dec 17, 2015 at 08:08:42AM -0700, He Gang wrote: 
> Hello Mark and all,
> In the past days, I and Eric were looking at a customer issue, the customer 
> is complaining that buffer reading sometimes lasts too much time ( 1 - 10 
> seconds) in case reading/writing the same file from different nodes 
> concurrently, some day ago I sent a mail to the list for some discussions, 
> you can read some details via the link 
> https://oss.oracle.com/pipermail/ocfs2-devel/2015-December/011389.html.
> But, this problem does not happen under SLES10 (sp1 - sp4), the customer 
> upgraded his Linux OS to SLES11(sp3 or sp4), the problem happened, this is 
> why the customer complains, he hope we can give a investigation, to see how 
> to make OCFS2 buffer reading/writing behavior be consistent with SLES10.
> According to our code reviewing and some testings, we found that the root 
> cause to let buffer read get starvation.
> The suspicious code in aops.c 
>  274 static int ocfs2_readpage(struct file *file, struct page *page)
>  275 {
>  276         struct inode *inode = page->mapping->host;
>  277         struct ocfs2_inode_info *oi = OCFS2_I(inode);
>  278         loff_t start = (loff_t)page->index << PAGE_CACHE_SHIFT;
>  279         int ret, unlock = 1;
>  280         long delta;
>  281         struct timespec t_enter, t_mid1, t_mid2, t_exit;
>  282
>  283         trace_ocfs2_readpage((unsigned long long)oi->ip_blkno,
>  284                              (page ? page->index : 0));
>  285
>  286         ret = ocfs2_inode_lock_with_page(inode, NULL, 0, page);  <<= 
> here, using nonblock way to get lock will bring many times retry, spend too 
> much time
>  287         if (ret != 0) {
>  288                 if (ret == AOP_TRUNCATED_PAGE)
>  289                         unlock = 0;
>  290                 mlog_errno(ret);
>  291                 goto out;
>  292         }
>  293
>  294         if (down_read_trylock(&oi->ip_alloc_sem) == 0) {  <<= here, the 
> same problem with above
>  295                 /*
>  296                  * Unlock the page and cycle ip_alloc_sem so that we 
> don't
>  297                  * busyloop waiting for ip_alloc_sem to unlock
>  298                  */
>  299                 ret = AOP_TRUNCATED_PAGE;
>  300                 unlock_page(page);
>  301                 unlock = 0;
>  302                 down_read(&oi->ip_alloc_sem);
>  303                 up_read(&oi->ip_alloc_sem);
>  304                 goto out_inode_unlock;
>  305         }
> 
> 
> As you can see, using nonblock way to get lock will bring many time retry, 
> spend too much time.
> We can't modify the code to using block way to get the lock, as this will 
> bring a dead lock.
> Actually, we did some testing when trying to use block way to get the lock 
> here, the deadlock problems were encountered.
> But, in SLES10 source code, there is not any using nonblock way to get lock 
> in buffer reading/writing, this is why buffer reading/writing are very fair 
> to get IO when reading/writing the same file from multiple nodes. 
SLES10 with kernel version about 2.6.16.x, used blocking way, i.e. down_read(), 
wich has the
potential deaklock between page lock / ip_alloc_sem when one node get the 
cluster lock and
does writing and reading on same file on it. This deadlock was fixed by this 
commit:
---
commit e9dfc0b2bc42761410e8db6c252c6c5889e178b8
Author: Mark Fasheh <mark.fas...@oracle.com>
Date:   Mon May 14 11:38:51 2007 -0700


    ocfs2: trylock in ocfs2_readpage()

    Similarly to the page lock / cluster lock inversion in ocfs2_readpage, we
    can deadlock on ip_alloc_sem. We can down_read_trylock() instead and just
    return AOP_TRUNCATED_PAGE if the operation fails.

    Signed-off-by: Mark Fasheh <mark.fas...@oracle.com>

diff --git a/fs/ocfs2/aops.c b/fs/ocfs2/aops.c
index 8e7cafb..3030670 100644
--- a/fs/ocfs2/aops.c
+++ b/fs/ocfs2/aops.c
@@ -222,7 +222,10 @@ static int ocfs2_readpage(struct file *file, struct page 
*page)
                goto out;
        }

-       down_read(&OCFS2_I(inode)->ip_alloc_sem);
+       if (down_read_trylock(&OCFS2_I(inode)->ip_alloc_sem) == 0) {
+               ret = AOP_TRUNCATED_PAGE;
+               goto out_meta_unlock;
+       }

        /*
         * i_size might have just been updated as we grabed the meta lock.  We
@@ -258,6 +261,7 @@ static int ocfs2_readpage(struct file *file, struct page 
*page)
        ocfs2_data_unlock(inode, 0);
 out_alloc:
        up_read(&OCFS2_I(inode)->ip_alloc_sem);
+out_meta_unlock:
        ocfs2_meta_unlock(inode, 0);
 out:
        if (unlock)
---

But somehow with this patch, performance in the scenario become very bad. I 
don't how this could happen? because the reading node just has only one
thread reading the shared file, then down_read_trylock() should always get 
ip_alloc_sem successfully, right? if not, who else may race ip_alloc_sem?

Thanks,
Eric
> Why the dead locks happen on SLES11? you can see the source code, there are 
> some code change, especially inode alloc_sem lock.
> On SLES11, to get  inode alloc_sem lock is moved into ocfs2_readpage and 
> ocfs2_write_begin, why we need to do that? this will let us to bring a dead 
> lock factor, to avoid the dead locks, we will have to use nonblocking way to 
> get some locks in ocfs2_readpage, the result will let buffer reading be 
> unfair to get IO. and that, to avoid CPU busy loop, add some code to get the 
> lock with block way in case can't get a lock in nonblock way, waste too much 
> time also.
> Finally, I want to discuss with your guys, how to fix this issue? could we 
> move the inode alloc_sem lock back to 
> ocfs2_file_aio_read/ocfs2_file_aio_write? 
> we can get the inode alloc_sem lock before calling into 
> ocfs2_readpage/ocfs2_write_begin, just like SLES10. 
> Since I have not enough background behind these code changes, hope you can 
> give some comments.
> 
> Thanks a lot.
> Gang
> 
> 
> 
> _______________________________________________
> Ocfs2-devel mailing list
> Ocfs2-devel@oss.oracle.com
> https://oss.oracle.com/mailman/listinfo/ocfs2-devel
> 

_______________________________________________
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] The root cause analysis about buffer read getting starvation

Reply via email to