Hi Eric, On 2015/11/12 17:48, Eric Ren wrote: > Hi Joseph, > > On 11/12/15 16:00, Joseph Qi wrote: >> On 2015/11/12 15:23, Eric Ren wrote: >>> Hi Joseph, >>> >>> Thanks for your reply! There're more details I'd like to ask about ;-) >>> >>> On 11/12/15 11:05, Joseph Qi wrote: >>>> Hi Eric, >>>> You reported an issue about sometime io response time may be long. >>>> >>>> From your test case information, I think it was caused by downconvert. >>> From what I learned from fs/dlm, lock manager grants all down-conversions >>> requests >>> in place,i.e. on grant queue. Here're some silly questions: >>> 1. who may requests down-convertion? >>> 2. when down-convertion happends? >>> 3. how could a down-convertion takes so long? >> IMO, it happens almost in two cases. >> 1. Owner knows another node is waiting on the lock, in other words, one >> have blocked another's request. It may be triggered in ast, bast, or >> unlock. >> 2. ocfs2cmt does periodically commit. >> >> One case can lead to long time downconvert is, it is indeed that it has >> too much work to do. I am not sure if there are any other cases or code >> bug. > OK, not familiar with ocfs2cmt. Could I bother you to explain what ocfs2cmt > is used to do, > it's relation with R/W, and why down-conversion can be triggered by when it > commits? Sorry, the above explanation is not right and may mislead you.
jbd2/xxx (previously called kjournald2?) does periodically commit, the default interval is 5s and can be set with mount option "commit=". ocfs2cmt does the checkpoint, it can be waked up: a) unblock lock during downconvert, and if jbd2/xxx has already done the commit, ocfs2cmt won't be actually waken up because it has already been checkpointed. So ocfs2cmt works with jbd2/xxx. b) evict inode and then do downconvert. >>> Could you describes more in this case? >>>> And it seemed reasonable because it had to. >>>> >>>> Node 1 wrote file, and node 2 read it. Since you used buffer io, that >>>> was after node 1 had finished written, it might be still in page cache. >>> Sorry, I cannot understand the relationship between "still in page case" >>> and "so...downconvert". >>>> So node 1 should downconvert first then node 2 read could continue. >>>> That was why you said it seemed ocfs2_inode_lock_with_page spent most >>> Actually, it suprises me more with such long time spent than the *most* >>> time compared to "readpage" stuff ;-) >>>> time. More specifically, it was ocfs2_inode_lock after trying nonblock >>>> lock and returning -EAGAIN. >>> You mean read process would repeatedly try nonblock lock until write >>> process down-convertion completes? >> No, after nonblock lock returning -EAGAIN, it will unlock page and then >> call ocfs2_inode_lock and ocfs2_inode_unlock. And ocfs2_inode_lock will > Yes. >> wait until downconvert completion in another node. > Another node which read or write process on? Yes, the node blocks my request. For example, node 1 has EX, then node 2 wants to get PR, it should wait for node 1 downconvert first. Thanks, Joesph >> This is for an lock inversion case. You can refer the comments of >> ocfs2_inode_lock_with_page. > Yeah, actually I read this comments again and again, but still fail to get > this idea. > Could you please explain how this works? I'm really really interested ;-) > Forgive me > paste code below, make it convenient to refer. > > /* > * This is working around a lock inversion between tasks acquiring DLM > * locks while holding a page lock and the downconvert thread which > * blocks dlm lock acquiry while acquiring page locks. > * > * ** These _with_page variantes are only intended to be called from aop > * methods that hold page locks and return a very specific *positive* error > * code that aop methods pass up to the VFS -- test for errors with != 0. ** > * > * The DLM is called such that it returns -EAGAIN if it would have > * blocked waiting for the downconvert thread. In that case we unlock > * our page so the downconvert thread can make progress. Once we've > * done this we have to return AOP_TRUNCATED_PAGE so the aop method > * that called us can bubble that back up into the VFS who will then > * immediately retry the aop call. > * > * We do a blocking lock and immediate unlock before returning, though, so > that > * the lock has a great chance of being cached on this node by the time the > VFS > * calls back to retry the aop. This has a potential to livelock as nodes > * ping locks back and forth, but that's a risk we're willing to take to avoid > * the lock inversion simply. > */ > int ocfs2_inode_lock_with_page(struct inode *inode, > struct buffer_head **ret_bh, > int ex, > struct page *page) > { > int ret; > > ret = ocfs2_inode_lock_full(inode, ret_bh, ex, OCFS2_LOCK_NONBLOCK); > if (ret == -EAGAIN) { > unlock_page(page); > if (ocfs2_inode_lock(inode, ret_bh, ex) == 0) > ocfs2_inode_unlock(inode, ex); > ret = AOP_TRUNCATED_PAGE; > } > > return ret; > } > > Thanks, > Eric >>>> And this also explained why direct io didn't have the issue, but took >>>> more time. >>>> >>>> I am not sure if your test case is the same as what the customer has >>>> reported. I think you should recheck the operations in each node. >>> Yes, we've verified several times both on sles10 and sles11. On sles10, >>> each IO time is smooth, no long time IO peak. >>>> And we have reported an case before about DLM handling issue. I am not >>>> sure if it has relations. >>>> https://oss.oracle.com/pipermail/ocfs2-devel/2015-August/011045.html >>> Thanks, I've read this post. I cannot see any relations yet. Actually, >>> fs/dlm also implements that way, it's the so-called "conversion deadlock" >>> which mentioned in 2.3.7.3 section of "programming locking applications" >>> book. >>> >>> There're only two processes from two nodes. Process A is blocked on wait >>> queue caused by process B in convert queue, that leave grant queue empty, >>> is this possible? >> So we have to investigate why convert request cannot be satisfied. >> If dlm still works fine, it is impossible. Otherwise it is a bug. >> >>> You'know I'm new here, maybe some questions're improper,please point out if >>> so;-) >>> >>> Thank, >>> Eric > > > . > _______________________________________________ Ocfs2-devel mailing list Ocfs2-devel@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-devel