[jira] [Commented] (HDFS-15553) Improve NameNode RPC throughput with ReadWriteRpcCallQueue
[ https://issues.apache.org/jira/browse/HDFS-15553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17404858#comment-17404858 ] liutongwei commented on HDFS-15553: --- Hi [~suxingfate], I have implemented a customized ReadWriteRpcScheduler that classify the incoming call into three kind : Unknown, Read, Write . And the ReadWriteRpcScheduler leverage the existed FairCallQueue as the backend queue implement. In my test environment, it turned out that when using unfair lock, the default LinkedBlockingQueue's QPS is a little better than the FairCallQueue with ReadWriteRpcScheduler, and the ReadWriteRpcScheduler with FairCallQueue indeed be better than the FairCallQueue with DecayRpcScheduler. In seems like unfair lock with default LinkedBlockingQueue has the best performance. Have you have tested the ReadWriteRpcCallQueue that compared with default LinkedBlockingQueue in unfairlock mode. Can you give me some advise or show me the prototype implementation of ReadWriteRpcCallQueue. Looking forward to your reply. > Improve NameNode RPC throughput with ReadWriteRpcCallQueue > --- > > Key: HDFS-15553 > URL: https://issues.apache.org/jira/browse/HDFS-15553 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Wang, Xinglong >Priority: Major > > *Current* > In our production cluster, a typical traffic model is read to write raito is > 10:1 and sometimes the ratios goes to 30:1. > NameNode is using ReEntrantReadWriteLock under the hood of FSNamesystemLock. > Read lock is shared lock while write lock is exclusive lock. > Read RPC and Write RPC comes randomly to namenode. This makes read and write > mixed up. And then only a small fraction of read can really share their read > lock. > Currently we have default callqueue and faircallqueue. And we can > refreshCallQueue on the fly. This opens room to design new call queue. > *Idea* > If we reorder the rpc call in callqueue to group read rpc together and write > rpc together, we will have sort of control to let a batch of read rpc come to > handlers together and possibly share the same read lock. Thus we can reduce > Fragments of read locks. > This will only improve the chance to share the read lock among the batch of > read rpc due to there are some namenode internal write lock is out of call > queue. > Under ReEntrantReadWriteLock, there is a queue to manage threads asking for > locks. We can give an example. > R: stands for read rpc > W: stands for write rpc > e.g > WWWWWWWW > In this case, we need 16 lock timeslice. > optimized > > In this case, we only need 9 lock timeslice. > *Correctness* > Since the execution order of any 2 concurrent or queued rpc in namenode is > not guaranteed. We can reorder the rpc in callqueue into read group and write > group. And then dequeue from these 2 queues by a designed strategy. let's say > dequeue 100 read and then dequeue 5 write rpc and then dequeue read again and > then write again. > Since FairCallQueue also does rpc call reorder in callqueue, for this part I > think they share the same logic to guarantee rpc result correctness. > *Performance* > In test environment, we can see a 15% - 20% NameNode RPC throughput > improvement comparing with default callqueue. > Test traffic is 30 read:3 write :1 list using NNLoadGeneratorMR > This performance is not a surprise. Due to some write rpc is not managed in > callqueue. We can't do reorder to them by reording calls in callqueue. > But still we can do a fully read write reorder if we redesign > ReEntrantReadWriteLock to achieve this. This will be further step after this. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15553) Improve NameNode RPC throughput with ReadWriteRpcCallQueue
[ https://issues.apache.org/jira/browse/HDFS-15553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17218440#comment-17218440 ] Xiaoyu Yao commented on HDFS-15553: --- bq. Do you see the queue overflow message in the NN log? The fixed queue size of 1024 may not be ideal. [~kihwal] I've seen the block report queue (default =1024) got full frequently on larger cluster with thousands of datanodes. This could leads to IBR all blocked on queue add. This could exhaust service RPC handler threads and make DN losing heartbeats. In addition to increase dfs.namenode.blockreport.queue.size, adjust dfs.blockreport.incremental.intervalMsec and dfs.blockreport.split.threshold should also help. With regard to increase dfs.namenode.blockreport.queue.size, can you share some of the details such as the scale of the cluster and the value being used/tested? Thanks! > Improve NameNode RPC throughput with ReadWriteRpcCallQueue > --- > > Key: HDFS-15553 > URL: https://issues.apache.org/jira/browse/HDFS-15553 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Wang, Xinglong >Priority: Major > > *Current* > In our production cluster, a typical traffic model is read to write raito is > 10:1 and sometimes the ratios goes to 30:1. > NameNode is using ReEntrantReadWriteLock under the hood of FSNamesystemLock. > Read lock is shared lock while write lock is exclusive lock. > Read RPC and Write RPC comes randomly to namenode. This makes read and write > mixed up. And then only a small fraction of read can really share their read > lock. > Currently we have default callqueue and faircallqueue. And we can > refreshCallQueue on the fly. This opens room to design new call queue. > *Idea* > If we reorder the rpc call in callqueue to group read rpc together and write > rpc together, we will have sort of control to let a batch of read rpc come to > handlers together and possibly share the same read lock. Thus we can reduce > Fragments of read locks. > This will only improve the chance to share the read lock among the batch of > read rpc due to there are some namenode internal write lock is out of call > queue. > Under ReEntrantReadWriteLock, there is a queue to manage threads asking for > locks. We can give an example. > R: stands for read rpc > W: stands for write rpc > e.g > WWWWWWWW > In this case, we need 16 lock timeslice. > optimized > > In this case, we only need 9 lock timeslice. > *Correctness* > Since the execution order of any 2 concurrent or queued rpc in namenode is > not guaranteed. We can reorder the rpc in callqueue into read group and write > group. And then dequeue from these 2 queues by a designed strategy. let's say > dequeue 100 read and then dequeue 5 write rpc and then dequeue read again and > then write again. > Since FairCallQueue also does rpc call reorder in callqueue, for this part I > think they share the same logic to guarantee rpc result correctness. > *Performance* > In test environment, we can see a 15% - 20% NameNode RPC throughput > improvement comparing with default callqueue. > Test traffic is 30 read:3 write :1 list using NNLoadGeneratorMR > This performance is not a surprise. Due to some write rpc is not managed in > callqueue. We can't do reorder to them by reording calls in callqueue. > But still we can do a fully read write reorder if we redesign > ReEntrantReadWriteLock to achieve this. This will be further step after this. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15553) Improve NameNode RPC throughput with ReadWriteRpcCallQueue
[ https://issues.apache.org/jira/browse/HDFS-15553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17189294#comment-17189294 ] Daryn Sharp commented on HDFS-15553: I've only skimmed the jira. The basic premise of re-ordering will allow more reader concurrency is flawed if the fsn lock is configured to be unfair – which I believe is the default. The last time I studied r/w lock internals: all queued threads for the read lock are woken when one thread acquires the read lock, or if another thread is currently holding the read lock. Ie. WWWWWWWW is already effectively optimized by the unfair lock to . > Improve NameNode RPC throughput with ReadWriteRpcCallQueue > --- > > Key: HDFS-15553 > URL: https://issues.apache.org/jira/browse/HDFS-15553 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Wang, Xinglong >Priority: Major > > *Current* > In our production cluster, a typical traffic model is read to write raito is > 10:1 and sometimes the ratios goes to 30:1. > NameNode is using ReEntrantReadWriteLock under the hood of FSNamesystemLock. > Read lock is shared lock while write lock is exclusive lock. > Read RPC and Write RPC comes randomly to namenode. This makes read and write > mixed up. And then only a small fraction of read can really share their read > lock. > Currently we have default callqueue and faircallqueue. And we can > refreshCallQueue on the fly. This opens room to design new call queue. > *Idea* > If we reorder the rpc call in callqueue to group read rpc together and write > rpc together, we will have sort of control to let a batch of read rpc come to > handlers together and possibly share the same read lock. Thus we can reduce > Fragments of read locks. > This will only improve the chance to share the read lock among the batch of > read rpc due to there are some namenode internal write lock is out of call > queue. > Under ReEntrantReadWriteLock, there is a queue to manage threads asking for > locks. We can give an example. > R: stands for read rpc > W: stands for write rpc > e.g > WWWWWWWW > In this case, we need 16 lock timeslice. > optimized > > In this case, we only need 9 lock timeslice. > *Correctness* > Since the execution order of any 2 concurrent or queued rpc in namenode is > not guaranteed. We can reorder the rpc in callqueue into read group and write > group. And then dequeue from these 2 queues by a designed strategy. let's say > dequeue 100 read and then dequeue 5 write rpc and then dequeue read again and > then write again. > Since FairCallQueue also does rpc call reorder in callqueue, for this part I > think they share the same logic to guarantee rpc result correctness. > *Performance* > In test environment, we can see a 15% - 20% NameNode RPC throughput > improvement comparing with default callqueue. > Test traffic is 30 read:3 write :1 list using NNLoadGeneratorMR > This performance is not a surprise. Due to some write rpc is not managed in > callqueue. We can't do reorder to them by reording calls in callqueue. > But still we can do a fully read write reorder if we redesign > ReEntrantReadWriteLock to achieve this. This will be further step after this. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15553) Improve NameNode RPC throughput with ReadWriteRpcCallQueue
[ https://issues.apache.org/jira/browse/HDFS-15553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188628#comment-17188628 ] Wang, Xinglong commented on HDFS-15553: --- [~kihwal] Thank you for the comment. It is really helpful. {quote}Also there are write calls that initially acquire the read lock then reacquire the write lock. {quote} Yes, I do notice this, the addBlock rpc is a such example to get read lock and then write lock. It's a trouble maker. Currently, I think as for the read lock part, it can share with other read rpc. So I put it in my read rpc queue. For the internal rpc calls, I agree with you. E.g, IRB may have a impact on complete rpc call result if there is a reorder. For other scenarios, I don't have concrete case for now. {quote} are you using async edit logging and audit logging? {quote} Yes. We are using both. And we have some enhancement for async edit logging part. I can share in another thread. {quote} -HDFS-9198- {quote} This patch is also in our production env. As for the 1024 queue size, I don't have data right now. Let me grab some logs and get back. For Lock combination, it can't be too small neither can be too big. This is tricky part. :) {quote}Key to the success of this approach would depend on how smart the dynamic read/write allocation mechanism works. This may be less critical if the workload pattern is easily predictable or slowly changing, or if you want to enforce a certain ratio or priority between reads and writes. In environments where the workload is highly varied, there might be difficulty utilizing this in its fullest extent. {quote} Yes. I can share my version of the dequeue strategy later by uploading some code as well as some benchmark. It's very like FairCallQueue strategy a weight based rpc scheduler. [~hexiaoqiao] One of our cluster is mostly with 10:1 read to write ratio traffic. And this approach will indeed not bring much benefit for cases where there are too many read or there are too many write. It's designed for moderate case. > Improve NameNode RPC throughput with ReadWriteRpcCallQueue > --- > > Key: HDFS-15553 > URL: https://issues.apache.org/jira/browse/HDFS-15553 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Wang, Xinglong >Priority: Major > > *Current* > In our production cluster, a typical traffic model is read to write raito is > 10:1 and sometimes the ratios goes to 30:1. > NameNode is using ReEntrantReadWriteLock under the hood of FSNamesystemLock. > Read lock is shared lock while write lock is exclusive lock. > Read RPC and Write RPC comes randomly to namenode. This makes read and write > mixed up. And then only a small fraction of read can really share their read > lock. > Currently we have default callqueue and faircallqueue. And we can > refreshCallQueue on the fly. This opens room to design new call queue. > *Idea* > If we reorder the rpc call in callqueue to group read rpc together and write > rpc together, we will have sort of control to let a batch of read rpc come to > handlers together and possibly share the same read lock. Thus we can reduce > Fragments of read locks. > This will only improve the chance to share the read lock among the batch of > read rpc due to there are some namenode internal write lock is out of call > queue. > Under ReEntrantReadWriteLock, there is a queue to manage threads asking for > locks. We can give an example. > R: stands for read rpc > W: stands for write rpc > e.g > WWWWWWWW > In this case, we need 16 lock timeslice. > optimized > > In this case, we only need 9 lock timeslice. > *Correctness* > Since the execution order of any 2 concurrent or queued rpc in namenode is > not guaranteed. We can reorder the rpc in callqueue into read group and write > group. And then dequeue from these 2 queues by a designed strategy. let's say > dequeue 100 read and then dequeue 5 write rpc and then dequeue read again and > then write again. > Since FairCallQueue also does rpc call reorder in callqueue, for this part I > think they share the same logic to guarantee rpc result correctness. > *Performance* > In test environment, we can see a 15% - 20% NameNode RPC throughput > improvement comparing with default callqueue. > Test traffic is 30 read:3 write :1 list using NNLoadGeneratorMR > This performance is not a surprise. Due to some write rpc is not managed in > callqueue. We can't do reorder to them by reording calls in callqueue. > But still we can do a fully read write reorder if we redesign > ReEntrantReadWriteLock to achieve this. This will be further step after this. -- This message was sent by Atlassian Jira
[jira] [Commented] (HDFS-15553) Improve NameNode RPC throughput with ReadWriteRpcCallQueue
[ https://issues.apache.org/jira/browse/HDFS-15553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188580#comment-17188580 ] Kihwal Lee commented on HDFS-15553: --- It is fine to reorder user requests in general. As [~suxingfate] described, state changes by clients are synchronous and clients are only guaranteed to see a state change only after the state changing write call returns. Read or write calls issued while the state changing write call is outstanding may or may not see the state update. Reordering write requests by users are also fine. However, some of the internal RPC calls from datanodes are not safe to reorder. Outstanding calls (IBRs, FBRs, etc.) from the same source may have implicit distributed dependencies. Some are also internally semi-synchronous to user's state changing requests. Over the years, some of them have been made less critical to timing and ordering, but there still are conditions that can cause issues. We could call that bad design/assumptions, but it was a design decision made for the balance between consistency, durability and performance of that time. But, we can always revisit and improve things when old assumptions do not hold anymore. Also there are write calls that initially acquire the read lock then reacquire the write lock. It should be safe to simply reorder user requests for read/write lock combining purposes. Key to the success of this approach would depend on how smart the dynamic read/write allocation mechanism works. This may be less critical if the workload pattern is easily predictable or slowly changing, or if you want to enforce a certain ratio or priority between reads and writes. In environments where the workload is highly varied, there might be difficulty utilizing this in its fullest extent. Just out of curiosity, are you using async edit logging and audit logging? Some of write combining is done in HDFS-9198 for the incremental block reports. Do you see the queue overflow message in the NN log? The fixed queue size of 1024 may not be ideal. > Improve NameNode RPC throughput with ReadWriteRpcCallQueue > --- > > Key: HDFS-15553 > URL: https://issues.apache.org/jira/browse/HDFS-15553 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Wang, Xinglong >Priority: Major > > *Current* > In our production cluster, a typical traffic model is read to write raito is > 10:1 and sometimes the ratios goes to 30:1. > NameNode is using ReEntrantReadWriteLock under the hood of FSNamesystemLock. > Read lock is shared lock while write lock is exclusive lock. > Read RPC and Write RPC comes randomly to namenode. This makes read and write > mixed up. And then only a small fraction of read can really share their read > lock. > Currently we have default callqueue and faircallqueue. And we can > refreshCallQueue on the fly. This opens room to design new call queue. > *Idea* > If we reorder the rpc call in callqueue to group read rpc together and write > rpc together, we will have sort of control to let a batch of read rpc come to > handlers together and possibly share the same read lock. Thus we can reduce > Fragments of read locks. > This will only improve the chance to share the read lock among the batch of > read rpc due to there are some namenode internal write lock is out of call > queue. > Under ReEntrantReadWriteLock, there is a queue to manage threads asking for > locks. We can give an example. > R: stands for read rpc > W: stands for write rpc > e.g > WWWWWWWW > In this case, we need 16 lock timeslice. > optimized > > In this case, we only need 9 lock timeslice. > *Correctness* > Since the execution order of any 2 concurrent or queued rpc in namenode is > not guaranteed. We can reorder the rpc in callqueue into read group and write > group. And then dequeue from these 2 queues by a designed strategy. let's say > dequeue 100 read and then dequeue 5 write rpc and then dequeue read again and > then write again. > Since FairCallQueue also does rpc call reorder in callqueue, for this part I > think they share the same logic to guarantee rpc result correctness. > *Performance* > In test environment, we can see a 15% - 20% NameNode RPC throughput > improvement comparing with default callqueue. > Test traffic is 30 read:3 write :1 list using NNLoadGeneratorMR > This performance is not a surprise. Due to some write rpc is not managed in > callqueue. We can't do reorder to them by reording calls in callqueue. > But still we can do a fully read write reorder if we redesign > ReEntrantReadWriteLock to achieve this. This will be further step after this. -- This message was sent by Atlassian Jira
[jira] [Commented] (HDFS-15553) Improve NameNode RPC throughput with ReadWriteRpcCallQueue
[ https://issues.apache.org/jira/browse/HDFS-15553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188476#comment-17188476 ] Xiaoqiao He commented on HDFS-15553: Thanks [~suxingfate] for your quick response. It makes sense to me. Would you like to attach design doc for your proposal (it will be better if include POC and benchmark result). I believe it will be better to discuss deeply and push forward for other reviewers. Thanks. > Improve NameNode RPC throughput with ReadWriteRpcCallQueue > --- > > Key: HDFS-15553 > URL: https://issues.apache.org/jira/browse/HDFS-15553 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Wang, Xinglong >Priority: Major > > *Current* > In our production cluster, a typical traffic model is read to write raito is > 10:1 and sometimes the ratios goes to 30:1. > NameNode is using ReEntrantReadWriteLock under the hood of FSNamesystemLock. > Read lock is shared lock while write lock is exclusive lock. > Read RPC and Write RPC comes randomly to namenode. This makes read and write > mixed up. And then only a small fraction of read can really share their read > lock. > Currently we have default callqueue and faircallqueue. And we can > refreshCallQueue on the fly. This opens room to design new call queue. > *Idea* > If we reorder the rpc call in callqueue to group read rpc together and write > rpc together, we will have sort of control to let a batch of read rpc come to > handlers together and possibly share the same read lock. Thus we can reduce > Fragments of read locks. > This will only improve the chance to share the read lock among the batch of > read rpc due to there are some namenode internal write lock is out of call > queue. > Under ReEntrantReadWriteLock, there is a queue to manage threads asking for > locks. We can give an example. > R: stands for read rpc > W: stands for write rpc > e.g > WWWWWWWW > In this case, we need 16 lock timeslice. > optimized > > In this case, we only need 9 lock timeslice. > *Correctness* > Since the execution order of any 2 concurrent or queued rpc in namenode is > not guaranteed. We can reorder the rpc in callqueue into read group and write > group. And then dequeue from these 2 queues by a designed strategy. let's say > dequeue 100 read and then dequeue 5 write rpc and then dequeue read again and > then write again. > Since FairCallQueue also does rpc call reorder in callqueue, for this part I > think they share the same logic to guarantee rpc result correctness. > *Performance* > In test environment, we can see a 15% - 20% NameNode RPC throughput > improvement comparing with default callqueue. > Test traffic is 30 read:3 write :1 list using NNLoadGeneratorMR > This performance is not a surprise. Due to some write rpc is not managed in > callqueue. We can't do reorder to them by reording calls in callqueue. > But still we can do a fully read write reorder if we redesign > ReEntrantReadWriteLock to achieve this. This will be further step after this. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15553) Improve NameNode RPC throughput with ReadWriteRpcCallQueue
[ https://issues.apache.org/jira/browse/HDFS-15553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188442#comment-17188442 ] Wang, Xinglong commented on HDFS-15553: --- Thank you for the commet [~hexiaoqiao] I think even in the current code base, we can't make sure Client B will always see result of client A. I will give some counter cases against the scenario in your case. 1 FairCallQueue case, if client A is downgraded to a low priority queue, then this rpc may be executed after client B rpc. 2 Unfair lock mode, latter rpc may be executed before former rpc due to unfair mechanism 3 Due to network latency, if T1 and T2 means client side timestamp, then the sequence of client A rpc and client B rpc is not guaranteed in server side. 4 Let's say client A and client B is in the same jvm. Usually client A will wait until it got response from NN and then it will perform other actions like notify others to do things. There needs a synchronization between client A and client B if there is dependency. If user just spawn client A first and client B later, and there is no synchronization between client A and client B. The result can be a lot of cases. We can't say for sure it will have a guaranteed result. 5 If client A and client B is in different jvm, I think it's even harder to guarantee result only based on rpc sent out time. For the dequeue strategy, currently I implemented a weight based dynamic dequeue strategy to dynamically decide how many rpc should be dequeued from read call queue and write callqueue based on current readqueue length and write queue length to make sure we can consume both queue to prevent one queue starving. > Improve NameNode RPC throughput with ReadWriteRpcCallQueue > --- > > Key: HDFS-15553 > URL: https://issues.apache.org/jira/browse/HDFS-15553 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Wang, Xinglong >Priority: Major > > *Current* > In our production cluster, a typical traffic model is read to write raito is > 10:1 and sometimes the ratios goes to 30:1. > NameNode is using ReEntrantReadWriteLock under the hood of FSNamesystemLock. > Read lock is shared lock while write lock is exclusive lock. > Read RPC and Write RPC comes randomly to namenode. This makes read and write > mixed up. And then only a small fraction of read can really share their read > lock. > Currently we have default callqueue and faircallqueue. And we can > refreshCallQueue on the fly. This opens room to design new call queue. > *Idea* > If we reorder the rpc call in callqueue to group read rpc together and write > rpc together, we will have sort of control to let a batch of read rpc come to > handlers together and possibly share the same read lock. Thus we can reduce > Fragments of read locks. > This will only improve the chance to share the read lock among the batch of > read rpc due to there are some namenode internal write lock is out of call > queue. > Under ReEntrantReadWriteLock, there is a queue to manage threads asking for > locks. We can give an example. > R: stands for read rpc > W: stands for write rpc > e.g > WWWWWWWW > In this case, we need 16 lock timeslice. > optimized > > In this case, we only need 9 lock timeslice. > *Correctness* > Since the execution order of any 2 concurrent or queued rpc in namenode is > not guaranteed. We can reorder the rpc in callqueue into read group and write > group. And then dequeue from these 2 queues by a designed strategy. let's say > dequeue 100 read and then dequeue 5 write rpc and then dequeue read again and > then write again. > Since FairCallQueue also does rpc call reorder in callqueue, for this part I > think they share the same logic to guarantee rpc result correctness. > *Performance* > In test environment, we can see a 15% - 20% NameNode RPC throughput > improvement comparing with default callqueue. > Test traffic is 30 read:3 write :1 list using NNLoadGeneratorMR > This performance is not a surprise. Due to some write rpc is not managed in > callqueue. We can't do reorder to them by reording calls in callqueue. > But still we can do a fully read write reorder if we redesign > ReEntrantReadWriteLock to achieve this. This will be further step after this. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15553) Improve NameNode RPC throughput with ReadWriteRpcCallQueue
[ https://issues.apache.org/jira/browse/HDFS-15553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188422#comment-17188422 ] Xiaoqiao He commented on HDFS-15553: Thanks [~suxingfate] for your interesting proposal. The initial feeling was that it is useful in most of cases, some particular scenario may be not very proper. Such as, a) Client A send Write request at T1 which create file foo, b) while Client B send Read request at T2 which get filestatus about file foo. considering T1 Improve NameNode RPC throughput with ReadWriteRpcCallQueue > --- > > Key: HDFS-15553 > URL: https://issues.apache.org/jira/browse/HDFS-15553 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Wang, Xinglong >Priority: Major > > *Current* > In our production cluster, a typical traffic model is read to write raito is > 10:1 and sometimes the ratios goes to 30:1. > NameNode is using ReEntrantReadWriteLock under the hood of FSNamesystemLock. > Read lock is shared lock while write lock is exclusive lock. > Read RPC and Write RPC comes randomly to namenode. This makes read and write > mixed up. And then only a small fraction of read can really share their read > lock. > Currently we have default callqueue and faircallqueue. And we can > refreshCallQueue on the fly. This opens room to design new call queue. > *Idea* > If we reorder the rpc call in callqueue to group read rpc together and write > rpc together, we will have sort of control to let a batch of read rpc come to > handlers together and possibly share the same read lock. Thus we can reduce > Fragments of read locks. > This will only improve the chance to share the read lock among the batch of > read rpc due to there are some namenode internal write lock is out of call > queue. > Under ReEntrantReadWriteLock, there is a queue to manage threads asking for > locks. We can give an example. > R: stands for read rpc > W: stands for write rpc > e.g > WWWWWWWW > In this case, we need 16 lock timeslice. > optimized > > In this case, we only need 9 lock timeslice. > *Correctness* > Since the execution order of any 2 concurrent or queued rpc in namenode is > not guaranteed. We can reorder the rpc in callqueue into read group and write > group. And then dequeue from these 2 queues by a designed strategy. let's say > dequeue 100 read and then dequeue 5 write rpc and then dequeue read again and > then write again. > Since FairCallQueue also does rpc call reorder in callqueue, for this part I > think they share the same logic to guarantee rpc result correctness. > *Performance* > In test environment, we can see a 15% - 20% NameNode RPC throughput > improvement comparing with default callqueue. > Test traffic is 30 read:3 write :1 list using NNLoadGeneratorMR > This performance is not a surprise. Due to some write rpc is not managed in > callqueue. We can't do reorder to them by reording calls in callqueue. > But still we can do a fully read write reorder if we redesign > ReEntrantReadWriteLock to achieve this. This will be further step after this. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15553) Improve NameNode RPC throughput with ReadWriteRpcCallQueue
[ https://issues.apache.org/jira/browse/HDFS-15553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188406#comment-17188406 ] Xiaoqiao He commented on HDFS-15553: It seems to improve NameNode performance based on the proposal. Move to HDFS sub-project. > Improve NameNode RPC throughput with ReadWriteRpcCallQueue > --- > > Key: HDFS-15553 > URL: https://issues.apache.org/jira/browse/HDFS-15553 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Wang, Xinglong >Priority: Major > > *Current* > In our production cluster, a typical traffic model is read to write raito is > 10:1 and sometimes the ratios goes to 30:1. > NameNode is using ReEntrantReadWriteLock under the hood of FSNamesystemLock. > Read lock is shared lock while write lock is exclusive lock. > Read RPC and Write RPC comes randomly to namenode. This makes read and write > mixed up. And then only a small fraction of read can really share their read > lock. > Currently we have default callqueue and faircallqueue. And we can > refreshCallQueue on the fly. This opens room to design new call queue. > *Idea* > If we reorder the rpc call in callqueue to group read rpc together and write > rpc together, we will have sort of control to let a batch of read rpc come to > handlers together and possibly share the same read lock. Thus we can reduce > Fragments of read locks. > This will only improve the chance to share the read lock among the batch of > read rpc due to there are some namenode internal write lock is out of call > queue. > Under ReEntrantReadWriteLock, there is a queue to manage threads asking for > locks. We can give an example. > R: stands for read rpc > W: stands for write rpc > e.g > WWWWWWWW > In this case, we need 16 lock timeslice. > optimized > > In this case, we only need 9 lock timeslice. > *Correctness* > Since the execution order of any 2 concurrent or queued rpc in namenode is > not guaranteed. We can reorder the rpc in callqueue into read group and write > group. And then dequeue from these 2 queues by a designed strategy. let's say > dequeue 100 read and then dequeue 5 write rpc and then dequeue read again and > then write again. > Since FairCallQueue also does rpc call reorder in callqueue, for this part I > think they share the same logic to guarantee rpc result correctness. > *Performance* > In test environment, we can see a 15% - 20% NameNode RPC throughput > improvement comparing with default callqueue. > Test traffic is 30 read:3 write :1 list using NNLoadGeneratorMR > This performance is not a surprise. Due to some write rpc is not managed in > callqueue. We can't do reorder to them by reording calls in callqueue. > But still we can do a fully read write reorder if we redesign > ReEntrantReadWriteLock to achieve this. This will be further step after this. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org