[jira] [Commented] (HDFS-15553) Improve NameNode RPC throughput with ReadWriteRpcCallQueue

Kihwal Lee (Jira) Tue, 01 Sep 2020 08:47:06 -0700


    [ 
https://issues.apache.org/jira/browse/HDFS-15553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17188580#comment-17188580
 ]


Kihwal Lee commented on HDFS-15553:
-----------------------------------

It is fine to reorder user requests in general. As [~suxingfate] described, 
state changes by clients are synchronous and clients are only guaranteed to see 
a state change only after the state changing write call returns. Read or write 
calls issued while the state changing write call is outstanding may or may not 
see the state update. Reordering write requests by users are also fine. 
However, some of the internal RPC calls from datanodes are not safe to reorder. 
Outstanding calls (IBRs, FBRs, etc.) from the same source may have implicit 
distributed dependencies. Some are also internally semi-synchronous to user's 
state changing requests. Over the years, some of them have been made less 
critical to timing and ordering, but there still are conditions that can cause 
issues. We could call that bad design/assumptions, but it was a design decision 
made for the balance between consistency, durability and performance of that 
time. But, we can always revisit and improve things when old assumptions do not 
hold anymore.  Also there are write calls that initially acquire the read lock 
then reacquire the write lock.

It should be safe to simply reorder user requests for read/write lock combining 
purposes.

Key to the success of this approach would depend on how smart the dynamic 
read/write allocation mechanism works. This may be less critical if the 
workload pattern is easily predictable or slowly changing, or if you want to 
enforce a certain ratio or priority between reads and writes. In environments 
where the workload is highly varied, there might be difficulty utilizing this 
in its fullest extent.

Just out of curiosity, are you using async edit logging and audit logging? Some 
of write combining is done in HDFS-9198 for the incremental block reports.  Do 
you see the queue overflow message in the NN log? The fixed queue size of 1024 
may not be ideal.

> Improve NameNode RPC throughput with ReadWriteRpcCallQueue 
> -----------------------------------------------------------
>
>                 Key: HDFS-15553
>                 URL: https://issues.apache.org/jira/browse/HDFS-15553
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: namenode
>            Reporter: Wang, Xinglong
>            Priority: Major
>
> *Current*
>  In our production cluster, a typical traffic model is read to write raito is 
> 10:1 and sometimes the ratios goes to 30:1.
>  NameNode is using ReEntrantReadWriteLock under the hood of FSNamesystemLock. 
> Read lock is shared lock while write lock is exclusive lock.
> Read RPC and Write RPC comes randomly to namenode. This makes read and write 
> mixed up. And then only a small fraction of read can really share their read 
> lock.
> Currently we have default callqueue and faircallqueue. And we can 
> refreshCallQueue on the fly. This opens room to design new call queue.
> *Idea*
>  If we reorder the rpc call in callqueue to group read rpc together and write 
> rpc together, we will have sort of control to let a batch of read rpc come to 
> handlers together and possibly share the same read lock. Thus we can reduce 
> Fragments of read locks.
>  This will only improve the chance to share the read lock among the batch of 
> read rpc due to there are some namenode internal write lock is out of call 
> queue.
> Under ReEntrantReadWriteLock, there is a queue to manage threads asking for 
> locks. We can give an example.
>  R: stands for read rpc
>  W: stands for write rpc
>  e.g
>  RRRRWRRRRWRRRRWRRRRWRRRRWRRRRWRRRRWRRRRW
>  In this case, we need 16 lock timeslice.
> optimized
>  RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRWWWWWWWW
>  In this case, we only need 9 lock timeslice.
> *Correctness*
>  Since the execution order of any 2 concurrent or queued rpc in namenode is 
> not guaranteed. We can reorder the rpc in callqueue into read group and write 
> group. And then dequeue from these 2 queues by a designed strategy. let's say 
> dequeue 100 read and then dequeue 5 write rpc and then dequeue read again and 
> then write again.
>  Since FairCallQueue also does rpc call reorder in callqueue, for this part I 
> think they share the same logic to guarantee rpc result correctness.
> *Performance*
>  In test environment, we can see a 15% - 20% NameNode RPC throughput 
> improvement comparing with default callqueue. 
>  Test traffic is 30 read:3 write :1 list using NNLoadGeneratorMR
> This performance is not a surprise. Due to some write rpc is not managed in 
> callqueue. We can't do reorder to them by reording calls in callqueue. 
>  But still we can do a fully read write reorder if we redesign 
> ReEntrantReadWriteLock to achieve this. This will be further step after this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HDFS-15553) Improve NameNode RPC throughput with ReadWriteRpcCallQueue

Reply via email to