[jira] [Updated] (HDFS-11384) Add option for balancer to disperse getBlocks calls to avoid NameNode's rpc.CallQueueLength spike

Konstantin Shvachko (JIRA) Fri, 21 Apr 2017 19:15:15 -0700

     [ 
https://issues.apache.org/jira/browse/HDFS-11384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Konstantin Shvachko updated HDFS-11384:
---------------------------------------
    Attachment: HDFS-11384.006.patch

* You are right, the rate of {{getBlocks}} RPCs is not guaranteed. Balancer can 
only do its best. The actual rate can be only guaranteed on the NameNode, but 
we don't want to go there.
I made it clear in the comment for {{BALANCER_NUM_RPC_PER_SEC}}.
* Added a decryption for delay.
* It is pretty hard to measure the rate of operations on NN. Here is what I did.
Created a spy FSNamesystem. The spy would call a modified {{getBlocks()}} when 
the corresponding RPC is called.
The modified {{getBlocks()}} first calls the original method, then counts the 
number of calls and the time of the first and the last call to {{getBlocks()}}. 
Given the number of calls and the interval we can estimate the rate later on.

> Add option for balancer to disperse getBlocks calls to avoid NameNode's 
> rpc.CallQueueLength spike
> -------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-11384
>                 URL: https://issues.apache.org/jira/browse/HDFS-11384
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: balancer & mover
>    Affects Versions: 2.7.3
>            Reporter: yunjiong zhao
>            Assignee: yunjiong zhao
>         Attachments: balancer.day.png, balancer.week.png, 
> HDFS-11384.001.patch, HDFS-11384.002.patch, HDFS-11384.003.patch, 
> HDFS-11384.004.patch, HDFS-11384.005.patch, HDFS-11384.006.patch
>
>
> When running balancer on hadoop cluster which have more than 3000 Datanodes 
> will cause NameNode's rpc.CallQueueLength spike. We observed this situation 
> could cause Hbase cluster failure due to RegionServer's WAL timeout.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (HDFS-11384) Add option for balancer to disperse getBlocks calls to avoid NameNode's rpc.CallQueueLength spike

Reply via email to