[jira] [Comment Edited] (HDFS-11384) Add option for balancer to disperse getBlocks calls to avoid NameNode's rpc.CallQueueLength spike

2017-03-30 Thread Vinitha Reddy Gankidi (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-11384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15949978#comment-15949978
 ] 

Vinitha Reddy Gankidi edited comment on HDFS-11384 at 3/30/17 10:29 PM:


[~shv] I'm leaning towards (4) instead of (3).
{{isGoodBlockCandidate}} needs a global view of the block replicas. Also there 
is some additional logic to deal with erasure coded(EC) blocks and this may be 
a blocker for reading from DNs. [~zhz] you probably have more context regarding 
the EC blocks.
{code}
 /**
   * Decide if the block/blockGroup is a good candidate to be moved from source
   * to target. A block is a good candidate if
   * 1. the block is not in the process of being moved/has not been moved;
   * 2. the block does not have a replica/internalBlock on the target;
   * 3. doing the move does not reduce the number of racks that the block has
   */
  private boolean isGoodBlockCandidate(StorageGroup source, StorageGroup target,
  StorageType targetStorageType, DBlock block) {
{code}

I agree that (2) and (4) are complimentary. 


was (Author: redvine):
[~shv] I'm leaning towards reading from (4) instead of (3).
{{isGoodBlockCandidate}} needs a global view of the block replicas. Also there 
is some additional logic to deal with erasure coded(EC) blocks and this may be 
a blocker for reading from DNs. [~zhz] you probably have more context regarding 
the EC blocks.
{code}
 /**
   * Decide if the block/blockGroup is a good candidate to be moved from source
   * to target. A block is a good candidate if
   * 1. the block is not in the process of being moved/has not been moved;
   * 2. the block does not have a replica/internalBlock on the target;
   * 3. doing the move does not reduce the number of racks that the block has
   */
  private boolean isGoodBlockCandidate(StorageGroup source, StorageGroup target,
  StorageType targetStorageType, DBlock block) {
{code}

I agree that (2) and (4) are complimentary. 

> Add option for balancer to disperse getBlocks calls to avoid NameNode's 
> rpc.CallQueueLength spike
> -
>
> Key: HDFS-11384
> URL: https://issues.apache.org/jira/browse/HDFS-11384
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: balancer & mover
>Affects Versions: 2.7.3
>Reporter: yunjiong zhao
>Assignee: yunjiong zhao
> Attachments: balancer.day.png, balancer.week.png, 
> HDFS-11384.001.patch, HDFS-11384.002.patch
>
>
> When running balancer on hadoop cluster which have more than 3000 Datanodes 
> will cause NameNode's rpc.CallQueueLength spike. We observed this situation 
> could cause Hbase cluster failure due to RegionServer's WAL timeout.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-11384) Add option for balancer to disperse getBlocks calls to avoid NameNode's rpc.CallQueueLength spike

2017-03-30 Thread Vinitha Reddy Gankidi (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-11384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15949757#comment-15949757
 ] 

Vinitha Reddy Gankidi edited comment on HDFS-11384 at 3/30/17 8:36 PM:
---

If we were to offload the calls to DN, dispersing calls wouldn't be a pressing 
issue. I would like to get some feedback  on the various approaches discussed. 
[~benoyantony], [~daryn], [~liuml07] and [~zhaoyunjiong] I would love to hear 
your opinions.


was (Author: redvine):
If we were to offload the calls to DN, dispersing calls wouldn't be a pressing 
issue. I would like to get some feedback  on the various approaches discussed. 
[~benoyantony] [~daryn] [~liuml07] [~zhaoyunjiong] I would love to hear your 
opinions.

> Add option for balancer to disperse getBlocks calls to avoid NameNode's 
> rpc.CallQueueLength spike
> -
>
> Key: HDFS-11384
> URL: https://issues.apache.org/jira/browse/HDFS-11384
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: balancer & mover
>Affects Versions: 2.7.3
>Reporter: yunjiong zhao
>Assignee: yunjiong zhao
> Attachments: balancer.day.png, balancer.week.png, 
> HDFS-11384.001.patch, HDFS-11384.002.patch
>
>
> When running balancer on hadoop cluster which have more than 3000 Datanodes 
> will cause NameNode's rpc.CallQueueLength spike. We observed this situation 
> could cause Hbase cluster failure due to RegionServer's WAL timeout.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-11384) Add option for balancer to disperse getBlocks calls to avoid NameNode's rpc.CallQueueLength spike

2017-03-30 Thread Vinitha Reddy Gankidi (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-11384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15949757#comment-15949757
 ] 

Vinitha Reddy Gankidi edited comment on HDFS-11384 at 3/30/17 8:36 PM:
---

If we were to offload the calls to DN, dispersing calls wouldn't be a pressing 
issue. I would like to get some feedback  on the various approaches discussed. 
[~benoyantony] [~daryn] [~liuml07] [~zhaoyunjiong] I would love to hear your 
opinions.


was (Author: redvine):
If we were to offload the calls to DN, dispersing calls wouldn't be a pressing 
issue. I would like to get some feedback  on the various approaches discussed. 
[~benoyantony] [~daryn] [~liuml07] I would love to hear your opinions.

> Add option for balancer to disperse getBlocks calls to avoid NameNode's 
> rpc.CallQueueLength spike
> -
>
> Key: HDFS-11384
> URL: https://issues.apache.org/jira/browse/HDFS-11384
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: balancer & mover
>Affects Versions: 2.7.3
>Reporter: yunjiong zhao
>Assignee: yunjiong zhao
> Attachments: balancer.day.png, balancer.week.png, 
> HDFS-11384.001.patch, HDFS-11384.002.patch
>
>
> When running balancer on hadoop cluster which have more than 3000 Datanodes 
> will cause NameNode's rpc.CallQueueLength spike. We observed this situation 
> could cause Hbase cluster failure due to RegionServer's WAL timeout.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-11384) Add option for balancer to disperse getBlocks calls to avoid NameNode's rpc.CallQueueLength spike

2017-03-20 Thread yunjiong zhao (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-11384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15933354#comment-15933354
 ] 

yunjiong zhao edited comment on HDFS-11384 at 3/20/17 7:26 PM:
---

Thanks [~shv] for review.
Only when you set dfs.balancer.getBlocks.interval.millis to non-zero, Balancer 
will only allow one thread to issue getBlocks()at any given time. Otherwise 
this patch doesn't change anything.
So only one change actually.

If use wait, it will release the lock, so can't make sure there are only one 
thread will call getBlocks().

By default, this patch doesn't change anything. So if you need run Balancer 
aggressively, don't set   dfs.balancer.getBlocks.interval.millis.



{quote}
Can we add some heuristics so that the Balancer could adjust by itself instead 
of adding the configuration parameter
{quote}
I though this before. The best way I can thought is add new function in IPC 
that let clients get the CallQueueLength, if CallQueueLength is too high, block 
getBlocks() until the CallQueueLength become normal again.




was (Author: zhaoyunjiong):
Thanks [~shv] for review.
Only when you set dfs.balancer.getBlocks.interval.millis to non-zero, Balancer 
will only allow one thread to issue {code}getBlocks(){code} at any given time. 
Otherwise this patch doesn't change anything.
So only one change actually.

If use wait, it will release the lock, so can't make sure there are only one 
thread will call {code}getBlocks(){code}.

By default, this patch doesn't change anything. So if you need run Balancer 
aggressively, don't set   dfs.balancer.getBlocks.interval.millis.



{quote}
Can we add some heuristics so that the Balancer could adjust by itself instead 
of adding the configuration parameter
{quote}
I though this before. The best way I can thought is add new function in IPC 
that let clients get the CallQueueLength, if CallQueueLength is too high, block 
getBlocks() until the CallQueueLength become normal again.



> Add option for balancer to disperse getBlocks calls to avoid NameNode's 
> rpc.CallQueueLength spike
> -
>
> Key: HDFS-11384
> URL: https://issues.apache.org/jira/browse/HDFS-11384
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: balancer & mover
>Affects Versions: 2.7.3
>Reporter: yunjiong zhao
>Assignee: yunjiong zhao
> Attachments: balancer.day.png, balancer.week.png, 
> HDFS-11384.001.patch, HDFS-11384.002.patch
>
>
> When running balancer on hadoop cluster which have more than 3000 Datanodes 
> will cause NameNode's rpc.CallQueueLength spike. We observed this situation 
> could cause Hbase cluster failure due to RegionServer's WAL timeout.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-11384) Add option for balancer to disperse getBlocks calls to avoid NameNode's rpc.CallQueueLength spike

2017-03-01 Thread Benoy Antony (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-11384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15890957#comment-15890957
 ] 

Benoy Antony edited comment on HDFS-11384 at 3/1/17 8:17 PM:
-

Sleeping inside the *Synchronized* block should be avoided as it will prevent 
other threads from obtaining the lock while the thread is sleeping. 
One tradeoff in sleeping fixed vs variable time is that code gets complicated. 
Since by default, the delay is not applied, it is okay to sleep for a fixed 
interval after getBlocks(). 


was (Author: benoyantony):
Sleeping inside the *Synchronized* block should be avoided as it will lock 
prevent other threads from obtaining the lock while the thread is sleeping. 
One tradeoff in sleeping fixed vs variable time is that code gets complicated. 
Since by default, the delay is not applied, it is okay to sleep for a fixed 
interval after getBlocks(). 

> Add option for balancer to disperse getBlocks calls to avoid NameNode's 
> rpc.CallQueueLength spike
> -
>
> Key: HDFS-11384
> URL: https://issues.apache.org/jira/browse/HDFS-11384
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: balancer & mover
>Affects Versions: 2.7.3
>Reporter: yunjiong zhao
>Assignee: yunjiong zhao
> Attachments: balancer.day.png, balancer.week.png, HDFS-11384.001.patch
>
>
> When running balancer on hadoop cluster which have more than 3000 Datanodes 
> will cause NameNode's rpc.CallQueueLength spike. We observed this situation 
> could cause Hbase cluster failure due to RegionServer's WAL timeout.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org