Please unsubscribe me from the list..thanks

Punitha S Tue, 08 Apr 2014 16:01:25 -0700

On Wed, Apr 9, 2014 at 8:55 AM, Ming Ma (JIRA) <[email protected]> wrote:


>
>      [
> https://issues.apache.org/jira/browse/HDFS-6178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel]
>
> Ming Ma updated HDFS-6178:
> --------------------------
>
>     Attachment: HDFS-6178-2.patch
>
> Thanks, Jing. Updated patch per suggestion.
>
> > Decommission on standby NN couldn't finish
> > ------------------------------------------
> >
> >                 Key: HDFS-6178
> >                 URL: https://issues.apache.org/jira/browse/HDFS-6178
> >             Project: Hadoop HDFS
> >          Issue Type: Bug
> >          Components: namenode
> >            Reporter: Ming Ma
> >         Attachments: HDFS-6178-2.patch, HDFS-6178.patch
> >
> >
> > Currently decommissioning machines in HA-enabled cluster requires
> running refreshNodes in both active and standby nodes. Sometimes
> decommissioning won't finish from standby NN's point of view.  Here is the
> diagnosis of why it could happen.
> > Standby NN's blockManager manages blocks replication and block
> invalidation as if it is the active NN; even though DNs will ignore block
> commands coming from standby NN. When standby NN makes block operation
> decisions such as the target of block replication and the node to remove
> excess blocks from, the decision is independent of active NN. So active NN
> and standby NN could have different states. When we try to decommission
> nodes on standby nodes; such state inconsistency might prevent standby NN
> from making progress. Here is an example.
> > Machine A
> > Machine B
> > Machine C
> > Machine D
> > Machine E
> > Machine F
> > Machine G
> > Machine H
> > 1. For a given block, both active and standby have 5 replicas on machine
> A, B, C, D, E. So both active and standby decide to pick excess nodes to
> invalidate.
> > Active picked D and E as excess DNs. After the next block reports from D
> and E, active NN has 3 active replicas (A, B, C), 0 excess replica.
> > {noformat}
> > 2014-03-27 01:50:14,410 INFO BlockStateChange: BLOCK*
> chooseExcessReplicates: (E:50010, blk_-5207804474559026159_121186764) is
> added to invalidated blocks set
> > 2014-03-27 01:50:15,539 INFO BlockStateChange: BLOCK*
> chooseExcessReplicates: (D:50010, blk_-5207804474559026159_121186764) is
> added to invalidated blocks set
> > {noformat}
> > Standby pick C, E as excess DNs. Given DNs ignore commands from standby,
> After the next block reports from C, D, E,  standby has 2 active replicas
> (A, B), 1 excess replica (C).
> > {noformat}
> > 2014-03-27 01:51:49,543 INFO BlockStateChange: BLOCK*
> chooseExcessReplicates: (E:50010, blk_-5207804474559026159_121186764) is
> added to invalidated blocks set
> > 2014-03-27 01:51:49,894 INFO BlockStateChange: BLOCK*
> chooseExcessReplicates: (C:50010, blk_-5207804474559026159_121186764) is
> added to invalidated blocks set
> > {noformat}
> > 2. Machine A decomm request was sent to standby. Standby only had one
> live replica and picked machine G, H as targets, but given standby commands
> was ignored by DNs, G, H remained in pending replication queue until they
> are timed out. At this point, you have one decommissioning replica (A), 1
> active replica (B), one excess replica (C).
> > {noformat}
> > 2014-03-27 04:42:52,258 INFO BlockStateChange: BLOCK* ask A:50010 to
> replicate blk_-5207804474559026159_121186764 to datanode(s) G:50010 H:50010
> > {noformat}
> > 3. Machine A decomm request was sent to active NN. Active NN picked
> machine F as the target. It finished properly. So active NN had 3 active
> replicas (B, C, F), one decommissioned replica (A).
> > {noformat}
> > 2014-03-27 04:44:15,239 INFO BlockStateChange: BLOCK* ask
> 10.42.246.110:50010 to replicate blk_-5207804474559026159_121186764 to
> datanode(s) F:50010
> > 2014-03-27 04:44:16,083 INFO BlockStateChange: BLOCK* addStoredBlock:
> blockMap updated: F:50010 is added to blk_-5207804474559026159_121186764
> size 7100065
> > {noformat}
> > 4. Standby NN picked up F as a new replica. Thus standby had one
> decommissioning replica (A), 2 active replicas (B, F), one excess replica
> (C). Standby NN kept trying to schedule replication work, but DNs ignored
> the commands.
> > {noformat}
> > 2014-03-27 04:44:16,084 INFO BlockStateChange: BLOCK* addStoredBlock:
> blockMap updated: F:50010 is added to blk_-5207804474559026159_121186764
> size 7100065
> > 2014-03-28 23:06:11,970 INFO
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Block:
> blk_-5207804474559026159_121186764, Expected Replicas: 3, live replicas: 2,
> corrupt replicas: 0, decommissioned replicas: 1, excess replicas: 1, Is
> Open File: false, Datanodes having this block: C:50010 B:50010 A:50010
> F:50010 , Current Datanode: A:50010, Is current datanode decommissioning:
> true
> > {noformat}
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v6.2#6252)
>

Please unsubscribe me from the list..thanks

Reply via email to