[jira] [Commented] (HDFS-8341) HDFS mover stuck in loop after failing to move block, doesn't move rest of blocks, can't get data back off decommissioning external storage tier as a result

Tsz Wo Nicholas Sze (JIRA) Mon, 11 May 2015 13:51:29 -0700

    [ 
https://issues.apache.org/jira/browse/HDFS-8341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14538625#comment-14538625
 ]


Tsz Wo Nicholas Sze commented on HDFS-8341:
-------------------------------------------

> ... If replica scheduled successfully it will return, but here it should 
> continue for next replica.

The reason of returning is that we don't want to move multiple replicas of the 
same block at once.

> Now the problem is, if file have more than one block for example 10 ... 

Do you mean "a block has more than one replicas for example 10"?

> ... and some problem in moving first replica then scheduleMoves4Block() API 
> will always schedule first replica in each iteration and it will return.
The locations are shuffled so that the first replica is not necessarily the 
same in each iteration.  Am I missing anything?
{code}
//scheduleMoves4Block
      Collections.shuffle(locations);
{code}


> HDFS mover stuck in loop after failing to move block, doesn't move rest of 
> blocks, can't get data back off decommissioning external storage tier as a 
> result
> ------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-8341
>                 URL: https://issues.apache.org/jira/browse/HDFS-8341
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: balancer & mover
>    Affects Versions: 2.6.0
>         Environment: HDP 2.2
>            Reporter: Hari Sekhon
>            Assignee: surendra singh lilhore
>            Priority: Blocker
>
> HDFS mover gets stuck looping on a block that fails to move and doesn't 
> migrate the rest of the blocks.
> This is preventing recovery of data from a decomissioning external storage 
> tier used for archive (we've had problems with that proprietary "hyperscale" 
> storage product which is why a couple blocks here and there have checksum 
> problems or premature eof as shown below), but this should not prevent moving 
> all the other blocks to recover our data:
> {code}hdfs mover -p /apps/hive/warehouse/<custom_scrubbed>
> 15/05/07 14:52:50 INFO mover.Mover: namenodes = 
> {hdfs://nameservice1=[/apps/hive/warehouse/<custom_scrubbed>]}
> 15/05/07 14:52:51 INFO balancer.KeyManager: Block token params received from 
> NN: update interval=10hrs, 0sec, token lifetime=10hrs, 0sec
> 15/05/07 14:52:51 INFO block.BlockTokenSecretManager: Setting block keys
> 15/05/07 14:52:51 INFO balancer.KeyManager: Update block keys every 2hrs, 
> 30mins, 0sec
> 15/05/07 14:52:52 INFO block.BlockTokenSecretManager: Setting block keys
> 15/05/07 14:52:52 INFO net.NetworkTopology: Adding a new node: 
> /default-rack/<ip>:1019
> 15/05/07 14:52:52 INFO net.NetworkTopology: Adding a new node: 
> /default-rack/<ip>:1019
> 15/05/07 14:52:52 INFO net.NetworkTopology: Adding a new node: 
> /default-rack/<ip>:1019
> 15/05/07 14:52:52 INFO net.NetworkTopology: Adding a new node: 
> /default-rack/<ip>:1019
> 15/05/07 14:52:52 INFO net.NetworkTopology: Adding a new node: 
> /default-rack/<ip>:1019
> 15/05/07 14:52:52 INFO net.NetworkTopology: Adding a new node: 
> /default-rack/<ip>:1019
> 15/05/07 14:52:52 WARN balancer.Dispatcher: Failed to move 
> blk_1075156654_1438349 with size=134217728 from <ip>:1019:ARCHIVE to 
> <ip>:1019:DISK through <ip>:1019: block move is failed: opReplaceBlock 
> BP-120244285-<ip>-1417023863606:blk_1075156654_1438349 received exception 
> java.io.EOFException: Premature EOF: no length prefix available
> <NOW IT STARTS LOOPING ON SAME BLOCK>
> 15/05/07 14:53:31 INFO net.NetworkTopology: Adding a new node: 
> /default-rack/<ip>:1019
> 15/05/07 14:53:31 INFO net.NetworkTopology: Adding a new node: 
> /default-rack/<ip>:1019
> 15/05/07 14:53:31 INFO net.NetworkTopology: Adding a new node: 
> /default-rack/<ip>:1019
> 15/05/07 14:53:31 INFO net.NetworkTopology: Adding a new node: 
> /default-rack/<ip>:1019
> 15/05/07 14:53:31 INFO net.NetworkTopology: Adding a new node: 
> /default-rack/<ip>:1019
> 15/05/07 14:53:31 INFO net.NetworkTopology: Adding a new node: 
> /default-rack/<ip>:1019
> 15/05/07 14:53:31 WARN balancer.Dispatcher: Failed to move 
> blk_1075156654_1438349 with size=134217728 from <ip>:1019:ARCHIVE to 
> <ip>:1019:DISK through <ip>:1019: block move is failed: opReplaceBlock 
> BP-120244285-<ip>-1417023863606:blk_1075156654_1438349 received exception 
> java.io.EOFException: Premature EOF: no length prefix available
> ...<repeat indefinitely>...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-8341) HDFS mover stuck in loop after failing to move block, doesn't move rest of blocks, can't get data back off decommissioning external storage tier as a result

Reply via email to