Hari Sekhon created HDFS-8341:
---------------------------------

             Summary: HDFS mover stuck in loop after failing to move block, 
doesn't move rest of blocks, can't get data back off decommissioning external 
storage tier as a result
                 Key: HDFS-8341
                 URL: https://issues.apache.org/jira/browse/HDFS-8341
             Project: Hadoop HDFS
          Issue Type: Bug
          Components: balancer & mover
    Affects Versions: 2.6.0
         Environment: HDP 2.2
            Reporter: Hari Sekhon
            Priority: Blocker


HDFS mover gets stuck looping on a block that fails to move and doesn't migrate 
the rest of the blocks.

This is preventing recovery of data from a decomissioning external storage tier 
used for archive (we've had problems with that proprietary "hyperscale" storage 
product which is why a couple blocks here and there have checksum problems or 
premature eof as shown below), but this should not prevent moving all the other 
blocks to recover our data:
{code}hdfs mover -p /apps/hive/warehouse/<custom_scrubbed>
15/05/07 14:52:50 INFO mover.Mover: namenodes = 
{hdfs://nameservice1=[/apps/hive/warehouse/<custom_scrubbed>]}
15/05/07 14:52:51 INFO balancer.KeyManager: Block token params received from 
NN: update interval=10hrs, 0sec, token lifetime=10hrs, 0sec
15/05/07 14:52:51 INFO block.BlockTokenSecretManager: Setting block keys
15/05/07 14:52:51 INFO balancer.KeyManager: Update block keys every 2hrs, 
30mins, 0sec
15/05/07 14:52:52 INFO block.BlockTokenSecretManager: Setting block keys
15/05/07 14:52:52 INFO net.NetworkTopology: Adding a new node: 
/default-rack/<ip>:1019
15/05/07 14:52:52 INFO net.NetworkTopology: Adding a new node: 
/default-rack/<ip>:1019
15/05/07 14:52:52 INFO net.NetworkTopology: Adding a new node: 
/default-rack/<ip>:1019
15/05/07 14:52:52 INFO net.NetworkTopology: Adding a new node: 
/default-rack/<ip>:1019
15/05/07 14:52:52 INFO net.NetworkTopology: Adding a new node: 
/default-rack/<ip>:1019
15/05/07 14:52:52 INFO net.NetworkTopology: Adding a new node: 
/default-rack/<ip>:1019
15/05/07 14:52:52 WARN balancer.Dispatcher: Failed to move 
blk_1075156654_1438349 with size=134217728 from <ip>:1019:ARCHIVE to 
<ip>:1019:DISK through <ip>:1019: block move is failed: opReplaceBlock 
BP-120244285-<ip>-1417023863606:blk_1075156654_1438349 received exception 
java.io.EOFException: Premature EOF: no length prefix available
<NOW IT STARTS LOOPING ON SAME BLOCK>
15/05/07 14:53:31 INFO net.NetworkTopology: Adding a new node: 
/default-rack/<ip>:1019
15/05/07 14:53:31 INFO net.NetworkTopology: Adding a new node: 
/default-rack/<ip>:1019
15/05/07 14:53:31 INFO net.NetworkTopology: Adding a new node: 
/default-rack/<ip>:1019
15/05/07 14:53:31 INFO net.NetworkTopology: Adding a new node: 
/default-rack/<ip>:1019
15/05/07 14:53:31 INFO net.NetworkTopology: Adding a new node: 
/default-rack/<ip>:1019
15/05/07 14:53:31 INFO net.NetworkTopology: Adding a new node: 
/default-rack/<ip>:1019
15/05/07 14:53:31 WARN balancer.Dispatcher: Failed to move 
blk_1075156654_1438349 with size=134217728 from <ip>:1019:ARCHIVE to 
<ip>:1019:DISK through <ip>:1019: block move is failed: opReplaceBlock 
BP-120244285-<ip>-1417023863606:blk_1075156654_1438349 received exception 
java.io.EOFException: Premature EOF: no length prefix available
...<repeat indefinitely>...
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to