[
https://issues.apache.org/jira/browse/HDFS-8341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14802746#comment-14802746
]
Hari Sekhon commented on HDFS-8341:
-----------------------------------
[~szetszwo] I believe this ticket is still valid:
There were holes in the data since that storage tier had replication factor 1
as the replication was supposed to be handled within the proprietary hyperscale
storage solution underpinning that tier so there was no point in storing
multiple HDFS replicas there. So if a given block's checksum failed, HDFS Mover
looped on that block (probably hoping to find other valid block replicas to use
but there were no other replicas so it was stuck looping on the one corrupt
replica) and never got past that block so it didn't transfer the rest of the
data's other blocks.
This would be the same problem if all replicas were corrupt or if a block was
under replicated (which happens often) and the existing replica was corrupt.
So this jira is still valid - if HDFS Mover can't find a valid/non-corrupt
replica then it doesn't proceed to move the rest of the other blocks, which
prevented decommissioning of this storage tier. This is the reason I scripted a
custom recovery job under the hood of Hadoop since the other blocks were fine
and it was leaving a lot of data behind on the external storage tier.
> (Summary & Description may be invalid) HDFS mover stuck in loop after failing
> to move block, doesn't move rest of blocks, can't get data back off
> decommissioning external storage tier as a result
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: HDFS-8341
> URL: https://issues.apache.org/jira/browse/HDFS-8341
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: balancer & mover
> Affects Versions: 2.6.0
> Environment: HDP 2.2
> Reporter: Hari Sekhon
> Priority: Minor
>
> HDFS mover gets stuck looping on a block that fails to move and doesn't
> migrate the rest of the blocks.
> This is preventing recovery of data from a decomissioning external storage
> tier used for archive (we've had problems with that proprietary "hyperscale"
> storage product which is why a couple blocks here and there have checksum
> problems or premature eof as shown below), but this should not prevent moving
> all the other blocks to recover our data:
> {code}hdfs mover -p /apps/hive/warehouse/<custom_scrubbed>
> 15/05/07 14:52:50 INFO mover.Mover: namenodes =
> {hdfs://nameservice1=[/apps/hive/warehouse/<custom_scrubbed>]}
> 15/05/07 14:52:51 INFO balancer.KeyManager: Block token params received from
> NN: update interval=10hrs, 0sec, token lifetime=10hrs, 0sec
> 15/05/07 14:52:51 INFO block.BlockTokenSecretManager: Setting block keys
> 15/05/07 14:52:51 INFO balancer.KeyManager: Update block keys every 2hrs,
> 30mins, 0sec
> 15/05/07 14:52:52 INFO block.BlockTokenSecretManager: Setting block keys
> 15/05/07 14:52:52 INFO net.NetworkTopology: Adding a new node:
> /default-rack/<ip>:1019
> 15/05/07 14:52:52 INFO net.NetworkTopology: Adding a new node:
> /default-rack/<ip>:1019
> 15/05/07 14:52:52 INFO net.NetworkTopology: Adding a new node:
> /default-rack/<ip>:1019
> 15/05/07 14:52:52 INFO net.NetworkTopology: Adding a new node:
> /default-rack/<ip>:1019
> 15/05/07 14:52:52 INFO net.NetworkTopology: Adding a new node:
> /default-rack/<ip>:1019
> 15/05/07 14:52:52 INFO net.NetworkTopology: Adding a new node:
> /default-rack/<ip>:1019
> 15/05/07 14:52:52 WARN balancer.Dispatcher: Failed to move
> blk_1075156654_1438349 with size=134217728 from <ip>:1019:ARCHIVE to
> <ip>:1019:DISK through <ip>:1019: block move is failed: opReplaceBlock
> BP-120244285-<ip>-1417023863606:blk_1075156654_1438349 received exception
> java.io.EOFException: Premature EOF: no length prefix available
> <NOW IT STARTS LOOPING ON SAME BLOCK>
> 15/05/07 14:53:31 INFO net.NetworkTopology: Adding a new node:
> /default-rack/<ip>:1019
> 15/05/07 14:53:31 INFO net.NetworkTopology: Adding a new node:
> /default-rack/<ip>:1019
> 15/05/07 14:53:31 INFO net.NetworkTopology: Adding a new node:
> /default-rack/<ip>:1019
> 15/05/07 14:53:31 INFO net.NetworkTopology: Adding a new node:
> /default-rack/<ip>:1019
> 15/05/07 14:53:31 INFO net.NetworkTopology: Adding a new node:
> /default-rack/<ip>:1019
> 15/05/07 14:53:31 INFO net.NetworkTopology: Adding a new node:
> /default-rack/<ip>:1019
> 15/05/07 14:53:31 WARN balancer.Dispatcher: Failed to move
> blk_1075156654_1438349 with size=134217728 from <ip>:1019:ARCHIVE to
> <ip>:1019:DISK through <ip>:1019: block move is failed: opReplaceBlock
> BP-120244285-<ip>-1417023863606:blk_1075156654_1438349 received exception
> java.io.EOFException: Premature EOF: no length prefix available
> ...<repeat indefinitely>...
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)