[
https://issues.apache.org/jira/browse/HDFS-8341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14875802#comment-14875802
]
Tsz Wo Nicholas Sze commented on HDFS-8341:
-------------------------------------------
> ... original code paste showing it looped on the same block number on each
> run and never got past it.
Do you mean the code posted on [this
comment|https://issues.apache.org/jira/browse/HDFS-8341?focusedCommentId=14536366&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14536366]?
It loops on the locations but it will exit after looped all locations. So
"never got past it" seems a wrong statement.
[~surendrasingh], could comment on this?
> ... so I don't have any way to reproduce it right now. Perhaps a new cluster
> with a storage tiering with rep factor 1 where the block is intentionally
> corrupted might be able to reproduce this.
Could you try to reproduce the problem? If we cannot reproduce it, we should
resolve this JIRA for the moment. We may reopen this or file a new JIRA in
case we see the problem later.
IMO, this bug does not exist. You may indeed have encountered a similar
problem but it might be caused by something else but not the Mover.
> HDFS mover stuck in loop trying to move corrupt block with no other valid
> replicas, doesn't move rest of other data blocks
> --------------------------------------------------------------------------------------------------------------------------
>
> Key: HDFS-8341
> URL: https://issues.apache.org/jira/browse/HDFS-8341
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: balancer & mover
> Affects Versions: 2.6.0
> Environment: HDP 2.2
> Reporter: Hari Sekhon
> Priority: Minor
>
> HDFS mover gets stuck looping on a block that fails to move and doesn't
> migrate the rest of the blocks.
> This is preventing recovery of data from a decomissioning external storage
> tier used for archive (we've had problems with that proprietary "hyperscale"
> storage product which is why a couple blocks here and there have checksum
> problems or premature eof as shown below), but this should not prevent moving
> all the other blocks to recover our data:
> {code}hdfs mover -p /apps/hive/warehouse/<custom_scrubbed>
> 15/05/07 14:52:50 INFO mover.Mover: namenodes =
> {hdfs://nameservice1=[/apps/hive/warehouse/<custom_scrubbed>]}
> 15/05/07 14:52:51 INFO balancer.KeyManager: Block token params received from
> NN: update interval=10hrs, 0sec, token lifetime=10hrs, 0sec
> 15/05/07 14:52:51 INFO block.BlockTokenSecretManager: Setting block keys
> 15/05/07 14:52:51 INFO balancer.KeyManager: Update block keys every 2hrs,
> 30mins, 0sec
> 15/05/07 14:52:52 INFO block.BlockTokenSecretManager: Setting block keys
> 15/05/07 14:52:52 INFO net.NetworkTopology: Adding a new node:
> /default-rack/<ip>:1019
> 15/05/07 14:52:52 INFO net.NetworkTopology: Adding a new node:
> /default-rack/<ip>:1019
> 15/05/07 14:52:52 INFO net.NetworkTopology: Adding a new node:
> /default-rack/<ip>:1019
> 15/05/07 14:52:52 INFO net.NetworkTopology: Adding a new node:
> /default-rack/<ip>:1019
> 15/05/07 14:52:52 INFO net.NetworkTopology: Adding a new node:
> /default-rack/<ip>:1019
> 15/05/07 14:52:52 INFO net.NetworkTopology: Adding a new node:
> /default-rack/<ip>:1019
> 15/05/07 14:52:52 WARN balancer.Dispatcher: Failed to move
> blk_1075156654_1438349 with size=134217728 from <ip>:1019:ARCHIVE to
> <ip>:1019:DISK through <ip>:1019: block move is failed: opReplaceBlock
> BP-120244285-<ip>-1417023863606:blk_1075156654_1438349 received exception
> java.io.EOFException: Premature EOF: no length prefix available
> <NOW IT STARTS LOOPING ON SAME BLOCK>
> 15/05/07 14:53:31 INFO net.NetworkTopology: Adding a new node:
> /default-rack/<ip>:1019
> 15/05/07 14:53:31 INFO net.NetworkTopology: Adding a new node:
> /default-rack/<ip>:1019
> 15/05/07 14:53:31 INFO net.NetworkTopology: Adding a new node:
> /default-rack/<ip>:1019
> 15/05/07 14:53:31 INFO net.NetworkTopology: Adding a new node:
> /default-rack/<ip>:1019
> 15/05/07 14:53:31 INFO net.NetworkTopology: Adding a new node:
> /default-rack/<ip>:1019
> 15/05/07 14:53:31 INFO net.NetworkTopology: Adding a new node:
> /default-rack/<ip>:1019
> 15/05/07 14:53:31 WARN balancer.Dispatcher: Failed to move
> blk_1075156654_1438349 with size=134217728 from <ip>:1019:ARCHIVE to
> <ip>:1019:DISK through <ip>:1019: block move is failed: opReplaceBlock
> BP-120244285-<ip>-1417023863606:blk_1075156654_1438349 received exception
> java.io.EOFException: Premature EOF: no length prefix available
> ...<repeat indefinitely>...
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)