[
https://issues.apache.org/jira/browse/HDFS-14847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16954481#comment-16954481
]
Fei Hui commented on HDFS-14847:
--------------------------------
[~ayushtkn] Thanks for your review!
For an EC block(RS-6-3-1024k), internal block index is from 0 to 8;
Use bitSet to find live index and to set the bit true and prevent it from
adding to srcIndices.
So size of bitSet is must greater than or equals 9.
If we use size of srcNodes, we should guarantee it is greater than or equals 9.
Maybe we should not dependent on srcNodes and we can not guarantee it is
greater than or equals 9.
That's my opinions above.
Have added a check to verify storageinfos are all in indices array and the
indexes from 0 to 9 are all exists.
Watching HDFS-14768
> Erasure Coding: Blocks are over-replicated while EC decommissioning
> -------------------------------------------------------------------
>
> Key: HDFS-14847
> URL: https://issues.apache.org/jira/browse/HDFS-14847
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: ec
> Affects Versions: 3.2.0, 3.0.3, 3.1.2, 3.3.0
> Reporter: Fei Hui
> Assignee: Fei Hui
> Priority: Critical
> Attachments: HDFS-14847.001.patch, HDFS-14847.002.patch,
> HDFS-14847.003.patch, HDFS-14847.004.patch
>
>
> Found that Some blocks are over-replicated while ec decommissioning. Messages
> in log as follow
> {quote}
> INFO BlockStateChange: Block: blk_-9223372035714984112_363779142, Expected
> Replicas: 9, live replicas: 8, corrupt replicas: 0, decommissioned replicas:
> 0, decommissioning replicas: 3, maintenance replicas: 0, live entering
> maintenance replicas: 0, excess replicas: 0, Is Open File: false, Datanodes
> having this block: 10.254.41.34:50010 10.254.54.53:50010 10.254.28.53:50010
> 10.254.56.55:50010 10.254.32.21:50010 10.254.33.19:50010 10.254.63.17:50010
> 10.254.31.19:50010 10.254.35.29:50010 10.254.51.57:50010 10.254.40.58:50010
> 10.254.69.31:50010 10.254.47.18:50010 10.254.51.18:50010 10.254.43.57:50010
> 10.254.50.47:50010 10.254.42.37:50010 10.254.57.29:50010 10.254.67.40:50010
> 10.254.44.16:50010 10.254.59.38:50010 10.254.53.56:50010 10.254.45.11:50010
> 10.254.39.22:50010 10.254.30.16:50010 10.254.35.53:50010 10.254.22.30:50010
> 10.254.26.34:50010 10.254.17.58:50010 10.254.65.53:50010 10.254.60.39:50010
> 10.254.61.20:50010 10.254.64.23:50010 10.254.21.13:50010 10.254.37.35:50010
> 10.254.68.30:50010 10.254.62.37:50010 10.254.25.58:50010 10.254.52.54:50010
> 10.254.58.31:50010 10.254.49.11:50010 10.254.55.52:50010 10.254.19.19:50010
> 10.254.36.40:50010 10.254.18.30:50010 10.254.20.39:50010 10.254.66.52:50010
> 10.254.56.32:50010 10.254.24.55:50010 10.254.34.11:50010 10.254.29.58:50010
> 10.254.27.40:50010 10.254.46.33:50010 10.254.23.19:50010 10.254.74.12:50010
> 10.254.74.13:50010 10.254.41.35:50010 10.254.67.58:50010 10.254.54.11:50010
> 10.254.68.14:50010 10.254.27.14:50010 10.254.51.29:50010 10.254.45.21:50010
> 10.254.50.56:50010 10.254.47.31:50010 10.254.40.14:50010 10.254.65.21:50010
> 10.254.62.22:50010 10.254.57.16:50010 10.254.36.52:50010 10.254.30.13:50010
> 10.254.35.12:50010 10.254.69.34:50010 10.254.34.58:50010 10.254.17.50:50010
> 10.254.63.12:50010 10.254.28.21:50010 10.254.58.30:50010 10.254.24.57:50010
> 10.254.33.50:50010 10.254.44.52:50010 10.254.32.48:50010 10.254.43.39:50010
> 10.254.20.37:50010 10.254.56.59:50010 10.254.22.33:50010 10.254.60.34:50010
> 10.254.49.19:50010 10.254.52.21:50010 10.254.23.59:50010 10.254.21.16:50010
> 10.254.42.55:50010 10.254.29.33:50010 10.254.53.17:50010 10.254.19.14:50010
> 10.254.64.51:50010 10.254.46.20:50010 10.254.66.22:50010 10.254.18.38:50010
> 10.254.39.17:50010 10.254.37.57:50010 10.254.31.54:50010 10.254.55.33:50010
> 10.254.25.17:50010 10.254.61.33:50010 10.254.26.40:50010 10.254.59.23:50010
> 10.254.59.35:50010 10.254.66.48:50010 10.254.41.15:50010 10.254.54.31:50010
> 10.254.61.50:50010 10.254.62.31:50010 10.254.17.56:50010 10.254.29.18:50010
> 10.254.45.16:50010 10.254.63.48:50010 10.254.22.34:50010 10.254.37.51:50010
> 10.254.65.49:50010 10.254.58.21:50010 10.254.42.12:50010 10.254.55.17:50010
> 10.254.27.13:50010 10.254.57.17:50010 10.254.67.18:50010 10.254.31.31:50010
> 10.254.28.12:50010 10.254.36.12:50010 10.254.21.59:50010 10.254.30.30:50010
> 10.254.26.50:50010 10.254.40.40:50010 10.254.32.17:50010 10.254.47.55:50010
> 10.254.60.55:50010 10.254.49.33:50010 10.254.68.47:50010 10.254.39.21:50010
> 10.254.56.14:50010 10.254.33.54:50010 10.254.69.57:50010 10.254.43.50:50010
> 10.254.50.13:50010 10.254.25.49:50010 10.254.18.20:50010 10.254.52.23:50010
> 10.254.19.11:50010 10.254.20.21:50010 10.254.74.16:50010 10.254.64.55:50010
> 10.254.24.48:50010 10.254.46.29:50010 10.254.51.12:50010 10.254.23.56:50010
> 10.254.44.59:50010 10.254.53.58:50010 10.254.34.38:50010 10.254.35.37:50010
> 10.254.35.16:50010 10.254.36.23:50010 10.254.41.47:50010 10.254.54.12:50010
> 10.254.20.59:50010 , Current Datanode: 10.254.56.55:50010, Is current
> datanode decommissioning: true, Is current datanode entering maintenance:
> false
> {quote}
> Decommisions hang for a long time.
> Deep into the code and find that There is a problem in ErasureCodingWork.java
> For Example, there are 2 nodes(dn0, dn1) in decommission and an ec block
> group with the 2 nodes. After creating an ErasureCodingWork to reconstruct,
> it will create 2 replication work.
> If dn0 replicates in success and dn1 replicates in failure, Then it will
> always create replication work for dn0. The block on dn0 is over-replicated
> and The block on dn1 will never replicate
> Here is the initial path for this.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]