KevinWikant opened a new pull request, #7179:
URL: https://github.com/apache/hadoop/pull/7179
## Problem
Problem background:
- A datanode should only enter decommissioned state if all the blocks on the
datanode are sufficiently replicated to other live datanodes.
- This expectation is violated for Under Construction blocks which are not
considered by the DatanodeAdminMonitor at all.
- DatanodeAdminMonitor currently only considers blocks in the
DatanodeDescriptor StorageInfos. This is because:
- For a new HDFS block that was just created, it is not be added to the
StorageInfos until the HDFS client closes the DFSOutputStream & the block
becomes finalized
- For an existing HDFS block that was opened for append:
- First, the block version with the previous generation stamp is marked
stale & removed from the StorageInfos
- Next, the block version with the new generation stamp is not be added
to the StorageInfos until the HDFS client closes the DFSOutputStream & the
block becomes finalized
There is logic in the DatanodeAdminManager/DatanodeAdminMonitor to avoid
transitioning datanodes to decommissioned state when they have open (i.e. Under
Construction) blocks:
-
https://github.com/apache/hadoop/blob/cd2cffe73f909a106ba47653acf525220f2665cf/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminManager.java#L357
-
https://github.com/apache/hadoop/blob/cd2cffe73f909a106ba47653acf525220f2665cf/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminManager.java#L305
This logic does not work correctly because, as mentioned above,
[DatanodeAdminMonitor currently only considers blocks in the DatanodeDescriptor
StorageInfos](https://github.com/apache/hadoop/blob/cd2cffe73f909a106ba47653acf525220f2665cf/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminDefaultMonitor.java#L385)
which does not include Under Construction blocks for which the DFSOutputStream
has not been closed yet.
There is also logic in the HDFS [DataStreamer client which will replace
bad/dead datanodes in the block write
pipeline](https://github.com/apache/hadoop/blob/cd2cffe73f909a106ba47653acf525220f2665cf/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DataStreamer.java#L1716).
Note that:
- this logic does not work if the replication factor is 1
- if the replication factor is greater than 1, then this logic does not work
if all the datanodes in the block write pipeline are decommissioned/terminated
at around the same time
Overall, the Namenode should not be putting datanodes with open blocks into
decommissioned state & hope that the DataStreamer client is able to replace
them when the decommissioned datanodes are terminated. This will not work
depending on the timing & therefore is not a solution which guarantees
correctness.
The Namenode needs to honor the rule that "a datanode should only enter
decommissioned state if all the blocks on the datanode are sufficiently
replicated to other live datanodes", even for blocks which are currently Under
Construction.
## Potential Solutions
One possible opinion is that if the DFSOutputStream has not been successfuly
closed yet, then the client should be able to replay all the data if there is a
failure. The client should not have any expectation the data is committed to
HDFS until the DFSOutputStream is closed. There are a few reasons I do not
think this makes sense:
- The methods hflush/hsync do not result in the data already appended to the
DFSOutputStream being persisted/finalized. This is confusing when compared the
standard experience of stream flush/sync methods.
- This does not handle the case where a block is re-opened by a new
DFSOutputStream after having been previously closed (by another different
client). In this case, the problem will lead to data loss for data that was
previously committed by another client & cannot be replayed.
- To solve this problem, we could try not removing old block version from
StorageInfos when a new block version is created; however, this change is
likely to have wider implications on block management.
Another possible option that comes to mind is to add blocks to StorageInfos
before they are finalized. However, this change also is likely to have wider
implications on block management.
Without modifying any existing block management logic, we can add a new data
structure (UnderConstructionBlocks) which temporarily tracks the Under
Construction blocks in-memory until they are committed/finalized & added to the
StorageInfos.
## Solution
Add a new data structure (UnderConstructionBlocks) which temporarily tracks
the Under Construction blocks in-memory until they are committed/finalized &
added to the StorageInfos.
Pros:
- works for newly created HDFS block
- works for re-opened HDFS block (i.e. opened for append)
- works for block with any replication factor
- does not change logic in BlockManager, the new UnderConstructionBlocks
data structure & associated logic is purely additive
### How was this patch tested?
`TODO - will add detailed test results`
### For code changes:
- [X] Does the title or this PR starts with the corresponding JIRA issue id
(e.g. 'HADOOP-17799. Your PR title ...')?
- [X] Object storage: have the integration tests been executed and the
endpoint declared according to the connector-specific documentation?
- [n/a] If adding new dependencies to the code, are these dependencies
licensed in a way that is compatible for inclusion under [ASF
2.0](http://www.apache.org/legal/resolved.html#category-a)?
- [n/a] If applicable, have you updated the `LICENSE`, `LICENSE-binary`,
`NOTICE-binary` files?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]