[
https://issues.apache.org/jira/browse/HDFS-12422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16403807#comment-16403807
]
Konstantin Shvachko commented on HDFS-12422:
--------------------------------------------
For the record this piece of code was [introduced way
back|http://svn.apache.org/viewvc?view=revision&revision=1091515] by HDFS-1606.
I think the current code is actually correct. So we are in
{{BlockConstructionStage.PIPELINE_CLOSE}} state. Adding nodes when the pipeline
is closing doesn't make sense to me, because something went wrong and the
client should just salvage whatever is remaining and let NN recover the block.
And it seems the client does just that. I see that in
{{processDatanodeOrExternalError()}} if {{PIPELINE_CLOSE}} it closes the block.
I also see this block replica is complete and good.
Besides adding DNs as you propose only makes the case rarer, but doesn't fully
solve the case. What if adding DNs fails, then you get the same problem again.
So it seems that you should look why NN does not replicate such block. I did
not check in current code base, but here is how it should work.
# The pipeline failed with only one last replica, so NN will not allow the
client to close the file. Write fails.
# NN will not replicate the block because it is still under construction.
# One hour later the file lease will expire and NN starts lease recovery,
which triggers replica recovery.
# Once finished NN closes the file, and the block becomes under-replicated.
# Replication monitor starts replication.
So eventually the block should be recovered, it just takes time > 1 hour. If it
doesn't happen then we have a problem. LMK
> Replace DataNode in Pipeline when waiting for Last Packet fails
> ---------------------------------------------------------------
>
> Key: HDFS-12422
> URL: https://issues.apache.org/jira/browse/HDFS-12422
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: hdfs, hdfs-client
> Reporter: Lukas Majercak
> Assignee: Lukas Majercak
> Priority: Major
> Labels: hdfs
> Attachments: HDFS-12422.001.patch, HDFS-12422.002.patch
>
>
> # Create a file with replicationFactor = 4, minReplicas = 2
> # Fail waiting for the last packet, followed by 2 exceptions when recovering
> the leftover pipeline
> # The leftover pipeline will only have one DN and NN will never close such
> block, resulting in failure to write
> The block will stay there forever, unable to be replicated, ultimately going
> missing.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]