[
https://issues.apache.org/jira/browse/HDFS-4389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13551581#comment-13551581
]
Daryn Sharp commented on HDFS-4389:
-----------------------------------
The problem was discovered while debugging random failures in
{{TestPersistBlocks}}. The {{TestRestartDfsWithFlush}} does the following:
# open a stream
# write 5 blocks
# flush
# wait for at least 1 block to be finalized, record size
# bounce the NN
# ensure file is at least as big as before bounce
# write 5 more blocks <- race condition blows up here
# close stream
# ensure all data is there
The problem occurs when {{DFSOutputStream.DataStreamer}} needs to call
{{DFSClient#addBlock}} while the NN is down. It receives a
{{ConnectException}} from the IPC layer, which isn't handled so it stores it
away and shuts down the stream. The additional write to the stream after the
NN restart throws the stored connect exception.
The end result is streams cannot survive a NN restart or network interruption
that lasts longer than the time it takes to write a block. The issue is
probably general to all client methods.
> Non-HA DFSClients do not attempt reconnects
> -------------------------------------------
>
> Key: HDFS-4389
> URL: https://issues.apache.org/jira/browse/HDFS-4389
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: ha, hdfs-client
> Affects Versions: 2.0.0-alpha, 3.0.0
> Reporter: Daryn Sharp
> Priority: Critical
>
> The HA retry policy implementation appears to have broken non-HA
> {{DFSClient}} connect retries. The ipc
> {{Client.Connection#handleConnectionFailure}} used to perform 45 connection
> attempts, but now it consults a retry policy. For non-HA proxies, the policy
> does not handle {{ConnectException}}.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira