[
https://issues.apache.org/jira/browse/HDFS-16127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17379961#comment-17379961
]
Kihwal Lee commented on HDFS-16127:
-----------------------------------
*How the bug manifests*
In DataStreamer thread's main loop, it flushes everything and waits for all
acks are received when it sees an empty close packet from dataQueue. Then it
proceeds to send the close packet to signal datanodes to finalize the replicas.
It also waits for the ack to the final packet by calling waitForAllAcks().
Prior to HDFS-15813, it involved no network activities, but network failures
became possible after this. The following is the client log entry from one of
the failure/data loss cases.
{noformat}
org.apache.hadoop.hdfs.DataStreamer: DataStreamer Exception
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
at sun.nio.ch.IOUtil.write(IOUtil.java:65)
at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:468)
at
org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:63)
at
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
at
org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:159)
at
org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:117)
at
java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
at java.io.DataOutputStream.flush(DataOutputStream.java:123)
at org.apache.hadoop.hdfs.DataStreamer.sendPacket(DataStreamer.java:857)
at
org.apache.hadoop.hdfs.DataStreamer.sendHeartbeat(DataStreamer.java:875)
at
org.apache.hadoop.hdfs.DataStreamer.waitForAllAcks(DataStreamer.java:845)
at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:798)
{noformat}
This exception resulted in a close recovery, because the pipeline stage was set
to PIPELINE_CLOSE at this point. However, the client had no error coming from
its ResponseProcessor, meaning that it has actually received the final ack and
removed the final packet from ackQueue. Since it shut itself down cleanly,
there was no sign of read or write error from this thread.
The following is the main part of the close recovery code, after a connection
is successfully established.
{code:java}
DFSPacket endOfBlockPacket = dataQueue.remove(); // remove the end
of block packet
assert endOfBlockPacket.isLastPacketInBlock();
assert lastAckedSeqno == endOfBlockPacket.getSeqno() - 1;
lastAckedSeqno = endOfBlockPacket.getSeqno();
pipelineRecoveryCount = 0;
dataQueue.notifyAll();
{code}
The asserts would have prevented the bug from propagating if they were active.
(They are only active during testing.) It blindly requeues the content of
ackQueue, thinking the unACKed final packet is still there. In this failure
case, there is none, as the final packet was actually ACKed. The datanodes
normally closed the connections, which resulted in "connection rest" for the
data streamer stuckin in sending a heartbeat. The recovery simply dequeues one
packet and tosses it away. After all, this packet is supposed to have no data.
It even erroneously updates lastAckedSeqno.
At this point the first packet belonging to the next block has been thrown
away. For the next block write, datanodes complains that the first packet's
offset is non-zero. This is irrecoverable and after 5 times of retries, the
data write fails.
*Data loss*
These errors usually cause a permanent write failure due to inability to write
further with a first packet (actually the 2nd one, but the bug makes it the
first one to be sent) of the block starting at non-zero offset. Thus the errors
are propagated back to the user and results in a task attempt failure, etc.
However, if the remaining bytes to write to a new block fits in one packet,
something worse happens. If the client encounters the close-recovery bug, the
one data packet that gets dropped by the faulty recovery is the only data
packet for the next block. The remaining packet after that is the final close
packet for the next block. When the client continues to the next block, the
datanode rejects the write as the close packet has a non-zero offset.
But instead of causing a permanent write failure, the client now enters a
close-recovery phase, since the pipeline stage was set to PIPELINE_CLOSE while
sending the packet. The new connection header tells datanodes that it is for a
close recovery so they simply closes the zero-byte block file with the
specified gen stamp. The recovery will appear successful, so the client gets no
error and the file is closed normally. Users do not get any exception while
the data is dropped. The namenode will later show a missing block due to size
mismatch.
> Improper pipeline close recovery causes a permanent write failure or data
> loss.
> -------------------------------------------------------------------------------
>
> Key: HDFS-16127
> URL: https://issues.apache.org/jira/browse/HDFS-16127
> Project: Hadoop HDFS
> Issue Type: Bug
> Reporter: Kihwal Lee
> Priority: Major
>
> When a block is being closed, the data streamer in the client waits for the
> final ACK to be delivered. If an exception is received during this wait, the
> close is retried. This assumption has become invalid by HDFS-15813, resulting
> in permanent write failures in some close error cases involving slow nodes.
> There are also less frequent cases of data loss.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]