[jira] [Commented] (HDFS-16127) Improper pipeline close recovery causes a permanent write failure or data loss.

Kihwal Lee (Jira) Tue, 13 Jul 2021 08:19:06 -0700


    [ 
https://issues.apache.org/jira/browse/HDFS-16127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17379961#comment-17379961
 ]


Kihwal Lee commented on HDFS-16127:
-----------------------------------

*How the bug manifests*
In DataStreamer thread's main loop, it flushes everything and waits for all 
acks are received when it sees an empty close packet from dataQueue. Then it 
proceeds to send the close packet to signal datanodes to finalize the replicas. 
It also waits for the ack to the final packet by calling waitForAllAcks(). 
Prior to HDFS-15813, it involved no network activities, but network failures 
became possible after this. The following is the client log entry from one of 
the failure/data loss cases.

{noformat}
org.apache.hadoop.hdfs.DataStreamer: DataStreamer Exception
java.io.IOException: Connection reset by peer
        at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
        at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
        at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
        at sun.nio.ch.IOUtil.write(IOUtil.java:65)
        at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:468)
        at 
org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:63)
        at 
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
        at 
org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:159)
        at 
org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:117)
        at 
java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
        at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
        at java.io.DataOutputStream.flush(DataOutputStream.java:123)
        at org.apache.hadoop.hdfs.DataStreamer.sendPacket(DataStreamer.java:857)
        at 
org.apache.hadoop.hdfs.DataStreamer.sendHeartbeat(DataStreamer.java:875)
        at 
org.apache.hadoop.hdfs.DataStreamer.waitForAllAcks(DataStreamer.java:845)
        at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:798)
{noformat}

This exception resulted in a close recovery, because the pipeline stage was set 
to PIPELINE_CLOSE at this point. However, the client had no error coming from 
its ResponseProcessor, meaning that it has actually received the final ack and 
removed the final packet from ackQueue. Since it shut itself down cleanly, 
there was no sign of read or write error from this thread.

The following is the main part of the close recovery code, after a connection 
is successfully established.
{code:java}
          DFSPacket endOfBlockPacket = dataQueue.remove();  // remove the end 
of block packet
          assert endOfBlockPacket.isLastPacketInBlock();
          assert lastAckedSeqno == endOfBlockPacket.getSeqno() - 1;
          lastAckedSeqno = endOfBlockPacket.getSeqno();
          pipelineRecoveryCount = 0;
          dataQueue.notifyAll();
{code}

The asserts would have prevented the bug from propagating if they were active. 
(They are only active during testing.) It blindly requeues the content of 
ackQueue, thinking the unACKed final packet is still there. In this failure 
case, there is none, as the final packet was actually ACKed. The datanodes 
normally closed the connections, which resulted in "connection rest" for the 
data streamer stuckin in sending a heartbeat. The recovery simply dequeues one 
packet and tosses it away. After all, this packet is supposed to have no data. 
It even erroneously updates lastAckedSeqno.

At this point the first packet belonging to the next block has been thrown 
away. For the next block write, datanodes complains that the first packet's 
offset is non-zero. This is irrecoverable and after 5 times of retries, the 
data write fails.

*Data loss*
These errors usually cause a permanent write failure due to inability to write 
further with a first packet (actually the 2nd one, but the bug makes it the 
first one to be sent) of the block starting at non-zero offset. Thus the errors 
are propagated back to the user and results in a task attempt failure, etc.

However, if the remaining bytes to write to a new block fits in one packet, 
something worse happens. If the client encounters the close-recovery bug, the 
one data packet that gets dropped by the faulty recovery is the only data 
packet for the next block. The remaining packet after that is the final close 
packet for the next block. When the client continues to the next block, the 
datanode rejects the write as the close packet has a non-zero offset.

But instead of causing a permanent write failure, the client now enters a 
close-recovery phase, since the pipeline stage was set to PIPELINE_CLOSE while 
sending the packet. The new connection header tells datanodes that it is for a 
close recovery so they simply closes the zero-byte block file with the 
specified gen stamp. The recovery will appear successful, so the client gets no 
error and the file is closed normally.  Users do not get any exception while 
the data is dropped. The namenode will later show a missing block due to size 
mismatch.


> Improper pipeline close recovery causes a permanent write failure or data 
> loss.
> -------------------------------------------------------------------------------
>
>                 Key: HDFS-16127
>                 URL: https://issues.apache.org/jira/browse/HDFS-16127
>             Project: Hadoop HDFS
>          Issue Type: Bug
>            Reporter: Kihwal Lee
>            Priority: Major
>
> When a block is being closed, the data streamer in the client waits for the 
> final ACK to be delivered. If an exception is received during this wait, the 
> close is retried. This assumption has become invalid by HDFS-15813, resulting 
> in permanent write failures in some close error cases involving slow nodes. 
> There are also less frequent cases of data loss.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HDFS-16127) Improper pipeline close recovery causes a permanent write failure or data loss.

Reply via email to