[ https://issues.apache.org/jira/browse/HDFS-4504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13745691#comment-13745691 ]
Colin Patrick McCabe commented on HDFS-4504: -------------------------------------------- Vinay wrote: bq. While handling the Zombie stream, ZombieStreamManager can report to NameNode via some new RPC as this stream is zombie. A DFSOutputStream is a zombie for one of two reasons: 1. The client can't contact the NameNode (perhaps because of a network problem) 2. The client asked the NameNode to complete the file and it refused, because the NN does not (yet?) have a record that all of the file's blocks are present and complete. In scenario #1, we can't tell the NameNode anything because we can't talk to it. In scenario #2, the NameNode already knows everything it needs to know about the file. It doesn't care whether we consider the file a zombie or not-- why would it? All it knows is that the file isn't complete yet. The big picture for this change is that we're trying to prevent a scenario where the DFSOutputStream is never closeable and leaks resources forever. In order to do that, we sometimes have to make some unpleasant choices. One of them is that if there was a data streamer failure, we complete the file anyway after a configurable time period (currently 2 minutes). If you don't like this policy, you can just set the period so long that it corresponds to the lease recovery period. As I said before, the current code doesn't do anything special in the case of a data streamer failure in DFSOutputStream#close. It just throws up its hands and says "oh well, guess that data's gone!" After the hard-lease period expires, we will complete the file anyway. So it's exactly the same behavior with this patch as without it-- only the timeout is different. It sounds like what you want to do is somehow "try harder" to fix the data streamer failure when you know the file is being closed. This might be a good idea, but we should do it in a future JIRA. This patch is big enough, and changes enough things already. > DFSOutputStream#close doesn't always release resources (such as leases) > ----------------------------------------------------------------------- > > Key: HDFS-4504 > URL: https://issues.apache.org/jira/browse/HDFS-4504 > Project: Hadoop HDFS > Issue Type: Bug > Reporter: Colin Patrick McCabe > Assignee: Colin Patrick McCabe > Attachments: HDFS-4504.001.patch, HDFS-4504.002.patch, > HDFS-4504.007.patch, HDFS-4504.008.patch, HDFS-4504.009.patch, > HDFS-4504.010.patch, HDFS-4504.011.patch, HDFS-4504.014.patch, > HDFS-4504.015.patch > > > {{DFSOutputStream#close}} can throw an {{IOException}} in some cases. One > example is if there is a pipeline error and then pipeline recovery fails. > Unfortunately, in this case, some of the resources used by the > {{DFSOutputStream}} are leaked. One particularly important resource is file > leases. > So it's possible for a long-lived HDFS client, such as Flume, to write many > blocks to a file, but then fail to close it. Unfortunately, the > {{LeaseRenewerThread}} inside the client will continue to renew the lease for > the "undead" file. Future attempts to close the file will just rethrow the > previous exception, and no progress can be made by the client. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira