[jira] [Commented] (HDFS-7314) When the DFSClient lease cannot be renewed, abort open-for-write files rather than the entire DFSClient

Hudson (JIRA) Thu, 16 Jul 2015 17:11:21 -0700

    [ 
https://issues.apache.org/jira/browse/HDFS-7314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14630577#comment-14630577
 ]


Hudson commented on HDFS-7314:
------------------------------

FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #247 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/247/])
HDFS-7314. When the DFSClient lease cannot be renewed, abort open-for-write 
files rather than the entire DFSClient. (mingma) (mingma: rev 
fbd88f1062f3c4b208724d208e3f501eb196dfab)
* 
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDFSClientRetries.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSClient.java
* hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/client/impl/LeaseRenewer.java
Move HDFS-7314 to 2.8 section in CHANGES.txt (mingma: rev 
0bda84fd48681ac1748a4770cff2f23e8336d276)
* hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt


> When the DFSClient lease cannot be renewed, abort open-for-write files rather 
> than the entire DFSClient
> -------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-7314
>                 URL: https://issues.apache.org/jira/browse/HDFS-7314
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>            Reporter: Ming Ma
>            Assignee: Ming Ma
>              Labels: BB2015-05-TBR
>             Fix For: 2.8.0
>
>         Attachments: HDFS-7314-2.patch, HDFS-7314-3.patch, HDFS-7314-4.patch, 
> HDFS-7314-5.patch, HDFS-7314-6.patch, HDFS-7314-7.patch, HDFS-7314-8.patch, 
> HDFS-7314-9.patch, HDFS-7314.patch
>
>
> It happened in YARN nodemanger scenario. But it could happen to any long 
> running service that use cached instance of DistrbutedFileSystem.
> 1. Active NN is under heavy load. So it became unavailable for 10 minutes; 
> any DFSClient request will get ConnectTimeoutException.
> 2. YARN nodemanager use DFSClient for certain write operation such as log 
> aggregator or shared cache in YARN-1492. DFSClient used by YARN NM's 
> renewLease RPC got ConnectTimeoutException.
> {noformat}
> 2014-10-29 01:36:19,559 WARN org.apache.hadoop.hdfs.LeaseRenewer: Failed to 
> renew lease for [DFSClient_NONMAPREDUCE_-550838118_1] for 372 seconds.  
> Aborting ...
> {noformat}
> 3. After DFSClient is in Aborted state, YARN NM can't use that cached 
> instance of DistributedFileSystem.
> {noformat}
> 2014-10-29 20:26:23,991 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
>  Failed to download rsrc...
> java.io.IOException: Filesystem closed
>         at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:727)
>         at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1780)
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1124)
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1120)
>         at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1120)
>         at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:237)
>         at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:340)
>         at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:57)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>         at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>         at java.lang.Thread.run(Thread.java:745)
> {noformat}
> We can make YARN or DFSClient more tolerant to temporary NN unavailability. 
> Given the callstack is YARN -> DistributedFileSystem -> DFSClient, this can 
> be addressed at different layers.
> * YARN closes the DistributedFileSystem object when it receives some well 
> defined exception. Then the next HDFS call will create a new instance of 
> DistributedFileSystem. We have to fix all the places in YARN. Plus other HDFS 
> applications need to address this as well.
> * DistributedFileSystem detects Aborted DFSClient and create a new instance 
> of DFSClient. We will need to fix all the places DistributedFileSystem calls 
> DFSClient.
> * After DFSClient gets into Aborted state, it doesn't have to reject all 
> requests , instead it can retry. If NN is available again it can transition 
> to healthy state.
> Comments?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7314) When the DFSClient lease cannot be renewed, abort open-for-write files rather than the entire DFSClient

Reply via email to