[jira] [Updated] (HDFS-7314) Aborted DFSClient's impact on long running service like YARN

Ming Ma (JIRA) Mon, 03 Nov 2014 22:53:03 -0800

     [ 
https://issues.apache.org/jira/browse/HDFS-7314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Ming Ma updated HDFS-7314:
--------------------------
    Attachment: HDFS-7314.patch

Thanks [~kihwal] and [~cmccabe] for the good suggestions.

Here is the initial patch that changes the behavior of DFSClient's abort. There 
might be scenarios that prefer the current behavior so it is configurable. Unit 
tests results look good so we don't have to define a new abortOutputStream 
function. To make sure it works for the case where the application tries to 
create files while leaseRenewal thread is aborting, leaseRenewal thread no 
longer exits when it receives SocketTimeoutException; otherwise, it is possible 
no thread will handle the lease renewal for the newly created files.

Also fix the incorrect log message and add some helper function to leaseRenewal 
to help with unit tests.

> Aborted DFSClient's impact on long running service like YARN
> ------------------------------------------------------------
>
>                 Key: HDFS-7314
>                 URL: https://issues.apache.org/jira/browse/HDFS-7314
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>            Reporter: Ming Ma
>         Attachments: HDFS-7314.patch
>
>
> It happened in YARN nodemanger scenario. But it could happen to any long 
> running service that use cached instance of DistrbutedFileSystem.
> 1. Active NN is under heavy load. So it became unavailable for 10 minutes; 
> any DFSClient request will get ConnectTimeoutException.
> 2. YARN nodemanager use DFSClient for certain write operation such as log 
> aggregator or shared cache in YARN-1492. DFSClient used by YARN NM's 
> renewLease RPC got ConnectTimeoutException.
> {noformat}
> 2014-10-29 01:36:19,559 WARN org.apache.hadoop.hdfs.LeaseRenewer: Failed to 
> renew lease for [DFSClient_NONMAPREDUCE_-550838118_1] for 372 seconds.  
> Aborting ...
> {noformat}
> 3. After DFSClient is in Aborted state, YARN NM can't use that cached 
> instance of DistributedFileSystem.
> {noformat}
> 2014-10-29 20:26:23,991 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
>  Failed to download rsrc...
> java.io.IOException: Filesystem closed
>         at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:727)
>         at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1780)
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1124)
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1120)
>         at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1120)
>         at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:237)
>         at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:340)
>         at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:57)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>         at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>         at java.lang.Thread.run(Thread.java:745)
> {noformat}
> We can make YARN or DFSClient more tolerant to temporary NN unavailability. 
> Given the callstack is YARN -> DistributedFileSystem -> DFSClient, this can 
> be addressed at different layers.
> * YARN closes the DistributedFileSystem object when it receives some well 
> defined exception. Then the next HDFS call will create a new instance of 
> DistributedFileSystem. We have to fix all the places in YARN. Plus other HDFS 
> applications need to address this as well.
> * DistributedFileSystem detects Aborted DFSClient and create a new instance 
> of DFSClient. We will need to fix all the places DistributedFileSystem calls 
> DFSClient.
> * After DFSClient gets into Aborted state, it doesn't have to reject all 
> requests , instead it can retry. If NN is available again it can transition 
> to healthy state.
> Comments?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HDFS-7314) Aborted DFSClient's impact on long running service like YARN

Reply via email to