[jira] [Commented] (HDFS-7314) Aborted DFSClient's impact on long running service like YARN
[ https://issues.apache.org/jira/browse/HDFS-7314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14619618#comment-14619618 ] Hadoop QA commented on HDFS-7314: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:red}-1{color} | pre-patch | 17m 38s | Pre-patch trunk has 1 extant Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 1 new or modified test files. | | {color:red}-1{color} | javac | 8m 9s | The applied patch generated 1 additional warning messages. | | {color:green}+1{color} | javadoc | 10m 6s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 21s | The applied patch does not increase the total number of release audit warnings. | | {color:red}-1{color} | checkstyle | 1m 26s | The applied patch generated 3 new checkstyle issues (total was 138, now 139). | | {color:green}+1{color} | whitespace | 0m 0s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 27s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 33s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 2m 37s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | native | 3m 10s | Pre-build of native portion | | {color:red}-1{color} | hdfs tests | 159m 27s | Tests failed in hadoop-hdfs. | | | | 204m 59s | | \\ \\ || Reason || Tests || | Failed unit tests | hadoop.hdfs.TestLeaseRecovery2 | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12744319/HDFS-7314-8.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / 2e3d83f | | Pre-patch Findbugs warnings | https://builds.apache.org/job/PreCommit-HDFS-Build/11630/artifact/patchprocess/trunkFindbugsWarningshadoop-hdfs.html | | javac | https://builds.apache.org/job/PreCommit-HDFS-Build/11630/artifact/patchprocess/diffJavacWarnings.txt | | checkstyle | https://builds.apache.org/job/PreCommit-HDFS-Build/11630/artifact/patchprocess/diffcheckstylehadoop-hdfs.txt | | hadoop-hdfs test log | https://builds.apache.org/job/PreCommit-HDFS-Build/11630/artifact/patchprocess/testrun_hadoop-hdfs.txt | | Test Results | https://builds.apache.org/job/PreCommit-HDFS-Build/11630/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf904.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-HDFS-Build/11630/console | This message was automatically generated. > Aborted DFSClient's impact on long running service like YARN > > > Key: HDFS-7314 > URL: https://issues.apache.org/jira/browse/HDFS-7314 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Ming Ma >Assignee: Ming Ma > Labels: BB2015-05-TBR > Attachments: HDFS-7314-2.patch, HDFS-7314-3.patch, HDFS-7314-4.patch, > HDFS-7314-5.patch, HDFS-7314-6.patch, HDFS-7314-7.patch, HDFS-7314-8.patch, > HDFS-7314.patch > > > It happened in YARN nodemanger scenario. But it could happen to any long > running service that use cached instance of DistrbutedFileSystem. > 1. Active NN is under heavy load. So it became unavailable for 10 minutes; > any DFSClient request will get ConnectTimeoutException. > 2. YARN nodemanager use DFSClient for certain write operation such as log > aggregator or shared cache in YARN-1492. DFSClient used by YARN NM's > renewLease RPC got ConnectTimeoutException. > {noformat} > 2014-10-29 01:36:19,559 WARN org.apache.hadoop.hdfs.LeaseRenewer: Failed to > renew lease for [DFSClient_NONMAPREDUCE_-550838118_1] for 372 seconds. > Aborting ... > {noformat} > 3. After DFSClient is in Aborted state, YARN NM can't use that cached > instance of DistributedFileSystem. > {noformat} > 2014-10-29 20:26:23,991 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: > Failed to download rsrc... > java.io.IOException: Filesystem closed > at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:727) > at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1780) > at > org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1124) > at > org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1120) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.
[jira] [Commented] (HDFS-7314) Aborted DFSClient's impact on long running service like YARN
[ https://issues.apache.org/jira/browse/HDFS-7314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14297266#comment-14297266 ] Gera Shegalov commented on HDFS-7314: - Actually I need to take #1 back, I misspoke DFS#close calls super.close() {code} @Override public void close() throws IOException { try { dfs.closeOutputStreams(false); super.close(); } finally { dfs.close(); } } {code} So it's only about 2. > Aborted DFSClient's impact on long running service like YARN > > > Key: HDFS-7314 > URL: https://issues.apache.org/jira/browse/HDFS-7314 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Ming Ma >Assignee: Ming Ma > Attachments: HDFS-7314-2.patch, HDFS-7314-3.patch, HDFS-7314-4.patch, > HDFS-7314-5.patch, HDFS-7314-6.patch, HDFS-7314-7.patch, HDFS-7314.patch > > > It happened in YARN nodemanger scenario. But it could happen to any long > running service that use cached instance of DistrbutedFileSystem. > 1. Active NN is under heavy load. So it became unavailable for 10 minutes; > any DFSClient request will get ConnectTimeoutException. > 2. YARN nodemanager use DFSClient for certain write operation such as log > aggregator or shared cache in YARN-1492. DFSClient used by YARN NM's > renewLease RPC got ConnectTimeoutException. > {noformat} > 2014-10-29 01:36:19,559 WARN org.apache.hadoop.hdfs.LeaseRenewer: Failed to > renew lease for [DFSClient_NONMAPREDUCE_-550838118_1] for 372 seconds. > Aborting ... > {noformat} > 3. After DFSClient is in Aborted state, YARN NM can't use that cached > instance of DistributedFileSystem. > {noformat} > 2014-10-29 20:26:23,991 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: > Failed to download rsrc... > java.io.IOException: Filesystem closed > at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:727) > at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1780) > at > org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1124) > at > org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1120) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1120) > at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:237) > at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:340) > at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:57) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {noformat} > We can make YARN or DFSClient more tolerant to temporary NN unavailability. > Given the callstack is YARN -> DistributedFileSystem -> DFSClient, this can > be addressed at different layers. > * YARN closes the DistributedFileSystem object when it receives some well > defined exception. Then the next HDFS call will create a new instance of > DistributedFileSystem. We have to fix all the places in YARN. Plus other HDFS > applications need to address this as well. > * DistributedFileSystem detects Aborted DFSClient and create a new instance > of DFSClient. We will need to fix all the places DistributedFileSystem calls > DFSClient. > * After DFSClient gets into Aborted state, it doesn't have to reject all > requests , instead it can retry. If NN is available again it can transition > to healthy state. > Comments? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7314) Aborted DFSClient's impact on long running service like YARN
[ https://issues.apache.org/jira/browse/HDFS-7314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14297133#comment-14297133 ] Ming Ma commented on HDFS-7314: --- Thanks, [~jira.shegalov]. That is interesting. That might work when applications request new FileSystem object. However, there is the scenario where applications still hold the reference of aborted FileSystem object and want to use that to create files; then applications need to be modified to catch the exception and recreate the FileSystem object? At the beginning of the jira, one of the 3 solutions proposed is to keep DistributedFileSystem alive and recreate DFSClient. Regarding of the approach, it will be good to keep it transparent to the applications. > Aborted DFSClient's impact on long running service like YARN > > > Key: HDFS-7314 > URL: https://issues.apache.org/jira/browse/HDFS-7314 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Ming Ma >Assignee: Ming Ma > Attachments: HDFS-7314-2.patch, HDFS-7314-3.patch, HDFS-7314-4.patch, > HDFS-7314-5.patch, HDFS-7314-6.patch, HDFS-7314-7.patch, HDFS-7314.patch > > > It happened in YARN nodemanger scenario. But it could happen to any long > running service that use cached instance of DistrbutedFileSystem. > 1. Active NN is under heavy load. So it became unavailable for 10 minutes; > any DFSClient request will get ConnectTimeoutException. > 2. YARN nodemanager use DFSClient for certain write operation such as log > aggregator or shared cache in YARN-1492. DFSClient used by YARN NM's > renewLease RPC got ConnectTimeoutException. > {noformat} > 2014-10-29 01:36:19,559 WARN org.apache.hadoop.hdfs.LeaseRenewer: Failed to > renew lease for [DFSClient_NONMAPREDUCE_-550838118_1] for 372 seconds. > Aborting ... > {noformat} > 3. After DFSClient is in Aborted state, YARN NM can't use that cached > instance of DistributedFileSystem. > {noformat} > 2014-10-29 20:26:23,991 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: > Failed to download rsrc... > java.io.IOException: Filesystem closed > at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:727) > at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1780) > at > org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1124) > at > org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1120) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1120) > at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:237) > at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:340) > at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:57) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {noformat} > We can make YARN or DFSClient more tolerant to temporary NN unavailability. > Given the callstack is YARN -> DistributedFileSystem -> DFSClient, this can > be addressed at different layers. > * YARN closes the DistributedFileSystem object when it receives some well > defined exception. Then the next HDFS call will create a new instance of > DistributedFileSystem. We have to fix all the places in YARN. Plus other HDFS > applications need to address this as well. > * DistributedFileSystem detects Aborted DFSClient and create a new instance > of DFSClient. We will need to fix all the places DistributedFileSystem calls > DFSClient. > * After DFSClient gets into Aborted state, it doesn't have to reject all > requests , instead it can retry. If NN is available again it can transition > to healthy state. > Comments? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7314) Aborted DFSClient's impact on long running service like YARN
[ https://issues.apache.org/jira/browse/HDFS-7314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14296425#comment-14296425 ] Hadoop QA commented on HDFS-7314: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12680729/HDFS-7314-7.patch against trunk revision 5a0051f. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/9366//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/9366//console This message is automatically generated. > Aborted DFSClient's impact on long running service like YARN > > > Key: HDFS-7314 > URL: https://issues.apache.org/jira/browse/HDFS-7314 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Ming Ma >Assignee: Ming Ma > Attachments: HDFS-7314-2.patch, HDFS-7314-3.patch, HDFS-7314-4.patch, > HDFS-7314-5.patch, HDFS-7314-6.patch, HDFS-7314-7.patch, HDFS-7314.patch > > > It happened in YARN nodemanger scenario. But it could happen to any long > running service that use cached instance of DistrbutedFileSystem. > 1. Active NN is under heavy load. So it became unavailable for 10 minutes; > any DFSClient request will get ConnectTimeoutException. > 2. YARN nodemanager use DFSClient for certain write operation such as log > aggregator or shared cache in YARN-1492. DFSClient used by YARN NM's > renewLease RPC got ConnectTimeoutException. > {noformat} > 2014-10-29 01:36:19,559 WARN org.apache.hadoop.hdfs.LeaseRenewer: Failed to > renew lease for [DFSClient_NONMAPREDUCE_-550838118_1] for 372 seconds. > Aborting ... > {noformat} > 3. After DFSClient is in Aborted state, YARN NM can't use that cached > instance of DistributedFileSystem. > {noformat} > 2014-10-29 20:26:23,991 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: > Failed to download rsrc... > java.io.IOException: Filesystem closed > at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:727) > at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1780) > at > org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1124) > at > org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1120) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1120) > at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:237) > at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:340) > at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:57) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {noformat} > We can make YARN or DFSClient more tolerant to temporary NN unavailability. > Given the callstack is YARN -> DistributedFileSystem -> DFSClient, this can > be addressed at different layers. > * YARN closes the DistributedFileSystem object when it receives some well > defined exception. Then the next HDFS call will create a new instance of > DistributedFileSystem. We have to fix all the places in YARN. Plus other HDFS > applications need to address this as well. > * DistributedFileSystem detects Aborted DFSClient and create a new instance > of DFSClient. We will need to fix all the places DistributedFileSystem calls > DFSClient. > * After DFSClient gets into Aborted state, it doesn't have to reject all >
[jira] [Commented] (HDFS-7314) Aborted DFSClient's impact on long running service like YARN
[ https://issues.apache.org/jira/browse/HDFS-7314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14296296#comment-14296296 ] Gera Shegalov commented on HDFS-7314: - I think the real problem is that the {{FileSystem}}-level CACHE entry is not invalidated/evicted although the DFS Client is closed. # DistributedFileSystem#close does not call super.close() that would achieve this. # DFSClient#abort does not close the wrapping DFS object nor DFS tries to intercept checkOpen to do this. Solving these issues would solve the scenario described in the JIRA. What do you think, [~mingma]? > Aborted DFSClient's impact on long running service like YARN > > > Key: HDFS-7314 > URL: https://issues.apache.org/jira/browse/HDFS-7314 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Ming Ma >Assignee: Ming Ma > Attachments: HDFS-7314-2.patch, HDFS-7314-3.patch, HDFS-7314-4.patch, > HDFS-7314-5.patch, HDFS-7314-6.patch, HDFS-7314-7.patch, HDFS-7314.patch > > > It happened in YARN nodemanger scenario. But it could happen to any long > running service that use cached instance of DistrbutedFileSystem. > 1. Active NN is under heavy load. So it became unavailable for 10 minutes; > any DFSClient request will get ConnectTimeoutException. > 2. YARN nodemanager use DFSClient for certain write operation such as log > aggregator or shared cache in YARN-1492. DFSClient used by YARN NM's > renewLease RPC got ConnectTimeoutException. > {noformat} > 2014-10-29 01:36:19,559 WARN org.apache.hadoop.hdfs.LeaseRenewer: Failed to > renew lease for [DFSClient_NONMAPREDUCE_-550838118_1] for 372 seconds. > Aborting ... > {noformat} > 3. After DFSClient is in Aborted state, YARN NM can't use that cached > instance of DistributedFileSystem. > {noformat} > 2014-10-29 20:26:23,991 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: > Failed to download rsrc... > java.io.IOException: Filesystem closed > at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:727) > at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1780) > at > org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1124) > at > org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1120) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1120) > at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:237) > at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:340) > at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:57) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {noformat} > We can make YARN or DFSClient more tolerant to temporary NN unavailability. > Given the callstack is YARN -> DistributedFileSystem -> DFSClient, this can > be addressed at different layers. > * YARN closes the DistributedFileSystem object when it receives some well > defined exception. Then the next HDFS call will create a new instance of > DistributedFileSystem. We have to fix all the places in YARN. Plus other HDFS > applications need to address this as well. > * DistributedFileSystem detects Aborted DFSClient and create a new instance > of DFSClient. We will need to fix all the places DistributedFileSystem calls > DFSClient. > * After DFSClient gets into Aborted state, it doesn't have to reject all > requests , instead it can retry. If NN is available again it can transition > to healthy state. > Comments? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7314) Aborted DFSClient's impact on long running service like YARN
[ https://issues.apache.org/jira/browse/HDFS-7314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14219683#comment-14219683 ] Ming Ma commented on HDFS-7314: --- Thanks, Colin. 1. There is an existing static method called {{LeaseRenewer}}#{{getInstance}}. There are synchronizations at methods of {{LeaseRenewer}} and {{LeaseRenewer#Factory}}. But that is only synchronized at each class instance level. Some of these race conditions come from the lack of synchronization between class instances. We can try to fix those scenarios. 2. Alternatively, we can get rid of the {{LeaseRenewer}} thread/object recycle logic. For short duration program like MR job submission, it won't kick in anyway. For long running services like YARN, it doesn't really matter as they create several long running threads it should be ok to keep few {{LeaseRenewer}} threads around. In addition; given these services might use HDFS regularly, {{LeaseRenewer}} threads will be recreated or just kept around anyway. > Aborted DFSClient's impact on long running service like YARN > > > Key: HDFS-7314 > URL: https://issues.apache.org/jira/browse/HDFS-7314 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Ming Ma >Assignee: Ming Ma > Attachments: HDFS-7314-2.patch, HDFS-7314-3.patch, HDFS-7314-4.patch, > HDFS-7314-5.patch, HDFS-7314-6.patch, HDFS-7314-7.patch, HDFS-7314.patch > > > It happened in YARN nodemanger scenario. But it could happen to any long > running service that use cached instance of DistrbutedFileSystem. > 1. Active NN is under heavy load. So it became unavailable for 10 minutes; > any DFSClient request will get ConnectTimeoutException. > 2. YARN nodemanager use DFSClient for certain write operation such as log > aggregator or shared cache in YARN-1492. DFSClient used by YARN NM's > renewLease RPC got ConnectTimeoutException. > {noformat} > 2014-10-29 01:36:19,559 WARN org.apache.hadoop.hdfs.LeaseRenewer: Failed to > renew lease for [DFSClient_NONMAPREDUCE_-550838118_1] for 372 seconds. > Aborting ... > {noformat} > 3. After DFSClient is in Aborted state, YARN NM can't use that cached > instance of DistributedFileSystem. > {noformat} > 2014-10-29 20:26:23,991 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: > Failed to download rsrc... > java.io.IOException: Filesystem closed > at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:727) > at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1780) > at > org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1124) > at > org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1120) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1120) > at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:237) > at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:340) > at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:57) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {noformat} > We can make YARN or DFSClient more tolerant to temporary NN unavailability. > Given the callstack is YARN -> DistributedFileSystem -> DFSClient, this can > be addressed at different layers. > * YARN closes the DistributedFileSystem object when it receives some well > defined exception. Then the next HDFS call will create a new instance of > DistributedFileSystem. We have to fix all the places in YARN. Plus other HDFS > applications need to address this as well. > * DistributedFileSystem detects Aborted DFSClient and create a new instance > of DFSClient. We will need to fix all the places DistributedFileSystem calls > DFSClient. > * After DFSClient gets into Aborted state, it doesn't have to reject all > requests , instead it can retry. If NN is available again it can transition > to healthy state. > Comments? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7314) Aborted DFSClient's impact on long running service like YARN
[ https://issues.apache.org/jira/browse/HDFS-7314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14219046#comment-14219046 ] Colin Patrick McCabe commented on HDFS-7314: I need to think about this more. I think the root of the problem is the decision to expose the {{LeaseManager}} thread object instances outside LeaseManager.java, rather than simply having static methods (or the equivalent) that act on "the current lease manager for your UGI", without forcing you to know or care what that is. I am fine with fixing this in another JIRA, but I really feel like we should fix it first. I don't feel good about the current synchronization at all. Thanks for your patience, [~mingma]. > Aborted DFSClient's impact on long running service like YARN > > > Key: HDFS-7314 > URL: https://issues.apache.org/jira/browse/HDFS-7314 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Ming Ma >Assignee: Ming Ma > Attachments: HDFS-7314-2.patch, HDFS-7314-3.patch, HDFS-7314-4.patch, > HDFS-7314-5.patch, HDFS-7314-6.patch, HDFS-7314-7.patch, HDFS-7314.patch > > > It happened in YARN nodemanger scenario. But it could happen to any long > running service that use cached instance of DistrbutedFileSystem. > 1. Active NN is under heavy load. So it became unavailable for 10 minutes; > any DFSClient request will get ConnectTimeoutException. > 2. YARN nodemanager use DFSClient for certain write operation such as log > aggregator or shared cache in YARN-1492. DFSClient used by YARN NM's > renewLease RPC got ConnectTimeoutException. > {noformat} > 2014-10-29 01:36:19,559 WARN org.apache.hadoop.hdfs.LeaseRenewer: Failed to > renew lease for [DFSClient_NONMAPREDUCE_-550838118_1] for 372 seconds. > Aborting ... > {noformat} > 3. After DFSClient is in Aborted state, YARN NM can't use that cached > instance of DistributedFileSystem. > {noformat} > 2014-10-29 20:26:23,991 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: > Failed to download rsrc... > java.io.IOException: Filesystem closed > at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:727) > at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1780) > at > org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1124) > at > org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1120) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1120) > at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:237) > at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:340) > at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:57) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {noformat} > We can make YARN or DFSClient more tolerant to temporary NN unavailability. > Given the callstack is YARN -> DistributedFileSystem -> DFSClient, this can > be addressed at different layers. > * YARN closes the DistributedFileSystem object when it receives some well > defined exception. Then the next HDFS call will create a new instance of > DistributedFileSystem. We have to fix all the places in YARN. Plus other HDFS > applications need to address this as well. > * DistributedFileSystem detects Aborted DFSClient and create a new instance > of DFSClient. We will need to fix all the places DistributedFileSystem calls > DFSClient. > * After DFSClient gets into Aborted state, it doesn't have to reject all > requests , instead it can retry. If NN is available again it can transition > to healthy state. > Comments? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7314) Aborted DFSClient's impact on long running service like YARN
[ https://issues.apache.org/jira/browse/HDFS-7314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14211584#comment-14211584 ] Ming Ma commented on HDFS-7314: --- Thanks Colin for the good point. I also noticed that during the analysis; but assumed that is part of the original design. 1. The issue you described is in trunk. It can happen when LeaseRenewer goes away, due to SocketTimeoutException or RenewerExpired. 2. In your above steps, LeaseRenewer object is added to Factory.INSTANCE in #3, not in step #4 and #5. But that doesn't change the issue. What will happen is when the first thread calls endFileLease, it will get hold of LR2. So LR1 will keep renewing the lease even after all files have been closed. It appears we have discovered a bunch of race conditions regardless whether the original issue is addressed or not. Given that, we can consider fixing the original issue and open another jira to address these race conditions. As you mentioned, the issues come from the fact that LeaseRenewer tries to clean up the object and the thread when they are no longer used. IMHO, that is not necessary; we can just keep LeaseRenewer objects and their threads around once they are created, the idea in the original patch. LeaseRenewer objects are keyed by NN address and ugi. In the normal set up, with HDFS federation you can have several NN addresses, but # of ugis should be limited. So it isn't expensive to keep these objects and their threads around. If the long term fix is to keep the LeaseRenewer object and thread around, we can start with fix for SocketTimeoutException in this patch and open another patch to address the RenewerExpired scenario later. > Aborted DFSClient's impact on long running service like YARN > > > Key: HDFS-7314 > URL: https://issues.apache.org/jira/browse/HDFS-7314 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Ming Ma >Assignee: Ming Ma > Attachments: HDFS-7314-2.patch, HDFS-7314-3.patch, HDFS-7314-4.patch, > HDFS-7314-5.patch, HDFS-7314-6.patch, HDFS-7314-7.patch, HDFS-7314.patch > > > It happened in YARN nodemanger scenario. But it could happen to any long > running service that use cached instance of DistrbutedFileSystem. > 1. Active NN is under heavy load. So it became unavailable for 10 minutes; > any DFSClient request will get ConnectTimeoutException. > 2. YARN nodemanager use DFSClient for certain write operation such as log > aggregator or shared cache in YARN-1492. DFSClient used by YARN NM's > renewLease RPC got ConnectTimeoutException. > {noformat} > 2014-10-29 01:36:19,559 WARN org.apache.hadoop.hdfs.LeaseRenewer: Failed to > renew lease for [DFSClient_NONMAPREDUCE_-550838118_1] for 372 seconds. > Aborting ... > {noformat} > 3. After DFSClient is in Aborted state, YARN NM can't use that cached > instance of DistributedFileSystem. > {noformat} > 2014-10-29 20:26:23,991 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: > Failed to download rsrc... > java.io.IOException: Filesystem closed > at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:727) > at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1780) > at > org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1124) > at > org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1120) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1120) > at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:237) > at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:340) > at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:57) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {noformat} > We can make YARN or DFSClient more tolerant to temporary NN unavailability. > Given the callstack is YARN -> DistributedFileSystem -> DFSClient, this can > be addressed at different layers. > * YARN closes the DistributedFileSystem object when it receives some well > defined exception. Then the next HDFS call will create a new instance of > DistributedFileSystem. We have to fix all the places in YARN. Plus other
[jira] [Commented] (HDFS-7314) Aborted DFSClient's impact on long running service like YARN
[ https://issues.apache.org/jira/browse/HDFS-7314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14211284#comment-14211284 ] Colin Patrick McCabe commented on HDFS-7314: I feel like this code is still not quite right. We can get two {{LeaseRenewer}} objects now, right? 1. beginFileLease calls into getLeaseRenewer, gets LeaseRenewer #1 2. LeaseRenewer#closeClient (for LeaseRenewer #1) removes itself from Factory.INSTANCE. 3. another thread calls beginFileLease. There is no LeaseRenewer object in Factory.INSTANCE any more, so a new one is created (call it #2). 4. first thread calls put, adds the DFSClient to LeaseRenewer #1 and LR1 to Factory.INSTANCE 5. second thread calls put, adds the DFSClient to LeaseRenewer #2 and LR2 to Factory.INSTANCE. Won't we end up with two {{LeaseRenewer}} objects after this point? The problem is basically that if we allow the {{LeaseRenewer}} object to escape from LeaseRenewer.java, and we accept that these objects can "die", we have to accept that people can be using dead LeaseRenewer objects. I'm not sure what the best way to fix this is... it is kind of a mess. I guess maybe it's a pre-existing problem too? If I'm understanding the situation correctly. > Aborted DFSClient's impact on long running service like YARN > > > Key: HDFS-7314 > URL: https://issues.apache.org/jira/browse/HDFS-7314 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Ming Ma >Assignee: Ming Ma > Attachments: HDFS-7314-2.patch, HDFS-7314-3.patch, HDFS-7314-4.patch, > HDFS-7314-5.patch, HDFS-7314-6.patch, HDFS-7314-7.patch, HDFS-7314.patch > > > It happened in YARN nodemanger scenario. But it could happen to any long > running service that use cached instance of DistrbutedFileSystem. > 1. Active NN is under heavy load. So it became unavailable for 10 minutes; > any DFSClient request will get ConnectTimeoutException. > 2. YARN nodemanager use DFSClient for certain write operation such as log > aggregator or shared cache in YARN-1492. DFSClient used by YARN NM's > renewLease RPC got ConnectTimeoutException. > {noformat} > 2014-10-29 01:36:19,559 WARN org.apache.hadoop.hdfs.LeaseRenewer: Failed to > renew lease for [DFSClient_NONMAPREDUCE_-550838118_1] for 372 seconds. > Aborting ... > {noformat} > 3. After DFSClient is in Aborted state, YARN NM can't use that cached > instance of DistributedFileSystem. > {noformat} > 2014-10-29 20:26:23,991 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: > Failed to download rsrc... > java.io.IOException: Filesystem closed > at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:727) > at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1780) > at > org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1124) > at > org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1120) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1120) > at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:237) > at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:340) > at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:57) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {noformat} > We can make YARN or DFSClient more tolerant to temporary NN unavailability. > Given the callstack is YARN -> DistributedFileSystem -> DFSClient, this can > be addressed at different layers. > * YARN closes the DistributedFileSystem object when it receives some well > defined exception. Then the next HDFS call will create a new instance of > DistributedFileSystem. We have to fix all the places in YARN. Plus other HDFS > applications need to address this as well. > * DistributedFileSystem detects Aborted DFSClient and create a new instance > of DFSClient. We will need to fix all the places DistributedFileSystem calls > DFSClient. > * After DFSClient gets into Aborted state, it doesn't have to reject all > requests , instead it can retry. If NN is available again it can transition > to healthy state. > Comments? -- This message was sent by Atlassian JI
[jira] [Commented] (HDFS-7314) Aborted DFSClient's impact on long running service like YARN
[ https://issues.apache.org/jira/browse/HDFS-7314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14205997#comment-14205997 ] Hadoop QA commented on HDFS-7314: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12680729/HDFS-7314-7.patch against trunk revision 58e9bf4. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/8711//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8711//console This message is automatically generated. > Aborted DFSClient's impact on long running service like YARN > > > Key: HDFS-7314 > URL: https://issues.apache.org/jira/browse/HDFS-7314 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Ming Ma >Assignee: Ming Ma > Attachments: HDFS-7314-2.patch, HDFS-7314-3.patch, HDFS-7314-4.patch, > HDFS-7314-5.patch, HDFS-7314-6.patch, HDFS-7314-7.patch, HDFS-7314.patch > > > It happened in YARN nodemanger scenario. But it could happen to any long > running service that use cached instance of DistrbutedFileSystem. > 1. Active NN is under heavy load. So it became unavailable for 10 minutes; > any DFSClient request will get ConnectTimeoutException. > 2. YARN nodemanager use DFSClient for certain write operation such as log > aggregator or shared cache in YARN-1492. DFSClient used by YARN NM's > renewLease RPC got ConnectTimeoutException. > {noformat} > 2014-10-29 01:36:19,559 WARN org.apache.hadoop.hdfs.LeaseRenewer: Failed to > renew lease for [DFSClient_NONMAPREDUCE_-550838118_1] for 372 seconds. > Aborting ... > {noformat} > 3. After DFSClient is in Aborted state, YARN NM can't use that cached > instance of DistributedFileSystem. > {noformat} > 2014-10-29 20:26:23,991 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: > Failed to download rsrc... > java.io.IOException: Filesystem closed > at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:727) > at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1780) > at > org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1124) > at > org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1120) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1120) > at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:237) > at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:340) > at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:57) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {noformat} > We can make YARN or DFSClient more tolerant to temporary NN unavailability. > Given the callstack is YARN -> DistributedFileSystem -> DFSClient, this can > be addressed at different layers. > * YARN closes the DistributedFileSystem object when it receives some well > defined exception. Then the next HDFS call will create a new instance of > DistributedFileSystem. We have to fix all the places in YARN. Plus other HDFS > applications need to address this as well. > * DistributedFileSystem detects Aborted DFSClient and create a new instance > of DFSClient. We will need to fix all the places DistributedFileSystem calls > DFSClie
[jira] [Commented] (HDFS-7314) Aborted DFSClient's impact on long running service like YARN
[ https://issues.apache.org/jira/browse/HDFS-7314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14205631#comment-14205631 ] Hadoop QA commented on HDFS-7314: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12680685/HDFS-7314-6.patch against trunk revision 68a0508. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.TestDistributedFileSystem {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/8707//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8707//console This message is automatically generated. > Aborted DFSClient's impact on long running service like YARN > > > Key: HDFS-7314 > URL: https://issues.apache.org/jira/browse/HDFS-7314 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Ming Ma >Assignee: Ming Ma > Attachments: HDFS-7314-2.patch, HDFS-7314-3.patch, HDFS-7314-4.patch, > HDFS-7314-5.patch, HDFS-7314-6.patch, HDFS-7314.patch > > > It happened in YARN nodemanger scenario. But it could happen to any long > running service that use cached instance of DistrbutedFileSystem. > 1. Active NN is under heavy load. So it became unavailable for 10 minutes; > any DFSClient request will get ConnectTimeoutException. > 2. YARN nodemanager use DFSClient for certain write operation such as log > aggregator or shared cache in YARN-1492. DFSClient used by YARN NM's > renewLease RPC got ConnectTimeoutException. > {noformat} > 2014-10-29 01:36:19,559 WARN org.apache.hadoop.hdfs.LeaseRenewer: Failed to > renew lease for [DFSClient_NONMAPREDUCE_-550838118_1] for 372 seconds. > Aborting ... > {noformat} > 3. After DFSClient is in Aborted state, YARN NM can't use that cached > instance of DistributedFileSystem. > {noformat} > 2014-10-29 20:26:23,991 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: > Failed to download rsrc... > java.io.IOException: Filesystem closed > at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:727) > at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1780) > at > org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1124) > at > org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1120) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1120) > at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:237) > at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:340) > at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:57) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {noformat} > We can make YARN or DFSClient more tolerant to temporary NN unavailability. > Given the callstack is YARN -> DistributedFileSystem -> DFSClient, this can > be addressed at different layers. > * YARN closes the DistributedFileSystem object when it receives some well > defined exception. Then the next HDFS call will create a new instance of > DistributedFileSystem. We have to fix all the places in YARN. Plus other HDFS > applications need to address this as well. > * DistributedFileSystem detects Aborted DFSClient and create a new instance > of DFSClient. We will need to fix al
[jira] [Commented] (HDFS-7314) Aborted DFSClient's impact on long running service like YARN
[ https://issues.apache.org/jira/browse/HDFS-7314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14205270#comment-14205270 ] Colin Patrick McCabe commented on HDFS-7314: Good catch. This code is certainly somewhat subtle. I think that the {{currentId}} variable was intended to address the problem you're describing. Keeping the thread running seems strange. Is it going to abort the clients it's tracking more than once? I would rather stop it if at all possible. It seems like maybe what we should do here is set {{emptyTime}} to 0 and break out of the loop to exit the thread. This will lead to the current {{LeaseRenewer}} thread being considered "expired" and not used in {{LeaseRenewer#put}}. So there should be no race condition then, because {{LeaseRenewer#put}} will create a new thread (and increment {{currentId}}) if the current one is expired. > Aborted DFSClient's impact on long running service like YARN > > > Key: HDFS-7314 > URL: https://issues.apache.org/jira/browse/HDFS-7314 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Ming Ma >Assignee: Ming Ma > Attachments: HDFS-7314-2.patch, HDFS-7314-3.patch, HDFS-7314-4.patch, > HDFS-7314-5.patch, HDFS-7314.patch > > > It happened in YARN nodemanger scenario. But it could happen to any long > running service that use cached instance of DistrbutedFileSystem. > 1. Active NN is under heavy load. So it became unavailable for 10 minutes; > any DFSClient request will get ConnectTimeoutException. > 2. YARN nodemanager use DFSClient for certain write operation such as log > aggregator or shared cache in YARN-1492. DFSClient used by YARN NM's > renewLease RPC got ConnectTimeoutException. > {noformat} > 2014-10-29 01:36:19,559 WARN org.apache.hadoop.hdfs.LeaseRenewer: Failed to > renew lease for [DFSClient_NONMAPREDUCE_-550838118_1] for 372 seconds. > Aborting ... > {noformat} > 3. After DFSClient is in Aborted state, YARN NM can't use that cached > instance of DistributedFileSystem. > {noformat} > 2014-10-29 20:26:23,991 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: > Failed to download rsrc... > java.io.IOException: Filesystem closed > at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:727) > at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1780) > at > org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1124) > at > org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1120) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1120) > at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:237) > at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:340) > at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:57) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {noformat} > We can make YARN or DFSClient more tolerant to temporary NN unavailability. > Given the callstack is YARN -> DistributedFileSystem -> DFSClient, this can > be addressed at different layers. > * YARN closes the DistributedFileSystem object when it receives some well > defined exception. Then the next HDFS call will create a new instance of > DistributedFileSystem. We have to fix all the places in YARN. Plus other HDFS > applications need to address this as well. > * DistributedFileSystem detects Aborted DFSClient and create a new instance > of DFSClient. We will need to fix all the places DistributedFileSystem calls > DFSClient. > * After DFSClient gets into Aborted state, it doesn't have to reject all > requests , instead it can retry. If NN is available again it can transition > to healthy state. > Comments? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7314) Aborted DFSClient's impact on long running service like YARN
[ https://issues.apache.org/jira/browse/HDFS-7314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14205241#comment-14205241 ] Ming Ma commented on HDFS-7314: --- Thanks, Colin. The reason to keep the thread running is to handle the following race condition. 1. leaseRenewal thread is aborting. 2. The application creates files before leaseRenewal is removed from the factory. So DFSClient is added to the leaseRenewal object. 3. leaseRenewal thread exits. So nobody will renew lease for that DFSClient. > Aborted DFSClient's impact on long running service like YARN > > > Key: HDFS-7314 > URL: https://issues.apache.org/jira/browse/HDFS-7314 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Ming Ma >Assignee: Ming Ma > Attachments: HDFS-7314-2.patch, HDFS-7314-3.patch, HDFS-7314-4.patch, > HDFS-7314-5.patch, HDFS-7314.patch > > > It happened in YARN nodemanger scenario. But it could happen to any long > running service that use cached instance of DistrbutedFileSystem. > 1. Active NN is under heavy load. So it became unavailable for 10 minutes; > any DFSClient request will get ConnectTimeoutException. > 2. YARN nodemanager use DFSClient for certain write operation such as log > aggregator or shared cache in YARN-1492. DFSClient used by YARN NM's > renewLease RPC got ConnectTimeoutException. > {noformat} > 2014-10-29 01:36:19,559 WARN org.apache.hadoop.hdfs.LeaseRenewer: Failed to > renew lease for [DFSClient_NONMAPREDUCE_-550838118_1] for 372 seconds. > Aborting ... > {noformat} > 3. After DFSClient is in Aborted state, YARN NM can't use that cached > instance of DistributedFileSystem. > {noformat} > 2014-10-29 20:26:23,991 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: > Failed to download rsrc... > java.io.IOException: Filesystem closed > at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:727) > at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1780) > at > org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1124) > at > org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1120) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1120) > at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:237) > at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:340) > at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:57) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {noformat} > We can make YARN or DFSClient more tolerant to temporary NN unavailability. > Given the callstack is YARN -> DistributedFileSystem -> DFSClient, this can > be addressed at different layers. > * YARN closes the DistributedFileSystem object when it receives some well > defined exception. Then the next HDFS call will create a new instance of > DistributedFileSystem. We have to fix all the places in YARN. Plus other HDFS > applications need to address this as well. > * DistributedFileSystem detects Aborted DFSClient and create a new instance > of DFSClient. We will need to fix all the places DistributedFileSystem calls > DFSClient. > * After DFSClient gets into Aborted state, it doesn't have to reject all > requests , instead it can retry. If NN is available again it can transition > to healthy state. > Comments? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7314) Aborted DFSClient's impact on long running service like YARN
[ https://issues.apache.org/jira/browse/HDFS-7314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14205204#comment-14205204 ] Colin Patrick McCabe commented on HDFS-7314: {code} @@ -450,10 +455,11 @@ private void run(final int id) throws InterruptedException { + (elapsed/1000) + " seconds. Aborting ...", ie); synchronized (this) { while (!dfsclients.isEmpty()) { - dfsclients.get(0).abort(); + DFSClient dfsClient = dfsclients.get(0); + dfsClient.closeAllFilesBeingWritten(true); + closeClient(dfsClient); } } - break; } catch (IOException ie) { LOG.warn("Failed to renew lease for " + clientsString() + " for " + (elapsed/1000) + " seconds. Will retry shortly ...", ie); {code} It seems like getting rid of "break" here is going to lead to the {{LeaseRenewer}} thread for the client continuing to run after the client's lease has been aborted. This doesn't seem like what we want? After all, we are going to create a new {{LeaseRenewer}} if the {{DFSClient}} opens another file for write. > Aborted DFSClient's impact on long running service like YARN > > > Key: HDFS-7314 > URL: https://issues.apache.org/jira/browse/HDFS-7314 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Ming Ma >Assignee: Ming Ma > Attachments: HDFS-7314-2.patch, HDFS-7314-3.patch, HDFS-7314-4.patch, > HDFS-7314-5.patch, HDFS-7314.patch > > > It happened in YARN nodemanger scenario. But it could happen to any long > running service that use cached instance of DistrbutedFileSystem. > 1. Active NN is under heavy load. So it became unavailable for 10 minutes; > any DFSClient request will get ConnectTimeoutException. > 2. YARN nodemanager use DFSClient for certain write operation such as log > aggregator or shared cache in YARN-1492. DFSClient used by YARN NM's > renewLease RPC got ConnectTimeoutException. > {noformat} > 2014-10-29 01:36:19,559 WARN org.apache.hadoop.hdfs.LeaseRenewer: Failed to > renew lease for [DFSClient_NONMAPREDUCE_-550838118_1] for 372 seconds. > Aborting ... > {noformat} > 3. After DFSClient is in Aborted state, YARN NM can't use that cached > instance of DistributedFileSystem. > {noformat} > 2014-10-29 20:26:23,991 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: > Failed to download rsrc... > java.io.IOException: Filesystem closed > at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:727) > at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1780) > at > org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1124) > at > org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1120) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1120) > at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:237) > at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:340) > at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:57) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {noformat} > We can make YARN or DFSClient more tolerant to temporary NN unavailability. > Given the callstack is YARN -> DistributedFileSystem -> DFSClient, this can > be addressed at different layers. > * YARN closes the DistributedFileSystem object when it receives some well > defined exception. Then the next HDFS call will create a new instance of > DistributedFileSystem. We have to fix all the places in YARN. Plus other HDFS > applications need to address this as well. > * DistributedFileSystem detects Aborted DFSClient and create a new instance > of DFSClient. We will need to fix all the places DistributedFileSystem calls > DFSClient. > * After DFSClient gets into Aborted state, it doesn't have to reject all > requests , instead it can retry. If NN is available again it can transition > to healthy state. > Comments? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7314) Aborted DFSClient's impact on long running service like YARN
[ https://issues.apache.org/jira/browse/HDFS-7314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14203256#comment-14203256 ] Hadoop QA commented on HDFS-7314: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12680355/HDFS-7314-5.patch against trunk revision 4a114dd. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/8696//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8696//console This message is automatically generated. > Aborted DFSClient's impact on long running service like YARN > > > Key: HDFS-7314 > URL: https://issues.apache.org/jira/browse/HDFS-7314 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Ming Ma >Assignee: Ming Ma > Attachments: HDFS-7314-2.patch, HDFS-7314-3.patch, HDFS-7314-4.patch, > HDFS-7314-5.patch, HDFS-7314.patch > > > It happened in YARN nodemanger scenario. But it could happen to any long > running service that use cached instance of DistrbutedFileSystem. > 1. Active NN is under heavy load. So it became unavailable for 10 minutes; > any DFSClient request will get ConnectTimeoutException. > 2. YARN nodemanager use DFSClient for certain write operation such as log > aggregator or shared cache in YARN-1492. DFSClient used by YARN NM's > renewLease RPC got ConnectTimeoutException. > {noformat} > 2014-10-29 01:36:19,559 WARN org.apache.hadoop.hdfs.LeaseRenewer: Failed to > renew lease for [DFSClient_NONMAPREDUCE_-550838118_1] for 372 seconds. > Aborting ... > {noformat} > 3. After DFSClient is in Aborted state, YARN NM can't use that cached > instance of DistributedFileSystem. > {noformat} > 2014-10-29 20:26:23,991 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: > Failed to download rsrc... > java.io.IOException: Filesystem closed > at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:727) > at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1780) > at > org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1124) > at > org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1120) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1120) > at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:237) > at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:340) > at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:57) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {noformat} > We can make YARN or DFSClient more tolerant to temporary NN unavailability. > Given the callstack is YARN -> DistributedFileSystem -> DFSClient, this can > be addressed at different layers. > * YARN closes the DistributedFileSystem object when it receives some well > defined exception. Then the next HDFS call will create a new instance of > DistributedFileSystem. We have to fix all the places in YARN. Plus other HDFS > applications need to address this as well. > * DistributedFileSystem detects Aborted DFSClient and create a new instance > of DFSClient. We will need to fix all the places DistributedFileSystem calls > DFSClient. > * After DFSClient gets into Abor
[jira] [Commented] (HDFS-7314) Aborted DFSClient's impact on long running service like YARN
[ https://issues.apache.org/jira/browse/HDFS-7314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14202971#comment-14202971 ] Colin Patrick McCabe commented on HDFS-7314: bq. It turns out a new bug not related to this was discovered by this change. If DataStreamer thread exit and closes the stream before application closes the stream, DFSClient will keep renewing the lease. That is because DataStreamer's closeInternal marks the stream closed but didn't call DFSClient's endFileLease. Later when application closes the stream, it will skip DFSClient's endFileLease given the stream has been closed. You're right that there is a bug here. There is a lot of discussion about what to do about this issue in HDFS-4504. It's not as simple as just calling {{endFileLease}}... if we missed calling {{completeFile}}, the NN will continue to think that we have a lease open on this file. I think we should avoid modifying {{DFSOutputStream#close}} here. We should try to keep this JIRA focused on just the description. Plus HDFS-4504 is a complex issue, not easy to solve. {{TestDFSClientRetries.java}}: let's get rid of the unnecessary whitespace change in the current patch. I like the idea of getting rid of the {{DFSClient#abort}} function. The patch looks good once these things are removed, should be ready to go soon! > Aborted DFSClient's impact on long running service like YARN > > > Key: HDFS-7314 > URL: https://issues.apache.org/jira/browse/HDFS-7314 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Ming Ma >Assignee: Ming Ma > Attachments: HDFS-7314-2.patch, HDFS-7314-3.patch, HDFS-7314-4.patch, > HDFS-7314.patch > > > It happened in YARN nodemanger scenario. But it could happen to any long > running service that use cached instance of DistrbutedFileSystem. > 1. Active NN is under heavy load. So it became unavailable for 10 minutes; > any DFSClient request will get ConnectTimeoutException. > 2. YARN nodemanager use DFSClient for certain write operation such as log > aggregator or shared cache in YARN-1492. DFSClient used by YARN NM's > renewLease RPC got ConnectTimeoutException. > {noformat} > 2014-10-29 01:36:19,559 WARN org.apache.hadoop.hdfs.LeaseRenewer: Failed to > renew lease for [DFSClient_NONMAPREDUCE_-550838118_1] for 372 seconds. > Aborting ... > {noformat} > 3. After DFSClient is in Aborted state, YARN NM can't use that cached > instance of DistributedFileSystem. > {noformat} > 2014-10-29 20:26:23,991 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: > Failed to download rsrc... > java.io.IOException: Filesystem closed > at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:727) > at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1780) > at > org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1124) > at > org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1120) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1120) > at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:237) > at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:340) > at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:57) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {noformat} > We can make YARN or DFSClient more tolerant to temporary NN unavailability. > Given the callstack is YARN -> DistributedFileSystem -> DFSClient, this can > be addressed at different layers. > * YARN closes the DistributedFileSystem object when it receives some well > defined exception. Then the next HDFS call will create a new instance of > DistributedFileSystem. We have to fix all the places in YARN. Plus other HDFS > applications need to address this as well. > * DistributedFileSystem detects Aborted DFSClient and create a new instance > of DFSClient. We will need to fix all the places DistributedFileSystem calls > DFSClient. > * After DFSClient gets into Aborted state, it doesn't have to reject all > requests , instead it can retry. If NN is available again it can transition > to healthy
[jira] [Commented] (HDFS-7314) Aborted DFSClient's impact on long running service like YARN
[ https://issues.apache.org/jira/browse/HDFS-7314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14201763#comment-14201763 ] Hadoop QA commented on HDFS-7314: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12680087/HDFS-7314-4.patch against trunk revision ba0a42c. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The following test timeouts occurred in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.server.namenode.TestFsck org.apache.hadoop.hdfs.server.namenode.TestDeleteRace {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/8686//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8686//console This message is automatically generated. > Aborted DFSClient's impact on long running service like YARN > > > Key: HDFS-7314 > URL: https://issues.apache.org/jira/browse/HDFS-7314 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Ming Ma >Assignee: Ming Ma > Attachments: HDFS-7314-2.patch, HDFS-7314-3.patch, HDFS-7314-4.patch, > HDFS-7314.patch > > > It happened in YARN nodemanger scenario. But it could happen to any long > running service that use cached instance of DistrbutedFileSystem. > 1. Active NN is under heavy load. So it became unavailable for 10 minutes; > any DFSClient request will get ConnectTimeoutException. > 2. YARN nodemanager use DFSClient for certain write operation such as log > aggregator or shared cache in YARN-1492. DFSClient used by YARN NM's > renewLease RPC got ConnectTimeoutException. > {noformat} > 2014-10-29 01:36:19,559 WARN org.apache.hadoop.hdfs.LeaseRenewer: Failed to > renew lease for [DFSClient_NONMAPREDUCE_-550838118_1] for 372 seconds. > Aborting ... > {noformat} > 3. After DFSClient is in Aborted state, YARN NM can't use that cached > instance of DistributedFileSystem. > {noformat} > 2014-10-29 20:26:23,991 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: > Failed to download rsrc... > java.io.IOException: Filesystem closed > at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:727) > at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1780) > at > org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1124) > at > org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1120) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1120) > at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:237) > at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:340) > at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:57) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {noformat} > We can make YARN or DFSClient more tolerant to temporary NN unavailability. > Given the callstack is YARN -> DistributedFileSystem -> DFSClient, this can > be addressed at different layers. > * YARN closes the DistributedFileSystem object when it receives some well > defined exception. Then the next HDFS call will create a new instance of > DistributedFileSystem. We have to fix all the places in YARN. Plus other HDFS > applications need to address this as well. > * DistributedFileSystem detects Aborted DFSClient and create a new instance >
[jira] [Commented] (HDFS-7314) Aborted DFSClient's impact on long running service like YARN
[ https://issues.apache.org/jira/browse/HDFS-7314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14201218#comment-14201218 ] Hadoop QA commented on HDFS-7314: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12679945/HDFS-7314-3.patch against trunk revision 1670578. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.TestDFSClientRetries The following test timeouts occurred in hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.qjournal.client.TestQuorumJournalManager {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/8681//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8681//console This message is automatically generated. > Aborted DFSClient's impact on long running service like YARN > > > Key: HDFS-7314 > URL: https://issues.apache.org/jira/browse/HDFS-7314 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Ming Ma >Assignee: Ming Ma > Attachments: HDFS-7314-2.patch, HDFS-7314-3.patch, HDFS-7314.patch > > > It happened in YARN nodemanger scenario. But it could happen to any long > running service that use cached instance of DistrbutedFileSystem. > 1. Active NN is under heavy load. So it became unavailable for 10 minutes; > any DFSClient request will get ConnectTimeoutException. > 2. YARN nodemanager use DFSClient for certain write operation such as log > aggregator or shared cache in YARN-1492. DFSClient used by YARN NM's > renewLease RPC got ConnectTimeoutException. > {noformat} > 2014-10-29 01:36:19,559 WARN org.apache.hadoop.hdfs.LeaseRenewer: Failed to > renew lease for [DFSClient_NONMAPREDUCE_-550838118_1] for 372 seconds. > Aborting ... > {noformat} > 3. After DFSClient is in Aborted state, YARN NM can't use that cached > instance of DistributedFileSystem. > {noformat} > 2014-10-29 20:26:23,991 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: > Failed to download rsrc... > java.io.IOException: Filesystem closed > at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:727) > at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1780) > at > org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1124) > at > org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1120) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1120) > at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:237) > at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:340) > at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:57) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {noformat} > We can make YARN or DFSClient more tolerant to temporary NN unavailability. > Given the callstack is YARN -> DistributedFileSystem -> DFSClient, this can > be addressed at different layers. > * YARN closes the DistributedFileSystem object when it receives some well > defined exception. Then the next HDFS call will create a new instance of > DistributedFileSystem. We have to fix all the places in YARN. Plus other HDFS > applications need to address this as well. > *
[jira] [Commented] (HDFS-7314) Aborted DFSClient's impact on long running service like YARN
[ https://issues.apache.org/jira/browse/HDFS-7314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14200568#comment-14200568 ] Colin Patrick McCabe commented on HDFS-7314: bq. 1. abort is only used for this scenario. After we have LeaseRenewer call abortOpenFiles, abort won't be called by any functions. Good point. Let's get rid of {{DFSClient#abort}} completely then. We don't need this function any more. bq. 2. In addition to have DFSClient call closeAllFilesBeingWritten, LeaseRenewer also needs to remove the DFSClient from its list via dfsclients.remove(dfsc); so that DFSClient doesn't renew release when there are no files opened. This is achieved via LeaseRenewer's closeClient. When a lease timeout occurs, {{LeaseRenewer}} can just call {{DFSClient#closeAllFilesBeingWritten(abort=true)}}. Then {{LeaseRenewer}} can just call {{LeaseRenewer#closeClient}} on itself. This avoids the need to modify {{LeaseRenewer#closeClient}}. {code} @@ -447,16 +453,17 @@ private void run(final int id) throws InterruptedException { lastRenewed = Time.now(); } catch (SocketTimeoutException ie) { LOG.warn("Failed to renew lease for " + clientsString() + " for " - + (elapsed/1000) + " seconds. Aborting ...", ie); + + ((Time.now() - lastRenewed)/1000) + " seconds. Aborting ...", + ie); synchronized (this) { while (!dfsclients.isEmpty()) { {code} I don't think we need this change and the other similar change. > Aborted DFSClient's impact on long running service like YARN > > > Key: HDFS-7314 > URL: https://issues.apache.org/jira/browse/HDFS-7314 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Ming Ma >Assignee: Ming Ma > Attachments: HDFS-7314-2.patch, HDFS-7314.patch > > > It happened in YARN nodemanger scenario. But it could happen to any long > running service that use cached instance of DistrbutedFileSystem. > 1. Active NN is under heavy load. So it became unavailable for 10 minutes; > any DFSClient request will get ConnectTimeoutException. > 2. YARN nodemanager use DFSClient for certain write operation such as log > aggregator or shared cache in YARN-1492. DFSClient used by YARN NM's > renewLease RPC got ConnectTimeoutException. > {noformat} > 2014-10-29 01:36:19,559 WARN org.apache.hadoop.hdfs.LeaseRenewer: Failed to > renew lease for [DFSClient_NONMAPREDUCE_-550838118_1] for 372 seconds. > Aborting ... > {noformat} > 3. After DFSClient is in Aborted state, YARN NM can't use that cached > instance of DistributedFileSystem. > {noformat} > 2014-10-29 20:26:23,991 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: > Failed to download rsrc... > java.io.IOException: Filesystem closed > at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:727) > at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1780) > at > org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1124) > at > org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1120) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1120) > at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:237) > at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:340) > at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:57) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {noformat} > We can make YARN or DFSClient more tolerant to temporary NN unavailability. > Given the callstack is YARN -> DistributedFileSystem -> DFSClient, this can > be addressed at different layers. > * YARN closes the DistributedFileSystem object when it receives some well > defined exception. Then the next HDFS call will create a new instance of > DistributedFileSystem. We have to fix all the places in YARN. Plus other HDFS > applications need to address this as well. > * DistributedFileSystem detects Aborted DFSClient and create a new instance > of DFSClient. We will need to fix all the places DistributedFileSystem calls > DFSClient. > * After DFSClient gets i
[jira] [Commented] (HDFS-7314) Aborted DFSClient's impact on long running service like YARN
[ https://issues.apache.org/jira/browse/HDFS-7314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14198998#comment-14198998 ] Ming Ma commented on HDFS-7314: --- Thanks, Colin. Here are more explanations for the changes. Please let me know your thoughts. Appreciate your input. 1. {{abort}} is only used for this scenario. After we have {{LeaseRenewer}} call {{abortOpenFiles}}, {{abort}} won't be called by any functions. 2. In addition to have {{DFSClient}} call {{closeAllFilesBeingWritten}}, {{LeaseRenewer}} also needs to remove the {{DFSClient}} from its list via {{dfsclients.remove(dfsc);}} so that {{DFSClient}} doesn't renew release when there are no files opened. This is achieved via {{LeaseRenewer}}'s {{closeClient}}. 3. Whether {{LeaseRenewer}} should be removed from the factory when it gets SocketTimeoutException. Given {{LeaseRenewer}} thread won't exit when it gets SocketTimeoutException as part of the fix, if {{LeaseRenewer}} object is removed from the factory, then it could leak the {{LeaseRenewer}} thread even though the old {{LeaseRenewer}} object isn't used by other objects. In reality, {{LeaseRenewer}} won't be removed from the factory inside {{closeClient}} given given {{isRenewerExpired()}} will return false. So {{removeFromFactory}} is there mostly for the semantics, not necessary. > Aborted DFSClient's impact on long running service like YARN > > > Key: HDFS-7314 > URL: https://issues.apache.org/jira/browse/HDFS-7314 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Ming Ma >Assignee: Ming Ma > Attachments: HDFS-7314-2.patch, HDFS-7314.patch > > > It happened in YARN nodemanger scenario. But it could happen to any long > running service that use cached instance of DistrbutedFileSystem. > 1. Active NN is under heavy load. So it became unavailable for 10 minutes; > any DFSClient request will get ConnectTimeoutException. > 2. YARN nodemanager use DFSClient for certain write operation such as log > aggregator or shared cache in YARN-1492. DFSClient used by YARN NM's > renewLease RPC got ConnectTimeoutException. > {noformat} > 2014-10-29 01:36:19,559 WARN org.apache.hadoop.hdfs.LeaseRenewer: Failed to > renew lease for [DFSClient_NONMAPREDUCE_-550838118_1] for 372 seconds. > Aborting ... > {noformat} > 3. After DFSClient is in Aborted state, YARN NM can't use that cached > instance of DistributedFileSystem. > {noformat} > 2014-10-29 20:26:23,991 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: > Failed to download rsrc... > java.io.IOException: Filesystem closed > at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:727) > at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1780) > at > org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1124) > at > org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1120) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1120) > at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:237) > at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:340) > at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:57) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {noformat} > We can make YARN or DFSClient more tolerant to temporary NN unavailability. > Given the callstack is YARN -> DistributedFileSystem -> DFSClient, this can > be addressed at different layers. > * YARN closes the DistributedFileSystem object when it receives some well > defined exception. Then the next HDFS call will create a new instance of > DistributedFileSystem. We have to fix all the places in YARN. Plus other HDFS > applications need to address this as well. > * DistributedFileSystem detects Aborted DFSClient and create a new instance > of DFSClient. We will need to fix all the places DistributedFileSystem calls > DFSClient. > * After DFSClient gets into Aborted state, it doesn't have to reject all > requests , instead it can retry. If NN is available again it can transition > to healthy state. > Comments? -- This message was sent by Atlassian JIRA
[jira] [Commented] (HDFS-7314) Aborted DFSClient's impact on long running service like YARN
[ https://issues.apache.org/jira/browse/HDFS-7314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14198765#comment-14198765 ] Colin Patrick McCabe commented on HDFS-7314: HDFS-7314-2.patch just seems to rename {{abort}} to {{abortOpenFiles}}. What I was suggesting was creating a separate function, different from {{abort}}, which the {{LeaseRenewer}} would call. Actually, looking at it, I wonder if the lease renewer can just call {{closeAllFilesBeingWritten}}? I haven't looked at it in detail so maybe there's something else the lease renewer needs to do, but this at least looks like a good start. We don't need all this {{boolean removeFromFactory}} stuff. {{getInstance}} will re-add the {{DFSClient}} to the map later if needed. > Aborted DFSClient's impact on long running service like YARN > > > Key: HDFS-7314 > URL: https://issues.apache.org/jira/browse/HDFS-7314 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Ming Ma >Assignee: Ming Ma > Attachments: HDFS-7314-2.patch, HDFS-7314.patch > > > It happened in YARN nodemanger scenario. But it could happen to any long > running service that use cached instance of DistrbutedFileSystem. > 1. Active NN is under heavy load. So it became unavailable for 10 minutes; > any DFSClient request will get ConnectTimeoutException. > 2. YARN nodemanager use DFSClient for certain write operation such as log > aggregator or shared cache in YARN-1492. DFSClient used by YARN NM's > renewLease RPC got ConnectTimeoutException. > {noformat} > 2014-10-29 01:36:19,559 WARN org.apache.hadoop.hdfs.LeaseRenewer: Failed to > renew lease for [DFSClient_NONMAPREDUCE_-550838118_1] for 372 seconds. > Aborting ... > {noformat} > 3. After DFSClient is in Aborted state, YARN NM can't use that cached > instance of DistributedFileSystem. > {noformat} > 2014-10-29 20:26:23,991 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: > Failed to download rsrc... > java.io.IOException: Filesystem closed > at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:727) > at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1780) > at > org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1124) > at > org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1120) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1120) > at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:237) > at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:340) > at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:57) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {noformat} > We can make YARN or DFSClient more tolerant to temporary NN unavailability. > Given the callstack is YARN -> DistributedFileSystem -> DFSClient, this can > be addressed at different layers. > * YARN closes the DistributedFileSystem object when it receives some well > defined exception. Then the next HDFS call will create a new instance of > DistributedFileSystem. We have to fix all the places in YARN. Plus other HDFS > applications need to address this as well. > * DistributedFileSystem detects Aborted DFSClient and create a new instance > of DFSClient. We will need to fix all the places DistributedFileSystem calls > DFSClient. > * After DFSClient gets into Aborted state, it doesn't have to reject all > requests , instead it can retry. If NN is available again it can transition > to healthy state. > Comments? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7314) Aborted DFSClient's impact on long running service like YARN
[ https://issues.apache.org/jira/browse/HDFS-7314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14197061#comment-14197061 ] Hadoop QA commented on HDFS-7314: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12679296/HDFS-7314-2.patch against trunk revision 1eed102. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/8642//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8642//console This message is automatically generated. > Aborted DFSClient's impact on long running service like YARN > > > Key: HDFS-7314 > URL: https://issues.apache.org/jira/browse/HDFS-7314 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Ming Ma >Assignee: Ming Ma > Attachments: HDFS-7314-2.patch, HDFS-7314.patch > > > It happened in YARN nodemanger scenario. But it could happen to any long > running service that use cached instance of DistrbutedFileSystem. > 1. Active NN is under heavy load. So it became unavailable for 10 minutes; > any DFSClient request will get ConnectTimeoutException. > 2. YARN nodemanager use DFSClient for certain write operation such as log > aggregator or shared cache in YARN-1492. DFSClient used by YARN NM's > renewLease RPC got ConnectTimeoutException. > {noformat} > 2014-10-29 01:36:19,559 WARN org.apache.hadoop.hdfs.LeaseRenewer: Failed to > renew lease for [DFSClient_NONMAPREDUCE_-550838118_1] for 372 seconds. > Aborting ... > {noformat} > 3. After DFSClient is in Aborted state, YARN NM can't use that cached > instance of DistributedFileSystem. > {noformat} > 2014-10-29 20:26:23,991 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: > Failed to download rsrc... > java.io.IOException: Filesystem closed > at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:727) > at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1780) > at > org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1124) > at > org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1120) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1120) > at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:237) > at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:340) > at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:57) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {noformat} > We can make YARN or DFSClient more tolerant to temporary NN unavailability. > Given the callstack is YARN -> DistributedFileSystem -> DFSClient, this can > be addressed at different layers. > * YARN closes the DistributedFileSystem object when it receives some well > defined exception. Then the next HDFS call will create a new instance of > DistributedFileSystem. We have to fix all the places in YARN. Plus other HDFS > applications need to address this as well. > * DistributedFileSystem detects Aborted DFSClient and create a new instance > of DFSClient. We will need to fix all the places DistributedFileSystem calls > DFSClient. > * After DFSClient gets into Aborted state, it doesn't have to reject all > requests , inste
[jira] [Commented] (HDFS-7314) Aborted DFSClient's impact on long running service like YARN
[ https://issues.apache.org/jira/browse/HDFS-7314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14196603#comment-14196603 ] Colin Patrick McCabe commented on HDFS-7314: Thanks, [~mingma]. It's interesting that all the unit tests pass with the changed behavior of {{DFSClient#abort}}. I would prefer not to add this new configuration key, because I really can't think of any cases where I'd like to set it to {{true}}. I think it would be better just to have the lease timeout logic call a function other than {{DFSClient#abort}}. Basically create something like {{DFSClient#abortOpenFiles}} and have the lease timeout code call this instead of abort. That way we don't get confused about what abort means, but we also have the nice behavior that our client continues to be useful after a lease timeout. > Aborted DFSClient's impact on long running service like YARN > > > Key: HDFS-7314 > URL: https://issues.apache.org/jira/browse/HDFS-7314 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Ming Ma >Assignee: Ming Ma > Attachments: HDFS-7314.patch > > > It happened in YARN nodemanger scenario. But it could happen to any long > running service that use cached instance of DistrbutedFileSystem. > 1. Active NN is under heavy load. So it became unavailable for 10 minutes; > any DFSClient request will get ConnectTimeoutException. > 2. YARN nodemanager use DFSClient for certain write operation such as log > aggregator or shared cache in YARN-1492. DFSClient used by YARN NM's > renewLease RPC got ConnectTimeoutException. > {noformat} > 2014-10-29 01:36:19,559 WARN org.apache.hadoop.hdfs.LeaseRenewer: Failed to > renew lease for [DFSClient_NONMAPREDUCE_-550838118_1] for 372 seconds. > Aborting ... > {noformat} > 3. After DFSClient is in Aborted state, YARN NM can't use that cached > instance of DistributedFileSystem. > {noformat} > 2014-10-29 20:26:23,991 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: > Failed to download rsrc... > java.io.IOException: Filesystem closed > at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:727) > at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1780) > at > org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1124) > at > org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1120) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1120) > at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:237) > at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:340) > at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:57) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {noformat} > We can make YARN or DFSClient more tolerant to temporary NN unavailability. > Given the callstack is YARN -> DistributedFileSystem -> DFSClient, this can > be addressed at different layers. > * YARN closes the DistributedFileSystem object when it receives some well > defined exception. Then the next HDFS call will create a new instance of > DistributedFileSystem. We have to fix all the places in YARN. Plus other HDFS > applications need to address this as well. > * DistributedFileSystem detects Aborted DFSClient and create a new instance > of DFSClient. We will need to fix all the places DistributedFileSystem calls > DFSClient. > * After DFSClient gets into Aborted state, it doesn't have to reject all > requests , instead it can retry. If NN is available again it can transition > to healthy state. > Comments? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7314) Aborted DFSClient's impact on long running service like YARN
[ https://issues.apache.org/jira/browse/HDFS-7314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14195968#comment-14195968 ] Hadoop QA commented on HDFS-7314: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12679168/HDFS-7314.patch against trunk revision 2bb327e. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/8638//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8638//console This message is automatically generated. > Aborted DFSClient's impact on long running service like YARN > > > Key: HDFS-7314 > URL: https://issues.apache.org/jira/browse/HDFS-7314 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Ming Ma >Assignee: Ming Ma > Attachments: HDFS-7314.patch > > > It happened in YARN nodemanger scenario. But it could happen to any long > running service that use cached instance of DistrbutedFileSystem. > 1. Active NN is under heavy load. So it became unavailable for 10 minutes; > any DFSClient request will get ConnectTimeoutException. > 2. YARN nodemanager use DFSClient for certain write operation such as log > aggregator or shared cache in YARN-1492. DFSClient used by YARN NM's > renewLease RPC got ConnectTimeoutException. > {noformat} > 2014-10-29 01:36:19,559 WARN org.apache.hadoop.hdfs.LeaseRenewer: Failed to > renew lease for [DFSClient_NONMAPREDUCE_-550838118_1] for 372 seconds. > Aborting ... > {noformat} > 3. After DFSClient is in Aborted state, YARN NM can't use that cached > instance of DistributedFileSystem. > {noformat} > 2014-10-29 20:26:23,991 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: > Failed to download rsrc... > java.io.IOException: Filesystem closed > at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:727) > at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1780) > at > org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1124) > at > org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1120) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1120) > at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:237) > at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:340) > at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:57) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {noformat} > We can make YARN or DFSClient more tolerant to temporary NN unavailability. > Given the callstack is YARN -> DistributedFileSystem -> DFSClient, this can > be addressed at different layers. > * YARN closes the DistributedFileSystem object when it receives some well > defined exception. Then the next HDFS call will create a new instance of > DistributedFileSystem. We have to fix all the places in YARN. Plus other HDFS > applications need to address this as well. > * DistributedFileSystem detects Aborted DFSClient and create a new instance > of DFSClient. We will need to fix all the places DistributedFileSystem calls > DFSClient. > * After DFSClient gets into Aborted state, it doesn't have to reject all > requests , instead it can retry. If N
[jira] [Commented] (HDFS-7314) Aborted DFSClient's impact on long running service like YARN
[ https://issues.apache.org/jira/browse/HDFS-7314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14194790#comment-14194790 ] Colin Patrick McCabe commented on HDFS-7314: bq. I think DFSClient#abort() can be changed, so that only the existing output streams are aborted. The underlying IPC client can try to reopen connection later. Good idea. The only thing to watch out for is that some unit tests might be using {{abort}} and expecting the current semantics. Perhaps we can create a new function, {{abortOpenStreams}}? > Aborted DFSClient's impact on long running service like YARN > > > Key: HDFS-7314 > URL: https://issues.apache.org/jira/browse/HDFS-7314 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Ming Ma > > It happened in YARN nodemanger scenario. But it could happen to any long > running service that use cached instance of DistrbutedFileSystem. > 1. Active NN is under heavy load. So it became unavailable for 10 minutes; > any DFSClient request will get ConnectTimeoutException. > 2. YARN nodemanager use DFSClient for certain write operation such as log > aggregator or shared cache in YARN-1492. DFSClient used by YARN NM's > renewLease RPC got ConnectTimeoutException. > {noformat} > 2014-10-29 01:36:19,559 WARN org.apache.hadoop.hdfs.LeaseRenewer: Failed to > renew lease for [DFSClient_NONMAPREDUCE_-550838118_1] for 372 seconds. > Aborting ... > {noformat} > 3. After DFSClient is in Aborted state, YARN NM can't use that cached > instance of DistributedFileSystem. > {noformat} > 2014-10-29 20:26:23,991 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: > Failed to download rsrc... > java.io.IOException: Filesystem closed > at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:727) > at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1780) > at > org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1124) > at > org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1120) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1120) > at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:237) > at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:340) > at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:57) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {noformat} > We can make YARN or DFSClient more tolerant to temporary NN unavailability. > Given the callstack is YARN -> DistributedFileSystem -> DFSClient, this can > be addressed at different layers. > * YARN closes the DistributedFileSystem object when it receives some well > defined exception. Then the next HDFS call will create a new instance of > DistributedFileSystem. We have to fix all the places in YARN. Plus other HDFS > applications need to address this as well. > * DistributedFileSystem detects Aborted DFSClient and create a new instance > of DFSClient. We will need to fix all the places DistributedFileSystem calls > DFSClient. > * After DFSClient gets into Aborted state, it doesn't have to reject all > requests , instead it can retry. If NN is available again it can transition > to healthy state. > Comments? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7314) Aborted DFSClient's impact on long running service like YARN
[ https://issues.apache.org/jira/browse/HDFS-7314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14191986#comment-14191986 ] Kihwal Lee commented on HDFS-7314: -- I think DFSClient#abort() can be changed, so that only the existing output streams are aborted. The underlying IPC client can try to reopen connection later. > Aborted DFSClient's impact on long running service like YARN > > > Key: HDFS-7314 > URL: https://issues.apache.org/jira/browse/HDFS-7314 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Ming Ma > > It happened in YARN nodemanger scenario. But it could happen to any long > running service that use cached instance of DistrbutedFileSystem. > 1. Active NN is under heavy load. So it became unavailable for 10 minutes; > any DFSClient request will get ConnectTimeoutException. > 2. YARN nodemanager use DFSClient for certain write operation such as log > aggregator or shared cache in YARN-1492. DFSClient used by YARN NM's > renewLease RPC got ConnectTimeoutException. > {noformat} > 2014-10-29 01:36:19,559 WARN org.apache.hadoop.hdfs.LeaseRenewer: Failed to > renew lease for [DFSClient_NONMAPREDUCE_-550838118_1] for 372 seconds. > Aborting ... > {noformat} > 3. After DFSClient is in Aborted state, YARN NM can't use that cached > instance of DistributedFileSystem. > {noformat} > 2014-10-29 20:26:23,991 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: > Failed to download rsrc... > java.io.IOException: Filesystem closed > at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:727) > at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1780) > at > org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1124) > at > org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1120) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1120) > at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:237) > at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:340) > at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:57) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {noformat} > We can make YARN or DFSClient more tolerant to temporary NN unavailability. > Given the callstack is YARN -> DistributedFileSystem -> DFSClient, this can > be addressed at different layers. > * YARN closes the DistributedFileSystem object when it receives some well > defined exception. Then the next HDFS call will create a new instance of > DistributedFileSystem. We have to fix all the places in YARN. Plus other HDFS > applications need to address this as well. > * DistributedFileSystem detects Aborted DFSClient and create a new instance > of DFSClient. We will need to fix all the places DistributedFileSystem calls > DFSClient. > * After DFSClient gets into Aborted state, it doesn't have to reject all > requests , instead it can retry. If NN is available again it can transition > to healthy state. > Comments? -- This message was sent by Atlassian JIRA (v6.3.4#6332)