[
https://issues.apache.org/jira/browse/HDFS-7314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14619618#comment-14619618
]
Hadoop QA commented on HDFS-7314:
---------------------------------
\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:red}-1{color} | pre-patch | 17m 38s | Pre-patch trunk has 1 extant
Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | @author | 0m 0s | The patch does not contain any
@author tags. |
| {color:green}+1{color} | tests included | 0m 0s | The patch appears to
include 1 new or modified test files. |
| {color:red}-1{color} | javac | 8m 9s | The applied patch generated 1
additional warning messages. |
| {color:green}+1{color} | javadoc | 10m 6s | There were no new javadoc
warning messages. |
| {color:green}+1{color} | release audit | 0m 21s | The applied patch does
not increase the total number of release audit warnings. |
| {color:red}-1{color} | checkstyle | 1m 26s | The applied patch generated 3
new checkstyle issues (total was 138, now 139). |
| {color:green}+1{color} | whitespace | 0m 0s | The patch has no lines that
end in whitespace. |
| {color:green}+1{color} | install | 1m 27s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse | 0m 33s | The patch built with
eclipse:eclipse. |
| {color:green}+1{color} | findbugs | 2m 37s | The patch does not introduce
any new Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | native | 3m 10s | Pre-build of native portion |
| {color:red}-1{color} | hdfs tests | 159m 27s | Tests failed in hadoop-hdfs. |
| | | 204m 59s | |
\\
\\
|| Reason || Tests ||
| Failed unit tests | hadoop.hdfs.TestLeaseRecovery2 |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL |
http://issues.apache.org/jira/secure/attachment/12744319/HDFS-7314-8.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / 2e3d83f |
| Pre-patch Findbugs warnings |
https://builds.apache.org/job/PreCommit-HDFS-Build/11630/artifact/patchprocess/trunkFindbugsWarningshadoop-hdfs.html
|
| javac |
https://builds.apache.org/job/PreCommit-HDFS-Build/11630/artifact/patchprocess/diffJavacWarnings.txt
|
| checkstyle |
https://builds.apache.org/job/PreCommit-HDFS-Build/11630/artifact/patchprocess/diffcheckstylehadoop-hdfs.txt
|
| hadoop-hdfs test log |
https://builds.apache.org/job/PreCommit-HDFS-Build/11630/artifact/patchprocess/testrun_hadoop-hdfs.txt
|
| Test Results |
https://builds.apache.org/job/PreCommit-HDFS-Build/11630/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf904.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output |
https://builds.apache.org/job/PreCommit-HDFS-Build/11630/console |
This message was automatically generated.
> Aborted DFSClient's impact on long running service like YARN
> ------------------------------------------------------------
>
> Key: HDFS-7314
> URL: https://issues.apache.org/jira/browse/HDFS-7314
> Project: Hadoop HDFS
> Issue Type: Improvement
> Reporter: Ming Ma
> Assignee: Ming Ma
> Labels: BB2015-05-TBR
> Attachments: HDFS-7314-2.patch, HDFS-7314-3.patch, HDFS-7314-4.patch,
> HDFS-7314-5.patch, HDFS-7314-6.patch, HDFS-7314-7.patch, HDFS-7314-8.patch,
> HDFS-7314.patch
>
>
> It happened in YARN nodemanger scenario. But it could happen to any long
> running service that use cached instance of DistrbutedFileSystem.
> 1. Active NN is under heavy load. So it became unavailable for 10 minutes;
> any DFSClient request will get ConnectTimeoutException.
> 2. YARN nodemanager use DFSClient for certain write operation such as log
> aggregator or shared cache in YARN-1492. DFSClient used by YARN NM's
> renewLease RPC got ConnectTimeoutException.
> {noformat}
> 2014-10-29 01:36:19,559 WARN org.apache.hadoop.hdfs.LeaseRenewer: Failed to
> renew lease for [DFSClient_NONMAPREDUCE_-550838118_1] for 372 seconds.
> Aborting ...
> {noformat}
> 3. After DFSClient is in Aborted state, YARN NM can't use that cached
> instance of DistributedFileSystem.
> {noformat}
> 2014-10-29 20:26:23,991 INFO
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
> Failed to download rsrc...
> java.io.IOException: Filesystem closed
> at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:727)
> at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1780)
> at
> org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1124)
> at
> org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1120)
> at
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
> at
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1120)
> at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:237)
> at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:340)
> at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:57)
> at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
> at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}
> We can make YARN or DFSClient more tolerant to temporary NN unavailability.
> Given the callstack is YARN -> DistributedFileSystem -> DFSClient, this can
> be addressed at different layers.
> * YARN closes the DistributedFileSystem object when it receives some well
> defined exception. Then the next HDFS call will create a new instance of
> DistributedFileSystem. We have to fix all the places in YARN. Plus other HDFS
> applications need to address this as well.
> * DistributedFileSystem detects Aborted DFSClient and create a new instance
> of DFSClient. We will need to fix all the places DistributedFileSystem calls
> DFSClient.
> * After DFSClient gets into Aborted state, it doesn't have to reject all
> requests , instead it can retry. If NN is available again it can transition
> to healthy state.
> Comments?
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)