[jira] [Commented] (YARN-2074) Preemption of AM containers shouldn't count towards AM failures
[ https://issues.apache.org/jira/browse/YARN-2074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14033479#comment-14033479 ] Wangda Tan commented on YARN-2074: -- [~jianhe], thanks for your clarification. I think the testAMPreemptedNotCountedForAMFailures is exactly what I meant. LGTM, +1. Preemption of AM containers shouldn't count towards AM failures --- Key: YARN-2074 URL: https://issues.apache.org/jira/browse/YARN-2074 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Vinod Kumar Vavilapalli Assignee: Jian He Attachments: YARN-2074.1.patch, YARN-2074.2.patch, YARN-2074.3.patch, YARN-2074.4.patch, YARN-2074.5.patch, YARN-2074.6.patch, YARN-2074.6.patch One orthogonal concern with issues like YARN-2055 and YARN-2022 is that AM containers getting preempted shouldn't count towards AM failures and thus shouldn't eventually fail applications. We should explicitly handle AM container preemption/kill as a separate issue and not count it towards the limit on AM failures. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2022) Preempting an Application Master container can be kept as least priority when multiple applications are marked for preemption by ProportionalCapacityPreemptionPolicy
[ https://issues.apache.org/jira/browse/YARN-2022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14033491#comment-14033491 ] Hadoop QA commented on YARN-2022: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12650734/YARN-2022.6.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4010//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4010//console This message is automatically generated. Preempting an Application Master container can be kept as least priority when multiple applications are marked for preemption by ProportionalCapacityPreemptionPolicy - Key: YARN-2022 URL: https://issues.apache.org/jira/browse/YARN-2022 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.4.0 Reporter: Sunil G Assignee: Sunil G Attachments: YARN-2022-DesignDraft.docx, YARN-2022.2.patch, YARN-2022.3.patch, YARN-2022.4.patch, YARN-2022.5.patch, YARN-2022.6.patch, Yarn-2022.1.patch Cluster Size = 16GB [2NM's] Queue A Capacity = 50% Queue B Capacity = 50% Consider there are 3 applications running in Queue A which has taken the full cluster capacity. J1 = 2GB AM + 1GB * 4 Maps J2 = 2GB AM + 1GB * 4 Maps J3 = 2GB AM + 1GB * 2 Maps Another Job J4 is submitted in Queue B [J4 needs a 2GB AM + 1GB * 2 Maps ]. Currently in this scenario, Jobs J3 will get killed including its AM. It is better if AM can be given least priority among multiple applications. In this same scenario, map tasks from J3 and J2 can be preempted. Later when cluster is free, maps can be allocated to these Jobs. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2074) Preemption of AM containers shouldn't count towards AM failures
[ https://issues.apache.org/jira/browse/YARN-2074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He updated YARN-2074: -- Attachment: YARN-2074.7.patch Preemption of AM containers shouldn't count towards AM failures --- Key: YARN-2074 URL: https://issues.apache.org/jira/browse/YARN-2074 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Vinod Kumar Vavilapalli Assignee: Jian He Attachments: YARN-2074.1.patch, YARN-2074.2.patch, YARN-2074.3.patch, YARN-2074.4.patch, YARN-2074.5.patch, YARN-2074.6.patch, YARN-2074.6.patch, YARN-2074.7.patch One orthogonal concern with issues like YARN-2055 and YARN-2022 is that AM containers getting preempted shouldn't count towards AM failures and thus shouldn't eventually fail applications. We should explicitly handle AM container preemption/kill as a separate issue and not count it towards the limit on AM failures. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2074) Preemption of AM containers shouldn't count towards AM failures
[ https://issues.apache.org/jira/browse/YARN-2074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14033496#comment-14033496 ] Jian He commented on YARN-2074: --- Thanks for pointing out RMAppAttemptImpl.isLastAttempt, there's an existing bug when calculating isLastAttempt. I updated the patch and test case accordingly. Preemption of AM containers shouldn't count towards AM failures --- Key: YARN-2074 URL: https://issues.apache.org/jira/browse/YARN-2074 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Vinod Kumar Vavilapalli Assignee: Jian He Attachments: YARN-2074.1.patch, YARN-2074.2.patch, YARN-2074.3.patch, YARN-2074.4.patch, YARN-2074.5.patch, YARN-2074.6.patch, YARN-2074.6.patch, YARN-2074.7.patch One orthogonal concern with issues like YARN-2055 and YARN-2022 is that AM containers getting preempted shouldn't count towards AM failures and thus shouldn't eventually fail applications. We should explicitly handle AM container preemption/kill as a separate issue and not count it towards the limit on AM failures. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2074) Preemption of AM containers shouldn't count towards AM failures
[ https://issues.apache.org/jira/browse/YARN-2074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He updated YARN-2074: -- Attachment: YARN-2074.7.patch Preemption of AM containers shouldn't count towards AM failures --- Key: YARN-2074 URL: https://issues.apache.org/jira/browse/YARN-2074 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Vinod Kumar Vavilapalli Assignee: Jian He Attachments: YARN-2074.1.patch, YARN-2074.2.patch, YARN-2074.3.patch, YARN-2074.4.patch, YARN-2074.5.patch, YARN-2074.6.patch, YARN-2074.6.patch, YARN-2074.7.patch, YARN-2074.7.patch One orthogonal concern with issues like YARN-2055 and YARN-2022 is that AM containers getting preempted shouldn't count towards AM failures and thus shouldn't eventually fail applications. We should explicitly handle AM container preemption/kill as a separate issue and not count it towards the limit on AM failures. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1480) RM web services getApps() accepts many more filters than ApplicationCLI list command
[ https://issues.apache.org/jira/browse/YARN-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14033502#comment-14033502 ] Zhijie Shen commented on YARN-1480: --- Hi [~kj-ki], thanks for the patch. Here're some meta comments on it: 1. I looked into the current RMWebServices#getApps(), and below is the list of missing options in ApplicationCLI. queue (current queue option is for the movetoqueue command) and tags are not covered in the patch. If it's not a big addition, is it better to include these two options into the option list? {code} @QueryParam(finalStatus) String finalStatusQuery, @QueryParam(user) String userQuery, @QueryParam(queue) String queueQuery, @QueryParam(limit) String count, @QueryParam(startedTimeBegin) String startedBegin, @QueryParam(startedTimeEnd) String startedEnd, @QueryParam(finishedTimeBegin) String finishBegin, @QueryParam(finishedTimeEnd) String finishEnd, @QueryParam(applicationTags) SetString applicationTags {code} 2. ApplicationClientProtocol#getApplications already support full filters, while YarnClient seems not to support the full options now. IMHO, the right way here is to make YarnClient to support full filters, and ApplicationCLI simply calls the API. It is an inefficient way to pull a long app list from RM and do local filtering. RM web services getApps() accepts many more filters than ApplicationCLI list command -- Key: YARN-1480 URL: https://issues.apache.org/jira/browse/YARN-1480 Project: Hadoop YARN Issue Type: Bug Reporter: Zhijie Shen Assignee: Kenji Kikushima Attachments: YARN-1480-2.patch, YARN-1480-3.patch, YARN-1480-4.patch, YARN-1480-5.patch, YARN-1480.patch Nowadays RM web services getApps() accepts many more filters than ApplicationCLI list command, which only accepts state and type. IMHO, ideally, different interfaces should provide consistent functionality. Is it better to allow more filters in ApplicationCLI? -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2074) Preemption of AM containers shouldn't count towards AM failures
[ https://issues.apache.org/jira/browse/YARN-2074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14033531#comment-14033531 ] Hadoop QA commented on YARN-2074: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12650742/YARN-2074.7.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 4 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4011//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4011//console This message is automatically generated. Preemption of AM containers shouldn't count towards AM failures --- Key: YARN-2074 URL: https://issues.apache.org/jira/browse/YARN-2074 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Vinod Kumar Vavilapalli Assignee: Jian He Attachments: YARN-2074.1.patch, YARN-2074.2.patch, YARN-2074.3.patch, YARN-2074.4.patch, YARN-2074.5.patch, YARN-2074.6.patch, YARN-2074.6.patch, YARN-2074.7.patch, YARN-2074.7.patch One orthogonal concern with issues like YARN-2055 and YARN-2022 is that AM containers getting preempted shouldn't count towards AM failures and thus shouldn't eventually fail applications. We should explicitly handle AM container preemption/kill as a separate issue and not count it towards the limit on AM failures. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2142) Add one service to check the nodes' TRUST status
[ https://issues.apache.org/jira/browse/YARN-2142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] anders updated YARN-2142: - Attachment: trust.patch Test weather this patch can wrok Add one service to check the nodes' TRUST status - Key: YARN-2142 URL: https://issues.apache.org/jira/browse/YARN-2142 Project: Hadoop YARN Issue Type: New Feature Components: nodemanager, resourcemanager, scheduler Affects Versions: 2.2.0 Environment: OS:Ubuntu 13.04; JAVA:OpenJDK 7u51-2.4.4-0 Reporter: anders Priority: Minor Labels: patch Fix For: 2.2.0 Attachments: trust.patch, trust.patch Original Estimate: 1m Remaining Estimate: 1m Because of critical computing environment ,we must test every node's TRUST status in the cluster (We can get the TRUST status by the API of OAT sever),So I add this feature into hadoop's schedule . By the TRUST check service ,node can get the TRUST status of itself, then through the heartbeat ,send the TRUST status to resource manager for scheduling. In the scheduling step,if the node's TRUST status is 'false', it will be abandoned until it's TRUST status turn to 'true'. ***The logic of this feature is similar to node's healthcheckservice. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2167) LeveldbIterator should get closed in NMLeveldbStateStoreService#loadLocalizationState() within finally block
[ https://issues.apache.org/jira/browse/YARN-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14033674#comment-14033674 ] Hudson commented on YARN-2167: -- SUCCESS: Integrated in Hadoop-Yarn-trunk #586 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/586/]) YARN-2167. LeveldbIterator should get closed in NMLeveldbStateStoreService#loadLocalizationState() within finally block. Contributed by Junping Du (jlowe: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1603039) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/recovery/NMLeveldbStateStoreService.java LeveldbIterator should get closed in NMLeveldbStateStoreService#loadLocalizationState() within finally block Key: YARN-2167 URL: https://issues.apache.org/jira/browse/YARN-2167 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Junping Du Assignee: Junping Du Fix For: 3.0.0, 2.5.0 Attachments: YARN-2167.patch In NMLeveldbStateStoreService#loadLocalizationState(), we have LeveldbIterator to read NM's localization state but it is not get closed in finally block. We should close this connection to DB as a common practice. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2159) Better logging in SchedulerNode#allocateContainer
[ https://issues.apache.org/jira/browse/YARN-2159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14033673#comment-14033673 ] Hudson commented on YARN-2159: -- SUCCESS: Integrated in Hadoop-Yarn-trunk #586 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/586/]) YARN-2159. Better logging in SchedulerNode#allocateContainer. (Ray Chiang via kasha) (kasha: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1603003) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/SchedulerNode.java Better logging in SchedulerNode#allocateContainer - Key: YARN-2159 URL: https://issues.apache.org/jira/browse/YARN-2159 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Reporter: Ray Chiang Assignee: Ray Chiang Priority: Trivial Labels: newbie, supportability Fix For: 2.5.0 Attachments: YARN2159-01.patch This bit of code: {quote} LOG.info(Assigned container + container.getId() + of capacity + container.getResource() + on host + rmNode.getNodeAddress() + , which currently has + numContainers + containers, + getUsedResource() + used and + getAvailableResource() + available); {quote} results in a line like: {quote} 2014-05-30 16:17:43,573 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerNode: Assigned container container_14000_0009_01_00 of capacity memory:1536, vCores:1 on host machine.host.domain.com:8041, which currently has 18 containers, memory:27648, vCores:18 used and memory:3072, vCores:0 available {quote} That message is fine in most cases, but looks pretty bad after the last available allocation, since it says something like vCores:0 available. Here is one suggested phrasing - which has 18 containers, memory:27648, vCores:18 used and memory:3072, vCores:0 available after allocation -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1339) Recover DeletionService state upon nodemanager restart
[ https://issues.apache.org/jira/browse/YARN-1339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14033677#comment-14033677 ] Hudson commented on YARN-1339: -- SUCCESS: Integrated in Hadoop-Yarn-trunk #586 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/586/]) YARN-1339. Recover DeletionService state upon nodemanager restart. (Contributed by Jason Lowe) (junping_du: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1603036) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/DeletionService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeManager.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/recovery/NMLeveldbStateStoreService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/recovery/NMNullStateStoreService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/recovery/NMStateStoreService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/proto/yarn_server_nodemanager_recovery.proto * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestDeletionService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/recovery/NMMemoryStateStoreService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/recovery/TestNMLeveldbStateStoreService.java Recover DeletionService state upon nodemanager restart -- Key: YARN-1339 URL: https://issues.apache.org/jira/browse/YARN-1339 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.3.0 Reporter: Jason Lowe Assignee: Jason Lowe Fix For: 2.5.0 Attachments: YARN-1339.patch, YARN-1339v2.patch, YARN-1339v3-and-YARN-1987.patch, YARN-1339v4.patch, YARN-1339v5.patch, YARN-1339v6.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1885) RM may not send the app-finished signal after RM restart to some nodes where the application ran before RM restarts
[ https://issues.apache.org/jira/browse/YARN-1885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14033679#comment-14033679 ] Hudson commented on YARN-1885: -- SUCCESS: Integrated in Hadoop-Yarn-trunk #586 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/586/]) YARN-1885. Fixed a bug that RM may not send application-clean-up signal to NMs where the completed applications previously ran in case of RM restart. Contributed by Wangda Tan (jianhe: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1603028) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/test/java/org/apache/hadoop/yarn/client/TestResourceTrackerOnHA.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/api/protocolrecords/RegisterNodeManagerRequest.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/api/protocolrecords/impl/pb/RegisterNodeManagerRequestPBImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/proto/yarn_server_common_service_protos.proto * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/test/java/org/apache/hadoop/yarn/server/api/protocolrecords/TestProtocolRecords.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/test/java/org/apache/hadoop/yarn/server/api/protocolrecords/TestRegisterNodeManagerRequest.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeStatusUpdaterImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceTrackerService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMApp.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMAppEventType.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMAppImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMAppRunningOnNodeEvent.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/RMAppAttempt.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/RMAppAttemptEventType.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/RMAppAttemptImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/event/RMAppAttemptContainerAcquiredEvent.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmcontainer/RMContainerImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeStartedEvent.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/MockNM.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/MockRM.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestApplicationCleanup.java *
[jira] [Created] (YARN-2169) NMSimulator of sls should catch more Exception
Beckham007 created YARN-2169: Summary: NMSimulator of sls should catch more Exception Key: YARN-2169 URL: https://issues.apache.org/jira/browse/YARN-2169 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.4.0 Reporter: Beckham007 In the method middleStep() of NMSimulator , sending heart beat may cause InterruptedException or other Exception if the load is heavily. If not handler these exceptions, the task of NMSimulator cloud not add to the executor queue again. So the NM will lost. In my situation, the pool size is 4000, nm size is 2000, and am is 1500. Some NMs will lost. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2169) NMSimulator of sls should catch more Exception
[ https://issues.apache.org/jira/browse/YARN-2169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Beckham007 updated YARN-2169: - Attachment: YARN-2169.patch NMSimulator of sls should catch more Exception -- Key: YARN-2169 URL: https://issues.apache.org/jira/browse/YARN-2169 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.4.0 Reporter: Beckham007 Attachments: YARN-2169.patch In the method middleStep() of NMSimulator , sending heart beat may cause InterruptedException or other Exception if the load is heavily. If not handler these exceptions, the task of NMSimulator cloud not add to the executor queue again. So the NM will lost. In my situation, the pool size is 4000, nm size is 2000, and am is 1500. Some NMs will lost. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2170) Fix components' version information in the web page 'About the Cluster'
Jun Gong created YARN-2170: -- Summary: Fix components' version information in the web page 'About the Cluster' Key: YARN-2170 URL: https://issues.apache.org/jira/browse/YARN-2170 Project: Hadoop YARN Issue Type: Bug Reporter: Jun Gong Priority: Minor In the web page 'About the Cluster', YARN's component's build version(e.g. ResourceManager) is the same as Hadoop version now. It is caused by calling getVersion() instead of _getVersion() in VersionInfo.java by mistake. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2170) Fix components' version information in the web page 'About the Cluster'
[ https://issues.apache.org/jira/browse/YARN-2170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jun Gong updated YARN-2170: --- Attachment: YARN-2170.patch Fix components' version information in the web page 'About the Cluster' --- Key: YARN-2170 URL: https://issues.apache.org/jira/browse/YARN-2170 Project: Hadoop YARN Issue Type: Bug Reporter: Jun Gong Priority: Minor Attachments: YARN-2170.patch In the web page 'About the Cluster', YARN's component's build version(e.g. ResourceManager) is the same as Hadoop version now. It is caused by calling getVersion() instead of _getVersion() in VersionInfo.java by mistake. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1339) Recover DeletionService state upon nodemanager restart
[ https://issues.apache.org/jira/browse/YARN-1339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14033829#comment-14033829 ] Hudson commented on YARN-1339: -- SUCCESS: Integrated in Hadoop-Hdfs-trunk #1777 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1777/]) YARN-1339. Recover DeletionService state upon nodemanager restart. (Contributed by Jason Lowe) (junping_du: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1603036) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/DeletionService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeManager.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/recovery/NMLeveldbStateStoreService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/recovery/NMNullStateStoreService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/recovery/NMStateStoreService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/proto/yarn_server_nodemanager_recovery.proto * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestDeletionService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/recovery/NMMemoryStateStoreService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/recovery/TestNMLeveldbStateStoreService.java Recover DeletionService state upon nodemanager restart -- Key: YARN-1339 URL: https://issues.apache.org/jira/browse/YARN-1339 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.3.0 Reporter: Jason Lowe Assignee: Jason Lowe Fix For: 2.5.0 Attachments: YARN-1339.patch, YARN-1339v2.patch, YARN-1339v3-and-YARN-1987.patch, YARN-1339v4.patch, YARN-1339v5.patch, YARN-1339v6.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2167) LeveldbIterator should get closed in NMLeveldbStateStoreService#loadLocalizationState() within finally block
[ https://issues.apache.org/jira/browse/YARN-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14033826#comment-14033826 ] Hudson commented on YARN-2167: -- SUCCESS: Integrated in Hadoop-Hdfs-trunk #1777 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1777/]) YARN-2167. LeveldbIterator should get closed in NMLeveldbStateStoreService#loadLocalizationState() within finally block. Contributed by Junping Du (jlowe: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1603039) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/recovery/NMLeveldbStateStoreService.java LeveldbIterator should get closed in NMLeveldbStateStoreService#loadLocalizationState() within finally block Key: YARN-2167 URL: https://issues.apache.org/jira/browse/YARN-2167 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Junping Du Assignee: Junping Du Fix For: 3.0.0, 2.5.0 Attachments: YARN-2167.patch In NMLeveldbStateStoreService#loadLocalizationState(), we have LeveldbIterator to read NM's localization state but it is not get closed in finally block. We should close this connection to DB as a common practice. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1885) RM may not send the app-finished signal after RM restart to some nodes where the application ran before RM restarts
[ https://issues.apache.org/jira/browse/YARN-1885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14033831#comment-14033831 ] Hudson commented on YARN-1885: -- SUCCESS: Integrated in Hadoop-Hdfs-trunk #1777 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1777/]) YARN-1885. Fixed a bug that RM may not send application-clean-up signal to NMs where the completed applications previously ran in case of RM restart. Contributed by Wangda Tan (jianhe: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1603028) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/test/java/org/apache/hadoop/yarn/client/TestResourceTrackerOnHA.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/api/protocolrecords/RegisterNodeManagerRequest.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/api/protocolrecords/impl/pb/RegisterNodeManagerRequestPBImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/proto/yarn_server_common_service_protos.proto * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/test/java/org/apache/hadoop/yarn/server/api/protocolrecords/TestProtocolRecords.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/test/java/org/apache/hadoop/yarn/server/api/protocolrecords/TestRegisterNodeManagerRequest.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeStatusUpdaterImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceTrackerService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMApp.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMAppEventType.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMAppImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMAppRunningOnNodeEvent.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/RMAppAttempt.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/RMAppAttemptEventType.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/RMAppAttemptImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/event/RMAppAttemptContainerAcquiredEvent.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmcontainer/RMContainerImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeStartedEvent.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/MockNM.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/MockRM.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestApplicationCleanup.java *
[jira] [Commented] (YARN-2159) Better logging in SchedulerNode#allocateContainer
[ https://issues.apache.org/jira/browse/YARN-2159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14033825#comment-14033825 ] Hudson commented on YARN-2159: -- SUCCESS: Integrated in Hadoop-Hdfs-trunk #1777 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1777/]) YARN-2159. Better logging in SchedulerNode#allocateContainer. (Ray Chiang via kasha) (kasha: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1603003) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/SchedulerNode.java Better logging in SchedulerNode#allocateContainer - Key: YARN-2159 URL: https://issues.apache.org/jira/browse/YARN-2159 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Reporter: Ray Chiang Assignee: Ray Chiang Priority: Trivial Labels: newbie, supportability Fix For: 2.5.0 Attachments: YARN2159-01.patch This bit of code: {quote} LOG.info(Assigned container + container.getId() + of capacity + container.getResource() + on host + rmNode.getNodeAddress() + , which currently has + numContainers + containers, + getUsedResource() + used and + getAvailableResource() + available); {quote} results in a line like: {quote} 2014-05-30 16:17:43,573 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerNode: Assigned container container_14000_0009_01_00 of capacity memory:1536, vCores:1 on host machine.host.domain.com:8041, which currently has 18 containers, memory:27648, vCores:18 used and memory:3072, vCores:0 available {quote} That message is fine in most cases, but looks pretty bad after the last available allocation, since it says something like vCores:0 available. Here is one suggested phrasing - which has 18 containers, memory:27648, vCores:18 used and memory:3072, vCores:0 available after allocation -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2171) AMs block on the CapacityScheduler lock during allocate()
Jason Lowe created YARN-2171: Summary: AMs block on the CapacityScheduler lock during allocate() Key: YARN-2171 URL: https://issues.apache.org/jira/browse/YARN-2171 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler Affects Versions: 2.4.0, 0.23.10 Reporter: Jason Lowe Assignee: Jason Lowe Priority: Critical When AMs heartbeat into the RM via the allocate() call they are blocking on the CapacityScheduler lock when trying to get the number of nodes in the cluster via getNumClusterNodes. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2171) AMs block on the CapacityScheduler lock during allocate()
[ https://issues.apache.org/jira/browse/YARN-2171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14033864#comment-14033864 ] Jason Lowe commented on YARN-2171: -- When the CapacityScheduler scheduler thread is running full-time due to a constant stream of events (e.g.: large number of running applications with a large number of cluster nodes) then the CapacityScheduler lock is held by that scheduler loop most of the time. As AMs heartbeat into the RM to try to get their resources, the capacity scheduler code goes out of its way to try to avoid having the AMs grab the scheduler lock. Unfortunately this one was missed to get this one integer value. Therefore they end up piling up on the scheduler lock, filling all of the IPC handlers of the ApplicationMasterService and the others back up on the call queue. Once the scheduler releases the lock it will quickly try to grab it again, so only a few AMs end up getting through the gate and the IPC handlers fill again with the next batch of AMs blocking on the scheduler lock. This causes the average RPC response times to skyrocket for AMs. AMs experience large delays getting their allocations which in turn leads to lower cluster utilization and increased application runtimes. AMs block on the CapacityScheduler lock during allocate() - Key: YARN-2171 URL: https://issues.apache.org/jira/browse/YARN-2171 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler Affects Versions: 0.23.10, 2.4.0 Reporter: Jason Lowe Assignee: Jason Lowe Priority: Critical When AMs heartbeat into the RM via the allocate() call they are blocking on the CapacityScheduler lock when trying to get the number of nodes in the cluster via getNumClusterNodes. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-409) Allow apps to be killed via the RM REST API
[ https://issues.apache.org/jira/browse/YARN-409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14033874#comment-14033874 ] Romain Rigaux commented on YARN-409: dup of https://issues.apache.org/jira/browse/YARN-1702? Allow apps to be killed via the RM REST API --- Key: YARN-409 URL: https://issues.apache.org/jira/browse/YARN-409 Project: Hadoop YARN Issue Type: New Feature Components: api, resourcemanager Affects Versions: 2.0.3-alpha Reporter: Sandy Ryza Assignee: Sandy Ryza The RM REST API currently allows getting information about running applications. Adding the capability to kill applications would allow systems like Hue to perform their functions over HTTP. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2172) Suspend/Resume Hadoop Jobs
Richard Chen created YARN-2172: -- Summary: Suspend/Resume Hadoop Jobs Key: YARN-2172 URL: https://issues.apache.org/jira/browse/YARN-2172 Project: Hadoop YARN Issue Type: New Feature Components: resourcemanager, webapp Affects Versions: 2.2.0 Environment: CentOS 6.5, Hadoop 2.2.0 Reporter: Richard Chen Fix For: 2.2.0 In a multi-application cluster environment, jobs running inside Hadoop application may be of lower-priority than jobs running inside other applications like HBase. To give way to other higher-priority jobs inside Hadoop, a user or some cluster-level resource scheduling service should be able to suspend and/or resume some particular jobs within Hadoop application. When target jobs inside Hadoop are suspended, those already allocated and running task containers will continue to run until their completion or active preemption by other ways. But no more new containers would be allocated to the target jobs. In contrast, when suspended jobs are put into resume mode, they will continue to run from the previous job progress and have new task containers allocated to complete the rest of the jobs. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2172) Suspend/Resume Hadoop Jobs
[ https://issues.apache.org/jira/browse/YARN-2172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Chen updated YARN-2172: --- Description: In a multi-application cluster environment, jobs running inside Hadoop YARN may be of lower-priority than jobs running outside Hadoop YARN like HBase. To give way to other higher-priority jobs inside Hadoop, a user or some cluster-level resource scheduling service should be able to suspend and/or resume some particular jobs within Hadoop YARN. When target jobs inside Hadoop are suspended, those already allocated and running task containers will continue to run until their completion or active preemption by other ways. But no more new containers would be allocated to the target jobs. In contrast, when suspended jobs are put into resume mode, they will continue to run from the previous job progress and have new task containers allocated to complete the rest of the jobs. was: In a multi-application cluster environment, jobs running inside Hadoop YARN may be of lower-priority than jobs running outside Hadoop YARN like HBase. To give way to other higher-priority jobs inside Hadoop, a user or some cluster-level resource scheduling service should be able to suspend and/or resume some particular jobs within Hadoop application. When target jobs inside Hadoop are suspended, those already allocated and running task containers will continue to run until their completion or active preemption by other ways. But no more new containers would be allocated to the target jobs. In contrast, when suspended jobs are put into resume mode, they will continue to run from the previous job progress and have new task containers allocated to complete the rest of the jobs. Suspend/Resume Hadoop Jobs -- Key: YARN-2172 URL: https://issues.apache.org/jira/browse/YARN-2172 Project: Hadoop YARN Issue Type: New Feature Components: resourcemanager, webapp Affects Versions: 2.2.0 Environment: CentOS 6.5, Hadoop 2.2.0 Reporter: Richard Chen Labels: hadoop, jobs, resume, suspend Fix For: 2.2.0 Original Estimate: 336h Remaining Estimate: 336h In a multi-application cluster environment, jobs running inside Hadoop YARN may be of lower-priority than jobs running outside Hadoop YARN like HBase. To give way to other higher-priority jobs inside Hadoop, a user or some cluster-level resource scheduling service should be able to suspend and/or resume some particular jobs within Hadoop YARN. When target jobs inside Hadoop are suspended, those already allocated and running task containers will continue to run until their completion or active preemption by other ways. But no more new containers would be allocated to the target jobs. In contrast, when suspended jobs are put into resume mode, they will continue to run from the previous job progress and have new task containers allocated to complete the rest of the jobs. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2172) Suspend/Resume Hadoop Jobs
[ https://issues.apache.org/jira/browse/YARN-2172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Chen updated YARN-2172: --- Description: In a multi-application cluster environment, jobs running inside Hadoop YARN may be of lower-priority than jobs running outside Hadoop YARN like HBase. To give way to other higher-priority jobs inside Hadoop, a user or some cluster-level resource scheduling service should be able to suspend and/or resume some particular jobs within Hadoop application. When target jobs inside Hadoop are suspended, those already allocated and running task containers will continue to run until their completion or active preemption by other ways. But no more new containers would be allocated to the target jobs. In contrast, when suspended jobs are put into resume mode, they will continue to run from the previous job progress and have new task containers allocated to complete the rest of the jobs. was: In a multi-application cluster environment, jobs running inside Hadoop application may be of lower-priority than jobs running inside other applications like HBase. To give way to other higher-priority jobs inside Hadoop, a user or some cluster-level resource scheduling service should be able to suspend and/or resume some particular jobs within Hadoop application. When target jobs inside Hadoop are suspended, those already allocated and running task containers will continue to run until their completion or active preemption by other ways. But no more new containers would be allocated to the target jobs. In contrast, when suspended jobs are put into resume mode, they will continue to run from the previous job progress and have new task containers allocated to complete the rest of the jobs. Suspend/Resume Hadoop Jobs -- Key: YARN-2172 URL: https://issues.apache.org/jira/browse/YARN-2172 Project: Hadoop YARN Issue Type: New Feature Components: resourcemanager, webapp Affects Versions: 2.2.0 Environment: CentOS 6.5, Hadoop 2.2.0 Reporter: Richard Chen Labels: hadoop, jobs, resume, suspend Fix For: 2.2.0 Original Estimate: 336h Remaining Estimate: 336h In a multi-application cluster environment, jobs running inside Hadoop YARN may be of lower-priority than jobs running outside Hadoop YARN like HBase. To give way to other higher-priority jobs inside Hadoop, a user or some cluster-level resource scheduling service should be able to suspend and/or resume some particular jobs within Hadoop application. When target jobs inside Hadoop are suspended, those already allocated and running task containers will continue to run until their completion or active preemption by other ways. But no more new containers would be allocated to the target jobs. In contrast, when suspended jobs are put into resume mode, they will continue to run from the previous job progress and have new task containers allocated to complete the rest of the jobs. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2172) Suspend/Resume Hadoop Jobs
[ https://issues.apache.org/jira/browse/YARN-2172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Chen updated YARN-2172: --- Description: In a multi-application cluster environment, jobs running inside Hadoop YARN may be of lower-priority than jobs running outside Hadoop YARN like HBase. To give way to other higher-priority jobs inside Hadoop, a user or some cluster-level resource scheduling service should be able to suspend and/or resume some particular jobs within Hadoop YARN. When target jobs inside Hadoop are suspended, those already allocated and running task containers will continue to run until their completion or active preemption by other ways. But no more new containers would be allocated to the target jobs. In contrast, when suspended jobs are put into resume mode, they will continue to run from the previous job progress and have new task containers allocated to complete the rest of the jobs. My team has completed its implementation and our tests showed it works in a rather solid way. was: In a multi-application cluster environment, jobs running inside Hadoop YARN may be of lower-priority than jobs running outside Hadoop YARN like HBase. To give way to other higher-priority jobs inside Hadoop, a user or some cluster-level resource scheduling service should be able to suspend and/or resume some particular jobs within Hadoop YARN. When target jobs inside Hadoop are suspended, those already allocated and running task containers will continue to run until their completion or active preemption by other ways. But no more new containers would be allocated to the target jobs. In contrast, when suspended jobs are put into resume mode, they will continue to run from the previous job progress and have new task containers allocated to complete the rest of the jobs. My team has completed its implementation and our tests showed it is working in a rather solid way. Suspend/Resume Hadoop Jobs -- Key: YARN-2172 URL: https://issues.apache.org/jira/browse/YARN-2172 Project: Hadoop YARN Issue Type: New Feature Components: resourcemanager, webapp Affects Versions: 2.2.0 Environment: CentOS 6.5, Hadoop 2.2.0 Reporter: Richard Chen Labels: hadoop, jobs, resume, suspend Fix For: 2.2.0 Original Estimate: 336h Remaining Estimate: 336h In a multi-application cluster environment, jobs running inside Hadoop YARN may be of lower-priority than jobs running outside Hadoop YARN like HBase. To give way to other higher-priority jobs inside Hadoop, a user or some cluster-level resource scheduling service should be able to suspend and/or resume some particular jobs within Hadoop YARN. When target jobs inside Hadoop are suspended, those already allocated and running task containers will continue to run until their completion or active preemption by other ways. But no more new containers would be allocated to the target jobs. In contrast, when suspended jobs are put into resume mode, they will continue to run from the previous job progress and have new task containers allocated to complete the rest of the jobs. My team has completed its implementation and our tests showed it works in a rather solid way. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2172) Suspend/Resume Hadoop Jobs
[ https://issues.apache.org/jira/browse/YARN-2172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Chen updated YARN-2172: --- Description: In a multi-application cluster environment, jobs running inside Hadoop YARN may be of lower-priority than jobs running outside Hadoop YARN like HBase. To give way to other higher-priority jobs inside Hadoop, a user or some cluster-level resource scheduling service should be able to suspend and/or resume some particular jobs within Hadoop YARN. When target jobs inside Hadoop are suspended, those already allocated and running task containers will continue to run until their completion or active preemption by other ways. But no more new containers would be allocated to the target jobs. In contrast, when suspended jobs are put into resume mode, they will continue to run from the previous job progress and have new task containers allocated to complete the rest of the jobs. My team has completed its implementation and our tests showed it is working in a rather solid way. was: In a multi-application cluster environment, jobs running inside Hadoop YARN may be of lower-priority than jobs running outside Hadoop YARN like HBase. To give way to other higher-priority jobs inside Hadoop, a user or some cluster-level resource scheduling service should be able to suspend and/or resume some particular jobs within Hadoop YARN. When target jobs inside Hadoop are suspended, those already allocated and running task containers will continue to run until their completion or active preemption by other ways. But no more new containers would be allocated to the target jobs. In contrast, when suspended jobs are put into resume mode, they will continue to run from the previous job progress and have new task containers allocated to complete the rest of the jobs. Suspend/Resume Hadoop Jobs -- Key: YARN-2172 URL: https://issues.apache.org/jira/browse/YARN-2172 Project: Hadoop YARN Issue Type: New Feature Components: resourcemanager, webapp Affects Versions: 2.2.0 Environment: CentOS 6.5, Hadoop 2.2.0 Reporter: Richard Chen Labels: hadoop, jobs, resume, suspend Fix For: 2.2.0 Original Estimate: 336h Remaining Estimate: 336h In a multi-application cluster environment, jobs running inside Hadoop YARN may be of lower-priority than jobs running outside Hadoop YARN like HBase. To give way to other higher-priority jobs inside Hadoop, a user or some cluster-level resource scheduling service should be able to suspend and/or resume some particular jobs within Hadoop YARN. When target jobs inside Hadoop are suspended, those already allocated and running task containers will continue to run until their completion or active preemption by other ways. But no more new containers would be allocated to the target jobs. In contrast, when suspended jobs are put into resume mode, they will continue to run from the previous job progress and have new task containers allocated to complete the rest of the jobs. My team has completed its implementation and our tests showed it is working in a rather solid way. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2167) LeveldbIterator should get closed in NMLeveldbStateStoreService#loadLocalizationState() within finally block
[ https://issues.apache.org/jira/browse/YARN-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14033906#comment-14033906 ] Hudson commented on YARN-2167: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #1804 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1804/]) YARN-2167. LeveldbIterator should get closed in NMLeveldbStateStoreService#loadLocalizationState() within finally block. Contributed by Junping Du (jlowe: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1603039) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/recovery/NMLeveldbStateStoreService.java LeveldbIterator should get closed in NMLeveldbStateStoreService#loadLocalizationState() within finally block Key: YARN-2167 URL: https://issues.apache.org/jira/browse/YARN-2167 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Junping Du Assignee: Junping Du Fix For: 3.0.0, 2.5.0 Attachments: YARN-2167.patch In NMLeveldbStateStoreService#loadLocalizationState(), we have LeveldbIterator to read NM's localization state but it is not get closed in finally block. We should close this connection to DB as a common practice. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2159) Better logging in SchedulerNode#allocateContainer
[ https://issues.apache.org/jira/browse/YARN-2159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14033905#comment-14033905 ] Hudson commented on YARN-2159: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #1804 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1804/]) YARN-2159. Better logging in SchedulerNode#allocateContainer. (Ray Chiang via kasha) (kasha: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1603003) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/SchedulerNode.java Better logging in SchedulerNode#allocateContainer - Key: YARN-2159 URL: https://issues.apache.org/jira/browse/YARN-2159 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Reporter: Ray Chiang Assignee: Ray Chiang Priority: Trivial Labels: newbie, supportability Fix For: 2.5.0 Attachments: YARN2159-01.patch This bit of code: {quote} LOG.info(Assigned container + container.getId() + of capacity + container.getResource() + on host + rmNode.getNodeAddress() + , which currently has + numContainers + containers, + getUsedResource() + used and + getAvailableResource() + available); {quote} results in a line like: {quote} 2014-05-30 16:17:43,573 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerNode: Assigned container container_14000_0009_01_00 of capacity memory:1536, vCores:1 on host machine.host.domain.com:8041, which currently has 18 containers, memory:27648, vCores:18 used and memory:3072, vCores:0 available {quote} That message is fine in most cases, but looks pretty bad after the last available allocation, since it says something like vCores:0 available. Here is one suggested phrasing - which has 18 containers, memory:27648, vCores:18 used and memory:3072, vCores:0 available after allocation -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1339) Recover DeletionService state upon nodemanager restart
[ https://issues.apache.org/jira/browse/YARN-1339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14033909#comment-14033909 ] Hudson commented on YARN-1339: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #1804 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1804/]) YARN-1339. Recover DeletionService state upon nodemanager restart. (Contributed by Jason Lowe) (junping_du: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1603036) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/DeletionService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeManager.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/recovery/NMLeveldbStateStoreService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/recovery/NMNullStateStoreService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/recovery/NMStateStoreService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/proto/yarn_server_nodemanager_recovery.proto * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestDeletionService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/recovery/NMMemoryStateStoreService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/recovery/TestNMLeveldbStateStoreService.java Recover DeletionService state upon nodemanager restart -- Key: YARN-1339 URL: https://issues.apache.org/jira/browse/YARN-1339 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.3.0 Reporter: Jason Lowe Assignee: Jason Lowe Fix For: 2.5.0 Attachments: YARN-1339.patch, YARN-1339v2.patch, YARN-1339v3-and-YARN-1987.patch, YARN-1339v4.patch, YARN-1339v5.patch, YARN-1339v6.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1885) RM may not send the app-finished signal after RM restart to some nodes where the application ran before RM restarts
[ https://issues.apache.org/jira/browse/YARN-1885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14033911#comment-14033911 ] Hudson commented on YARN-1885: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #1804 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1804/]) YARN-1885. Fixed a bug that RM may not send application-clean-up signal to NMs where the completed applications previously ran in case of RM restart. Contributed by Wangda Tan (jianhe: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1603028) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/test/java/org/apache/hadoop/yarn/client/TestResourceTrackerOnHA.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/api/protocolrecords/RegisterNodeManagerRequest.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/api/protocolrecords/impl/pb/RegisterNodeManagerRequestPBImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/proto/yarn_server_common_service_protos.proto * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/test/java/org/apache/hadoop/yarn/server/api/protocolrecords/TestProtocolRecords.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/test/java/org/apache/hadoop/yarn/server/api/protocolrecords/TestRegisterNodeManagerRequest.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeStatusUpdaterImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceTrackerService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMApp.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMAppEventType.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMAppImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMAppRunningOnNodeEvent.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/RMAppAttempt.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/RMAppAttemptEventType.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/RMAppAttemptImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/event/RMAppAttemptContainerAcquiredEvent.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmcontainer/RMContainerImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeStartedEvent.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/MockNM.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/MockRM.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestApplicationCleanup.java *
[jira] [Updated] (YARN-2171) AMs block on the CapacityScheduler lock during allocate()
[ https://issues.apache.org/jira/browse/YARN-2171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Lowe updated YARN-2171: - Attachment: YARN-2171.patch Patch to use AtomicInteger for the number of nodes so we can avoid grabbing the lock to access the value. I also added a unit test to verify allocate doesn't try to grab the capacity scheduler lock. AMs block on the CapacityScheduler lock during allocate() - Key: YARN-2171 URL: https://issues.apache.org/jira/browse/YARN-2171 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler Affects Versions: 0.23.10, 2.4.0 Reporter: Jason Lowe Assignee: Jason Lowe Priority: Critical Attachments: YARN-2171.patch When AMs heartbeat into the RM via the allocate() call they are blocking on the CapacityScheduler lock when trying to get the number of nodes in the cluster via getNumClusterNodes. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-409) Allow apps to be killed via the RM REST API
[ https://issues.apache.org/jira/browse/YARN-409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14033950#comment-14033950 ] Sandy Ryza commented on YARN-409: - definitely. will close this because there seems to be more activity there. Allow apps to be killed via the RM REST API --- Key: YARN-409 URL: https://issues.apache.org/jira/browse/YARN-409 Project: Hadoop YARN Issue Type: New Feature Components: api, resourcemanager Affects Versions: 2.0.3-alpha Reporter: Sandy Ryza Assignee: Sandy Ryza The RM REST API currently allows getting information about running applications. Adding the capability to kill applications would allow systems like Hue to perform their functions over HTTP. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (YARN-409) Allow apps to be killed via the RM REST API
[ https://issues.apache.org/jira/browse/YARN-409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandy Ryza resolved YARN-409. - Resolution: Duplicate Allow apps to be killed via the RM REST API --- Key: YARN-409 URL: https://issues.apache.org/jira/browse/YARN-409 Project: Hadoop YARN Issue Type: New Feature Components: api, resourcemanager Affects Versions: 2.0.3-alpha Reporter: Sandy Ryza Assignee: Sandy Ryza The RM REST API currently allows getting information about running applications. Adding the capability to kill applications would allow systems like Hue to perform their functions over HTTP. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2162) Fair Scheduler :ability to configure minResources and maxResources in terms of percentage
[ https://issues.apache.org/jira/browse/YARN-2162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ashwin Shankar updated YARN-2162: - Description: minResources and maxResources in fair scheduler configs are expressed in terms of absolute numbers X mb, Y vcores. As a result, when we expand or shrink our hadoop cluster, we need to recalculate and change minResources/maxResources accordingly, which is pretty inconvenient. We can circumvent this problem if we can optionally configure these properties in terms of percentage of cluster capacity. was: minResources and maxResources in fair scheduler configs are expressed in terms of absolute numbers X mb, Y vcores. As a result, when we expand or shrink our hadoop cluster, we need to recalculate and change minResources/maxResources accordingly, which is pretty inconvenient. We can circumvent this problem if we can (optionally) configure these properties in terms of percentage of cluster capacity. Fair Scheduler :ability to configure minResources and maxResources in terms of percentage - Key: YARN-2162 URL: https://issues.apache.org/jira/browse/YARN-2162 Project: Hadoop YARN Issue Type: Improvement Components: scheduler Reporter: Ashwin Shankar Labels: scheduler minResources and maxResources in fair scheduler configs are expressed in terms of absolute numbers X mb, Y vcores. As a result, when we expand or shrink our hadoop cluster, we need to recalculate and change minResources/maxResources accordingly, which is pretty inconvenient. We can circumvent this problem if we can optionally configure these properties in terms of percentage of cluster capacity. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2173) Enabling HTTPS for the reader REST APIs
Zhijie Shen created YARN-2173: - Summary: Enabling HTTPS for the reader REST APIs Key: YARN-2173 URL: https://issues.apache.org/jira/browse/YARN-2173 Project: Hadoop YARN Issue Type: Sub-task Reporter: Zhijie Shen Assignee: Zhijie Shen -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2162) Fair Scheduler :ability to optionally configure minResources and maxResources in terms of percentage
[ https://issues.apache.org/jira/browse/YARN-2162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ashwin Shankar updated YARN-2162: - Summary: Fair Scheduler :ability to optionally configure minResources and maxResources in terms of percentage (was: Fair Scheduler :ability to configure minResources and maxResources in terms of percentage) Fair Scheduler :ability to optionally configure minResources and maxResources in terms of percentage Key: YARN-2162 URL: https://issues.apache.org/jira/browse/YARN-2162 Project: Hadoop YARN Issue Type: Improvement Components: scheduler Reporter: Ashwin Shankar Labels: scheduler minResources and maxResources in fair scheduler configs are expressed in terms of absolute numbers X mb, Y vcores. As a result, when we expand or shrink our hadoop cluster, we need to recalculate and change minResources/maxResources accordingly, which is pretty inconvenient. We can circumvent this problem if we can optionally configure these properties in terms of percentage of cluster capacity. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2174) Enabling HTTPs for the writer REST API
Zhijie Shen created YARN-2174: - Summary: Enabling HTTPs for the writer REST API Key: YARN-2174 URL: https://issues.apache.org/jira/browse/YARN-2174 Project: Hadoop YARN Issue Type: Sub-task Reporter: Zhijie Shen -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2162) Fair Scheduler :ability to optionally configure minResources and maxResources in terms of percentage
[ https://issues.apache.org/jira/browse/YARN-2162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14034019#comment-14034019 ] Ashwin Shankar commented on YARN-2162: -- [~maysamyabandeh], yes that was the intention. Changed title and description to make it clear. Fair Scheduler :ability to optionally configure minResources and maxResources in terms of percentage Key: YARN-2162 URL: https://issues.apache.org/jira/browse/YARN-2162 Project: Hadoop YARN Issue Type: Improvement Components: scheduler Reporter: Ashwin Shankar Labels: scheduler minResources and maxResources in fair scheduler configs are expressed in terms of absolute numbers X mb, Y vcores. As a result, when we expand or shrink our hadoop cluster, we need to recalculate and change minResources/maxResources accordingly, which is pretty inconvenient. We can circumvent this problem if we can optionally configure these properties in terms of percentage of cluster capacity. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (YARN-2174) Enabling HTTPs for the writer REST API
[ https://issues.apache.org/jira/browse/YARN-2174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen reassigned YARN-2174: - Assignee: Zhijie Shen Enabling HTTPs for the writer REST API -- Key: YARN-2174 URL: https://issues.apache.org/jira/browse/YARN-2174 Project: Hadoop YARN Issue Type: Sub-task Reporter: Zhijie Shen Assignee: Zhijie Shen -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2174) Enabling HTTPs for the writer REST API
[ https://issues.apache.org/jira/browse/YARN-2174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-2174: -- Description: Since we'd like to allow the application to put the timeline data at the client, the AM and even the containers, we need to provide the way to distribute the keystore. Enabling HTTPs for the writer REST API -- Key: YARN-2174 URL: https://issues.apache.org/jira/browse/YARN-2174 Project: Hadoop YARN Issue Type: Sub-task Reporter: Zhijie Shen Assignee: Zhijie Shen Since we'd like to allow the application to put the timeline data at the client, the AM and even the containers, we need to provide the way to distribute the keystore. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1341) Recover NMTokens upon nodemanager restart
[ https://issues.apache.org/jira/browse/YARN-1341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14034021#comment-14034021 ] Junping Du commented on YARN-1341: -- [~jlowe], Thanks for the patch here. I am currently reviewing it and looks like some code like: LeveldbIterator, NMStateStoreService already get committed in other patches. Would you resync the patch here against trunk? Thanks! Recover NMTokens upon nodemanager restart - Key: YARN-1341 URL: https://issues.apache.org/jira/browse/YARN-1341 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.3.0 Reporter: Jason Lowe Assignee: Jason Lowe Attachments: YARN-1341.patch, YARN-1341v2.patch, YARN-1341v3.patch, YARN-1341v4-and-YARN-1987.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2102) More generalized timeline ACLs
[ https://issues.apache.org/jira/browse/YARN-2102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-2102: -- Summary: More generalized timeline ACLs (was: Extend access control for configured user/group list) More generalized timeline ACLs -- Key: YARN-2102 URL: https://issues.apache.org/jira/browse/YARN-2102 Project: Hadoop YARN Issue Type: Sub-task Reporter: Zhijie Shen Assignee: Zhijie Shen Like ApplicationACLsManager, we should also allow configured user/group to access the timeline data. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2102) More generalized timeline ACLs
[ https://issues.apache.org/jira/browse/YARN-2102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-2102: -- Description: We need to differentiate the access controls of reading and writing operations, and we need to think about cross-entity access control. For example, if we are executing a workflow of MR jobs, which writing the timeline data of this workflow, we don't want other user to pollute the timeline data of the workflow by putting something under it. (was: Like ApplicationACLsManager, we should also allow configured user/group to access the timeline data.) More generalized timeline ACLs -- Key: YARN-2102 URL: https://issues.apache.org/jira/browse/YARN-2102 Project: Hadoop YARN Issue Type: Sub-task Reporter: Zhijie Shen Assignee: Zhijie Shen We need to differentiate the access controls of reading and writing operations, and we need to think about cross-entity access control. For example, if we are executing a workflow of MR jobs, which writing the timeline data of this workflow, we don't want other user to pollute the timeline data of the workflow by putting something under it. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2171) AMs block on the CapacityScheduler lock during allocate()
[ https://issues.apache.org/jira/browse/YARN-2171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14034034#comment-14034034 ] Hadoop QA commented on YARN-2171: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12650819/YARN-2171.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.ahs.TestRMApplicationHistoryWriter {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4014//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4014//console This message is automatically generated. AMs block on the CapacityScheduler lock during allocate() - Key: YARN-2171 URL: https://issues.apache.org/jira/browse/YARN-2171 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler Affects Versions: 0.23.10, 2.4.0 Reporter: Jason Lowe Assignee: Jason Lowe Priority: Critical Attachments: YARN-2171.patch When AMs heartbeat into the RM via the allocate() call they are blocking on the CapacityScheduler lock when trying to get the number of nodes in the cluster via getNumClusterNodes. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2083) In fair scheduler, Queue should not been assigned more containers when its usedResource had reach the maxResource limit
[ https://issues.apache.org/jira/browse/YARN-2083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yi Tian updated YARN-2083: -- Attachment: YARN-2083-2.patch move test code to TestFSQueue.java In fair scheduler, Queue should not been assigned more containers when its usedResource had reach the maxResource limit --- Key: YARN-2083 URL: https://issues.apache.org/jira/browse/YARN-2083 Project: Hadoop YARN Issue Type: Bug Components: scheduler Affects Versions: 2.3.0 Reporter: Yi Tian Labels: assignContainer, fair, scheduler Fix For: 2.4.1 Attachments: YARN-2083-1.patch, YARN-2083-2.patch, YARN-2083.patch In fair scheduler, FSParentQueue and FSLeafQueue do an assignContainerPreCheck to guaranty this queue is not over its limit. But the fitsIn function in Resource.java did not return false when the usedResource equals the maxResource. I think we should create a new Function fitsInWithoutEqual instead of fitsIn in this case. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-868) YarnClient should set the service address in tokens returned by getRMDelegationToken()
[ https://issues.apache.org/jira/browse/YARN-868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hitesh Shah updated YARN-868: - Target Version/s: 2.5.0 (was: 2.1.0-beta) YarnClient should set the service address in tokens returned by getRMDelegationToken() -- Key: YARN-868 URL: https://issues.apache.org/jira/browse/YARN-868 Project: Hadoop YARN Issue Type: Bug Reporter: Hitesh Shah Either the client should set this information into the token or the client layer should expose an api that returns the service address. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2171) AMs block on the CapacityScheduler lock during allocate()
[ https://issues.apache.org/jira/browse/YARN-2171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14034060#comment-14034060 ] Vinod Kumar Vavilapalli commented on YARN-2171: --- The code changes look fine enough to me. The test is not so useful beyond validating this ticket, but that's okay. I see that we don't have any test validating the number of nodes itself explicitly, shall we add that here? AMs block on the CapacityScheduler lock during allocate() - Key: YARN-2171 URL: https://issues.apache.org/jira/browse/YARN-2171 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler Affects Versions: 0.23.10, 2.4.0 Reporter: Jason Lowe Assignee: Jason Lowe Priority: Critical Attachments: YARN-2171.patch When AMs heartbeat into the RM via the allocate() call they are blocking on the CapacityScheduler lock when trying to get the number of nodes in the cluster via getNumClusterNodes. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-365) Each NM heartbeat should not generate an event for the Scheduler
[ https://issues.apache.org/jira/browse/YARN-365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Lowe updated YARN-365: Attachment: YARN-365.branch-0.23.patch Patch for branch-0.23. RM unit tests pass, and I manually tested it as well on a single-node cluster forcing the scheduler to run slower than the heartbeat interval. Each NM heartbeat should not generate an event for the Scheduler Key: YARN-365 URL: https://issues.apache.org/jira/browse/YARN-365 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager, scheduler Affects Versions: 0.23.5 Reporter: Siddharth Seth Assignee: Xuan Gong Fix For: 2.1.0-beta Attachments: Prototype2.txt, Prototype3.txt, YARN-365.1.patch, YARN-365.10.patch, YARN-365.2.patch, YARN-365.3.patch, YARN-365.4.patch, YARN-365.5.patch, YARN-365.6.patch, YARN-365.7.patch, YARN-365.8.patch, YARN-365.9.patch, YARN-365.branch-0.23.patch Follow up from YARN-275 https://issues.apache.org/jira/secure/attachment/12567075/Prototype.txt -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2083) In fair scheduler, Queue should not been assigned more containers when its usedResource had reach the maxResource limit
[ https://issues.apache.org/jira/browse/YARN-2083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14034154#comment-14034154 ] Hadoop QA commented on YARN-2083: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12650834/YARN-2083-2.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestFSQueue {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4015//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4015//console This message is automatically generated. In fair scheduler, Queue should not been assigned more containers when its usedResource had reach the maxResource limit --- Key: YARN-2083 URL: https://issues.apache.org/jira/browse/YARN-2083 Project: Hadoop YARN Issue Type: Bug Components: scheduler Affects Versions: 2.3.0 Reporter: Yi Tian Labels: assignContainer, fair, scheduler Fix For: 2.4.1 Attachments: YARN-2083-1.patch, YARN-2083-2.patch, YARN-2083.patch In fair scheduler, FSParentQueue and FSLeafQueue do an assignContainerPreCheck to guaranty this queue is not over its limit. But the fitsIn function in Resource.java did not return false when the usedResource equals the maxResource. I think we should create a new Function fitsInWithoutEqual instead of fitsIn in this case. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1972) Implement secure Windows Container Executor
[ https://issues.apache.org/jira/browse/YARN-1972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14034160#comment-14034160 ] Vinod Kumar Vavilapalli commented on YARN-1972: --- bq. All in all a very high privilege required for NM. We are considering a future iteration in which we extract the privileged operations into a dedicated NT service (=daemon) and bestow the high privileges only to this service. Thanks. Let's document this in a Windows specific docs page. bq. You are launching so many commands for every container - to chown files, to copy files etc. bq. We'll measure. [..] I don't think that moving the localization into native code would result in much benefit over a proper Java implementation. I'd file an investigation ticket to track this. bq. DCE and WCE no longer create user file cache, this is done solely by the localizer initDirs. DCE Test modified to reflect this. DCE.createUserCacheDirs renamed to createUserAppCacheDirs accordingly The division of responsibility between launching multiple commands before starting the localizer and the stuff that happens inside the localizer: Unfortunately, this still isn't ideal. Having userCache created by the ContainerExecutor but not file-cache is assymetric and confusing. I propose that we split this refactoring into a separate JIRA and stick to your original code. Apologies for the back-and-forth on this one. bq. There is more feedback to address (DRY between LCE and WCE localization launch, proper place for localization classpath jar). So, you will work on them here itself, right? Looks fine otherwise, exception for the above comments and a request for some basic documentation. Implement secure Windows Container Executor --- Key: YARN-1972 URL: https://issues.apache.org/jira/browse/YARN-1972 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Reporter: Remus Rusanu Assignee: Remus Rusanu Labels: security, windows Attachments: YARN-1972.1.patch, YARN-1972.2.patch h1. Windows Secure Container Executor (WCE) YARN-1063 adds the necessary infrasturcture to launch a process as a domain user as a solution for the problem of having a security boundary between processes executed in YARN containers and the Hadoop services. The WCE is a container executor that leverages the winutils capabilities introduced in YARN-1063 and launches containers as an OS process running as the job submitter user. A description of the S4U infrastructure used by YARN-1063 alternatives considered can be read on that JIRA. The WCE is based on the DefaultContainerExecutor. It relies on the DCE to drive the flow of execution, but it overwrrides some emthods to the effect of: * change the DCE created user cache directories to be owned by the job user and by the nodemanager group. * changes the actual container run command to use the 'createAsUser' command of winutils task instead of 'create' * runs the localization as standalone process instead of an in-process Java method call. This in turn relies on the winutil createAsUser feature to run the localization as the job user. When compared to LinuxContainerExecutor (LCE), the WCE has some minor differences: * it does no delegate the creation of the user cache directories to the native implementation. * it does no require special handling to be able to delete user files The approach on the WCE came from a practical trial-and-error approach. I had to iron out some issues around the Windows script shell limitations (command line length) to get it to work, the biggest issue being the huge CLASSPATH that is commonplace in Hadoop environment container executions. The job container itself is already dealing with this via a so called 'classpath jar', see HADOOP-8899 and YARN-316 for details. For the WCE localizer launch as a separate container the same issue had to be resolved and I used the same 'classpath jar' approach. h2. Deployment Requirements To use the WCE one needs to set the `yarn.nodemanager.container-executor.class` to `org.apache.hadoop.yarn.server.nodemanager.WindowsSecureContainerExecutor` and set the `yarn.nodemanager.windows-secure-container-executor.group` to a Windows security group name that is the nodemanager service principal is a member of (equivalent of LCE `yarn.nodemanager.linux-container-executor.group`). Unlike the LCE the WCE does not require any configuration outside of the Hadoop own's yar-site.xml. For WCE to work the nodemanager must run as a service principal that is member of the local Administrators group or LocalSystem. this is derived from the need to invoke LoadUserProfile API which mention these requirements in the specifications. This is in addition
[jira] [Commented] (YARN-1972) Implement secure Windows Container Executor
[ https://issues.apache.org/jira/browse/YARN-1972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14034179#comment-14034179 ] Remus Rusanu commented on YARN-1972: Thanks for the update Vinod. I have updated the item description to act as documentation. Do you think anything more is needed? Implement secure Windows Container Executor --- Key: YARN-1972 URL: https://issues.apache.org/jira/browse/YARN-1972 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Reporter: Remus Rusanu Assignee: Remus Rusanu Labels: security, windows Attachments: YARN-1972.1.patch, YARN-1972.2.patch h1. Windows Secure Container Executor (WCE) YARN-1063 adds the necessary infrasturcture to launch a process as a domain user as a solution for the problem of having a security boundary between processes executed in YARN containers and the Hadoop services. The WCE is a container executor that leverages the winutils capabilities introduced in YARN-1063 and launches containers as an OS process running as the job submitter user. A description of the S4U infrastructure used by YARN-1063 alternatives considered can be read on that JIRA. The WCE is based on the DefaultContainerExecutor. It relies on the DCE to drive the flow of execution, but it overwrrides some emthods to the effect of: * change the DCE created user cache directories to be owned by the job user and by the nodemanager group. * changes the actual container run command to use the 'createAsUser' command of winutils task instead of 'create' * runs the localization as standalone process instead of an in-process Java method call. This in turn relies on the winutil createAsUser feature to run the localization as the job user. When compared to LinuxContainerExecutor (LCE), the WCE has some minor differences: * it does no delegate the creation of the user cache directories to the native implementation. * it does no require special handling to be able to delete user files The approach on the WCE came from a practical trial-and-error approach. I had to iron out some issues around the Windows script shell limitations (command line length) to get it to work, the biggest issue being the huge CLASSPATH that is commonplace in Hadoop environment container executions. The job container itself is already dealing with this via a so called 'classpath jar', see HADOOP-8899 and YARN-316 for details. For the WCE localizer launch as a separate container the same issue had to be resolved and I used the same 'classpath jar' approach. h2. Deployment Requirements To use the WCE one needs to set the `yarn.nodemanager.container-executor.class` to `org.apache.hadoop.yarn.server.nodemanager.WindowsSecureContainerExecutor` and set the `yarn.nodemanager.windows-secure-container-executor.group` to a Windows security group name that is the nodemanager service principal is a member of (equivalent of LCE `yarn.nodemanager.linux-container-executor.group`). Unlike the LCE the WCE does not require any configuration outside of the Hadoop own's yar-site.xml. For WCE to work the nodemanager must run as a service principal that is member of the local Administrators group or LocalSystem. this is derived from the need to invoke LoadUserProfile API which mention these requirements in the specifications. This is in addition to the SE_TCB privilege mentioned in YARN-1063, but this requirement will automatically imply that the SE_TCB privilege is held by the nodemanager. For the Linux speakers in the audience, the requirement is basically to run NM as root. h2. Dedicated high privilege Service Due to the high privilege required by the WCE we had discussed the need to isolate the high privilege operations into a separate process, an 'executor' service that is solely responsible to start the containers (incloding the localizer). The NM would have to authenticate, authorize and communicate with this service via an IPC mechanism and use this service to launch the containers. I still believe we'll end up deploying such a service, but the effort to onboard such a new platfrom specific new service on the project are not trivial. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1367) After restart NM should resync with the RM without killing containers
[ https://issues.apache.org/jira/browse/YARN-1367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14034186#comment-14034186 ] Jian He commented on YARN-1367: --- [~adhoot], mind updating the patch please? I'm happy to work on it if you are busy. After restart NM should resync with the RM without killing containers - Key: YARN-1367 URL: https://issues.apache.org/jira/browse/YARN-1367 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bikas Saha Assignee: Anubhav Dhoot Attachments: YARN-1367.prototype.patch After RM restart, the RM sends a resync response to NMs that heartbeat to it. Upon receiving the resync response, the NM kills all containers and re-registers with the RM. The NM should be changed to not kill the container and instead inform the RM about all currently running containers including their allocations etc. After the re-register, the NM should send all pending container completions to the RM as usual. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1972) Implement secure Windows Container Executor
[ https://issues.apache.org/jira/browse/YARN-1972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14034268#comment-14034268 ] Vinod Kumar Vavilapalli commented on YARN-1972: --- That looks fine. I was suggesting we create one more document at hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/. You can create that doc and add it to the patch together with addressing my review in the last comment. Tx again for working on this, it's almost there.. Implement secure Windows Container Executor --- Key: YARN-1972 URL: https://issues.apache.org/jira/browse/YARN-1972 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Reporter: Remus Rusanu Assignee: Remus Rusanu Labels: security, windows Attachments: YARN-1972.1.patch, YARN-1972.2.patch h1. Windows Secure Container Executor (WCE) YARN-1063 adds the necessary infrasturcture to launch a process as a domain user as a solution for the problem of having a security boundary between processes executed in YARN containers and the Hadoop services. The WCE is a container executor that leverages the winutils capabilities introduced in YARN-1063 and launches containers as an OS process running as the job submitter user. A description of the S4U infrastructure used by YARN-1063 alternatives considered can be read on that JIRA. The WCE is based on the DefaultContainerExecutor. It relies on the DCE to drive the flow of execution, but it overwrrides some emthods to the effect of: * change the DCE created user cache directories to be owned by the job user and by the nodemanager group. * changes the actual container run command to use the 'createAsUser' command of winutils task instead of 'create' * runs the localization as standalone process instead of an in-process Java method call. This in turn relies on the winutil createAsUser feature to run the localization as the job user. When compared to LinuxContainerExecutor (LCE), the WCE has some minor differences: * it does no delegate the creation of the user cache directories to the native implementation. * it does no require special handling to be able to delete user files The approach on the WCE came from a practical trial-and-error approach. I had to iron out some issues around the Windows script shell limitations (command line length) to get it to work, the biggest issue being the huge CLASSPATH that is commonplace in Hadoop environment container executions. The job container itself is already dealing with this via a so called 'classpath jar', see HADOOP-8899 and YARN-316 for details. For the WCE localizer launch as a separate container the same issue had to be resolved and I used the same 'classpath jar' approach. h2. Deployment Requirements To use the WCE one needs to set the `yarn.nodemanager.container-executor.class` to `org.apache.hadoop.yarn.server.nodemanager.WindowsSecureContainerExecutor` and set the `yarn.nodemanager.windows-secure-container-executor.group` to a Windows security group name that is the nodemanager service principal is a member of (equivalent of LCE `yarn.nodemanager.linux-container-executor.group`). Unlike the LCE the WCE does not require any configuration outside of the Hadoop own's yar-site.xml. For WCE to work the nodemanager must run as a service principal that is member of the local Administrators group or LocalSystem. this is derived from the need to invoke LoadUserProfile API which mention these requirements in the specifications. This is in addition to the SE_TCB privilege mentioned in YARN-1063, but this requirement will automatically imply that the SE_TCB privilege is held by the nodemanager. For the Linux speakers in the audience, the requirement is basically to run NM as root. h2. Dedicated high privilege Service Due to the high privilege required by the WCE we had discussed the need to isolate the high privilege operations into a separate process, an 'executor' service that is solely responsible to start the containers (incloding the localizer). The NM would have to authenticate, authorize and communicate with this service via an IPC mechanism and use this service to launch the containers. I still believe we'll end up deploying such a service, but the effort to onboard such a new platfrom specific new service on the project are not trivial. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1367) After restart NM should resync with the RM without killing containers
[ https://issues.apache.org/jira/browse/YARN-1367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14034359#comment-14034359 ] Anubhav Dhoot commented on YARN-1367: - I am still working on it. Will have an update soon After restart NM should resync with the RM without killing containers - Key: YARN-1367 URL: https://issues.apache.org/jira/browse/YARN-1367 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bikas Saha Assignee: Anubhav Dhoot Attachments: YARN-1367.prototype.patch After RM restart, the RM sends a resync response to NMs that heartbeat to it. Upon receiving the resync response, the NM kills all containers and re-registers with the RM. The NM should be changed to not kill the container and instead inform the RM about all currently running containers including their allocations etc. After the re-register, the NM should send all pending container completions to the RM as usual. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2171) AMs block on the CapacityScheduler lock during allocate()
[ https://issues.apache.org/jira/browse/YARN-2171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Lowe updated YARN-2171: - Attachment: YARN-2171v2.patch The point of the unit test was to catch regressions at a high level. If anyone changes the code such that calling allocate() will grab the scheduler lock then the test will fail, whether that's a regression in this particular method or some new method that's added that ApplicationMasterService or CapacityScheduler itself calls and grabs the lock. I added a separate unit test to exercise the getNumClusterNodes method. The AHS unit test failure seems unrelated, and it passes for me locally even with this change. AMs block on the CapacityScheduler lock during allocate() - Key: YARN-2171 URL: https://issues.apache.org/jira/browse/YARN-2171 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler Affects Versions: 0.23.10, 2.4.0 Reporter: Jason Lowe Assignee: Jason Lowe Priority: Critical Attachments: YARN-2171.patch, YARN-2171v2.patch When AMs heartbeat into the RM via the allocate() call they are blocking on the CapacityScheduler lock when trying to get the number of nodes in the cluster via getNumClusterNodes. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2175) Container localization has no timeouts and tasks can be stuck there for a long time
Anubhav Dhoot created YARN-2175: --- Summary: Container localization has no timeouts and tasks can be stuck there for a long time Key: YARN-2175 URL: https://issues.apache.org/jira/browse/YARN-2175 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Anubhav Dhoot There are no timeouts that can be used to limit the time taken by various container startup operations. Localization for example could take a long time and there is no way to kill an task if its stuck in these states. These may have nothing to do with the task itself and could be an issue within the platform. Ideally there should be configurable limits for various states within the NodeManager to limit various states. The RM does not care about most of these and its only between AM and the NM. We can start by making these global configurable defaults and in future we can make it fancier by letting AM override them in the start container request. This jira will be used to limit localization time and we open others if we feel we need to limit other operations. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2175) Container localization has no timeouts and tasks can be stuck there for a long time
[ https://issues.apache.org/jira/browse/YARN-2175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anubhav Dhoot updated YARN-2175: Affects Version/s: 2.4.0 Container localization has no timeouts and tasks can be stuck there for a long time --- Key: YARN-2175 URL: https://issues.apache.org/jira/browse/YARN-2175 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.4.0 Reporter: Anubhav Dhoot There are no timeouts that can be used to limit the time taken by various container startup operations. Localization for example could take a long time and there is no way to kill an task if its stuck in these states. These may have nothing to do with the task itself and could be an issue within the platform. Ideally there should be configurable limits for various states within the NodeManager to limit various states. The RM does not care about most of these and its only between AM and the NM. We can start by making these global configurable defaults and in future we can make it fancier by letting AM override them in the start container request. This jira will be used to limit localization time and we open others if we feel we need to limit other operations. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (YARN-2175) Container localization has no timeouts and tasks can be stuck there for a long time
[ https://issues.apache.org/jira/browse/YARN-2175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anubhav Dhoot reassigned YARN-2175: --- Assignee: Anubhav Dhoot Container localization has no timeouts and tasks can be stuck there for a long time --- Key: YARN-2175 URL: https://issues.apache.org/jira/browse/YARN-2175 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.4.0 Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot There are no timeouts that can be used to limit the time taken by various container startup operations. Localization for example could take a long time and there is no way to kill an task if its stuck in these states. These may have nothing to do with the task itself and could be an issue within the platform. Ideally there should be configurable limits for various states within the NodeManager to limit various states. The RM does not care about most of these and its only between AM and the NM. We can start by making these global configurable defaults and in future we can make it fancier by letting AM override them in the start container request. This jira will be used to limit localization time and we open others if we feel we need to limit other operations. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2176) CapacityScheduler loops over all running applications rather than actively requesting apps
Jason Lowe created YARN-2176: Summary: CapacityScheduler loops over all running applications rather than actively requesting apps Key: YARN-2176 URL: https://issues.apache.org/jira/browse/YARN-2176 Project: Hadoop YARN Issue Type: Improvement Components: capacityscheduler Affects Versions: 2.4.0 Reporter: Jason Lowe The capacity scheduler performance is primarily dominated by LeafQueue.assignContainers, and that currently loops over all applications that are running in the queue. It would be more efficient if we looped over just the applications that are actively asking for resources rather than all applications, as there could be thousands of applications running but only a few hundred that are currently asking for resources. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1367) After restart NM should resync with the RM without killing containers
[ https://issues.apache.org/jira/browse/YARN-1367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14034405#comment-14034405 ] Anubhav Dhoot commented on YARN-1367: - I am still working on it and will have it ready soon. After restart NM should resync with the RM without killing containers - Key: YARN-1367 URL: https://issues.apache.org/jira/browse/YARN-1367 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bikas Saha Assignee: Anubhav Dhoot Attachments: YARN-1367.prototype.patch After RM restart, the RM sends a resync response to NMs that heartbeat to it. Upon receiving the resync response, the NM kills all containers and re-registers with the RM. The NM should be changed to not kill the container and instead inform the RM about all currently running containers including their allocations etc. After the re-register, the NM should send all pending container completions to the RM as usual. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (YARN-1373) Transition RMApp and RMAppAttempt state to RUNNING after restart for recovered running apps
[ https://issues.apache.org/jira/browse/YARN-1373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli resolved YARN-1373. --- Resolution: Duplicate Assignee: Omkar Vinit Joshi (was: Anubhav Dhoot) bq. Currently the RM moves recovered app attempts to the a terminal recovered state and starts a new attempt. This is no longer an issue - never been since YARN-1210. Even in non-work-preserving RM restart, RM explicitly never kills the AMs, it's the nodes that kill all containers - this was done in YARN-1210. The state-machines are already setup correctly and so no changes are needed here. Closing as duplicate of YARN-1210. Transition RMApp and RMAppAttempt state to RUNNING after restart for recovered running apps --- Key: YARN-1373 URL: https://issues.apache.org/jira/browse/YARN-1373 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bikas Saha Assignee: Omkar Vinit Joshi Currently the RM moves recovered app attempts to the a terminal recovered state and starts a new attempt. Instead, it will have to transition the last attempt to a running state such that it can proceed as normal once the running attempt has resynced with the ApplicationMasterService (YARN-1365 and YARN-1366). If the RM had started the application container before dying then the AM would be up and trying to contact the RM. The RM may have had died before launching the container. For this case, the RM should wait for AM liveliness period and issue a kill container for the stored master container. It should transition this attempt to some RECOVER_ERROR state and proceed to start a new attempt. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2174) Enabling HTTPs for the writer REST API of TimelineServer
[ https://issues.apache.org/jira/browse/YARN-2174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli updated YARN-2174: -- Summary: Enabling HTTPs for the writer REST API of TimelineServer (was: Enabling HTTPs for the writer REST API) Enabling HTTPs for the writer REST API of TimelineServer Key: YARN-2174 URL: https://issues.apache.org/jira/browse/YARN-2174 Project: Hadoop YARN Issue Type: Sub-task Reporter: Zhijie Shen Assignee: Zhijie Shen Since we'd like to allow the application to put the timeline data at the client, the AM and even the containers, we need to provide the way to distribute the keystore. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2173) Enabling HTTPS for the reader REST APIs of TimelineServer
[ https://issues.apache.org/jira/browse/YARN-2173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli updated YARN-2173: -- Summary: Enabling HTTPS for the reader REST APIs of TimelineServer (was: Enabling HTTPS for the reader REST APIs) Enabling HTTPS for the reader REST APIs of TimelineServer - Key: YARN-2173 URL: https://issues.apache.org/jira/browse/YARN-2173 Project: Hadoop YARN Issue Type: Sub-task Reporter: Zhijie Shen Assignee: Zhijie Shen -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2052) ContainerId creation after work preserving restart is broken
[ https://issues.apache.org/jira/browse/YARN-2052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14034448#comment-14034448 ] Vinod Kumar Vavilapalli commented on YARN-2052: --- bq. BTW, I think we should update CheckpointAMPreemptionPolicy after this JIRA. Ideally this should be container-allocation timestamp and we should depend on that instead of comparing container-IDs. IAC, let's fix it separately.. ContainerId creation after work preserving restart is broken Key: YARN-2052 URL: https://issues.apache.org/jira/browse/YARN-2052 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA Attachments: YARN-2052.1.patch, YARN-2052.2.patch, YARN-2052.3.patch Container ids are made unique by using the app identifier and appending a monotonically increasing sequence number to it. Since container creation is a high churn activity the RM does not store the sequence number per app. So after restart it does not know what the new sequence number should be for new allocations. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2052) ContainerId creation after work preserving restart is broken
[ https://issues.apache.org/jira/browse/YARN-2052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14034452#comment-14034452 ] Jian He commented on YARN-2052: --- Another question is how are we going to show the containerId string? specifically the toString() method. If we just say original containerId string+UUID, it'll be inconvenient for debugging as the UUID has no meaning. ContainerId creation after work preserving restart is broken Key: YARN-2052 URL: https://issues.apache.org/jira/browse/YARN-2052 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA Attachments: YARN-2052.1.patch, YARN-2052.2.patch, YARN-2052.3.patch Container ids are made unique by using the app identifier and appending a monotonically increasing sequence number to it. Since container creation is a high churn activity the RM does not store the sequence number per app. So after restart it does not know what the new sequence number should be for new allocations. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2171) AMs block on the CapacityScheduler lock during allocate()
[ https://issues.apache.org/jira/browse/YARN-2171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14034456#comment-14034456 ] Hadoop QA commented on YARN-2171: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12650880/YARN-2171v2.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.ahs.TestRMApplicationHistoryWriter {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4016//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4016//console This message is automatically generated. AMs block on the CapacityScheduler lock during allocate() - Key: YARN-2171 URL: https://issues.apache.org/jira/browse/YARN-2171 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler Affects Versions: 0.23.10, 2.4.0 Reporter: Jason Lowe Assignee: Jason Lowe Priority: Critical Attachments: YARN-2171.patch, YARN-2171v2.patch When AMs heartbeat into the RM via the allocate() call they are blocking on the CapacityScheduler lock when trying to get the number of nodes in the cluster via getNumClusterNodes. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2052) ContainerId creation after work preserving restart is broken
[ https://issues.apache.org/jira/browse/YARN-2052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14034474#comment-14034474 ] Tsuyoshi OZAWA commented on YARN-2052: -- Vinod, OK. I'll create new JIRA to address it. {quote} Another question is how are we going to show the containerId string? specifically the toString() method. If we just say original containerId string+UUID, it'll be inconvenient for debugging as the UUID has no meaning. {quote} From developer's point of view, you're right. One idea is showing RM_ID instead of UUID. Validating RM_ID and confirming not to include underscore at startup time. One concern of this approach is that we'll break backward compatibility of yarn-site.xml. If we can accept it, it's better approach. ContainerId creation after work preserving restart is broken Key: YARN-2052 URL: https://issues.apache.org/jira/browse/YARN-2052 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA Attachments: YARN-2052.1.patch, YARN-2052.2.patch, YARN-2052.3.patch Container ids are made unique by using the app identifier and appending a monotonically increasing sequence number to it. Since container creation is a high churn activity the RM does not store the sequence number per app. So after restart it does not know what the new sequence number should be for new allocations. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1341) Recover NMTokens upon nodemanager restart
[ https://issues.apache.org/jira/browse/YARN-1341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Lowe updated YARN-1341: - Attachment: YARN-1341v5.patch Thanks for taking a look, Junping! I've updated the patch to trunk. Recover NMTokens upon nodemanager restart - Key: YARN-1341 URL: https://issues.apache.org/jira/browse/YARN-1341 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.3.0 Reporter: Jason Lowe Assignee: Jason Lowe Attachments: YARN-1341.patch, YARN-1341v2.patch, YARN-1341v3.patch, YARN-1341v4-and-YARN-1987.patch, YARN-1341v5.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2052) ContainerId creation after work preserving restart is broken
[ https://issues.apache.org/jira/browse/YARN-2052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14034541#comment-14034541 ] Jian He commented on YARN-2052: --- Seems more problem with the randomId approach if user wants to the kill the container, user has to be aware of the random ID.. Had an offline discussion with Vinod. Maybe it's still better to persist some sequence number to indicate the number of RM restarts when RM starts up. Today containerId#id is int (32 bits), we reserve some bits in the front for the number of RM restarts. e.g. 32bits divided as 8bits for the number of RM restarts and 24 bits for the number of containers. Each time RM restarts, we increase the RM sequence number. Also, We should have a followup jira to change the containerId/appId from integer to long and deprecate the old one. [~ozawa], do you agree? ContainerId creation after work preserving restart is broken Key: YARN-2052 URL: https://issues.apache.org/jira/browse/YARN-2052 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA Attachments: YARN-2052.1.patch, YARN-2052.2.patch, YARN-2052.3.patch Container ids are made unique by using the app identifier and appending a monotonically increasing sequence number to it. Since container creation is a high churn activity the RM does not store the sequence number per app. So after restart it does not know what the new sequence number should be for new allocations. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1341) Recover NMTokens upon nodemanager restart
[ https://issues.apache.org/jira/browse/YARN-1341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14034588#comment-14034588 ] Hadoop QA commented on YARN-1341: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12650914/YARN-1341v5.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 3 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4017//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4017//console This message is automatically generated. Recover NMTokens upon nodemanager restart - Key: YARN-1341 URL: https://issues.apache.org/jira/browse/YARN-1341 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.3.0 Reporter: Jason Lowe Assignee: Jason Lowe Attachments: YARN-1341.patch, YARN-1341v2.patch, YARN-1341v3.patch, YARN-1341v4-and-YARN-1987.patch, YARN-1341v5.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2147) client lacks delegation token exception details when application submit fails
[ https://issues.apache.org/jira/browse/YARN-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14034612#comment-14034612 ] Daryn Sharp commented on YARN-2147: --- I don't think the patch handles the use case it's designed for. If job submission failed with a bland Read timed out, then logging all the tokens in the RM log doesn't help the end user, nor does the RM log even answer the question which token timed out? What you really want to do is change {{DelegationTokenRenewer#handleAppSubmitEvent}} to trap exceptions from {{renewToken}}. Wrap the exception with a more descriptive exception that stringifies to the user as Can't renew token blah: Read timed out. client lacks delegation token exception details when application submit fails - Key: YARN-2147 URL: https://issues.apache.org/jira/browse/YARN-2147 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.4.0 Reporter: Jason Lowe Assignee: Chen He Priority: Minor Attachments: YARN-2147-v2.patch, YARN-2147.patch When an client submits an application and the delegation token process fails the client can lack critical details needed to understand the nature of the error. Only the message of the error exception is conveyed to the client, which sometimes isn't enough to debug. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2144) Add logs when preemption occurs
[ https://issues.apache.org/jira/browse/YARN-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14034624#comment-14034624 ] Jian He commented on YARN-2144: --- the patch needs rebase, can you update please? thx Add logs when preemption occurs --- Key: YARN-2144 URL: https://issues.apache.org/jira/browse/YARN-2144 Project: Hadoop YARN Issue Type: Improvement Components: capacityscheduler Affects Versions: 2.5.0 Reporter: Tassapol Athiapinya Assignee: Wangda Tan Attachments: AM-page-preemption-info.png, YARN-2144.patch, YARN-2144.patch, YARN-2144.patch There should be easy-to-read logs when preemption does occur. 1. For debugging purpose, RM should log this. 2. For administrative purpose, RM webpage should have a page to show recent preemption events. RM logs should have following properties: * Logs are retrievable when an application is still running and often flushed. * Can distinguish between AM container preemption and task container preemption with container ID shown. * Should be INFO level log. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2052) ContainerId creation after work preserving restart is broken
[ https://issues.apache.org/jira/browse/YARN-2052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14034637#comment-14034637 ] Tsuyoshi OZAWA commented on YARN-2052: -- Basically, I agree with the approach. If we take the sequence-number approach, we should define the behavior when sequence number overflows. One simple way is to fallback to RM-restart implemented in YARN-128. After changing the containerId/appId from integer to long, it'll happen very rarely. [~jianhe], what do you think about the behavior? ContainerId creation after work preserving restart is broken Key: YARN-2052 URL: https://issues.apache.org/jira/browse/YARN-2052 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA Attachments: YARN-2052.1.patch, YARN-2052.2.patch, YARN-2052.3.patch Container ids are made unique by using the app identifier and appending a monotonically increasing sequence number to it. Since container creation is a high churn activity the RM does not store the sequence number per app. So after restart it does not know what the new sequence number should be for new allocations. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2052) ContainerId creation after work preserving restart is broken
[ https://issues.apache.org/jira/browse/YARN-2052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14034691#comment-14034691 ] Bikas Saha commented on YARN-2052: -- bq. Had an offline discussion with Vinod. Maybe it's still better to persist some sequence number to indicate the number of RM restarts when RM starts up. Is this the same as the epoch number that was mentioned earlier in this jira? https://issues.apache.org/jira/browse/YARN-2052?focusedCommentId=13996675page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13996675. Seems to me that its the same with epoch number changed to num-rm-restarts. ContainerId creation after work preserving restart is broken Key: YARN-2052 URL: https://issues.apache.org/jira/browse/YARN-2052 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA Attachments: YARN-2052.1.patch, YARN-2052.2.patch, YARN-2052.3.patch Container ids are made unique by using the app identifier and appending a monotonically increasing sequence number to it. Since container creation is a high churn activity the RM does not store the sequence number per app. So after restart it does not know what the new sequence number should be for new allocations. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1373) Transition RMApp and RMAppAttempt state to RUNNING after restart for recovered running apps
[ https://issues.apache.org/jira/browse/YARN-1373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14034700#comment-14034700 ] Bikas Saha commented on YARN-1373: -- Sorry I am not clear how this is a dup. This jira is tracking new behavior in the RM that will transition a recovered RMAppImpl/RMAppAttemptImpl (and still running for real) app to a RUNNING state instead of a terminal recovered state. This is to ensure that the state machines are in the correct state for the running AM to resync and continue as running. This is not related to killing the app master process on the NM. Transition RMApp and RMAppAttempt state to RUNNING after restart for recovered running apps --- Key: YARN-1373 URL: https://issues.apache.org/jira/browse/YARN-1373 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bikas Saha Assignee: Omkar Vinit Joshi Currently the RM moves recovered app attempts to the a terminal recovered state and starts a new attempt. Instead, it will have to transition the last attempt to a running state such that it can proceed as normal once the running attempt has resynced with the ApplicationMasterService (YARN-1365 and YARN-1366). If the RM had started the application container before dying then the AM would be up and trying to contact the RM. The RM may have had died before launching the container. For this case, the RM should wait for AM liveliness period and issue a kill container for the stored master container. It should transition this attempt to some RECOVER_ERROR state and proceed to start a new attempt. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2144) Add logs when preemption occurs
[ https://issues.apache.org/jira/browse/YARN-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-2144: - Attachment: YARN-2144.patch Rebased patch to latest trunk. Add logs when preemption occurs --- Key: YARN-2144 URL: https://issues.apache.org/jira/browse/YARN-2144 Project: Hadoop YARN Issue Type: Improvement Components: capacityscheduler Affects Versions: 2.5.0 Reporter: Tassapol Athiapinya Assignee: Wangda Tan Attachments: AM-page-preemption-info.png, YARN-2144.patch, YARN-2144.patch, YARN-2144.patch, YARN-2144.patch There should be easy-to-read logs when preemption does occur. 1. For debugging purpose, RM should log this. 2. For administrative purpose, RM webpage should have a page to show recent preemption events. RM logs should have following properties: * Logs are retrievable when an application is still running and often flushed. * Can distinguish between AM container preemption and task container preemption with container ID shown. * Should be INFO level log. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2052) ContainerId creation after work preserving restart is broken
[ https://issues.apache.org/jira/browse/YARN-2052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14034702#comment-14034702 ] Tsuyoshi OZAWA commented on YARN-2052: -- [~bikassaha], Yes, I think it's same. ContainerId creation after work preserving restart is broken Key: YARN-2052 URL: https://issues.apache.org/jira/browse/YARN-2052 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA Attachments: YARN-2052.1.patch, YARN-2052.2.patch, YARN-2052.3.patch Container ids are made unique by using the app identifier and appending a monotonically increasing sequence number to it. Since container creation is a high churn activity the RM does not store the sequence number per app. So after restart it does not know what the new sequence number should be for new allocations. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2052) ContainerId creation after work preserving restart is broken
[ https://issues.apache.org/jira/browse/YARN-2052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14034716#comment-14034716 ] Jian He commented on YARN-2052: --- bq. One simple way is to fallback to RM-restart implemented in YARN-128 Can you clarify more what you mean? ContainerId creation after work preserving restart is broken Key: YARN-2052 URL: https://issues.apache.org/jira/browse/YARN-2052 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA Attachments: YARN-2052.1.patch, YARN-2052.2.patch, YARN-2052.3.patch Container ids are made unique by using the app identifier and appending a monotonically increasing sequence number to it. Since container creation is a high churn activity the RM does not store the sequence number per app. So after restart it does not know what the new sequence number should be for new allocations. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2052) ContainerId creation after work preserving restart is broken
[ https://issues.apache.org/jira/browse/YARN-2052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14034722#comment-14034722 ] Tsuyoshi OZAWA commented on YARN-2052: -- I meant starting apps from a clean state after the restart like RM restart phase 1. If the sequence numbers are reset to zero, some applications can do unexpected behavior because the {{ContainerId#compareTo}} doesn't work correctly. If the apps start from a clean state, we can avoid the situation. ContainerId creation after work preserving restart is broken Key: YARN-2052 URL: https://issues.apache.org/jira/browse/YARN-2052 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA Attachments: YARN-2052.1.patch, YARN-2052.2.patch, YARN-2052.3.patch Container ids are made unique by using the app identifier and appending a monotonically increasing sequence number to it. Since container creation is a high churn activity the RM does not store the sequence number per app. So after restart it does not know what the new sequence number should be for new allocations. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2144) Add logs when preemption occurs
[ https://issues.apache.org/jira/browse/YARN-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14034725#comment-14034725 ] Hadoop QA commented on YARN-2144: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12650937/YARN-2144.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 7 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4018//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4018//console This message is automatically generated. Add logs when preemption occurs --- Key: YARN-2144 URL: https://issues.apache.org/jira/browse/YARN-2144 Project: Hadoop YARN Issue Type: Improvement Components: capacityscheduler Affects Versions: 2.5.0 Reporter: Tassapol Athiapinya Assignee: Wangda Tan Attachments: AM-page-preemption-info.png, YARN-2144.patch, YARN-2144.patch, YARN-2144.patch, YARN-2144.patch There should be easy-to-read logs when preemption does occur. 1. For debugging purpose, RM should log this. 2. For administrative purpose, RM webpage should have a page to show recent preemption events. RM logs should have following properties: * Logs are retrievable when an application is still running and often flushed. * Can distinguish between AM container preemption and task container preemption with container ID shown. * Should be INFO level log. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2052) ContainerId creation after work preserving restart is broken
[ https://issues.apache.org/jira/browse/YARN-2052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14034731#comment-14034731 ] Bikas Saha commented on YARN-2052: -- Why would ContainerId#compareTo fail? Existing containerId's should remain unchanged after RM restart. Only new container ids should have a different epoch number. ContainerId creation after work preserving restart is broken Key: YARN-2052 URL: https://issues.apache.org/jira/browse/YARN-2052 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA Attachments: YARN-2052.1.patch, YARN-2052.2.patch, YARN-2052.3.patch Container ids are made unique by using the app identifier and appending a monotonically increasing sequence number to it. Since container creation is a high churn activity the RM does not store the sequence number per app. So after restart it does not know what the new sequence number should be for new allocations. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2052) ContainerId creation after work preserving restart is broken
[ https://issues.apache.org/jira/browse/YARN-2052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14034732#comment-14034732 ] Bikas Saha commented on YARN-2052: -- Ah. I did not see the rest of the comment. Yes. Integer overflow is a problem. We should make it a long in the same release as the epoch number addition so that we dont have to worry about that. ContainerId creation after work preserving restart is broken Key: YARN-2052 URL: https://issues.apache.org/jira/browse/YARN-2052 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA Attachments: YARN-2052.1.patch, YARN-2052.2.patch, YARN-2052.3.patch Container ids are made unique by using the app identifier and appending a monotonically increasing sequence number to it. Since container creation is a high churn activity the RM does not store the sequence number per app. So after restart it does not know what the new sequence number should be for new allocations. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2083) In fair scheduler, Queue should not been assigned more containers when its usedResource had reach the maxResource limit
[ https://issues.apache.org/jira/browse/YARN-2083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yi Tian updated YARN-2083: -- Attachment: YARN-2083-3.patch little change for YARN-1474. Make schedulers services. In fair scheduler, Queue should not been assigned more containers when its usedResource had reach the maxResource limit --- Key: YARN-2083 URL: https://issues.apache.org/jira/browse/YARN-2083 Project: Hadoop YARN Issue Type: Bug Components: scheduler Affects Versions: 2.3.0 Reporter: Yi Tian Labels: assignContainer, fair, scheduler Fix For: 2.4.1 Attachments: YARN-2083-1.patch, YARN-2083-2.patch, YARN-2083-3.patch, YARN-2083.patch In fair scheduler, FSParentQueue and FSLeafQueue do an assignContainerPreCheck to guaranty this queue is not over its limit. But the fitsIn function in Resource.java did not return false when the usedResource equals the maxResource. I think we should create a new Function fitsInWithoutEqual instead of fitsIn in this case. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2052) ContainerId creation after work preserving restart is broken
[ https://issues.apache.org/jira/browse/YARN-2052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14034746#comment-14034746 ] Tsuyoshi OZAWA commented on YARN-2052: -- {quote} We should make it a long in the same release as the epoch number addition so that we dont have to worry about that. {quote} +1 to do this in the same release. We'll plan to do the improvement on another JIRA. It's OK, but I think it's important for us that we decide the behavior when the overflow happens. We have 2 options: just aborting RM for now or starting apps from a clean state after the restart. We're planning to make id long just after this JIRA, so we can take aborting approach to prevent unexpected behavior for the simplicity. [~bikassaha], [~jianhe], what do you think about this? ContainerId creation after work preserving restart is broken Key: YARN-2052 URL: https://issues.apache.org/jira/browse/YARN-2052 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA Attachments: YARN-2052.1.patch, YARN-2052.2.patch, YARN-2052.3.patch Container ids are made unique by using the app identifier and appending a monotonically increasing sequence number to it. Since container creation is a high churn activity the RM does not store the sequence number per app. So after restart it does not know what the new sequence number should be for new allocations. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2083) In fair scheduler, Queue should not been assigned more containers when its usedResource had reach the maxResource limit
[ https://issues.apache.org/jira/browse/YARN-2083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14034777#comment-14034777 ] Hadoop QA commented on YARN-2083: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12650950/YARN-2083-3.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4019//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4019//console This message is automatically generated. In fair scheduler, Queue should not been assigned more containers when its usedResource had reach the maxResource limit --- Key: YARN-2083 URL: https://issues.apache.org/jira/browse/YARN-2083 Project: Hadoop YARN Issue Type: Bug Components: scheduler Affects Versions: 2.3.0 Reporter: Yi Tian Labels: assignContainer, fair, scheduler Fix For: 2.4.1 Attachments: YARN-2083-1.patch, YARN-2083-2.patch, YARN-2083-3.patch, YARN-2083.patch In fair scheduler, FSParentQueue and FSLeafQueue do an assignContainerPreCheck to guaranty this queue is not over its limit. But the fitsIn function in Resource.java did not return false when the usedResource equals the maxResource. I think we should create a new Function fitsInWithoutEqual instead of fitsIn in this case. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2083) In fair scheduler, Queue should not been assigned more containers when its usedResource had reach the maxResource limit
[ https://issues.apache.org/jira/browse/YARN-2083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yi Tian updated YARN-2083: -- Fix Version/s: (was: 2.4.1) In fair scheduler, Queue should not been assigned more containers when its usedResource had reach the maxResource limit --- Key: YARN-2083 URL: https://issues.apache.org/jira/browse/YARN-2083 Project: Hadoop YARN Issue Type: Bug Components: scheduler Affects Versions: 2.3.0 Reporter: Yi Tian Labels: assignContainer, fair, scheduler Attachments: YARN-2083-1.patch, YARN-2083-2.patch, YARN-2083-3.patch, YARN-2083.patch In fair scheduler, FSParentQueue and FSLeafQueue do an assignContainerPreCheck to guaranty this queue is not over its limit. But the fitsIn function in Resource.java did not return false when the usedResource equals the maxResource. I think we should create a new Function fitsInWithoutEqual instead of fitsIn in this case. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2083) In fair scheduler, Queue should not been assigned more containers when its usedResource had reach the maxResource limit
[ https://issues.apache.org/jira/browse/YARN-2083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14034812#comment-14034812 ] Yi Tian commented on YARN-2083: --- [~ywskycn], thanks for your advice, YARN-2083-3.patch works fine in thunk ,YARN-2083-2.patch works fine in branch-2.4.1. is it possible to apply this patch into yarn-project? In fair scheduler, Queue should not been assigned more containers when its usedResource had reach the maxResource limit --- Key: YARN-2083 URL: https://issues.apache.org/jira/browse/YARN-2083 Project: Hadoop YARN Issue Type: Bug Components: scheduler Affects Versions: 2.3.0 Reporter: Yi Tian Labels: assignContainer, fair, scheduler Attachments: YARN-2083-1.patch, YARN-2083-2.patch, YARN-2083-3.patch, YARN-2083.patch In fair scheduler, FSParentQueue and FSLeafQueue do an assignContainerPreCheck to guaranty this queue is not over its limit. But the fitsIn function in Resource.java did not return false when the usedResource equals the maxResource. I think we should create a new Function fitsInWithoutEqual instead of fitsIn in this case. -- This message was sent by Atlassian JIRA (v6.2#6252)