[jira] [Commented] (YARN-1389) ApplicationClientProtocol and ApplicationHistoryProtocol should expose analog APIs
[ https://issues.apache.org/jira/browse/YARN-1389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13917826#comment-13917826 ] Hadoop QA commented on YARN-1389: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12632192/YARN-1389-2.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:red}-1 javac{color:red}. The patch appears to cause the build to fail. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3228//console This message is automatically generated. ApplicationClientProtocol and ApplicationHistoryProtocol should expose analog APIs -- Key: YARN-1389 URL: https://issues.apache.org/jira/browse/YARN-1389 Project: Hadoop YARN Issue Type: Sub-task Reporter: Mayank Bansal Assignee: Mayank Bansal Attachments: YARN-1389-1.patch, YARN-1389-2.patch As we plan to have the APIs in ApplicationHistoryProtocol to expose the reports of *finished* application attempts and containers, we should do the same for ApplicationClientProtocol, which will return the reports of *running* attempts and containers. Later on, we can improve YarnClient to direct the query of running instance to ApplicationClientProtocol, while that of finished instance to ApplicationHistoryProtocol, making it transparent to the users. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1759) Configuration settings can potentially disappear post YARN-1666
[ https://issues.apache.org/jira/browse/YARN-1759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13917895#comment-13917895 ] Steve Loughran commented on YARN-1759: -- I think Hitesh's concern is of the workflow # load YARN config # subclass overrides values in its serviceInit() # new YarnConfig overwrites this. I'm not sure this happens, certainly {{new YarnConfig(Configuration)}} doesn't -it pops up in a few places -hence some logic in {{AbstractService.init(Configuration)}} to recognise and handle this situation by updating its own {{config}} field. A small unit test should be able to replicate the problem if it does exist. Configuration settings can potentially disappear post YARN-1666 --- Key: YARN-1759 URL: https://issues.apache.org/jira/browse/YARN-1759 Project: Hadoop YARN Issue Type: Bug Reporter: Hitesh Shah By implicitly loading core-site and yarn-site again in the RM::serviceInit(), some configs may be unintentionally overridden. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1206) Container logs link is broken on RM web UI after application finished
[ https://issues.apache.org/jira/browse/YARN-1206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13918064#comment-13918064 ] Rohith commented on YARN-1206: -- I am able to reproduce this issue in today's trunk with log aggregation disable. I verified hadoop-2.1 , this issue does not ocure. I just going through fix for YARN-649, found there is null check for container in ContainerLogsUtils.getContainerLogDirs() method. {noformat} if (container == null) { throw new YarnException(Container does not exist.); } {noformat} In hadoop-2.1, above piece of code not there. I am not pretty sure why this is added.!! Basically if container is COMPLETED than it will be removed from NMContext ( NodeStatusUpdaterImpl.updateAndGetContainerStatuses() ). NM does not have any information regarding this container. Is it really required to have this check ? Container logs link is broken on RM web UI after application finished - Key: YARN-1206 URL: https://issues.apache.org/jira/browse/YARN-1206 Project: Hadoop YARN Issue Type: Bug Reporter: Jian He Priority: Blocker With log aggregation disabled, when container is running, its logs link works properly, but after the application is finished, the link shows 'Container does not exist.' -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1771) many getFileStatus calls made from node manager for localizing a public distributed cache resource
[ https://issues.apache.org/jira/browse/YARN-1771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13918112#comment-13918112 ] Jason Lowe commented on YARN-1771: -- Here's a thought to possibly avoid checking each directory level individually: what if the NM simply tried to read the file as the user requesting it to be public? The NM should already have the necessary tokens to access the resource, so it should be able to use doAs to read the file as the requesting user. The rationale for this approach being that if the user can read the resource and is asking for it to be public then they can trivially make the data public themselves by copying to /tmp and make the copy publicly accessible. many getFileStatus calls made from node manager for localizing a public distributed cache resource -- Key: YARN-1771 URL: https://issues.apache.org/jira/browse/YARN-1771 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Sangjin Lee Assignee: Sangjin Lee Priority: Critical We're observing that the getFileStatus calls are putting a fair amount of load on the name node as part of checking the public-ness for localizing a resource that belong in the public cache. We see 7 getFileStatus calls made for each of these resource. We should look into reducing the number of calls to the name node. One example: {noformat} 2014-02-27 18:07:27,351 INFO audit: ... cmd=getfileinfo src=/tmp/temp-887708724/tmp883330348/foo-0.0.44.jar ... 2014-02-27 18:07:27,352 INFO audit: ... cmd=getfileinfo src=/tmp/temp-887708724/tmp883330348/foo-0.0.44.jar ... 2014-02-27 18:07:27,352 INFO audit: ... cmd=getfileinfo src=/tmp/temp-887708724/tmp883330348 ... 2014-02-27 18:07:27,353 INFO audit: ... cmd=getfileinfo src=/tmp/temp-887708724 ... 2014-02-27 18:07:27,353 INFO audit: ... cmd=getfileinfo src=/tmp ... 2014-02-27 18:07:27,354 INFO audit: ... cmd=getfileinfo src=/... 2014-02-27 18:07:27,354 INFO audit: ... cmd=getfileinfo src=/tmp/temp-887708724/tmp883330348/foo-0.0.44.jar ... 2014-02-27 18:07:27,355 INFO audit: ... cmd=open src=/tmp/temp-887708724/tmp883330348/foo-0.0.44.jar ... {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1769) CapacityScheduler: Improve reservations
[ https://issues.apache.org/jira/browse/YARN-1769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated YARN-1769: Attachment: YARN-1769.patch In this patch I tried to minimize the code changes. I choose to keep the accounting/book keeping of reservations the same to hopefully minimize the impact of this and keep it small. I made this change configurable (which is refreshable via yarn rmadmin -refreshQueues). At a high level what it does is: - for the limit checks, it does the normal checks but then if it has hit a limit and this is configured on, it does the check again subtracting out the amount reserved. If that is under the limit it allows it to go on to see if it could unreserve a spot and use the current node. - for the number of reservation limit, we simply delay that check and if we could allocate on the current node by unreserving then we do. CapacityScheduler: Improve reservations Key: YARN-1769 URL: https://issues.apache.org/jira/browse/YARN-1769 Project: Hadoop YARN Issue Type: Improvement Components: capacityscheduler Affects Versions: 2.3.0 Reporter: Thomas Graves Assignee: Thomas Graves Attachments: YARN-1769.patch Currently the CapacityScheduler uses reservations in order to handle requests for large containers and the fact there might not currently be enough space available on a single host. The current algorithm for reservations is to reserve as many containers as currently required and then it will start to reserve more above that after a certain number of re-reservations (currently biased against larger containers). Anytime it hits the limit of number reserved it stops looking at any other nodes. This results in potentially missing nodes that have enough space to fullfill the request. The other place for improvement is currently reservations count against your queue capacity. If you have reservations you could hit the various limits which would then stop you from looking further at that node. The above 2 cases can cause an application requesting a larger container to take a long time to gets it resources. We could improve upon both of those by simply continuing to look at incoming nodes to see if we could potentially swap out a reservation for an actual allocation. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1769) CapacityScheduler: Improve reservations
[ https://issues.apache.org/jira/browse/YARN-1769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13918192#comment-13918192 ] Hadoop QA commented on YARN-1769: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12632271/YARN-1769.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 5 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 2 new Findbugs (version 1.3.9) warnings. {color:red}-1 release audit{color}. The applied patch generated 1 release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.recovery.TestFSRMStateStore {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3229//testReport/ Release audit warnings: https://builds.apache.org/job/PreCommit-YARN-Build/3229//artifact/trunk/patchprocess/patchReleaseAuditProblems.txt Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/3229//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3229//console This message is automatically generated. CapacityScheduler: Improve reservations Key: YARN-1769 URL: https://issues.apache.org/jira/browse/YARN-1769 Project: Hadoop YARN Issue Type: Improvement Components: capacityscheduler Affects Versions: 2.3.0 Reporter: Thomas Graves Assignee: Thomas Graves Attachments: YARN-1769.patch Currently the CapacityScheduler uses reservations in order to handle requests for large containers and the fact there might not currently be enough space available on a single host. The current algorithm for reservations is to reserve as many containers as currently required and then it will start to reserve more above that after a certain number of re-reservations (currently biased against larger containers). Anytime it hits the limit of number reserved it stops looking at any other nodes. This results in potentially missing nodes that have enough space to fullfill the request. The other place for improvement is currently reservations count against your queue capacity. If you have reservations you could hit the various limits which would then stop you from looking further at that node. The above 2 cases can cause an application requesting a larger container to take a long time to gets it resources. We could improve upon both of those by simply continuing to look at incoming nodes to see if we could potentially swap out a reservation for an actual allocation. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1771) many getFileStatus calls made from node manager for localizing a public distributed cache resource
[ https://issues.apache.org/jira/browse/YARN-1771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13918247#comment-13918247 ] Sangjin Lee commented on YARN-1771: --- Would it be a little weaker condition than the current public check? The current check calls for the READ permission by others. One possible case here is if the user has a group READ permission on the file (but others' READ permission is off). Then the user's doAs would succeed even though others do not have the READ permission. many getFileStatus calls made from node manager for localizing a public distributed cache resource -- Key: YARN-1771 URL: https://issues.apache.org/jira/browse/YARN-1771 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Sangjin Lee Assignee: Sangjin Lee Priority: Critical We're observing that the getFileStatus calls are putting a fair amount of load on the name node as part of checking the public-ness for localizing a resource that belong in the public cache. We see 7 getFileStatus calls made for each of these resource. We should look into reducing the number of calls to the name node. One example: {noformat} 2014-02-27 18:07:27,351 INFO audit: ... cmd=getfileinfo src=/tmp/temp-887708724/tmp883330348/foo-0.0.44.jar ... 2014-02-27 18:07:27,352 INFO audit: ... cmd=getfileinfo src=/tmp/temp-887708724/tmp883330348/foo-0.0.44.jar ... 2014-02-27 18:07:27,352 INFO audit: ... cmd=getfileinfo src=/tmp/temp-887708724/tmp883330348 ... 2014-02-27 18:07:27,353 INFO audit: ... cmd=getfileinfo src=/tmp/temp-887708724 ... 2014-02-27 18:07:27,353 INFO audit: ... cmd=getfileinfo src=/tmp ... 2014-02-27 18:07:27,354 INFO audit: ... cmd=getfileinfo src=/... 2014-02-27 18:07:27,354 INFO audit: ... cmd=getfileinfo src=/tmp/temp-887708724/tmp883330348/foo-0.0.44.jar ... 2014-02-27 18:07:27,355 INFO audit: ... cmd=open src=/tmp/temp-887708724/tmp883330348/foo-0.0.44.jar ... {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-1776) renewDelegationToken should survive RM failover
Zhijie Shen created YARN-1776: - Summary: renewDelegationToken should survive RM failover Key: YARN-1776 URL: https://issues.apache.org/jira/browse/YARN-1776 Project: Hadoop YARN Issue Type: Sub-task Reporter: Zhijie Shen Assignee: Zhijie Shen When a delegation token is renewed, two RMStateStore operations: 1) removing the old DT, and 2) storing the new DT will happen. If RM fails in between. There would be problem. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1771) many getFileStatus calls made from node manager for localizing a public distributed cache resource
[ https://issues.apache.org/jira/browse/YARN-1771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13918301#comment-13918301 ] Jason Lowe commented on YARN-1771: -- Yes, it would be a weaker condition check, but I'm wondering if the weaker check still meets the security needs of the dist cache. A user is requesting a resource to be publicly localized. If they have read permissions to it then even if others lack access then the original user can trivially work around that obstacle by copying to a publicly accessible location (e.g.: /tmp). So in that sense the user has a legitimate way to make the resource data public even if it isn't right now. A subsequent request for the same resource would check the timestamp doing the same doAs logic, so if another user doesn't have access then they won't localize. It's true that the other user's container can still access the resource by avoiding explicit localization and instead scanning/scraping the local public distcache area directly once it runs. However the original user who requested the resource asked for it to be public and has the means to make it public, so they probably aren't concerned that the public can access it. This approach would also be useful to the shared cache design in YARN-1492, where it was calling for the ability to make something a public resource directly from a user's staging area. There may be some security concerns that I've missed, but if this ends up being a possibility then it would eliminate all of the parent directory stat calls on public localization. many getFileStatus calls made from node manager for localizing a public distributed cache resource -- Key: YARN-1771 URL: https://issues.apache.org/jira/browse/YARN-1771 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Sangjin Lee Assignee: Sangjin Lee Priority: Critical We're observing that the getFileStatus calls are putting a fair amount of load on the name node as part of checking the public-ness for localizing a resource that belong in the public cache. We see 7 getFileStatus calls made for each of these resource. We should look into reducing the number of calls to the name node. One example: {noformat} 2014-02-27 18:07:27,351 INFO audit: ... cmd=getfileinfo src=/tmp/temp-887708724/tmp883330348/foo-0.0.44.jar ... 2014-02-27 18:07:27,352 INFO audit: ... cmd=getfileinfo src=/tmp/temp-887708724/tmp883330348/foo-0.0.44.jar ... 2014-02-27 18:07:27,352 INFO audit: ... cmd=getfileinfo src=/tmp/temp-887708724/tmp883330348 ... 2014-02-27 18:07:27,353 INFO audit: ... cmd=getfileinfo src=/tmp/temp-887708724 ... 2014-02-27 18:07:27,353 INFO audit: ... cmd=getfileinfo src=/tmp ... 2014-02-27 18:07:27,354 INFO audit: ... cmd=getfileinfo src=/... 2014-02-27 18:07:27,354 INFO audit: ... cmd=getfileinfo src=/tmp/temp-887708724/tmp883330348/foo-0.0.44.jar ... 2014-02-27 18:07:27,355 INFO audit: ... cmd=open src=/tmp/temp-887708724/tmp883330348/foo-0.0.44.jar ... {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1734) RM should get the updated Configurations when it transits from Standby to Active
[ https://issues.apache.org/jira/browse/YARN-1734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13918307#comment-13918307 ] Xuan Gong commented on YARN-1734: - bq. It appears the AdminService#refreshAll is called on transition to active. However, calling any of the refresh commands on the Standby throws StandbyException. This can lead to confusion - we throw an exception even though the refresh command takes affect when the RM transitions to Active. After rm.transitionToActive() is successfully executed, the rm is at Active state. So, it will not throw out StandbyException. RM should get the updated Configurations when it transits from Standby to Active Key: YARN-1734 URL: https://issues.apache.org/jira/browse/YARN-1734 Project: Hadoop YARN Issue Type: Sub-task Reporter: Xuan Gong Assignee: Xuan Gong Priority: Critical Fix For: 2.4.0 Attachments: YARN-1734.1.patch, YARN-1734.2.patch, YARN-1734.3.patch, YARN-1734.4.patch, YARN-1734.5.patch, YARN-1734.6.patch, YARN-1734.7.patch Currently, we have ConfigurationProvider which can support LocalConfiguration, and FileSystemBasedConfiguration. When HA is enabled, and FileSystemBasedConfiguration is enabled, RM can not get the updated Configurations when it transits from Standby to Active -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (YARN-1759) Configuration settings can potentially disappear post YARN-1666
[ https://issues.apache.org/jira/browse/YARN-1759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuan Gong reassigned YARN-1759: --- Assignee: Xuan Gong Configuration settings can potentially disappear post YARN-1666 --- Key: YARN-1759 URL: https://issues.apache.org/jira/browse/YARN-1759 Project: Hadoop YARN Issue Type: Bug Reporter: Hitesh Shah Assignee: Xuan Gong By implicitly loading core-site and yarn-site again in the RM::serviceInit(), some configs may be unintentionally overridden. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1734) RM should get the updated Configurations when it transits from Standby to Active
[ https://issues.apache.org/jira/browse/YARN-1734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13918365#comment-13918365 ] Karthik Kambatla commented on YARN-1734: In our case, we plan to use the LocalConfiguration and not the FileSystemBased one. So, in the HA case, we would update the local configs on both RMs and call the appropriate refresh command on both RMs - this is what we do for HDFS as well. The expectation is that the Active picks these up immediately, and the Standby picks them eventually when it becomes Active. In other words, the expectation is that these updates are not lost. With the current code, the Standby would throw a StandbyException, thereby telling the user that the config refresh has failed. This is not exactly true, because the Standby would actually pick the latest configs when transitioning to Active. No? Let me think more on this, but thought I should raise this concern. RM should get the updated Configurations when it transits from Standby to Active Key: YARN-1734 URL: https://issues.apache.org/jira/browse/YARN-1734 Project: Hadoop YARN Issue Type: Sub-task Reporter: Xuan Gong Assignee: Xuan Gong Priority: Critical Fix For: 2.4.0 Attachments: YARN-1734.1.patch, YARN-1734.2.patch, YARN-1734.3.patch, YARN-1734.4.patch, YARN-1734.5.patch, YARN-1734.6.patch, YARN-1734.7.patch Currently, we have ConfigurationProvider which can support LocalConfiguration, and FileSystemBasedConfiguration. When HA is enabled, and FileSystemBasedConfiguration is enabled, RM can not get the updated Configurations when it transits from Standby to Active -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1768) yarn kill non-existent application is too verbose
[ https://issues.apache.org/jira/browse/YARN-1768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13918381#comment-13918381 ] Ravi Prakash commented on YARN-1768: With this patch the return code is wrong (0). Earlier it was returning a non-zero error code. Please also consider adding that check to the test yarn kill non-existent application is too verbose - Key: YARN-1768 URL: https://issues.apache.org/jira/browse/YARN-1768 Project: Hadoop YARN Issue Type: Bug Components: client Affects Versions: 2.2.0 Reporter: Hitesh Shah Assignee: Tsuyoshi OZAWA Priority: Minor Attachments: YARN-1768.1.patch, YARN-1768.2.patch Instead of catching ApplicationNotFound and logging a simple app not found message, the whole stack trace is logged. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1389) ApplicationClientProtocol and ApplicationHistoryProtocol should expose analog APIs
[ https://issues.apache.org/jira/browse/YARN-1389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13918401#comment-13918401 ] Zhijie Shen commented on YARN-1389: --- [~mayank_bansal], the patch still doesn't compile. Would you please check it again? ApplicationClientProtocol and ApplicationHistoryProtocol should expose analog APIs -- Key: YARN-1389 URL: https://issues.apache.org/jira/browse/YARN-1389 Project: Hadoop YARN Issue Type: Sub-task Reporter: Mayank Bansal Assignee: Mayank Bansal Attachments: YARN-1389-1.patch, YARN-1389-2.patch As we plan to have the APIs in ApplicationHistoryProtocol to expose the reports of *finished* application attempts and containers, we should do the same for ApplicationClientProtocol, which will return the reports of *running* attempts and containers. Later on, we can improve YarnClient to direct the query of running instance to ApplicationClientProtocol, while that of finished instance to ApplicationHistoryProtocol, making it transparent to the users. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-90) NodeManager should identify failed disks becoming good back again
[ https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13918405#comment-13918405 ] Varun Vasudev commented on YARN-90: --- Ravi, are you still working on this ticket? Do you mind if I take over? NodeManager should identify failed disks becoming good back again - Key: YARN-90 URL: https://issues.apache.org/jira/browse/YARN-90 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Reporter: Ravi Gummadi Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch, YARN-90.patch, YARN-90.patch MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes down, it is marked as failed forever. To reuse that disk (after it becomes good), NodeManager needs restart. This JIRA is to improve NodeManager to reuse good disks(which could be bad some time back). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-1777) Nodemanager fails to detect Full disk and try to launch container
Yesha Vora created YARN-1777: Summary: Nodemanager fails to detect Full disk and try to launch container Key: YARN-1777 URL: https://issues.apache.org/jira/browse/YARN-1777 Project: Hadoop YARN Issue Type: Bug Reporter: Yesha Vora Nodemanager is not able to recognize that the disk is full. it keeps retrying to launch a container on full disk. -- 2013-06-06 17:45:25,319 INFO container.Container (ContainerImpl.java:handle(852)) - Container container_1370473246485_0136_01_18 transitioned from LOCALIZING to LOCALIZED 2013-06-06 17:45:25,328 INFO container.Container (ContainerImpl.java:handle(852)) - Container container_1370473246485_0136_01_19 transitioned from LOCALIZED to RUNNING 2013-06-06 17:45:25,329 WARN launcher.ContainerLaunch (ContainerLaunch.java:call(255)) - Failed to launch container. java.io.IOException: mkdir of /tmp/1/hdp/yarn/local/usercache/hrt_qa/appcache/application_1370473246485_0136/container_1370473246485_0136_01_19 failed at org.apache.hadoop.fs.FileSystem.primitiveMkdir(FileSystem.java:1044) at org.apache.hadoop.fs.DelegateToFileSystem.mkdir(DelegateToFileSystem.java:150) at org.apache.hadoop.fs.FilterFs.mkdir(FilterFs.java:187) at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:730) at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:726) at org.apache.hadoop.fs.FileContext$FSLinkResolver.resolve(FileContext.java:2379) at org.apache.hadoop.fs.FileContext.mkdir(FileContext.java:726) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.createDir(DefaultContainerExecutor.java:412) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:130) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:250) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:73) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) 2013-06-06 17:45:25,330 INFO container.Container (ContainerImpl.java:handle(852)) - Container container_1370473246485_0136_01_19 transitioned from RUNNING to EXITED_WITH_FAILURE 2013-06-06 17:45:25,330 INFO launcher.ContainerLaunch (ContainerLaunch.java:cleanupContainer(307)) - Cleaning up container container_1370473246485_0136_01_19 2013-06-06 17:45:25,333 WARN launcher.ContainerLaunch (ContainerLaunch.java:call(255)) - Failed to launch container. java.io.IOException: mkdir of /tmp/1/hdp/yarn/local/usercache/hrt_qa/appcache/application_1370473246485_0136/container_1370473246485_0136_01_18 failed at org.apache.hadoop.fs.FileSystem.primitiveMkdir(FileSystem.java:1044) at org.apache.hadoop.fs.DelegateToFileSystem.mkdir(DelegateToFileSystem.java:150) at org.apache.hadoop.fs.FilterFs.mkdir(FilterFs.java:187) at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:730) at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:726) at org.apache.hadoop.fs.FileContext$FSLinkResolver.resolve(FileContext.java:2379) at org.apache.hadoop.fs.FileContext.mkdir(FileContext.java:726) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.createDir(DefaultContainerExecutor.java:412) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:130) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:250) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:73) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) -- -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1734) RM should get the updated Configurations when it transits from Standby to Active
[ https://issues.apache.org/jira/browse/YARN-1734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13918433#comment-13918433 ] Xuan Gong commented on YARN-1734: - So, if the Standby RM transits to Active, it will pick the latest configuration. For calling refresh* in standby RM, it will throw a standbyException and trigger the retry. In that case, even if we call refresh* in Standby RM, it actually do the refresh* in active RM. bq. With the current code, the Standby would throw a StandbyException, thereby telling the user that the config refresh has failed. This is not exactly true, because the Standby would actually pick the latest configs when transitioning to Active. No? When RM is at Standby state, all of the active services have already been stopped. I think this pick the latest configs should mean all the related services pick the latest configs, such as CapacityScheduler, NodesListManager, ClientRMService, ResourceTrackerService, etc. But since most of these services are stopped in standby mode, they can not get the latest configurations. RM should get the updated Configurations when it transits from Standby to Active Key: YARN-1734 URL: https://issues.apache.org/jira/browse/YARN-1734 Project: Hadoop YARN Issue Type: Sub-task Reporter: Xuan Gong Assignee: Xuan Gong Priority: Critical Fix For: 2.4.0 Attachments: YARN-1734.1.patch, YARN-1734.2.patch, YARN-1734.3.patch, YARN-1734.4.patch, YARN-1734.5.patch, YARN-1734.6.patch, YARN-1734.7.patch Currently, we have ConfigurationProvider which can support LocalConfiguration, and FileSystemBasedConfiguration. When HA is enabled, and FileSystemBasedConfiguration is enabled, RM can not get the updated Configurations when it transits from Standby to Active -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1734) RM should get the updated Configurations when it transits from Standby to Active
[ https://issues.apache.org/jira/browse/YARN-1734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13918437#comment-13918437 ] Karthik Kambatla commented on YARN-1734: I guess the ambiguity stems from the definition of success for {{rmadmin -refresh*}} commands. I propose adding a config - yarn.resourcemanager.ha.refresh-all-rms. When set, the refresh commands should attempt to refresh on all RMs and fail if it can't - i.e., this should fail when called on the StandbyRM? When cleared, the refresh command should attempt to refresh only on this RM and should succeed as long as the configs are refreshed as early as they are required - i.e., it should be okay to refresh on transition to active and the StandbyRM should also succeed? [~xgong], [~vinodkv] - do you think this captures the behavior well enough and is reasonable? RM should get the updated Configurations when it transits from Standby to Active Key: YARN-1734 URL: https://issues.apache.org/jira/browse/YARN-1734 Project: Hadoop YARN Issue Type: Sub-task Reporter: Xuan Gong Assignee: Xuan Gong Priority: Critical Fix For: 2.4.0 Attachments: YARN-1734.1.patch, YARN-1734.2.patch, YARN-1734.3.patch, YARN-1734.4.patch, YARN-1734.5.patch, YARN-1734.6.patch, YARN-1734.7.patch Currently, we have ConfigurationProvider which can support LocalConfiguration, and FileSystemBasedConfiguration. When HA is enabled, and FileSystemBasedConfiguration is enabled, RM can not get the updated Configurations when it transits from Standby to Active -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-90) NodeManager should identify failed disks becoming good back again
[ https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13918439#comment-13918439 ] Ravi Prakash commented on YARN-90: -- I'm not working on it. Please feel free to take it over. Thanks Varun NodeManager should identify failed disks becoming good back again - Key: YARN-90 URL: https://issues.apache.org/jira/browse/YARN-90 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Reporter: Ravi Gummadi Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch, YARN-90.patch, YARN-90.patch MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes down, it is marked as failed forever. To reuse that disk (after it becomes good), NodeManager needs restart. This JIRA is to improve NodeManager to reuse good disks(which could be bad some time back). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1752) Unexpected Unregistered event at Attempt Launched state
[ https://issues.apache.org/jira/browse/YARN-1752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13918459#comment-13918459 ] Jian He commented on YARN-1752: --- Patch looks good overall, some minors: - styling: exceed the 80 column limit, {code} public void unregisterAppAttempt(final FinishApplicationMasterRequest req,boolean waitForStateRunning) {code} - we can consolidate the exception comments like this ? {code} * This exception is thrown when an ApplicationMaster asks for resources by * calling {@link ApplicationMasterProtocol#allocate(AllocateRequest)} or tries * to unregister by calling * {@link ApplicationMasterProtocol#finishApplicationMaster(FinishApplicationMasterRequest)} * without first registering with ResourceManager by calling * {@link ApplicationMasterProtocol#registerApplicationMaster(RegisterApplicationMasterRequest)} * or if it tries to register more than once. {code} - Test: we can check the attempt state to be Launched state after this call. Simply, we can just use MockRM.launchAM {code} MockAM am1 = rm.sendAMLaunched(attempt1.getAppAttemptId()); {code} Unexpected Unregistered event at Attempt Launched state --- Key: YARN-1752 URL: https://issues.apache.org/jira/browse/YARN-1752 Project: Hadoop YARN Issue Type: Bug Reporter: Jian He Assignee: Rohith Attachments: YARN-1752.1.patch, YARN-1752.2.patch, YARN-1752.3.patch {code} 2014-02-21 14:56:03,453 ERROR org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: Can't handle this event at current state org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: UNREGISTERED at LAUNCHED at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:647) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:103) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:733) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:714) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) at java.lang.Thread.run(Thread.java:695) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1734) RM should get the updated Configurations when it transits from Standby to Active
[ https://issues.apache.org/jira/browse/YARN-1734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13918441#comment-13918441 ] Karthik Kambatla commented on YARN-1734: bq. For calling refresh* in standby RM, it will throw a standbyException and trigger the retry. In that case, even if we call refresh* in Standby RM, it actually do the refresh* in active RM. Sorry, I missed this while browsing through the code. Let me try this on a cluster and report. RM should get the updated Configurations when it transits from Standby to Active Key: YARN-1734 URL: https://issues.apache.org/jira/browse/YARN-1734 Project: Hadoop YARN Issue Type: Sub-task Reporter: Xuan Gong Assignee: Xuan Gong Priority: Critical Fix For: 2.4.0 Attachments: YARN-1734.1.patch, YARN-1734.2.patch, YARN-1734.3.patch, YARN-1734.4.patch, YARN-1734.5.patch, YARN-1734.6.patch, YARN-1734.7.patch Currently, we have ConfigurationProvider which can support LocalConfiguration, and FileSystemBasedConfiguration. When HA is enabled, and FileSystemBasedConfiguration is enabled, RM can not get the updated Configurations when it transits from Standby to Active -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1758) MiniYARNCluster broken post YARN-1666
[ https://issues.apache.org/jira/browse/YARN-1758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13918462#comment-13918462 ] Vinod Kumar Vavilapalli commented on YARN-1758: --- This looks fine enough for me for now. In the interest of progress, let's track YARN-1759 separately. +1. Checking this in now. MiniYARNCluster broken post YARN-1666 - Key: YARN-1758 URL: https://issues.apache.org/jira/browse/YARN-1758 Project: Hadoop YARN Issue Type: Bug Reporter: Hitesh Shah Assignee: Xuan Gong Priority: Blocker Attachments: YARN-1758.1.patch, YARN-1758.2.patch NPE seen when trying to use MiniYARNCluster -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1759) Configuration settings can potentially disappear post YARN-1666
[ https://issues.apache.org/jira/browse/YARN-1759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13918465#comment-13918465 ] Hitesh Shah commented on YARN-1759: --- [~ste...@apache.org] A common case will be with mini clusters where the code itself updates the config based on what ports the daemon binds to. Configuration settings can potentially disappear post YARN-1666 --- Key: YARN-1759 URL: https://issues.apache.org/jira/browse/YARN-1759 Project: Hadoop YARN Issue Type: Bug Reporter: Hitesh Shah Assignee: Xuan Gong By implicitly loading core-site and yarn-site again in the RM::serviceInit(), some configs may be unintentionally overridden. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (YARN-1777) Nodemanager fails to detect Full disk and try to launch container
[ https://issues.apache.org/jira/browse/YARN-1777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Lowe resolved YARN-1777. -- Resolution: Duplicate This is a duplicate of YARN-257. Nodemanager fails to detect Full disk and try to launch container - Key: YARN-1777 URL: https://issues.apache.org/jira/browse/YARN-1777 Project: Hadoop YARN Issue Type: Bug Reporter: Yesha Vora Nodemanager is not able to recognize that the disk is full. it keeps retrying to launch a container on full disk. -- 2013-06-06 17:45:25,319 INFO container.Container (ContainerImpl.java:handle(852)) - Container container_1370473246485_0136_01_18 transitioned from LOCALIZING to LOCALIZED 2013-06-06 17:45:25,328 INFO container.Container (ContainerImpl.java:handle(852)) - Container container_1370473246485_0136_01_19 transitioned from LOCALIZED to RUNNING 2013-06-06 17:45:25,329 WARN launcher.ContainerLaunch (ContainerLaunch.java:call(255)) - Failed to launch container. java.io.IOException: mkdir of /tmp/1/hdp/yarn/local/usercache/hrt_qa/appcache/application_1370473246485_0136/container_1370473246485_0136_01_19 failed at org.apache.hadoop.fs.FileSystem.primitiveMkdir(FileSystem.java:1044) at org.apache.hadoop.fs.DelegateToFileSystem.mkdir(DelegateToFileSystem.java:150) at org.apache.hadoop.fs.FilterFs.mkdir(FilterFs.java:187) at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:730) at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:726) at org.apache.hadoop.fs.FileContext$FSLinkResolver.resolve(FileContext.java:2379) at org.apache.hadoop.fs.FileContext.mkdir(FileContext.java:726) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.createDir(DefaultContainerExecutor.java:412) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:130) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:250) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:73) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) 2013-06-06 17:45:25,330 INFO container.Container (ContainerImpl.java:handle(852)) - Container container_1370473246485_0136_01_19 transitioned from RUNNING to EXITED_WITH_FAILURE 2013-06-06 17:45:25,330 INFO launcher.ContainerLaunch (ContainerLaunch.java:cleanupContainer(307)) - Cleaning up container container_1370473246485_0136_01_19 2013-06-06 17:45:25,333 WARN launcher.ContainerLaunch (ContainerLaunch.java:call(255)) - Failed to launch container. java.io.IOException: mkdir of /tmp/1/hdp/yarn/local/usercache/hrt_qa/appcache/application_1370473246485_0136/container_1370473246485_0136_01_18 failed at org.apache.hadoop.fs.FileSystem.primitiveMkdir(FileSystem.java:1044) at org.apache.hadoop.fs.DelegateToFileSystem.mkdir(DelegateToFileSystem.java:150) at org.apache.hadoop.fs.FilterFs.mkdir(FilterFs.java:187) at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:730) at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:726) at org.apache.hadoop.fs.FileContext$FSLinkResolver.resolve(FileContext.java:2379) at org.apache.hadoop.fs.FileContext.mkdir(FileContext.java:726) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.createDir(DefaultContainerExecutor.java:412) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:130) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:250) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:73) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) -- --
[jira] [Commented] (YARN-1764) Handle RM fail overs after the submitApplication call.
[ https://issues.apache.org/jira/browse/YARN-1764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13918499#comment-13918499 ] Xuan Gong commented on YARN-1764: - Let us continue our discussions on case 3: Handle RM fail overs after the submitApplication call. Reply to [~kkambatl]‘s comment: “ I don't see 3 to be as straight-forward, and suspect would require revisiting the state machine.” We will only consider the case that failover happens after submitApplication call. It means when failover happens, we have already received the SubmitApplicationResponse. When the failover happens, we will *not re-entry* clientRMService#submitApplication() again. What will happen next is that getApplicationReport() will start to execute. And YarnClient will start to re-try until it finds the next active RM, and continue execute getApplicationReport(). Now we have two cases to handle: * RMStateStore already saved the ApplicationState when failover happens. * RMStateStore does not save the ApplicationState when failover happens. For case1, we do not need to make any changes. For case2, if the failover happens, when we try to execute getApplicationReport, we will get ApplicationNotFoundException. I think this is the only case we should handle here. Handle RM fail overs after the submitApplication call. -- Key: YARN-1764 URL: https://issues.apache.org/jira/browse/YARN-1764 Project: Hadoop YARN Issue Type: Sub-task Reporter: Xuan Gong Assignee: Xuan Gong -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1729) TimelineWebServices always passes primary and secondary filters as strings
[ https://issues.apache.org/jira/browse/YARN-1729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13918507#comment-13918507 ] Zhijie Shen commented on YARN-1729: --- 1. mapper is not necessary, objectReader and objectReader should be final, and both constants can be initiated in a static block. And please follow the name convention of the static final constant. {code} + private static ObjectMapper mapper = new ObjectMapper(); + private static ObjectReader objectReader = mapper.reader(Object.class); + private static ObjectWriter objectWriter = mapper.writer(); {code} 2. Similar problem here. {code} + private static ObjectReader objectReader = + new ObjectMapper().reader(Object.class); {code} 3. In the test case, would you mind adding one more test case of other:123abc to show the difference? {code} +ClientResponse response = r.path(ws).path(v1).path(timeline) +.path(type_1).queryParam(primaryFilter, other:\123abc\) {code} Other than that, the patch looks good to me. In addition, I'm aware of an additional issue of the leveldb implementation, which is aware of JSON input specification. This means whenever our RESTful APIs allows to take XML input, the current implementation may not work correctly. IMHO, ideally the store should be isolated from the RESTful interface input types. Anyway, let's leave the issue separately not to block this patch, as the issue happens before this patch as well. TimelineWebServices always passes primary and secondary filters as strings -- Key: YARN-1729 URL: https://issues.apache.org/jira/browse/YARN-1729 Project: Hadoop YARN Issue Type: Sub-task Reporter: Billie Rinaldi Assignee: Billie Rinaldi Attachments: YARN-1729.1.patch, YARN-1729.2.patch, YARN-1729.3.patch, YARN-1729.4.patch, YARN-1729.5.patch Primary filters and secondary filter values can be arbitrary json-compatible Object. The web services should determine if the filters specified as query parameters are objects or strings before passing them to the store. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (YARN-1704) Review LICENSE and NOTICE to reflect new levelDB releated libraries being used
[ https://issues.apache.org/jira/browse/YARN-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli resolved YARN-1704. --- Resolution: Fixed Fix Version/s: 2.4.0 Hadoop Flags: Reviewed Committed this to trunk, branch-2 and branch-2.4. Thanks Billie! Review LICENSE and NOTICE to reflect new levelDB releated libraries being used -- Key: YARN-1704 URL: https://issues.apache.org/jira/browse/YARN-1704 Project: Hadoop YARN Issue Type: Sub-task Reporter: Billie Rinaldi Assignee: Billie Rinaldi Priority: Blocker Fix For: 2.4.0 Attachments: YARN-1704.1.patch, YARN-1704.2.patch, YARN-1704.3.patch Make any changes necessary in LICENSE and NOTICE related to dependencies introduced by the application timeline store. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1704) Review LICENSE and NOTICE to reflect new levelDB releated libraries being used
[ https://issues.apache.org/jira/browse/YARN-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13918526#comment-13918526 ] Hudson commented on YARN-1704: -- SUCCESS: Integrated in Hadoop-trunk-Commit #5254 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/5254/]) YARN-1704. Modified LICENSE and NOTICE files to reflect newly used levelDB related libraries. Contributed by Billie Rinaldi. (vinodkv: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1573702) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/LICENSE.txt * /hadoop/common/trunk/hadoop-yarn-project/NOTICE.txt Review LICENSE and NOTICE to reflect new levelDB releated libraries being used -- Key: YARN-1704 URL: https://issues.apache.org/jira/browse/YARN-1704 Project: Hadoop YARN Issue Type: Sub-task Reporter: Billie Rinaldi Assignee: Billie Rinaldi Priority: Blocker Fix For: 2.4.0 Attachments: YARN-1704.1.patch, YARN-1704.2.patch, YARN-1704.3.patch Make any changes necessary in LICENSE and NOTICE related to dependencies introduced by the application timeline store. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1758) MiniYARNCluster broken post YARN-1666
[ https://issues.apache.org/jira/browse/YARN-1758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13918525#comment-13918525 ] Hudson commented on YARN-1758: -- SUCCESS: Integrated in Hadoop-trunk-Commit #5254 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/5254/]) YARN-1758. Fixed ResourceManager to not mandate the presence of site specific configuration files and thus fix failures in downstream tests. Contributed by Xuan Gong. (vinodkv: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1573695) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/FileSystemBasedConfigurationProvider.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/AdminService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ApplicationMasterService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ClientRMService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceManager.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceTrackerService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMAdminService.java MiniYARNCluster broken post YARN-1666 - Key: YARN-1758 URL: https://issues.apache.org/jira/browse/YARN-1758 Project: Hadoop YARN Issue Type: Bug Reporter: Hitesh Shah Assignee: Xuan Gong Priority: Blocker Fix For: 2.4.0 Attachments: YARN-1758.1.patch, YARN-1758.2.patch NPE seen when trying to use MiniYARNCluster -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1765) Write test cases to verify that killApplication API works in RM HA
[ https://issues.apache.org/jira/browse/YARN-1765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13918545#comment-13918545 ] Vinod Kumar Vavilapalli commented on YARN-1765: --- Looks good. +1. Checking this in. Write test cases to verify that killApplication API works in RM HA -- Key: YARN-1765 URL: https://issues.apache.org/jira/browse/YARN-1765 Project: Hadoop YARN Issue Type: Sub-task Reporter: Xuan Gong Assignee: Xuan Gong Attachments: YARN-1765.1.patch, YARN-1765.2.patch, YARN-1765.2.patch, YARN-1765.3.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1771) many getFileStatus calls made from node manager for localizing a public distributed cache resource
[ https://issues.apache.org/jira/browse/YARN-1771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13918548#comment-13918548 ] Chris Douglas commented on YARN-1771: - The simpler check doesn't seem to have any practical issues. Since the cache is keyed on Paths, the case where a user can refer to an object without access to it seems pretty esoteric. As long as the public cache runs with lowered privileges, and the check isn't necessary to verify that the public resource isn't private to YARN. Copying with the user's HDFS credentials avoids that, though that seems like a heavyweight solution if reducing getFileStatus calls is the only motivation. many getFileStatus calls made from node manager for localizing a public distributed cache resource -- Key: YARN-1771 URL: https://issues.apache.org/jira/browse/YARN-1771 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Sangjin Lee Assignee: Sangjin Lee Priority: Critical We're observing that the getFileStatus calls are putting a fair amount of load on the name node as part of checking the public-ness for localizing a resource that belong in the public cache. We see 7 getFileStatus calls made for each of these resource. We should look into reducing the number of calls to the name node. One example: {noformat} 2014-02-27 18:07:27,351 INFO audit: ... cmd=getfileinfo src=/tmp/temp-887708724/tmp883330348/foo-0.0.44.jar ... 2014-02-27 18:07:27,352 INFO audit: ... cmd=getfileinfo src=/tmp/temp-887708724/tmp883330348/foo-0.0.44.jar ... 2014-02-27 18:07:27,352 INFO audit: ... cmd=getfileinfo src=/tmp/temp-887708724/tmp883330348 ... 2014-02-27 18:07:27,353 INFO audit: ... cmd=getfileinfo src=/tmp/temp-887708724 ... 2014-02-27 18:07:27,353 INFO audit: ... cmd=getfileinfo src=/tmp ... 2014-02-27 18:07:27,354 INFO audit: ... cmd=getfileinfo src=/... 2014-02-27 18:07:27,354 INFO audit: ... cmd=getfileinfo src=/tmp/temp-887708724/tmp883330348/foo-0.0.44.jar ... 2014-02-27 18:07:27,355 INFO audit: ... cmd=open src=/tmp/temp-887708724/tmp883330348/foo-0.0.44.jar ... {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1765) Write test cases to verify that killApplication API works in RM HA
[ https://issues.apache.org/jira/browse/YARN-1765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13918557#comment-13918557 ] Hudson commented on YARN-1765: -- SUCCESS: Integrated in Hadoop-trunk-Commit #5255 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/5255/]) YARN-1765. Added test cases to verify that killApplication API works across ResourceManager failover. Contributed by Xuan Gong. (vinodkv: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1573735) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt Write test cases to verify that killApplication API works in RM HA -- Key: YARN-1765 URL: https://issues.apache.org/jira/browse/YARN-1765 Project: Hadoop YARN Issue Type: Sub-task Reporter: Xuan Gong Assignee: Xuan Gong Fix For: 2.4.0 Attachments: YARN-1765.1.patch, YARN-1765.2.patch, YARN-1765.2.patch, YARN-1765.3.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1675) Application does not change to RUNNING after being scheduled
[ https://issues.apache.org/jira/browse/YARN-1675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13918558#comment-13918558 ] Hudson commented on YARN-1675: -- SUCCESS: Integrated in Hadoop-trunk-Commit #5255 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/5255/]) YARN-1675. Added the previously missed new file. (vinodkv: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1573736) * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestKillApplicationWithRMHA.java Application does not change to RUNNING after being scheduled Key: YARN-1675 URL: https://issues.apache.org/jira/browse/YARN-1675 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.4.0 Reporter: Trupti Dhavle I dont see any stacktraces in logs. But the debug logs show negative vcores- {noformat} 2014-01-29 18:42:26,357 DEBUG capacity.LeafQueue (LeafQueue.java:assignContainers(808)) - assignContainers: node=hor11n39.gq1.ygridcore.net #applications=5 2014-01-29 18:42:26,357 DEBUG capacity.LeafQueue (LeafQueue.java:assignContainers(827)) - pre-assignContainers for application application_1390986573180_0269 2014-01-29 18:42:26,358 DEBUG scheduler.SchedulerApplicationAttempt (SchedulerApplicationAttempt.java:showRequests(326)) - showRequests: application=application_1390986573180_0269 headRoom=memory:22528, vCores:0 currentConsumption=2048 2014-01-29 18:42:26,358 DEBUG scheduler.SchedulerApplicationAttempt (SchedulerApplicationAttempt.java:showRequests(330)) - showRequests: application=application_1390986573180_0269 request={Priority: 0, Capability: memory:2048, vCores:1, # Containers: 0, Location: *, Relax Locality: true} 2014-01-29 18:42:26,358 DEBUG capacity.LeafQueue (LeafQueue.java:assignContainers(911)) - post-assignContainers for application application_1390986573180_0269 2014-01-29 18:42:26,358 DEBUG scheduler.SchedulerApplicationAttempt (SchedulerApplicationAttempt.java:showRequests(326)) - showRequests: application=application_1390986573180_0269 headRoom=memory:22528, vCores:0 currentConsumption=2048 2014-01-29 18:42:26,358 DEBUG scheduler.SchedulerApplicationAttempt (SchedulerApplicationAttempt.java:showRequests(330)) - showRequests: application=application_1390986573180_0269 request={Priority: 0, Capability: memory:2048, vCores:1, # Containers: 0, Location: *, Relax Locality: true} 2014-01-29 18:42:26,358 DEBUG capacity.LeafQueue (LeafQueue.java:assignContainers(827)) - pre-assignContainers for application application_1390986573180_0272 2014-01-29 18:42:26,358 DEBUG scheduler.SchedulerApplicationAttempt (SchedulerApplicationAttempt.java:showRequests(326)) - showRequests: application=application_1390986573180_0272 headRoom=memory:18432, vCores:-2 currentConsumption=2048 2014-01-29 18:42:26,359 DEBUG scheduler.SchedulerApplicationAttempt (SchedulerApplicationAttempt.java:showRequests(330)) - showRequests: application=application_1390986573180_0272 request={Priority: 0, Capability: memory:2048, vCores:1, # Containers: 0, Location: *, Relax Locality: true} 2014-01-29 18:42:26,359 DEBUG capacity.LeafQueue (LeafQueue.java:assignContainers(911)) - post-assignContainers for application application_1390986573180_0272 2014-01-29 18:42:26,359 DEBUG scheduler.SchedulerApplicationAttempt (SchedulerApplicationAttempt.java:showRequests(326)) - showRequests: application=application_1390986573180_0272 headRoom=memory:18432, vCores:-2 currentConsumption=2048 2014-01-29 18:42:26,359 DEBUG scheduler.SchedulerApplicationAttempt (SchedulerApplicationAttempt.java:showRequests(330)) - showRequests: application=application_1390986573180_0272 request={Priority: 0, Capability: memory:2048, vCores:1, # Containers: 0, Location: *, Relax Locality: true} 2014-01-29 18:42:26,359 DEBUG capacity.LeafQueue (LeafQueue.java:assignContainers(827)) - pre-assignContainers for application application_1390986573180_0273 2014-01-29 18:42:26,359 DEBUG scheduler.SchedulerApplicationAttempt (SchedulerApplicationAttempt.java:showRequests(326)) - showRequests: application=application_1390986573180_0273 headRoom=memory:18432, vCores:-2 currentConsumption=2048 2014-01-29 18:42:26,359 DEBUG scheduler.SchedulerApplicationAttempt (SchedulerApplicationAttempt.java:showRequests(330)) - showRequests: application=application_1390986573180_0273 request={Priority: 0, Capability: memory:2048, vCores:1, # Containers: 0, Location: *, Relax Locality: true} 2014-01-29 18:42:26,360 DEBUG capacity.LeafQueue (LeafQueue.java:assignContainers(911)) - post-assignContainers for application
[jira] [Commented] (YARN-1771) many getFileStatus calls made from node manager for localizing a public distributed cache resource
[ https://issues.apache.org/jira/browse/YARN-1771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13918565#comment-13918565 ] Jason Lowe commented on YARN-1771: -- Today the public cache localizes as the NM user, so the public checking is important to avoid a security problem where the user could convince the NM to localize a file for which the user does not have privileges but the NM user does (e.g.: please localize that other job's .jhist file, aggregated logs, etc.). So I think we need some kind of access check, either as the requesting user or explicit access checks like it does today, to avoid a malicious client obtaining access to private files via the NM. many getFileStatus calls made from node manager for localizing a public distributed cache resource -- Key: YARN-1771 URL: https://issues.apache.org/jira/browse/YARN-1771 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Sangjin Lee Assignee: Sangjin Lee Priority: Critical We're observing that the getFileStatus calls are putting a fair amount of load on the name node as part of checking the public-ness for localizing a resource that belong in the public cache. We see 7 getFileStatus calls made for each of these resource. We should look into reducing the number of calls to the name node. One example: {noformat} 2014-02-27 18:07:27,351 INFO audit: ... cmd=getfileinfo src=/tmp/temp-887708724/tmp883330348/foo-0.0.44.jar ... 2014-02-27 18:07:27,352 INFO audit: ... cmd=getfileinfo src=/tmp/temp-887708724/tmp883330348/foo-0.0.44.jar ... 2014-02-27 18:07:27,352 INFO audit: ... cmd=getfileinfo src=/tmp/temp-887708724/tmp883330348 ... 2014-02-27 18:07:27,353 INFO audit: ... cmd=getfileinfo src=/tmp/temp-887708724 ... 2014-02-27 18:07:27,353 INFO audit: ... cmd=getfileinfo src=/tmp ... 2014-02-27 18:07:27,354 INFO audit: ... cmd=getfileinfo src=/... 2014-02-27 18:07:27,354 INFO audit: ... cmd=getfileinfo src=/tmp/temp-887708724/tmp883330348/foo-0.0.44.jar ... 2014-02-27 18:07:27,355 INFO audit: ... cmd=open src=/tmp/temp-887708724/tmp883330348/foo-0.0.44.jar ... {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (YARN-1670) aggregated log writer can write more log data then it says is the log length
[ https://issues.apache.org/jira/browse/YARN-1670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mit Desai reassigned YARN-1670: --- Assignee: Mit Desai aggregated log writer can write more log data then it says is the log length Key: YARN-1670 URL: https://issues.apache.org/jira/browse/YARN-1670 Project: Hadoop YARN Issue Type: Bug Affects Versions: 0.23.10, 2.2.0 Reporter: Thomas Graves Assignee: Mit Desai Priority: Critical We have seen exceptions when using 'yarn logs' to read log files. at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) at java.lang.Long.parseLong(Long.java:441) at java.lang.Long.parseLong(Long.java:483) at org.apache.hadoop.yarn.logaggregation.AggregatedLogFormat$LogReader.readAContainerLogsForALogType(AggregatedLogFormat.java:518) at org.apache.hadoop.yarn.logaggregation.LogDumper.dumpAContainerLogs(LogDumper.java:178) at org.apache.hadoop.yarn.logaggregation.LogDumper.run(LogDumper.java:130) at org.apache.hadoop.yarn.logaggregation.LogDumper.main(LogDumper.java:246) We traced it down to the reader trying to read the file type of the next file but where it reads is still log data from the previous file. What happened was the Log Length was written as a certain size but the log data was actually longer then that. Inside of the write() routine in LogValue it first writes what the logfile length is, but then when it goes to write the log itself it just goes to the end of the file. There is a race condition here where if someone is still writing to the file when it goes to be aggregated the length written could be to small. We should have the write() routine stop when it writes whatever it said was the length. It would be nice if we could somehow tell the user it might be truncated but I'm not sure of a good way to do this. We also noticed that a bug in readAContainerLogsForALogType where it is using an int for curRead whereas it should be using a long. while (len != -1 curRead fileLength) { This isn't actually a problem right now as it looks like the underlying decoder is doing the right thing and the len condition exits. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1729) TimelineWebServices always passes primary and secondary filters as strings
[ https://issues.apache.org/jira/browse/YARN-1729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Billie Rinaldi updated YARN-1729: - Attachment: YARN-1729.6.patch Thanks for the additional review. I've attached a new patch addressing your comments. TimelineWebServices always passes primary and secondary filters as strings -- Key: YARN-1729 URL: https://issues.apache.org/jira/browse/YARN-1729 Project: Hadoop YARN Issue Type: Sub-task Reporter: Billie Rinaldi Assignee: Billie Rinaldi Attachments: YARN-1729.1.patch, YARN-1729.2.patch, YARN-1729.3.patch, YARN-1729.4.patch, YARN-1729.5.patch, YARN-1729.6.patch Primary filters and secondary filter values can be arbitrary json-compatible Object. The web services should determine if the filters specified as query parameters are objects or strings before passing them to the store. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1771) many getFileStatus calls made from node manager for localizing a public distributed cache resource
[ https://issues.apache.org/jira/browse/YARN-1771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13918605#comment-13918605 ] Gera Shegalov commented on YARN-1771: - Orthogonal to this we have been discussing adding a FileStatus[] getFileStatus(Path f) API that returns FileStatus for each path component of f in a single RPC. Interested in comments about this idea. many getFileStatus calls made from node manager for localizing a public distributed cache resource -- Key: YARN-1771 URL: https://issues.apache.org/jira/browse/YARN-1771 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Sangjin Lee Assignee: Sangjin Lee Priority: Critical We're observing that the getFileStatus calls are putting a fair amount of load on the name node as part of checking the public-ness for localizing a resource that belong in the public cache. We see 7 getFileStatus calls made for each of these resource. We should look into reducing the number of calls to the name node. One example: {noformat} 2014-02-27 18:07:27,351 INFO audit: ... cmd=getfileinfo src=/tmp/temp-887708724/tmp883330348/foo-0.0.44.jar ... 2014-02-27 18:07:27,352 INFO audit: ... cmd=getfileinfo src=/tmp/temp-887708724/tmp883330348/foo-0.0.44.jar ... 2014-02-27 18:07:27,352 INFO audit: ... cmd=getfileinfo src=/tmp/temp-887708724/tmp883330348 ... 2014-02-27 18:07:27,353 INFO audit: ... cmd=getfileinfo src=/tmp/temp-887708724 ... 2014-02-27 18:07:27,353 INFO audit: ... cmd=getfileinfo src=/tmp ... 2014-02-27 18:07:27,354 INFO audit: ... cmd=getfileinfo src=/... 2014-02-27 18:07:27,354 INFO audit: ... cmd=getfileinfo src=/tmp/temp-887708724/tmp883330348/foo-0.0.44.jar ... 2014-02-27 18:07:27,355 INFO audit: ... cmd=open src=/tmp/temp-887708724/tmp883330348/foo-0.0.44.jar ... {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1729) TimelineWebServices always passes primary and secondary filters as strings
[ https://issues.apache.org/jira/browse/YARN-1729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13918611#comment-13918611 ] Hadoop QA commented on YARN-1729: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12632356/YARN-1729.6.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 3 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:red}-1 release audit{color}. The applied patch generated 1 release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3230//testReport/ Release audit warnings: https://builds.apache.org/job/PreCommit-YARN-Build/3230//artifact/trunk/patchprocess/patchReleaseAuditProblems.txt Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3230//console This message is automatically generated. TimelineWebServices always passes primary and secondary filters as strings -- Key: YARN-1729 URL: https://issues.apache.org/jira/browse/YARN-1729 Project: Hadoop YARN Issue Type: Sub-task Reporter: Billie Rinaldi Assignee: Billie Rinaldi Attachments: YARN-1729.1.patch, YARN-1729.2.patch, YARN-1729.3.patch, YARN-1729.4.patch, YARN-1729.5.patch, YARN-1729.6.patch Primary filters and secondary filter values can be arbitrary json-compatible Object. The web services should determine if the filters specified as query parameters are objects or strings before passing them to the store. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1771) many getFileStatus calls made from node manager for localizing a public distributed cache resource
[ https://issues.apache.org/jira/browse/YARN-1771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13918625#comment-13918625 ] Chris Douglas commented on YARN-1771: - bq. Orthogonal to this we have been discussing adding a FileStatus[] getFileStatus(Path f) API that returns FileStatus for each path component of f in a single RPC. Symlinks might be awkward to support, but that discussion is for a separate ticket. Do you have a JIRA ref? bq. So I think we need some kind of access check, either as the requesting user or explicit access checks like it does today, to avoid a malicious client obtaining access to private files via the NM. An HDFS nobody account? A cache would probably be correct in almost all cases, though. Since the check is only performed when the resource is localized, there could be cases where the filesystem is never in the cached state, but those are rare (and as Sandy points out, already in the current design). To attack the cache, the writer would need to take an unprotected directory, change its permissions, then populate it with private data (whose attributes are guessable). Expiring after short internals and not populating the cache with failed localization attempts could help mitigate its effectiveness. many getFileStatus calls made from node manager for localizing a public distributed cache resource -- Key: YARN-1771 URL: https://issues.apache.org/jira/browse/YARN-1771 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Sangjin Lee Assignee: Sangjin Lee Priority: Critical We're observing that the getFileStatus calls are putting a fair amount of load on the name node as part of checking the public-ness for localizing a resource that belong in the public cache. We see 7 getFileStatus calls made for each of these resource. We should look into reducing the number of calls to the name node. One example: {noformat} 2014-02-27 18:07:27,351 INFO audit: ... cmd=getfileinfo src=/tmp/temp-887708724/tmp883330348/foo-0.0.44.jar ... 2014-02-27 18:07:27,352 INFO audit: ... cmd=getfileinfo src=/tmp/temp-887708724/tmp883330348/foo-0.0.44.jar ... 2014-02-27 18:07:27,352 INFO audit: ... cmd=getfileinfo src=/tmp/temp-887708724/tmp883330348 ... 2014-02-27 18:07:27,353 INFO audit: ... cmd=getfileinfo src=/tmp/temp-887708724 ... 2014-02-27 18:07:27,353 INFO audit: ... cmd=getfileinfo src=/tmp ... 2014-02-27 18:07:27,354 INFO audit: ... cmd=getfileinfo src=/... 2014-02-27 18:07:27,354 INFO audit: ... cmd=getfileinfo src=/tmp/temp-887708724/tmp883330348/foo-0.0.44.jar ... 2014-02-27 18:07:27,355 INFO audit: ... cmd=open src=/tmp/temp-887708724/tmp883330348/foo-0.0.44.jar ... {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1751) Improve MiniYarnCluster and LogCLIHelpers for log aggregation testing
[ https://issues.apache.org/jira/browse/YARN-1751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ming Ma updated YARN-1751: -- Attachment: YARN-1751.patch Here is the patch. Improve MiniYarnCluster and LogCLIHelpers for log aggregation testing - Key: YARN-1751 URL: https://issues.apache.org/jira/browse/YARN-1751 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Reporter: Ming Ma Attachments: YARN-1751.patch MiniYarnCluster specifies individual remote log aggregation root dir for each NM. Test code that uses MiniYarnCluster won't be able to get the value of log aggregation root dir. The following code isn't necessary in MiniYarnCluster. File remoteLogDir = new File(testWorkDir, MiniYARNCluster.this.getName() + -remoteLogDir-nm- + index); remoteLogDir.mkdir(); config.set(YarnConfiguration.NM_REMOTE_APP_LOG_DIR, remoteLogDir.getAbsolutePath()); In LogCLIHelpers.java, dumpAllContainersLogs should pass its conf object to FileContext.getFileContext() call. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1751) Improve MiniYarnCluster and LogCLIHelpers for log aggregation testing
[ https://issues.apache.org/jira/browse/YARN-1751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ming Ma updated YARN-1751: -- Attachment: (was: YARN-1751.patch) Improve MiniYarnCluster and LogCLIHelpers for log aggregation testing - Key: YARN-1751 URL: https://issues.apache.org/jira/browse/YARN-1751 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Reporter: Ming Ma Attachments: YARN-1751-trunk.patch MiniYarnCluster specifies individual remote log aggregation root dir for each NM. Test code that uses MiniYarnCluster won't be able to get the value of log aggregation root dir. The following code isn't necessary in MiniYarnCluster. File remoteLogDir = new File(testWorkDir, MiniYARNCluster.this.getName() + -remoteLogDir-nm- + index); remoteLogDir.mkdir(); config.set(YarnConfiguration.NM_REMOTE_APP_LOG_DIR, remoteLogDir.getAbsolutePath()); In LogCLIHelpers.java, dumpAllContainersLogs should pass its conf object to FileContext.getFileContext() call. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1771) many getFileStatus calls made from node manager for localizing a public distributed cache resource
[ https://issues.apache.org/jira/browse/YARN-1771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13918649#comment-13918649 ] Jason Lowe commented on YARN-1771: -- Agreed, a nobody account would make the check similarly cheap. I also like the idea of caching these a bit more rather than pinging the namenode each time a new container arrives with an existing resource requested. That latter idea is similar to what Koji was asking for way back in MAPREDUCE-2011. many getFileStatus calls made from node manager for localizing a public distributed cache resource -- Key: YARN-1771 URL: https://issues.apache.org/jira/browse/YARN-1771 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Sangjin Lee Assignee: Sangjin Lee Priority: Critical We're observing that the getFileStatus calls are putting a fair amount of load on the name node as part of checking the public-ness for localizing a resource that belong in the public cache. We see 7 getFileStatus calls made for each of these resource. We should look into reducing the number of calls to the name node. One example: {noformat} 2014-02-27 18:07:27,351 INFO audit: ... cmd=getfileinfo src=/tmp/temp-887708724/tmp883330348/foo-0.0.44.jar ... 2014-02-27 18:07:27,352 INFO audit: ... cmd=getfileinfo src=/tmp/temp-887708724/tmp883330348/foo-0.0.44.jar ... 2014-02-27 18:07:27,352 INFO audit: ... cmd=getfileinfo src=/tmp/temp-887708724/tmp883330348 ... 2014-02-27 18:07:27,353 INFO audit: ... cmd=getfileinfo src=/tmp/temp-887708724 ... 2014-02-27 18:07:27,353 INFO audit: ... cmd=getfileinfo src=/tmp ... 2014-02-27 18:07:27,354 INFO audit: ... cmd=getfileinfo src=/... 2014-02-27 18:07:27,354 INFO audit: ... cmd=getfileinfo src=/tmp/temp-887708724/tmp883330348/foo-0.0.44.jar ... 2014-02-27 18:07:27,355 INFO audit: ... cmd=open src=/tmp/temp-887708724/tmp883330348/foo-0.0.44.jar ... {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1771) many getFileStatus calls made from node manager for localizing a public distributed cache resource
[ https://issues.apache.org/jira/browse/YARN-1771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13918661#comment-13918661 ] Gera Shegalov commented on YARN-1771: - bq. Symlinks might be awkward to support, but that discussion is for a separate ticket. Do you have a JIRA ref? Now I do: HDFS-6045 many getFileStatus calls made from node manager for localizing a public distributed cache resource -- Key: YARN-1771 URL: https://issues.apache.org/jira/browse/YARN-1771 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Sangjin Lee Assignee: Sangjin Lee Priority: Critical We're observing that the getFileStatus calls are putting a fair amount of load on the name node as part of checking the public-ness for localizing a resource that belong in the public cache. We see 7 getFileStatus calls made for each of these resource. We should look into reducing the number of calls to the name node. One example: {noformat} 2014-02-27 18:07:27,351 INFO audit: ... cmd=getfileinfo src=/tmp/temp-887708724/tmp883330348/foo-0.0.44.jar ... 2014-02-27 18:07:27,352 INFO audit: ... cmd=getfileinfo src=/tmp/temp-887708724/tmp883330348/foo-0.0.44.jar ... 2014-02-27 18:07:27,352 INFO audit: ... cmd=getfileinfo src=/tmp/temp-887708724/tmp883330348 ... 2014-02-27 18:07:27,353 INFO audit: ... cmd=getfileinfo src=/tmp/temp-887708724 ... 2014-02-27 18:07:27,353 INFO audit: ... cmd=getfileinfo src=/tmp ... 2014-02-27 18:07:27,354 INFO audit: ... cmd=getfileinfo src=/... 2014-02-27 18:07:27,354 INFO audit: ... cmd=getfileinfo src=/tmp/temp-887708724/tmp883330348/foo-0.0.44.jar ... 2014-02-27 18:07:27,355 INFO audit: ... cmd=open src=/tmp/temp-887708724/tmp883330348/foo-0.0.44.jar ... {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1751) Improve MiniYarnCluster and LogCLIHelpers for log aggregation testing
[ https://issues.apache.org/jira/browse/YARN-1751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13918674#comment-13918674 ] Hadoop QA commented on YARN-1751: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12632365/YARN-1751-trunk.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:red}-1 release audit{color}. The applied patch generated 1 release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3231//testReport/ Release audit warnings: https://builds.apache.org/job/PreCommit-YARN-Build/3231//artifact/trunk/patchprocess/patchReleaseAuditProblems.txt Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3231//console This message is automatically generated. Improve MiniYarnCluster and LogCLIHelpers for log aggregation testing - Key: YARN-1751 URL: https://issues.apache.org/jira/browse/YARN-1751 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Reporter: Ming Ma Attachments: YARN-1751-trunk.patch MiniYarnCluster specifies individual remote log aggregation root dir for each NM. Test code that uses MiniYarnCluster won't be able to get the value of log aggregation root dir. The following code isn't necessary in MiniYarnCluster. File remoteLogDir = new File(testWorkDir, MiniYARNCluster.this.getName() + -remoteLogDir-nm- + index); remoteLogDir.mkdir(); config.set(YarnConfiguration.NM_REMOTE_APP_LOG_DIR, remoteLogDir.getAbsolutePath()); In LogCLIHelpers.java, dumpAllContainersLogs should pass its conf object to FileContext.getFileContext() call. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1748) hadoop-yarn-server-tests packages core-site.xml breaking downstream tests
[ https://issues.apache.org/jira/browse/YARN-1748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli updated YARN-1748: -- Priority: Blocker (was: Major) hadoop-yarn-server-tests packages core-site.xml breaking downstream tests - Key: YARN-1748 URL: https://issues.apache.org/jira/browse/YARN-1748 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.3.0 Reporter: Sravya Tirukkovalur Assignee: Vinod Kumar Vavilapalli Priority: Blocker Attachments: YARN-1748-1.patch, YARN-1748-1.patch Jars should not package config files, as this might come into the classpaths of clients causing the clients to break. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1766) When RM does the initiation, it should use loaded Configuration instead of bootstrap configuration.
[ https://issues.apache.org/jira/browse/YARN-1766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13918691#comment-13918691 ] Xuan Gong commented on YARN-1766: - create the patch based on the latest trunk code When RM does the initiation, it should use loaded Configuration instead of bootstrap configuration. --- Key: YARN-1766 URL: https://issues.apache.org/jira/browse/YARN-1766 Project: Hadoop YARN Issue Type: Sub-task Reporter: Xuan Gong Assignee: Xuan Gong Attachments: YARN-1766.1.patch, YARN-1766.2.patch Right now, we have FileSystemBasedConfigurationProvider to let Users upload the configurations into remote File System, and let different RMs share the same configurations. During the initiation, RM will load the configurations from Remote File System. So when RM initiates the services, it should use the loaded Configurations instead of using the bootstrap configurations. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1766) When RM does the initiation, it should use loaded Configuration instead of bootstrap configuration.
[ https://issues.apache.org/jira/browse/YARN-1766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuan Gong updated YARN-1766: Attachment: YARN-1766.2.patch When RM does the initiation, it should use loaded Configuration instead of bootstrap configuration. --- Key: YARN-1766 URL: https://issues.apache.org/jira/browse/YARN-1766 Project: Hadoop YARN Issue Type: Sub-task Reporter: Xuan Gong Assignee: Xuan Gong Attachments: YARN-1766.1.patch, YARN-1766.2.patch Right now, we have FileSystemBasedConfigurationProvider to let Users upload the configurations into remote File System, and let different RMs share the same configurations. During the initiation, RM will load the configurations from Remote File System. So when RM initiates the services, it should use the loaded Configurations instead of using the bootstrap configurations. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-986) RM DT token service should have service addresses of both RMs
[ https://issues.apache.org/jira/browse/YARN-986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13918700#comment-13918700 ] Vinod Kumar Vavilapalli commented on YARN-986: -- Looking at it now. RM DT token service should have service addresses of both RMs - Key: YARN-986 URL: https://issues.apache.org/jira/browse/YARN-986 Project: Hadoop YARN Issue Type: Sub-task Reporter: Vinod Kumar Vavilapalli Assignee: Karthik Kambatla Priority: Blocker Attachments: yarn-986-1.patch, yarn-986-2.patch, yarn-986-prelim-0.patch Previously: YARN should use cluster-id as token service address This needs to be done to support non-ip based fail over of RM. Once the server sets the token service address to be this generic ClusterId/ServiceId, clients can translate it to appropriate final IP and then be able to select tokens via TokenSelectors. Some workarounds for other related issues were put in place at YARN-945. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1748) hadoop-yarn-server-tests packages core-site.xml breaking downstream tests
[ https://issues.apache.org/jira/browse/YARN-1748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13918702#comment-13918702 ] Sravya Tirukkovalur commented on YARN-1748: --- Great, thanks Vinod! hadoop-yarn-server-tests packages core-site.xml breaking downstream tests - Key: YARN-1748 URL: https://issues.apache.org/jira/browse/YARN-1748 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.3.0 Reporter: Sravya Tirukkovalur Assignee: Sravya Tirukkovalur Priority: Blocker Fix For: 2.4.0 Attachments: YARN-1748-1.patch, YARN-1748-1.patch Jars should not package config files, as this might come into the classpaths of clients causing the clients to break. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1729) TimelineWebServices always passes primary and secondary filters as strings
[ https://issues.apache.org/jira/browse/YARN-1729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13918706#comment-13918706 ] Hadoop QA commented on YARN-1729: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12632368/YARN-1729.7.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 3 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:red}-1 release audit{color}. The applied patch generated 1 release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3232//testReport/ Release audit warnings: https://builds.apache.org/job/PreCommit-YARN-Build/3232//artifact/trunk/patchprocess/patchReleaseAuditProblems.txt Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3232//console This message is automatically generated. TimelineWebServices always passes primary and secondary filters as strings -- Key: YARN-1729 URL: https://issues.apache.org/jira/browse/YARN-1729 Project: Hadoop YARN Issue Type: Sub-task Reporter: Billie Rinaldi Assignee: Billie Rinaldi Attachments: YARN-1729.1.patch, YARN-1729.2.patch, YARN-1729.3.patch, YARN-1729.4.patch, YARN-1729.5.patch, YARN-1729.6.patch, YARN-1729.7.patch Primary filters and secondary filter values can be arbitrary json-compatible Object. The web services should determine if the filters specified as query parameters are objects or strings before passing them to the store. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1747) Better physical memory monitoring for containers
[ https://issues.apache.org/jira/browse/YARN-1747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13918729#comment-13918729 ] Colin Patrick McCabe commented on YARN-1747: My first thought here is to read /proc/pid/maps and look for the [stack] and [heap] sections, and just count those. There might be something I'm not considering, though. I wonder if there is ever a case where we'd want to charge an application for the page cache its use of a file takes up? Better physical memory monitoring for containers Key: YARN-1747 URL: https://issues.apache.org/jira/browse/YARN-1747 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Affects Versions: 2.3.0 Reporter: Karthik Kambatla YARN currently uses RSS to compute the physical memory being used by a container. This can lead to issues, as noticed in HDFS-5957. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1748) hadoop-yarn-server-tests packages core-site.xml breaking downstream tests
[ https://issues.apache.org/jira/browse/YARN-1748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13918745#comment-13918745 ] Hudson commented on YARN-1748: -- SUCCESS: Integrated in Hadoop-trunk-Commit #5257 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/5257/]) YARN-1748. Excluded core-site.xml from hadoop-yarn-server-tests package's jar and thus avoid breaking downstream tests. Contributed by Sravya Tirukkovalur. (vinodkv: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1573795) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests/pom.xml hadoop-yarn-server-tests packages core-site.xml breaking downstream tests - Key: YARN-1748 URL: https://issues.apache.org/jira/browse/YARN-1748 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.3.0 Reporter: Sravya Tirukkovalur Assignee: Sravya Tirukkovalur Priority: Blocker Fix For: 2.4.0 Attachments: YARN-1748-1.patch, YARN-1748-1.patch Jars should not package config files, as this might come into the classpaths of clients causing the clients to break. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-986) RM DT token service should have service addresses of both RMs
[ https://issues.apache.org/jira/browse/YARN-986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13918755#comment-13918755 ] Vinod Kumar Vavilapalli commented on YARN-986: -- Some more comments: - Let's mark YarnConfiguration.getClusterId() as Private - Can we move the getRMDelegationTokenService() API to ClientRMProxy? (The later BTW is missing the visibility annotations). That seems like a better place. - There are some related TODOs in ClientRMProxy.setupTokens() that we put before. Search for YARN-986. We can fix them here or separately. - getRMDelegationTokenService() API: Not sure why we are doing {{yarnConf.set(YarnConfiguration.RM_HA_ID, rmId);}}. And like I mentioned before, {code} +services.add(SecurityUtil.buildTokenService( +yarnConf.getSocketAddr(YarnConfiguration.RM_ADDRESS, +YarnConfiguration.DEFAULT_RM_ADDRESS, +YarnConfiguration.DEFAULT_RM_PORT)).toString()); {code} is looking at RM_ADDRESS instead of HAUtil.addSuffix(YarnConfiguration.RM_ADDRESS, rmId). It should do the later, no? RM DT token service should have service addresses of both RMs - Key: YARN-986 URL: https://issues.apache.org/jira/browse/YARN-986 Project: Hadoop YARN Issue Type: Sub-task Reporter: Vinod Kumar Vavilapalli Assignee: Karthik Kambatla Priority: Blocker Attachments: yarn-986-1.patch, yarn-986-2.patch, yarn-986-prelim-0.patch Previously: YARN should use cluster-id as token service address This needs to be done to support non-ip based fail over of RM. Once the server sets the token service address to be this generic ClusterId/ServiceId, clients can translate it to appropriate final IP and then be able to select tokens via TokenSelectors. Some workarounds for other related issues were put in place at YARN-945. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-986) RM DT token service should have service addresses of both RMs
[ https://issues.apache.org/jira/browse/YARN-986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13918757#comment-13918757 ] Vinod Kumar Vavilapalli commented on YARN-986: -- In my earlier review comment, I thought that MR changes imply other apps need to change too, I was wrong. MR is wrapping about our delegation-token APIs so needed to change. RM DT token service should have service addresses of both RMs - Key: YARN-986 URL: https://issues.apache.org/jira/browse/YARN-986 Project: Hadoop YARN Issue Type: Sub-task Reporter: Vinod Kumar Vavilapalli Assignee: Karthik Kambatla Priority: Blocker Attachments: yarn-986-1.patch, yarn-986-2.patch, yarn-986-prelim-0.patch Previously: YARN should use cluster-id as token service address This needs to be done to support non-ip based fail over of RM. Once the server sets the token service address to be this generic ClusterId/ServiceId, clients can translate it to appropriate final IP and then be able to select tokens via TokenSelectors. Some workarounds for other related issues were put in place at YARN-945. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1774) FS: Submitting to non-leaf queue throws NPE
[ https://issues.apache.org/jira/browse/YARN-1774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13918786#comment-13918786 ] Anubhav Dhoot commented on YARN-1774: - Manual test consisted of a) Configure yarn to use fair scheduler b) Create hierarchical queue in fair-scheduler.xml c) Try to run a job assigned to a parent queue. Without the fix Resource manager would terminate with the exception in fair scheduler. With the fix the job submission is rejected with an error and Resource manager continues running. FS: Submitting to non-leaf queue throws NPE --- Key: YARN-1774 URL: https://issues.apache.org/jira/browse/YARN-1774 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.3.0 Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot Priority: Blocker Attachments: YARN-1774.patch If you create a hierarchy of queues and assign a job to parent queue, FairScheduler quits with a NPE. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1766) When RM does the initiation, it should use loaded Configuration instead of bootstrap configuration.
[ https://issues.apache.org/jira/browse/YARN-1766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13918800#comment-13918800 ] Hadoop QA commented on YARN-1766: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12632378/YARN-1766.2.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:red}-1 release audit{color}. The applied patch generated 1 release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.recovery.TestFSRMStateStore {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3233//testReport/ Release audit warnings: https://builds.apache.org/job/PreCommit-YARN-Build/3233//artifact/trunk/patchprocess/patchReleaseAuditProblems.txt Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3233//console This message is automatically generated. When RM does the initiation, it should use loaded Configuration instead of bootstrap configuration. --- Key: YARN-1766 URL: https://issues.apache.org/jira/browse/YARN-1766 Project: Hadoop YARN Issue Type: Sub-task Reporter: Xuan Gong Assignee: Xuan Gong Attachments: YARN-1766.1.patch, YARN-1766.2.patch Right now, we have FileSystemBasedConfigurationProvider to let Users upload the configurations into remote File System, and let different RMs share the same configurations. During the initiation, RM will load the configurations from Remote File System. So when RM initiates the services, it should use the loaded Configurations instead of using the bootstrap configurations. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1747) Better physical memory monitoring for containers
[ https://issues.apache.org/jira/browse/YARN-1747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13918802#comment-13918802 ] Adar Dembo commented on YARN-1747: -- If you're willing to use the memory cgroup subsystem, you can get more accurate RSS (i.e. w/o pages from mapped files) in memory.stat. Is that an option? Better physical memory monitoring for containers Key: YARN-1747 URL: https://issues.apache.org/jira/browse/YARN-1747 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Affects Versions: 2.3.0 Reporter: Karthik Kambatla YARN currently uses RSS to compute the physical memory being used by a container. This can lead to issues, as noticed in HDFS-5957. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1774) FS: Submitting to non-leaf queue throws NPE
[ https://issues.apache.org/jira/browse/YARN-1774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13918815#comment-13918815 ] Tsuyoshi OZAWA commented on YARN-1774: -- +1. Confirmed to reproduce the problem, and the patch fix NPE. The test failure is obviously unrelated - it says java.lang.UnsupportedOperationException: libhadoop cannot be loaded.. We should discuss it on another JIRA. [~sandyr], can you take a look? FS: Submitting to non-leaf queue throws NPE --- Key: YARN-1774 URL: https://issues.apache.org/jira/browse/YARN-1774 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.3.0 Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot Priority: Blocker Attachments: YARN-1774.patch If you create a hierarchy of queues and assign a job to parent queue, FairScheduler quits with a NPE. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-1778) TestFSRMStateStore fails on trunk
Xuan Gong created YARN-1778: --- Summary: TestFSRMStateStore fails on trunk Key: YARN-1778 URL: https://issues.apache.org/jira/browse/YARN-1778 Project: Hadoop YARN Issue Type: Bug Reporter: Xuan Gong -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1778) TestFSRMStateStore fails on trunk
[ https://issues.apache.org/jira/browse/YARN-1778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13918845#comment-13918845 ] Tsuyoshi OZAWA commented on YARN-1778: -- A log of the test failure is available here: https://builds.apache.org/job/PreCommit-YARN-Build/3234//testReport/ TestFSRMStateStore fails on trunk - Key: YARN-1778 URL: https://issues.apache.org/jira/browse/YARN-1778 Project: Hadoop YARN Issue Type: Bug Reporter: Xuan Gong -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1766) When RM does the initiation, it should use loaded Configuration instead of bootstrap configuration.
[ https://issues.apache.org/jira/browse/YARN-1766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13918862#comment-13918862 ] Vinod Kumar Vavilapalli commented on YARN-1766: --- The patch looks fine to me, but I wonder how we missed this before. This seems like a basic things that our tests should have caught before itself. When RM does the initiation, it should use loaded Configuration instead of bootstrap configuration. --- Key: YARN-1766 URL: https://issues.apache.org/jira/browse/YARN-1766 Project: Hadoop YARN Issue Type: Sub-task Reporter: Xuan Gong Assignee: Xuan Gong Attachments: YARN-1766.1.patch, YARN-1766.2.patch Right now, we have FileSystemBasedConfigurationProvider to let Users upload the configurations into remote File System, and let different RMs share the same configurations. During the initiation, RM will load the configurations from Remote File System. So when RM initiates the services, it should use the loaded Configurations instead of using the bootstrap configurations. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-1779) Handle AMRMTokens across RM failover
Karthik Kambatla created YARN-1779: -- Summary: Handle AMRMTokens across RM failover Key: YARN-1779 URL: https://issues.apache.org/jira/browse/YARN-1779 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.3.0 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Priority: Blocker Verify if AMRMTokens continue to work against RM failover. If not, we will have to do something along the lines of YARN-986. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1761) RMAdminCLI should check whether HA is enabled before executes transitionToActive/transitionToStandby
[ https://issues.apache.org/jira/browse/YARN-1761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13918879#comment-13918879 ] Vinod Kumar Vavilapalli commented on YARN-1761: --- Remote-configuration-provider on RM is a server side property. We will not use it to specify client-side configuration. Given that, why do we need to use the config-provider on the client side? RMAdminCLI should check whether HA is enabled before executes transitionToActive/transitionToStandby Key: YARN-1761 URL: https://issues.apache.org/jira/browse/YARN-1761 Project: Hadoop YARN Issue Type: Sub-task Reporter: Xuan Gong Assignee: Xuan Gong Attachments: YARN-1761.1.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1729) TimelineWebServices always passes primary and secondary filters as strings
[ https://issues.apache.org/jira/browse/YARN-1729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13918890#comment-13918890 ] Hudson commented on YARN-1729: -- SUCCESS: Integrated in Hadoop-trunk-Commit #5258 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/5258/]) YARN-1729. Made TimelineWebServices deserialize the string primary- and secondary-filters param into the JSON-compatible object. Contributed by Billie Rinaldi. (zjshen: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1573825) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/applicationhistoryservice/timeline/GenericObjectMapper.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/applicationhistoryservice/timeline/MemoryTimelineStore.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/applicationhistoryservice/webapp/TimelineWebServices.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/applicationhistoryservice/timeline/TestGenericObjectMapper.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/applicationhistoryservice/timeline/TimelineStoreTestUtils.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/applicationhistoryservice/webapp/TestTimelineWebServices.java TimelineWebServices always passes primary and secondary filters as strings -- Key: YARN-1729 URL: https://issues.apache.org/jira/browse/YARN-1729 Project: Hadoop YARN Issue Type: Sub-task Reporter: Billie Rinaldi Assignee: Billie Rinaldi Fix For: 2.4.0 Attachments: YARN-1729.1.patch, YARN-1729.2.patch, YARN-1729.3.patch, YARN-1729.4.patch, YARN-1729.5.patch, YARN-1729.6.patch, YARN-1729.7.patch Primary filters and secondary filter values can be arbitrary json-compatible Object. The web services should determine if the filters specified as query parameters are objects or strings before passing them to the store. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1778) TestFSRMStateStore fails on trunk
[ https://issues.apache.org/jira/browse/YARN-1778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13918899#comment-13918899 ] Tsuyoshi OZAWA commented on YARN-1778: -- The error message reported on HDFS-6048 is exactly same. TestFSRMStateStore fails on trunk - Key: YARN-1778 URL: https://issues.apache.org/jira/browse/YARN-1778 Project: Hadoop YARN Issue Type: Bug Reporter: Xuan Gong -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-986) RM DT token service should have service addresses of both RMs
[ https://issues.apache.org/jira/browse/YARN-986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-986: -- Attachment: yarn-986-3.patch RM DT token service should have service addresses of both RMs - Key: YARN-986 URL: https://issues.apache.org/jira/browse/YARN-986 Project: Hadoop YARN Issue Type: Sub-task Reporter: Vinod Kumar Vavilapalli Assignee: Karthik Kambatla Priority: Blocker Attachments: yarn-986-1.patch, yarn-986-2.patch, yarn-986-3.patch, yarn-986-prelim-0.patch Previously: YARN should use cluster-id as token service address This needs to be done to support non-ip based fail over of RM. Once the server sets the token service address to be this generic ClusterId/ServiceId, clients can translate it to appropriate final IP and then be able to select tokens via TokenSelectors. Some workarounds for other related issues were put in place at YARN-945. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-986) RM DT token service should have service addresses of both RMs
[ https://issues.apache.org/jira/browse/YARN-986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13918951#comment-13918951 ] Karthik Kambatla commented on YARN-986: --- bq. There are some related TODOs in ClientRMProxy.setupTokens() that we put before. Search for YARN-986. We can fix them here or separately. Created YARN-1779 to address AMRMTokens. This JIRA is only for RMDTTokens. bq. getRMDelegationTokenService() API: Not sure why we are doing yarnConf.set(YarnConfiguration.RM_HA_ID, rmId); bq. you are only building the service against one address RM_ADDRESS. Discussed with Vinod offline. YarnConfiguration#getSocketAddr already handles the HA case. Updated its javadoc to reflect that. Addressed other comments. Again, verified manually running Oozie jobs. RM DT token service should have service addresses of both RMs - Key: YARN-986 URL: https://issues.apache.org/jira/browse/YARN-986 Project: Hadoop YARN Issue Type: Sub-task Reporter: Vinod Kumar Vavilapalli Assignee: Karthik Kambatla Priority: Blocker Attachments: yarn-986-1.patch, yarn-986-2.patch, yarn-986-3.patch, yarn-986-prelim-0.patch Previously: YARN should use cluster-id as token service address This needs to be done to support non-ip based fail over of RM. Once the server sets the token service address to be this generic ClusterId/ServiceId, clients can translate it to appropriate final IP and then be able to select tokens via TokenSelectors. Some workarounds for other related issues were put in place at YARN-945. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1768) yarn kill non-existent application is too verbose
[ https://issues.apache.org/jira/browse/YARN-1768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA updated YARN-1768: - Attachment: YARN-1768.3.patch Fixed exit code to return non-zero value(-1) when application doesn't exist. yarn kill non-existent application is too verbose - Key: YARN-1768 URL: https://issues.apache.org/jira/browse/YARN-1768 Project: Hadoop YARN Issue Type: Bug Components: client Affects Versions: 2.2.0 Reporter: Hitesh Shah Assignee: Tsuyoshi OZAWA Priority: Minor Attachments: YARN-1768.1.patch, YARN-1768.2.patch, YARN-1768.3.patch Instead of catching ApplicationNotFound and logging a simple app not found message, the whole stack trace is logged. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1445) Separate FINISHING and FINISHED state in YarnApplicationState
[ https://issues.apache.org/jira/browse/YARN-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13918984#comment-13918984 ] Zhijie Shen commented on YARN-1445: --- I've thought about postponing the unregistration success notification until the application is at FINISHED. It seems impossible, because it will result in a dead lock bellow 1. AM container is waiting for finishing unregistration to move on and exit; 2. Unregistration is waiting for RM notifying success; 3. RM is waiting for RMApp moving from FINISHING to FINISHED to return success; 4. RMApp is waiting for RMAppAttempt moving from FINISHING to FINISHED; 5. RMAppAttempt is waiting for AM container being finished. Then, if we return a prior state to the client given the internal FINISHING, and still return unregistration success when RMApp reaches FINISHING, client will see, for example, RUNNING, while the registration is already successful. The inconsistency here may result in some race condition for the process relying on checking the final state. For example, MR client will direct user to AM if the application is said not to be in a final state. Then, it is possible that AM is unregistered, and RM tells the client that the application is still running. When the client moves on to contact AM, AM has proceeded and exited before being able to respond the client request. It seems that we cannot avoid splitting the user-faced state, and FINISHING can map to a period of an application's life cycle, which is from unregistration to process exit. [~jlowe] and [~jianhe], how do you think about it? Separate FINISHING and FINISHED state in YarnApplicationState - Key: YARN-1445 URL: https://issues.apache.org/jira/browse/YARN-1445 Project: Hadoop YARN Issue Type: Bug Reporter: Xuan Gong Assignee: Xuan Gong Attachments: YARN-1445.1.patch, YARN-1445.2.patch, YARN-1445.3.patch, YARN-1445.4.patch, YARN-1445.5.patch, YARN-1445.5.patch, YARN-1445.6.patch Today, we will transmit both RMAppState.FINISHING and RMAppState.FINISHED to YarnApplicationState.FINISHED. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-1780) Improve logging in timeline service
Zhijie Shen created YARN-1780: - Summary: Improve logging in timeline service Key: YARN-1780 URL: https://issues.apache.org/jira/browse/YARN-1780 Project: Hadoop YARN Issue Type: Sub-task Reporter: Zhijie Shen Assignee: Zhijie Shen The server side of timeline service is lacking logging information, which makes debugging difficult -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1752) Unexpected Unregistered event at Attempt Launched state
[ https://issues.apache.org/jira/browse/YARN-1752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith updated YARN-1752: - Attachment: YARN-1752.4.patch Attaching patch for fixing comments.Please review. Unexpected Unregistered event at Attempt Launched state --- Key: YARN-1752 URL: https://issues.apache.org/jira/browse/YARN-1752 Project: Hadoop YARN Issue Type: Bug Reporter: Jian He Assignee: Rohith Attachments: YARN-1752.1.patch, YARN-1752.2.patch, YARN-1752.3.patch, YARN-1752.4.patch {code} 2014-02-21 14:56:03,453 ERROR org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: Can't handle this event at current state org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: UNREGISTERED at LAUNCHED at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:647) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:103) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:733) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:714) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) at java.lang.Thread.run(Thread.java:695) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1752) Unexpected Unregistered event at Attempt Launched state
[ https://issues.apache.org/jira/browse/YARN-1752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13919019#comment-13919019 ] Hadoop QA commented on YARN-1752: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12632436/YARN-1752.4.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 4 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:red}-1 release audit{color}. The applied patch generated 1 release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.recovery.TestFSRMStateStore {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3236//testReport/ Release audit warnings: https://builds.apache.org/job/PreCommit-YARN-Build/3236//artifact/trunk/patchprocess/patchReleaseAuditProblems.txt Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3236//console This message is automatically generated. Unexpected Unregistered event at Attempt Launched state --- Key: YARN-1752 URL: https://issues.apache.org/jira/browse/YARN-1752 Project: Hadoop YARN Issue Type: Bug Reporter: Jian He Assignee: Rohith Attachments: YARN-1752.1.patch, YARN-1752.2.patch, YARN-1752.3.patch, YARN-1752.4.patch {code} 2014-02-21 14:56:03,453 ERROR org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: Can't handle this event at current state org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: UNREGISTERED at LAUNCHED at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:647) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:103) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:733) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:714) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) at java.lang.Thread.run(Thread.java:695) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1766) When RM does the initiation, it should use loaded Configuration instead of bootstrap configuration.
[ https://issues.apache.org/jira/browse/YARN-1766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13919029#comment-13919029 ] Xuan Gong commented on YARN-1766: - Our previous tests missed this. Our tests covered: startRM w/o LocalConfigurationProvider/FSBasedConfigurationProvider, do refresh* with LocalConfigurationProvider/FSBasedConfigurationProvider, and RMHA with FSBasedConfigurationProvider. But we did not verify whether all RM services get correct configuration when RM initiates with FSBasedConfigurationProvider. When RM does the initiation, it should use loaded Configuration instead of bootstrap configuration. --- Key: YARN-1766 URL: https://issues.apache.org/jira/browse/YARN-1766 Project: Hadoop YARN Issue Type: Sub-task Reporter: Xuan Gong Assignee: Xuan Gong Attachments: YARN-1766.1.patch, YARN-1766.2.patch Right now, we have FileSystemBasedConfigurationProvider to let Users upload the configurations into remote File System, and let different RMs share the same configurations. During the initiation, RM will load the configurations from Remote File System. So when RM initiates the services, it should use the loaded Configurations instead of using the bootstrap configurations. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1734) RM should get the updated Configurations when it transits from Standby to Active
[ https://issues.apache.org/jira/browse/YARN-1734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13919031#comment-13919031 ] Karthik Kambatla commented on YARN-1734: Sorry for all the confusion caused here - forgot that the rmadmin command also uses ConfiguredRMFailoverProxyProvider. Played with a cluster with local configurations. It behaves as expected. refresh* refreshes the Active. The Standby refreshes everything on transition to active. Thanks [~xgong] for fixing the refresh commands, and for being patient with my questions/concerns. RM should get the updated Configurations when it transits from Standby to Active Key: YARN-1734 URL: https://issues.apache.org/jira/browse/YARN-1734 Project: Hadoop YARN Issue Type: Sub-task Reporter: Xuan Gong Assignee: Xuan Gong Priority: Critical Fix For: 2.4.0 Attachments: YARN-1734.1.patch, YARN-1734.2.patch, YARN-1734.3.patch, YARN-1734.4.patch, YARN-1734.5.patch, YARN-1734.6.patch, YARN-1734.7.patch Currently, we have ConfigurationProvider which can support LocalConfiguration, and FileSystemBasedConfiguration. When HA is enabled, and FileSystemBasedConfiguration is enabled, RM can not get the updated Configurations when it transits from Standby to Active -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-986) RM DT token service should have service addresses of both RMs
[ https://issues.apache.org/jira/browse/YARN-986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13919038#comment-13919038 ] Hadoop QA commented on YARN-986: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12632426/yarn-986-3.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 3 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:red}-1 release audit{color}. The applied patch generated 1 release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.mapreduce.v2.TestNonExistentJob org.apache.hadoop.mapred.TestSpecialCharactersInOutputPath org.apache.hadoop.mapred.TestClusterMapReduceTestCase org.apache.hadoop.mapred.TestJobName org.apache.hadoop.mapreduce.v2.TestMiniMRProxyUser org.apache.hadoop.fs.TestDFSIO org.apache.hadoop.mapreduce.v2.TestUberAM org.apache.hadoop.mapreduce.TestMRJobClient org.apache.hadoop.mapred.TestMerge org.apache.hadoop.mapred.TestReduceFetch org.apache.hadoop.mapred.TestLazyOutput org.apache.hadoop.mapred.TestReduceFetchFromPartialMem org.apache.hadoop.mapreduce.v2.TestMRJobs org.apache.hadoop.mapred.TestMRCJCFileInputFormat org.apache.hadoop.mapred.TestMiniMRWithDFSWithDistinctUsers org.apache.hadoop.mapred.TestJobSysDirWithDFS org.apache.hadoop.mapreduce.security.TestMRCredentials org.apache.hadoop.mapreduce.TestMapReduceLazyOutput org.apache.hadoop.mapreduce.lib.join.TestJoinProperties org.apache.hadoop.ipc.TestMRCJCSocketFactory org.apache.hadoop.mapred.TestMiniMRClasspath org.apache.hadoop.mapreduce.security.ssl.TestEncryptedShuffle org.apache.hadoop.conf.TestNoDefaultsJobConf org.apache.hadoop.mapred.TestMiniMRChildTask org.apache.hadoop.mapreduce.lib.input.TestDelegatingInputFormat org.apache.hadoop.mapred.join.TestDatamerge org.apache.hadoop.mapred.lib.TestDelegatingInputFormat org.apache.hadoop.fs.TestFileSystem org.apache.hadoop.mapreduce.lib.join.TestJoinDatamerge org.apache.hadoop.mapreduce.security.TestBinaryTokenFile org.apache.hadoop.mapreduce.lib.input.TestCombineFileInputFormat org.apache.hadoop.yarn.server.resourcemanager.recovery.TestFSRMStateStore {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3237//testReport/ Release audit warnings: https://builds.apache.org/job/PreCommit-YARN-Build/3237//artifact/trunk/patchprocess/patchReleaseAuditProblems.txt Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3237//console This message is automatically generated. RM DT token service should have service addresses of both RMs - Key: YARN-986 URL: https://issues.apache.org/jira/browse/YARN-986 Project: Hadoop YARN Issue Type: Sub-task Reporter: Vinod Kumar Vavilapalli Assignee: Karthik Kambatla Priority: Blocker Attachments: yarn-986-1.patch, yarn-986-2.patch, yarn-986-3.patch, yarn-986-prelim-0.patch Previously: YARN should use cluster-id as token service address This needs to be done to support non-ip based fail over of RM. Once the server sets the token service address to be this generic ClusterId/ServiceId, clients can translate it to appropriate final IP and then be able to select tokens via TokenSelectors. Some workarounds for other related issues
[jira] [Commented] (YARN-1408) Preemption caused Invalid State Event: ACQUIRED at KILLED and caused a task timeout for 30mins
[ https://issues.apache.org/jira/browse/YARN-1408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13919086#comment-13919086 ] Hadoop QA commented on YARN-1408: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12629000/Yarn-1408.4.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3238//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3238//console This message is automatically generated. Preemption caused Invalid State Event: ACQUIRED at KILLED and caused a task timeout for 30mins -- Key: YARN-1408 URL: https://issues.apache.org/jira/browse/YARN-1408 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.2.0 Reporter: Sunil G Fix For: 2.4.0 Attachments: Yarn-1408.1.patch, Yarn-1408.2.patch, Yarn-1408.3.patch, Yarn-1408.4.patch, Yarn-1408.patch Capacity preemption is enabled as follows. * yarn.resourcemanager.scheduler.monitor.enable= true , * yarn.resourcemanager.scheduler.monitor.policies=org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy Queue = a,b Capacity of Queue A = 80% Capacity of Queue B = 20% Step 1: Assign a big jobA on queue a which uses full cluster capacity Step 2: Submitted a jobB to queue b which would use less than 20% of cluster capacity JobA task which uses queue b capcity is been preempted and killed. This caused below problem: 1. New Container has got allocated for jobA in Queue A as per node update from an NM. 2. This container has been preempted immediately as per preemption. Here ACQUIRED at KILLED Invalid State exception came when the next AM heartbeat reached RM. ERROR org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: Can't handle this event at current state org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: ACQUIRED at KILLED This also caused the Task to go for a timeout for 30minutes as this Container was already killed by preemption. attempt_1380289782418_0003_m_00_0 Timed out after 1800 secs -- This message was sent by Atlassian JIRA (v6.2#6252)