[jira] [Resolved] (YARN-10290) Resourcemanager recover failed when fair scheduler queue acl changed

2020-06-01 Thread Wilfred Spiegelenburg (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wilfred Spiegelenburg resolved YARN-10290.
--
Resolution: Duplicate

This issue is fixed in YARN-7913.
That change fixes a number of issues around restores that fail.

The change was not backported to Hadoop 2.x

> Resourcemanager recover failed when fair scheduler queue acl changed
> 
>
> Key: YARN-10290
> URL: https://issues.apache.org/jira/browse/YARN-10290
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.2
>Reporter: yehuanhuan
>Priority: Blocker
>
> Resourcemanager recover failed when fair scheduler queue acl changed. Because 
> of queue acl changed, when recover the application (addApplication() in 
> fairscheduler) is rejected. Then recover the applicationAttempt 
> (addApplicationAttempt() in fairscheduler) get Application is null. This will 
> lead to two RM is at standby. Repeat as follows.
>  
> # user run a long running application.
> # change queue acl (aclSubmitApps) so that the user does not have permission.
> # restart the RM.
> {code:java}
> 2020-05-25 16:04:06,191 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Updating 
> application application_1590393162216_0005 with final state: FAILED
> 2020-05-25 16:04:06,192 ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Failed to 
> load/recover state
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.addApplicationAttempt(FairScheduler.java:663)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1246)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:116)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$AttemptRecoveredTransition.transition(RMAppAttemptImpl.java:1072)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$AttemptRecoveredTransition.transition(RMAppAttemptImpl.java:1036)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:789)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:105)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.recoverAppAttempts(RMAppImpl.java:845)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.access$1900(RMAppImpl.java:102)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl$RMAppRecoveredTransition.transition(RMAppImpl.java:897)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl$RMAppRecoveredTransition.transition(RMAppImpl.java:850)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:723)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:322)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:427)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1173)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:584)
> at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:980)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1021)
> at 
> 

[jira] [Resolved] (YARN-10063) Usage output of container-executor binary needs to include --http/--https argument

2020-04-07 Thread Wilfred Spiegelenburg (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wilfred Spiegelenburg resolved YARN-10063.
--
Fix Version/s: 3.4.0
   3.3.0
   Resolution: Fixed

committed to 3.3

removed backport to 3.2 branch YARN-6586 was not backported to 3.2

> Usage output of container-executor binary needs to include --http/--https 
> argument
> --
>
> Key: YARN-10063
> URL: https://issues.apache.org/jira/browse/YARN-10063
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Siddharth Ahuja
>Assignee: Siddharth Ahuja
>Priority: Minor
> Fix For: 3.3.0, 3.4.0
>
> Attachments: YARN-10063.001.patch, YARN-10063.002.patch, 
> YARN-10063.003.patch, YARN-10063.004.patch
>
>
> YARN-8448/YARN-6586 seems to have introduced a new option - "\--http" 
> (default) and "\--https" that is possible to be passed in to the 
> container-executor binary, see :
> https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/main.c#L564
> and 
> https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/main.c#L521
> however, the usage output seems to have missed this:
> https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/main.c#L74
> Raising this jira to improve this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-9940) avoid continuous scheduling thread crashes while sorting nodes get 'Comparison method violates its general contract'

2020-03-17 Thread Wilfred Spiegelenburg (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wilfred Spiegelenburg resolved YARN-9940.
-
Resolution: Not A Problem

This issue is fixed in later versions via YARN-8373. In the version it is 
logged against it does not exist.

The custom code that caused the issue to show up is a mix of Hadoop 2.7 and 
Hadoop 2.9.

> avoid continuous scheduling thread crashes while sorting nodes get 
> 'Comparison method violates its general contract'
> 
>
> Key: YARN-9940
> URL: https://issues.apache.org/jira/browse/YARN-9940
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.7.2
>Reporter: kailiu_dev
>Assignee: kailiu_dev
>Priority: Major
> Attachments: YARN-9940-branch-2.7.2.001.patch
>
>
> 2019-10-16 09:14:51,215 ERROR 
> org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread 
> Thread[FairSchedulerContinuousScheduling,5,main] threw an Exception.
> java.lang.IllegalArgumentException: Comparison method violates its general 
> contract!
>     at java.util.TimSort.mergeHi(TimSort.java:868)
>     at java.util.TimSort.mergeAt(TimSort.java:485)
>     at java.util.TimSort.mergeForceCollapse(TimSort.java:426)
>     at java.util.TimSort.sort(TimSort.java:223)
>     at java.util.TimSort.sort(TimSort.java:173)
>     at java.util.Arrays.sort(Arrays.java:659)
>     at java.util.Collections.sort(Collections.java:217)
>     at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousSchedulingAttempt(FairScheduler.java:1117)
>     at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$ContinuousSchedulingThread.run(FairScheduler.java:296)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-10182) SLS运行报错Couldn't create /yarn-leader-election/yarnRM

2020-03-05 Thread Wilfred Spiegelenburg (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wilfred Spiegelenburg resolved YARN-10182.
--
Resolution: Fixed

JIRA is not the correct place to ask questions on how to setup or run parts. 
Please use the mailing lists if you need help: u...@hadoop.apache.org

> SLS运行报错Couldn't create /yarn-leader-election/yarnRM
> ---
>
> Key: YARN-10182
> URL: https://issues.apache.org/jira/browse/YARN-10182
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
> Environment: Cloudera Express 6.0.0
> RM1 :active RM2:standby
> kerberos is on
> yarn-site.xml : /etc/hadoop/conf.cloudera.yarn/yarn-site.xml
> keytab: /etc/krb5.keytab ===>the keytab of yarn
> when I run slsrun.sh ,I will get an error:
> Exception in thread "main" org.apache.hadoop.service.ServiceStateException: 
> java.io.IOException: Couldn't create /yarn-leader-election/yarnRM
> If I use sample-conf/yarn-site.xml ,I will get  "KerberosAuthException: Login 
> failure for user: yarn from keytab /etc/krb5.keytab 
> javax.security.auth.login.LoginException: Unable to obtain password from user"
> How to resolve it ?
>  
>Reporter: zhangyu
>Priority: Major
> Attachments: slsrun.log.txt
>
>
> RM1 :active RM2:standby
> kerberos is on
> yarn-site.xml : /etc/hadoop/conf.cloudera.yarn/yarn-site.xml
> keytab: /etc/krb5.keytab ===>the keytab of yarn
> when I run slsrun.sh on RM1 ,I will get an error:
> Exception in thread "main" org.apache.hadoop.service.ServiceStateException: 
> java.io.IOException: Couldn't create /yarn-leader-election/yarnRM
> If I use sample-conf/yarn-site.xml ,I will get "KerberosAuthException: Login 
> failure for user: yarn from keytab /etc/krb5.keytab 
> javax.security.auth.login.LoginException: Unable to obtain password from user"
> How to resolve it ?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-10087) ATS possible NPE on REST API when data is missing

2020-01-15 Thread Wilfred Spiegelenburg (Jira)
Wilfred Spiegelenburg created YARN-10087:


 Summary: ATS possible NPE on REST API when data is missing
 Key: YARN-10087
 URL: https://issues.apache.org/jira/browse/YARN-10087
 Project: Hadoop YARN
  Issue Type: Bug
  Components: ATSv2
Reporter: Wilfred Spiegelenburg


If the data stored by the ATS is not complete REST calls to the ATS can return 
a NPE instead of results.

{{{"exception":"NullPointerException","javaClassName":"java.lang.NullPointerException"}}}

The issue shows up when the ATS was down for a short period and in that time 
new applications were started. This causes certain parts of the application 
data to be missing in the ATS store. In most cases this is not a problem and 
data will be returned but when you start filtering data the filtering fails 
throwing the NPE.
 In this case the request was for: 
{{http://:8188/ws/v1/applicationhistory/apps?user=hive'}}

If certain pieces of data are missing the ATS should not even consider 
returning that data, filtered or not. We should not display partial or 
incomplete data.
 In case of the missing user information ACL checks cannot be correctly 
performed and we could see more issues.

A similar issue was fixed in YARN-7118 where the queue details were missing. It 
just _skips_ the app to prevent the NPE but that is not the correct thing when 
the user is missing



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-8470) Fair scheduler exception with SLS

2020-01-14 Thread Wilfred Spiegelenburg (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-8470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wilfred Spiegelenburg resolved YARN-8470.
-
  Assignee: Wilfred Spiegelenburg  (was: Szilard Nemeth)
Resolution: Duplicate

> Fair scheduler exception with SLS
> -
>
> Key: YARN-8470
> URL: https://issues.apache.org/jira/browse/YARN-8470
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Miklos Szegedi
>Assignee: Wilfred Spiegelenburg
>Priority: Major
>
> I ran into the following exception with sls:
> 2018-06-26 13:34:04,358 ERROR resourcemanager.ResourceManager: Received 
> RMFatalEvent of type CRITICAL_THREAD_CRASH, caused by a critical thread, 
> FSPreemptionThread, that exited unexpectedly: java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSPreemptionThread.identifyContainersToPreemptOnNode(FSPreemptionThread.java:207)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSPreemptionThread.identifyContainersToPreemptForOneContainer(FSPreemptionThread.java:161)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSPreemptionThread.identifyContainersToPreempt(FSPreemptionThread.java:121)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSPreemptionThread.run(FSPreemptionThread.java:81)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-7012) yarn refuses application if a user has no groups associated

2019-12-22 Thread Wilfred Spiegelenburg (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-7012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wilfred Spiegelenburg resolved YARN-7012.
-
  Assignee: Wilfred Spiegelenburg
Resolution: Not A Bug

The group mapping you configure should handle the local and non local users. A 
user must have at least a primary group, it is needed for ownership of files 
etc.
The exception you see shows that the user is not considered valid and this is 
supposed to happen.

> yarn refuses application if a user has no groups associated
> ---
>
> Key: YARN-7012
> URL: https://issues.apache.org/jira/browse/YARN-7012
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.7.3
> Environment: redhat 7.3
>Reporter: Michael Salmon
>Assignee: Wilfred Spiegelenburg
>Priority: Major
>
> When an application is submitted the list of queue mappings is searched for 
> the default queue for the user submitting the application. If there is a 
> queue mapping for a group and the user does not have any groups then an 
> exception is thrown and the application is rejected:
> org.apache.hadoop.yarn.exceptions.YarnException: Failed to submit application 
> application_1499325026025_0002 submitted by user hive reason: No groups found 
> for user hive
> at UserGroupMappingPlacementRule.java:153 although the lookup is at line 122.
> This problem arises when using local users and AD groups.
> Users without groups should be treated as users with no matching groups



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-10053) Placement rules do not use correct group service init

2019-12-22 Thread Wilfred Spiegelenburg (Jira)
Wilfred Spiegelenburg created YARN-10053:


 Summary: Placement rules do not use correct group service init
 Key: YARN-10053
 URL: https://issues.apache.org/jira/browse/YARN-10053
 Project: Hadoop YARN
  Issue Type: Bug
  Components: yarn
Affects Versions: 3.1.3
Reporter: Wilfred Spiegelenburg
Assignee: Wilfred Spiegelenburg


The placement rules, CS and FS, all create a new group service instead of using 
the shared group mapping service. This means that the cache for the placement 
rules is not same as for the ACL and other parts of the RM service.

It also could cause an issue with the configuration that is passed in to create 
the cache: the scheduler config might not have the same values as the service 
config and could thus cause issues. This second issue just seems to affect the 
CS not the FS.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-9993) Remove incorrectly committed files from YARN-9011

2019-11-27 Thread Wilfred Spiegelenburg (Jira)
Wilfred Spiegelenburg created YARN-9993:
---

 Summary: Remove incorrectly committed files from YARN-9011
 Key: YARN-9993
 URL: https://issues.apache.org/jira/browse/YARN-9993
 Project: Hadoop YARN
  Issue Type: Bug
  Components: yarn
Affects Versions: 3.2.2
Reporter: Wilfred Spiegelenburg
Assignee: Wilfred Spiegelenburg


With the checkin of YARN-9011 a number of files were added that should not have 
been in the commit:
[https://github.com/apache/hadoop/tree/branch-3.2/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications]

This causes the asf license check to fail on the 3.2 branch build



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-9984) FSPreemptionThread crash with NullPointerException

2019-11-17 Thread Wilfred Spiegelenburg (Jira)
Wilfred Spiegelenburg created YARN-9984:
---

 Summary: FSPreemptionThread crash with NullPointerException
 Key: YARN-9984
 URL: https://issues.apache.org/jira/browse/YARN-9984
 Project: Hadoop YARN
  Issue Type: Bug
  Components: fairscheduler
Affects Versions: 3.0.0
Reporter: Wilfred Spiegelenburg
Assignee: Wilfred Spiegelenburg


When an application is unregistered there is a chance that there are still 
containers running on a node for that application. In all cases we handle the 
application missing from the RM gracefully (log a message and continue) except 
for the FS pre-emption thread.

In case the application is removed but some containers are still linked to a 
node the FSPreemptionThread will crash with a NPE when it tries to retrieve the 
application id for the attempt:
{code:java}
FSAppAttempt app =
scheduler.getSchedulerApp(container.getApplicationAttemptId());
ApplicationId appId = app.getApplicationId();{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-9952) ontinuous scheduling thread crashes

2019-11-08 Thread Wilfred Spiegelenburg (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wilfred Spiegelenburg resolved YARN-9952.
-
Resolution: Duplicate

> ontinuous scheduling thread crashes
> ---
>
> Key: YARN-9952
> URL: https://issues.apache.org/jira/browse/YARN-9952
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.7.2
>Reporter: kailiu_dev
>Priority: Major
>
> {color:#172b4d}2019-10-16 09:14:51,215 ERROR 
> org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread 
> Thread{color}[FairSchedulerContinuousScheduling,5,main]{color:#172b4d} threw 
> an Exception.{color}
> {color:#172b4d} java.lang.IllegalArgumentException: Comparison method 
> violates its general contract!{color}
> {color:#172b4d}     at java.util.TimSort.mergeHi(TimSort.java:868){color}
> {color:#172b4d}     at java.util.TimSort.mergeAt(TimSort.java:485){color}
> {color:#172b4d}     at 
> java.util.TimSort.mergeForceCollapse(TimSort.java:426){color}
> {color:#172b4d}     at java.util.TimSort.sort(TimSort.java:223){color}
> {color:#172b4d}     at java.util.TimSort.sort(TimSort.java:173){color}
> {color:#172b4d}     at java.util.Arrays.sort(Arrays.java:659){color}
> {color:#172b4d}     at 
> java.util.Collections.sort(Collections.java:217){color}
> {color:#172b4d}     at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousSchedulingAttempt(FairScheduler.java:1117){color}
> {color:#172b4d}     at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$ContinuousSchedulingThread.run(FairScheduler.java:296){color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-9516) move application between queues,not check target queue acl permission

2019-05-01 Thread Wilfred Spiegelenburg (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wilfred Spiegelenburg resolved YARN-9516.
-
Resolution: Duplicate

This has been fixed as YARN-5554 MoveApplicationAcrossQueues does not check 
user permission on the target queue in 3.0


> move application between queues,not check target queue acl permission
> -
>
> Key: YARN-9516
> URL: https://issues.apache.org/jira/browse/YARN-9516
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler, resourcemanager
>Affects Versions: 2.6.0
>Reporter: powerinf
>Priority: Critical
>
> User test1 can subbmit a application on queue root.test.test1,but not  to 
> queue root.test.test2.when I subbmit a application on queue root.test.test1 
> using user test1, and try to move the application to root.test.test2, it can 
> move successfully,not check target queue acl permission,



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-9431) flaky junit test fair.TestAppRunnability after YARN-8967

2019-03-31 Thread Wilfred Spiegelenburg (JIRA)
Wilfred Spiegelenburg created YARN-9431:
---

 Summary: flaky junit test fair.TestAppRunnability after YARN-8967
 Key: YARN-9431
 URL: https://issues.apache.org/jira/browse/YARN-9431
 Project: Hadoop YARN
  Issue Type: Bug
  Components: fairscheduler, test
Affects Versions: 3.3.0
Reporter: Wilfred Spiegelenburg
Assignee: Wilfred Spiegelenburg


In YARN-4901 one of the scheduler tests failed. This seems to be linked to the 
changes around the placement rules introduced in YARN-8967.

Applications submitted in the tests are accepted and rejected at the same time:
{code}
2019-04-01 12:00:57,269 INFO  [main] fair.FairScheduler 
(FairScheduler.java:addApplication(540)) - Accepted application 
application_0_0001 from user: user1, in queue: root.user1, currently num of 
applications: 1
2019-04-01 12:00:57,269 INFO  [AsyncDispatcher event handler] 
fair.FairScheduler (FairScheduler.java:rejectApplicationWithMessage(1344)) - 
Reject application application_0_0001 submitted by user user1 application 
rejected by placement rules.
{code}
This should never happen and is most likely due to the way the tests generates 
the application and events.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-9417) Implement FS equivalent of AppNameMappingPlacementRule

2019-03-27 Thread Wilfred Spiegelenburg (JIRA)
Wilfred Spiegelenburg created YARN-9417:
---

 Summary: Implement FS equivalent of AppNameMappingPlacementRule
 Key: YARN-9417
 URL: https://issues.apache.org/jira/browse/YARN-9417
 Project: Hadoop YARN
  Issue Type: Improvement
Affects Versions: 3.3.0
Reporter: Wilfred Spiegelenburg
Assignee: Wilfred Spiegelenburg


The AppNameMappingPlacementRule is only available for the CS. We need the same 
kind of rule for the FS.
The rule should use the application name as set in the submission context.

This allows spark, mr or tez jobs to be run in their own queues.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-9416) Add filter options to FS placement rules

2019-03-27 Thread Wilfred Spiegelenburg (JIRA)
Wilfred Spiegelenburg created YARN-9416:
---

 Summary: Add filter options to FS placement rules
 Key: YARN-9416
 URL: https://issues.apache.org/jira/browse/YARN-9416
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: fairscheduler
Affects Versions: 3.3.0
Reporter: Wilfred Spiegelenburg
Assignee: Wilfred Spiegelenburg


The placement rules should allow filtering of the groups and or users that 
match the rule.

In the case of the user rule you might want it to only match if the users are 
member of a specific group. An other example would be to only allow specific 
users to match the specified rule.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-5387) FairScheduler: add the ability to specify a parent queue to all placement rules

2019-03-27 Thread Wilfred Spiegelenburg (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-5387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wilfred Spiegelenburg resolved YARN-5387.
-
Resolution: Implemented

This has been included as part of the YARN-8967 changes.
Documentation is still outstanding and will be added as part of YARN-9415.

> FairScheduler: add the ability to specify a parent queue to all placement 
> rules
> ---
>
> Key: YARN-5387
> URL: https://issues.apache.org/jira/browse/YARN-5387
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: fairscheduler
>Reporter: Wilfred Spiegelenburg
>Assignee: Wilfred Spiegelenburg
>Priority: Major
>  Labels: supportability
>
> In the current placement policy there all rules generate a queue name under 
> the root. The only exception is the nestedUserQueue rule. This rule allows a 
> queue to be created under a parent queue defined by a second rule.
> Instead of creating new rules to also allow nested groups, secondary groups 
> or  nested queues for new rules that we think of we should generalise this by 
> allowing a parent attribute to be specified in each rule like the create flag.
> The optional parent attribute for a rule should allow the following values:
> - empty (which is the same as not specifying the attribute)
> - a rule
> - a fixed value (with or without the root prefix)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-2257) Add user to queue mappings to automatically place users' apps into specific queues

2019-03-27 Thread Wilfred Spiegelenburg (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wilfred Spiegelenburg resolved YARN-2257.
-
Resolution: Duplicate

This has been fixed as part of YARN-8948, YARN-9298 and finally integrated in 
YARN-8967. Both schedulers use the same placement manager and placement rule 
code. The rules are different for both schedulers as the FS uses a slightly 
different setup with rule chaining and creation of queues that do not exist.

The fix is in 3.3 and later: marking this as a duplicate of YARN-8967

> Add user to queue mappings to automatically place users' apps into specific 
> queues
> --
>
> Key: YARN-2257
> URL: https://issues.apache.org/jira/browse/YARN-2257
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: scheduler
>Reporter: Patrick Liu
>Assignee: Vinod Kumar Vavilapalli
>Priority: Major
>  Labels: features
>
> Currently, the fair-scheduler supports two modes, default queue or individual 
> queue for each user.
> Apparently, the default queue is not a good option, because the resources 
> cannot be managed for each user or group.
> However, individual queue for each user is not good enough. Especially when 
> connecting yarn with hive. There will be increasing hive users in a corporate 
> environment. If we create a queue for a user, the resource management will be 
> hard to maintain.
> I think the problem can be solved like this:
> 1. Define user->queue mapping in Fair-Scheduler.xml. Inside each queue, use 
> aclSubmitApps to control user's ability.
> 2. Each time a user submit an app to yarn, if the user has mapped to a queue, 
> the app will be scheduled to that queue; otherwise, the app will be submitted 
> to default queue.
> 3. If the user cannot pass aclSubmitApps limits, the app will not be accepted.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-9415) Document FS placement rule changes from YARN-8967

2019-03-27 Thread Wilfred Spiegelenburg (JIRA)
Wilfred Spiegelenburg created YARN-9415:
---

 Summary: Document FS placement rule changes from YARN-8967
 Key: YARN-9415
 URL: https://issues.apache.org/jira/browse/YARN-9415
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: documentation, fairscheduler
Affects Versions: 3.3.0
Reporter: Wilfred Spiegelenburg
Assignee: Wilfred Spiegelenburg


With the changes introduced by YARN-8967 we now allow parent rules on all 
existing rules. This should be documented.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-9298) Implement FS placement rules using PlacementRule interface

2019-02-12 Thread Wilfred Spiegelenburg (JIRA)
Wilfred Spiegelenburg created YARN-9298:
---

 Summary: Implement FS placement rules using PlacementRule interface
 Key: YARN-9298
 URL: https://issues.apache.org/jira/browse/YARN-9298
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: scheduler
Reporter: Wilfred Spiegelenburg
Assignee: Wilfred Spiegelenburg


Implement existing placement rules of the FS using the PlacementRule interface.

Preparation for YARN-8967



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-1122) FairScheduler user-as-default-queue always defaults to 'default'

2019-01-12 Thread Wilfred Spiegelenburg (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-1122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wilfred Spiegelenburg resolved YARN-1122.
-
Resolution: Duplicate

closing old jira as duplicate per comment

> FairScheduler user-as-default-queue always defaults to 'default'
> 
>
> Key: YARN-1122
> URL: https://issues.apache.org/jira/browse/YARN-1122
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: scheduler
>Affects Versions: 2.0.5-alpha
>Reporter: Lohit Vijayarenu
>Priority: Major
> Attachments: YARN-1122.1.patch
>
>
> By default YARN fairscheduler should use user name as queue name, but we see 
> that in our clusters all jobs were ending up in default queue. Even after 
> picking YARN-333 which is part of trunk, the behavior remains the same. Jobs 
> do end up in right queue, but from UI perspective they are shown as running 
> under default queue. It looks like there is small bug with
> {noformat}
> RMApp rmApp = rmContext.getRMApps().get(applicationAttemptId);
> {noformat}
> which should actually be
> {noformat}
> RMApp rmApp = 
> rmContext.getRMApps().get(applicationAttemptId.getApplicationId());
> {noformat}
> There is also a simple js change needed for filtering of jobs on 
> fairscheduler UI page.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-9173) FairShare calculation broken for large values after YARN-8833

2019-01-03 Thread Wilfred Spiegelenburg (JIRA)
Wilfred Spiegelenburg created YARN-9173:
---

 Summary: FairShare calculation broken for large values after 
YARN-8833
 Key: YARN-9173
 URL: https://issues.apache.org/jira/browse/YARN-9173
 Project: Hadoop YARN
  Issue Type: Bug
  Components: fairscheduler
Affects Versions: 3.3.0
Reporter: Wilfred Spiegelenburg
Assignee: Wilfred Spiegelenburg


After the fix for the infinite loop in YARN-8833 we now get the wrong values 
back for fairshare calculations under certain circumstances. The current 
implementation works when the total resource is smaller than Integer.MAXVALUE.

When the total resource goes above that value the number of iterations is not 
enough to converge to the correct value.

The new test {{testResourceUsedWithWeightToResourceRatio()}} only checks that 
the calculation does not hang but does not check the outcome of the calculation.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-9047) FairScheduler: default resource calculator is not resource type aware

2018-11-22 Thread Wilfred Spiegelenburg (JIRA)
Wilfred Spiegelenburg created YARN-9047:
---

 Summary: FairScheduler: default resource calculator is not 
resource type aware
 Key: YARN-9047
 URL: https://issues.apache.org/jira/browse/YARN-9047
 Project: Hadoop YARN
  Issue Type: Bug
  Components: fairscheduler
Reporter: Wilfred Spiegelenburg


The FairScheduler#getResourceCalculator always returns the default resource 
calculator. The default calculator is not resource type aware and should only 
be used if there are no resource types configured.

We need to make sure that in we the direct hard code reference to 
{{RESOURCE_CALCULATOR}} is either safe to use in all cases or is not used  in 
the scheduler.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-8994) Fix for race condition in move app and queue cleanup in FS

2018-11-08 Thread Wilfred Spiegelenburg (JIRA)
Wilfred Spiegelenburg created YARN-8994:
---

 Summary: Fix for race condition in move app and queue cleanup in FS
 Key: YARN-8994
 URL: https://issues.apache.org/jira/browse/YARN-8994
 Project: Hadoop YARN
  Issue Type: Bug
  Components: fairscheduler
Affects Versions: 3.2.0
Reporter: Wilfred Spiegelenburg
Assignee: Wilfred Spiegelenburg


Similar to YARN-8990 and also introduced by YARN-8191 there is a race condition 
while moving an application. The pre-move check looks for the queue and when it 
finds the queue it progresses. The real move then retrieves the queue and does 
further check before updating the app and queues.

The move uses the retrieved queue object but the queue could have become empty 
while checks are performed. If the cleanup runs at that same time the app will 
be moved to a deleted queue and lost.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-8990) FS" race condition in app submit and queue cleanup

2018-11-08 Thread Wilfred Spiegelenburg (JIRA)
Wilfred Spiegelenburg created YARN-8990:
---

 Summary: FS" race condition in app submit and queue cleanup
 Key: YARN-8990
 URL: https://issues.apache.org/jira/browse/YARN-8990
 Project: Hadoop YARN
  Issue Type: Bug
  Components: fairscheduler
Affects Versions: 3.2.0
Reporter: Wilfred Spiegelenburg
Assignee: Wilfred Spiegelenburg


With the introduction of the dynamic queue deletion in YARN-8191 a race 
condition was introduced that can cause a queue to be removed while an 
application submit is in progress.

The issue occurs in {{FairScheduler.addApplication()}} when an application is 
submitted to a dynamic queue which is empty or the queue does not exist yet. If 
during the processing of the application submit the 
{{AllocationFileLoaderService}} kicks of for an update the queue clean up will 
be run first. The application submit first creates the queue and get a 
reference back to the queue. 
Other checks are performed and as the last action before getting ready to 
generate an AppAttempt the queue is updated to show the submitted application 
ID..

The time between the queue creation and the queue update to show the submit is 
long enough for the queue to be removed. The application however is lost and 
will never get any resources assigned.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-8985) FSParentQueue: debug log missing when assigning container

2018-11-08 Thread Wilfred Spiegelenburg (JIRA)
Wilfred Spiegelenburg created YARN-8985:
---

 Summary: FSParentQueue: debug log missing when assigning container
 Key: YARN-8985
 URL: https://issues.apache.org/jira/browse/YARN-8985
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: fairscheduler
Affects Versions: 3.3.0
Reporter: Wilfred Spiegelenburg
Assignee: Wilfred Spiegelenburg


Tracking assignments in the queue hierarchy is not possible at a DEBUG level 
because the FSParentQueue does not log a node being offered to the queue.
This means that if a parent queue has no leaf queues then it will be impossible 
to track the offering leaving a hole in the tracking.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-8967) Change FairScheduler to use PlacementRule interface

2018-11-02 Thread Wilfred Spiegelenburg (JIRA)
Wilfred Spiegelenburg created YARN-8967:
---

 Summary: Change FairScheduler to use PlacementRule interface
 Key: YARN-8967
 URL: https://issues.apache.org/jira/browse/YARN-8967
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: capacityscheduler, fairscheduler
Reporter: Wilfred Spiegelenburg
Assignee: Wilfred Spiegelenburg


The PlacementRule interface was introduced to be used by all schedulers as per 
YARN-3635. The CapacityScheduler is using it but the FairScheduler is not and 
is using its own rule definition.

YARN-8948 cleans up the implementation and removes the CS references which 
should allow this change to go through.
This would be the first step in using one placement rule engine for both 
schedulers.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-8944) TestContainerAllocation.testUserLimitAllocationMultipleContainers failure after YARN-8896

2018-10-25 Thread Wilfred Spiegelenburg (JIRA)
Wilfred Spiegelenburg created YARN-8944:
---

 Summary: 
TestContainerAllocation.testUserLimitAllocationMultipleContainers failure after 
YARN-8896
 Key: YARN-8944
 URL: https://issues.apache.org/jira/browse/YARN-8944
 Project: Hadoop YARN
  Issue Type: Test
  Components: capacity scheduler
Affects Versions: 3.3.0
Reporter: Wilfred Spiegelenburg
Assignee: Wilfred Spiegelenburg


YARN-8896 changes the behaviour of the CapacityScheduler by limiting the number 
of containers that can be allocated in one heartbeat. It is an undocumented 
change in behaviour.

The change breaks the junit test: 
{{TestContainerAllocation.testUserLimitAllocationMultipleContainers}}

The maximum number of containers that gets assigned via the on heartbeat is 100 
and it expects 199 to be assigned.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-8904) TestRMDelegationTokens can fail in testRMDTMasterKeyStateOnRollingMasterKey

2018-10-17 Thread Wilfred Spiegelenburg (JIRA)
Wilfred Spiegelenburg created YARN-8904:
---

 Summary: TestRMDelegationTokens can fail in 
testRMDTMasterKeyStateOnRollingMasterKey
 Key: YARN-8904
 URL: https://issues.apache.org/jira/browse/YARN-8904
 Project: Hadoop YARN
  Issue Type: Test
  Components: test
Affects Versions: 3.1.0
Reporter: Wilfred Spiegelenburg
Assignee: Wilfred Spiegelenburg


In build 
[link|https://builds.apache.org/job/PreCommit-YARN-Build/22215/artifact/out/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt],
 TestRMDelegationTokens fails for a test case:

* TestRMDelegationTokens.testRMDTMasterKeyStateOnRollingMasterKey

The test fails with an extra key in the list. It can be easily reproduced by 
introducing a short sleep in the thread.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-8865) RMStateStore contains large number of expired RMDelegationToken

2018-10-10 Thread Wilfred Spiegelenburg (JIRA)
Wilfred Spiegelenburg created YARN-8865:
---

 Summary: RMStateStore contains large number of expired 
RMDelegationToken
 Key: YARN-8865
 URL: https://issues.apache.org/jira/browse/YARN-8865
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 3.1.0
Reporter: Wilfred Spiegelenburg
Assignee: Wilfred Spiegelenburg


When the RM state store is restored expired delegation tokens are restored and 
added to the system. These expired tokens do not get cleaned up or removed. The 
exact reason why the tokens are still in the store is not clear. We have seen 
as many as 250,000 tokens in the store some of which were 2 years old.

This has two side effects:
* for the zookeeper store this leads to a jute buffer exhaustion issue and 
prevents the RM from becoming active.
* restore takes longer than needed and heap usage is higher than it should be

We should not restore already expired tokens since they cannot be renewed or 
used.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-8605) TestDominantResourceFairnessPolicy.testModWhileSorting: flaky

2018-07-30 Thread Wilfred Spiegelenburg (JIRA)
Wilfred Spiegelenburg created YARN-8605:
---

 Summary: TestDominantResourceFairnessPolicy.testModWhileSorting: 
flaky
 Key: YARN-8605
 URL: https://issues.apache.org/jira/browse/YARN-8605
 Project: Hadoop YARN
  Issue Type: Bug
  Components: fairscheduler
Affects Versions: 3.2.0
Reporter: Wilfred Spiegelenburg
Assignee: Wilfred Spiegelenburg


TestDominantResourceFairnessPolicy.testModWhileSorting: the test for the old 
comparison method is flaky.
Testing relies on the sorting to have started when the modification starts and 
that seems to be to tricky to time.

Introduced with YARN-8436



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-8494) TestQueueManagementDynamicEditPolicy.testEditSchedule is flaky

2018-07-05 Thread Wilfred Spiegelenburg (JIRA)
Wilfred Spiegelenburg created YARN-8494:
---

 Summary: TestQueueManagementDynamicEditPolicy.testEditSchedule is 
flaky 
 Key: YARN-8494
 URL: https://issues.apache.org/jira/browse/YARN-8494
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler
Affects Versions: 3.1.0
Reporter: Wilfred Spiegelenburg


The TestQueueManagementDynamicEditPolicy.testEditSchedule has been failing in a 
number of jenkins jobs:

org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestQueueManagementDynamicEditPolicy.testEditSchedule

*Error Message*
expected:<0.5> but was:<0.0>
*Stacktrace*
java.lang.AssertionError: expected:<0.5> but was:<0.0>
at org.junit.Assert.fail(Assert.java:88)
at org.junit.Assert.failNotEquals(Assert.java:743)
at org.junit.Assert.assertEquals(Assert.java:519)
at org.junit.Assert.assertEquals(Assert.java:609)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacitySchedulerAutoCreatedQueueBase.validateCapacities(TestCapacitySchedulerAutoCreatedQueueBase.java:445)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestQueueManagementDynamicEditPolicy.testEditSchedule(TestQueueManagementDynamicEditPolicy.java:103)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-8436) FSParentQueue: Comparison method violates its general contract

2018-06-19 Thread Wilfred Spiegelenburg (JIRA)
Wilfred Spiegelenburg created YARN-8436:
---

 Summary: FSParentQueue: Comparison method violates its general 
contract
 Key: YARN-8436
 URL: https://issues.apache.org/jira/browse/YARN-8436
 Project: Hadoop YARN
  Issue Type: Bug
  Components: fairscheduler
Affects Versions: 3.1.0
Reporter: Wilfred Spiegelenburg


The ResourceManager can fail while sorting queues if an update comes in:
{code:java}
FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in 
handling event type NODE_UPDATE to the scheduler
java.lang.IllegalArgumentException: Comparison method violates its general 
contract!
at java.util.TimSort.mergeLo(TimSort.java:777)
at java.util.TimSort.mergeAt(TimSort.java:514)
...
at java.util.Collections.sort(Collections.java:175)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.assignContainer(FSParentQueue.java:223){code}
The reason it breaks is a change in the sorted object itself. 
This is why it fails:
 * an update from a node comes in as a heartbeat.
 * the update triggers a check to see if we can assign a container on the node.
 * walk over the queue hierarchy to find a queue to assign a container to: top 
down.
 * for each parent queue we sort the child queues in {{assignContainer}} to 
decide which queue to descent into.
 * we lock the parent queue when sort to prevent changes, but we do not lock 
the child queues that we are sorting.

If during this sorting a different node update changes a child queue then we 
allow that. This means that the objects that we are trying to sort now might be 
out of order. That causes the issue with the comparator. The comparator itself 
is not broken.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-6941) Allow Queue placement policies to be ordered by attribute

2018-05-10 Thread Wilfred Spiegelenburg (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-6941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wilfred Spiegelenburg resolved YARN-6941.
-
Resolution: Won't Fix

The XML is already ordered so it does not make sense to override the ordering 
which is in the XML.

> Allow Queue placement policies to be ordered by attribute
> -
>
> Key: YARN-6941
> URL: https://issues.apache.org/jira/browse/YARN-6941
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: fairscheduler
>Reporter: Yufei Gu
>Priority: Minor
>
> It would be nice to add a feature that would allow users to provide an 
> "order" or "index" the placement policies should apply, rather than just the 
> native policy order as included in the XML.
> For instance, the following two examples would be the same:
> Natural order:
> 
> 
> 
> 
> 
> Indexed Order:
> 
> 
> 
> 
> 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-5399) Add configuration to remember ad-hoc queues upon configuration reload

2018-05-10 Thread Wilfred Spiegelenburg (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-5399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wilfred Spiegelenburg resolved YARN-5399.
-
Resolution: Duplicate

We should have this covered as part of YARN-8191.

We're adding a flag to queues for ad-hoc or dynamic queues. Dynamic queues that 
are empty will be removed. Queues that are no longer in the configuration will 
be removed when they become empty.

> Add configuration to remember ad-hoc queues upon configuration reload
> -
>
> Key: YARN-5399
> URL: https://issues.apache.org/jira/browse/YARN-5399
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: fairscheduler
>Reporter: Ray Chiang
>Assignee: Wilfred Spiegelenburg
>Priority: Major
>  Labels: supportability
>
> By default, FairScheduler detects and loads a changed configuration file.  
> When that load happens, ad-hoc queues are not re-created.  This can cause 
> issues with those ad-hoc queues that still have jobs running.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-5128) Investigate potential race condition in the scheduler nodeUpdate() method

2018-03-29 Thread Wilfred Spiegelenburg (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-5128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wilfred Spiegelenburg resolved YARN-5128.
-
Resolution: Duplicate

The comment was added into the CapacityScheduler only as part of YARN-3223. The 
discussion led to YARN-4677. I will close this as a dupe of that jira and 
attach a fix there.

> Investigate potential race condition in the scheduler nodeUpdate() method
> -
>
> Key: YARN-5128
> URL: https://issues.apache.org/jira/browse/YARN-5128
> Project: Hadoop YARN
>  Issue Type: Task
>Reporter: Ray Chiang
>Assignee: Wilfred Spiegelenburg
>Priority: Major
>
> This section of code exists in the various schedulers in the method 
> nodeUpdate():
> {code}
> // If the node is decommissioning, send an update to have the total
> // resource equal to the used resource, so no available resource to
> // schedule.
> // TODO: Fix possible race-condition when request comes in before
> // update is propagated
> if (nm.getState() == NodeState.DECOMMISSIONING) {
>   this.rmContext
>   .getDispatcher()
>   .getEventHandler()
>   .handle(
>   new RMNodeResourceUpdateEvent(nm.getNodeID(), ResourceOption
>   .newInstance(getSchedulerNode(nm.getNodeID())
>   .getAllocatedResource(), 0)));
> }
> {code}
> Investigate the TODO section.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-1495) Allow moving apps between queues

2018-03-07 Thread Wilfred Spiegelenburg (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wilfred Spiegelenburg resolved YARN-1495.
-
Resolution: Fixed

Closing as last open jira YARN-1558 was done in 2.9 as part of YARN-5932

> Allow moving apps between queues
> 
>
> Key: YARN-1495
> URL: https://issues.apache.org/jira/browse/YARN-1495
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: scheduler
>Affects Versions: 2.2.0
>Reporter: Sandy Ryza
>Assignee: Sandy Ryza
>Priority: Major
>
> This is an umbrella JIRA for work needed to allow moving YARN applications 
> from one queue to another.  The work will consist of additions in the command 
> line options, additions in the client RM protocol, and changes in the 
> schedulers to support this.
> I have a picture of how this should function in the Fair Scheduler, but I'm 
> not familiar enough with the Capacity Scheduler for the same there.  
> Ultimately, the decision to whether an application can be moved should go 
> down to the scheduler - some schedulers may wish not to support this at all.  
> However, schedulers that do support it should share some common semantics 
> around ACLs and what happens to running containers.
> Here is how I see the general semantics working out:
> * A move request is issued by the client.  After it gets past ACLs, the 
> scheduler checks whether executing the move will violate any constraints. For 
> the Fair Scheduler, these would be queue maxRunningApps and queue 
> maxResources constraints
> * All running containers are transferred from the old queue to the new queue
> * All outstanding requests are transferred from the old queue to the new queue
> Here is I see the ACLs of this working out:
> * To move an app from a queue a user must have modify access on the app or 
> administer access on the queue
> * To move an app to a queue a user must have submit access on the queue or 
> administer access on the queue 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-1558) After apps are moved across queues, store new queue info in the RM state store

2018-03-07 Thread Wilfred Spiegelenburg (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wilfred Spiegelenburg resolved YARN-1558.
-
Resolution: Duplicate

Closing as YARN-5932 is in 2.9 and later

> After apps are moved across queues, store new queue info in the RM state store
> --
>
> Key: YARN-1558
> URL: https://issues.apache.org/jira/browse/YARN-1558
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: scheduler
>Reporter: Sandy Ryza
>Assignee: Varun Saxena
>Priority: Major
>
> The result of moving an app to a new queue should persist across RM restarts. 
>  This will require updating the ApplicationSubmissionContext, the single 
> source of truth upon state recovery, with the new queue info.
> There will be a brief window after the move completes before the move is 
> stored.  If the RM dies during this window, the recovered RM will include the 
> old queue info.  Schedulers should be resilient to this situation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-7755) Clean up deprecation messages for allocation increments in FS config

2018-01-15 Thread Wilfred Spiegelenburg (JIRA)
Wilfred Spiegelenburg created YARN-7755:
---

 Summary: Clean up deprecation messages for allocation increments 
in FS config
 Key: YARN-7755
 URL: https://issues.apache.org/jira/browse/YARN-7755
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: fairscheduler
Affects Versions: 3.1.0
Reporter: Wilfred Spiegelenburg
Assignee: Wilfred Spiegelenburg


See the comment in YARN-6486: deprecation messages in the FS configuration are 
missing and the java doc needs a clean up.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-7689) TestRMContainerAllocator fails after YARN-6124

2017-12-28 Thread Wilfred Spiegelenburg (JIRA)
Wilfred Spiegelenburg created YARN-7689:
---

 Summary: TestRMContainerAllocator fails after YARN-6124
 Key: YARN-7689
 URL: https://issues.apache.org/jira/browse/YARN-7689
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scheduler
Affects Versions: 3.1.0
Reporter: Wilfred Spiegelenburg
Assignee: Wilfred Spiegelenburg


After the change that was made for YARN-6124 multiple tests in the 
TestRMContainerAllocator from MapReduce fail with the following NPE:
{code}
java.lang.NullPointerException
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.reinitialize(AbstractYarnScheduler.java:1437)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler.reinitialize(FifoScheduler.java:320)
at 
org.apache.hadoop.mapreduce.v2.app.rm.TestRMContainerAllocator$ExcessReduceContainerAllocateScheduler.(TestRMContainerAllocator.java:1808)
at 
org.apache.hadoop.mapreduce.v2.app.rm.TestRMContainerAllocator$MyResourceManager2.createScheduler(TestRMContainerAllocator.java:970)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceInit(ResourceManager.java:659)
at 
org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.createAndInitActiveServices(ResourceManager.java:1133)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:316)
at 
org.apache.hadoop.yarn.server.resourcemanager.MockRM.serviceInit(MockRM.java:1334)
at 
org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
at 
org.apache.hadoop.yarn.server.resourcemanager.MockRM.(MockRM.java:162)
at 
org.apache.hadoop.yarn.server.resourcemanager.MockRM.(MockRM.java:141)
at 
org.apache.hadoop.yarn.server.resourcemanager.MockRM.(MockRM.java:137)
at 
org.apache.hadoop.mapreduce.v2.app.rm.TestRMContainerAllocator$MyResourceManager.(TestRMContainerAllocator.java:928)
{code}

In the test we just call reinitiaize on a scheduler and never call init.
The stop of the service is guarded and so should the start and the re-init.




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-7665) Allow FS scheduler state dump to be turned on/off separate from FS debug

2017-12-17 Thread Wilfred Spiegelenburg (JIRA)
Wilfred Spiegelenburg created YARN-7665:
---

 Summary: Allow FS scheduler state dump to be turned on/off 
separate from FS debug
 Key: YARN-7665
 URL: https://issues.apache.org/jira/browse/YARN-7665
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 3.0.0
Reporter: Wilfred Spiegelenburg
Assignee: Wilfred Spiegelenburg


The FS state dump can currently not be turned on or off independently of the FS 
debug logging.
The logic for dumping the state uses a mixture of {{FairScheduler}} and 
{{FairScheduler.ststedump}} to check wether it dumps. It should be just using 
the state dump.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-7534) Fair scheduler assign resources may exceed maxResources

2017-12-11 Thread Wilfred Spiegelenburg (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wilfred Spiegelenburg resolved YARN-7534.
-
Resolution: Cannot Reproduce

No issue found the code shows that we check the queue size in the FS and we 
have no logs that show this is not working

> Fair scheduler assign resources may exceed maxResources
> ---
>
> Key: YARN-7534
> URL: https://issues.apache.org/jira/browse/YARN-7534
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Reporter: YunFan Zhou
>Assignee: Wilfred Spiegelenburg
>
> The logic we're scheduling now is to check whether the resources used by the 
> queue has exceeded *maxResources* before assigning the container. This will 
> leads to the fact that after assigning this container the queue uses more 
> resources than *maxResources*.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-7585) NodeManager should go unhealthy when state store throws DBException

2017-11-29 Thread Wilfred Spiegelenburg (JIRA)
Wilfred Spiegelenburg created YARN-7585:
---

 Summary: NodeManager should go unhealthy when state store throws 
DBException 
 Key: YARN-7585
 URL: https://issues.apache.org/jira/browse/YARN-7585
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Wilfred Spiegelenburg
Assignee: Wilfred Spiegelenburg


If work preserving recover is enabled the NM will not start up if the state 
store does not initialise. However if the state store becomes unavailable after 
that for any reason the NM will not go unhealthy. 
Since the state store is not available new containers can not be started any 
more and the NM should become unhealthy:
{code}
AMLauncher: Error launching appattempt_1508806289867_268617_01. Got 
exception: org.apache.hadoop.yarn.exceptions.YarnException: 
java.io.IOException: org.iq80.leveldb.DBException: IO error: 
/dsk/app/var/lib/hadoop-yarn/yarn-nm-recovery/yarn-nm-state/028269.log: 
Read-only file system
at o.a.h.yarn.ipc.RPCUtil.getRemoteException(RPCUtil.java:38)
at 
o.a.h.y.s.n.cm.ContainerManagerImpl.startContainers(ContainerManagerImpl.java:721)
...
Caused by: java.io.IOException: org.iq80.leveldb.DBException: IO error: 
/dsk/app/var/lib/hadoop-yarn/yarn-nm-recovery/yarn-nm-state/028269.log: 
Read-only file system
at 
o.a.h.y.s.n.r.NMLeveldbStateStoreService.storeApplication(NMLeveldbStateStoreService.java:374)
at 
o.a.h.y.s.n.cm.ContainerManagerImpl.startContainerInternal(ContainerManagerImpl.java:848)
at 
o.a.h.y.s.n.cm.ContainerManagerImpl.startContainers(ContainerManagerImpl.java:712)
{code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-7580) ContainersMonitorImpl logged message lacks detail when exceeding memory limits

2017-11-28 Thread Wilfred Spiegelenburg (JIRA)
Wilfred Spiegelenburg created YARN-7580:
---

 Summary: ContainersMonitorImpl logged message lacks detail when 
exceeding memory limits
 Key: YARN-7580
 URL: https://issues.apache.org/jira/browse/YARN-7580
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager
Affects Versions: 3.1.0
Reporter: Wilfred Spiegelenburg
Assignee: Wilfred Spiegelenburg


Currently in the RM logs container memory usage for a container that exceeds 
the memory limit is reported like this:
{code}
2016-06-14 09:15:36,694 INFO [AsyncDispatcher event handler] 
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report 
from attempt_1464251583966_0932_r_000876_0: Container 
[pid=134938,containerID=container_1464251583966_0932_01_002237] is running 
beyond physical memory limits. Current usage: 1.0 GB of 1 GB physical memory 
used; 1.9 GB of 2.1 GB virtual memory used. Killing container.
{code}

Two enhancements as part of this jira:
- make it clearer which limit we exceed
- show exactly how much we exceeded the limit by



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-7524) Remove unused FairSchedulerEventLog

2017-11-16 Thread Wilfred Spiegelenburg (JIRA)
Wilfred Spiegelenburg created YARN-7524:
---

 Summary: Remove unused FairSchedulerEventLog
 Key: YARN-7524
 URL: https://issues.apache.org/jira/browse/YARN-7524
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: fairscheduler
Reporter: Wilfred Spiegelenburg
Assignee: Wilfred Spiegelenburg


The FairSchedulerEventLog is no longer used. It is only being written to in one 
location in the FS (see YARN-1383) and the functionality requested in that jira 
has been implemented using the normal OOTB logging in the AbstractYarnScheduler.

The functionality the scheduler event log used to provide has been replaced 
with normal logging and the scheduler state dump in YARN-6042



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-7513) FindBugs in FSAppAttempt.getWeight()

2017-11-16 Thread Wilfred Spiegelenburg (JIRA)
Wilfred Spiegelenburg created YARN-7513:
---

 Summary: FindBugs in FSAppAttempt.getWeight()
 Key: YARN-7513
 URL: https://issues.apache.org/jira/browse/YARN-7513
 Project: Hadoop YARN
  Issue Type: Bug
  Components: fairscheduler
Affects Versions: 3.1.0
Reporter: Wilfred Spiegelenburg
Assignee: Wilfred Spiegelenburg
Priority: Minor


With the change from YARN-7414 a new FindBugs warning was introduced.
The code that was moved from the FairScheduler to the FSAppAttempt can also be 
simplified by removing the unneeded locking.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-7139) FairScheduler: finished applications are always restored to default queue

2017-08-30 Thread Wilfred Spiegelenburg (JIRA)
Wilfred Spiegelenburg created YARN-7139:
---

 Summary: FairScheduler: finished applications are always restored 
to default queue
 Key: YARN-7139
 URL: https://issues.apache.org/jira/browse/YARN-7139
 Project: Hadoop YARN
  Issue Type: Bug
  Components: fairscheduler
Affects Versions: 2.8.1
Reporter: Wilfred Spiegelenburg
Assignee: Wilfred Spiegelenburg


The queue an application gets submitted to is defined by the placement policy 
in the FS. The placement policy returns the queue and the application object is 
updated. When an application is stored in the state store the application 
submission context is used which has not been updated after the placement rules 
have run. 

This means that the original queue from the submission is still stored which is 
the incorrect queue. On restore we then read back the wrong queue and display 
the wrong queue in the RM web UI.

We should update the submission context after we have run the placement 
policies to make sure that we store the correct queue for the application.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-6615) AmIpFilter drops query parameters on redirect

2017-05-17 Thread Wilfred Spiegelenburg (JIRA)
Wilfred Spiegelenburg created YARN-6615:
---

 Summary: AmIpFilter drops query parameters on redirect
 Key: YARN-6615
 URL: https://issues.apache.org/jira/browse/YARN-6615
 Project: Hadoop YARN
  Issue Type: Bug
  Components: amrmproxy
Affects Versions: 3.0.0-alpha2
Reporter: Wilfred Spiegelenburg
Assignee: Wilfred Spiegelenburg


When an AM web request is redirected to the RM the query parameters are dropped 
from the web request.

This happens for Spark as described in SPARK-20772.
The repro steps are:
- Start up the spark-shell in yarn mode and run a job
- Try to access the job details through http://:4040/jobs/job?id=0
- A HTTP ERROR 400 is thrown (requirement failed: missing id parameter)

This works fine in local or standalone mode, but does not work on Yarn where 
the query parameter is dropped. If the UI filter 
org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter is removed from the 
config which shows that the problem is in the filter



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-6513) Fix for FindBugs getPendingLogFilesToUpload() possible NPE

2017-04-22 Thread Wilfred Spiegelenburg (JIRA)
Wilfred Spiegelenburg created YARN-6513:
---

 Summary: Fix for FindBugs getPendingLogFilesToUpload() possible NPE
 Key: YARN-6513
 URL: https://issues.apache.org/jira/browse/YARN-6513
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Wilfred Spiegelenburg


{code}
Possible null pointer dereference in 
org.apache.hadoop.yarn.logaggregation.AggregatedLogFormat$LogValue.getPendingLogFilesToUpload(File)
 due to return value of called method
Bug type NP_NULL_ON_SOME_PATH_FROM_RETURN_VALUE (click for details) 
In class org.apache.hadoop.yarn.logaggregation.AggregatedLogFormat$LogValue
In method 
org.apache.hadoop.yarn.logaggregation.AggregatedLogFormat$LogValue.getPendingLogFilesToUpload(File)
Local variable stored in JVM register ?
Method invoked at AggregatedLogFormat.java:[line 314]
Known null at AggregatedLogFormat.java:[line 314]
{code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-6512) Fix for FindBugs getProcessList() possible NPE

2017-04-22 Thread Wilfred Spiegelenburg (JIRA)
Wilfred Spiegelenburg created YARN-6512:
---

 Summary: Fix for FindBugs getProcessList() possible NPE
 Key: YARN-6512
 URL: https://issues.apache.org/jira/browse/YARN-6512
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Wilfred Spiegelenburg
Assignee: Wilfred Spiegelenburg


Findbugs output:
{code}
Possible null pointer dereference in 
org.apache.hadoop.yarn.util.ProcfsBasedProcessTree.getProcessList() due to 
return value of called method
Bug type NP_NULL_ON_SOME_PATH_FROM_RETURN_VALUE
In class org.apache.hadoop.yarn.util.ProcfsBasedProcessTree
In method org.apache.hadoop.yarn.util.ProcfsBasedProcessTree.getProcessList()
Value loaded from processDirs
Dereferenced at ProcfsBasedProcessTree.java:[line 487]
Known null at ProcfsBasedProcessTree.java:[line 484]
{code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-6510) Fix warning - procfs stat file is not in the expected format: YARN-3344 is not enough

2017-04-21 Thread Wilfred Spiegelenburg (JIRA)
Wilfred Spiegelenburg created YARN-6510:
---

 Summary: Fix warning - procfs stat file is not in the expected 
format: YARN-3344 is not enough
 Key: YARN-6510
 URL: https://issues.apache.org/jira/browse/YARN-6510
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 3.0.0-alpha2, 2.8.0
Reporter: Wilfred Spiegelenburg
Assignee: Wilfred Spiegelenburg


Even with the fix for YARN-3344 we still have issues with the procfs format.

This is the case that is causing issues:
{code}
[user@nm1 ~]$ cat /proc/2406/stat
2406 (ib_fmr(mlx4_0)) S 2 0 0 0 -1 2149613632 0 0 0 0 166 126908 0 0 20 0 1 0 
4284 0 0 18446744073709551615 0 0 0 0 0 0 0 2147483647 0 18446744073709551615 0 
0 17 6 0 0 0 0 0
{code}

We do not handle the parenthesis in the name which causes the pattern matching 
to fail



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-6490) Turn on assign multiple after removing continuous scheduling

2017-04-17 Thread Wilfred Spiegelenburg (JIRA)
Wilfred Spiegelenburg created YARN-6490:
---

 Summary: Turn on assign multiple after removing continuous 
scheduling
 Key: YARN-6490
 URL: https://issues.apache.org/jira/browse/YARN-6490
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Wilfred Spiegelenburg
Assignee: Wilfred Spiegelenburg


To help loading up a cluster when not using continuous scheduling change the 
default for {{yarn.scheduler.fair.assignmultiple}} from {{false}} to {{true}}.

This requires the change from YARN-5035 to make sure that we leverage assigning 
more than one container to a node per heartbeat.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-6489) Remove continuous scheduling code

2017-04-17 Thread Wilfred Spiegelenburg (JIRA)
Wilfred Spiegelenburg created YARN-6489:
---

 Summary: Remove continuous scheduling code
 Key: YARN-6489
 URL: https://issues.apache.org/jira/browse/YARN-6489
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Wilfred Spiegelenburg
Assignee: Wilfred Spiegelenburg






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-6488) Remove continuous scheduling tests

2017-04-17 Thread Wilfred Spiegelenburg (JIRA)
Wilfred Spiegelenburg created YARN-6488:
---

 Summary: Remove continuous scheduling tests
 Key: YARN-6488
 URL: https://issues.apache.org/jira/browse/YARN-6488
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: fairscheduler
Reporter: Wilfred Spiegelenburg
Assignee: Wilfred Spiegelenburg


Remove all continuous scheduling tests from the code



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-6487) FairScheduler: remove continuous scheduling (YARN-1010

2017-04-17 Thread Wilfred Spiegelenburg (JIRA)
Wilfred Spiegelenburg created YARN-6487:
---

 Summary: FairScheduler: remove continuous scheduling (YARN-1010
 Key: YARN-6487
 URL: https://issues.apache.org/jira/browse/YARN-6487
 Project: Hadoop YARN
  Issue Type: Task
  Components: fairscheduler
Affects Versions: 2.7.0
Reporter: Wilfred Spiegelenburg
Assignee: Wilfred Spiegelenburg


Remove deprecated FairScheduler continuous scheduler code



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-6486) FairScheduler: Deprecate continuous scheduling in 2.9

2017-04-17 Thread Wilfred Spiegelenburg (JIRA)
Wilfred Spiegelenburg created YARN-6486:
---

 Summary: FairScheduler: Deprecate continuous scheduling in 2.9
 Key: YARN-6486
 URL: https://issues.apache.org/jira/browse/YARN-6486
 Project: Hadoop YARN
  Issue Type: Task
  Components: fairscheduler
Affects Versions: 2.9.0
Reporter: Wilfred Spiegelenburg
Assignee: Wilfred Spiegelenburg


Mark continuous scheduling as deprecated in 2.9 and remove the code in 3.0. 
Removing continuous scheduling from the code will be logged as a separate jira



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-5895) TestRMRestart#testFinishedAppRemovalAfterRMRestart is still flakey

2016-11-16 Thread Wilfred Spiegelenburg (JIRA)
Wilfred Spiegelenburg created YARN-5895:
---

 Summary: TestRMRestart#testFinishedAppRemovalAfterRMRestart is 
still flakey 
 Key: YARN-5895
 URL: https://issues.apache.org/jira/browse/YARN-5895
 Project: Hadoop YARN
  Issue Type: Bug
  Components: test
Affects Versions: 3.0.0-alpha1
Reporter: Wilfred Spiegelenburg


Even after YARN-5362 the test is still flaky:
{code}
Tests run: 29, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 100.652 sec 
<<< FAILURE! - in org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart
testFinishedAppRemovalAfterRMRestart(org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart)
  Time elapsed: 0.338 sec  <<< FAILURE!
java.lang.AssertionError: expected null, but was:
at org.junit.Assert.fail(Assert.java:88)
at org.junit.Assert.failNotNull(Assert.java:664)
at org.junit.Assert.assertNull(Assert.java:646)
at org.junit.Assert.assertNull(Assert.java:656)
at 
org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.testFinishedAppRemovalAfterRMRestart(TestRMRestart.java:1659)
{code}

The test finishes with two asserts. This is the second assert that fails, 
YARN-5362 looked at a failure on the first of the two asserts



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-5722) FairScheduler hides group resolution exceptions when assigning queue

2016-10-11 Thread Wilfred Spiegelenburg (JIRA)
Wilfred Spiegelenburg created YARN-5722:
---

 Summary: FairScheduler hides group resolution exceptions when 
assigning queue 
 Key: YARN-5722
 URL: https://issues.apache.org/jira/browse/YARN-5722
 Project: Hadoop YARN
  Issue Type: Bug
  Components: fairscheduler
Affects Versions: 3.0.0-alpha1, 2.6.5
Reporter: Wilfred Spiegelenburg
Assignee: Wilfred Spiegelenburg


When a group based placement rule is used and the user does not have any groups 
the reason for rejecting the application is hidden. An assignment will fail as 
follows:

{code}
 
 
{code}

The error logged on the client side:
{code}
09/30 15:59:27 INFO mapreduce.JobSubmitter: Cleaning up the staging area 
/user/test_user/.staging/job_1475223610304_6043 
16/09/30 15:59:27 WARN security.UserGroupInformation: 
PriviledgedActionException as:test_user (auth:SIMPLE) 
cause:java.io.IOException: Failed to run job : Error assigning app to queue 
default 
java.io.IOException: Failed to run job : Error assigning app to queue default 
at org.apache.hadoop.mapred.YARNRunner.submitJob(YARNRunner.java:301) 
at 
org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:244)
 
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1307) 
{code}

The {{default}} queue name is passed in as part of the application submission 
and not really the queue that is tried.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-2093) Fair Scheduler IllegalStateException after upgrade from 2.2.0 to 2.4.1-SNAP

2016-09-29 Thread Wilfred Spiegelenburg (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wilfred Spiegelenburg resolved YARN-2093.
-
Resolution: Duplicate

> Fair Scheduler IllegalStateException after upgrade from 2.2.0 to 2.4.1-SNAP
> ---
>
> Key: YARN-2093
> URL: https://issues.apache.org/jira/browse/YARN-2093
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.4.1
>Reporter: Jon Bringhurst
>Assignee: Wilfred Spiegelenburg
>
> After upgrading from 2.2.0 to 2.4.1-SNAP, I ran into the following on startup:
> {noformat}
> 21:19:34,308  INFO RMAppAttemptImpl:659 - 
> appattempt_1400092144371_0003_09 State change from SUBMITTED to SCHEDULED
> 21:19:34,309  INFO RMAppAttemptImpl:659 - 
> appattempt_1400092144371_0004_08 State change from SUBMITTED to SCHEDULED
> 21:19:34,310  INFO RMAppAttemptImpl:659 - 
> appattempt_1400092144371_0003_10 State change from SUBMITTED to SCHEDULED
> 21:19:34,310  INFO RMAppAttemptImpl:659 - 
> appattempt_1400092144371_0003_11 State change from SUBMITTED to SCHEDULED
> 21:19:34,317  INFO FairScheduler:673 - Added Application Attempt 
> appattempt_1400092144371_0004_09 to scheduler from user: 
> samza-perf-playground
> 21:19:34,318  INFO FairScheduler:673 - Added Application Attempt 
> appattempt_1400092144371_0004_10 to scheduler from user: 
> samza-perf-playground
> 21:19:34,318  INFO RMAppAttemptImpl:659 - 
> appattempt_1400092144371_0004_09 State change from SUBMITTED to SCHEDULED
> 21:19:34,318  INFO FairScheduler:733 - Application 
> appattempt_1400092144371_0003_05 is done. finalState=FAILED
> 21:19:34,319  INFO RMAppAttemptImpl:659 - 
> appattempt_1400092144371_0004_10 State change from SUBMITTED to SCHEDULED
> 21:19:34,319  INFO AppSchedulingInfo:108 - Application 
> application_1400092144371_0003 requests cleared
> 21:19:34,319  INFO FairScheduler:673 - Added Application Attempt 
> appattempt_1400092144371_0004_11 to scheduler from user: 
> samza-perf-playground
> 21:19:34,320  INFO FairScheduler:733 - Application 
> appattempt_1400092144371_0003_06 is done. finalState=FAILED
> 21:19:34,320  INFO AppSchedulingInfo:108 - Application 
> application_1400092144371_0003 requests cleared
> 21:19:34,320  INFO RMAppAttemptImpl:659 - 
> appattempt_1400092144371_0004_11 State change from SUBMITTED to SCHEDULED
> 21:19:34,323 FATAL ResourceManager:600 - Error in handling event type 
> APP_ATTEMPT_REMOVED to the scheduler
> java.lang.IllegalStateException: Given app to remove 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerApp@429f809d
>  does not exist in queue [root.samza-perf-playground, demand= vCores:0>, running=, share=, 
> w=]
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.removeApp(FSLeafQueue.java:93)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplicationAttempt(FairScheduler.java:774)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1201)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:122)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:591)
>   at java.lang.Thread.run(Thread.java:744)
> 21:19:34,330  INFO ResourceManager:604 - Exiting, bbye..
> 21:19:34,335  INFO log:67 - Stopped SelectChannelConnector@:8088
> 21:19:34,437  INFO Server:2398 - Stopping server on 8033
> 21:19:34,438  INFO Server:694 - Stopping IPC Server listener on 8033
> {noformat}
> Last commit message for this build is (branch-2.4 on 
> github.com/apache/hadoop-common):
> {noformat}
> commit 09e24d5519187c0db67aacc1992be5d43829aa1e
> Author: Arpit Agarwal 
> Date:   Tue May 20 20:18:46 2014 +
> HADOOP-10562. Fix CHANGES.txt entry again
> 
> git-svn-id: 
> https://svn.apache.org/repos/asf/hadoop/common/branches/branch-2.4@1596389 
> 13f79535-47bb-0310-9956-ffa450edef68
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-5674) FairScheduler handles "dots" in user names inconsistently in the config

2016-09-26 Thread Wilfred Spiegelenburg (JIRA)
Wilfred Spiegelenburg created YARN-5674:
---

 Summary: FairScheduler handles "dots" in user names inconsistently 
in the config
 Key: YARN-5674
 URL: https://issues.apache.org/jira/browse/YARN-5674
 Project: Hadoop YARN
  Issue Type: Bug
  Components: fairscheduler
Affects Versions: 2.6.0
Reporter: Wilfred Spiegelenburg
Assignee: Wilfred Spiegelenburg


A user name can contain a dot because it could be used as the queue name we 
replace the dot with a defined separator. When defining queues in the 
configuration for users containing a dot we expect that the dot is replaced by 
the "\_dot\_" string.
In the user limits we do not do that and user limits need a normal dot in the 
user name. This is confusing when you create a scheduler configuration in some 
places you need to replace in others you do not. This can cause issue when user 
limits are not enforced as expected.

We should use one way to specify the user and since the queue naming can not be 
changed we should also use the same "\_dot\_" in the user limits and enforce 
correctly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-5672) FairScheduler: wrong queue name in log when adding application

2016-09-26 Thread Wilfred Spiegelenburg (JIRA)
Wilfred Spiegelenburg created YARN-5672:
---

 Summary: FairScheduler: wrong queue name in log when adding 
application
 Key: YARN-5672
 URL: https://issues.apache.org/jira/browse/YARN-5672
 Project: Hadoop YARN
  Issue Type: Bug
  Components: fairscheduler
Affects Versions: 2.6.0
Reporter: Wilfred Spiegelenburg
Assignee: Wilfred Spiegelenburg
Priority: Minor


The FairScheduler logs the passed in queue name when adding an application  
instead of the queue returned by the policy. Later log entries show the correct 
info:
{code}
INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
Accepted application application_1471982804173_6181 from user: wilfred, in 
queue: default, currently num of applications: 1
...
INFO 
org.apache.hadoop.yarn.server.resourcemanager.RMAppManager$ApplicationSummary: 
appId=application_1471982804173_6181,name=oozie:launcher:XXX,user=wilfred,queue=root.wilfred,state=FAILED,trackingUrl=https://10.10.10.10:8088/cluster/app/application_1471982804173_6181,appMasterHost=N/A,startTime=1473580802079,finishTime=1473580809148,finalStatus=FAILED
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-5387) FairScheduler: add the ability to specify a parent queue to all placement rules

2016-07-15 Thread Wilfred Spiegelenburg (JIRA)
Wilfred Spiegelenburg created YARN-5387:
---

 Summary: FairScheduler: add the ability to specify a parent queue 
to all placement rules
 Key: YARN-5387
 URL: https://issues.apache.org/jira/browse/YARN-5387
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: fairscheduler
Reporter: Wilfred Spiegelenburg
Assignee: Wilfred Spiegelenburg


In the current placement policy there all rules generate a queue name under the 
root. The only exception is the nestedUserQueue rule. This rule allows a queue 
to be created under a parent queue defined by a second rule.

Instead of creating new rules to also allow nested groups, secondary groups or  
nested queues for new rules that we think of we should generalise this by 
allowing a parent attribute to be specified in each rule like the create flag.

The optional parent attribute for a rule should allow the following values:
- empty (which is the same as not specifying the attribute)
- a rule
- a fixed value (with or without the root prefix)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-5272) FairScheduler handles "invalid" queue names inconsistently even after YARN-3241

2016-06-18 Thread Wilfred Spiegelenburg (JIRA)
Wilfred Spiegelenburg created YARN-5272:
---

 Summary: FairScheduler handles "invalid" queue names 
inconsistently even after YARN-3241
 Key: YARN-5272
 URL: https://issues.apache.org/jira/browse/YARN-5272
 Project: Hadoop YARN
  Issue Type: Bug
  Components: fairscheduler
Affects Versions: 2.8.0
Reporter: Wilfred Spiegelenburg
Assignee: Wilfred Spiegelenburg


The fix used in YARN-3214 uses a the JDK trim() method to remove leading and 
trailing spaces. The QueueMetrics uses a guava based trim when it splits the 
queues.

The guava based trim uses the unicode definition of a white space which is 
different than the java trim as can be seen 
[here|https://docs.google.com/a/cloudera.com/spreadsheets/d/1kq4ECwPjHX9B8QUCTPclgsDCXYaj7T-FlT4tB5q3ahk/pub]

A queue name with a non-breaking white space will thus still cause the same 
"Metrics source XXX already exists!" MetricsException.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-4698) Negative value in RM UI counters due to double container release

2016-03-08 Thread Wilfred Spiegelenburg (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wilfred Spiegelenburg resolved YARN-4698.
-
Resolution: Duplicate

This is the same as described in YARN-3933. It handles the case of the 
duplicate release and thus should fix the root cause of the web UI issue. The 
diff attached to the jira fixes the same code

> Negative value in RM UI counters due to double container release
> 
>
> Key: YARN-4698
> URL: https://issues.apache.org/jira/browse/YARN-4698
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler, resourcemanager
>Affects Versions: 2.5.1
>Reporter: Dmytro Kabakchei
>Assignee: Wilfred Spiegelenburg
>Priority: Minor
> Attachments: Example.log-cut, mitigating2.5.1.diff
>
>
> We noticed that on our cluster there are negative values in RM UI counters:
> - Containers Running: -19
> - Memory Used: -38GB
> - Vcores Used: -19
> After we checked RM logs, we found, that the following events had happened:
> - Assigned container: 67019 times
> - Released container: 67019 times
> - Invalid container released: 19 times
> Some log records related can be found within "Example.log-cut" attachment.
> After some investigation we made a conclusion that there is some kind of race 
> condition for container that was scheduled for killing, but was completed 
> successfully before kill.
> Also, there is a patch that possibly mitigates effects of the issue, but 
> doesn't solve original problem (see mitigating2.5.1diff).
> Unfortunately, the cluster and all other logs are lost, because the report 
> was made about a year ago, but wasn't submitted properly. Also, we don't know 
> if the issue exist in other versions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3821) Scheduler spams log with messages at INFO level

2015-06-17 Thread Wilfred Spiegelenburg (JIRA)
Wilfred Spiegelenburg created YARN-3821:
---

 Summary: Scheduler spams log with messages at INFO level
 Key: YARN-3821
 URL: https://issues.apache.org/jira/browse/YARN-3821
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: capacityscheduler, fairscheduler
Affects Versions: 2.8.0
Reporter: Wilfred Spiegelenburg
Assignee: Wilfred Spiegelenburg
Priority: Minor


The schedulers spams the logs with messages that are not providing any 
actionable information. There is no action taken in the code and there is 
nothing that needs to be done from an administrative point of view. 

Even after the improvements for the messages from YARN-3197 and YARN-3495 
administrators get confused and ask what needs to be done to prevent the log 
spam.

Moving the messages to a debug log level makes far more sense.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3742) YARN RM will shut down if ZKClient creation times out

2015-05-29 Thread Wilfred Spiegelenburg (JIRA)
Wilfred Spiegelenburg created YARN-3742:
---

 Summary: YARN RM  will shut down if ZKClient creation times out 
 Key: YARN-3742
 URL: https://issues.apache.org/jira/browse/YARN-3742
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.7.0
Reporter: Wilfred Spiegelenburg


The RM goes down showing the following stacktrace if the ZK client connection 
fails to be created. We should not exit but transition to StandBy and stop 
doing things and let the other RM take over.

{code}
2015-04-19 01:22:20,513  FATAL 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a 
org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
STATE_STORE_OP_FAILED. Cause:
java.io.IOException: Wait for ZKClient creation timed out
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1066)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1090)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.existsWithRetries(ZKRMStateStore.java:996)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationStateInternal(ZKRMStateStore.java:643)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppTransition.transition(RMStateStore.java:162)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppTransition.transition(RMStateStore.java:147)
at 
org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
at 
org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
at 
org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
at 
org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:806)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:879)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:874)
at 
org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
at 
org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
at java.lang.Thread.run(Thread.java:745)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3350) YARN RackResolver spams logs with messages at info level

2015-03-15 Thread Wilfred Spiegelenburg (JIRA)
Wilfred Spiegelenburg created YARN-3350:
---

 Summary: YARN RackResolver spams logs with messages at info level
 Key: YARN-3350
 URL: https://issues.apache.org/jira/browse/YARN-3350
 Project: Hadoop YARN
  Issue Type: Improvement
Affects Versions: 2.6.0
Reporter: Wilfred Spiegelenburg
Assignee: Wilfred Spiegelenburg


When you run an application the container logs shows a lot of messages for the 
RackResolver:

2015-03-10 00:58:30,483 INFO [RMCommunicator Allocator] 
org.apache.hadoop.yarn.util.RackResolver: Resolved node175.example.com to 
/rack15

A real world example for a large job was generating 20+ messages in 2 
milliseconds during a sustained period of time flooding the logs causing the 
node to run out of disk space.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-2910) FSLeafQueue can throw ConcurrentModificationException

2014-11-26 Thread Wilfred Spiegelenburg (JIRA)
Wilfred Spiegelenburg created YARN-2910:
---

 Summary: FSLeafQueue can throw ConcurrentModificationException
 Key: YARN-2910
 URL: https://issues.apache.org/jira/browse/YARN-2910
 Project: Hadoop YARN
  Issue Type: Bug
  Components: fairscheduler
Affects Versions: 2.5.0
Reporter: Wilfred Spiegelenburg




The list that maintains the runnable and the non runnable apps are a standard 
ArrayList but there is no guarantee that it will only be manipulated by one 
thread in the system. This can lead to the following exception:

2014-11-12 02:29:01,169 ERROR [RMCommunicator Allocator] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: ERROR IN CONTACTING 
RM.
java.util.ConcurrentModificationException: 
java.util.ConcurrentModificationException
at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:859)
at java.util.ArrayList$Itr.next(ArrayList.java:831)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.getResourceUsage(FSLeafQueue.java:147)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.getHeadroom(FSAppAttempt.java:180)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.allocate(FairScheduler.java:923)
at 
org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:516)

Full stack trace in the attached file.

We should guard against that by using a thread safe version from 
java.util.concurrent.CopyOnWriteArrayList




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-2578) NM does not failover timely if RM node network connection fails

2014-09-21 Thread Wilfred Spiegelenburg (JIRA)
Wilfred Spiegelenburg created YARN-2578:
---

 Summary: NM does not failover timely if RM node network connection 
fails
 Key: YARN-2578
 URL: https://issues.apache.org/jira/browse/YARN-2578
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.5.1
Reporter: Wilfred Spiegelenburg


The NM does not fail over correctly when the network cable of the RM is 
unplugged or the failure is simulated by a service network stop or a firewall 
that drops all traffic on the node. The RM fails over to the standby node when 
the failure is detected as expected. The NM should than re-register with the 
new active RM. This re-register takes a long time (15 minutes or more). Until 
then the cluster has no nodes for processing and applications are stuck.

Reproduction test case which can be used in any environment:
- create a cluster with 3 nodes
node 1: ZK, NN, JN, ZKFC, DN, RM, NM
node 2: ZK, NN, JN, ZKFC, DN, RM, NM
node 3: ZK, JN, DN, NM
- start all services make sure they are in good health
- kill the network connection of the RM that is active using one of the network 
kills from above
- observe the NN and RM failover
- the DN's fail over to the new active NN
- the NM does not recover for a long time
- the logs show a long delay and traces show no change at all

The stack traces of the NM all show the same set of threads. The main thread 
which should be used in the re-register is the Node Status Updater This 
thread is stuck in:
{code}
Node Status Updater prio=10 tid=0x7f5a6cc99800 nid=0x18d0 in 
Object.wait() [0x7f5a51fc1000]
   java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on 0xed62f488 (a org.apache.hadoop.ipc.Client$Call)
at java.lang.Object.wait(Object.java:503)
at org.apache.hadoop.ipc.Client.call(Client.java:1395)
- locked 0xed62f488 (a org.apache.hadoop.ipc.Client$Call)
at org.apache.hadoop.ipc.Client.call(Client.java:1362)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
at com.sun.proxy.$Proxy26.nodeHeartbeat(Unknown Source)
at 
org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80)
{code}

The client connection which goes through the proxy can be traced back to the 
ResourceTrackerPBClientImpl. The generated proxy does not time out and we 
should be using a version which takes the RPC timeout (from the configuration) 
as a parameter.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)