from:"\"Szilard Nemeth \\\(JIRA\\\)\""

[jira] [Assigned] (YARN-9035) Allow better troubleshooting of FS container assignments and lack of container assignments

2024-02-23 Thread Szilard Nemeth (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-9035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth reassigned YARN-9035:


Assignee: (was: Szilard Nemeth)

> Allow better troubleshooting of FS container assignments and lack of 
> container assignments
> --
>
> Key: YARN-9035
> URL: https://issues.apache.org/jira/browse/YARN-9035
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Szilard Nemeth
>Priority: Major
> Attachments: YARN-9035.001.patch
>
>
> The call chain started from {{FairScheduler.attemptScheduling}}, to 
> {{FSQueue}} (parent / leaf).assignContainer and down to 
> {{FSAppAttempt#assignContainer}} has many calls and has many potential 
> conditions where {{Resources.none()}} can be returned, meaning container is 
> not allocated.
>  A bunch of these empty-assignments do not come with a debug log statement, 
> so it's very hard to tell what condition lead the {{FairScheduler}} to a 
> decision where containers are not allocated.
>  On top of that, in many places, it's difficult to tell either why a 
> container was allocated to an app attempt.
> The goal is to have a common place (i.e. class) that will do all the 
> loggings, so users conveniently can control all the logs if they are curious 
> why (and why not) container assigments happened.
>  Also, it would be handy if readers of the log could easily decide which 
> {{AppAttempt}} is the log record created for, in other words: every log 
> record should include the ID of the application / app attempt, if possible.
>  
> Details of implementation: 
>  As most of the already in-place debug messages were protected by a condition 
> that checks whether the debug level is enabled on loggers, I followed a 
> similar pattern. All the relevant log messages are created with the class 
> {{ResourceAssignment}}. 
>  This class is a wrapper for the assigned {{Resource}} object and has a 
> single logger, so clients should use its helper methods to create log 
> records. There is a helper method called {{shouldLogReservationActivity}} 
> that checks if DEBUG or TRACE level is activated on the logger. 
>  See the javadoc on this class for further information.
>  
> {{ResourceAssignment}} is also responsible for adding the app / appettempt ID 
> to every log record (with some exceptions).
>  A couple of check classes are introduced: They are responsible to run and 
> store results of checks that are dependency of a successful container 
> allocation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Assigned] (YARN-9856) Remove log-aggregation related duplicate function

2024-02-23 Thread Szilard Nemeth (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-9856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth reassigned YARN-9856:


Assignee: (was: Szilard Nemeth)

> Remove log-aggregation related duplicate function
> -
>
> Key: YARN-9856
> URL: https://issues.apache.org/jira/browse/YARN-9856
> Project: Hadoop YARN
>  Issue Type: Task
>  Components: log-aggregation, yarn
>Affects Versions: 3.3.0
>Reporter: Adam Antal
>Priority: Trivial
> Attachments: YARN-9856.001.patch, YARN-9856.002.patch
>
>
> [~snemeth] has noticed a duplication in two of the log-aggregation related 
> functions.
> {quote}I noticed duplicated code in 
> org.apache.hadoop.yarn.logaggregation.LogToolUtils#outputContainerLog, 
> duplicated in 
> org.apache.hadoop.yarn.logaggregation.AggregatedLogFormat.LogReader#readContainerLogs.
>  [...]
> {quote}
> We should remove the duplication.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Assigned] (YARN-10843) [Umbrella] Tools to help migration from Fair Scheduler to Capacity Scheduler - part II

2024-02-23 Thread Szilard Nemeth (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth reassigned YARN-10843:
-

Assignee: (was: Szilard Nemeth)

> [Umbrella] Tools to help migration from Fair Scheduler to Capacity Scheduler 
> - part II
> --
>
> Key: YARN-10843
> URL: https://issues.apache.org/jira/browse/YARN-10843
> Project: Hadoop YARN
>  Issue Type: Task
>  Components: capacity scheduler, capacityscheduler
>Reporter: Peter Bacsko
>Priority: Major
>  Labels: fs2cs
>
> Remaining tasks for fs2cs converter.
> Phase I was completed under YARN-9698.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Assigned] (YARN-10249) Various ResourceManager tests are failing on branch-3.2

2024-02-23 Thread Szilard Nemeth (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth reassigned YARN-10249:
-

Assignee: (was: Szilard Nemeth)

> Various ResourceManager tests are failing on branch-3.2
> ---
>
> Key: YARN-10249
> URL: https://issues.apache.org/jira/browse/YARN-10249
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.2.0
>Reporter: Benjamin Teke
>Priority: Major
> Attachments: YARN-10249.branch-3.2.POC001.patch, 
> YARN-10249.branch-3.2.POC002.patch, YARN-10249.branch-3.2.POC003.patch
>
>
> Various tests are failing on branch-3.2. Some examples can be found in: 
> YARN-10003, YARN-10002, YARN-10237. The seemingly common thing that all of 
> the failing tests are RM/Capacity Scheduler related, and the failures are 
> flaky.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-7505) RM REST endpoints generate malformed JSON

2024-02-23 Thread Szilard Nemeth (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-7505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth updated YARN-7505:
-
Description: 
For all endpoints that return DAOs that contain maps, the generated JSON is 
malformed.  For example:


{code:java}
% curl 'http://localhost:8088/ws/v1/cluster/apps'
{"apps":{"app":[{"id":"application_1510777276702_0001","user":"daniel","name":"QuasiMonteCarlo","queue":"root.daniel","state":"RUNNING","finalStatus":"UNDEFINED","progress":5.0,"trackingUI":"ApplicationMaster","trackingUrl":"http://dhcp-10-16-0-181.pa.cloudera.com:8088/proxy/application_1510777276702_0001/","diagnostics":"","clusterId":1510777276702,"applicationType":"MAPREDUCE","applicationTags":"","priority":0,"startedTime":1510777317853,"finishedTime":0,"elapsedTime":21623,"amContainerLogs":"http://dhcp-10-16-0-181.pa.cloudera.com:8042/node/containerlogs/container_1510777276702_0001_01_01/daniel","amHostHttpAddress":"dhcp-10-16-0-181.pa.cloudera.com:8042","amRPCAddress":"dhcp-10-16-0-181.pa.cloudera.com:63371","allocatedMB":5120,"allocatedVCores":4,"reservedMB":0,"reservedVCores":0,"runningContainers":4,"memorySeconds":49820,"vcoreSeconds":26,"queueUsagePercentage":62.5,"clusterUsagePercentage":62.5,"resourceSecondsMap":{"entry":{"key":"test2","value":"0"},"entry":{"key":"test","value":"0"},"entry":{"key":"memory-mb","value":"49820"},"entry":{"key":"vcores","value":"26"}},"preemptedResourceMB":0,"preemptedResourceVCores":0,"numNonAMContainerPreempted":0,"numAMContainerPreempted":0,"preemptedMemorySeconds":0,"preemptedVcoreSeconds":0,"preemptedResourceSecondsMap":{},"resourceRequests":[{"priority":20,"resourceName":"dhcp-10-16-0-181.pa.cloudera.com","capability":{"memory":1024,"vCores":1},"numContainers":8,"relaxLocality":true,"nodeLabelExpression":"","executionTypeRequest":{"executionType":"GUARANTEED","enforceExecutionType":true},"enforceExecutionType":false},{"priority":20,"resourceName":"/default-rack","capability":{"memory":1024,"vCores":1},"numContainers":8,"relaxLocality":true,"nodeLabelExpression":"","executionTypeRequest":{"executionType":"GUARANTEED","enforceExecutionType":true},"enforceExecutionType":false},{"priority":20,"resourceName":"*","capability":{"memory":1024,"vCores":1},"numContainers":8,"relaxLocality":true,"nodeLabelExpression":"","executionTypeRequest":{"executionType":"GUARANTEED","enforceExecutionType":true},"enforceExecutionType":false}],"logAggregationStatus":"DISABLED","unmanagedApplication":false,"amNodeLabelExpression":"","timeouts":{"timeout":[{"type":"LIFETIME","expiryTime":"UNLIMITED","remainingTimeInSeconds":-1}]}}]}}
{code}


  was:
For all endpoints that return DAOs that contain maps, the generated JSON is 
malformed.  For example:

% curl 'http://localhost:8088/ws/v1/cluster/apps'
{"apps":{"app":[{"id":"application_1510777276702_0001","user":"daniel","name":"QuasiMonteCarlo","queue":"root.daniel","state":"RUNNING","finalStatus":"UNDEFINED","progress":5.0,"trackingUI":"ApplicationMaster","trackingUrl":"http://dhcp-10-16-0-181.pa.cloudera.com:8088/proxy/application_1510777276702_0001/","diagnostics":"","clusterId":1510777276702,"applicationType":"MAPREDUCE","applicationTags":"","priority":0,"startedTime":1510777317853,"finishedTime":0,"elapsedTime":21623,"amContainerLogs":"http://dhcp-10-16-0-181.pa.cloudera.com:8042/node/containerlogs/container_1510777276702_0001_01_01/daniel","amHostHttpAddress":"dhcp-10-16-0-181.pa.cloudera.com:8042","amRPCAddress":"dhcp-10-16-0-181.pa.cloudera.com:63371","allocatedMB":5120,"allocatedVCores":4,"reservedMB":0,"reservedVCores":0,"runningContainers":4,"memorySeconds":49820,"vcoreSeconds":26,"queueUsagePercentage":62.5,"clusterUsagePercentage":62.5,"resourceSecondsMap":{"entry":{"key":"test2","value":"0"},"entry":{"key":"test","value":"0"},"entry":{"key":"memory-mb","value":"49820"},"entry":{"key":"vcores","value":"26"}},"preemptedResourceMB":0,"preemptedResourceVCores":0,"numNonAMContainerPreempted":0,"numAMContainerPreempted":0,"preemptedMemorySeconds":0,"preemptedVcoreSeconds":0,"preemptedResourceSecondsMap":{},"resourceRequests":[{"priority":20,"resourceName":"dhcp-10-16-0-181.pa.cloudera.com","capability":{"memory":1024,"vCores":1},"numContainers":8,"relaxLocality":true,"nodeLabelExpression":"","executionTypeRequest":{"executionType":"GUARANTEED","enforceExecutionType":true},"enforceExecutionType":false},{"priority":20,"resourceName":"/default-rack","capability":{"memory":1024,"vCores":1},"numContainers":8,"relaxLocality":true,"nodeLabelExpression":"","executionTypeRequest":{"executionType":"GUARANTEED","enforceExecutionType":true},"enforceExecutionType":false},{"priority":20,"resourceName":"*","capability":{"memory":1024,"vCores":1},"numContainers":8,"relaxLocality":true,"nodeLabelExpression":"","executionTypeRequest":{"executionType":"GUARANTEED","enforceExecutionType":true},"enforceExecutionType":false}],"logAggregationStatus":"DISABLED","unmanaged

[jira] [Assigned] (YARN-7505) RM REST endpoints generate malformed JSON

2024-02-23 Thread Szilard Nemeth (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-7505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth reassigned YARN-7505:


Assignee: (was: Szilard Nemeth)

> RM REST endpoints generate malformed JSON
> -
>
> Key: YARN-7505
> URL: https://issues.apache.org/jira/browse/YARN-7505
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: restapi
>Affects Versions: 3.0.0
>Reporter: Daniel Templeton
>Priority: Critical
> Attachments: YARN-7505.001.patch, YARN-7505.002.patch
>
>
> For all endpoints that return DAOs that contain maps, the generated JSON is 
> malformed.  For example:
> % curl 'http://localhost:8088/ws/v1/cluster/apps'
> {"apps":{"app":[{"id":"application_1510777276702_0001","user":"daniel","name":"QuasiMonteCarlo","queue":"root.daniel","state":"RUNNING","finalStatus":"UNDEFINED","progress":5.0,"trackingUI":"ApplicationMaster","trackingUrl":"http://dhcp-10-16-0-181.pa.cloudera.com:8088/proxy/application_1510777276702_0001/","diagnostics":"","clusterId":1510777276702,"applicationType":"MAPREDUCE","applicationTags":"","priority":0,"startedTime":1510777317853,"finishedTime":0,"elapsedTime":21623,"amContainerLogs":"http://dhcp-10-16-0-181.pa.cloudera.com:8042/node/containerlogs/container_1510777276702_0001_01_01/daniel","amHostHttpAddress":"dhcp-10-16-0-181.pa.cloudera.com:8042","amRPCAddress":"dhcp-10-16-0-181.pa.cloudera.com:63371","allocatedMB":5120,"allocatedVCores":4,"reservedMB":0,"reservedVCores":0,"runningContainers":4,"memorySeconds":49820,"vcoreSeconds":26,"queueUsagePercentage":62.5,"clusterUsagePercentage":62.5,"resourceSecondsMap":{"entry":{"key":"test2","value":"0"},"entry":{"key":"test","value":"0"},"entry":{"key":"memory-mb","value":"49820"},"entry":{"key":"vcores","value":"26"}},"preemptedResourceMB":0,"preemptedResourceVCores":0,"numNonAMContainerPreempted":0,"numAMContainerPreempted":0,"preemptedMemorySeconds":0,"preemptedVcoreSeconds":0,"preemptedResourceSecondsMap":{},"resourceRequests":[{"priority":20,"resourceName":"dhcp-10-16-0-181.pa.cloudera.com","capability":{"memory":1024,"vCores":1},"numContainers":8,"relaxLocality":true,"nodeLabelExpression":"","executionTypeRequest":{"executionType":"GUARANTEED","enforceExecutionType":true},"enforceExecutionType":false},{"priority":20,"resourceName":"/default-rack","capability":{"memory":1024,"vCores":1},"numContainers":8,"relaxLocality":true,"nodeLabelExpression":"","executionTypeRequest":{"executionType":"GUARANTEED","enforceExecutionType":true},"enforceExecutionType":false},{"priority":20,"resourceName":"*","capability":{"memory":1024,"vCores":1},"numContainers":8,"relaxLocality":true,"nodeLabelExpression":"","executionTypeRequest":{"executionType":"GUARANTEED","enforceExecutionType":true},"enforceExecutionType":false}],"logAggregationStatus":"DISABLED","unmanagedApplication":false,"amNodeLabelExpression":"","timeouts":{"timeout":[{"type":"LIFETIME","expiryTime":"UNLIMITED","remainingTimeInSeconds":-1}]}}]}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Assigned] (YARN-9450) TestCapacityOverTimePolicy#testAllocation fails sporadically

2024-02-23 Thread Szilard Nemeth (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-9450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth reassigned YARN-9450:


Assignee: (was: Szilard Nemeth)

> TestCapacityOverTimePolicy#testAllocation fails sporadically
> 
>
> Key: YARN-9450
> URL: https://issues.apache.org/jira/browse/YARN-9450
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, test
>Affects Versions: 3.2.0
>Reporter: Prabhu Joseph
>Priority: Major
>
> TestCapacityOverTimePolicy#testAllocation fails sporadically. Observed in 
> multiple builds ran for - YARN-9447, YARN-8193, YARN-8051.
> {code}
> Failed
> org.apache.hadoop.yarn.server.resourcemanager.reservation.TestCapacityOverTimePolicy.testAllocation[Duration
>  90,000,000, height 0.25, numSubmission 1, periodic 8640)]
> Failing for the past 1 build (Since Failed#23900 )
> Took 34 ms.
> Stacktrace
> junit.framework.AssertionFailedError
>   at junit.framework.Assert.fail(Assert.java:55)
>   at junit.framework.Assert.fail(Assert.java:64)
>   at junit.framework.TestCase.fail(TestCase.java:235)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.reservation.BaseSharingPolicyTest.runTest(BaseSharingPolicyTest.java:146)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.reservation.TestCapacityOverTimePolicy.testAllocation(TestCapacityOverTimePolicy.java:136)
>   at sun.reflect.GeneratedMethodAccessor31.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
>   at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
>   at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
>   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
>   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
>   at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
>   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
>   at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
>   at org.junit.runners.Suite.runChild(Suite.java:128)
>   at org.junit.runners.Suite.runChild(Suite.java:27)
>   at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
>   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
>   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
>   at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
>   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
>   at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418)
> Standard Output
> 2019-04-05 23:46:19,022 INFO  [main] recovery.RMStateStore 
> (RMStateStore.java:transition(591)) - Storing reservation 
> allocation.reservation_-4277767163553399219_8391370105871519867
> 2019-04-05 23:46:19,022 INFO  [main] recovery.RMStateStore 
> (MemoryRMStateStore.java:storeReservationState(258)) - Storing 
> reservationallocation for 
> reservation_-4277767163553399219_8391370105871519867 for plan dedicated
> 2019-04-05 23:46:19,023 INFO  [main] reservation.InMemoryPlan 
> (InMemoryPlan.java:addReservation(373)) - Successfully added reservation: 
> reservation_-4277767163553399219_8391370105871519867

[jira] [Assigned] (YARN-10877) SLSSchedulerCommons: Consider using application map from AbstractYarnScheduler and make event handling more consistent

2024-02-23 Thread Szilard Nemeth (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth reassigned YARN-10877:
-

Assignee: (was: Szilard Nemeth)

> SLSSchedulerCommons: Consider using application map from 
> AbstractYarnScheduler and make event handling more consistent
> --
>
> Key: YARN-10877
> URL: https://issues.apache.org/jira/browse/YARN-10877
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Szilard Nemeth
>Priority: Minor
>
> This is a follow-up of YARN-10552.
> The improvements and things to check are coming from [this 
> comment|https://issues.apache.org/jira/browse/YARN-10552?focusedCommentId=17277991&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17277991].
> {quote}
> appQueueMap was not present in SLSFairScheduler before (it was in 
> SLSCapacityScheduler) however from 
> https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/scheduler/SLSFairScheduler.java#L163,
>  it seems that the super class of the schedulers - 
> https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/AbstractYarnScheduler.java#L159
>  has this already. As such, do we really need to define a new map as a common 
> map at all in SLSSchedulerCommons or can we somehow reuse the super class's 
> map? It might need some code updates though.
> In regards to the above point, considering SLSFairScheduler did not 
> previously have any of the following code in handle() method:
> {quote}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Assigned] (YARN-10799) Eliminate queue name replacement in ApplicationSubmissionContext based on placement context

2024-02-23 Thread Szilard Nemeth (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth reassigned YARN-10799:
-

Assignee: (was: Szilard Nemeth)

> Eliminate queue name replacement in ApplicationSubmissionContext based on 
> placement context
> ---
>
> Key: YARN-10799
> URL: https://issues.apache.org/jira/browse/YARN-10799
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Szilard Nemeth
>Priority: Major
>
> This is the long-term fix for YARN-10787: The task is to investigate if it's 
> possible to eliminate RMAppManager#copyPlacementQueueToSubmissionContext.
> This could introduce nasty backward incompatible issues with recovery, so it 
> should be thought through really carefully.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Assigned] (YARN-9511) TestAuxServices#testRemoteAuxServiceClassPath YarnRuntimeException: The remote jarfile should not be writable by group or others. The current Permission is 436

2024-02-23 Thread Szilard Nemeth (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-9511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth reassigned YARN-9511:


Assignee: (was: Szilard Nemeth)

> TestAuxServices#testRemoteAuxServiceClassPath YarnRuntimeException: The 
> remote jarfile should not be writable by group or others. The current 
> Permission is 436
> ---
>
> Key: YARN-9511
> URL: https://issues.apache.org/jira/browse/YARN-9511
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: test
>Reporter: Siyao Meng
>Priority: Major
>
> Found in maven JDK 11 unit test run. Compiled on JDK 8.
> {code}
> [ERROR] 
> testRemoteAuxServiceClassPath(org.apache.hadoop.yarn.server.nodemanager.containermanager.TestAuxServices)
>   Time elapsed: 0.551 s  <<< 
> ERROR!org.apache.hadoop.yarn.exceptions.YarnRuntimeException: The remote 
> jarfile should not be writable by group or others. The current Permission is 
> 436
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.serviceInit(AuxServices.java:202)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.TestAuxServices.testRemoteAuxServiceClassPath(TestAuxServices.java:268)
> at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.base/java.lang.reflect.Method.invoke(Method.java:566)
> at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
> at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
> at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
> at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
> at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271)
> at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70)
> at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)
> at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238)
> at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63)
> at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236)
> at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53)
> at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229)
> at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
> at 
> org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365)
> at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273)
> at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238)
> at 
> org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159)
> at 
> org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384)
> at 
> org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345)
> at 
> org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126)
> at 
> org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Assigned] (YARN-10264) Add container launch related env / classpath debug info to container logs when a container fails

2024-02-23 Thread Szilard Nemeth (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth reassigned YARN-10264:
-

Assignee: (was: Szilard Nemeth)

> Add container launch related env / classpath debug info to container logs 
> when a container fails
> 
>
> Key: YARN-10264
> URL: https://issues.apache.org/jira/browse/YARN-10264
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Szilard Nemeth
>Priority: Major
>
> Sometimes when a container fails to launch, it can be pretty hard to figure 
> out why it has failed.
> Similar to YARN-4309, we can add a switch to control if the printing of 
> environment variables and Java classpath should be done.
> As a bonus, 
> [jdeps|https://docs.oracle.com/javase/8/docs/technotes/tools/unix/jdeps.html] 
> could also be utilized to print some verbose info about the classpath. 
> When log aggregation occurs, all this information will automatically get 
> collected and make debugging such container launch failures much easier.
> Below is an example output when the user faces a classpath configuration 
> issue while launching an application: 
> {code:java}
> End of LogType:prelaunch.err
> **
> 2020-04-19 05:49:12,145 DEBUG:app_info:Diagnostics of the failed app
> 2020-04-19 05:49:12,145 DEBUG:app_info:Application 
> application_1587300264561_0001 failed 2 times due to AM Container for 
> appattempt_1587300264561_0001_02 exited with  exitCode: 1
> Failing this attempt.Diagnostics: [2020-04-19 12:45:01.955]Exception from 
> container-launch.
> Container id: container_e60_1587300264561_0001_02_01
> Exit code: 1
> Exception message: Launch container failed
> Shell output: main : command provided 1
> main : run as user is systest
> main : requested yarn user is systest
> Getting exit code file...
> Creating script paths...
> Writing pid file...
> Writing to tmp file 
> /dataroot/ycloud/yarn/nm/nmPrivate/application_1587300264561_0001/container_e60_1587300264561_0001_02_01/container_e60_1587300264561_0001_02_01.pid.tmp
> Writing to cgroup task files...
> Creating local dirs...
> Launching container...
> Getting exit code file...
> Creating script paths...
> [2020-04-19 12:45:01.984]Container exited with a non-zero exit code 1. Error 
> file: prelaunch.err.
> Last 4096 bytes of prelaunch.err :
> Last 4096 bytes of stderr :
> Error: Could not find or load main class 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster
> Please check whether your etc/hadoop/mapred-site.xml contains the below 
> configuration:
> 
>   yarn.app.mapreduce.am.env
>   HADOOP_MAPRED_HOME=${full path of your hadoop distribution 
> directory}
> 
> 
>   mapreduce.map.env
>   HADOOP_MAPRED_HOME=${full path of your hadoop distribution 
> directory}
> 
> 
>   mapreduce.reduce.env
>   HADOOP_MAPRED_HOME=${full path of your hadoop distribution 
> directory}
> 
> [2020-04-19 12:45:01.985]Container exited with a non-zero exit code 1. Error 
> file: prelaunch.err.
> Last 4096 bytes of prelaunch.err :
> Last 4096 bytes of stderr :
> Error: Could not find or load main class 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster
> Please check whether your etc/hadoop/mapred-site.xml contains the below 
> configuration:
> 
>   yarn.app.mapreduce.am.env
>   HADOOP_MAPRED_HOME=${full path of your hadoop distribution 
> directory}
> 
> 
>   mapreduce.map.env
>   HADOOP_MAPRED_HOME=${full path of your hadoop distribution 
> directory}
> 
> 
>   mapreduce.reduce.env
>   HADOOP_MAPRED_HOME=${full path of your hadoop distribution 
> directory}
> 
> For more detailed output, check the application tracking page: 
> http://quasar-plnefj-2.quasar-plnefj.root.hwx.site:8088/cluster/app/application_1587300264561_0001
>  Then click on links to logs of each attempt.
> ...
> 2020-04-19 05:49:12,148 INFO:util:* End test_app_API 
> (yarn.suite.YarnAPITests) *
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Assigned] (YARN-10798) Enhancements in RMAppManager: createAndPopulateNewRMApp and copyPlacementQueueToSubmissionContext

2024-02-23 Thread Szilard Nemeth (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth reassigned YARN-10798:
-

Assignee: (was: Szilard Nemeth)

> Enhancements in RMAppManager: createAndPopulateNewRMApp and 
> copyPlacementQueueToSubmissionContext
> -
>
> Key: YARN-10798
> URL: https://issues.apache.org/jira/browse/YARN-10798
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Szilard Nemeth
>Priority: Major
>
> As a follow-up of YARN-10787, we need to do the following: 
> 1. Rename RMAppManager#copyPlacementQueueToSubmissionContext: This method not 
> really copies anything, it simply overrides the queue value.
> 2. Add Debug log to print csqueue object before the authorization code: [Code 
> block|https://github.com/apache/hadoop/blob/2541efa496ba0e7e096ee5ec3c08d64b62036402/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/RMAppManager.java#L459-L475]
> 3. Fix log messages: As 'copyPlacementQueueToSubmissionContext' overrides 
> (not copies) the original queue name with the queue name from the 
> PlacementContext, all calls to submissionContext.getQueue() will return the 
> short queue name. This results in very misleading log messages as well, 
> including the exception message itself:
> {code}
>  org.apache.hadoop.yarn.exceptions.YarnException: 
> org.apache.hadoop.security.AccessControlException: User someuser1 does not 
> have permission to submit application_1621540945412_0001 to queue somequeue
> {code}
> All log messages should print the original submission queue, if possible.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Assigned] (YARN-9047) FairScheduler: default resource calculator is not resource type aware

2024-02-23 Thread Szilard Nemeth (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-9047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth reassigned YARN-9047:


Assignee: (was: Szilard Nemeth)

> FairScheduler: default resource calculator is not resource type aware
> -
>
> Key: YARN-9047
> URL: https://issues.apache.org/jira/browse/YARN-9047
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Reporter: Wilfred Spiegelenburg
>Priority: Major
> Attachments: YARN-9047.001.patch, YARN-9047.002.patch, 
> YARN-9047.003.patch
>
>
> The FairScheduler#getResourceCalculator always returns the default resource 
> calculator. The default calculator is not resource type aware and should only 
> be used if there are no resource types configured.
> We need to make sure that in we the direct hard code reference to 
> {{RESOURCE_CALCULATOR}} is either safe to use in all cases or is not used  in 
> the scheduler.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Assigned] (YARN-8078) TestDistributedShell#testDSShellWithoutDomainV2 fails on trunk

2024-02-23 Thread Szilard Nemeth (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-8078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth reassigned YARN-8078:


Assignee: (was: Szilard Nemeth)

> TestDistributedShell#testDSShellWithoutDomainV2 fails on trunk
> --
>
> Key: YARN-8078
> URL: https://issues.apache.org/jira/browse/YARN-8078
> Project: Hadoop YARN
>  Issue Type: Test
>Reporter: Weiwei Yang
>Priority: Major
>  Labels: UT
>
> java.lang.AssertionError: Unexpected number of YARN_CONTAINER_FINISHED event 
> published. 
> Expected :1
> Actual :0
> at org.junit.Assert.failNotEquals(Assert.java:743)
>  at org.junit.Assert.assertEquals(Assert.java:118)
>  at org.junit.Assert.assertEquals(Assert.java:555)
>  at 
> org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell.verifyEntityForTimelineV2(TestDistributedShell.java:692)
>  at 
> org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell.checkTimelineV2(TestDistributedShell.java:584)
>  at 
> org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell.testDSShell(TestDistributedShell.java:450)
>  at 
> org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell.testDSShell(TestDistributedShell.java:309)
>  at 
> org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell.testDSShellWithoutDomainV2(TestDistributedShell.java:305)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Assigned] (YARN-9421) Implement SafeMode for ResourceManager by defining a resource threshold

2024-02-23 Thread Szilard Nemeth (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-9421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth reassigned YARN-9421:


Assignee: (was: Szilard Nemeth)

> Implement SafeMode for ResourceManager by defining a resource threshold
> ---
>
> Key: YARN-9421
> URL: https://issues.apache.org/jira/browse/YARN-9421
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Szilard Nemeth
>Priority: Major
> Attachments: client-log.log, nodemanager.log, resourcemanager.log
>
>
> We have a hypothetical testcase in our test suite that tests Resource Types.
>  The test does the following: 
>  1. Sets up a resource named "gpu"
>  2. Out of 9 NodeManager nodes, 1 node has 100 of "gpu".
>  3. It executes a sleep job with resoure requests: 
>  "-Dmapreduce.reduce.resource.gpu=7" and 
> "-Dyarn.app.mapreduce.am.resource.gpu=11"
> Sometimes, we encounter situations when the app submission fails with: 
> {code:java}
> 2019-02-25 06:09:56,795 WARN 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager: RM app submission 
> failed in validating AM resource request for application 
> application_1551103768202_0001
>  org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid 
> resource request! Cannot allocate containers as requested resource is greater 
> than maximum allowed allocation. Requested resource type=[gpu], Requested 
> resource=, maximum allowed 
> allocation=, please note that maximum allowed 
> allocation is calculated by scheduler based on maximum resource of registered 
> NodeManagers, which might be less than configured maximum 
> allocation={code}
> It's clearly visible that the maximum allowed allocation does not have any 
> "gpu" resources.
>  
> Looking into the logs further, I realized that sometimes the node having the 
> "gpu" resources are registered after the app is submitted.
>  In a real world situation and even with this very special test exexution, we 
> can't be sure which order NMs are registering with RM.
>  With the advent of resource types, this issue was more likely surface.
> If we have a cluster with some "rare" resources like GPUs only on some nodes 
> out of a 100, we can quickly run into a situation when the NMs with GPUs are 
> registering later than the normal nodes. While the critical NMs are still 
> registering, we will most likely experience the same 
> InvalidResourceRequestException if we submit jobs requesting GPUs.
> There is a naive solution to this: 
>  1. Give some time for RM to wait for NMs to be able to register themselves 
> and put submitted applications on hold. This could work in some situations 
> but it's not the most flexible solution as different clusters can have 
> different requirements. Of course, we can make this more flexible by making 
> the timeout value configurable.
> *A more flexible alternative would be:*
>  2. We define a threshold of Resource capability: While we haven't reached 
> this threshold, we put submitted jobs on hold. Once we reached the threshold, 
> we enable jobs to pass through. 
>  This is very similar to an already existing concept, the SafeMode in HDFS 
> ([https://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html#Safemode]).
>  Back to my GPU example above, the threshold could be: 8 vcores, 16GB, 3 
> GPUs. 
>  Defining a threshold like this, we can ensure most of the submitted jobs 
> won't be lost, just "parked" until NMs are registered.
> The final solution could be the Resource threshold, or the combination of the 
> threshold and timeout value. I'm open for any other suggestion as well.
> *Last but not least, a very easy way to reproduce the issue on a 3 node 
> cluster:* 
>  1. Configure a resource type, named 'testres'.
>  2. Node1 runs RM, Node 2/3 runs NMs
>  3. Node2 has 1 testres
>  4. Node3 has 0 testres
>  5. Stop all nodes
>  6. Start RM on Node1
>  7. Start NM on Node3 (the one without the resource)
>  8. Start a pi job, request 1 testres for the AM
> Here's the command to start the job:
> {code:java}
> MY_HADOOP_VERSION=3.3.0-SNAPSHOT;pushd /opt/hadoop;bin/yarn jar 
> "./share/hadoop/mapreduce/hadoop-mapreduce-examples-$MY_HADOOP_VERSION.jar" 
> pi -Dyarn.app.mapreduce.am.resource.testres=1 1 1000;popd{code}
>  
> *Configurations*: 
>  node1: yarn-site.xml of ResourceManager:
> {code:java}
> 
>  yarn.resource-types
>  testres
> {code}
> node2: yarn-site.xml of NodeManager:
> {code:java}
> 
>  yarn.resource-types
>  testres
> 
> 
>  yarn.nodemanager.resource-type.testres
>  1
> {code}
> node3: yarn-site.xml of NodeManager:
> {code:java}
> 
>  yarn.resource-types
>  testres
> {code}
> Please see full process logs from RM, NM, YARN-client attached.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---

[jira] [Assigned] (YARN-5684) testDecreaseAfterIncreaseWithAllocationExpiration fails intermittently

2024-02-23 Thread Szilard Nemeth (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-5684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth reassigned YARN-5684:


Assignee: (was: Szilard Nemeth)

> testDecreaseAfterIncreaseWithAllocationExpiration fails intermittently 
> ---
>
> Key: YARN-5684
> URL: https://issues.apache.org/jira/browse/YARN-5684
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Rohith Sharma K S
>Priority: Major
>
> Saw the following in a precommit:
> {code}
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestIncreaseAllocationExpirer
> testDecreaseAfterIncreaseWithAllocationExpiration(org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestIncreaseAllocationExpirer)
>   Time elapsed: 10.726 sec  <<< FAILURE!
> java.lang.AssertionError: expected:<3> but was:<2>
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotEquals(Assert.java:743)
>   at org.junit.Assert.assertEquals(Assert.java:118)
>   at org.junit.Assert.assertEquals(Assert.java:555)
>   at org.junit.Assert.assertEquals(Assert.java:542)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestIncreaseAllocationExpirer.testDecreaseAfterIncreaseWithAllocationExpiration(TestIncreaseAllocationExpirer.java:367)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Assigned] (YARN-6286) TestCapacityScheduler.testAMLimitUsage throws UndeclaredThrowableException

2024-02-23 Thread Szilard Nemeth (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-6286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth reassigned YARN-6286:


Assignee: (was: Szilard Nemeth)

> TestCapacityScheduler.testAMLimitUsage throws UndeclaredThrowableException
> --
>
> Key: YARN-6286
> URL: https://issues.apache.org/jira/browse/YARN-6286
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: test
>Reporter: Sunil G
>Priority: Major
>  Labels: capacityscheduler
>
> {code}
> testAMLimitUsage(org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacityScheduler)
>   Time elapsed: 0.124 sec  <<< ERROR!
> java.lang.reflect.UndeclaredThrowableException: null
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:253)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:218)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:189)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.validateAndCreateResourceRequest(RMAppManager.java:497)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:384)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.submitApplication(RMAppManager.java:295)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.submitApplication(ClientRMService.java:664)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.MockRM$2.run(MockRM.java:752)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.MockRM$2.run(MockRM.java:746)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1965)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.MockRM.submitApp(MockRM.java:765)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.MockRM.submitApp(MockRM.java:665)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.MockRM.submitApp(MockRM.java:572)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacityScheduler.verifyAMLimitForLeafQueue(TestCapacityScheduler.java:3370)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacityScheduler.testAMLimitUsage(TestCapacityScheduler.java:3232)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-5897) Use drainEvent to replace sleep-wait in MockRM#waitForState

2024-02-23 Thread Szilard Nemeth (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-5897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth updated YARN-5897:
-
Summary: Use drainEvent to replace sleep-wait in MockRM#waitForState  (was: 
using drainEvent to replace sleep-wait in MockRM#waitForState)

> Use drainEvent to replace sleep-wait in MockRM#waitForState
> ---
>
> Key: YARN-5897
> URL: https://issues.apache.org/jira/browse/YARN-5897
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: sandflee
>Assignee: Szilard Nemeth
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Assigned] (YARN-8818) Yarn log aggregation of spark streaming job

2024-02-23 Thread Szilard Nemeth (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-8818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth reassigned YARN-8818:


Assignee: (was: Szilard Nemeth)

> Yarn log aggregation of spark streaming job
> ---
>
> Key: YARN-8818
> URL: https://issues.apache.org/jira/browse/YARN-8818
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Ayush Chauhan
>Priority: Major
>
> By default, YARN aggregates logs after an application completes. But I am 
> trying to aggregate logs for spark streaming job which in theory will run 
> forever. I have set the following properties for log aggregation and 
> restarted yarn by restarting {{hadoop-yarn-nodemanager}} for core & task 
> nodes and {{hadoop-yarn-resourcemanager}} for master node on my emr cluster. 
> I can view my changes in [http://node-ip:8088/conf].
> {noformat}
> yarn.log-aggregation-enable => true{noformat}
> {noformat}
> yarn.log-aggregation.retain-seconds => 172800{noformat}
> {noformat}
> yarn.log-aggregation.retain-check-interval-seconds => -1 {noformat}
> {noformat}
> yarn.nodemanager.log-aggregation.roll-monitoring-interval-seconds => 
> 3600{noformat}
> All the articles and resources have only mentioned to include 
> {{yarn.nodemanager.log-aggregation.roll-monitoring-interval-seconds}} 
> property and yarn will start aggregating logs for running jobs. But it is not 
> working in my case.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Assigned] (YARN-7631) ResourceRequest with different Capacity (Resource) overrides each other in RM and thus lost

2024-02-23 Thread Szilard Nemeth (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-7631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth reassigned YARN-7631:


Assignee: (was: Szilard Nemeth)

> ResourceRequest with different Capacity (Resource) overrides each other in RM 
> and thus lost
> ---
>
> Key: YARN-7631
> URL: https://issues.apache.org/jira/browse/YARN-7631
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Botong Huang
>Priority: Major
> Attachments: resourcebug.patch
>
>
> Today in AMRMClientImpl, the ResourceRequests (RR) are kept as: RequestId -> 
> Priority -> ResourceName -> ExecutionType -> Resource (Capacity) -> 
> ResourceRequestInfo (the actual RR). This means that only RRs with the same 
> (requestId, priority, resourcename, executionType, resource) will be grouped 
> and aggregated together. 
> While in RM side, the mapping is SchedulerRequestKey (RequestId, priority) -> 
> LocalityAppPlacementAllocator (ResourceName -> RR). 
> The issue is that in RM side Resource is not in the key to the RR at all. 
> (Note that executionType is also not in the RM side, but it is fine because 
> RM handles it separately as container update requests.) This means that under 
> the same value of (requestId, priority, resourcename), RRs with different 
> Resource values will be grouped together and override each other in RM. As a 
> result, some of the container requests are lost and will never be allocated. 
> Furthermore, since the two RRs are kept under different keys in AMRMClient 
> side, allocation of RR1 will only trigger cancel for RR1, the pending RR2 
> will not get resend as well. 
> I’ve attached an unit test (resourcebug.patch) which is failing in trunk to 
> illustrate this issue. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Assigned] (YARN-7903) Method getStarvedResourceRequests() only consider the first encountered resource

2024-02-23 Thread Szilard Nemeth (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-7903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth reassigned YARN-7903:


Assignee: (was: Szilard Nemeth)

> Method getStarvedResourceRequests() only consider the first encountered 
> resource
> 
>
> Key: YARN-7903
> URL: https://issues.apache.org/jira/browse/YARN-7903
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 3.1.0
>Reporter: Yufei Gu
>Priority: Major
>
> We need to specify rack and ANY while submitting a node local resource 
> request, as YARN-7561 discussed. For example:
> {code}
> ResourceRequest nodeRequest =
> createResourceRequest(GB, node1.getHostName(), 1, 1, false);
> ResourceRequest rackRequest =
> createResourceRequest(GB, node1.getRackName(), 1, 1, false);
> ResourceRequest anyRequest =
> createResourceRequest(GB, ResourceRequest.ANY, 1, 1, false);
> List resourceRequests =
> Arrays.asList(nodeRequest, rackRequest, anyRequest);
> {code}
> However, method getStarvedResourceRequests() only consider the first 
> encountered resource, which most likely is ResourceRequest.ANY. That's a 
> mismatch for locality request.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Assigned] (YARN-10837) Break down effectiveMinRatio calculation in ResourceConfigMode

2024-02-23 Thread Szilard Nemeth (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth reassigned YARN-10837:
-

Assignee: (was: Szilard Nemeth)

> Break down effectiveMinRatio calculation in ResourceConfigMode
> --
>
> Key: YARN-10837
> URL: https://issues.apache.org/jira/browse/YARN-10837
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Andras Gyori
>Priority: Major
>
> In ResourceConfigMode, the effectiveMinRatio resource calculation is hard to 
> understand, not documented and involves long methods. It must be refactored 
> and cleaned up in order to eliminate the future code debt.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Assigned] (YARN-4929) Explore a better way than sleeping for a while in some test cases

2024-02-23 Thread Szilard Nemeth (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-4929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth reassigned YARN-4929:


Assignee: (was: Szilard Nemeth)

> Explore a better way than sleeping for a while in some test cases
> -
>
> Key: YARN-4929
> URL: https://issues.apache.org/jira/browse/YARN-4929
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Yufei Gu
>Priority: Major
>
> The following unit test cases failed because we removed the minimum wait time 
> for attempt in YARN-4807. I manually added sleeps so the tests pass and added 
> a TODO in the code. We can explore a better way to do it.
> - TestAMRestart.testRMAppAttemptFailuresValidityInterval 
> - TestApplicationMasterService.testResourceTypes
> - TestContainerResourceUsage.testUsageAfterAMRestartWithMultipleContainers
> - TestRMApplicationHistoryWriter.testRMWritingMassiveHistoryForFairSche
> - TestRMApplicationHistoryWriter.testRMWritingMassiveHistoryForCapacitySche



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Assigned] (YARN-7908) TestSystemMetricsPublisher#testPublishContainerMetrics can fail with an NPE

2024-02-23 Thread Szilard Nemeth (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-7908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth reassigned YARN-7908:


Assignee: (was: Szilard Nemeth)

> TestSystemMetricsPublisher#testPublishContainerMetrics can fail with an NPE
> ---
>
> Key: YARN-7908
> URL: https://issues.apache.org/jira/browse/YARN-7908
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: test
>Affects Versions: 2.8.3
>Reporter: Jason Darrell Lowe
>Priority: Major
>
> testPublishContainerMetrics can fail with a NullPointerException:
> {noformat}
> Running 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.TestSystemMetricsPublisher
> Tests run: 5, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 4.42 sec <<< 
> FAILURE! - in 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.TestSystemMetricsPublisher
> testPublishContainerMetrics(org.apache.hadoop.yarn.server.resourcemanager.metrics.TestSystemMetricsPublisher)
>   Time elapsed: 0.031 sec  <<< ERROR!
> java.lang.NullPointerException: null
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.TestSystemMetricsPublisher.testPublishContainerMetrics(TestSystemMetricsPublisher.java:454)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Assigned] (YARN-10836) Clean up queue config mode methods

2024-02-23 Thread Szilard Nemeth (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth reassigned YARN-10836:
-

Assignee: (was: Szilard Nemeth)

> Clean up queue config mode methods
> --
>
> Key: YARN-10836
> URL: https://issues.apache.org/jira/browse/YARN-10836
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler
>Reporter: Andras Gyori
>Priority: Major
>
> After YARN-10759 is merged, it would be advisable to refactor long methods 
> inside the different classes. Also the error messages could be improved.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-7548) TestCapacityOverTimePolicy.testAllocation is flaky

2024-02-23 Thread Szilard Nemeth (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-7548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17820070#comment-17820070
 ] 

Szilard Nemeth commented on YARN-7548:
--

[~susheel_7] Sure, feel free to assign it to yourself and work on it.

> TestCapacityOverTimePolicy.testAllocation is flaky
> --
>
> Key: YARN-7548
> URL: https://issues.apache.org/jira/browse/YARN-7548
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: reservation system
>Affects Versions: 3.0.0-beta1
>Reporter: Haibo Chen
>Assignee: Susheel Gupta
>Priority: Major
>
> *Reported at: 15/Nov/18 20:32*
> It failed in both YARN-7337 and YARN-6921 jenkins jobs.
> org.apache.hadoop.yarn.server.resourcemanager.reservation.TestCapacityOverTimePolicy.testAllocation[Duration
>  90,000,000, height 0.25, numSubmission 1, periodic 8640)]
> *Stacktrace*
> {code:java}
> junit.framework.AssertionFailedError: null
>  at junit.framework.Assert.fail(Assert.java:55)
>  at junit.framework.Assert.fail(Assert.java:64)
>  at junit.framework.TestCase.fail(TestCase.java:235)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.reservation.BaseSharingPolicyTest.runTest(BaseSharingPolicyTest.java:146)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.reservation.TestCapacityOverTimePolicy.testAllocation(TestCapacityOverTimePolicy.java:136){code}
> *Standard Output*
> {code:java}
> 2017-11-20 23:57:03,759 INFO [main] recovery.RMStateStore 
> (RMStateStore.java:transition(538)) - Storing reservation 
> allocation.reservation_-9026698577416205920_6337917439559340517
>  2017-11-20 23:57:03,759 INFO [main] recovery.RMStateStore 
> (MemoryRMStateStore.java:storeReservationState(247)) - Storing 
> reservationallocation for 
> reservation_-9026698577416205920_6337917439559340517 for plan dedicated
>  2017-11-20 23:57:03,760 INFO [main] reservation.InMemoryPlan 
> (InMemoryPlan.java:addReservation(373)) - Successfully added reservation: 
> reservation_-9026698577416205920_6337917439559340517 to plan.
>  In-memory Plan: Parent Queue: dedicatedTotal Capacity:  vCores:1000>Step: 1000reservation_-9026698577416205920_6337917439559340517 
> user:u1 startTime: 0 endTime: 8640 Periodiciy: 8640 alloc:
>  [Period: 8640
>  0: 
>  3423748: 
>  86223748: 
>  8640: 
>  9223372036854775807: null
>  ]
> {code}
> *Reported at: 21/Feb/24*
> Ran TestCapacityOverTimePolicy testcase locally 100 times in a row and found 
> it failed 5 times with the below error:
> [INFO] Running 
> org.apache.hadoop.yarn.server.resourcemanager.reservation.TestCapacityOverTimePolicy
> [ERROR] Tests run: 30, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 
> 0.503 s <<< FAILURE! - in 
> org.apache.hadoop.yarn.server.resourcemanager.reservation.TestCapacityOverTimePolicy
> [ERROR] testAllocation[Duration 60,000, height 0.25, numSubmission 3, 
> periodic 
> 720)](org.apache.hadoop.yarn.server.resourcemanager.reservation.TestCapacityOverTimePolicy)
>   Time elapsed: 0.009 s  <<< ERROR!
> org.apache.hadoop.yarn.server.resourcemanager.reservation.exceptions.PlanningQuotaException:
>  Integral (avg over time) quota capacity 0.25 over a window of 86400 seconds, 
>  would be exceeded by accepting reservation: 
> reservation_-7619846766601560789_3793931544284185119
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.reservation.CapacityOverTimePolicy.validate(CapacityOverTimePolicy.java:206)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.reservation.InMemoryPlan.addReservation(InMemoryPlan.java:348)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.reservation.BaseSharingPolicyTest.runTest(BaseSharingPolicyTest.java:141)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.reservation.TestCapacityOverTimePolicy.testAllocation(TestCapacityOverTimePolicy.java:136)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>         at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:498)
>         at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
>         at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>         at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
>         at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>         at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
>         at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
>         at 
> org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(B

[jira] [Assigned] (YARN-10853) Add more tests to TestUsersManager

2024-02-23 Thread Szilard Nemeth (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth reassigned YARN-10853:
-

Assignee: (was: Szilard Nemeth)

> Add more tests to TestUsersManager
> --
>
> Key: YARN-10853
> URL: https://issues.apache.org/jira/browse/YARN-10853
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Szilard Nemeth
>Priority: Minor
> Attachments: UsersManager.html
>
>
> Running TestUsersManager with code coverage measurements only gives 18% line 
> coverage for class "UsersManager". This value is pretty low.
> See the attached coverage report for that class.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Resolved] (YARN-11590) RM process stuck after calling confStore.format() when ZK SSL/TLS is enabled, as netty thread waits indefinitely

2023-10-16 Thread Szilard Nemeth (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-11590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth resolved YARN-11590.
---
Hadoop Flags: Reviewed
  Resolution: Fixed

> RM process stuck after calling confStore.format() when ZK SSL/TLS is enabled, 
>  as netty thread waits indefinitely
> -
>
> Key: YARN-11590
> URL: https://issues.apache.org/jira/browse/YARN-11590
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Ferenc Erdelyi
>Assignee: Ferenc Erdelyi
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> YARN-11468 enabled Zookeeper SSL/TLS support for YARN.
> Curator uses ClientCnxnSocketNetty for secured connection and the thread 
> needs to be closed after calling confStore.format() to avoid the netty thread 
> waiting indefinitely, which renders the RM unresponsive after deleting the 
> confstore when started with the "-format-conf-store" arg.
> The unclosed thread, which keeps RM running:
> {code:java}
> 2023-10-10 12:13:01,000 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: The 
> Thread[main-SendThread(ferdelyi-1.ferdelyi.root.hwx.site:2182),5,main]TIMED_WAITING
>  is stands at [sun.misc.Unsafe.park(Native Method), 
> java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215), 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078),
>  
> java.util.concurrent.LinkedBlockingDeque.pollFirst(LinkedBlockingDeque.java:522),
>  java.util.concurrent.LinkedBlockingDeque.poll(LinkedBlockingDeque.java:684), 
> org.apache.zookeeper.ClientCnxnSocketNetty.doTransport(ClientCnxnSocketNetty.java:275),
>  org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1289)]
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-11590) RM process stuck after calling confStore.format() when ZK SSL/TLS is enabled, as netty thread waits indefinitely

2023-10-16 Thread Szilard Nemeth (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-11590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth updated YARN-11590:
--
Fix Version/s: 3.4.0

> RM process stuck after calling confStore.format() when ZK SSL/TLS is enabled, 
>  as netty thread waits indefinitely
> -
>
> Key: YARN-11590
> URL: https://issues.apache.org/jira/browse/YARN-11590
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Ferenc Erdelyi
>Assignee: Ferenc Erdelyi
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> YARN-11468 enabled Zookeeper SSL/TLS support for YARN.
> Curator uses ClientCnxnSocketNetty for secured connection and the thread 
> needs to be closed after calling confStore.format() to avoid the netty thread 
> waiting indefinitely, which renders the RM unresponsive after deleting the 
> confstore when started with the "-format-conf-store" arg.
> The unclosed thread, which keeps RM running:
> {code:java}
> 2023-10-10 12:13:01,000 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: The 
> Thread[main-SendThread(ferdelyi-1.ferdelyi.root.hwx.site:2182),5,main]TIMED_WAITING
>  is stands at [sun.misc.Unsafe.park(Native Method), 
> java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215), 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078),
>  
> java.util.concurrent.LinkedBlockingDeque.pollFirst(LinkedBlockingDeque.java:522),
>  java.util.concurrent.LinkedBlockingDeque.poll(LinkedBlockingDeque.java:684), 
> org.apache.zookeeper.ClientCnxnSocketNetty.doTransport(ClientCnxnSocketNetty.java:275),
>  org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1289)]
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Resolved] (YARN-11468) Zookeeper SSL/TLS support

2023-09-27 Thread Szilard Nemeth (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-11468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth resolved YARN-11468.
---
Fix Version/s: 3.4.0
   Resolution: Fixed

> Zookeeper SSL/TLS support
> -
>
> Key: YARN-11468
> URL: https://issues.apache.org/jira/browse/YARN-11468
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Reporter: Ferenc Erdelyi
>Assignee: Ferenc Erdelyi
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> Zookeeper 3.5.5 server can operate with SSL/TLS secure connection with its 
> clients.
> [https://cwiki.apache.org/confluence/display/ZOOKEEPER/ZooKeeper+SSL+User+Guide]
> The SSL communication should be possible in the different parts of YARN, 
> where it communicates with Zookeeper servers. The Zookeeper clients are used 
> in the following places:
>  * ResourceManager
>  * ZKConfigurationStore
>  * ZKRMStateStore
> The yarn.resourcemanager.zk-client-ssl.enabled flag to enable SSL 
> communication should be provided in the yarn-default.xml and the required 
> parameters for the keystore and truststore should be picked up from the 
> core-default.xml (HADOOP-18709)
> yarn.resourcemanager.ha.curator-leader-elector.enabled has to set to true via 
> yarn-site.xml to make sure Curator is used, otherwise we can't enable SSL.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-11573) Add config option to make container allocation prefer nodes without reserved containers

2023-09-22 Thread Szilard Nemeth (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-11573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth updated YARN-11573:
--
Fix Version/s: 3.4.0

> Add config option to make container allocation prefer nodes without reserved 
> containers
> ---
>
> Key: YARN-11573
> URL: https://issues.apache.org/jira/browse/YARN-11573
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> Applications could be stuck when the container allocation logic does not 
> consider more nodes, but only nodes that are having reserved containers.
> This behavior can even block new AMs to be allocated on nodes so they don't 
> reach the running status.
> A jira that mentions the same thing is YARN-9598:
> {quote}Nodes which have been reserved should be skipped when iterating 
> candidates in RegularContainerAllocator#allocate, otherwise scheduler may 
> generate allocation or reservation proposal on these node which will always 
> be rejected in FiCaScheduler#commonCheckContainerAllocation.
> {quote}
> Since this jira implements 2 other points, I decided to create this one and 
> implement the 3rd point separately.
> h2. Notes:
> 1. FiCaSchedulerApp#commonCheckContainerAllocation will log this:
> {code:java}
> Trying to allocate from reserved container in async scheduling mode
> {code}
> in case RegularContainerAllocator creates a reservation proposal for nodes 
> having reserved container.
> 2. A better way is to prevent generating an AM container (or even normal 
> container) allocation proposal on a node if it already has a reservation on 
> it and we still have more nodes to check in the preferred node set. 
> Completely disabling task containers from being allocated to worker nodes 
> could limit the downscaling ability that we have currently.
> h2. 3. CALL HIERARCHY
> 1. 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler#nodeUpdate
> 2. 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler#allocateContainersToNode(org.apache.hadoop.yarn.api.records.NodeId,
>  boolean)
> 3. 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler#allocateContainersToNode(org.apache.hadoop.yarn.server.resourcemanager.scheduler.placement.CandidateNodeSet,
>  boolean)
> 3.1. This is the place where it is decided whether to call 
> allocateContainerOnSingleNode or allocateContainersOnMultiNodes
> 4. 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler#allocateContainersOnMultiNodes
> 5. 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler#allocateOrReserveNewContainers
> 6. 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue#assignContainers
> 7. 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractParentQueue#assignContainersToChildQueues
> 8. 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractLeafQueue#assignContainers
> 9. 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp#assignContainers
> 10. 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator#assignContainers
> 11. 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator#allocate
> 12. 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator#tryAllocateOnNode
> 13. 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator#assignContainersOnNode
> 14. 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator#assignNodeLocalContainers
> 15. 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator#assignContainer
> Logs these lines as an example:
> {code:java}
> 2023-08-23 17:44:08,129 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator:
>  assignContainers: node= application=application_1692304118418_3151 
> priority=0 pendingAsk= vCores:1>,repeat=1> type=OFF_SWITCH
> {code}
> h2. 4. DETAILS OF RegularContainerAllocator#allocate
> [Method 
> definition|https://github.com/apache/hadoop/blob/9342ecf6ccd5c7ef443a0eb722852d2addc1d5db/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/allocator/RegularContainerAllocator.java#L826-L896]
> 4.1. Defining ordered l

[jira] [Updated] (YARN-11573) Add config option to make container allocation prefer nodes without reserved containers

2023-09-18 Thread Szilard Nemeth (Jira)

[
https://issues.apache.org/jira/browse/YARN-11573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Szilard Nemeth updated YARN-11573:
--
Description:
Applications could be stuck when the container allocation logic does not
consider more nodes, but only nodes that are having reserved containers.
This behavior can even block new AMs to be allocated on nodes so they don't
reach the running status.
A jira that mentions the same thing is YARN-9598:
{quote}Nodes which have been reserved should be skipped when iterating
candidates in RegularContainerAllocator#allocate, otherwise scheduler may
generate allocation or reservation proposal on these node which will always be
rejected in FiCaScheduler#commonCheckContainerAllocation.
{quote}
Since this jira implements 2 other points, I decided to create this one and
implement the 3rd point separately.

h2. Notes:

1. FiCaSchedulerApp#commonCheckContainerAllocation will log this:
{code:java}
Trying to allocate from reserved container in async scheduling mode
{code}
in case RegularContainerAllocator creates a reservation proposal for nodes
having reserved container.

2. A better way is to prevent generating an AM container (or even normal
container) allocation proposal on a node if it already has a reservation on it
and we still have more nodes to check in the preferred node set. Completely
disabling task containers from being allocated to worker nodes could limit the
downscaling ability that we have currently.

h2. 3. CALL HIERARCHY
1.
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler#nodeUpdate
2.
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler#allocateContainersToNode(org.apache.hadoop.yarn.api.records.NodeId,
boolean)
3.
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler#allocateContainersToNode(org.apache.hadoop.yarn.server.resourcemanager.scheduler.placement.CandidateNodeSet,
boolean)
3.1. This is the place where it is decided whether to call
allocateContainerOnSingleNode or allocateContainersOnMultiNodes
4.
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler#allocateContainersOnMultiNodes
5.
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler#allocateOrReserveNewContainers
6.
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue#assignContainers
7.
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractParentQueue#assignContainersToChildQueues
8.
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractLeafQueue#assignContainers
9.
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp#assignContainers
10.
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator#assignContainers
11.
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator#allocate
12.
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator#tryAllocateOnNode
13.
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator#assignContainersOnNode
14.
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator#assignNodeLocalContainers
15.
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator#assignContainer

h2. 4. DETAILS OF RegularContainerAllocator#allocate
[Method
definition|https://github.com/apache/hadoop/blob/9342ecf6ccd5c7ef443a0eb722852d2addc1d5db/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/allocator/RegularContainerAllocator.java#L826-L896]

4.1. Defining ordered list of nodes to allocate containers on:
[LINK|https://github.com/apache/hadoop/blob/9342ecf6ccd5c7ef443a0eb722852d2addc1d5db/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/allocator/RegularContainerAllocator.java#L851-L852]
{code:java}
Iterator iter = schedulingPS.getPreferredNodeIterator(
candidates);
{code}
4.2.
org.apache.hadoop.yarn.server.resourcemanager.scheduler.placement.AppPlacementAllocator#getPreferredNodeIterator
4.3.
org.apache.hadoop.yarn.server.resourcemanager.scheduler.placement.MultiNodeSortingManager#getMultiNodeSortIterator

([LINK|https://github.com/apache/hadoop/blob/9

[jira] [Updated] (YARN-11573) Add config option to make container allocation prefer nodes without reserved containers

2023-09-18 Thread Szilard Nemeth (Jira)

[
https://issues.apache.org/jira/browse/YARN-11573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Notes:

3. CALL HIERARCHY
1.
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler#nodeUpdate
2.
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler#allocateContainersToNode(org.apache.hadoop.yarn.api.records.NodeId,
boolean)
3.
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler#allocateContainersToNode(org.apache.hadoop.yarn.server.resourcemanager.scheduler.placement.CandidateNodeSet,
boolean)
3.1. This is the place where it is decided whether to call
allocateContainerOnSingleNode or allocateContainersOnMultiNodes
4.
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler#allocateContainersOnMultiNodes
5.
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler#allocateOrReserveNewContainers
6.
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue#assignContainers
7.
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractParentQueue#assignContainersToChildQueues
8.
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractLeafQueue#assignContainers
9.
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp#assignContainers
10.
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator#assignContainers
11.
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator#allocate
12.
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator#tryAllocateOnNode
13.
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator#assignContainersOnNode
14.
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator#assignNodeLocalContainers
15.
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator#assignContainer

Logs these lines as an example:
{code:java}
2023-08-23 17:44:08,129 DEBUG
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator:
assignContainers: node= application=application_1692304118418_3151
priority=0 pendingAsk=,repeat=1> type=OFF_SWITCH
{code}
4. DETAILS OF RegularContainerAllocator#allocate
[Method
definition|https://github.com/apache/hadoop/blob/9342ecf6ccd5c7ef443a0eb722852d2addc1d5db/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/allocator/RegularContainerAllocator.java#L826-L896]

([LINK|https://github.com/apache/hadoop/blob/9342ecf6ccd5c7

[jira] [Created] (YARN-11573) Add config option to make container allocation prefer nodes without reserved containers

2023-09-18 Thread Szilard Nemeth (Jira)

Szilard Nemeth created YARN-11573:
-

 Summary: Add config option to make container allocation prefer 
nodes without reserved containers
 Key: YARN-11573
 URL: https://issues.apache.org/jira/browse/YARN-11573
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacity scheduler
Reporter: Szilard Nemeth
Assignee: Szilard Nemeth






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Created] (YARN-11523) CapacityScheduler.md is incorrectly formatted

2023-06-29 Thread Szilard Nemeth (Jira)

Szilard Nemeth created YARN-11523:
-

 Summary: CapacityScheduler.md is incorrectly formatted
 Key: YARN-11523
 URL: https://issues.apache.org/jira/browse/YARN-11523
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Szilard Nemeth


I noticed that the headers are not formatted corretly, I can see many "###"s 
instead of proper markdown headings, I think the space is missing between the 
hash and the name of the headings.
See: 
https://github.com/apache/hadoop/blob/4bd873b816dbd889f410428d6e618586d4ff1780/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/CapacityScheduler.md



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-11490) JMX QueueMetrics breaks after mutable config validation in CS

2023-05-11 Thread Szilard Nemeth (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17721946#comment-17721946
 ] 

Szilard Nemeth commented on YARN-11490:
---

Hi [~tdomok],

Nice finding.
I do agree with your statements.

1. The memory leak 
{quote}
Revert YARN-11211, it's a nasty bug and the "leak" only causes problems if the 
validation API is abused with unique queue names. Note that YARN-11211 did not 
solve the leak problem either, details above.
{quote}
Good that you characterized the nature of the leak, I think it's okay to revert 
YARN-11211 in this case.
Please file a separate bug ticket for the leak.

3. Validate separately:
{quote}
Spawn a separate process for configuration validation with the proper config / 
state. Not sure if this is feasible or not, but it would be the cleanest.
{quote}
I agree that this would be the cleanest approach but given the current state of 
the codebase I really doubt it's easy to implement.

> JMX QueueMetrics breaks after mutable config validation in CS
> -
>
> Key: YARN-11490
> URL: https://issues.apache.org/jira/browse/YARN-11490
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 3.4.0
>Reporter: Tamas Domok
>Assignee: Tamas Domok
>Priority: Major
>  Labels: pull-request-available
> Attachments: addqueue.xml, defaultqueue.json, 
> hadoop-tdomok-resourcemanager-tdomok-MBP16.log, removequeue.xml, 
> stopqueue.json
>
>
> Reproduction steps:
> 1. Submit a long running job
> {code}
> hadoop-3.4.0-SNAPSHOT/bin/yarn jar 
> hadoop-3.4.0-SNAPSHOT/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-3.4.0-SNAPSHOT-tests.jar
>  sleep -m 1 -r 1 -rt 120 -mt 20
> {code}
> 2. Verify that there is one running app
> {code}
> $ curl http://localhost:8088/ws/v1/cluster/metrics | jq
> {code}
> 3. Verify that the JMX endpoint reports 1 running app as well
> {code}
> $ curl http://localhost:8088/jmx | jq
> {code}
> 4. Validate the configuration (x2)
> {code}
> $ curl -X POST -H 'Content-Type: application/json' -d @defaultqueue.json 
> localhost:8088/ws/v1/cluster/scheduler-conf/validate
> $ cat defaultqueue.json
> {"update-queue":{"queue-name":"root.default","params":{"entry":{"key":"maximum-applications","value":"100"}}},"subClusterId":"","global":null,"global-updates":null}
> {code}
> 5. Check 2. and 3. again. The cluster metrics should still work but the JMX 
> endpoint will show 0 running apps, that's the bug.
> It is caused by YARN-11211, reverting that patch (or only removing the 
> _QueueMetrics.clearQueueMetrics();_ line) fixes the issue. But I think that 
> would re-introduce the memory leak.
> It looks like the QUEUE_METRICS hash map is "add-only", the 
> clearQueueMetrics() was only called from ResourceManager.reinitialize() 
> method (transitionToActive/transitionToStandby) prior to YARN-11211. 
> Constantly adding and removing queues with unique names would cause a leak as 
> well, because there is no remove from QUEUE_METRICS, so it is not just the 
> validation API that has this problem.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-11482) Fix bug of DRF comparison DominantResourceFairnessComparator2 in fair scheduler

2023-05-04 Thread Szilard Nemeth (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-11482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth updated YARN-11482:
--
Summary: Fix bug of DRF comparison DominantResourceFairnessComparator2 in 
fair scheduler  (was: Fix bug of DRF comparision 
DominantResourceFairnessComparator2 in fair scheduler)

> Fix bug of DRF comparison DominantResourceFairnessComparator2 in fair 
> scheduler
> ---
>
> Key: YARN-11482
> URL: https://issues.apache.org/jira/browse/YARN-11482
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Reporter: Xiaoqiao He
>Assignee: Xiaoqiao He
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0, 3.3.6
>
>
> DominantResourceFairnessComparator2 was using wrong resource info to get if 
> one queue is needy or not now. We should fix it.
> {code:java}
>   boolean s1Needy = resourceInfo1[dominant1].getValue() <
>   minShareInfo1[dominant1].getValue();
>   boolean s2Needy = resourceInfo1[dominant2].getValue() <
>   minShareInfo2[dominant2].getValue();
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-11464) queue element is added to any other leaf queue, it's queueType becomes QueueType.PARENT_QUEUE

2023-05-04 Thread Szilard Nemeth (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17719603#comment-17719603
 ] 

Szilard Nemeth commented on YARN-11464:
---

Hi [~susheel_7] ,

Is this a test only issue?

>From the title it's not clear for me.

>  queue element is added to any other  leaf queue, it's queueType 
> becomes QueueType.PARENT_QUEUE
> 
>
> Key: YARN-11464
> URL: https://issues.apache.org/jira/browse/YARN-11464
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 3.3.4
>Reporter: Susheel Gupta
>Priority: Major
>
> This testcase clearly reproduces the issue. There is a missing dot before 
> "auto-queue-creation-v2.enabled" for method call assertNoValueForQueues.
> {code:java}
> @Test
> public void testAutoCreateV2FlagsInWeightMode() {
>   converter = builder.withPercentages(false).build();
>   converter.convertQueueHierarchy(rootQueue);
>   assertTrue("root autocreate v2 flag",
>   csConfig.getBoolean(
>   PREFIX + "root.auto-queue-creation-v2.enabled", false));
>   assertTrue("root.admins autocreate v2 flag",
>   csConfig.getBoolean(
>   PREFIX + "root.admins.auto-queue-creation-v2.enabled", false));
>   assertTrue("root.users autocreate v2 flag",
>   csConfig.getBoolean(
>   PREFIX + "root.users.auto-queue-creation-v2.enabled", false));
>   assertTrue("root.misc autocreate v2 flag",
>   csConfig.getBoolean(
>   PREFIX + "root.misc.auto-queue-creation-v2.enabled", false));
>   Set leafs = Sets.difference(ALL_QUEUES,
>   Sets.newHashSet("root",
>   "root.default",
>   "root.admins",
>   "root.users",
>   "root.misc"));
>   assertNoValueForQueues(leafs, "auto-queue-creation-v2.enabled",
>   csConfig);
> } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Resolved] (YARN-11079) Make an AbstractParentQueue to store common ParentQueue and ManagedParentQueue functionality

2023-05-04 Thread Szilard Nemeth (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-11079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth resolved YARN-11079.
---
Hadoop Flags: Reviewed
  Resolution: Fixed

> Make an AbstractParentQueue to store common ParentQueue and 
> ManagedParentQueue functionality
> 
>
> Key: YARN-11079
> URL: https://issues.apache.org/jira/browse/YARN-11079
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacity scheduler
>Reporter: Benjamin Teke
>Assignee: Susheel Gupta
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> ParentQueue is an instantiable class which stores the necessary functionality 
> of parent queues, however it is also extended by the 
> AbstractManagedParentQueue, which is an abstract class for storing managed 
> parent queue functionality. Since legacy AQC doesn't allow dynamic queues 
> next to static ones, managed parent queues technically behave like leaf 
> queues by not having any static child queues when created. This structure and 
> behaviour is really error prone, as for example if someone is not completely 
> aware of this and simply changes the checking order by first checking if the 
> queue in question is a ParentQueue in a method like 
> MappingRuleValidationContextImpl.isDynamicParent can result a completely 
> wrong return value (as a ManagedParent is a dynamic parent, but currently 
> it's also a ParentQueue, and ManagedParent cannot have the 
> isEligibleForAutoQueueCreation as true, so the method will return false). 
> {code:java}
>   private boolean isDynamicParent(CSQueue queue) {
> if (queue == null) {
>   return false;
> }
> if (queue instanceof ManagedParentQueue) {
>   return true;
> }
> if (queue instanceof ParentQueue) {
>   return ((ParentQueue)queue).isEligibleForAutoQueueCreation();
> }
> return false;
>   }
> {code}
> Similarly to YARN-11024 an AbstractParentQueue class should be created to 
> completely separate the managed parents from the instantiable ParentQueue 
> class.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-11079) Make an AbstractParentQueue to store common ParentQueue and ManagedParentQueue functionality

2023-05-04 Thread Szilard Nemeth (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-11079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth updated YARN-11079:
--
Fix Version/s: 3.4.0

> Make an AbstractParentQueue to store common ParentQueue and 
> ManagedParentQueue functionality
> 
>
> Key: YARN-11079
> URL: https://issues.apache.org/jira/browse/YARN-11079
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacity scheduler
>Reporter: Benjamin Teke
>Assignee: Susheel Gupta
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> ParentQueue is an instantiable class which stores the necessary functionality 
> of parent queues, however it is also extended by the 
> AbstractManagedParentQueue, which is an abstract class for storing managed 
> parent queue functionality. Since legacy AQC doesn't allow dynamic queues 
> next to static ones, managed parent queues technically behave like leaf 
> queues by not having any static child queues when created. This structure and 
> behaviour is really error prone, as for example if someone is not completely 
> aware of this and simply changes the checking order by first checking if the 
> queue in question is a ParentQueue in a method like 
> MappingRuleValidationContextImpl.isDynamicParent can result a completely 
> wrong return value (as a ManagedParent is a dynamic parent, but currently 
> it's also a ParentQueue, and ManagedParent cannot have the 
> isEligibleForAutoQueueCreation as true, so the method will return false). 
> {code:java}
>   private boolean isDynamicParent(CSQueue queue) {
> if (queue == null) {
>   return false;
> }
> if (queue instanceof ManagedParentQueue) {
>   return true;
> }
> if (queue instanceof ParentQueue) {
>   return ((ParentQueue)queue).isEligibleForAutoQueueCreation();
> }
> return false;
>   }
> {code}
> Similarly to YARN-11024 an AbstractParentQueue class should be created to 
> completely separate the managed parents from the instantiable ParentQueue 
> class.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-10178) Global Scheduler async thread crash caused by 'Comparison method violates its general contract

2023-05-04 Thread Szilard Nemeth (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth updated YARN-10178:
--
Description: 
Stack trace:
{code:java}
ERROR org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received 
RMFatalEvent of type CRITICAL_THREAD_CRASH, caused by a critical thread, 
Thread-6066574, that exited unexpectedly: java.lang.IllegalArgumentException: 
Comparison method violates its general contract!
 at 
java.util.TimSort.mergeHi(TimSort.java:899)
at java.util.TimSort.mergeAt(TimSort.java:516)
at java.util.TimSort.mergeForceCollapse(TimSort.java:457)
at java.util.TimSort.sort(TimSort.java:254)
at java.util.Arrays.sort(Arrays.java:1512)
at java.util.ArrayList.sort(ArrayList.java:1462)
at java.util.Collections.sort(Collections.java:177)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.policy.PriorityUtilizationQueueOrderingPolicy.getAssignmentIterator(PriorityUtilizationQueueOrderingPolicy.java:221)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.sortAndGetChildrenAllocationIterator(ParentQueue.java:777)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:791)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:623)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:1635)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1629)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1732)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1481)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.schedule(CapacityScheduler.java:569)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:616)

{code}
In JDK 8, Arrays.sort by default is using the timsort algorithm, and timsort 
has a few requirements:
{code:java}
1.x.compareTo(y) != y.compareTo(x)
2.x>y,y>z --> x > z
3.x=y, x.compareTo(z) == y.compareTo(z)
{code}
If the Array / List does not satisfy any of these requirements, TimSort will 
throw a java.lang.IllegalArgumentException.

 

1. If we take a look into PriorityUtilizationQueueOrderingPolicy.compare 
method, we can see that Capacity Scheduler these queue fields in order to 
compare resource usage:
{code:java}
AbsoluteUsedCapacity
UsedCapacity
ConfiguredMinResource
AbsoluteCapacity
{code}
 

2. In CS, during the execution of AsyncScheduleThread while the queues are 
being sorted in PriorityUtilizationQueueOrderingPolicy, for choosing the queue 
to assign the container to this IllegalArgumentException is thrown.

 

3. If we take a look into the ResourceCommitterService method, it tries to 
commit a CSAssignment coming from the ResourceCommitRequest, look tryCommit 
function, the queue resource usage is being updated.
{code:java}
public boolean tryCommit(Resource cluster, ResourceCommitRequest r,
boolean updatePending) {
  long commitStart = System.nanoTime();
  ResourceCommitRequest request =
  (ResourceCommitRequest) r;
 
  ...
  boolean isSuccess = false;
  if (attemptId != null) {
FiCaSchedulerApp app = getApplicationAttempt(attemptId);
// Required sanity check for attemptId - when async-scheduling enabled,
// proposal might be outdated if AM failover just finished
// and proposal queue was not be consumed in time
if (app != null && attemptId.equals(app.getApplicationAttemptId())) {
  if (app.accept(cluster, request, updatePending)
  && app.apply(cluster, request, updatePending)) { // apply this 
resource
...
}
}
  }
  return isSuccess;
}
}
{code}
{code:java}
public boolean apply(Resource cluster, ResourceCommitRequest request, boolean updatePending) {
...
if (!reReservation) {
getCSLeafQueue().apply(cluster, request); 
}
...
}
{code}
4. 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue#apply
 invokes 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue#allocateResource:
{code:java}
void allocateResource(Resource clusterResource,
Resource resource, String nodePartition) {
  try {
writeLock.lock(); // only lock leaf queue lock
queueUsage.incUsed(nodePartition, resource);
 
++numContainers;
 
CSQueueUtils.updateQue

[jira] [Updated] (YARN-10178) Global Scheduler async thread crash caused by 'Comparison method violates its general contract

2023-05-04 Thread Szilard Nemeth (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth updated YARN-10178:
--
Description: 
Stack trace:
{code:java}
ERROR org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received 
RMFatalEvent of type CRITICAL_THREAD_CRASH, caused by a critical thread, 
Thread-6066574, that exited unexpectedly: java.lang.IllegalArgumentException: 
Comparison method violates its general contract!
 at 
java.util.TimSort.mergeHi(TimSort.java:899)
at java.util.TimSort.mergeAt(TimSort.java:516)
at java.util.TimSort.mergeForceCollapse(TimSort.java:457)
at java.util.TimSort.sort(TimSort.java:254)
at java.util.Arrays.sort(Arrays.java:1512)
at java.util.ArrayList.sort(ArrayList.java:1462)
at java.util.Collections.sort(Collections.java:177)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.policy.PriorityUtilizationQueueOrderingPolicy.getAssignmentIterator(PriorityUtilizationQueueOrderingPolicy.java:221)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.sortAndGetChildrenAllocationIterator(ParentQueue.java:777)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:791)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:623)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:1635)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1629)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1732)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1481)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.schedule(CapacityScheduler.java:569)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:616)

{code}
In JDK 8, Arrays.sort by default is using the timsort algorithm, and timsort 
has a few requirements:
{code:java}
1.x.compareTo(y) != y.compareTo(x)
2.x>y,y>z --> x > z
3.x=y, x.compareTo(z) == y.compareTo(z)
{code}
If the Array / List does not satisfy any of these requirements, TimSort will 
throw a java.lang.IllegalArgumentException.

 

1. If we take a look into PriorityUtilizationQueueOrderingPolicy.compare 
method, we can see that Capacity Scheduler these queue fields in order to 
compare resource usage:
{code:java}
AbsoluteUsedCapacity
UsedCapacity
ConfiguredMinResource
AbsoluteCapacity
{code}
 

2. In CS, during the execution of AsyncScheduleThread while the queues are 
being sorted in PriorityUtilizationQueueOrderingPolicy, for choosing the queue 
to assign the container to this IllegalArgumentException is thrown.

 

3. If we take a look into the ResourceCommitterService method, it tries to 
commit a CSAssignment coming from the ResourceCommitRequest, look tryCommit 
function, the queue resource usage is being updated.
{code:java}
public boolean tryCommit(Resource cluster, ResourceCommitRequest r,
boolean updatePending) {
  long commitStart = System.nanoTime();
  ResourceCommitRequest request =
  (ResourceCommitRequest) r;
 
  ...
  boolean isSuccess = false;
  if (attemptId != null) {
FiCaSchedulerApp app = getApplicationAttempt(attemptId);
// Required sanity check for attemptId - when async-scheduling enabled,
// proposal might be outdated if AM failover just finished
// and proposal queue was not be consumed in time
if (app != null && attemptId.equals(app.getApplicationAttemptId())) {
  if (app.accept(cluster, request, updatePending)
  && app.apply(cluster, request, updatePending)) { // apply this 
resource
...
}
}
  }
  return isSuccess;
}
}
{code}
{code:java}
public boolean apply(Resource cluster, ResourceCommitRequest request, boolean updatePending) {
...
if (!reReservation) {
getCSLeafQueue().apply(cluster, request); 
}
...
}
{code}
4. 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue#apply
 invokes 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue#allocateResource:
{code:java}
void allocateResource(Resource clusterResource,
Resource resource, String nodePartition) {
  try {
writeLock.lock(); // only lock leaf queue lock
queueUsage.incUsed(nodePartition, resource);
 
++numContainers;
 
CSQueueUtils.updateQue

[jira] [Resolved] (YARN-11415) Refactor TestConfigurationFieldsBase and the connected test classes

2023-03-06 Thread Szilard Nemeth (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-11415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth resolved YARN-11415.
---
Resolution: Not A Problem

> Refactor TestConfigurationFieldsBase and the connected test classes
> ---
>
> Key: YARN-11415
> URL: https://issues.apache.org/jira/browse/YARN-11415
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Benjamin Teke
>Assignee: Szilard Nemeth
>Priority: Major
>  Labels: pull-request-available
>
> YARN-11413 pointed out a strange way of how the configuration tests are 
> executed. The first problem is that there is a 
> [Pattern|https://github.com/apache/hadoop/blob/570b503e3e7e7adf5b0a8fabca76003298216543/hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/conf/TestConfigurationFieldsBase.java#L197],
>  that matches only numbers, letters, dots, hyphens and underscores, but not 
> %, which is used in string replacements (i.e 
> {{yarn.nodemanager.aux-services.%s.classpath}} ), so essentially every 
> property that's present in any configuration object and doesn't match this 
> pattern is silently skipped, and documenting it will result in invalid test 
> failures, ergo the test encourages introducing props and not documenting 
> them. The pattern should be fixed in YARN-11413 for %s, but it's necessity 
> could be checked. 
> Another issue with this is that it works in a semi-opposite way of what it's 
> supposed to do. To ensure all of the configuration entries are documented it 
> should iterate through all of the configuration fields and check if those 
> have matching xyz-default.xml entries, but currently it just reports the 
> entries that are present in the xyz-default.xml and missing in the matching 
> configuration file. Since this test checks all the configuration objects this 
> might need some other follow-ups to document the missing properties from 
> other components if there are any.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-11415) Refactor TestConfigurationFieldsBase and the connected test classes

2023-03-06 Thread Szilard Nemeth (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17696896#comment-17696896
 ] 

Szilard Nemeth commented on YARN-11415:
---

Based on our offline discussion with [~bteke] , I'm closing this.

> Refactor TestConfigurationFieldsBase and the connected test classes
> ---
>
> Key: YARN-11415
> URL: https://issues.apache.org/jira/browse/YARN-11415
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Benjamin Teke
>Assignee: Szilard Nemeth
>Priority: Major
>  Labels: pull-request-available
>
> YARN-11413 pointed out a strange way of how the configuration tests are 
> executed. The first problem is that there is a 
> [Pattern|https://github.com/apache/hadoop/blob/570b503e3e7e7adf5b0a8fabca76003298216543/hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/conf/TestConfigurationFieldsBase.java#L197],
>  that matches only numbers, letters, dots, hyphens and underscores, but not 
> %, which is used in string replacements (i.e 
> {{yarn.nodemanager.aux-services.%s.classpath}} ), so essentially every 
> property that's present in any configuration object and doesn't match this 
> pattern is silently skipped, and documenting it will result in invalid test 
> failures, ergo the test encourages introducing props and not documenting 
> them. The pattern should be fixed in YARN-11413 for %s, but it's necessity 
> could be checked. 
> Another issue with this is that it works in a semi-opposite way of what it's 
> supposed to do. To ensure all of the configuration entries are documented it 
> should iterate through all of the configuration fields and check if those 
> have matching xyz-default.xml entries, but currently it just reports the 
> entries that are present in the xyz-default.xml and missing in the matching 
> configuration file. Since this test checks all the configuration objects this 
> might need some other follow-ups to document the missing properties from 
> other components if there are any.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Created] (YARN-11450) Improvements for TestYarnConfigurationFields and TestConfigurationFieldsBase

2023-03-05 Thread Szilard Nemeth (Jira)

Szilard Nemeth created YARN-11450:
-

 Summary: Improvements for TestYarnConfigurationFields and 
TestConfigurationFieldsBase
 Key: YARN-11450
 URL: https://issues.apache.org/jira/browse/YARN-11450
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Szilard Nemeth
Assignee: Szilard Nemeth






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Comment Edited] (YARN-11415) Refactor TestConfigurationFieldsBase and the connected test classes

2023-03-05 Thread Szilard Nemeth (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17696584#comment-17696584
 ] 

Szilard Nemeth edited comment on YARN-11415 at 3/5/23 4:21 PM:
---

Hi [~bteke],
I just briefly checked this.
{quote}Another issue with this is that it works in a semi-opposite way of what 
it's supposed to do. To ensure all of the configuration entries are documented 
it should iterate through all of the configuration fields and check if those 
have matching xyz-default.xml entries, but currently it just reports the 
entries that are present in the xyz-default.xml and missing in the matching 
configuration file. Since this test checks all the configuration objects this 
might need some other follow-ups to document the missing properties from other 
components if there are any.
{quote}
I think what you stated here is true for this method: 
TestCommonConfigurationFields#testCompareXmlAgainstConfigurationClass.
This method compares the properties that are in yarn-default.xml, but not in 
the Configuration class.

*With my first commit* I added a dummy property to yarn-default.xml without 
adding it to the YarnConfiguration class.
The property called "yarn.nodemanager.missingpropinclass", but the name doesn't 
really matter.

Test result: Failure in 
TestCommonConfigurationFields#testCompareXmlAgainstConfigurationClass:
{code:java}
java.lang.AssertionError: yarn-default.xml has 1 properties missing in  class 
org.apache.hadoop.yarn.conf.YarnConfiguration Entries:   
yarn.nodemanager.missingpropinclass 
Expected :0
Actual   :1
{code}
However, we also have 
TestCommonConfigurationFields#testCompareConfigurationClassAgainstXml which 
compares the properties that are in the YarnConfiguration class, but not 
defined in yarn-default.xml.

{*}So with my second commit{*}, I added this to YarnConfiguration:
{code:java}
public static final String MISSING_PROP_IN_YARN_DEF = 
"yarn.missingprop.in.yarndefault";
{code}
without touching the yarn-default.xml so the new config was not documented.

As I expected, the test case 
TestConfigurationFieldsBase#testCompareConfigurationClassAgainstXml failed with:
{code:java}
java.lang.AssertionError: class org.apache.hadoop.yarn.conf.YarnConfiguration 
has 1 variables missing in yarn-default.xml Entries:   
yarn.missingprop.in.yarndefault 
Expected :0
Actual   :1
{code}
I think so far so good, this is the expected behavior.

The main issue before YARN-11413 was: 
1. 
org.apache.hadoop.conf.TestConfigurationFieldsBase#setupTestConfigurationFields 
is called as a "@Before" method.
2. 
org.apache.hadoop.conf.TestConfigurationFieldsBase#extractMemberVariablesFromConfigurationFields
 is called.
3. All of the fields of the class is checked here with certain restrictions. 
Among these are that it should be a public static final String, and it should 
match a Pattern.
If the pattern is not matched, the field won't be added to the "known fields" 
for sure.

 

{*}4. So with my third commit{*}, I just removed the percent sign (basically a 
revert of YARN-11415) to see what happens.
TestConfigurationFieldsBase#testCompareXmlAgainstConfigurationClass fails, 
which is a false positive now.

Here's the assertion message:
{code:java}
java.lang.AssertionError: yarn-default.xml has 2 properties missing in  class 
org.apache.hadoop.yarn.conf.YarnConfiguration Entries:   
yarn.nodemanager.aux-services.%s.classpath  
yarn.nodemanager.aux-services.%s.system-classes 
Expected :0
Actual   :2
{code}
This is indeed wrong, as per your description.
However, I don't see how this could be fixed in a clean way.
Here, the configuration fields of YarnConfiguration were not recognized: 
yarn.nodemanager.aux-services.%s.classpath 
yarn.nodemanager.aux-services.%s.system-classes.
What can we do then? 

*CONCLUSION*

The whole point of matching the values of the String fields is to differentiate 
config keys from other strings like:
{code:java}
public static final String NVIDIA_DOCKER_V1 = "nvidia-docker-v1";
{code}
I cannot see a better way than what we are currently doing with the regex.
I think the regex pattern should be continuously maintained and occassions like 
unmatched real config keys like "yarn.nodemanager.aux-services.%s.classpath" 
(because of the regex didn't contain the percentage sign) could be quite rare.

[~bteke]
Do you have anything in mind about a clean fix for this?

Anyway, I will report another jira where I will list my suggested improvements 
of the test and its base class: TestConfigurationFieldsBase.


was (Author: snemeth):
Hi [~bteke],
I just briefly checked this.

{quote}
Another issue with this is that it works in a semi-opposite way of what it's 
supposed to do. To ensure all of the configuration entries are documented it 
should iterate through all of the configuration fields and check if those have 
matching xyz-default.xml entries, but cur

[jira] [Commented] (YARN-11415) Refactor TestConfigurationFieldsBase and the connected test classes

2023-03-05 Thread Szilard Nemeth (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17696584#comment-17696584
 ] 

Szilard Nemeth commented on YARN-11415:
---

Hi [~bteke],
I just briefly checked this.

{quote}
Another issue with this is that it works in a semi-opposite way of what it's 
supposed to do. To ensure all of the configuration entries are documented it 
should iterate through all of the configuration fields and check if those have 
matching xyz-default.xml entries, but currently it just reports the entries 
that are present in the xyz-default.xml and missing in the matching 
configuration file. Since this test checks all the configuration objects this 
might need some other follow-ups to document the missing properties from other 
components if there are any.
{quote}

I think what you stated here is true for this method: 
TestCommonConfigurationFields#testCompareXmlAgainstConfigurationClass.
This method compares the properties that are in yarn-default.xml, but not in 
the Configuration class.
With my first commit I added a dummy property to yarn-default.xml without 
adding it to the YarnConfiguration class.
The property called "yarn.nodemanager.missingpropinclass", but the name doesn't 
really matter.

Test result: Failure in 
TestCommonConfigurationFields#testCompareXmlAgainstConfigurationClass: 
{code}
java.lang.AssertionError: yarn-default.xml has 1 properties missing in  class 
org.apache.hadoop.yarn.conf.YarnConfiguration Entries:   
yarn.nodemanager.missingpropinclass 
Expected :0
Actual   :1
{code}


However, we also have 
TestCommonConfigurationFields#testCompareConfigurationClassAgainstXml which 
compares the properties that are in the YarnConfiguration class, but not 
defined in yarn-default.xml.
So with my second commit, I added this to YarnConfiguration: 

{code}
public static final String MISSING_PROP_IN_YARN_DEF = 
"yarn.missingprop.in.yarndefault";
{code}

without touching the yarn-default.xml so the new config was not documented.

As I expected, the testcase 
TestConfigurationFieldsBase#testCompareConfigurationClassAgainstXml failed 
with: 
{code}
java.lang.AssertionError: class org.apache.hadoop.yarn.conf.YarnConfiguration 
has 1 variables missing in yarn-default.xml Entries:   
yarn.missingprop.in.yarndefault 
Expected :0
Actual   :1
{code}

I think so far so good, this is the expected behavior.


The main issue before YARN-11413 was: 
1. 
org.apache.hadoop.conf.TestConfigurationFieldsBase#setupTestConfigurationFields 
is called as a "\@Before" method.
2. 
org.apache.hadoop.conf.TestConfigurationFieldsBase#extractMemberVariablesFromConfigurationFields
 is called.
3. All of the fields of the class is checked here with certain restrictions. 
Among these are that it should be a public static final String, and it should 
match a Pattern.
If the pattern is not matched, the field won't be added to the "known fields" 
for sure.


4. So with my third commit, I just removed the percent sign (basically a revert 
of YARN-11415) to see what happens.
TestConfigurationFieldsBase#testCompareXmlAgainstConfigurationClass fails, 
which is a false positive now.

Here's the assertion message: 
{code}
java.lang.AssertionError: yarn-default.xml has 2 properties missing in  class 
org.apache.hadoop.yarn.conf.YarnConfiguration Entries:   
yarn.nodemanager.aux-services.%s.classpath  
yarn.nodemanager.aux-services.%s.system-classes 
Expected :0
Actual   :2
{code}

This is indeed wrong, as per your description.
However, I don't see how this could be fixed in a clean way.
Here, the configuration fields of YarnConfiguration were not recognized: 
yarn.nodemanager.aux-services.%s.classpath  
yarn.nodemanager.aux-services.%s.system-classes.
What can we do then? 
The whole point of matching the values of the String fields is to differentiate 
config keys from other strings like: 

{code}
public static final String NVIDIA_DOCKER_V1 = "nvidia-docker-v1";
{code}

I cannot see a better way than what we are currently doing with the regex.
I think the regex pattern should be continuously maintained and occassions like 
unmatched real config keys like "yarn.nodemanager.aux-services.%s.classpath" 
(because of the regex didn't contain the percentage sign) could be quite rare.

[~bteke]
Do you have anything in mind about a clean fix for this? 

Anyway, I will report another jira where I will list my suggested improvements 
of the test and its base class: TestConfigurationFieldsBase.



> Refactor TestConfigurationFieldsBase and the connected test classes
> ---
>
> Key: YARN-11415
> URL: https://issues.apache.org/jira/browse/YARN-11415
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Benjamin Teke
>Assignee: Szilard Nemeth
>Priority: Major
>  Labels: pu

[jira] [Updated] (YARN-11427) Pull up the versioned imports in pom of hadoop-mapreduce-client-app to hadoop-project pom

2023-02-27 Thread Szilard Nemeth (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-11427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth updated YARN-11427:
--
Description: 
The versioned imports in pom.xml of hadoop-mapreduce-client-app can be pulled 
up to hadoop-project pom as it is better for version maintenance and  ease of 
using an IDE to find where things are used

 
{code:java}
    
      org.mockito
      mockito-junit-jupiter
      4.11.0
      test
    
    
      uk.org.webcompere
      system-stubs-core
      1.1.0
      test
    
    
      uk.org.webcompere
      system-stubs-jupiter
      1.1.0
      test
     {code}
 

 

 

  was:
The versioned imports in pom.xml of hadoop-mapreduce-client-app can be pullup 
to hadoop-project pom as it is better for version maintenance and  ease of 
using an IDE to find where things are used

 
{code:java}
    
      org.mockito
      mockito-junit-jupiter
      4.11.0
      test
    
    
      uk.org.webcompere
      system-stubs-core
      1.1.0
      test
    
    
      uk.org.webcompere
      system-stubs-jupiter
      1.1.0
      test
     {code}
 

 

 


> Pull up the versioned imports in pom of hadoop-mapreduce-client-app to 
> hadoop-project pom
> -
>
> Key: YARN-11427
> URL: https://issues.apache.org/jira/browse/YARN-11427
> Project: Hadoop YARN
>  Issue Type: Task
>  Components: yarn
>Reporter: Susheel Gupta
>Assignee: Susheel Gupta
>Priority: Minor
>
> The versioned imports in pom.xml of hadoop-mapreduce-client-app can be pulled 
> up to hadoop-project pom as it is better for version maintenance and  ease of 
> using an IDE to find where things are used
>  
> {code:java}
>     
>       org.mockito
>       mockito-junit-jupiter
>       4.11.0
>       test
>     
>     
>       uk.org.webcompere
>       system-stubs-core
>       1.1.0
>       test
>     
>     
>       uk.org.webcompere
>       system-stubs-jupiter
>       1.1.0
>       test
>      {code}
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-11427) Pull up the versioned imports in pom of hadoop-mapreduce-client-app to hadoop-project pom

2023-02-27 Thread Szilard Nemeth (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-11427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth updated YARN-11427:
--
Summary: Pull up the versioned imports in pom of 
hadoop-mapreduce-client-app to hadoop-project pom  (was: Pullup the versioned 
imports in pom of hadoop-mapreduce-client-app to hadoop-project pom)

> Pull up the versioned imports in pom of hadoop-mapreduce-client-app to 
> hadoop-project pom
> -
>
> Key: YARN-11427
> URL: https://issues.apache.org/jira/browse/YARN-11427
> Project: Hadoop YARN
>  Issue Type: Task
>  Components: yarn
>Reporter: Susheel Gupta
>Assignee: Susheel Gupta
>Priority: Minor
>
> The versioned imports in pom.xml of hadoop-mapreduce-client-app can be pullup 
> to hadoop-project pom as it is better for version maintenance and  ease of 
> using an IDE to find where things are used
>  
> {code:java}
>     
>       org.mockito
>       mockito-junit-jupiter
>       4.11.0
>       test
>     
>     
>       uk.org.webcompere
>       system-stubs-core
>       1.1.0
>       test
>     
>     
>       uk.org.webcompere
>       system-stubs-jupiter
>       1.1.0
>       test
>      {code}
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Assigned] (YARN-11372) Migrate legacy AQC to flexible AQC

2023-01-27 Thread Szilard Nemeth (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-11372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth reassigned YARN-11372:
-

Assignee: Peter Szucs

> Migrate legacy AQC to flexible AQC
> --
>
> Key: YARN-11372
> URL: https://issues.apache.org/jira/browse/YARN-11372
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Benjamin Teke
>Assignee: Peter Szucs
>Priority: Major
>
> Currently the codebase of Legacy AQC (with 
> ManagedParentQueue/ManagedLeafQueue) classes live next to the basic queue 
> classes that are used by the flexible AQC. The scope of this task is to 
> eliminate the former while migrating the functionality of legacy AQC.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-11372) Migrate legacy AQC to flexible AQC

2023-01-27 Thread Szilard Nemeth (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-11372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth updated YARN-11372:
--
Parent Issue: YARN-10889  (was: YARN-10888)

> Migrate legacy AQC to flexible AQC
> --
>
> Key: YARN-11372
> URL: https://issues.apache.org/jira/browse/YARN-11372
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Benjamin Teke
>Priority: Major
>
> Currently the codebase of Legacy AQC (with 
> ManagedParentQueue/ManagedLeafQueue) classes live next to the basic queue 
> classes that are used by the flexible AQC. The scope of this task is to 
> eliminate the former while migrating the functionality of legacy AQC.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Resolved] (YARN-10965) Centralize queue resource calculation based on CapacityVectors

2023-01-26 Thread Szilard Nemeth (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth resolved YARN-10965.
---
Hadoop Flags: Reviewed
  Resolution: Fixed

> Centralize queue resource calculation based on CapacityVectors
> --
>
> Key: YARN-10965
> URL: https://issues.apache.org/jira/browse/YARN-10965
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacity scheduler
>Reporter: Andras Gyori
>Assignee: Andras Gyori
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 6.5h
>  Remaining Estimate: 0h
>
> With the introduction of YARN-10930 it is possible to unify queue resource 
> calculation. In order to narrow down the scope of this patch, the base system 
> is implemented here, without refactoring the existing resource calculation in 
> updateClusterResource (which will be done in YARN-11000).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-10965) Centralize queue resource calculation based on CapacityVectors

2023-01-26 Thread Szilard Nemeth (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth updated YARN-10965:
--
Fix Version/s: 3.4.0

> Centralize queue resource calculation based on CapacityVectors
> --
>
> Key: YARN-10965
> URL: https://issues.apache.org/jira/browse/YARN-10965
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacity scheduler
>Reporter: Andras Gyori
>Assignee: Andras Gyori
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 6.5h
>  Remaining Estimate: 0h
>
> With the introduction of YARN-10930 it is possible to unify queue resource 
> calculation. In order to narrow down the scope of this patch, the base system 
> is implemented here, without refactoring the existing resource calculation in 
> updateClusterResource (which will be done in YARN-11000).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-6971) Clean up different ways to create resources

2023-01-25 Thread Szilard Nemeth (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-6971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth updated YARN-6971:
-
Fix Version/s: 3.4.0

> Clean up different ways to create resources
> ---
>
> Key: YARN-6971
> URL: https://issues.apache.org/jira/browse/YARN-6971
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager, scheduler
>Reporter: Yufei Gu
>Assignee: Riya Khandelwal
>Priority: Minor
>  Labels: newbie, pull-request-available
> Fix For: 3.4.0
>
>
> There are several ways to create a {{resource}} object, e.g., 
> BuilderUtils.newResource() and Resources.createResource(). These methods not 
> only cause confusing but also performance issues, for example 
> BuilderUtils.newResource() is significant slow than 
> Resources.createResource(). 
> We could merge them some how, and replace most BuilderUtils.newResource() 
> with Resources.createResource().



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Resolved] (YARN-6971) Clean up different ways to create resources

2023-01-25 Thread Szilard Nemeth (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-6971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth resolved YARN-6971.
--
Hadoop Flags: Reviewed
  Resolution: Fixed

> Clean up different ways to create resources
> ---
>
> Key: YARN-6971
> URL: https://issues.apache.org/jira/browse/YARN-6971
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager, scheduler
>Reporter: Yufei Gu
>Assignee: Riya Khandelwal
>Priority: Minor
>  Labels: newbie, pull-request-available
> Fix For: 3.4.0
>
>
> There are several ways to create a {{resource}} object, e.g., 
> BuilderUtils.newResource() and Resources.createResource(). These methods not 
> only cause confusing but also performance issues, for example 
> BuilderUtils.newResource() is significant slow than 
> Resources.createResource(). 
> We could merge them some how, and replace most BuilderUtils.newResource() 
> with Resources.createResource().



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-11416) FS2CS should use CapacitySchedulerConfiguration in FSQueueConverterBuilder

2023-01-25 Thread Szilard Nemeth (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-11416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth updated YARN-11416:
--
Description: 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.converter.FSQueueConverter
 and its builder stores the variable capacitySchedulerConfig as a simple 
Configuration object instead of CapacitySchedulerConfiguration. This is 
misleading, as capacitySchedulerConfig suggests that it is indeed a 
CapacitySchedulerConfiguration and it loses access to the convenience methods 
to check for various properties. Because of this every time a property getter 
is changed FS2CS should be checked if it reimplemented the same, otherwise 
there might be behaviour differences or even bugs.  (was: 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.converter.FSQueueConverter
 and it's builder stores the variable capacitySchedulerConfig as a simple 
Configuration object instead of CapacitySchedulerConfiguration. This is 
misleading, as capacitySchedulerConfig suggests that it is indeed a 
CapacitySchedulerConfiguration and it loses access to the convenience methods 
to check for various properties. Because of this every time a property getter 
is changed FS2CS should be checked if it reimplemented the same, otherwise 
there might be behaviour differences or even bugs.)

> FS2CS should use CapacitySchedulerConfiguration in FSQueueConverterBuilder 
> ---
>
> Key: YARN-11416
> URL: https://issues.apache.org/jira/browse/YARN-11416
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Benjamin Teke
>Assignee: Susheel Gupta
>Priority: Major
>  Labels: pull-request-available
>
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.converter.FSQueueConverter
>  and its builder stores the variable capacitySchedulerConfig as a simple 
> Configuration object instead of CapacitySchedulerConfiguration. This is 
> misleading, as capacitySchedulerConfig suggests that it is indeed a 
> CapacitySchedulerConfiguration and it loses access to the convenience methods 
> to check for various properties. Because of this every time a property getter 
> is changed FS2CS should be checked if it reimplemented the same, otherwise 
> there might be behaviour differences or even bugs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-5607) Document TestContainerResourceUsage#waitForContainerCompletion

2023-01-25 Thread Szilard Nemeth (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-5607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth updated YARN-5607:
-
Fix Version/s: 3.4.0

> Document TestContainerResourceUsage#waitForContainerCompletion
> --
>
> Key: YARN-5607
> URL: https://issues.apache.org/jira/browse/YARN-5607
> Project: Hadoop YARN
>  Issue Type: Test
>  Components: resourcemanager, test
>Affects Versions: 2.9.0
>Reporter: Karthik Kambatla
>Assignee: Susheel Gupta
>Priority: Major
>  Labels: newbie, pull-request-available
> Fix For: 3.4.0
>
>
> The logic in TestContainerResourceUsage#waitForContainerCompletion 
> (introduced in YARN-5024) is not immediately obvious. It could use some 
> documentation. Also, this seems like a useful helper method. Should this be 
> moved to one of the mock classes or to a util class? 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-11404) Add junit5 dependency to hadoop-mapreduce-client-app to fix few unit test failure

2023-01-25 Thread Szilard Nemeth (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-11404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth updated YARN-11404:
--
Fix Version/s: 3.4.0

> Add junit5 dependency to hadoop-mapreduce-client-app to fix few unit test 
> failure
> -
>
> Key: YARN-11404
> URL: https://issues.apache.org/jira/browse/YARN-11404
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Reporter: Susheel Gupta
>Assignee: Susheel Gupta
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
> Attachments: 
> patch-unit-hadoop-mapreduce-project_hadoop-mapreduce-client_hadoop-mapreduce-client-app.txt
>
>
> We need to add Junit 5 dependency in
> {code:java}
> /hadoop/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/pom.xml{code}
> as the testcase TestAMWebServicesJobConf, TestAMWebServicesJobs, 
> TestAMWebServices, TestAMWebServicesAttempts, TestAMWebServicesTasks were 
> passing locally but failed at jenkins build in this 
> [link|https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5119/7/artifact/out/patch-unit-hadoop-mapreduce-project_hadoop-mapreduce-client_hadoop-mapreduce-client-app.txt]
>  for YARN-5607



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Resolved] (YARN-11404) Add junit5 dependency to hadoop-mapreduce-client-app to fix few unit test failure

2023-01-25 Thread Szilard Nemeth (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-11404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth resolved YARN-11404.
---
Hadoop Flags: Reviewed
  Resolution: Fixed

> Add junit5 dependency to hadoop-mapreduce-client-app to fix few unit test 
> failure
> -
>
> Key: YARN-11404
> URL: https://issues.apache.org/jira/browse/YARN-11404
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Reporter: Susheel Gupta
>Assignee: Susheel Gupta
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
> Attachments: 
> patch-unit-hadoop-mapreduce-project_hadoop-mapreduce-client_hadoop-mapreduce-client-app.txt
>
>
> We need to add Junit 5 dependency in
> {code:java}
> /hadoop/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/pom.xml{code}
> as the testcase TestAMWebServicesJobConf, TestAMWebServicesJobs, 
> TestAMWebServices, TestAMWebServicesAttempts, TestAMWebServicesTasks were 
> passing locally but failed at jenkins build in this 
> [link|https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5119/7/artifact/out/patch-unit-hadoop-mapreduce-project_hadoop-mapreduce-client_hadoop-mapreduce-client-app.txt]
>  for YARN-5607



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-11409) Fix Typo of ResourceManager#webapp module

2023-01-13 Thread Szilard Nemeth (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-11409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth updated YARN-11409:
--
Summary: Fix Typo of ResourceManager#webapp module  (was: Fix Typo of 
ResourceManager#webapp moudle)

> Fix Typo of ResourceManager#webapp module
> -
>
> Key: YARN-11409
> URL: https://issues.apache.org/jira/browse/YARN-11409
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.4.0
>Reporter: Shilun Fan
>Assignee: Shilun Fan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> When finishing YARN-11218, I found some typo problems in RM's RMWebServices. 
> I checked the java class of the webapp moudle and fixed the typo problems.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-11404) Add junit5 dependency to hadoop-mapreduce-client-app to fix few unit test failure

2023-01-12 Thread Szilard Nemeth (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-11404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth updated YARN-11404:
--
Attachment: 
patch-unit-hadoop-mapreduce-project_hadoop-mapreduce-client_hadoop-mapreduce-client-app.txt

> Add junit5 dependency to hadoop-mapreduce-client-app to fix few unit test 
> failure
> -
>
> Key: YARN-11404
> URL: https://issues.apache.org/jira/browse/YARN-11404
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Reporter: Susheel Gupta
>Assignee: Susheel Gupta
>Priority: Major
> Attachments: 
> patch-unit-hadoop-mapreduce-project_hadoop-mapreduce-client_hadoop-mapreduce-client-app.txt
>
>
> We need to add Junit 5 dependency in
> {code:java}
> /hadoop/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/pom.xml{code}
> as the testcase TestAMWebServicesJobConf, TestAMWebServicesJobs, 
> TestAMWebServices, TestAMWebServicesAttempts, TestAMWebServicesTasks were 
> passing locally but failed at jenkins build in this 
> [link|https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5119/7/artifact/out/patch-unit-hadoop-mapreduce-project_hadoop-mapreduce-client_hadoop-mapreduce-client-app.txt]
>  for YARN-5607



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Assigned] (YARN-11415) Refactor TestConfigurationFieldsBase and the connected test classes

2023-01-11 Thread Szilard Nemeth (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-11415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth reassigned YARN-11415:
-

Assignee: Szilard Nemeth

> Refactor TestConfigurationFieldsBase and the connected test classes
> ---
>
> Key: YARN-11415
> URL: https://issues.apache.org/jira/browse/YARN-11415
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Benjamin Teke
>Assignee: Szilard Nemeth
>Priority: Major
>
> YARN-11413 pointed out a strange way of how the configuration tests are 
> executed. The first problem is that there is a 
> [Pattern|https://github.com/apache/hadoop/blob/570b503e3e7e7adf5b0a8fabca76003298216543/hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/conf/TestConfigurationFieldsBase.java#L197],
>  that matches only numbers, letters, dots, hyphens and underscores, but not 
> %, which is used in string replacements (i.e 
> {{yarn.nodemanager.aux-services.%s.classpath}} ), so essentially every 
> property that's present in any configuration object and doesn't match this 
> pattern is silently skipped, and documenting it will result in invalid test 
> failures, ergo the test encourages introducing props and not documenting 
> them. The pattern should be fixed in YARN-11413 for %s, but it's necessity 
> could be checked. 
> Another issue with this is that it works in a semi-opposite way of what it's 
> supposed to do. To ensure all of the configuration entries are documented it 
> should iterate through all of the configuration fields and check if those 
> have matching xyz-default.xml entries, but currently it just reports the 
> entries that are present in the xyz-default.xml and missing in the matching 
> configuration file. Since this test checks all the configuration objects this 
> might need some other follow-ups to document the missing properties from 
> other components if there are any.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-11355) YARN Client Failovers immediately to rm2 but takes ~30000ms to rm3

2023-01-10 Thread Szilard Nemeth (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17656635#comment-17656635
 ] 

Szilard Nemeth commented on YARN-11355:
---

Hi [~vineethNaroju],
You have to move this jira to Patch available to trigger Jenkins.
Nowadays Github PRs are more welcome.

Thanks

> YARN Client Failovers immediately to rm2 but takes ~3ms to rm3
> --
>
> Key: YARN-11355
> URL: https://issues.apache.org/jira/browse/YARN-11355
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: client
>Affects Versions: 3.4.0
>Reporter: Prabhu Joseph
>Assignee: Vineeth Naroju
>Priority: Major
> Attachments: YARN-11355.diff
>
>
> YARN Client Failovers immediately to rm2 but takes ~3ms to rm3 during 
> initial retry.
> *Repro:*
> {code:java}
> 1. YARN Cluster with three master nodes rm1,rm2 and rm3
> 2. rm3 is active
> 3. yarn node -list or any other yarn client calls takes more than 30 seconds.
>  {code}
> The initial failover to rm2 is immediate but then the failover to rm3 is 
> after ~3 ms. Current RetryPolicy does not honor the number of master 
> nodes. It has to perform atleast one immediate failover to every rm.
> {code:java}
> 2022-10-20 06:37:44,123 INFO client.ConfiguredRMFailoverProxyProvider: 
> Failing over to rm2
> 2022-10-20 06:37:44,129 INFO retry.RetryInvocationHandler: 
> java.net.ConnectException: Call From local to remote:8032 failed on 
> connection exception: java.net.ConnectException: Connection refused; For more 
> details see:  http://wiki.apache.org/hadoop/ConnectionRefused, while invoking 
> ApplicationClientProtocolPBClientImpl.getClusterNodes over rm2 after 1 
> failover attempts. Trying to failover after sleeping for 21139ms.
> {code}
>  
> *Workaround:*
> Reduce yarn.resourcemanager.connect.retry-interval.ms from 3 to like 100. 
> This will do immediate failover to rm3 but there will be too many retries 
> when there is no active resourcemanager.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-11410) Add default methods for StateMachine

2023-01-08 Thread Szilard Nemeth (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-11410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth updated YARN-11410:
--
Description: YARN-11395 created a new method in the StateMachine interface, 
what can break the compatibility with connected softwares, so the method should 
be converted to default method, what can prevent this break  (was: The 
YARN-11395 created a new method in the StateMachine interface, what can break 
the compatibility with connected softwares, so the method should be converted 
to default method, what can prevent this break)

> Add default methods for StateMachine
> 
>
> Key: YARN-11410
> URL: https://issues.apache.org/jira/browse/YARN-11410
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bence Kosztolnik
>Assignee: Bence Kosztolnik
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> YARN-11395 created a new method in the StateMachine interface, what can break 
> the compatibility with connected softwares, so the method should be converted 
> to default method, what can prevent this break



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-11395) Resource Manager UI, cluster/appattempt/*, can not present FINAL_SAVING state

2023-01-06 Thread Szilard Nemeth (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-11395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth updated YARN-11395:
--
Description: 
If an attempt is in *FINAL_SAVING* state, the 
*RMAppAttemptBlock#createAttemptHeadRoomTable* method fails with a convert 
error, what will results a 
{code:java}
RFC6265 Cookie values may not contain character: [ ]{code}
error in the UI an in the logs as well.
RM log:
{code:java}
...
    at java.lang.Thread.run(Thread.java:750)
Caused by: java.lang.IllegalArgumentException: No enum constant 
org.apache.hadoop.yarn.api.records.YarnApplicationAttemptState.FINAL_SAVING
    at java.lang.Enum.valueOf(Enum.java:238)
    at 
org.apache.hadoop.yarn.api.records.YarnApplicationAttemptState.valueOf(YarnApplicationAttemptState.java:27)
    at 
org.apache.hadoop.yarn.server.resourcemanager.webapp.RMAppAttemptBlock.createAttemptHeadRoomTable(RMAppAttemptBlock.java:424)
    at 
org.apache.hadoop.yarn.server.webapp.AppAttemptBlock.render(AppAttemptBlock.java:151)
    at org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:69)
    at 
org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:79)
    at org.apache.hadoop.yarn.webapp.View.render(View.java:243)
    at 
org.apache.hadoop.yarn.webapp.view.HtmlPage$Page.subView(HtmlPage.java:49)
    at 
org.apache.hadoop.yarn.webapp.hamlet2.HamletImpl$EImp._v(HamletImpl.java:117)
    at org.apache.hadoop.yarn.webapp.hamlet2.Hamlet$TD.__(Hamlet.java:848)
    at 
org.apache.hadoop.yarn.webapp.view.TwoColumnLayout.render(TwoColumnLayout.java:71)
    at org.apache.hadoop.yarn.webapp.view.HtmlPage.render(HtmlPage.java:82)
    at org.apache.hadoop.yarn.webapp.Controller.render(Controller.java:216)
    at 
org.apache.hadoop.yarn.server.resourcemanager.webapp.RmController.appattempt(RmController.java:62)
    ... 63 more
2022-12-05 04:15:33,029 WARN org.eclipse.jetty.server.HttpChannel: 
/cluster/appattempt/appattempt_1667297151262_0247_01
java.lang.IllegalArgumentException: RFC6265 Cookie values may not contain 
character: [ ]
    at 
org.eclipse.jetty.http.Syntax.requireValidRFC6265CookieValue(Syntax.java:136)
...{code}
This bug was introduced with the YARN-1345 ticket what also caused a similar 
error called YARN-4411. In case of the YARN-4411 the enum mapping logic from 
RMAppAttemptStates to YarnApplicationAttemptState was modified like this:
- if the state is FINAL_SAVING we should represent the previous state

This error can also be occur in case of ALLOCATED_SAVING, 
LAUNCHED_UNMANAGED_SAVING states as well.

So we should modify the *createAttemptHeadRoomTable* method to be able to 
handle the previously mentioned 3 states just like in case of YARN-4411

  was:
If an attempt is in *FINAL_SAVING* state, the 
*RMAppAttemptBlock#createAttemptHeadRoomTable* method fails with a convert 
error, what will results a 
{code:java}
RFC6265 Cookie values may not contain character: [ ]{code}
error in the UI an in the logs as well.
RM log:
{code:java}
...
    at java.lang.Thread.run(Thread.java:750)
Caused by: java.lang.IllegalArgumentException: No enum constant 
org.apache.hadoop.yarn.api.records.YarnApplicationAttemptState.FINAL_SAVING
    at java.lang.Enum.valueOf(Enum.java:238)
    at 
org.apache.hadoop.yarn.api.records.YarnApplicationAttemptState.valueOf(YarnApplicationAttemptState.java:27)
    at 
org.apache.hadoop.yarn.server.resourcemanager.webapp.RMAppAttemptBlock.createAttemptHeadRoomTable(RMAppAttemptBlock.java:424)
    at 
org.apache.hadoop.yarn.server.webapp.AppAttemptBlock.render(AppAttemptBlock.java:151)
    at org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:69)
    at 
org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:79)
    at org.apache.hadoop.yarn.webapp.View.render(View.java:243)
    at 
org.apache.hadoop.yarn.webapp.view.HtmlPage$Page.subView(HtmlPage.java:49)
    at 
org.apache.hadoop.yarn.webapp.hamlet2.HamletImpl$EImp._v(HamletImpl.java:117)
    at org.apache.hadoop.yarn.webapp.hamlet2.Hamlet$TD.__(Hamlet.java:848)
    at 
org.apache.hadoop.yarn.webapp.view.TwoColumnLayout.render(TwoColumnLayout.java:71)
    at org.apache.hadoop.yarn.webapp.view.HtmlPage.render(HtmlPage.java:82)
    at org.apache.hadoop.yarn.webapp.Controller.render(Controller.java:216)
    at 
org.apache.hadoop.yarn.server.resourcemanager.webapp.RmController.appattempt(RmController.java:62)
    ... 63 more
2022-12-05 04:15:33,029 WARN org.eclipse.jetty.server.HttpChannel: 
/cluster/appattempt/appattempt_1667297151262_0247_01
java.lang.IllegalArgumentException: RFC6265 Cookie values may not contain 
character: [ ]
    at 
org.eclipse.jetty.http.Syntax.requireValidRFC6265CookieValue(Syntax.java:136)
...{code}
This bug was introduced with the YARN-1345 ticket what also caused a similar 
error called YARN-4411. In case of the YARN-4411 the enum mapping logic from 
RMAppAttemptStates to YarnApplicationAttemptStat

[jira] [Commented] (YARN-10905) Investigate if AbstractCSQueue#configuredNodeLabels vs. QueueCapacities#getExistingNodeLabels holds the same data

2023-01-03 Thread Szilard Nemeth (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17654073#comment-17654073
 ] 

Szilard Nemeth commented on YARN-10905:
---

Hi [~pszucs],

Thanks for your investigation.
I checked the code and your assessment is correct. 
Feel free to close this ticket.

> Investigate if AbstractCSQueue#configuredNodeLabels vs. 
> QueueCapacities#getExistingNodeLabels holds the same data
> -
>
> Key: YARN-10905
> URL: https://issues.apache.org/jira/browse/YARN-10905
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Szilard Nemeth
>Assignee: Peter Szucs
>Priority: Minor
>
> The task is to investigate whether the field 
> AbstractCSQueue#configuredNodeLabels holds the same data as 
> QueueCapacities#getExistingNodeLabels.
> Obviously, we don't want double-entry bookkeeping so if the data is the same, 
> we can remove this or that.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Resolved] (YARN-10946) AbstractCSQueue: Create separate class for constructing Queue API objects

2022-12-01 Thread Szilard Nemeth (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth resolved YARN-10946.
---
Hadoop Flags: Reviewed
  Resolution: Fixed

> AbstractCSQueue: Create separate class for constructing Queue API objects
> -
>
> Key: YARN-10946
> URL: https://issues.apache.org/jira/browse/YARN-10946
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Szilard Nemeth
>Assignee: Peter Szucs
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> Relevant methods are: 
> - 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractCSQueue#getQueueConfigurations
> - 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractCSQueue#getQueueInfo
> - 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractCSQueue#getQueueStatistics



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-10946) AbstractCSQueue: Create separate class for constructing Queue API objects

2022-12-01 Thread Szilard Nemeth (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth updated YARN-10946:
--
Fix Version/s: 3.4.0

> AbstractCSQueue: Create separate class for constructing Queue API objects
> -
>
> Key: YARN-10946
> URL: https://issues.apache.org/jira/browse/YARN-10946
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Szilard Nemeth
>Assignee: Peter Szucs
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> Relevant methods are: 
> - 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractCSQueue#getQueueConfigurations
> - 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractCSQueue#getQueueInfo
> - 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractCSQueue#getQueueStatistics



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Reopened] (YARN-10959) Extract common method of two that check if preemption disabled in CSQueuePreemption

2022-11-22 Thread Szilard Nemeth (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth reopened YARN-10959:
---

> Extract common method of two that check if preemption disabled in 
> CSQueuePreemption
> ---
>
> Key: YARN-10959
> URL: https://issues.apache.org/jira/browse/YARN-10959
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Szilard Nemeth
>Assignee: Peter Szucs
>Priority: Minor
>
> This is a follow-up of YARN-10913. 
> After YARN-10913, we have a class called CSQueuePreemption that has 2 methods 
> that are very similar to each other: 
> - isQueueHierarchyPreemptionDisabled
> - isIntraQueueHierarchyPreemptionDisabled
> The goal is to create one method and use it from those 2, merging the common 
> logic as much as we can.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Resolved] (YARN-10959) Extract common method of two that check if preemption disabled in CSQueuePreemption

2022-11-22 Thread Szilard Nemeth (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth resolved YARN-10959.
---
Resolution: Won't Fix

> Extract common method of two that check if preemption disabled in 
> CSQueuePreemption
> ---
>
> Key: YARN-10959
> URL: https://issues.apache.org/jira/browse/YARN-10959
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Szilard Nemeth
>Assignee: Peter Szucs
>Priority: Minor
>
> This is a follow-up of YARN-10913. 
> After YARN-10913, we have a class called CSQueuePreemption that has 2 methods 
> that are very similar to each other: 
> - isQueueHierarchyPreemptionDisabled
> - isIntraQueueHierarchyPreemptionDisabled
> The goal is to create one method and use it from those 2, merging the common 
> logic as much as we can.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Resolved] (YARN-8262) get_executable in container-executor should provide meaningful error codes

2022-11-22 Thread Szilard Nemeth (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-8262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth resolved YARN-8262.
--
Hadoop Flags: Reviewed
  Resolution: Fixed

> get_executable in container-executor should provide meaningful error codes
> --
>
> Key: YARN-8262
> URL: https://issues.apache.org/jira/browse/YARN-8262
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Miklos Szegedi
>Assignee: Susheel Gupta
>Priority: Minor
>  Labels: newbie, pull-request-available, trivial
> Fix For: 3.4.0
>
>
> Currently it calls exit(-1) that makes it difficult to debug without stderr.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-8262) get_executable in container-executor should provide meaningful error codes

2022-11-22 Thread Szilard Nemeth (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-8262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth updated YARN-8262:
-
Fix Version/s: 3.4.0

> get_executable in container-executor should provide meaningful error codes
> --
>
> Key: YARN-8262
> URL: https://issues.apache.org/jira/browse/YARN-8262
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Miklos Szegedi
>Assignee: Susheel Gupta
>Priority: Minor
>  Labels: newbie, pull-request-available, trivial
> Fix For: 3.4.0
>
>
> Currently it calls exit(-1) that makes it difficult to debug without stderr.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-11369) Commons.compress throws an IllegalArgumentException with large uids after 1.21

2022-11-16 Thread Szilard Nemeth (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-11369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth updated YARN-11369:
--
Fix Version/s: 3.4.0

> Commons.compress throws an IllegalArgumentException with large uids after 1.21
> --
>
> Key: YARN-11369
> URL: https://issues.apache.org/jira/browse/YARN-11369
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Benjamin Teke
>Assignee: Benjamin Teke
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> Encountering COMPRESS-587 with large uids/gids in 
> {{hadoop-mapreduce-client-uploader/src/main/java/org/apache/hadoop/mapred/uploader/FrameworkUploader.java}}:
> {code:java}
> 22/09/13 06:39:05 INFO uploader.FrameworkUploader: Adding 
> /cs/cloudera/opt/cloudera/cm/lib/plugins/event-publish-7.7.1-shaded.jar
> Exception in thread "main" java.lang.IllegalArgumentException: group id 
> '5049047' is too big ( > 2097151 ). Use STAR or POSIX extensions to overcome 
> this limit
> {code}
> A workaround is to specifically set bignumber mode to BIGNUMBER_POSIX or 
> BIGNUMBER_STAR on the instance.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Assigned] (YARN-10886) Cluster based and parent based max capacity in Capacity Scheduler

2022-11-14 Thread Szilard Nemeth (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth reassigned YARN-10886:
-

Assignee: (was: Szilard Nemeth)

> Cluster based and parent based max capacity in Capacity Scheduler
> -
>
> Key: YARN-10886
> URL: https://issues.apache.org/jira/browse/YARN-10886
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Szilard Nemeth
>Priority: Major
>
> We want to introduce the percentage modes relative to the cluster, not the 
> parent, i.e 
>  The property root.users.maximum-capacity will mean one of the following 
> things:
> *Either Parent Percentage:* maximum capacity relative to its parent. If it’s 
> set to 50, then it means that the capacity is capped with respect to the 
> parent. This can be covered by the current format, no change there.
>  *Or Cluster Percentage:* maximum capacity expressed as a percentage of the 
> overall cluster capacity. This case is the new scenario, for example:
>  yarn.scheduler.capacity.root.users.max-capacity = c:50%
>  yarn.scheduler.capacity.root.users.max-capacity = c:50%, c:30%



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Assigned] (YARN-10959) Extract common method of two that check if preemption disabled in CSQueuePreemption

2022-11-14 Thread Szilard Nemeth (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth reassigned YARN-10959:
-

Assignee: (was: Szilard Nemeth)

> Extract common method of two that check if preemption disabled in 
> CSQueuePreemption
> ---
>
> Key: YARN-10959
> URL: https://issues.apache.org/jira/browse/YARN-10959
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Szilard Nemeth
>Priority: Minor
>
> This is a follow-up of YARN-10913. 
> After YARN-10913, we have a class called CSQueuePreemption that has 2 methods 
> that are very similar to each other: 
> - isQueueHierarchyPreemptionDisabled
> - isIntraQueueHierarchyPreemptionDisabled
> The goal is to create one method and use it from those 2, merging the common 
> logic as much as we can.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Assigned] (YARN-10946) AbstractCSQueue: Create separate class for constructing Queue API objects

2022-11-14 Thread Szilard Nemeth (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth reassigned YARN-10946:
-

Assignee: (was: Szilard Nemeth)

> AbstractCSQueue: Create separate class for constructing Queue API objects
> -
>
> Key: YARN-10946
> URL: https://issues.apache.org/jira/browse/YARN-10946
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Szilard Nemeth
>Priority: Minor
>
> Relevant methods are: 
> - 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractCSQueue#getQueueConfigurations
> - 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractCSQueue#getQueueInfo
> - 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractCSQueue#getQueueStatistics



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Assigned] (YARN-10921) AbstractCSQueue: Node Labels logic is scattered and iteration logic is repeated all over the place

2022-11-14 Thread Szilard Nemeth (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth reassigned YARN-10921:
-

Assignee: (was: Szilard Nemeth)

> AbstractCSQueue: Node Labels logic is scattered and iteration logic is 
> repeated all over the place
> --
>
> Key: YARN-10921
> URL: https://issues.apache.org/jira/browse/YARN-10921
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Szilard Nemeth
>Priority: Minor
>
> TODO items:
> - Check original Node labels epic / jiras?
> - Think about ways to improve repetitive iteration on configuredNodeLabels
> - Search for: "String label" in code
> Code blocks to handle Node labels:
> - AbstractCSQueue#setupQueueConfigs
> - AbstractCSQueue#getQueueConfigurations
> - AbstractCSQueue#accessibleToPartition
> - AbstractCSQueue#getNodeLabelsForQueue
> - AbstractCSQueue#updateAbsoluteCapacities
> - AbstractCSQueue#updateConfigurableResourceRequirement
> - CSQueueUtils#loadCapacitiesByLabelsFromConf
> - AutoCreatedLeafQueue



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Assigned] (YARN-10926) Test validation after YARN-10504 and YARN-10506: Check if modified test expectations are correct or not

2022-11-14 Thread Szilard Nemeth (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth reassigned YARN-10926:
-

Assignee: (was: Szilard Nemeth)

> Test validation after YARN-10504 and YARN-10506: Check if modified test 
> expectations are correct or not
> ---
>
> Key: YARN-10926
> URL: https://issues.apache.org/jira/browse/YARN-10926
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Szilard Nemeth
>Priority: Minor
>
> YARN-10504 and YARN-10506 modified some test expectations.
> The task is to verify if those expectations are correct.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Resolved] (YARN-10920) Create a dedicated class for Node Labels

2022-11-14 Thread Szilard Nemeth (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth resolved YARN-10920.
---
Resolution: Won't Fix

Since this is a huge effort and pretty hard, I think there's would be a very 
little gain compared to the size of the effort, hence closing this ticket with 
"Won't fix".

> Create a dedicated class for Node Labels
> 
>
> Key: YARN-10920
> URL: https://issues.apache.org/jira/browse/YARN-10920
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Szilard Nemeth
>Priority: Minor
>
> In the current codebase, Node labels are simple strings. It's very 
> error-prone to use Strings as it can contain basically anything. Moreover, 
> it's easier to keep track of all usages if we have a dedicated class for Node 
> labels.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Assigned] (YARN-10920) Create a dedicated class for Node Labels

2022-11-14 Thread Szilard Nemeth (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth reassigned YARN-10920:
-

Assignee: (was: Szilard Nemeth)

> Create a dedicated class for Node Labels
> 
>
> Key: YARN-10920
> URL: https://issues.apache.org/jira/browse/YARN-10920
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Szilard Nemeth
>Priority: Minor
>
> In the current codebase, Node labels are simple strings. It's very 
> error-prone to use Strings as it can contain basically anything. Moreover, 
> it's easier to keep track of all usages if we have a dedicated class for Node 
> labels.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Assigned] (YARN-10905) Investigate if AbstractCSQueue#configuredNodeLabels vs. QueueCapacities#getExistingNodeLabels holds the same data

2022-11-14 Thread Szilard Nemeth (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth reassigned YARN-10905:
-

Assignee: (was: Szilard Nemeth)

> Investigate if AbstractCSQueue#configuredNodeLabels vs. 
> QueueCapacities#getExistingNodeLabels holds the same data
> -
>
> Key: YARN-10905
> URL: https://issues.apache.org/jira/browse/YARN-10905
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Szilard Nemeth
>Priority: Minor
>
> The task is to investigate whether the field 
> AbstractCSQueue#configuredNodeLabels holds the same data as 
> QueueCapacities#getExistingNodeLabels.
> Obviously, we don't want double-entry bookkeeping so if the data is the same, 
> we can remove this or that.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Resolved] (YARN-10005) Code improvements in MutableCSConfigurationProvider

2022-11-12 Thread Szilard Nemeth (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth resolved YARN-10005.
---
Hadoop Flags: Reviewed
  Resolution: Fixed

> Code improvements in MutableCSConfigurationProvider
> ---
>
> Key: YARN-10005
> URL: https://issues.apache.org/jira/browse/YARN-10005
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Szilard Nemeth
>Assignee: Peter Szucs
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> * Important: constructKeyValueConfUpdate and all related methods seems a 
> separate responsibility: how to convert incoming SchedConfUpdateInfo to 
> Configuration changes (Configuration object)
> * Duplicated code block (9 lines) in init / formatConfigurationInStore methods
> * Method "getConfStore" could be package-private



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-10005) Code improvements in MutableCSConfigurationProvider

2022-11-12 Thread Szilard Nemeth (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth updated YARN-10005:
--
Fix Version/s: 3.4.0

> Code improvements in MutableCSConfigurationProvider
> ---
>
> Key: YARN-10005
> URL: https://issues.apache.org/jira/browse/YARN-10005
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Szilard Nemeth
>Assignee: Peter Szucs
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> * Important: constructKeyValueConfUpdate and all related methods seems a 
> separate responsibility: how to convert incoming SchedConfUpdateInfo to 
> Configuration changes (Configuration object)
> * Duplicated code block (9 lines) in init / formatConfigurationInStore methods
> * Method "getConfStore" could be package-private



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-11362) Fix several typos in YARN codebase of misspelled resource

2022-10-28 Thread Szilard Nemeth (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-11362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth updated YARN-11362:
--
Labels: newbie newbie++  (was: )

> Fix several typos in YARN codebase of misspelled resource
> -
>
> Key: YARN-11362
> URL: https://issues.apache.org/jira/browse/YARN-11362
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Major
>  Labels: newbie, newbie++
>
> I noticed that in YARN's codebase, there are several occassions of misspelled 
> resource as "Resoure".



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-11362) Fix several typos in YARN codebase of misspelled resource

2022-10-28 Thread Szilard Nemeth (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-11362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth updated YARN-11362:
--
Description: I noticed that in YARN's codebase, there are several 
occassions of misspelled resource as "Resoure".

> Fix several typos in YARN codebase of misspelled resource
> -
>
> Key: YARN-11362
> URL: https://issues.apache.org/jira/browse/YARN-11362
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Major
>
> I noticed that in YARN's codebase, there are several occassions of misspelled 
> resource as "Resoure".



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Created] (YARN-11362) Fix several typos in YARN codebase of misspelled resource

2022-10-28 Thread Szilard Nemeth (Jira)

Szilard Nemeth created YARN-11362:
-

 Summary: Fix several typos in YARN codebase of misspelled resource
 Key: YARN-11362
 URL: https://issues.apache.org/jira/browse/YARN-11362
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Szilard Nemeth
Assignee: Szilard Nemeth






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-9361) Write testcase for FSLeafQueue that explicitly checks if non-zero AM-share values are not overwritten for custom resources

2022-10-27 Thread Szilard Nemeth (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-9361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth updated YARN-9361:
-
Component/s: fairscheduler

> Write testcase for FSLeafQueue that explicitly checks if non-zero AM-share 
> values are not overwritten for custom resources
> --
>
> Key: YARN-9361
> URL: https://issues.apache.org/jira/browse/YARN-9361
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler
>Reporter: Szilard Nemeth
>Priority: Major
>
> This is a follow-up for YARN-9323, covering changes regarding explicit zero 
> value check that has been discussed with [~templedf] earlier.
> YARN-9323 fixed a bug in FSLeafQueue#computeMaxAMResource, so that custom 
> resource values are also set to the AM share.
> We need a new test in TestFSLeafQueue that explicitly checks if the custom 
> resource value is only being set if the fairshare for that resource is zero.
> This way, we can make sure we don't overwrite any meaningful resource value.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-9457) Integrate custom resource metrics better for FairScheduler

2022-10-27 Thread Szilard Nemeth (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-9457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth updated YARN-9457:
-
Component/s: fairscheduler

> Integrate custom resource metrics better for FairScheduler
> --
>
> Key: YARN-9457
> URL: https://issues.apache.org/jira/browse/YARN-9457
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler
>Reporter: Szilard Nemeth
>Priority: Major
>
> YARN-8842 added 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.QueueMetricsForCustomResources.
> This class stores all metrics data for custom resource types.
> A field is there in QueueMetrics to hold an object of this class.
> Similarly, YARN-9322 added FSQueueMetricsForCustomResources and added an 
> object of this class to FSQueueMetrics.
> This jira is about to investigate how it is possible to integrate 
> QueueMetricsForCustomResources into QueueMetrics and 
> FSQueueMetricsForCustomResources into FSQueueMetrics. 
> The trick is that the Metrics annotation 
> (org.apache.hadoop.metrics2.annotation.Metric) is used to expose values on 
> JMX.
> We need to implement a mechanism where QueueMetrics / FSQueueMetrics classes 
> do contain a field of the custom resource values which is a map of resource 
> names as keys, and longs as values.
> This way, we don't need the new classes (QueueMetricsForCustomResources and 
> FSQueueMetricsForCustomResources), the code could be much cleaner and 
> consistent.
> The hardest part possibly is to find a way to expose metrics values from a 
> map. We obviously can't use the Metrics annotation so a mechanism is required 
> to expose the values on JMX.
> For a quick search, I haven't found any way like this in the code
> [~wilfreds]: Are you aware of any way to expose values like this?
> Most probably, we need to check how the Metrics annotation is processed, 
> understand the whole flow and check what is the underlying mechanism of the 
> metrics propagation to the JMX interface.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-9352) Multiple versions of createSchedulingRequest in FairSchedulerTestBase could be cleaned up

2022-10-27 Thread Szilard Nemeth (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-9352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth updated YARN-9352:
-
Component/s: fairscheduler

> Multiple versions of createSchedulingRequest in FairSchedulerTestBase could 
> be cleaned up
> -
>
> Key: YARN-9352
> URL: https://issues.apache.org/jira/browse/YARN-9352
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler
>Reporter: Szilard Nemeth
>Assignee: Siddharth Ahuja
>Priority: Minor
>  Labels: newbie, newbie++, trivial
>
> createSchedulingRequest in FairSchedulerTestBase is overloaded many times.
> This could be more cleaner is we introduced a builder instead of calling 
> various forms of this method.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-7239) Possible launch/cleanup race condition in ContainersLauncher

2022-10-27 Thread Szilard Nemeth (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-7239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth updated YARN-7239:
-
Component/s: nodemanager

> Possible launch/cleanup race condition in ContainersLauncher
> 
>
> Key: YARN-7239
> URL: https://issues.apache.org/jira/browse/YARN-7239
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: Miklos Szegedi
>Priority: Major
>  Labels: newbie
>
> ContainersLauncher.handle() submits the launch job and then adds the job into 
> the collection risking that the cleanup will miss it and return. This should 
> be in reversed order in all 3 instances:
> {code}
> containerLauncher.submit(launch);
> running.put(containerId, launch);
> {code}
> The cleanup code that the above code is racing with:
> {code}
> ContainerLaunch runningContainer = running.get(containerId);
> if (runningContainer == null) {
>   // Container not launched. So nothing needs to be done.
>   LOG.info("Container " + containerId + " not running, nothing to 
> signal.");
>   return;
> }
> ...
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-6474) CGroupsHandlerImpl.java has a few checkstyle issues left to be fixed after YARN-5301

2022-10-27 Thread Szilard Nemeth (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-6474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth updated YARN-6474:
-
Component/s: nodemanager

> CGroupsHandlerImpl.java has a few checkstyle issues left to be fixed after 
> YARN-5301
> 
>
> Key: YARN-6474
> URL: https://issues.apache.org/jira/browse/YARN-6474
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: Miklos Szegedi
>Priority: Minor
>  Labels: newbie, trivial
>
> The main issue is throw inside finally



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-6525) Linux container executor should not propagate application errors

2022-10-26 Thread Szilard Nemeth (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-6525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth updated YARN-6525:
-
Component/s: LCE

> Linux container executor should not propagate application errors
> 
>
> Key: YARN-6525
> URL: https://issues.apache.org/jira/browse/YARN-6525
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: LCE
>Affects Versions: 3.0.0-alpha2
>Reporter: Miklos Szegedi
>Priority: Major
>  Labels: newbie
>
> wait_and_get_exit_code currently returns the application error code as LCE 
> error code. This may overlap with LCE errors. Instead LCE should return a 
> fixed application failed error code. I should print the application error 
> into the logs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-10680) Revisit try blocks without catch blocks but having finally blocks

2022-10-15 Thread Szilard Nemeth (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth updated YARN-10680:
--
Fix Version/s: 3.4.0

> Revisit try blocks without catch blocks but having finally blocks
> -
>
> Key: YARN-10680
> URL: https://issues.apache.org/jira/browse/YARN-10680
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Szilard Nemeth
>Assignee: Susheel Gupta
>Priority: Minor
>  Labels: newbie, pull-request-available, trivial
> Fix For: 3.4.0
>
> Attachments: YARN-10860.001.patch
>
>
> This jira is to revisit all try blocks without catch blocks but having 
> finally blocks in SLS.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Resolved] (YARN-10680) Revisit try blocks without catch blocks but having finally blocks

2022-10-15 Thread Szilard Nemeth (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth resolved YARN-10680.
---
Hadoop Flags: Reviewed
  Resolution: Fixed

> Revisit try blocks without catch blocks but having finally blocks
> -
>
> Key: YARN-10680
> URL: https://issues.apache.org/jira/browse/YARN-10680
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Szilard Nemeth
>Assignee: Susheel Gupta
>Priority: Minor
>  Labels: newbie, pull-request-available, trivial
> Fix For: 3.4.0
>
> Attachments: YARN-10860.001.patch
>
>
> This jira is to revisit all try blocks without catch blocks but having 
> finally blocks in SLS.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Assigned] (YARN-4944) Handle lack of ResourceCalculatorPlugin gracefully

2022-10-11 Thread Szilard Nemeth (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-4944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth reassigned YARN-4944:


Assignee: Susheel Gupta

> Handle lack of ResourceCalculatorPlugin gracefully
> --
>
> Key: YARN-4944
> URL: https://issues.apache.org/jira/browse/YARN-4944
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Affects Versions: 2.8.0
>Reporter: Karthik Kambatla
>Assignee: Susheel Gupta
>Priority: Major
>  Labels: newbie++, trivial
>
> On some systems (e.g. mac), the NM might not be able to instantiate a 
> ResourceCalculatorPlugin and leads to logging a bunch of error messages. We 
> could improve the way we handle this. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Assigned] (YARN-5607) Document TestContainerResourceUsage#waitForContainerCompletion

2022-10-11 Thread Szilard Nemeth (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-5607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth reassigned YARN-5607:


Assignee: Susheel Gupta  (was: Gergely Pollák)

> Document TestContainerResourceUsage#waitForContainerCompletion
> --
>
> Key: YARN-5607
> URL: https://issues.apache.org/jira/browse/YARN-5607
> Project: Hadoop YARN
>  Issue Type: Test
>  Components: resourcemanager, test
>Affects Versions: 2.9.0
>Reporter: Karthik Kambatla
>Assignee: Susheel Gupta
>Priority: Major
>  Labels: newbie
>
> The logic in TestContainerResourceUsage#waitForContainerCompletion 
> (introduced in YARN-5024) is not immediately obvious. It could use some 
> documentation. Also, this seems like a useful helper method. Should this be 
> moved to one of the mock classes or to a util class? 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Assigned] (YARN-6412) aux-services classpath not documented

2022-10-11 Thread Szilard Nemeth (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-6412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth reassigned YARN-6412:


Assignee: Riya Khandelwal  (was: Siddharth Ahuja)

> aux-services classpath not documented
> -
>
> Key: YARN-6412
> URL: https://issues.apache.org/jira/browse/YARN-6412
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Miklos Szegedi
>Assignee: Riya Khandelwal
>Priority: Minor
>  Labels: docuentation, newbie
>
> YARN-4577 introduced two new configuration entries 
> yarn.nodemanager.aux-services.%s.classpath and 
> yarn.nodemanager.aux-services.%s.system-classes. These are not documented in 
> hadoop-yarn-common/.../yarn-default.xml



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Assigned] (YARN-6971) Clean up different ways to create resources

2022-10-11 Thread Szilard Nemeth (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-6971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth reassigned YARN-6971:


Assignee: Riya Khandelwal

> Clean up different ways to create resources
> ---
>
> Key: YARN-6971
> URL: https://issues.apache.org/jira/browse/YARN-6971
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager, scheduler
>Reporter: Yufei Gu
>Assignee: Riya Khandelwal
>Priority: Minor
>  Labels: newbie
>
> There are several ways to create a {{resource}} object, e.g., 
> BuilderUtils.newResource() and Resources.createResource(). These methods not 
> only cause confusing but also performance issues, for example 
> BuilderUtils.newResource() is significant slow than 
> Resources.createResource(). 
> We could merge them some how, and replace most BuilderUtils.newResource() 
> with Resources.createResource().



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Resolved] (YARN-6766) Add helper method in FairSchedulerAppsBlock to print app info

2022-10-10 Thread Szilard Nemeth (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-6766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth resolved YARN-6766.
--
Hadoop Flags: Reviewed
  Resolution: Fixed

> Add helper method in FairSchedulerAppsBlock to print app info
> -
>
> Key: YARN-6766
> URL: https://issues.apache.org/jira/browse/YARN-6766
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: webapp
>Affects Versions: 2.8.1, 3.0.0-alpha3
>Reporter: Daniel Templeton
>Assignee: Riya Khandelwal
>Priority: Minor
>  Labels: newbie, pull-request-available, trivial
> Fix For: 3.4.0
>
>
> The various {{*AppsBlock}} classes are riddled with statements like:
> {code}.append(appInfo.getReservedVCores() == -1 ? "N/A" : 
> String.valueOf(appInfo.getReservedVCores())){code}
> The code would be much cleaner if there were a utility method for that 
> operation, e.g.:
> {code}.append(printData(appInfo.getReservedCores())){code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-6766) Add helper method in FairSchedulerAppsBlock to print app info

2022-10-10 Thread Szilard Nemeth (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-6766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth updated YARN-6766:
-
Fix Version/s: 3.4.0

> Add helper method in FairSchedulerAppsBlock to print app info
> -
>
> Key: YARN-6766
> URL: https://issues.apache.org/jira/browse/YARN-6766
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: webapp
>Affects Versions: 2.8.1, 3.0.0-alpha3
>Reporter: Daniel Templeton
>Assignee: Riya Khandelwal
>Priority: Minor
>  Labels: newbie, pull-request-available, trivial
> Fix For: 3.4.0
>
>
> The various {{*AppsBlock}} classes are riddled with statements like:
> {code}.append(appInfo.getReservedVCores() == -1 ? "N/A" : 
> String.valueOf(appInfo.getReservedVCores())){code}
> The code would be much cleaner if there were a utility method for that 
> operation, e.g.:
> {code}.append(printData(appInfo.getReservedCores())){code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 1023 matches

Mail list logo