[jira] [Commented] (YARN-11016) Queue weight is incorrectly reset to zero

2022-04-07 Thread Imran Chaush (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17518921#comment-17518921
 ] 

Imran Chaush commented on YARN-11016:
-

Thanks

> Queue weight is incorrectly reset to zero
> -
>
> Key: YARN-11016
> URL: https://issues.apache.org/jira/browse/YARN-11016
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Reporter: Andras Gyori
>Assignee: Andras Gyori
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> QueueCapacities#clearConfigurableFields set WEIGHT capacity to 0, which could 
> cause problems like in the following scenario:
> 1. Initializing queues
> 2. Parent 'parent' have accessibleNodeLabels set, and since accessible node 
> labels are inherited, its children, for example 'child' has 'test' label as 
> its accessible-node-label.
> 3. In LeafQueue#updateClusterResource, we call 
> LeafQueue#activateApplications, which then calls 
> LeafQueue#calculateAndGetAMResourceLimitPerPartition for each label (see 
> getNodeLabelsForQueue). 
> In this case, the labels are the accessible node labels (the inherited 
> 'test). 
> During this event, the ResourceUsage object is updated for the label 'test', 
> thus extending its nodeLabelsSet with 'test'.
> 4. In a following updateClusterResource call, for example an addNode event, 
> we now have 'test' label in ResourceUsage even though it was never explicitly 
> configured and we call CSQueueUtils#updateQueueStatistics, that takes the 
> union of the node labels from QueueCapacities and ResourceUsage (this union 
> is now the empty default label AND 'test') and updates QueueCapacities with 
> the label 'perf-test'. 
> Now QueueCapacities has 'test' in its nodeLabelsSet as well!
> 5. After a reinitialization (like an update from mutation API), the 
> CSQueueUtils#loadCapacitiesByLabelsFromCon is called, which resets the 
> QueueCapacities values to zero (even weight, which is wrong in my opinion) 
> and loads the values again from the config. 
> The problem here is that values are reset for all node labels in 
> QueueCapacities (even for 'test'), but we only load the values for the 
> configured node labels (which we did not set, so it is defaulted to the empty 
> label).
> 6. Now all children of 'parent' have weight=0 for 'test' in QueueCapacities 
> and that is why the update fails. 
> It even explains why validation passes, because the validation endpoint 
> instantiates a brand new CapacityScheduler for which these cascade of effects 
> can not accumulate (as there are no multiple updateClusterResource calls)
> This scenario manifests as an error when updating via mutation API:
> {noformat}
> Failed to re-init queues : Parent queue 'parent' have children queue used 
> mixed of weight mode, percentage and absolute mode, it is not allowed, please 
> double check, details:{noformat}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-11016) Queue weight is incorrectly reset to zero

2021-12-08 Thread Szilard Nemeth (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17455172#comment-17455172
 ] 

Szilard Nemeth commented on YARN-11016:
---

Thanks [~gandras] for the confirmation. Resolving this jira then.

> Queue weight is incorrectly reset to zero
> -
>
> Key: YARN-11016
> URL: https://issues.apache.org/jira/browse/YARN-11016
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Reporter: Andras Gyori
>Assignee: Andras Gyori
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> QueueCapacities#clearConfigurableFields set WEIGHT capacity to 0, which could 
> cause problems like in the following scenario:
> 1. Initializing queues
> 2. Parent 'parent' have accessibleNodeLabels set, and since accessible node 
> labels are inherited, its children, for example 'child' has 'test' label as 
> its accessible-node-label.
> 3. In LeafQueue#updateClusterResource, we call 
> LeafQueue#activateApplications, which then calls 
> LeafQueue#calculateAndGetAMResourceLimitPerPartition for each label (see 
> getNodeLabelsForQueue). 
> In this case, the labels are the accessible node labels (the inherited 
> 'test). 
> During this event, the ResourceUsage object is updated for the label 'test', 
> thus extending its nodeLabelsSet with 'test'.
> 4. In a following updateClusterResource call, for example an addNode event, 
> we now have 'test' label in ResourceUsage even though it was never explicitly 
> configured and we call CSQueueUtils#updateQueueStatistics, that takes the 
> union of the node labels from QueueCapacities and ResourceUsage (this union 
> is now the empty default label AND 'test') and updates QueueCapacities with 
> the label 'perf-test'. 
> Now QueueCapacities has 'test' in its nodeLabelsSet as well!
> 5. After a reinitialization (like an update from mutation API), the 
> CSQueueUtils#loadCapacitiesByLabelsFromCon is called, which resets the 
> QueueCapacities values to zero (even weight, which is wrong in my opinion) 
> and loads the values again from the config. 
> The problem here is that values are reset for all node labels in 
> QueueCapacities (even for 'test'), but we only load the values for the 
> configured node labels (which we did not set, so it is defaulted to the empty 
> label).
> 6. Now all children of 'parent' have weight=0 for 'test' in QueueCapacities 
> and that is why the update fails. 
> It even explains why validation passes, because the validation endpoint 
> instantiates a brand new CapacityScheduler for which these cascade of effects 
> can not accumulate (as there are no multiple updateClusterResource calls)
> This scenario manifests as an error when updating via mutation API:
> {noformat}
> Failed to re-init queues : Parent queue 'parent' have children queue used 
> mixed of weight mode, percentage and absolute mode, it is not allowed, please 
> double check, details:{noformat}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-11016) Queue weight is incorrectly reset to zero

2021-12-07 Thread Andras Gyori (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17455008#comment-17455008
 ] 

Andras Gyori commented on YARN-11016:
-

[~snemeth] Weight mode has been introduced in 3.4, therefore we do not need 
this fix there.

> Queue weight is incorrectly reset to zero
> -
>
> Key: YARN-11016
> URL: https://issues.apache.org/jira/browse/YARN-11016
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Reporter: Andras Gyori
>Assignee: Andras Gyori
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> QueueCapacities#clearConfigurableFields set WEIGHT capacity to 0, which could 
> cause problems like in the following scenario:
> 1. Initializing queues
> 2. Parent 'parent' have accessibleNodeLabels set, and since accessible node 
> labels are inherited, its children, for example 'child' has 'test' label as 
> its accessible-node-label.
> 3. In LeafQueue#updateClusterResource, we call 
> LeafQueue#activateApplications, which then calls 
> LeafQueue#calculateAndGetAMResourceLimitPerPartition for each label (see 
> getNodeLabelsForQueue). 
> In this case, the labels are the accessible node labels (the inherited 
> 'test). 
> During this event, the ResourceUsage object is updated for the label 'test', 
> thus extending its nodeLabelsSet with 'test'.
> 4. In a following updateClusterResource call, for example an addNode event, 
> we now have 'test' label in ResourceUsage even though it was never explicitly 
> configured and we call CSQueueUtils#updateQueueStatistics, that takes the 
> union of the node labels from QueueCapacities and ResourceUsage (this union 
> is now the empty default label AND 'test') and updates QueueCapacities with 
> the label 'perf-test'. 
> Now QueueCapacities has 'test' in its nodeLabelsSet as well!
> 5. After a reinitialization (like an update from mutation API), the 
> CSQueueUtils#loadCapacitiesByLabelsFromCon is called, which resets the 
> QueueCapacities values to zero (even weight, which is wrong in my opinion) 
> and loads the values again from the config. 
> The problem here is that values are reset for all node labels in 
> QueueCapacities (even for 'test'), but we only load the values for the 
> configured node labels (which we did not set, so it is defaulted to the empty 
> label).
> 6. Now all children of 'parent' have weight=0 for 'test' in QueueCapacities 
> and that is why the update fails. 
> It even explains why validation passes, because the validation endpoint 
> instantiates a brand new CapacityScheduler for which these cascade of effects 
> can not accumulate (as there are no multiple updateClusterResource calls)
> This scenario manifests as an error when updating via mutation API:
> {noformat}
> Failed to re-init queues : Parent queue 'parent' have children queue used 
> mixed of weight mode, percentage and absolute mode, it is not allowed, please 
> double check, details:{noformat}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-11016) Queue weight is incorrectly reset to zero

2021-12-07 Thread Szilard Nemeth (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17454697#comment-17454697
 ] 

Szilard Nemeth commented on YARN-11016:
---

Hi [~gandras],
Just committed your patch to trunk.
Could you please check whether it's required to backport this to branch-3.3 / 
branch-3.2?
Thanks.


> Queue weight is incorrectly reset to zero
> -
>
> Key: YARN-11016
> URL: https://issues.apache.org/jira/browse/YARN-11016
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Reporter: Andras Gyori
>Assignee: Andras Gyori
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> QueueCapacities#clearConfigurableFields set WEIGHT capacity to 0, which could 
> cause problems like in the following scenario:
> 1. Initializing queues
> 2. Parent 'parent' have accessibleNodeLabels set, and since accessible node 
> labels are inherited, its children, for example 'child' has 'test' label as 
> its accessible-node-label.
> 3. In LeafQueue#updateClusterResource, we call 
> LeafQueue#activateApplications, which then calls 
> LeafQueue#calculateAndGetAMResourceLimitPerPartition for each label (see 
> getNodeLabelsForQueue). 
> In this case, the labels are the accessible node labels (the inherited 
> 'test). 
> During this event, the ResourceUsage object is updated for the label 'test', 
> thus extending its nodeLabelsSet with 'test'.
> 4. In a following updateClusterResource call, for example an addNode event, 
> we now have 'test' label in ResourceUsage even though it was never explicitly 
> configured and we call CSQueueUtils#updateQueueStatistics, that takes the 
> union of the node labels from QueueCapacities and ResourceUsage (this union 
> is now the empty default label AND 'test') and updates QueueCapacities with 
> the label 'perf-test'. 
> Now QueueCapacities has 'test' in its nodeLabelsSet as well!
> 5. After a reinitialization (like an update from mutation API), the 
> CSQueueUtils#loadCapacitiesByLabelsFromCon is called, which resets the 
> QueueCapacities values to zero (even weight, which is wrong in my opinion) 
> and loads the values again from the config. 
> The problem here is that values are reset for all node labels in 
> QueueCapacities (even for 'test'), but we only load the values for the 
> configured node labels (which we did not set, so it is defaulted to the empty 
> label).
> 6. Now all children of 'parent' have weight=0 for 'test' in QueueCapacities 
> and that is why the update fails. 
> It even explains why validation passes, because the validation endpoint 
> instantiates a brand new CapacityScheduler for which these cascade of effects 
> can not accumulate (as there are no multiple updateClusterResource calls)
> This scenario manifests as an error when updating via mutation API:
> {noformat}
> Failed to re-init queues : Parent queue 'parent' have children queue used 
> mixed of weight mode, percentage and absolute mode, it is not allowed, please 
> double check, details:{noformat}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org