[jira] [Commented] (YARN-11016) Queue weight is incorrectly reset to zero
[ https://issues.apache.org/jira/browse/YARN-11016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17518921#comment-17518921 ] Imran Chaush commented on YARN-11016: - Thanks > Queue weight is incorrectly reset to zero > - > > Key: YARN-11016 > URL: https://issues.apache.org/jira/browse/YARN-11016 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Reporter: Andras Gyori >Assignee: Andras Gyori >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0 > > Time Spent: 50m > Remaining Estimate: 0h > > QueueCapacities#clearConfigurableFields set WEIGHT capacity to 0, which could > cause problems like in the following scenario: > 1. Initializing queues > 2. Parent 'parent' have accessibleNodeLabels set, and since accessible node > labels are inherited, its children, for example 'child' has 'test' label as > its accessible-node-label. > 3. In LeafQueue#updateClusterResource, we call > LeafQueue#activateApplications, which then calls > LeafQueue#calculateAndGetAMResourceLimitPerPartition for each label (see > getNodeLabelsForQueue). > In this case, the labels are the accessible node labels (the inherited > 'test). > During this event, the ResourceUsage object is updated for the label 'test', > thus extending its nodeLabelsSet with 'test'. > 4. In a following updateClusterResource call, for example an addNode event, > we now have 'test' label in ResourceUsage even though it was never explicitly > configured and we call CSQueueUtils#updateQueueStatistics, that takes the > union of the node labels from QueueCapacities and ResourceUsage (this union > is now the empty default label AND 'test') and updates QueueCapacities with > the label 'perf-test'. > Now QueueCapacities has 'test' in its nodeLabelsSet as well! > 5. After a reinitialization (like an update from mutation API), the > CSQueueUtils#loadCapacitiesByLabelsFromCon is called, which resets the > QueueCapacities values to zero (even weight, which is wrong in my opinion) > and loads the values again from the config. > The problem here is that values are reset for all node labels in > QueueCapacities (even for 'test'), but we only load the values for the > configured node labels (which we did not set, so it is defaulted to the empty > label). > 6. Now all children of 'parent' have weight=0 for 'test' in QueueCapacities > and that is why the update fails. > It even explains why validation passes, because the validation endpoint > instantiates a brand new CapacityScheduler for which these cascade of effects > can not accumulate (as there are no multiple updateClusterResource calls) > This scenario manifests as an error when updating via mutation API: > {noformat} > Failed to re-init queues : Parent queue 'parent' have children queue used > mixed of weight mode, percentage and absolute mode, it is not allowed, please > double check, details:{noformat} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-11016) Queue weight is incorrectly reset to zero
[ https://issues.apache.org/jira/browse/YARN-11016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17455172#comment-17455172 ] Szilard Nemeth commented on YARN-11016: --- Thanks [~gandras] for the confirmation. Resolving this jira then. > Queue weight is incorrectly reset to zero > - > > Key: YARN-11016 > URL: https://issues.apache.org/jira/browse/YARN-11016 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Reporter: Andras Gyori >Assignee: Andras Gyori >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0 > > Time Spent: 50m > Remaining Estimate: 0h > > QueueCapacities#clearConfigurableFields set WEIGHT capacity to 0, which could > cause problems like in the following scenario: > 1. Initializing queues > 2. Parent 'parent' have accessibleNodeLabels set, and since accessible node > labels are inherited, its children, for example 'child' has 'test' label as > its accessible-node-label. > 3. In LeafQueue#updateClusterResource, we call > LeafQueue#activateApplications, which then calls > LeafQueue#calculateAndGetAMResourceLimitPerPartition for each label (see > getNodeLabelsForQueue). > In this case, the labels are the accessible node labels (the inherited > 'test). > During this event, the ResourceUsage object is updated for the label 'test', > thus extending its nodeLabelsSet with 'test'. > 4. In a following updateClusterResource call, for example an addNode event, > we now have 'test' label in ResourceUsage even though it was never explicitly > configured and we call CSQueueUtils#updateQueueStatistics, that takes the > union of the node labels from QueueCapacities and ResourceUsage (this union > is now the empty default label AND 'test') and updates QueueCapacities with > the label 'perf-test'. > Now QueueCapacities has 'test' in its nodeLabelsSet as well! > 5. After a reinitialization (like an update from mutation API), the > CSQueueUtils#loadCapacitiesByLabelsFromCon is called, which resets the > QueueCapacities values to zero (even weight, which is wrong in my opinion) > and loads the values again from the config. > The problem here is that values are reset for all node labels in > QueueCapacities (even for 'test'), but we only load the values for the > configured node labels (which we did not set, so it is defaulted to the empty > label). > 6. Now all children of 'parent' have weight=0 for 'test' in QueueCapacities > and that is why the update fails. > It even explains why validation passes, because the validation endpoint > instantiates a brand new CapacityScheduler for which these cascade of effects > can not accumulate (as there are no multiple updateClusterResource calls) > This scenario manifests as an error when updating via mutation API: > {noformat} > Failed to re-init queues : Parent queue 'parent' have children queue used > mixed of weight mode, percentage and absolute mode, it is not allowed, please > double check, details:{noformat} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-11016) Queue weight is incorrectly reset to zero
[ https://issues.apache.org/jira/browse/YARN-11016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17455008#comment-17455008 ] Andras Gyori commented on YARN-11016: - [~snemeth] Weight mode has been introduced in 3.4, therefore we do not need this fix there. > Queue weight is incorrectly reset to zero > - > > Key: YARN-11016 > URL: https://issues.apache.org/jira/browse/YARN-11016 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Reporter: Andras Gyori >Assignee: Andras Gyori >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0 > > Time Spent: 50m > Remaining Estimate: 0h > > QueueCapacities#clearConfigurableFields set WEIGHT capacity to 0, which could > cause problems like in the following scenario: > 1. Initializing queues > 2. Parent 'parent' have accessibleNodeLabels set, and since accessible node > labels are inherited, its children, for example 'child' has 'test' label as > its accessible-node-label. > 3. In LeafQueue#updateClusterResource, we call > LeafQueue#activateApplications, which then calls > LeafQueue#calculateAndGetAMResourceLimitPerPartition for each label (see > getNodeLabelsForQueue). > In this case, the labels are the accessible node labels (the inherited > 'test). > During this event, the ResourceUsage object is updated for the label 'test', > thus extending its nodeLabelsSet with 'test'. > 4. In a following updateClusterResource call, for example an addNode event, > we now have 'test' label in ResourceUsage even though it was never explicitly > configured and we call CSQueueUtils#updateQueueStatistics, that takes the > union of the node labels from QueueCapacities and ResourceUsage (this union > is now the empty default label AND 'test') and updates QueueCapacities with > the label 'perf-test'. > Now QueueCapacities has 'test' in its nodeLabelsSet as well! > 5. After a reinitialization (like an update from mutation API), the > CSQueueUtils#loadCapacitiesByLabelsFromCon is called, which resets the > QueueCapacities values to zero (even weight, which is wrong in my opinion) > and loads the values again from the config. > The problem here is that values are reset for all node labels in > QueueCapacities (even for 'test'), but we only load the values for the > configured node labels (which we did not set, so it is defaulted to the empty > label). > 6. Now all children of 'parent' have weight=0 for 'test' in QueueCapacities > and that is why the update fails. > It even explains why validation passes, because the validation endpoint > instantiates a brand new CapacityScheduler for which these cascade of effects > can not accumulate (as there are no multiple updateClusterResource calls) > This scenario manifests as an error when updating via mutation API: > {noformat} > Failed to re-init queues : Parent queue 'parent' have children queue used > mixed of weight mode, percentage and absolute mode, it is not allowed, please > double check, details:{noformat} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-11016) Queue weight is incorrectly reset to zero
[ https://issues.apache.org/jira/browse/YARN-11016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17454697#comment-17454697 ] Szilard Nemeth commented on YARN-11016: --- Hi [~gandras], Just committed your patch to trunk. Could you please check whether it's required to backport this to branch-3.3 / branch-3.2? Thanks. > Queue weight is incorrectly reset to zero > - > > Key: YARN-11016 > URL: https://issues.apache.org/jira/browse/YARN-11016 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Reporter: Andras Gyori >Assignee: Andras Gyori >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0 > > Time Spent: 50m > Remaining Estimate: 0h > > QueueCapacities#clearConfigurableFields set WEIGHT capacity to 0, which could > cause problems like in the following scenario: > 1. Initializing queues > 2. Parent 'parent' have accessibleNodeLabels set, and since accessible node > labels are inherited, its children, for example 'child' has 'test' label as > its accessible-node-label. > 3. In LeafQueue#updateClusterResource, we call > LeafQueue#activateApplications, which then calls > LeafQueue#calculateAndGetAMResourceLimitPerPartition for each label (see > getNodeLabelsForQueue). > In this case, the labels are the accessible node labels (the inherited > 'test). > During this event, the ResourceUsage object is updated for the label 'test', > thus extending its nodeLabelsSet with 'test'. > 4. In a following updateClusterResource call, for example an addNode event, > we now have 'test' label in ResourceUsage even though it was never explicitly > configured and we call CSQueueUtils#updateQueueStatistics, that takes the > union of the node labels from QueueCapacities and ResourceUsage (this union > is now the empty default label AND 'test') and updates QueueCapacities with > the label 'perf-test'. > Now QueueCapacities has 'test' in its nodeLabelsSet as well! > 5. After a reinitialization (like an update from mutation API), the > CSQueueUtils#loadCapacitiesByLabelsFromCon is called, which resets the > QueueCapacities values to zero (even weight, which is wrong in my opinion) > and loads the values again from the config. > The problem here is that values are reset for all node labels in > QueueCapacities (even for 'test'), but we only load the values for the > configured node labels (which we did not set, so it is defaulted to the empty > label). > 6. Now all children of 'parent' have weight=0 for 'test' in QueueCapacities > and that is why the update fails. > It even explains why validation passes, because the validation endpoint > instantiates a brand new CapacityScheduler for which these cascade of effects > can not accumulate (as there are no multiple updateClusterResource calls) > This scenario manifests as an error when updating via mutation API: > {noformat} > Failed to re-init queues : Parent queue 'parent' have children queue used > mixed of weight mode, percentage and absolute mode, it is not allowed, please > double check, details:{noformat} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org