[ 
https://issues.apache.org/jira/browse/YUNIKORN-2731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xi Chen updated YUNIKORN-2731:
------------------------------
    Description: 
We have encountered this issue in one of our clusters every a few days. We are 
running a version that is built from branch 
[https://github.com/apache/yunikorn-k8shim/commits/branch-1.5/] commit 
fb4e3f11345e6a9866dfaea97770c94b9421807b.

Here is our configuration of queues.yaml.
{code:java}
partitions:
  - name: default
    nodesortpolicy:
      type: binpacking
    preemption:
      enabled: false
    placementrules:
      - name: tag
        value: namespace
        create: false
    queues:
      - name: root
        submitacl: '*'
        queues:
          - name: c
            resources:
              guaranteed:
                memory: 13000Gi
                vcore: 3250
              max:
                memory: 13000Gi
                vcore: 3250
            properties:
              application.sort.policy: fair
          - name: e
            resources:
              guaranteed:
                memory: 2600Gi
                vcore: 650
              max:
                memory: 2600Gi
                vcore: 650
            properties:
              application.sort.policy: fair
          - name: m1
            resources:
              guaranteed:
                memory: 1000Gi
                vcore: 250
              max:
                memory: 1000Gi
                vcore: 250
            properties:
              application.sort.policy: fair
          - name: m2
            resources:
              guaranteed:
                memory: 62000Gi
                vcore: 15500
              max:
                memory: 62000Gi
                vcore: 15500
            properties:
              application.sort.policy: fair {code}
The issue is that at some point the scheduler would stop starting new 
containers, and there would be 0 containers running finally and lots of 
applications in Accepted status.

!Applications stuck in Accepted status.png|width=1211,height=407!

There are some logs that contains negative vcore resource, and these logs are 
highly corralated with this issue in timeline.
{code:java}
2024-07-08T10:19:13.436Z    INFO    core.scheduler    
scheduler/scheduler.go:101    Found outstanding requests that will trigger 
autoscaling    {"number of requests": 1, "total resources": 
"map[memory:2147483648 pods:1 vcore:500]"}
E0708 10:19:13.604563       1 event_broadcaster.go:270] "Server rejected event 
(will not retry!)" err="Event 
\"example-job-1720433945-574-aa32179091daba13-driver.17e03590c7f8bd76\" is 
invalid: [action: Required value, reason: Required value]" 
event="&Event{ObjectMeta:{example-job-1720433945-574-aa32179091daba13-driver.17e03590c7f8bd76
  c    0 0001-01-01 00:00:00 +0000 UTC <nil> <nil> map[] map[] [] [] 
[]},EventTime:2024-07-08 10:19:13.60205325 +0000 UTC 
m=+524778.657618517,Series:nil,ReportingController:yunikorn,ReportingInstance:yunikorn-yunikorn-scheduler-84cb695b5d-lr42h,Action:,Reason:,Regarding:{Pod
 c example-job-1720433945-574-aa32179091daba13-driver 
a1901455-3a23-4dbc-bbb2-9dd2dfc775ea v1 201821875 },Related:nil,Note:Request 
'a1901455-3a23-4dbc-bbb2-9dd2dfc775ea' does not fit in queue 'root.c' 
(requested map[memory:2147483648 pods:1 vcore:500], available 
map[ephemeral-storage:2689798906768 hugepages-1Gi:0 hugepages-2Mi:0 
hugepages-32Mi:0 hugepages-64Ki:0 memory:605510809350 pods:1715 
vcore:-56150]),Type:Normal,DeprecatedSource:{ 
},DeprecatedFirstTimestamp:0001-01-01 00:00:00 +0000 
UTC,DeprecatedLastTimestamp:0001-01-01 00:00:00 +0000 UTC,DeprecatedCount:0,}"

2024-07-08T10:19:05.391Z    INFO    core.scheduler    
scheduler/scheduler.go:101    Found outstanding requests that will trigger 
autoscaling    {"number of requests": 1, "total resources": 
"map[memory:2147483648 pods:1 vcore:500]"}
E0708 10:19:05.601679       1 event_broadcaster.go:270] "Server rejected event 
(will not retry!)" err="Event 
\"example-job-1720433937-295-e7b2229091da99a7-driver.17e0358eeaf728e4\" is 
invalid: [action: Required value, reason: Required value]" 
event="&Event{ObjectMeta:{example-job-1720433937-295-e7b2229091da99a7-driver.17e0358eeaf728e4
  e    0 0001-01-01 00:00:00 +0000 UTC <nil> <nil> map[] map[] [] [] 
[]},EventTime:2024-07-08 10:19:05.599216316 +0000 UTC 
m=+524770.654781585,Series:nil,ReportingController:yunikorn,ReportingInstance:yunikorn-yunikorn-scheduler-84cb695b5d-lr42h,Action:,Reason:,Regarding:{Pod
 e example-job-1720433937-295-e7b2229091da99a7-driver 
14a40bac-5e89-4293-bcb7-936c544694a2 v1 201821666 },Related:nil,Note:Request 
'14a40bac-5e89-4293-bcb7-936c544694a2' does not fit in queue 'root.e' 
(requested map[memory:2147483648 pods:1 vcore:500], available 
map[ephemeral-storage:2689798906768 hugepages-1Gi:0 hugepages-2Mi:0 
hugepages-32Mi:0 hugepages-64Ki:0 memory:605510809350 pods:1715 
vcore:-56150]),Type:Normal,DeprecatedSource:{ 
},DeprecatedFirstTimestamp:0001-01-01 00:00:00 +0000 
UTC,DeprecatedLastTimestamp:0001-01-01 00:00:00 +0000 UTC,DeprecatedCount:0,}"

2024-07-08T10:18:51.325Z    INFO    core.scheduler    
scheduler/scheduler.go:101    Found outstanding requests that will trigger 
autoscaling    {"number of requests": 1, "total resources": 
"map[memory:2147483648 pods:1 vcore:500]"}
E0708 10:18:51.596390       1 event_broadcaster.go:270] "Server rejected event 
(will not retry!)" err="Event 
\"example-job-1720433923-763-378d629091da6500-driver.17e0358ba82f71a5\" is 
invalid: [action: Required value, reason: Required value]" 
event="&Event{ObjectMeta:{example-job-1720433923-763-378d629091da6500-driver.17e0358ba82f71a5
  m1    0 0001-01-01 00:00:00 +0000 UTC <nil> <nil> map[] map[] [] [] 
[]},EventTime:2024-07-08 10:18:51.593930204 +0000 UTC 
m=+524756.649495472,Series:nil,ReportingController:yunikorn,ReportingInstance:yunikorn-yunikorn-scheduler-84cb695b5d-lr42h,Action:,Reason:,Regarding:{Pod
 m1 example-job-1720433923-763-378d629091da6500-driver 
f0c19c6a-6eb5-4e68-808d-389862c197cb v1 201821358 },Related:nil,Note:Request 
'f0c19c6a-6eb5-4e68-808d-389862c197cb' does not fit in queue 'root.m1' 
(requested map[memory:2147483648 pods:1 vcore:500], available 
map[ephemeral-storage:2689798906768 hugepages-1Gi:0 hugepages-2Mi:0 
hugepages-32Mi:0 hugepages-64Ki:0 memory:605510809350 pods:1715 
vcore:-56150]),Type:Normal,DeprecatedSource:{ 
},DeprecatedFirstTimestamp:0001-01-01 00:00:00 +0000 
UTC,DeprecatedLastTimestamp:0001-01-01 00:00:00 +0000 UTC,DeprecatedCount:0,}"

2024-07-08T10:18:03.231Z    INFO    shim.context    cache/context.go:1139    
app request originating pod added    {"appID": 
"spark-26e1b4f9c3124376aad12a9b63c8b711", "original task": 
"ffdf1356-4a7f-4559-9cbd-afa510f96cfe"}
E0708 10:18:03.584031       1 event_broadcaster.go:270] "Server rejected event 
(will not retry!)" err="Event 
\"another-example-job-1720433872-277-38b8b09091d9a492-driver.17e035807a6bb0df\" 
is invalid: [action: Required value, reason: Required value]" 
event="&Event{ObjectMeta:{another-example-job-1720433872-277-38b8b09091d9a492-driver.17e035807a6bb0df
  m2    0 0001-01-01 00:00:00 +0000 UTC <nil> <nil> map[] map[] [] [] 
[]},EventTime:2024-07-08 10:18:03.581485338 +0000 UTC 
m=+524708.637050606,Series:nil,ReportingController:yunikorn,ReportingInstance:yunikorn-yunikorn-scheduler-84cb695b5d-lr42h,Action:,Reason:,Regarding:{Pod
 m2 another-example-job-1720433872-277-38b8b09091d9a492-driver 
9b99dd53-cd1d-48b4-a8e3-c0c58f98a503 v1 201820328 },Related:nil,Note:Request 
'9b99dd53-cd1d-48b4-a8e3-c0c58f98a503' does not fit in queue 'root.m2' 
(requested map[memory:3758096384 pods:1 vcore:500], available 
map[ephemeral-storage:2490103866211 hugepages-1Gi:0 hugepages-2Mi:0 
hugepages-32Mi:0 hugepages-64Ki:0 memory:352023250073 pods:1635 
vcore:-87850]),Type:Normal,DeprecatedSource:{ 
},DeprecatedFirstTimestamp:0001-01-01 00:00:00 +0000 
UTC,DeprecatedLastTimestamp:0001-01-01 00:00:00 +0000 UTC,DeprecatedCount:0,}"
 {code}
There are also some warnings about Scheduler is not healthy, but those logs 
were there before the issue started
{code:java}
2024-07-08T10:19:24.990Z    WARN    core.scheduler.health    
scheduler/health_checker.go:178    Scheduler is not healthy    {"name": 
"Consistency of data", "description": "Check if a partition's allocated 
resource <= total resource of the partition", "message": "Partitions with 
inconsistent data: [\"[foo-spark]default\"]"} {code}
 

  was:
We have encountered this issue in one of our clusters every a few days. We are 
running a version that is built from branch 
[https://github.com/apache/yunikorn-k8shim/commits/branch-1.5/] commit 
fb4e3f11345e6a9866dfaea97770c94b9421807b.

Here is our configuration of queues.yaml.

 
{code:java}
partitions:
  - name: default
    nodesortpolicy:
      type: binpacking
    preemption:
      enabled: false
    placementrules:
      - name: tag
        value: namespace
        create: false
    queues:
      - name: root
        submitacl: '*'
        queues:
          - name: c
            resources:
              guaranteed:
                memory: 13000Gi
                vcore: 3250
              max:
                memory: 13000Gi
                vcore: 3250
            properties:
              application.sort.policy: fair
          - name: e
            resources:
              guaranteed:
                memory: 2600Gi
                vcore: 650
              max:
                memory: 2600Gi
                vcore: 650
            properties:
              application.sort.policy: fair
          - name: m1
            resources:
              guaranteed:
                memory: 1000Gi
                vcore: 250
              max:
                memory: 1000Gi
                vcore: 250
            properties:
              application.sort.policy: fair
          - name: m2
            resources:
              guaranteed:
                memory: 62000Gi
                vcore: 15500
              max:
                memory: 62000Gi
                vcore: 15500
            properties:
              application.sort.policy: fair {code}
The issue is that at some point the scheduler would stop starting new 
containers, and there would be 0 containers running finally and lots of 
applications in Accepted status.

 

There are some logs that contains negative vcore resource, and these logs are 
highly corralated with this issue in timeline.

 
{code:java}
2024-07-08T10:19:13.436Z    INFO    core.scheduler    
scheduler/scheduler.go:101    Found outstanding requests that will trigger 
autoscaling    {"number of requests": 1, "total resources": 
"map[memory:2147483648 pods:1 vcore:500]"}
E0708 10:19:13.604563       1 event_broadcaster.go:270] "Server rejected event 
(will not retry!)" err="Event 
\"example-job-1720433945-574-aa32179091daba13-driver.17e03590c7f8bd76\" is 
invalid: [action: Required value, reason: Required value]" 
event="&Event{ObjectMeta:{example-job-1720433945-574-aa32179091daba13-driver.17e03590c7f8bd76
  c    0 0001-01-01 00:00:00 +0000 UTC <nil> <nil> map[] map[] [] [] 
[]},EventTime:2024-07-08 10:19:13.60205325 +0000 UTC 
m=+524778.657618517,Series:nil,ReportingController:yunikorn,ReportingInstance:yunikorn-yunikorn-scheduler-84cb695b5d-lr42h,Action:,Reason:,Regarding:{Pod
 c example-job-1720433945-574-aa32179091daba13-driver 
a1901455-3a23-4dbc-bbb2-9dd2dfc775ea v1 201821875 },Related:nil,Note:Request 
'a1901455-3a23-4dbc-bbb2-9dd2dfc775ea' does not fit in queue 'root.c' 
(requested map[memory:2147483648 pods:1 vcore:500], available 
map[ephemeral-storage:2689798906768 hugepages-1Gi:0 hugepages-2Mi:0 
hugepages-32Mi:0 hugepages-64Ki:0 memory:605510809350 pods:1715 
vcore:-56150]),Type:Normal,DeprecatedSource:{ 
},DeprecatedFirstTimestamp:0001-01-01 00:00:00 +0000 
UTC,DeprecatedLastTimestamp:0001-01-01 00:00:00 +0000 UTC,DeprecatedCount:0,}"

2024-07-08T10:19:05.391Z    INFO    core.scheduler    
scheduler/scheduler.go:101    Found outstanding requests that will trigger 
autoscaling    {"number of requests": 1, "total resources": 
"map[memory:2147483648 pods:1 vcore:500]"}
E0708 10:19:05.601679       1 event_broadcaster.go:270] "Server rejected event 
(will not retry!)" err="Event 
\"example-job-1720433937-295-e7b2229091da99a7-driver.17e0358eeaf728e4\" is 
invalid: [action: Required value, reason: Required value]" 
event="&Event{ObjectMeta:{example-job-1720433937-295-e7b2229091da99a7-driver.17e0358eeaf728e4
  e    0 0001-01-01 00:00:00 +0000 UTC <nil> <nil> map[] map[] [] [] 
[]},EventTime:2024-07-08 10:19:05.599216316 +0000 UTC 
m=+524770.654781585,Series:nil,ReportingController:yunikorn,ReportingInstance:yunikorn-yunikorn-scheduler-84cb695b5d-lr42h,Action:,Reason:,Regarding:{Pod
 e example-job-1720433937-295-e7b2229091da99a7-driver 
14a40bac-5e89-4293-bcb7-936c544694a2 v1 201821666 },Related:nil,Note:Request 
'14a40bac-5e89-4293-bcb7-936c544694a2' does not fit in queue 'root.e' 
(requested map[memory:2147483648 pods:1 vcore:500], available 
map[ephemeral-storage:2689798906768 hugepages-1Gi:0 hugepages-2Mi:0 
hugepages-32Mi:0 hugepages-64Ki:0 memory:605510809350 pods:1715 
vcore:-56150]),Type:Normal,DeprecatedSource:{ 
},DeprecatedFirstTimestamp:0001-01-01 00:00:00 +0000 
UTC,DeprecatedLastTimestamp:0001-01-01 00:00:00 +0000 UTC,DeprecatedCount:0,}"

2024-07-08T10:18:51.325Z    INFO    core.scheduler    
scheduler/scheduler.go:101    Found outstanding requests that will trigger 
autoscaling    {"number of requests": 1, "total resources": 
"map[memory:2147483648 pods:1 vcore:500]"}
E0708 10:18:51.596390       1 event_broadcaster.go:270] "Server rejected event 
(will not retry!)" err="Event 
\"example-job-1720433923-763-378d629091da6500-driver.17e0358ba82f71a5\" is 
invalid: [action: Required value, reason: Required value]" 
event="&Event{ObjectMeta:{example-job-1720433923-763-378d629091da6500-driver.17e0358ba82f71a5
  m1    0 0001-01-01 00:00:00 +0000 UTC <nil> <nil> map[] map[] [] [] 
[]},EventTime:2024-07-08 10:18:51.593930204 +0000 UTC 
m=+524756.649495472,Series:nil,ReportingController:yunikorn,ReportingInstance:yunikorn-yunikorn-scheduler-84cb695b5d-lr42h,Action:,Reason:,Regarding:{Pod
 m1 example-job-1720433923-763-378d629091da6500-driver 
f0c19c6a-6eb5-4e68-808d-389862c197cb v1 201821358 },Related:nil,Note:Request 
'f0c19c6a-6eb5-4e68-808d-389862c197cb' does not fit in queue 'root.m1' 
(requested map[memory:2147483648 pods:1 vcore:500], available 
map[ephemeral-storage:2689798906768 hugepages-1Gi:0 hugepages-2Mi:0 
hugepages-32Mi:0 hugepages-64Ki:0 memory:605510809350 pods:1715 
vcore:-56150]),Type:Normal,DeprecatedSource:{ 
},DeprecatedFirstTimestamp:0001-01-01 00:00:00 +0000 
UTC,DeprecatedLastTimestamp:0001-01-01 00:00:00 +0000 UTC,DeprecatedCount:0,}"

2024-07-08T10:18:03.231Z    INFO    shim.context    cache/context.go:1139    
app request originating pod added    {"appID": 
"spark-26e1b4f9c3124376aad12a9b63c8b711", "original task": 
"ffdf1356-4a7f-4559-9cbd-afa510f96cfe"}
E0708 10:18:03.584031       1 event_broadcaster.go:270] "Server rejected event 
(will not retry!)" err="Event 
\"another-example-job-1720433872-277-38b8b09091d9a492-driver.17e035807a6bb0df\" 
is invalid: [action: Required value, reason: Required value]" 
event="&Event{ObjectMeta:{another-example-job-1720433872-277-38b8b09091d9a492-driver.17e035807a6bb0df
  m2    0 0001-01-01 00:00:00 +0000 UTC <nil> <nil> map[] map[] [] [] 
[]},EventTime:2024-07-08 10:18:03.581485338 +0000 UTC 
m=+524708.637050606,Series:nil,ReportingController:yunikorn,ReportingInstance:yunikorn-yunikorn-scheduler-84cb695b5d-lr42h,Action:,Reason:,Regarding:{Pod
 m2 another-example-job-1720433872-277-38b8b09091d9a492-driver 
9b99dd53-cd1d-48b4-a8e3-c0c58f98a503 v1 201820328 },Related:nil,Note:Request 
'9b99dd53-cd1d-48b4-a8e3-c0c58f98a503' does not fit in queue 'root.m2' 
(requested map[memory:3758096384 pods:1 vcore:500], available 
map[ephemeral-storage:2490103866211 hugepages-1Gi:0 hugepages-2Mi:0 
hugepages-32Mi:0 hugepages-64Ki:0 memory:352023250073 pods:1635 
vcore:-87850]),Type:Normal,DeprecatedSource:{ 
},DeprecatedFirstTimestamp:0001-01-01 00:00:00 +0000 
UTC,DeprecatedLastTimestamp:0001-01-01 00:00:00 +0000 UTC,DeprecatedCount:0,}"
 {code}
 

 

There are also some warnings about Scheduler is not healthy, but those logs 
were there before the issue started

 
{code:java}
2024-07-08T10:19:24.990Z    WARN    core.scheduler.health    
scheduler/health_checker.go:178    Scheduler is not healthy    {"name": 
"Consistency of data", "description": "Check if a partition's allocated 
resource <= total resource of the partition", "message": "Partitions with 
inconsistent data: [\"[foo-spark]default\"]"} {code}
 


> YuniKorn stopped scheduling new containers with negative vcore in queue
> -----------------------------------------------------------------------
>
>                 Key: YUNIKORN-2731
>                 URL: https://issues.apache.org/jira/browse/YUNIKORN-2731
>             Project: Apache YuniKorn
>          Issue Type: Bug
>          Components: core - scheduler
>    Affects Versions: 1.5.1
>            Reporter: Xi Chen
>            Priority: Major
>         Attachments: Applications stuck in Accepted status.png
>
>
> We have encountered this issue in one of our clusters every a few days. We 
> are running a version that is built from branch 
> [https://github.com/apache/yunikorn-k8shim/commits/branch-1.5/] commit 
> fb4e3f11345e6a9866dfaea97770c94b9421807b.
> Here is our configuration of queues.yaml.
> {code:java}
> partitions:
>   - name: default
>     nodesortpolicy:
>       type: binpacking
>     preemption:
>       enabled: false
>     placementrules:
>       - name: tag
>         value: namespace
>         create: false
>     queues:
>       - name: root
>         submitacl: '*'
>         queues:
>           - name: c
>             resources:
>               guaranteed:
>                 memory: 13000Gi
>                 vcore: 3250
>               max:
>                 memory: 13000Gi
>                 vcore: 3250
>             properties:
>               application.sort.policy: fair
>           - name: e
>             resources:
>               guaranteed:
>                 memory: 2600Gi
>                 vcore: 650
>               max:
>                 memory: 2600Gi
>                 vcore: 650
>             properties:
>               application.sort.policy: fair
>           - name: m1
>             resources:
>               guaranteed:
>                 memory: 1000Gi
>                 vcore: 250
>               max:
>                 memory: 1000Gi
>                 vcore: 250
>             properties:
>               application.sort.policy: fair
>           - name: m2
>             resources:
>               guaranteed:
>                 memory: 62000Gi
>                 vcore: 15500
>               max:
>                 memory: 62000Gi
>                 vcore: 15500
>             properties:
>               application.sort.policy: fair {code}
> The issue is that at some point the scheduler would stop starting new 
> containers, and there would be 0 containers running finally and lots of 
> applications in Accepted status.
> !Applications stuck in Accepted status.png|width=1211,height=407!
> There are some logs that contains negative vcore resource, and these logs are 
> highly corralated with this issue in timeline.
> {code:java}
> 2024-07-08T10:19:13.436Z    INFO    core.scheduler    
> scheduler/scheduler.go:101    Found outstanding requests that will trigger 
> autoscaling    {"number of requests": 1, "total resources": 
> "map[memory:2147483648 pods:1 vcore:500]"}
> E0708 10:19:13.604563       1 event_broadcaster.go:270] "Server rejected 
> event (will not retry!)" err="Event 
> \"example-job-1720433945-574-aa32179091daba13-driver.17e03590c7f8bd76\" is 
> invalid: [action: Required value, reason: Required value]" 
> event="&Event{ObjectMeta:{example-job-1720433945-574-aa32179091daba13-driver.17e03590c7f8bd76
>   c    0 0001-01-01 00:00:00 +0000 UTC <nil> <nil> map[] map[] [] [] 
> []},EventTime:2024-07-08 10:19:13.60205325 +0000 UTC 
> m=+524778.657618517,Series:nil,ReportingController:yunikorn,ReportingInstance:yunikorn-yunikorn-scheduler-84cb695b5d-lr42h,Action:,Reason:,Regarding:{Pod
>  c example-job-1720433945-574-aa32179091daba13-driver 
> a1901455-3a23-4dbc-bbb2-9dd2dfc775ea v1 201821875 },Related:nil,Note:Request 
> 'a1901455-3a23-4dbc-bbb2-9dd2dfc775ea' does not fit in queue 'root.c' 
> (requested map[memory:2147483648 pods:1 vcore:500], available 
> map[ephemeral-storage:2689798906768 hugepages-1Gi:0 hugepages-2Mi:0 
> hugepages-32Mi:0 hugepages-64Ki:0 memory:605510809350 pods:1715 
> vcore:-56150]),Type:Normal,DeprecatedSource:{ 
> },DeprecatedFirstTimestamp:0001-01-01 00:00:00 +0000 
> UTC,DeprecatedLastTimestamp:0001-01-01 00:00:00 +0000 UTC,DeprecatedCount:0,}"
> 2024-07-08T10:19:05.391Z    INFO    core.scheduler    
> scheduler/scheduler.go:101    Found outstanding requests that will trigger 
> autoscaling    {"number of requests": 1, "total resources": 
> "map[memory:2147483648 pods:1 vcore:500]"}
> E0708 10:19:05.601679       1 event_broadcaster.go:270] "Server rejected 
> event (will not retry!)" err="Event 
> \"example-job-1720433937-295-e7b2229091da99a7-driver.17e0358eeaf728e4\" is 
> invalid: [action: Required value, reason: Required value]" 
> event="&Event{ObjectMeta:{example-job-1720433937-295-e7b2229091da99a7-driver.17e0358eeaf728e4
>   e    0 0001-01-01 00:00:00 +0000 UTC <nil> <nil> map[] map[] [] [] 
> []},EventTime:2024-07-08 10:19:05.599216316 +0000 UTC 
> m=+524770.654781585,Series:nil,ReportingController:yunikorn,ReportingInstance:yunikorn-yunikorn-scheduler-84cb695b5d-lr42h,Action:,Reason:,Regarding:{Pod
>  e example-job-1720433937-295-e7b2229091da99a7-driver 
> 14a40bac-5e89-4293-bcb7-936c544694a2 v1 201821666 },Related:nil,Note:Request 
> '14a40bac-5e89-4293-bcb7-936c544694a2' does not fit in queue 'root.e' 
> (requested map[memory:2147483648 pods:1 vcore:500], available 
> map[ephemeral-storage:2689798906768 hugepages-1Gi:0 hugepages-2Mi:0 
> hugepages-32Mi:0 hugepages-64Ki:0 memory:605510809350 pods:1715 
> vcore:-56150]),Type:Normal,DeprecatedSource:{ 
> },DeprecatedFirstTimestamp:0001-01-01 00:00:00 +0000 
> UTC,DeprecatedLastTimestamp:0001-01-01 00:00:00 +0000 UTC,DeprecatedCount:0,}"
> 2024-07-08T10:18:51.325Z    INFO    core.scheduler    
> scheduler/scheduler.go:101    Found outstanding requests that will trigger 
> autoscaling    {"number of requests": 1, "total resources": 
> "map[memory:2147483648 pods:1 vcore:500]"}
> E0708 10:18:51.596390       1 event_broadcaster.go:270] "Server rejected 
> event (will not retry!)" err="Event 
> \"example-job-1720433923-763-378d629091da6500-driver.17e0358ba82f71a5\" is 
> invalid: [action: Required value, reason: Required value]" 
> event="&Event{ObjectMeta:{example-job-1720433923-763-378d629091da6500-driver.17e0358ba82f71a5
>   m1    0 0001-01-01 00:00:00 +0000 UTC <nil> <nil> map[] map[] [] [] 
> []},EventTime:2024-07-08 10:18:51.593930204 +0000 UTC 
> m=+524756.649495472,Series:nil,ReportingController:yunikorn,ReportingInstance:yunikorn-yunikorn-scheduler-84cb695b5d-lr42h,Action:,Reason:,Regarding:{Pod
>  m1 example-job-1720433923-763-378d629091da6500-driver 
> f0c19c6a-6eb5-4e68-808d-389862c197cb v1 201821358 },Related:nil,Note:Request 
> 'f0c19c6a-6eb5-4e68-808d-389862c197cb' does not fit in queue 'root.m1' 
> (requested map[memory:2147483648 pods:1 vcore:500], available 
> map[ephemeral-storage:2689798906768 hugepages-1Gi:0 hugepages-2Mi:0 
> hugepages-32Mi:0 hugepages-64Ki:0 memory:605510809350 pods:1715 
> vcore:-56150]),Type:Normal,DeprecatedSource:{ 
> },DeprecatedFirstTimestamp:0001-01-01 00:00:00 +0000 
> UTC,DeprecatedLastTimestamp:0001-01-01 00:00:00 +0000 UTC,DeprecatedCount:0,}"
> 2024-07-08T10:18:03.231Z    INFO    shim.context    cache/context.go:1139    
> app request originating pod added    {"appID": 
> "spark-26e1b4f9c3124376aad12a9b63c8b711", "original task": 
> "ffdf1356-4a7f-4559-9cbd-afa510f96cfe"}
> E0708 10:18:03.584031       1 event_broadcaster.go:270] "Server rejected 
> event (will not retry!)" err="Event 
> \"another-example-job-1720433872-277-38b8b09091d9a492-driver.17e035807a6bb0df\"
>  is invalid: [action: Required value, reason: Required value]" 
> event="&Event{ObjectMeta:{another-example-job-1720433872-277-38b8b09091d9a492-driver.17e035807a6bb0df
>   m2    0 0001-01-01 00:00:00 +0000 UTC <nil> <nil> map[] map[] [] [] 
> []},EventTime:2024-07-08 10:18:03.581485338 +0000 UTC 
> m=+524708.637050606,Series:nil,ReportingController:yunikorn,ReportingInstance:yunikorn-yunikorn-scheduler-84cb695b5d-lr42h,Action:,Reason:,Regarding:{Pod
>  m2 another-example-job-1720433872-277-38b8b09091d9a492-driver 
> 9b99dd53-cd1d-48b4-a8e3-c0c58f98a503 v1 201820328 },Related:nil,Note:Request 
> '9b99dd53-cd1d-48b4-a8e3-c0c58f98a503' does not fit in queue 'root.m2' 
> (requested map[memory:3758096384 pods:1 vcore:500], available 
> map[ephemeral-storage:2490103866211 hugepages-1Gi:0 hugepages-2Mi:0 
> hugepages-32Mi:0 hugepages-64Ki:0 memory:352023250073 pods:1635 
> vcore:-87850]),Type:Normal,DeprecatedSource:{ 
> },DeprecatedFirstTimestamp:0001-01-01 00:00:00 +0000 
> UTC,DeprecatedLastTimestamp:0001-01-01 00:00:00 +0000 UTC,DeprecatedCount:0,}"
>  {code}
> There are also some warnings about Scheduler is not healthy, but those logs 
> were there before the issue started
> {code:java}
> 2024-07-08T10:19:24.990Z    WARN    core.scheduler.health    
> scheduler/health_checker.go:178    Scheduler is not healthy    {"name": 
> "Consistency of data", "description": "Check if a partition's allocated 
> resource <= total resource of the partition", "message": "Partitions with 
> inconsistent data: [\"[foo-spark]default\"]"} {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to