[ 
https://issues.apache.org/jira/browse/YUNIKORN-2785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17870637#comment-17870637
 ] 

Craig Condit edited comment on YUNIKORN-2785 at 8/2/24 10:00 PM:
-----------------------------------------------------------------

This is expected behavior. YuniKorn is by design stateless, and cannot know 
about termination of pods which terminate while YuniKorn is not running. 
However, consumers of the event API can detect when YuniKorn has restarted (as 
it will have a unique ID on the next run). Upon restart, YuniKorn will emit 
events for all applications and tasks which are detected upon startup. The 
event consumer could then detect changes using this.

Alternatively, you could employ a persistent pod informer on the Kubernetes API 
to be notified of pod creations and deletions, though this won't give you 
YuniKorn-specific metadata.

Also the APP_SUMMARY log messages should be considered deprecated in favor of 
the event consumer API. There are numerous shortcomings to them (as you've 
seen).


was (Author: ccondit):
This is expected behavior. YuniKorn is by design stateless, and cannot know 
about termination of pods which terminate while YuniKorn is not running. 
However, consumers of the event API can detect when YuniKorn has restarted (as 
it will have a unique ID on the next run). Upon restart, YuniKorn will emit 
events for all applications and tasks which are detected upon startup. The 
event consumer could then detect changes using this.

Alternatively, you could employ a persistent pod informer on the Kubernetes API 
to be notified of pod creations and deletions, though this won't give you 
YuniKorn-specific metadata.

> App summary resource usage is inaccurate if Yunikorn restarts
> -------------------------------------------------------------
>
>                 Key: YUNIKORN-2785
>                 URL: https://issues.apache.org/jira/browse/YUNIKORN-2785
>             Project: Apache YuniKorn
>          Issue Type: Bug
>          Components: core - scheduler
>            Reporter: Colin
>            Priority: Minor
>              Labels: events, resource
>
> My team needs to accurately track the resources used by Spark jobs. We're 
> currently using YuniKorn's app summary log emitted by the scheduler when a 
> job completes. However, we're aware that this log is inaccurate if YuniKorn 
> restarts while the job is running since YuniKorn keeps track of app resources 
> in memory only. To address this, we created a sidecar pod that connects to 
> YuniKorn's streaming event endpoint and saves the events to a Kafka topic for 
> persistence, should YuniKorn crash, allowing us to calculate our own app 
> summaries.
> However, we have noticed that if executor pods complete while YuniKorn is 
> down, YuniKorn never emits an allocation cancellation event. Thus, we cannot 
> determine when the executor pod stopped using resources. Using the last event 
> timestamp from the job provides an upper bound on the executor's resource 
> usage, and ignoring the executor entirely, as YuniKorn seems to do, provides 
> a lower bound.
> Below are the results from my testing:
> h3. Test without YuniKorn Restart
> I ran a job for about 5 minutes with a driver pod creating roughly 100 
> executor pods. The first execution was without restarting YuniKorn.
> *My results calculated using the events in the Kafka topic:*
> {code:java}
> Total aggregated resources usage:
> memory: 126643967751900.53
> pods: 9102.207251182002
> vcore: 35909809.17691601 {code}
> *Yunikorn's App Summary Log:*
> {code:java}
> 2024-07-30T23:04:58.526Z INFO core.scheduler.application.usage 
> objects/application_summary.go:60 YK_APP_SUMMARY: {ResourceUsage: 
> TrackedResource{UNKNOWN:pods=9048,UNKNOWN:vcore=35694000,UNKNOWN:memory=125880530632704},
>  PreemptedResource: TrackedResource{}, PlaceholderResource: 
> TrackedResource{}} {code}
> *The difference (my value - yunikorn app summary value):*
> {code:java}
> memory: 126643967751900.53 - 125880530632704 = 125880530632704 (my value is 
> 0.60647% greater)
> pods: 9102.207251182002 - 9048 = 54.207251182   (my value is 0.599% greater)
> vcore: 35909809.17691601 - 35694000 = 215809.176916  (my value is 0.6046% 
> greater) {code}
> My value is slightly different because I'm using the event timestamps and not 
> the resource timestamps (if you think it's something else then please share).
>  
> h3. Test with YuniKorn Restart
> I then ran the same job but shut YuniKorn down for about 30 seconds after 
> allocating resources to the driver and all executors, as the executors were 
> nearing completion. Then, I restarted YuniKorn.
>  
> +_Ignoring pods without cancellation events_+
> *My results calculated using the events in the Kafka topic:*
> {code:java}
> Total aggregated resources usage:
> memory: 13299125469337.467
> pods: 945.3453441859999
> vcore: 3760461.7715400006{code}
> *Yunikorn's App Summary Log:*
> {code:java}
> 2024-07-30T23:48:41.044Z INFO core.scheduler.application.usage 
> objects/application_summary.go:60 YK_APP_SUMMARY: {ResourceUsage: 
> TrackedResource{UNKNOWN:memory=12561602838528,UNKNOWN:vcore=3552000,UNKNOWN:pods=893},
>  PreemptedResource: TrackedResource{}, PlaceholderResource: 
> TrackedResource{}}{code}
> *The difference (my value - yunikorn app summary value):*
> {code:java}
> memory: 13299125469337.467 - 12561602838528 = 737522630809 (my value is 
> 5.87124% greater)
> pods: 945.3453441859999 - 893 = 52.345344186   (my value is 5.8617% greater)
> vcore: 3760461.7715400006 - 3552000 = 208461.77154  (my value is 5.8688% 
> greater){code}
> There's a larger discrepancy this time. Notably, the number of pods shows a 
> significant drop. In typical runs without restarting YuniKorn, the job's pod 
> resource usage hovers around 9k.
>  
> h3. Using Last Event Timestamp as a Replacement
> When using the last event timestamp instead of the allocation cancellation 
> event to calculate resource usage, the results align closer to expectations 
> but remain significantly higher than YuniKorn's summary log, likely 
> representing an overestimate.
>  
> *My results calculated using the events in the Kafka topic:*
> {code:java}
> Number of allocations without matching cancels: 101
> Total aggregated resources usage:
> memory: 159366239582373.12
> pods: 11375.528615858
> vcore: 45109670.74353799{code}
> *The difference (my value - yunikorn app summary value):*
> {code:java}
> memory: 159366239582373.12 - 12561602838528 = 1.4680464e+14 (my value is 
> 1168.6776% greater)
> pods: 11375.528615858 - 893 = 10482.5286159   (my value is 1173.85538% 
> greater)
> vcore: 45109670.74353799 - 3552000 = 41557670.7435  (my value is 1169.9794% 
> greater){code}
>  
> h3. Conclusion and Inquiry
> Is this a bug in YuniKorn? Besides logging events to a Kafka topic, are there 
> other strategies my team can employ to improve resource usage tracking?
> Any insights or recommendations would be greatly appreciated.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to