[jira] [Commented] (YUNIKORN-2532) Resource usage report has an incompatible format change

Craig Condit (Jira) Thu, 04 Apr 2024 11:34:05 -0700


    [ 
https://issues.apache.org/jira/browse/YUNIKORN-2532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17834047#comment-17834047
 ]


Craig Condit commented on YUNIKORN-2532:
----------------------------------------

{quote}In the long term, if we were to switch to the event streaming, my 
current understanding is that we need implement a service that take event 
stream like the informer in yukikorn that takes k8s events. In order not to 
lose any information, we need to keep the service alive (HA) and be able to 
switch to talking to new Yunikorn server upon Yunikorn restart. If this service 
is down when Yunikorn is down, it may still lose data.
{quote}
This is correct, though I think you're overstating it a bit. Your current 
solution provides no persistence or HA either. It also has the rather large 
drawback of being unable to account for usage except when an application 
terminates. This makes it unsuitable for long-running applications that may 
exist for hours, days or even months.



YuniKorn exposes its REST API via a service, and provides a unique instance ID, 
so clients that are consuming events can retry connecting until YuniKorn 
responds. If the instance ID changes, the restart can be detected.

> Resource usage report has an incompatible format change
> -------------------------------------------------------
>
>                 Key: YUNIKORN-2532
>                 URL: https://issues.apache.org/jira/browse/YUNIKORN-2532
>             Project: Apache YuniKorn
>          Issue Type: Bug
>          Components: core - scheduler
>            Reporter: Yongjun Zhang
>            Priority: Major
>
> There is some recent change that caused the application resource usage report 
> to have a new format:
> Prior the change, the format was:
> {code:java}
> YK_APP_SUMMARY: {"appID": "adf53ee0-experiment-organicad-94520240-1-1", 
> "submissionTime": 1712169262131, "startTime": 1712169264134, "finishTime": 
> 1712173619983, "user": 
> "system:serviceaccount:spark-operator-02:spark-operator", "queue": 
> "root.queue-large", "state": "Completed", "rmID": "test-cluster", 
> "resourceUsage": 
> {"insttype-1":{"memory":139178200478515200,"pods":1729129,"vcore":5183062000},"insttype-2":{"memory":113789789798400,"pods":1413,"vcore":4239000}},
>  "preemptedResource": {}}
>   {code}
> with the change, the new format is:
> {code:java}
>  2024-04-04T00:33:08.532Z     INFO    core.scheduler.application.usage        
> objects/application_summary.go:60       YK_APP_SUMMARY: {ApplicationID: 
> afa303d0-test-trino-sparksql--20240404-2-1, SubmissionTime: 1712190615461, 
> StartTime: 1712190617496, FinishTime: 1712190788532, User: 
> system:serviceaccount:spark-operator-01:spark-operator, Queue: 
> root.queue-large, State: Completed, RmID: test-cluster, ResourceUsage: 
> TrackedResource{UNKNOWN:pods=177,UNKNOWN:vcore=354000,UNKNOWN:memory=1431454089216},
>  PreemptedResource: TrackedResource{}, PlaceholderResource: 
> TrackedResource{}}{code}
> There are several incompatibilities:
> 1. the class name TrackedResource was not there before, now it is.
> 2. the instance type was outside the resource part before, not it's embedded
> 3. the instance type was reported correctly before the change, now it's 
> UNKNOWN
> #3 may be a different issue, but it's observed by us at the same time.
> I think what should change the format back to the original one, as this is an 
> incompatible change. What do you think [~wilfreds] , [~pbacsko] ,[~ccondit] ?
> Thanks.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (YUNIKORN-2532) Resource usage report has an incompatible format change

Reply via email to