[
https://issues.apache.org/jira/browse/YUNIKORN-2532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17834047#comment-17834047
]
Craig Condit commented on YUNIKORN-2532:
----------------------------------------
{quote}In the long term, if we were to switch to the event streaming, my
current understanding is that we need implement a service that take event
stream like the informer in yukikorn that takes k8s events. In order not to
lose any information, we need to keep the service alive (HA) and be able to
switch to talking to new Yunikorn server upon Yunikorn restart. If this service
is down when Yunikorn is down, it may still lose data.
{quote}
This is correct, though I think you're overstating it a bit. Your current
solution provides no persistence or HA either. It also has the rather large
drawback of being unable to account for usage except when an application
terminates. This makes it unsuitable for long-running applications that may
exist for hours, days or even months.
YuniKorn exposes its REST API via a service, and provides a unique instance ID,
so clients that are consuming events can retry connecting until YuniKorn
responds. If the instance ID changes, the restart can be detected.
> Resource usage report has an incompatible format change
> -------------------------------------------------------
>
> Key: YUNIKORN-2532
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2532
> Project: Apache YuniKorn
> Issue Type: Bug
> Components: core - scheduler
> Reporter: Yongjun Zhang
> Priority: Major
>
> There is some recent change that caused the application resource usage report
> to have a new format:
> Prior the change, the format was:
> {code:java}
> YK_APP_SUMMARY: {"appID": "adf53ee0-experiment-organicad-94520240-1-1",
> "submissionTime": 1712169262131, "startTime": 1712169264134, "finishTime":
> 1712173619983, "user":
> "system:serviceaccount:spark-operator-02:spark-operator", "queue":
> "root.queue-large", "state": "Completed", "rmID": "test-cluster",
> "resourceUsage":
> {"insttype-1":{"memory":139178200478515200,"pods":1729129,"vcore":5183062000},"insttype-2":{"memory":113789789798400,"pods":1413,"vcore":4239000}},
> "preemptedResource": {}}
> {code}
> with the change, the new format is:
> {code:java}
> 2024-04-04T00:33:08.532Z INFO core.scheduler.application.usage
> objects/application_summary.go:60 YK_APP_SUMMARY: {ApplicationID:
> afa303d0-test-trino-sparksql--20240404-2-1, SubmissionTime: 1712190615461,
> StartTime: 1712190617496, FinishTime: 1712190788532, User:
> system:serviceaccount:spark-operator-01:spark-operator, Queue:
> root.queue-large, State: Completed, RmID: test-cluster, ResourceUsage:
> TrackedResource{UNKNOWN:pods=177,UNKNOWN:vcore=354000,UNKNOWN:memory=1431454089216},
> PreemptedResource: TrackedResource{}, PlaceholderResource:
> TrackedResource{}}{code}
> There are several incompatibilities:
> 1. the class name TrackedResource was not there before, now it is.
> 2. the instance type was outside the resource part before, not it's embedded
> 3. the instance type was reported correctly before the change, now it's
> UNKNOWN
> #3 may be a different issue, but it's observed by us at the same time.
> I think what should change the format back to the original one, as this is an
> incompatible change. What do you think [~wilfreds] , [~pbacsko] ,[~ccondit] ?
> Thanks.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]