[ 
https://issues.apache.org/jira/browse/YUNIKORN-2532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17834046#comment-17834046
 ] 

Yongjun Zhang commented on YUNIKORN-2532:
-----------------------------------------

Thanks [~ccondit], [~wilfreds] 

We will patch own ingestion pipeline instead for the time being. 

Even though the current logging has the deficiency of losing some in-memory 
data upon Yunikorn restart, since we don't restart it often, it works for us 
for now because resource usage is treated as a close estimation rather than 
data with 100% accuracy.

In the long term, if we were to switch to the event streaming, my current 
understanding is that we need implement a service that take event stream like 
the informer in yukikorn that takes k8s events.  In order not to lose any 
information, we need to keep the service alive (HA) and be able to switch to 
talking to new Yunikorn server upon Yunikorn restart.  If this service is down 
when Yunikorn is down, it may still lose data.

Is my understanding correct?

Thanks.

> Resource usage report has an incompatible format change
> -------------------------------------------------------
>
>                 Key: YUNIKORN-2532
>                 URL: https://issues.apache.org/jira/browse/YUNIKORN-2532
>             Project: Apache YuniKorn
>          Issue Type: Bug
>          Components: core - scheduler
>            Reporter: Yongjun Zhang
>            Priority: Major
>
> There is some recent change that caused the application resource usage report 
> to have a new format:
> Prior the change, the format was:
> {code:java}
> YK_APP_SUMMARY: {"appID": "adf53ee0-experiment-organicad-94520240-1-1", 
> "submissionTime": 1712169262131, "startTime": 1712169264134, "finishTime": 
> 1712173619983, "user": 
> "system:serviceaccount:spark-operator-02:spark-operator", "queue": 
> "root.queue-large", "state": "Completed", "rmID": "test-cluster", 
> "resourceUsage": 
> {"insttype-1":{"memory":139178200478515200,"pods":1729129,"vcore":5183062000},"insttype-2":{"memory":113789789798400,"pods":1413,"vcore":4239000}},
>  "preemptedResource": {}}
>   {code}
> with the change, the new format is:
> {code:java}
>  2024-04-04T00:33:08.532Z     INFO    core.scheduler.application.usage        
> objects/application_summary.go:60       YK_APP_SUMMARY: {ApplicationID: 
> afa303d0-test-trino-sparksql--20240404-2-1, SubmissionTime: 1712190615461, 
> StartTime: 1712190617496, FinishTime: 1712190788532, User: 
> system:serviceaccount:spark-operator-01:spark-operator, Queue: 
> root.queue-large, State: Completed, RmID: test-cluster, ResourceUsage: 
> TrackedResource{UNKNOWN:pods=177,UNKNOWN:vcore=354000,UNKNOWN:memory=1431454089216},
>  PreemptedResource: TrackedResource{}, PlaceholderResource: 
> TrackedResource{}}{code}
> There are several incompatibilities:
> 1. the class name TrackedResource was not there before, now it is.
> 2. the instance type was outside the resource part before, not it's embedded
> 3. the instance type was reported correctly before the change, now it's 
> UNKNOWN
> #3 may be a different issue, but it's observed by us at the same time.
> I think what should change the format back to the original one, as this is an 
> incompatible change. What do you think [~wilfreds] , [~pbacsko] ,[~ccondit] ?
> Thanks.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to