Peter, Thank you for starting this discussion. See inline for further comments.
> Hi all, > > Due to the number of problems that we have discovered since the release of > 1.5.0, I believe it makes sense to create a new Yunikorn release which > consists of bug fixes only. If I'm not mistaken we haven't done this before > (at least since leaving the ASF incubator), so this would be the first > minor Yunikorn release. +1 I am totally for releasing YuniKorn 1.5.1 with the lock fixes. Looking at all the work you have done for this release: would you be willing to also step up as a release manager for the 1.5.1 release? > There are a bunch of fixes that are already on branch-1.5: > > - YUNIKORN-2521 Scheduler deadlock (resolved indirectly by YUNIKORN-2544) > - YUNIKORN-2539 Add optional deadlock detection > - YUNIKORN-2544 [UMBRELLA] Fix Yunikorn potential locking issues > - YUNIKORN-2543 Fix locking in RMProxy > - YUNIKORN-2545 Eliminate multiple lock calls from Queue > - YUNIKORN-2548 Potential deadlock during concurrent > bottom-up/top-down queue traversal > - YUNIKORN-2550 Fix locking in PartitionContext > - YUNIKORN-2552 Recursive locking when sending remove queue event > - YUNIKORN-2553 [core] Enable deadlock detection during unit tests > - YUNIKORN-2563 [shim] Enable deadlock detection during unit tests > - YUNIKORN-2574 totalPartitionResource should not be mutated with > AddTo/SubFrom > - YUNIKORN-2562 Nil pointer panic in Application.ReplaceAllocation() > Yes for all the above. > The following is In Progress for 1.5.1: > > - YUNIKORN-2526 Discrepancy between shim cache and core app/task list > after scheduler restart This would be a good one to get in if we have some progress on this. Do we understand what is going on yet? I looked at the jira and am not sure if we understand the root cause. > Candidates: > > - YUNIKORN-2520 PVC errors in AssumePod() are not handled properly - > Resolved, only cherry-picking is needed Yes, this could be added. I also think we need to check if we have any CVE fixes that need to be added. Quick check shows these two: * golang.org/x/net 0.23 (CVE-2023-45288 or GO-2024-2687 via YUNIKORN-2541) * google.golang.org/protobuf to v1.33.0 (CVE-2024-24786 via YUNIKORN-2469) * build with golang 1.21.9 To satisfy the scanners, although we are not affected: * K8s 1.29.4 (CVE-2024-3177) > - YUNIKORN-2057 FindQueueByAppID is slow - Critical priority, "In > progress" since Oct 2023 > - YUNIKORN-1089 Application handling with invalid task group annotations > - Critical priority, no progress > - YUNIKORN-1988 Preemption happens when a queue lower than its > guaranteed capacity - Critical priority, "In progress" since Sep 2023 No for the last 3 mentioned. We did not block the 1.5.0 release on these and they have not made enough progress since then. I would not consider them as a possible candidate for 1.5.1 Wilfred > > Thoughts, opinions? What should be the scope of 1.5.1? > > Thanks, > Peter --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org