JoaoJandre commented on PR #12758: URL: https://github.com/apache/cloudstack/pull/12758#issuecomment-4092974611
> ## Testing > > **The patch coverage is 5.59%.** The classes that handle the most critical operations are almost completely untested. Sadly, essentially no PRs in this project cover a substantial amount of code with unit tests. I feel like we should not arbitrarily decide on which PRs should have coverage and which do not need to have. Maybe we should think to discuss these standards and create a protocol for future reference and enforcing in all PRs that are opened to the project. **However**, I will add more unit tests to the critical operations. And I hope that all PRs are up to the same standard from now on. > **This is a 12k+ line PR with no Marvin tests.** The manual test matrices in the PR description are good to have, but without automated integration tests, there is no way to catch regressions. Future changes to the VM snapshot code, volume lifecycle, or backup framework could silently break KNIB (or the other way around) with no automated detection. This feels like another standard that is being arbitrarily brought up. Again, perhaps we should think to discuss these standards and create a protocol for future reference and enforcing. Anyway, I will add a few Marvin tests as well. > If the secondary storage selector allows clean separation of backup storage from everything else, this can work. But this needs to be clearly documented — operators must understand that without proper selector configuration, backups will share space and capacity tracking with templates, ISOs, and snapshots. I plan to add a documentation where this is explained; I will work on a PR for the docs, and I will link it here. > The secondary storage selector JavaScript engine has had security issues — see CVE-2025-59302. While this is currently blocked by a global setting, extending the selector framework to also cover backups increases the surface area. This should at minimum be documented with a note about the possible impact. As far as I know, the security issues have been addressed already https://github.com/apache/cloudstack/issues/12523#issuecomment-3853102387. Furthermore, this is only accessible to Root Admins, so I hope they can avoid exploiting their own environment. > Backups are generally expected to be stored in a separate, remote location from the primary infrastructure. If the secondary storage is co-located, the value of backups (surviving an infrastructure failure) is reduced. Was this tested with any meaningful network latency between the compute hosts / SSVM and a remote storage used for backups? What latencies were observed? This is the same as for the concept of backup repositories in the NAS plugin. It is up to the people entering the configurations (appliances) to ACS. > The `BackupError` state has no recovery path — no admin API, no background self-healing task, nothing. The spec says an operator must read the logs and recover the VM manually. In a real deployment, a management server restart or agent timeout during a backup window could put many VMs into this state at once. The tenants can't start, stop, migrate, or snapshot their VMs until an admin manually fixes each one. At the very least, this limitation and its operational impact should be clearly documented. A management server restart will never put the VM in this state. However, an agent restart or timeout could. I fully understand that this state is scary, and I would not use it if it wasn't necessary. However, it was created to protect users from themselves. If an unexpected error occurs which ACS has no way of knowing the current state of the VM, it is best to stop the user from trying to operate on VMs in this state. In a perfect world, ACS would be able to detect all possible inconsistencies and fix them; however, this could be very dangerous, if a bug were to occur and ACS “fixed” a VM in the wrong way, it could make the situation much worse. Therefore, that is why we decided to work a more defensive way and to avoid enabling users to operate on VMs with inconsistent states. We could create a "self-healing" process that scans VMs in this state and if they fit a few criteria, fixes them. I would like to implement such mechanism, but its nonexistence should not stop this PR from moving; again, users of ACS do not need to use this plugin, if they do not trust it. This type of inconsistency is already happening with snapshots and bitmaps corruptions that need to be manually addressed, and/or the NAS backup plugin, and so on. For example, see here: https://github.com/apache/cloudstack/issues/12821 and here https://github.com/apache/cloudstack/issues/12829. In any case, I will document the reason this state exists, a few possible reasons for the VM to be in this state, and a general checklist on how to go about fixing it. > ### Changes to Existing Features > > This PR modifies 224 files. Many of the changes are not isolated to the KNIB plugin — they touch shared infrastructure. > Affected areas > > VM state machine (VirtualMachine.java, UserVmManagerImpl.java) — two new states added. Every component that checks VM state is affected. Could you point out where is the issue here? > VM snapshot strategy (KvmFileBasedStorageVmSnapshotStrategy.java, 82 new lines, 0% coverage) — the disk-only VM snapshot revert, create, and delete paths are modified to account for backup deltas. This is an existing feature that other users depend on. A bug here breaks VM snapshots for everyone, not just KNIB users. I was the one who wrote 100% of the implementation on this. I made the spec, the implementation, and I am the one supporting it. If I introduce a bug here, I will fix it. I will also add a few unit tests to the affected processes. > Volume operations — volume attach, detach, destroy, and migrate paths are all modified to handle backup deltas. The main path was not changed. An exception was added so that only when using KNIB, and only if the VM uses backups, a few more processes are executed. Is this a problem? > Secondary storage selectors (HeuristicType.java) — extended for backups. Yes. > I don't have a strong opinion on the exact name, but I find "native" a bit unclear — native to what? KVM? CloudStack? The hypervisor? Something more descriptive of the actual mechanism would help operators understand what they're enabling. I'll leave this to the community to decide. Native to ACS, as explained in https://github.com/apache/cloudstack/pull/12758#issuecomment-4069106439. I mean, you need to fully read what is written there. It is very literal the use of the "native" word. We used "native" to represent that ACS is executing the backup processes for the hypervisor KVM. Anyway, I'm open to change the name, thanks to you guys, I realize that maybe another name would be best. What do you think of the proposed names here https://github.com/apache/cloudstack/pull/12758#issuecomment-4082573985? > I'd also note that the [coding conventions](https://cwiki.apache.org/confluence/display/CLOUDSTACK/Coding+conventions) document was written by the community and will keep evolving. Citing it as a definitive authority on what is or isn't acceptable doesn't fully capture how we work — the conventions are a living document, and community consensus on what is good for the project's future carries equal weight. Therefore, claiming that "Native" goes against the convention is also pointless, correct? > ## Separate APIs vs Reusing Existing APIs > > I would strongly prefer reusing the existing backup offering APIs rather than creating parallel ones. The existing `importBackupOffering` API could be extended — for example, making `externalId` optional or using some other value like `internal`, `native` or whatever makes sense as value for this. This keeps a single unified offering model in the framework, the UI, and the documentation. While I was not the one that introduced the backup framework, looking at its design, it was clearly intended to have as little configuration as possible on the ACS side while leaving these details to the backup provider. If we add these parameters to the import backup offering API, I'm sure a lot of users will be confused when they do nothing for Veeam and Dell Networker. I did not intend to warp the original design of configuring the offerings on the provider side and only importing it to ACS. This is why I created the concept of native offerings. As with KNIB (and NAS) the provider is ACS, and thus the configurations can still be made on the "backup provider", and the import API will follow the same logic it always had. Since the community is strongly against creating these APIs. I shall follow the consensus and remove them, the concept of "native" backup offerings will no longer exist. I will add a single new API, called `createBackupOffering`. It will, as the name suggests, create a new backup offering. It will have all the necessary parameters to create a new backup offering, and the parameters that KNIB needs. This API will do nothing for other providers for now. On the GUI, it will be a new button on the same page as the current backup offerings. Regarding the table for native backup offerings, it will be absorbed into the backup offering details table. I strongly oppose using the `importBackupOffering` API to create a definition of a backup offering, this is not what the semantics of the API name suggests, and it is confusing. It should be used to import offerings that were already defined externally. > ## On Reusing and Improving Existing Code > > One of the strengths of a community project is that we, as a community, collectively own and improve the codebase. _None of us are working on code that "belongs" to us individually_ — we are all changing, refactoring, and improving code that someone else wrote years ago, sometimes by contributors who are no longer active in the project. That is how open source works and how the project stays healthy. > > When existing code has gaps or quality concerns, the community expectation is that we improve it — not build a parallel implementation alongside it. If the existing NAS backup plugin has shortcomings (no incremental support, no compression, no validation), those are opportunities for contribution that benefit all users of the existing plugin. > > Building KNIB as a completely separate provider with its own offering APIs, its own DB tables, and its own storage model means the NAS plugin's gaps remain unaddressed, and now we have two native backup implementations to maintain. This also sets a precedent: if the reasoning is "the existing code doesn't meet our standards, so we'll build our own," then any contributor could justify creating parallel implementations of any subsystem. Over time, this fragments the codebase rather than strengthening it. > > The Apache Ethos emphasizes that community contributions should strengthen the commons, not create parallel silos. I'd encourage us all — including myself — to keep this principle in mind. There is a multitude of reasons why KNIB was developed as a separate plugin instead of working with NAS. I will outline the most important ones: 1. The spec was done and early development was already underway internally when NAS was first proposed to the community. Since from the beginning we could see that both were going in very different directions, I decided to continue the KNIB implementation. 2. NAS and KNIB have vastly different approaches to many concepts. To make NAS work as KNIB and have the same features as KNIB, I would have to transform it into KNIB, which I'm sure the community would be against. 3. One plugin does not have to substitute the other. Both existing is a good thing, since they have different ways of working, users have more freedom to choose which plugin best fits their needs. > ## Security > ### Validation Command Injection > > The `backupValidationCommand` VM setting is user-writable by default. From what I can see, the command and arguments are put into a JSON string using `String.format()`: > > ```java > String GUEST_EXEC_COMMAND = "{\"execute\": \"guest-exec\", \"arguments\":{\"path\":\"%s\",\"arg\":%s,\"capture-output\":true}}"; > ``` While it is user-writable by default, validation is a feature that only RootAdmins can enable. While I certainly can make it an Admin only setting by default, I'm sure that most who want to use this feature will let their users edit it. This is because the validation is inherently VM dependent, and thus most of the time only the owner of the VM will know what needs to be validated. > Example of how this can be exploited > If the path value is not sanitized, a crafted command string can break out of the JSON structure. For example: > String script = "validation.sh;\nls /etc/\n\",\"extra_param\":\"injected_value"; > String command = String.format(GUEST_EXEC_COMMAND, script, "[]"); > This would produce malformed JSON that could be interpreted differently by the QEMU guest agent. I tested informing this string `"validation.sh;\nls /etc/\n\",\"extra_param\":\"injected_value"` in the `backupValidationCommand` setting, however, no injection occured, only an error: ``` cannot parse json {"execute": "guest-exec", "arguments":{"path":""validation.sh;\nls /etc/\n\",\"extra_param\":\"injected_value"","arg":[],"capture-output":true}}: lexical error: invalid char in json text. ``` Did you test this? Furthermore, this JSON is passed directly to Qemu and is used as part of the user backup validation process. Are you concerned that the user will sabotage their own backups validations? If you find a vulnerability and an example of how to exploit it, I will address it. But please make sure it is really a vulnerability first. > ### Validation Resource Exhaustion > > I don't see a hard cap on the total validation lifecycle time. A user could repeatedly trigger backups on stopped VMs with validation-enabled offerings, causing validation VMs to be created on compute hosts. Combined with `enforce.resource.limit.on.backup.validation.vm` defaulting to `false` and `backup.validation.max.concurrent.operations` defaulting to `0` (unlimited), this could lead to resource exhaustion on compute hosts, affecting other tenants. > > Repeated backup commands on stopped VMs could potentially cause Libvirt to become unresponsive on the host, which would then impact all VMs on that host. Since `backup.validation.max.concurrent.operations.per.host` is 1 by default, at most one validation may occur on each host. Furthermore, this is a question of sensible default configurations, the current default configurations are like this based on feedback from users. I am surely open to changing these defaults. Do you think that we should further constrain the default validation configurations? We could change the `enforce.resource.limit.on.backup.validation.vm` to true by default, for example. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
