vishesh92 commented on PR #12758: URL: https://github.com/apache/cloudstack/pull/12758#issuecomment-4089923209
I'd like to gently remind everyone — myself included — that we are all working toward the same goal here: making CloudStack better. This conversation has gotten a bit heated, and I think part of that is because a large PR with significant architectural decisions landed without earlier community discussion — something I've been guilty of myself in the past. For features of this scale, early alignment with the community helps avoid exactly this kind of friction. When that doesn't happen, it becomes even more important for the author to be open to the feedback that comes during review. The [Apache Ethos](https://hop.apache.org/community/ethos/) is a good reference for how we want to collaborate as a community. Disagreement on technical matters is healthy; let's just make sure we keep it respectful and constructive. --- I have tried to go through the spec, the PR conversation, the existing backup framework docs, and the code at a high level. I might have missed some things given the size of this PR, so please correct me if I'm wrong on any point. I can see that a lot of effort has gone into both the spec and the implementation. That said, I have some concerns that I think need to be addressed before this is production-safe. --- ## Testing **The patch coverage is 5.59%.** The classes that handle the most critical operations are almost completely untested. <details> <summary>Unit test coverage details</summary> | Class | Coverage | |---|---| | `NativeBackupServiceImpl.java` | 0% | | `NativeBackupServiceJobController.java` | 0% | | `BackupCompressionServiceJobController.java` | 0% | | `BackupValidationServiceController.java` | 0% | | `LibvirtTakeKnibBackupCommandWrapper.java` | 0.49% | | `LibvirtValidateKnibVmCommandWrapper.java` | 0.76% | | `KvmFileBasedStorageVmSnapshotStrategy.java` | 0% | These code paths run as root on hypervisor hosts and manipulate live VM disk chains. A bug here can cause data loss for tenants. This is not something we can rely on manual testing alone for — the next person who touches the VM snapshot strategy or volume migration code has no safety net to know they've broken something. </details> **This is a 12k+ line PR with no Marvin tests.** The manual test matrices in the PR description are good to have, but without automated integration tests, there is no way to catch regressions. Future changes to the VM snapshot code, volume lifecycle, or backup framework could silently break KNIB (or the other way around) with no automated detection. --- ## Architecture ### Using Secondary Storage for Backups <details> <summary>Three concerns with this approach</summary> - If the secondary storage selector allows clean separation of backup storage from everything else, this can work. But this needs to be **clearly documented** — operators must understand that without proper selector configuration, backups will share space and capacity tracking with templates, ISOs, and snapshots. - The secondary storage selector JavaScript engine has had security issues — see [CVE-2025-59302](https://cloudstack.apache.org/blog/cve-advisories-4.20.2.0-4.22.0.0/#cve-2025-59302-potential-remote-code-execution-on-javascript-engine-defin). While this is currently blocked by a global setting, extending the selector framework to also cover backups increases the surface area. This should at minimum be documented with a note about the possible impact. - Backups are generally expected to be stored in a separate, remote location from the primary infrastructure. If the secondary storage is co-located, the value of backups (surviving an infrastructure failure) is reduced. Was this tested with any meaningful network latency between the compute hosts / SSVM and a remote storage used for backups? What latencies were observed? </details> ### `BackupError` VM State The `BackupError` state has no recovery path — no admin API, no background self-healing task, nothing. The spec says an operator must read the logs and recover the VM manually. In a real deployment, a management server restart or agent timeout during a backup window could put many VMs into this state at once. The tenants can't start, stop, migrate, or snapshot their VMs until an admin manually fixes each one. At the very least, this limitation and its operational impact should be clearly documented. ### Changes to Existing Features This PR modifies 224 files. Many of the changes are not isolated to the KNIB plugin — they touch shared infrastructure. <details> <summary>Affected areas</summary> - **VM state machine** (`VirtualMachine.java`, `UserVmManagerImpl.java`) — two new states added. Every component that checks VM state is affected. - **VM snapshot strategy** (`KvmFileBasedStorageVmSnapshotStrategy.java`, 82 new lines, 0% coverage) — the disk-only VM snapshot revert, create, and delete paths are modified to account for backup deltas. This is an existing feature that other users depend on. A bug here breaks VM snapshots for everyone, not just KNIB users. - **Volume operations** — volume attach, detach, destroy, and migrate paths are all modified to handle backup deltas. - **Secondary storage selectors** (`HeuristicType.java`) — extended for backups. These changes are all reasonable for the feature, but combined with 0% test coverage on the affected paths, they carry a regression risk for users who don't use KNIB at all. </details> ### Naming I don't have a strong opinion on the exact name, but I find "native" a bit unclear — native to what? KVM? CloudStack? The hypervisor? Something more descriptive of the actual mechanism would help operators understand what they're enabling. I'll leave this to the community to decide. I'd also note that the [coding conventions](https://cwiki.apache.org/confluence/display/CLOUDSTACK/Coding+conventions) document was written by the community and will keep evolving. Citing it as a definitive authority on what is or isn't acceptable doesn't fully capture how we work — the conventions are a living document, and community consensus on what is good for the project's future carries equal weight. --- ## Separate APIs vs Reusing Existing APIs I would strongly prefer reusing the existing backup offering APIs rather than creating parallel ones. The existing `importBackupOffering` API could be extended — for example, making `externalId` optional or using some other value like `internal`, `native` or whatever makes sense as value for this. This keeps a single unified offering model in the framework, the UI, and the documentation. <details> <summary>Why this matters</summary> To give an example: when I worked on the GPU feature, the approach was to extend the existing VM deployment APIs rather than create a separate `createGPUEnabledVM` command. The same principle should apply here. Given that this project does not deprecate APIs, once the new `createNativeBackupOffering` / `listNativeBackupOfferings` / `deleteNativeBackupOffering` APIs ship, we are committed to maintaining them forever alongside the existing offering APIs. This is a decision that is very hard to walk back. More broadly: if every new backup provider introduces its own offering API, we end up with an ever-growing API surface. **If someone contributes another backup provider tomorrow, would we accept a third set of offering APIs?** I don't think we would — and if we wouldn't accept it then, we should think carefully about accepting it now. </details> --- ## On Reusing and Improving Existing Code One of the strengths of a community project is that we, as a community, collectively own and improve the codebase. *None of us are working on code that "belongs" to us individually* — we are all changing, refactoring, and improving code that someone else wrote years ago, sometimes by contributors who are no longer active in the project. That is how open source works and how the project stays healthy. <details> <summary>More on this</summary> When existing code has gaps or quality concerns, the community expectation is that we improve it — not build a parallel implementation alongside it. If the existing NAS backup plugin has shortcomings (no incremental support, no compression, no validation), those are opportunities for contribution that benefit all users of the existing plugin. Building KNIB as a completely separate provider with its own offering APIs, its own DB tables, and its own storage model means the NAS plugin's gaps remain unaddressed, and now we have two native backup implementations to maintain. This also sets a precedent: if the reasoning is "the existing code doesn't meet our standards, so we'll build our own," then any contributor could justify creating parallel implementations of any subsystem. Over time, this fragments the codebase rather than strengthening it. The [Apache Ethos](https://hop.apache.org/community/ethos/#_specific_guidelines) emphasizes that community contributions should strengthen the commons, not create parallel silos. I'd encourage us all — including myself — to keep this principle in mind. </details> --- ## Security ### Validation Command Injection The `backupValidationCommand` VM setting is user-writable by default. From what I can see, the command and arguments are put into a JSON string using `String.format()`: ```java String GUEST_EXEC_COMMAND = "{\"execute\": \"guest-exec\", \"arguments\":{\"path\":\"%s\",\"arg\":%s,\"capture-output\":true}}"; ``` <details> <summary>Example of how this can be exploited</summary> If the `path` value is not sanitized, a crafted command string can break out of the JSON structure. For example: ```java String script = "validation.sh;\nls /etc/\n\",\"extra_param\":\"injected_value"; String command = String.format(GUEST_EXEC_COMMAND, script, "[]"); ``` This would produce malformed JSON that could be interpreted differently by the QEMU guest agent. </details> ### Validation Resource Exhaustion I don't see a hard cap on the total validation lifecycle time. A user could repeatedly trigger backups on stopped VMs with validation-enabled offerings, causing validation VMs to be created on compute hosts. Combined with `enforce.resource.limit.on.backup.validation.vm` defaulting to `false` and `backup.validation.max.concurrent.operations` defaulting to `0` (unlimited), this could lead to resource exhaustion on compute hosts, affecting other tenants. Repeated backup commands on stopped VMs could potentially cause libvirt to become unresponsive on the host, which would then impact all VMs on that host. --- I can see that a lot of work has gone into this. My concerns are about making sure this is safe and maintainable for the long term. Happy to discuss any of these points further. I'd also like to request everyone — contributors and reviewers alike — to please consider each other's feedback with an open mind. We as a community may not agree on everything, but we can disagree respectfully. This PR will be better for it, and so will the project. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
