vishesh92 commented on PR #12758:
URL: https://github.com/apache/cloudstack/pull/12758#issuecomment-4089923209

   I'd like to gently remind everyone — myself included — that we are all 
working toward the same goal here: making CloudStack better. This conversation 
has gotten a bit heated, and I think part of that is because a large PR with 
significant architectural decisions landed without earlier community discussion 
— something I've been guilty of myself in the past. For features of this scale, 
early alignment with the community helps avoid exactly this kind of friction. 
When that doesn't happen, it becomes even more important for the author to be 
open to the feedback that comes during review.
   
   The [Apache Ethos](https://hop.apache.org/community/ethos/) is a good 
reference for how we want to collaborate as a community. Disagreement on 
technical matters is healthy; let's just make sure we keep it respectful and 
constructive.
   
   ---
   
   I have tried to go through the spec, the PR conversation, the existing 
backup framework docs, and the code at a high level. I might have missed some 
things given the size of this PR, so please correct me if I'm wrong on any 
point.
   
   I can see that a lot of effort has gone into both the spec and the 
implementation. That said, I have some concerns that I think need to be 
addressed before this is production-safe.
   
   ---
   
   ## Testing
   
   **The patch coverage is 5.59%.** The classes that handle the most critical 
operations are almost completely untested.
   
   <details>
   <summary>Unit test coverage details</summary>
   
   | Class | Coverage |
   |---|---|
   | `NativeBackupServiceImpl.java` | 0% |
   | `NativeBackupServiceJobController.java` | 0% |
   | `BackupCompressionServiceJobController.java` | 0% |
   | `BackupValidationServiceController.java` | 0% |
   | `LibvirtTakeKnibBackupCommandWrapper.java` | 0.49% |
   | `LibvirtValidateKnibVmCommandWrapper.java` | 0.76% |
   | `KvmFileBasedStorageVmSnapshotStrategy.java` | 0% |
   
   These code paths run as root on hypervisor hosts and manipulate live VM disk 
chains. A bug here can cause data loss for tenants. This is not something we 
can rely on manual testing alone for — the next person who touches the VM 
snapshot strategy or volume migration code has no safety net to know they've 
broken something.
   
   </details>
   
   **This is a 12k+ line PR with no Marvin tests.** The manual test matrices in 
the PR description are good to have, but without automated integration tests, 
there is no way to catch regressions. Future changes to the VM snapshot code, 
volume lifecycle, or backup framework could silently break KNIB (or the other 
way around) with no automated detection.
   
   ---
   
   ## Architecture
   
   ### Using Secondary Storage for Backups
   
   <details>
   <summary>Three concerns with this approach</summary>
   
   - If the secondary storage selector allows clean separation of backup 
storage from everything else, this can work. But this needs to be **clearly 
documented** — operators must understand that without proper selector 
configuration, backups will share space and capacity tracking with templates, 
ISOs, and snapshots.
   - The secondary storage selector JavaScript engine has had security issues — 
see 
[CVE-2025-59302](https://cloudstack.apache.org/blog/cve-advisories-4.20.2.0-4.22.0.0/#cve-2025-59302-potential-remote-code-execution-on-javascript-engine-defin).
 While this is currently blocked by a global setting, extending the selector 
framework to also cover backups increases the surface area. This should at 
minimum be documented with a note about the possible impact.
   - Backups are generally expected to be stored in a separate, remote location 
from the primary infrastructure. If the secondary storage is co-located, the 
value of backups (surviving an infrastructure failure) is reduced. Was this 
tested with any meaningful network latency between the compute hosts / SSVM and 
a remote storage used for backups? What latencies were observed?
   
   </details>
   
   ### `BackupError` VM State
   
   The `BackupError` state has no recovery path — no admin API, no background 
self-healing task, nothing. The spec says an operator must read the logs and 
recover the VM manually. In a real deployment, a management server restart or 
agent timeout during a backup window could put many VMs into this state at 
once. The tenants can't start, stop, migrate, or snapshot their VMs until an 
admin manually fixes each one. At the very least, this limitation and its 
operational impact should be clearly documented.
   
   ### Changes to Existing Features
   
   This PR modifies 224 files. Many of the changes are not isolated to the KNIB 
plugin — they touch shared infrastructure.
   
   <details>
   <summary>Affected areas</summary>
   
   - **VM state machine** (`VirtualMachine.java`, `UserVmManagerImpl.java`) — 
two new states added. Every component that checks VM state is affected.
   - **VM snapshot strategy** (`KvmFileBasedStorageVmSnapshotStrategy.java`, 82 
new lines, 0% coverage) — the disk-only VM snapshot revert, create, and delete 
paths are modified to account for backup deltas. This is an existing feature 
that other users depend on. A bug here breaks VM snapshots for everyone, not 
just KNIB users.
   - **Volume operations** — volume attach, detach, destroy, and migrate paths 
are all modified to handle backup deltas.
   - **Secondary storage selectors** (`HeuristicType.java`) — extended for 
backups.
   
   These changes are all reasonable for the feature, but combined with 0% test 
coverage on the affected paths, they carry a regression risk for users who 
don't use KNIB at all.
   
   </details>
   
   ### Naming
   
   I don't have a strong opinion on the exact name, but I find "native" a bit 
unclear — native to what? KVM? CloudStack? The hypervisor? Something more 
descriptive of the actual mechanism would help operators understand what 
they're enabling. I'll leave this to the community to decide.
   
   I'd also note that the [coding 
conventions](https://cwiki.apache.org/confluence/display/CLOUDSTACK/Coding+conventions)
 document was written by the community and will keep evolving. Citing it as a 
definitive authority on what is or isn't acceptable doesn't fully capture how 
we work — the conventions are a living document, and community consensus on 
what is good for the project's future carries equal weight.
   
   ---
   
   ## Separate APIs vs Reusing Existing APIs
   
   I would strongly prefer reusing the existing backup offering APIs rather 
than creating parallel ones. The existing `importBackupOffering` API could be 
extended — for example, making `externalId` optional or using some other value 
like `internal`, `native` or whatever makes sense as value for this. This keeps 
a single unified offering model in the framework, the UI, and the documentation.
   
   <details>
   <summary>Why this matters</summary>
   
   To give an example: when I worked on the GPU feature, the approach was to 
extend the existing VM deployment APIs rather than create a separate 
`createGPUEnabledVM` command. The same principle should apply here.
   
   Given that this project does not deprecate APIs, once the new 
`createNativeBackupOffering` / `listNativeBackupOfferings` / 
`deleteNativeBackupOffering` APIs ship, we are committed to maintaining them 
forever alongside the existing offering APIs. This is a decision that is very 
hard to walk back.
   
   More broadly: if every new backup provider introduces its own offering API, 
we end up with an ever-growing API surface. **If someone contributes another 
backup provider tomorrow, would we accept a third set of offering APIs?** I 
don't think we would — and if we wouldn't accept it then, we should think 
carefully about accepting it now.
   
   </details>
   
   ---
   
   ## On Reusing and Improving Existing Code
   
   One of the strengths of a community project is that we, as a community, 
collectively own and improve the codebase. *None of us are working on code that 
"belongs" to us individually* — we are all changing, refactoring, and improving 
code that someone else wrote years ago, sometimes by contributors who are no 
longer active in the project. That is how open source works and how the project 
stays healthy.
   
   <details>
   <summary>More on this</summary>
   
   When existing code has gaps or quality concerns, the community expectation 
is that we improve it — not build a parallel implementation alongside it. If 
the existing NAS backup plugin has shortcomings (no incremental support, no 
compression, no validation), those are opportunities for contribution that 
benefit all users of the existing plugin.
   
   Building KNIB as a completely separate provider with its own offering APIs, 
its own DB tables, and its own storage model means the NAS plugin's gaps remain 
unaddressed, and now we have two native backup implementations to maintain. 
This also sets a precedent: if the reasoning is "the existing code doesn't meet 
our standards, so we'll build our own," then any contributor could justify 
creating parallel implementations of any subsystem. Over time, this fragments 
the codebase rather than strengthening it.
   
   The [Apache 
Ethos](https://hop.apache.org/community/ethos/#_specific_guidelines) emphasizes 
that community contributions should strengthen the commons, not create parallel 
silos. I'd encourage us all — including myself — to keep this principle in mind.
   
   </details>
   
   ---
   
   ## Security
   
   ### Validation Command Injection
   
   The `backupValidationCommand` VM setting is user-writable by default. From 
what I can see, the command and arguments are put into a JSON string using 
`String.format()`:
   ```java
   String GUEST_EXEC_COMMAND = "{\"execute\": \"guest-exec\", 
\"arguments\":{\"path\":\"%s\",\"arg\":%s,\"capture-output\":true}}";
   ```
   
   <details>
   <summary>Example of how this can be exploited</summary>
   
   If the `path` value is not sanitized, a crafted command string can break out 
of the JSON structure. For example:
   ```java
   String script = "validation.sh;\nls 
/etc/\n\",\"extra_param\":\"injected_value";
   String command = String.format(GUEST_EXEC_COMMAND, script, "[]");
   ```
   
   This would produce malformed JSON that could be interpreted differently by 
the QEMU guest agent.
   
   </details>
   
   ### Validation Resource Exhaustion
   
   I don't see a hard cap on the total validation lifecycle time. A user could 
repeatedly trigger backups on stopped VMs with validation-enabled offerings, 
causing validation VMs to be created on compute hosts. Combined with 
`enforce.resource.limit.on.backup.validation.vm` defaulting to `false` and 
`backup.validation.max.concurrent.operations` defaulting to `0` (unlimited), 
this could lead to resource exhaustion on compute hosts, affecting other 
tenants.
   
   Repeated backup commands on stopped VMs could potentially cause libvirt to 
become unresponsive on the host, which would then impact all VMs on that host.
   
   ---
   
   I can see that a lot of work has gone into this. My concerns are about 
making sure this is safe and maintainable for the long term. Happy to discuss 
any of these points further.
   
   I'd also like to request everyone — contributors and reviewers alike — to 
please consider each other's feedback with an open mind. We as a community may 
not agree on everything, but we can disagree respectfully. This PR will be 
better for it, and so will the project.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to