Re: [PR] Introduce new native backup provider (KNIB) [cloudstack]

via GitHub Thu, 19 Mar 2026 05:52:52 -0700


vishesh92 commented on PR #12758:
URL: https://github.com/apache/cloudstack/pull/12758#issuecomment-4089923209

I'd like to gently remind everyone — myself included — that we are all
working toward the same goal here: making CloudStack better. This conversation
has gotten a bit heated, and I think part of that is because a large PR with
significant architectural decisions landed without earlier community discussion
— something I've been guilty of myself in the past. For features of this scale,
early alignment with the community helps avoid exactly this kind of friction.
When that doesn't happen, it becomes even more important for the author to be
open to the feedback that comes during review.

The [Apache Ethos](https://hop.apache.org/community/ethos/) is a good
reference for how we want to collaborate as a community. Disagreement on
technical matters is healthy; let's just make sure we keep it respectful and
constructive.

---

I have tried to go through the spec, the PR conversation, the existing
backup framework docs, and the code at a high level. I might have missed some
things given the size of this PR, so please correct me if I'm wrong on any
point.

I can see that a lot of effort has gone into both the spec and the
implementation. That said, I have some concerns that I think need to be
addressed before this is production-safe.

---

## Testing

**The patch coverage is 5.59%.** The classes that handle the most critical
operations are almost completely untested.

<details>
<summary>Unit test coverage details</summary>

| Class | Coverage |
|---|---|
| `NativeBackupServiceImpl.java` | 0% |
| `NativeBackupServiceJobController.java` | 0% |
| `BackupCompressionServiceJobController.java` | 0% |
| `BackupValidationServiceController.java` | 0% |
| `LibvirtTakeKnibBackupCommandWrapper.java` | 0.49% |
| `LibvirtValidateKnibVmCommandWrapper.java` | 0.76% |
| `KvmFileBasedStorageVmSnapshotStrategy.java` | 0% |

These code paths run as root on hypervisor hosts and manipulate live VM disk
chains. A bug here can cause data loss for tenants. This is not something we
can rely on manual testing alone for — the next person who touches the VM
snapshot strategy or volume migration code has no safety net to know they've
broken something.

</details>

**This is a 12k+ line PR with no Marvin tests.** The manual test matrices in
the PR description are good to have, but without automated integration tests,
there is no way to catch regressions. Future changes to the VM snapshot code,
volume lifecycle, or backup framework could silently break KNIB (or the other
way around) with no automated detection.

---

## Architecture

### Using Secondary Storage for Backups

<details>
<summary>Three concerns with this approach</summary>

- If the secondary storage selector allows clean separation of backup
storage from everything else, this can work. But this needs to be **clearly
documented** — operators must understand that without proper selector
configuration, backups will share space and capacity tracking with templates,
ISOs, and snapshots.
- The secondary storage selector JavaScript engine has had security issues —
see
[CVE-2025-59302](https://cloudstack.apache.org/blog/cve-advisories-4.20.2.0-4.22.0.0/#cve-2025-59302-potential-remote-code-execution-on-javascript-engine-defin).
While this is currently blocked by a global setting, extending the selector
framework to also cover backups increases the surface area. This should at
minimum be documented with a note about the possible impact.
- Backups are generally expected to be stored in a separate, remote location
from the primary infrastructure. If the secondary storage is co-located, the
value of backups (surviving an infrastructure failure) is reduced. Was this
tested with any meaningful network latency between the compute hosts / SSVM and
a remote storage used for backups? What latencies were observed?

</details>

### `BackupError` VM State

The `BackupError` state has no recovery path — no admin API, no background
self-healing task, nothing. The spec says an operator must read the logs and
recover the VM manually. In a real deployment, a management server restart or
agent timeout during a backup window could put many VMs into this state at
once. The tenants can't start, stop, migrate, or snapshot their VMs until an
admin manually fixes each one. At the very least, this limitation and its
operational impact should be clearly documented.

### Changes to Existing Features

This PR modifies 224 files. Many of the changes are not isolated to the KNIB
plugin — they touch shared infrastructure.

<details>
<summary>Affected areas</summary>

- **VM state machine** (`VirtualMachine.java`, `UserVmManagerImpl.java`) —
two new states added. Every component that checks VM state is affected.
- **VM snapshot strategy** (`KvmFileBasedStorageVmSnapshotStrategy.java`, 82
new lines, 0% coverage) — the disk-only VM snapshot revert, create, and delete
paths are modified to account for backup deltas. This is an existing feature
that other users depend on. A bug here breaks VM snapshots for everyone, not
just KNIB users.
- **Volume operations** — volume attach, detach, destroy, and migrate paths
are all modified to handle backup deltas.
- **Secondary storage selectors** (`HeuristicType.java`) — extended for
backups.

These changes are all reasonable for the feature, but combined with 0% test
coverage on the affected paths, they carry a regression risk for users who
don't use KNIB at all.

</details>

### Naming

I don't have a strong opinion on the exact name, but I find "native" a bit
unclear — native to what? KVM? CloudStack? The hypervisor? Something more
descriptive of the actual mechanism would help operators understand what
they're enabling. I'll leave this to the community to decide.

I'd also note that the [coding
conventions](https://cwiki.apache.org/confluence/display/CLOUDSTACK/Coding+conventions)
document was written by the community and will keep evolving. Citing it as a
definitive authority on what is or isn't acceptable doesn't fully capture how
we work — the conventions are a living document, and community consensus on
what is good for the project's future carries equal weight.

---

## Separate APIs vs Reusing Existing APIs

I would strongly prefer reusing the existing backup offering APIs rather
than creating parallel ones. The existing `importBackupOffering` API could be
extended — for example, making `externalId` optional or using some other value
like `internal`, `native` or whatever makes sense as value for this. This keeps
a single unified offering model in the framework, the UI, and the documentation.

<details>
<summary>Why this matters</summary>

To give an example: when I worked on the GPU feature, the approach was to
extend the existing VM deployment APIs rather than create a separate
`createGPUEnabledVM` command. The same principle should apply here.

Given that this project does not deprecate APIs, once the new
`createNativeBackupOffering` / `listNativeBackupOfferings` /
`deleteNativeBackupOffering` APIs ship, we are committed to maintaining them
forever alongside the existing offering APIs. This is a decision that is very
hard to walk back.

More broadly: if every new backup provider introduces its own offering API,
we end up with an ever-growing API surface. **If someone contributes another
backup provider tomorrow, would we accept a third set of offering APIs?** I
don't think we would — and if we wouldn't accept it then, we should think
carefully about accepting it now.

</details>

---

## On Reusing and Improving Existing Code

One of the strengths of a community project is that we, as a community,
collectively own and improve the codebase. *None of us are working on code that
"belongs" to us individually* — we are all changing, refactoring, and improving
code that someone else wrote years ago, sometimes by contributors who are no
longer active in the project. That is how open source works and how the project
stays healthy.

When existing code has gaps or quality concerns, the community expectation
is that we improve it — not build a parallel implementation alongside it. If
the existing NAS backup plugin has shortcomings (no incremental support, no
compression, no validation), those are opportunities for contribution that
benefit all users of the existing plugin.

Building KNIB as a completely separate provider with its own offering APIs,
its own DB tables, and its own storage model means the NAS plugin's gaps remain
unaddressed, and now we have two native backup implementations to maintain.
This also sets a precedent: if the reasoning is "the existing code doesn't meet
our standards, so we'll build our own," then any contributor could justify
creating parallel implementations of any subsystem. Over time, this fragments
the codebase rather than strengthening it.

The [Apache
Ethos](https://hop.apache.org/community/ethos/#_specific_guidelines) emphasizes
that community contributions should strengthen the commons, not create parallel
silos. I'd encourage us all — including myself — to keep this principle in mind.

</details>

---

## Security

### Validation Command Injection

The `backupValidationCommand` VM setting is user-writable by default. From
what I can see, the command and arguments are put into a JSON string using
`String.format()`:
```java
String GUEST_EXEC_COMMAND = "{\"execute\": \"guest-exec\",
\"arguments\":{\"path\":\"%s\",\"arg\":%s,\"capture-output\":true}}";
```

<details>
<summary>Example of how this can be exploited</summary>

If the `path` value is not sanitized, a crafted command string can break out
of the JSON structure. For example:
```java
String script = "validation.sh;\nls
/etc/\n\",\"extra_param\":\"injected_value";
String command = String.format(GUEST_EXEC_COMMAND, script, "[]");
```

This would produce malformed JSON that could be interpreted differently by
the QEMU guest agent.

</details>

### Validation Resource Exhaustion

I don't see a hard cap on the total validation lifecycle time. A user could
repeatedly trigger backups on stopped VMs with validation-enabled offerings,
causing validation VMs to be created on compute hosts. Combined with
`enforce.resource.limit.on.backup.validation.vm` defaulting to `false` and
`backup.validation.max.concurrent.operations` defaulting to `0` (unlimited),
this could lead to resource exhaustion on compute hosts, affecting other
tenants.

Repeated backup commands on stopped VMs could potentially cause libvirt to
become unresponsive on the host, which would then impact all VMs on that host.

---

I can see that a lot of work has gone into this. My concerns are about
making sure this is safe and maintainable for the long term. Happy to discuss
any of these points further.

I'd also like to request everyone — contributors and reviewers alike — to
please consider each other's feedback with an open mind. We as a community may
not agree on everything, but we can disagree respectfully. This PR will be
better for it, and so will the project.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Introduce new native backup provider (KNIB) [cloudstack]

Reply via email to