bhouse-nexthop opened a new pull request, #13361:
URL: https://github.com/apache/cloudstack/pull/13361
### Description
RBD erasure-coded (EC) pool support (#9808) taught the KVM agent to honor
the `rbd_default_data_pool` storage-pool detail, but only in two places:
- `KVMPhysicalDisk.RBDStringBuilder()` — the `qemu-img` URI builder
- `createPhysicalDisk()` — blank volume creation (routes RBD creates to the
QemuImg path when a data pool is set)
It was **not** applied to `createDiskFromTemplateOnRBD()`, which creates
ROOT volumes from a template using rados-java (`rbd.clone()` / `rbd.create()`)
directly. Those calls build a `Rados` connection that sets `mon_host`, `key`
and `client_mount_timeout` but never `rbd_default_data_pool`, so the resulting
image is created **without a data pool**.
Net effect: every volume cloned from a template onto an EC-backed primary
storage has all of its data objects written to the **replicated metadata pool**
instead of the erasure-coded data pool. This silently defeats EC and consumes
~3x raw space for those volumes. Blank data disks on the same pool are correct,
which makes the inconsistency easy to miss.
This was observed in production: of the VMs on an EC primary storage, the
template-derived ROOT volumes had no `data_pool` (`rbd info` shows no
`data_pool:` line and lacks the `data-pool` feature), while blank DATADISKs and
the template base images themselves were correct.
### Fix
In `createDiskFromTemplateOnRBD`, read the destination pool's
`rbd_default_data_pool` detail once and, when present, `confSet` it on the
`Rados` connection **before** `connect()` — in both the same-cluster clone/copy
branch and the cross-cluster copy branch. librbd then uses it as the default
data pool when the new image is created, so template-derived volumes get
`data_pool` set, exactly like blank volumes already do. This mirrors how
`RBDStringBuilder` injects the same key for the `qemu-img` path.
```java
String dataPool = (destDetails == null) ? null :
destDetails.get(KVMPhysicalDisk.RBD_DEFAULT_DATA_POOL);
...
if (dataPool != null) {
r.confSet(KVMPhysicalDisk.RBD_DEFAULT_DATA_POOL, dataPool);
}
r.connect();
```
No behavior change for non-EC pools (`dataPool == null` → no-op).
### Types of changes
- [x] Bug fix (non-breaking change which fixes an issue)
### How Has This Been Tested?
- Root cause reproduced live: deploying a VM from a template onto an EC
primary storage produced a ROOT clone with no `data_pool` (parent base snapshot
in the metadata pool); the data-pool feature was absent.
- Confirmed the working paths (template seed via `qemu-img`, blank DATADISK
create) already set the data pool, isolating the gap to the rados-java
clone/create path.
- Code-path traced to #9808, which did not touch
`createDiskFromTemplateOnRBD`.
> Note: not compiled locally (no JDK/Maven on the workstation used). The
change is a direct mirror of the adjacent `confSet` calls; CI should validate
the build. A maintainer with an EC RBD pool can verify by deploying a VM from a
template and checking `rbd info <pool>/<root-vol>` reports the expected
`data_pool` and `data-pool` feature.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]