vishesh92 commented on PR #12758: URL: https://github.com/apache/cloudstack/pull/12758#issuecomment-4096191489
Thanks for the detailed response @JoaoJandre. I am glad to see the conversation moving forward, especially the concessions on the APIs, tests, and documentation. I've grouped my follow-ups by topic to keep this concise. **On testing standards** - I agree the project could do better on test coverage overall, and I'm happy to discuss project-wide standards separately. But I don't think the current state of the project is a reason to skip them here. One reason for which I believe it's really useful is forward merges. Unit tests and Marvin tests help ensure a feature works consistently across versions unless there's a deliberate change. Without them, a forward merge can silently break things and nobody notices until production. Glad you're adding tests to the affected paths. That's what matters, regardless of who originally wrote the code, anyone should be able to confidently modify it without fear of breaking things. **On the VM state machine and BackupError** - Once a VM enters `BackupError` state, there's no path to recover, destroy, or expunge it. IMO, an admin should at least be able to destroy or expunge a stuck VM. As per the current state machine, it might require database intervention for the fix, which we should avoid. As you suggested, a self-healing process would be a good future improvement and we don't need to block this PR for it right now. IMO, we will benefit from similar approaches elsewhere in CloudStack too. Also, the design doc says "if the MGMT/agent service goes down" as an example trigger. This is what scared me. But as you mentioned, a management server restart won't result in `BackupError` state. So, we just need to update the design doc. **On volume operations** - I flagged it based on seeing the interface changes without looking deeper into the implementation. Thanks for clarifying that the main paths are untouched. **On the CVE** - I understand it's been addressed and the severity was low. But it was still assigned a CVE, and extending the selector framework to cover backups increases the surface area. I think that's worth documenting so operators can make an informed decision. I could be wrong, but I believe adding more documentation for security-related concerns is the safer choice. **On naming & conventions** - I wasn't claiming "Native" goes against the conventions. My point was that the conventions document alone shouldn't be the final word in either direction. What matters is having a meaningful discussion so we as a community can arrive at the best choice. **On the APIs** - The `createBackupOffering` approach sounds reasonable. Your point about `importBackupOffering` not being semantically right for creating an offering is fair. The other people in the community, including me, can absolutely be wrong too. The important thing is that we talk through these decisions constructively. **On NAS and KNIB** - I agree that multiple plugins can be a good thing and users benefit from choice. But I think we should keep things as generic and consistent as possible for the same type of resource. I haven't seen the code for both implementations so I wouldn't be able to add much here from a technical perspective for now. You mentioned that early development on KNIB was already underway internally when NAS was proposed. Was there any coordination or feedback exchanged between the two efforts at that point? Early communication helps avoid situations like this. **On the security example** - Thanks for testing the injection. Good to know the JSON parser catches it. I did a high-level review and tested just the string formatting part, not the full end-to-end flow with QEMU. I should have been clearer about that scope in my original comment. I still think we should add proper input validation on the CloudStack side. It's possible that a CVE gets discovered in libvirt or QEMU that could turn this into an attack vector. Or if this gets extended to other hypervisors in the future, it could end up as a possible issue there. **On validation defaults** - Fair point on `backup.validation.max.concurrent.operations.per.host` defaulting to 1. I missed that earlier. That does limit the per-host impact. Changing `enforce.resource.limit.on.backup.validation.vm` to `true` by default sounds like a good step as well. --- I don't currently have deep familiarity with the existing backup framework implementation, and I'm probably not the best person for a detailed code review of this PR. I came across it because the discussion felt like it wasn't heading in a productive direction, and I wanted to help move things forward. Looking forward to the updates. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
