vishesh92 commented on PR #12758:
URL: https://github.com/apache/cloudstack/pull/12758#issuecomment-4096191489

   Thanks for the detailed response @JoaoJandre.
   
   I am glad to see the conversation moving forward, especially the concessions 
on the APIs, tests, and documentation.
   
   I've grouped my follow-ups by topic to keep this concise.
   
   **On testing standards** - I agree the project could do better on test 
coverage overall, and I'm happy to discuss project-wide standards separately. 
But I don't think the current state of the project is a reason to skip them 
here. One reason for which I believe it's really useful is forward merges. Unit 
tests and Marvin tests help ensure 
   a feature works consistently across versions unless there's a deliberate 
change. Without them, a forward merge can silently break things and nobody 
notices until production.
   
   Glad you're adding tests to the affected paths. That's what matters, 
regardless of who originally wrote the code, anyone should be able to 
confidently modify it without fear of breaking things.
   
   **On the VM state machine and BackupError** - Once a VM enters `BackupError` 
state, there's no path to recover, destroy, or expunge it. IMO, an admin should 
at least be able to destroy or expunge a stuck VM. As per the current state 
machine, it might require database intervention for the fix, which we should 
avoid.
   
   As you suggested, a self-healing process would be a good future improvement 
and we don't need to block this PR for it right now. IMO, we will benefit from 
similar approaches elsewhere in CloudStack too.
   
   Also, the design doc says "if the MGMT/agent service goes down" as an 
example trigger. This is what scared me. But as you mentioned, a management 
server restart won't result in `BackupError` state. So, we just need to update 
the design doc.
   
   **On volume operations** - I flagged it based on seeing the interface 
changes without looking deeper into the implementation. Thanks for clarifying 
that the main paths are untouched.
   
   **On the CVE** - I understand it's been addressed and the severity was low. 
But it was still assigned a CVE, and extending the selector framework to cover 
backups increases the surface area. I think that's worth documenting so 
operators can make an informed decision. I could be wrong, but I believe adding 
more documentation for security-related concerns is the safer choice.
   
   **On naming & conventions** - I wasn't claiming "Native" goes against the 
conventions. My point was that the conventions document alone shouldn't be the 
final word in either direction. What matters is having a meaningful discussion 
so we as a community can arrive at the best choice.
   
   **On the APIs** - The `createBackupOffering` approach sounds reasonable. 
Your point about `importBackupOffering` not being semantically right for 
creating an offering is fair. The other people in the community, including me, 
can absolutely be wrong too. The important thing is that we talk through these 
decisions constructively.
   
   **On NAS and KNIB** - I agree that multiple plugins can be a good thing and 
users benefit from choice. But I think we should keep things as generic and 
consistent as possible for the same type of resource. I haven't seen the code 
for both implementations so I wouldn't be able to add much here from a 
technical perspective for now.
   
   You mentioned that early development on KNIB was already underway internally 
when NAS was proposed. Was there any coordination or feedback exchanged between 
the two efforts at that point? Early communication helps avoid situations like 
this.
   
   **On the security example** - Thanks for testing the injection. Good to know 
the JSON parser catches it. I did a high-level review and tested just the 
string formatting part, not the full end-to-end flow with QEMU. I should have 
been clearer about that scope in my original comment.
   
   I still think we should add proper input validation on the CloudStack side. 
It's possible that a CVE gets discovered in libvirt or QEMU that could turn 
this into an attack vector. Or if this gets extended to other hypervisors in 
the future, it could end up as a possible issue there.
   
   **On validation defaults** - Fair point on 
`backup.validation.max.concurrent.operations.per.host` defaulting to 1.
   I missed that earlier. That does limit the per-host impact. Changing 
`enforce.resource.limit.on.backup.validation.vm` to `true` by default sounds 
like a good step as well.
   
   ---
   
   I don't currently have deep familiarity with the existing backup framework 
implementation, and I'm probably not the best person for a detailed code review 
of this PR. I came across it because the discussion felt like it wasn't heading 
in a productive direction, and I wanted to help move things forward. Looking 
forward to the updates.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to