rhtyd opened a new pull request #2638: agent: Fixes #2633 don't wait for 
pending tasks on reconnection
URL: https://github.com/apache/cloudstack/pull/2638
 
 
   When agent loses connection with management server, the reconnection
   logic waits for any pending tasks to finish. However, when such tasks
   do finish they fail to send an `Answer` back to managements server.
   Therefore from a management server's perspective such pending
   operations are stuck in a FSM state and need manual removal or fixing.
   This is by design where management server's side cmd-answer request
   pattern is code/execution dependent, therefore even if the answer
   were to be sent when management server came back up (reconnects)
   the management server will fail to acknowledge and process the answer
   due to missing listeners or being in the exact state to handle answers.
   
   Historically, the Agent would wait to reconnect until the internal
   tasks complete but I found no reason why it should wait for reconnection
   at all.
   
   ## Types of changes
   <!--- What types of changes does your code introduce? Put an `x` in all the 
boxes that apply: -->
   - [ ] Breaking change (fix or feature that would cause existing 
functionality to change)
   - [ ] New feature (non-breaking change which adds functionality)
   - [ ] Bug fix (non-breaking change which fixes an issue)
   - [ ] Enhancement (improves an existing feature and functionality)
   - [ ] Cleanup (Code refactoring and cleanup, that may add test cases)
   
   ## GitHub Issue/PRs
   <!-- If this PR is to fix an issue or another PR on GH, uncomment the 
section and provide the id of issue/PR -->
   <!-- When "Fixes: #<id>" is specified, the issue/PR will automatically be 
closed when this PR gets merged -->
   <!-- For addressing multiple issues/PRs, use multiple "Fixes: #<id>" -->
   
   <!-- Fixes: # -->
   
   ## Screenshots (if appropriate):
   
   ## How Has This Been Tested?
   
   Before fix: Started a snapshot of a volume, killed/shutdown the management 
server to see that agent is blocked until the job finished. When the job 
finishes, it fails to send answer. When mgmt server is started again, it has 
the snapshot still in backing state. However, the agent is blocked until the 
job finishes, even if the mgmt server were to come up online. Irrespective of 
the case, the pending job fails to reply (as the link object changes, the 
`send` fails).
   
   After fix: The same as above, but this time agent is not blocked by any 
long-running pending job and reconnects faster. The failure scenarios remain 
the same, including manual fixing (if any) needed after the mgmt server is back.
   
   ## Checklist:
   <!--- Go over all the following points, and put an `x` in all the boxes that 
apply. -->
   <!--- If you're unsure about any of these, don't hesitate to ask. We're here 
to help! -->
   - [ ] I have read the 
[CONTRIBUTING](https://github.com/apache/cloudstack/blob/master/CONTRIBUTING.md)
 document.
   - [ ] My code follows the code style of this project.
   - [ ] My change requires a change to the documentation.
   - [ ] I have updated the documentation accordingly.
   Testing
   - [ ] I have added tests to cover my changes.
   - [ ] All relevant new and existing integration tests have passed.
   - [ ] A full integration testsuite with all test that can run on my 
environment has passed.
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to