jvarenina commented on pull request #7378: URL: https://github.com/apache/geode/pull/7378#issuecomment-1072272422
Hi @boglesby, I also assumed that the same race condition is possible for the client connections, but I haven't tried to reproduce it. Thanks for pointing this out and lots of other valuable information. Also, thank you for the extensive testing you have done. If we decide to go with this solution, I agree that we should make the load-poll-interval parameter configurable for gateway receivers. Changing it to the lower value would slightly mitigate race condition effects. The load-balance gateways command is working on server this way: - pauses gateway-sender - destroys all connections and then rely upon the mechanism used during connection creation (ClientConnectionRequest/Response) to do the better load balancing - resume gateway-sender This command will result again in the burst of connection requests that could hit an issue caused by a race condition. Maybe instead of sending load information periodically from the servers, the locator could scrape it (perhaps using CacheServerMXBean) from the servers and apply it simultaneously for all receivers in the locator. The locator could get load when it receives a connection request, and the current connection load is stale (e.g., older than 200 ms), as we don't expect many connections from gateway-senders. This way, the locator would at least have an up-to-date connection load taken at a similar time on all servers. This solution should even catch the change in connection load when the load-balance command destroys all connections. Maybe, an algorithm that could work this way: - Connection request received, check if a connection request is stale (older than new parameter load-update-frequency=200ms) - if yes, then try to get connection load from all servers asynchronously - if received load from all servers, then apply it in the locator - if any get fails, then check profiles again and immediately retry for all servers - Use immediately the current load - If the connection request is not received, then just periodically get load, e.g., every 5 seconds (load-poll-interval) Not sure if this makes any sense as I don't know how fast locator can scrape the load. I can create a prototype if you see that this could maybe work? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: notifications-unsubscr...@geode.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org