jvarenina commented on pull request #7378:
URL: https://github.com/apache/geode/pull/7378#issuecomment-1072272422


   Hi @boglesby,
   
   I also assumed that the same race condition is possible for the client 
connections, but I haven't tried to reproduce it. Thanks for pointing this out 
and lots of other valuable information. Also, thank you for the extensive 
testing you have done.
   
   If we decide to go with this solution, I agree that we should make the 
load-poll-interval parameter configurable for gateway receivers. Changing it to 
the lower value would slightly mitigate race condition effects.
   
   The load-balance gateways command is working on server this way:
   - pauses gateway-sender
   - destroys all connections and then rely upon the mechanism used during 
connection creation (ClientConnectionRequest/Response) to do the better load 
balancing
   - resume gateway-sender
   
   This command will result again in the burst of connection requests that 
could hit an issue caused by a race condition. 
   
   Maybe instead of sending load information periodically from the servers, the 
locator could scrape it (perhaps using CacheServerMXBean) from the servers and 
apply it simultaneously for all receivers in the locator. The locator could get 
load when it receives a connection request, and the current connection load is 
stale (e.g., older than 200 ms), as we don't expect many connections from 
gateway-senders. This way, the locator would at least have an up-to-date 
connection load taken at a similar time on all servers. This solution should 
even catch the change in connection load when the load-balance command destroys 
all connections.
   
   Maybe, an algorithm that could work this way:
   - Connection request received, check if a connection request is stale (older 
than new parameter load-update-frequency=200ms)
     - if yes, then try to get connection load from all servers asynchronously
       - if received load from all servers, then apply it in the locator
       - if any get fails, then check profiles again and immediately retry for 
all servers
     - Use immediately the current load
   - If the connection request is not received, then just periodically get 
load, e.g., every 5 seconds (load-poll-interval)
   
   Not sure if this makes any sense as I don't know how fast locator can scrape 
the load. I can create a prototype if you see that this could maybe work?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscr...@geode.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to