Hi again, I finally figured out why I was not getting the "ServerConnectivityException" when executing a big amount of functions in Geode while I did get the exception when running lots of gets/puts/queries.
The reason is that the ConnectionImpl::execute(Op op) does not use the timeout set by PoolFactory::setReadTimeout(int timeout) when the operation is a function. Instead, it uses the timeout set by the following System property: gemfire.CLIENT_FUNCTION_TIMEOUT. Do you see value in adding a method to the PoolFactory as well as to the ClientCacheFactory to set this timeout for functions? How about being able to override this timeout on each function invocation by adding a setReadTimeout method to the FunctionService interface? /Alberto On 22/5/19 18:03, Alberto Gomez wrote: > Hi Anthony, > > Thanks again for the information. > > I have played a bit with the the client timeouts and retries and have > seen operations being rejected when load is high due to get or put > operations. Nevertheless, I have not seen that happen when the load in > the server is high due to functions invoked. Is there a reason for not > seeing errors with functions or is it just that my test was not good > to hit the limits? What if queries are sent with OQL? Do the timeout > and retries apply? Is there a similar protection on the native C++ API? > > I'd be willing to contribute to the improvements you mention. Do you > already have ideas? Anything written down? > > /Alberto > > > On 14/5/19 17:01, Anthony Baker wrote: >> The primary load limiter between the client tier and the Geode servers is >> via the max connections limit as noted in this writeup: >> >> https://cwiki.apache.org/confluence/display/GEODE/Resource+Management+in+Geode >> >> <https://cwiki.apache.org/confluence/display/GEODE/Resource+Management+in+Geode> >> >> When the load is sufficiently high, operations may timeout and a geode >> client will failover to less loaded servers. You can limit the number of >> retries the client will attempt (each gated by a read timeout) and thus slow >> down incoming operations. >> >> We’re looking into some improvements in the client connection pool to >> improve both performance and behaviors at the ragged edge when resources are >> saturated. Contributions welcome! >> >> Anthony >> >> >>> On May 13, 2019, at 9:02 AM, Alberto Gomez <alberto.go...@est.tech> wrote: >>> >>> Hi Anthony! >>> >>> Thanks a lot for your prompt answer. >>> >>> I think it is great that Geode can preserve the availability and >>> predictable low latency of the cluster when some members are unresponsive >>> by means of the GMS. >>> >>> My question was more targeted to situations in which the load received by >>> the cluster is so high that all members struggle to offer low latency. >>> Under such circumstances, does Geode take any action to back-off some of >>> the incoming load? >>> >>> Thanks in advance, >>> >>> Alberto >>> >>> >>> On 10/5/19 17:52, Anthony Baker wrote: >>> >>> Hi Alberto! >>> >>> Great questions. One of the fundamental characteristics of Geode is its >>> Group Membership System (GMS). You can read more about it here [1]. The >>> membership system ensures that failures due to unresponsive members and/or >>> network partitions are detected quickly. Given that we use synchronous >>> replication for consistent updates, the GMS algorithms fence off >>> unresponsive members to preserve the availability (and predictable low >>> latency) of the cluster as a whole. >>> >>> Another factor of resilience is memory load. Regions can be configured to >>> automatically evict data to disk based on heap usage. In addition, when a >>> Region exceeds a critical memory usage thresholds further updates are >>> blocked until the overload is resolved. >>> >>> Geode clients route operations to cluster members based on connection load. >>> This helps balance cpu load across the entire cluster. Cluster members >>> can set connection maximums to prevent overrunning the available capacity >>> of an individual server. >>> >>> I hope this helps and feel free to keep asking questions :-) >>> >>> Anthony >>> >>> [1] >>> https://cwiki.apache.org/confluence/display/GEODE/Core+Distributed+System+Concepts >>> >>> <https://cwiki.apache.org/confluence/display/GEODE/Core+Distributed+System+Concepts><https://cwiki.apache.org/confluence/display/GEODE/Core+Distributed+System+Concepts >>> >>> <https://cwiki.apache.org/confluence/display/GEODE/Core+Distributed+System+Concepts>> >>> >>> >>> >>> >>> On May 10, 2019, at 3:22 AM, Alberto Gomez <alberto.go...@est.tech> wrote: >>> >>> Hi Geode community! >>> >>> I'd like to know if Geode implements any kind of self-protection against >>> overload. What I mean by this is some mechanism that allows Geode servers >>> (and possibly locators) to reject incoming operations before processing >>> them when it detects that it is not able to handle the amount of operations >>> received in a reasonable way (with reasonable latency and without >>> experiencing processes crashing). >>> >>> The goal would be to make sure that Geode (or some parts of it) do not >>> crash under too heavy load and also that the latency level is always under >>> control at least for the amount of traffic the Geode cluster is supposed to >>> support. >>> >>> If Geode does not offer such mechanism, I would also like to get your >>> opinion about this possible feature, (if you find it interesting) and also >>> on how it could be implemented. One possible approach could be having some >>> measure of the current CPU consumption that allows to decide if a given >>> operation must be processed or not, taking into account the CPU consumption >>> value with respect to an overload threshold. >>> >>> Thanks in advance for your answers, >>> >>> -Alberto > >