Re: Geode self-protection about overload

Alberto Gomez Wed, 05 Jun 2019 07:15:39 -0700

Hi again,

I finally figured out why I was not getting the 
"ServerConnectivityException" when executing a big amount of functions 
in Geode while I did get the exception when running lots of 
gets/puts/queries.


The reason is that the ConnectionImpl::execute(Op op) does not use the 
timeout set by PoolFactory::setReadTimeout(int timeout) when the 
operation is a function. Instead, it uses the timeout set by the 
following System property: gemfire.CLIENT_FUNCTION_TIMEOUT.

Do you see value in adding a method to the PoolFactory as well as to the 
ClientCacheFactory to set this timeout for functions?

How about being able to override this timeout on each function 
invocation by adding a setReadTimeout method to the FunctionService 
interface?

/Alberto


On 22/5/19 18:03, Alberto Gomez wrote:
> Hi Anthony,
>
> Thanks again for the information.
>
> I have played a bit with the the client timeouts and retries and have 
> seen operations being rejected when load is high due to get or put 
> operations. Nevertheless, I have not seen that happen when the load in 
> the server is high due to functions invoked. Is there a reason for not 
> seeing errors with functions or is it just that my test was not good 
> to hit the limits? What if queries are sent with OQL? Do the timeout 
> and retries apply? Is there a similar protection on the native C++ API?
>
> I'd be willing to contribute to the improvements you mention. Do you 
> already have ideas? Anything written down?
>
> /Alberto
>
>
> On 14/5/19 17:01, Anthony Baker wrote:
>> The primary load limiter between the client tier and the Geode servers is 
>> via the max connections limit as noted in this writeup:
>>
>> https://cwiki.apache.org/confluence/display/GEODE/Resource+Management+in+Geode
>>  
>> <https://cwiki.apache.org/confluence/display/GEODE/Resource+Management+in+Geode>
>>
>> When the load is sufficiently high, operations may timeout and a geode 
>> client will failover to less loaded servers.  You can limit the number of 
>> retries the client will attempt (each gated by a read timeout) and thus slow 
>> down incoming operations.
>>
>> We’re looking into some improvements in the client connection pool to 
>> improve both performance and behaviors at the ragged edge when resources are 
>> saturated.  Contributions welcome!
>>
>> Anthony
>>
>>
>>> On May 13, 2019, at 9:02 AM, Alberto Gomez <alberto.go...@est.tech> wrote:
>>>
>>> Hi Anthony!
>>>
>>> Thanks a lot for your prompt answer.
>>>
>>> I think it is great that Geode can preserve the availability and 
>>> predictable low latency of the cluster when some members are unresponsive 
>>> by means of the GMS.
>>>
>>> My question was more targeted to situations in which the load received by 
>>> the cluster is so high that all members struggle to offer low latency. 
>>> Under such circumstances, does Geode take any action to back-off some of 
>>> the incoming load?
>>>
>>> Thanks in advance,
>>>
>>> Alberto
>>>
>>>
>>> On 10/5/19 17:52, Anthony Baker wrote:
>>>
>>> Hi Alberto!
>>>
>>> Great questions.  One of the fundamental characteristics of Geode is its 
>>> Group Membership System (GMS).  You can read more about it here [1].  The 
>>> membership system ensures that failures due to unresponsive members and/or 
>>> network partitions are detected quickly.  Given that we use synchronous 
>>> replication for consistent updates, the GMS algorithms fence off 
>>> unresponsive members to preserve the availability (and predictable low 
>>> latency) of the cluster as a whole.
>>>
>>> Another factor of resilience is memory load.  Regions can be configured to 
>>> automatically evict data to disk based on heap usage.  In addition, when a 
>>> Region exceeds a critical memory usage thresholds further updates are 
>>> blocked until the overload is resolved.
>>>
>>> Geode clients route operations to cluster members based on connection load. 
>>>  This helps balance cpu load across the entire cluster.  Cluster members 
>>> can set connection maximums to prevent overrunning the available capacity 
>>> of an individual server.
>>>
>>> I hope this helps and feel free to keep asking questions :-)
>>>
>>> Anthony
>>>
>>> [1] 
>>> https://cwiki.apache.org/confluence/display/GEODE/Core+Distributed+System+Concepts
>>>  
>>> <https://cwiki.apache.org/confluence/display/GEODE/Core+Distributed+System+Concepts><https://cwiki.apache.org/confluence/display/GEODE/Core+Distributed+System+Concepts
>>>  
>>> <https://cwiki.apache.org/confluence/display/GEODE/Core+Distributed+System+Concepts>>
>>>
>>>
>>>
>>>
>>> On May 10, 2019, at 3:22 AM, Alberto Gomez <alberto.go...@est.tech> wrote:
>>>
>>> Hi Geode community!
>>>
>>> I'd like to know if Geode implements any kind of self-protection against 
>>> overload. What I mean by this is some mechanism that allows Geode servers 
>>> (and possibly locators) to reject incoming operations before processing 
>>> them when it detects that it is not able to handle the amount of operations 
>>> received in a reasonable way (with reasonable latency and without 
>>> experiencing processes crashing).
>>>
>>> The goal would be to make sure that Geode (or some parts of it) do not 
>>> crash under too heavy load and also that the latency level is always under 
>>> control at least for the amount of traffic the Geode cluster is supposed to 
>>> support.
>>>
>>> If Geode does not offer such mechanism, I would also like to get your 
>>> opinion about this possible feature, (if you find it interesting) and also 
>>> on how it could be implemented. One possible approach could be having some 
>>> measure of the current CPU consumption that allows to decide if a given 
>>> operation must be processed or not, taking into account the CPU consumption 
>>> value with respect to an overload threshold.
>>>
>>> Thanks in advance for your answers,
>>>
>>> -Alberto
>
>

Re: Geode self-protection about overload

Reply via email to