[ 
https://issues.apache.org/jira/browse/CASSANDRA-19958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17886482#comment-17886482
 ] 

Stefan Miklosovic edited comment on CASSANDRA-19958 at 10/2/24 7:53 PM:
------------------------------------------------------------------------

[[email protected]]  dont we want to control the number of threads 
for HINTS stage in runtime? I can imagine to have an mbean method on 
StorageService / StorageProxy so we could control the resizing of that 
dynamically if one sees it necessary. By lowering the thread pool dynamically, 
we could basically "throttle" hints submissions which means that we might 
prioritize other operations. If we are in a hurry and we want to submit all 
hints as fast as possible, we might set it higher so all is written sooner.

I am sorry if this is already the case. 

I think this will be controlled by nodetool get/setconcurrency, right?

Anyway, this is an interesting problem, especially after reading this (1). I 
wonder what would happen if the mutation stage is "slow" so requests would 
timeout there but we would manage to submit a hint. So we would basically end 
up with request not being written locally but there would be eventually hints 
for that. But ... this is the case already, isn't it? 
{code:java}
hint1,hint2,hint3,....hint100,mutation1,mutaiton2,....mutation100 {code}
Is this really correct? I think the queue would look more like this:
{code:java}
hint1_1, hint1_2, mutation1, hint_2_1, hint2_2, mutation2, hint3, mutation3, 
mutation4 ... {code}
How is it that there would be 100 hints before and only after that all 
mutations? From the code we see that hints would be first, then mutations etc.

btw looking into the impl of shouldSendHints:
{code:java}
public boolean shouldSendHints()
{
    if (!DatabaseDescriptor.getEnforceNativeDeadlineForHints())
        return true;

    long now = MonotonicClock.Global.preciseTime.now();
    long clientDeadline = clientDeadline();
    return now < clientDeadline;
} {code}
enforce_native_deadline_for_hints is by default set to false so we always send 
hints. If you set this to true, then it will arrive at "now < clientDeadline", 
the hint will not be sent when the we reached deadline of the request. Is not 
that something which might prevent you from hints piling up?
(1) 
[https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/transport/Dispatcher.java#L254-L260]


was (Author: smiklosovic):
[[email protected]]  dont we want to control the number of threads 
for HINTS stage in runtime? I can imagine to have an mbean method on 
StorageService / StorageProxy so we could control the resizing of that 
dynamically if one sees it necessary. By lowering the thread pool dynamically, 
we could basically "throttle" hints submissions which means that we might 
prioritize other operations. If we are in a hurry and we want to submit all 
hints as fast as possible, we might set it higher so all is written sooner.

I am sorry if this is already the case. 

I think this will be controlled by nodetool get/setconcurrency, right?

Anyway, this is an interesting problem, especially after reading this (1). I 
wonder what would happen if the mutation stage is "slow" so requests would 
timeout there but we would manage to submit a hint. So we would basically end 
up with request not being written locally but there would be eventually hints 
for that. But ... this is the case already, isn't it? 
{code:java}
hint1,hint2,hint3,....hint100,mutation1,mutaiton2,....mutation100 {code}
Is this really correct? I think the queue would look more like this:
{code:java}
hint1_1, hint1_2, mutation1, hint_2_1, hint2_2, mutation2, hint3, mutation3, 
mutation4 ... {code}
How is it that there would be 100 hints before and only after that all 
mutations? From the code we see that hints would be first, then mutations etc.

(1) 
[https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/transport/Dispatcher.java#L254-L260]

> Local Hints are stepping on local mutations
> -------------------------------------------
>
>                 Key: CASSANDRA-19958
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-19958
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Legacy/Local Write-Read Paths
>            Reporter: Jaydeepkumar Chovatia
>            Priority: Normal
>         Attachments: image-2024-09-26-15-28-20-435.png
>
>
> Cassandra uses the same queue (Stage.MUTATION) to process local mutations as 
> well as local hints writing. CASSANDRA-19534 has enhanced and added timeouts 
> for local mutations, but local hint writing does not honor that timeout by 
> design as it honors a different timeout, i.e. _max_hint_window_in_ms_
>  
> *The Problem*
> Let's understand the problem by having five nodes Cassandra cluster N1, N2, 
> N3, N4, N5 with the following configuration:
>  * concurrent_writes{_}:{_}10
>  * native_transport_timeout: 5s 
>  * write_request_timeout_in_ms: 2000 //2 seconds
> .
> +StorageProxy.java snippet...+
>  
> !image-2024-09-26-15-28-20-435.png|width=600,height=200!
>  
> Let's assume N4 and N5 are slow flapping or down. Assume N1 receives a flurry 
> of mutations, so this is what happens on N1:
>  # Line no 1542: Append 100 hints to the Stage.Mutation queue 
>  # Line no 1547: Append 100 local mutations to the Stage.Mutation queue 
>  Stage.Mutation queue on N1 would look as follows:
> {code:java}
> hint1,hint2,hint3,....hint100,mutation1,mutaiton2,....mutation100 {code}
>  * Assume hints runnable takes 1 second, then it will take 10 seconds to 
> process 100 hints, and only after that will local mutation be processed. 
>  
> So, in production, it would look like N1 is inactive for almost 10 seconds as 
> it is just writing hints locally and not participating in any Quorum, etc.
>  
> The problem becomes really huge if, let's say, the load is high, and if hints 
> pile up to 1M, then N1 will choke. The only solution at this time is to 
> involve an operator that will restart N1 to drain all the piled-up hints from 
> the Stage.Mutation queue.
>  
> The reason above problem happens is because local hint writing and local 
> mutation are both using the same Queue, i.e., Stage.Mutation.
> Local mutation writing is in the hot path. However, a slight local hint 
> writing delay does not create a big trouble.
>  
> *Reproducible steps*
>  # Pull the latest 4.1.x release
>  # Create a 5-node cluster
>  # Set the following configuration
> {code:java}
> native_transport_timeout: 10s
> write_request_timeout_in_ms: 2000
> enforce_native_deadline_for_hints: true{code}
>  # Inject 1s of latency inside the following API in _StorageProxy.java_ on 
> all five nodes
>  # 
> {code:java}
> private static void performLocally(Stage stage, Replica localReplica, final 
> Runnable runnable, final RequestCallback<?> handler, Object description, 
> Dispatcher.RequestTime requestTime)
> {
>     stage.maybeExecuteImmediately(new LocalMutationRunnable(localReplica, 
> requestTime)
>     {
>         public void runMayThrow()
>         {
>             try
>             {
>                 Thread.sleep(1000); // Inject latency here
>                 runnable.run();
>                 handler.onResponse(null);
>             }
>             catch (Exception ex)
>             {
>                 if (!(ex instanceof WriteTimeoutException))
>                     logger.error("Failed to apply mutation locally : ", ex);
>                 handler.onFailure(FBUtilities.getBroadcastAddressAndPort(), 
> RequestFailureReason.forException(ex));
>             }
>         }
>         @Override
>         public String description()
>         {
>             // description is an Object and toString() called so we do not 
> have to evaluate the Mutation.toString()
>             // unless expliclitly checked
>             return description.toString();
>         }
>         @Override
>         protected Verb verb()
>         {
>             return Verb.MUTATION_REQ;
>         }
>     });
> } {code}
>  # Run write-only stress for 1 hour or so
>  # You will see Stage.Mutation queue will pile up to >1 million in size
>  # Stop the load
>  # Stage.Mutation will not be cleared immediately, and you cannot perform new 
> writes. Basically, at this time Cassandra cluster has become inoperable from 
> new mutations point-of-view. Only read will be served
>  
> *Solution*
> The solution is to segregate the local mutation queue and local hint writing 
> queue to address the problem above. Here is the PR: 
> [https://github.com/apache/cassandra/pull/3580]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to