[jira] [Commented] (IGNITE-26532) Design CMG/MG absence handling logic

Vladislav Pyatkov (Jira) Mon, 24 Nov 2025 03:24:06 -0800


    [ 
https://issues.apache.org/jira/browse/IGNITE-26532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18040289#comment-18040289
 ]


Vladislav Pyatkov commented on IGNITE-26532:
--------------------------------------------

h2. CMG/MG services

The CMG/MG service can be invoked either with a timeout or without one.
If the service is called without a timeout (timeout = 0), a single attempt will 
be made to send the command for replication.
If the service is called with an infinite timeout (timeout = Long.MAX_VALUE), 
replication attempts will continue indefinitely, or the call will wait until 
the service becomes ready.
If the service is called with a bounded timeout, replication attempts will 
continue only for the specified duration (timeout).
If the replication command cannot be executed due to the service being 
unavailable, a ReplicationGroupUnavailableException will be thrown.
{code}
/**
 * CMG/MG service.
 * The service applies a replication operation on the replication group.
*/
public interface Service {
/**
 * Invoke a replication operation on the replication group with a timeout.
*
 * @param operation Replication operation.
 * @param timeout Timeout in milliseconds.
 * If timeout is 0, the operation will try only one attempt to execute the 
operation.
 * If timeout is \{@code Long.MAX_VALUE}the operation will wait indefinitely 
until the operation is executed.

 * @return The future will be completed with a result if the group is available 
during the execution of the replication command.
 * If the group is not available and the timeout expires, the future will be 
completed with \{@link ReplicationGroupUnavailableException}.
*/
CompletableFuture<Boolean> invoke(Object operation, long timeout);
}
{code}

h2. CMG/MG managers
Manager methods can be invoked either with a timeout or with an infinite wait 
time (timeout <= 0).
For calls with a timeout, the execution pattern is as follows (this code only 
reflects the implementation approach, but it is not the final version):
{code}
/**
* CMG/MG manager.
*/
public class Manager {
   /** CMG/MG service. */
   Service service;


   /**
    * @param operation Replication operation.
    * @param timeout Operation timeout in microseconds.
    *                If the timeout is 0, the operation will try only one 
attempt to execute the operation.
    *                If timeout is {@code Long.MAX_VALUE} the operation will 
wait indefinitely until the operation is executed.
    * @return The future completes with a result or exceptionally when the 
replication group is unavailable during the timeout.
    */
   public CompletableFuture<Boolean> invoke(Object operation, long timeout) {
       return service.invoke(operation, timeout);
   }


   /**
    * @param operation Replication operation.
    * @return The future completes with results or exceptionally when the 
replication group is unavailable.
    */
   public CompletableFuture<Boolean> invoke(Object operation) {
       return service.invoke(operation, Long.MAX_VALUE);
   }
}
{code}
For fail-fast scenarios, a special approach should be provided. This approach 
directly forwards the request to the service and, if the group is not ready, 
immediately returns an exception.
{code}
Manager manager = new Manager();


try {
   CompletableFuture<Boolean> fut = manager.invoke(cmd, 0);


   fut.get();
} catch (ReplicationGroupUnavailableException ex) {
   System.err.println("Replication group is unavailable: " + ex.getMessage());
}
{code}
0 timeout means a single attempt to execute the command.
 

> Design CMG/MG absence handling logic
> ------------------------------------
>
>                 Key: IGNITE-26532
>                 URL: https://issues.apache.org/jira/browse/IGNITE-26532
>             Project: Ignite
>          Issue Type: Task
>            Reporter: Alexander Lapin
>            Assignee: Vladislav Pyatkov
>            Priority: Major
>              Labels: ignite-3
>
> h3. Motivation
> In case of
>  # loss of majority in *MG* only
>  # loss of majority in *CMG* only
>  # loss of majority in both *CMG* and *MG*
> User operations behave adequately: within the specified timeouts they attempt 
> to wait for majority restoration, and if it does not happen, they fail with a 
> clear error. At the same time, they do not flood the logs with tons of 
> exceptions on every internal retry.
> We are talking about operations such as:
>  * Schema changes (e.g., creating a table).
>  * Transactions of all types (with partially applied transactions being 
> rolled back).
>  * Adding nodes.
>  * Various {{{}resetPartitions{}}}.
>  * …
> At the same time, user operations such as
>  * stopping a node, and
>  * read-only transactions (as in the past)
> must complete successfully without exceptions being logged.
> Internal _system_ operations must wait indefinitely for the restoration of 
> majority in the corresponding system groups (whether via infinite retry or 
> reactively), and under no circumstances should they trigger FG (which is what 
> happens now).
> A node should log reasonably little about the unavailability of a system 
> group, not as excessively as it currently does.
> Cancellation operations (rollback, abort, etc.) should, whenever possible, 
> work even in the absence of CMG/MG. This needs to be verified separately, 
> since it’s unclear if we can guarantee it for everything.
> When CMG/MG is restored, the cluster should return to normal operability.
> h3. Definition of Done
> Design document that addresses aforementioned questions is ready.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (IGNITE-26532) Design CMG/MG absence handling logic

Reply via email to