[
https://issues.apache.org/jira/browse/IGNITE-26532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18040289#comment-18040289
]
Vladislav Pyatkov commented on IGNITE-26532:
--------------------------------------------
h2. CMG/MG services
The CMG/MG service can be invoked either with a timeout or without one.
If the service is called without a timeout (timeout = 0), a single attempt will
be made to send the command for replication.
If the service is called with an infinite timeout (timeout = Long.MAX_VALUE),
replication attempts will continue indefinitely, or the call will wait until
the service becomes ready.
If the service is called with a bounded timeout, replication attempts will
continue only for the specified duration (timeout).
If the replication command cannot be executed due to the service being
unavailable, a ReplicationGroupUnavailableException will be thrown.
{code}
/**
* CMG/MG service.
* The service applies a replication operation on the replication group.
*/
public interface Service {
/**
* Invoke a replication operation on the replication group with a timeout.
*
* @param operation Replication operation.
* @param timeout Timeout in milliseconds.
* If timeout is 0, the operation will try only one attempt to execute the
operation.
* If timeout is \{@code Long.MAX_VALUE}the operation will wait indefinitely
until the operation is executed.
* @return The future will be completed with a result if the group is available
during the execution of the replication command.
* If the group is not available and the timeout expires, the future will be
completed with \{@link ReplicationGroupUnavailableException}.
*/
CompletableFuture<Boolean> invoke(Object operation, long timeout);
}
{code}
h2. CMG/MG managers
Manager methods can be invoked either with a timeout or with an infinite wait
time (timeout <= 0).
For calls with a timeout, the execution pattern is as follows (this code only
reflects the implementation approach, but it is not the final version):
{code}
/**
* CMG/MG manager.
*/
public class Manager {
/** CMG/MG service. */
Service service;
/**
* @param operation Replication operation.
* @param timeout Operation timeout in microseconds.
* If the timeout is 0, the operation will try only one
attempt to execute the operation.
* If timeout is {@code Long.MAX_VALUE} the operation will
wait indefinitely until the operation is executed.
* @return The future completes with a result or exceptionally when the
replication group is unavailable during the timeout.
*/
public CompletableFuture<Boolean> invoke(Object operation, long timeout) {
return service.invoke(operation, timeout);
}
/**
* @param operation Replication operation.
* @return The future completes with results or exceptionally when the
replication group is unavailable.
*/
public CompletableFuture<Boolean> invoke(Object operation) {
return service.invoke(operation, Long.MAX_VALUE);
}
}
{code}
For fail-fast scenarios, a special approach should be provided. This approach
directly forwards the request to the service and, if the group is not ready,
immediately returns an exception.
{code}
Manager manager = new Manager();
try {
CompletableFuture<Boolean> fut = manager.invoke(cmd, 0);
fut.get();
} catch (ReplicationGroupUnavailableException ex) {
System.err.println("Replication group is unavailable: " + ex.getMessage());
}
{code}
0 timeout means a single attempt to execute the command.
> Design CMG/MG absence handling logic
> ------------------------------------
>
> Key: IGNITE-26532
> URL: https://issues.apache.org/jira/browse/IGNITE-26532
> Project: Ignite
> Issue Type: Task
> Reporter: Alexander Lapin
> Assignee: Vladislav Pyatkov
> Priority: Major
> Labels: ignite-3
>
> h3. Motivation
> In case of
> # loss of majority in *MG* only
> # loss of majority in *CMG* only
> # loss of majority in both *CMG* and *MG*
> User operations behave adequately: within the specified timeouts they attempt
> to wait for majority restoration, and if it does not happen, they fail with a
> clear error. At the same time, they do not flood the logs with tons of
> exceptions on every internal retry.
> We are talking about operations such as:
> * Schema changes (e.g., creating a table).
> * Transactions of all types (with partially applied transactions being
> rolled back).
> * Adding nodes.
> * Various {{{}resetPartitions{}}}.
> * …
> At the same time, user operations such as
> * stopping a node, and
> * read-only transactions (as in the past)
> must complete successfully without exceptions being logged.
> Internal _system_ operations must wait indefinitely for the restoration of
> majority in the corresponding system groups (whether via infinite retry or
> reactively), and under no circumstances should they trigger FG (which is what
> happens now).
> A node should log reasonably little about the unavailability of a system
> group, not as excessively as it currently does.
> Cancellation operations (rollback, abort, etc.) should, whenever possible,
> work even in the absence of CMG/MG. This needs to be verified separately,
> since it’s unclear if we can guarantee it for everything.
> When CMG/MG is restored, the cluster should return to normal operability.
> h3. Definition of Done
> Design document that addresses aforementioned questions is ready.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)