Regarding PME:

I think some kind of exceptions may be caught in
'GridDhtPartitionsExchangeFuture' to be able to notify the
coordinator.
For example, if an exception occurred during cache or index (or
something else) creation on some node, the node can notify coordinator
and coordinator should be able to make a decision on how to handle the
situation. AFAIK a message 'DynamicCacheChangeFailureMessage' will be
sent in case of some kind error.

It would be great if the most actions managed by PME will be handled
in the same manner. For doing this we can specify some sort of
exception types and rules on how to handle them, so the coordinator
will be able to finish hang PME across the cluster.
On Tue, Dec 4, 2018 at 8:01 PM Ilya Kasnacheev
<ilya.kasnach...@gmail.com> wrote:
>
> Hello!
>
> Currently, Apache Ignite is mostly written in "fail fast" fashion.
>
> All of Apache Ignite codebase is assumed to have no bugs. When an
> unexpected exception happens, it will be printed to log and will usually 
> *leave
> current operation hanging forever*. This is useful to developers since they
> can spot problems right away and to users since they can avoid further data
> loss, but it often leads to *data loss of current data*.
>
> The most notorious case of such errors is hanging PMEs. When PME handling
> on a single node in the cluster results in an exception, the whole cluster
> will hang up forever until this node is killed. I guess you can observe
> data loss after exceptions during rebalance. You can also have various
> operations hanging once remote node throws an unexpected exception.
>
> Most recently I'm trying to fix IgniteErrorOnRebalanceTest in IGNITE-9842.
> It tests exception thrown from IndexingSpi, it never worked and it leads to
> silent data loss. When baseline is introduced, it will now lead to hanging
> PME.
>
> Should we fight this problem? IndexingSpi implementation is external to us,
> we should either:
> A. Catch any exception that it would throw. If it was thrown during
> rebalance, ignore it with warning to avoid data loss.
> B. Assume that it never throws exceptions (or that it will only throw
> IgniteSpiException). As soon as any exception is thrown, the behavior of
> cluster is undefined. This is current behavior.
>
> *Are we ready to make a leap from B to A?*
>
> Note that currently, if an exception is thrown from IndexingSpi, an
> operation will fail in mid-flight, meaning that part of data could be
> updated and the rest was not. It is possible that entry was added to cache
> but not indexed by SQL, for example. We will need to be able to roll back
> any operation when error occurs.
>
> With regard to PME:
> - When there is an exception during exchange, we should be able to switch
> back to previous topology version on all nodes.
> - That means the node which was trying to join is kicked from topology (and
> not the one that had this exception thrown). Or the cache is not created,
> or not destroyed, etc.
> - No data loss since all existing nodes happily continue to work on the old
> topology version and new node did not have any data yet.
>
> Basically, every remote operation should be guarded with a fallback where a
> message is sent to caller when operation did not succeed. This will mean
> that no operation ever hangs.
>
> WDYT?
>
> Regards,
> --
> Ilya Kasnacheev



-- 
Best Regards, Vyacheslav D.

Reply via email to