Exchange stucks while node restoring state from WAL

Maxim Muzafarov Thu, 02 Aug 2018 23:44:23 -0700

Hi Igniters,


I'm working on bug [1] and have some questions about the final
implementation. Probably, I've already found answers on some of
them but I want to be sure. Please, help me to clarify details.


The key problem here is that we are reading WAL and restoring
memory state of new joined node inside PME. Reading WAL can
consume huge amount of time, so the whole cluster stucks and
waits for the single node.


1) Is it correct that readMetastore() happens after node starts
but before including node into the ring?

2) Is after onDone() method called for LocalJoinFuture on local
node happend we can proceed with initiating PME on local node?

3) After reading checkpoint and restore memory for new joined
node how and when we are updating obsolete partitions update
counter? At historical rebalance, right?

4) Should we restoreMemory for new joined node before PME
initiates on the other nodes in cluster?

5) Does in our final solution for new joined node readMetastore
and restoreMemory should be performed in one step?


[1] https://issues.apache.org/jira/browse/IGNITE-7196
-- 
--
Maxim Muzafarov

Exchange stucks while node restoring state from WAL

Reply via email to