Help Needed for APEXCORE-619

Issue : When application is relaunched after long time with stateless
opeartors at the end of the DAG, the stateless operators starts with a very
high windowId. In this case the stateless operator ignors all the data
received till upstream operator catches up with it. This breaks the
*at-least-once* gaurantee while relaunch of the opeartor or when master is
killed and application is restarted.

Solutions:
- Fix windowId for stateless leaf operators from upstream opeartor. But it
has some issues when we have a join with two upstrams operators at
different windowId. If we set the windowID to min(upstream windowId), then
we need to again recalulate the new recovery window ids for upstream paths
from this operators.

- Other solution is to create a empty file in checkpoint directory for
stateless operators. This will help us to identify the checkpoints of
stateless operators during relaunch instead of computing from latest
timestamp.

- Bring the entire DAG to committedWindowId. This could be achived using
writing committedWindowId in a journal. we need to make sure that we are
not puring the checkpointed state until the committedWundowId is saved in
journal.

Let me know your thoughs on this and preferred solution.

Regards,
-Tushar.

Reply via email to