Can't the commitedWindowId be calculated by looking at the physical plan and the existing checkpoints?
On Wed, Mar 1, 2017 at 5:34 AM, Tushar Gosavi <tus...@apache.org> wrote: > Help Needed for APEXCORE-619 > > Issue : When application is relaunched after long time with stateless > opeartors at the end of the DAG, the stateless operators starts with a very > high windowId. In this case the stateless operator ignors all the data > received till upstream operator catches up with it. This breaks the > *at-least-once* gaurantee while relaunch of the opeartor or when master is > killed and application is restarted. > > Solutions: > - Fix windowId for stateless leaf operators from upstream opeartor. But it > has some issues when we have a join with two upstrams operators at > different windowId. If we set the windowID to min(upstream windowId), then > we need to again recalulate the new recovery window ids for upstream paths > from this operators. > > - Other solution is to create a empty file in checkpoint directory for > stateless operators. This will help us to identify the checkpoints of > stateless operators during relaunch instead of computing from latest > timestamp. > > - Bring the entire DAG to committedWindowId. This could be achived using > writing committedWindowId in a journal. we need to make sure that we are > not puring the checkpointed state until the committedWundowId is saved in > journal. > > Let me know your thoughs on this and preferred solution. > > Regards, > -Tushar. >