How about re-launching the app from the same location? If at all they want to store the state we can provide savepoint feature.
On Tue, Sep 20, 2016 at 4:39 AM Tushar Gosavi <tus...@datatorrent.com> wrote: > We have observed that application relaunch takes long time. > The one major reason for delay in application startup during relaunch > is time taken to copy state of exisitng application to new application. > This state could grow in GBs and copy is performed in single thread before > new application is submitted to Yarn. > > The state of previous application constists > - jars > - stram checkpoint/recovery file. > - events > - container file > - stats recording if enabled. > - operator checkpoints > - operator data. > > We could avoid copying debugging data like stat recording which could > run in TB for long > running application and is not required for functioning of new application. > > Similarly operator checkpoints could be read in parallel when they are > launched for first time, > This will also help in copying only required checkpoints and will be > done in parallel > by multiple containers/threads. > > For operator data stored in application directory, we could copy it > completely for now, but > in future we could provide an callback which will allow operator > partition to read only > required state from previous location. > > let me know your though on this. > > Regards, > - Tushar. >