Hi All, The PR: https://github.com/apache/apex-core/pull/422 to solve this issue looks good to me. If there are no other comments, will merge this PR soon.
~ Bhupesh _______________________________________________________ Bhupesh Chawda E: [email protected] | Twitter: @bhupeshsc www.datatorrent.com | apex.apache.org On Wed, Sep 21, 2016 at 8:16 PM, Sandesh Hegde <[email protected]> wrote: > Relaunching from the same location can be one of the options. > > On Tue, Sep 20, 2016, 10:17 PM Tushar Gosavi <[email protected]> > wrote: > > > In case of application failure, we will like to have ability to > > quickly restart the application while keeping the old state for > > failure > > analysis. Also the problem remains the same when we want to start from > > savepoint, where we will need to copy state from > > savepoint to application. > > > > -Tushar. > > > > > > > > On Tue, Sep 20, 2016 at 8:34 PM, Sandesh Hegde <[email protected]> > > wrote: > > > How about re-launching the app from the same location? > > > > > > If at all they want to store the state we can provide savepoint > feature. > > > > > > On Tue, Sep 20, 2016 at 4:39 AM Tushar Gosavi <[email protected]> > > > wrote: > > > > > >> We have observed that application relaunch takes long time. > > >> The one major reason for delay in application startup during relaunch > > >> is time taken to copy state of exisitng application to new > application. > > >> This state could grow in GBs and copy is performed in single thread > > before > > >> new application is submitted to Yarn. > > >> > > >> The state of previous application constists > > >> - jars > > >> - stram checkpoint/recovery file. > > >> - events > > >> - container file > > >> - stats recording if enabled. > > >> - operator checkpoints > > >> - operator data. > > >> > > >> We could avoid copying debugging data like stat recording which could > > >> run in TB for long > > >> running application and is not required for functioning of new > > application. > > >> > > >> Similarly operator checkpoints could be read in parallel when they are > > >> launched for first time, > > >> This will also help in copying only required checkpoints and will be > > >> done in parallel > > >> by multiple containers/threads. > > >> > > >> For operator data stored in application directory, we could copy it > > >> completely for now, but > > >> in future we could provide an callback which will allow operator > > >> partition to read only > > >> required state from previous location. > > >> > > >> let me know your though on this. > > >> > > >> Regards, > > >> - Tushar. > > >> > > >
