We have observed that application relaunch takes long time. The one major reason for delay in application startup during relaunch is time taken to copy state of exisitng application to new application. This state could grow in GBs and copy is performed in single thread before new application is submitted to Yarn.
The state of previous application constists - jars - stram checkpoint/recovery file. - events - container file - stats recording if enabled. - operator checkpoints - operator data. We could avoid copying debugging data like stat recording which could run in TB for long running application and is not required for functioning of new application. Similarly operator checkpoints could be read in parallel when they are launched for first time, This will also help in copying only required checkpoints and will be done in parallel by multiple containers/threads. For operator data stored in application directory, we could copy it completely for now, but in future we could provide an callback which will allow operator partition to read only required state from previous location. let me know your though on this. Regards, - Tushar.
