We have observed that application relaunch takes long time.
The one major reason for delay in application startup during relaunch
is time taken to copy state of exisitng application to new application.
This state could grow in GBs and copy is performed in single thread before
new application is submitted to Yarn.
The state of previous application constists
- stram checkpoint/recovery file.
- container file
- stats recording if enabled.
- operator checkpoints
- operator data.
We could avoid copying debugging data like stat recording which could
run in TB for long
running application and is not required for functioning of new application.
Similarly operator checkpoints could be read in parallel when they are
launched for first time,
This will also help in copying only required checkpoints and will be
done in parallel
by multiple containers/threads.
For operator data stored in application directory, we could copy it
completely for now, but
in future we could provide an callback which will allow operator
partition to read only
required state from previous location.
let me know your though on this.