HeartSaVioR commented on issue #25577: [WIP][CORE][SPARK-28867] InMemoryStore checkpoint to speed up replay log file in HistoryServer URL: https://github.com/apache/spark/pull/25577#issuecomment-533680034 The one of main goals in SPARK-28594 is limiting the overall size of log directory per application. (End users have been concerned about it.) That means, we should provide a way to roll the event log file within deterministic size, which is not applicable to roll file per lines. In the following patch I'll introduce max number of files (max file size is introduced in #25670 ) and clean up old event files via replacing these old files with snapshot file - so it'll take a snapshot for different purpose, though it also helps faster reading. Given two issues take a snapshot for different purposes, I'm kind of OK to go with different approaches and consolidate the approach later (assuming the snapshot file is compatible). One thing I might be concerning about is, we only talk about the new approach for in-memory store which Spark hides the implementation of KVStore via wrapping it with ElementTrackingStore. The change should be reflected to KVStore API so that caller side would deal with the way of snapshotting properly. (Now we only add some necessary methods in KVStore to snapshot from outside, but if we have both sync/async snapshot for KVStore, that should be reflected to the KVStore API.) To add some context on this, previously (in internal reviewing) I proposed snapshotting underlying LevelDB - archiving directory would just work - for LevelDB KVStore implementation and I was suggested to find a way to support snapshotting for all implementations of KVStore. That's why current snapshot mechanism is based on KVStore interface. Once we respect the format of snapshot file, both sync/async snapshots would be compatible, but in same spirit, ideally we should support both approaches of snapshot smoothly, via KVStore interface level.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
