Ngone51 commented on issue #25577: [WIP][CORE][SPARK-28867] InMemoryStore checkpoint to speed up replay log file in HistoryServer URL: https://github.com/apache/spark/pull/25577#issuecomment-533485141 > I'm just trying not to let both approaches be diverged (approach A for single event log, approach B for rolling event log). I remeber in SPARK-28594, `EventLoggingListener` would generates multiple log files in the format like: `[sequeneceId-1, fileSize-1] [sequeneceId-2, fileSize-2] ... [sequeneceId-n, fileSize-n]` And, measuring current writing file size could be a tricky part of rolling up log file. What if we also use processed events number to roll up log file ? For example, we config to roll up log file per 1000 events. Then, we'll have multiple log files in the format like: `[sequeneceId-1, 1000] [sequeneceId-2, 1000] ... [sequeneceId-n, 123]` Then, we'll have two cases to integrate SPARK-28667 with SPARK-28594 while snapshotting on driver is enable: 1. SHS replays event log files for a `completed` application In this case, SHS would firstly load [driver-snapshot, X-events-num]. Then, locate the event log file by processed num X. For example, if X is 1500, then, SHS would start replay from file [sequeneceId-2, 1000] (rather than [sequeneceId-2, 1000] )and skip first 500 events in file-2 to replay. And the following behavior would follow SPARK-28594's rolling up mechanism which described in the design doc(e.g. snapshot in SHS). 2. SHS replays event log files for an `in-completed` application In this case, driver-snapshot would be continuously generated every interval(e.g. per 1000 events). And SHS could always firstly load the newest [driver-snapshot, X-events-num] and then replay the event log file and finally generates [SHS-snapshot, Y-events-num]. Next time, driver generates a newer [driver-snapshot, Z-events-num], SHS needs to decides which snapshot it should load depends on Y > Z or Y < Z. And then, repeat the replay steps. But if Z < Y, SHS needs to re-replay out-of-date event log file, which may already deleted. So, actually, I'd preffer not to snashot in SHS and always use driver-snapshot in this case. WDYT ? @HeartSaVioR Though, there may be a way to integrate SPARK-28667 with SPARK-28594, but I think it's fine for us to focus on SPARK-28594 currently. As methioned above, thery're separate issues indeed. So, I think you @HeartSaVioR don't need to get too much SPARK-28667 details into SPARK-28594. SPARK-28667 could introduce some adjustments into finished SPARK-28594 later to make they're compatible with each other. > This kind of discussion is ideal to be happening on design phase. I think we'll have a new design to include @zsxwing 's idea about two maps in `InMemoryStore`, the way to work with SPARK-28594 and the way to accurately record process events num, later. Personally, I don't have a good design for this issue initially, but these discussions make the design more and more better.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
