HeartSaVioR commented on issue #25577: [WIP][CORE][SPARK-28867] InMemoryStore checkpoint to speed up replay log file in HistoryServer URL: https://github.com/apache/spark/pull/25577#issuecomment-532855117 > One of the issues I hit in the past is that it cannot render UI for a long-running spark application because replaying events takes too long. For example, if you have a streaming query running 7 days, the event logs will be huge and it may take SHS several days to replay events. If we can take snapshot in driver, the number of events need to replay in SHS will be small. I guess SPARK-28594 would be more preferred approach for streaming query in the end. I agree with you for provided issue, but retaining huge single event log file itself is also challenging. Unlike SPARK-28867 where Spark just needs to count on events in AppStatusListener (assuming events are in order between EventLoggingListener and AppStatusListener), SPARK-28594 should deal with snapshot from EventLoggingListener which don't know about AppStatusListener so it would be harder to take a snapshot from driver. (sync up twos) That's advantage of SPARK-28867, but it still doesn't deal with major issue. > For example, we can have two maps in InMemoryStore. Firstly, we write to one map. When flushing out, we freeze the current map and new updates go to the other one. We can write out the frozen map asynchronously and any query going to InMemoryStore can just check both two maps. Then flushing them out, we add all the items in backup map to the frozen map and re-activate it. The number of items to copy here should be small. This seems to assume there're only "appends", but the reality is that there're also "updates". This will require special care of updating existing object and it needs to choose one of 1) simply cloning all events 2) copying map and cloning object whenever it is updated 3) let update be synchronous.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
