squito commented on issue #25577: [WIP][CORE][SPARK-28867] InMemoryStore checkpoint to speed up replay log file in HistoryServer URL: https://github.com/apache/spark/pull/25577#issuecomment-542319523 Hmm, now I'm very confused. I don't think anything needs to be done to speed up the replay of completed applications in the SHS. As long as you have the SHS configured to use local disk, after it parses the logs once, it'll just read the leveldb kvstore which will be very fast. https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L341-L343 Is your goal to avoid having the SHS even parse the file one time? Trying to do it from the driver seems so much more complicated than having another dedicated process which *just* parses the eventlog files, and produces the leveldb kvstore. If you really wanted to do that, I'd have the driver just write out the leveldb kvstore when the application terminates. You could update AppStatusStore and SQLAppStatusStore to dump the contents of kvstore to another. SPARK-28594 could also make the first parse faster, depending on the exact implementation in the end -- because the "first" parse of the completed application will only need to read the end of the event logs, as there was already a lot of parsing done of the logs of the incomplete application to produce the snapshot and rolled logs. minor: can I ask you to use the terms "in-progress" vs "completed" applications? There are a few times in the discussion when you say "in-complete" which don't really seem to refer to "in-progress", and I'm not sure if that's a typo or my misunderstanding etc. (eg. the pr description seems to mostly focus on in-progress, so I'm surprised you're saying this is primarily for complete applications.)
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
