squito commented on issue #25577: [WIP][CORE][SPARK-28867] InMemoryStore 
checkpoint to speed up replay log file in HistoryServer
URL: https://github.com/apache/spark/pull/25577#issuecomment-542319523
 
 
   Hmm, now I'm very confused.   I don't think anything needs to be done to 
speed up the replay of completed applications in the SHS.  As long as you have 
the SHS configured to use local disk, after it parses the logs once, it'll just 
read the leveldb kvstore which will be very fast.
   
   
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L341-L343
   
   Is your goal to avoid having the SHS even parse the file one time?  Trying 
to do it from the driver seems so much more complicated than having another 
dedicated process which *just* parses the eventlog files, and produces the 
leveldb kvstore.  If you really wanted to do that, I'd have the driver just 
write out the leveldb kvstore when the application terminates.  You could 
update AppStatusStore and SQLAppStatusStore to dump the contents of kvstore to 
another.
   
   SPARK-28594 could also make the first parse faster, depending on the exact 
implementation in the end -- because the "first" parse of the completed 
application will only need to read the end of the event logs, as there was 
already a lot of parsing done of the logs of the incomplete application to 
produce the snapshot and rolled logs.
   
   minor: can I ask you to use the terms "in-progress" vs "completed" 
applications?  There are a few times in the discussion when you say 
"in-complete" which don't really seem to refer to "in-progress", and I'm not 
sure if that's a typo or my misunderstanding etc.  (eg. the pr description 
seems to mostly focus on in-progress, so I'm surprised you're saying this is 
primarily for complete applications.)

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to