HeartSaVioR commented on issue #25577: [WIP][CORE][SPARK-28867] InMemoryStore 
checkpoint to speed up replay log file in HistoryServer
URL: https://github.com/apache/spark/pull/25577#issuecomment-532855117
 
 
   > One of the issues I hit in the past is that it cannot render UI for a 
long-running spark application because replaying events takes too long. For 
example, if you have a streaming query running 7 days, the event logs will be 
huge and it may take SHS several days to replay events. If we can take snapshot 
in driver, the number of events need to replay in SHS will be small.
   
   I guess SPARK-28594 would be more preferred approach for streaming query in 
the end. I agree with you for provided issue, but retaining huge single event 
log file itself is also challenging.
   
   Unlike SPARK-28867 where Spark just needs to count on events in 
AppStatusListener (assuming events are in order between EventLoggingListener 
and AppStatusListener), SPARK-28594 should deal with snapshot from 
EventLoggingListener which don't know about AppStatusListener so it would be 
harder to take a snapshot from driver. (sync up twos) That's advantage of 
SPARK-28867, but it still doesn't deal with major issue.
   
   > For example, we can have two maps in InMemoryStore. Firstly, we write to 
one map. When flushing out, we freeze the current map and new updates go to the 
other one. We can write out the frozen map asynchronously and any query going 
to InMemoryStore can just check both two maps. Then flushing them out, we add 
all the items in backup map to the frozen map and re-activate it. The number of 
items to copy here should be small.
   
   This seems to assume there're only "appends", but the reality is that 
there're also "updates". This will require special care of updating existing 
object and it needs to choose one of 1) simply cloning all events 2) copying 
map and cloning object whenever it is updated 3) let update be synchronous.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to