Ngone51 commented on issue #25577: [WIP][CORE][SPARK-28867] InMemoryStore 
checkpoint to speed up replay log file in HistoryServer
URL: https://github.com/apache/spark/pull/25577#issuecomment-533485141
 
 
   >  I'm just trying not to let both approaches be diverged (approach A for 
single event log, approach B for rolling event log).
   
   I remeber in SPARK-28594, `EventLoggingListener` would generates multiple 
log files in the format like:
   
   `[sequeneceId-1, fileSize-1] [sequeneceId-2, fileSize-2] ... [sequeneceId-n, 
fileSize-n]`
   
   And, measuring current writing file size could be a tricky part of rolling 
up log file.
   
   What if we also use processed events number  to roll up log file ? For 
example, we config to roll up log file per 1000 events. Then, we'll have 
multiple log files in the format like:
   
   `[sequeneceId-1, 1000] [sequeneceId-2, 1000] ... [sequeneceId-n, 123]`
   
   Then, we'll have two cases to integrate SPARK-28667 with SPARK-28594 while 
snapshotting on driver is enable:
   
   1.  SHS replays event log files for a `completed` application
   
   In this case, SHS would firstly load [driver-snapshot, X-events-num]. Then, 
locate the event log file by processed num X. For example, if X is 1500, then, 
SHS would start replay from file [sequeneceId-2, 1000] (rather than 
[sequeneceId-2, 1000] )and skip first 500 events in file-2 to replay. And the 
following behavior would follow SPARK-28594's rolling up mechanism which 
described in the design doc(e.g. snapshot in SHS).
   
   2. SHS replays event log files for an `in-completed` application
   
   In this case,  driver-snapshot would be continuously generated every 
interval(e.g. per 1000 events). And SHS could always firstly load the newest 
[driver-snapshot, X-events-num] and then replay the event log file and finally 
generates [SHS-snapshot, Y-events-num]. Next time, driver generates a newer 
[driver-snapshot, Z-events-num], SHS needs to decides which snapshot it should 
load depends on Y > Z or Y < Z. And then, repeat the replay steps. But if Z < 
Y, SHS needs to re-replay out-of-date event log file, which may already 
deleted. So, actually, I'd preffer not to snashot in SHS and always use 
driver-snapshot in this case.
   
   WDYT ? @HeartSaVioR 
   
   Though, there may be a way to integrate SPARK-28667 with SPARK-28594, but I 
think it's fine for us to focus on SPARK-28594 currently. As methioned above, 
thery're separate issues indeed. So, I think you @HeartSaVioR don't need to get 
too much SPARK-28667 details into SPARK-28594. SPARK-28667 could introduce some 
adjustments into finished SPARK-28594 later to make they're compatible with 
each other.
   
   > This kind of discussion is ideal to be happening on design phase.
   
   I think we'll have a new design to include @zsxwing 's idea about two maps 
in `InMemoryStore`,
   the way to work with SPARK-28594 and the way to accurately record process 
events num, later. Personally, I don't have a good design for this issue 
initially, but these discussions make the design more and more better.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to