gaborgsomogyi commented on a change in pull request #27398: 
[SPARK-30481][DOCS][FOLLOWUP] Document event log compaction into new section of 
monitoring.md
URL: https://github.com/apache/spark/pull/27398#discussion_r380310264
 
 

 ##########
 File path: docs/monitoring.md
 ##########
 @@ -95,6 +95,49 @@ The history server can be configured as follows:
   </tr>
 </table>
 
+### Applying compaction on rolling event log files
+
+A long-running application (e.g. streaming) can bring a huge single event log 
file which may cost a lot to maintain and
+also requires a bunch of resource to replay per each update in Spark History 
Server.
+
+Enabling <code>spark.eventLog.rolling.enabled</code> and 
<code>spark.eventLog.rolling.maxFileSize</code> would
+let you have rolling event log files instead of single huge event log file 
which may help some scenarios on its own,
+but it still doesn't help you reducing the overall size of logs.
+
+Spark History Server can apply 'compaction' on the rolling event log files to 
reduce the overall size of
+logs, via setting the configuration 
<code>spark.history.fs.eventLog.rolling.maxFilesToRetain</code> on the
+Spark History Server.
+
+Details will be described below, but please note in prior that 'compaction' is 
LOSSY operation.
+'Compaction' will discard some events which will be no longer seen on UI - you 
may want to check which events will be discarded
+before enabling the option.
+
+When the compaction happens, the History Server lists all the available event 
log files for the application, and considers
+the event log files having less index than the file with smallest index which 
will be retained as target of compaction.
+For example, if the application A has 5 event log files and 
<code>spark.history.fs.eventLog.rolling.maxFilesToRetain</code> is set to 2, 
then first 3 log files will be selected to be compacted.
+
+Once it selects the target, it analyzes them to figure out which events can be 
excluded, and rewrites them
+into one compact file with discarding events which are decided to exclude.
+
+The compaction tries to exclude the events which point to the outdated things 
like jobs, and so on. As of now, below describes
 
 Review comment:
   Nit: s/events which point to the outdated things like jobs/events which 
point to outdated data

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to