GitHub user ericvandenbergfb opened a pull request:
https://github.com/apache/spark/pull/18791
[SPARK-21571][WEB UI] Spark history server leaves incomplete or unreaâ¦
â¦dable history files
around forever.
Fix logic
1. checkForLogs excluded 0-size files so they stuck around forever.
2. checkForLogs / mergeApplicationListing indefinitely ignored files
that were not parseable/couldn't extract an appID, so they stuck around
forever.
Only apply above logic if spark.history.fs.cleaner.aggressive=true.
Fixed race condition in a test (SPARK-3697: ignore files that cannot be
read.) where the number of mergeApplicationListings could be more than 1
since the FsHistoryProvider would spin up an executor that also calls
checkForLogs in parallel with the test unless spark.testing=true configured.
Added unit test to cover all cases with aggressive and non-aggressive
clean up logic.
## What changes were proposed in this pull request?
The spark history server doesn't clean up certain history files outside the
retention window leading to thousands of such files lingering around on our
servers. The log checking and clean up logic skipped 0 byte files and expired
inprogress or complete history files that weren't properly parseable (not able
to extract an app id or otherwise parse...) Note these files most likely
appeared to due aborted jobs or earlier spark/file system driver bugs. To
mitigate this, FsHistoryProvider.checkForLogs now internally identifies these
untracked files and will remove them if they expire outside the cleaner
retention window.
This is currently controlled via configuration
spark.history.fs.cleaner.aggressive=true to perform more aggressive cleaning.
## How was this patch tested?
Implemented a unit test that exercises the above cases without and without
the aggressive cleaning to ensure correct results in all cases. Note that
FsHistoryProvider at one place uses the file system to get the current time and
and at other times the local system time, this seems inconsistent/buggy but I
did not attempt to fix in this commit. I was forced to change one of the
method FsHistoryProvider.getNewLastScanTime() for the test to properly mock the
clock.
Also ran a history server and touched some files to verify they were
properly removed.
ericvandenberg@localhost /tmp/spark-events % ls -la
total 808K
drwxr-xr-x 8 ericvandenberg 272 Jul 31 18:22 .
drwxrwxrwt 127 root
-rw-r--r-- 1 ericvandenberg 0 Jan 1 2016 local-123.inprogress
-rwxr-x--- 1 ericvandenberg 342K Jan 1 2016 local-1501549952084
-rwxrwx--- 1 ericvandenberg 342K Jan 1 2016
local-1501549952084.inprogress
-rwxrwx--- 1 ericvandenberg 59K Jul 31 18:19 local-1501550073208
-rwxrwx--- 1 ericvandenberg 59K Jul 31 18:21
local-1501550473508.inprogress
-rw-r--r-- 1 ericvandenberg 0 Jan 1 2016 local-234
Observed in history server logs:
17/07/31 18:23:52 INFO FsHistoryProvider: Aggressively cleaned up 4
untracked history files.
ericvandenberg@localhost /tmp/spark-events % ls -la
total 120K
drwxr-xr-x 4 ericvandenberg 136 Jul 31 18:24 .
drwxrwxrwt 127 root 4.3K Jul 31 18:07 ..
-rwxrwx--- 1 ericvandenberg 59K Jul 31 18:19 local-1501550073208
-rwxrwx--- 1 ericvandenberg 59K Jul 31 18:22 local-1501550473508
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/ericvandenbergfb/spark
cleanup.untracked.history.files
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/18791.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #18791
----
commit c52b1cfd2eee9c881267d3d4cd9ea83fb6a767eb
Author: Eric Vandenberg <[email protected]>
Date: 2017-07-31T22:02:54Z
[SPARK-21571][WEB UI] Spark history server leaves incomplete or unreadable
history files
around forever.
Fix logic
1. checkForLogs excluded 0-size files so they stuck around forever.
2. checkForLogs / mergeApplicationListing indefinitely ignored files
that were not parseable/couldn't extract an appID, so they stuck around
forever.
Only apply above logic if spark.history.fs.cleaner.aggressive=true.
Fixed race condition in a test (SPARK-3697: ignore files that cannot be
read.) where the number of mergeApplicationListings could be more than 1
since the FsHistoryProvider would spin up an executor that also calls
checkForLogs in parallel with the test.
Added unit test to cover all cases with aggressive and non-aggressive
clean up logic.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]