GitHub user ericvandenbergfb opened a pull request:
https://github.com/apache/spark/pull/19770
[SPARK-21571][WEB UI] Spark history server leaves incomplete or unreadable
logs around forever
## What changes were proposed in this pull request?
** Updated pull request based on some other refactoring that went into
FsHistoryProvider **
Fix logic
checkForLogs excluded 0-size files so they stuck around forever.
checkForLogs / mergeApplicationListing indefinitely ignored files
that were not parseable/couldn't extract an appID, so they stuck around
forever.
Only apply above logic if spark.history.fs.cleaner.aggressive=true.
Fixed race condition in a test (SPARK-3697: ignore files that cannot be
read.) where the number of mergeApplicationListings could be more than 1
since the FsHistoryProvider would spin up an executor that also calls
checkForLogs in parallel with the test.
Added unit test to cover all cases with aggressive and non-aggressive
clean up logic.
## How was this patch tested?
Add test that extensive tests the untracked files getting cleaned up when
configured.
Please review http://spark.apache.org/contributing.html before opening a
pull request.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/ericvandenbergfb/spark master
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/19770.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #19770
----
commit c52b1cfd2eee9c881267d3d4cd9ea83fb6a767eb
Author: Eric Vandenberg <[email protected]>
Date: 2017-07-31T22:02:54Z
[SPARK-21571][WEB UI] Spark history server leaves incomplete or unreadable
history files
around forever.
Fix logic
1. checkForLogs excluded 0-size files so they stuck around forever.
2. checkForLogs / mergeApplicationListing indefinitely ignored files
that were not parseable/couldn't extract an appID, so they stuck around
forever.
Only apply above logic if spark.history.fs.cleaner.aggressive=true.
Fixed race condition in a test (SPARK-3697: ignore files that cannot be
read.) where the number of mergeApplicationListings could be more than 1
since the FsHistoryProvider would spin up an executor that also calls
checkForLogs in parallel with the test.
Added unit test to cover all cases with aggressive and non-aggressive
clean up logic.
commit 08ea4ace02b7f8bf39190d5af53e7ced5e2807a0
Author: Eric Vandenberg <[email protected]>
Date: 2017-11-15T20:03:21Z
Merge branch 'master' of github.com:ericvandenbergfb/spark into
cleanup.untracked.history.files
* 'master' of github.com:ericvandenbergfb/spark: (637 commits)
[SPARK-22469][SQL] Accuracy problem in comparison with string and numeric
[SPARK-22490][DOC] Add PySpark doc for SparkSession.builder
[SPARK-22422][ML] Add Adjusted R2 to RegressionMetrics
[SPARK-20791][PYTHON][FOLLOWUP] Check for unicode column names in
createDataFrame with Arrow
[SPARK-22514][SQL] move ColumnVector.Array and ColumnarBatch.Row to
individual files
[SPARK-12375][ML] VectorIndexerModel support handle unseen categories via
handleInvalid
[SPARK-21087][ML] CrossValidator, TrainValidationSplit expose sub models
after fitting: Scala
[SPARK-22511][BUILD] Update maven central repo address
[SPARK-22519][YARN] Remove unnecessary stagingDirPath null check in
ApplicationMaster.cleanupStagingDir()
[SPARK-20652][SQL] Store SQL UI data in the new app status store.
[SPARK-20648][CORE] Port JobsTab and StageTab to the new UI backend.
[SPARK-17074][SQL] Generate equi-height histogram in column statistics
[SPARK-17310][SQL] Add an option to disable record-level filter in
Parquet-side
[SPARK-21911][ML][FOLLOW-UP] Fix doc for parallel ML Tuning in PySpark
[SPARK-22377][BUILD] Use /usr/sbin/lsof if lsof does not exists in
release-build.sh
[SPARK-22487][SQL][FOLLOWUP] still keep spark.sql.hive.version
[MINOR][CORE] Using bufferedInputStream for dataDeserializeStream
[SPARK-20791][PYSPARK] Use Arrow to create Spark DataFrame from Pandas
[SPARK-21693][R][ML] Reduce max iterations in Linear SVM test in R to
speed up AppVeyor build
[SPARK-21720][SQL] Fix 64KB JVM bytecode limit problem with AND or OR
...
commit aee2fd3ffb9d720d33d032fdb924e9d1f4d20a4c
Author: Eric Vandenberg <[email protected]>
Date: 2017-11-16T20:33:39Z
[SPARK-21571][WEB UI] Spark history server cleans up untracked files.
The history provider code was changed so I reimplemented the fix to
clean up empty or corrupt history files that otherwise would stay
around forever.
commit 3431d5a2c427f1d2f2a859e014ed62c30a45ebdb
Author: ericvandenbergfb <[email protected]>
Date: 2017-11-16T21:27:48Z
Merge pull request #1 from ericvandenbergfb/cleanup.untracked.history.files
[SPARK-21571][WEB UI] Spark history server leaves incomplete or unreadable
logs around forever
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]