GitHub user ericvandenbergfb opened a pull request:

    https://github.com/apache/spark/pull/19770

    [SPARK-21571][WEB UI] Spark history server leaves incomplete or unreadable 
logs around forever

    ## What changes were proposed in this pull request?
    
    ** Updated pull request based on some other refactoring that went into 
FsHistoryProvider **
    
    Fix logic
    
    checkForLogs excluded 0-size files so they stuck around forever.
    checkForLogs / mergeApplicationListing indefinitely ignored files
    that were not parseable/couldn't extract an appID, so they stuck around
    forever.
    Only apply above logic if spark.history.fs.cleaner.aggressive=true.
    
    Fixed race condition in a test (SPARK-3697: ignore files that cannot be
    read.) where the number of mergeApplicationListings could be more than 1
    since the FsHistoryProvider would spin up an executor that also calls
    checkForLogs in parallel with the test.
    
    Added unit test to cover all cases with aggressive and non-aggressive
    clean up logic.
    
    ## How was this patch tested?
    
    Add test that extensive tests the untracked files getting cleaned up when 
configured.
    
    Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/ericvandenbergfb/spark master

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/19770.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #19770
    
----
commit c52b1cfd2eee9c881267d3d4cd9ea83fb6a767eb
Author: Eric Vandenberg <[email protected]>
Date:   2017-07-31T22:02:54Z

    [SPARK-21571][WEB UI] Spark history server leaves incomplete or unreadable 
history files
    around forever.
    
    Fix logic
    1. checkForLogs excluded 0-size files so they stuck around forever.
    2. checkForLogs / mergeApplicationListing indefinitely ignored files
    that were not parseable/couldn't extract an appID, so they stuck around
    forever.
    
    Only apply above logic if spark.history.fs.cleaner.aggressive=true.
    
    Fixed race condition in a test (SPARK-3697: ignore files that cannot be
    read.) where the number of mergeApplicationListings could be more than 1
    since the FsHistoryProvider would spin up an executor that also calls
    checkForLogs in parallel with the test.
    
    Added unit test to cover all cases with aggressive and non-aggressive
    clean up logic.

commit 08ea4ace02b7f8bf39190d5af53e7ced5e2807a0
Author: Eric Vandenberg <[email protected]>
Date:   2017-11-15T20:03:21Z

    Merge branch 'master' of github.com:ericvandenbergfb/spark into 
cleanup.untracked.history.files
    
    * 'master' of github.com:ericvandenbergfb/spark: (637 commits)
      [SPARK-22469][SQL] Accuracy problem in comparison with string and numeric
      [SPARK-22490][DOC] Add PySpark doc for SparkSession.builder
      [SPARK-22422][ML] Add Adjusted R2 to RegressionMetrics
      [SPARK-20791][PYTHON][FOLLOWUP] Check for unicode column names in 
createDataFrame with Arrow
      [SPARK-22514][SQL] move ColumnVector.Array and ColumnarBatch.Row to 
individual files
      [SPARK-12375][ML] VectorIndexerModel support handle unseen categories via 
handleInvalid
      [SPARK-21087][ML] CrossValidator, TrainValidationSplit expose sub models 
after fitting: Scala
      [SPARK-22511][BUILD] Update maven central repo address
      [SPARK-22519][YARN] Remove unnecessary stagingDirPath null check in 
ApplicationMaster.cleanupStagingDir()
      [SPARK-20652][SQL] Store SQL UI data in the new app status store.
      [SPARK-20648][CORE] Port JobsTab and StageTab to the new UI backend.
      [SPARK-17074][SQL] Generate equi-height histogram in column statistics
      [SPARK-17310][SQL] Add an option to disable record-level filter in 
Parquet-side
      [SPARK-21911][ML][FOLLOW-UP] Fix doc for parallel ML Tuning in PySpark
      [SPARK-22377][BUILD] Use /usr/sbin/lsof if lsof does not exists in 
release-build.sh
      [SPARK-22487][SQL][FOLLOWUP] still keep spark.sql.hive.version
      [MINOR][CORE] Using bufferedInputStream for dataDeserializeStream
      [SPARK-20791][PYSPARK] Use Arrow to create Spark DataFrame from Pandas
      [SPARK-21693][R][ML] Reduce max iterations in Linear SVM test in R to 
speed up AppVeyor build
      [SPARK-21720][SQL] Fix 64KB JVM bytecode limit problem with AND or OR
      ...

commit aee2fd3ffb9d720d33d032fdb924e9d1f4d20a4c
Author: Eric Vandenberg <[email protected]>
Date:   2017-11-16T20:33:39Z

    [SPARK-21571][WEB UI] Spark history server cleans up untracked files.
    
    The history provider code was changed so I reimplemented the fix to
    clean up empty or corrupt history files that otherwise would stay
    around forever.

commit 3431d5a2c427f1d2f2a859e014ed62c30a45ebdb
Author: ericvandenbergfb <[email protected]>
Date:   2017-11-16T21:27:48Z

    Merge pull request #1 from ericvandenbergfb/cleanup.untracked.history.files
    
     [SPARK-21571][WEB UI] Spark history server leaves incomplete or unreadable 
logs around forever

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to