[ 
https://issues.apache.org/jira/browse/SPARK-21571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16109282#comment-16109282
 ] 

Eric Vandenberg commented on SPARK-21571:
-----------------------------------------

Link to pull request https://github.com/apache/spark/pull/18791

> Spark history server leaves incomplete or unreadable history files around 
> forever.
> ----------------------------------------------------------------------------------
>
>                 Key: SPARK-21571
>                 URL: https://issues.apache.org/jira/browse/SPARK-21571
>             Project: Spark
>          Issue Type: Bug
>          Components: Scheduler
>    Affects Versions: 2.2.0
>            Reporter: Eric Vandenberg
>            Priority: Minor
>
> We have noticed that history server logs are sometimes never cleaned up.  The 
> current history server logic *ONLY* cleans up history files if they are 
> completed since in general it doesn't make sense to clean up inprogress 
> history files (after all, the job is presumably still running?)  Note that 
> inprogress history files would generally not be targeted for clean up any way 
> assuming they regularly flush logs and the file system accurately updates the 
> history log last modified time/size, while this is likely it is not 
> guaranteed behavior.
> As a consequence of the current clean up logic and a combination of unclean 
> shutdowns, various file system bugs, earlier spark bugs, etc. we have 
> accumulated thousands of these dead history files associated with long since 
> gone jobs.
> For example (with spark.history.fs.cleaner.maxAge=14d):
> -rw-rw----   3 xxxxxx                                           ooooooo      
> 14382 2016-09-13 15:40 
> /user/hadoop/xxxxxxxxxxxxxx/spark/logs/qqqqqq1974_ppppppppppp-8812_110586000000195_dev4384_jjjjjjjjjjjj-53982.zstandard
> -rw-rw----   3 xxxx                                             ooooooo       
> 5933 2016-11-01 20:16 
> /user/hadoop/xxxxxxxxxxxxxx/spark/logs/qqqqqq2016_ppppppppppp-8812_126507000000673_dev5365_jjjjjjjjjjjj-65313.lz4
> -rw-rw----   3 yyy                                              ooooooo       
>    0 2017-01-19 11:59 
> /user/hadoop/xxxxxxxxxxxxxx/spark/logs/yyyyyyyyyyyyyyyy0057_zzzz326_mmmmmmmmm-57863.lz4.inprogress
> -rw-rw----   3 xxxxxxxxx                                        ooooooo       
>    0 2017-01-19 14:17 
> /user/hadoop/xxxxxxxxxxxxxx/spark/logs/yyyyyyyyyyyyyyyy0063_zzzz688_mmmmmmmmm-33246.lz4.inprogress
> -rw-rw----   3 yyy                                              ooooooo       
>    0 2017-01-20 10:56 
> /user/hadoop/xxxxxxxxxxxxxx/spark/logs/yyyyyyyyyyyyyyyy1030_zzzz326_mmmmmmmmm-45195.lz4.inprogress
> -rw-rw----   3 xxxxxxxxxxxx                                     ooooooo      
> 11955 2017-01-20 17:55 
> /user/hadoop/xxxxxxxxxxxxxx/spark/logs/yyyyyyyyyyyyyyyy1314_wwww54_kkkkkkkkkkkkkk-64671.lz4.inprogress
> -rw-rw----   3 xxxxxxxxxxxx                                     ooooooo      
> 11958 2017-01-20 17:55 
> /user/hadoop/xxxxxxxxxxxxxx/spark/logs/yyyyyyyyyyyyyyyy1315_wwww1667_kkkkkkkkkkkkkk-58968.lz4.inprogress
> -rw-rw----   3 xxxxxxxxxxxx                                     ooooooo      
> 11960 2017-01-20 17:55 
> /user/hadoop/xxxxxxxxxxxxxx/spark/logs/yyyyyyyyyyyyyyyy1316_wwww54_kkkkkkkkkkkkkk-48058.lz4.inprogress
> Based on the current logic, clean up candidates are skipped in several cases:
> 1. if a file has 0 bytes, it is completely ignored
> 2. if a file is in progress and not paresable/can't extract appID, is it 
> completely ignored
> 3. if a file is complete and but not parseable/can't extract appID, it is 
> completely ignored.
> To address this edge case and provide a way to clean out orphaned history 
> files I propose a new configuration option:
> spark.history.fs.cleaner.aggressive={true, false}, default is false.
> If true, the history server will more aggressively garbage collect history 
> files in cases (1), (2) and (3).  Since the default is false, existing 
> customers won't be affected unless they explicitly opt-in.  If customers have 
> similar leaking garbage over time they have the option of aggressively 
> cleaning up in such cases.  Also note that aggressive clean up may not be 
> appropriate for some customers if they have long running jobs that exceed the 
> cleaner.maxAge time frame and/or have buggy file systems.
> Would like to get feedback on if this seems like a reasonable solution.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to