Github user JoshRosen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21390#discussion_r190105740
  
    --- Diff: core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala 
---
    @@ -97,6 +97,10 @@ private[deploy] class Worker(
       private val APP_DATA_RETENTION_SECONDS =
         conf.getLong("spark.worker.cleanup.appDataTtl", 7 * 24 * 3600)
     
    +  // Whether or not cleanup the non-shuffle files on executor finishes.
    +  private val CLEANUP_NON_SHUFFLE_FILES_ENABLED =
    +    conf.getBoolean("spark.worker.cleanup.nonShuffleFiles.enabled", true)
    --- End diff --
    
    I think these configurations might actually be independent:
    
    - `spark.worker.cleanup.enabled` controls whether directories from finished 
applications (i.e. terminated drivers) are cleaned up is a decision which can 
impact users' ability to debug failed applications. In environments where 
clusters / underlying physical workers are ephemeral this configuration tends 
to matter a little less, but in environments with long-lived machines which 
host many Spark drivers over a long period of time (e.g. on prem) this 
configuration can matter. For example, I think that setting 
`spark.worker.cleanup.enabled = true` could mean that completely applications' 
executor  logs get purged from the underlying host machine after a time period.
    - The new configuration being proposed here generally matters only _within 
a long-running Spark application_ and is kind of orthogonal to the choice to 
keep or discard failed applications' executor directories. I think that we're 
only removing files within block managers' nested folders, right? So there 
should be no user-controlled / user-generated files here, at least in theory. 
Therefore the impact of this configuration on ordinary users' ability to debug 
is likely to be small (I think you'd only ever really want access to the files 
that we're deleting if you're going to hex-dump files to debug corruption 
issues which trigger segfaults, an unlikely scenario).
    
    Given the kind of different motivations and lifecycles for these cleanup 
mechanisms and the motivations for different choices of defaults, I'm thinking 
that it might make sense to rename the new configuration so that we don't 
confuse users via similar names.
    
    What do you think about something like 
`spark.storage.removeNonShuffleBlockManagerFilesAfterExecutorDeath`? Really 
verbose, but a precise description of what's going on. I'm open to better name 
suggestions (since I tend to pick overly-complicated ones); this is just a 
conversation-starting strawman.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to