[ 
https://issues.apache.org/jira/browse/SPARK-21942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16158320#comment-16158320
 ] 

Saisai Shao commented on SPARK-21942:
-------------------------------------

Personally I would like to fail fast if such things happened, here it happened 
to clean the root folder and using {{mkdirs}} can handle this issue, but if 
some persistent block or shuffle index file is removed (because it is closed), 
I think there's no way to handle it. So instead of trying to workaround it, 
exposing an exception to user might be more useful, and will let user to know 
the issue earlier.

> DiskBlockManager crashing when a root local folder has been externally 
> deleted by OS
> ------------------------------------------------------------------------------------
>
>                 Key: SPARK-21942
>                 URL: https://issues.apache.org/jira/browse/SPARK-21942
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 1.6.1, 1.6.2, 1.6.3, 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 
> 2.2.0, 2.2.1, 2.3.0, 3.0.0
>            Reporter: Ruslan Shestopalyuk
>            Priority: Minor
>              Labels: storage
>             Fix For: 2.3.0
>
>
> _DiskBlockManager_ has a notion of a "scratch" local folder(s), which can be 
> configured via _spark.local.dir_ option, and which defaults to the system's 
> _/tmp_. The hierarchy is two-level, e.g. _/blockmgr-XXX.../YY_, where the 
> _YY_ part is a hash bit, to spread files evenly.
> Function _DiskBlockManager.getFile_ expects the top level directories 
> (_blockmgr-XXX..._) to always exist (they get created once, when the spark 
> context is first created), otherwise it would fail with a message like:
> {code}
> ... java.io.IOException: Failed to create local dir in /tmp/blockmgr-XXX.../YY
> {code}
> However, this may not always be the case.
> In particular, *if it's the default _/tmp_ folder*, there can be different 
> strategies of automatically removing files from it, depending on the OS:
> * on the boot time
> * on a regular basis (e.g. once per day via a system cron job)
> * based on the file age
> The symptom is that after the process (in our case, a service) using spark is 
> running for a while (a few days), it may not be able to load files anymore, 
> since the top-level scratch directories are not there and 
> _DiskBlockManager.getFile_ crashes.
> Please note that this is different from people arbitrarily removing files 
> manually.
> We have both the facts that _/tmp_ is the default in the spark config and 
> that the system has the right to tamper with its contents, and will do it 
> with a high probability, after some period of time.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to