[
https://issues.apache.org/jira/browse/FLINK-6505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Stefan Richter updated FLINK-6505:
----------------------------------
Description:
In {{RocksDBKeyedStateBackend}}, the {{instanceBasePath}} is cleared on
{{dispose()}}. I think it might make sense to also clear this directory when
the backend is created, in case something crashed and the backend never reached
{{dispose()}}. At least for previous runs of the same job, we can know what to
delete on restart.
In general, it is very important for this backend to clean up the local FS,
because the local quota might be very limited compared to the DFS. And a node
that runs out of local disk space can bring down the whole job, with no way to
recover (it might always get rescheduled to that node).
was:
In `RocksDBKeyedStateBackend`, the `instanceBasePath` is cleared on
`dispose()`. I think it might make sense to also clear this directory when the
backend is created, in case something crashed and the backend never reached
`dispose()`.
In general, it is very important for this backend to clean up the local FS,
because the local quota might be very limited compared to the DFS. And a node
that runs out of local disk space can bring down the whole job, with no way to
recover (it might always get rescheduled to that node).
> Proactively cleanup local FS for RocksDBKeyedStateBackend on startup
> --------------------------------------------------------------------
>
> Key: FLINK-6505
> URL: https://issues.apache.org/jira/browse/FLINK-6505
> Project: Flink
> Issue Type: Bug
> Components: State Backends, Checkpointing
> Affects Versions: 1.3.0
> Reporter: Stefan Richter
>
> In {{RocksDBKeyedStateBackend}}, the {{instanceBasePath}} is cleared on
> {{dispose()}}. I think it might make sense to also clear this directory when
> the backend is created, in case something crashed and the backend never
> reached {{dispose()}}. At least for previous runs of the same job, we can
> know what to delete on restart.
> In general, it is very important for this backend to clean up the local FS,
> because the local quota might be very limited compared to the DFS. And a node
> that runs out of local disk space can bring down the whole job, with no way
> to recover (it might always get rescheduled to that node).
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)