[
https://issues.apache.org/jira/browse/FLINK-1730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14736582#comment-14736582
]
ASF GitHub Bot commented on FLINK-1730:
---------------------------------------
Github user fhueske commented on the pull request:
https://github.com/apache/flink/pull/1083#issuecomment-138861596
You are certainly right that there should be an API call to explicitly
persist data in memory (or transparently on disk if memory is short) and later
access this data (within the same or another job). However, this feature can be
implemented in different ways, for example using the network stack or on the
operator level. Even if one implementation looks straight-forward, it can have
severe limitations and implications on the behavior of the system. That is why
such features should be discussed before taking action even if it looks like an
easily doable thing.
Doing it on an operator level has several shortcomings:
- persisted data sets cannot be used for recovery. If done on the network
stack level, the same code can be basically used for both.
- data cannot (easily) be shared across jobs. Operators are expected to
return their memory when a job is done otherwise this will be a memory leak.
There is no way to free memory if the job is finished and did not do it.
@uce, @StephanEwen You are more familiar with this feature. Did I miss
something?
> Add a FlinkTools.persist style method to the Data Set.
> ------------------------------------------------------
>
> Key: FLINK-1730
> URL: https://issues.apache.org/jira/browse/FLINK-1730
> Project: Flink
> Issue Type: New Feature
> Reporter: Stephan Ewen
> Priority: Minor
>
> I think this is an operation that will be needed more prominently. Defining a
> point where one long logical program is broken into different executions.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)