[
https://issues.apache.org/jira/browse/FLINK-1730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14736502#comment-14736502
]
ASF GitHub Bot commented on FLINK-1730:
---------------------------------------
Github user sachingoel0101 commented on the pull request:
https://github.com/apache/flink/pull/1083#issuecomment-138844139
Hey @fhueske , thanks for your comments.
I was not aware this was intended to allow for recovery on failed jobs.
For reusing among different jobs in the same session, I don't see how this
doesn't solve the issue. If the Memory manager is alive, the results will be
there for any job to use.
For a true across-job sharing, one possible feature would be to add a
method for initialization from the environment as `getPersistedSource(String)`
which would access results from a persisted data set from some entirely
independent job.
Further, this kind of makes sense on an operator level. User should have to
ability to explicitly persist a data set in memory, which calls for providing a
function call. I was only drawing the analogy from spark's api. I have no idea
how they internally implement this, but if an API function is to be provided,
it can only be done in two ways. Either return a new Operator, as a
transformation on the original data set, or just by returning the same data set
[like `withBroadcastSet` does]. The former seemed easier to work with, because
it doesn't interfere with the existing mechanisms.
I have implemented no new internal functionality, but only used the
existing system. I would've loved more discussion on this but frankly, once I
started going through the internal mechanisms, it seemed like a pretty trivial
thing to implement. Of course that was when I wasn't aware it was intended to
be used for recovery.
If there is some work on persisting intermediate results for recovery, the
same mechanism can be used for a persist operation, in which case this work is
anyways moot. But there has to be an API call to allow users to explicitly
cache results in memory. This is a major problem I'm facing in implementing a
randomized splitting algorithm.
> Add a FlinkTools.persist style method to the Data Set.
> ------------------------------------------------------
>
> Key: FLINK-1730
> URL: https://issues.apache.org/jira/browse/FLINK-1730
> Project: Flink
> Issue Type: New Feature
> Reporter: Stephan Ewen
> Priority: Minor
>
> I think this is an operation that will be needed more prominently. Defining a
> point where one long logical program is broken into different executions.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)