[ 
https://issues.apache.org/jira/browse/FLINK-1730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14736502#comment-14736502
 ] 

ASF GitHub Bot commented on FLINK-1730:
---------------------------------------

Github user sachingoel0101 commented on the pull request:

    https://github.com/apache/flink/pull/1083#issuecomment-138844139
  
    Hey @fhueske , thanks for your comments.
    I was not aware this was intended to allow for recovery on failed jobs.
    For reusing among different jobs in the same session, I don't see how this 
doesn't solve the issue. If the Memory manager is alive, the results will be 
there for any job to use. 
    For a true across-job sharing, one possible feature would be to add a 
method for initialization from the environment as `getPersistedSource(String)` 
which would access results from a persisted data set from some entirely 
independent job.
    
    Further, this kind of makes sense on an operator level. User should have to 
ability to explicitly persist a data set in memory, which calls for providing a 
function call. I was only drawing the analogy from spark's api. I have no idea 
how they internally implement this, but if an API function is to be provided, 
it can only be done in two ways. Either return a new Operator, as a 
transformation on the original data set, or just by returning the same data set 
[like `withBroadcastSet` does]. The former seemed easier to work with, because 
it doesn't interfere with the existing mechanisms.
    
    I have implemented no new internal functionality, but only used the 
existing system. I would've loved more discussion on this but frankly, once I 
started going through the internal mechanisms, it seemed like a pretty trivial 
thing to implement. Of course that was when I wasn't aware it was intended to 
be used for recovery.
    If there is some work on persisting intermediate results for recovery, the 
same mechanism can be used for a persist operation, in which case this work is 
anyways moot. But there has to be an API call to allow users to explicitly 
cache results in memory. This is a major problem I'm facing in implementing a 
randomized splitting algorithm.


> Add a FlinkTools.persist style method to the Data Set.
> ------------------------------------------------------
>
>                 Key: FLINK-1730
>                 URL: https://issues.apache.org/jira/browse/FLINK-1730
>             Project: Flink
>          Issue Type: New Feature
>            Reporter: Stephan Ewen
>            Priority: Minor
>
> I think this is an operation that will be needed more prominently. Defining a 
> point where one long logical program is broken into different executions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to