[ 
https://issues.apache.org/jira/browse/FLINK-1730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14709648#comment-14709648
 ] 

Sachin Goel commented on FLINK-1730:
------------------------------------

This would be equivalent to spark's persist method, right? 
Drawing from the description there, we would need different levels of 
persistence. 
Disk level is pretty easy. This will be exactly as is done in FlinkMLTools.
For faster access, we will need to implement this in-memory and 
in-memory-cum-disk. 
For an initial implementation in-memory:
Far as I could figure out, it would require setting up a new Single Input 
operator, and Plan node along with a new driver. The driver however, instead of 
reading data from inputs, reads it from a cache built above the Memory Manager. 
When the {{run}} function of the driver is invoked, it checks with the cache 
whether the results have been persisted, and if so, fetches the results simply 
from there. Otherwise, it reads all the input, and returns it while 
simultaneously caching it.
Does this seem like a valid approach? 

Valid or not, however, I also remember reading something about plans to add 
cached data sets in Flink. I can't find out where I read it though. If so, this 
can just be implemented using that perhaps.

> Add a FlinkTools.persist style method to the Data Set.
> ------------------------------------------------------
>
>                 Key: FLINK-1730
>                 URL: https://issues.apache.org/jira/browse/FLINK-1730
>             Project: Flink
>          Issue Type: New Feature
>            Reporter: Stephan Ewen
>            Priority: Minor
>
> I think this is an operation that will be needed more prominently. Defining a 
> point where one long logical program is broken into different executions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to