[
https://issues.apache.org/jira/browse/FLINK-1730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14709648#comment-14709648
]
Sachin Goel commented on FLINK-1730:
------------------------------------
This would be equivalent to spark's persist method, right?
Drawing from the description there, we would need different levels of
persistence.
Disk level is pretty easy. This will be exactly as is done in FlinkMLTools.
For faster access, we will need to implement this in-memory and
in-memory-cum-disk.
For an initial implementation in-memory:
Far as I could figure out, it would require setting up a new Single Input
operator, and Plan node along with a new driver. The driver however, instead of
reading data from inputs, reads it from a cache built above the Memory Manager.
When the {{run}} function of the driver is invoked, it checks with the cache
whether the results have been persisted, and if so, fetches the results simply
from there. Otherwise, it reads all the input, and returns it while
simultaneously caching it.
Does this seem like a valid approach?
Valid or not, however, I also remember reading something about plans to add
cached data sets in Flink. I can't find out where I read it though. If so, this
can just be implemented using that perhaps.
> Add a FlinkTools.persist style method to the Data Set.
> ------------------------------------------------------
>
> Key: FLINK-1730
> URL: https://issues.apache.org/jira/browse/FLINK-1730
> Project: Flink
> Issue Type: New Feature
> Reporter: Stephan Ewen
> Priority: Minor
>
> I think this is an operation that will be needed more prominently. Defining a
> point where one long logical program is broken into different executions.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)