Github user fhueske commented on the pull request:
https://github.com/apache/flink/pull/1083#issuecomment-138861596
You are certainly right that there should be an API call to explicitly
persist data in memory (or transparently on disk if memory is short) and later
access this data (within the same or another job). However, this feature can be
implemented in different ways, for example using the network stack or on the
operator level. Even if one implementation looks straight-forward, it can have
severe limitations and implications on the behavior of the system. That is why
such features should be discussed before taking action even if it looks like an
easily doable thing.
Doing it on an operator level has several shortcomings:
- persisted data sets cannot be used for recovery. If done on the network
stack level, the same code can be basically used for both.
- data cannot (easily) be shared across jobs. Operators are expected to
return their memory when a job is done otherwise this will be a memory leak.
There is no way to free memory if the job is finished and did not do it.
@uce, @StephanEwen You are more familiar with this feature. Did I miss
something?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---