[
https://issues.apache.org/jira/browse/MAHOUT-1817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15213618#comment-15213618
]
Andrew Palumbo commented on MAHOUT-1817:
----------------------------------------
I merged #203 as a workaround for this. I changing the {{checkpoint}} call to
throw an exception for anything other than {{CacheHint.DISK_ONLY}} or
{{CacheHint.NONE}} but some of the tests must have other cache hints so I had
left it as is. There's still probably room for improvement here. But this
should work for now.
As Implemented now, a {{checkpoint()}} call on a Drm will trigger a logical
optimization and then create a physical Flink plan. After the plan is created,
the {{cache}} call will assign a name to the Drm (if {{cache}} has not been
called before) persist the backing {{DataSet}} to either the directory as
specified in the {{taskmanager.tmp.dirs}} properties if set in a
{{$MAHOUT_HOME/conf/flink-config.yaml}} file or to {{/tmp}} if no file exists
or if the property is not set. This will trigger the evaluation of the
physical plan. The backing {{DataSet}} is then re-read from the given
directory, wrapped into a Drm and returned by the {{cache}} function.
If {{cache}} has already been called on this Drm, the dataset simply reads the
previously persisted {{DataSet}} from the filesystem, wraps that into a Drm and
returns it.
There is some question about how to reset the parallelism degree of the cached
{{DataSet}} remaining.
> Implement caching in Flink Bindings
> -----------------------------------
>
> Key: MAHOUT-1817
> URL: https://issues.apache.org/jira/browse/MAHOUT-1817
> Project: Mahout
> Issue Type: New Feature
> Components: Flink
> Affects Versions: 0.11.2
> Reporter: Andrew Palumbo
> Assignee: Andrew Palumbo
> Priority: Blocker
> Fix For: 0.12.0
>
>
> Flink does not have in-memory caching analogous to that of Spark. We need
> find a way to honour the {{checkpoint()}} contract in Flink Bindings.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)