[
https://issues.apache.org/jira/browse/MAHOUT-1570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14969108#comment-14969108
]
ASF GitHub Bot commented on MAHOUT-1570:
----------------------------------------
Github user tillrohrmann commented on the pull request:
https://github.com/apache/mahout/pull/161#issuecomment-150209353
I assume the reason why this has never been a problem for the spark and h2o
bindings is exactly that you've registered (at least for the sparkbindings) a
proper kryo serializer for the vector types. This is not done for the flink
bindings.
Let me clarify a little bit how Flink serializes the collection of
`MatrixVectorView`. Flink's type extractor detects that this type is not a POJO
and thus assigns it a `GenericTypeInfo`. Types with this type info are
serialized using Kryo. Since kryo has no special serializer registered it will
use the default one. The default serializer in this case is the
`FieldSerializer`. Thus, it will simply serialize all fields.
That's how the `CollectionInputFormat` can serialize the
`MatrixVectorView`. Consequently, `CollectionInputFormat` can be serializable
using Java serialization because it takes care of serializing its data
collection.
However, I've just checked and noticed that Flink does not allow you to
specify default serializers for Kryo. With that you can do the same as for the
spark bindings. I'll fix this.
> Adding support for Apache Flink as a backend for the Mahout DSL
> ---------------------------------------------------------------
>
> Key: MAHOUT-1570
> URL: https://issues.apache.org/jira/browse/MAHOUT-1570
> Project: Mahout
> Issue Type: Improvement
> Reporter: Till Rohrmann
> Assignee: Alexey Grigorev
> Labels: DSL, flink, scala
> Fix For: 0.11.1
>
>
> With the finalized abstraction of the Mahout DSL plans from the backend
> operations (MAHOUT-1529), it should be possible to integrate further backends
> for the Mahout DSL. Apache Flink would be a suitable candidate to act as a
> good execution backend.
> With respect to the implementation, the biggest difference between Spark and
> Flink at the moment is probably the incremental rollout of plans, which is
> triggered by Spark's actions and which is not supported by Flink yet.
> However, the Flink community is working on this issue. For the moment, it
> should be possible to circumvent this problem by writing intermediate results
> required by an action to HDFS and reading from there.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)