[ 
https://issues.apache.org/jira/browse/MAHOUT-1570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14969108#comment-14969108
 ] 

ASF GitHub Bot commented on MAHOUT-1570:
----------------------------------------

Github user tillrohrmann commented on the pull request:

    https://github.com/apache/mahout/pull/161#issuecomment-150209353
  
    I assume the reason why this has never been a problem for the spark and h2o 
bindings is exactly that you've registered (at least for the sparkbindings) a 
proper kryo serializer for the vector types. This is not done for the flink 
bindings.
    
    Let me clarify a little bit how Flink serializes the collection of 
`MatrixVectorView`. Flink's type extractor detects that this type is not a POJO 
and thus assigns it a `GenericTypeInfo`. Types with this type info are 
serialized using Kryo. Since kryo has no special serializer registered it will 
use the default one. The default serializer in this case is the 
`FieldSerializer`. Thus, it will simply serialize all fields.
    
    That's how the `CollectionInputFormat` can serialize the 
`MatrixVectorView`. Consequently, `CollectionInputFormat` can be serializable 
using Java serialization because it takes care of serializing its data 
collection.
    
    However, I've just checked and noticed that Flink does not allow you to 
specify default serializers for Kryo. With that you can do the same as for the 
spark bindings. I'll fix this.


> Adding support for Apache Flink as a backend for the Mahout DSL
> ---------------------------------------------------------------
>
>                 Key: MAHOUT-1570
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1570
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Till Rohrmann
>            Assignee: Alexey Grigorev
>              Labels: DSL, flink, scala
>             Fix For: 0.11.1
>
>
> With the finalized abstraction of the Mahout DSL plans from the backend 
> operations (MAHOUT-1529), it should be possible to integrate further backends 
> for the Mahout DSL. Apache Flink would be a suitable candidate to act as a 
> good execution backend. 
> With respect to the implementation, the biggest difference between Spark and 
> Flink at the moment is probably the incremental rollout of plans, which is 
> triggered by Spark's actions and which is not supported by Flink yet. 
> However, the Flink community is working on this issue. For the moment, it 
> should be possible to circumvent this problem by writing intermediate results 
> required by an action to HDFS and reading from there.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to