AW: Add Apache Flink as new backend

Kunft, Andreas Thu, 03 Mar 2016 09:48:06 -0800

Hello,


thank you for the fast reply. We are glad you like the idea!


As next step, we will focus on implementing a end-to-end integration based on 
your suggestions. We think that this initial integration is a good start for 
further discussions based on the concrete implementation in the pull request.


Best

Andreas

________________________________
Von: Matthias Boehm <[email protected]>
Gesendet: Donnerstag, 3. März 2016 06:44
An: [email protected]
Betreff: Re: Add Apache Flink as new backend


Thanks guys, for sharing the details of this prototype. In general, I really 
like the idea of having a Flink backend in SystemML. We just need to structure 
the code (similar to our Spark backend) in a way that Flink libraries are not 
necessarily required when running in Spark or MapReduce execution modes.

To answer your questions in detail:

1) Shared Functionality: I would recommend to reuse the upper levels (i.e., 
language, hops, lops, etc) and core block operations but keep the instructions 
(and everything that accesses Flink APIs) independent. Yes, this separation 
comes at the cost of code duplication but it allows to run backends without the 
need for libraries of the other backends. Note that we did the same for our 
Spark backend, which allows us to run the same jar in old MapReduce v1 
environments where these libraries are not available during runtime. Down the 
road we might consolidate common functionality like runtime maintenance of 
matrix characteristics etc.

2) Execution Modes: Yes, please add two new execution modes in 
DMLScript.RUNTIME_PLATFORM and one new execution type in Lop.ExecType. Once 
this is done, you can already run end-to-end scripts with '-exec hybrid_flink'. 
Of course we can have more detailed discussions about how and when to select 
Flink operators during operator selection in hybrid_flink mode. As a start, I 
would recommend our default heuristic of compiling Flink operators whenever the 
memory estimate of an operation exceeds the local memory budget of the 
driver/client process.

3) Pull Requests: I would recommend multiple stages: (1) initially a minimal 
end-to-end integration, (2) multiple packages of "instruction sets" incl tests, 
(3) specific rewrites / optimizer extensions, and later (4) continuous 
improvements. For the initial end-to-end integration, I would focus on two or 
three simple yet very important instructions (tsmm, mapmm, mapmmchain), basic 
converter utils, and a basic end-to-end integration (execution types, 
serialization, buffer pool, etc). Having tsmm and mapmm (plus optionally 
mapmmchain) would already allow you to run end-to-end algorithms like LinregDS, 
LinregCG, GLM, L2SVM, PageRank, etc for common scenarios where only 
transpose-self matrix multiplications or matrix-vector multiplications are 
compiled to distributed operations while remaining operators are executed in 
the driver/client and vectors are small enough to be broadcast to 
mapmm/mapmmchain.

4) Refactoring runtime package: Again, don't worry about the refactoring of our 
runtime packages. The focus is mainly our block runtime, restructuring 
everything such that this runtime can be easily packaged as an individual jar 
and distributed/consumed independently of SystemML.

I'm looking forward to many more discussions on this topic.

Regards
Matthias

[Inactive hide details for "Kunft, Andreas" ---03/02/2016 11:51:16 AM---Hi all, 
we are a group of researchers from the Database]"Kunft, Andreas" ---03/02/2016 
11:51:16 AM---Hi all, we are a group of researchers from the Database group 
(DIMA) at TU Berlin. We would like to

From: "Kunft, Andreas" <[email protected]>
To: "[email protected]" <[email protected]>
Date: 03/02/2016 11:51 AM
Subject: Add Apache Flink as new backend

________________________________



Hi all,

we are a group of researchers from the Database group (DIMA) at TU Berlin. We 
would like to add Apache Flink as an execution backend to SystemML in addition 
to Hadoop MR and Spark.
To this end we started implementing a proof of concept consisting of several 
instructions together with the necessary de-/serialization and execution-logic.
You can see the current state of our fork [1] including two test-cases showing 
what we currently support [2][3].

For our simple POC implementation we realized that we had to duplicate a lot of 
functionality (especially from spark instructions). We saw that people already 
raised concerns regarding the refactoring of the runtime package [4][5], 
potentially making it easier to integrate further backend-systems.
Given that this would be a bigger change, it would be helpful to get some input 
from the SystemML community regarding this effort.

In particular, we would like to discuss the following questions:

 *
How should we deal with shared functionality between the different backends 
(Flink, Spark, etc.) to avoid code duplication, especially in instructions, but 
also introduce modularity? And is this modularization even desired?
 *
How should we integrate Flink into the different runtime-modes? (Flink-only, 
Flink-Hybrid, etc.)
 *
How should we structure the integration? (multiple/single commits)

We're looking forward to feedback and hope the community likes the idea of 
adding Flink as an execution backend to SystemML.

Best,
Andreas Kunft
Christoph Brücke
Felix Schüler

[1] https://github.com/stratosphere/incubator-systemml/tree/flink-integration
[2] 
https://github.com/stratosphere/incubator-systemml/blob/flink-integration/src/test/java/org/apache/sysml/runtime/instructions/flink/TsmmFLInstructionTest.java
[3] 
https://github.com/stratosphere/incubator-systemml/blob/flink-integration/src/test/java/org/apache/sysml/runtime/instructions/flink/utils/DataSetConverterUtilsTest.java
[4] https://issues.apache.org/jira/browse/SYSTEMML-33
[5] 
https://www.mail-archive.com/search?l=dev%40systemml.incubator.apache.org&q=subject%3A%22Runtime+package+refactoring%22&o=newest&f=1?

AW: Add Apache Flink as new backend

Reply via email to