Author: smarthi
Date: Fri Apr 8 18:40:36 2016
New Revision: 1738281
URL: http://svn.apache.org/viewvc?rev=1738281&view=rev
Log:
MAHOUT-1779: Brief overview of Mahout Flink Engine
Added:
mahout/site/mahout_cms/trunk/content/users/flinkbindings/
mahout/site/mahout_cms/trunk/content/users/flinkbindings/flink-internals.mdtext
Added:
mahout/site/mahout_cms/trunk/content/users/flinkbindings/flink-internals.mdtext
URL:
http://svn.apache.org/viewvc/mahout/site/mahout_cms/trunk/content/users/flinkbindings/flink-internals.mdtext?rev=1738281&view=auto
==============================================================================
---
mahout/site/mahout_cms/trunk/content/users/flinkbindings/flink-internals.mdtext
(added)
+++
mahout/site/mahout_cms/trunk/content/users/flinkbindings/flink-internals.mdtext
Fri Apr 8 18:40:36 2016
@@ -0,0 +1,30 @@
+
+#Introduction
+
+This document provides an overview of how the Mahout Samsara environment is
implemented over the Apache Flink backend engine. This document gives an
overview of the code layout for the Flink backend engine, the source code for
which can be found under /flink directory in the Mahout codebase.
+
+Apache Flink is a distributed big data streaming engine that supports both
Streaming and Batch interfaces. Batch processing is an extension of Flinkâs
Stream processing engine.
+
+The Mahout Flink integration presently supports Flinkâs batch processing
capabilities leveraging the DataSet API.
+
+The Mahout DRM, or Distributed Row Matrix, is an abstraction for storing a
large matrix of numbers in-memory in a cluster by distributing logical rows
among servers. Mahout's scala DSL provides an abstract API on DRMs for backend
engines to provide implementations of this API. An example is the Spark backend
engine. Each engine has it's own design of mapping the abstract API onto its
data model and provides implementations for algebraic operators over that
mapping.
+
+#Flink Environment Engine
+
+The Flink backend implements the abstract DRM as a Flink DataSet. A Flink job
runs in the context of an ExecutionEnvironment (from the Flink Batch processing
API).
+
+#Source Layout
+
+Within mahout.git, the top level directory, flink/ holds all the source code
for the Flink backend engine. Sections of code that interface with the rest of
the Mahout components are in Scala, and sections of the code that interface
with Flink DataSet API and implement algebraic operators are in Java. Here is a
brief overview of what functionality can be found within flink/ folder.
+
+flink/ - top level directory containing all Flink related code
+
+flink/src/main/scala/org/apache/mahout/flinkbindings/blas/*.scala - Physical
operator code for the Samsara DSL algebra
+
+flink/src/main/scala/org/apache/mahout/flinkbindings/drm/*.scala - Flink
Dataset DRM and broadcast implementation
+
+flink/src/main/scala/org/apache/mahout/flinkbindings/io/*.scala - Read / Write
between DRMDataSet and files on HDFS
+
+flink/src/main/scala/org/apache/mahout/flinkbindings/FlinkEngine.scala - DSL
operator graph evaluator and various abstract API implementations for a
distributed engine.
+
+