[
https://issues.apache.org/jira/browse/FLINK-838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14064729#comment-14064729
]
Fabian Hueske commented on FLINK-838:
-------------------------------------
Yes, the notion and scheduling of tasks in Flink differs from Hadoop.
A Hadoop setup offers a fixed number of Map and Reduce Slots which are filled
with Map and Reduce tasks. Each InputSplit translates to one MapTask, the
number of ReduceTasks can be freely chosen. If a job consists of more tasks
than available slots, they are scheduled into slots as these become available.
In Flink, the number of tasks is equal to the number of concurrently running
processing threads. InputSplits are lazily assigned to DataSourceTasks, i.e.,
one DataSourceTask can read more than one InputSplit. Due to Flink's pipelined
processing model it is not (yet) possible to put data aside to process it later
as Hadoop does.
If a Hadoop Job requires let's say 1000 ReduceTasks which would be executed
using the available ReduceSlots, this would (naively) translate to 1000
concurrently running tasks in Flink which might not work so well (depending on
cluster capacity).
This means, that we need to come up with some way to choose the number of
MapTasks (= DataSourceTasks) and ReduceTasks. I would propose to add parameters
which emulate the number-of-slots-parameters in Hadoop.
These would serve as an upper bound in case a Job requires more Map or Reduce
tasks.
What do you think?
> GSoC Summer Project: Implement full Hadoop Compatibility Layer for
> Stratosphere
> -------------------------------------------------------------------------------
>
> Key: FLINK-838
> URL: https://issues.apache.org/jira/browse/FLINK-838
> Project: Flink
> Issue Type: Improvement
> Reporter: GitHub Import
> Labels: github-import
> Fix For: pre-apache
>
>
> This is a meta issue for tracking @atsikiridis progress with implementing a
> full Hadoop Compatibliltiy Layer for Stratosphere.
> Some documentation can be found in the Wiki:
> https://github.com/stratosphere/stratosphere/wiki/%5BGSoC-14%5D-A-Hadoop-abstraction-layer-for-Stratosphere-(Project-Map-and-Notes)
> As well as the project proposal:
> https://github.com/stratosphere/stratosphere/wiki/GSoC-2014-Project-Proposal-Draft-by-Artem-Tsikiridis
> Most importantly, there is the following **schedule**:
> *19 May - 27 June (Midterm)*
> 1) Work on the Hadoop tasks, their Context and the mapping of Hadoop's
> Configuration to the one of Stratosphere. By successfully bridging the Hadoop
> tasks with Stratosphere, we already cover the most basic Hadoop Jobs. This
> can be determined by running some popular Hadoop examples on Stratosphere
> (e.g. WordCount, k-means, join) (4 - 5 weeks)
> 2) Understand how the running of these jobs works (e.g. command line
> interface) for the wrapper. Implement how will the user run them. (1 - 2
> weeks).
> *27 June - 11 August*
> 1) Continue wrapping more "advanced" Hadoop Interfaces (Comparators,
> Partitioners, Distributed Cache etc.) There are quite a few interfaces and it
> will be a challenge to support all of them. (5 full weeks)
> 2) Profiling of the application and optimizations (if applicable)
> *11 August - 18 August*
> Write documentation on code, write a README with care and add more
> unit-tests. (1 week)
> ---------------- Imported from GitHub ----------------
> Url: https://github.com/stratosphere/stratosphere/issues/838
> Created by: [rmetzger|https://github.com/rmetzger]
> Labels: core, enhancement, parent-for-major-feature,
> Milestone: Release 0.7 (unplanned)
> Created at: Tue May 20 10:11:34 CEST 2014
> State: open
--
This message was sent by Atlassian JIRA
(v6.2#6252)