[jira] [Commented] (FLINK-838) GSoC Summer Project: Implement full Hadoop Compatibility Layer for Stratosphere

Fabian Hueske (JIRA) Tue, 29 Jul 2014 03:40:34 -0700

    [ 
https://issues.apache.org/jira/browse/FLINK-838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14077597#comment-14077597
 ]


Fabian Hueske commented on FLINK-838:
-------------------------------------

While commenting the PR, I thought about some additional features.

1. There is a way to implement Hadoop jobs with multiple inputs, e.g., 
https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/mapreduce/lib/input/MultipleInputs.html.
We should support this by adding multiple sources and union them before the 
mapper. There might be some additional logic required to determine which source 
a tuple came from. This needs to be checked.

2. The Flink Java API provides `DataSet.runOperation(CustomUnaryOperation<T, X> 
operation)` to perform a complex (i.e., multi-operator) operation on a 
`DataSet`. We could provide a HadoopJobOperation, that takes the `DataSet` as 
input, applies the Hadoop Job and provides the result as a new `DataSet`. In 
this case, the Input and OutputFormats of the job would be ignored and only 
`processing` steps be executed.

What do you guys think?


> GSoC Summer Project: Implement full Hadoop Compatibility Layer for 
> Stratosphere
> -------------------------------------------------------------------------------
>
>                 Key: FLINK-838
>                 URL: https://issues.apache.org/jira/browse/FLINK-838
>             Project: Flink
>          Issue Type: Improvement
>            Reporter: GitHub Import
>              Labels: github-import
>             Fix For: pre-apache
>
>
> This is a meta issue for tracking @atsikiridis progress with implementing a 
> full Hadoop Compatibliltiy Layer for Stratosphere.
> Some documentation can be found in the Wiki: 
> https://github.com/stratosphere/stratosphere/wiki/%5BGSoC-14%5D-A-Hadoop-abstraction-layer-for-Stratosphere-(Project-Map-and-Notes)
> As well as the project proposal: 
> https://github.com/stratosphere/stratosphere/wiki/GSoC-2014-Project-Proposal-Draft-by-Artem-Tsikiridis
> Most importantly, there is the following **schedule**:
> *19 May - 27 June (Midterm)*
> 1) Work on the Hadoop tasks, their Context and the mapping of Hadoop's 
> Configuration to the one of Stratosphere. By successfully bridging the Hadoop 
> tasks with Stratosphere, we already cover the most basic Hadoop Jobs. This 
> can be determined by running some popular Hadoop examples on Stratosphere 
> (e.g. WordCount, k-means, join) (4 - 5 weeks)
> 2) Understand how the running of these jobs works (e.g. command line 
> interface) for the wrapper. Implement how will the user run them. (1 - 2 
> weeks).
> *27 June - 11 August*
> 1) Continue wrapping more "advanced" Hadoop Interfaces (Comparators, 
> Partitioners, Distributed Cache etc.) There are quite a few interfaces and it 
> will be a challenge to support all of them. (5 full weeks)
> 2) Profiling of the application and optimizations (if applicable)
> *11 August - 18 August*
> Write documentation on code, write a README with care and add more 
> unit-tests. (1 week)
> ---------------- Imported from GitHub ----------------
> Url: https://github.com/stratosphere/stratosphere/issues/838
> Created by: [rmetzger|https://github.com/rmetzger]
> Labels: core, enhancement, parent-for-major-feature, 
> Milestone: Release 0.7 (unplanned)
> Created at: Tue May 20 10:11:34 CEST 2014
> State: open



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (FLINK-838) GSoC Summer Project: Implement full Hadoop Compatibility Layer for Stratosphere

Reply via email to