[ 
https://issues.apache.org/jira/browse/FLINK-838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14089930#comment-14089930
 ] 

Fabian Hueske commented on FLINK-838:
-------------------------------------

I created a working prototype to support custom partitioner, sorter, and 
grouping comparators for Hadoop ReduceFunctions.

>From an API point-of-view, the integration is quite nice. You call 
>{{DataSet.reduce(HadoopReduceFunction)}} and the system will automatically 
>extract the partitioner, sorter, and grouper from the Hadoop jobConf, i.e., 
>the user does not need to define the grouping key. This might be not the 
>optimal solution if only the function (and not the partitioner, sorter, 
>grouper) should be used, however, we can extend the API for this case.

This ease-of-use comes at a price. Internally, we have a special 
HadoopReduceDriver at the execution level, which sorts the data with the 
(wrapped) custom sort comparator and groups it with the (wrapped) custom 
grouping comparator. The partitioning is done similar to the KeySelector in the 
regular API. We also need special operators (java-api and generic), an 
optimizer node, driver strategy, etc. for the HadoopReduce. These are only a 
few lines of code (due to the fantastic design of the optimizer!!!) but still a 
lot of classes only to support Hadoop ReduceFunctions.

A few things haven't been done yet, such as proper resource allocation (the 
sorter takes always 16MB or so) and Combiners do not work either (I guess we 
need the custom sorter and comparator there as well. This should be checked). 
Code needs to be cleaned as well, but it is compiling and all tests pass.

I pushed my code into my repo at: 
[https://github.com/fhueske/incubator-flink/commit/f8c5ff07936580d529f52a8c7907f3f1b5b8229b].
 

I would like to get some feedback on the prototype. Should I continue or do 
some things differently? 
If we do not support these custom partitioners, sorter, groupers, I think we do 
not need to continue with the HadoopCompatibility mode because any MR job with 
custom data types will fail...

[~StephanEwen]: Any comments?
Artem: can you check which sorters and groupers are used for combining within 
the Hadoop code. I saw something about a special CombineGrouper in the JobConf.

> GSoC Summer Project: Implement full Hadoop Compatibility Layer for 
> Stratosphere
> -------------------------------------------------------------------------------
>
>                 Key: FLINK-838
>                 URL: https://issues.apache.org/jira/browse/FLINK-838
>             Project: Flink
>          Issue Type: Improvement
>            Reporter: GitHub Import
>              Labels: github-import
>             Fix For: pre-apache
>
>
> This is a meta issue for tracking @atsikiridis progress with implementing a 
> full Hadoop Compatibliltiy Layer for Stratosphere.
> Some documentation can be found in the Wiki: 
> https://github.com/stratosphere/stratosphere/wiki/%5BGSoC-14%5D-A-Hadoop-abstraction-layer-for-Stratosphere-(Project-Map-and-Notes)
> As well as the project proposal: 
> https://github.com/stratosphere/stratosphere/wiki/GSoC-2014-Project-Proposal-Draft-by-Artem-Tsikiridis
> Most importantly, there is the following **schedule**:
> *19 May - 27 June (Midterm)*
> 1) Work on the Hadoop tasks, their Context and the mapping of Hadoop's 
> Configuration to the one of Stratosphere. By successfully bridging the Hadoop 
> tasks with Stratosphere, we already cover the most basic Hadoop Jobs. This 
> can be determined by running some popular Hadoop examples on Stratosphere 
> (e.g. WordCount, k-means, join) (4 - 5 weeks)
> 2) Understand how the running of these jobs works (e.g. command line 
> interface) for the wrapper. Implement how will the user run them. (1 - 2 
> weeks).
> *27 June - 11 August*
> 1) Continue wrapping more "advanced" Hadoop Interfaces (Comparators, 
> Partitioners, Distributed Cache etc.) There are quite a few interfaces and it 
> will be a challenge to support all of them. (5 full weeks)
> 2) Profiling of the application and optimizations (if applicable)
> *11 August - 18 August*
> Write documentation on code, write a README with care and add more 
> unit-tests. (1 week)
> ---------------- Imported from GitHub ----------------
> Url: https://github.com/stratosphere/stratosphere/issues/838
> Created by: [rmetzger|https://github.com/rmetzger]
> Labels: core, enhancement, parent-for-major-feature, 
> Milestone: Release 0.7 (unplanned)
> Created at: Tue May 20 10:11:34 CEST 2014
> State: open



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to