[jira] [Commented] (HAMA-983) Hama runner for DataFlow
[ https://issues.apache.org/jira/browse/HAMA-983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16009911#comment-16009911 ] Edward J. Yoon commented on HAMA-983: - {code} # create a new branch inside your directory 'current' git checkout -b HAMA-983 # ... do some changes to the files ... # store changes in the branch git push origin HAMA-983 # commit changes to the branch git commit -a -m '[HAMA-983] Hama runner for DataFlow' Then go to your GitHub HAMA page and do a Pull Request. {code} Hi JongYoon, you can create new branch like above. > Hama runner for DataFlow > > > Key: HAMA-983 > URL: https://issues.apache.org/jira/browse/HAMA-983 > Project: Hama > Issue Type: Bug >Reporter: Edward J. Yoon > Labels: gsoc2016 > > As you already know, Apache Beam provides unified programming model for both > batch and streaming inputs. > The APIs are generally associated with data filtering and transforming. So > we'll need to implement some data processing runner like > https://github.com/dapurv5/MapReduce-BSP-Adapter/blob/master/src/main/java/org/apache/hama/mapreduce/examples/WordCount.java > Also, implementing similarity join can be funny. According to > http://www.ruizhang.info/publications/TPDS2015-Heads_Join.pdf, Apache Hama is > clearly winner among Apache Hadoop and Apache Spark. > Since it consists of transformation, aggregation, and partition computations, > I think it's possible to implement using Apache Beam APIs. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (HAMA-983) Hama runner for DataFlow
[ https://issues.apache.org/jira/browse/HAMA-983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15984090#comment-15984090 ] JongYoon Lim commented on HAMA-983: --- So, I'll try to create the branch. Thank you :) > Hama runner for DataFlow > > > Key: HAMA-983 > URL: https://issues.apache.org/jira/browse/HAMA-983 > Project: Hama > Issue Type: Bug >Reporter: Edward J. Yoon > Labels: gsoc2016 > > As you already know, Apache Beam provides unified programming model for both > batch and streaming inputs. > The APIs are generally associated with data filtering and transforming. So > we'll need to implement some data processing runner like > https://github.com/dapurv5/MapReduce-BSP-Adapter/blob/master/src/main/java/org/apache/hama/mapreduce/examples/WordCount.java > Also, implementing similarity join can be funny. According to > http://www.ruizhang.info/publications/TPDS2015-Heads_Join.pdf, Apache Hama is > clearly winner among Apache Hadoop and Apache Spark. > Since it consists of transformation, aggregation, and partition computations, > I think it's possible to implement using Apache Beam APIs. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (HAMA-983) Hama runner for DataFlow
[ https://issues.apache.org/jira/browse/HAMA-983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15984007#comment-15984007 ] Edward J. Yoon commented on HAMA-983: - Sorry for late reply. {quote}could you create a branch called 'beam_support' on github?{quote} Sure. or, you'll also able to create a branch because you're committer. I can do it this weekend. > Hama runner for DataFlow > > > Key: HAMA-983 > URL: https://issues.apache.org/jira/browse/HAMA-983 > Project: Hama > Issue Type: Bug >Reporter: Edward J. Yoon > Labels: gsoc2016 > > As you already know, Apache Beam provides unified programming model for both > batch and streaming inputs. > The APIs are generally associated with data filtering and transforming. So > we'll need to implement some data processing runner like > https://github.com/dapurv5/MapReduce-BSP-Adapter/blob/master/src/main/java/org/apache/hama/mapreduce/examples/WordCount.java > Also, implementing similarity join can be funny. According to > http://www.ruizhang.info/publications/TPDS2015-Heads_Join.pdf, Apache Hama is > clearly winner among Apache Hadoop and Apache Spark. > Since it consists of transformation, aggregation, and partition computations, > I think it's possible to implement using Apache Beam APIs. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (HAMA-983) Hama runner for DataFlow
[ https://issues.apache.org/jira/browse/HAMA-983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15978003#comment-15978003 ] JongYoon Lim commented on HAMA-983: --- Hi Edward, could you create a branch called 'beam_support' on github? These days, I'm working on this issue again and it looks working. Now I'm working on this on my local beam's branch but I think it'd be better to work on hama's one before it can support recent beam version. (I'm working this based on beam's release-0.3.0-incubating, but they already have released 0.6.0 version. ) I think working on hama is more easier to get review and feedback from other developers. After that, it could be contributed to the beam project. > Hama runner for DataFlow > > > Key: HAMA-983 > URL: https://issues.apache.org/jira/browse/HAMA-983 > Project: Hama > Issue Type: Bug >Reporter: Edward J. Yoon > Labels: gsoc2016 > > As you already know, Apache Beam provides unified programming model for both > batch and streaming inputs. > The APIs are generally associated with data filtering and transforming. So > we'll need to implement some data processing runner like > https://github.com/dapurv5/MapReduce-BSP-Adapter/blob/master/src/main/java/org/apache/hama/mapreduce/examples/WordCount.java > Also, implementing similarity join can be funny. According to > http://www.ruizhang.info/publications/TPDS2015-Heads_Join.pdf, Apache Hama is > clearly winner among Apache Hadoop and Apache Spark. > Since it consists of transformation, aggregation, and partition computations, > I think it's possible to implement using Apache Beam APIs. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (HAMA-983) Hama runner for DataFlow
[ https://issues.apache.org/jira/browse/HAMA-983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15730744#comment-15730744 ] Edward J. Yoon commented on HAMA-983: - cool, let me check. > Hama runner for DataFlow > > > Key: HAMA-983 > URL: https://issues.apache.org/jira/browse/HAMA-983 > Project: Hama > Issue Type: Bug >Reporter: Edward J. Yoon > Labels: gsoc2016 > > As you already know, Apache Beam provides unified programming model for both > batch and streaming inputs. > The APIs are generally associated with data filtering and transforming. So > we'll need to implement some data processing runner like > https://github.com/dapurv5/MapReduce-BSP-Adapter/blob/master/src/main/java/org/apache/hama/mapreduce/examples/WordCount.java > Also, implementing similarity join can be funny. According to > http://www.ruizhang.info/publications/TPDS2015-Heads_Join.pdf, Apache Hama is > clearly winner among Apache Hadoop and Apache Spark. > Since it consists of transformation, aggregation, and partition computations, > I think it's possible to implement using Apache Beam APIs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HAMA-983) Hama runner for DataFlow
[ https://issues.apache.org/jira/browse/HAMA-983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15730739#comment-15730739 ] JongYoon Lim commented on HAMA-983: --- I added a link for skeleton code for hama-runner. Actually, I added TranslationContext class for executing batchjob. I mean results(supersteps) from translator are added to list in TranslationContext and after every translation, it executes each supersteps one by one. But when I add result(superstep), it's an object not class. So, I've just wondered if there is an easy way to create same object in grooms because those results(objects) are created on master. Also I wonder if this approach is correct or not.. > Hama runner for DataFlow > > > Key: HAMA-983 > URL: https://issues.apache.org/jira/browse/HAMA-983 > Project: Hama > Issue Type: Bug >Reporter: Edward J. Yoon > Labels: gsoc2016 > > As you already know, Apache Beam provides unified programming model for both > batch and streaming inputs. > The APIs are generally associated with data filtering and transforming. So > we'll need to implement some data processing runner like > https://github.com/dapurv5/MapReduce-BSP-Adapter/blob/master/src/main/java/org/apache/hama/mapreduce/examples/WordCount.java > Also, implementing similarity join can be funny. According to > http://www.ruizhang.info/publications/TPDS2015-Heads_Join.pdf, Apache Hama is > clearly winner among Apache Hadoop and Apache Spark. > Since it consists of transformation, aggregation, and partition computations, > I think it's possible to implement using Apache Beam APIs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HAMA-983) Hama runner for DataFlow
[ https://issues.apache.org/jira/browse/HAMA-983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15730450#comment-15730450 ] Edward J. Yoon commented on HAMA-983: - Here's my skeleton code with example that counts the words. You should implement the HamaPipelineRunner. Just translate and execute batch job. I think you can find how to translate them from flink's code: https://github.com/dataArtisans/flink-dataflow/blob/aad5d936abd41240f3e15d294ea181fb9cca05e0/runner/src/main/java/com/dataartisans/flink/dataflow/translation/FlinkBatchTransformTranslators.java#L410 {code} public class WordCountTest { static final String[] WORDS_ARRAY = new String[] { "hi there", "hi", "hi sue bob", "hi sue", "", "bob hi" }; static final List WORDS = Arrays.asList(WORDS_ARRAY); static final String[] COUNTS_ARRAY = new String[] { "hi: 5", "there: 1", "sue: 2", "bob: 2" }; /** * Example test that tests a PTransform by using an in-memory input and * inspecting the output. */ @Test @Category(RunnableOnService.class) public void testCountWords() throws Exception { HamaOptions options = PipelineOptionsFactory.as(HamaOptions.class); options.setRunner(HamaPipelineRunner.class); Pipeline p = Pipeline.create(options); PCollection input = p.apply(Create.of(WORDS).withCoder( StringUtf8Coder.of())); PCollection output = input .apply(new WordCount()) .apply(MapElements.via(new FormatAsTextFn())); //.apply(TextIO.Write.to("/tmp/result")); PAssert.that(output).containsInAnyOrder(COUNTS_ARRAY); p.run().waitUntilFinish(); } public static class WordCount extends PTransform, PCollection>> { private static final long serialVersionUID = 1L; @Override public PCollection> apply(PCollection lines) { // Convert lines of text into individual words. PCollection words = lines.apply(ParDo.of(new DoFn() { private static final long serialVersionUID = 1L; private final Aggregator emptyLines = createAggregator("emptyLines", new Sum.SumLongFn()); @ProcessElement public void processElement(ProcessContext c) { if (c.element().trim().isEmpty()) { emptyLines.addValue(1L); } // Split the line into words. String[] words = c.element().split("[^a-zA-Z']+"); // Output each word encountered into the output PCollection. for (String word : words) { if (!word.isEmpty()) { c.output(word); } } } })); // Count the number of times each word occurs. PCollection> wordCounts = words.apply(Count . perElement()); return wordCounts; } } // / TODO public static class HamaPipelineRunner extends PipelineRunner { public static HamaPipelineRunner fromOptions(PipelineOptions x) { return new HamaPipelineRunner(); } @Override public Output apply( PTransform transform, Input input) { return super.apply(transform, input); } @Override public HamaPipelineResult run(Pipeline pipeline) { // TODO Auto-generated method stub System.out.println("Executing pipeline using HamaPipelineRunner."); // TODO you need to translate pipeline to Hama program // and execute pipeline // return the result return null; } } public class HamaPipelineResult implements PipelineResult { @Override public State getState() { // TODO Auto-generated method stub return null; } @Override public State cancel() throws IOException { // TODO Auto-generated method stub return null; } @Override public State waitUntilFinish(Duration duration) { // TODO Auto-generated method stub return null; } @Override public State waitUntilFinish() { // TODO Auto-generated method stub return null; } @Override public AggregatorValues getAggregatorValues( Aggregator aggregator) throws AggregatorRetrievalException { // TODO Auto-generated method stub return null; } @Override public MetricResults metrics() { // TODO Auto-generated method stub return null; } } public static interface HamaOptions extends PipelineOptions { } } {code} > Hama runner for DataFlow > > > Key: HAMA-983 > URL: https://issues.apache.org/jira/browse/HAMA-983 > Project: Hama > Issue Type: Bug >Reporter: Edward J. Yoon > Labels: gsoc2016 > > As you already know, Apache Beam provides unified programming model for both > batch and streaming inputs. > The APIs are generally associated with data filtering and transforming. So > we'll need to im
[jira] [Commented] (HAMA-983) Hama runner for DataFlow
[ https://issues.apache.org/jira/browse/HAMA-983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15730085#comment-15730085 ] JongYoon Lim commented on HAMA-983: --- Hi, Edward. First of all, sorry for long delay. This is process for testing beam-hama-runner. 1. Define testing ParDo, for example, as below. {code} PCollection> output = input.apply("test", ParDo.of(new DoFn, KV>() { @ProcessElement public void processElement(ProcessContext c) { for (String word : c.element().toString().split("[^a-zA-Z']+")) { if (!word.isEmpty()) { c.output(KV.of(new Text(word), new LongWritable(11))); } } } })); {code} 2. For translation of ParDo, I can pass the ParDo to DoFnFunction which is a subclass of Superstep and has OldDoFn.ProcessContext. Here, I'd like to create dofn instance in hama cluster after finishing all translation. And I'm not sure how I can do it easily... {code} private static TransformTranslator> parDo() { return new TransformTranslator>() { @Override public void translate(final ParDo.Bound transform, TranslationContext context) { //context.addSuperstep(TestSuperStep.class); DoFnFunction dofn = new DoFnFunction((OldDoFn) transform.getFn()); //context.addSuperstep(dofn.getClass()); } }; } {code} > Hama runner for DataFlow > > > Key: HAMA-983 > URL: https://issues.apache.org/jira/browse/HAMA-983 > Project: Hama > Issue Type: Bug >Reporter: Edward J. Yoon > Labels: gsoc2016 > > As you already know, Apache Beam provides unified programming model for both > batch and streaming inputs. > The APIs are generally associated with data filtering and transforming. So > we'll need to implement some data processing runner like > https://github.com/dapurv5/MapReduce-BSP-Adapter/blob/master/src/main/java/org/apache/hama/mapreduce/examples/WordCount.java > Also, implementing similarity join can be funny. According to > http://www.ruizhang.info/publications/TPDS2015-Heads_Join.pdf, Apache Hama is > clearly winner among Apache Hadoop and Apache Spark. > Since it consists of transformation, aggregation, and partition computations, > I think it's possible to implement using Apache Beam APIs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HAMA-983) Hama runner for DataFlow
[ https://issues.apache.org/jira/browse/HAMA-983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15672033#comment-15672033 ] Edward J. Yoon commented on HAMA-983: - I don't understand exactly, can you please share your progress? > Hama runner for DataFlow > > > Key: HAMA-983 > URL: https://issues.apache.org/jira/browse/HAMA-983 > Project: Hama > Issue Type: Bug >Reporter: Edward J. Yoon > Labels: gsoc2016 > > As you already know, Apache Beam provides unified programming model for both > batch and streaming inputs. > The APIs are generally associated with data filtering and transforming. So > we'll need to implement some data processing runner like > https://github.com/dapurv5/MapReduce-BSP-Adapter/blob/master/src/main/java/org/apache/hama/mapreduce/examples/WordCount.java > Also, implementing similarity join can be funny. According to > http://www.ruizhang.info/publications/TPDS2015-Heads_Join.pdf, Apache Hama is > clearly winner among Apache Hadoop and Apache Spark. > Since it consists of transformation, aggregation, and partition computations, > I think it's possible to implement using Apache Beam APIs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HAMA-983) Hama runner for DataFlow
[ https://issues.apache.org/jira/browse/HAMA-983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15660806#comment-15660806 ] JongYoon Lim commented on HAMA-983: --- Hi Edward, could you give me some idea what the recommend way is to create the same instance at groom server with an instance which is created from a translator(beam) at master..? > Hama runner for DataFlow > > > Key: HAMA-983 > URL: https://issues.apache.org/jira/browse/HAMA-983 > Project: Hama > Issue Type: Bug >Reporter: Edward J. Yoon > Labels: gsoc2016 > > As you already know, Apache Beam provides unified programming model for both > batch and streaming inputs. > The APIs are generally associated with data filtering and transforming. So > we'll need to implement some data processing runner like > https://github.com/dapurv5/MapReduce-BSP-Adapter/blob/master/src/main/java/org/apache/hama/mapreduce/examples/WordCount.java > Also, implementing similarity join can be funny. According to > http://www.ruizhang.info/publications/TPDS2015-Heads_Join.pdf, Apache Hama is > clearly winner among Apache Hadoop and Apache Spark. > Since it consists of transformation, aggregation, and partition computations, > I think it's possible to implement using Apache Beam APIs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HAMA-983) Hama runner for DataFlow
[ https://issues.apache.org/jira/browse/HAMA-983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15502033#comment-15502033 ] Edward J. Yoon commented on HAMA-983: - >> once PoC is done Great. If you need some helps, feel free to let me know :-) > Hama runner for DataFlow > > > Key: HAMA-983 > URL: https://issues.apache.org/jira/browse/HAMA-983 > Project: Hama > Issue Type: Bug >Reporter: Edward J. Yoon > Labels: gsoc2016 > > As you already know, Apache Beam provides unified programming model for both > batch and streaming inputs. > The APIs are generally associated with data filtering and transforming. So > we'll need to implement some data processing runner like > https://github.com/dapurv5/MapReduce-BSP-Adapter/blob/master/src/main/java/org/apache/hama/mapreduce/examples/WordCount.java > Also, implementing similarity join can be funny. According to > http://www.ruizhang.info/publications/TPDS2015-Heads_Join.pdf, Apache Hama is > clearly winner among Apache Hadoop and Apache Spark. > Since it consists of transformation, aggregation, and partition computations, > I think it's possible to implement using Apache Beam APIs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HAMA-983) Hama runner for DataFlow
[ https://issues.apache.org/jira/browse/HAMA-983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15502030#comment-15502030 ] JongYoon Lim commented on HAMA-983: --- Yes, I can add a PR link to https://issues.apache.org/jira/browse/BEAM-612 once PoC is done. > Hama runner for DataFlow > > > Key: HAMA-983 > URL: https://issues.apache.org/jira/browse/HAMA-983 > Project: Hama > Issue Type: Bug >Reporter: Edward J. Yoon > Labels: gsoc2016 > > As you already know, Apache Beam provides unified programming model for both > batch and streaming inputs. > The APIs are generally associated with data filtering and transforming. So > we'll need to implement some data processing runner like > https://github.com/dapurv5/MapReduce-BSP-Adapter/blob/master/src/main/java/org/apache/hama/mapreduce/examples/WordCount.java > Also, implementing similarity join can be funny. According to > http://www.ruizhang.info/publications/TPDS2015-Heads_Join.pdf, Apache Hama is > clearly winner among Apache Hadoop and Apache Spark. > Since it consists of transformation, aggregation, and partition computations, > I think it's possible to implement using Apache Beam APIs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HAMA-983) Hama runner for DataFlow
[ https://issues.apache.org/jira/browse/HAMA-983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15502017#comment-15502017 ] Edward J. Yoon commented on HAMA-983: - Why don't we contribute this feature to the Apache Beam directly? https://github.com/apache/incubator-beam/tree/master/runners > Hama runner for DataFlow > > > Key: HAMA-983 > URL: https://issues.apache.org/jira/browse/HAMA-983 > Project: Hama > Issue Type: Bug >Reporter: Edward J. Yoon > Labels: gsoc2016 > > As you already know, Apache Beam provides unified programming model for both > batch and streaming inputs. > The APIs are generally associated with data filtering and transforming. So > we'll need to implement some data processing runner like > https://github.com/dapurv5/MapReduce-BSP-Adapter/blob/master/src/main/java/org/apache/hama/mapreduce/examples/WordCount.java > Also, implementing similarity join can be funny. According to > http://www.ruizhang.info/publications/TPDS2015-Heads_Join.pdf, Apache Hama is > clearly winner among Apache Hadoop and Apache Spark. > Since it consists of transformation, aggregation, and partition computations, > I think it's possible to implement using Apache Beam APIs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HAMA-983) Hama runner for DataFlow
[ https://issues.apache.org/jira/browse/HAMA-983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15502014#comment-15502014 ] JongYoon Lim commented on HAMA-983: --- Thank you for your feedbace.. And do you think it's better to branch from hama for this or have an independent repo(github)? > Hama runner for DataFlow > > > Key: HAMA-983 > URL: https://issues.apache.org/jira/browse/HAMA-983 > Project: Hama > Issue Type: Bug >Reporter: Edward J. Yoon > Labels: gsoc2016 > > As you already know, Apache Beam provides unified programming model for both > batch and streaming inputs. > The APIs are generally associated with data filtering and transforming. So > we'll need to implement some data processing runner like > https://github.com/dapurv5/MapReduce-BSP-Adapter/blob/master/src/main/java/org/apache/hama/mapreduce/examples/WordCount.java > Also, implementing similarity join can be funny. According to > http://www.ruizhang.info/publications/TPDS2015-Heads_Join.pdf, Apache Hama is > clearly winner among Apache Hadoop and Apache Spark. > Since it consists of transformation, aggregation, and partition computations, > I think it's possible to implement using Apache Beam APIs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HAMA-983) Hama runner for DataFlow
[ https://issues.apache.org/jira/browse/HAMA-983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15501973#comment-15501973 ] Edward J. Yoon commented on HAMA-983: - https://cloud.google.com/dataflow/examples/wordcount-example This page is well-described about beam concept. The flow is like below: {code} Creating the Pipeline Applying transforms to the Pipeline Reading input (in this example: reading text files) Applying ParDo transforms Applying SDK-provided transforms (in this example: Count) Writing output (in this example: writing to Google Cloud Storage) Running the Pipeline {code} Once we created Hama pipeline we should able to run the program like below: {code} public static void main(String[] args) { // Create a pipeline parameterized by commandline flags. Pipeline p = Pipeline.create(PipelineOptionsFactory.fromArgs(arg)); p.apply(TextIO.Read.from("gs://...")) // Read input. .apply(new CountWords()) // Do some processing. .apply(TextIO.Write.to("gs://...")); // Write output. // Run the pipeline. p.run(); } {code} For I/O operations, you can refer this https://github.com/apache/incubator-beam/blob/master/runners/spark/src/main/java/org/apache/beam/runners/spark/io/hadoop/HadoopIO.java (instead of org.apache.hadoop.mapreduce.lib.input.FileInputFormat you should use https://github.com/apache/hama/blob/master/core/src/main/java/org/apache/hama/bsp/FileInputFormat.java) {quote}BSP for dataflow could be similar to SuperstepBSP{quote} I think so. GroupByKey seems a built-in processor that groups records by key. We should implement it using a superstep. > Hama runner for DataFlow > > > Key: HAMA-983 > URL: https://issues.apache.org/jira/browse/HAMA-983 > Project: Hama > Issue Type: Bug >Reporter: Edward J. Yoon > Labels: gsoc2016 > > As you already know, Apache Beam provides unified programming model for both > batch and streaming inputs. > The APIs are generally associated with data filtering and transforming. So > we'll need to implement some data processing runner like > https://github.com/dapurv5/MapReduce-BSP-Adapter/blob/master/src/main/java/org/apache/hama/mapreduce/examples/WordCount.java > Also, implementing similarity join can be funny. According to > http://www.ruizhang.info/publications/TPDS2015-Heads_Join.pdf, Apache Hama is > clearly winner among Apache Hadoop and Apache Spark. > Since it consists of transformation, aggregation, and partition computations, > I think it's possible to implement using Apache Beam APIs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HAMA-983) Hama runner for DataFlow
[ https://issues.apache.org/jira/browse/HAMA-983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15500951#comment-15500951 ] JongYoon Lim commented on HAMA-983: --- Hi, it takes some time to understand Beam API, spark and flink runner for Beam. And it seems that Beam's transforms can be translated to Hama's API as follow. And BSP for dataflow could be similar to SuperstepBSP. (if I have misunderstandings, please correct me) BEAM -> HAMA ParDo -> Superstep Read.Bound -> RecordReader Writt.Bound -> RecordWriter Combine -> Combiner GroupByKey -> ? I'm about to start from batch mode first until Hama's streaming is ready. And I'll add sub-tasks for this soon. > Hama runner for DataFlow > > > Key: HAMA-983 > URL: https://issues.apache.org/jira/browse/HAMA-983 > Project: Hama > Issue Type: Bug >Reporter: Edward J. Yoon > Labels: gsoc2016 > > As you already know, Apache Beam provides unified programming model for both > batch and streaming inputs. > The APIs are generally associated with data filtering and transforming. So > we'll need to implement some data processing runner like > https://github.com/dapurv5/MapReduce-BSP-Adapter/blob/master/src/main/java/org/apache/hama/mapreduce/examples/WordCount.java > Also, implementing similarity join can be funny. According to > http://www.ruizhang.info/publications/TPDS2015-Heads_Join.pdf, Apache Hama is > clearly winner among Apache Hadoop and Apache Spark. > Since it consists of transformation, aggregation, and partition computations, > I think it's possible to implement using Apache Beam APIs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HAMA-983) Hama runner for DataFlow
[ https://issues.apache.org/jira/browse/HAMA-983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15463592#comment-15463592 ] JongYoon Lim commented on HAMA-983: --- Yes, I'll check the streaming mode as well. > Hama runner for DataFlow > > > Key: HAMA-983 > URL: https://issues.apache.org/jira/browse/HAMA-983 > Project: Hama > Issue Type: Bug >Reporter: Edward J. Yoon > Labels: gsoc2016 > > As you already know, Apache Beam provides unified programming model for both > batch and streaming inputs. > The APIs are generally associated with data filtering and transforming. So > we'll need to implement some data processing runner like > https://github.com/dapurv5/MapReduce-BSP-Adapter/blob/master/src/main/java/org/apache/hama/mapreduce/examples/WordCount.java > Also, implementing similarity join can be funny. According to > http://www.ruizhang.info/publications/TPDS2015-Heads_Join.pdf, Apache Hama is > clearly winner among Apache Hadoop and Apache Spark. > Since it consists of transformation, aggregation, and partition computations, > I think it's possible to implement using Apache Beam APIs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HAMA-983) Hama runner for DataFlow
[ https://issues.apache.org/jira/browse/HAMA-983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15463593#comment-15463593 ] JongYoon Lim commented on HAMA-983: --- FlinkPipelineRunner internally has a translator for both pipeline and transform. It seems that translator translates Beam operators to their counterparts of flink and saves regarding information in TranslationContext which is used for flink job processing. I think this patch can be started from implementing a simple translator for batch job first. > Hama runner for DataFlow > > > Key: HAMA-983 > URL: https://issues.apache.org/jira/browse/HAMA-983 > Project: Hama > Issue Type: Bug >Reporter: Edward J. Yoon > Labels: gsoc2016 > > As you already know, Apache Beam provides unified programming model for both > batch and streaming inputs. > The APIs are generally associated with data filtering and transforming. So > we'll need to implement some data processing runner like > https://github.com/dapurv5/MapReduce-BSP-Adapter/blob/master/src/main/java/org/apache/hama/mapreduce/examples/WordCount.java > Also, implementing similarity join can be funny. According to > http://www.ruizhang.info/publications/TPDS2015-Heads_Join.pdf, Apache Hama is > clearly winner among Apache Hadoop and Apache Spark. > Since it consists of transformation, aggregation, and partition computations, > I think it's possible to implement using Apache Beam APIs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HAMA-983) Hama runner for DataFlow
[ https://issues.apache.org/jira/browse/HAMA-983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15454239#comment-15454239 ] Edward J. Yoon commented on HAMA-983: - Just FYI, Apache Beam's basic example is wordcount. I guess, the batch mode can be similar with org.apache.hama.examples.PiEstimator: (n - 1) tasks parses and counts the words and 1 task aggregates the word counts and emits the final result. The streaming mode is not sure, so you'll need to check how it handles io. > Hama runner for DataFlow > > > Key: HAMA-983 > URL: https://issues.apache.org/jira/browse/HAMA-983 > Project: Hama > Issue Type: Bug >Reporter: Edward J. Yoon > Labels: gsoc2016 > > As you already know, Apache Beam provides unified programming model for both > batch and streaming inputs. > The APIs are generally associated with data filtering and transforming. So > we'll need to implement some data processing runner like > https://github.com/dapurv5/MapReduce-BSP-Adapter/blob/master/src/main/java/org/apache/hama/mapreduce/examples/WordCount.java > Also, implementing similarity join can be funny. According to > http://www.ruizhang.info/publications/TPDS2015-Heads_Join.pdf, Apache Hama is > clearly winner among Apache Hadoop and Apache Spark. > Since it consists of transformation, aggregation, and partition computations, > I think it's possible to implement using Apache Beam APIs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HAMA-983) Hama runner for DataFlow
[ https://issues.apache.org/jira/browse/HAMA-983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15453619#comment-15453619 ] JongYoon Lim commented on HAMA-983: --- Thank you for the information. I'm interested in this feature. I'll start analyzing flink runner. > Hama runner for DataFlow > > > Key: HAMA-983 > URL: https://issues.apache.org/jira/browse/HAMA-983 > Project: Hama > Issue Type: Bug >Reporter: Edward J. Yoon > Labels: gsoc2016 > > As you already know, Apache Beam provides unified programming model for both > batch and streaming inputs. > The APIs are generally associated with data filtering and transforming. So > we'll need to implement some data processing runner like > https://github.com/dapurv5/MapReduce-BSP-Adapter/blob/master/src/main/java/org/apache/hama/mapreduce/examples/WordCount.java > Also, implementing similarity join can be funny. According to > http://www.ruizhang.info/publications/TPDS2015-Heads_Join.pdf, Apache Hama is > clearly winner among Apache Hadoop and Apache Spark. > Since it consists of transformation, aggregation, and partition computations, > I think it's possible to implement using Apache Beam APIs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HAMA-983) Hama runner for DataFlow
[ https://issues.apache.org/jira/browse/HAMA-983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15451100#comment-15451100 ] Edward J. Yoon commented on HAMA-983: - Hi, I didn't look at dataflow (apache beam) closely, but: >> Do you mean that each superstep can be executed in data pipeline as a >> pcollection? I guess yes, or single job can be executed as the case may be. If you're interested in working on this, you can refer https://github.com/dataArtisans/flink-dataflow/blob/master/runner/src/main/java/com/dataartisans/flink/dataflow/FlinkPipelineRunner.java And, before we do this, HAMA-940 and data processing BSP maybe the first I guess. Please feel free to drop your opinion and contribute the patches. :-) If you have any questions, let me know. > Hama runner for DataFlow > > > Key: HAMA-983 > URL: https://issues.apache.org/jira/browse/HAMA-983 > Project: Hama > Issue Type: Bug >Reporter: Edward J. Yoon > Labels: gsoc2016 > > As you already know, Apache Beam provides unified programming model for both > batch and streaming inputs. > The APIs are generally associated with data filtering and transforming. So > we'll need to implement some data processing runner like > https://github.com/dapurv5/MapReduce-BSP-Adapter/blob/master/src/main/java/org/apache/hama/mapreduce/examples/WordCount.java > Also, implementing similarity join can be funny. According to > http://www.ruizhang.info/publications/TPDS2015-Heads_Join.pdf, Apache Hama is > clearly winner among Apache Hadoop and Apache Spark. > Since it consists of transformation, aggregation, and partition computations, > I think it's possible to implement using Apache Beam APIs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HAMA-983) Hama runner for DataFlow
[ https://issues.apache.org/jira/browse/HAMA-983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15450866#comment-15450866 ] JongYoon Lim commented on HAMA-983: --- Do you mean that each superstep can be executed in data pipeline as a pcollection? Could you add more details if I didn't understand correcly..? > Hama runner for DataFlow > > > Key: HAMA-983 > URL: https://issues.apache.org/jira/browse/HAMA-983 > Project: Hama > Issue Type: Bug >Reporter: Edward J. Yoon > Labels: gsoc2016 > > As you already know, Apache Beam provides unified programming model for both > batch and streaming inputs. > The APIs are generally associated with data filtering and transforming. So > we'll need to implement some data processing runner like > https://github.com/dapurv5/MapReduce-BSP-Adapter/blob/master/src/main/java/org/apache/hama/mapreduce/examples/WordCount.java > Also, implementing similarity join can be funny. According to > http://www.ruizhang.info/publications/TPDS2015-Heads_Join.pdf, Apache Hama is > clearly winner among Apache Hadoop and Apache Spark. > Since it consists of transformation, aggregation, and partition computations, > I think it's possible to implement using Apache Beam APIs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)