[jira] [Commented] (HAMA-983) Hama runner for DataFlow
[ https://issues.apache.org/jira/browse/HAMA-983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15502033#comment-15502033 ] Edward J. Yoon commented on HAMA-983: - >> once PoC is done Great. If you need some helps, feel free to let me know :-) > Hama runner for DataFlow > > > Key: HAMA-983 > URL: https://issues.apache.org/jira/browse/HAMA-983 > Project: Hama > Issue Type: Bug >Reporter: Edward J. Yoon > Labels: gsoc2016 > > As you already know, Apache Beam provides unified programming model for both > batch and streaming inputs. > The APIs are generally associated with data filtering and transforming. So > we'll need to implement some data processing runner like > https://github.com/dapurv5/MapReduce-BSP-Adapter/blob/master/src/main/java/org/apache/hama/mapreduce/examples/WordCount.java > Also, implementing similarity join can be funny. According to > http://www.ruizhang.info/publications/TPDS2015-Heads_Join.pdf, Apache Hama is > clearly winner among Apache Hadoop and Apache Spark. > Since it consists of transformation, aggregation, and partition computations, > I think it's possible to implement using Apache Beam APIs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HAMA-983) Hama runner for DataFlow
[ https://issues.apache.org/jira/browse/HAMA-983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15502030#comment-15502030 ] JongYoon Lim commented on HAMA-983: --- Yes, I can add a PR link to https://issues.apache.org/jira/browse/BEAM-612 once PoC is done. > Hama runner for DataFlow > > > Key: HAMA-983 > URL: https://issues.apache.org/jira/browse/HAMA-983 > Project: Hama > Issue Type: Bug >Reporter: Edward J. Yoon > Labels: gsoc2016 > > As you already know, Apache Beam provides unified programming model for both > batch and streaming inputs. > The APIs are generally associated with data filtering and transforming. So > we'll need to implement some data processing runner like > https://github.com/dapurv5/MapReduce-BSP-Adapter/blob/master/src/main/java/org/apache/hama/mapreduce/examples/WordCount.java > Also, implementing similarity join can be funny. According to > http://www.ruizhang.info/publications/TPDS2015-Heads_Join.pdf, Apache Hama is > clearly winner among Apache Hadoop and Apache Spark. > Since it consists of transformation, aggregation, and partition computations, > I think it's possible to implement using Apache Beam APIs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HAMA-983) Hama runner for DataFlow
[ https://issues.apache.org/jira/browse/HAMA-983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15502017#comment-15502017 ] Edward J. Yoon commented on HAMA-983: - Why don't we contribute this feature to the Apache Beam directly? https://github.com/apache/incubator-beam/tree/master/runners > Hama runner for DataFlow > > > Key: HAMA-983 > URL: https://issues.apache.org/jira/browse/HAMA-983 > Project: Hama > Issue Type: Bug >Reporter: Edward J. Yoon > Labels: gsoc2016 > > As you already know, Apache Beam provides unified programming model for both > batch and streaming inputs. > The APIs are generally associated with data filtering and transforming. So > we'll need to implement some data processing runner like > https://github.com/dapurv5/MapReduce-BSP-Adapter/blob/master/src/main/java/org/apache/hama/mapreduce/examples/WordCount.java > Also, implementing similarity join can be funny. According to > http://www.ruizhang.info/publications/TPDS2015-Heads_Join.pdf, Apache Hama is > clearly winner among Apache Hadoop and Apache Spark. > Since it consists of transformation, aggregation, and partition computations, > I think it's possible to implement using Apache Beam APIs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HAMA-983) Hama runner for DataFlow
[ https://issues.apache.org/jira/browse/HAMA-983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15502014#comment-15502014 ] JongYoon Lim commented on HAMA-983: --- Thank you for your feedbace.. And do you think it's better to branch from hama for this or have an independent repo(github)? > Hama runner for DataFlow > > > Key: HAMA-983 > URL: https://issues.apache.org/jira/browse/HAMA-983 > Project: Hama > Issue Type: Bug >Reporter: Edward J. Yoon > Labels: gsoc2016 > > As you already know, Apache Beam provides unified programming model for both > batch and streaming inputs. > The APIs are generally associated with data filtering and transforming. So > we'll need to implement some data processing runner like > https://github.com/dapurv5/MapReduce-BSP-Adapter/blob/master/src/main/java/org/apache/hama/mapreduce/examples/WordCount.java > Also, implementing similarity join can be funny. According to > http://www.ruizhang.info/publications/TPDS2015-Heads_Join.pdf, Apache Hama is > clearly winner among Apache Hadoop and Apache Spark. > Since it consists of transformation, aggregation, and partition computations, > I think it's possible to implement using Apache Beam APIs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HAMA-983) Hama runner for DataFlow
[ https://issues.apache.org/jira/browse/HAMA-983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15501973#comment-15501973 ] Edward J. Yoon commented on HAMA-983: - https://cloud.google.com/dataflow/examples/wordcount-example This page is well-described about beam concept. The flow is like below: {code} Creating the Pipeline Applying transforms to the Pipeline Reading input (in this example: reading text files) Applying ParDo transforms Applying SDK-provided transforms (in this example: Count) Writing output (in this example: writing to Google Cloud Storage) Running the Pipeline {code} Once we created Hama pipeline we should able to run the program like below: {code} public static void main(String[] args) { // Create a pipeline parameterized by commandline flags. Pipeline p = Pipeline.create(PipelineOptionsFactory.fromArgs(arg)); p.apply(TextIO.Read.from("gs://...")) // Read input. .apply(new CountWords()) // Do some processing. .apply(TextIO.Write.to("gs://...")); // Write output. // Run the pipeline. p.run(); } {code} For I/O operations, you can refer this https://github.com/apache/incubator-beam/blob/master/runners/spark/src/main/java/org/apache/beam/runners/spark/io/hadoop/HadoopIO.java (instead of org.apache.hadoop.mapreduce.lib.input.FileInputFormat you should use https://github.com/apache/hama/blob/master/core/src/main/java/org/apache/hama/bsp/FileInputFormat.java) {quote}BSP for dataflow could be similar to SuperstepBSP{quote} I think so. GroupByKey seems a built-in processor that groups records by key. We should implement it using a superstep. > Hama runner for DataFlow > > > Key: HAMA-983 > URL: https://issues.apache.org/jira/browse/HAMA-983 > Project: Hama > Issue Type: Bug >Reporter: Edward J. Yoon > Labels: gsoc2016 > > As you already know, Apache Beam provides unified programming model for both > batch and streaming inputs. > The APIs are generally associated with data filtering and transforming. So > we'll need to implement some data processing runner like > https://github.com/dapurv5/MapReduce-BSP-Adapter/blob/master/src/main/java/org/apache/hama/mapreduce/examples/WordCount.java > Also, implementing similarity join can be funny. According to > http://www.ruizhang.info/publications/TPDS2015-Heads_Join.pdf, Apache Hama is > clearly winner among Apache Hadoop and Apache Spark. > Since it consists of transformation, aggregation, and partition computations, > I think it's possible to implement using Apache Beam APIs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (HAMA-983) Hama runner for DataFlow
[ https://issues.apache.org/jira/browse/HAMA-983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15500951#comment-15500951 ] JongYoon Lim edited comment on HAMA-983 at 9/18/16 1:27 PM: Hi, it took some time to understand Beam API, spark and flink runner for Beam. And it seems that Beam's transforms can be translated to Hama's API as follow. And BSP for dataflow could be similar to SuperstepBSP. (if I have misunderstandings, please correct me) BEAM -> HAMA ParDo -> Superstep Read.Bound -> RecordReader Writt.Bound -> RecordWriter Combine -> Combiner GroupByKey -> ? I'm about to start from batch mode first until Hama's streaming is ready. And I'll add sub-tasks for this soon. was (Author: seedengine): Hi, it takes some time to understand Beam API, spark and flink runner for Beam. And it seems that Beam's transforms can be translated to Hama's API as follow. And BSP for dataflow could be similar to SuperstepBSP. (if I have misunderstandings, please correct me) BEAM -> HAMA ParDo -> Superstep Read.Bound -> RecordReader Writt.Bound -> RecordWriter Combine -> Combiner GroupByKey -> ? I'm about to start from batch mode first until Hama's streaming is ready. And I'll add sub-tasks for this soon. > Hama runner for DataFlow > > > Key: HAMA-983 > URL: https://issues.apache.org/jira/browse/HAMA-983 > Project: Hama > Issue Type: Bug >Reporter: Edward J. Yoon > Labels: gsoc2016 > > As you already know, Apache Beam provides unified programming model for both > batch and streaming inputs. > The APIs are generally associated with data filtering and transforming. So > we'll need to implement some data processing runner like > https://github.com/dapurv5/MapReduce-BSP-Adapter/blob/master/src/main/java/org/apache/hama/mapreduce/examples/WordCount.java > Also, implementing similarity join can be funny. According to > http://www.ruizhang.info/publications/TPDS2015-Heads_Join.pdf, Apache Hama is > clearly winner among Apache Hadoop and Apache Spark. > Since it consists of transformation, aggregation, and partition computations, > I think it's possible to implement using Apache Beam APIs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HAMA-983) Hama runner for DataFlow
[ https://issues.apache.org/jira/browse/HAMA-983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15500951#comment-15500951 ] JongYoon Lim commented on HAMA-983: --- Hi, it takes some time to understand Beam API, spark and flink runner for Beam. And it seems that Beam's transforms can be translated to Hama's API as follow. And BSP for dataflow could be similar to SuperstepBSP. (if I have misunderstandings, please correct me) BEAM -> HAMA ParDo -> Superstep Read.Bound -> RecordReader Writt.Bound -> RecordWriter Combine -> Combiner GroupByKey -> ? I'm about to start from batch mode first until Hama's streaming is ready. And I'll add sub-tasks for this soon. > Hama runner for DataFlow > > > Key: HAMA-983 > URL: https://issues.apache.org/jira/browse/HAMA-983 > Project: Hama > Issue Type: Bug >Reporter: Edward J. Yoon > Labels: gsoc2016 > > As you already know, Apache Beam provides unified programming model for both > batch and streaming inputs. > The APIs are generally associated with data filtering and transforming. So > we'll need to implement some data processing runner like > https://github.com/dapurv5/MapReduce-BSP-Adapter/blob/master/src/main/java/org/apache/hama/mapreduce/examples/WordCount.java > Also, implementing similarity join can be funny. According to > http://www.ruizhang.info/publications/TPDS2015-Heads_Join.pdf, Apache Hama is > clearly winner among Apache Hadoop and Apache Spark. > Since it consists of transformation, aggregation, and partition computations, > I think it's possible to implement using Apache Beam APIs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)