[jira] [Commented] (HAMA-983) Hama runner for DataFlow

2016-09-18 Thread Edward J. Yoon (JIRA)

[ 
https://issues.apache.org/jira/browse/HAMA-983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15502033#comment-15502033
 ] 

Edward J. Yoon commented on HAMA-983:
-

>> once PoC is done

Great. If you need some helps, feel free to let me know :-)

> Hama runner for DataFlow
> 
>
> Key: HAMA-983
> URL: https://issues.apache.org/jira/browse/HAMA-983
> Project: Hama
>  Issue Type: Bug
>Reporter: Edward J. Yoon
>  Labels: gsoc2016
>
> As you already know, Apache Beam provides unified programming model for both 
> batch and streaming inputs.
> The APIs are generally associated with data filtering and transforming. So 
> we'll need to implement some data processing runner like 
> https://github.com/dapurv5/MapReduce-BSP-Adapter/blob/master/src/main/java/org/apache/hama/mapreduce/examples/WordCount.java
> Also, implementing similarity join can be funny. According to 
> http://www.ruizhang.info/publications/TPDS2015-Heads_Join.pdf, Apache Hama is 
> clearly winner among Apache Hadoop and Apache Spark.
> Since it consists of transformation, aggregation, and partition computations, 
> I think it's possible to implement using Apache Beam APIs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HAMA-983) Hama runner for DataFlow

2016-09-18 Thread JongYoon Lim (JIRA)

[ 
https://issues.apache.org/jira/browse/HAMA-983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15502030#comment-15502030
 ] 

JongYoon Lim commented on HAMA-983:
---

Yes, I can add a PR link to https://issues.apache.org/jira/browse/BEAM-612 once 
PoC is done. 

> Hama runner for DataFlow
> 
>
> Key: HAMA-983
> URL: https://issues.apache.org/jira/browse/HAMA-983
> Project: Hama
>  Issue Type: Bug
>Reporter: Edward J. Yoon
>  Labels: gsoc2016
>
> As you already know, Apache Beam provides unified programming model for both 
> batch and streaming inputs.
> The APIs are generally associated with data filtering and transforming. So 
> we'll need to implement some data processing runner like 
> https://github.com/dapurv5/MapReduce-BSP-Adapter/blob/master/src/main/java/org/apache/hama/mapreduce/examples/WordCount.java
> Also, implementing similarity join can be funny. According to 
> http://www.ruizhang.info/publications/TPDS2015-Heads_Join.pdf, Apache Hama is 
> clearly winner among Apache Hadoop and Apache Spark.
> Since it consists of transformation, aggregation, and partition computations, 
> I think it's possible to implement using Apache Beam APIs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HAMA-983) Hama runner for DataFlow

2016-09-18 Thread Edward J. Yoon (JIRA)

[ 
https://issues.apache.org/jira/browse/HAMA-983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15502017#comment-15502017
 ] 

Edward J. Yoon commented on HAMA-983:
-

Why don't we contribute this feature to the Apache Beam directly? 
https://github.com/apache/incubator-beam/tree/master/runners

> Hama runner for DataFlow
> 
>
> Key: HAMA-983
> URL: https://issues.apache.org/jira/browse/HAMA-983
> Project: Hama
>  Issue Type: Bug
>Reporter: Edward J. Yoon
>  Labels: gsoc2016
>
> As you already know, Apache Beam provides unified programming model for both 
> batch and streaming inputs.
> The APIs are generally associated with data filtering and transforming. So 
> we'll need to implement some data processing runner like 
> https://github.com/dapurv5/MapReduce-BSP-Adapter/blob/master/src/main/java/org/apache/hama/mapreduce/examples/WordCount.java
> Also, implementing similarity join can be funny. According to 
> http://www.ruizhang.info/publications/TPDS2015-Heads_Join.pdf, Apache Hama is 
> clearly winner among Apache Hadoop and Apache Spark.
> Since it consists of transformation, aggregation, and partition computations, 
> I think it's possible to implement using Apache Beam APIs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HAMA-983) Hama runner for DataFlow

2016-09-18 Thread JongYoon Lim (JIRA)

[ 
https://issues.apache.org/jira/browse/HAMA-983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15502014#comment-15502014
 ] 

JongYoon Lim commented on HAMA-983:
---

Thank you for your feedbace.. And do you think it's better to branch from hama 
for this or have an independent repo(github)?

> Hama runner for DataFlow
> 
>
> Key: HAMA-983
> URL: https://issues.apache.org/jira/browse/HAMA-983
> Project: Hama
>  Issue Type: Bug
>Reporter: Edward J. Yoon
>  Labels: gsoc2016
>
> As you already know, Apache Beam provides unified programming model for both 
> batch and streaming inputs.
> The APIs are generally associated with data filtering and transforming. So 
> we'll need to implement some data processing runner like 
> https://github.com/dapurv5/MapReduce-BSP-Adapter/blob/master/src/main/java/org/apache/hama/mapreduce/examples/WordCount.java
> Also, implementing similarity join can be funny. According to 
> http://www.ruizhang.info/publications/TPDS2015-Heads_Join.pdf, Apache Hama is 
> clearly winner among Apache Hadoop and Apache Spark.
> Since it consists of transformation, aggregation, and partition computations, 
> I think it's possible to implement using Apache Beam APIs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HAMA-983) Hama runner for DataFlow

2016-09-18 Thread Edward J. Yoon (JIRA)

[ 
https://issues.apache.org/jira/browse/HAMA-983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15501973#comment-15501973
 ] 

Edward J. Yoon commented on HAMA-983:
-

https://cloud.google.com/dataflow/examples/wordcount-example

This page is well-described about beam concept. The flow is like below:

{code}
Creating the Pipeline
Applying transforms to the Pipeline
Reading input (in this example: reading text files)
Applying ParDo transforms
Applying SDK-provided transforms (in this example: Count)
Writing output (in this example: writing to Google Cloud Storage)
Running the Pipeline
{code}

Once we created Hama pipeline we should able to run the program like below:

{code}
  public static void main(String[] args) {
// Create a pipeline parameterized by commandline flags.
Pipeline p = Pipeline.create(PipelineOptionsFactory.fromArgs(arg));

p.apply(TextIO.Read.from("gs://..."))   // Read input.
 .apply(new CountWords())   // Do some processing.
 .apply(TextIO.Write.to("gs://..."));   // Write output.

// Run the pipeline.
p.run();
  }
{code}

For I/O operations, you can refer this 
https://github.com/apache/incubator-beam/blob/master/runners/spark/src/main/java/org/apache/beam/runners/spark/io/hadoop/HadoopIO.java
 (instead of org.apache.hadoop.mapreduce.lib.input.FileInputFormat you should 
use 
https://github.com/apache/hama/blob/master/core/src/main/java/org/apache/hama/bsp/FileInputFormat.java)

{quote}BSP for dataflow could be similar to SuperstepBSP{quote}

I think so. GroupByKey seems a built-in processor that groups records by key. 
We should implement it using a superstep.





> Hama runner for DataFlow
> 
>
> Key: HAMA-983
> URL: https://issues.apache.org/jira/browse/HAMA-983
> Project: Hama
>  Issue Type: Bug
>Reporter: Edward J. Yoon
>  Labels: gsoc2016
>
> As you already know, Apache Beam provides unified programming model for both 
> batch and streaming inputs.
> The APIs are generally associated with data filtering and transforming. So 
> we'll need to implement some data processing runner like 
> https://github.com/dapurv5/MapReduce-BSP-Adapter/blob/master/src/main/java/org/apache/hama/mapreduce/examples/WordCount.java
> Also, implementing similarity join can be funny. According to 
> http://www.ruizhang.info/publications/TPDS2015-Heads_Join.pdf, Apache Hama is 
> clearly winner among Apache Hadoop and Apache Spark.
> Since it consists of transformation, aggregation, and partition computations, 
> I think it's possible to implement using Apache Beam APIs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (HAMA-983) Hama runner for DataFlow

2016-09-18 Thread JongYoon Lim (JIRA)

[ 
https://issues.apache.org/jira/browse/HAMA-983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15500951#comment-15500951
 ] 

JongYoon Lim edited comment on HAMA-983 at 9/18/16 1:27 PM:


Hi, it took some time to understand Beam API, spark and flink runner for Beam. 
And it seems that Beam's transforms can be translated to Hama's API as follow. 
And BSP for dataflow could be similar to SuperstepBSP. (if I have 
misunderstandings, please correct me)  
BEAM -> HAMA
ParDo -> Superstep
Read.Bound -> RecordReader
Writt.Bound -> RecordWriter
Combine -> Combiner
GroupByKey -> ? 

I'm about to start from batch mode first until Hama's streaming is ready. And 
I'll add sub-tasks for this soon. 



was (Author: seedengine):
Hi, it takes some time to understand Beam API, spark and flink runner for Beam. 
And it seems that Beam's transforms can be translated to Hama's API as follow. 
And BSP for dataflow could be similar to SuperstepBSP. (if I have 
misunderstandings, please correct me)  
BEAM -> HAMA
ParDo -> Superstep
Read.Bound -> RecordReader
Writt.Bound -> RecordWriter
Combine -> Combiner
GroupByKey -> ? 

I'm about to start from batch mode first until Hama's streaming is ready. And 
I'll add sub-tasks for this soon. 


> Hama runner for DataFlow
> 
>
> Key: HAMA-983
> URL: https://issues.apache.org/jira/browse/HAMA-983
> Project: Hama
>  Issue Type: Bug
>Reporter: Edward J. Yoon
>  Labels: gsoc2016
>
> As you already know, Apache Beam provides unified programming model for both 
> batch and streaming inputs.
> The APIs are generally associated with data filtering and transforming. So 
> we'll need to implement some data processing runner like 
> https://github.com/dapurv5/MapReduce-BSP-Adapter/blob/master/src/main/java/org/apache/hama/mapreduce/examples/WordCount.java
> Also, implementing similarity join can be funny. According to 
> http://www.ruizhang.info/publications/TPDS2015-Heads_Join.pdf, Apache Hama is 
> clearly winner among Apache Hadoop and Apache Spark.
> Since it consists of transformation, aggregation, and partition computations, 
> I think it's possible to implement using Apache Beam APIs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HAMA-983) Hama runner for DataFlow

2016-09-18 Thread JongYoon Lim (JIRA)

[ 
https://issues.apache.org/jira/browse/HAMA-983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15500951#comment-15500951
 ] 

JongYoon Lim commented on HAMA-983:
---

Hi, it takes some time to understand Beam API, spark and flink runner for Beam. 
And it seems that Beam's transforms can be translated to Hama's API as follow. 
And BSP for dataflow could be similar to SuperstepBSP. (if I have 
misunderstandings, please correct me)  
BEAM -> HAMA
ParDo -> Superstep
Read.Bound -> RecordReader
Writt.Bound -> RecordWriter
Combine -> Combiner
GroupByKey -> ? 

I'm about to start from batch mode first until Hama's streaming is ready. And 
I'll add sub-tasks for this soon. 


> Hama runner for DataFlow
> 
>
> Key: HAMA-983
> URL: https://issues.apache.org/jira/browse/HAMA-983
> Project: Hama
>  Issue Type: Bug
>Reporter: Edward J. Yoon
>  Labels: gsoc2016
>
> As you already know, Apache Beam provides unified programming model for both 
> batch and streaming inputs.
> The APIs are generally associated with data filtering and transforming. So 
> we'll need to implement some data processing runner like 
> https://github.com/dapurv5/MapReduce-BSP-Adapter/blob/master/src/main/java/org/apache/hama/mapreduce/examples/WordCount.java
> Also, implementing similarity join can be funny. According to 
> http://www.ruizhang.info/publications/TPDS2015-Heads_Join.pdf, Apache Hama is 
> clearly winner among Apache Hadoop and Apache Spark.
> Since it consists of transformation, aggregation, and partition computations, 
> I think it's possible to implement using Apache Beam APIs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)