[jira] [Commented] (HAMA-983) Hama runner for DataFlow

2017-05-14 Thread Edward J. Yoon (JIRA)

[ 
https://issues.apache.org/jira/browse/HAMA-983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16009911#comment-16009911
 ] 

Edward J. Yoon commented on HAMA-983:
-

{code}
# create a new branch inside your directory 'current'
git checkout -b HAMA-983
# ... do some changes to the files ...
# store changes in the branch
git push origin HAMA-983
# commit changes to the branch
git commit -a -m '[HAMA-983] Hama runner for DataFlow'
Then go to your GitHub HAMA page and do a Pull Request. 
{code}

Hi JongYoon, you can create new branch like above.

> Hama runner for DataFlow
> 
>
> Key: HAMA-983
> URL: https://issues.apache.org/jira/browse/HAMA-983
> Project: Hama
>  Issue Type: Bug
>Reporter: Edward J. Yoon
>  Labels: gsoc2016
>
> As you already know, Apache Beam provides unified programming model for both 
> batch and streaming inputs.
> The APIs are generally associated with data filtering and transforming. So 
> we'll need to implement some data processing runner like 
> https://github.com/dapurv5/MapReduce-BSP-Adapter/blob/master/src/main/java/org/apache/hama/mapreduce/examples/WordCount.java
> Also, implementing similarity join can be funny. According to 
> http://www.ruizhang.info/publications/TPDS2015-Heads_Join.pdf, Apache Hama is 
> clearly winner among Apache Hadoop and Apache Spark.
> Since it consists of transformation, aggregation, and partition computations, 
> I think it's possible to implement using Apache Beam APIs.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HAMA-983) Hama runner for DataFlow

2017-04-25 Thread JongYoon Lim (JIRA)

[ 
https://issues.apache.org/jira/browse/HAMA-983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15984090#comment-15984090
 ] 

JongYoon Lim commented on HAMA-983:
---

So, I'll try to create the branch. 
Thank you :)

> Hama runner for DataFlow
> 
>
> Key: HAMA-983
> URL: https://issues.apache.org/jira/browse/HAMA-983
> Project: Hama
>  Issue Type: Bug
>Reporter: Edward J. Yoon
>  Labels: gsoc2016
>
> As you already know, Apache Beam provides unified programming model for both 
> batch and streaming inputs.
> The APIs are generally associated with data filtering and transforming. So 
> we'll need to implement some data processing runner like 
> https://github.com/dapurv5/MapReduce-BSP-Adapter/blob/master/src/main/java/org/apache/hama/mapreduce/examples/WordCount.java
> Also, implementing similarity join can be funny. According to 
> http://www.ruizhang.info/publications/TPDS2015-Heads_Join.pdf, Apache Hama is 
> clearly winner among Apache Hadoop and Apache Spark.
> Since it consists of transformation, aggregation, and partition computations, 
> I think it's possible to implement using Apache Beam APIs.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HAMA-983) Hama runner for DataFlow

2017-04-25 Thread Edward J. Yoon (JIRA)

[ 
https://issues.apache.org/jira/browse/HAMA-983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15984007#comment-15984007
 ] 

Edward J. Yoon commented on HAMA-983:
-

Sorry for late reply. 

{quote}could you create a branch called 'beam_support' on github?{quote} 

Sure. or, you'll also able to create a branch because you're committer. I can 
do it this weekend.

> Hama runner for DataFlow
> 
>
> Key: HAMA-983
> URL: https://issues.apache.org/jira/browse/HAMA-983
> Project: Hama
>  Issue Type: Bug
>Reporter: Edward J. Yoon
>  Labels: gsoc2016
>
> As you already know, Apache Beam provides unified programming model for both 
> batch and streaming inputs.
> The APIs are generally associated with data filtering and transforming. So 
> we'll need to implement some data processing runner like 
> https://github.com/dapurv5/MapReduce-BSP-Adapter/blob/master/src/main/java/org/apache/hama/mapreduce/examples/WordCount.java
> Also, implementing similarity join can be funny. According to 
> http://www.ruizhang.info/publications/TPDS2015-Heads_Join.pdf, Apache Hama is 
> clearly winner among Apache Hadoop and Apache Spark.
> Since it consists of transformation, aggregation, and partition computations, 
> I think it's possible to implement using Apache Beam APIs.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HAMA-983) Hama runner for DataFlow

2017-04-20 Thread JongYoon Lim (JIRA)

[ 
https://issues.apache.org/jira/browse/HAMA-983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15978003#comment-15978003
 ] 

JongYoon Lim commented on HAMA-983:
---

Hi Edward, could you create a branch called 'beam_support' on github? These 
days, I'm working on this issue again and it looks working. Now I'm working on 
this on my local beam's branch but I think it'd be better to work on hama's one 
before it can support recent beam version. (I'm working this based on beam's 
release-0.3.0-incubating, but they already have released 0.6.0 version. ) I 
think working on hama is more easier to get review and feedback from other 
developers. After that, it could be contributed to the beam project. 

> Hama runner for DataFlow
> 
>
> Key: HAMA-983
> URL: https://issues.apache.org/jira/browse/HAMA-983
> Project: Hama
>  Issue Type: Bug
>Reporter: Edward J. Yoon
>  Labels: gsoc2016
>
> As you already know, Apache Beam provides unified programming model for both 
> batch and streaming inputs.
> The APIs are generally associated with data filtering and transforming. So 
> we'll need to implement some data processing runner like 
> https://github.com/dapurv5/MapReduce-BSP-Adapter/blob/master/src/main/java/org/apache/hama/mapreduce/examples/WordCount.java
> Also, implementing similarity join can be funny. According to 
> http://www.ruizhang.info/publications/TPDS2015-Heads_Join.pdf, Apache Hama is 
> clearly winner among Apache Hadoop and Apache Spark.
> Since it consists of transformation, aggregation, and partition computations, 
> I think it's possible to implement using Apache Beam APIs.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HAMA-983) Hama runner for DataFlow

2016-12-07 Thread Edward J. Yoon (JIRA)

[ 
https://issues.apache.org/jira/browse/HAMA-983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15730744#comment-15730744
 ] 

Edward J. Yoon commented on HAMA-983:
-

cool, let me check.

> Hama runner for DataFlow
> 
>
> Key: HAMA-983
> URL: https://issues.apache.org/jira/browse/HAMA-983
> Project: Hama
>  Issue Type: Bug
>Reporter: Edward J. Yoon
>  Labels: gsoc2016
>
> As you already know, Apache Beam provides unified programming model for both 
> batch and streaming inputs.
> The APIs are generally associated with data filtering and transforming. So 
> we'll need to implement some data processing runner like 
> https://github.com/dapurv5/MapReduce-BSP-Adapter/blob/master/src/main/java/org/apache/hama/mapreduce/examples/WordCount.java
> Also, implementing similarity join can be funny. According to 
> http://www.ruizhang.info/publications/TPDS2015-Heads_Join.pdf, Apache Hama is 
> clearly winner among Apache Hadoop and Apache Spark.
> Since it consists of transformation, aggregation, and partition computations, 
> I think it's possible to implement using Apache Beam APIs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HAMA-983) Hama runner for DataFlow

2016-12-07 Thread JongYoon Lim (JIRA)

[ 
https://issues.apache.org/jira/browse/HAMA-983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15730739#comment-15730739
 ] 

JongYoon Lim commented on HAMA-983:
---

I added a link for skeleton code for hama-runner. 
Actually, I added TranslationContext class for executing batchjob. I mean 
results(supersteps) from translator are added to list in TranslationContext and 
after every translation, it executes each supersteps one by one. But when I add 
result(superstep), it's an object not class. So, I've just wondered if there is 
an easy way to create same object in grooms because those results(objects) are 
created on master. Also I wonder if this approach is correct or not.. 

> Hama runner for DataFlow
> 
>
> Key: HAMA-983
> URL: https://issues.apache.org/jira/browse/HAMA-983
> Project: Hama
>  Issue Type: Bug
>Reporter: Edward J. Yoon
>  Labels: gsoc2016
>
> As you already know, Apache Beam provides unified programming model for both 
> batch and streaming inputs.
> The APIs are generally associated with data filtering and transforming. So 
> we'll need to implement some data processing runner like 
> https://github.com/dapurv5/MapReduce-BSP-Adapter/blob/master/src/main/java/org/apache/hama/mapreduce/examples/WordCount.java
> Also, implementing similarity join can be funny. According to 
> http://www.ruizhang.info/publications/TPDS2015-Heads_Join.pdf, Apache Hama is 
> clearly winner among Apache Hadoop and Apache Spark.
> Since it consists of transformation, aggregation, and partition computations, 
> I think it's possible to implement using Apache Beam APIs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HAMA-983) Hama runner for DataFlow

2016-12-07 Thread Edward J. Yoon (JIRA)

[ 
https://issues.apache.org/jira/browse/HAMA-983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15730450#comment-15730450
 ] 

Edward J. Yoon commented on HAMA-983:
-

Here's my skeleton code with example that counts the words. You should 
implement the HamaPipelineRunner. Just translate and execute batch job. I think 
you can find how to translate them from flink's code: 
https://github.com/dataArtisans/flink-dataflow/blob/aad5d936abd41240f3e15d294ea181fb9cca05e0/runner/src/main/java/com/dataartisans/flink/dataflow/translation/FlinkBatchTransformTranslators.java#L410

{code}
public class WordCountTest {

  static final String[] WORDS_ARRAY = new String[] { "hi there", "hi",
  "hi sue bob", "hi sue", "", "bob hi" };

  static final List WORDS = Arrays.asList(WORDS_ARRAY);

  static final String[] COUNTS_ARRAY = new String[] { "hi: 5", "there: 1",
  "sue: 2", "bob: 2" };

  /**
   * Example test that tests a PTransform by using an in-memory input and
   * inspecting the output.
   */
  @Test
  @Category(RunnableOnService.class)
  public void testCountWords() throws Exception {
HamaOptions options = PipelineOptionsFactory.as(HamaOptions.class);
options.setRunner(HamaPipelineRunner.class);
Pipeline p = Pipeline.create(options);

PCollection input = p.apply(Create.of(WORDS).withCoder(
StringUtf8Coder.of()));

PCollection output = input
.apply(new WordCount())
.apply(MapElements.via(new FormatAsTextFn()));
//.apply(TextIO.Write.to("/tmp/result"));

PAssert.that(output).containsInAnyOrder(COUNTS_ARRAY);
p.run().waitUntilFinish();
  }

  public static class WordCount extends
  PTransform, PCollection>> {

private static final long serialVersionUID = 1L;

@Override
public PCollection> apply(PCollection lines) {

  // Convert lines of text into individual words.
  PCollection words = lines.apply(ParDo.of(new DoFn() {
private static final long serialVersionUID = 1L;
private final Aggregator emptyLines =
createAggregator("emptyLines", new Sum.SumLongFn());

@ProcessElement
public void processElement(ProcessContext c) {
  if (c.element().trim().isEmpty()) {
emptyLines.addValue(1L);
  }

  // Split the line into words.
  String[] words = c.element().split("[^a-zA-Z']+");

  // Output each word encountered into the output PCollection.
  for (String word : words) {
if (!word.isEmpty()) {
  c.output(word);
}
  }
}
  }));

  // Count the number of times each word occurs.
  PCollection> wordCounts = words.apply(Count
  . perElement());

  return wordCounts;
}
  }

  // / TODO
  public static class HamaPipelineRunner extends
  PipelineRunner {

public static HamaPipelineRunner fromOptions(PipelineOptions x) {
  return new HamaPipelineRunner();
}

@Override
public  Output apply(
PTransform transform, Input input) {
return super.apply(transform, input);
}

@Override
public HamaPipelineResult run(Pipeline pipeline) {
  // TODO Auto-generated method stub
  System.out.println("Executing pipeline using HamaPipelineRunner.");

  // TODO you need to translate pipeline to Hama program
  // and execute pipeline
  // return the result
  return null;
}

  }

  public class HamaPipelineResult implements PipelineResult {

@Override
public State getState() {
  // TODO Auto-generated method stub
  return null;
}

@Override
public State cancel() throws IOException {
  // TODO Auto-generated method stub
  return null;
}

@Override
public State waitUntilFinish(Duration duration) {
  // TODO Auto-generated method stub
  return null;
}

@Override
public State waitUntilFinish() {
  // TODO Auto-generated method stub
  return null;
}

@Override
public  AggregatorValues getAggregatorValues(
Aggregator aggregator) throws AggregatorRetrievalException {
  // TODO Auto-generated method stub
  return null;
}

@Override
public MetricResults metrics() {
  // TODO Auto-generated method stub
  return null;
}

  }

  public static interface HamaOptions extends PipelineOptions {

  }

}
{code}

> Hama runner for DataFlow
> 
>
> Key: HAMA-983
> URL: https://issues.apache.org/jira/browse/HAMA-983
> Project: Hama
>  Issue Type: Bug
>Reporter: Edward J. Yoon
>  Labels: gsoc2016
>
> As you already know, Apache Beam provides unified programming model for both 
> batch and streaming inputs.
> The APIs are generally associated with data filtering and transforming. So 
> we'll need to im

[jira] [Commented] (HAMA-983) Hama runner for DataFlow

2016-12-07 Thread JongYoon Lim (JIRA)

[ 
https://issues.apache.org/jira/browse/HAMA-983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15730085#comment-15730085
 ] 

JongYoon Lim commented on HAMA-983:
---

Hi, Edward. First of all, sorry for long delay. 

This is process for testing beam-hama-runner. 
1. Define testing ParDo, for example, as below. 
{code}
PCollection> output = input.apply("test", 
ParDo.of(new DoFn, KV>() {
  @ProcessElement
  public void processElement(ProcessContext c) {
for (String word : c.element().toString().split("[^a-zA-Z']+")) {
  if (!word.isEmpty()) {
c.output(KV.of(new Text(word), new LongWritable(11)));
  }
}
  }
}));
{code}
2. For translation of ParDo, I can pass the ParDo to DoFnFunction which is a 
subclass of Superstep and has OldDoFn.ProcessContext. Here, I'd like to create 
dofn instance in hama cluster after finishing all translation. And I'm not sure 
how I can do it easily... 
{code}
  private static  TransformTranslator> parDo() {
return new TransformTranslator>() {
  @Override
  public void translate(final ParDo.Bound transform, 
TranslationContext context) {
//context.addSuperstep(TestSuperStep.class);
DoFnFunction dofn = new DoFnFunction((OldDoFn) 
transform.getFn());
//context.addSuperstep(dofn.getClass());
  }
};
  }
{code}

> Hama runner for DataFlow
> 
>
> Key: HAMA-983
> URL: https://issues.apache.org/jira/browse/HAMA-983
> Project: Hama
>  Issue Type: Bug
>Reporter: Edward J. Yoon
>  Labels: gsoc2016
>
> As you already know, Apache Beam provides unified programming model for both 
> batch and streaming inputs.
> The APIs are generally associated with data filtering and transforming. So 
> we'll need to implement some data processing runner like 
> https://github.com/dapurv5/MapReduce-BSP-Adapter/blob/master/src/main/java/org/apache/hama/mapreduce/examples/WordCount.java
> Also, implementing similarity join can be funny. According to 
> http://www.ruizhang.info/publications/TPDS2015-Heads_Join.pdf, Apache Hama is 
> clearly winner among Apache Hadoop and Apache Spark.
> Since it consists of transformation, aggregation, and partition computations, 
> I think it's possible to implement using Apache Beam APIs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HAMA-983) Hama runner for DataFlow

2016-11-16 Thread Edward J. Yoon (JIRA)

[ 
https://issues.apache.org/jira/browse/HAMA-983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15672033#comment-15672033
 ] 

Edward J. Yoon commented on HAMA-983:
-

I don't understand exactly, can you please share your progress? 

> Hama runner for DataFlow
> 
>
> Key: HAMA-983
> URL: https://issues.apache.org/jira/browse/HAMA-983
> Project: Hama
>  Issue Type: Bug
>Reporter: Edward J. Yoon
>  Labels: gsoc2016
>
> As you already know, Apache Beam provides unified programming model for both 
> batch and streaming inputs.
> The APIs are generally associated with data filtering and transforming. So 
> we'll need to implement some data processing runner like 
> https://github.com/dapurv5/MapReduce-BSP-Adapter/blob/master/src/main/java/org/apache/hama/mapreduce/examples/WordCount.java
> Also, implementing similarity join can be funny. According to 
> http://www.ruizhang.info/publications/TPDS2015-Heads_Join.pdf, Apache Hama is 
> clearly winner among Apache Hadoop and Apache Spark.
> Since it consists of transformation, aggregation, and partition computations, 
> I think it's possible to implement using Apache Beam APIs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HAMA-983) Hama runner for DataFlow

2016-11-12 Thread JongYoon Lim (JIRA)

[ 
https://issues.apache.org/jira/browse/HAMA-983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15660806#comment-15660806
 ] 

JongYoon Lim commented on HAMA-983:
---

Hi Edward, could you give me some idea what the recommend way is to create the 
same instance at groom server with an instance which is created from a 
translator(beam) at master..? 

> Hama runner for DataFlow
> 
>
> Key: HAMA-983
> URL: https://issues.apache.org/jira/browse/HAMA-983
> Project: Hama
>  Issue Type: Bug
>Reporter: Edward J. Yoon
>  Labels: gsoc2016
>
> As you already know, Apache Beam provides unified programming model for both 
> batch and streaming inputs.
> The APIs are generally associated with data filtering and transforming. So 
> we'll need to implement some data processing runner like 
> https://github.com/dapurv5/MapReduce-BSP-Adapter/blob/master/src/main/java/org/apache/hama/mapreduce/examples/WordCount.java
> Also, implementing similarity join can be funny. According to 
> http://www.ruizhang.info/publications/TPDS2015-Heads_Join.pdf, Apache Hama is 
> clearly winner among Apache Hadoop and Apache Spark.
> Since it consists of transformation, aggregation, and partition computations, 
> I think it's possible to implement using Apache Beam APIs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HAMA-983) Hama runner for DataFlow

2016-09-18 Thread Edward J. Yoon (JIRA)

[ 
https://issues.apache.org/jira/browse/HAMA-983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15502033#comment-15502033
 ] 

Edward J. Yoon commented on HAMA-983:
-

>> once PoC is done

Great. If you need some helps, feel free to let me know :-)

> Hama runner for DataFlow
> 
>
> Key: HAMA-983
> URL: https://issues.apache.org/jira/browse/HAMA-983
> Project: Hama
>  Issue Type: Bug
>Reporter: Edward J. Yoon
>  Labels: gsoc2016
>
> As you already know, Apache Beam provides unified programming model for both 
> batch and streaming inputs.
> The APIs are generally associated with data filtering and transforming. So 
> we'll need to implement some data processing runner like 
> https://github.com/dapurv5/MapReduce-BSP-Adapter/blob/master/src/main/java/org/apache/hama/mapreduce/examples/WordCount.java
> Also, implementing similarity join can be funny. According to 
> http://www.ruizhang.info/publications/TPDS2015-Heads_Join.pdf, Apache Hama is 
> clearly winner among Apache Hadoop and Apache Spark.
> Since it consists of transformation, aggregation, and partition computations, 
> I think it's possible to implement using Apache Beam APIs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HAMA-983) Hama runner for DataFlow

2016-09-18 Thread JongYoon Lim (JIRA)

[ 
https://issues.apache.org/jira/browse/HAMA-983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15502030#comment-15502030
 ] 

JongYoon Lim commented on HAMA-983:
---

Yes, I can add a PR link to https://issues.apache.org/jira/browse/BEAM-612 once 
PoC is done. 

> Hama runner for DataFlow
> 
>
> Key: HAMA-983
> URL: https://issues.apache.org/jira/browse/HAMA-983
> Project: Hama
>  Issue Type: Bug
>Reporter: Edward J. Yoon
>  Labels: gsoc2016
>
> As you already know, Apache Beam provides unified programming model for both 
> batch and streaming inputs.
> The APIs are generally associated with data filtering and transforming. So 
> we'll need to implement some data processing runner like 
> https://github.com/dapurv5/MapReduce-BSP-Adapter/blob/master/src/main/java/org/apache/hama/mapreduce/examples/WordCount.java
> Also, implementing similarity join can be funny. According to 
> http://www.ruizhang.info/publications/TPDS2015-Heads_Join.pdf, Apache Hama is 
> clearly winner among Apache Hadoop and Apache Spark.
> Since it consists of transformation, aggregation, and partition computations, 
> I think it's possible to implement using Apache Beam APIs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HAMA-983) Hama runner for DataFlow

2016-09-18 Thread Edward J. Yoon (JIRA)

[ 
https://issues.apache.org/jira/browse/HAMA-983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15502017#comment-15502017
 ] 

Edward J. Yoon commented on HAMA-983:
-

Why don't we contribute this feature to the Apache Beam directly? 
https://github.com/apache/incubator-beam/tree/master/runners

> Hama runner for DataFlow
> 
>
> Key: HAMA-983
> URL: https://issues.apache.org/jira/browse/HAMA-983
> Project: Hama
>  Issue Type: Bug
>Reporter: Edward J. Yoon
>  Labels: gsoc2016
>
> As you already know, Apache Beam provides unified programming model for both 
> batch and streaming inputs.
> The APIs are generally associated with data filtering and transforming. So 
> we'll need to implement some data processing runner like 
> https://github.com/dapurv5/MapReduce-BSP-Adapter/blob/master/src/main/java/org/apache/hama/mapreduce/examples/WordCount.java
> Also, implementing similarity join can be funny. According to 
> http://www.ruizhang.info/publications/TPDS2015-Heads_Join.pdf, Apache Hama is 
> clearly winner among Apache Hadoop and Apache Spark.
> Since it consists of transformation, aggregation, and partition computations, 
> I think it's possible to implement using Apache Beam APIs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HAMA-983) Hama runner for DataFlow

2016-09-18 Thread JongYoon Lim (JIRA)

[ 
https://issues.apache.org/jira/browse/HAMA-983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15502014#comment-15502014
 ] 

JongYoon Lim commented on HAMA-983:
---

Thank you for your feedbace.. And do you think it's better to branch from hama 
for this or have an independent repo(github)?

> Hama runner for DataFlow
> 
>
> Key: HAMA-983
> URL: https://issues.apache.org/jira/browse/HAMA-983
> Project: Hama
>  Issue Type: Bug
>Reporter: Edward J. Yoon
>  Labels: gsoc2016
>
> As you already know, Apache Beam provides unified programming model for both 
> batch and streaming inputs.
> The APIs are generally associated with data filtering and transforming. So 
> we'll need to implement some data processing runner like 
> https://github.com/dapurv5/MapReduce-BSP-Adapter/blob/master/src/main/java/org/apache/hama/mapreduce/examples/WordCount.java
> Also, implementing similarity join can be funny. According to 
> http://www.ruizhang.info/publications/TPDS2015-Heads_Join.pdf, Apache Hama is 
> clearly winner among Apache Hadoop and Apache Spark.
> Since it consists of transformation, aggregation, and partition computations, 
> I think it's possible to implement using Apache Beam APIs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HAMA-983) Hama runner for DataFlow

2016-09-18 Thread Edward J. Yoon (JIRA)

[ 
https://issues.apache.org/jira/browse/HAMA-983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15501973#comment-15501973
 ] 

Edward J. Yoon commented on HAMA-983:
-

https://cloud.google.com/dataflow/examples/wordcount-example

This page is well-described about beam concept. The flow is like below:

{code}
Creating the Pipeline
Applying transforms to the Pipeline
Reading input (in this example: reading text files)
Applying ParDo transforms
Applying SDK-provided transforms (in this example: Count)
Writing output (in this example: writing to Google Cloud Storage)
Running the Pipeline
{code}

Once we created Hama pipeline we should able to run the program like below:

{code}
  public static void main(String[] args) {
// Create a pipeline parameterized by commandline flags.
Pipeline p = Pipeline.create(PipelineOptionsFactory.fromArgs(arg));

p.apply(TextIO.Read.from("gs://..."))   // Read input.
 .apply(new CountWords())   // Do some processing.
 .apply(TextIO.Write.to("gs://..."));   // Write output.

// Run the pipeline.
p.run();
  }
{code}

For I/O operations, you can refer this 
https://github.com/apache/incubator-beam/blob/master/runners/spark/src/main/java/org/apache/beam/runners/spark/io/hadoop/HadoopIO.java
 (instead of org.apache.hadoop.mapreduce.lib.input.FileInputFormat you should 
use 
https://github.com/apache/hama/blob/master/core/src/main/java/org/apache/hama/bsp/FileInputFormat.java)

{quote}BSP for dataflow could be similar to SuperstepBSP{quote}

I think so. GroupByKey seems a built-in processor that groups records by key. 
We should implement it using a superstep.





> Hama runner for DataFlow
> 
>
> Key: HAMA-983
> URL: https://issues.apache.org/jira/browse/HAMA-983
> Project: Hama
>  Issue Type: Bug
>Reporter: Edward J. Yoon
>  Labels: gsoc2016
>
> As you already know, Apache Beam provides unified programming model for both 
> batch and streaming inputs.
> The APIs are generally associated with data filtering and transforming. So 
> we'll need to implement some data processing runner like 
> https://github.com/dapurv5/MapReduce-BSP-Adapter/blob/master/src/main/java/org/apache/hama/mapreduce/examples/WordCount.java
> Also, implementing similarity join can be funny. According to 
> http://www.ruizhang.info/publications/TPDS2015-Heads_Join.pdf, Apache Hama is 
> clearly winner among Apache Hadoop and Apache Spark.
> Since it consists of transformation, aggregation, and partition computations, 
> I think it's possible to implement using Apache Beam APIs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HAMA-983) Hama runner for DataFlow

2016-09-18 Thread JongYoon Lim (JIRA)

[ 
https://issues.apache.org/jira/browse/HAMA-983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15500951#comment-15500951
 ] 

JongYoon Lim commented on HAMA-983:
---

Hi, it takes some time to understand Beam API, spark and flink runner for Beam. 
And it seems that Beam's transforms can be translated to Hama's API as follow. 
And BSP for dataflow could be similar to SuperstepBSP. (if I have 
misunderstandings, please correct me)  
BEAM -> HAMA
ParDo -> Superstep
Read.Bound -> RecordReader
Writt.Bound -> RecordWriter
Combine -> Combiner
GroupByKey -> ? 

I'm about to start from batch mode first until Hama's streaming is ready. And 
I'll add sub-tasks for this soon. 


> Hama runner for DataFlow
> 
>
> Key: HAMA-983
> URL: https://issues.apache.org/jira/browse/HAMA-983
> Project: Hama
>  Issue Type: Bug
>Reporter: Edward J. Yoon
>  Labels: gsoc2016
>
> As you already know, Apache Beam provides unified programming model for both 
> batch and streaming inputs.
> The APIs are generally associated with data filtering and transforming. So 
> we'll need to implement some data processing runner like 
> https://github.com/dapurv5/MapReduce-BSP-Adapter/blob/master/src/main/java/org/apache/hama/mapreduce/examples/WordCount.java
> Also, implementing similarity join can be funny. According to 
> http://www.ruizhang.info/publications/TPDS2015-Heads_Join.pdf, Apache Hama is 
> clearly winner among Apache Hadoop and Apache Spark.
> Since it consists of transformation, aggregation, and partition computations, 
> I think it's possible to implement using Apache Beam APIs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HAMA-983) Hama runner for DataFlow

2016-09-04 Thread JongYoon Lim (JIRA)

[ 
https://issues.apache.org/jira/browse/HAMA-983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15463592#comment-15463592
 ] 

JongYoon Lim commented on HAMA-983:
---

Yes, I'll check the streaming mode as well. 

> Hama runner for DataFlow
> 
>
> Key: HAMA-983
> URL: https://issues.apache.org/jira/browse/HAMA-983
> Project: Hama
>  Issue Type: Bug
>Reporter: Edward J. Yoon
>  Labels: gsoc2016
>
> As you already know, Apache Beam provides unified programming model for both 
> batch and streaming inputs.
> The APIs are generally associated with data filtering and transforming. So 
> we'll need to implement some data processing runner like 
> https://github.com/dapurv5/MapReduce-BSP-Adapter/blob/master/src/main/java/org/apache/hama/mapreduce/examples/WordCount.java
> Also, implementing similarity join can be funny. According to 
> http://www.ruizhang.info/publications/TPDS2015-Heads_Join.pdf, Apache Hama is 
> clearly winner among Apache Hadoop and Apache Spark.
> Since it consists of transformation, aggregation, and partition computations, 
> I think it's possible to implement using Apache Beam APIs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HAMA-983) Hama runner for DataFlow

2016-09-04 Thread JongYoon Lim (JIRA)

[ 
https://issues.apache.org/jira/browse/HAMA-983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15463593#comment-15463593
 ] 

JongYoon Lim commented on HAMA-983:
---

FlinkPipelineRunner internally has a translator for both pipeline and 
transform. It seems that translator translates Beam operators to their 
counterparts of flink and saves regarding information in TranslationContext 
which is used for flink job processing. I think this patch can be started from 
implementing a simple translator for batch job first.


> Hama runner for DataFlow
> 
>
> Key: HAMA-983
> URL: https://issues.apache.org/jira/browse/HAMA-983
> Project: Hama
>  Issue Type: Bug
>Reporter: Edward J. Yoon
>  Labels: gsoc2016
>
> As you already know, Apache Beam provides unified programming model for both 
> batch and streaming inputs.
> The APIs are generally associated with data filtering and transforming. So 
> we'll need to implement some data processing runner like 
> https://github.com/dapurv5/MapReduce-BSP-Adapter/blob/master/src/main/java/org/apache/hama/mapreduce/examples/WordCount.java
> Also, implementing similarity join can be funny. According to 
> http://www.ruizhang.info/publications/TPDS2015-Heads_Join.pdf, Apache Hama is 
> clearly winner among Apache Hadoop and Apache Spark.
> Since it consists of transformation, aggregation, and partition computations, 
> I think it's possible to implement using Apache Beam APIs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HAMA-983) Hama runner for DataFlow

2016-08-31 Thread Edward J. Yoon (JIRA)

[ 
https://issues.apache.org/jira/browse/HAMA-983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15454239#comment-15454239
 ] 

Edward J. Yoon commented on HAMA-983:
-

Just FYI, Apache Beam's basic example is wordcount. I guess, the batch mode can 
be similar with org.apache.hama.examples.PiEstimator: (n - 1) tasks parses and 
counts the words and 1 task aggregates the word counts and emits the final 
result. The streaming mode is not sure, so you'll need to check how it handles 
io.

> Hama runner for DataFlow
> 
>
> Key: HAMA-983
> URL: https://issues.apache.org/jira/browse/HAMA-983
> Project: Hama
>  Issue Type: Bug
>Reporter: Edward J. Yoon
>  Labels: gsoc2016
>
> As you already know, Apache Beam provides unified programming model for both 
> batch and streaming inputs.
> The APIs are generally associated with data filtering and transforming. So 
> we'll need to implement some data processing runner like 
> https://github.com/dapurv5/MapReduce-BSP-Adapter/blob/master/src/main/java/org/apache/hama/mapreduce/examples/WordCount.java
> Also, implementing similarity join can be funny. According to 
> http://www.ruizhang.info/publications/TPDS2015-Heads_Join.pdf, Apache Hama is 
> clearly winner among Apache Hadoop and Apache Spark.
> Since it consists of transformation, aggregation, and partition computations, 
> I think it's possible to implement using Apache Beam APIs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HAMA-983) Hama runner for DataFlow

2016-08-31 Thread JongYoon Lim (JIRA)

[ 
https://issues.apache.org/jira/browse/HAMA-983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15453619#comment-15453619
 ] 

JongYoon Lim commented on HAMA-983:
---

Thank you for the information. I'm interested in this feature. I'll start 
analyzing flink runner. 

> Hama runner for DataFlow
> 
>
> Key: HAMA-983
> URL: https://issues.apache.org/jira/browse/HAMA-983
> Project: Hama
>  Issue Type: Bug
>Reporter: Edward J. Yoon
>  Labels: gsoc2016
>
> As you already know, Apache Beam provides unified programming model for both 
> batch and streaming inputs.
> The APIs are generally associated with data filtering and transforming. So 
> we'll need to implement some data processing runner like 
> https://github.com/dapurv5/MapReduce-BSP-Adapter/blob/master/src/main/java/org/apache/hama/mapreduce/examples/WordCount.java
> Also, implementing similarity join can be funny. According to 
> http://www.ruizhang.info/publications/TPDS2015-Heads_Join.pdf, Apache Hama is 
> clearly winner among Apache Hadoop and Apache Spark.
> Since it consists of transformation, aggregation, and partition computations, 
> I think it's possible to implement using Apache Beam APIs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HAMA-983) Hama runner for DataFlow

2016-08-30 Thread Edward J. Yoon (JIRA)

[ 
https://issues.apache.org/jira/browse/HAMA-983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15451100#comment-15451100
 ] 

Edward J. Yoon commented on HAMA-983:
-

Hi, I didn't look at dataflow (apache beam) closely, but:

>> Do you mean that each superstep can be executed in data pipeline as a 
>> pcollection? 

I guess yes, or single job can be executed as the case may be.

If you're interested in working on this, you can refer 
https://github.com/dataArtisans/flink-dataflow/blob/master/runner/src/main/java/com/dataartisans/flink/dataflow/FlinkPipelineRunner.java

And, before we do this, HAMA-940 and data processing BSP maybe the first I 
guess. Please feel free to drop your opinion and contribute the patches. :-)

If you have any questions, let me know.

> Hama runner for DataFlow
> 
>
> Key: HAMA-983
> URL: https://issues.apache.org/jira/browse/HAMA-983
> Project: Hama
>  Issue Type: Bug
>Reporter: Edward J. Yoon
>  Labels: gsoc2016
>
> As you already know, Apache Beam provides unified programming model for both 
> batch and streaming inputs.
> The APIs are generally associated with data filtering and transforming. So 
> we'll need to implement some data processing runner like 
> https://github.com/dapurv5/MapReduce-BSP-Adapter/blob/master/src/main/java/org/apache/hama/mapreduce/examples/WordCount.java
> Also, implementing similarity join can be funny. According to 
> http://www.ruizhang.info/publications/TPDS2015-Heads_Join.pdf, Apache Hama is 
> clearly winner among Apache Hadoop and Apache Spark.
> Since it consists of transformation, aggregation, and partition computations, 
> I think it's possible to implement using Apache Beam APIs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HAMA-983) Hama runner for DataFlow

2016-08-30 Thread JongYoon Lim (JIRA)

[ 
https://issues.apache.org/jira/browse/HAMA-983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15450866#comment-15450866
 ] 

JongYoon Lim commented on HAMA-983:
---

Do you mean that each superstep can be executed in data pipeline as a 
pcollection? Could you add more details if I didn't understand correcly..?

> Hama runner for DataFlow
> 
>
> Key: HAMA-983
> URL: https://issues.apache.org/jira/browse/HAMA-983
> Project: Hama
>  Issue Type: Bug
>Reporter: Edward J. Yoon
>  Labels: gsoc2016
>
> As you already know, Apache Beam provides unified programming model for both 
> batch and streaming inputs.
> The APIs are generally associated with data filtering and transforming. So 
> we'll need to implement some data processing runner like 
> https://github.com/dapurv5/MapReduce-BSP-Adapter/blob/master/src/main/java/org/apache/hama/mapreduce/examples/WordCount.java
> Also, implementing similarity join can be funny. According to 
> http://www.ruizhang.info/publications/TPDS2015-Heads_Join.pdf, Apache Hama is 
> clearly winner among Apache Hadoop and Apache Spark.
> Since it consists of transformation, aggregation, and partition computations, 
> I think it's possible to implement using Apache Beam APIs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)